Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
@ 2010-08-28 21:17 Ævar Arnfjörð Bjarmason
  2010-08-28 21:33 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-28 21:17 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Marcin Cieslak

I'm having an odd encoding issue with gettext on my
gettextize-git-mainporcelain branch that hadn't been turned up before
because none of the existing messages used non-ASCII translations.

With this in is.po (full version at [is.po]):

    "Content-Type: text/plain; charset=UTF-8\n"
    "Content-Transfer-Encoding: 8bit\n

I do:

    $ msgfmt -o /opt/git/next-gettext/share/locale/is/LC_MESSAGES/git.mo is.po

Which, under an Icelandic locale gives me:

    $ rm -rf /tmp/meh; LANGUAGE= LC_ALL= LANG=is_IS.UTF-8 git init /tmp/meh
    Bj? til t?ma Git lind ? /tmp/meh/.git/

Those "?" characters are actual ASCII question marks.

But if I don't specify an encoding msgfmt will complain:

    $ msgfmt -o /opt/git/next-gettext/share/locale/is/LC_MESSAGES/git.mo is.po
    is.po: warning: Charset missing in header.
                    Message conversion to user's charset will not work.

But git will now emit the non-ASCII characters from its message
catalogue. Probably because some component now doesn't try to be smart
about encoding.

    $ rm -rf /tmp/meh; LANGUAGE= LC_ALL= LANG=is_IS.UTF-8 git init /tmp/meh
    Bjó til tóma Git lind í /tmp/meh/.git/

That'd probably break under a non-UTF-8 locale, like an ISO-8859-1 one
though.

A `hexdump -C` of the two `.mo` files is exactly the same, aside from
the charset header. I.e. both contain valid UTF-8 sequences, so the
issue is somewhere between the `*.mo` file being read and it being
emitted by `libintl` and the `gettext` function.

We're not doing anything odd in our [gettext.c] that I can see that
could explain this.

To reproduce it, do:

    git clone --reference ~/g/git git://github.com/avar/git.git next-gettext
    cd next-gettext
    git checkout -t origin/gettextize-git-mainporcelain
    make -j 4 prefix=/tmp/git all install
    rm -rf /tmp/meh; LANGUAGE= LANG=is_IS.utf8 /tmp/git/bin/git init /tmp/meh

Which'll give (as mentioned above):

    Bj? til t?ma Git lind ? /tmp/meh/.git/

But editing out the Content-Type line gives:

    Bjó til tóma Git lind í /tmp/meh/.git/

[gettextize-git-mainporcelain]:
http://github.com/avar/git/tree/gettextize-git-mainporcelain]
[is.po]: http://github.com/avar/git/blob/gettextize-git-mainporcelain/po/is.po
[gettext.c]: http://github.com/avar/git/blob/gettextize-git-mainporcelain/gettext.c

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 21:17 Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII Ævar Arnfjörð Bjarmason
@ 2010-08-28 21:33 ` Ævar Arnfjörð Bjarmason
  2010-08-28 21:46   ` Jonathan Nieder
  0 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-28 21:33 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Marcin Cieslak

On Sat, Aug 28, 2010 at 21:17, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> I'm having an odd encoding issue with gettext on my
> gettextize-git-mainporcelain branch that hadn't been turned up before
> because none of the existing messages used non-ASCII translations.

Well, this is fun. It turns out that reverting 107880a makes it work,
i.e. you need to set LC_CTYPE since reading *.mo files in a
locale-awere manner involved character classification.

But as 107880a explains doing so broke other parts of Git.

I'll have to think about how to solve that, one way obviously would be
to fix up our vsnprintf() invocation, but there may be others like it
that I haven't spotted.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 21:33 ` Ævar Arnfjörð Bjarmason
@ 2010-08-28 21:46   ` Jonathan Nieder
  2010-08-28 21:59     ` Jonathan Nieder
  0 siblings, 1 reply; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-28 21:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List, Marcin Cieslak

Ævar Arnfjörð Bjarmason wrote:

> Well, this is fun. It turns out that reverting 107880a makes it work,
> i.e. you need to set LC_CTYPE since reading *.mo files in a
> locale-awere manner involved character classification.
> 
> But as 107880a explains doing so broke other parts of Git.
> 
> I'll have to think about how to solve that, one way obviously would be
> to fix up our vsnprintf() invocation, but there may be others like it
> that I haven't spotted.

In case you remember: why did vsnprintf() fail in that example?  If I
understand what C99 says correctly (a big if), then

 printf("%s\n", some_nonsense_string);

should always just work.

ltrace indicates that something is wacky about the format string.

 vsnprintf("Author: ", 143, "%s: %.*s%.*s\n", 0xbf8b5738) = -1

The regexes in http-backend are lc_collate- (but probably not lc_ctype-)
sensitive.  I am not sure how to go about exhaustively tracking down
ctype-dependencies.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 21:46   ` Jonathan Nieder
@ 2010-08-28 21:59     ` Jonathan Nieder
  2010-08-28 22:14       ` Marcin Cieslak
  0 siblings, 1 reply; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-28 21:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List, Marcin Cieslak

Jonathan Nieder wrote:

>  printf("%s\n", some_nonsense_string);
> 
> should always just work.

Ok, so apparently

 #include <stdio.h>
 #include <locale.h>

 int main(void)
 {
        setlocale(LC_CTYPE, "");
        printf("%.11s\n", "Author: \277");
	return 0;
 }

does not work.  Even while

	printf("%.1s\n", "étale");

does print only one byte.

Ideas?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 21:59     ` Jonathan Nieder
@ 2010-08-28 22:14       ` Marcin Cieslak
  2010-08-28 22:16         ` Jonathan Nieder
  2010-08-28 22:20         ` Jonathan Nieder
  0 siblings, 2 replies; 20+ messages in thread
From: Marcin Cieslak @ 2010-08-28 22:14 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

On Sat, 28 Aug 2010, Jonathan Nieder wrote:

> Jonathan Nieder wrote:
>
>>  printf("%s\n", some_nonsense_string);
>>
>> should always just work.
>
> Ok, so apparently
>
> #include <stdio.h>
> #include <locale.h>
>
> int main(void)
> {
>        setlocale(LC_CTYPE, "");
>        printf("%.11s\n", "Author: \277");
> 	return 0;
> }

On my FreeBSD box (various locales tested) I get the following bytes
output:

41 75 74 68 6f 72 3a 20 bf 0a

> does not work.

What's wrong?

--Marcin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 22:14       ` Marcin Cieslak
@ 2010-08-28 22:16         ` Jonathan Nieder
  2010-08-29  7:36           ` Ævar Arnfjörð Bjarmason
  2010-08-29 18:12           ` Ævar Arnfjörð Bjarmason
  2010-08-28 22:20         ` Jonathan Nieder
  1 sibling, 2 replies; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-28 22:16 UTC (permalink / raw)
  To: Marcin Cieslak; +Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

Marcin Cieslak wrote:

> What's wrong?

$ /lib/libc.so.6  |head -1
GNU C Library (Debian EGLIBC 2.11.2-2) stable release version 2.11.2, by Roland McGrath et al.
$ cat test.c
#include <stdio.h>
#include <locale.h>

int main(void)
{
        int n;

        setlocale(LC_CTYPE, "");
        n = printf("%.11s\n", "Author: \277");
        perror("printf");
        fprintf(stderr, "return value: %d\n", n);
        return 0;
}
$ make test
cc     test.c   -o test
$ ./test
printf: Invalid or incomplete multibyte or wide character
return value: -1

glibc bug?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 22:14       ` Marcin Cieslak
  2010-08-28 22:16         ` Jonathan Nieder
@ 2010-08-28 22:20         ` Jonathan Nieder
  2010-08-28 22:30           ` Marcin Cieslak
  1 sibling, 1 reply; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-28 22:20 UTC (permalink / raw)
  To: Marcin Cieslak; +Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

Marcin Cieslak wrote:

> What's wrong?

http://sourceware.org/bugzilla/show_bug.cgi?id=6530

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 22:20         ` Jonathan Nieder
@ 2010-08-28 22:30           ` Marcin Cieslak
  0 siblings, 0 replies; 20+ messages in thread
From: Marcin Cieslak @ 2010-08-28 22:30 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

On Sat, 28 Aug 2010, Jonathan Nieder wrote:

> Marcin Cieslak wrote:
>
>> What's wrong?
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=6530

Yes, a pretty old SPARC Solaris box gives me the same result
as FreeBSD.

--Marcin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 22:16         ` Jonathan Nieder
@ 2010-08-29  7:36           ` Ævar Arnfjörð Bjarmason
  2010-08-29  8:37             ` Ævar Arnfjörð Bjarmason
  2010-08-30  2:22             ` Jonathan Nieder
  2010-08-29 18:12           ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-29  7:36 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Marcin Cieslak, Git Mailing List

On Sat, Aug 28, 2010 at 22:16, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Marcin Cieslak wrote:
>
>> What's wrong?
>
> $ /lib/libc.so.6  |head -1
> GNU C Library (Debian EGLIBC 2.11.2-2) stable release version 2.11.2, by Roland McGrath et al.
> $ cat test.c
> #include <stdio.h>
> #include <locale.h>
>
> int main(void)
> {
>        int n;
>
>        setlocale(LC_CTYPE, "");
>        n = printf("%.11s\n", "Author: \277");
>        perror("printf");
>        fprintf(stderr, "return value: %d\n", n);
>        return 0;
> }
> $ make test
> cc     test.c   -o test
> $ ./test
> printf: Invalid or incomplete multibyte or wide character
> return value: -1
>
> glibc bug?

It would appear so. It seems my monkeypatch in 107880a was the wrong
way to do it. We should be setting LC_CTYPE, and providing a fallback
for GNU's buggy sprintf().

We also have another bug, compiling git with
SNPRINTF_RETURNS_BOGUS=YesGNuIsBuggy and running "git show v0.99.6~1"
on our own repository causes a segfault, presumably due to the same
bug, but I didn't track it down further than this:

    (gdb) run show v0.99.6~1
    Starting program: /home/avar/g/git/git show v0.99.6~1
    [Thread debugging using libthread_db enabled]

    Program received signal SIGSEGV, Segmentation fault.
    __strnlen (str=0x8f <Address 0x8f out of bounds>, maxlen=<value
optimized out>) at strnlen.c:47
    47      strnlen.c: No such file or directory.
            in strnlen.c
    (gdb) bt
    #0  __strnlen (str=0x8f <Address 0x8f out of bounds>,
maxlen=<value optimized out>) at strnlen.c:47
    #1  0x00007ffff73318cd in __mbsnrtowcs (dst=0x7fffffffb960
L"\xffffb9a0翿\x404500", src=0x7fffffffcf78, nmc=8557680, len=<value
optimized out>,
        ps=0x7fffffffcfb0) at mbsnrtowcs.c:66
    #2  0x00007ffff72ef932 in _IO_vfprintf_internal (s=0x7fffffffd010,
format=<value optimized out>, ap=0x7fffffffd330) at vfprintf.c:1614
    #3  0x00007ffff7311432 in _IO_vsnprintf (
        string=0x829510
"H\215C\020H\213\\$\bH\213l$\020L\213d$\030L\213l$
L\213t$(L\213|$0H\203\304\070\303f\017\037D: \177", maxlen=<value
optimized out>,
        format=0x58aa28 "%s: %.*s%.*s\n", args=0x7fffffffd330) at
vsnprintf.c:120
    #4  0x00000000005569c6 in git_vsnprintf (
        str=0x829510
"H\215C\020H\213\\$\bH\213l$\020L\213d$\030L\213l$
L\213t$(L\213|$0H\203\304\070\303f\017\037D: \177", maxsize=572,
        format=0x58aa28 "%s: %.*s%.*s\n", ap=0x7fffffffd330) at snprintf.c:45
    #5  0x000000000053877f in strbuf_addf (sb=0x7fffffffd640,
fmt=0x58aa28 "%s: %.*s%.*s\n") at strbuf.c:203
    #6  0x000000000050267d in pp_user_info (what=0x58aa4e "Author",
fmt=CMIT_FMT_MEDIUM, sb=0x7fffffffd640,
        line=0x830675 "David_K\345gedal <davidk@lysator.liu.se>
1126078160 +0200\ncommitter Junio C Hamano <junkio@cox.net> 1126128590
-0700\n\n[PATCH] Simplify git script\n\nThe code for listing the
available subcommands was unnec"..., dmode=DATE_NORMAL,
encoding=0x57025d "UTF-8") at pretty.c:283
    #7  0x0000000000503be9 in pp_header (fmt=CMIT_FMT_MEDIUM,
abbrev=7, dmode=DATE_NORMAL, encoding=0x57025d "UTF-8",
commit=0x832cf8, msg_p=0x7fffffffd558,
        sb=0x7fffffffd640) at pretty.c:1077
    #8  0x0000000000503865 in pretty_print_commit
(fmt=CMIT_FMT_MEDIUM, commit=0x832cf8, sb=0x7fffffffd640,
context=0x7fffffffd5e8) at pretty.c:1219
    #9  0x00000000004ead8b in show_log (opt=0x7fffffffd858) at log-tree.c:508
    #10 0x00000000004eb1b2 in log_tree_diff_flush (opt=0x7fffffffd858)
at log-tree.c:557
    #11 0x00000000004eb5e9 in log_tree_diff (opt=0x7fffffffd858,
commit=0x832cf8, log=0x7fffffffd728) at log-tree.c:635
    #12 0x00000000004eb322 in log_tree_commit (opt=0x7fffffffd858,
commit=0x832cf8) at log-tree.c:658
    #13 0x000000000044ff0b in cmd_log_walk (rev=0x7fffffffd858) at log.c:372
    #14 0x0000000000450555 in cmd_show (argc=2, argv=0x7fffffffe070,
prefix=0x0) at log.c:561
    #15 0x00000000004058a7 in run_builtin (p=0x7bf708, argc=2,
argv=0x7fffffffe070) at git.c:278
    #16 0x0000000000404a54 in handle_internal_command (argc=2,
argv=0x7fffffffe070) at git.c:434
    #17 0x000000000040513e in run_argv (argcp=0x7fffffffdf78,
argv=0x7fffffffdf70) at git.c:478
    #18 0x0000000000404875 in main (argc=2, argv=0x7fffffffe070) at git.c:553
    (gdb) frame 4
    #4  0x00000000005569c6 in git_vsnprintf (
        str=0x829510
"H\215C\020H\213\\$\bH\213l$\020L\213d$\030L\213l$
L\213t$(L\213|$0H\203\304\070\303f\017\037D: \177", maxsize=572,
        format=0x58aa28 "%s: %.*s%.*s\n", ap=0x7fffffffd330) at snprintf.c:45
    45                      ret = vsnprintf(str,
maxsize-SNPRINTF_SIZE_CORR, format, ap);
    (gdb) p str
    $1 = 0x829510 "H\215C\020H\213\\$\bH\213l$\020L\213d$\030L\213l$
L\213t$(L\213|$0H\203\304\070\303f\017\037D: \177"
    (gdb) p maxsize
    $2 = 572
    (gdb) p SNPRINTF_SIZE_CORR
    No symbol "SNPRINTF_SIZE_CORR" in current context.
    (gdb) p format
    $3 = 0x58aa28 "%s: %.*s%.*s\n"
    (gdb) p ap
    $4 = (__va_list_tag *) 0x7fffffffd330
    (gdb) frame 6
    #6  0x000000000050267d in pp_user_info (what=0x58aa4e "Author",
fmt=CMIT_FMT_MEDIUM, sb=0x7fffffffd640,
        line=0x830675 "David_K\345gedal <davidk@lysator.liu.se>
1126078160 +0200\ncommitter Junio C Hamano <junkio@cox.net> 1126128590
-0700\n\n[PATCH] Simplify git script\n\nThe code for listing the
available subcommands was unnec"..., dmode=DATE_NORMAL,
encoding=0x57025d "UTF-8") at pretty.c:283
    283                     strbuf_addf(sb, "%s: %.*s%.*s\n", what,

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-29  7:36           ` Ævar Arnfjörð Bjarmason
@ 2010-08-29  8:37             ` Ævar Arnfjörð Bjarmason
  2010-08-30  2:22             ` Jonathan Nieder
  1 sibling, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-29  8:37 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Marcin Cieslak, Git Mailing List

On Sun, Aug 29, 2010 at 07:36, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:

> We also have another bug, compiling git with
> SNPRINTF_RETURNS_BOGUS=YesGNuIsBuggy and running "git show v0.99.6~1"
> on our own repository causes a segfault, presumably due to the same
> bug, but I didn't track it down further than this:

I forgot to mention, compiling it with this partial revert of 107880a of course:

    diff --git a/gettext.c b/gettext.c
    index db99742..7ae5cae 100644
    --- a/gettext.c
    +++ b/gettext.c
    @@ -19,2 +19,3 @@ extern void git_setup_gettext(void) {
            (void)setlocale(LC_MESSAGES, "");
    +       (void)setlocale(LC_CTYPE, "");
            (void)textdomain("git");

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-28 22:16         ` Jonathan Nieder
  2010-08-29  7:36           ` Ævar Arnfjörð Bjarmason
@ 2010-08-29 18:12           ` Ævar Arnfjörð Bjarmason
  2010-08-29 20:45             ` Jonathan Nieder
  1 sibling, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-29 18:12 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Marcin Cieslak, Git Mailing List, Junio C Hamano

On Sat, Aug 28, 2010 at 22:16, Jonathan Nieder <jrnieder@gmail.com> wrote:

> $ /lib/libc.so.6  |head -1
> GNU C Library (Debian EGLIBC 2.11.2-2) stable release version 2.11.2, by Roland McGrath et al.
> $ cat test.c
> #include <stdio.h>
> #include <locale.h>
>
> int main(void)
> {
>        int n;
>
>        setlocale(LC_CTYPE, "");
>        n = printf("%.11s\n", "Author: \277");
>        perror("printf");
>        fprintf(stderr, "return value: %d\n", n);
>        return 0;
> }
> $ make test
> cc     test.c   -o test
> $ ./test
> printf: Invalid or incomplete multibyte or wide character
> return value: -1

So, my plan of attack is:

 * Add compat/printf from Free, Open or NetBSD. Maybe make
   compat/snprintf.c use that while I'm at it.
 * Use that instead of the GNU libc printf on systems that have glibc.
 * Add a configure check for that.
 * Revert 107880a
 * Get gettext goodness with LC_CTYPE

Does anyone see a problem with that? The potential issue is that
LC_CTYPE is for:

    "regular expression matching, character classification,
    conversion, case-sensitive comparison, and wide character
    functions."

So it might have unintended side-effects. But the only other
workaround I can see is to decree that all consumers of the localized
messages must have a UTF-8 locale.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-29 18:12           ` Ævar Arnfjörð Bjarmason
@ 2010-08-29 20:45             ` Jonathan Nieder
  2010-08-30  8:57               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-29 20:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Marcin Cieslak, Git Mailing List, Junio C Hamano

Ævar Arnfjörð Bjarmason wrote:

> So, my plan of attack is:
> 
>  * Add compat/printf from Free, Open or NetBSD. Maybe make
>    compat/snprintf.c use that while I'm at it.

I would prefer to get this fixed in glibc, but of course that
has nothing to do with git.

>  * Use that instead of the GNU libc printf on systems that have glibc.
>  * Add a configure check for that.
>  * Revert 107880a
>  * Get gettext goodness with LC_CTYPE
> 
> Does anyone see a problem with that? The potential issue is that
> LC_CTYPE is for:
> 
>     "regular expression matching,

should be okay, I think (unless http-backend is a problem)

> character classification,

worked around (see git-compat-util.h)

>     conversion,

I don't know what this means; iconv() is not affected by LC_CTYPE,
is it?

> case-sensitive comparison,

Could be a problem: we use strcasecmp() heavily.

> and wide character
>     functions."

no problem. :)

> So it might have unintended side-effects. But the only other
> workaround I can see is to decree that all consumers of the localized
> messages must have a UTF-8 locale.

And that is no workaround at all; the problem is still seen with UTF-8
locales, no?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-29  7:36           ` Ævar Arnfjörð Bjarmason
  2010-08-29  8:37             ` Ævar Arnfjörð Bjarmason
@ 2010-08-30  2:22             ` Jonathan Nieder
  1 sibling, 0 replies; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-30  2:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Marcin Cieslak, Git Mailing List

Ævar Arnfjörð Bjarmason wrote:

> We also have another bug, compiling git with
> SNPRINTF_RETURNS_BOGUS=YesGNuIsBuggy and running "git show v0.99.6~1"
> on our own repository causes a segfault

That's because the glibc bug is not the bug SNPRINTF_RETURNS_BOGUS is
meant to guard against.  Hopefully no printf implementation has both
bugs. :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-29 20:45             ` Jonathan Nieder
@ 2010-08-30  8:57               ` Ævar Arnfjörð Bjarmason
  2010-08-30 13:41                 ` Jonathan Nieder
  0 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-30  8:57 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Marcin Cieslak, Git Mailing List, Junio C Hamano

On Sun, Aug 29, 2010 at 20:45, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Ævar Arnfjörð Bjarmason wrote:
>
>> So, my plan of attack is:
>>
>>  * Add compat/printf from Free, Open or NetBSD. Maybe make
>>    compat/snprintf.c use that while I'm at it.
>
> I would prefer to get this fixed in glibc, but of course that
> has nothing to do with git.

Yeah, but even if it's fixed there everyone's glibc won't be updated
for at least ten years as far as the glibc's we have to support go.

So even if the bug were fixed upstream today we'd still need a
workaround.

>>  * Use that instead of the GNU libc printf on systems that have glibc.
>>  * Add a configure check for that.
>>  * Revert 107880a
>>  * Get gettext goodness with LC_CTYPE
>>
>> Does anyone see a problem with that? The potential issue is that
>> LC_CTYPE is for:
>>
>>     "regular expression matching,
>
> should be okay, I think (unless http-backend is a problem)

User-level commands that take regexes would have different semantics
based on the locale though, e.g. git log --grep=<regex>.

>> character classification,
>
> worked around (see git-compat-util.h)

Yay sane_istest!

>>     conversion,
>
> I don't know what this means; iconv() is not affected by LC_CTYPE,
> is it?

I think it's only to do with functions like btowc, see:
http://www.gnu.org/s/libc/manual/html_node/Restartable-multibyte-conversion.html#Restartable-multibyte-conversion

>> case-sensitive comparison,
>
> Could be a problem: we use strcasecmp() heavily.

Yeah, strcasecmp is affected by LC_CTYPE.

>> and wide character
>>     functions."
>
> no problem. :)

Nope.

>> So it might have unintended side-effects. But the only other
>> workaround I can see is to decree that all consumers of the localized
>> messages must have a UTF-8 locale.
>
> And that is no workaround at all; the problem is still seen with UTF-8
> locales, no?

No, it'll be seen with all non-UTF-8 locales. Here's the issue:

When we add non-ASCII to the po/*.po files we'll write it in UTF-8 as
a matter of policy, simply because that's all the rave these days.

However, unless we put "Content-Type: text/plain; charset=UTF-8\n" in
the file the gettext utilities won't *know* that it's in UTF-8, if
it's not there then to them it'll just be a raw stream of bytes. So
they won't do the right conversion under non-UTF-8 locales.

But users using a gettext translation under a UTF-8 locale won't tell
the difference, since the *.po encoding and their expected encoding
don't differ they don't need any conversion anyway.

We can even keep the "Content-Type: text/plain; charset=UTF-8\n" and
*not* use LC_CTYPE if we add a bind_textdomain_codeset("git", "UTF-8")
call to gettext.c. That call declares that the *.po file is in UTF-8
(but without LC_CTYPE there still won't be any conversion), see
http://www.gnu.org/s/libc/manual/html_node/Charset-conversion-in-gettext.html#Charset-conversion-in-gettext

Here's a table explaining the various approaches:

    A: [correctness] LC_CTYPE + *.po charset=UTF-8
    B: [UTF-8-only hack] no LC_CTYPE + no *.po charset=UTF-8
    C: [UTF-8-only hack] no LC_CTYPE + A *.po charset=UTF-8 +
bind_textdomain_codeset("git", "UTF-8")

    | Approach | Correct *.po encoding header | GNU printf() issue |
LANG=is_IS.utf8 OK | LANG=is_IS.iso88591 OK  |
    |----------+------------------------------+--------------------+--------------------+-------------------------|
    | A        | X                            | X                  | X
                 | X                       |
    | B        | No                           | No, no LC_CTYPE    | X
                 | No, still outputs UTF-8 |
    | C        | X                            | No, no LC_CTYPE    | X
                 | No, still outputs UTF-8 |

A would be preferred for correctness, and with a fallback BSD printf()
we can avoid the GNU libc bug, however that'll mean using LC_CTYPE,
which'll have some small side-effects for the rest of the code.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-30  8:57               ` Ævar Arnfjörð Bjarmason
@ 2010-08-30 13:41                 ` Jonathan Nieder
  2010-08-30 14:00                   ` Marcin Cieslak
  2010-08-30 14:04                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-30 13:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Marcin Cieslak, Git Mailing List, Junio C Hamano

Ævar Arnfjörð Bjarmason wrote:

> We can even keep the "Content-Type: text/plain; charset=UTF-8\n" and
> *not* use LC_CTYPE if we add a bind_textdomain_codeset("git", "UTF-8")
> call to gettext.

Oh!  I'd personally prefer to do that for now. :)  (Not because of the
known printf problem but because I like to reduce possible unknowns.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-30 13:41                 ` Jonathan Nieder
@ 2010-08-30 14:00                   ` Marcin Cieslak
  2010-08-30 14:09                     ` Jonathan Nieder
  2010-08-30 14:13                     ` Ævar Arnfjörð Bjarmason
  2010-08-30 14:04                   ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 20+ messages in thread
From: Marcin Cieslak @ 2010-08-30 14:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Jonathan Nieder
  Cc: Git Mailing List, Junio C Hamano

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1501 bytes --]

On Mon, 30 Aug 2010, Ævar Arnfjörð Bjarmason wrote:

> On Sun, Aug 29, 2010 at 20:45, Jonathan Nieder <jrnieder@gmail.com> wrote:
> A would be preferred for correctness, and with a fallback BSD printf()
> we can avoid the GNU libc bug, however that'll mean using LC_CTYPE,
> which'll have some small side-effects for the rest of the code.

The real problem is that you are probably using same functions
(locale-enable) for the user-facing side as well as for the 
backend (talking to repository). Some projects decided to use
some special encoding internally (like UCS-2 in case of Java
and Python 2.x, Unicode ordinals in Python 3.x). Otherwise
you may end up in some incompatibilities in the on-disk on 
on-network format. I don't think you want to keep telling all bug 
reporters for few years - "Can you try that again with env LANG=C,
please?" :)

Bringing Unicode onboard means that simple strlen() is no longer
what you normally think it does.

On Mon, 30 Aug 2010, Jonathan Nieder wrote:

> Ævar Arnfjörð Bjarmason wrote:
>
>> We can even keep the "Content-Type: text/plain; charset=UTF-8\n" and
>> *not* use LC_CTYPE if we add a bind_textdomain_codeset("git", "UTF-8")
>> call to gettext.
>
> Oh!  I'd personally prefer to do that for now. :)  (Not because of the
> known printf problem but because I like to reduce possible unknowns.)

Well, in this case everybody will be force to have UTF-8 in output
on-screen, not useful for people using ISO8859-*, KOI8-R and similar
things...

--Marcin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-30 13:41                 ` Jonathan Nieder
  2010-08-30 14:00                   ` Marcin Cieslak
@ 2010-08-30 14:04                   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-30 14:04 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Marcin Cieslak, Git Mailing List, Junio C Hamano

On Mon, Aug 30, 2010 at 13:41, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Ævar Arnfjörð Bjarmason wrote:
>
>> We can even keep the "Content-Type: text/plain; charset=UTF-8\n" and
>> *not* use LC_CTYPE if we add a bind_textdomain_codeset("git", "UTF-8")
>> call to gettext.
>
> Oh!  I'd personally prefer to do that for now. :)  (Not because of the
> known printf problem but because I like to reduce possible unknowns.)

By now I want to do that too. I've been experimenting with including
*printf*.c from OpenBSD, NetBSD or FreeBSD and the uClibc and in all
those cases it's a major PITA to wade through the OS-specific code
that deep in the libc.

Even if I could get that sorted it'll be non-trivial to audit all the
code whose semantics will change with LC_CTYPE, and there's a good
chance I'll miss something and cause an embarrasing bug in some
unrelated component.

Better to just document this limitation for now and decree that
gettext users must have a UTF-8 locale.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-30 14:00                   ` Marcin Cieslak
@ 2010-08-30 14:09                     ` Jonathan Nieder
  2010-08-30 14:33                       ` Ævar Arnfjörð Bjarmason
  2010-08-30 14:13                     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 20+ messages in thread
From: Jonathan Nieder @ 2010-08-30 14:09 UTC (permalink / raw)
  To: Marcin Cieslak
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano

Marcin Cieslak wrote:

> Well, in this case everybody will be force to have UTF-8 in output
> on-screen, not useful for people using ISO8859-*, KOI8-R and similar
> things...

Can't we do:

	setlocale(LC_CTYPE, "");
	charset = nl_langinfo(CODESET);
	setlocale(LC_CTYPE, "C");

to allow an arbitrary character set?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-30 14:00                   ` Marcin Cieslak
  2010-08-30 14:09                     ` Jonathan Nieder
@ 2010-08-30 14:13                     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-30 14:13 UTC (permalink / raw)
  To: Marcin Cieslak; +Cc: Jonathan Nieder, Git Mailing List, Junio C Hamano

On Mon, Aug 30, 2010 at 14:00, Marcin Cieslak <saper@saper.info> wrote:
> On Mon, 30 Aug 2010, Ævar Arnfjörð Bjarmason wrote:
>> On Sun, Aug 29, 2010 at 20:45, Jonathan Nieder <jrnieder@gmail.com> wrote:
>> A would be preferred for correctness, and with a fallback BSD printf()
>> we can avoid the GNU libc bug, however that'll mean using LC_CTYPE,
>> which'll have some small side-effects for the rest of the code.
>
> The real problem is that you are probably using same functions
> (locale-enable) for the user-facing side as well as for the backend (talking
> to repository). Some projects decided to use
> some special encoding internally (like UCS-2 in case of Java
> and Python 2.x, Unicode ordinals in Python 3.x). Otherwise
> you may end up in some incompatibilities in the on-disk on on-network
> format. I don't think you want to keep telling all bug reporters for few
> years - "Can you try that again with env LANG=C,
> please?" :)

Yeah, those programs can probably get away with it too because they
either implement their own string functions, or don't use setlocale()
at all for their localizations.

> Bringing Unicode onboard means that simple strlen() is no longer
> what you normally think it does.

I'm pretty sure strlen() always gives you the number of
null-terminated bytes regardless of locale settings. wcslen is the
wide-characted equivalent.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII
  2010-08-30 14:09                     ` Jonathan Nieder
@ 2010-08-30 14:33                       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-08-30 14:33 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Marcin Cieslak, Git Mailing List, Junio C Hamano

On Mon, Aug 30, 2010 at 14:09, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Marcin Cieslak wrote:
>
>> Well, in this case everybody will be force to have UTF-8 in output
>> on-screen, not useful for people using ISO8859-*, KOI8-R and similar
>> things...
>
> Can't we do:
>
>        setlocale(LC_CTYPE, "");
>        charset = nl_langinfo(CODESET);
>        setlocale(LC_CTYPE, "C");
>
> to allow an arbitrary character set?

Yes, it seems so! With this patch:

    --- a/gettext.c
    +++ b/gettext.c
    @@ -3,2 +3,3 @@
     #include <libintl.h>
    +#include <langinfo.h>
     #include <stdlib.h>
    @@ -8,2 +9,3 @@ extern void git_setup_gettext(void) {
            char *envdir = getenv("GIT_TEXTDOMAINDIR");
    +       char *charset;

    @@ -19,2 +21,6 @@ extern void git_setup_gettext(void) {
            (void)setlocale(LC_MESSAGES, "");
    +       (void)setlocale(LC_CTYPE, "");
    +       charset = nl_langinfo(CODESET);
    +       (void)bind_textdomain_codeset("git", charset);
    +       (void)setlocale(LC_CTYPE, "C");
            (void)textdomain("git");

The sanity test still passes:

    ./t0201-gettext-fallbacks.sh ......... ok
    ./t0200-gettext-basic.sh ............. ok
    ./t0203-gettext-setlocale-sanity.sh .. ok
    ./t0202-gettext-perl.sh .............. ok
    All tests successful.

And the resulting git binary can emit both UTF-8 and ISO-8859-1 text
from its *.po files, respectively:

    Bjó til tóma Git lind í /tmp/meh/.git/
    Bj� til t�ma Git lind � /tmp/meh/.git/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-08-30 14:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-28 21:17 Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII Ævar Arnfjörð Bjarmason
2010-08-28 21:33 ` Ævar Arnfjörð Bjarmason
2010-08-28 21:46   ` Jonathan Nieder
2010-08-28 21:59     ` Jonathan Nieder
2010-08-28 22:14       ` Marcin Cieslak
2010-08-28 22:16         ` Jonathan Nieder
2010-08-29  7:36           ` Ævar Arnfjörð Bjarmason
2010-08-29  8:37             ` Ævar Arnfjörð Bjarmason
2010-08-30  2:22             ` Jonathan Nieder
2010-08-29 18:12           ` Ævar Arnfjörð Bjarmason
2010-08-29 20:45             ` Jonathan Nieder
2010-08-30  8:57               ` Ævar Arnfjörð Bjarmason
2010-08-30 13:41                 ` Jonathan Nieder
2010-08-30 14:00                   ` Marcin Cieslak
2010-08-30 14:09                     ` Jonathan Nieder
2010-08-30 14:33                       ` Ævar Arnfjörð Bjarmason
2010-08-30 14:13                     ` Ævar Arnfjörð Bjarmason
2010-08-30 14:04                   ` Ævar Arnfjörð Bjarmason
2010-08-28 22:20         ` Jonathan Nieder
2010-08-28 22:30           ` Marcin Cieslak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).