[2.6 patch] UTF-8 fixes in comments

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [2.6 patch] UTF-8 fixes in comments
@ 2008-04-28 15:40 Adrian Bunk
  2008-04-28 23:05 ` Willy Tarreau
  2008-04-29 12:18 ` KOSAKI Motohiro
  0 siblings, 2 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-04-28 15:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: trivial

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

This patch converts some non-UTF-8 encoded text in comments to UTF-8.

Signed-off-by: Adrian Bunk <bunk@kernel.org>

---

This patch is attached compressed to prevent my MUA from mangling it.

 Documentation/PCI/pcieaer-howto.txt |    2 -
 arch/arm/mach-omap2/io.c            |    2 -
 arch/s390/kernel/ebcdic.c           |   36 ++++++++++++++--------------
 drivers/hid/hid-input.c             |    2 -
 drivers/isdn/hisax/enternow_pci.c   |    2 -
 drivers/media/video/saa5249.c       |    2 -
 drivers/misc/ibmasm/command.c       |    2 -
 drivers/misc/ibmasm/dot_command.c   |    2 -
 drivers/misc/ibmasm/dot_command.h   |    2 -
 drivers/misc/ibmasm/event.c         |    2 -
 drivers/misc/ibmasm/heartbeat.c     |    2 -
 drivers/misc/ibmasm/i2o.h           |    2 -
 drivers/misc/ibmasm/ibmasm.h        |    2 -
 drivers/misc/ibmasm/ibmasmfs.c      |    2 -
 drivers/misc/ibmasm/lowlevel.c      |    2 -
 drivers/misc/ibmasm/lowlevel.h      |    2 -
 drivers/misc/ibmasm/module.c        |    2 -
 drivers/misc/ibmasm/r_heartbeat.c   |    2 -
 drivers/misc/ibmasm/remote.h        |    2 -
 drivers/misc/ibmasm/uart.c          |    2 -
 drivers/s390/ebcdic.c               |   36 ++++++++++++++--------------
 drivers/scsi/jazz_esp.c             |    2 -
 drivers/spi/omap2_mcspi.c           |    2 -
 drivers/usb/storage/cypress_atacb.c |    2 -
 drivers/video/omap/rfbi.c           |    2 -
 drivers/video/omap/sossi.c          |    2 -
 26 files changed, 60 insertions(+), 60 deletions(-)


[-- Attachment #2: patch-fix-utf-8.gz --]
[-- Type: application/octet-stream, Size: 3987 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-28 15:40 [2.6 patch] UTF-8 fixes in comments Adrian Bunk
@ 2008-04-28 23:05 ` Willy Tarreau
  2008-04-29  1:29   ` H. Peter Anvin
  2008-04-29  9:01   ` Alan Cox
  2008-04-29 12:18 ` KOSAKI Motohiro
  1 sibling, 2 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-04-28 23:05 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: linux-kernel, trivial

On Mon, Apr 28, 2008 at 06:40:23PM +0300, Adrian Bunk wrote:
> This patch converts some non-UTF-8 encoded text in comments to UTF-8.

Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
everyone reads UTF-8. Now I get random crappy chars which cripple my
xterms when reading such comments, and I have to do a full-reset once
I've read them. It's not as if it was *that* important, and to be
honnest, if you had not sent this patch, I would not even have known
that non-ASCII characters were here. However, it will quickly get
annoying if a recursive grep returns those pesky codes on non-compatible
consoles...

Quite frankly, it does not bring anything beyond trouble. I'm not adding
a NAK here because I find this rude, but I don't like the orientation
we're taking with the sources. We should not force people to install
version X or Y of a particular system just to read sources.

In fact, I would have better converted accentuated chars to their ASCII
equivalent to be more friendly with people who only read 7-bit.

Regards,
Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-28 23:05 ` Willy Tarreau
@ 2008-04-29  1:29   ` H. Peter Anvin
  2008-04-29  5:06     ` Willy Tarreau
  2008-04-29  9:01   ` Alan Cox
  1 sibling, 1 reply; 48+ messages in thread
From: H. Peter Anvin @ 2008-04-29  1:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, linux-kernel, trivial

Willy Tarreau wrote:
> Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
> everyone reads UTF-8.

"Everyone" who speaks a Western European language, perhaps; and even 
then, mostly because a lot of tools still have a "oh, it's not valid 
UTF-8, guess iso-8859-1" mode.  The most common instance of non-ASCII 
characters in Linux kernel code are people's names, and there are plenty 
of names which aren't representable in either ASCII or iso-8859-1.

The debate on this was years ago, and the consensus was to migrate to 
UTF-8; however, the salient information should be expressed in the ASCII 
character set unless impossible.

	-hpa

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  1:29   ` H. Peter Anvin
@ 2008-04-29  5:06     ` Willy Tarreau
  2008-04-29  6:04       ` H. Peter Anvin
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29  5:06 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Adrian Bunk, linux-kernel, trivial

On Mon, Apr 28, 2008 at 06:29:43PM -0700, H. Peter Anvin wrote:
> Willy Tarreau wrote:
> >Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
> >everyone reads UTF-8.
> 
> "Everyone" who speaks a Western European language, perhaps; and even 
> then, mostly because a lot of tools still have a "oh, it's not valid 
> UTF-8, guess iso-8859-1" mode.

Or simply because people have not migrated all their install, or have
explicitly disabled UTF-8 a few hours after starting to use it once
they discovered the mess it caused and the poor support from the
tools :-/

> The most common instance of non-ASCII 
> characters in Linux kernel code are people's names, and there are plenty 
> of names which aren't representable in either ASCII or iso-8859-1.
> 
> The debate on this was years ago, and the consensus was to migrate to 
> UTF-8; however, the salient information should be expressed in the ASCII 
> character set unless impossible.

And do we really consider that people's names in *comments* cannot
be converted to pure ASCII ? I'm western european and have always
been against accents in comments (another reason to write comments
in english BTW). Unix and internet have lived without accents for
almost 30 years without anyone really bothering. And now we try to
put them everywhere (even in domain names, implying big security
issues) and it causes real annoyances. People's names have not
changed in 30 years, so I guess that the rules used during this
time to ASCII-fy the names are still usable.

> 	-hpa

Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  5:06     ` Willy Tarreau
@ 2008-04-29  6:04       ` H. Peter Anvin
  2008-04-29  7:29       ` Adrian Bunk
  2008-05-09 12:48       ` David Kågedal
  2 siblings, 0 replies; 48+ messages in thread
From: H. Peter Anvin @ 2008-04-29  6:04 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: H. Peter Anvin, Adrian Bunk, linux-kernel, trivial

Willy Tarreau wrote:
> 
> And do we really consider that people's names in *comments* cannot
> be converted to pure ASCII ? I'm western european and have always
> been against accents in comments (another reason to write comments
> in english BTW). Unix and internet have lived without accents for
> almost 30 years without anyone really bothering. And now we try to
> put them everywhere (even in domain names, implying big security
> issues) and it causes real annoyances. People's names have not
> changed in 30 years, so I guess that the rules used during this
> time to ASCII-fy the names are still usable.
> 

For some languages, it's considered acceptable, for others it's 
considered major corruption.

	-hpa


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  5:06     ` Willy Tarreau
  2008-04-29  6:04       ` H. Peter Anvin
@ 2008-04-29  7:29       ` Adrian Bunk
  2008-04-29  8:14         ` Willy Tarreau
  2008-05-09 12:48       ` David Kågedal
  2 siblings, 1 reply; 48+ messages in thread
From: Adrian Bunk @ 2008-04-29  7:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 07:06:05AM +0200, Willy Tarreau wrote:
> On Mon, Apr 28, 2008 at 06:29:43PM -0700, H. Peter Anvin wrote:
> > Willy Tarreau wrote:
> > >Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
> > >everyone reads UTF-8.
> > 
> > "Everyone" who speaks a Western European language, perhaps; and even 
> > then, mostly because a lot of tools still have a "oh, it's not valid 
> > UTF-8, guess iso-8859-1" mode.
> 
> Or simply because people have not migrated all their install, or have
> explicitly disabled UTF-8 a few hours after starting to use it once
> they discovered the mess it caused and the poor support from the
> tools :-/

Non-ancient distributions default to UTF-8 and have tools that handle it 
fine.

If you had bad experiences in the last millenium you should try again.

> > The most common instance of non-ASCII 
> > characters in Linux kernel code are people's names, and there are plenty 
> > of names which aren't representable in either ASCII or iso-8859-1.
> > 
> > The debate on this was years ago, and the consensus was to migrate to 
> > UTF-8; however, the salient information should be expressed in the ASCII 
> > character set unless impossible.
> 
> And do we really consider that people's names in *comments* cannot
> be converted to pure ASCII ? I'm western european and have always
> been against accents in comments (another reason to write comments
> in english BTW).

Accents are very rare in names in the kernel.

Most non-ASCII characters are umlauts and there's no sane way to 
express them in ASCII (and the vowels without umlaut are pronounced 
quite differently and might even make names look very strange).

And that's only within European languages, outside it becomes even 
worse.

> Unix and internet have lived without accents for
> almost 30 years without anyone really bothering. And now we try to
> put them everywhere (even in domain names, implying big security
> issues) and it causes real annoyances. People's names have not
> changed in 30 years, so I guess that the rules used during this
> time to ASCII-fy the names are still usable.

The comments in the kernel have been converted to UTF-8 quite some time 
ago, what I'm fixing with my patch is just some recent non-UTF-8 stuff 
that creeped in.

And names in comments in the kernel were not pure ASCII since very 
early, they were in other charsets.

Mostly iso-8859-1, but not all of them.

I remember that for one name we first guessed which character it was and 
then tried to figure out which charset it was in (no, it was not one 
of iso-8859-*).

So it was not "ASCII -> UTF-8", it was
"several different charsets -> UTF-8".

> Willy

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  7:29       ` Adrian Bunk
@ 2008-04-29  8:14         ` Willy Tarreau
  2008-04-29  9:06           ` Helge Hafting
                             ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29  8:14 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 10:29:11AM +0300, Adrian Bunk wrote:
> On Tue, Apr 29, 2008 at 07:06:05AM +0200, Willy Tarreau wrote:
> > On Mon, Apr 28, 2008 at 06:29:43PM -0700, H. Peter Anvin wrote:
> > > Willy Tarreau wrote:
> > > >Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
> > > >everyone reads UTF-8.
> > > 
> > > "Everyone" who speaks a Western European language, perhaps; and even 
> > > then, mostly because a lot of tools still have a "oh, it's not valid 
> > > UTF-8, guess iso-8859-1" mode.
> > 
> > Or simply because people have not migrated all their install, or have
> > explicitly disabled UTF-8 a few hours after starting to use it once
> > they discovered the mess it caused and the poor support from the
> > tools :-/
> 
> Non-ancient distributions default to UTF-8 and have tools that handle it 
> fine.
> 
> If you had bad experiences in the last millenium you should try again.

Well, I accidentally used a freshly installed laptop running mandriva 2008.
I was typing in a terminal inside KDE (I don't know the program name, sort
of an xterm, but with huge borders all around). I made a typo in a word and
typed in a "é" (e acute). Pressing backspace to fix it showed me that I
remove more chars than typed. I tried again. Pressing this letter 5 times,
then 10 times backspace. I removed 5 chars from the prompt. I suspect that
if I had used some chars with wider encoding (eg 4 bytes), I could have
removed as many... Clearly those tools are not ready.

Also, I recently upgraded one machine from 2.6.22 to 2.6.25. Same crappy
behaviour on the console (with bash). I quickly set the vt.defaults on
the kernel command line to fix the problem.

At this stage, I'm not even trying to "fix" the problem, as it's
a philosophical debate and I do not want to enter it. Some people
consider it normal that we break user-space applications and that
it's obvious that all useland code has to be replaced to remain
compatible with "evolutions", and I simply do not support this
principle. I just care about having the ability to disable the
broken behaviour. Most of the problem comes from the variable
length characters causing wrapping lines and misplaced tabs when
read in non UTF-8 aware editors and/or terminals. The rest of
the problem with the terminal going mad could have been caused by
other encodings, I admit.

> > > The most common instance of non-ASCII 
> > > characters in Linux kernel code are people's names, and there are plenty 
> > > of names which aren't representable in either ASCII or iso-8859-1.
> > > 
> > > The debate on this was years ago, and the consensus was to migrate to 
> > > UTF-8; however, the salient information should be expressed in the ASCII 
> > > character set unless impossible.
> > 
> > And do we really consider that people's names in *comments* cannot
> > be converted to pure ASCII ? I'm western european and have always
> > been against accents in comments (another reason to write comments
> > in english BTW).
> 
> Accents are very rare in names in the kernel.
> 
> Most non-ASCII characters are umlauts and there's no sane way to 
> express them in ASCII (and the vowels without umlaut are pronounced 
> quite differently and might even make names look very strange).

Agreed, but it's been done for *years*. I received mails from people
spelled "jorn" or "jurgen" and they had no trouble using that spelling
in their names or mail addresses.

> And that's only within European languages, outside it becomes even 
> worse.
> 
> > Unix and internet have lived without accents for
> > almost 30 years without anyone really bothering. And now we try to
> > put them everywhere (even in domain names, implying big security
> > issues) and it causes real annoyances. People's names have not
> > changed in 30 years, so I guess that the rules used during this
> > time to ASCII-fy the names are still usable.
> 
> The comments in the kernel have been converted to UTF-8 quite some time 
> ago, what I'm fixing with my patch is just some recent non-UTF-8 stuff 
> that creeped in.

Well, if that had already begun, at least you're standardizing.

> And names in comments in the kernel were not pure ASCII since very 
> early, they were in other charsets.
> 
> Mostly iso-8859-1, but not all of them.
> 
> I remember that for one name we first guessed which character it was and 
> then tried to figure out which charset it was in (no, it was not one 
> of iso-8859-*).
> 
> So it was not "ASCII -> UTF-8", it was
> "several different charsets -> UTF-8".

I would have loved to see "several different charsets -> ASCII".

Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-28 23:05 ` Willy Tarreau
  2008-04-29  1:29   ` H. Peter Anvin
@ 2008-04-29  9:01   ` Alan Cox
  2008-04-29  9:19     ` Jan Engelhardt
  2008-04-29  9:34     ` Willy Tarreau
  1 sibling, 2 replies; 48+ messages in thread
From: Alan Cox @ 2008-04-29  9:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, linux-kernel, trivial

> In fact, I would have better converted accentuated chars to their ASCII
> equivalent to be more friendly with people who only read 7-bit.

Perhaps we should put them in latin as well just in case any Roman is
struggling with this new language 8) Distibutions have been shipping UTF
enabled by default for years and years.

Alan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  8:14         ` Willy Tarreau
@ 2008-04-29  9:06           ` Helge Hafting
  2008-04-29  9:33             ` Alan Cox
  2008-04-29 10:09             ` Willy Tarreau
  2008-04-29  9:43           ` Adrian Bunk
  2008-04-29 19:31           ` H. Peter Anvin
  2 siblings, 2 replies; 48+ messages in thread
From: Helge Hafting @ 2008-04-29  9:06 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 10:29:11AM +0300, Adrian Bunk wrote:
>   
>> On Tue, Apr 29, 2008 at 07:06:05AM +0200, Willy Tarreau wrote:
>>     
>>> On Mon, Apr 28, 2008 at 06:29:43PM -0700, H. Peter Anvin wrote:
>>>       
>>>> Willy Tarreau wrote:
>>>>         
>>>>> Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
>>>>> everyone reads UTF-8.
>>>>>           
>>>> "Everyone" who speaks a Western European language, perhaps; and even 
>>>> then, mostly because a lot of tools still have a "oh, it's not valid 
>>>> UTF-8, guess iso-8859-1" mode.
>>>>         
>>> Or simply because people have not migrated all their install, or have
>>> explicitly disabled UTF-8 a few hours after starting to use it once
>>> they discovered the mess it caused and the poor support from the
>>> tools :-/
>>>       
>> Non-ancient distributions default to UTF-8 and have tools that handle it 
>> fine.
>>
>> If you had bad experiences in the last millenium you should try again.
>>     
>
> Well, I accidentally used a freshly installed laptop running mandriva 2008.
> I was typing in a terminal inside KDE (I don't know the program name, sort
> of an xterm, but with huge borders all around). I made a typo in a word and
> typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> remove more chars than typed. I tried again. Pressing this letter 5 times,
> then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> if I had used some chars with wider encoding (eg 4 bytes), I could have
> removed as many... Clearly those tools are not ready.
>   
So don't use that particular tool, and/or file a bug with the 
maintainer. :-)
I have used utf-8 for years - the fact that some editors and some terminal
emulators fail is not a problem for me. There are so many that works
just fine. There is unicode xterm, and rxvt if you consider xterm too heavy.
Both vi and emacs have versions that handle utf-8 competently. You may 
have to
put in a one-off effort in finding a suitable font for your xterm, if you
actually wants to see proper umlauts in all cases. If you don't care about
looks, then xterm will display blanks/squares and backspace etc. will 
still work.
> Also, I recently upgraded one machine from 2.6.22 to 2.6.25. Same crappy
> behaviour on the console (with bash). I quickly set the vt.defaults on
> the kernel command line to fix the problem.
>
> At this stage, I'm not even trying to "fix" the problem, as it's
> a philosophical debate and I do not want to enter it. Some people
> consider it normal that we break user-space applications and that
> it's obvious that all useland code has to be replaced to remain
> compatible with "evolutions", and I simply do not support this
> principle.
Outside the english-speaking world, userland _was_ completely
broken in the day of ascii. And supporting the multiple
iso8859-xx encodings was completely broken too, if you ever needed
more than one of them.

Unicode gives userland an opportunity to actually work decently
for the first time. Now, ascii may be fine if C development is all
you ever use the machine for. You can mangle a few names in
comments - some people won't like that at all, some won't care.

But try using the same machine for writing a business letter without
a proper character set. You won't be taken seriously. Or even a non-english
gui app with ascii-only menus.

If you want to know what it is like, knock three vowels or so out of the
english alphabet. Consider them not supported. Invent "transcriptions" 
if you like.
Try writing a letter that way! Or even kernel code with informative 
comments.
See just how much that suck.
>  I just care about having the ability to disable the
> broken behaviour. Most of the problem comes from the variable
> length characters causing wrapping lines and misplaced tabs when
> read in non UTF-8 aware editors and/or terminals.
Consider the alternative - disable the broken behavior by using a
tool that handles UTF-8. There are certainly enough aware apps/tools for
those of us that  need  unicode.

>>> And do we really consider that people's names in *comments* cannot
>>> be converted to pure ASCII ? I'm western european and have always
>>> been against accents in comments (another reason to write comments
>>> in english BTW).
>>>       
>> Accents are very rare in names in the kernel.
>>
>> Most non-ASCII characters are umlauts and there's no sane way to 
>> express them in ASCII (and the vowels without umlaut are pronounced 
>> quite differently and might even make names look very strange).
>>     
>
> Agreed, but it's been done for *years*. I received mails from people
> spelled "jorn" or "jurgen" and they had no trouble using that spelling
> in their names or mail addresses.
>   
It has been done for years because there were no other choice. If you
wanted to work in unix, just forget your own name! Now there is a choice.
Some people still don' care and is fine with "jorn" and such. Some are
pissed off, takes offense, or stick to windows or simply puts unicode
into kernel comments.

If your mailer doesn't support utf-8, chances are you get some mail
from people with very strange looking names too.
>> And that's only within European languages, outside it becomes even 
>> worse.
>>
>>     
>>> Unix and internet have lived without accents for
>>> almost 30 years without anyone really bothering. And now we try to
>>>       
Lots of people actually bothered - and created various encoding schemes
to struggle with until they came up with unicode. English speakers and
people _only_ interested in simple tools like tar and ls didn't bother 
perhaps.
No problem there - the pressure to support more than ascii always was on 
those
wanting to use more than ascii. Now the kernel contains more than ascii,
and if you want to work on it you will have to cope - or succeed in 
patching it out again.
>>> put them everywhere (even in domain names, implying big security
>>> issues) and it causes real annoyances. People's names have not
>>> changed in 30 years, so I guess that the rules used during this
>>> time to ASCII-fy the names are still usable.
>>>       
Such "rules" may work for kernel comments specifically.
But linux is used for much more than that, so it now supports utf-8 just 
fine.
People who have a poperly set up system see no reason why they
can't use utf-8 in the kernel too. Consider tools that work. Or fix
the few remaining that doesn't work - if you are attached to them.
>> The comments in the kernel have been converted to UTF-8 quite some time 
>> ago, what I'm fixing with my patch is just some recent non-UTF-8 stuff 
>> that creeped in.
>>     
>
> Well, if that had already begun, at least you're standardizing.
>
>   
>> And names in comments in the kernel were not pure ASCII since very 
>> early, they were in other charsets.
>>
>> Mostly iso-8859-1, but not all of them.
>>
>> I remember that for one name we first guessed which character it was and 
>> then tried to figure out which charset it was in (no, it was not one 
>> of iso-8859-*).
>>
>> So it was not "ASCII -> UTF-8", it was
>> "several different charsets -> UTF-8".
>>     
>
> I would have loved to see "several different charsets -> ASCII".
>   
And all those that actually used those "different charsets" disagree,
or they'd used ascii in the first place too. :-)

Helge Hafting

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  9:01   ` Alan Cox
@ 2008-04-29  9:19     ` Jan Engelhardt
  2008-04-29  9:34     ` Willy Tarreau
  1 sibling, 0 replies; 48+ messages in thread
From: Jan Engelhardt @ 2008-04-29  9:19 UTC (permalink / raw)
  To: Alan Cox; +Cc: Willy Tarreau, Adrian Bunk, linux-kernel, trivial


On Tuesday 2008-04-29 11:01, Alan Cox wrote:
>> In fact, I would have better converted accentuated chars to their ASCII
>> equivalent to be more friendly with people who only read 7-bit.
>
>Perhaps we should put them in latin as well just in case any Roman is
>struggling with this new language 8) Distibutions have been shipping UTF
>enabled by default for years and years.

With some being overly late.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  9:06           ` Helge Hafting
@ 2008-04-29  9:33             ` Alan Cox
  2008-04-29 10:09             ` Willy Tarreau
  1 sibling, 0 replies; 48+ messages in thread
From: Alan Cox @ 2008-04-29  9:33 UTC (permalink / raw)
  To: Helge Hafting
  Cc: Willy Tarreau, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

> Outside the english-speaking world, userland _was_ completely

(American)

Formal UK English uses accented characters for some foreign imports (eg
café), ï for words like naïve, and if you are really pretentious you need
the æ symbol for words like mediæval although for modern writing this is
considered silly.

The bash problem btw should have been fixed (if it is bash causing it) as
of 2.05b and readline 4.3. If its being cause by the KDE terminal that
would suprise me but might be worth filing a bug.

Alan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  9:01   ` Alan Cox
  2008-04-29  9:19     ` Jan Engelhardt
@ 2008-04-29  9:34     ` Willy Tarreau
  2008-04-29  9:41       ` Alan Cox
  1 sibling, 1 reply; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29  9:34 UTC (permalink / raw)
  To: Alan Cox; +Cc: Adrian Bunk, linux-kernel, trivial

On Tue, Apr 29, 2008 at 10:01:07AM +0100, Alan Cox wrote:
> > In fact, I would have better converted accentuated chars to their ASCII
> > equivalent to be more friendly with people who only read 7-bit.
> 
> Perhaps we should put them in latin as well just in case any Roman is
> struggling with this new language 8) Distibutions have been shipping UTF
> enabled by default for years and years.

"enabled" does not mean "working" Alan. I know one distro which I will
not name in order not to offense you which shipped with it enabled by
default, but which would not properly display the characters on the
console, resulting in mangled messages during boot. I particularly
remember the "[ECHEC]" ("[FAILED]") with random garbage instead of the
first 'E'.

:-)
Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  9:34     ` Willy Tarreau
@ 2008-04-29  9:41       ` Alan Cox
  0 siblings, 0 replies; 48+ messages in thread
From: Alan Cox @ 2008-04-29  9:41 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, linux-kernel, trivial

> "enabled" does not mean "working" Alan. I know one distro which I will
> not name in order not to offense you which shipped with it enabled by

No offence taken. In fact I seem to remember filing similar bugs at the
time about rpm/popt getting its help formatting wrong in some locales (eg
Welsh) for similar reasons - but that was some time ago.

All the mainstream tools handle utf-8 just fine, joe is quite happy
editing utf-8 these days (as are the legacy vim and emacs editing
tools ;)). There really are no good reasons left not to use UTF-8.

Alan
--
        > you are confusing me even more.  
        Of course.  "I'm from IBM.  I'm here to help."  ;-)
                                -- Alan Altmark

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  8:14         ` Willy Tarreau
  2008-04-29  9:06           ` Helge Hafting
@ 2008-04-29  9:43           ` Adrian Bunk
  2008-04-29 19:31           ` H. Peter Anvin
  2 siblings, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-04-29  9:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 10:14:23AM +0200, Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 10:29:11AM +0300, Adrian Bunk wrote:
> > On Tue, Apr 29, 2008 at 07:06:05AM +0200, Willy Tarreau wrote:
> > > On Mon, Apr 28, 2008 at 06:29:43PM -0700, H. Peter Anvin wrote:
> > > > Willy Tarreau wrote:
> > > > >Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not
> > > > >everyone reads UTF-8.
> > > > 
> > > > "Everyone" who speaks a Western European language, perhaps; and even 
> > > > then, mostly because a lot of tools still have a "oh, it's not valid 
> > > > UTF-8, guess iso-8859-1" mode.
> > > 
> > > Or simply because people have not migrated all their install, or have
> > > explicitly disabled UTF-8 a few hours after starting to use it once
> > > they discovered the mess it caused and the poor support from the
> > > tools :-/
> > 
> > Non-ancient distributions default to UTF-8 and have tools that handle it 
> > fine.
> > 
> > If you had bad experiences in the last millenium you should try again.
> 
> Well, I accidentally used a freshly installed laptop running mandriva 2008.
> I was typing in a terminal inside KDE (I don't know the program name, sort
> of an xterm, but with huge borders all around). I made a typo in a word and
> typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> remove more chars than typed. I tried again. Pressing this letter 5 times,
> then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> if I had used some chars with wider encoding (eg 4 bytes), I could have
> removed as many... Clearly those tools are not ready.
>...

This sounds as if you had UTF-8 characters in a non UTF-8 environment.

If you did your "explicitly disabled UTF-8" then this is what triggered it.

> > > > The most common instance of non-ASCII 
> > > > characters in Linux kernel code are people's names, and there are plenty 
> > > > of names which aren't representable in either ASCII or iso-8859-1.
> > > > 
> > > > The debate on this was years ago, and the consensus was to migrate to 
> > > > UTF-8; however, the salient information should be expressed in the ASCII 
> > > > character set unless impossible.
> > > 
> > > And do we really consider that people's names in *comments* cannot
> > > be converted to pure ASCII ? I'm western european and have always
> > > been against accents in comments (another reason to write comments
> > > in english BTW).
> > 
> > Accents are very rare in names in the kernel.
> > 
> > Most non-ASCII characters are umlauts and there's no sane way to 
> > express them in ASCII (and the vowels without umlaut are pronounced 
> > quite differently and might even make names look very strange).
> 
> Agreed, but it's been done for *years*. I received mails from people
> spelled "jorn" or "jurgen" and they had no trouble using that spelling
> in their names or mail addresses.

Email addresses are a different topic.

But it's not right in names, and if someone then pronounces their name 
according to the wrong writing the result is also wrong.

> > And that's only within European languages, outside it becomes even 
> > worse.
> > 
> > > Unix and internet have lived without accents for
> > > almost 30 years without anyone really bothering. And now we try to
> > > put them everywhere (even in domain names, implying big security
> > > issues) and it causes real annoyances. People's names have not
> > > changed in 30 years, so I guess that the rules used during this
> > > time to ASCII-fy the names are still usable.
> > 
> > The comments in the kernel have been converted to UTF-8 quite some time 
> > ago, what I'm fixing with my patch is just some recent non-UTF-8 stuff 
> > that creeped in.
> 
> Well, if that had already begun, at least you're standardizing.
>...
> Willy

cu
Adrian

--

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  9:06           ` Helge Hafting
  2008-04-29  9:33             ` Alan Cox
@ 2008-04-29 10:09             ` Willy Tarreau
  2008-04-29 10:10               ` Alan Cox
                                 ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29 10:09 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 11:06:05AM +0200, Helge Hafting wrote:
> >Well, I accidentally used a freshly installed laptop running mandriva 2008.
> >I was typing in a terminal inside KDE (I don't know the program name, sort
> >of an xterm, but with huge borders all around). I made a typo in a word and
> >typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> >remove more chars than typed. I tried again. Pressing this letter 5 times,
> >then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> >if I had used some chars with wider encoding (eg 4 bytes), I could have
> >removed as many... Clearly those tools are not ready.
> >  
> So don't use that particular tool

It was not my machine, and had you been there, you would have heard me call
it names !

> and/or file a bug with the maintainer. :-)

It's too easy to impose crappy designs to end-users and tell them that if
that does not work they have to file a bug. There are a minimal set of
things that must be tested before shipping. Seeing that the default
terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does
not properly render it simply makes me sick. This is broken by design and
even distros trying to get it working for years still can't cope with it.
There must be a reason.

> I have used utf-8 for years - the fact that some editors and some terminal
> emulators fail is not a problem for me. There are so many that works
> just fine. There is unicode xterm, and rxvt if you consider xterm too heavy.
> Both vi and emacs have versions that handle utf-8 competently. You may 
> have to
> put in a one-off effort in finding a suitable font for your xterm, if you
> actually wants to see proper umlauts in all cases. If you don't care about
> looks, then xterm will display blanks/squares and backspace etc. will 
> still work.

I don't care about the *look*. Mutt shows me a question mark when it does
not know. I care about the *behaviour*. Having backspace go back farther
than the prompt is not acceptable. Having 80-col lines span over two lines
is absurd.

> Outside the english-speaking world, userland _was_ completely
> broken in the day of ascii. And supporting the multiple
> iso8859-xx encodings was completely broken too, if you ever needed
> more than one of them.

yes but you just had unexpected characters. Just like MS-DOS when
switching from code-page 437 to 850. Aside this, everything worked.

> Unicode gives userland an opportunity to actually work decently
> for the first time.

Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode.
That's as silly as if you had to replace your terminals to read
native gzip, and expect them as well as all the tools to work
properly!

> Now, ascii may be fine if C development is all
> you ever use the machine for. You can mangle a few names in
> comments - some people won't like that at all, some won't care.
> 
> But try using the same machine for writing a business letter without
> a proper character set. You won't be taken seriously. Or even a non-english
> gui app with ascii-only menus.
>
> If you want to know what it is like, knock three vowels or so out of the
> english alphabet. Consider them not supported. Invent "transcriptions" 
> if you like.

amusing comparison :-)

> Try writing a letter that way! Or even kernel code with informative 
> comments.
> See just how much that suck.
> > I just care about having the ability to disable the
> >broken behaviour. Most of the problem comes from the variable
> >length characters causing wrapping lines and misplaced tabs when
> >read in non UTF-8 aware editors and/or terminals.
> Consider the alternative - disable the broken behavior by using a
> tool that handles UTF-8. There are certainly enough aware apps/tools for
> those of us that  need  unicode.

Well, booting 2.6.25 with "init=/bin/bash" results in backspace
eating the prompt after pressing accentuated letters. Even the
control chars have been correctly handled on many UNIXes for
decades! The real problem with this crap is that it is viral :
"replace all userland applications or die alone on your island".
Then "ah, your applications behave in a funny manner, well that
may be because of UTF-8, but that is not important, just wait
for the update". I'm not even speaking about the security
implications it has on a lot of tools, starting with regex
libraries.

> >Agreed, but it's been done for *years*. I received mails from people
> >spelled "jorn" or "jurgen" and they had no trouble using that spelling
> >in their names or mail addresses.
> >  
> It has been done for years because there were no other choice. If you
> wanted to work in unix, just forget your own name! Now there is a choice.
> Some people still don' care and is fine with "jorn" and such. Some are
> pissed off, takes offense, or stick to windows or simply puts unicode
> into kernel comments.

Funny that you mention Windows. Windows has been using 16-bit unicode
for a long time without problems. It's a clean encoding. Like it or not.
Since they have started using UTF-8, bare windows users have started
telling me that there are often bizarre characters in texts instead of
accents. That most often happens in forwarded mails. so they get hit
too.

> If your mailer doesn't support utf-8, chances are you get some mail
> from people with very strange looking names too.

Once again, I don't care about the strange looking, just about the
behaviour.

> >>>Unix and internet have lived without accents for
> >>>almost 30 years without anyone really bothering. And now we try to
> >>>      
> Lots of people actually bothered - and created various encoding schemes
> to struggle with until they came up with unicode. English speakers and
> people _only_ interested in simple tools like tar and ls didn't bother 
> perhaps.

You know why we got this encoding ? Simply because it was designed by
english speakers who did not want to be impacted at all by the transition.
That way they can still use their old "elm", "cat" and "vi" with no
hassle and pretend to be UTF-8 ready.

> No problem there - the pressure to support more than ascii always was on 
> those
> wanting to use more than ascii. Now the kernel contains more than ascii,
> and if you want to work on it you will have to cope - or succeed in 
> patching it out again.

I'm not suggesting to patch it out, just that we stay conservative with
the sources. Being limited to certain compilers is already a problem,
but we must avoid putting restrictions on the tools required to read/write
the sources.

> >>>put them everywhere (even in domain names, implying big security
> >>>issues) and it causes real annoyances. People's names have not
> >>>changed in 30 years, so I guess that the rules used during this
> >>>time to ASCII-fy the names are still usable.
> >>>      
> Such "rules" may work for kernel comments specifically.
> But linux is used for much more than that, so it now supports utf-8 just 
> fine.
> People who have a poperly set up system see no reason why they
> can't use utf-8 in the kernel too. Consider tools that work. Or fix
> the few remaining that doesn't work - if you are attached to them.

No, you're speaking as a desktop user. You upgrade every 6-months. When
you have several machines, with various OSes, you know that the first
one which will stuff this crap everywhere will cause even more trouble
with the other ones. At one moment, you'll have to upgrade everything.
BTW, do you have an UTF-8 patch for the vt320 and vt510 I use as an
always-on console on my servers ? Clearly, the system does not have to
be "properly setup" to behave correctly. A kernel running bash as init
is a "properly setup system". Displaying wrong things is OK, behaving
badly is not.

> >I would have loved to see "several different charsets -> ASCII".
> >  
> And all those that actually used those "different charsets" disagree,
> or they'd used ascii in the first place too. :-)

As I said to Adrian, I did not even know there were non-ASCII chars
in our sources, and found it a bit shocking. Well, maybe I'm just an
old-timer and I need to stop working with computers :-/

Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:09             ` Willy Tarreau
@ 2008-04-29 10:10               ` Alan Cox
  2008-04-29 10:33                 ` Willy Tarreau
  2008-04-29 19:33                 ` H. Peter Anvin
  2008-04-29 10:42               ` Adrian Bunk
  2008-04-30  9:15               ` Helge Hafting
  2 siblings, 2 replies; 48+ messages in thread
From: Alan Cox @ 2008-04-29 10:10 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Helge Hafting, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

> Well, booting 2.6.25 with "init=/bin/bash" results in backspace
> eating the prompt after pressing accentuated letters. Even the

Did you put the bash shell and the console into unicode mode ?

> Funny that you mention Windows. Windows has been using 16-bit unicode
> for a long time without problems. It's a clean encoding. Like it or not.

I would describe the UCS-2 situation as a disaster area - embedded nuls
causing breakage, inability to represent the full unicode space and
awkward programming interfaces.

> You know why we got this encoding ? Simply because it was designed by
> english speakers who did not want to be impacted at all by the transition.

Actually it was primarily designed to make moving encoding painless so
that ascii still worked and C properties like \0 plus traditional
Unixisms like "/" just worked.

> BTW, do you have an UTF-8 patch for the vt320 and vt510 I use as an
> always-on console on my servers ? Clearly, the system does not have to

screen supports the needed transliteration for you.

Alan
--
"Having worked in a university for more than twenty years after leaving
 industry, I had become unused to seeing management skill routinely
 exercised, universities being administered rather than managed"
                -- Peter Checkland


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:10               ` Alan Cox
@ 2008-04-29 10:33                 ` Willy Tarreau
  2008-04-29 10:34                   ` Alan Cox
  2008-05-01  9:46                   ` Alexander E. Patrakov
  2008-04-29 19:33                 ` H. Peter Anvin
  1 sibling, 2 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29 10:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Helge Hafting, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 11:10:14AM +0100, Alan Cox wrote:
> > Well, booting 2.6.25 with "init=/bin/bash" results in backspace
> > eating the prompt after pressing accentuated letters. Even the
> 
> Did you put the bash shell and the console into unicode mode ?

The console yes (by default until I disabled it to restore correct
behaviour). The shell no, it was the one present on my machine and
has never been compiled with UTF-8 support, and should not have to.

If we say that starting with 2.6.24, we're explicitly breaking
compatiblity with old userland, fine. But that was not explicitly
stated.

In my opinion, the problem is that when I press "é", the system sends
two chars to the bash, which itself sends two chars to the terminal,
which only displays one and moves the cursor one step ahead. Then,
pressing backspace once sends one backspace all along, resulting in
the terminal blanking one displayed char, but the shell not being
aware that only half of it was removed. But if you look at how
control chars are handled, if you display ^H then press backspace,
you remove all of it. It's the terminal which adjusts the position
depending on the character length.

So in my opinion, when we send one backspace to the terminal to
remove one character, since there are two in the buffer, we
should not get back one full char. Ideally, the console driver
should send as many backspaces as needed to fix the multiple
characters that were emitted. It's not logical at all that if
we send 3 chars to a process with one key, sending a cancellation
of those chars only sends one backspace.

You see, that's really what I hate with this encoding. Every
stage relies on the next one to do the fixup. And of course, a
lot of combinations fail.

> > Funny that you mention Windows. Windows has been using 16-bit unicode
> > for a long time without problems. It's a clean encoding. Like it or not.
> 
> I would describe the UCS-2 situation as a disaster area - embedded nuls
> causing breakage, inability to represent the full unicode space and
> awkward programming interfaces.

But at least, there is no feeling of having it working. You immediately
see if your tools are compliant or not.

> > You know why we got this encoding ? Simply because it was designed by
> > english speakers who did not want to be impacted at all by the transition.
> 
> Actually it was primarily designed to make moving encoding painless so
> that ascii still worked and C properties like \0 plus traditional
> Unixisms like "/" just worked.

I cannot imagine how one can believe that something which transcodes one
char as a series of 1-to-4 chars will be a painless move. A lot of code
is totally broken and was not before the move.

> > BTW, do you have an UTF-8 patch for the vt320 and vt510 I use as an
> > always-on console on my servers ? Clearly, the system does not have to
> 
> screen supports the needed transliteration for you.

That's a useful information, thanks. I was not aware of this.

> Alan

Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:33                 ` Willy Tarreau
@ 2008-04-29 10:34                   ` Alan Cox
  2008-04-29 22:12                     ` Willy Tarreau
  2008-05-01  9:46                   ` Alexander E. Patrakov
  1 sibling, 1 reply; 48+ messages in thread
From: Alan Cox @ 2008-04-29 10:34 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Helge Hafting, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

> behaviour). The shell no, it was the one present on my machine and
> has never been compiled with UTF-8 support, and should not have to.

Bizarre, so you are using deliberately misconfigured ancient userspace to
complain about utf-8

> In my opinion, the problem is that when I press "é", the system sends
> two chars to the bash, which itself sends two chars to the terminal,
> which only displays one and moves the cursor one step ahead. Then,
> pressing backspace once sends one backspace all along, resulting in
> the terminal blanking one displayed char, but the shell not being

The shell puts the terminal in character by character mode and readline
does this. If you have your shell/readline deliberately set up not to be
doing unicode locales then it will do the wrong thing.

> So in my opinion, when we send one backspace to the terminal to
> remove one character, since there are two in the buffer, we
> should not get back one full char. Ideally, the console driver
> should send as many backspaces as needed to fix the multiple

The console driver isn't involved - readline took over for the shell, and
readline most definitely supports this in a utf8 locale.

Alan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:09             ` Willy Tarreau
  2008-04-29 10:10               ` Alan Cox
@ 2008-04-29 10:42               ` Adrian Bunk
  2008-04-29 11:06                 ` Willy Tarreau
  2008-04-30  9:15               ` Helge Hafting
  2 siblings, 1 reply; 48+ messages in thread
From: Adrian Bunk @ 2008-04-29 10:42 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Helge Hafting, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 12:09:34PM +0200, Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 11:06:05AM +0200, Helge Hafting wrote:
> > >Well, I accidentally used a freshly installed laptop running mandriva 2008.
> > >I was typing in a terminal inside KDE (I don't know the program name, sort
> > >of an xterm, but with huge borders all around). I made a typo in a word and
> > >typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> > >remove more chars than typed. I tried again. Pressing this letter 5 times,
> > >then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> > >if I had used some chars with wider encoding (eg 4 bytes), I could have
> > >removed as many... Clearly those tools are not ready.
> > >  
> > So don't use that particular tool
> 
> It was not my machine, and had you been there, you would have heard me call
> it names !
> 
> > and/or file a bug with the maintainer. :-)
> 
> It's too easy to impose crappy designs to end-users and tell them that if
> that does not work they have to file a bug. There are a minimal set of
> things that must be tested before shipping. Seeing that the default
> terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does
> not properly render it simply makes me sick. This is broken by design and
> even distros trying to get it working for years still can't cope with it.
> There must be a reason.

I can reproduce your problem in a plain xterm when setting LANG=en_US
(most likely the same problem can occur with other non UTF-8 settings).

In this case I'm actually more surprised that the character is displayed 
correctly than that you have to type backspace twice.

Any kind of charset mixing is highly problematic (which is also why my 
patch was attached compressed), so if you disable UTF-8 anywhere in a 
modern distribution problems are somehow expected (it could also be a 
bug in Mandrivas default settings, but that would really surprise me).

>...
> > Unicode gives userland an opportunity to actually work decently
> > for the first time.
> 
> Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode.
> That's as silly as if you had to replace your terminals to read
> native gzip, and expect them as well as all the tools to work
> properly!

It's not a compressed encoding, it's a variable-length encoding.

Besides the size advantages one main advantage of UTF-8 is that ASCII is 
valid UTF-8. This means that for the ASCII source code in the kernel it 
doesn't matter whether it's treated as ASCII or UTF-8, and no conversion 
was needed.

You can't get this property with a fixed-size Unicode encoding.

>...
> Willy

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:42               ` Adrian Bunk
@ 2008-04-29 11:06                 ` Willy Tarreau
  2008-04-29 11:27                   ` Adrian Bunk
  0 siblings, 1 reply; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29 11:06 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: Helge Hafting, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 01:42:16PM +0300, Adrian Bunk wrote:
> On Tue, Apr 29, 2008 at 12:09:34PM +0200, Willy Tarreau wrote:
> > On Tue, Apr 29, 2008 at 11:06:05AM +0200, Helge Hafting wrote:
> > > >Well, I accidentally used a freshly installed laptop running mandriva 2008.
> > > >I was typing in a terminal inside KDE (I don't know the program name, sort
> > > >of an xterm, but with huge borders all around). I made a typo in a word and
> > > >typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> > > >remove more chars than typed. I tried again. Pressing this letter 5 times,
> > > >then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> > > >if I had used some chars with wider encoding (eg 4 bytes), I could have
> > > >removed as many... Clearly those tools are not ready.
> > > >  
> > > So don't use that particular tool
> > 
> > It was not my machine, and had you been there, you would have heard me call
> > it names !
> > 
> > > and/or file a bug with the maintainer. :-)
> > 
> > It's too easy to impose crappy designs to end-users and tell them that if
> > that does not work they have to file a bug. There are a minimal set of
> > things that must be tested before shipping. Seeing that the default
> > terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does
> > not properly render it simply makes me sick. This is broken by design and
> > even distros trying to get it working for years still can't cope with it.
> > There must be a reason.
> 
> I can reproduce your problem in a plain xterm when setting LANG=en_US
> (most likely the same problem can occur with other non UTF-8 settings).

possibly they broke it when forcing support for variable length ?

> In this case I'm actually more surprised that the character is displayed 
> correctly than that you have to type backspace twice.

It's not that I *had* to type it twice. But I *could* type it twice, and
the first one removed the character, the second one the prompt.

> Any kind of charset mixing is highly problematic (which is also why my 
> patch was attached compressed), so if you disable UTF-8 anywhere in a 
> modern distribution problems are somehow expected (it could also be a 
> bug in Mandrivas default settings, but that would really surprise me).

No, it was not disabled at all. I had to type in a command for a
co-worker who just did a default install the day before, and typed a
typo which I wanted to fix.

> > Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode.
> > That's as silly as if you had to replace your terminals to read
> > native gzip, and expect them as well as all the tools to work
> > properly!
> 
> It's not a compressed encoding, it's a variable-length encoding.
> 
> Besides the size advantages one main advantage of UTF-8 is that ASCII is 
> valid UTF-8. This means that for the ASCII source code in the kernel it 
> doesn't matter whether it's treated as ASCII or UTF-8, and no conversion 
> was needed.
> 
> You can't get this property with a fixed-size Unicode encoding.

I don't agree. If you refuse character-set mixing, there's no problem.
Bit 7 of first char == 1 ? => full text is 32 bit.

Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 11:06                 ` Willy Tarreau
@ 2008-04-29 11:27                   ` Adrian Bunk
  2008-04-29 11:32                     ` Adrian Bunk
  0 siblings, 1 reply; 48+ messages in thread
From: Adrian Bunk @ 2008-04-29 11:27 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Helge Hafting, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 01:06:38PM +0200, Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 01:42:16PM +0300, Adrian Bunk wrote:
> > On Tue, Apr 29, 2008 at 12:09:34PM +0200, Willy Tarreau wrote:
>...
> > > Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode.
> > > That's as silly as if you had to replace your terminals to read
> > > native gzip, and expect them as well as all the tools to work
> > > properly!
> > 
> > It's not a compressed encoding, it's a variable-length encoding.
> > 
> > Besides the size advantages one main advantage of UTF-8 is that ASCII is 
> > valid UTF-8. This means that for the ASCII source code in the kernel it 
> > doesn't matter whether it's treated as ASCII or UTF-8, and no conversion 
> > was needed.
> > 
> > You can't get this property with a fixed-size Unicode encoding.
> 
> I don't agree. If you refuse character-set mixing, there's no problem.
> Bit 7 of first char == 1 ? => full text is 32 bit.

You miss my point.

The point is:
A conversion "ASCII -> UTF-8" is a nop.

This means when changing the kernel from half a dozen charsets used in 
comments to UTF-8 we only had to change the few characters actually 
containing non UTF-8.

Going to something like UTF-32 as you suggest would have involved 
converting every single file in the kernel.

> Willy

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 11:27                   ` Adrian Bunk
@ 2008-04-29 11:32                     ` Adrian Bunk
  2008-04-29 20:18                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 48+ messages in thread
From: Adrian Bunk @ 2008-04-29 11:32 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Helge Hafting, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 02:27:18PM +0300, Adrian Bunk wrote:
> 
> You miss my point.
> 
> The point is:
> A conversion "ASCII -> UTF-8" is a nop.
> 
> This means when changing the kernel from half a dozen charsets used in 
> comments to UTF-8 we only had to change the few characters actually 
> containing non UTF-8.

"containing non-ASCII"

> Going to something like UTF-32 as you suggest would have involved 
> converting every single file in the kernel.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-28 15:40 [2.6 patch] UTF-8 fixes in comments Adrian Bunk
  2008-04-28 23:05 ` Willy Tarreau
@ 2008-04-29 12:18 ` KOSAKI Motohiro
  1 sibling, 0 replies; 48+ messages in thread
From: KOSAKI Motohiro @ 2008-04-29 12:18 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: kosaki.motohiro, linux-kernel, trivial

> This patch converts some non-UTF-8 encoded text in comments to UTF-8.
> 
> Signed-off-by: Adrian Bunk <bunk@kernel.org>

Good Job!

   Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

AFAIK some file already are written by utf-8.
frankly, I say from the standpoint as the non-Europian,

all files are written by ascii:      no problem
all files are written by iso8859-1:  need editor customize
all files are written by utf-8:      no problem
some files are written by iso8859-1, 
but another files are written by utf-8: Ouch! Noooooo!!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  8:14         ` Willy Tarreau
  2008-04-29  9:06           ` Helge Hafting
  2008-04-29  9:43           ` Adrian Bunk
@ 2008-04-29 19:31           ` H. Peter Anvin
  2008-04-29 20:05             ` Willy Tarreau
  2 siblings, 1 reply; 48+ messages in thread
From: H. Peter Anvin @ 2008-04-29 19:31 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, linux-kernel, trivial

Willy Tarreau wrote:
> 
> Well, I accidentally used a freshly installed laptop running mandriva 2008.
> I was typing in a terminal inside KDE (I don't know the program name, sort
> of an xterm, but with huge borders all around). I made a typo in a word and
> typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> remove more chars than typed. I tried again. Pressing this letter 5 times,
> then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> if I had used some chars with wider encoding (eg 4 bytes), I could have
> removed as many... Clearly those tools are not ready.
> 

Presumably, this was konsole.  konsole works fine with UTF-8 (I use it 
that way every day); the most common cause of this kind of problems is 
people explicitly clobbering the locale or charset class defaults in 
their login scripts.

	-hpa


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:10               ` Alan Cox
  2008-04-29 10:33                 ` Willy Tarreau
@ 2008-04-29 19:33                 ` H. Peter Anvin
  1 sibling, 0 replies; 48+ messages in thread
From: H. Peter Anvin @ 2008-04-29 19:33 UTC (permalink / raw)
  To: Alan Cox; +Cc: Willy Tarreau, Helge Hafting, Adrian Bunk, linux-kernel, trivial

Alan Cox wrote:
> 
>> Funny that you mention Windows. Windows has been using 16-bit unicode
>> for a long time without problems. It's a clean encoding. Like it or not.
> 
> I would describe the UCS-2 situation as a disaster area - embedded nuls
> causing breakage, inability to represent the full unicode space and
> awkward programming interfaces.
> 

Not to mention the fact that UCS-2 ran out of code points almost as soon 
as they said "no more codepoints."  The result was UTF-16, a hideous 
abortion which took all the problems with wide encodings, combined it 
with all the problems of multibyte encodings, and added a few new ones 
for good measure.

	-hpa

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 19:31           ` H. Peter Anvin
@ 2008-04-29 20:05             ` Willy Tarreau
  2008-04-29 20:09               ` H. Peter Anvin
  0 siblings, 1 reply; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29 20:05 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Adrian Bunk, linux-kernel, trivial

On Tue, Apr 29, 2008 at 12:31:01PM -0700, H. Peter Anvin wrote:
> Willy Tarreau wrote:
> >
> >Well, I accidentally used a freshly installed laptop running mandriva 2008.
> >I was typing in a terminal inside KDE (I don't know the program name, sort
> >of an xterm, but with huge borders all around). I made a typo in a word and
> >typed in a "é" (e acute). Pressing backspace to fix it showed me that I
> >remove more chars than typed. I tried again. Pressing this letter 5 times,
> >then 10 times backspace. I removed 5 chars from the prompt. I suspect that
> >if I had used some chars with wider encoding (eg 4 bytes), I could have
> >removed as many... Clearly those tools are not ready.
> >
> 
> Presumably, this was konsole.

Possible. It was the one you get by clicking on a terminal icon.
Huuhhh what an horror, I'm discussing icons and GUIs on LKML. I must
take my meds :-)

> konsole works fine with UTF-8 (I use it 
> that way every day); the most common cause of this kind of problems is 
> people explicitly clobbering the locale or charset class defaults in 
> their login scripts.

I really doubt the miss would have done this. Or someone would have done
it for her which I really doubt in such a small time frame after a fresh
install from the day before. I will investigate though.

Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 20:05             ` Willy Tarreau
@ 2008-04-29 20:09               ` H. Peter Anvin
  0 siblings, 0 replies; 48+ messages in thread
From: H. Peter Anvin @ 2008-04-29 20:09 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, linux-kernel, trivial

Willy Tarreau wrote:
> 
>> konsole works fine with UTF-8 (I use it 
>> that way every day); the most common cause of this kind of problems is 
>> people explicitly clobbering the locale or charset class defaults in 
>> their login scripts.
> 
> I really doubt the miss would have done this. Or someone would have done
> it for her which I really doubt in such a small time frame after a fresh
> install from the day before. I will investigate though.
> 

 From one of Alan's posts it sounds like there was a bug with multibyte 
characters in readline at some point that got fixed relatively quickly, 
but still made it out.

	-hpa

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 11:32                     ` Adrian Bunk
@ 2008-04-29 20:18                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 48+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-29 20:18 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Willy Tarreau, Helge Hafting, H. Peter Anvin, linux-kernel,
	trivial

Adrian Bunk wrote:
> On Tue, Apr 29, 2008 at 02:27:18PM +0300, Adrian Bunk wrote:
>   
>> You miss my point.
>>
>> The point is:
>> A conversion "ASCII -> UTF-8" is a nop.
>>
>> This means when changing the kernel from half a dozen charsets used in 
>> comments to UTF-8 we only had to change the few characters actually 
>> containing non UTF-8.
>>     
>
> "containing non-ASCII"
>   

Same thing ;)

    J

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:34                   ` Alan Cox
@ 2008-04-29 22:12                     ` Willy Tarreau
  2008-04-29 22:15                       ` Alan Cox
  0 siblings, 1 reply; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29 22:12 UTC (permalink / raw)
  To: Alan Cox
  Cc: Helge Hafting, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

Hi Alan,

On Tue, Apr 29, 2008 at 11:34:10AM +0100, Alan Cox wrote:
> > behaviour). The shell no, it was the one present on my machine and
> > has never been compiled with UTF-8 support, and should not have to.
> 
> Bizarre, so you are using deliberately misconfigured ancient userspace to
> complain about utf-8

No I'm not using anything deliberately misconfigured. I'm trying to explain
that on the opposite, any tool which has not been explicitly adapted to those
new usages is impacted.

> > In my opinion, the problem is that when I press "é", the system sends
> > two chars to the bash, which itself sends two chars to the terminal,
> > which only displays one and moves the cursor one step ahead. Then,
> > pressing backspace once sends one backspace all along, resulting in
> > the terminal blanking one displayed char, but the shell not being
> 
> The shell puts the terminal in character by character mode and readline
> does this. If you have your shell/readline deliberately set up not to be
> doing unicode locales then it will do the wrong thing.

Please, I'm not "deliberately" setting my tools *not* to support unicode.
I have tools which have worked for years and which are now asked to behave
strangely.

> > So in my opinion, when we send one backspace to the terminal to
> > remove one character, since there are two in the buffer, we
> > should not get back one full char. Ideally, the console driver
> > should send as many backspaces as needed to fix the multiple
> 
> The console driver isn't involved - readline took over for the shell, and
> readline most definitely supports this in a utf8 locale.

OK I could reproduce the case without ever involving either a shell or
readline or anything. Using "cat" as the init program exhibited the
anomaly, though it was not much easy to analyze. Then I switched to
"init=od -An -tx1 -".

1) if I enter "A" then press backspace, I get nothing. Pressing enter 16
   times flushes the line buffer and "od" prints 16 times "0a", indicating
   nothing was remaining in the buffer.

2) if I enter Ctrl-V Ctrl-A, my display prints "^A", and when I press
   backspace, I correctly get the cursor back two chars. Once again,
   flushing the buffer with enter shows it was empty.

3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od
   got two bytes: c3 84.

4) now if I enter Alt-196 and press backspace, my "Ä" is removed by the
   backspace, but only the second byte is flushed from the line buffer.
   Then, if I press enter 15 times, I get a line with c3 0a 0a 0a ...
   And there is no user-land involved here.

I'm really hoping you better understand the problem now. Pressing backspace
to fix input does not correct the input with multi-byte chars, it leaves
incomplete start sequences. If I press Alt-1111111, then backspace, I get
f4 8f 91 0a 0a 0a 0a because it is f4 8f 91 87 minus one byte.

Of course, pressing Backspace multiple times removes them all, but it also
removes previous characters on the display.

Another experience :

I press 01234, then Alt-255, Backspace, then 56789. On the display, I have
0123456789. od gets 30 31 32 33 34 c3 35 36 37 38 39.

Now if I want to correctly fix the input, I have to press backspace twice,
but then I have to make the '4' disappear from my display, while knowing it
still remains in the buffer. And indeed, my display shows "012356789" but
od sees 30 31 32 33 34 35 36 37 38 39.

And this is without anything on the user-land (except 'od'), just plain
stupid text console booted with "init=..."

So obviously there is something broken as the data fed into stdin does not
match what is displayed for multi-byte characters.

Hoping this clarifies the situation,
Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 22:12                     ` Willy Tarreau
@ 2008-04-29 22:15                       ` Alan Cox
  2008-04-29 23:05                         ` Willy Tarreau
  0 siblings, 1 reply; 48+ messages in thread
From: Alan Cox @ 2008-04-29 22:15 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Helge Hafting, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

> OK I could reproduce the case without ever involving either a shell or
> readline or anything. Using "cat" as the init program exhibited the
> anomaly, though it was not much easy to analyze. Then I switched to
> "init=od -An -tx1 -".

Did you put the console into utf-8 mode before the cat ?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 22:15                       ` Alan Cox
@ 2008-04-29 23:05                         ` Willy Tarreau
  2008-05-01 20:18                           ` H. Peter Anvin
  0 siblings, 1 reply; 48+ messages in thread
From: Willy Tarreau @ 2008-04-29 23:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Helge Hafting, Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

On Tue, Apr 29, 2008 at 11:15:54PM +0100, Alan Cox wrote:
> > OK I could reproduce the case without ever involving either a shell or
> > readline or anything. Using "cat" as the init program exhibited the
> > anomaly, though it was not much easy to analyze. Then I switched to
> > "init=od -An -tx1 -".
> 
> Did you put the console into utf-8 mode before the cat ?

I had not *explictly* disabled it, since as the doc suggests :

        vt.default_utf8=
                        [VT]
                        Format=<0|1>
                        Set system-wide default UTF-8 mode for all tty's.
                        Default is 1, i.e. UTF-8 mode is enabled for all
                        newly opened terminals.

And I know that I can fix the behaviour by explicitly setting it to zero.
Also, the fact that "od" shows me multi-byte characters on the input
indicates to me that everything is set to UTF-8. So unless I'm missing
something, my console is set by default to UTF-8 (I test this on 2.6.25).

Regards,
Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
@ 2008-04-30  0:08 Samuel Thibault
  2008-04-30  3:38 ` Chris Adams
                   ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Samuel Thibault @ 2008-04-30  0:08 UTC (permalink / raw)
  To: linux-kernel

Willy Tarreau wrote:
> 3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od
> got two bytes: c3 84.

Confirmed.

Try init=/bin/stty -a, that will show

-iutf8

So there is little wonder that canonical mode does not work as expected.

Try init=/bin/sh, from that shell run stty iutf8. Then things will work
fine.  The fix is thus just to make the VT's tty initial iutf8 setup
follow vt.default_utf8.

Samuel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30  0:08 [2.6 patch] UTF-8 fixes in comments Samuel Thibault
@ 2008-04-30  3:38 ` Chris Adams
  2008-04-30  9:38 ` Samuel Thibault
  2008-04-30 19:49 ` Willy Tarreau
  2 siblings, 0 replies; 48+ messages in thread
From: Chris Adams @ 2008-04-30  3:38 UTC (permalink / raw)
  To: linux-kernel

Once upon a time, Samuel Thibault  <samuel.thibault@ens-lyon.org> said:
>Try init=/bin/sh, from that shell run stty iutf8. Then things will work
>fine.  The fix is thus just to make the VT's tty initial iutf8 setup
>follow vt.default_utf8.

You may also need to select a UTF-8 locale (e.g. LANG="en_US.UTF-8") for
programs like bash to handle this correctly.

-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:09             ` Willy Tarreau
  2008-04-29 10:10               ` Alan Cox
  2008-04-29 10:42               ` Adrian Bunk
@ 2008-04-30  9:15               ` Helge Hafting
  2008-04-30 19:22                 ` Adrian Bunk
  2008-04-30 19:42                 ` H. Peter Anvin
  2 siblings, 2 replies; 48+ messages in thread
From: Helge Hafting @ 2008-04-30  9:15 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Adrian Bunk, H. Peter Anvin, linux-kernel, trivial

Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 11:06:05AM +0200, Helge Hafting wrote:
>   
>>> Well, I accidentally used a freshly installed laptop running mandriva 2008.
>>> I was typing in a terminal inside KDE (I don't know the program name, sort
>>> of an xterm, but with huge borders all around). I made a typo in a word and
>>> typed in a "é" (e acute). Pressing backspace to fix it showed me that I
>>> remove more chars than typed. I tried again. Pressing this letter 5 times,
>>> then 10 times backspace. I removed 5 chars from the prompt. I suspect that
>>> if I had used some chars with wider encoding (eg 4 bytes), I could have
>>> removed as many... Clearly those tools are not ready.
>>>  
>>>       
>> So don't use that particular tool
>>     
>
> It was not my machine, and had you been there, you would have heard me call
> it names !
>   
We all do that, for various reasons...
>   
>> and/or file a bug with the maintainer. :-)
>>     
>
> It's too easy to impose crappy designs to end-users and tell them that if
> that does not work they have to file a bug. There are a minimal set of
> things that must be tested before shipping. Seeing that the default
> terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does
> not properly render it simply makes me sick. This is broken by design and
> even distros trying to get it working for years still can't cope with it.
> There must be a reason.
>   
Yeah, ascii-only is a crappy design. :-/ 
I don't know if mandriva is broken by design - I only use debian.
It would not surprise me if some distros  botch utf-8 through negligence.
They are based in english-speaking countries and have their biggest
user bases there - the majority of their customers aren't going to use 
more than
ascii so why should they bother. 

Someone made a "cool" terminal emulator? Transparency and effects?
Distribute it, despite the fact that it won't work in all cases.
Distro contains xterm anyway for those that need a fallback.
Machine owner thinks one terminal emulator is enough and
install the default or cool one only.
>> I have used utf-8 for years - the fact that some editors and some terminal
>> emulators fail is not a problem for me. There are so many that works
>> just fine. There is unicode xterm, and rxvt if you consider xterm too heavy.
>> Both vi and emacs have versions that handle utf-8 competently. You may 
>> have to
>> put in a one-off effort in finding a suitable font for your xterm, if you
>> actually wants to see proper umlauts in all cases. If you don't care about
>> looks, then xterm will display blanks/squares and backspace etc. will 
>> still work.
>>     
>
> I don't care about the *look*. Mutt shows me a question mark when it does
> not know. I care about the *behaviour*. Having backspace go back farther
> than the prompt is not acceptable. Having 80-col lines span over two lines
> is absurd.
>
>   
>> Outside the english-speaking world, userland _was_ completely
>> broken in the day of ascii. And supporting the multiple
>> iso8859-xx encodings was completely broken too, if you ever needed
>> more than one of them.
>>     
>
> yes but you just had unexpected characters. Just like MS-DOS when
> switching from code-page 437 to 850. Aside this, everything worked.
>   
I don't see how wrong characters are better than backspace eating
the prompt or 80-col overflowing when it shouldn't. It is all breakage 
either way.
Stuff break if TERM is set wrong for the terminal in use too, or if the
app in use don't _use_ the TERM variable. This happens too, and you only
notice if the app runs on  a terminal incompatible with TERM=linux.
[...]
>> If you want to know what it is like, knock three vowels or so out of the
>> english alphabet. Consider them not supported. Invent "transcriptions" 
>> if you like.
>>     
>
> amusing comparison :-)
>   
Amusing and accurate. I use Norwegian which has 3 non-ascii vowels. As well
as some accented characters, but they don't crop up in _every other 
sentence_.

>> Lots of people actually bothered - and created various encoding schemes
>> to struggle with until they came up with unicode. English speakers and
>> people _only_ interested in simple tools like tar and ls didn't bother 
>> perhaps.
>>     
>
> You know why we got this encoding ? Simply because it was designed by
> english speakers who did not want to be impacted at all by the transition.
> That way they can still use their old "elm", "cat" and "vi" with no
> hassle and pretend to be UTF-8 ready.
>   
It had to be done in an ascii-compatible way. That way, a userland 
containing
a mix of ascii-only apps,  fully utf-8 supporting apps, and apps with 
partial
utf-8 support will work flawlessly for ascii-only stuff. Like C source and
english language tools. Of course utf-8 only works in the apps 
supporting it,
but utf-8 users keeps fixing this in the apps they need.

Breaking ascii compatibility was not an option, because that means
replacing the entire userland in one operation.  That cannot be done
unless a single authority control everything, and the open source world
isn't like that.

Variable length encoding is necessary, given that:
* Ascii should work as before, i.e. one "char" per ascii character
* One single encoding so a plain text file can contain the symbols of
   any writing system in use. There are way more than 256 symbols.

[...]
>> Such "rules" may work for kernel comments specifically.
>> But linux is used for much more than that, so it now supports utf-8 just 
>> fine.
>> People who have a poperly set up system see no reason why they
>> can't use utf-8 in the kernel too. Consider tools that work. Or fix
>> the few remaining that doesn't work - if you are attached to them.
>>     
>
> No, you're speaking as a desktop user. You upgrade every 6-months. When
> you have several machines, with various OSes, you know that the first
> one which will stuff this crap everywhere will cause even more trouble
> with the other ones. At one moment, you'll have to upgrade everything.
> BTW, do you have an UTF-8 patch for the vt320 and vt510 I use as an
> always-on console on my servers ? Clearly, the system does not have to
> be "properly setup" to behave correctly. A kernel running bash as init
> is a "properly setup system". Displaying wrong things is OK, behaving
> badly is not.
>   
No, I don't have a utf-8 patch for vt320 terminals. Using one is your 
choice.
Either you don't work with utf-8 stuff on it, or you
use intermediate software that translate the utf-8 to something the
terminal can display in an acceptable matter.
>   
>>> I would have loved to see "several different charsets -> ASCII".
>>>  
>>>       
>> And all those that actually used those "different charsets" disagree,
>> or they'd used ascii in the first place too. :-)
>>     
>
> As I said to Adrian, I did not even know there were non-ASCII chars
> in our sources, and found it a bit shocking. Well, maybe I'm just an
> old-timer and I need to stop working with computers :-/
>   
If you _cannot_ accept utf-8, then your computer world will shrink with 
time.
Or you can live with a few things you don't like - most of us have to, 
given that
the computer world has so many people with differing opinions.


Helge Hafting

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30  0:08 [2.6 patch] UTF-8 fixes in comments Samuel Thibault
  2008-04-30  3:38 ` Chris Adams
@ 2008-04-30  9:38 ` Samuel Thibault
  2008-04-30 19:45   ` Willy Tarreau
  2008-04-30 19:49 ` Willy Tarreau
  2 siblings, 1 reply; 48+ messages in thread
From: Samuel Thibault @ 2008-04-30  9:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: cmadams, Willy Tarreau, Alan Cox, Helge Hafting, Adrian Bunk,
	H. Peter Anvin

Chris Adams wrote:
> Once upon a time, Samuel Thibault  <samuel.thibault@ens-lyon.org> said:
> >Try init=/bin/sh, from that shell run stty iutf8. Then things will work
> >fine.  The fix is thus just to make the VT's tty initial iutf8 setup
> >follow vt.default_utf8.
> 
> You may also need to select a UTF-8 locale (e.g. LANG="en_US.UTF-8") for
> programs like bash to handle this correctly.

Yes of course, but here the purpose was _not_ programs like bash, but
the canonical mode (i.e. programs like cat etc.), for which the LANG
variable has no effect, only iutf8 has.

Samuel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30  9:15               ` Helge Hafting
@ 2008-04-30 19:22                 ` Adrian Bunk
  2008-04-30 19:42                 ` H. Peter Anvin
  1 sibling, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-04-30 19:22 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Willy Tarreau, H. Peter Anvin, linux-kernel, trivial

On Wed, Apr 30, 2008 at 11:15:12AM +0200, Helge Hafting wrote:
> Willy Tarreau wrote:
>...
>> It's too easy to impose crappy designs to end-users and tell them that if
>> that does not work they have to file a bug. There are a minimal set of
>> things that must be tested before shipping. Seeing that the default
>> terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does
>> not properly render it simply makes me sick. This is broken by design and
>> even distros trying to get it working for years still can't cope with it.
>> There must be a reason.
>>   
> Yeah, ascii-only is a crappy design. :-/ I don't know if mandriva is 
> broken by design - I only use debian.
> It would not surprise me if some distros  botch utf-8 through negligence.
> They are based in english-speaking countries and have their biggest
> user bases there - the majority of their customers aren't going to use  
> more than
> ascii so why should they bother. 

Mandriva is a French company.

And what Willy describes really sounds like someone fiddling with some 
settings (or something like accidentally selecting some non UTF-8 
locale).

Bad things can happen when you somehow get charsets mixed, but 
distributions default to UTF-8 for quite some time, and problems
with a 100% UTF-8 system have therefore become were unlikely.

>...
> Helge Hafting

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30  9:15               ` Helge Hafting
  2008-04-30 19:22                 ` Adrian Bunk
@ 2008-04-30 19:42                 ` H. Peter Anvin
  1 sibling, 0 replies; 48+ messages in thread
From: H. Peter Anvin @ 2008-04-30 19:42 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Willy Tarreau, Adrian Bunk, linux-kernel, trivial

Helge Hafting wrote:
> It would not surprise me if some distros  botch utf-8 through negligence.
> They are based in english-speaking countries and have their biggest
> user bases there - the majority of their customers aren't going to use 
> more than ascii so why should they bother.

Well, we were talking about Mandriva, which is a Brazilian-French 
company, their main languages are Portugese and French; you'd think 
they'd notice themselves.  Most likely there was something in Willy's 
configuration that buggered it up.

	-hpa


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30  9:38 ` Samuel Thibault
@ 2008-04-30 19:45   ` Willy Tarreau
  0 siblings, 0 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-04-30 19:45 UTC (permalink / raw)
  To: Samuel Thibault, linux-kernel, cmadams, Alan Cox, Helge Hafting,
	Adrian Bunk, H. Peter Anvin

On Wed, Apr 30, 2008 at 10:38:32AM +0100, Samuel Thibault wrote:
> Chris Adams wrote:
> > Once upon a time, Samuel Thibault  <samuel.thibault@ens-lyon.org> said:
> > >Try init=/bin/sh, from that shell run stty iutf8. Then things will work
> > >fine.  The fix is thus just to make the VT's tty initial iutf8 setup
> > >follow vt.default_utf8.
> > 
> > You may also need to select a UTF-8 locale (e.g. LANG="en_US.UTF-8") for
> > programs like bash to handle this correctly.
> 
> Yes of course, but here the purpose was _not_ programs like bash, but
> the canonical mode (i.e. programs like cat etc.), for which the LANG
> variable has no effect, only iutf8 has.

exactly, thanks for understanding my problem Samuel :-)

Willy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30  0:08 [2.6 patch] UTF-8 fixes in comments Samuel Thibault
  2008-04-30  3:38 ` Chris Adams
  2008-04-30  9:38 ` Samuel Thibault
@ 2008-04-30 19:49 ` Willy Tarreau
  2008-05-03 23:50   ` Samuel Thibault
  2 siblings, 1 reply; 48+ messages in thread
From: Willy Tarreau @ 2008-04-30 19:49 UTC (permalink / raw)
  To: Samuel Thibault, linux-kernel

On Wed, Apr 30, 2008 at 01:08:51AM +0100, Samuel Thibault wrote:
> Willy Tarreau wrote:
> > 3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od
> > got two bytes: c3 84.
> 
> Confirmed.
> 
> Try init=/bin/stty -a, that will show
> 
> -iutf8
> 
> So there is little wonder that canonical mode does not work as expected.
> 
> Try init=/bin/sh, from that shell run stty iutf8. Then things will work
> fine.  The fix is thus just to make the VT's tty initial iutf8 setup
> follow vt.default_utf8.

Will try that on a more recent install. Mine's stty does not support
this option. Your analysis makes quite a lot of sense, and such a fix
would wipe part of my annoyances/anger with this recent change.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 10:33                 ` Willy Tarreau
  2008-04-29 10:34                   ` Alan Cox
@ 2008-05-01  9:46                   ` Alexander E. Patrakov
  1 sibling, 0 replies; 48+ messages in thread
From: Alexander E. Patrakov @ 2008-05-01  9:46 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Alan Cox, Helge Hafting, Adrian Bunk, H. Peter Anvin,
	linux-kernel, trivial

Willy Tarreau wrote:
> In my opinion, the problem is that when I press "é", the system sends
> two chars to the bash, which itself sends two chars to the terminal,
> which only displays one and moves the cursor one step ahead. Then,
> pressing backspace once sends one backspace all along, resulting in
> the terminal blanking one displayed char, but the shell not being
> aware that only half of it was removed. But if you look at how
> control chars are handled, if you display ^H then press backspace,
> you remove all of it. It's the terminal which adjusts the position
> depending on the character length.

export LANG=en_US.UTF-8 (i.e., inform the userspace that you are using UTF-8), 
unset LC_CTYPE and unset LC_ALL (so that they don't override $LANG), and problem 
solved.

-- 
Alexander E. Patrakov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29 23:05                         ` Willy Tarreau
@ 2008-05-01 20:18                           ` H. Peter Anvin
  0 siblings, 0 replies; 48+ messages in thread
From: H. Peter Anvin @ 2008-05-01 20:18 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Alan Cox, Helge Hafting, Adrian Bunk, linux-kernel, trivial

Willy Tarreau wrote:
> On Tue, Apr 29, 2008 at 11:15:54PM +0100, Alan Cox wrote:
>>> OK I could reproduce the case without ever involving either a shell or
>>> readline or anything. Using "cat" as the init program exhibited the
>>> anomaly, though it was not much easy to analyze. Then I switched to
>>> "init=od -An -tx1 -".
>> Did you put the console into utf-8 mode before the cat ?
> 
> I had not *explictly* disabled it, since as the doc suggests :
> 
>         vt.default_utf8=
>                         [VT]
>                         Format=<0|1>
>                         Set system-wide default UTF-8 mode for all tty's.
>                         Default is 1, i.e. UTF-8 mode is enabled for all
>                         newly opened terminals.
> 
> And I know that I can fix the behaviour by explicitly setting it to zero.
> Also, the fact that "od" shows me multi-byte characters on the input
> indicates to me that everything is set to UTF-8. So unless I'm missing
> something, my console is set by default to UTF-8 (I test this on 2.6.25).
> 

Yes, there is apparently a real bug here: this vt setting doesn't 
propagate to the tty layer iutf8 flag.

	-hpa

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-30 19:49 ` Willy Tarreau
@ 2008-05-03 23:50   ` Samuel Thibault
  2008-05-04  8:55     ` Willy Tarreau
  2008-05-04 10:25     ` Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments] Samuel Thibault
  0 siblings, 2 replies; 48+ messages in thread
From: Samuel Thibault @ 2008-05-03 23:50 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, akpm

Hello,

Willy Tarreau, le Wed 30 Apr 2008 21:49:20 +0200, a écrit :
> On Wed, Apr 30, 2008 at 01:08:51AM +0100, Samuel Thibault wrote:
> > Willy Tarreau wrote:
> > > 3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od
> > > got two bytes: c3 84.
> > 
> > Confirmed.
> > 
> > Try init=/bin/stty -a, that will show
> > 
> > -iutf8
> > 
> > So there is little wonder that canonical mode does not work as expected.
> > 
> > Try init=/bin/sh, from that shell run stty iutf8. Then things will work
> > fine.  The fix is thus just to make the VT's tty initial iutf8 setup
> > follow vt.default_utf8.
> 
> Will try that on a more recent install. Mine's stty does not support
> this option. Your analysis makes quite a lot of sense, and such a fix
> would wipe part of my annoyances/anger with this recent change.

Can you give the patch below a try?
Dynamic per-VT utf-8 switch should also work, provided that you reopen
the VT (i.e. log out).

Samuel



Set IUTF8 as appropriate on VT tty open.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

--- linux/drivers/char/vt.c.orig	2008-05-04 00:37:50.000000000 +0100
+++ linux/drivers/char/vt.c	2008-05-04 00:47:39.000000000 +0100
@@ -2723,6 +2723,10 @@ static int con_open(struct tty_struct *t
 				tty->winsize.ws_row = vc_cons[currcons].d->vc_rows;
 				tty->winsize.ws_col = vc_cons[currcons].d->vc_cols;
 			}
+			if (vc->vc_utf)
+				tty->termios->c_iflag |= IUTF8;
+			else
+				tty->termios->c_iflag &= ~IUTF8;
 			release_console_sem();
 			vcs_make_sysfs(tty);
 			return ret;
@@ -2899,6 +2903,8 @@ int __init vty_init(void)
 	console_driver->minor_start = 1;
 	console_driver->type = TTY_DRIVER_TYPE_CONSOLE;
 	console_driver->init_termios = tty_std_termios;
+	if (default_utf8)
+		console_driver->init_termios.c_iflag |= IUTF8;
 	console_driver->flags = TTY_DRIVER_REAL_RAW | TTY_DRIVER_RESET_TERMIOS;
 	tty_set_operations(console_driver, &con_ops);
 	if (tty_register_driver(console_driver))

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-05-03 23:50   ` Samuel Thibault
@ 2008-05-04  8:55     ` Willy Tarreau
  2008-05-04 10:25     ` Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments] Samuel Thibault
  1 sibling, 0 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-05-04  8:55 UTC (permalink / raw)
  To: Samuel Thibault, linux-kernel, akpm

Hi Samuel,

On Sun, May 04, 2008 at 12:50:28AM +0100, Samuel Thibault wrote:
> Can you give the patch below a try?
> Dynamic per-VT utf-8 switch should also work, provided that you reopen
> the VT (i.e. log out).

I confirm that your patch works perfectly for me. Now backspace correctly
removes multi-byte characters. My bash is still fooled though but as Alan
explained it, it's readline which has to be upgraded now.

Thanks!
Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments]
  2008-05-03 23:50   ` Samuel Thibault
  2008-05-04  8:55     ` Willy Tarreau
@ 2008-05-04 10:25     ` Samuel Thibault
  2008-05-04 11:03       ` Willy Tarreau
  2008-05-05 23:00       ` Andrew Morton
  1 sibling, 2 replies; 48+ messages in thread
From: Samuel Thibault @ 2008-05-04 10:25 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, akpm, stable

Samuel Thibault, le Sun 04 May 2008 00:50:27 +0100, a écrit :
> Willy Tarreau, le Wed 30 Apr 2008 21:49:20 +0200, a écrit :
> > On Wed, Apr 30, 2008 at 01:08:51AM +0100, Samuel Thibault wrote:
> > > Willy Tarreau wrote:
> > > > 3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od
> > > > got two bytes: c3 84.
> > > 
> > > Confirmed.
> > > 
> > > Try init=/bin/stty -a, that will show
> > > 
> > > -iutf8
> > > 
> > > So there is little wonder that canonical mode does not work as expected.
> > > 
> > > Try init=/bin/sh, from that shell run stty iutf8. Then things will work
> > > fine.  The fix is thus just to make the VT's tty initial iutf8 setup
> > > follow vt.default_utf8.
> > 
> > Will try that on a more recent install. Mine's stty does not support
> > this option. Your analysis makes quite a lot of sense, and such a fix
> > would wipe part of my annoyances/anger with this recent change.
> 
> Can you give the patch below a try?
> Dynamic per-VT utf-8 switch should also work, provided that you reopen
> the VT (i.e. log out).

Willy Tarreau, le Sun 04 May 2008 10:55:14 +0200, a écrit :
> I confirm that your patch works perfectly for me. Now backspace correctly
> removes multi-byte characters. My bash is still fooled though but as Alan
> explained it, it's readline which has to be upgraded now.

I guess this is suitable for the stable trees of 2.6.24 and 2.6.25
(where UTF-8 is by default now).




Set IUTF8 as appropriate on VT tty open.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

--- linux/drivers/char/vt.c.orig	2008-05-04 00:37:50.000000000 +0100
+++ linux/drivers/char/vt.c	2008-05-04 00:47:39.000000000 +0100
@@ -2723,6 +2723,10 @@ static int con_open(struct tty_struct *t
 				tty->winsize.ws_row = vc_cons[currcons].d->vc_rows;
 				tty->winsize.ws_col = vc_cons[currcons].d->vc_cols;
 			}
+			if (vc->vc_utf)
+				tty->termios->c_iflag |= IUTF8;
+			else
+				tty->termios->c_iflag &= ~IUTF8;
 			release_console_sem();
 			vcs_make_sysfs(tty);
 			return ret;
@@ -2899,6 +2903,8 @@ int __init vty_init(void)
 	console_driver->minor_start = 1;
 	console_driver->type = TTY_DRIVER_TYPE_CONSOLE;
 	console_driver->init_termios = tty_std_termios;
+	if (default_utf8)
+		console_driver->init_termios.c_iflag |= IUTF8;
 	console_driver->flags = TTY_DRIVER_REAL_RAW | TTY_DRIVER_RESET_TERMIOS;
 	tty_set_operations(console_driver, &con_ops);
 	if (tty_register_driver(console_driver))


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments]
  2008-05-04 10:25     ` Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments] Samuel Thibault
@ 2008-05-04 11:03       ` Willy Tarreau
  2008-05-05 23:00       ` Andrew Morton
  1 sibling, 0 replies; 48+ messages in thread
From: Willy Tarreau @ 2008-05-04 11:03 UTC (permalink / raw)
  To: Samuel Thibault, linux-kernel, akpm, stable

On Sun, May 04, 2008 at 11:25:54AM +0100, Samuel Thibault wrote:
> Willy Tarreau, le Sun 04 May 2008 10:55:14 +0200, a écrit :
> > I confirm that your patch works perfectly for me. Now backspace correctly
> > removes multi-byte characters. My bash is still fooled though but as Alan
> > explained it, it's readline which has to be upgraded now.
> 
> I guess this is suitable for the stable trees of 2.6.24 and 2.6.25
> (where UTF-8 is by default now).

agreed.

> Set IUTF8 as appropriate on VT tty open.
> 
> Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

You should have added:  CC: stable@kernel.org here so that the stable
team automatically gets notified when it's merged into mainline.

Thanks!
Willy


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments]
  2008-05-04 10:25     ` Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments] Samuel Thibault
  2008-05-04 11:03       ` Willy Tarreau
@ 2008-05-05 23:00       ` Andrew Morton
  2008-05-05 23:54         ` Samuel Thibault
  1 sibling, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2008-05-05 23:00 UTC (permalink / raw)
  To: Samuel Thibault; +Cc: w, linux-kernel, stable

On Sun, 4 May 2008 11:25:54 +0100
Samuel Thibault <samuel.thibault@ens-lyon.org> wrote:

> Samuel Thibault, le Sun 04 May 2008 00:50:27 +0100, a écrit :
> > Willy Tarreau, le Wed 30 Apr 2008 21:49:20 +0200, a écrit :
> > > On Wed, Apr 30, 2008 at 01:08:51AM +0100, Samuel Thibault wrote:
> > > > Willy Tarreau wrote:
> > > > > 3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od
> > > > > got two bytes: c3 84.
> > > > 
> > > > Confirmed.
> > > > 
> > > > Try init=/bin/stty -a, that will show
> > > > 
> > > > -iutf8
> > > > 
> > > > So there is little wonder that canonical mode does not work as expected.
> > > > 
> > > > Try init=/bin/sh, from that shell run stty iutf8. Then things will work
> > > > fine.  The fix is thus just to make the VT's tty initial iutf8 setup
> > > > follow vt.default_utf8.
> > > 
> > > Will try that on a more recent install. Mine's stty does not support
> > > this option. Your analysis makes quite a lot of sense, and such a fix
> > > would wipe part of my annoyances/anger with this recent change.
> > 
> > Can you give the patch below a try?
> > Dynamic per-VT utf-8 switch should also work, provided that you reopen
> > the VT (i.e. log out).
> 
> Willy Tarreau, le Sun 04 May 2008 10:55:14 +0200, a écrit :
> > I confirm that your patch works perfectly for me. Now backspace correctly
> > removes multi-byte characters. My bash is still fooled though but as Alan
> > explained it, it's readline which has to be upgraded now.
> 
> I guess this is suitable for the stable trees of 2.6.24 and 2.6.25
> (where UTF-8 is by default now).
> 
> 
> 
> 
> Set IUTF8 as appropriate on VT tty open.
> 
> Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

That changelog is pretty darn terse :(  I'll often go through
the email ladder and try to extract the missing information
but this time I don't really see it there.

Things like: what is the kernel's current behaviour, why does
it behave that way, how does the patch fix it?

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments]
  2008-05-05 23:00       ` Andrew Morton
@ 2008-05-05 23:54         ` Samuel Thibault
  0 siblings, 0 replies; 48+ messages in thread
From: Samuel Thibault @ 2008-05-05 23:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: w, linux-kernel, stable

Andrew Morton, le Mon 05 May 2008 16:00:44 -0700, a écrit :
> > Set IUTF8 as appropriate on VT tty open.
> > 
> > Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
> 
> That changelog is pretty darn terse :( 

Erf, sorry.

> I'll often go through
> the email ladder and try to extract the missing information
> but this time I don't really see it there.
> 
> Things like: what is the kernel's current behaviour, why does
> it behave that way, how does the patch fix it?

Well, it's more an implementation than a fix. Let's try again:



For e.g. proper TTY canonical support, IUTF8 termios flag has to be set
as appropriate.  Linux used to not care about setting that flag for VT
TTYs.

This patch fixes that by activating it according to the current mode
of the VT, and sets the default value according to the vt.default_utf8
parameter.



Samuel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2.6 patch] UTF-8 fixes in comments
  2008-04-29  5:06     ` Willy Tarreau
  2008-04-29  6:04       ` H. Peter Anvin
  2008-04-29  7:29       ` Adrian Bunk
@ 2008-05-09 12:48       ` David Kågedal
  2 siblings, 0 replies; 48+ messages in thread
From: David Kågedal @ 2008-05-09 12:48 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: H. Peter Anvin, Adrian Bunk, linux-kernel, trivial

Willy Tarreau <w@1wt.eu> writes:

> And do we really consider that people's names in *comments* cannot
> be converted to pure ASCII ? I'm western european and have always
> been against accents in comments (another reason to write comments
> in english BTW). Unix and internet have lived without accents for
> almost 30 years without anyone really bothering. 

That's a ridiculous statement.  Just because you didn't bother, you
can't assume that the people who were actually affected didn't bother.

I went through large parts of the 1990's under the name "David
K}gedal". And I bothered.

And no, the second character in my last name is not an accented a,
they have been separate letters for hundreds of years in Sweden.  So I
can live without using accented letters, as long as I can write
Kågedal including the å. :-)

Not that my name appears anywhere in the Linux source, but I still
felt the urge to reply...

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2008-05-09 13:07 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-30  0:08 [2.6 patch] UTF-8 fixes in comments Samuel Thibault
2008-04-30  3:38 ` Chris Adams
2008-04-30  9:38 ` Samuel Thibault
2008-04-30 19:45   ` Willy Tarreau
2008-04-30 19:49 ` Willy Tarreau
2008-05-03 23:50   ` Samuel Thibault
2008-05-04  8:55     ` Willy Tarreau
2008-05-04 10:25     ` Fix VT canonical input in UTF-8 mode [Was: UTF-8 fixes in comments] Samuel Thibault
2008-05-04 11:03       ` Willy Tarreau
2008-05-05 23:00       ` Andrew Morton
2008-05-05 23:54         ` Samuel Thibault
  -- strict thread matches above, loose matches on Subject: below --
2008-04-28 15:40 [2.6 patch] UTF-8 fixes in comments Adrian Bunk
2008-04-28 23:05 ` Willy Tarreau
2008-04-29  1:29   ` H. Peter Anvin
2008-04-29  5:06     ` Willy Tarreau
2008-04-29  6:04       ` H. Peter Anvin
2008-04-29  7:29       ` Adrian Bunk
2008-04-29  8:14         ` Willy Tarreau
2008-04-29  9:06           ` Helge Hafting
2008-04-29  9:33             ` Alan Cox
2008-04-29 10:09             ` Willy Tarreau
2008-04-29 10:10               ` Alan Cox
2008-04-29 10:33                 ` Willy Tarreau
2008-04-29 10:34                   ` Alan Cox
2008-04-29 22:12                     ` Willy Tarreau
2008-04-29 22:15                       ` Alan Cox
2008-04-29 23:05                         ` Willy Tarreau
2008-05-01 20:18                           ` H. Peter Anvin
2008-05-01  9:46                   ` Alexander E. Patrakov
2008-04-29 19:33                 ` H. Peter Anvin
2008-04-29 10:42               ` Adrian Bunk
2008-04-29 11:06                 ` Willy Tarreau
2008-04-29 11:27                   ` Adrian Bunk
2008-04-29 11:32                     ` Adrian Bunk
2008-04-29 20:18                       ` Jeremy Fitzhardinge
2008-04-30  9:15               ` Helge Hafting
2008-04-30 19:22                 ` Adrian Bunk
2008-04-30 19:42                 ` H. Peter Anvin
2008-04-29  9:43           ` Adrian Bunk
2008-04-29 19:31           ` H. Peter Anvin
2008-04-29 20:05             ` Willy Tarreau
2008-04-29 20:09               ` H. Peter Anvin
2008-05-09 12:48       ` David Kågedal
2008-04-29  9:01   ` Alan Cox
2008-04-29  9:19     ` Jan Engelhardt
2008-04-29  9:34     ` Willy Tarreau
2008-04-29  9:41       ` Alan Cox
2008-04-29 12:18 ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox