Re: UTF-8 practically vs. theoretically in the VFS API

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Jamie Lokier <jamie@shareable.org>
Cc: "Alex Belits" <abelits@phobos.illtel.denver.co.us>,
	"Måns Rullgård" <mru@kth.se>,
	linux-kernel@vger.kernel.org
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Date: 22 Feb 2004 14:53:09 -0700	[thread overview]
Message-ID: <m1ad3akchm.fsf@ebiederm.dsl.xmission.com> (raw)
In-Reply-To: <20040222162848.GC25664@mail.shareable.org>

Jamie Lokier <jamie@shareable.org> writes:

> Eric W. Biederman wrote:
> > I guess my question is when do we know the information is going to
> > a terminal so we should translate it?
> 
> When a program is writing to a terminal device, then we know it's
> going to a terminal _or_ to a program which is pretending to be one
> (pseudo-terminal).  Either way, the behaviour should be the same
> 
> The "screen" program can be used to do translation, although it's a
> rather cumbersome way to go about it, and it has other effects which
> are annoying (at least one key is always designated for "screen" commands).

Right.  At this point I am not worried about temporary solutions.  I
want to pin down how things should be implemented.  So the user space
programs can be fixed.  Pardon me while I think aloud to frame the problem.

First it is worth noting that the existing practice is that ttys 
always use the character set encoding of the user.  Even X cut and
paste frequently abuses the iso8859-1 range, and instead uses the
native character set encoding instead of iso8825-1.

Now the work is how to get multiple locales to play nicely with each
other.  utf-8 and unicode are convenient for that as they preserve the
existing assumptions that terminals, filenames, and text files are
all using the same character set encoding, even when multiple locales
are involved.

So within one machine utf-8 solves the multiple locale problem.  The
problem has now moved to interoperability between machines.  Since
multiple machines have different upgrade cycles, and are in different
administrative domains everyone does not move to utf-8 at the same
time.

When we add the assertion that all I/O going through a terminal device
is in the native locale we break 8bit transparency.  This holds true
in some instances when both sides use the same character set encoding,
such as utf8.

There are some mitigating factors to this.  ssh already documents
pseudo tty's as potentially breaking 8 bit transparency.  And
applications that require ttys for stdin/stdout are most likely
interactive.  Interactive programs are either character based, or
broken.

Being an unclean channel for pipes will affect at least XMODEM,
YMODEM, and ZMODEM protocols, and possibly ppp.  These programs
already know how to avoid problem characters and because ascii is a
common subset of most character set encodings the effect should be no
worse than a line that is not 8 bit clean.

ssh at least has explict options to allocate or not allocate a
pseudo-tty so getting an 8 bit clean data path is not a problem with
ssh.

The rule ``All data that passing through a pseudo-tty is in the
character set encoding specified by the locale of the owner of the
tty'' seems both reasonable and no significant change from the current
status quo.

Now how does this get implemented?

On the wire between two machines I recommend passing unicode
characters.  Unicode guarantees no round trip loss for any of it's
member character sets, and it reduces everything to one set of
translation tables.

By convention glibc stores unicode values in wchar_t.  mbrstowc will
convert multibyte strings to internal wide characters, based on the
current locale. wctombs will do the opposite.  So going between
unicode and the character set encoding of the current locale is
straight forward.

How do we convert the applications?

There are only four cases I can think of where we connect to a remote
system with terminal semantics.
1) Directly connected serial terminals.
2) telnetd
3) rshd
4) sshd

To my knowledge all of their protocols just pass through characters
and are neutral.  So changing these feels like a protocol extension,
ouch!  Those are the programs that bridge multiple administrative
domains, and they do deal with pseudo ttys so they are where something
needs to happen, to support different character set encodings on
different machines. 

If everyone just switches over to using utf-8 even the above cases are
fine.  So if there is a reasonable expectation that everyone will
change to using utf-8 in the near future even those programs don't
need to change.

Given the delay in changing protocols I propose 2 simple programs.
sh-utf8 and utf8-tty.  The first runs a command converting stdout and
stderr from utf8 to the current locale, and converting stdin into
utf8.  The second creates a pseudo tty and relays to it's controlling
tty, assuming the controlling tty uses utf8 and it's tty uses the
current locale.

Looking around there already is a TTYConv program that seems to fill
this niche, except you must specify the character set encodings
manually.
http://bedroomlan.dyndns.org/~alexios/coding_ttyconv.html

Comments?

Eric

next prev parent reply	other threads:[~2004-02-22 22:01 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 14:27 ` JFS default behavior Nicolas Mailhot
2004-02-14 15:40   ` viro
2004-02-14 17:47     ` Nicolas Mailhot
2004-02-14 17:59       ` Nicolas Mailhot
2004-02-14 23:06     ` Robin Rosenberg
2004-02-14 23:29       ` viro
2004-02-15  0:07         ` Robin Rosenberg
2004-02-15  2:41           ` Linus Torvalds
2004-02-15  3:33             ` Matthias Urlichs
2004-02-15  4:04               ` viro
2004-02-15  9:48                 ` Robin Rosenberg
2004-02-15 18:26                 ` yodaiken
2004-02-18  2:48               ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
2004-02-20  9:48                 ` Matthias Urlichs
2004-02-16 15:05             ` stty utf8 Jamie Lokier
2004-02-16 16:10               ` Gerd Knorr
2004-02-16 22:03               ` Jamie Lokier
2004-02-16 22:17                 ` Linus Torvalds
2004-02-16 22:04               ` Jamie Lokier
2004-02-16 18:36             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49               ` Linus Torvalds
2004-02-16 19:26                 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48                   ` John Bradford
2004-02-16 19:48                     ` Linus Torvalds
2004-02-16 20:20                       ` Marc Lehmann
2004-02-16 20:26                         ` Linus Torvalds
2004-02-18  2:49                         ` Rob Landley
2004-02-16 20:21                       ` bert hubert
2004-02-16 20:33                         ` Marc Lehmann
2004-02-18  2:58                         ` H. Peter Anvin
2004-02-18  3:13                           ` Linus Torvalds
2004-02-18  3:22                             ` H. Peter Anvin
2004-02-18  3:30                               ` Linus Torvalds
2004-02-18  5:30                                 ` H. Peter Anvin
2004-02-18 10:29                                   ` Robin Rosenberg
2004-02-18 11:49                                     ` Tomas Szepe
2004-02-18 11:59                                       ` Robin Rosenberg
2004-02-18 12:05                                         ` Tomas Szepe
2004-02-18 12:34                                           ` Robin Rosenberg
2004-02-18 15:35                                   ` Linus Torvalds
2004-02-18 19:47                                     ` Tomas Szepe
2004-02-18 20:01                                       ` H. Peter Anvin
2004-02-18 21:22                                         ` Robin Rosenberg
2004-02-18 21:42                                           ` H. Peter Anvin
2004-02-18 11:24                               ` Jamie Lokier
2004-02-18 11:33                             ` Jamie Lokier
2004-02-18 16:47                               ` H. Peter Anvin
2004-02-18 19:59                               ` Linus Torvalds
2004-02-18 20:08                                 ` H. Peter Anvin
2004-02-18  7:25                           ` bert hubert
2004-02-16 20:16                     ` Marc Lehmann
2004-02-16 20:20                       ` Jeff Garzik
2004-02-16 21:10                       ` viro
2004-02-17  7:18                       ` jw schultz
2004-02-17  7:42                       ` Nick Piggin
2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23                   ` Linus Torvalds
2004-02-16 20:58                     ` Marc Lehmann
2004-02-17 14:12                       ` Dave Kleikamp
2004-02-16 22:26                     ` Jamie Lokier
2004-02-16 22:40                       ` Linus Torvalds
2004-02-16 22:52                         ` Linus Torvalds
2004-02-17 13:15                           ` Jamie Lokier
2004-02-17  7:14                         ` Lehmann 
2004-02-17 11:20                           ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
     [not found]                             ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32                               ` Linus Torvalds
2004-02-17 16:46                                 ` Jamie Lokier
2004-02-17 19:00                                   ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57                                     ` Jamie Lokier
2004-02-17 21:06                                       ` Alex Belits
2004-02-17 21:47                                         ` Jamie Lokier
2004-02-22 15:32                                           ` Eric W. Biederman
2004-02-22 16:28                                             ` Jamie Lokier
2004-02-22 21:53                                               ` Eric W. Biederman [this message]
2004-02-18  7:23                                         ` Marc Lehmann
2004-02-17 21:23                                       ` Matthew Kirkwood
2004-02-18 13:11                                   ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
2004-02-17 16:52                                 ` Marc Lehmann
2004-02-17 16:54                                 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
2004-02-18  1:27                                   ` Hans Reiser
2004-02-18  2:08                                     ` Robin Rosenberg
2004-02-18 11:06                                       ` Jamie Lokier
2004-02-17 20:37                                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
2004-02-17 16:36                             ` Jamie Lokier
2004-02-17 17:52                               ` viro
2004-02-17 19:29                                 ` Jamie Lokier
2004-02-17 19:45                                   ` Linus Torvalds
2004-02-17 20:30                                     ` Jamie Lokier
2004-02-17 20:49                                       ` Linus Torvalds
2004-02-17 21:17                                         ` Jamie Lokier
2004-02-17 19:51                                   ` Jamie Lokier
2004-02-17 19:53                                   ` viro
2004-02-17 20:35                                     ` John Bradford
2004-02-17 20:40                                       ` Jamie Lokier
2004-02-17 20:50                                         ` John Bradford
2004-02-17 21:04                                           ` Linus Torvalds
2004-02-17 21:16                                             ` John Bradford
2004-02-17 21:21                                               ` Linus Torvalds
2004-02-18  0:52                                                 ` John Bradford
2004-02-17 22:50                                               ` Robin Rosenberg
2004-02-18  6:48                                             ` Marc Lehmann
2004-02-17 20:47                                       ` viro
2004-02-17 20:53                                         ` John Bradford
2004-02-17 20:59                                       ` Linus Torvalds
2004-02-17 21:06                                         ` John Bradford
2004-02-17 21:42                                         ` Alex Belits
2004-02-18  6:56                                           ` Marc Lehmann
2004-02-18 20:37                                             ` Alex Belits
2004-02-18  3:11                                         ` H. Peter Anvin
2004-02-17 20:38                                     ` Jamie Lokier
2004-02-18  3:07                               ` H. Peter Anvin
2004-02-21 13:54                             ` Pavel Machek
2004-02-22 20:09                               ` H. Peter Anvin
2004-02-17  1:24                   ` Alex Belits
2004-02-17 21:09                     ` Jamie Lokier
2004-02-17 21:48                       ` Linus Torvalds
2004-02-17 22:19                       ` Alex Belits
2004-02-23 11:35 UTF-8 practically vs. theoretically in the VFS API Norman Diamond
     [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
2004-02-23 19:13   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1ad3akchm.fsf@ebiederm.dsl.xmission.com \
    --to=ebiederm@xmission.com \
    --cc=abelits@phobos.illtel.denver.co.us \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mru@kth.se \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox