Re: [PATCH] console UTF-8 fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "H. Peter Anvin" <hpa@zytor.com>
To: Egmont Koblinger <egmont@uhulinux.hu>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] console UTF-8 fixes
Date: Fri, 06 Apr 2007 12:43:03 -0700	[thread overview]
Message-ID: <4616A2C7.3030000@zytor.com> (raw)
In-Reply-To: <20070406191245.GA11974@uhulinux.hu>

Egmont Koblinger wrote:
> 
> - If a certain (otherwise valid UTF-8) character is not found in the glyph
>   table, the current code does one of these two (depending on other
>   circumstances):
> 
>   - Either it displays the replacement character U+FFFD, falling back to a
>     simple question mark. Note that the Unicode replacement character U+FFFD
>     is to be used for invalid sequences. However, it shouldn't necessarily
>     be used when replacing a valid but undisplayable character. Think of
>     Pango for example that renders these as four hex digits inside a square.
>     To be able to visually distinguish between illegal sequences and legal
>     but undisplayable characters, I think U+FFFD or the question mark are
>     bad choices. In fact, any symbol that may normally occur in the text is
>     a bad choice if is displayed simply. Hence I chose to display an
>     inverted dot.
> 

I strongly disagree.  First of all, you're changing the semantics of a 
13-year-old API.  The semantics of the Linux console is that by 
specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
specified the fallback glyph.

What's worse, you've hard-coded the uses of specific visual 
representations.  That is completely unacceptable.

>   - Another possible thing the current code may do (for latin1-compatible
>     characters) is to simply display the glyph loaded in that position.
>     Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
>     double accent". An applications prints U+00FB, which is an "u with
>     circumflex". Since this glyph is not present in latin2, it cannot be
>     printed with the current font. Still, the current code falls back to
>     printing the glyph from the 0xFB position of the glyph table. Hence my
>     app asked to print "u with circumflex" but an "u with double accent"
>     appears on the screen. This is totally contrary to the goals of Unicode
>     and shouldn't ever happen.

When does that happen?  That is clearly a bug.

> - The replacement character for invalid UTF-8 sequences is U+FFFD, falling
>   back to a question mark. I've changed the fallback version to an inverted
>   question mark. This way it's more similar to the common glyph of U+FFFD,
>   and it's more trivial to the user that it's not a literal question mark
>   but rather some erroneous situation.

Brilliant.  You've picked a fallback glyph which is unlikely to exist in 
all fonts.  The whole point of falling back to ? is that it's an ASCII 
character, which means that if the font designer failed to designate a 
fallback glyph -- which is an error!!! -- there is at least some hope of 
conveying the error back to the user.

> - Overlong sequences are not caught currently, they're displayed as if these
>   were valid representations. This may even have security impacts.
> 
> - Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
>   currently displayed as some "random" glyphs rather than the replacement
>   character.
> 
> - Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
>   but rather cause the subsequent valid character to be displayed more
>   times(!).

These are valid issues.

> - There's no concept of double-width characters. It's way beyond the scope
>   of my patch to try to display them, but at least I think it's important
>   for the cursor to jump two positions when printing such characters, since
>   this is what applications (such as text editors) expect. If the cursor
>   didn't jump two positions, applications would suffer from displaying and
>   refreshing problems, and editing some English letters that are preceded by
>   some CJK characters in the same line became a nightmare. With my patch an
>   inverted dot followed by an inverted space is displayed for double-width
>   characters so it's quite easy to see that they are tied together.

To be able to do CJK you need something like Kon anyway.  This feels 
like bloat.

> - There's no concept of zero-width characters (such as combining accents)
>   either. Yet again it's beyond the scope of my patch to properly handle
>   them. Instead of the current behavior (write a replacement character) I
>   just ignore them so that full-screen applications can keep track of the
>   cursor position correctly.

There is a concept of combining sequences.  Anything else, I suspect 
it's better to let the user know that something bad is happening.

> - I believe (at least I do hope) that my code is cleaner, more
>   straightforward, easier to understand, and is slightly better documented
>   than the current version. The current code doesn't separate UTF-8 decoding
>   and glyph displaying parts. I clearly separated them. First I perform
>   UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
>   the width of the resulting character, change it to U+FFFD if it's
>   unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
>   does its best in displaying the character on the screen.
> 
> I hope you like it. :)

Please see above comments.

	-hpa

next prev parent reply	other threads:[~2007-04-06 19:43 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-06 19:12 [PATCH] console UTF-8 fixes Egmont Koblinger
2007-04-06 19:43 ` H. Peter Anvin [this message]
2007-04-07  9:24   ` Egmont Koblinger
2007-04-07 11:00     ` Jan Engelhardt
2007-04-07 17:26       ` Egmont Koblinger
2007-04-07 17:59         ` H. Peter Anvin
2007-04-10  9:43           ` Egmont Koblinger
2007-04-10 15:43             ` H. Peter Anvin
2007-04-10 17:19               ` Egmont Koblinger
2007-04-10 17:30                 ` H. Peter Anvin
2007-04-10 18:51                   ` Egmont Koblinger
2007-04-11 12:58                     ` Jan Engelhardt
2007-04-10 17:36                 ` Alan Cox
2007-04-10 17:36                   ` H. Peter Anvin
2007-04-11 18:28                   ` Egmont Koblinger
2007-04-11 18:36                     ` H. Peter Anvin
2007-04-12  9:11                       ` Egmont Koblinger
2007-04-12 15:36                         ` H. Peter Anvin
2007-04-12 16:41                           ` Jan Engelhardt
2007-04-12 16:55                             ` Egmont Koblinger
2007-04-12 16:58                               ` H. Peter Anvin
2007-04-12 17:16                                 ` Egmont Koblinger
2007-04-12 17:35                                   ` H. Peter Anvin
2007-04-12 17:44                                     ` Egmont Koblinger
2007-04-12 17:49                                       ` H. Peter Anvin
2007-04-12 18:46                               ` Jan Engelhardt
2007-04-12 12:54                       ` Egmont Koblinger
2007-04-12 13:13                         ` Alan Cox
2007-04-12 14:06                           ` Egmont Koblinger
2007-04-12 14:38                         ` Roman Zippel
2007-04-12 14:58                           ` Egmont Koblinger
2007-04-12 15:52                             ` Roman Zippel
2007-04-12 16:36                               ` Egmont Koblinger
2007-04-12 18:09                                 ` Roman Zippel
2007-04-11 19:00                     ` Jan Engelhardt
2007-04-12  9:22                       ` Egmont Koblinger
2007-04-11 19:36 ` Pavel Machek
2007-04-12  8:14   ` Jan Engelhardt
  -- strict thread matches above, loose matches on Subject: below --
2007-04-17 10:22 Egmont Koblinger
2007-06-19 12:13 ` Egmont Koblinger
     [not found] <8aT6Q-3iM-17@gated-at.bofh.it>
     [not found] ` <8xLa7-25v-5@gated-at.bofh.it>
2007-06-19 13:54   ` Bodo Eggert
2007-06-19 14:42     ` Egmont Koblinger
2007-06-19 17:10       ` Bodo Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4616A2C7.3030000@zytor.com \
    --to=hpa@zytor.com \
    --cc=egmont@uhulinux.hu \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.