Re: [PATCH] console UTF-8 fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "H. Peter Anvin" <hpa@zytor.com>
To: Egmont Koblinger <egmont@uhulinux.hu>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] console UTF-8 fixes
Date: Fri, 06 Apr 2007 12:43:03 -0700	[thread overview]
Message-ID: <4616A2C7.3030000@zytor.com> (raw)
In-Reply-To: <20070406191245.GA11974@uhulinux.hu>

Egmont Koblinger wrote:
> 
> - If a certain (otherwise valid UTF-8) character is not found in the glyph
>   table, the current code does one of these two (depending on other
>   circumstances):
> 
>   - Either it displays the replacement character U+FFFD, falling back to a
>     simple question mark. Note that the Unicode replacement character U+FFFD
>     is to be used for invalid sequences. However, it shouldn't necessarily
>     be used when replacing a valid but undisplayable character. Think of
>     Pango for example that renders these as four hex digits inside a square.
>     To be able to visually distinguish between illegal sequences and legal
>     but undisplayable characters, I think U+FFFD or the question mark are
>     bad choices. In fact, any symbol that may normally occur in the text is
>     a bad choice if is displayed simply. Hence I chose to display an
>     inverted dot.
> 

I strongly disagree.  First of all, you're changing the semantics of a 
13-year-old API.  The semantics of the Linux console is that by 
specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
specified the fallback glyph.

What's worse, you've hard-coded the uses of specific visual 
representations.  That is completely unacceptable.

>   - Another possible thing the current code may do (for latin1-compatible
>     characters) is to simply display the glyph loaded in that position.
>     Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
>     double accent". An applications prints U+00FB, which is an "u with
>     circumflex". Since this glyph is not present in latin2, it cannot be
>     printed with the current font. Still, the current code falls back to
>     printing the glyph from the 0xFB position of the glyph table. Hence my
>     app asked to print "u with circumflex" but an "u with double accent"
>     appears on the screen. This is totally contrary to the goals of Unicode
>     and shouldn't ever happen.

When does that happen?  That is clearly a bug.

> - The replacement character for invalid UTF-8 sequences is U+FFFD, falling
>   back to a question mark. I've changed the fallback version to an inverted
>   question mark. This way it's more similar to the common glyph of U+FFFD,
>   and it's more trivial to the user that it's not a literal question mark
>   but rather some erroneous situation.

Brilliant.  You've picked a fallback glyph which is unlikely to exist in 
all fonts.  The whole point of falling back to ? is that it's an ASCII 
character, which means that if the font designer failed to designate a 
fallback glyph -- which is an error!!! -- there is at least some hope of 
conveying the error back to the user.

> - Overlong sequences are not caught currently, they're displayed as if these
>   were valid representations. This may even have security impacts.
> 
> - Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
>   currently displayed as some "random" glyphs rather than the replacement
>   character.
> 
> - Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
>   but rather cause the subsequent valid character to be displayed more
>   times(!).

These are valid issues.

> - There's no concept of double-width characters. It's way beyond the scope
>   of my patch to try to display them, but at least I think it's important
>   for the cursor to jump two positions when printing such characters, since
>   this is what applications (such as text editors) expect. If the cursor
>   didn't jump two positions, applications would suffer from displaying and
>   refreshing problems, and editing some English letters that are preceded by
>   some CJK characters in the same line became a nightmare. With my patch an
>   inverted dot followed by an inverted space is displayed for double-width
>   characters so it's quite easy to see that they are tied together.

To be able to do CJK you need something like Kon anyway.  This feels 
like bloat.

> - There's no concept of zero-width characters (such as combining accents)
>   either. Yet again it's beyond the scope of my patch to properly handle
>   them. Instead of the current behavior (write a replacement character) I
>   just ignore them so that full-screen applications can keep track of the
>   cursor position correctly.

There is a concept of combining sequences.  Anything else, I suspect 
it's better to let the user know that something bad is happening.

> - I believe (at least I do hope) that my code is cleaner, more
>   straightforward, easier to understand, and is slightly better documented
>   than the current version. The current code doesn't separate UTF-8 decoding
>   and glyph displaying parts. I clearly separated them. First I perform
>   UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
>   the width of the resulting character, change it to U+FFFD if it's
>   unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
>   does its best in displaying the character on the screen.
> 
> I hope you like it. :)

Please see above comments.

	-hpa

next prev parent reply	other threads:[~2007-04-06 19:43 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-06 19:12 [PATCH] console UTF-8 fixes Egmont Koblinger
2007-04-06 19:43 ` H. Peter Anvin [this message]
2007-04-07  9:24   ` Egmont Koblinger
2007-04-07 11:00     ` Jan Engelhardt
2007-04-07 17:26       ` Egmont Koblinger
2007-04-07 17:59         ` H. Peter Anvin
2007-04-10  9:43           ` Egmont Koblinger
2007-04-10 15:43             ` H. Peter Anvin
2007-04-10 17:19               ` Egmont Koblinger
2007-04-10 17:30                 ` H. Peter Anvin
2007-04-10 18:51                   ` Egmont Koblinger
2007-04-11 12:58                     ` Jan Engelhardt
2007-04-10 17:36                 ` Alan Cox
2007-04-10 17:36                   ` H. Peter Anvin
2007-04-11 18:28                   ` Egmont Koblinger
2007-04-11 18:36                     ` H. Peter Anvin
2007-04-12  9:11                       ` Egmont Koblinger
2007-04-12 15:36                         ` H. Peter Anvin
2007-04-12 16:41                           ` Jan Engelhardt
2007-04-12 16:55                             ` Egmont Koblinger
2007-04-12 16:58                               ` H. Peter Anvin
2007-04-12 17:16                                 ` Egmont Koblinger
2007-04-12 17:35                                   ` H. Peter Anvin
2007-04-12 17:44                                     ` Egmont Koblinger
2007-04-12 17:49                                       ` H. Peter Anvin
2007-04-12 18:46                               ` Jan Engelhardt
2007-04-12 12:54                       ` Egmont Koblinger
2007-04-12 13:13                         ` Alan Cox
2007-04-12 14:06                           ` Egmont Koblinger
2007-04-12 14:38                         ` Roman Zippel
2007-04-12 14:58                           ` Egmont Koblinger
2007-04-12 15:52                             ` Roman Zippel
2007-04-12 16:36                               ` Egmont Koblinger
2007-04-12 18:09                                 ` Roman Zippel
2007-04-11 19:00                     ` Jan Engelhardt
2007-04-12  9:22                       ` Egmont Koblinger
2007-04-11 19:36 ` Pavel Machek
2007-04-12  8:14   ` Jan Engelhardt
  -- strict thread matches above, loose matches on Subject: below --
2007-04-17 10:22 Egmont Koblinger
2007-06-19 12:13 ` Egmont Koblinger
     [not found] <8aT6Q-3iM-17@gated-at.bofh.it>
     [not found] ` <8xLa7-25v-5@gated-at.bofh.it>
2007-06-19 13:54   ` Bodo Eggert
2007-06-19 14:42     ` Egmont Koblinger
2007-06-19 17:10       ` Bodo Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4616A2C7.3030000@zytor.com \
    --to=hpa@zytor.com \
    --cc=egmont@uhulinux.hu \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox