From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932309AbXDFTnL (ORCPT ); Fri, 6 Apr 2007 15:43:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932319AbXDFTnL (ORCPT ); Fri, 6 Apr 2007 15:43:11 -0400 Received: from terminus.zytor.com ([192.83.249.54]:35063 "EHLO terminus.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932309AbXDFTnJ (ORCPT ); Fri, 6 Apr 2007 15:43:09 -0400 Message-ID: <4616A2C7.3030000@zytor.com> Date: Fri, 06 Apr 2007 12:43:03 -0700 From: "H. Peter Anvin" User-Agent: Thunderbird 1.5.0.10 (X11/20070302) MIME-Version: 1.0 To: Egmont Koblinger CC: linux-kernel@vger.kernel.org Subject: Re: [PATCH] console UTF-8 fixes References: <20070406191245.GA11974@uhulinux.hu> In-Reply-To: <20070406191245.GA11974@uhulinux.hu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Egmont Koblinger wrote: > > - If a certain (otherwise valid UTF-8) character is not found in the glyph > table, the current code does one of these two (depending on other > circumstances): > > - Either it displays the replacement character U+FFFD, falling back to a > simple question mark. Note that the Unicode replacement character U+FFFD > is to be used for invalid sequences. However, it shouldn't necessarily > be used when replacing a valid but undisplayable character. Think of > Pango for example that renders these as four hex digits inside a square. > To be able to visually distinguish between illegal sequences and legal > but undisplayable characters, I think U+FFFD or the question mark are > bad choices. In fact, any symbol that may normally occur in the text is > a bad choice if is displayed simply. Hence I chose to display an > inverted dot. > I strongly disagree. First of all, you're changing the semantics of a 13-year-old API. The semantics of the Linux console is that by specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have specified the fallback glyph. What's worse, you've hard-coded the uses of specific visual representations. That is completely unacceptable. > - Another possible thing the current code may do (for latin1-compatible > characters) is to simply display the glyph loaded in that position. > Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with > double accent". An applications prints U+00FB, which is an "u with > circumflex". Since this glyph is not present in latin2, it cannot be > printed with the current font. Still, the current code falls back to > printing the glyph from the 0xFB position of the glyph table. Hence my > app asked to print "u with circumflex" but an "u with double accent" > appears on the screen. This is totally contrary to the goals of Unicode > and shouldn't ever happen. When does that happen? That is clearly a bug. > - The replacement character for invalid UTF-8 sequences is U+FFFD, falling > back to a question mark. I've changed the fallback version to an inverted > question mark. This way it's more similar to the common glyph of U+FFFD, > and it's more trivial to the user that it's not a literal question mark > but rather some erroneous situation. Brilliant. You've picked a fallback glyph which is unlikely to exist in all fonts. The whole point of falling back to ? is that it's an ASCII character, which means that if the font designer failed to designate a fallback glyph -- which is an error!!! -- there is at least some hope of conveying the error back to the user. > - Overlong sequences are not caught currently, they're displayed as if these > were valid representations. This may even have security impacts. > > - Lone continuation bytes (section 3.1 of the UTF-8 stress test) are > currently displayed as some "random" glyphs rather than the replacement > character. > > - Incomplete sequences (sections 3.2 and 3.3) emit no replacement character, > but rather cause the subsequent valid character to be displayed more > times(!). These are valid issues. > - There's no concept of double-width characters. It's way beyond the scope > of my patch to try to display them, but at least I think it's important > for the cursor to jump two positions when printing such characters, since > this is what applications (such as text editors) expect. If the cursor > didn't jump two positions, applications would suffer from displaying and > refreshing problems, and editing some English letters that are preceded by > some CJK characters in the same line became a nightmare. With my patch an > inverted dot followed by an inverted space is displayed for double-width > characters so it's quite easy to see that they are tied together. To be able to do CJK you need something like Kon anyway. This feels like bloat. > - There's no concept of zero-width characters (such as combining accents) > either. Yet again it's beyond the scope of my patch to properly handle > them. Instead of the current behavior (write a replacement character) I > just ignore them so that full-screen applications can keep track of the > cursor position correctly. There is a concept of combining sequences. Anything else, I suspect it's better to let the user know that something bad is happening. > - I believe (at least I do hope) that my code is cleaner, more > straightforward, easier to understand, and is slightly better documented > than the current version. The current code doesn't separate UTF-8 decoding > and glyph displaying parts. I clearly separated them. First I perform > UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for > the width of the resulting character, change it to U+FFFD if it's > unprintable (e.g. an UTF-16 surrogate), and finally comes the part that > does its best in displaying the character on the screen. > > I hope you like it. :) Please see above comments. -hpa