public inbox for linux-serial@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
@ 2023-12-12 15:13 Roman Žilka
  2023-12-12 15:36 ` Greg KH
  0 siblings, 1 reply; 10+ messages in thread
From: Roman Žilka @ 2023-12-12 15:13 UTC (permalink / raw)
  To: gregkh, jirislaby; +Cc: linux-serial, roman.zilka

vc_translate_unicode() and vc_sanitize_unicode() parse input to the
UTF-8-enabled console, marking invalid byte sequences and producing Unicode
codepoints. The current algorithm follows ancient Unicode and may accept invalid
byte sequences, pass on non-existent codepoints and reject valid sequences.

The patch restores the functions' compliance with modern Unicode (v15.1 + many
previous versions) as well as RFC 3629.
1. Codepoint space is limited to 0x10FFFF.
2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode and
   will be accepted. Another option was to complete the set of noncharacters
   (used to be just those two, now there's more) and preserve the rejection
   step. This is indeed what Unicode suggests (v15.1, chap. 23.7) (not
   requires), but most codepoints are !iswprint(), so selecting just the
   noncharacters seemed arbitrary and futile (and unnecessary).

On the side:
3. What remained of vc_sanitize_unicode() is in vc_translate_unicode().
4. Corrected vc_translate_unicode() doc (@rescan).

This is not a security patch. I'm not aware of any present security implications
of the old code.

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---
 drivers/tty/vt/vt.c | 36 +++++++-----------------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 156efda7c80d..215e162ec8af 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2587,23 +2587,11 @@ static inline int vc_translate_ascii(const struct vc_data *vc, int c)
 }
 
 
-/**
- * vc_sanitize_unicode - Replace invalid Unicode code points with U+FFFD
- * @c: the received character, or U+FFFD for invalid sequences.
- */
-static inline int vc_sanitize_unicode(const int c)
-{
-	if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff)
-		return 0xfffd;
-
-	return c;
-}
-
 /**
  * vc_translate_unicode - Combine UTF-8 into Unicode in @vc_utf_char
  * @vc: virtual console
- * @c: character to translate
- * @rescan: we return true if we need more (continuation) data
+ * @c: UTF-8 byte to translate
+ * @rescan: true => @c wasn't translated here and needs to be re-processed
  *
  * @vc_utf_char is the being-constructed unicode character.
  * @vc_utf_count is the number of continuation bytes still expected to arrive.
@@ -2611,10 +2599,7 @@ static inline int vc_sanitize_unicode(const int c)
  */
 static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 {
-	static const u32 utf8_length_changes[] = {
-		0x0000007f, 0x000007ff, 0x0000ffff,
-		0x001fffff, 0x03ffffff, 0x7fffffff
-	};
+	static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
 
 	/* Continuation byte received */
 	if ((c & 0xc0) == 0x80) {
@@ -2629,12 +2614,12 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 
 		/* Got a whole character */
 		c = vc->vc_utf_char;
-		/* Reject overlong sequences */
+		/* Reject overlong sequences and surrogates */
 		if (c <= utf8_length_changes[vc->vc_npar - 1] ||
-				c > utf8_length_changes[vc->vc_npar])
+				c > utf8_length_changes[vc->vc_npar] ||
+				(c & 0xfff800) == 0x00d800)
 			return 0xfffd;
-
-		return vc_sanitize_unicode(c);
+		return c;
 	}
 
 	/* Single ASCII byte or first byte of a sequence received */
@@ -2660,14 +2645,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	} else if ((c & 0xf8) == 0xf0) {
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
-	} else if ((c & 0xfc) == 0xf8) {
-		vc->vc_utf_count = 4;
-		vc->vc_utf_char = (c & 0x03);
-	} else if ((c & 0xfe) == 0xfc) {
-		vc->vc_utf_count = 5;
-		vc->vc_utf_char = (c & 0x01);
 	} else {
-		/* 254 and 255 are invalid */
 		return 0xfffd;
 	}
 

base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread
* [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode
@ 2023-12-12  7:40 Roman Zilka
  2023-12-12  8:24 ` Greg KH
  2023-12-12  9:20 ` Jiri Slaby
  0 siblings, 2 replies; 10+ messages in thread
From: Roman Zilka @ 2023-12-12  7:40 UTC (permalink / raw)
  To: gregkh, jirislaby; +Cc: linux-serial

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

vc_translate_unicode(), vc_sanitize_unicode():
1. Limit codepoint space to 0x10FFFF. The old algorithm followed an ancient
   version of Unicode.
2. Corrected vc_translate_unicode() doc (@rescan).
3. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in Unicode -
   - accept them. Another option was to complete the set of noncharacters (used
   to be those two, now there's more) and preserve the substitution. This is
   indeed what Unicode suggests (v15.1, chap. 23.7) (not requires), but most
   codepoints are !iswprint(), so substituting just the noncharacters seemed
   futile. Also, I've never seen noncharacters treated in a special way.
4. Moved what remained of vc_sanitize_unicode() into vc_translate_unicode().

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
---
 drivers/tty/vt/vt.c | 36 +++++++-----------------------------
 1 file changed, 7 insertions(+), 29 deletions(-)

base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
-- 
2.41.0

[-- Attachment #2: 0001-tty-vt-UTF-8-parsing-update-according-to-RFC-3629-mo.patch.xz --]
[-- Type: application/x-xz, Size: 1732 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-09 10:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-12 15:13 [PATCH] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode Roman Žilka
2023-12-12 15:36 ` Greg KH
2023-12-12 16:23   ` [PATCH v2] " Roman Žilka
2023-12-12 20:26     ` [PATCH v3] " Roman Žilka
2024-01-04 15:28       ` Greg KH
2024-01-09 10:28         ` Roman Žilka
2024-01-09 10:43         ` [PATCH v4] " Roman Žilka
  -- strict thread matches above, loose matches on Subject: below --
2023-12-12  7:40 [PATCH] " Roman Zilka
2023-12-12  8:24 ` Greg KH
2023-12-12  9:20 ` Jiri Slaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox