linux-serial.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] vt: implement proper Unicode handling
@ 2025-04-15 19:17 Nicolas Pitre
  2025-04-15 19:17 ` [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
                   ` (12 more replies)
  0 siblings, 13 replies; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

The Linux VT console has many problems with regards to proper Unicode
handling:

- All new double-width Unicode code points which have been introduced since
  Unicode 5.0 are not recognized as such (we're at Unicode 16.0 now).

- Zero-width code points are not recognized at all. If you try to edit files
  containing a lot of emojis, you will see the rendering issues. When there
  are a lot of zero-width characters (like "variation selectors"), long
  lines get wrapped, but any Unicode-aware editor thinks that the content
  was rendered properly and its rendering logic starts to work in very bad
  ways. Combine this with tmux or screen, and there is a huge mess going on
  in the terminal.

- Also, text which uses combining diacritics has the same effect as text
  with zero-width characters as programs expect the characters to take fewer
  columns than what they actually do.

Some may argue that the Linux VT console is unmaintained and/or not used
much any longer and that one should consider a user space terminal
alternative instead. But every such alternative that is not less maintained
than the Linux VT console does require a full heavy graphical environment
and that is the exact antithesis of what the Linux console is meant to be.

Furthermore, there is a significant Linux console user base represented by
blind users (which I'm a member of) for whom the alternatives are way more
cumbersome to use reducing our productivity. So it has to stay and
be maintained to the best of our abilities.

That being said...

This patch series is about fixing all the above issues. This is accomplished
with some Python scripts leveraging Python's unicodedata module to generate
C code with lookup tables that is suitable for the kernel. In summary:

- The double-width code point table is updated to the latest Unicode version
  and the table itself is optimized to reduce its size.

- A zero-width code point table is created and the console code is modified
  to properly use it.

- A table with base character + combining mark pairs is created to convert
  them into their precomposed equivalents when they're encountered.
  By default the generated table contains most commonly used Latin, Greek,
  and Cyrillic recomposition pairs only, but one can execute the provided
  script with the --full argument to create a table that covers all
  possibilities. Combining marks that are not listed in the table are simply
  treated like zero-width code points and properly ignored.

- All those tables plus related lookup code require about 3500 additional
  bytes of text which is not very significant these days. Yet, one
  can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out
  if need be.

Note: The generated C code makes scripts/checkpatch.pl complain about
      "... exceeds 100 columns" because the inserted comments with code
      point names, well, make some inlines exceed 100 columns. Please make
      an exception for those files and disregard those warnings. When
      checkpatch.pl is used on those files directly with -f then it doesn't
      complain.

This series was tested on top of v6.15-rc2.

Changes from v1 (https://lkml.org/lkml/2025/4/9/1952):

- Moved much of the C functions out of the Python generator, leaving only
  lookup tables to C code generation

- Cleaned up the Python code

- Unicode processing in vt.c moved to a function of its own

- Folded bug fixes into the series, fixed style, typos, etc.

Thanks to Jiri Slaby for the review.

diffstat:
 drivers/tty/vt/Makefile                   |   3 +-
 drivers/tty/vt/consolemap.c               |   2 -
 drivers/tty/vt/gen_ucs_recompose_table.py | 255 +++++++++++++
 drivers/tty/vt/gen_ucs_width_table.py     | 299 ++++++++++++++++
 drivers/tty/vt/ucs.c                      | 156 ++++++++
 drivers/tty/vt/ucs_recompose_table.h      | 102 ++++++
 drivers/tty/vt/ucs_width_table.h          | 453 ++++++++++++++++++++++++
 drivers/tty/vt/vt.c                       | 138 +++++---
 include/linux/consolemap.h                |  18 +
 9 files changed, 1376 insertions(+), 50 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode()
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  3:41   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 02/13] vt: move unicode processing to a separate file Nicolas Pitre
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Make it clearer when a sequence is bad.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index f5642b3038..b5f3c8a818 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2817,7 +2817,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	if ((c & 0xc0) == 0x80) {
 		/* Unexpected continuation byte? */
 		if (!vc->vc_utf_count)
-			return 0xfffd;
+			goto bad_sequence;
 
 		vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
 		vc->vc_npar++;
@@ -2829,17 +2829,17 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 		/* Reject overlong sequences */
 		if (c <= utf8_length_changes[vc->vc_npar - 1] ||
 				c > utf8_length_changes[vc->vc_npar])
-			return 0xfffd;
+			goto bad_sequence;
 
 		return vc_sanitize_unicode(c);
 	}
 
 	/* Single ASCII byte or first byte of a sequence received */
 	if (vc->vc_utf_count) {
-		/* Continuation byte expected */
+		/* A continuation byte was expected */
 		*rescan = true;
 		vc->vc_utf_count = 0;
-		return 0xfffd;
+		goto bad_sequence;
 	}
 
 	/* Nothing to do if an ASCII byte was received */
@@ -2858,11 +2858,14 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
 	} else {
-		return 0xfffd;
+		goto bad_sequence;
 	}
 
 need_more_bytes:
 	return -1;
+
+bad_sequence:
+	return 0xfffd;
 }
 
 static int vc_translate(struct vc_data *vc, int *c, bool *rescan)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 02/13] vt: move unicode processing to a separate file
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
  2025-04-15 19:17 ` [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  3:42   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 03/13] vt: properly support zero-width Unicode code points Nicolas Pitre
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This will make it easier to maintain. Also make it depend on
CONFIG_CONSOLE_TRANSLATIONS.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/Makefile    |  3 ++-
 drivers/tty/vt/ucs.c       | 54 ++++++++++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c        | 40 +---------------------------
 include/linux/consolemap.h |  6 +++++
 4 files changed, 63 insertions(+), 40 deletions(-)
 create mode 100644 drivers/tty/vt/ucs.c

diff --git a/drivers/tty/vt/Makefile b/drivers/tty/vt/Makefile
index 2c8ce8b592..e24c8546ac 100644
--- a/drivers/tty/vt/Makefile
+++ b/drivers/tty/vt/Makefile
@@ -7,7 +7,8 @@ FONTMAPFILE = cp437.uni
 obj-$(CONFIG_VT)			+= vt_ioctl.o vc_screen.o \
 					   selection.o keyboard.o \
 					   vt.o defkeymap.o
-obj-$(CONFIG_CONSOLE_TRANSLATIONS)	+= consolemap.o consolemap_deftbl.o
+obj-$(CONFIG_CONSOLE_TRANSLATIONS)	+= consolemap.o consolemap_deftbl.o \
+					   ucs.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/tty/vt/ucs.c b/drivers/tty/vt/ucs.c
new file mode 100644
index 0000000000..0f6c087158
--- /dev/null
+++ b/drivers/tty/vt/ucs.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/array_size.h>
+#include <linux/bsearch.h>
+#include <linux/consolemap.h>
+#include <linux/minmax.h>
+
+/* ucs_is_double_width() is based on the wcwidth() implementation by
+ * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
+ * Latest version: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+ */
+
+struct ucs_interval {
+	u32 first;
+	u32 last;
+};
+
+static const struct ucs_interval ucs_double_width_ranges[] = {
+	{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
+	{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
+	{ 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
+	{ 0xFFE0, 0xFFE6 }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
+};
+
+static int interval_cmp(const void *key, const void *element)
+{
+	u32 cp = *(u32 *)key;
+	const struct ucs_interval *entry = element;
+
+	if (cp < entry->first)
+		return -1;
+	if (cp > entry->last)
+		return 1;
+	return 0;
+}
+
+/**
+ * Determine if a Unicode code point is double-width.
+ *
+ * @param cp: Unicode code point (UCS-4)
+ * Return: true if the character is double-width, false otherwise
+ */
+bool ucs_is_double_width(u32 cp)
+{
+	size_t size = ARRAY_SIZE(ucs_double_width_ranges);
+
+	if (!in_range(cp, ucs_double_width_ranges[0].first,
+			  ucs_double_width_ranges[size - 1].last))
+		return false;
+
+	return __inline_bsearch(&cp, ucs_double_width_ranges, size,
+				sizeof(*ucs_double_width_ranges),
+				interval_cmp) != NULL;
+}
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index b5f3c8a818..bcb508bc15 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -104,7 +104,6 @@
 #include <linux/uaccess.h>
 #include <linux/kdb.h>
 #include <linux/ctype.h>
-#include <linux/bsearch.h>
 #include <linux/gcd.h>
 
 #define MAX_NR_CON_DRIVER 16
@@ -2712,43 +2711,6 @@ static void do_con_trol(struct tty_struct *tty, struct vc_data *vc, u8 c)
 	}
 }
 
-/* is_double_width() is based on the wcwidth() implementation by
- * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
- * Latest version: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
- */
-struct interval {
-	uint32_t first;
-	uint32_t last;
-};
-
-static int ucs_cmp(const void *key, const void *elt)
-{
-	uint32_t ucs = *(uint32_t *)key;
-	struct interval e = *(struct interval *) elt;
-
-	if (ucs > e.last)
-		return 1;
-	else if (ucs < e.first)
-		return -1;
-	return 0;
-}
-
-static int is_double_width(uint32_t ucs)
-{
-	static const struct interval double_width[] = {
-		{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
-		{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
-		{ 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
-		{ 0xFFE0, 0xFFE6 }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
-	};
-	if (ucs < double_width[0].first ||
-	    ucs > double_width[ARRAY_SIZE(double_width) - 1].last)
-		return 0;
-
-	return bsearch(&ucs, double_width, ARRAY_SIZE(double_width),
-			sizeof(struct interval), ucs_cmp) != NULL;
-}
-
 struct vc_draw_region {
 	unsigned long from, to;
 	int x;
@@ -2953,7 +2915,7 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 	bool inverse = false;
 
 	if (vc->vc_utf && !vc->vc_disp_ctrl) {
-		if (is_double_width(c))
+		if (ucs_is_double_width(c))
 			width = 2;
 	}
 
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index c35db4896c..caf079bcb8 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -28,6 +28,7 @@ int conv_uni_to_pc(struct vc_data *conp, long ucs);
 u32 conv_8bit_to_uni(unsigned char c);
 int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
+bool ucs_is_double_width(uint32_t cp);
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
@@ -57,6 +58,11 @@ static inline int conv_uni_to_8bit(u32 uni)
 }
 
 static inline void console_map_init(void) { }
+
+static inline bool ucs_is_double_width(uint32_t cp)
+{
+	return false;
+}
 #endif /* CONFIG_CONSOLE_TRANSLATIONS */
 
 #endif /* __LINUX_CONSOLEMAP_H__ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 03/13] vt: properly support zero-width Unicode code points
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
  2025-04-15 19:17 ` [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
  2025-04-15 19:17 ` [PATCH v2 02/13] vt: move unicode processing to a separate file Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  3:45   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h Nicolas Pitre
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Zero-width Unicode code points are causing misalignment in vertically
aligned content, disrupting the visual layout. Let's handle zero-width
code points more intelligently.

Double-width code points are stored in the screen grid followed by a white
space code point to create the expected screen layout. When a double-width
code point is followed by a zero-width code point in the console incoming
bytestream (e.g., an emoji with a presentation selector) then we may
replace the white space padding by that zero-width code point instead of
dropping it. This maximize screen content information while preserving
proper layout.

If a zero-width code point is preceded by a single-width code point then
the above trick is not possible and such zero-width code point must
be dropped.

VS16 (Variation Selector 16, U+FE0F) is special as it doubles the width
of the preceding single-width code point. We handle that case by giving
VS16 a width of 1 when that happens.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c        | 70 ++++++++++++++++++++++++++++++++++++--
 include/linux/consolemap.h | 10 ++++++
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index bcb508bc15..a989feffad 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -443,6 +443,15 @@ static void vc_uniscr_scroll(struct vc_data *vc, unsigned int top,
 	}
 }
 
+static u32 vc_uniscr_getc(struct vc_data *vc, int relative_pos)
+{
+	int pos = vc->state.x + vc->vc_need_wrap + relative_pos;
+
+	if (vc->vc_uni_lines && in_range(pos, 0, vc->vc_cols))
+		return vc->vc_uni_lines[vc->state.y][pos];
+	return 0;
+}
+
 static void vc_uniscr_copy_area(u32 **dst_lines,
 				unsigned int dst_cols,
 				unsigned int dst_rows,
@@ -2905,6 +2914,60 @@ static bool vc_is_control(struct vc_data *vc, int tc, int c)
 	return false;
 }
 
+static void vc_con_rewind(struct vc_data *vc)
+{
+	if (vc->state.x && !vc->vc_need_wrap) {
+		vc->vc_pos -= 2;
+		vc->state.x--;
+	}
+	vc->vc_need_wrap = 0;
+}
+
+#define UCS_VS16	0xfe0f	/* Variation Selector 16 */
+
+static int vc_process_ucs(struct vc_data *vc, int c, int *tc)
+{
+	u32 prev_c, curr_c = c;
+
+	if (ucs_is_double_width(curr_c))
+		return 2;
+
+	if (!ucs_is_zero_width(curr_c))
+		return 1;
+
+	/* From here curr_c is known to be zero-width. */
+
+	if (ucs_is_double_width(vc_uniscr_getc(vc, -2))) {
+		/*
+		 * Let's merge this zero-width code point with the preceding
+		 * double-width code point by replacing the existing
+		 * whitespace padding. To do so we rewind one column and
+		 * pretend this has a width of 1.
+		 * We give the legacy display the same initial space padding.
+		 */
+		vc_con_rewind(vc);
+		*tc = ' ';
+		return 1;
+	}
+
+	/* From here the preceding character, if any, must be single-width. */
+	prev_c = vc_uniscr_getc(vc, -1);
+
+	if (curr_c == UCS_VS16 && prev_c != 0) {
+		/*
+		 * VS16 (U+FE0F) is special. It typically turns the preceding
+		 * single-width character into a double-width one. Let it
+		 * have a width of 1 effectively making the combination with
+		 * the preceding character double-width.
+		 */
+		*tc = ' ';
+		return 1;
+	}
+
+	/* Otherwise zero-width code points are ignored. */
+	return 0;
+}
+
 static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 		struct vc_draw_region *draw)
 {
@@ -2915,8 +2978,9 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 	bool inverse = false;
 
 	if (vc->vc_utf && !vc->vc_disp_ctrl) {
-		if (ucs_is_double_width(c))
-			width = 2;
+		width = vc_process_ucs(vc, c, &tc);
+		if (!width)
+			goto out;
 	}
 
 	/* Now try to find out how to display it */
@@ -2995,6 +3059,8 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 			tc = ' ';
 		next_c = ' ';
 	}
+
+out:
 	notify_write(vc, c);
 
 	if (inverse)
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index caf079bcb8..7d778752dc 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -29,6 +29,11 @@ u32 conv_8bit_to_uni(unsigned char c);
 int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
 bool ucs_is_double_width(uint32_t cp);
+static inline bool ucs_is_zero_width(uint32_t cp)
+{
+	/* coming soon */
+	return false;
+}
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
@@ -63,6 +68,11 @@ static inline bool ucs_is_double_width(uint32_t cp)
 {
 	return false;
 }
+
+static inline bool ucs_is_zero_width(uint32_t cp)
+{
+	return false;
+}
 #endif /* CONFIG_CONSOLE_TRANSLATIONS */
 
 #endif /* __LINUX_CONSOLEMAP_H__ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (2 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 03/13] vt: properly support zero-width Unicode code points Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  4:14   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 05/13] vt: create ucs_width_table.h with gen_ucs_width_table.py Nicolas Pitre
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

The table in ucs.c is terribly out of date and incomplete. We also need a
second table to store zero-width code points. Properly maintaining those
tables manually is impossible. So here's a script to generate them.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/gen_ucs_width_table.py | 256 ++++++++++++++++++++++++++
 1 file changed, 256 insertions(+)
 create mode 100755 drivers/tty/vt/gen_ucs_width_table.py

diff --git a/drivers/tty/vt/gen_ucs_width_table.py b/drivers/tty/vt/gen_ucs_width_table.py
new file mode 100755
index 0000000000..00510444a7
--- /dev/null
+++ b/drivers/tty/vt/gen_ucs_width_table.py
@@ -0,0 +1,256 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+#
+# Leverage Python's unicodedata module to generate ucs_width_table.h
+
+import unicodedata
+import sys
+
+# This script's file name
+from pathlib import Path
+this_file = Path(__file__).name
+
+# Output file name
+out_file = "ucs_width_table.h"
+
+# --- Global Constants for Width Assignments ---
+
+# Known zero-width characters
+KNOWN_ZERO_WIDTH = (
+    0x200B,  # ZERO WIDTH SPACE
+    0x200C,  # ZERO WIDTH NON-JOINER
+    0x200D,  # ZERO WIDTH JOINER
+    0x2060,  # WORD JOINER
+    0xFEFF   # ZERO WIDTH NO-BREAK SPACE (BOM)
+)
+
+# Zero-width emoji modifiers and components
+# NOTE: Some of these characters would normally be single-width according to
+# East Asian Width properties, but we deliberately override them to be
+# zero-width because they function as modifiers in emoji sequences.
+EMOJI_ZERO_WIDTH = [
+    # Skin tone modifiers
+    (0x1F3FB, 0x1F3FF),  # Emoji modifiers (skin tones)
+
+    # Variation selectors (note: VS16 is treated specially in vt.c)
+    (0xFE00, 0xFE0F),    # Variation Selectors 1-16
+
+    # Gender and hair style modifiers
+    # These would be single-width by Unicode properties, but are zero-width
+    # when part of emoji
+    (0x2640, 0x2640),    # Female sign
+    (0x2642, 0x2642),    # Male sign
+    (0x26A7, 0x26A7),    # Transgender symbol
+    (0x1F9B0, 0x1F9B3),  # Hair components (red, curly, white, bald)
+
+    # Tag characters
+    (0xE0020, 0xE007E),  # Tags
+]
+
+# Regional indicators (flag components)
+REGIONAL_INDICATORS = (0x1F1E6, 0x1F1FF)  # Regional indicator symbols A-Z
+
+# Double-width emoji ranges
+#
+# Many emoji characters are classified as single-width according to Unicode
+# Standard Annex #11 East Asian Width property (N or Neutral), but we
+# deliberately override them to be double-width. References:
+# 1. Unicode Technical Standard #51: Unicode Emoji
+#    (https://www.unicode.org/reports/tr51/)
+# 2. Principle of "emoji presentation" in WHATWG CSS Text specification
+#    (https://drafts.csswg.org/css-text-3/#character-properties)
+# 3. Terminal emulator implementations (iTerm2, Windows Terminal, etc.) which
+#    universally render emoji as double-width characters regardless of their
+#    Unicode EAW property
+# 4. W3C Work Item: Requirements for Japanese Text Layout - Section 3.8.1
+#    Emoji width (https://www.w3.org/TR/jlreq/)
+EMOJI_RANGES = [
+    (0x1F000, 0x1F02F),  # Mahjong Tiles (EAW: N, but displayed as double-width)
+    (0x1F0A0, 0x1F0FF),  # Playing Cards (EAW: N, but displayed as double-width)
+    (0x1F300, 0x1F5FF),  # Miscellaneous Symbols and Pictographs
+    (0x1F600, 0x1F64F),  # Emoticons
+    (0x1F680, 0x1F6FF),  # Transport and Map Symbols
+    (0x1F700, 0x1F77F),  # Alchemical Symbols
+    (0x1F780, 0x1F7FF),  # Geometric Shapes Extended
+    (0x1F800, 0x1F8FF),  # Supplemental Arrows-C
+    (0x1F900, 0x1F9FF),  # Supplemental Symbols and Pictographs
+    (0x1FA00, 0x1FA6F),  # Chess Symbols
+    (0x1FA70, 0x1FAFF),  # Symbols and Pictographs Extended-A
+]
+
+def create_width_tables():
+    """
+    Creates Unicode character width tables and returns the data structures.
+
+    Returns:
+        tuple: (zero_width_ranges, double_width_ranges)
+    """
+
+    # Width data mapping
+    width_map = {}  # Maps code points to width (0, 1, 2)
+
+    # Mark emoji modifiers as zero-width
+    for start, end in EMOJI_ZERO_WIDTH:
+        for cp in range(start, end + 1):
+            width_map[cp] = 0
+
+    # Mark all regional indicators as single-width as they are usually paired
+    # providing a combined width of 2 when displayed together.
+    start, end = REGIONAL_INDICATORS
+    for cp in range(start, end + 1):
+        width_map[cp] = 1
+
+    # Process all assigned Unicode code points (Basic Multilingual Plane +
+    # Supplementary Planes) Range 0x0 to 0x10FFFF (the full Unicode range)
+    for block_start in range(0, 0x110000, 0x1000):
+        block_end = block_start + 0x1000
+        for cp in range(block_start, block_end):
+            try:
+                char = chr(cp)
+
+                # Skip if already processed
+                if cp in width_map:
+                    continue
+
+                # Check for combining marks and a format characters
+                category = unicodedata.category(char)
+
+                # Combining marks
+                if category.startswith('M'):
+                    width_map[cp] = 0
+                    continue
+
+                # Format characters
+                # Since we have no support for bidirectional text, all format
+                # characters (category Cf) can be treated with width 0 (zero)
+                # for simplicity, as they don't need to occupy visual space
+                # in a non-bidirectional text environment.
+                if category == 'Cf':
+                    width_map[cp] = 0
+                    continue
+
+                # Known zero-width characters
+                if cp in KNOWN_ZERO_WIDTH:
+                    width_map[cp] = 0
+                    continue
+
+                # Use East Asian Width property
+                eaw = unicodedata.east_asian_width(char)
+                if eaw in ('F', 'W'):  # Fullwidth or Wide
+                    width_map[cp] = 2
+                elif eaw in ('Na', 'H', 'N', 'A'):  # Narrow, Halfwidth, Neutral, Ambiguous
+                    width_map[cp] = 1
+                else:
+                    # Default to single-width for unknown
+                    width_map[cp] = 1
+
+            except (ValueError, OverflowError):
+                # Skip invalid code points
+                continue
+
+    # Process Emoji - generally double-width
+    for start, end in EMOJI_RANGES:
+        for cp in range(start, end + 1):
+            if cp not in width_map or width_map[cp] != 0:  # Don't override zero-width
+                try:
+                    char = chr(cp)
+                    width_map[cp] = 2
+                except (ValueError, OverflowError):
+                    continue
+
+    # Optimize to create range tables
+    def ranges_optimize(width_data, target_width):
+        points = sorted([cp for cp, width in width_data.items() if width == target_width])
+        if not points:
+            return []
+
+        # Group consecutive code points into ranges
+        ranges = []
+        start = points[0]
+        prev = start
+
+        for cp in points[1:]:
+            if cp > prev + 1:
+                ranges.append((start, prev))
+                start = cp
+            prev = cp
+
+        # Add the last range
+        ranges.append((start, prev))
+        return ranges
+
+    # Extract ranges for each width
+    zero_width_ranges = ranges_optimize(width_map, 0)
+    double_width_ranges = ranges_optimize(width_map, 2)
+
+    return zero_width_ranges, double_width_ranges
+
+def write_tables(zero_width_ranges, double_width_ranges):
+    """
+    Write the generated tables to C header file.
+
+    Args:
+        zero_width_ranges: List of (start, end) ranges for zero-width characters
+        double_width_ranges: List of (start, end) ranges for double-width characters
+    """
+
+    # Function to generate code point description comments
+    def get_code_point_comment(start, end):
+        try:
+            start_char_desc = unicodedata.name(chr(start))
+            if start == end:
+                return f"/* {start_char_desc} */"
+            else:
+                end_char_desc = unicodedata.name(chr(end))
+                return f"/* {start_char_desc} - {end_char_desc} */"
+        except:
+            if start == end:
+                return f"/* U+{start:04X} */"
+            else:
+                return f"/* U+{start:04X} - U+{end:04X} */"
+
+    # Generate C tables
+    with open(out_file, 'w') as f:
+        f.write(f"""\
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * {out_file} - Unicode character width
+ *
+ * Auto-generated by {this_file}
+ *
+ * Unicode Version: {unicodedata.unidata_version}
+ */
+
+/* Zero-width character ranges */
+static const struct ucs_interval ucs_zero_width_ranges[] = {{
+""")
+
+        for start, end in zero_width_ranges:
+            comment = get_code_point_comment(start, end)
+            f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
+
+        f.write("""\
+};
+
+/* Double-width character ranges */
+static const struct ucs_interval ucs_double_width_ranges[] = {
+""")
+
+        for start, end in double_width_ranges:
+            comment = get_code_point_comment(start, end)
+            f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
+
+        f.write("};\n")
+
+if __name__ == "__main__":
+    # Write tables to header file
+    zero_width_ranges, double_width_ranges = create_width_tables()
+    write_tables(zero_width_ranges, double_width_ranges)
+
+    # Print summary
+    zero_width_count = sum(end - start + 1 for start, end in zero_width_ranges)
+    double_width_count = sum(end - start + 1 for start, end in double_width_ranges)
+    print(f"Generated {out_file} with:")
+    print(f"- {len(zero_width_ranges)} zero-width ranges covering ~{zero_width_count} code points")
+    print(f"- {len(double_width_ranges)} double-width ranges covering ~{double_width_count} code points")
+    print(f"- Unicode Version: {unicodedata.unidata_version}")
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 05/13] vt: create ucs_width_table.h with gen_ucs_width_table.py
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (3 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  4:20   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 06/13] vt: use new tables in ucs.c Nicolas Pitre
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Provide comprehensive ranges for double-width and zero-width Unicode
code points.

Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
      Please ignore.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs_width_table.h | 445 +++++++++++++++++++++++++++++++
 1 file changed, 445 insertions(+)
 create mode 100644 drivers/tty/vt/ucs_width_table.h

diff --git a/drivers/tty/vt/ucs_width_table.h b/drivers/tty/vt/ucs_width_table.h
new file mode 100644
index 0000000000..9cc86b5cdf
--- /dev/null
+++ b/drivers/tty/vt/ucs_width_table.h
@@ -0,0 +1,445 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * ucs_width_table.h - Unicode character width
+ *
+ * Auto-generated by gen_ucs_width_table.py
+ *
+ * Unicode Version: 16.0.0
+ */
+
+/* Zero-width character ranges */
+static const struct ucs_interval ucs_zero_width_ranges[] = {
+	{ 0x000AD, 0x000AD }, /* SOFT HYPHEN */
+	{ 0x00300, 0x0036F }, /* COMBINING GRAVE ACCENT - COMBINING LATIN SMALL LETTER X */
+	{ 0x00483, 0x00489 }, /* COMBINING CYRILLIC TITLO - COMBINING CYRILLIC MILLIONS SIGN */
+	{ 0x00591, 0x005BD }, /* HEBREW ACCENT ETNAHTA - HEBREW POINT METEG */
+	{ 0x005BF, 0x005BF }, /* HEBREW POINT RAFE */
+	{ 0x005C1, 0x005C2 }, /* HEBREW POINT SHIN DOT - HEBREW POINT SIN DOT */
+	{ 0x005C4, 0x005C5 }, /* HEBREW MARK UPPER DOT - HEBREW MARK LOWER DOT */
+	{ 0x005C7, 0x005C7 }, /* HEBREW POINT QAMATS QATAN */
+	{ 0x00600, 0x00605 }, /* ARABIC NUMBER SIGN - ARABIC NUMBER MARK ABOVE */
+	{ 0x00610, 0x0061A }, /* ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM - ARABIC SMALL KASRA */
+	{ 0x0061C, 0x0061C }, /* ARABIC LETTER MARK */
+	{ 0x0064B, 0x0065F }, /* ARABIC FATHATAN - ARABIC WAVY HAMZA BELOW */
+	{ 0x00670, 0x00670 }, /* ARABIC LETTER SUPERSCRIPT ALEF */
+	{ 0x006D6, 0x006DD }, /* ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA - ARABIC END OF AYAH */
+	{ 0x006DF, 0x006E4 }, /* ARABIC SMALL HIGH ROUNDED ZERO - ARABIC SMALL HIGH MADDA */
+	{ 0x006E7, 0x006E8 }, /* ARABIC SMALL HIGH YEH - ARABIC SMALL HIGH NOON */
+	{ 0x006EA, 0x006ED }, /* ARABIC EMPTY CENTRE LOW STOP - ARABIC SMALL LOW MEEM */
+	{ 0x0070F, 0x0070F }, /* SYRIAC ABBREVIATION MARK */
+	{ 0x00711, 0x00711 }, /* SYRIAC LETTER SUPERSCRIPT ALAPH */
+	{ 0x00730, 0x0074A }, /* SYRIAC PTHAHA ABOVE - SYRIAC BARREKH */
+	{ 0x007A6, 0x007B0 }, /* THAANA ABAFILI - THAANA SUKUN */
+	{ 0x007EB, 0x007F3 }, /* NKO COMBINING SHORT HIGH TONE - NKO COMBINING DOUBLE DOT ABOVE */
+	{ 0x007FD, 0x007FD }, /* NKO DANTAYALAN */
+	{ 0x00816, 0x00819 }, /* SAMARITAN MARK IN - SAMARITAN MARK DAGESH */
+	{ 0x0081B, 0x00823 }, /* SAMARITAN MARK EPENTHETIC YUT - SAMARITAN VOWEL SIGN A */
+	{ 0x00825, 0x00827 }, /* SAMARITAN VOWEL SIGN SHORT A - SAMARITAN VOWEL SIGN U */
+	{ 0x00829, 0x0082D }, /* SAMARITAN VOWEL SIGN LONG I - SAMARITAN MARK NEQUDAA */
+	{ 0x00859, 0x0085B }, /* MANDAIC AFFRICATION MARK - MANDAIC GEMINATION MARK */
+	{ 0x00890, 0x00891 }, /* ARABIC POUND MARK ABOVE - ARABIC PIASTRE MARK ABOVE */
+	{ 0x00897, 0x0089F }, /* ARABIC PEPET - ARABIC HALF MADDA OVER MADDA */
+	{ 0x008CA, 0x00903 }, /* ARABIC SMALL HIGH FARSI YEH - DEVANAGARI SIGN VISARGA */
+	{ 0x0093A, 0x0093C }, /* DEVANAGARI VOWEL SIGN OE - DEVANAGARI SIGN NUKTA */
+	{ 0x0093E, 0x0094F }, /* DEVANAGARI VOWEL SIGN AA - DEVANAGARI VOWEL SIGN AW */
+	{ 0x00951, 0x00957 }, /* DEVANAGARI STRESS SIGN UDATTA - DEVANAGARI VOWEL SIGN UUE */
+	{ 0x00962, 0x00963 }, /* DEVANAGARI VOWEL SIGN VOCALIC L - DEVANAGARI VOWEL SIGN VOCALIC LL */
+	{ 0x00981, 0x00983 }, /* BENGALI SIGN CANDRABINDU - BENGALI SIGN VISARGA */
+	{ 0x009BC, 0x009BC }, /* BENGALI SIGN NUKTA */
+	{ 0x009BE, 0x009C4 }, /* BENGALI VOWEL SIGN AA - BENGALI VOWEL SIGN VOCALIC RR */
+	{ 0x009C7, 0x009C8 }, /* BENGALI VOWEL SIGN E - BENGALI VOWEL SIGN AI */
+	{ 0x009CB, 0x009CD }, /* BENGALI VOWEL SIGN O - BENGALI SIGN VIRAMA */
+	{ 0x009D7, 0x009D7 }, /* BENGALI AU LENGTH MARK */
+	{ 0x009E2, 0x009E3 }, /* BENGALI VOWEL SIGN VOCALIC L - BENGALI VOWEL SIGN VOCALIC LL */
+	{ 0x009FE, 0x009FE }, /* BENGALI SANDHI MARK */
+	{ 0x00A01, 0x00A03 }, /* GURMUKHI SIGN ADAK BINDI - GURMUKHI SIGN VISARGA */
+	{ 0x00A3C, 0x00A3C }, /* GURMUKHI SIGN NUKTA */
+	{ 0x00A3E, 0x00A42 }, /* GURMUKHI VOWEL SIGN AA - GURMUKHI VOWEL SIGN UU */
+	{ 0x00A47, 0x00A48 }, /* GURMUKHI VOWEL SIGN EE - GURMUKHI VOWEL SIGN AI */
+	{ 0x00A4B, 0x00A4D }, /* GURMUKHI VOWEL SIGN OO - GURMUKHI SIGN VIRAMA */
+	{ 0x00A51, 0x00A51 }, /* GURMUKHI SIGN UDAAT */
+	{ 0x00A70, 0x00A71 }, /* GURMUKHI TIPPI - GURMUKHI ADDAK */
+	{ 0x00A75, 0x00A75 }, /* GURMUKHI SIGN YAKASH */
+	{ 0x00A81, 0x00A83 }, /* GUJARATI SIGN CANDRABINDU - GUJARATI SIGN VISARGA */
+	{ 0x00ABC, 0x00ABC }, /* GUJARATI SIGN NUKTA */
+	{ 0x00ABE, 0x00AC5 }, /* GUJARATI VOWEL SIGN AA - GUJARATI VOWEL SIGN CANDRA E */
+	{ 0x00AC7, 0x00AC9 }, /* GUJARATI VOWEL SIGN E - GUJARATI VOWEL SIGN CANDRA O */
+	{ 0x00ACB, 0x00ACD }, /* GUJARATI VOWEL SIGN O - GUJARATI SIGN VIRAMA */
+	{ 0x00AE2, 0x00AE3 }, /* GUJARATI VOWEL SIGN VOCALIC L - GUJARATI VOWEL SIGN VOCALIC LL */
+	{ 0x00AFA, 0x00AFF }, /* GUJARATI SIGN SUKUN - GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE */
+	{ 0x00B01, 0x00B03 }, /* ORIYA SIGN CANDRABINDU - ORIYA SIGN VISARGA */
+	{ 0x00B3C, 0x00B3C }, /* ORIYA SIGN NUKTA */
+	{ 0x00B3E, 0x00B44 }, /* ORIYA VOWEL SIGN AA - ORIYA VOWEL SIGN VOCALIC RR */
+	{ 0x00B47, 0x00B48 }, /* ORIYA VOWEL SIGN E - ORIYA VOWEL SIGN AI */
+	{ 0x00B4B, 0x00B4D }, /* ORIYA VOWEL SIGN O - ORIYA SIGN VIRAMA */
+	{ 0x00B55, 0x00B57 }, /* ORIYA SIGN OVERLINE - ORIYA AU LENGTH MARK */
+	{ 0x00B62, 0x00B63 }, /* ORIYA VOWEL SIGN VOCALIC L - ORIYA VOWEL SIGN VOCALIC LL */
+	{ 0x00B82, 0x00B82 }, /* TAMIL SIGN ANUSVARA */
+	{ 0x00BBE, 0x00BC2 }, /* TAMIL VOWEL SIGN AA - TAMIL VOWEL SIGN UU */
+	{ 0x00BC6, 0x00BC8 }, /* TAMIL VOWEL SIGN E - TAMIL VOWEL SIGN AI */
+	{ 0x00BCA, 0x00BCD }, /* TAMIL VOWEL SIGN O - TAMIL SIGN VIRAMA */
+	{ 0x00BD7, 0x00BD7 }, /* TAMIL AU LENGTH MARK */
+	{ 0x00C00, 0x00C04 }, /* TELUGU SIGN COMBINING CANDRABINDU ABOVE - TELUGU SIGN COMBINING ANUSVARA ABOVE */
+	{ 0x00C3C, 0x00C3C }, /* TELUGU SIGN NUKTA */
+	{ 0x00C3E, 0x00C44 }, /* TELUGU VOWEL SIGN AA - TELUGU VOWEL SIGN VOCALIC RR */
+	{ 0x00C46, 0x00C48 }, /* TELUGU VOWEL SIGN E - TELUGU VOWEL SIGN AI */
+	{ 0x00C4A, 0x00C4D }, /* TELUGU VOWEL SIGN O - TELUGU SIGN VIRAMA */
+	{ 0x00C55, 0x00C56 }, /* TELUGU LENGTH MARK - TELUGU AI LENGTH MARK */
+	{ 0x00C62, 0x00C63 }, /* TELUGU VOWEL SIGN VOCALIC L - TELUGU VOWEL SIGN VOCALIC LL */
+	{ 0x00C81, 0x00C83 }, /* KANNADA SIGN CANDRABINDU - KANNADA SIGN VISARGA */
+	{ 0x00CBC, 0x00CBC }, /* KANNADA SIGN NUKTA */
+	{ 0x00CBE, 0x00CC4 }, /* KANNADA VOWEL SIGN AA - KANNADA VOWEL SIGN VOCALIC RR */
+	{ 0x00CC6, 0x00CC8 }, /* KANNADA VOWEL SIGN E - KANNADA VOWEL SIGN AI */
+	{ 0x00CCA, 0x00CCD }, /* KANNADA VOWEL SIGN O - KANNADA SIGN VIRAMA */
+	{ 0x00CD5, 0x00CD6 }, /* KANNADA LENGTH MARK - KANNADA AI LENGTH MARK */
+	{ 0x00CE2, 0x00CE3 }, /* KANNADA VOWEL SIGN VOCALIC L - KANNADA VOWEL SIGN VOCALIC LL */
+	{ 0x00CF3, 0x00CF3 }, /* KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT */
+	{ 0x00D00, 0x00D03 }, /* MALAYALAM SIGN COMBINING ANUSVARA ABOVE - MALAYALAM SIGN VISARGA */
+	{ 0x00D3B, 0x00D3C }, /* MALAYALAM SIGN VERTICAL BAR VIRAMA - MALAYALAM SIGN CIRCULAR VIRAMA */
+	{ 0x00D3E, 0x00D44 }, /* MALAYALAM VOWEL SIGN AA - MALAYALAM VOWEL SIGN VOCALIC RR */
+	{ 0x00D46, 0x00D48 }, /* MALAYALAM VOWEL SIGN E - MALAYALAM VOWEL SIGN AI */
+	{ 0x00D4A, 0x00D4D }, /* MALAYALAM VOWEL SIGN O - MALAYALAM SIGN VIRAMA */
+	{ 0x00D57, 0x00D57 }, /* MALAYALAM AU LENGTH MARK */
+	{ 0x00D62, 0x00D63 }, /* MALAYALAM VOWEL SIGN VOCALIC L - MALAYALAM VOWEL SIGN VOCALIC LL */
+	{ 0x00D81, 0x00D83 }, /* SINHALA SIGN CANDRABINDU - SINHALA SIGN VISARGAYA */
+	{ 0x00DCA, 0x00DCA }, /* SINHALA SIGN AL-LAKUNA */
+	{ 0x00DCF, 0x00DD4 }, /* SINHALA VOWEL SIGN AELA-PILLA - SINHALA VOWEL SIGN KETTI PAA-PILLA */
+	{ 0x00DD6, 0x00DD6 }, /* SINHALA VOWEL SIGN DIGA PAA-PILLA */
+	{ 0x00DD8, 0x00DDF }, /* SINHALA VOWEL SIGN GAETTA-PILLA - SINHALA VOWEL SIGN GAYANUKITTA */
+	{ 0x00DF2, 0x00DF3 }, /* SINHALA VOWEL SIGN DIGA GAETTA-PILLA - SINHALA VOWEL SIGN DIGA GAYANUKITTA */
+	{ 0x00E31, 0x00E31 }, /* THAI CHARACTER MAI HAN-AKAT */
+	{ 0x00E34, 0x00E3A }, /* THAI CHARACTER SARA I - THAI CHARACTER PHINTHU */
+	{ 0x00E47, 0x00E4E }, /* THAI CHARACTER MAITAIKHU - THAI CHARACTER YAMAKKAN */
+	{ 0x00EB1, 0x00EB1 }, /* LAO VOWEL SIGN MAI KAN */
+	{ 0x00EB4, 0x00EBC }, /* LAO VOWEL SIGN I - LAO SEMIVOWEL SIGN LO */
+	{ 0x00EC8, 0x00ECE }, /* LAO TONE MAI EK - LAO YAMAKKAN */
+	{ 0x00F18, 0x00F19 }, /* TIBETAN ASTROLOGICAL SIGN -KHYUD PA - TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS */
+	{ 0x00F35, 0x00F35 }, /* TIBETAN MARK NGAS BZUNG NYI ZLA */
+	{ 0x00F37, 0x00F37 }, /* TIBETAN MARK NGAS BZUNG SGOR RTAGS */
+	{ 0x00F39, 0x00F39 }, /* TIBETAN MARK TSA -PHRU */
+	{ 0x00F3E, 0x00F3F }, /* TIBETAN SIGN YAR TSHES - TIBETAN SIGN MAR TSHES */
+	{ 0x00F71, 0x00F84 }, /* TIBETAN VOWEL SIGN AA - TIBETAN MARK HALANTA */
+	{ 0x00F86, 0x00F87 }, /* TIBETAN SIGN LCI RTAGS - TIBETAN SIGN YANG RTAGS */
+	{ 0x00F8D, 0x00F97 }, /* TIBETAN SUBJOINED SIGN LCE TSA CAN - TIBETAN SUBJOINED LETTER JA */
+	{ 0x00F99, 0x00FBC }, /* TIBETAN SUBJOINED LETTER NYA - TIBETAN SUBJOINED LETTER FIXED-FORM RA */
+	{ 0x00FC6, 0x00FC6 }, /* TIBETAN SYMBOL PADMA GDAN */
+	{ 0x0102B, 0x0103E }, /* MYANMAR VOWEL SIGN TALL AA - MYANMAR CONSONANT SIGN MEDIAL HA */
+	{ 0x01056, 0x01059 }, /* MYANMAR VOWEL SIGN VOCALIC R - MYANMAR VOWEL SIGN VOCALIC LL */
+	{ 0x0105E, 0x01060 }, /* MYANMAR CONSONANT SIGN MON MEDIAL NA - MYANMAR CONSONANT SIGN MON MEDIAL LA */
+	{ 0x01062, 0x01064 }, /* MYANMAR VOWEL SIGN SGAW KAREN EU - MYANMAR TONE MARK SGAW KAREN KE PHO */
+	{ 0x01067, 0x0106D }, /* MYANMAR VOWEL SIGN WESTERN PWO KAREN EU - MYANMAR SIGN WESTERN PWO KAREN TONE-5 */
+	{ 0x01071, 0x01074 }, /* MYANMAR VOWEL SIGN GEBA KAREN I - MYANMAR VOWEL SIGN KAYAH EE */
+	{ 0x01082, 0x0108D }, /* MYANMAR CONSONANT SIGN SHAN MEDIAL WA - MYANMAR SIGN SHAN COUNCIL EMPHATIC TONE */
+	{ 0x0108F, 0x0108F }, /* MYANMAR SIGN RUMAI PALAUNG TONE-5 */
+	{ 0x0109A, 0x0109D }, /* MYANMAR SIGN KHAMTI TONE-1 - MYANMAR VOWEL SIGN AITON AI */
+	{ 0x0135D, 0x0135F }, /* ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK - ETHIOPIC COMBINING GEMINATION MARK */
+	{ 0x01712, 0x01715 }, /* TAGALOG VOWEL SIGN I - TAGALOG SIGN PAMUDPOD */
+	{ 0x01732, 0x01734 }, /* HANUNOO VOWEL SIGN I - HANUNOO SIGN PAMUDPOD */
+	{ 0x01752, 0x01753 }, /* BUHID VOWEL SIGN I - BUHID VOWEL SIGN U */
+	{ 0x01772, 0x01773 }, /* TAGBANWA VOWEL SIGN I - TAGBANWA VOWEL SIGN U */
+	{ 0x017B4, 0x017D3 }, /* KHMER VOWEL INHERENT AQ - KHMER SIGN BATHAMASAT */
+	{ 0x017DD, 0x017DD }, /* KHMER SIGN ATTHACAN */
+	{ 0x0180B, 0x0180F }, /* MONGOLIAN FREE VARIATION SELECTOR ONE - MONGOLIAN FREE VARIATION SELECTOR FOUR */
+	{ 0x01885, 0x01886 }, /* MONGOLIAN LETTER ALI GALI BALUDA - MONGOLIAN LETTER ALI GALI THREE BALUDA */
+	{ 0x018A9, 0x018A9 }, /* MONGOLIAN LETTER ALI GALI DAGALGA */
+	{ 0x01920, 0x0192B }, /* LIMBU VOWEL SIGN A - LIMBU SUBJOINED LETTER WA */
+	{ 0x01930, 0x0193B }, /* LIMBU SMALL LETTER KA - LIMBU SIGN SA-I */
+	{ 0x01A17, 0x01A1B }, /* BUGINESE VOWEL SIGN I - BUGINESE VOWEL SIGN AE */
+	{ 0x01A55, 0x01A5E }, /* TAI THAM CONSONANT SIGN MEDIAL RA - TAI THAM CONSONANT SIGN SA */
+	{ 0x01A60, 0x01A7C }, /* TAI THAM SIGN SAKOT - TAI THAM SIGN KHUEN-LUE KARAN */
+	{ 0x01A7F, 0x01A7F }, /* TAI THAM COMBINING CRYPTOGRAMMIC DOT */
+	{ 0x01AB0, 0x01ACE }, /* COMBINING DOUBLED CIRCUMFLEX ACCENT - COMBINING LATIN SMALL LETTER INSULAR T */
+	{ 0x01B00, 0x01B04 }, /* BALINESE SIGN ULU RICEM - BALINESE SIGN BISAH */
+	{ 0x01B34, 0x01B44 }, /* BALINESE SIGN REREKAN - BALINESE ADEG ADEG */
+	{ 0x01B6B, 0x01B73 }, /* BALINESE MUSICAL SYMBOL COMBINING TEGEH - BALINESE MUSICAL SYMBOL COMBINING GONG */
+	{ 0x01B80, 0x01B82 }, /* SUNDANESE SIGN PANYECEK - SUNDANESE SIGN PANGWISAD */
+	{ 0x01BA1, 0x01BAD }, /* SUNDANESE CONSONANT SIGN PAMINGKAL - SUNDANESE CONSONANT SIGN PASANGAN WA */
+	{ 0x01BE6, 0x01BF3 }, /* BATAK SIGN TOMPI - BATAK PANONGONAN */
+	{ 0x01C24, 0x01C37 }, /* LEPCHA SUBJOINED LETTER YA - LEPCHA SIGN NUKTA */
+	{ 0x01CD0, 0x01CD2 }, /* VEDIC TONE KARSHANA - VEDIC TONE PRENKHA */
+	{ 0x01CD4, 0x01CE8 }, /* VEDIC SIGN YAJURVEDIC MIDLINE SVARITA - VEDIC SIGN VISARGA ANUDATTA WITH TAIL */
+	{ 0x01CED, 0x01CED }, /* VEDIC SIGN TIRYAK */
+	{ 0x01CF4, 0x01CF4 }, /* VEDIC TONE CANDRA ABOVE */
+	{ 0x01CF7, 0x01CF9 }, /* VEDIC SIGN ATIKRAMA - VEDIC TONE DOUBLE RING ABOVE */
+	{ 0x01DC0, 0x01DFF }, /* COMBINING DOTTED GRAVE ACCENT - COMBINING RIGHT ARROWHEAD AND DOWN ARROWHEAD BELOW */
+	{ 0x0200B, 0x0200F }, /* ZERO WIDTH SPACE - RIGHT-TO-LEFT MARK */
+	{ 0x0202A, 0x0202E }, /* LEFT-TO-RIGHT EMBEDDING - RIGHT-TO-LEFT OVERRIDE */
+	{ 0x02060, 0x02064 }, /* WORD JOINER - INVISIBLE PLUS */
+	{ 0x02066, 0x0206F }, /* LEFT-TO-RIGHT ISOLATE - NOMINAL DIGIT SHAPES */
+	{ 0x020D0, 0x020F0 }, /* COMBINING LEFT HARPOON ABOVE - COMBINING ASTERISK ABOVE */
+	{ 0x02640, 0x02640 }, /* FEMALE SIGN */
+	{ 0x02642, 0x02642 }, /* MALE SIGN */
+	{ 0x026A7, 0x026A7 }, /* MALE WITH STROKE AND MALE AND FEMALE SIGN */
+	{ 0x02CEF, 0x02CF1 }, /* COPTIC COMBINING NI ABOVE - COPTIC COMBINING SPIRITUS LENIS */
+	{ 0x02D7F, 0x02D7F }, /* TIFINAGH CONSONANT JOINER */
+	{ 0x02DE0, 0x02DFF }, /* COMBINING CYRILLIC LETTER BE - COMBINING CYRILLIC LETTER IOTIFIED BIG YUS */
+	{ 0x0302A, 0x0302F }, /* IDEOGRAPHIC LEVEL TONE MARK - HANGUL DOUBLE DOT TONE MARK */
+	{ 0x03099, 0x0309A }, /* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK - COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK */
+	{ 0x0A66F, 0x0A672 }, /* COMBINING CYRILLIC VZMET - COMBINING CYRILLIC THOUSAND MILLIONS SIGN */
+	{ 0x0A674, 0x0A67D }, /* COMBINING CYRILLIC LETTER UKRAINIAN IE - COMBINING CYRILLIC PAYEROK */
+	{ 0x0A69E, 0x0A69F }, /* COMBINING CYRILLIC LETTER EF - COMBINING CYRILLIC LETTER IOTIFIED E */
+	{ 0x0A6F0, 0x0A6F1 }, /* BAMUM COMBINING MARK KOQNDON - BAMUM COMBINING MARK TUKWENTIS */
+	{ 0x0A802, 0x0A802 }, /* SYLOTI NAGRI SIGN DVISVARA */
+	{ 0x0A806, 0x0A806 }, /* SYLOTI NAGRI SIGN HASANTA */
+	{ 0x0A80B, 0x0A80B }, /* SYLOTI NAGRI SIGN ANUSVARA */
+	{ 0x0A823, 0x0A827 }, /* SYLOTI NAGRI VOWEL SIGN A - SYLOTI NAGRI VOWEL SIGN OO */
+	{ 0x0A82C, 0x0A82C }, /* SYLOTI NAGRI SIGN ALTERNATE HASANTA */
+	{ 0x0A880, 0x0A881 }, /* SAURASHTRA SIGN ANUSVARA - SAURASHTRA SIGN VISARGA */
+	{ 0x0A8B4, 0x0A8C5 }, /* SAURASHTRA CONSONANT SIGN HAARU - SAURASHTRA SIGN CANDRABINDU */
+	{ 0x0A8E0, 0x0A8F1 }, /* COMBINING DEVANAGARI DIGIT ZERO - COMBINING DEVANAGARI SIGN AVAGRAHA */
+	{ 0x0A8FF, 0x0A8FF }, /* DEVANAGARI VOWEL SIGN AY */
+	{ 0x0A926, 0x0A92D }, /* KAYAH LI VOWEL UE - KAYAH LI TONE CALYA PLOPHU */
+	{ 0x0A947, 0x0A953 }, /* REJANG VOWEL SIGN I - REJANG VIRAMA */
+	{ 0x0A980, 0x0A983 }, /* JAVANESE SIGN PANYANGGA - JAVANESE SIGN WIGNYAN */
+	{ 0x0A9B3, 0x0A9C0 }, /* JAVANESE SIGN CECAK TELU - JAVANESE PANGKON */
+	{ 0x0A9E5, 0x0A9E5 }, /* MYANMAR SIGN SHAN SAW */
+	{ 0x0AA29, 0x0AA36 }, /* CHAM VOWEL SIGN AA - CHAM CONSONANT SIGN WA */
+	{ 0x0AA43, 0x0AA43 }, /* CHAM CONSONANT SIGN FINAL NG */
+	{ 0x0AA4C, 0x0AA4D }, /* CHAM CONSONANT SIGN FINAL M - CHAM CONSONANT SIGN FINAL H */
+	{ 0x0AA7B, 0x0AA7D }, /* MYANMAR SIGN PAO KAREN TONE - MYANMAR SIGN TAI LAING TONE-5 */
+	{ 0x0AAB0, 0x0AAB0 }, /* TAI VIET MAI KANG */
+	{ 0x0AAB2, 0x0AAB4 }, /* TAI VIET VOWEL I - TAI VIET VOWEL U */
+	{ 0x0AAB7, 0x0AAB8 }, /* TAI VIET MAI KHIT - TAI VIET VOWEL IA */
+	{ 0x0AABE, 0x0AABF }, /* TAI VIET VOWEL AM - TAI VIET TONE MAI EK */
+	{ 0x0AAC1, 0x0AAC1 }, /* TAI VIET TONE MAI THO */
+	{ 0x0AAEB, 0x0AAEF }, /* MEETEI MAYEK VOWEL SIGN II - MEETEI MAYEK VOWEL SIGN AAU */
+	{ 0x0AAF5, 0x0AAF6 }, /* MEETEI MAYEK VOWEL SIGN VISARGA - MEETEI MAYEK VIRAMA */
+	{ 0x0ABE3, 0x0ABEA }, /* MEETEI MAYEK VOWEL SIGN ONAP - MEETEI MAYEK VOWEL SIGN NUNG */
+	{ 0x0ABEC, 0x0ABED }, /* MEETEI MAYEK LUM IYEK - MEETEI MAYEK APUN IYEK */
+	{ 0x0FB1E, 0x0FB1E }, /* HEBREW POINT JUDEO-SPANISH VARIKA */
+	{ 0x0FE00, 0x0FE0F }, /* VARIATION SELECTOR-1 - VARIATION SELECTOR-16 */
+	{ 0x0FE20, 0x0FE2F }, /* COMBINING LIGATURE LEFT HALF - COMBINING CYRILLIC TITLO RIGHT HALF */
+	{ 0x0FEFF, 0x0FEFF }, /* ZERO WIDTH NO-BREAK SPACE */
+	{ 0x0FFF9, 0x0FFFB }, /* INTERLINEAR ANNOTATION ANCHOR - INTERLINEAR ANNOTATION TERMINATOR */
+	{ 0x101FD, 0x101FD }, /* PHAISTOS DISC SIGN COMBINING OBLIQUE STROKE */
+	{ 0x102E0, 0x102E0 }, /* COPTIC EPACT THOUSANDS MARK */
+	{ 0x10376, 0x1037A }, /* COMBINING OLD PERMIC LETTER AN - COMBINING OLD PERMIC LETTER SII */
+	{ 0x10A01, 0x10A03 }, /* KHAROSHTHI VOWEL SIGN I - KHAROSHTHI VOWEL SIGN VOCALIC R */
+	{ 0x10A05, 0x10A06 }, /* KHAROSHTHI VOWEL SIGN E - KHAROSHTHI VOWEL SIGN O */
+	{ 0x10A0C, 0x10A0F }, /* KHAROSHTHI VOWEL LENGTH MARK - KHAROSHTHI SIGN VISARGA */
+	{ 0x10A38, 0x10A3A }, /* KHAROSHTHI SIGN BAR ABOVE - KHAROSHTHI SIGN DOT BELOW */
+	{ 0x10A3F, 0x10A3F }, /* KHAROSHTHI VIRAMA */
+	{ 0x10AE5, 0x10AE6 }, /* MANICHAEAN ABBREVIATION MARK ABOVE - MANICHAEAN ABBREVIATION MARK BELOW */
+	{ 0x10D24, 0x10D27 }, /* HANIFI ROHINGYA SIGN HARBAHAY - HANIFI ROHINGYA SIGN TASSI */
+	{ 0x10D69, 0x10D6D }, /* GARAY VOWEL SIGN E - GARAY CONSONANT NASALIZATION MARK */
+	{ 0x10EAB, 0x10EAC }, /* YEZIDI COMBINING HAMZA MARK - YEZIDI COMBINING MADDA MARK */
+	{ 0x10EFC, 0x10EFF }, /* ARABIC COMBINING ALEF OVERLAY - ARABIC SMALL LOW WORD MADDA */
+	{ 0x10F46, 0x10F50 }, /* SOGDIAN COMBINING DOT BELOW - SOGDIAN COMBINING STROKE BELOW */
+	{ 0x10F82, 0x10F85 }, /* OLD UYGHUR COMBINING DOT ABOVE - OLD UYGHUR COMBINING TWO DOTS BELOW */
+	{ 0x11000, 0x11002 }, /* BRAHMI SIGN CANDRABINDU - BRAHMI SIGN VISARGA */
+	{ 0x11038, 0x11046 }, /* BRAHMI VOWEL SIGN AA - BRAHMI VIRAMA */
+	{ 0x11070, 0x11070 }, /* BRAHMI SIGN OLD TAMIL VIRAMA */
+	{ 0x11073, 0x11074 }, /* BRAHMI VOWEL SIGN OLD TAMIL SHORT E - BRAHMI VOWEL SIGN OLD TAMIL SHORT O */
+	{ 0x1107F, 0x11082 }, /* BRAHMI NUMBER JOINER - KAITHI SIGN VISARGA */
+	{ 0x110B0, 0x110BA }, /* KAITHI VOWEL SIGN AA - KAITHI SIGN NUKTA */
+	{ 0x110BD, 0x110BD }, /* KAITHI NUMBER SIGN */
+	{ 0x110C2, 0x110C2 }, /* KAITHI VOWEL SIGN VOCALIC R */
+	{ 0x110CD, 0x110CD }, /* KAITHI NUMBER SIGN ABOVE */
+	{ 0x11100, 0x11102 }, /* CHAKMA SIGN CANDRABINDU - CHAKMA SIGN VISARGA */
+	{ 0x11127, 0x11134 }, /* CHAKMA VOWEL SIGN A - CHAKMA MAAYYAA */
+	{ 0x11145, 0x11146 }, /* CHAKMA VOWEL SIGN AA - CHAKMA VOWEL SIGN EI */
+	{ 0x11173, 0x11173 }, /* MAHAJANI SIGN NUKTA */
+	{ 0x11180, 0x11182 }, /* SHARADA SIGN CANDRABINDU - SHARADA SIGN VISARGA */
+	{ 0x111B3, 0x111C0 }, /* SHARADA VOWEL SIGN AA - SHARADA SIGN VIRAMA */
+	{ 0x111C9, 0x111CC }, /* SHARADA SANDHI MARK - SHARADA EXTRA SHORT VOWEL MARK */
+	{ 0x111CE, 0x111CF }, /* SHARADA VOWEL SIGN PRISHTHAMATRA E - SHARADA SIGN INVERTED CANDRABINDU */
+	{ 0x1122C, 0x11237 }, /* KHOJKI VOWEL SIGN AA - KHOJKI SIGN SHADDA */
+	{ 0x1123E, 0x1123E }, /* KHOJKI SIGN SUKUN */
+	{ 0x11241, 0x11241 }, /* KHOJKI VOWEL SIGN VOCALIC R */
+	{ 0x112DF, 0x112EA }, /* KHUDAWADI SIGN ANUSVARA - KHUDAWADI SIGN VIRAMA */
+	{ 0x11300, 0x11303 }, /* GRANTHA SIGN COMBINING ANUSVARA ABOVE - GRANTHA SIGN VISARGA */
+	{ 0x1133B, 0x1133C }, /* COMBINING BINDU BELOW - GRANTHA SIGN NUKTA */
+	{ 0x1133E, 0x11344 }, /* GRANTHA VOWEL SIGN AA - GRANTHA VOWEL SIGN VOCALIC RR */
+	{ 0x11347, 0x11348 }, /* GRANTHA VOWEL SIGN EE - GRANTHA VOWEL SIGN AI */
+	{ 0x1134B, 0x1134D }, /* GRANTHA VOWEL SIGN OO - GRANTHA SIGN VIRAMA */
+	{ 0x11357, 0x11357 }, /* GRANTHA AU LENGTH MARK */
+	{ 0x11362, 0x11363 }, /* GRANTHA VOWEL SIGN VOCALIC L - GRANTHA VOWEL SIGN VOCALIC LL */
+	{ 0x11366, 0x1136C }, /* COMBINING GRANTHA DIGIT ZERO - COMBINING GRANTHA DIGIT SIX */
+	{ 0x11370, 0x11374 }, /* COMBINING GRANTHA LETTER A - COMBINING GRANTHA LETTER PA */
+	{ 0x113B8, 0x113C0 }, /* TULU-TIGALARI VOWEL SIGN AA - TULU-TIGALARI VOWEL SIGN VOCALIC LL */
+	{ 0x113C2, 0x113C2 }, /* TULU-TIGALARI VOWEL SIGN EE */
+	{ 0x113C5, 0x113C5 }, /* TULU-TIGALARI VOWEL SIGN AI */
+	{ 0x113C7, 0x113CA }, /* TULU-TIGALARI VOWEL SIGN OO - TULU-TIGALARI SIGN CANDRA ANUNASIKA */
+	{ 0x113CC, 0x113D0 }, /* TULU-TIGALARI SIGN ANUSVARA - TULU-TIGALARI CONJOINER */
+	{ 0x113D2, 0x113D2 }, /* TULU-TIGALARI GEMINATION MARK */
+	{ 0x113E1, 0x113E2 }, /* TULU-TIGALARI VEDIC TONE SVARITA - TULU-TIGALARI VEDIC TONE ANUDATTA */
+	{ 0x11435, 0x11446 }, /* NEWA VOWEL SIGN AA - NEWA SIGN NUKTA */
+	{ 0x1145E, 0x1145E }, /* NEWA SANDHI MARK */
+	{ 0x114B0, 0x114C3 }, /* TIRHUTA VOWEL SIGN AA - TIRHUTA SIGN NUKTA */
+	{ 0x115AF, 0x115B5 }, /* SIDDHAM VOWEL SIGN AA - SIDDHAM VOWEL SIGN VOCALIC RR */
+	{ 0x115B8, 0x115C0 }, /* SIDDHAM VOWEL SIGN E - SIDDHAM SIGN NUKTA */
+	{ 0x115DC, 0x115DD }, /* SIDDHAM VOWEL SIGN ALTERNATE U - SIDDHAM VOWEL SIGN ALTERNATE UU */
+	{ 0x11630, 0x11640 }, /* MODI VOWEL SIGN AA - MODI SIGN ARDHACANDRA */
+	{ 0x116AB, 0x116B7 }, /* TAKRI SIGN ANUSVARA - TAKRI SIGN NUKTA */
+	{ 0x1171D, 0x1172B }, /* AHOM CONSONANT SIGN MEDIAL LA - AHOM SIGN KILLER */
+	{ 0x1182C, 0x1183A }, /* DOGRA VOWEL SIGN AA - DOGRA SIGN NUKTA */
+	{ 0x11930, 0x11935 }, /* DIVES AKURU VOWEL SIGN AA - DIVES AKURU VOWEL SIGN E */
+	{ 0x11937, 0x11938 }, /* DIVES AKURU VOWEL SIGN AI - DIVES AKURU VOWEL SIGN O */
+	{ 0x1193B, 0x1193E }, /* DIVES AKURU SIGN ANUSVARA - DIVES AKURU VIRAMA */
+	{ 0x11940, 0x11940 }, /* DIVES AKURU MEDIAL YA */
+	{ 0x11942, 0x11943 }, /* DIVES AKURU MEDIAL RA - DIVES AKURU SIGN NUKTA */
+	{ 0x119D1, 0x119D7 }, /* NANDINAGARI VOWEL SIGN AA - NANDINAGARI VOWEL SIGN VOCALIC RR */
+	{ 0x119DA, 0x119E0 }, /* NANDINAGARI VOWEL SIGN E - NANDINAGARI SIGN VIRAMA */
+	{ 0x119E4, 0x119E4 }, /* NANDINAGARI VOWEL SIGN PRISHTHAMATRA E */
+	{ 0x11A01, 0x11A0A }, /* ZANABAZAR SQUARE VOWEL SIGN I - ZANABAZAR SQUARE VOWEL LENGTH MARK */
+	{ 0x11A33, 0x11A39 }, /* ZANABAZAR SQUARE FINAL CONSONANT MARK - ZANABAZAR SQUARE SIGN VISARGA */
+	{ 0x11A3B, 0x11A3E }, /* ZANABAZAR SQUARE CLUSTER-FINAL LETTER YA - ZANABAZAR SQUARE CLUSTER-FINAL LETTER VA */
+	{ 0x11A47, 0x11A47 }, /* ZANABAZAR SQUARE SUBJOINER */
+	{ 0x11A51, 0x11A5B }, /* SOYOMBO VOWEL SIGN I - SOYOMBO VOWEL LENGTH MARK */
+	{ 0x11A8A, 0x11A99 }, /* SOYOMBO FINAL CONSONANT SIGN G - SOYOMBO SUBJOINER */
+	{ 0x11C2F, 0x11C36 }, /* BHAIKSUKI VOWEL SIGN AA - BHAIKSUKI VOWEL SIGN VOCALIC L */
+	{ 0x11C38, 0x11C3F }, /* BHAIKSUKI VOWEL SIGN E - BHAIKSUKI SIGN VIRAMA */
+	{ 0x11C92, 0x11CA7 }, /* MARCHEN SUBJOINED LETTER KA - MARCHEN SUBJOINED LETTER ZA */
+	{ 0x11CA9, 0x11CB6 }, /* MARCHEN SUBJOINED LETTER YA - MARCHEN SIGN CANDRABINDU */
+	{ 0x11D31, 0x11D36 }, /* MASARAM GONDI VOWEL SIGN AA - MASARAM GONDI VOWEL SIGN VOCALIC R */
+	{ 0x11D3A, 0x11D3A }, /* MASARAM GONDI VOWEL SIGN E */
+	{ 0x11D3C, 0x11D3D }, /* MASARAM GONDI VOWEL SIGN AI - MASARAM GONDI VOWEL SIGN O */
+	{ 0x11D3F, 0x11D45 }, /* MASARAM GONDI VOWEL SIGN AU - MASARAM GONDI VIRAMA */
+	{ 0x11D47, 0x11D47 }, /* MASARAM GONDI RA-KARA */
+	{ 0x11D8A, 0x11D8E }, /* GUNJALA GONDI VOWEL SIGN AA - GUNJALA GONDI VOWEL SIGN UU */
+	{ 0x11D90, 0x11D91 }, /* GUNJALA GONDI VOWEL SIGN EE - GUNJALA GONDI VOWEL SIGN AI */
+	{ 0x11D93, 0x11D97 }, /* GUNJALA GONDI VOWEL SIGN OO - GUNJALA GONDI VIRAMA */
+	{ 0x11EF3, 0x11EF6 }, /* MAKASAR VOWEL SIGN I - MAKASAR VOWEL SIGN O */
+	{ 0x11F00, 0x11F01 }, /* KAWI SIGN CANDRABINDU - KAWI SIGN ANUSVARA */
+	{ 0x11F03, 0x11F03 }, /* KAWI SIGN VISARGA */
+	{ 0x11F34, 0x11F3A }, /* KAWI VOWEL SIGN AA - KAWI VOWEL SIGN VOCALIC R */
+	{ 0x11F3E, 0x11F42 }, /* KAWI VOWEL SIGN E - KAWI CONJOINER */
+	{ 0x11F5A, 0x11F5A }, /* KAWI SIGN NUKTA */
+	{ 0x13430, 0x13440 }, /* EGYPTIAN HIEROGLYPH VERTICAL JOINER - EGYPTIAN HIEROGLYPH MIRROR HORIZONTALLY */
+	{ 0x13447, 0x13455 }, /* EGYPTIAN HIEROGLYPH MODIFIER DAMAGED AT TOP START - EGYPTIAN HIEROGLYPH MODIFIER DAMAGED */
+	{ 0x1611E, 0x1612F }, /* GURUNG KHEMA VOWEL SIGN AA - GURUNG KHEMA SIGN THOLHOMA */
+	{ 0x16AF0, 0x16AF4 }, /* BASSA VAH COMBINING HIGH TONE - BASSA VAH COMBINING HIGH-LOW TONE */
+	{ 0x16B30, 0x16B36 }, /* PAHAWH HMONG MARK CIM TUB - PAHAWH HMONG MARK CIM TAUM */
+	{ 0x16F4F, 0x16F4F }, /* MIAO SIGN CONSONANT MODIFIER BAR */
+	{ 0x16F51, 0x16F87 }, /* MIAO SIGN ASPIRATION - MIAO VOWEL SIGN UI */
+	{ 0x16F8F, 0x16F92 }, /* MIAO TONE RIGHT - MIAO TONE BELOW */
+	{ 0x16FE4, 0x16FE4 }, /* KHITAN SMALL SCRIPT FILLER */
+	{ 0x16FF0, 0x16FF1 }, /* VIETNAMESE ALTERNATE READING MARK CA - VIETNAMESE ALTERNATE READING MARK NHAY */
+	{ 0x1BC9D, 0x1BC9E }, /* DUPLOYAN THICK LETTER SELECTOR - DUPLOYAN DOUBLE MARK */
+	{ 0x1BCA0, 0x1BCA3 }, /* SHORTHAND FORMAT LETTER OVERLAP - SHORTHAND FORMAT UP STEP */
+	{ 0x1CF00, 0x1CF2D }, /* ZNAMENNY COMBINING MARK GORAZDO NIZKO S KRYZHEM ON LEFT - ZNAMENNY COMBINING MARK KRYZH ON LEFT */
+	{ 0x1CF30, 0x1CF46 }, /* ZNAMENNY COMBINING TONAL RANGE MARK MRACHNO - ZNAMENNY PRIZNAK MODIFIER ROG */
+	{ 0x1D165, 0x1D169 }, /* MUSICAL SYMBOL COMBINING STEM - MUSICAL SYMBOL COMBINING TREMOLO-3 */
+	{ 0x1D16D, 0x1D182 }, /* MUSICAL SYMBOL COMBINING AUGMENTATION DOT - MUSICAL SYMBOL COMBINING LOURE */
+	{ 0x1D185, 0x1D18B }, /* MUSICAL SYMBOL COMBINING DOIT - MUSICAL SYMBOL COMBINING TRIPLE TONGUE */
+	{ 0x1D1AA, 0x1D1AD }, /* MUSICAL SYMBOL COMBINING DOWN BOW - MUSICAL SYMBOL COMBINING SNAP PIZZICATO */
+	{ 0x1D242, 0x1D244 }, /* COMBINING GREEK MUSICAL TRISEME - COMBINING GREEK MUSICAL PENTASEME */
+	{ 0x1DA00, 0x1DA36 }, /* SIGNWRITING HEAD RIM - SIGNWRITING AIR SUCKING IN */
+	{ 0x1DA3B, 0x1DA6C }, /* SIGNWRITING MOUTH CLOSED NEUTRAL - SIGNWRITING EXCITEMENT */
+	{ 0x1DA75, 0x1DA75 }, /* SIGNWRITING UPPER BODY TILTING FROM HIP JOINTS */
+	{ 0x1DA84, 0x1DA84 }, /* SIGNWRITING LOCATION HEAD NECK */
+	{ 0x1DA9B, 0x1DA9F }, /* SIGNWRITING FILL MODIFIER-2 - SIGNWRITING FILL MODIFIER-6 */
+	{ 0x1DAA1, 0x1DAAF }, /* SIGNWRITING ROTATION MODIFIER-2 - SIGNWRITING ROTATION MODIFIER-16 */
+	{ 0x1E000, 0x1E006 }, /* COMBINING GLAGOLITIC LETTER AZU - COMBINING GLAGOLITIC LETTER ZHIVETE */
+	{ 0x1E008, 0x1E018 }, /* COMBINING GLAGOLITIC LETTER ZEMLJA - COMBINING GLAGOLITIC LETTER HERU */
+	{ 0x1E01B, 0x1E021 }, /* COMBINING GLAGOLITIC LETTER SHTA - COMBINING GLAGOLITIC LETTER YATI */
+	{ 0x1E023, 0x1E024 }, /* COMBINING GLAGOLITIC LETTER YU - COMBINING GLAGOLITIC LETTER SMALL YUS */
+	{ 0x1E026, 0x1E02A }, /* COMBINING GLAGOLITIC LETTER YO - COMBINING GLAGOLITIC LETTER FITA */
+	{ 0x1E08F, 0x1E08F }, /* COMBINING CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I */
+	{ 0x1E130, 0x1E136 }, /* NYIAKENG PUACHUE HMONG TONE-B - NYIAKENG PUACHUE HMONG TONE-D */
+	{ 0x1E2AE, 0x1E2AE }, /* TOTO SIGN RISING TONE */
+	{ 0x1E2EC, 0x1E2EF }, /* WANCHO TONE TUP - WANCHO TONE KOINI */
+	{ 0x1E4EC, 0x1E4EF }, /* NAG MUNDARI SIGN MUHOR - NAG MUNDARI SIGN SUTUH */
+	{ 0x1E5EE, 0x1E5EF }, /* OL ONAL SIGN MU - OL ONAL SIGN IKIR */
+	{ 0x1E8D0, 0x1E8D6 }, /* MENDE KIKAKUI COMBINING NUMBER TEENS - MENDE KIKAKUI COMBINING NUMBER MILLIONS */
+	{ 0x1E944, 0x1E94A }, /* ADLAM ALIF LENGTHENER - ADLAM NUKTA */
+	{ 0x1F3FB, 0x1F3FF }, /* EMOJI MODIFIER FITZPATRICK TYPE-1-2 - EMOJI MODIFIER FITZPATRICK TYPE-6 */
+	{ 0x1F9B0, 0x1F9B3 }, /* EMOJI COMPONENT RED HAIR - EMOJI COMPONENT WHITE HAIR */
+	{ 0xE0001, 0xE0001 }, /* LANGUAGE TAG */
+	{ 0xE0020, 0xE007F }, /* TAG SPACE - CANCEL TAG */
+	{ 0xE0100, 0xE01EF }, /* VARIATION SELECTOR-17 - VARIATION SELECTOR-256 */
+};
+
+/* Double-width character ranges */
+static const struct ucs_interval ucs_double_width_ranges[] = {
+	{ 0x01100, 0x0115F }, /* HANGUL CHOSEONG KIYEOK - HANGUL CHOSEONG FILLER */
+	{ 0x0231A, 0x0231B }, /* WATCH - HOURGLASS */
+	{ 0x02329, 0x0232A }, /* LEFT-POINTING ANGLE BRACKET - RIGHT-POINTING ANGLE BRACKET */
+	{ 0x023E9, 0x023EC }, /* BLACK RIGHT-POINTING DOUBLE TRIANGLE - BLACK DOWN-POINTING DOUBLE TRIANGLE */
+	{ 0x023F0, 0x023F0 }, /* ALARM CLOCK */
+	{ 0x023F3, 0x023F3 }, /* HOURGLASS WITH FLOWING SAND */
+	{ 0x025FD, 0x025FE }, /* WHITE MEDIUM SMALL SQUARE - BLACK MEDIUM SMALL SQUARE */
+	{ 0x02614, 0x02615 }, /* UMBRELLA WITH RAIN DROPS - HOT BEVERAGE */
+	{ 0x02630, 0x02637 }, /* TRIGRAM FOR HEAVEN - TRIGRAM FOR EARTH */
+	{ 0x02648, 0x02653 }, /* ARIES - PISCES */
+	{ 0x0267F, 0x0267F }, /* WHEELCHAIR SYMBOL */
+	{ 0x0268A, 0x0268F }, /* MONOGRAM FOR YANG - DIGRAM FOR GREATER YIN */
+	{ 0x02693, 0x02693 }, /* ANCHOR */
+	{ 0x026A1, 0x026A1 }, /* HIGH VOLTAGE SIGN */
+	{ 0x026AA, 0x026AB }, /* MEDIUM WHITE CIRCLE - MEDIUM BLACK CIRCLE */
+	{ 0x026BD, 0x026BE }, /* SOCCER BALL - BASEBALL */
+	{ 0x026C4, 0x026C5 }, /* SNOWMAN WITHOUT SNOW - SUN BEHIND CLOUD */
+	{ 0x026CE, 0x026CE }, /* OPHIUCHUS */
+	{ 0x026D4, 0x026D4 }, /* NO ENTRY */
+	{ 0x026EA, 0x026EA }, /* CHURCH */
+	{ 0x026F2, 0x026F3 }, /* FOUNTAIN - FLAG IN HOLE */
+	{ 0x026F5, 0x026F5 }, /* SAILBOAT */
+	{ 0x026FA, 0x026FA }, /* TENT */
+	{ 0x026FD, 0x026FD }, /* FUEL PUMP */
+	{ 0x02705, 0x02705 }, /* WHITE HEAVY CHECK MARK */
+	{ 0x0270A, 0x0270B }, /* RAISED FIST - RAISED HAND */
+	{ 0x02728, 0x02728 }, /* SPARKLES */
+	{ 0x0274C, 0x0274C }, /* CROSS MARK */
+	{ 0x0274E, 0x0274E }, /* NEGATIVE SQUARED CROSS MARK */
+	{ 0x02753, 0x02755 }, /* BLACK QUESTION MARK ORNAMENT - WHITE EXCLAMATION MARK ORNAMENT */
+	{ 0x02757, 0x02757 }, /* HEAVY EXCLAMATION MARK SYMBOL */
+	{ 0x02795, 0x02797 }, /* HEAVY PLUS SIGN - HEAVY DIVISION SIGN */
+	{ 0x027B0, 0x027B0 }, /* CURLY LOOP */
+	{ 0x027BF, 0x027BF }, /* DOUBLE CURLY LOOP */
+	{ 0x02B1B, 0x02B1C }, /* BLACK LARGE SQUARE - WHITE LARGE SQUARE */
+	{ 0x02B50, 0x02B50 }, /* WHITE MEDIUM STAR */
+	{ 0x02B55, 0x02B55 }, /* HEAVY LARGE CIRCLE */
+	{ 0x02E80, 0x02E99 }, /* CJK RADICAL REPEAT - CJK RADICAL RAP */
+	{ 0x02E9B, 0x02EF3 }, /* CJK RADICAL CHOKE - CJK RADICAL C-SIMPLIFIED TURTLE */
+	{ 0x02F00, 0x02FD5 }, /* KANGXI RADICAL ONE - KANGXI RADICAL FLUTE */
+	{ 0x02FF0, 0x03029 }, /* IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT - HANGZHOU NUMERAL NINE */
+	{ 0x03030, 0x0303E }, /* WAVY DASH - IDEOGRAPHIC VARIATION INDICATOR */
+	{ 0x03041, 0x03096 }, /* HIRAGANA LETTER SMALL A - HIRAGANA LETTER SMALL KE */
+	{ 0x0309B, 0x030FF }, /* KATAKANA-HIRAGANA VOICED SOUND MARK - KATAKANA DIGRAPH KOTO */
+	{ 0x03105, 0x0312F }, /* BOPOMOFO LETTER B - BOPOMOFO LETTER NN */
+	{ 0x03131, 0x0318E }, /* HANGUL LETTER KIYEOK - HANGUL LETTER ARAEAE */
+	{ 0x03190, 0x031E5 }, /* IDEOGRAPHIC ANNOTATION LINKING MARK - CJK STROKE SZP */
+	{ 0x031EF, 0x0321E }, /* IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION - PARENTHESIZED KOREAN CHARACTER O HU */
+	{ 0x03220, 0x03247 }, /* PARENTHESIZED IDEOGRAPH ONE - CIRCLED IDEOGRAPH KOTO */
+	{ 0x03250, 0x0A48C }, /* PARTNERSHIP SIGN - YI SYLLABLE YYR */
+	{ 0x0A490, 0x0A4C6 }, /* YI RADICAL QOT - YI RADICAL KE */
+	{ 0x0A960, 0x0A97C }, /* HANGUL CHOSEONG TIKEUT-MIEUM - HANGUL CHOSEONG SSANGYEORINHIEUH */
+	{ 0x0AC00, 0x0D7A3 }, /* HANGUL SYLLABLE GA - HANGUL SYLLABLE HIH */
+	{ 0x0F900, 0x0FAFF }, /* U+F900 - U+FAFF */
+	{ 0x0FE10, 0x0FE19 }, /* PRESENTATION FORM FOR VERTICAL COMMA - PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS */
+	{ 0x0FE30, 0x0FE52 }, /* PRESENTATION FORM FOR VERTICAL TWO DOT LEADER - SMALL FULL STOP */
+	{ 0x0FE54, 0x0FE66 }, /* SMALL SEMICOLON - SMALL EQUALS SIGN */
+	{ 0x0FE68, 0x0FE6B }, /* SMALL REVERSE SOLIDUS - SMALL COMMERCIAL AT */
+	{ 0x0FF01, 0x0FF60 }, /* FULLWIDTH EXCLAMATION MARK - FULLWIDTH RIGHT WHITE PARENTHESIS */
+	{ 0x0FFE0, 0x0FFE6 }, /* FULLWIDTH CENT SIGN - FULLWIDTH WON SIGN */
+	{ 0x16FE0, 0x16FE3 }, /* TANGUT ITERATION MARK - OLD CHINESE ITERATION MARK */
+	{ 0x17000, 0x187F7 }, /* U+17000 - U+187F7 */
+	{ 0x18800, 0x18CD5 }, /* TANGUT COMPONENT-001 - KHITAN SMALL SCRIPT CHARACTER-18CD5 */
+	{ 0x18CFF, 0x18D08 }, /* U+18CFF - U+18D08 */
+	{ 0x1AFF0, 0x1AFF3 }, /* KATAKANA LETTER MINNAN TONE-2 - KATAKANA LETTER MINNAN TONE-5 */
+	{ 0x1AFF5, 0x1AFFB }, /* KATAKANA LETTER MINNAN TONE-7 - KATAKANA LETTER MINNAN NASALIZED TONE-5 */
+	{ 0x1AFFD, 0x1AFFE }, /* KATAKANA LETTER MINNAN NASALIZED TONE-7 - KATAKANA LETTER MINNAN NASALIZED TONE-8 */
+	{ 0x1B000, 0x1B122 }, /* KATAKANA LETTER ARCHAIC E - KATAKANA LETTER ARCHAIC WU */
+	{ 0x1B132, 0x1B132 }, /* HIRAGANA LETTER SMALL KO */
+	{ 0x1B150, 0x1B152 }, /* HIRAGANA LETTER SMALL WI - HIRAGANA LETTER SMALL WO */
+	{ 0x1B155, 0x1B155 }, /* KATAKANA LETTER SMALL KO */
+	{ 0x1B164, 0x1B167 }, /* KATAKANA LETTER SMALL WI - KATAKANA LETTER SMALL N */
+	{ 0x1B170, 0x1B2FB }, /* NUSHU CHARACTER-1B170 - NUSHU CHARACTER-1B2FB */
+	{ 0x1D300, 0x1D356 }, /* MONOGRAM FOR EARTH - TETRAGRAM FOR FOSTERING */
+	{ 0x1D360, 0x1D376 }, /* COUNTING ROD UNIT DIGIT ONE - IDEOGRAPHIC TALLY MARK FIVE */
+	{ 0x1F000, 0x1F02F }, /* U+1F000 - U+1F02F */
+	{ 0x1F0A0, 0x1F0FF }, /* U+1F0A0 - U+1F0FF */
+	{ 0x1F18E, 0x1F18E }, /* NEGATIVE SQUARED AB */
+	{ 0x1F191, 0x1F19A }, /* SQUARED CL - SQUARED VS */
+	{ 0x1F200, 0x1F202 }, /* SQUARE HIRAGANA HOKA - SQUARED KATAKANA SA */
+	{ 0x1F210, 0x1F23B }, /* SQUARED CJK UNIFIED IDEOGRAPH-624B - SQUARED CJK UNIFIED IDEOGRAPH-914D */
+	{ 0x1F240, 0x1F248 }, /* TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-672C - TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-6557 */
+	{ 0x1F250, 0x1F251 }, /* CIRCLED IDEOGRAPH ADVANTAGE - CIRCLED IDEOGRAPH ACCEPT */
+	{ 0x1F260, 0x1F265 }, /* ROUNDED SYMBOL FOR FU - ROUNDED SYMBOL FOR CAI */
+	{ 0x1F300, 0x1F3FA }, /* CYCLONE - AMPHORA */
+	{ 0x1F400, 0x1F64F }, /* RAT - PERSON WITH FOLDED HANDS */
+	{ 0x1F680, 0x1F9AF }, /* ROCKET - PROBING CANE */
+	{ 0x1F9B4, 0x1FAFF }, /* U+1F9B4 - U+1FAFF */
+	{ 0x20000, 0x2FFFD }, /* U+20000 - U+2FFFD */
+	{ 0x30000, 0x3FFFD }, /* U+30000 - U+3FFFD */
+};
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 06/13] vt: use new tables in ucs.c
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (4 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 05/13] vt: create ucs_width_table.h with gen_ucs_width_table.py Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  4:22   ` Jiri Slaby
  2025-04-17  8:30   ` kernel test robot
  2025-04-15 19:17 ` [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h Nicolas Pitre
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This removes the table from ucs.c and substitutes the generated tables
from ucs_width_table.h providing comprehensive ranges for double-width
and zero-width Unicode code points.

Also implements ucs_is_zero_width() to query the new zero-width table.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs.c       | 44 +++++++++++++++++++++-----------------
 include/linux/consolemap.h |  6 +-----
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/tty/vt/ucs.c b/drivers/tty/vt/ucs.c
index 0f6c087158..5e71aa3896 100644
--- a/drivers/tty/vt/ucs.c
+++ b/drivers/tty/vt/ucs.c
@@ -5,22 +5,12 @@
 #include <linux/consolemap.h>
 #include <linux/minmax.h>
 
-/* ucs_is_double_width() is based on the wcwidth() implementation by
- * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
- * Latest version: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
- */
-
 struct ucs_interval {
 	u32 first;
 	u32 last;
 };
 
-static const struct ucs_interval ucs_double_width_ranges[] = {
-	{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
-	{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
-	{ 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
-	{ 0xFFE0, 0xFFE6 }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
-};
+#include "ucs_width_table.h"
 
 static int interval_cmp(const void *key, const void *element)
 {
@@ -34,6 +24,27 @@ static int interval_cmp(const void *key, const void *element)
 	return 0;
 }
 
+static bool cp_in_range(u32 cp, const struct ucs_interval *ranges, size_t size)
+{
+	if (!in_range(cp, ranges[0].first, ranges[size - 1].last))
+		return false;
+
+	return __inline_bsearch(&cp, ranges, size, sizeof(*ranges),
+				interval_cmp) != NULL;
+}
+
+/**
+ * Determine if a Unicode code point is zero-width.
+ *
+ * @param cp: Unicode code point (UCS-4)
+ * Return: true if the character is zero-width, false otherwise
+ */
+bool ucs_is_zero_width(u32 cp)
+{
+	return cp_in_range(cp, ucs_zero_width_ranges,
+			   ARRAY_SIZE(ucs_zero_width_ranges));
+}
+
 /**
  * Determine if a Unicode code point is double-width.
  *
@@ -42,13 +53,6 @@ static int interval_cmp(const void *key, const void *element)
  */
 bool ucs_is_double_width(u32 cp)
 {
-	size_t size = ARRAY_SIZE(ucs_double_width_ranges);
-
-	if (!in_range(cp, ucs_double_width_ranges[0].first,
-			  ucs_double_width_ranges[size - 1].last))
-		return false;
-
-	return __inline_bsearch(&cp, ucs_double_width_ranges, size,
-				sizeof(*ucs_double_width_ranges),
-				interval_cmp) != NULL;
+	return cp_in_range(cp, ucs_double_width_ranges,
+			   ARRAY_SIZE(ucs_double_width_ranges));
 }
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index 7d778752dc..b3a9118666 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -29,11 +29,7 @@ u32 conv_8bit_to_uni(unsigned char c);
 int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
 bool ucs_is_double_width(uint32_t cp);
-static inline bool ucs_is_zero_width(uint32_t cp)
-{
-	/* coming soon */
-	return false;
-}
+bool ucs_is_zero_width(uint32_t cp);
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (5 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 06/13] vt: use new tables in ucs.c Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  4:29   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 08/13] vt: create ucs_recompose_table.h with gen_ucs_recompose_table.py Nicolas Pitre
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

The generated table maps base character + combining mark pairs to their
precomposed equivalents using Python's unicodedata module.

The default script behavior is to create a table with most commonly used
Latin, Greek, and Cyrillic recomposition pairs only. It is much smaller
than the table with all possible recomposition pairs (71 entries vs 1000
entries). But if one needs/wants the full table then simply running the
script with the --full argument will generate it.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/gen_ucs_recompose_table.py | 255 ++++++++++++++++++++++
 1 file changed, 255 insertions(+)
 create mode 100755 drivers/tty/vt/gen_ucs_recompose_table.py

diff --git a/drivers/tty/vt/gen_ucs_recompose_table.py b/drivers/tty/vt/gen_ucs_recompose_table.py
new file mode 100755
index 0000000000..91e81fb1c9
--- /dev/null
+++ b/drivers/tty/vt/gen_ucs_recompose_table.py
@@ -0,0 +1,255 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+#
+# Leverage Python's unicodedata module to generate ucs_recompose_table.h
+#
+# The generated table maps base character + combining mark pairs to their
+# precomposed equivalents.
+#
+# Usage:
+#   python gen_ucs_recompose_table.py         # Generate with common recomposition pairs
+#   python gen_ucs_recompose_table.py --full  # Generate with all recomposition pairs
+
+import unicodedata
+import sys
+import argparse
+import textwrap
+
+# This script's file name
+from pathlib import Path
+this_file = Path(__file__).name
+
+# Output file name
+out_file = "ucs_recompose_table.h"
+
+common_recompose_description = "most commonly used Latin, Greek, and Cyrillic recomposition pairs only"
+COMMON_RECOMPOSITION_PAIRS = [
+    # Latin letters with accents - uppercase
+    (0x0041, 0x0300, 0x00C0),  # A + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER A WITH GRAVE
+    (0x0041, 0x0301, 0x00C1),  # A + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER A WITH ACUTE
+    (0x0041, 0x0302, 0x00C2),  # A + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER A WITH CIRCUMFLEX
+    (0x0041, 0x0303, 0x00C3),  # A + COMBINING TILDE = LATIN CAPITAL LETTER A WITH TILDE
+    (0x0041, 0x0308, 0x00C4),  # A + COMBINING DIAERESIS = LATIN CAPITAL LETTER A WITH DIAERESIS
+    (0x0041, 0x030A, 0x00C5),  # A + COMBINING RING ABOVE = LATIN CAPITAL LETTER A WITH RING ABOVE
+    (0x0043, 0x0327, 0x00C7),  # C + COMBINING CEDILLA = LATIN CAPITAL LETTER C WITH CEDILLA
+    (0x0045, 0x0300, 0x00C8),  # E + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER E WITH GRAVE
+    (0x0045, 0x0301, 0x00C9),  # E + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER E WITH ACUTE
+    (0x0045, 0x0302, 0x00CA),  # E + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER E WITH CIRCUMFLEX
+    (0x0045, 0x0308, 0x00CB),  # E + COMBINING DIAERESIS = LATIN CAPITAL LETTER E WITH DIAERESIS
+    (0x0049, 0x0300, 0x00CC),  # I + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER I WITH GRAVE
+    (0x0049, 0x0301, 0x00CD),  # I + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER I WITH ACUTE
+    (0x0049, 0x0302, 0x00CE),  # I + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER I WITH CIRCUMFLEX
+    (0x0049, 0x0308, 0x00CF),  # I + COMBINING DIAERESIS = LATIN CAPITAL LETTER I WITH DIAERESIS
+    (0x004E, 0x0303, 0x00D1),  # N + COMBINING TILDE = LATIN CAPITAL LETTER N WITH TILDE
+    (0x004F, 0x0300, 0x00D2),  # O + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER O WITH GRAVE
+    (0x004F, 0x0301, 0x00D3),  # O + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER O WITH ACUTE
+    (0x004F, 0x0302, 0x00D4),  # O + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER O WITH CIRCUMFLEX
+    (0x004F, 0x0303, 0x00D5),  # O + COMBINING TILDE = LATIN CAPITAL LETTER O WITH TILDE
+    (0x004F, 0x0308, 0x00D6),  # O + COMBINING DIAERESIS = LATIN CAPITAL LETTER O WITH DIAERESIS
+    (0x0055, 0x0300, 0x00D9),  # U + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER U WITH GRAVE
+    (0x0055, 0x0301, 0x00DA),  # U + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER U WITH ACUTE
+    (0x0055, 0x0302, 0x00DB),  # U + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER U WITH CIRCUMFLEX
+    (0x0055, 0x0308, 0x00DC),  # U + COMBINING DIAERESIS = LATIN CAPITAL LETTER U WITH DIAERESIS
+    (0x0059, 0x0301, 0x00DD),  # Y + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER Y WITH ACUTE
+
+    # Latin letters with accents - lowercase
+    (0x0061, 0x0300, 0x00E0),  # a + COMBINING GRAVE ACCENT = LATIN SMALL LETTER A WITH GRAVE
+    (0x0061, 0x0301, 0x00E1),  # a + COMBINING ACUTE ACCENT = LATIN SMALL LETTER A WITH ACUTE
+    (0x0061, 0x0302, 0x00E2),  # a + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER A WITH CIRCUMFLEX
+    (0x0061, 0x0303, 0x00E3),  # a + COMBINING TILDE = LATIN SMALL LETTER A WITH TILDE
+    (0x0061, 0x0308, 0x00E4),  # a + COMBINING DIAERESIS = LATIN SMALL LETTER A WITH DIAERESIS
+    (0x0061, 0x030A, 0x00E5),  # a + COMBINING RING ABOVE = LATIN SMALL LETTER A WITH RING ABOVE
+    (0x0063, 0x0327, 0x00E7),  # c + COMBINING CEDILLA = LATIN SMALL LETTER C WITH CEDILLA
+    (0x0065, 0x0300, 0x00E8),  # e + COMBINING GRAVE ACCENT = LATIN SMALL LETTER E WITH GRAVE
+    (0x0065, 0x0301, 0x00E9),  # e + COMBINING ACUTE ACCENT = LATIN SMALL LETTER E WITH ACUTE
+    (0x0065, 0x0302, 0x00EA),  # e + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER E WITH CIRCUMFLEX
+    (0x0065, 0x0308, 0x00EB),  # e + COMBINING DIAERESIS = LATIN SMALL LETTER E WITH DIAERESIS
+    (0x0069, 0x0300, 0x00EC),  # i + COMBINING GRAVE ACCENT = LATIN SMALL LETTER I WITH GRAVE
+    (0x0069, 0x0301, 0x00ED),  # i + COMBINING ACUTE ACCENT = LATIN SMALL LETTER I WITH ACUTE
+    (0x0069, 0x0302, 0x00EE),  # i + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER I WITH CIRCUMFLEX
+    (0x0069, 0x0308, 0x00EF),  # i + COMBINING DIAERESIS = LATIN SMALL LETTER I WITH DIAERESIS
+    (0x006E, 0x0303, 0x00F1),  # n + COMBINING TILDE = LATIN SMALL LETTER N WITH TILDE
+    (0x006F, 0x0300, 0x00F2),  # o + COMBINING GRAVE ACCENT = LATIN SMALL LETTER O WITH GRAVE
+    (0x006F, 0x0301, 0x00F3),  # o + COMBINING ACUTE ACCENT = LATIN SMALL LETTER O WITH ACUTE
+    (0x006F, 0x0302, 0x00F4),  # o + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER O WITH CIRCUMFLEX
+    (0x006F, 0x0303, 0x00F5),  # o + COMBINING TILDE = LATIN SMALL LETTER O WITH TILDE
+    (0x006F, 0x0308, 0x00F6),  # o + COMBINING DIAERESIS = LATIN SMALL LETTER O WITH DIAERESIS
+    (0x0075, 0x0300, 0x00F9),  # u + COMBINING GRAVE ACCENT = LATIN SMALL LETTER U WITH GRAVE
+    (0x0075, 0x0301, 0x00FA),  # u + COMBINING ACUTE ACCENT = LATIN SMALL LETTER U WITH ACUTE
+    (0x0075, 0x0302, 0x00FB),  # u + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER U WITH CIRCUMFLEX
+    (0x0075, 0x0308, 0x00FC),  # u + COMBINING DIAERESIS = LATIN SMALL LETTER U WITH DIAERESIS
+    (0x0079, 0x0301, 0x00FD),  # y + COMBINING ACUTE ACCENT = LATIN SMALL LETTER Y WITH ACUTE
+    (0x0079, 0x0308, 0x00FF),  # y + COMBINING DIAERESIS = LATIN SMALL LETTER Y WITH DIAERESIS
+
+    # Common Greek characters
+    (0x0391, 0x0301, 0x0386),  # Α + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ALPHA WITH TONOS
+    (0x0395, 0x0301, 0x0388),  # Ε + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER EPSILON WITH TONOS
+    (0x0397, 0x0301, 0x0389),  # Η + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ETA WITH TONOS
+    (0x0399, 0x0301, 0x038A),  # Ι + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER IOTA WITH TONOS
+    (0x039F, 0x0301, 0x038C),  # Ο + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMICRON WITH TONOS
+    (0x03A5, 0x0301, 0x038E),  # Υ + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER UPSILON WITH TONOS
+    (0x03A9, 0x0301, 0x038F),  # Ω + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMEGA WITH TONOS
+    (0x03B1, 0x0301, 0x03AC),  # α + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ALPHA WITH TONOS
+    (0x03B5, 0x0301, 0x03AD),  # ε + COMBINING ACUTE ACCENT = GREEK SMALL LETTER EPSILON WITH TONOS
+    (0x03B7, 0x0301, 0x03AE),  # η + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ETA WITH TONOS
+    (0x03B9, 0x0301, 0x03AF),  # ι + COMBINING ACUTE ACCENT = GREEK SMALL LETTER IOTA WITH TONOS
+    (0x03BF, 0x0301, 0x03CC),  # ο + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMICRON WITH TONOS
+    (0x03C5, 0x0301, 0x03CD),  # υ + COMBINING ACUTE ACCENT = GREEK SMALL LETTER UPSILON WITH TONOS
+    (0x03C9, 0x0301, 0x03CE),  # ω + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMEGA WITH TONOS
+
+    # Common Cyrillic characters
+    (0x0418, 0x0306, 0x0419),  # И + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT I
+    (0x0438, 0x0306, 0x0439),  # и + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT I
+    (0x0423, 0x0306, 0x040E),  # У + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT U
+    (0x0443, 0x0306, 0x045E),  # у + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT U
+]
+
+full_recompose_description = "all possible recomposition pairs from the Unicode BMP"
+def collect_all_recomposition_pairs():
+    """Collect all possible recomposition pairs from the Unicode data."""
+    # Map to store recomposition pairs: (base, combining) -> recomposed
+    recompose_map = {}
+
+    # Process all assigned Unicode code points in BMP (Basic Multilingual Plane)
+    # We limit to BMP (0x0000-0xFFFF) to keep our table smaller with uint16_t
+    for cp in range(0, 0x10000):
+        try:
+            char = chr(cp)
+
+            # Skip unassigned or control characters
+            if not unicodedata.name(char, ''):
+                continue
+
+            # Find decomposition
+            decomp = unicodedata.decomposition(char)
+            if not decomp or '<' in decomp:  # Skip compatibility decompositions
+                continue
+
+            # Parse the decomposition
+            parts = decomp.split()
+            if len(parts) == 2:  # Simple base + combining mark
+                base = int(parts[0], 16)
+                combining = int(parts[1], 16)
+
+                # Only store if both are in BMP
+                if base < 0x10000 and combining < 0x10000:
+                    recompose_map[(base, combining)] = cp
+
+        except (ValueError, TypeError):
+            continue
+
+    # Convert to a list of tuples and sort for binary search
+    recompose_list = [(base, combining, recomposed)
+                     for (base, combining), recomposed in recompose_map.items()]
+    recompose_list.sort()
+
+    return recompose_list
+
+def validate_common_pairs(full_list):
+    """Validate that all common pairs are in the full list.
+
+    Raises:
+        ValueError: If any common pair is missing or has a different recomposition
+        value than what's in the full table.
+    """
+    full_pairs = {(base, combining): recomposed for base, combining, recomposed in full_list}
+    for base, combining, recomposed in COMMON_RECOMPOSITION_PAIRS:
+        full_recomposed = full_pairs.get((base, combining))
+        if full_recomposed is None:
+            error_msg = f"Error: Common pair (0x{base:04X}, 0x{combining:04X}) not found in full data"
+            print(error_msg)
+            raise ValueError(error_msg)
+        elif full_recomposed != recomposed:
+            error_msg = (f"Error: Common pair (0x{base:04X}, 0x{combining:04X}) has different recomposition: "
+                         f"0x{recomposed:04X} vs 0x{full_recomposed:04X}")
+            print(error_msg)
+            raise ValueError(error_msg)
+
+def generate_recomposition_table(use_full_list=False):
+    """Generate the recomposition C table."""
+
+    # Collect all recomposition pairs for validation
+    full_recompose_list = collect_all_recomposition_pairs()
+
+    # Decide which list to use
+    if use_full_list:
+        print("Using full recomposition list...")
+        recompose_list = full_recompose_list
+        table_description = full_recompose_description
+        alt_list = COMMON_RECOMPOSITION_PAIRS
+        alt_description = common_recompose_description
+    else:
+        print("Using common recomposition list...")
+        # Validate that all common pairs are in the full list
+        validate_common_pairs(full_recompose_list)
+        recompose_list = sorted(COMMON_RECOMPOSITION_PAIRS)
+        table_description = common_recompose_description
+        alt_list = full_recompose_list
+        alt_description = full_recompose_description
+    generation_mode = " --full" if use_full_list else ""
+    alternative_mode = " --full" if not use_full_list else ""
+    table_description_detail = f"{table_description} ({len(recompose_list)} entries)"
+    alt_description_detail = f"{alt_description} ({len(alt_list)} entries)"
+
+    # Calculate min/max values for boundary checks
+    min_base = min(base for base, _, _ in recompose_list)
+    max_base = max(base for base, _, _ in recompose_list)
+    min_combining = min(combining for _, combining, _ in recompose_list)
+    max_combining = max(combining for _, combining, _ in recompose_list)
+
+    # Generate implementation file
+    with open(out_file, 'w') as f:
+        f.write(f"""\
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * {out_file} - Unicode character recomposition
+ *
+ * Auto-generated by {this_file}{generation_mode}
+ *
+ * Unicode Version: {unicodedata.unidata_version}
+ *
+{textwrap.fill(
+    f"This file contains a table with {table_description_detail}. " +
+    f"To generate a table with {alt_description_detail} instead, run:",
+    width=75, initial_indent=" * ", subsequent_indent=" * ")}
+ *
+ *   python {this_file}{alternative_mode}
+ */
+
+/*
+ * Table of {table_description}
+ * Sorted by base character and then combining mark for binary search
+ */
+static const struct ucs_recomposition ucs_recomposition_table[] = {{
+""")
+
+        for base, combining, recomposed in recompose_list:
+            try:
+                base_name = unicodedata.name(chr(base))
+                combining_name = unicodedata.name(chr(combining))
+                recomposed_name = unicodedata.name(chr(recomposed))
+                comment = f"/* {base_name} + {combining_name} = {recomposed_name} */"
+            except ValueError:
+                comment = f"/* U+{base:04X} + U+{combining:04X} = U+{recomposed:04X} */"
+            f.write(f"\t{{ 0x{base:04X}, 0x{combining:04X}, 0x{recomposed:04X} }}, {comment}\n")
+
+        f.write(f"""\
+}};
+
+/*
+ * Boundary values for quick rejection
+ * These are calculated by analyzing the table during generation
+ */
+#define UCS_RECOMPOSE_MIN_BASE  0x{min_base:04X}
+#define UCS_RECOMPOSE_MAX_BASE  0x{max_base:04X}
+#define UCS_RECOMPOSE_MIN_MARK  0x{min_combining:04X}
+#define UCS_RECOMPOSE_MAX_MARK  0x{max_combining:04X}
+""")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate Unicode recomposition table")
+    parser.add_argument("--full", action="store_true",
+                        help="Generate a full recomposition table (default: common pairs only)")
+    args = parser.parse_args()
+
+    generate_recomposition_table(use_full_list=args.full)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 08/13] vt: create ucs_recompose_table.h with gen_ucs_recompose_table.py
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (6 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  5:05   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 09/13] vt: support Unicode recomposition Nicolas Pitre
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Table of base character + combining mark pairs with their precomposed
equivalents.

Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
      Please ignore.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs_recompose_table.h | 102 +++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)
 create mode 100644 drivers/tty/vt/ucs_recompose_table.h

diff --git a/drivers/tty/vt/ucs_recompose_table.h b/drivers/tty/vt/ucs_recompose_table.h
new file mode 100644
index 0000000000..bd91edde5d
--- /dev/null
+++ b/drivers/tty/vt/ucs_recompose_table.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * ucs_recompose_table.h - Unicode character recomposition
+ *
+ * Auto-generated by gen_ucs_recompose_table.py
+ *
+ * Unicode Version: 16.0.0
+ *
+ * This file contains a table with most commonly used Latin, Greek, and
+ * Cyrillic recomposition pairs only (71 entries). To generate a table with
+ * all possible recomposition pairs from the Unicode BMP (1000 entries)
+ * instead, run:
+ *
+ *   python gen_ucs_recompose_table.py --full
+ */
+
+/*
+ * Table of most commonly used Latin, Greek, and Cyrillic recomposition pairs only
+ * Sorted by base character and then combining mark for binary search
+ */
+static const struct ucs_recomposition ucs_recomposition_table[] = {
+	{ 0x0041, 0x0300, 0x00C0 }, /* LATIN CAPITAL LETTER A + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER A WITH GRAVE */
+	{ 0x0041, 0x0301, 0x00C1 }, /* LATIN CAPITAL LETTER A + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER A WITH ACUTE */
+	{ 0x0041, 0x0302, 0x00C2 }, /* LATIN CAPITAL LETTER A + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER A WITH CIRCUMFLEX */
+	{ 0x0041, 0x0303, 0x00C3 }, /* LATIN CAPITAL LETTER A + COMBINING TILDE = LATIN CAPITAL LETTER A WITH TILDE */
+	{ 0x0041, 0x0308, 0x00C4 }, /* LATIN CAPITAL LETTER A + COMBINING DIAERESIS = LATIN CAPITAL LETTER A WITH DIAERESIS */
+	{ 0x0041, 0x030A, 0x00C5 }, /* LATIN CAPITAL LETTER A + COMBINING RING ABOVE = LATIN CAPITAL LETTER A WITH RING ABOVE */
+	{ 0x0043, 0x0327, 0x00C7 }, /* LATIN CAPITAL LETTER C + COMBINING CEDILLA = LATIN CAPITAL LETTER C WITH CEDILLA */
+	{ 0x0045, 0x0300, 0x00C8 }, /* LATIN CAPITAL LETTER E + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER E WITH GRAVE */
+	{ 0x0045, 0x0301, 0x00C9 }, /* LATIN CAPITAL LETTER E + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER E WITH ACUTE */
+	{ 0x0045, 0x0302, 0x00CA }, /* LATIN CAPITAL LETTER E + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER E WITH CIRCUMFLEX */
+	{ 0x0045, 0x0308, 0x00CB }, /* LATIN CAPITAL LETTER E + COMBINING DIAERESIS = LATIN CAPITAL LETTER E WITH DIAERESIS */
+	{ 0x0049, 0x0300, 0x00CC }, /* LATIN CAPITAL LETTER I + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER I WITH GRAVE */
+	{ 0x0049, 0x0301, 0x00CD }, /* LATIN CAPITAL LETTER I + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER I WITH ACUTE */
+	{ 0x0049, 0x0302, 0x00CE }, /* LATIN CAPITAL LETTER I + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER I WITH CIRCUMFLEX */
+	{ 0x0049, 0x0308, 0x00CF }, /* LATIN CAPITAL LETTER I + COMBINING DIAERESIS = LATIN CAPITAL LETTER I WITH DIAERESIS */
+	{ 0x004E, 0x0303, 0x00D1 }, /* LATIN CAPITAL LETTER N + COMBINING TILDE = LATIN CAPITAL LETTER N WITH TILDE */
+	{ 0x004F, 0x0300, 0x00D2 }, /* LATIN CAPITAL LETTER O + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER O WITH GRAVE */
+	{ 0x004F, 0x0301, 0x00D3 }, /* LATIN CAPITAL LETTER O + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER O WITH ACUTE */
+	{ 0x004F, 0x0302, 0x00D4 }, /* LATIN CAPITAL LETTER O + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER O WITH CIRCUMFLEX */
+	{ 0x004F, 0x0303, 0x00D5 }, /* LATIN CAPITAL LETTER O + COMBINING TILDE = LATIN CAPITAL LETTER O WITH TILDE */
+	{ 0x004F, 0x0308, 0x00D6 }, /* LATIN CAPITAL LETTER O + COMBINING DIAERESIS = LATIN CAPITAL LETTER O WITH DIAERESIS */
+	{ 0x0055, 0x0300, 0x00D9 }, /* LATIN CAPITAL LETTER U + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER U WITH GRAVE */
+	{ 0x0055, 0x0301, 0x00DA }, /* LATIN CAPITAL LETTER U + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER U WITH ACUTE */
+	{ 0x0055, 0x0302, 0x00DB }, /* LATIN CAPITAL LETTER U + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER U WITH CIRCUMFLEX */
+	{ 0x0055, 0x0308, 0x00DC }, /* LATIN CAPITAL LETTER U + COMBINING DIAERESIS = LATIN CAPITAL LETTER U WITH DIAERESIS */
+	{ 0x0059, 0x0301, 0x00DD }, /* LATIN CAPITAL LETTER Y + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER Y WITH ACUTE */
+	{ 0x0061, 0x0300, 0x00E0 }, /* LATIN SMALL LETTER A + COMBINING GRAVE ACCENT = LATIN SMALL LETTER A WITH GRAVE */
+	{ 0x0061, 0x0301, 0x00E1 }, /* LATIN SMALL LETTER A + COMBINING ACUTE ACCENT = LATIN SMALL LETTER A WITH ACUTE */
+	{ 0x0061, 0x0302, 0x00E2 }, /* LATIN SMALL LETTER A + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER A WITH CIRCUMFLEX */
+	{ 0x0061, 0x0303, 0x00E3 }, /* LATIN SMALL LETTER A + COMBINING TILDE = LATIN SMALL LETTER A WITH TILDE */
+	{ 0x0061, 0x0308, 0x00E4 }, /* LATIN SMALL LETTER A + COMBINING DIAERESIS = LATIN SMALL LETTER A WITH DIAERESIS */
+	{ 0x0061, 0x030A, 0x00E5 }, /* LATIN SMALL LETTER A + COMBINING RING ABOVE = LATIN SMALL LETTER A WITH RING ABOVE */
+	{ 0x0063, 0x0327, 0x00E7 }, /* LATIN SMALL LETTER C + COMBINING CEDILLA = LATIN SMALL LETTER C WITH CEDILLA */
+	{ 0x0065, 0x0300, 0x00E8 }, /* LATIN SMALL LETTER E + COMBINING GRAVE ACCENT = LATIN SMALL LETTER E WITH GRAVE */
+	{ 0x0065, 0x0301, 0x00E9 }, /* LATIN SMALL LETTER E + COMBINING ACUTE ACCENT = LATIN SMALL LETTER E WITH ACUTE */
+	{ 0x0065, 0x0302, 0x00EA }, /* LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER E WITH CIRCUMFLEX */
+	{ 0x0065, 0x0308, 0x00EB }, /* LATIN SMALL LETTER E + COMBINING DIAERESIS = LATIN SMALL LETTER E WITH DIAERESIS */
+	{ 0x0069, 0x0300, 0x00EC }, /* LATIN SMALL LETTER I + COMBINING GRAVE ACCENT = LATIN SMALL LETTER I WITH GRAVE */
+	{ 0x0069, 0x0301, 0x00ED }, /* LATIN SMALL LETTER I + COMBINING ACUTE ACCENT = LATIN SMALL LETTER I WITH ACUTE */
+	{ 0x0069, 0x0302, 0x00EE }, /* LATIN SMALL LETTER I + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER I WITH CIRCUMFLEX */
+	{ 0x0069, 0x0308, 0x00EF }, /* LATIN SMALL LETTER I + COMBINING DIAERESIS = LATIN SMALL LETTER I WITH DIAERESIS */
+	{ 0x006E, 0x0303, 0x00F1 }, /* LATIN SMALL LETTER N + COMBINING TILDE = LATIN SMALL LETTER N WITH TILDE */
+	{ 0x006F, 0x0300, 0x00F2 }, /* LATIN SMALL LETTER O + COMBINING GRAVE ACCENT = LATIN SMALL LETTER O WITH GRAVE */
+	{ 0x006F, 0x0301, 0x00F3 }, /* LATIN SMALL LETTER O + COMBINING ACUTE ACCENT = LATIN SMALL LETTER O WITH ACUTE */
+	{ 0x006F, 0x0302, 0x00F4 }, /* LATIN SMALL LETTER O + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER O WITH CIRCUMFLEX */
+	{ 0x006F, 0x0303, 0x00F5 }, /* LATIN SMALL LETTER O + COMBINING TILDE = LATIN SMALL LETTER O WITH TILDE */
+	{ 0x006F, 0x0308, 0x00F6 }, /* LATIN SMALL LETTER O + COMBINING DIAERESIS = LATIN SMALL LETTER O WITH DIAERESIS */
+	{ 0x0075, 0x0300, 0x00F9 }, /* LATIN SMALL LETTER U + COMBINING GRAVE ACCENT = LATIN SMALL LETTER U WITH GRAVE */
+	{ 0x0075, 0x0301, 0x00FA }, /* LATIN SMALL LETTER U + COMBINING ACUTE ACCENT = LATIN SMALL LETTER U WITH ACUTE */
+	{ 0x0075, 0x0302, 0x00FB }, /* LATIN SMALL LETTER U + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER U WITH CIRCUMFLEX */
+	{ 0x0075, 0x0308, 0x00FC }, /* LATIN SMALL LETTER U + COMBINING DIAERESIS = LATIN SMALL LETTER U WITH DIAERESIS */
+	{ 0x0079, 0x0301, 0x00FD }, /* LATIN SMALL LETTER Y + COMBINING ACUTE ACCENT = LATIN SMALL LETTER Y WITH ACUTE */
+	{ 0x0079, 0x0308, 0x00FF }, /* LATIN SMALL LETTER Y + COMBINING DIAERESIS = LATIN SMALL LETTER Y WITH DIAERESIS */
+	{ 0x0391, 0x0301, 0x0386 }, /* GREEK CAPITAL LETTER ALPHA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ALPHA WITH TONOS */
+	{ 0x0395, 0x0301, 0x0388 }, /* GREEK CAPITAL LETTER EPSILON + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER EPSILON WITH TONOS */
+	{ 0x0397, 0x0301, 0x0389 }, /* GREEK CAPITAL LETTER ETA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ETA WITH TONOS */
+	{ 0x0399, 0x0301, 0x038A }, /* GREEK CAPITAL LETTER IOTA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER IOTA WITH TONOS */
+	{ 0x039F, 0x0301, 0x038C }, /* GREEK CAPITAL LETTER OMICRON + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMICRON WITH TONOS */
+	{ 0x03A5, 0x0301, 0x038E }, /* GREEK CAPITAL LETTER UPSILON + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER UPSILON WITH TONOS */
+	{ 0x03A9, 0x0301, 0x038F }, /* GREEK CAPITAL LETTER OMEGA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMEGA WITH TONOS */
+	{ 0x03B1, 0x0301, 0x03AC }, /* GREEK SMALL LETTER ALPHA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ALPHA WITH TONOS */
+	{ 0x03B5, 0x0301, 0x03AD }, /* GREEK SMALL LETTER EPSILON + COMBINING ACUTE ACCENT = GREEK SMALL LETTER EPSILON WITH TONOS */
+	{ 0x03B7, 0x0301, 0x03AE }, /* GREEK SMALL LETTER ETA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ETA WITH TONOS */
+	{ 0x03B9, 0x0301, 0x03AF }, /* GREEK SMALL LETTER IOTA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER IOTA WITH TONOS */
+	{ 0x03BF, 0x0301, 0x03CC }, /* GREEK SMALL LETTER OMICRON + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMICRON WITH TONOS */
+	{ 0x03C5, 0x0301, 0x03CD }, /* GREEK SMALL LETTER UPSILON + COMBINING ACUTE ACCENT = GREEK SMALL LETTER UPSILON WITH TONOS */
+	{ 0x03C9, 0x0301, 0x03CE }, /* GREEK SMALL LETTER OMEGA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMEGA WITH TONOS */
+	{ 0x0418, 0x0306, 0x0419 }, /* CYRILLIC CAPITAL LETTER I + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT I */
+	{ 0x0423, 0x0306, 0x040E }, /* CYRILLIC CAPITAL LETTER U + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT U */
+	{ 0x0438, 0x0306, 0x0439 }, /* CYRILLIC SMALL LETTER I + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT I */
+	{ 0x0443, 0x0306, 0x045E }, /* CYRILLIC SMALL LETTER U + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT U */
+};
+
+/*
+ * Boundary values for quick rejection
+ * These are calculated by analyzing the table during generation
+ */
+#define UCS_RECOMPOSE_MIN_BASE  0x0041
+#define UCS_RECOMPOSE_MAX_BASE  0x0443
+#define UCS_RECOMPOSE_MIN_MARK  0x0300
+#define UCS_RECOMPOSE_MAX_MARK  0x0327
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 09/13] vt: support Unicode recomposition
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (7 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 08/13] vt: create ucs_recompose_table.h with gen_ucs_recompose_table.py Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  5:07   ` Jiri Slaby
  2025-04-15 19:17 ` [PATCH v2 10/13] vt: pad double-width code points with a zero-width space Nicolas Pitre
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Try replacing any decomposed Unicode sequence by the corresponding
recomposed code point. Code point to glyph correspondance works best
after recomposition, and this apply mostly to single-width code points
therefore we can't preserve them in their decomposed form anyway.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs.c       | 62 ++++++++++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c        | 14 +++++++--
 include/linux/consolemap.h |  6 ++++
 3 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/vt/ucs.c b/drivers/tty/vt/ucs.c
index 5e71aa3896..07b2bd1714 100644
--- a/drivers/tty/vt/ucs.c
+++ b/drivers/tty/vt/ucs.c
@@ -56,3 +56,65 @@ bool ucs_is_double_width(u32 cp)
 	return cp_in_range(cp, ucs_double_width_ranges,
 			   ARRAY_SIZE(ucs_double_width_ranges));
 }
+
+/*
+ * Structure for base with combining mark pairs and resulting recompositions.
+ * Using u16 to save space since all values are within BMP range.
+ */
+struct ucs_recomposition {
+	u16 base;	/* base character */
+	u16 mark;	/* combining mark */
+	u16 recomposed;	/* corresponding recomposed character */
+};
+
+#include "ucs_recompose_table.h"
+
+struct compare_key {
+	u16 base;
+	u16 mark;
+};
+
+static int recomposition_cmp(const void *key, const void *element)
+{
+	const struct compare_key *search_key = key;
+	const struct ucs_recomposition *entry = element;
+
+	/* Compare base character first */
+	if (search_key->base < entry->base)
+		return -1;
+	if (search_key->base > entry->base)
+		return 1;
+
+	/* Base characters match, now compare combining character */
+	if (search_key->mark < entry->mark)
+		return -1;
+	if (search_key->mark > entry->mark)
+		return 1;
+
+	/* Both match */
+	return 0;
+}
+
+/**
+ * Attempt to recompose two Unicode characters into a single character.
+ *
+ * @param base: Base Unicode code point (UCS-4)
+ * @param mark: Combining mark Unicode code point (UCS-4)
+ * Return: Recomposed Unicode code point, or 0 if no recomposition is possible
+ */
+u32 ucs_recompose(u32 base, u32 mark)
+{
+	/* Check if characters are within the range of our table */
+	if (!in_range(base, UCS_RECOMPOSE_MIN_BASE, UCS_RECOMPOSE_MAX_BASE) ||
+	    !in_range(mark, UCS_RECOMPOSE_MIN_MARK, UCS_RECOMPOSE_MAX_MARK))
+		return 0;
+
+	struct compare_key key = { base, mark };
+	struct ucs_recomposition *result =
+		__inline_bsearch(&key, ucs_recomposition_table,
+				 ARRAY_SIZE(ucs_recomposition_table),
+				 sizeof(*ucs_recomposition_table),
+				 recomposition_cmp);
+
+	return result ? result->recomposed : 0;
+}
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index a989feffad..76554c2040 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2925,9 +2925,9 @@ static void vc_con_rewind(struct vc_data *vc)
 
 #define UCS_VS16	0xfe0f	/* Variation Selector 16 */
 
-static int vc_process_ucs(struct vc_data *vc, int c, int *tc)
+static int vc_process_ucs(struct vc_data *vc, int *c, int *tc)
 {
-	u32 prev_c, curr_c = c;
+	u32 prev_c, curr_c = *c;
 
 	if (ucs_is_double_width(curr_c))
 		return 2;
@@ -2964,6 +2964,14 @@ static int vc_process_ucs(struct vc_data *vc, int c, int *tc)
 		return 1;
 	}
 
+	/* try recomposition */
+	prev_c = ucs_recompose(prev_c, curr_c);
+	if (prev_c != 0) {
+		vc_con_rewind(vc);
+		*tc = *c = prev_c;
+		return 1;
+	}
+
 	/* Otherwise zero-width code points are ignored. */
 	return 0;
 }
@@ -2978,7 +2986,7 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 	bool inverse = false;
 
 	if (vc->vc_utf && !vc->vc_disp_ctrl) {
-		width = vc_process_ucs(vc, c, &tc);
+		width = vc_process_ucs(vc, &c, &tc);
 		if (!width)
 			goto out;
 	}
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index b3a9118666..8167494229 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -30,6 +30,7 @@ int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
 bool ucs_is_double_width(uint32_t cp);
 bool ucs_is_zero_width(uint32_t cp);
+u32 ucs_recompose(u32 base, u32 mark);
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
@@ -69,6 +70,11 @@ static inline bool ucs_is_zero_width(uint32_t cp)
 {
 	return false;
 }
+
+static inline u32 ucs_recompose(u32 base, u32 mark)
+{
+	return 0;
+}
 #endif /* CONFIG_CONSOLE_TRANSLATIONS */
 
 #endif /* __LINUX_CONSOLEMAP_H__ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 10/13] vt: pad double-width code points with a zero-width space
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (8 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 09/13] vt: support Unicode recomposition Nicolas Pitre
@ 2025-04-15 19:17 ` Nicolas Pitre
  2025-04-16  5:07   ` Jiri Slaby
  2025-04-15 19:18 ` [PATCH v2 11/13] vt: remove zero-width-space handling from conv_uni_to_pc() Nicolas Pitre
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

In the Unicode screen buffer, we follow double-width code points with a
space to maintain proper column alignment. This, however, creates
semantic problems when e.g. using cut and paste.

Let's use a better code point for the column padding's purpose i.e. a
zero-width space rather than a full space. This way the combination
remains with a width of 2.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 76554c2040..1bd1878094 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2923,6 +2923,7 @@ static void vc_con_rewind(struct vc_data *vc)
 	vc->vc_need_wrap = 0;
 }
 
+#define UCS_ZWS		0x200b	/* Zero Width Space */
 #define UCS_VS16	0xfe0f	/* Variation Selector 16 */
 
 static int vc_process_ucs(struct vc_data *vc, int *c, int *tc)
@@ -2941,8 +2942,8 @@ static int vc_process_ucs(struct vc_data *vc, int *c, int *tc)
 		/*
 		 * Let's merge this zero-width code point with the preceding
 		 * double-width code point by replacing the existing
-		 * whitespace padding. To do so we rewind one column and
-		 * pretend this has a width of 1.
+		 * zero-width space padding. To do so we rewind one column
+		 * and pretend this has a width of 1.
 		 * We give the legacy display the same initial space padding.
 		 */
 		vc_con_rewind(vc);
@@ -3065,7 +3066,11 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 		tc = conv_uni_to_pc(vc, ' ');
 		if (tc < 0)
 			tc = ' ';
-		next_c = ' ';
+		/*
+		 * Store a zero-width space in the Unicode screen given that
+		 * the previous code point is semantically double width.
+		 */
+		next_c = UCS_ZWS;
 	}
 
 out:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 11/13] vt: remove zero-width-space handling from conv_uni_to_pc()
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (9 preceding siblings ...)
  2025-04-15 19:17 ` [PATCH v2 10/13] vt: pad double-width code points with a zero-width space Nicolas Pitre
@ 2025-04-15 19:18 ` Nicolas Pitre
  2025-04-16  5:07   ` Jiri Slaby
  2025-04-15 19:18 ` [PATCH v2 12/13] vt: update gen_ucs_width_table.py to make tables more space efficient Nicolas Pitre
  2025-04-15 19:18 ` [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly Nicolas Pitre
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:18 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This is now taken care of by ucs_is_zero_width().

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/consolemap.c | 2 --
 drivers/tty/vt/vt.c         | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 82d70083fe..bb4bb272eb 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -870,8 +870,6 @@ int conv_uni_to_pc(struct vc_data *conp, long ucs)
 		return -4;		/* Not found */
 	else if (ucs < 0x20)
 		return -1;		/* Not a printable character */
-	else if (ucs == 0xfeff || (ucs >= 0x200b && ucs <= 0x200f))
-		return -2;			/* Zero-width space */
 	/*
 	 * UNI_DIRECT_BASE indicates the start of the region in the User Zone
 	 * which always has a 1:1 mapping to the currently loaded font.  The
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 1bd1878094..24c6cd2eed 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2995,7 +2995,7 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 	/* Now try to find out how to display it */
 	tc = conv_uni_to_pc(vc, tc);
 	if (tc & ~charmask) {
-		if (tc == -1 || tc == -2)
+		if (tc == -1)
 			return -1; /* nothing to display */
 
 		/* Glyph not found */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 12/13] vt: update gen_ucs_width_table.py to make tables more space efficient
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (10 preceding siblings ...)
  2025-04-15 19:18 ` [PATCH v2 11/13] vt: remove zero-width-space handling from conv_uni_to_pc() Nicolas Pitre
@ 2025-04-15 19:18 ` Nicolas Pitre
  2025-04-16  5:09   ` Jiri Slaby
  2025-04-15 19:18 ` [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly Nicolas Pitre
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:18 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
This reduces the corresponding text size by 20-25%.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/gen_ucs_width_table.py | 55 ++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/vt/gen_ucs_width_table.py b/drivers/tty/vt/gen_ucs_width_table.py
index 00510444a7..059ed9a8ba 100755
--- a/drivers/tty/vt/gen_ucs_width_table.py
+++ b/drivers/tty/vt/gen_ucs_width_table.py
@@ -194,6 +194,27 @@ def write_tables(zero_width_ranges, double_width_ranges):
         double_width_ranges: List of (start, end) ranges for double-width characters
     """
 
+    # Function to split ranges into BMP (16-bit) and non-BMP (above 16-bit)
+    def split_ranges_by_size(ranges):
+        bmp_ranges = []
+        non_bmp_ranges = []
+
+        for start, end in ranges:
+            if end <= 0xFFFF:
+                bmp_ranges.append((start, end))
+            elif start > 0xFFFF:
+                non_bmp_ranges.append((start, end))
+            else:
+                # Split the range at 0xFFFF
+                bmp_ranges.append((start, 0xFFFF))
+                non_bmp_ranges.append((0x10000, end))
+
+        return bmp_ranges, non_bmp_ranges
+
+    # Split ranges into BMP and non-BMP
+    zero_width_bmp, zero_width_non_bmp = split_ranges_by_size(zero_width_ranges)
+    double_width_bmp, double_width_non_bmp = split_ranges_by_size(double_width_ranges)
+
     # Function to generate code point description comments
     def get_code_point_comment(start, end):
         try:
@@ -221,22 +242,44 @@ def write_tables(zero_width_ranges, double_width_ranges):
  * Unicode Version: {unicodedata.unidata_version}
  */
 
-/* Zero-width character ranges */
-static const struct ucs_interval ucs_zero_width_ranges[] = {{
+/* Zero-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct ucs_interval16 ucs_zero_width_bmp_ranges[] = {{
+""")
+
+        for start, end in zero_width_bmp:
+            comment = get_code_point_comment(start, end)
+            f.write(f"\t{{ 0x{start:04X}, 0x{end:04X} }}, {comment}\n")
+
+        f.write("""\
+};
+
+/* Zero-width character ranges (non-BMP, U+10000 and above) */
+static const struct ucs_interval32 ucs_zero_width_non_bmp_ranges[] = {
 """)
 
-        for start, end in zero_width_ranges:
+        for start, end in zero_width_non_bmp:
             comment = get_code_point_comment(start, end)
             f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
 
         f.write("""\
 };
 
-/* Double-width character ranges */
-static const struct ucs_interval ucs_double_width_ranges[] = {
+/* Double-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct ucs_interval16 ucs_double_width_bmp_ranges[] = {
+""")
+
+        for start, end in double_width_bmp:
+            comment = get_code_point_comment(start, end)
+            f.write(f"\t{{ 0x{start:04X}, 0x{end:04X} }}, {comment}\n")
+
+        f.write("""\
+};
+
+/* Double-width character ranges (non-BMP, U+10000 and above) */
+static const struct ucs_interval32 ucs_double_width_non_bmp_ranges[] = {
 """)
 
-        for start, end in double_width_ranges:
+        for start, end in double_width_non_bmp:
             comment = get_code_point_comment(start, end)
             f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly
  2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
                   ` (11 preceding siblings ...)
  2025-04-15 19:18 ` [PATCH v2 12/13] vt: update gen_ucs_width_table.py to make tables more space efficient Nicolas Pitre
@ 2025-04-15 19:18 ` Nicolas Pitre
  2025-04-16  5:12   ` Jiri Slaby
  12 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:18 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby; +Cc: Nicolas Pitre, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Width tables are now split into BMP (16-bit) and non-BMP (above 16-bit).
This reduces the corresponding text size by 20-25%.

Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
      Please ignore.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs.c             |  54 +++-
 drivers/tty/vt/ucs_width_table.h | 540 ++++++++++++++++---------------
 2 files changed, 319 insertions(+), 275 deletions(-)

diff --git a/drivers/tty/vt/ucs.c b/drivers/tty/vt/ucs.c
index 07b2bd1714..710e7750f3 100644
--- a/drivers/tty/vt/ucs.c
+++ b/drivers/tty/vt/ucs.c
@@ -5,17 +5,34 @@
 #include <linux/consolemap.h>
 #include <linux/minmax.h>
 
-struct ucs_interval {
+struct ucs_interval16 {
+	u16 first;
+	u16 last;
+};
+
+struct ucs_interval32 {
 	u32 first;
 	u32 last;
 };
 
 #include "ucs_width_table.h"
 
-static int interval_cmp(const void *key, const void *element)
+static int interval16_cmp(const void *key, const void *element)
+{
+	u16 cp = *(u16 *)key;
+	const struct ucs_interval16 *entry = element;
+
+	if (cp < entry->first)
+		return -1;
+	if (cp > entry->last)
+		return 1;
+	return 0;
+}
+
+static int interval32_cmp(const void *key, const void *element)
 {
 	u32 cp = *(u32 *)key;
-	const struct ucs_interval *entry = element;
+	const struct ucs_interval32 *entry = element;
 
 	if (cp < entry->first)
 		return -1;
@@ -24,15 +41,26 @@ static int interval_cmp(const void *key, const void *element)
 	return 0;
 }
 
-static bool cp_in_range(u32 cp, const struct ucs_interval *ranges, size_t size)
+static bool cp_in_range16(u16 cp, const struct ucs_interval16 *ranges, size_t size)
 {
 	if (!in_range(cp, ranges[0].first, ranges[size - 1].last))
 		return false;
 
 	return __inline_bsearch(&cp, ranges, size, sizeof(*ranges),
-				interval_cmp) != NULL;
+				interval16_cmp) != NULL;
 }
 
+static bool cp_in_range32(u32 cp, const struct ucs_interval32 *ranges, size_t size)
+{
+	if (!in_range(cp, ranges[0].first, ranges[size - 1].last))
+		return false;
+
+	return __inline_bsearch(&cp, ranges, size, sizeof(*ranges),
+				interval32_cmp) != NULL;
+}
+
+#define UCS_IS_BMP(cp)	((cp) <= 0xffff)
+
 /**
  * Determine if a Unicode code point is zero-width.
  *
@@ -41,8 +69,12 @@ static bool cp_in_range(u32 cp, const struct ucs_interval *ranges, size_t size)
  */
 bool ucs_is_zero_width(u32 cp)
 {
-	return cp_in_range(cp, ucs_zero_width_ranges,
-			   ARRAY_SIZE(ucs_zero_width_ranges));
+	if (UCS_IS_BMP(cp))
+		return cp_in_range16(cp, ucs_zero_width_bmp_ranges,
+				     ARRAY_SIZE(ucs_zero_width_bmp_ranges));
+	else
+		return cp_in_range32(cp, ucs_zero_width_non_bmp_ranges,
+				     ARRAY_SIZE(ucs_zero_width_non_bmp_ranges));
 }
 
 /**
@@ -53,8 +85,12 @@ bool ucs_is_zero_width(u32 cp)
  */
 bool ucs_is_double_width(u32 cp)
 {
-	return cp_in_range(cp, ucs_double_width_ranges,
-			   ARRAY_SIZE(ucs_double_width_ranges));
+	if (UCS_IS_BMP(cp))
+		return cp_in_range16(cp, ucs_double_width_bmp_ranges,
+				     ARRAY_SIZE(ucs_double_width_bmp_ranges));
+	else
+		return cp_in_range32(cp, ucs_double_width_non_bmp_ranges,
+				     ARRAY_SIZE(ucs_double_width_non_bmp_ranges));
 }
 
 /*
diff --git a/drivers/tty/vt/ucs_width_table.h b/drivers/tty/vt/ucs_width_table.h
index 9cc86b5cdf..6fcb8f1d57 100644
--- a/drivers/tty/vt/ucs_width_table.h
+++ b/drivers/tty/vt/ucs_width_table.h
@@ -7,210 +7,214 @@
  * Unicode Version: 16.0.0
  */
 
-/* Zero-width character ranges */
-static const struct ucs_interval ucs_zero_width_ranges[] = {
-	{ 0x000AD, 0x000AD }, /* SOFT HYPHEN */
-	{ 0x00300, 0x0036F }, /* COMBINING GRAVE ACCENT - COMBINING LATIN SMALL LETTER X */
-	{ 0x00483, 0x00489 }, /* COMBINING CYRILLIC TITLO - COMBINING CYRILLIC MILLIONS SIGN */
-	{ 0x00591, 0x005BD }, /* HEBREW ACCENT ETNAHTA - HEBREW POINT METEG */
-	{ 0x005BF, 0x005BF }, /* HEBREW POINT RAFE */
-	{ 0x005C1, 0x005C2 }, /* HEBREW POINT SHIN DOT - HEBREW POINT SIN DOT */
-	{ 0x005C4, 0x005C5 }, /* HEBREW MARK UPPER DOT - HEBREW MARK LOWER DOT */
-	{ 0x005C7, 0x005C7 }, /* HEBREW POINT QAMATS QATAN */
-	{ 0x00600, 0x00605 }, /* ARABIC NUMBER SIGN - ARABIC NUMBER MARK ABOVE */
-	{ 0x00610, 0x0061A }, /* ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM - ARABIC SMALL KASRA */
-	{ 0x0061C, 0x0061C }, /* ARABIC LETTER MARK */
-	{ 0x0064B, 0x0065F }, /* ARABIC FATHATAN - ARABIC WAVY HAMZA BELOW */
-	{ 0x00670, 0x00670 }, /* ARABIC LETTER SUPERSCRIPT ALEF */
-	{ 0x006D6, 0x006DD }, /* ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA - ARABIC END OF AYAH */
-	{ 0x006DF, 0x006E4 }, /* ARABIC SMALL HIGH ROUNDED ZERO - ARABIC SMALL HIGH MADDA */
-	{ 0x006E7, 0x006E8 }, /* ARABIC SMALL HIGH YEH - ARABIC SMALL HIGH NOON */
-	{ 0x006EA, 0x006ED }, /* ARABIC EMPTY CENTRE LOW STOP - ARABIC SMALL LOW MEEM */
-	{ 0x0070F, 0x0070F }, /* SYRIAC ABBREVIATION MARK */
-	{ 0x00711, 0x00711 }, /* SYRIAC LETTER SUPERSCRIPT ALAPH */
-	{ 0x00730, 0x0074A }, /* SYRIAC PTHAHA ABOVE - SYRIAC BARREKH */
-	{ 0x007A6, 0x007B0 }, /* THAANA ABAFILI - THAANA SUKUN */
-	{ 0x007EB, 0x007F3 }, /* NKO COMBINING SHORT HIGH TONE - NKO COMBINING DOUBLE DOT ABOVE */
-	{ 0x007FD, 0x007FD }, /* NKO DANTAYALAN */
-	{ 0x00816, 0x00819 }, /* SAMARITAN MARK IN - SAMARITAN MARK DAGESH */
-	{ 0x0081B, 0x00823 }, /* SAMARITAN MARK EPENTHETIC YUT - SAMARITAN VOWEL SIGN A */
-	{ 0x00825, 0x00827 }, /* SAMARITAN VOWEL SIGN SHORT A - SAMARITAN VOWEL SIGN U */
-	{ 0x00829, 0x0082D }, /* SAMARITAN VOWEL SIGN LONG I - SAMARITAN MARK NEQUDAA */
-	{ 0x00859, 0x0085B }, /* MANDAIC AFFRICATION MARK - MANDAIC GEMINATION MARK */
-	{ 0x00890, 0x00891 }, /* ARABIC POUND MARK ABOVE - ARABIC PIASTRE MARK ABOVE */
-	{ 0x00897, 0x0089F }, /* ARABIC PEPET - ARABIC HALF MADDA OVER MADDA */
-	{ 0x008CA, 0x00903 }, /* ARABIC SMALL HIGH FARSI YEH - DEVANAGARI SIGN VISARGA */
-	{ 0x0093A, 0x0093C }, /* DEVANAGARI VOWEL SIGN OE - DEVANAGARI SIGN NUKTA */
-	{ 0x0093E, 0x0094F }, /* DEVANAGARI VOWEL SIGN AA - DEVANAGARI VOWEL SIGN AW */
-	{ 0x00951, 0x00957 }, /* DEVANAGARI STRESS SIGN UDATTA - DEVANAGARI VOWEL SIGN UUE */
-	{ 0x00962, 0x00963 }, /* DEVANAGARI VOWEL SIGN VOCALIC L - DEVANAGARI VOWEL SIGN VOCALIC LL */
-	{ 0x00981, 0x00983 }, /* BENGALI SIGN CANDRABINDU - BENGALI SIGN VISARGA */
-	{ 0x009BC, 0x009BC }, /* BENGALI SIGN NUKTA */
-	{ 0x009BE, 0x009C4 }, /* BENGALI VOWEL SIGN AA - BENGALI VOWEL SIGN VOCALIC RR */
-	{ 0x009C7, 0x009C8 }, /* BENGALI VOWEL SIGN E - BENGALI VOWEL SIGN AI */
-	{ 0x009CB, 0x009CD }, /* BENGALI VOWEL SIGN O - BENGALI SIGN VIRAMA */
-	{ 0x009D7, 0x009D7 }, /* BENGALI AU LENGTH MARK */
-	{ 0x009E2, 0x009E3 }, /* BENGALI VOWEL SIGN VOCALIC L - BENGALI VOWEL SIGN VOCALIC LL */
-	{ 0x009FE, 0x009FE }, /* BENGALI SANDHI MARK */
-	{ 0x00A01, 0x00A03 }, /* GURMUKHI SIGN ADAK BINDI - GURMUKHI SIGN VISARGA */
-	{ 0x00A3C, 0x00A3C }, /* GURMUKHI SIGN NUKTA */
-	{ 0x00A3E, 0x00A42 }, /* GURMUKHI VOWEL SIGN AA - GURMUKHI VOWEL SIGN UU */
-	{ 0x00A47, 0x00A48 }, /* GURMUKHI VOWEL SIGN EE - GURMUKHI VOWEL SIGN AI */
-	{ 0x00A4B, 0x00A4D }, /* GURMUKHI VOWEL SIGN OO - GURMUKHI SIGN VIRAMA */
-	{ 0x00A51, 0x00A51 }, /* GURMUKHI SIGN UDAAT */
-	{ 0x00A70, 0x00A71 }, /* GURMUKHI TIPPI - GURMUKHI ADDAK */
-	{ 0x00A75, 0x00A75 }, /* GURMUKHI SIGN YAKASH */
-	{ 0x00A81, 0x00A83 }, /* GUJARATI SIGN CANDRABINDU - GUJARATI SIGN VISARGA */
-	{ 0x00ABC, 0x00ABC }, /* GUJARATI SIGN NUKTA */
-	{ 0x00ABE, 0x00AC5 }, /* GUJARATI VOWEL SIGN AA - GUJARATI VOWEL SIGN CANDRA E */
-	{ 0x00AC7, 0x00AC9 }, /* GUJARATI VOWEL SIGN E - GUJARATI VOWEL SIGN CANDRA O */
-	{ 0x00ACB, 0x00ACD }, /* GUJARATI VOWEL SIGN O - GUJARATI SIGN VIRAMA */
-	{ 0x00AE2, 0x00AE3 }, /* GUJARATI VOWEL SIGN VOCALIC L - GUJARATI VOWEL SIGN VOCALIC LL */
-	{ 0x00AFA, 0x00AFF }, /* GUJARATI SIGN SUKUN - GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE */
-	{ 0x00B01, 0x00B03 }, /* ORIYA SIGN CANDRABINDU - ORIYA SIGN VISARGA */
-	{ 0x00B3C, 0x00B3C }, /* ORIYA SIGN NUKTA */
-	{ 0x00B3E, 0x00B44 }, /* ORIYA VOWEL SIGN AA - ORIYA VOWEL SIGN VOCALIC RR */
-	{ 0x00B47, 0x00B48 }, /* ORIYA VOWEL SIGN E - ORIYA VOWEL SIGN AI */
-	{ 0x00B4B, 0x00B4D }, /* ORIYA VOWEL SIGN O - ORIYA SIGN VIRAMA */
-	{ 0x00B55, 0x00B57 }, /* ORIYA SIGN OVERLINE - ORIYA AU LENGTH MARK */
-	{ 0x00B62, 0x00B63 }, /* ORIYA VOWEL SIGN VOCALIC L - ORIYA VOWEL SIGN VOCALIC LL */
-	{ 0x00B82, 0x00B82 }, /* TAMIL SIGN ANUSVARA */
-	{ 0x00BBE, 0x00BC2 }, /* TAMIL VOWEL SIGN AA - TAMIL VOWEL SIGN UU */
-	{ 0x00BC6, 0x00BC8 }, /* TAMIL VOWEL SIGN E - TAMIL VOWEL SIGN AI */
-	{ 0x00BCA, 0x00BCD }, /* TAMIL VOWEL SIGN O - TAMIL SIGN VIRAMA */
-	{ 0x00BD7, 0x00BD7 }, /* TAMIL AU LENGTH MARK */
-	{ 0x00C00, 0x00C04 }, /* TELUGU SIGN COMBINING CANDRABINDU ABOVE - TELUGU SIGN COMBINING ANUSVARA ABOVE */
-	{ 0x00C3C, 0x00C3C }, /* TELUGU SIGN NUKTA */
-	{ 0x00C3E, 0x00C44 }, /* TELUGU VOWEL SIGN AA - TELUGU VOWEL SIGN VOCALIC RR */
-	{ 0x00C46, 0x00C48 }, /* TELUGU VOWEL SIGN E - TELUGU VOWEL SIGN AI */
-	{ 0x00C4A, 0x00C4D }, /* TELUGU VOWEL SIGN O - TELUGU SIGN VIRAMA */
-	{ 0x00C55, 0x00C56 }, /* TELUGU LENGTH MARK - TELUGU AI LENGTH MARK */
-	{ 0x00C62, 0x00C63 }, /* TELUGU VOWEL SIGN VOCALIC L - TELUGU VOWEL SIGN VOCALIC LL */
-	{ 0x00C81, 0x00C83 }, /* KANNADA SIGN CANDRABINDU - KANNADA SIGN VISARGA */
-	{ 0x00CBC, 0x00CBC }, /* KANNADA SIGN NUKTA */
-	{ 0x00CBE, 0x00CC4 }, /* KANNADA VOWEL SIGN AA - KANNADA VOWEL SIGN VOCALIC RR */
-	{ 0x00CC6, 0x00CC8 }, /* KANNADA VOWEL SIGN E - KANNADA VOWEL SIGN AI */
-	{ 0x00CCA, 0x00CCD }, /* KANNADA VOWEL SIGN O - KANNADA SIGN VIRAMA */
-	{ 0x00CD5, 0x00CD6 }, /* KANNADA LENGTH MARK - KANNADA AI LENGTH MARK */
-	{ 0x00CE2, 0x00CE3 }, /* KANNADA VOWEL SIGN VOCALIC L - KANNADA VOWEL SIGN VOCALIC LL */
-	{ 0x00CF3, 0x00CF3 }, /* KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT */
-	{ 0x00D00, 0x00D03 }, /* MALAYALAM SIGN COMBINING ANUSVARA ABOVE - MALAYALAM SIGN VISARGA */
-	{ 0x00D3B, 0x00D3C }, /* MALAYALAM SIGN VERTICAL BAR VIRAMA - MALAYALAM SIGN CIRCULAR VIRAMA */
-	{ 0x00D3E, 0x00D44 }, /* MALAYALAM VOWEL SIGN AA - MALAYALAM VOWEL SIGN VOCALIC RR */
-	{ 0x00D46, 0x00D48 }, /* MALAYALAM VOWEL SIGN E - MALAYALAM VOWEL SIGN AI */
-	{ 0x00D4A, 0x00D4D }, /* MALAYALAM VOWEL SIGN O - MALAYALAM SIGN VIRAMA */
-	{ 0x00D57, 0x00D57 }, /* MALAYALAM AU LENGTH MARK */
-	{ 0x00D62, 0x00D63 }, /* MALAYALAM VOWEL SIGN VOCALIC L - MALAYALAM VOWEL SIGN VOCALIC LL */
-	{ 0x00D81, 0x00D83 }, /* SINHALA SIGN CANDRABINDU - SINHALA SIGN VISARGAYA */
-	{ 0x00DCA, 0x00DCA }, /* SINHALA SIGN AL-LAKUNA */
-	{ 0x00DCF, 0x00DD4 }, /* SINHALA VOWEL SIGN AELA-PILLA - SINHALA VOWEL SIGN KETTI PAA-PILLA */
-	{ 0x00DD6, 0x00DD6 }, /* SINHALA VOWEL SIGN DIGA PAA-PILLA */
-	{ 0x00DD8, 0x00DDF }, /* SINHALA VOWEL SIGN GAETTA-PILLA - SINHALA VOWEL SIGN GAYANUKITTA */
-	{ 0x00DF2, 0x00DF3 }, /* SINHALA VOWEL SIGN DIGA GAETTA-PILLA - SINHALA VOWEL SIGN DIGA GAYANUKITTA */
-	{ 0x00E31, 0x00E31 }, /* THAI CHARACTER MAI HAN-AKAT */
-	{ 0x00E34, 0x00E3A }, /* THAI CHARACTER SARA I - THAI CHARACTER PHINTHU */
-	{ 0x00E47, 0x00E4E }, /* THAI CHARACTER MAITAIKHU - THAI CHARACTER YAMAKKAN */
-	{ 0x00EB1, 0x00EB1 }, /* LAO VOWEL SIGN MAI KAN */
-	{ 0x00EB4, 0x00EBC }, /* LAO VOWEL SIGN I - LAO SEMIVOWEL SIGN LO */
-	{ 0x00EC8, 0x00ECE }, /* LAO TONE MAI EK - LAO YAMAKKAN */
-	{ 0x00F18, 0x00F19 }, /* TIBETAN ASTROLOGICAL SIGN -KHYUD PA - TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS */
-	{ 0x00F35, 0x00F35 }, /* TIBETAN MARK NGAS BZUNG NYI ZLA */
-	{ 0x00F37, 0x00F37 }, /* TIBETAN MARK NGAS BZUNG SGOR RTAGS */
-	{ 0x00F39, 0x00F39 }, /* TIBETAN MARK TSA -PHRU */
-	{ 0x00F3E, 0x00F3F }, /* TIBETAN SIGN YAR TSHES - TIBETAN SIGN MAR TSHES */
-	{ 0x00F71, 0x00F84 }, /* TIBETAN VOWEL SIGN AA - TIBETAN MARK HALANTA */
-	{ 0x00F86, 0x00F87 }, /* TIBETAN SIGN LCI RTAGS - TIBETAN SIGN YANG RTAGS */
-	{ 0x00F8D, 0x00F97 }, /* TIBETAN SUBJOINED SIGN LCE TSA CAN - TIBETAN SUBJOINED LETTER JA */
-	{ 0x00F99, 0x00FBC }, /* TIBETAN SUBJOINED LETTER NYA - TIBETAN SUBJOINED LETTER FIXED-FORM RA */
-	{ 0x00FC6, 0x00FC6 }, /* TIBETAN SYMBOL PADMA GDAN */
-	{ 0x0102B, 0x0103E }, /* MYANMAR VOWEL SIGN TALL AA - MYANMAR CONSONANT SIGN MEDIAL HA */
-	{ 0x01056, 0x01059 }, /* MYANMAR VOWEL SIGN VOCALIC R - MYANMAR VOWEL SIGN VOCALIC LL */
-	{ 0x0105E, 0x01060 }, /* MYANMAR CONSONANT SIGN MON MEDIAL NA - MYANMAR CONSONANT SIGN MON MEDIAL LA */
-	{ 0x01062, 0x01064 }, /* MYANMAR VOWEL SIGN SGAW KAREN EU - MYANMAR TONE MARK SGAW KAREN KE PHO */
-	{ 0x01067, 0x0106D }, /* MYANMAR VOWEL SIGN WESTERN PWO KAREN EU - MYANMAR SIGN WESTERN PWO KAREN TONE-5 */
-	{ 0x01071, 0x01074 }, /* MYANMAR VOWEL SIGN GEBA KAREN I - MYANMAR VOWEL SIGN KAYAH EE */
-	{ 0x01082, 0x0108D }, /* MYANMAR CONSONANT SIGN SHAN MEDIAL WA - MYANMAR SIGN SHAN COUNCIL EMPHATIC TONE */
-	{ 0x0108F, 0x0108F }, /* MYANMAR SIGN RUMAI PALAUNG TONE-5 */
-	{ 0x0109A, 0x0109D }, /* MYANMAR SIGN KHAMTI TONE-1 - MYANMAR VOWEL SIGN AITON AI */
-	{ 0x0135D, 0x0135F }, /* ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK - ETHIOPIC COMBINING GEMINATION MARK */
-	{ 0x01712, 0x01715 }, /* TAGALOG VOWEL SIGN I - TAGALOG SIGN PAMUDPOD */
-	{ 0x01732, 0x01734 }, /* HANUNOO VOWEL SIGN I - HANUNOO SIGN PAMUDPOD */
-	{ 0x01752, 0x01753 }, /* BUHID VOWEL SIGN I - BUHID VOWEL SIGN U */
-	{ 0x01772, 0x01773 }, /* TAGBANWA VOWEL SIGN I - TAGBANWA VOWEL SIGN U */
-	{ 0x017B4, 0x017D3 }, /* KHMER VOWEL INHERENT AQ - KHMER SIGN BATHAMASAT */
-	{ 0x017DD, 0x017DD }, /* KHMER SIGN ATTHACAN */
-	{ 0x0180B, 0x0180F }, /* MONGOLIAN FREE VARIATION SELECTOR ONE - MONGOLIAN FREE VARIATION SELECTOR FOUR */
-	{ 0x01885, 0x01886 }, /* MONGOLIAN LETTER ALI GALI BALUDA - MONGOLIAN LETTER ALI GALI THREE BALUDA */
-	{ 0x018A9, 0x018A9 }, /* MONGOLIAN LETTER ALI GALI DAGALGA */
-	{ 0x01920, 0x0192B }, /* LIMBU VOWEL SIGN A - LIMBU SUBJOINED LETTER WA */
-	{ 0x01930, 0x0193B }, /* LIMBU SMALL LETTER KA - LIMBU SIGN SA-I */
-	{ 0x01A17, 0x01A1B }, /* BUGINESE VOWEL SIGN I - BUGINESE VOWEL SIGN AE */
-	{ 0x01A55, 0x01A5E }, /* TAI THAM CONSONANT SIGN MEDIAL RA - TAI THAM CONSONANT SIGN SA */
-	{ 0x01A60, 0x01A7C }, /* TAI THAM SIGN SAKOT - TAI THAM SIGN KHUEN-LUE KARAN */
-	{ 0x01A7F, 0x01A7F }, /* TAI THAM COMBINING CRYPTOGRAMMIC DOT */
-	{ 0x01AB0, 0x01ACE }, /* COMBINING DOUBLED CIRCUMFLEX ACCENT - COMBINING LATIN SMALL LETTER INSULAR T */
-	{ 0x01B00, 0x01B04 }, /* BALINESE SIGN ULU RICEM - BALINESE SIGN BISAH */
-	{ 0x01B34, 0x01B44 }, /* BALINESE SIGN REREKAN - BALINESE ADEG ADEG */
-	{ 0x01B6B, 0x01B73 }, /* BALINESE MUSICAL SYMBOL COMBINING TEGEH - BALINESE MUSICAL SYMBOL COMBINING GONG */
-	{ 0x01B80, 0x01B82 }, /* SUNDANESE SIGN PANYECEK - SUNDANESE SIGN PANGWISAD */
-	{ 0x01BA1, 0x01BAD }, /* SUNDANESE CONSONANT SIGN PAMINGKAL - SUNDANESE CONSONANT SIGN PASANGAN WA */
-	{ 0x01BE6, 0x01BF3 }, /* BATAK SIGN TOMPI - BATAK PANONGONAN */
-	{ 0x01C24, 0x01C37 }, /* LEPCHA SUBJOINED LETTER YA - LEPCHA SIGN NUKTA */
-	{ 0x01CD0, 0x01CD2 }, /* VEDIC TONE KARSHANA - VEDIC TONE PRENKHA */
-	{ 0x01CD4, 0x01CE8 }, /* VEDIC SIGN YAJURVEDIC MIDLINE SVARITA - VEDIC SIGN VISARGA ANUDATTA WITH TAIL */
-	{ 0x01CED, 0x01CED }, /* VEDIC SIGN TIRYAK */
-	{ 0x01CF4, 0x01CF4 }, /* VEDIC TONE CANDRA ABOVE */
-	{ 0x01CF7, 0x01CF9 }, /* VEDIC SIGN ATIKRAMA - VEDIC TONE DOUBLE RING ABOVE */
-	{ 0x01DC0, 0x01DFF }, /* COMBINING DOTTED GRAVE ACCENT - COMBINING RIGHT ARROWHEAD AND DOWN ARROWHEAD BELOW */
-	{ 0x0200B, 0x0200F }, /* ZERO WIDTH SPACE - RIGHT-TO-LEFT MARK */
-	{ 0x0202A, 0x0202E }, /* LEFT-TO-RIGHT EMBEDDING - RIGHT-TO-LEFT OVERRIDE */
-	{ 0x02060, 0x02064 }, /* WORD JOINER - INVISIBLE PLUS */
-	{ 0x02066, 0x0206F }, /* LEFT-TO-RIGHT ISOLATE - NOMINAL DIGIT SHAPES */
-	{ 0x020D0, 0x020F0 }, /* COMBINING LEFT HARPOON ABOVE - COMBINING ASTERISK ABOVE */
-	{ 0x02640, 0x02640 }, /* FEMALE SIGN */
-	{ 0x02642, 0x02642 }, /* MALE SIGN */
-	{ 0x026A7, 0x026A7 }, /* MALE WITH STROKE AND MALE AND FEMALE SIGN */
-	{ 0x02CEF, 0x02CF1 }, /* COPTIC COMBINING NI ABOVE - COPTIC COMBINING SPIRITUS LENIS */
-	{ 0x02D7F, 0x02D7F }, /* TIFINAGH CONSONANT JOINER */
-	{ 0x02DE0, 0x02DFF }, /* COMBINING CYRILLIC LETTER BE - COMBINING CYRILLIC LETTER IOTIFIED BIG YUS */
-	{ 0x0302A, 0x0302F }, /* IDEOGRAPHIC LEVEL TONE MARK - HANGUL DOUBLE DOT TONE MARK */
-	{ 0x03099, 0x0309A }, /* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK - COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK */
-	{ 0x0A66F, 0x0A672 }, /* COMBINING CYRILLIC VZMET - COMBINING CYRILLIC THOUSAND MILLIONS SIGN */
-	{ 0x0A674, 0x0A67D }, /* COMBINING CYRILLIC LETTER UKRAINIAN IE - COMBINING CYRILLIC PAYEROK */
-	{ 0x0A69E, 0x0A69F }, /* COMBINING CYRILLIC LETTER EF - COMBINING CYRILLIC LETTER IOTIFIED E */
-	{ 0x0A6F0, 0x0A6F1 }, /* BAMUM COMBINING MARK KOQNDON - BAMUM COMBINING MARK TUKWENTIS */
-	{ 0x0A802, 0x0A802 }, /* SYLOTI NAGRI SIGN DVISVARA */
-	{ 0x0A806, 0x0A806 }, /* SYLOTI NAGRI SIGN HASANTA */
-	{ 0x0A80B, 0x0A80B }, /* SYLOTI NAGRI SIGN ANUSVARA */
-	{ 0x0A823, 0x0A827 }, /* SYLOTI NAGRI VOWEL SIGN A - SYLOTI NAGRI VOWEL SIGN OO */
-	{ 0x0A82C, 0x0A82C }, /* SYLOTI NAGRI SIGN ALTERNATE HASANTA */
-	{ 0x0A880, 0x0A881 }, /* SAURASHTRA SIGN ANUSVARA - SAURASHTRA SIGN VISARGA */
-	{ 0x0A8B4, 0x0A8C5 }, /* SAURASHTRA CONSONANT SIGN HAARU - SAURASHTRA SIGN CANDRABINDU */
-	{ 0x0A8E0, 0x0A8F1 }, /* COMBINING DEVANAGARI DIGIT ZERO - COMBINING DEVANAGARI SIGN AVAGRAHA */
-	{ 0x0A8FF, 0x0A8FF }, /* DEVANAGARI VOWEL SIGN AY */
-	{ 0x0A926, 0x0A92D }, /* KAYAH LI VOWEL UE - KAYAH LI TONE CALYA PLOPHU */
-	{ 0x0A947, 0x0A953 }, /* REJANG VOWEL SIGN I - REJANG VIRAMA */
-	{ 0x0A980, 0x0A983 }, /* JAVANESE SIGN PANYANGGA - JAVANESE SIGN WIGNYAN */
-	{ 0x0A9B3, 0x0A9C0 }, /* JAVANESE SIGN CECAK TELU - JAVANESE PANGKON */
-	{ 0x0A9E5, 0x0A9E5 }, /* MYANMAR SIGN SHAN SAW */
-	{ 0x0AA29, 0x0AA36 }, /* CHAM VOWEL SIGN AA - CHAM CONSONANT SIGN WA */
-	{ 0x0AA43, 0x0AA43 }, /* CHAM CONSONANT SIGN FINAL NG */
-	{ 0x0AA4C, 0x0AA4D }, /* CHAM CONSONANT SIGN FINAL M - CHAM CONSONANT SIGN FINAL H */
-	{ 0x0AA7B, 0x0AA7D }, /* MYANMAR SIGN PAO KAREN TONE - MYANMAR SIGN TAI LAING TONE-5 */
-	{ 0x0AAB0, 0x0AAB0 }, /* TAI VIET MAI KANG */
-	{ 0x0AAB2, 0x0AAB4 }, /* TAI VIET VOWEL I - TAI VIET VOWEL U */
-	{ 0x0AAB7, 0x0AAB8 }, /* TAI VIET MAI KHIT - TAI VIET VOWEL IA */
-	{ 0x0AABE, 0x0AABF }, /* TAI VIET VOWEL AM - TAI VIET TONE MAI EK */
-	{ 0x0AAC1, 0x0AAC1 }, /* TAI VIET TONE MAI THO */
-	{ 0x0AAEB, 0x0AAEF }, /* MEETEI MAYEK VOWEL SIGN II - MEETEI MAYEK VOWEL SIGN AAU */
-	{ 0x0AAF5, 0x0AAF6 }, /* MEETEI MAYEK VOWEL SIGN VISARGA - MEETEI MAYEK VIRAMA */
-	{ 0x0ABE3, 0x0ABEA }, /* MEETEI MAYEK VOWEL SIGN ONAP - MEETEI MAYEK VOWEL SIGN NUNG */
-	{ 0x0ABEC, 0x0ABED }, /* MEETEI MAYEK LUM IYEK - MEETEI MAYEK APUN IYEK */
-	{ 0x0FB1E, 0x0FB1E }, /* HEBREW POINT JUDEO-SPANISH VARIKA */
-	{ 0x0FE00, 0x0FE0F }, /* VARIATION SELECTOR-1 - VARIATION SELECTOR-16 */
-	{ 0x0FE20, 0x0FE2F }, /* COMBINING LIGATURE LEFT HALF - COMBINING CYRILLIC TITLO RIGHT HALF */
-	{ 0x0FEFF, 0x0FEFF }, /* ZERO WIDTH NO-BREAK SPACE */
-	{ 0x0FFF9, 0x0FFFB }, /* INTERLINEAR ANNOTATION ANCHOR - INTERLINEAR ANNOTATION TERMINATOR */
+/* Zero-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct ucs_interval16 ucs_zero_width_bmp_ranges[] = {
+	{ 0x00AD, 0x00AD }, /* SOFT HYPHEN */
+	{ 0x0300, 0x036F }, /* COMBINING GRAVE ACCENT - COMBINING LATIN SMALL LETTER X */
+	{ 0x0483, 0x0489 }, /* COMBINING CYRILLIC TITLO - COMBINING CYRILLIC MILLIONS SIGN */
+	{ 0x0591, 0x05BD }, /* HEBREW ACCENT ETNAHTA - HEBREW POINT METEG */
+	{ 0x05BF, 0x05BF }, /* HEBREW POINT RAFE */
+	{ 0x05C1, 0x05C2 }, /* HEBREW POINT SHIN DOT - HEBREW POINT SIN DOT */
+	{ 0x05C4, 0x05C5 }, /* HEBREW MARK UPPER DOT - HEBREW MARK LOWER DOT */
+	{ 0x05C7, 0x05C7 }, /* HEBREW POINT QAMATS QATAN */
+	{ 0x0600, 0x0605 }, /* ARABIC NUMBER SIGN - ARABIC NUMBER MARK ABOVE */
+	{ 0x0610, 0x061A }, /* ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM - ARABIC SMALL KASRA */
+	{ 0x061C, 0x061C }, /* ARABIC LETTER MARK */
+	{ 0x064B, 0x065F }, /* ARABIC FATHATAN - ARABIC WAVY HAMZA BELOW */
+	{ 0x0670, 0x0670 }, /* ARABIC LETTER SUPERSCRIPT ALEF */
+	{ 0x06D6, 0x06DD }, /* ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA - ARABIC END OF AYAH */
+	{ 0x06DF, 0x06E4 }, /* ARABIC SMALL HIGH ROUNDED ZERO - ARABIC SMALL HIGH MADDA */
+	{ 0x06E7, 0x06E8 }, /* ARABIC SMALL HIGH YEH - ARABIC SMALL HIGH NOON */
+	{ 0x06EA, 0x06ED }, /* ARABIC EMPTY CENTRE LOW STOP - ARABIC SMALL LOW MEEM */
+	{ 0x070F, 0x070F }, /* SYRIAC ABBREVIATION MARK */
+	{ 0x0711, 0x0711 }, /* SYRIAC LETTER SUPERSCRIPT ALAPH */
+	{ 0x0730, 0x074A }, /* SYRIAC PTHAHA ABOVE - SYRIAC BARREKH */
+	{ 0x07A6, 0x07B0 }, /* THAANA ABAFILI - THAANA SUKUN */
+	{ 0x07EB, 0x07F3 }, /* NKO COMBINING SHORT HIGH TONE - NKO COMBINING DOUBLE DOT ABOVE */
+	{ 0x07FD, 0x07FD }, /* NKO DANTAYALAN */
+	{ 0x0816, 0x0819 }, /* SAMARITAN MARK IN - SAMARITAN MARK DAGESH */
+	{ 0x081B, 0x0823 }, /* SAMARITAN MARK EPENTHETIC YUT - SAMARITAN VOWEL SIGN A */
+	{ 0x0825, 0x0827 }, /* SAMARITAN VOWEL SIGN SHORT A - SAMARITAN VOWEL SIGN U */
+	{ 0x0829, 0x082D }, /* SAMARITAN VOWEL SIGN LONG I - SAMARITAN MARK NEQUDAA */
+	{ 0x0859, 0x085B }, /* MANDAIC AFFRICATION MARK - MANDAIC GEMINATION MARK */
+	{ 0x0890, 0x0891 }, /* ARABIC POUND MARK ABOVE - ARABIC PIASTRE MARK ABOVE */
+	{ 0x0897, 0x089F }, /* ARABIC PEPET - ARABIC HALF MADDA OVER MADDA */
+	{ 0x08CA, 0x0903 }, /* ARABIC SMALL HIGH FARSI YEH - DEVANAGARI SIGN VISARGA */
+	{ 0x093A, 0x093C }, /* DEVANAGARI VOWEL SIGN OE - DEVANAGARI SIGN NUKTA */
+	{ 0x093E, 0x094F }, /* DEVANAGARI VOWEL SIGN AA - DEVANAGARI VOWEL SIGN AW */
+	{ 0x0951, 0x0957 }, /* DEVANAGARI STRESS SIGN UDATTA - DEVANAGARI VOWEL SIGN UUE */
+	{ 0x0962, 0x0963 }, /* DEVANAGARI VOWEL SIGN VOCALIC L - DEVANAGARI VOWEL SIGN VOCALIC LL */
+	{ 0x0981, 0x0983 }, /* BENGALI SIGN CANDRABINDU - BENGALI SIGN VISARGA */
+	{ 0x09BC, 0x09BC }, /* BENGALI SIGN NUKTA */
+	{ 0x09BE, 0x09C4 }, /* BENGALI VOWEL SIGN AA - BENGALI VOWEL SIGN VOCALIC RR */
+	{ 0x09C7, 0x09C8 }, /* BENGALI VOWEL SIGN E - BENGALI VOWEL SIGN AI */
+	{ 0x09CB, 0x09CD }, /* BENGALI VOWEL SIGN O - BENGALI SIGN VIRAMA */
+	{ 0x09D7, 0x09D7 }, /* BENGALI AU LENGTH MARK */
+	{ 0x09E2, 0x09E3 }, /* BENGALI VOWEL SIGN VOCALIC L - BENGALI VOWEL SIGN VOCALIC LL */
+	{ 0x09FE, 0x09FE }, /* BENGALI SANDHI MARK */
+	{ 0x0A01, 0x0A03 }, /* GURMUKHI SIGN ADAK BINDI - GURMUKHI SIGN VISARGA */
+	{ 0x0A3C, 0x0A3C }, /* GURMUKHI SIGN NUKTA */
+	{ 0x0A3E, 0x0A42 }, /* GURMUKHI VOWEL SIGN AA - GURMUKHI VOWEL SIGN UU */
+	{ 0x0A47, 0x0A48 }, /* GURMUKHI VOWEL SIGN EE - GURMUKHI VOWEL SIGN AI */
+	{ 0x0A4B, 0x0A4D }, /* GURMUKHI VOWEL SIGN OO - GURMUKHI SIGN VIRAMA */
+	{ 0x0A51, 0x0A51 }, /* GURMUKHI SIGN UDAAT */
+	{ 0x0A70, 0x0A71 }, /* GURMUKHI TIPPI - GURMUKHI ADDAK */
+	{ 0x0A75, 0x0A75 }, /* GURMUKHI SIGN YAKASH */
+	{ 0x0A81, 0x0A83 }, /* GUJARATI SIGN CANDRABINDU - GUJARATI SIGN VISARGA */
+	{ 0x0ABC, 0x0ABC }, /* GUJARATI SIGN NUKTA */
+	{ 0x0ABE, 0x0AC5 }, /* GUJARATI VOWEL SIGN AA - GUJARATI VOWEL SIGN CANDRA E */
+	{ 0x0AC7, 0x0AC9 }, /* GUJARATI VOWEL SIGN E - GUJARATI VOWEL SIGN CANDRA O */
+	{ 0x0ACB, 0x0ACD }, /* GUJARATI VOWEL SIGN O - GUJARATI SIGN VIRAMA */
+	{ 0x0AE2, 0x0AE3 }, /* GUJARATI VOWEL SIGN VOCALIC L - GUJARATI VOWEL SIGN VOCALIC LL */
+	{ 0x0AFA, 0x0AFF }, /* GUJARATI SIGN SUKUN - GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE */
+	{ 0x0B01, 0x0B03 }, /* ORIYA SIGN CANDRABINDU - ORIYA SIGN VISARGA */
+	{ 0x0B3C, 0x0B3C }, /* ORIYA SIGN NUKTA */
+	{ 0x0B3E, 0x0B44 }, /* ORIYA VOWEL SIGN AA - ORIYA VOWEL SIGN VOCALIC RR */
+	{ 0x0B47, 0x0B48 }, /* ORIYA VOWEL SIGN E - ORIYA VOWEL SIGN AI */
+	{ 0x0B4B, 0x0B4D }, /* ORIYA VOWEL SIGN O - ORIYA SIGN VIRAMA */
+	{ 0x0B55, 0x0B57 }, /* ORIYA SIGN OVERLINE - ORIYA AU LENGTH MARK */
+	{ 0x0B62, 0x0B63 }, /* ORIYA VOWEL SIGN VOCALIC L - ORIYA VOWEL SIGN VOCALIC LL */
+	{ 0x0B82, 0x0B82 }, /* TAMIL SIGN ANUSVARA */
+	{ 0x0BBE, 0x0BC2 }, /* TAMIL VOWEL SIGN AA - TAMIL VOWEL SIGN UU */
+	{ 0x0BC6, 0x0BC8 }, /* TAMIL VOWEL SIGN E - TAMIL VOWEL SIGN AI */
+	{ 0x0BCA, 0x0BCD }, /* TAMIL VOWEL SIGN O - TAMIL SIGN VIRAMA */
+	{ 0x0BD7, 0x0BD7 }, /* TAMIL AU LENGTH MARK */
+	{ 0x0C00, 0x0C04 }, /* TELUGU SIGN COMBINING CANDRABINDU ABOVE - TELUGU SIGN COMBINING ANUSVARA ABOVE */
+	{ 0x0C3C, 0x0C3C }, /* TELUGU SIGN NUKTA */
+	{ 0x0C3E, 0x0C44 }, /* TELUGU VOWEL SIGN AA - TELUGU VOWEL SIGN VOCALIC RR */
+	{ 0x0C46, 0x0C48 }, /* TELUGU VOWEL SIGN E - TELUGU VOWEL SIGN AI */
+	{ 0x0C4A, 0x0C4D }, /* TELUGU VOWEL SIGN O - TELUGU SIGN VIRAMA */
+	{ 0x0C55, 0x0C56 }, /* TELUGU LENGTH MARK - TELUGU AI LENGTH MARK */
+	{ 0x0C62, 0x0C63 }, /* TELUGU VOWEL SIGN VOCALIC L - TELUGU VOWEL SIGN VOCALIC LL */
+	{ 0x0C81, 0x0C83 }, /* KANNADA SIGN CANDRABINDU - KANNADA SIGN VISARGA */
+	{ 0x0CBC, 0x0CBC }, /* KANNADA SIGN NUKTA */
+	{ 0x0CBE, 0x0CC4 }, /* KANNADA VOWEL SIGN AA - KANNADA VOWEL SIGN VOCALIC RR */
+	{ 0x0CC6, 0x0CC8 }, /* KANNADA VOWEL SIGN E - KANNADA VOWEL SIGN AI */
+	{ 0x0CCA, 0x0CCD }, /* KANNADA VOWEL SIGN O - KANNADA SIGN VIRAMA */
+	{ 0x0CD5, 0x0CD6 }, /* KANNADA LENGTH MARK - KANNADA AI LENGTH MARK */
+	{ 0x0CE2, 0x0CE3 }, /* KANNADA VOWEL SIGN VOCALIC L - KANNADA VOWEL SIGN VOCALIC LL */
+	{ 0x0CF3, 0x0CF3 }, /* KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT */
+	{ 0x0D00, 0x0D03 }, /* MALAYALAM SIGN COMBINING ANUSVARA ABOVE - MALAYALAM SIGN VISARGA */
+	{ 0x0D3B, 0x0D3C }, /* MALAYALAM SIGN VERTICAL BAR VIRAMA - MALAYALAM SIGN CIRCULAR VIRAMA */
+	{ 0x0D3E, 0x0D44 }, /* MALAYALAM VOWEL SIGN AA - MALAYALAM VOWEL SIGN VOCALIC RR */
+	{ 0x0D46, 0x0D48 }, /* MALAYALAM VOWEL SIGN E - MALAYALAM VOWEL SIGN AI */
+	{ 0x0D4A, 0x0D4D }, /* MALAYALAM VOWEL SIGN O - MALAYALAM SIGN VIRAMA */
+	{ 0x0D57, 0x0D57 }, /* MALAYALAM AU LENGTH MARK */
+	{ 0x0D62, 0x0D63 }, /* MALAYALAM VOWEL SIGN VOCALIC L - MALAYALAM VOWEL SIGN VOCALIC LL */
+	{ 0x0D81, 0x0D83 }, /* SINHALA SIGN CANDRABINDU - SINHALA SIGN VISARGAYA */
+	{ 0x0DCA, 0x0DCA }, /* SINHALA SIGN AL-LAKUNA */
+	{ 0x0DCF, 0x0DD4 }, /* SINHALA VOWEL SIGN AELA-PILLA - SINHALA VOWEL SIGN KETTI PAA-PILLA */
+	{ 0x0DD6, 0x0DD6 }, /* SINHALA VOWEL SIGN DIGA PAA-PILLA */
+	{ 0x0DD8, 0x0DDF }, /* SINHALA VOWEL SIGN GAETTA-PILLA - SINHALA VOWEL SIGN GAYANUKITTA */
+	{ 0x0DF2, 0x0DF3 }, /* SINHALA VOWEL SIGN DIGA GAETTA-PILLA - SINHALA VOWEL SIGN DIGA GAYANUKITTA */
+	{ 0x0E31, 0x0E31 }, /* THAI CHARACTER MAI HAN-AKAT */
+	{ 0x0E34, 0x0E3A }, /* THAI CHARACTER SARA I - THAI CHARACTER PHINTHU */
+	{ 0x0E47, 0x0E4E }, /* THAI CHARACTER MAITAIKHU - THAI CHARACTER YAMAKKAN */
+	{ 0x0EB1, 0x0EB1 }, /* LAO VOWEL SIGN MAI KAN */
+	{ 0x0EB4, 0x0EBC }, /* LAO VOWEL SIGN I - LAO SEMIVOWEL SIGN LO */
+	{ 0x0EC8, 0x0ECE }, /* LAO TONE MAI EK - LAO YAMAKKAN */
+	{ 0x0F18, 0x0F19 }, /* TIBETAN ASTROLOGICAL SIGN -KHYUD PA - TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS */
+	{ 0x0F35, 0x0F35 }, /* TIBETAN MARK NGAS BZUNG NYI ZLA */
+	{ 0x0F37, 0x0F37 }, /* TIBETAN MARK NGAS BZUNG SGOR RTAGS */
+	{ 0x0F39, 0x0F39 }, /* TIBETAN MARK TSA -PHRU */
+	{ 0x0F3E, 0x0F3F }, /* TIBETAN SIGN YAR TSHES - TIBETAN SIGN MAR TSHES */
+	{ 0x0F71, 0x0F84 }, /* TIBETAN VOWEL SIGN AA - TIBETAN MARK HALANTA */
+	{ 0x0F86, 0x0F87 }, /* TIBETAN SIGN LCI RTAGS - TIBETAN SIGN YANG RTAGS */
+	{ 0x0F8D, 0x0F97 }, /* TIBETAN SUBJOINED SIGN LCE TSA CAN - TIBETAN SUBJOINED LETTER JA */
+	{ 0x0F99, 0x0FBC }, /* TIBETAN SUBJOINED LETTER NYA - TIBETAN SUBJOINED LETTER FIXED-FORM RA */
+	{ 0x0FC6, 0x0FC6 }, /* TIBETAN SYMBOL PADMA GDAN */
+	{ 0x102B, 0x103E }, /* MYANMAR VOWEL SIGN TALL AA - MYANMAR CONSONANT SIGN MEDIAL HA */
+	{ 0x1056, 0x1059 }, /* MYANMAR VOWEL SIGN VOCALIC R - MYANMAR VOWEL SIGN VOCALIC LL */
+	{ 0x105E, 0x1060 }, /* MYANMAR CONSONANT SIGN MON MEDIAL NA - MYANMAR CONSONANT SIGN MON MEDIAL LA */
+	{ 0x1062, 0x1064 }, /* MYANMAR VOWEL SIGN SGAW KAREN EU - MYANMAR TONE MARK SGAW KAREN KE PHO */
+	{ 0x1067, 0x106D }, /* MYANMAR VOWEL SIGN WESTERN PWO KAREN EU - MYANMAR SIGN WESTERN PWO KAREN TONE-5 */
+	{ 0x1071, 0x1074 }, /* MYANMAR VOWEL SIGN GEBA KAREN I - MYANMAR VOWEL SIGN KAYAH EE */
+	{ 0x1082, 0x108D }, /* MYANMAR CONSONANT SIGN SHAN MEDIAL WA - MYANMAR SIGN SHAN COUNCIL EMPHATIC TONE */
+	{ 0x108F, 0x108F }, /* MYANMAR SIGN RUMAI PALAUNG TONE-5 */
+	{ 0x109A, 0x109D }, /* MYANMAR SIGN KHAMTI TONE-1 - MYANMAR VOWEL SIGN AITON AI */
+	{ 0x135D, 0x135F }, /* ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK - ETHIOPIC COMBINING GEMINATION MARK */
+	{ 0x1712, 0x1715 }, /* TAGALOG VOWEL SIGN I - TAGALOG SIGN PAMUDPOD */
+	{ 0x1732, 0x1734 }, /* HANUNOO VOWEL SIGN I - HANUNOO SIGN PAMUDPOD */
+	{ 0x1752, 0x1753 }, /* BUHID VOWEL SIGN I - BUHID VOWEL SIGN U */
+	{ 0x1772, 0x1773 }, /* TAGBANWA VOWEL SIGN I - TAGBANWA VOWEL SIGN U */
+	{ 0x17B4, 0x17D3 }, /* KHMER VOWEL INHERENT AQ - KHMER SIGN BATHAMASAT */
+	{ 0x17DD, 0x17DD }, /* KHMER SIGN ATTHACAN */
+	{ 0x180B, 0x180F }, /* MONGOLIAN FREE VARIATION SELECTOR ONE - MONGOLIAN FREE VARIATION SELECTOR FOUR */
+	{ 0x1885, 0x1886 }, /* MONGOLIAN LETTER ALI GALI BALUDA - MONGOLIAN LETTER ALI GALI THREE BALUDA */
+	{ 0x18A9, 0x18A9 }, /* MONGOLIAN LETTER ALI GALI DAGALGA */
+	{ 0x1920, 0x192B }, /* LIMBU VOWEL SIGN A - LIMBU SUBJOINED LETTER WA */
+	{ 0x1930, 0x193B }, /* LIMBU SMALL LETTER KA - LIMBU SIGN SA-I */
+	{ 0x1A17, 0x1A1B }, /* BUGINESE VOWEL SIGN I - BUGINESE VOWEL SIGN AE */
+	{ 0x1A55, 0x1A5E }, /* TAI THAM CONSONANT SIGN MEDIAL RA - TAI THAM CONSONANT SIGN SA */
+	{ 0x1A60, 0x1A7C }, /* TAI THAM SIGN SAKOT - TAI THAM SIGN KHUEN-LUE KARAN */
+	{ 0x1A7F, 0x1A7F }, /* TAI THAM COMBINING CRYPTOGRAMMIC DOT */
+	{ 0x1AB0, 0x1ACE }, /* COMBINING DOUBLED CIRCUMFLEX ACCENT - COMBINING LATIN SMALL LETTER INSULAR T */
+	{ 0x1B00, 0x1B04 }, /* BALINESE SIGN ULU RICEM - BALINESE SIGN BISAH */
+	{ 0x1B34, 0x1B44 }, /* BALINESE SIGN REREKAN - BALINESE ADEG ADEG */
+	{ 0x1B6B, 0x1B73 }, /* BALINESE MUSICAL SYMBOL COMBINING TEGEH - BALINESE MUSICAL SYMBOL COMBINING GONG */
+	{ 0x1B80, 0x1B82 }, /* SUNDANESE SIGN PANYECEK - SUNDANESE SIGN PANGWISAD */
+	{ 0x1BA1, 0x1BAD }, /* SUNDANESE CONSONANT SIGN PAMINGKAL - SUNDANESE CONSONANT SIGN PASANGAN WA */
+	{ 0x1BE6, 0x1BF3 }, /* BATAK SIGN TOMPI - BATAK PANONGONAN */
+	{ 0x1C24, 0x1C37 }, /* LEPCHA SUBJOINED LETTER YA - LEPCHA SIGN NUKTA */
+	{ 0x1CD0, 0x1CD2 }, /* VEDIC TONE KARSHANA - VEDIC TONE PRENKHA */
+	{ 0x1CD4, 0x1CE8 }, /* VEDIC SIGN YAJURVEDIC MIDLINE SVARITA - VEDIC SIGN VISARGA ANUDATTA WITH TAIL */
+	{ 0x1CED, 0x1CED }, /* VEDIC SIGN TIRYAK */
+	{ 0x1CF4, 0x1CF4 }, /* VEDIC TONE CANDRA ABOVE */
+	{ 0x1CF7, 0x1CF9 }, /* VEDIC SIGN ATIKRAMA - VEDIC TONE DOUBLE RING ABOVE */
+	{ 0x1DC0, 0x1DFF }, /* COMBINING DOTTED GRAVE ACCENT - COMBINING RIGHT ARROWHEAD AND DOWN ARROWHEAD BELOW */
+	{ 0x200B, 0x200F }, /* ZERO WIDTH SPACE - RIGHT-TO-LEFT MARK */
+	{ 0x202A, 0x202E }, /* LEFT-TO-RIGHT EMBEDDING - RIGHT-TO-LEFT OVERRIDE */
+	{ 0x2060, 0x2064 }, /* WORD JOINER - INVISIBLE PLUS */
+	{ 0x2066, 0x206F }, /* LEFT-TO-RIGHT ISOLATE - NOMINAL DIGIT SHAPES */
+	{ 0x20D0, 0x20F0 }, /* COMBINING LEFT HARPOON ABOVE - COMBINING ASTERISK ABOVE */
+	{ 0x2640, 0x2640 }, /* FEMALE SIGN */
+	{ 0x2642, 0x2642 }, /* MALE SIGN */
+	{ 0x26A7, 0x26A7 }, /* MALE WITH STROKE AND MALE AND FEMALE SIGN */
+	{ 0x2CEF, 0x2CF1 }, /* COPTIC COMBINING NI ABOVE - COPTIC COMBINING SPIRITUS LENIS */
+	{ 0x2D7F, 0x2D7F }, /* TIFINAGH CONSONANT JOINER */
+	{ 0x2DE0, 0x2DFF }, /* COMBINING CYRILLIC LETTER BE - COMBINING CYRILLIC LETTER IOTIFIED BIG YUS */
+	{ 0x302A, 0x302F }, /* IDEOGRAPHIC LEVEL TONE MARK - HANGUL DOUBLE DOT TONE MARK */
+	{ 0x3099, 0x309A }, /* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK - COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK */
+	{ 0xA66F, 0xA672 }, /* COMBINING CYRILLIC VZMET - COMBINING CYRILLIC THOUSAND MILLIONS SIGN */
+	{ 0xA674, 0xA67D }, /* COMBINING CYRILLIC LETTER UKRAINIAN IE - COMBINING CYRILLIC PAYEROK */
+	{ 0xA69E, 0xA69F }, /* COMBINING CYRILLIC LETTER EF - COMBINING CYRILLIC LETTER IOTIFIED E */
+	{ 0xA6F0, 0xA6F1 }, /* BAMUM COMBINING MARK KOQNDON - BAMUM COMBINING MARK TUKWENTIS */
+	{ 0xA802, 0xA802 }, /* SYLOTI NAGRI SIGN DVISVARA */
+	{ 0xA806, 0xA806 }, /* SYLOTI NAGRI SIGN HASANTA */
+	{ 0xA80B, 0xA80B }, /* SYLOTI NAGRI SIGN ANUSVARA */
+	{ 0xA823, 0xA827 }, /* SYLOTI NAGRI VOWEL SIGN A - SYLOTI NAGRI VOWEL SIGN OO */
+	{ 0xA82C, 0xA82C }, /* SYLOTI NAGRI SIGN ALTERNATE HASANTA */
+	{ 0xA880, 0xA881 }, /* SAURASHTRA SIGN ANUSVARA - SAURASHTRA SIGN VISARGA */
+	{ 0xA8B4, 0xA8C5 }, /* SAURASHTRA CONSONANT SIGN HAARU - SAURASHTRA SIGN CANDRABINDU */
+	{ 0xA8E0, 0xA8F1 }, /* COMBINING DEVANAGARI DIGIT ZERO - COMBINING DEVANAGARI SIGN AVAGRAHA */
+	{ 0xA8FF, 0xA8FF }, /* DEVANAGARI VOWEL SIGN AY */
+	{ 0xA926, 0xA92D }, /* KAYAH LI VOWEL UE - KAYAH LI TONE CALYA PLOPHU */
+	{ 0xA947, 0xA953 }, /* REJANG VOWEL SIGN I - REJANG VIRAMA */
+	{ 0xA980, 0xA983 }, /* JAVANESE SIGN PANYANGGA - JAVANESE SIGN WIGNYAN */
+	{ 0xA9B3, 0xA9C0 }, /* JAVANESE SIGN CECAK TELU - JAVANESE PANGKON */
+	{ 0xA9E5, 0xA9E5 }, /* MYANMAR SIGN SHAN SAW */
+	{ 0xAA29, 0xAA36 }, /* CHAM VOWEL SIGN AA - CHAM CONSONANT SIGN WA */
+	{ 0xAA43, 0xAA43 }, /* CHAM CONSONANT SIGN FINAL NG */
+	{ 0xAA4C, 0xAA4D }, /* CHAM CONSONANT SIGN FINAL M - CHAM CONSONANT SIGN FINAL H */
+	{ 0xAA7B, 0xAA7D }, /* MYANMAR SIGN PAO KAREN TONE - MYANMAR SIGN TAI LAING TONE-5 */
+	{ 0xAAB0, 0xAAB0 }, /* TAI VIET MAI KANG */
+	{ 0xAAB2, 0xAAB4 }, /* TAI VIET VOWEL I - TAI VIET VOWEL U */
+	{ 0xAAB7, 0xAAB8 }, /* TAI VIET MAI KHIT - TAI VIET VOWEL IA */
+	{ 0xAABE, 0xAABF }, /* TAI VIET VOWEL AM - TAI VIET TONE MAI EK */
+	{ 0xAAC1, 0xAAC1 }, /* TAI VIET TONE MAI THO */
+	{ 0xAAEB, 0xAAEF }, /* MEETEI MAYEK VOWEL SIGN II - MEETEI MAYEK VOWEL SIGN AAU */
+	{ 0xAAF5, 0xAAF6 }, /* MEETEI MAYEK VOWEL SIGN VISARGA - MEETEI MAYEK VIRAMA */
+	{ 0xABE3, 0xABEA }, /* MEETEI MAYEK VOWEL SIGN ONAP - MEETEI MAYEK VOWEL SIGN NUNG */
+	{ 0xABEC, 0xABED }, /* MEETEI MAYEK LUM IYEK - MEETEI MAYEK APUN IYEK */
+	{ 0xFB1E, 0xFB1E }, /* HEBREW POINT JUDEO-SPANISH VARIKA */
+	{ 0xFE00, 0xFE0F }, /* VARIATION SELECTOR-1 - VARIATION SELECTOR-16 */
+	{ 0xFE20, 0xFE2F }, /* COMBINING LIGATURE LEFT HALF - COMBINING CYRILLIC TITLO RIGHT HALF */
+	{ 0xFEFF, 0xFEFF }, /* ZERO WIDTH NO-BREAK SPACE */
+	{ 0xFFF9, 0xFFFB }, /* INTERLINEAR ANNOTATION ANCHOR - INTERLINEAR ANNOTATION TERMINATOR */
+};
+
+/* Zero-width character ranges (non-BMP, U+10000 and above) */
+static const struct ucs_interval32 ucs_zero_width_non_bmp_ranges[] = {
 	{ 0x101FD, 0x101FD }, /* PHAISTOS DISC SIGN COMBINING OBLIQUE STROKE */
 	{ 0x102E0, 0x102E0 }, /* COPTIC EPACT THOUSANDS MARK */
 	{ 0x10376, 0x1037A }, /* COMBINING OLD PERMIC LETTER AN - COMBINING OLD PERMIC LETTER SII */
@@ -350,68 +354,72 @@ static const struct ucs_interval ucs_zero_width_ranges[] = {
 	{ 0xE0100, 0xE01EF }, /* VARIATION SELECTOR-17 - VARIATION SELECTOR-256 */
 };
 
-/* Double-width character ranges */
-static const struct ucs_interval ucs_double_width_ranges[] = {
-	{ 0x01100, 0x0115F }, /* HANGUL CHOSEONG KIYEOK - HANGUL CHOSEONG FILLER */
-	{ 0x0231A, 0x0231B }, /* WATCH - HOURGLASS */
-	{ 0x02329, 0x0232A }, /* LEFT-POINTING ANGLE BRACKET - RIGHT-POINTING ANGLE BRACKET */
-	{ 0x023E9, 0x023EC }, /* BLACK RIGHT-POINTING DOUBLE TRIANGLE - BLACK DOWN-POINTING DOUBLE TRIANGLE */
-	{ 0x023F0, 0x023F0 }, /* ALARM CLOCK */
-	{ 0x023F3, 0x023F3 }, /* HOURGLASS WITH FLOWING SAND */
-	{ 0x025FD, 0x025FE }, /* WHITE MEDIUM SMALL SQUARE - BLACK MEDIUM SMALL SQUARE */
-	{ 0x02614, 0x02615 }, /* UMBRELLA WITH RAIN DROPS - HOT BEVERAGE */
-	{ 0x02630, 0x02637 }, /* TRIGRAM FOR HEAVEN - TRIGRAM FOR EARTH */
-	{ 0x02648, 0x02653 }, /* ARIES - PISCES */
-	{ 0x0267F, 0x0267F }, /* WHEELCHAIR SYMBOL */
-	{ 0x0268A, 0x0268F }, /* MONOGRAM FOR YANG - DIGRAM FOR GREATER YIN */
-	{ 0x02693, 0x02693 }, /* ANCHOR */
-	{ 0x026A1, 0x026A1 }, /* HIGH VOLTAGE SIGN */
-	{ 0x026AA, 0x026AB }, /* MEDIUM WHITE CIRCLE - MEDIUM BLACK CIRCLE */
-	{ 0x026BD, 0x026BE }, /* SOCCER BALL - BASEBALL */
-	{ 0x026C4, 0x026C5 }, /* SNOWMAN WITHOUT SNOW - SUN BEHIND CLOUD */
-	{ 0x026CE, 0x026CE }, /* OPHIUCHUS */
-	{ 0x026D4, 0x026D4 }, /* NO ENTRY */
-	{ 0x026EA, 0x026EA }, /* CHURCH */
-	{ 0x026F2, 0x026F3 }, /* FOUNTAIN - FLAG IN HOLE */
-	{ 0x026F5, 0x026F5 }, /* SAILBOAT */
-	{ 0x026FA, 0x026FA }, /* TENT */
-	{ 0x026FD, 0x026FD }, /* FUEL PUMP */
-	{ 0x02705, 0x02705 }, /* WHITE HEAVY CHECK MARK */
-	{ 0x0270A, 0x0270B }, /* RAISED FIST - RAISED HAND */
-	{ 0x02728, 0x02728 }, /* SPARKLES */
-	{ 0x0274C, 0x0274C }, /* CROSS MARK */
-	{ 0x0274E, 0x0274E }, /* NEGATIVE SQUARED CROSS MARK */
-	{ 0x02753, 0x02755 }, /* BLACK QUESTION MARK ORNAMENT - WHITE EXCLAMATION MARK ORNAMENT */
-	{ 0x02757, 0x02757 }, /* HEAVY EXCLAMATION MARK SYMBOL */
-	{ 0x02795, 0x02797 }, /* HEAVY PLUS SIGN - HEAVY DIVISION SIGN */
-	{ 0x027B0, 0x027B0 }, /* CURLY LOOP */
-	{ 0x027BF, 0x027BF }, /* DOUBLE CURLY LOOP */
-	{ 0x02B1B, 0x02B1C }, /* BLACK LARGE SQUARE - WHITE LARGE SQUARE */
-	{ 0x02B50, 0x02B50 }, /* WHITE MEDIUM STAR */
-	{ 0x02B55, 0x02B55 }, /* HEAVY LARGE CIRCLE */
-	{ 0x02E80, 0x02E99 }, /* CJK RADICAL REPEAT - CJK RADICAL RAP */
-	{ 0x02E9B, 0x02EF3 }, /* CJK RADICAL CHOKE - CJK RADICAL C-SIMPLIFIED TURTLE */
-	{ 0x02F00, 0x02FD5 }, /* KANGXI RADICAL ONE - KANGXI RADICAL FLUTE */
-	{ 0x02FF0, 0x03029 }, /* IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT - HANGZHOU NUMERAL NINE */
-	{ 0x03030, 0x0303E }, /* WAVY DASH - IDEOGRAPHIC VARIATION INDICATOR */
-	{ 0x03041, 0x03096 }, /* HIRAGANA LETTER SMALL A - HIRAGANA LETTER SMALL KE */
-	{ 0x0309B, 0x030FF }, /* KATAKANA-HIRAGANA VOICED SOUND MARK - KATAKANA DIGRAPH KOTO */
-	{ 0x03105, 0x0312F }, /* BOPOMOFO LETTER B - BOPOMOFO LETTER NN */
-	{ 0x03131, 0x0318E }, /* HANGUL LETTER KIYEOK - HANGUL LETTER ARAEAE */
-	{ 0x03190, 0x031E5 }, /* IDEOGRAPHIC ANNOTATION LINKING MARK - CJK STROKE SZP */
-	{ 0x031EF, 0x0321E }, /* IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION - PARENTHESIZED KOREAN CHARACTER O HU */
-	{ 0x03220, 0x03247 }, /* PARENTHESIZED IDEOGRAPH ONE - CIRCLED IDEOGRAPH KOTO */
-	{ 0x03250, 0x0A48C }, /* PARTNERSHIP SIGN - YI SYLLABLE YYR */
-	{ 0x0A490, 0x0A4C6 }, /* YI RADICAL QOT - YI RADICAL KE */
-	{ 0x0A960, 0x0A97C }, /* HANGUL CHOSEONG TIKEUT-MIEUM - HANGUL CHOSEONG SSANGYEORINHIEUH */
-	{ 0x0AC00, 0x0D7A3 }, /* HANGUL SYLLABLE GA - HANGUL SYLLABLE HIH */
-	{ 0x0F900, 0x0FAFF }, /* U+F900 - U+FAFF */
-	{ 0x0FE10, 0x0FE19 }, /* PRESENTATION FORM FOR VERTICAL COMMA - PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS */
-	{ 0x0FE30, 0x0FE52 }, /* PRESENTATION FORM FOR VERTICAL TWO DOT LEADER - SMALL FULL STOP */
-	{ 0x0FE54, 0x0FE66 }, /* SMALL SEMICOLON - SMALL EQUALS SIGN */
-	{ 0x0FE68, 0x0FE6B }, /* SMALL REVERSE SOLIDUS - SMALL COMMERCIAL AT */
-	{ 0x0FF01, 0x0FF60 }, /* FULLWIDTH EXCLAMATION MARK - FULLWIDTH RIGHT WHITE PARENTHESIS */
-	{ 0x0FFE0, 0x0FFE6 }, /* FULLWIDTH CENT SIGN - FULLWIDTH WON SIGN */
+/* Double-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct ucs_interval16 ucs_double_width_bmp_ranges[] = {
+	{ 0x1100, 0x115F }, /* HANGUL CHOSEONG KIYEOK - HANGUL CHOSEONG FILLER */
+	{ 0x231A, 0x231B }, /* WATCH - HOURGLASS */
+	{ 0x2329, 0x232A }, /* LEFT-POINTING ANGLE BRACKET - RIGHT-POINTING ANGLE BRACKET */
+	{ 0x23E9, 0x23EC }, /* BLACK RIGHT-POINTING DOUBLE TRIANGLE - BLACK DOWN-POINTING DOUBLE TRIANGLE */
+	{ 0x23F0, 0x23F0 }, /* ALARM CLOCK */
+	{ 0x23F3, 0x23F3 }, /* HOURGLASS WITH FLOWING SAND */
+	{ 0x25FD, 0x25FE }, /* WHITE MEDIUM SMALL SQUARE - BLACK MEDIUM SMALL SQUARE */
+	{ 0x2614, 0x2615 }, /* UMBRELLA WITH RAIN DROPS - HOT BEVERAGE */
+	{ 0x2630, 0x2637 }, /* TRIGRAM FOR HEAVEN - TRIGRAM FOR EARTH */
+	{ 0x2648, 0x2653 }, /* ARIES - PISCES */
+	{ 0x267F, 0x267F }, /* WHEELCHAIR SYMBOL */
+	{ 0x268A, 0x268F }, /* MONOGRAM FOR YANG - DIGRAM FOR GREATER YIN */
+	{ 0x2693, 0x2693 }, /* ANCHOR */
+	{ 0x26A1, 0x26A1 }, /* HIGH VOLTAGE SIGN */
+	{ 0x26AA, 0x26AB }, /* MEDIUM WHITE CIRCLE - MEDIUM BLACK CIRCLE */
+	{ 0x26BD, 0x26BE }, /* SOCCER BALL - BASEBALL */
+	{ 0x26C4, 0x26C5 }, /* SNOWMAN WITHOUT SNOW - SUN BEHIND CLOUD */
+	{ 0x26CE, 0x26CE }, /* OPHIUCHUS */
+	{ 0x26D4, 0x26D4 }, /* NO ENTRY */
+	{ 0x26EA, 0x26EA }, /* CHURCH */
+	{ 0x26F2, 0x26F3 }, /* FOUNTAIN - FLAG IN HOLE */
+	{ 0x26F5, 0x26F5 }, /* SAILBOAT */
+	{ 0x26FA, 0x26FA }, /* TENT */
+	{ 0x26FD, 0x26FD }, /* FUEL PUMP */
+	{ 0x2705, 0x2705 }, /* WHITE HEAVY CHECK MARK */
+	{ 0x270A, 0x270B }, /* RAISED FIST - RAISED HAND */
+	{ 0x2728, 0x2728 }, /* SPARKLES */
+	{ 0x274C, 0x274C }, /* CROSS MARK */
+	{ 0x274E, 0x274E }, /* NEGATIVE SQUARED CROSS MARK */
+	{ 0x2753, 0x2755 }, /* BLACK QUESTION MARK ORNAMENT - WHITE EXCLAMATION MARK ORNAMENT */
+	{ 0x2757, 0x2757 }, /* HEAVY EXCLAMATION MARK SYMBOL */
+	{ 0x2795, 0x2797 }, /* HEAVY PLUS SIGN - HEAVY DIVISION SIGN */
+	{ 0x27B0, 0x27B0 }, /* CURLY LOOP */
+	{ 0x27BF, 0x27BF }, /* DOUBLE CURLY LOOP */
+	{ 0x2B1B, 0x2B1C }, /* BLACK LARGE SQUARE - WHITE LARGE SQUARE */
+	{ 0x2B50, 0x2B50 }, /* WHITE MEDIUM STAR */
+	{ 0x2B55, 0x2B55 }, /* HEAVY LARGE CIRCLE */
+	{ 0x2E80, 0x2E99 }, /* CJK RADICAL REPEAT - CJK RADICAL RAP */
+	{ 0x2E9B, 0x2EF3 }, /* CJK RADICAL CHOKE - CJK RADICAL C-SIMPLIFIED TURTLE */
+	{ 0x2F00, 0x2FD5 }, /* KANGXI RADICAL ONE - KANGXI RADICAL FLUTE */
+	{ 0x2FF0, 0x3029 }, /* IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT - HANGZHOU NUMERAL NINE */
+	{ 0x3030, 0x303E }, /* WAVY DASH - IDEOGRAPHIC VARIATION INDICATOR */
+	{ 0x3041, 0x3096 }, /* HIRAGANA LETTER SMALL A - HIRAGANA LETTER SMALL KE */
+	{ 0x309B, 0x30FF }, /* KATAKANA-HIRAGANA VOICED SOUND MARK - KATAKANA DIGRAPH KOTO */
+	{ 0x3105, 0x312F }, /* BOPOMOFO LETTER B - BOPOMOFO LETTER NN */
+	{ 0x3131, 0x318E }, /* HANGUL LETTER KIYEOK - HANGUL LETTER ARAEAE */
+	{ 0x3190, 0x31E5 }, /* IDEOGRAPHIC ANNOTATION LINKING MARK - CJK STROKE SZP */
+	{ 0x31EF, 0x321E }, /* IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION - PARENTHESIZED KOREAN CHARACTER O HU */
+	{ 0x3220, 0x3247 }, /* PARENTHESIZED IDEOGRAPH ONE - CIRCLED IDEOGRAPH KOTO */
+	{ 0x3250, 0xA48C }, /* PARTNERSHIP SIGN - YI SYLLABLE YYR */
+	{ 0xA490, 0xA4C6 }, /* YI RADICAL QOT - YI RADICAL KE */
+	{ 0xA960, 0xA97C }, /* HANGUL CHOSEONG TIKEUT-MIEUM - HANGUL CHOSEONG SSANGYEORINHIEUH */
+	{ 0xAC00, 0xD7A3 }, /* HANGUL SYLLABLE GA - HANGUL SYLLABLE HIH */
+	{ 0xF900, 0xFAFF }, /* U+F900 - U+FAFF */
+	{ 0xFE10, 0xFE19 }, /* PRESENTATION FORM FOR VERTICAL COMMA - PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS */
+	{ 0xFE30, 0xFE52 }, /* PRESENTATION FORM FOR VERTICAL TWO DOT LEADER - SMALL FULL STOP */
+	{ 0xFE54, 0xFE66 }, /* SMALL SEMICOLON - SMALL EQUALS SIGN */
+	{ 0xFE68, 0xFE6B }, /* SMALL REVERSE SOLIDUS - SMALL COMMERCIAL AT */
+	{ 0xFF01, 0xFF60 }, /* FULLWIDTH EXCLAMATION MARK - FULLWIDTH RIGHT WHITE PARENTHESIS */
+	{ 0xFFE0, 0xFFE6 }, /* FULLWIDTH CENT SIGN - FULLWIDTH WON SIGN */
+};
+
+/* Double-width character ranges (non-BMP, U+10000 and above) */
+static const struct ucs_interval32 ucs_double_width_non_bmp_ranges[] = {
 	{ 0x16FE0, 0x16FE3 }, /* TANGUT ITERATION MARK - OLD CHINESE ITERATION MARK */
 	{ 0x17000, 0x187F7 }, /* U+17000 - U+187F7 */
 	{ 0x18800, 0x18CD5 }, /* TANGUT COMPONENT-001 - KHITAN SMALL SCRIPT CHARACTER-18CD5 */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode()
  2025-04-15 19:17 ` [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
@ 2025-04-16  3:41   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  3:41 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Make it clearer when a sequence is bad.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 02/13] vt: move unicode processing to a separate file
  2025-04-15 19:17 ` [PATCH v2 02/13] vt: move unicode processing to a separate file Nicolas Pitre
@ 2025-04-16  3:42   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  3:42 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> This will make it easier to maintain. Also make it depend on
> CONFIG_CONSOLE_TRANSLATIONS.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 03/13] vt: properly support zero-width Unicode code points
  2025-04-15 19:17 ` [PATCH v2 03/13] vt: properly support zero-width Unicode code points Nicolas Pitre
@ 2025-04-16  3:45   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  3:45 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Zero-width Unicode code points are causing misalignment in vertically
> aligned content, disrupting the visual layout. Let's handle zero-width
> code points more intelligently.
> 
> Double-width code points are stored in the screen grid followed by a white
> space code point to create the expected screen layout. When a double-width
> code point is followed by a zero-width code point in the console incoming
> bytestream (e.g., an emoji with a presentation selector) then we may
> replace the white space padding by that zero-width code point instead of
> dropping it. This maximize screen content information while preserving
> proper layout.
> 
> If a zero-width code point is preceded by a single-width code point then
> the above trick is not possible and such zero-width code point must
> be dropped.
> 
> VS16 (Variation Selector 16, U+FE0F) is special as it doubles the width
> of the preceding single-width code point. We handle that case by giving
> VS16 a width of 1 when that happens.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h
  2025-04-15 19:17 ` [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h Nicolas Pitre
@ 2025-04-16  4:14   ` Jiri Slaby
  2025-04-16  4:19     ` Jiri Slaby
  0 siblings, 1 reply; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  4:14 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> The table in ucs.c is terribly out of date and incomplete. We also need a
> second table to store zero-width code points. Properly maintaining those
> tables manually is impossible. So here's a script to generate them.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h
  2025-04-16  4:14   ` Jiri Slaby
@ 2025-04-16  4:19     ` Jiri Slaby
  2025-04-16 13:21       ` Nicolas Pitre
  0 siblings, 1 reply; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  4:19 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 16. 04. 25, 6:14, Jiri Slaby wrote:
> On 15. 04. 25, 21:17, Nicolas Pitre wrote:
>> From: Nicolas Pitre <npitre@baylibre.com>
>>
>> The table in ucs.c is terribly out of date and incomplete. We also need a
>> second table to store zero-width code points. Properly maintaining those
>> tables manually is impossible. So here's a script to generate them.
>>
>> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> 
> Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

Actually, could you create a makefile rule for this too?

Similar to what GENERATE_KEYMAP does in vt/Makefile.

So that you would do:
   make GENERATE_UCS_WIDTH_TABLE=1 drivers/tty/vt/
to let the script generate it on the fly?

thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 05/13] vt: create ucs_width_table.h with gen_ucs_width_table.py
  2025-04-15 19:17 ` [PATCH v2 05/13] vt: create ucs_width_table.h with gen_ucs_width_table.py Nicolas Pitre
@ 2025-04-16  4:20   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  4:20 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Provide comprehensive ranges for double-width and zero-width Unicode
> code points.
> 
> Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
>        Please ignore.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 06/13] vt: use new tables in ucs.c
  2025-04-15 19:17 ` [PATCH v2 06/13] vt: use new tables in ucs.c Nicolas Pitre
@ 2025-04-16  4:22   ` Jiri Slaby
  2025-04-17  8:30   ` kernel test robot
  1 sibling, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  4:22 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> This removes the table from ucs.c and substitutes the generated tables
> from ucs_width_table.h providing comprehensive ranges for double-width
> and zero-width Unicode code points.
> 
> Also implements ucs_is_zero_width() to query the new zero-width table.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h
  2025-04-15 19:17 ` [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h Nicolas Pitre
@ 2025-04-16  4:29   ` Jiri Slaby
  2025-04-16 13:17     ` Nicolas Pitre
  0 siblings, 1 reply; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  4:29 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> The generated table maps base character + combining mark pairs to their
> precomposed equivalents using Python's unicodedata module.
> 
> The default script behavior is to create a table with most commonly used
> Latin, Greek, and Cyrillic recomposition pairs only. It is much smaller
> than the table with all possible recomposition pairs (71 entries vs 1000
> entries). But if one needs/wants the full table then simply running the
> script with the --full argument will generate it.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> ---
>   drivers/tty/vt/gen_ucs_recompose_table.py | 255 ++++++++++++++++++++++
>   1 file changed, 255 insertions(+)
>   create mode 100755 drivers/tty/vt/gen_ucs_recompose_table.py
> 
> diff --git a/drivers/tty/vt/gen_ucs_recompose_table.py b/drivers/tty/vt/gen_ucs_recompose_table.py
> new file mode 100755
> index 0000000000..91e81fb1c9
> --- /dev/null
> +++ b/drivers/tty/vt/gen_ucs_recompose_table.py
> @@ -0,0 +1,255 @@
> +#!/usr/bin/env python3
...
> +    # Generate implementation file
> +    with open(out_file, 'w') as f:
> +        f.write(f"""\
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * {out_file} - Unicode character recomposition
> + *
> + * Auto-generated by {this_file}{generation_mode}
> + *
> + * Unicode Version: {unicodedata.unidata_version}
> + *
> +{textwrap.fill(
> +    f"This file contains a table with {table_description_detail}. " +
> +    f"To generate a table with {alt_description_detail} instead, run:",
> +    width=75, initial_indent=" * ", subsequent_indent=" * ")}
> + *
> + *   python {this_file}{alternative_mode}

This should be python3. Or no 'python' at all -- I assume the script is 
executable given "new file mode 100755".

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 08/13] vt: create ucs_recompose_table.h with gen_ucs_recompose_table.py
  2025-04-15 19:17 ` [PATCH v2 08/13] vt: create ucs_recompose_table.h with gen_ucs_recompose_table.py Nicolas Pitre
@ 2025-04-16  5:05   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  5:05 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Table of base character + combining mark pairs with their precomposed
> equivalents.
> 
> Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
>        Please ignore.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 09/13] vt: support Unicode recomposition
  2025-04-15 19:17 ` [PATCH v2 09/13] vt: support Unicode recomposition Nicolas Pitre
@ 2025-04-16  5:07   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  5:07 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Try replacing any decomposed Unicode sequence by the corresponding
> recomposed code point. Code point to glyph correspondance works best
> after recomposition, and this apply mostly to single-width code points
> therefore we can't preserve them in their decomposed form anyway.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 10/13] vt: pad double-width code points with a zero-width space
  2025-04-15 19:17 ` [PATCH v2 10/13] vt: pad double-width code points with a zero-width space Nicolas Pitre
@ 2025-04-16  5:07   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  5:07 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> In the Unicode screen buffer, we follow double-width code points with a
> space to maintain proper column alignment. This, however, creates
> semantic problems when e.g. using cut and paste.
> 
> Let's use a better code point for the column padding's purpose i.e. a
> zero-width space rather than a full space. This way the combination
> remains with a width of 2.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 11/13] vt: remove zero-width-space handling from conv_uni_to_pc()
  2025-04-15 19:18 ` [PATCH v2 11/13] vt: remove zero-width-space handling from conv_uni_to_pc() Nicolas Pitre
@ 2025-04-16  5:07   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  5:07 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:18, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> This is now taken care of by ucs_is_zero_width().
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 12/13] vt: update gen_ucs_width_table.py to make tables more space efficient
  2025-04-15 19:18 ` [PATCH v2 12/13] vt: update gen_ucs_width_table.py to make tables more space efficient Nicolas Pitre
@ 2025-04-16  5:09   ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  5:09 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:18, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
> This reduces the corresponding text size by 20-25%.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly
  2025-04-15 19:18 ` [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly Nicolas Pitre
@ 2025-04-16  5:12   ` Jiri Slaby
  2025-04-16 13:09     ` Nicolas Pitre
  0 siblings, 1 reply; 33+ messages in thread
From: Jiri Slaby @ 2025-04-16  5:12 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, linux-serial, linux-kernel

On 15. 04. 25, 21:18, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Width tables are now split into BMP (16-bit) and non-BMP (above 16-bit).
> This reduces the corresponding text size by 20-25%.
> 
> Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
>        Please ignore.
...
> --- a/drivers/tty/vt/ucs.c
> +++ b/drivers/tty/vt/ucs.c
> @@ -5,17 +5,34 @@
...
> -static int interval_cmp(const void *key, const void *element)
> +static int interval16_cmp(const void *key, const void *element)
> +{
> +	u16 cp = *(u16 *)key;

You cast away const. Does the compiler not complain?

> +	const struct ucs_interval16 *entry = element;
> +
> +	if (cp < entry->first)
> +		return -1;
> +	if (cp > entry->last)
> +		return 1;
> +	return 0;
> +}
> +
> +static int interval32_cmp(const void *key, const void *element)
>   {
>   	u32 cp = *(u32 *)key;

Apparently not, given we do this for ages. I wonder why?

Anyway:

Reviewed-by: Jiri Slaby <jirislaby@kernel.org>

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly
  2025-04-16  5:12   ` Jiri Slaby
@ 2025-04-16 13:09     ` Nicolas Pitre
  0 siblings, 0 replies; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-16 13:09 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Wed, 16 Apr 2025, Jiri Slaby wrote:

> On 15. 04. 25, 21:18, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> > 
> > Width tables are now split into BMP (16-bit) and non-BMP (above 16-bit).
> > This reduces the corresponding text size by 20-25%.
> > 
> > Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
> >        Please ignore.
> ...
> > --- a/drivers/tty/vt/ucs.c
> > +++ b/drivers/tty/vt/ucs.c
> > @@ -5,17 +5,34 @@
> ...
> > -static int interval_cmp(const void *key, const void *element)
> > +static int interval16_cmp(const void *key, const void *element)
> > +{
> > +	u16 cp = *(u16 *)key;
> 
> You cast away const. Does the compiler not complain?

Nope.

> > +	const struct ucs_interval16 *entry = element;
> > +
> > +	if (cp < entry->first)
> > +		return -1;
> > +	if (cp > entry->last)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +static int interval32_cmp(const void *key, const void *element)
> >   {
> >    u32 cp = *(u32 *)key;
> 
> Apparently not, given we do this for ages. I wonder why?

Because we're not creating another pointer that could be used for 
modifying the referenced memory.

> Anyway:
> 
> Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
> 
> -- 
> js
> suse labs
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h
  2025-04-16  4:29   ` Jiri Slaby
@ 2025-04-16 13:17     ` Nicolas Pitre
  2025-04-17  4:09       ` Jiri Slaby
  0 siblings, 1 reply; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-16 13:17 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Wed, 16 Apr 2025, Jiri Slaby wrote:

> On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> > +/*
> > + * {out_file} - Unicode character recomposition
> > + *
> > + * Auto-generated by {this_file}{generation_mode}
> > + *
> > + * Unicode Version: {unicodedata.unidata_version}
> > + *
> > +{textwrap.fill(
> > +    f"This file contains a table with {table_description_detail}. " +
> > +    f"To generate a table with {alt_description_detail} instead, run:",
> > +    width=75, initial_indent=" * ", subsequent_indent=" * ")}
> > + *
> > + *   python {this_file}{alternative_mode}
> 
> This should be python3. Or no 'python' at all -- I assume the script is
> executable given "new file mode 100755".

On my system, python == python3 since many years. I think it is safe.

> Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
> 
> -- 
> js
> suse labs
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h
  2025-04-16  4:19     ` Jiri Slaby
@ 2025-04-16 13:21       ` Nicolas Pitre
  0 siblings, 0 replies; 33+ messages in thread
From: Nicolas Pitre @ 2025-04-16 13:21 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Wed, 16 Apr 2025, Jiri Slaby wrote:

> On 16. 04. 25, 6:14, Jiri Slaby wrote:
> > On 15. 04. 25, 21:17, Nicolas Pitre wrote:
> >> From: Nicolas Pitre <npitre@baylibre.com>
> >>
> >> The table in ucs.c is terribly out of date and incomplete. We also need a
> >> second table to store zero-width code points. Properly maintaining those
> >> tables manually is impossible. So here's a script to generate them.
> >>
> >> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> > 
> > Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
> 
> Actually, could you create a makefile rule for this too?
> 
> Similar to what GENERATE_KEYMAP does in vt/Makefile.
> 
> So that you would do:
>   make GENERATE_UCS_WIDTH_TABLE=1 drivers/tty/vt/
> to let the script generate it on the fly?

Sure. I have more patches coming up so I'll bundle that change with 
them.


Nicolas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h
  2025-04-16 13:17     ` Nicolas Pitre
@ 2025-04-17  4:09       ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2025-04-17  4:09 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On 16. 04. 25, 15:17, Nicolas Pitre wrote:
> On Wed, 16 Apr 2025, Jiri Slaby wrote:
> 
>> On 15. 04. 25, 21:17, Nicolas Pitre wrote:
>>> +/*
>>> + * {out_file} - Unicode character recomposition
>>> + *
>>> + * Auto-generated by {this_file}{generation_mode}
>>> + *
>>> + * Unicode Version: {unicodedata.unidata_version}
>>> + *
>>> +{textwrap.fill(
>>> +    f"This file contains a table with {table_description_detail}. " +
>>> +    f"To generate a table with {alt_description_detail} instead, run:",
>>> +    width=75, initial_indent=" * ", subsequent_indent=" * ")}
>>> + *
>>> + *   python {this_file}{alternative_mode}
>>
>> This should be python3. Or no 'python' at all -- I assume the script is
>> executable given "new file mode 100755".
> 
> On my system, python == python3 since many years. I think it is safe.

On many systems (incl. _all_ SUSE's):
$ python
bash: python: command not found

The convention is to call python3 if you want py3 (the same as you do in 
the script's shebang).

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 06/13] vt: use new tables in ucs.c
  2025-04-15 19:17 ` [PATCH v2 06/13] vt: use new tables in ucs.c Nicolas Pitre
  2025-04-16  4:22   ` Jiri Slaby
@ 2025-04-17  8:30   ` kernel test robot
  1 sibling, 0 replies; 33+ messages in thread
From: kernel test robot @ 2025-04-17  8:30 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman, Jiri Slaby
  Cc: oe-kbuild-all, Nicolas Pitre, linux-serial, linux-kernel

Hi Nicolas,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tty/tty-linus]
[also build test WARNING on linus/master v6.15-rc2]
[cannot apply to tty/tty-testing tty/tty-next next-20250416]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolas-Pitre/vt-minor-cleanup-to-vc_translate_unicode/20250416-142136
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git tty-linus
patch link:    https://lore.kernel.org/r/20250415192212.33949-7-nico%40fluxnic.net
patch subject: [PATCH v2 06/13] vt: use new tables in ucs.c
config: arc-randconfig-002-20250417 (https://download.01.org/0day-ci/archive/20250417/202504171646.NDaNfFql-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250417/202504171646.NDaNfFql-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504171646.NDaNfFql-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/tty/vt/ucs.c:43: warning: Function parameter or struct member 'cp' not described in 'ucs_is_zero_width'
>> drivers/tty/vt/ucs.c:43: warning: expecting prototype for Determine if a Unicode code point is zero(). Prototype was for ucs_is_zero_width() instead
   drivers/tty/vt/ucs.c:55: warning: Function parameter or struct member 'cp' not described in 'ucs_is_double_width'
   drivers/tty/vt/ucs.c:55: warning: expecting prototype for Determine if a Unicode code point is double(). Prototype was for ucs_is_double_width() instead


vim +43 drivers/tty/vt/ucs.c

    35	
    36	/**
    37	 * Determine if a Unicode code point is zero-width.
    38	 *
    39	 * @param cp: Unicode code point (UCS-4)
    40	 * Return: true if the character is zero-width, false otherwise
    41	 */
    42	bool ucs_is_zero_width(u32 cp)
  > 43	{
    44		return cp_in_range(cp, ucs_zero_width_ranges,
    45				   ARRAY_SIZE(ucs_zero_width_ranges));
    46	}
    47	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-04-17  8:30 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15 19:17 [PATCH v2 00/13] vt: implement proper Unicode handling Nicolas Pitre
2025-04-15 19:17 ` [PATCH v2 01/13] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
2025-04-16  3:41   ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 02/13] vt: move unicode processing to a separate file Nicolas Pitre
2025-04-16  3:42   ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 03/13] vt: properly support zero-width Unicode code points Nicolas Pitre
2025-04-16  3:45   ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 04/13] vt: introduce gen_ucs_width_table.py to create ucs_width_table.h Nicolas Pitre
2025-04-16  4:14   ` Jiri Slaby
2025-04-16  4:19     ` Jiri Slaby
2025-04-16 13:21       ` Nicolas Pitre
2025-04-15 19:17 ` [PATCH v2 05/13] vt: create ucs_width_table.h with gen_ucs_width_table.py Nicolas Pitre
2025-04-16  4:20   ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 06/13] vt: use new tables in ucs.c Nicolas Pitre
2025-04-16  4:22   ` Jiri Slaby
2025-04-17  8:30   ` kernel test robot
2025-04-15 19:17 ` [PATCH v2 07/13] vt: introduce gen_ucs_recompose_table.py to create ucs_recompose_table.h Nicolas Pitre
2025-04-16  4:29   ` Jiri Slaby
2025-04-16 13:17     ` Nicolas Pitre
2025-04-17  4:09       ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 08/13] vt: create ucs_recompose_table.h with gen_ucs_recompose_table.py Nicolas Pitre
2025-04-16  5:05   ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 09/13] vt: support Unicode recomposition Nicolas Pitre
2025-04-16  5:07   ` Jiri Slaby
2025-04-15 19:17 ` [PATCH v2 10/13] vt: pad double-width code points with a zero-width space Nicolas Pitre
2025-04-16  5:07   ` Jiri Slaby
2025-04-15 19:18 ` [PATCH v2 11/13] vt: remove zero-width-space handling from conv_uni_to_pc() Nicolas Pitre
2025-04-16  5:07   ` Jiri Slaby
2025-04-15 19:18 ` [PATCH v2 12/13] vt: update gen_ucs_width_table.py to make tables more space efficient Nicolas Pitre
2025-04-16  5:09   ` Jiri Slaby
2025-04-15 19:18 ` [PATCH v2 13/13] vt: refresh ucs_width_table.h and adjust code in ucs.c accordingly Nicolas Pitre
2025-04-16  5:12   ` Jiri Slaby
2025-04-16 13:09     ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).