linux-serial.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] vt: implement proper Unicode handling
@ 2025-04-10  1:13 Nicolas Pitre
  2025-04-10  1:13 ` [PATCH 01/11] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

The Linux VT console has many problems with regards to proper Unicode
handling:

- All new double-width Unicode code points which have been introduced since
  Unicode 5.0 are not recognized as such (we're at Unicode 16.0 now).

- Zero-width code points are not recognized at all. If you try to edit files
  containing a lot of emojis, you will see the rendering issues. When there
  are a lot of zero-width characters (like "variation selectors"), long
  lines get wrapped, but any Unicode-aware editor thinks that the content
  was rendered properly and its rendering logic starts to work in very bad
  ways. Combine this with tmux or screen, and there is a huge mess going on
  in the terminal.

- Also, text which uses combining diacritics has the same effect as text
  with zero-width characters as programs expect the characters to take fewer
  columns than what they actually do.

Some may argue that the Linux VT console is unmaintained and/or not used
much any longer and that one should consider a user space terminal
alternative instead. But every such alternative that is not less maintained
than the Linux VT console does require a full heavy graphical environment
and that is the exact antithesis of what the Linux console is meant to be.

Furthermore, there is a significant Linux console user base represented by
blind users (which I'm a member of) for whom the alternatives are way more
cumbersome to use reducing our productivity. So it has to stay and
be maintained to the best of our abilities.

That being said...

This patch series is about fixing all the above issues. This is accomplished
with some Python scripts leveraging Python's unicodedata module to generate
C code with lookup tables that is suitable for the kernel. In summary:

- The double-width code point table is updated to the latest Unicode version
  and the table itself is optimized to reduce its size.

- A zero-width code point table is created and the console code is modified
  to properly use it.

- A table with base character + combining mark pairs is created to convert
  them into their precomposed equivalents when they're encountered.
  By default the generated table contains most commonly used Latin, Greek,
  and Cyrillic recomposition pairs only, but one can execute the provided
  script with the --full argument to create a table that covers all
  possibilities. Combining marks that are not listed in the table are simply
  treated like zero-width code points and properly ignored.

- All those tables plus related lookup code require about 3500 additional
  bytes of text which is not very significant these days. Yet, one
  can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out
  if need be.

Note: The generated C code makes scripts/checkpatch.pl complain about
      "... exceeds 100 columns" because the inserted comments with code
      point names, well, make some inlines exceed 100 columns. Please make
      an exception for those files and disregard those warnings. When
      checkpatch.pl is used on those files directly with -f then it doesn't
      complain.

This series was tested on top of v6.15-rc1.

diffstat:

 drivers/tty/vt/Makefile             |   3 +-
 drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++
 drivers/tty/vt/gen_ucs_width.py     | 336 +++++++++++++++++++
 drivers/tty/vt/ucs_recompose.c      | 170 ++++++++++
 drivers/tty/vt/ucs_width.c          | 536 ++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c                 | 111 ++++---
 include/linux/consolemap.h          |  18 +
 7 files changed, 1448 insertions(+), 47 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/11] vt: minor cleanup to vc_translate_unicode()
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-10  1:13 ` [PATCH 02/11] vt: move unicode processing to a separate file Nicolas Pitre
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Make it clearer when a sequence is bad.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index f5642b3038..b5f3c8a818 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2817,7 +2817,7 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 	if ((c & 0xc0) == 0x80) {
 		/* Unexpected continuation byte? */
 		if (!vc->vc_utf_count)
-			return 0xfffd;
+			goto bad_sequence;
 
 		vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f);
 		vc->vc_npar++;
@@ -2829,17 +2829,17 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 		/* Reject overlong sequences */
 		if (c <= utf8_length_changes[vc->vc_npar - 1] ||
 				c > utf8_length_changes[vc->vc_npar])
-			return 0xfffd;
+			goto bad_sequence;
 
 		return vc_sanitize_unicode(c);
 	}
 
 	/* Single ASCII byte or first byte of a sequence received */
 	if (vc->vc_utf_count) {
-		/* Continuation byte expected */
+		/* A continuation byte was expected */
 		*rescan = true;
 		vc->vc_utf_count = 0;
-		return 0xfffd;
+		goto bad_sequence;
 	}
 
 	/* Nothing to do if an ASCII byte was received */
@@ -2858,11 +2858,14 @@ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan)
 		vc->vc_utf_count = 3;
 		vc->vc_utf_char = (c & 0x07);
 	} else {
-		return 0xfffd;
+		goto bad_sequence;
 	}
 
 need_more_bytes:
 	return -1;
+
+bad_sequence:
+	return 0xfffd;
 }
 
 static int vc_translate(struct vc_data *vc, int *c, bool *rescan)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 02/11] vt: move unicode processing to a separate file
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
  2025-04-10  1:13 ` [PATCH 01/11] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-14  6:47   ` Jiri Slaby
  2025-04-10  1:13 ` [PATCH 03/11] vt: properly support zero-width Unicode code points Nicolas Pitre
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This will make it easier to maintain. Also make it depend on
CONFIG_CONSOLE_TRANSLATIONS.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/Makefile    |  3 ++-
 drivers/tty/vt/ucs_width.c | 45 ++++++++++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c        | 40 +--------------------------------
 include/linux/consolemap.h |  6 +++++
 4 files changed, 54 insertions(+), 40 deletions(-)
 create mode 100644 drivers/tty/vt/ucs_width.c

diff --git a/drivers/tty/vt/Makefile b/drivers/tty/vt/Makefile
index 2c8ce8b592..bee69277bb 100644
--- a/drivers/tty/vt/Makefile
+++ b/drivers/tty/vt/Makefile
@@ -7,7 +7,8 @@ FONTMAPFILE = cp437.uni
 obj-$(CONFIG_VT)			+= vt_ioctl.o vc_screen.o \
 					   selection.o keyboard.o \
 					   vt.o defkeymap.o
-obj-$(CONFIG_CONSOLE_TRANSLATIONS)	+= consolemap.o consolemap_deftbl.o
+obj-$(CONFIG_CONSOLE_TRANSLATIONS)	+= consolemap.o consolemap_deftbl.o \
+					   ucs_width.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/tty/vt/ucs_width.c b/drivers/tty/vt/ucs_width.c
new file mode 100644
index 0000000000..5f0bde30a1
--- /dev/null
+++ b/drivers/tty/vt/ucs_width.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/types.h>
+#include <linux/array_size.h>
+#include <linux/bsearch.h>
+#include <linux/consolemap.h>
+
+/* ucs_is_double_width() is based on the wcwidth() implementation by
+ * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
+ * Latest version: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+ */
+
+struct interval {
+	uint32_t first;
+	uint32_t last;
+};
+
+static int ucs_cmp(const void *key, const void *elt)
+{
+	uint32_t cp = *(uint32_t *)key;
+	struct interval e = *(struct interval *) elt;
+
+	if (cp > e.last)
+		return 1;
+	else if (cp < e.first)
+		return -1;
+	return 0;
+}
+
+static const struct interval double_width[] = {
+	{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
+	{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
+	{ 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
+	{ 0xFFE0, 0xFFE6 }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
+};
+
+bool ucs_is_double_width(uint32_t cp)
+{
+	if (cp < double_width[0].first ||
+	    cp > double_width[ARRAY_SIZE(double_width) - 1].last)
+		return false;
+
+	return bsearch(&cp, double_width, ARRAY_SIZE(double_width),
+		       sizeof(struct interval), ucs_cmp) != NULL;
+}
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index b5f3c8a818..bcb508bc15 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -104,7 +104,6 @@
 #include <linux/uaccess.h>
 #include <linux/kdb.h>
 #include <linux/ctype.h>
-#include <linux/bsearch.h>
 #include <linux/gcd.h>
 
 #define MAX_NR_CON_DRIVER 16
@@ -2712,43 +2711,6 @@ static void do_con_trol(struct tty_struct *tty, struct vc_data *vc, u8 c)
 	}
 }
 
-/* is_double_width() is based on the wcwidth() implementation by
- * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
- * Latest version: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
- */
-struct interval {
-	uint32_t first;
-	uint32_t last;
-};
-
-static int ucs_cmp(const void *key, const void *elt)
-{
-	uint32_t ucs = *(uint32_t *)key;
-	struct interval e = *(struct interval *) elt;
-
-	if (ucs > e.last)
-		return 1;
-	else if (ucs < e.first)
-		return -1;
-	return 0;
-}
-
-static int is_double_width(uint32_t ucs)
-{
-	static const struct interval double_width[] = {
-		{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
-		{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
-		{ 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
-		{ 0xFFE0, 0xFFE6 }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
-	};
-	if (ucs < double_width[0].first ||
-	    ucs > double_width[ARRAY_SIZE(double_width) - 1].last)
-		return 0;
-
-	return bsearch(&ucs, double_width, ARRAY_SIZE(double_width),
-			sizeof(struct interval), ucs_cmp) != NULL;
-}
-
 struct vc_draw_region {
 	unsigned long from, to;
 	int x;
@@ -2953,7 +2915,7 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 	bool inverse = false;
 
 	if (vc->vc_utf && !vc->vc_disp_ctrl) {
-		if (is_double_width(c))
+		if (ucs_is_double_width(c))
 			width = 2;
 	}
 
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index c35db4896c..caf079bcb8 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -28,6 +28,7 @@ int conv_uni_to_pc(struct vc_data *conp, long ucs);
 u32 conv_8bit_to_uni(unsigned char c);
 int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
+bool ucs_is_double_width(uint32_t cp);
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
@@ -57,6 +58,11 @@ static inline int conv_uni_to_8bit(u32 uni)
 }
 
 static inline void console_map_init(void) { }
+
+static inline bool ucs_is_double_width(uint32_t cp)
+{
+	return false;
+}
 #endif /* CONFIG_CONSOLE_TRANSLATIONS */
 
 #endif /* __LINUX_CONSOLEMAP_H__ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 03/11] vt: properly support zero-width Unicode code points
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
  2025-04-10  1:13 ` [PATCH 01/11] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
  2025-04-10  1:13 ` [PATCH 02/11] vt: move unicode processing to a separate file Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-14  6:51   ` Jiri Slaby
  2025-04-10  1:13 ` [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c Nicolas Pitre
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Zero-width Unicode code points are causing misalignment in vertically
aligned content, disrupting the visual layout. Let's handle zero-width
code points more intelligently.

Double-width code points are stored in the screen grid followed by a white
space code point to create the expected screen layout. When a double-width
code point is followed by a zero-width code point in the console incoming
bytestream (e.g., an emoji with a presentation selector) then we may
replace the white space padding by that zero-width code point instead of
dropping it. This maximize screen content information while preserving
proper layout.

If a zero-width code point is preceded by a single-width code point then
the above trick is not possible and such zero-width code point must
be dropped.

VS16 (Variation Selector 16, U+FE0F) is special as it doubles the width
of the preceding single-width code point. We handle that case by giving
VS16 a width of 1 when that happens.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c        | 46 ++++++++++++++++++++++++++++++++++++--
 include/linux/consolemap.h | 10 +++++++++
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index bcb508bc15..5d53feeb5d 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -443,6 +443,15 @@ static void vc_uniscr_scroll(struct vc_data *vc, unsigned int top,
 	}
 }
 
+static u32 vc_uniscr_getc(struct vc_data *vc, int relative_pos)
+{
+	int pos = vc->state.x + vc->vc_need_wrap + relative_pos;
+
+	if (vc->vc_uni_lines && pos >= 0 && pos < vc->vc_cols)
+		return vc->vc_uni_lines[vc->state.y][pos];
+	return 0;
+}
+
 static void vc_uniscr_copy_area(u32 **dst_lines,
 				unsigned int dst_cols,
 				unsigned int dst_rows,
@@ -2905,18 +2914,49 @@ static bool vc_is_control(struct vc_data *vc, int tc, int c)
 	return false;
 }
 
+static void vc_con_rewind(struct vc_data *vc)
+{
+	if (vc->state.x && !vc->vc_need_wrap) {
+		vc->vc_pos -= 2;
+		vc->state.x--;
+	}
+	vc->vc_need_wrap = 0;
+}
+
 static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 		struct vc_draw_region *draw)
 {
-	int next_c;
+	int next_c, prev_c;
 	unsigned char vc_attr = vc->vc_attr;
 	u16 himask = vc->vc_hi_font_mask, charmask = himask ? 0x1ff : 0xff;
 	u8 width = 1;
 	bool inverse = false;
 
 	if (vc->vc_utf && !vc->vc_disp_ctrl) {
-		if (ucs_is_double_width(c))
+		if (ucs_is_double_width(c)) {
 			width = 2;
+		} else if (ucs_is_zero_width(c)) {
+			prev_c = vc_uniscr_getc(vc, -1);
+			if (prev_c == ' ' &&
+			    ucs_is_double_width(vc_uniscr_getc(vc, -2))) {
+				/*
+				 * Let's merge this zero-width code point with
+				 * the preceding double-width code point by
+				 * replacing the existing whitespace padding.
+				 */
+				vc_con_rewind(vc);
+			} else if (c == 0xfe0f && prev_c != 0) {
+				/*
+				 * VS16 (U+FE0F) is special. Let it have a
+				 * width of 1 when preceded by a single-width
+				 * code point effectively making the later
+				 * double-width.
+				 */
+			} else {
+				/* Otherwise zero-width code points are ignored */
+				goto out;
+			}
+		}
 	}
 
 	/* Now try to find out how to display it */
@@ -2995,6 +3035,8 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 			tc = ' ';
 		next_c = ' ';
 	}
+
+out:
 	notify_write(vc, c);
 
 	if (inverse)
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index caf079bcb8..7d778752dc 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -29,6 +29,11 @@ u32 conv_8bit_to_uni(unsigned char c);
 int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
 bool ucs_is_double_width(uint32_t cp);
+static inline bool ucs_is_zero_width(uint32_t cp)
+{
+	/* coming soon */
+	return false;
+}
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
@@ -63,6 +68,11 @@ static inline bool ucs_is_double_width(uint32_t cp)
 {
 	return false;
 }
+
+static inline bool ucs_is_zero_width(uint32_t cp)
+{
+	return false;
+}
 #endif /* CONFIG_CONSOLE_TRANSLATIONS */
 
 #endif /* __LINUX_CONSOLEMAP_H__ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (2 preceding siblings ...)
  2025-04-10  1:13 ` [PATCH 03/11] vt: properly support zero-width Unicode code points Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-14  7:04   ` Jiri Slaby
  2025-04-10  1:13 ` [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py Nicolas Pitre
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

The table in the current ucs_width.c is terribly out of date and
incomplete. We also need a second table to store zero-width code points.
Properly maintaining those tables manually is impossible. So here's a
script to automatically generate them.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/gen_ucs_width.py | 264 ++++++++++++++++++++++++++++++++
 1 file changed, 264 insertions(+)
 create mode 100755 drivers/tty/vt/gen_ucs_width.py

diff --git a/drivers/tty/vt/gen_ucs_width.py b/drivers/tty/vt/gen_ucs_width.py
new file mode 100755
index 0000000000..41997fe001
--- /dev/null
+++ b/drivers/tty/vt/gen_ucs_width.py
@@ -0,0 +1,264 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+#
+# This script uses Python's unicodedata module to generate ucs_width.c
+
+import unicodedata
+import sys
+
+def generate_ucs_width():
+    # Output file name
+    c_file = "ucs_width.c"
+
+    # Width data mapping
+    width_map = {}  # Maps code points to width (0, 1, 2)
+
+    # Define emoji modifiers and components that should have zero width
+    emoji_zero_width = [
+        # Skin tone modifiers
+        (0x1F3FB, 0x1F3FF),  # Emoji modifiers (skin tones)
+
+        # Variation selectors (note: VS16 is treated specially in vt.c)
+        (0xFE00, 0xFE0F),    # Variation Selectors 1-16
+
+        # Gender and hair style modifiers
+        (0x2640, 0x2640),    # Female sign
+        (0x2642, 0x2642),    # Male sign
+        (0x26A7, 0x26A7),    # Transgender symbol
+        (0x1F9B0, 0x1F9B3),  # Hair components (red, curly, white, bald)
+
+        # Tag characters
+        (0xE0020, 0xE007E),  # Tags
+    ]
+
+    # Mark these emoji modifiers as zero-width
+    for start, end in emoji_zero_width:
+        for cp in range(start, end + 1):
+            try:
+                width_map[cp] = 0
+            except (ValueError, OverflowError):
+                continue
+
+    # Mark all regional indicators as single-width as they are usually paired
+    # providing a combined with of 2.
+    regional_indicators = (0x1F1E6, 0x1F1FF)  # Regional indicator symbols A-Z
+    start, end = regional_indicators
+    for cp in range(start, end + 1):
+        try:
+            width_map[cp] = 1
+        except (ValueError, OverflowError):
+            continue
+
+    # Process all assigned Unicode code points (Basic Multilingual Plane + Supplementary Planes)
+    # Range 0x0 to 0x10FFFF (the full Unicode range)
+    for block_start in range(0, 0x110000, 0x1000):
+        block_end = block_start + 0x1000
+        for cp in range(block_start, block_end):
+            try:
+                char = chr(cp)
+
+                # Skip if already processed
+                if cp in width_map:
+                    continue
+
+                # Check if the character is a combining mark
+                category = unicodedata.category(char)
+
+                # Combining marks, format characters, zero-width characters
+                if (category.startswith('M') or  # Mark (combining)
+                    (category == 'Cf' and cp not in (0x061C, 0x06DD, 0x070F, 0x180E, 0x200F, 0x202E, 0x2066, 0x2067, 0x2068, 0x2069)) or
+                    cp in (0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF)):  # Known zero-width characters
+                    width_map[cp] = 0
+                    continue
+
+                # Use East Asian Width property
+                eaw = unicodedata.east_asian_width(char)
+
+                if eaw in ('F', 'W'):  # Fullwidth or Wide
+                    width_map[cp] = 2
+                elif eaw in ('Na', 'H', 'N', 'A'):  # Narrow, Halfwidth, Neutral, Ambiguous
+                    width_map[cp] = 1
+                else:
+                    # Default to single-width for unknown
+                    width_map[cp] = 1
+
+            except (ValueError, OverflowError):
+                # Skip invalid code points
+                continue
+
+    # Process Emoji - generally double-width
+    # Ranges according to Unicode Emoji standard
+    emoji_ranges = [
+        (0x1F000, 0x1F02F),  # Mahjong Tiles
+        (0x1F0A0, 0x1F0FF),  # Playing Cards
+        (0x1F300, 0x1F5FF),  # Miscellaneous Symbols and Pictographs
+        (0x1F600, 0x1F64F),  # Emoticons
+        (0x1F680, 0x1F6FF),  # Transport and Map Symbols
+        (0x1F700, 0x1F77F),  # Alchemical Symbols
+        (0x1F780, 0x1F7FF),  # Geometric Shapes Extended
+        (0x1F800, 0x1F8FF),  # Supplemental Arrows-C
+        (0x1F900, 0x1F9FF),  # Supplemental Symbols and Pictographs
+        (0x1FA00, 0x1FA6F),  # Chess Symbols
+        (0x1FA70, 0x1FAFF),  # Symbols and Pictographs Extended-A
+    ]
+
+    for start, end in emoji_ranges:
+        for cp in range(start, end + 1):
+            if cp not in width_map or width_map[cp] != 0:  # Don't override zero-width
+                try:
+                    char = chr(cp)
+                    width_map[cp] = 2
+                except (ValueError, OverflowError):
+                    continue
+
+    # Optimize to create range tables
+    def ranges_optimize(width_data, target_width):
+        points = sorted([cp for cp, width in width_data.items() if width == target_width])
+        if not points:
+            return []
+
+        # Group consecutive code points into ranges
+        ranges = []
+        start = points[0]
+        prev = start
+
+        for cp in points[1:]:
+            if cp > prev + 1:
+                ranges.append((start, prev))
+                start = cp
+            prev = cp
+
+        # Add the last range
+        ranges.append((start, prev))
+        return ranges
+
+    # Extract ranges for each width
+    zero_width_ranges = ranges_optimize(width_map, 0)
+    double_width_ranges = ranges_optimize(width_map, 2)
+
+    # Get Unicode version information
+    unicode_version = unicodedata.unidata_version
+
+    # Generate C implementation file
+    with open(c_file, 'w') as f:
+        f.write(f"""\
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ucs_width.c - Unicode character width lookup
+ *
+ * Auto-generated by gen_ucs_width.py
+ *
+ * Unicode Version: {unicode_version}
+ */
+
+#include <linux/types.h>
+#include <linux/array_size.h>
+#include <linux/bsearch.h>
+#include <linux/consolemap.h>
+
+struct interval {{
+	uint32_t first;
+	uint32_t last;
+}};
+
+/* Zero-width character ranges */
+static const struct interval zero_width_ranges[] = {{
+""")
+
+        for start, end in zero_width_ranges:
+            try:
+                start_char_desc = unicodedata.name(chr(start)) if start < 0x10000 else f"U+{start:05X}"
+                if start == end:
+                    comment = f"/* {start_char_desc} */"
+                else:
+                    end_char_desc = unicodedata.name(chr(end)) if end < 0x10000 else f"U+{end:05X}"
+                    comment = f"/* {start_char_desc} - {end_char_desc} */"
+            except:
+                if start == end:
+                    comment = f"/* U+{start:05X} */"
+                else:
+                    comment = f"/* U+{start:05X} - U+{end:05X} */"
+
+            f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
+
+        f.write("""\
+};
+
+/* Double-width character ranges */
+static const struct interval double_width_ranges[] = {
+""")
+
+        for start, end in double_width_ranges:
+            try:
+                start_char_desc = unicodedata.name(chr(start)) if start < 0x10000 else f"U+{start:05X}"
+                if start == end:
+                    comment = f"/* {start_char_desc} */"
+                else:
+                    end_char_desc = unicodedata.name(chr(end)) if end < 0x10000 else f"U+{end:05X}"
+                    comment = f"/* {start_char_desc} - {end_char_desc} */"
+            except:
+                if start == end:
+                    comment = f"/* U+{start:05X} */"
+                else:
+                    comment = f"/* U+{start:05X} - U+{end:05X} */"
+
+            f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
+
+        f.write("""\
+};
+
+
+static int ucs_cmp(const void *key, const void *element)
+{
+	uint32_t cp = *(uint32_t *)key;
+	const struct interval *e = element;
+
+	if (cp > e->last)
+		return 1;
+	if (cp < e->first)
+		return -1;
+	return 0;
+}
+
+static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t count)
+{
+	if (cp < intervals[0].first || cp > intervals[count - 1].last)
+		return false;
+
+	return __inline_bsearch(&cp, intervals, count,
+				sizeof(*intervals), ucs_cmp) != NULL;
+}
+
+/**
+ * Determine if a Unicode code point is zero-width.
+ *
+ * @param ucs: Unicode code point (UCS-4)
+ * Return: true if the character is zero-width, false otherwise
+ */
+bool ucs_is_zero_width(uint32_t cp)
+{
+	return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
+}
+
+/**
+ * Determine if a Unicode code point is double-width.
+ *
+ * @param ucs: Unicode code point (UCS-4)
+ * Return: true if the character is double-width, false otherwise
+ */
+bool ucs_is_double_width(uint32_t cp)
+{
+	return is_in_interval(cp, double_width_ranges, ARRAY_SIZE(double_width_ranges));
+}
+""")
+
+    # Print summary
+    zero_width_count = sum(end - start + 1 for start, end in zero_width_ranges)
+    double_width_count = sum(end - start + 1 for start, end in double_width_ranges)
+
+    print(f"Generated {c_file} with:")
+    print(f"- {len(zero_width_ranges)} zero-width ranges covering ~{zero_width_count} code points")
+    print(f"- {len(double_width_ranges)} double-width ranges covering ~{double_width_count} code points")
+
+if __name__ == "__main__":
+    generate_ucs_width()
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (3 preceding siblings ...)
  2025-04-10  1:13 ` [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-11  3:47   ` kernel test robot
  2025-04-10  1:13 ` [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c Nicolas Pitre
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This replaces ucs_width.c with the code generated by gen_ucs_width.py
providing comprehensive tables for double-width and zero-width Unicode
code points. Also make ucs_is_zero_width() effective.

Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
      Please ignore.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs_width.c | 495 +++++++++++++++++++++++++++++++++++--
 include/linux/consolemap.h |   6 +-
 2 files changed, 475 insertions(+), 26 deletions(-)

diff --git a/drivers/tty/vt/ucs_width.c b/drivers/tty/vt/ucs_width.c
index 5f0bde30a1..47b22583bd 100644
--- a/drivers/tty/vt/ucs_width.c
+++ b/drivers/tty/vt/ucs_width.c
@@ -1,45 +1,498 @@
 // SPDX-License-Identifier: GPL-2.0
+/*
+ * ucs_width.c - Unicode character width lookup
+ *
+ * Auto-generated by gen_ucs_width.py
+ *
+ * Unicode Version: 16.0.0
+ */
 
 #include <linux/types.h>
 #include <linux/array_size.h>
 #include <linux/bsearch.h>
 #include <linux/consolemap.h>
 
-/* ucs_is_double_width() is based on the wcwidth() implementation by
- * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
- * Latest version: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
- */
-
 struct interval {
 	uint32_t first;
 	uint32_t last;
 };
 
-static int ucs_cmp(const void *key, const void *elt)
+/* Zero-width character ranges */
+static const struct interval zero_width_ranges[] = {
+	{ 0x000AD, 0x000AD }, /* SOFT HYPHEN */
+	{ 0x00300, 0x0036F }, /* COMBINING GRAVE ACCENT - COMBINING LATIN SMALL LETTER X */
+	{ 0x00483, 0x00489 }, /* COMBINING CYRILLIC TITLO - COMBINING CYRILLIC MILLIONS SIGN */
+	{ 0x00591, 0x005BD }, /* HEBREW ACCENT ETNAHTA - HEBREW POINT METEG */
+	{ 0x005BF, 0x005BF }, /* HEBREW POINT RAFE */
+	{ 0x005C1, 0x005C2 }, /* HEBREW POINT SHIN DOT - HEBREW POINT SIN DOT */
+	{ 0x005C4, 0x005C5 }, /* HEBREW MARK UPPER DOT - HEBREW MARK LOWER DOT */
+	{ 0x005C7, 0x005C7 }, /* HEBREW POINT QAMATS QATAN */
+	{ 0x00600, 0x00605 }, /* ARABIC NUMBER SIGN - ARABIC NUMBER MARK ABOVE */
+	{ 0x00610, 0x0061A }, /* ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM - ARABIC SMALL KASRA */
+	{ 0x0064B, 0x0065F }, /* ARABIC FATHATAN - ARABIC WAVY HAMZA BELOW */
+	{ 0x00670, 0x00670 }, /* ARABIC LETTER SUPERSCRIPT ALEF */
+	{ 0x006D6, 0x006DC }, /* ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA - ARABIC SMALL HIGH SEEN */
+	{ 0x006DF, 0x006E4 }, /* ARABIC SMALL HIGH ROUNDED ZERO - ARABIC SMALL HIGH MADDA */
+	{ 0x006E7, 0x006E8 }, /* ARABIC SMALL HIGH YEH - ARABIC SMALL HIGH NOON */
+	{ 0x006EA, 0x006ED }, /* ARABIC EMPTY CENTRE LOW STOP - ARABIC SMALL LOW MEEM */
+	{ 0x00711, 0x00711 }, /* SYRIAC LETTER SUPERSCRIPT ALAPH */
+	{ 0x00730, 0x0074A }, /* SYRIAC PTHAHA ABOVE - SYRIAC BARREKH */
+	{ 0x007A6, 0x007B0 }, /* THAANA ABAFILI - THAANA SUKUN */
+	{ 0x007EB, 0x007F3 }, /* NKO COMBINING SHORT HIGH TONE - NKO COMBINING DOUBLE DOT ABOVE */
+	{ 0x007FD, 0x007FD }, /* NKO DANTAYALAN */
+	{ 0x00816, 0x00819 }, /* SAMARITAN MARK IN - SAMARITAN MARK DAGESH */
+	{ 0x0081B, 0x00823 }, /* SAMARITAN MARK EPENTHETIC YUT - SAMARITAN VOWEL SIGN A */
+	{ 0x00825, 0x00827 }, /* SAMARITAN VOWEL SIGN SHORT A - SAMARITAN VOWEL SIGN U */
+	{ 0x00829, 0x0082D }, /* SAMARITAN VOWEL SIGN LONG I - SAMARITAN MARK NEQUDAA */
+	{ 0x00859, 0x0085B }, /* MANDAIC AFFRICATION MARK - MANDAIC GEMINATION MARK */
+	{ 0x00890, 0x00891 }, /* ARABIC POUND MARK ABOVE - ARABIC PIASTRE MARK ABOVE */
+	{ 0x00897, 0x0089F }, /* ARABIC PEPET - ARABIC HALF MADDA OVER MADDA */
+	{ 0x008CA, 0x00903 }, /* ARABIC SMALL HIGH FARSI YEH - DEVANAGARI SIGN VISARGA */
+	{ 0x0093A, 0x0093C }, /* DEVANAGARI VOWEL SIGN OE - DEVANAGARI SIGN NUKTA */
+	{ 0x0093E, 0x0094F }, /* DEVANAGARI VOWEL SIGN AA - DEVANAGARI VOWEL SIGN AW */
+	{ 0x00951, 0x00957 }, /* DEVANAGARI STRESS SIGN UDATTA - DEVANAGARI VOWEL SIGN UUE */
+	{ 0x00962, 0x00963 }, /* DEVANAGARI VOWEL SIGN VOCALIC L - DEVANAGARI VOWEL SIGN VOCALIC LL */
+	{ 0x00981, 0x00983 }, /* BENGALI SIGN CANDRABINDU - BENGALI SIGN VISARGA */
+	{ 0x009BC, 0x009BC }, /* BENGALI SIGN NUKTA */
+	{ 0x009BE, 0x009C4 }, /* BENGALI VOWEL SIGN AA - BENGALI VOWEL SIGN VOCALIC RR */
+	{ 0x009C7, 0x009C8 }, /* BENGALI VOWEL SIGN E - BENGALI VOWEL SIGN AI */
+	{ 0x009CB, 0x009CD }, /* BENGALI VOWEL SIGN O - BENGALI SIGN VIRAMA */
+	{ 0x009D7, 0x009D7 }, /* BENGALI AU LENGTH MARK */
+	{ 0x009E2, 0x009E3 }, /* BENGALI VOWEL SIGN VOCALIC L - BENGALI VOWEL SIGN VOCALIC LL */
+	{ 0x009FE, 0x009FE }, /* BENGALI SANDHI MARK */
+	{ 0x00A01, 0x00A03 }, /* GURMUKHI SIGN ADAK BINDI - GURMUKHI SIGN VISARGA */
+	{ 0x00A3C, 0x00A3C }, /* GURMUKHI SIGN NUKTA */
+	{ 0x00A3E, 0x00A42 }, /* GURMUKHI VOWEL SIGN AA - GURMUKHI VOWEL SIGN UU */
+	{ 0x00A47, 0x00A48 }, /* GURMUKHI VOWEL SIGN EE - GURMUKHI VOWEL SIGN AI */
+	{ 0x00A4B, 0x00A4D }, /* GURMUKHI VOWEL SIGN OO - GURMUKHI SIGN VIRAMA */
+	{ 0x00A51, 0x00A51 }, /* GURMUKHI SIGN UDAAT */
+	{ 0x00A70, 0x00A71 }, /* GURMUKHI TIPPI - GURMUKHI ADDAK */
+	{ 0x00A75, 0x00A75 }, /* GURMUKHI SIGN YAKASH */
+	{ 0x00A81, 0x00A83 }, /* GUJARATI SIGN CANDRABINDU - GUJARATI SIGN VISARGA */
+	{ 0x00ABC, 0x00ABC }, /* GUJARATI SIGN NUKTA */
+	{ 0x00ABE, 0x00AC5 }, /* GUJARATI VOWEL SIGN AA - GUJARATI VOWEL SIGN CANDRA E */
+	{ 0x00AC7, 0x00AC9 }, /* GUJARATI VOWEL SIGN E - GUJARATI VOWEL SIGN CANDRA O */
+	{ 0x00ACB, 0x00ACD }, /* GUJARATI VOWEL SIGN O - GUJARATI SIGN VIRAMA */
+	{ 0x00AE2, 0x00AE3 }, /* GUJARATI VOWEL SIGN VOCALIC L - GUJARATI VOWEL SIGN VOCALIC LL */
+	{ 0x00AFA, 0x00AFF }, /* GUJARATI SIGN SUKUN - GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE */
+	{ 0x00B01, 0x00B03 }, /* ORIYA SIGN CANDRABINDU - ORIYA SIGN VISARGA */
+	{ 0x00B3C, 0x00B3C }, /* ORIYA SIGN NUKTA */
+	{ 0x00B3E, 0x00B44 }, /* ORIYA VOWEL SIGN AA - ORIYA VOWEL SIGN VOCALIC RR */
+	{ 0x00B47, 0x00B48 }, /* ORIYA VOWEL SIGN E - ORIYA VOWEL SIGN AI */
+	{ 0x00B4B, 0x00B4D }, /* ORIYA VOWEL SIGN O - ORIYA SIGN VIRAMA */
+	{ 0x00B55, 0x00B57 }, /* ORIYA SIGN OVERLINE - ORIYA AU LENGTH MARK */
+	{ 0x00B62, 0x00B63 }, /* ORIYA VOWEL SIGN VOCALIC L - ORIYA VOWEL SIGN VOCALIC LL */
+	{ 0x00B82, 0x00B82 }, /* TAMIL SIGN ANUSVARA */
+	{ 0x00BBE, 0x00BC2 }, /* TAMIL VOWEL SIGN AA - TAMIL VOWEL SIGN UU */
+	{ 0x00BC6, 0x00BC8 }, /* TAMIL VOWEL SIGN E - TAMIL VOWEL SIGN AI */
+	{ 0x00BCA, 0x00BCD }, /* TAMIL VOWEL SIGN O - TAMIL SIGN VIRAMA */
+	{ 0x00BD7, 0x00BD7 }, /* TAMIL AU LENGTH MARK */
+	{ 0x00C00, 0x00C04 }, /* TELUGU SIGN COMBINING CANDRABINDU ABOVE - TELUGU SIGN COMBINING ANUSVARA ABOVE */
+	{ 0x00C3C, 0x00C3C }, /* TELUGU SIGN NUKTA */
+	{ 0x00C3E, 0x00C44 }, /* TELUGU VOWEL SIGN AA - TELUGU VOWEL SIGN VOCALIC RR */
+	{ 0x00C46, 0x00C48 }, /* TELUGU VOWEL SIGN E - TELUGU VOWEL SIGN AI */
+	{ 0x00C4A, 0x00C4D }, /* TELUGU VOWEL SIGN O - TELUGU SIGN VIRAMA */
+	{ 0x00C55, 0x00C56 }, /* TELUGU LENGTH MARK - TELUGU AI LENGTH MARK */
+	{ 0x00C62, 0x00C63 }, /* TELUGU VOWEL SIGN VOCALIC L - TELUGU VOWEL SIGN VOCALIC LL */
+	{ 0x00C81, 0x00C83 }, /* KANNADA SIGN CANDRABINDU - KANNADA SIGN VISARGA */
+	{ 0x00CBC, 0x00CBC }, /* KANNADA SIGN NUKTA */
+	{ 0x00CBE, 0x00CC4 }, /* KANNADA VOWEL SIGN AA - KANNADA VOWEL SIGN VOCALIC RR */
+	{ 0x00CC6, 0x00CC8 }, /* KANNADA VOWEL SIGN E - KANNADA VOWEL SIGN AI */
+	{ 0x00CCA, 0x00CCD }, /* KANNADA VOWEL SIGN O - KANNADA SIGN VIRAMA */
+	{ 0x00CD5, 0x00CD6 }, /* KANNADA LENGTH MARK - KANNADA AI LENGTH MARK */
+	{ 0x00CE2, 0x00CE3 }, /* KANNADA VOWEL SIGN VOCALIC L - KANNADA VOWEL SIGN VOCALIC LL */
+	{ 0x00CF3, 0x00CF3 }, /* KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT */
+	{ 0x00D00, 0x00D03 }, /* MALAYALAM SIGN COMBINING ANUSVARA ABOVE - MALAYALAM SIGN VISARGA */
+	{ 0x00D3B, 0x00D3C }, /* MALAYALAM SIGN VERTICAL BAR VIRAMA - MALAYALAM SIGN CIRCULAR VIRAMA */
+	{ 0x00D3E, 0x00D44 }, /* MALAYALAM VOWEL SIGN AA - MALAYALAM VOWEL SIGN VOCALIC RR */
+	{ 0x00D46, 0x00D48 }, /* MALAYALAM VOWEL SIGN E - MALAYALAM VOWEL SIGN AI */
+	{ 0x00D4A, 0x00D4D }, /* MALAYALAM VOWEL SIGN O - MALAYALAM SIGN VIRAMA */
+	{ 0x00D57, 0x00D57 }, /* MALAYALAM AU LENGTH MARK */
+	{ 0x00D62, 0x00D63 }, /* MALAYALAM VOWEL SIGN VOCALIC L - MALAYALAM VOWEL SIGN VOCALIC LL */
+	{ 0x00D81, 0x00D83 }, /* SINHALA SIGN CANDRABINDU - SINHALA SIGN VISARGAYA */
+	{ 0x00DCA, 0x00DCA }, /* SINHALA SIGN AL-LAKUNA */
+	{ 0x00DCF, 0x00DD4 }, /* SINHALA VOWEL SIGN AELA-PILLA - SINHALA VOWEL SIGN KETTI PAA-PILLA */
+	{ 0x00DD6, 0x00DD6 }, /* SINHALA VOWEL SIGN DIGA PAA-PILLA */
+	{ 0x00DD8, 0x00DDF }, /* SINHALA VOWEL SIGN GAETTA-PILLA - SINHALA VOWEL SIGN GAYANUKITTA */
+	{ 0x00DF2, 0x00DF3 }, /* SINHALA VOWEL SIGN DIGA GAETTA-PILLA - SINHALA VOWEL SIGN DIGA GAYANUKITTA */
+	{ 0x00E31, 0x00E31 }, /* THAI CHARACTER MAI HAN-AKAT */
+	{ 0x00E34, 0x00E3A }, /* THAI CHARACTER SARA I - THAI CHARACTER PHINTHU */
+	{ 0x00E47, 0x00E4E }, /* THAI CHARACTER MAITAIKHU - THAI CHARACTER YAMAKKAN */
+	{ 0x00EB1, 0x00EB1 }, /* LAO VOWEL SIGN MAI KAN */
+	{ 0x00EB4, 0x00EBC }, /* LAO VOWEL SIGN I - LAO SEMIVOWEL SIGN LO */
+	{ 0x00EC8, 0x00ECE }, /* LAO TONE MAI EK - LAO YAMAKKAN */
+	{ 0x00F18, 0x00F19 }, /* TIBETAN ASTROLOGICAL SIGN -KHYUD PA - TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS */
+	{ 0x00F35, 0x00F35 }, /* TIBETAN MARK NGAS BZUNG NYI ZLA */
+	{ 0x00F37, 0x00F37 }, /* TIBETAN MARK NGAS BZUNG SGOR RTAGS */
+	{ 0x00F39, 0x00F39 }, /* TIBETAN MARK TSA -PHRU */
+	{ 0x00F3E, 0x00F3F }, /* TIBETAN SIGN YAR TSHES - TIBETAN SIGN MAR TSHES */
+	{ 0x00F71, 0x00F84 }, /* TIBETAN VOWEL SIGN AA - TIBETAN MARK HALANTA */
+	{ 0x00F86, 0x00F87 }, /* TIBETAN SIGN LCI RTAGS - TIBETAN SIGN YANG RTAGS */
+	{ 0x00F8D, 0x00F97 }, /* TIBETAN SUBJOINED SIGN LCE TSA CAN - TIBETAN SUBJOINED LETTER JA */
+	{ 0x00F99, 0x00FBC }, /* TIBETAN SUBJOINED LETTER NYA - TIBETAN SUBJOINED LETTER FIXED-FORM RA */
+	{ 0x00FC6, 0x00FC6 }, /* TIBETAN SYMBOL PADMA GDAN */
+	{ 0x0102B, 0x0103E }, /* MYANMAR VOWEL SIGN TALL AA - MYANMAR CONSONANT SIGN MEDIAL HA */
+	{ 0x01056, 0x01059 }, /* MYANMAR VOWEL SIGN VOCALIC R - MYANMAR VOWEL SIGN VOCALIC LL */
+	{ 0x0105E, 0x01060 }, /* MYANMAR CONSONANT SIGN MON MEDIAL NA - MYANMAR CONSONANT SIGN MON MEDIAL LA */
+	{ 0x01062, 0x01064 }, /* MYANMAR VOWEL SIGN SGAW KAREN EU - MYANMAR TONE MARK SGAW KAREN KE PHO */
+	{ 0x01067, 0x0106D }, /* MYANMAR VOWEL SIGN WESTERN PWO KAREN EU - MYANMAR SIGN WESTERN PWO KAREN TONE-5 */
+	{ 0x01071, 0x01074 }, /* MYANMAR VOWEL SIGN GEBA KAREN I - MYANMAR VOWEL SIGN KAYAH EE */
+	{ 0x01082, 0x0108D }, /* MYANMAR CONSONANT SIGN SHAN MEDIAL WA - MYANMAR SIGN SHAN COUNCIL EMPHATIC TONE */
+	{ 0x0108F, 0x0108F }, /* MYANMAR SIGN RUMAI PALAUNG TONE-5 */
+	{ 0x0109A, 0x0109D }, /* MYANMAR SIGN KHAMTI TONE-1 - MYANMAR VOWEL SIGN AITON AI */
+	{ 0x0135D, 0x0135F }, /* ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK - ETHIOPIC COMBINING GEMINATION MARK */
+	{ 0x01712, 0x01715 }, /* TAGALOG VOWEL SIGN I - TAGALOG SIGN PAMUDPOD */
+	{ 0x01732, 0x01734 }, /* HANUNOO VOWEL SIGN I - HANUNOO SIGN PAMUDPOD */
+	{ 0x01752, 0x01753 }, /* BUHID VOWEL SIGN I - BUHID VOWEL SIGN U */
+	{ 0x01772, 0x01773 }, /* TAGBANWA VOWEL SIGN I - TAGBANWA VOWEL SIGN U */
+	{ 0x017B4, 0x017D3 }, /* KHMER VOWEL INHERENT AQ - KHMER SIGN BATHAMASAT */
+	{ 0x017DD, 0x017DD }, /* KHMER SIGN ATTHACAN */
+	{ 0x0180B, 0x0180D }, /* MONGOLIAN FREE VARIATION SELECTOR ONE - MONGOLIAN FREE VARIATION SELECTOR THREE */
+	{ 0x0180F, 0x0180F }, /* MONGOLIAN FREE VARIATION SELECTOR FOUR */
+	{ 0x01885, 0x01886 }, /* MONGOLIAN LETTER ALI GALI BALUDA - MONGOLIAN LETTER ALI GALI THREE BALUDA */
+	{ 0x018A9, 0x018A9 }, /* MONGOLIAN LETTER ALI GALI DAGALGA */
+	{ 0x01920, 0x0192B }, /* LIMBU VOWEL SIGN A - LIMBU SUBJOINED LETTER WA */
+	{ 0x01930, 0x0193B }, /* LIMBU SMALL LETTER KA - LIMBU SIGN SA-I */
+	{ 0x01A17, 0x01A1B }, /* BUGINESE VOWEL SIGN I - BUGINESE VOWEL SIGN AE */
+	{ 0x01A55, 0x01A5E }, /* TAI THAM CONSONANT SIGN MEDIAL RA - TAI THAM CONSONANT SIGN SA */
+	{ 0x01A60, 0x01A7C }, /* TAI THAM SIGN SAKOT - TAI THAM SIGN KHUEN-LUE KARAN */
+	{ 0x01A7F, 0x01A7F }, /* TAI THAM COMBINING CRYPTOGRAMMIC DOT */
+	{ 0x01AB0, 0x01ACE }, /* COMBINING DOUBLED CIRCUMFLEX ACCENT - COMBINING LATIN SMALL LETTER INSULAR T */
+	{ 0x01B00, 0x01B04 }, /* BALINESE SIGN ULU RICEM - BALINESE SIGN BISAH */
+	{ 0x01B34, 0x01B44 }, /* BALINESE SIGN REREKAN - BALINESE ADEG ADEG */
+	{ 0x01B6B, 0x01B73 }, /* BALINESE MUSICAL SYMBOL COMBINING TEGEH - BALINESE MUSICAL SYMBOL COMBINING GONG */
+	{ 0x01B80, 0x01B82 }, /* SUNDANESE SIGN PANYECEK - SUNDANESE SIGN PANGWISAD */
+	{ 0x01BA1, 0x01BAD }, /* SUNDANESE CONSONANT SIGN PAMINGKAL - SUNDANESE CONSONANT SIGN PASANGAN WA */
+	{ 0x01BE6, 0x01BF3 }, /* BATAK SIGN TOMPI - BATAK PANONGONAN */
+	{ 0x01C24, 0x01C37 }, /* LEPCHA SUBJOINED LETTER YA - LEPCHA SIGN NUKTA */
+	{ 0x01CD0, 0x01CD2 }, /* VEDIC TONE KARSHANA - VEDIC TONE PRENKHA */
+	{ 0x01CD4, 0x01CE8 }, /* VEDIC SIGN YAJURVEDIC MIDLINE SVARITA - VEDIC SIGN VISARGA ANUDATTA WITH TAIL */
+	{ 0x01CED, 0x01CED }, /* VEDIC SIGN TIRYAK */
+	{ 0x01CF4, 0x01CF4 }, /* VEDIC TONE CANDRA ABOVE */
+	{ 0x01CF7, 0x01CF9 }, /* VEDIC SIGN ATIKRAMA - VEDIC TONE DOUBLE RING ABOVE */
+	{ 0x01DC0, 0x01DFF }, /* COMBINING DOTTED GRAVE ACCENT - COMBINING RIGHT ARROWHEAD AND DOWN ARROWHEAD BELOW */
+	{ 0x0200B, 0x0200E }, /* ZERO WIDTH SPACE - LEFT-TO-RIGHT MARK */
+	{ 0x0202A, 0x0202D }, /* LEFT-TO-RIGHT EMBEDDING - LEFT-TO-RIGHT OVERRIDE */
+	{ 0x02060, 0x02064 }, /* WORD JOINER - INVISIBLE PLUS */
+	{ 0x0206A, 0x0206F }, /* INHIBIT SYMMETRIC SWAPPING - NOMINAL DIGIT SHAPES */
+	{ 0x020D0, 0x020F0 }, /* COMBINING LEFT HARPOON ABOVE - COMBINING ASTERISK ABOVE */
+	{ 0x02640, 0x02640 }, /* FEMALE SIGN */
+	{ 0x02642, 0x02642 }, /* MALE SIGN */
+	{ 0x026A7, 0x026A7 }, /* MALE WITH STROKE AND MALE AND FEMALE SIGN */
+	{ 0x02CEF, 0x02CF1 }, /* COPTIC COMBINING NI ABOVE - COPTIC COMBINING SPIRITUS LENIS */
+	{ 0x02D7F, 0x02D7F }, /* TIFINAGH CONSONANT JOINER */
+	{ 0x02DE0, 0x02DFF }, /* COMBINING CYRILLIC LETTER BE - COMBINING CYRILLIC LETTER IOTIFIED BIG YUS */
+	{ 0x0302A, 0x0302F }, /* IDEOGRAPHIC LEVEL TONE MARK - HANGUL DOUBLE DOT TONE MARK */
+	{ 0x03099, 0x0309A }, /* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK - COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK */
+	{ 0x0A66F, 0x0A672 }, /* COMBINING CYRILLIC VZMET - COMBINING CYRILLIC THOUSAND MILLIONS SIGN */
+	{ 0x0A674, 0x0A67D }, /* COMBINING CYRILLIC LETTER UKRAINIAN IE - COMBINING CYRILLIC PAYEROK */
+	{ 0x0A69E, 0x0A69F }, /* COMBINING CYRILLIC LETTER EF - COMBINING CYRILLIC LETTER IOTIFIED E */
+	{ 0x0A6F0, 0x0A6F1 }, /* BAMUM COMBINING MARK KOQNDON - BAMUM COMBINING MARK TUKWENTIS */
+	{ 0x0A802, 0x0A802 }, /* SYLOTI NAGRI SIGN DVISVARA */
+	{ 0x0A806, 0x0A806 }, /* SYLOTI NAGRI SIGN HASANTA */
+	{ 0x0A80B, 0x0A80B }, /* SYLOTI NAGRI SIGN ANUSVARA */
+	{ 0x0A823, 0x0A827 }, /* SYLOTI NAGRI VOWEL SIGN A - SYLOTI NAGRI VOWEL SIGN OO */
+	{ 0x0A82C, 0x0A82C }, /* SYLOTI NAGRI SIGN ALTERNATE HASANTA */
+	{ 0x0A880, 0x0A881 }, /* SAURASHTRA SIGN ANUSVARA - SAURASHTRA SIGN VISARGA */
+	{ 0x0A8B4, 0x0A8C5 }, /* SAURASHTRA CONSONANT SIGN HAARU - SAURASHTRA SIGN CANDRABINDU */
+	{ 0x0A8E0, 0x0A8F1 }, /* COMBINING DEVANAGARI DIGIT ZERO - COMBINING DEVANAGARI SIGN AVAGRAHA */
+	{ 0x0A8FF, 0x0A8FF }, /* DEVANAGARI VOWEL SIGN AY */
+	{ 0x0A926, 0x0A92D }, /* KAYAH LI VOWEL UE - KAYAH LI TONE CALYA PLOPHU */
+	{ 0x0A947, 0x0A953 }, /* REJANG VOWEL SIGN I - REJANG VIRAMA */
+	{ 0x0A980, 0x0A983 }, /* JAVANESE SIGN PANYANGGA - JAVANESE SIGN WIGNYAN */
+	{ 0x0A9B3, 0x0A9C0 }, /* JAVANESE SIGN CECAK TELU - JAVANESE PANGKON */
+	{ 0x0A9E5, 0x0A9E5 }, /* MYANMAR SIGN SHAN SAW */
+	{ 0x0AA29, 0x0AA36 }, /* CHAM VOWEL SIGN AA - CHAM CONSONANT SIGN WA */
+	{ 0x0AA43, 0x0AA43 }, /* CHAM CONSONANT SIGN FINAL NG */
+	{ 0x0AA4C, 0x0AA4D }, /* CHAM CONSONANT SIGN FINAL M - CHAM CONSONANT SIGN FINAL H */
+	{ 0x0AA7B, 0x0AA7D }, /* MYANMAR SIGN PAO KAREN TONE - MYANMAR SIGN TAI LAING TONE-5 */
+	{ 0x0AAB0, 0x0AAB0 }, /* TAI VIET MAI KANG */
+	{ 0x0AAB2, 0x0AAB4 }, /* TAI VIET VOWEL I - TAI VIET VOWEL U */
+	{ 0x0AAB7, 0x0AAB8 }, /* TAI VIET MAI KHIT - TAI VIET VOWEL IA */
+	{ 0x0AABE, 0x0AABF }, /* TAI VIET VOWEL AM - TAI VIET TONE MAI EK */
+	{ 0x0AAC1, 0x0AAC1 }, /* TAI VIET TONE MAI THO */
+	{ 0x0AAEB, 0x0AAEF }, /* MEETEI MAYEK VOWEL SIGN II - MEETEI MAYEK VOWEL SIGN AAU */
+	{ 0x0AAF5, 0x0AAF6 }, /* MEETEI MAYEK VOWEL SIGN VISARGA - MEETEI MAYEK VIRAMA */
+	{ 0x0ABE3, 0x0ABEA }, /* MEETEI MAYEK VOWEL SIGN ONAP - MEETEI MAYEK VOWEL SIGN NUNG */
+	{ 0x0ABEC, 0x0ABED }, /* MEETEI MAYEK LUM IYEK - MEETEI MAYEK APUN IYEK */
+	{ 0x0FB1E, 0x0FB1E }, /* HEBREW POINT JUDEO-SPANISH VARIKA */
+	{ 0x0FE00, 0x0FE0F }, /* VARIATION SELECTOR-1 - VARIATION SELECTOR-16 */
+	{ 0x0FE20, 0x0FE2F }, /* COMBINING LIGATURE LEFT HALF - COMBINING CYRILLIC TITLO RIGHT HALF */
+	{ 0x0FEFF, 0x0FEFF }, /* ZERO WIDTH NO-BREAK SPACE */
+	{ 0x0FFF9, 0x0FFFB }, /* INTERLINEAR ANNOTATION ANCHOR - INTERLINEAR ANNOTATION TERMINATOR */
+	{ 0x101FD, 0x101FD }, /* U+101FD */
+	{ 0x102E0, 0x102E0 }, /* U+102E0 */
+	{ 0x10376, 0x1037A }, /* U+10376 - U+1037A */
+	{ 0x10A01, 0x10A03 }, /* U+10A01 - U+10A03 */
+	{ 0x10A05, 0x10A06 }, /* U+10A05 - U+10A06 */
+	{ 0x10A0C, 0x10A0F }, /* U+10A0C - U+10A0F */
+	{ 0x10A38, 0x10A3A }, /* U+10A38 - U+10A3A */
+	{ 0x10A3F, 0x10A3F }, /* U+10A3F */
+	{ 0x10AE5, 0x10AE6 }, /* U+10AE5 - U+10AE6 */
+	{ 0x10D24, 0x10D27 }, /* U+10D24 - U+10D27 */
+	{ 0x10D69, 0x10D6D }, /* U+10D69 - U+10D6D */
+	{ 0x10EAB, 0x10EAC }, /* U+10EAB - U+10EAC */
+	{ 0x10EFC, 0x10EFF }, /* U+10EFC - U+10EFF */
+	{ 0x10F46, 0x10F50 }, /* U+10F46 - U+10F50 */
+	{ 0x10F82, 0x10F85 }, /* U+10F82 - U+10F85 */
+	{ 0x11000, 0x11002 }, /* U+11000 - U+11002 */
+	{ 0x11038, 0x11046 }, /* U+11038 - U+11046 */
+	{ 0x11070, 0x11070 }, /* U+11070 */
+	{ 0x11073, 0x11074 }, /* U+11073 - U+11074 */
+	{ 0x1107F, 0x11082 }, /* U+1107F - U+11082 */
+	{ 0x110B0, 0x110BA }, /* U+110B0 - U+110BA */
+	{ 0x110BD, 0x110BD }, /* U+110BD */
+	{ 0x110C2, 0x110C2 }, /* U+110C2 */
+	{ 0x110CD, 0x110CD }, /* U+110CD */
+	{ 0x11100, 0x11102 }, /* U+11100 - U+11102 */
+	{ 0x11127, 0x11134 }, /* U+11127 - U+11134 */
+	{ 0x11145, 0x11146 }, /* U+11145 - U+11146 */
+	{ 0x11173, 0x11173 }, /* U+11173 */
+	{ 0x11180, 0x11182 }, /* U+11180 - U+11182 */
+	{ 0x111B3, 0x111C0 }, /* U+111B3 - U+111C0 */
+	{ 0x111C9, 0x111CC }, /* U+111C9 - U+111CC */
+	{ 0x111CE, 0x111CF }, /* U+111CE - U+111CF */
+	{ 0x1122C, 0x11237 }, /* U+1122C - U+11237 */
+	{ 0x1123E, 0x1123E }, /* U+1123E */
+	{ 0x11241, 0x11241 }, /* U+11241 */
+	{ 0x112DF, 0x112EA }, /* U+112DF - U+112EA */
+	{ 0x11300, 0x11303 }, /* U+11300 - U+11303 */
+	{ 0x1133B, 0x1133C }, /* U+1133B - U+1133C */
+	{ 0x1133E, 0x11344 }, /* U+1133E - U+11344 */
+	{ 0x11347, 0x11348 }, /* U+11347 - U+11348 */
+	{ 0x1134B, 0x1134D }, /* U+1134B - U+1134D */
+	{ 0x11357, 0x11357 }, /* U+11357 */
+	{ 0x11362, 0x11363 }, /* U+11362 - U+11363 */
+	{ 0x11366, 0x1136C }, /* U+11366 - U+1136C */
+	{ 0x11370, 0x11374 }, /* U+11370 - U+11374 */
+	{ 0x113B8, 0x113C0 }, /* U+113B8 - U+113C0 */
+	{ 0x113C2, 0x113C2 }, /* U+113C2 */
+	{ 0x113C5, 0x113C5 }, /* U+113C5 */
+	{ 0x113C7, 0x113CA }, /* U+113C7 - U+113CA */
+	{ 0x113CC, 0x113D0 }, /* U+113CC - U+113D0 */
+	{ 0x113D2, 0x113D2 }, /* U+113D2 */
+	{ 0x113E1, 0x113E2 }, /* U+113E1 - U+113E2 */
+	{ 0x11435, 0x11446 }, /* U+11435 - U+11446 */
+	{ 0x1145E, 0x1145E }, /* U+1145E */
+	{ 0x114B0, 0x114C3 }, /* U+114B0 - U+114C3 */
+	{ 0x115AF, 0x115B5 }, /* U+115AF - U+115B5 */
+	{ 0x115B8, 0x115C0 }, /* U+115B8 - U+115C0 */
+	{ 0x115DC, 0x115DD }, /* U+115DC - U+115DD */
+	{ 0x11630, 0x11640 }, /* U+11630 - U+11640 */
+	{ 0x116AB, 0x116B7 }, /* U+116AB - U+116B7 */
+	{ 0x1171D, 0x1172B }, /* U+1171D - U+1172B */
+	{ 0x1182C, 0x1183A }, /* U+1182C - U+1183A */
+	{ 0x11930, 0x11935 }, /* U+11930 - U+11935 */
+	{ 0x11937, 0x11938 }, /* U+11937 - U+11938 */
+	{ 0x1193B, 0x1193E }, /* U+1193B - U+1193E */
+	{ 0x11940, 0x11940 }, /* U+11940 */
+	{ 0x11942, 0x11943 }, /* U+11942 - U+11943 */
+	{ 0x119D1, 0x119D7 }, /* U+119D1 - U+119D7 */
+	{ 0x119DA, 0x119E0 }, /* U+119DA - U+119E0 */
+	{ 0x119E4, 0x119E4 }, /* U+119E4 */
+	{ 0x11A01, 0x11A0A }, /* U+11A01 - U+11A0A */
+	{ 0x11A33, 0x11A39 }, /* U+11A33 - U+11A39 */
+	{ 0x11A3B, 0x11A3E }, /* U+11A3B - U+11A3E */
+	{ 0x11A47, 0x11A47 }, /* U+11A47 */
+	{ 0x11A51, 0x11A5B }, /* U+11A51 - U+11A5B */
+	{ 0x11A8A, 0x11A99 }, /* U+11A8A - U+11A99 */
+	{ 0x11C2F, 0x11C36 }, /* U+11C2F - U+11C36 */
+	{ 0x11C38, 0x11C3F }, /* U+11C38 - U+11C3F */
+	{ 0x11C92, 0x11CA7 }, /* U+11C92 - U+11CA7 */
+	{ 0x11CA9, 0x11CB6 }, /* U+11CA9 - U+11CB6 */
+	{ 0x11D31, 0x11D36 }, /* U+11D31 - U+11D36 */
+	{ 0x11D3A, 0x11D3A }, /* U+11D3A */
+	{ 0x11D3C, 0x11D3D }, /* U+11D3C - U+11D3D */
+	{ 0x11D3F, 0x11D45 }, /* U+11D3F - U+11D45 */
+	{ 0x11D47, 0x11D47 }, /* U+11D47 */
+	{ 0x11D8A, 0x11D8E }, /* U+11D8A - U+11D8E */
+	{ 0x11D90, 0x11D91 }, /* U+11D90 - U+11D91 */
+	{ 0x11D93, 0x11D97 }, /* U+11D93 - U+11D97 */
+	{ 0x11EF3, 0x11EF6 }, /* U+11EF3 - U+11EF6 */
+	{ 0x11F00, 0x11F01 }, /* U+11F00 - U+11F01 */
+	{ 0x11F03, 0x11F03 }, /* U+11F03 */
+	{ 0x11F34, 0x11F3A }, /* U+11F34 - U+11F3A */
+	{ 0x11F3E, 0x11F42 }, /* U+11F3E - U+11F42 */
+	{ 0x11F5A, 0x11F5A }, /* U+11F5A */
+	{ 0x13430, 0x13440 }, /* U+13430 - U+13440 */
+	{ 0x13447, 0x13455 }, /* U+13447 - U+13455 */
+	{ 0x1611E, 0x1612F }, /* U+1611E - U+1612F */
+	{ 0x16AF0, 0x16AF4 }, /* U+16AF0 - U+16AF4 */
+	{ 0x16B30, 0x16B36 }, /* U+16B30 - U+16B36 */
+	{ 0x16F4F, 0x16F4F }, /* U+16F4F */
+	{ 0x16F51, 0x16F87 }, /* U+16F51 - U+16F87 */
+	{ 0x16F8F, 0x16F92 }, /* U+16F8F - U+16F92 */
+	{ 0x16FE4, 0x16FE4 }, /* U+16FE4 */
+	{ 0x16FF0, 0x16FF1 }, /* U+16FF0 - U+16FF1 */
+	{ 0x1BC9D, 0x1BC9E }, /* U+1BC9D - U+1BC9E */
+	{ 0x1BCA0, 0x1BCA3 }, /* U+1BCA0 - U+1BCA3 */
+	{ 0x1CF00, 0x1CF2D }, /* U+1CF00 - U+1CF2D */
+	{ 0x1CF30, 0x1CF46 }, /* U+1CF30 - U+1CF46 */
+	{ 0x1D165, 0x1D169 }, /* U+1D165 - U+1D169 */
+	{ 0x1D16D, 0x1D182 }, /* U+1D16D - U+1D182 */
+	{ 0x1D185, 0x1D18B }, /* U+1D185 - U+1D18B */
+	{ 0x1D1AA, 0x1D1AD }, /* U+1D1AA - U+1D1AD */
+	{ 0x1D242, 0x1D244 }, /* U+1D242 - U+1D244 */
+	{ 0x1DA00, 0x1DA36 }, /* U+1DA00 - U+1DA36 */
+	{ 0x1DA3B, 0x1DA6C }, /* U+1DA3B - U+1DA6C */
+	{ 0x1DA75, 0x1DA75 }, /* U+1DA75 */
+	{ 0x1DA84, 0x1DA84 }, /* U+1DA84 */
+	{ 0x1DA9B, 0x1DA9F }, /* U+1DA9B - U+1DA9F */
+	{ 0x1DAA1, 0x1DAAF }, /* U+1DAA1 - U+1DAAF */
+	{ 0x1E000, 0x1E006 }, /* U+1E000 - U+1E006 */
+	{ 0x1E008, 0x1E018 }, /* U+1E008 - U+1E018 */
+	{ 0x1E01B, 0x1E021 }, /* U+1E01B - U+1E021 */
+	{ 0x1E023, 0x1E024 }, /* U+1E023 - U+1E024 */
+	{ 0x1E026, 0x1E02A }, /* U+1E026 - U+1E02A */
+	{ 0x1E08F, 0x1E08F }, /* U+1E08F */
+	{ 0x1E130, 0x1E136 }, /* U+1E130 - U+1E136 */
+	{ 0x1E2AE, 0x1E2AE }, /* U+1E2AE */
+	{ 0x1E2EC, 0x1E2EF }, /* U+1E2EC - U+1E2EF */
+	{ 0x1E4EC, 0x1E4EF }, /* U+1E4EC - U+1E4EF */
+	{ 0x1E5EE, 0x1E5EF }, /* U+1E5EE - U+1E5EF */
+	{ 0x1E8D0, 0x1E8D6 }, /* U+1E8D0 - U+1E8D6 */
+	{ 0x1E944, 0x1E94A }, /* U+1E944 - U+1E94A */
+	{ 0x1F3FB, 0x1F3FF }, /* U+1F3FB - U+1F3FF */
+	{ 0x1F9B0, 0x1F9B3 }, /* U+1F9B0 - U+1F9B3 */
+	{ 0xE0001, 0xE0001 }, /* U+E0001 */
+	{ 0xE0020, 0xE007F }, /* U+E0020 - U+E007F */
+	{ 0xE0100, 0xE01EF }, /* U+E0100 - U+E01EF */
+};
+
+/* Double-width character ranges */
+static const struct interval double_width_ranges[] = {
+	{ 0x01100, 0x0115F }, /* HANGUL CHOSEONG KIYEOK - HANGUL CHOSEONG FILLER */
+	{ 0x0231A, 0x0231B }, /* WATCH - HOURGLASS */
+	{ 0x02329, 0x0232A }, /* LEFT-POINTING ANGLE BRACKET - RIGHT-POINTING ANGLE BRACKET */
+	{ 0x023E9, 0x023EC }, /* BLACK RIGHT-POINTING DOUBLE TRIANGLE - BLACK DOWN-POINTING DOUBLE TRIANGLE */
+	{ 0x023F0, 0x023F0 }, /* ALARM CLOCK */
+	{ 0x023F3, 0x023F3 }, /* HOURGLASS WITH FLOWING SAND */
+	{ 0x025FD, 0x025FE }, /* WHITE MEDIUM SMALL SQUARE - BLACK MEDIUM SMALL SQUARE */
+	{ 0x02614, 0x02615 }, /* UMBRELLA WITH RAIN DROPS - HOT BEVERAGE */
+	{ 0x02630, 0x02637 }, /* TRIGRAM FOR HEAVEN - TRIGRAM FOR EARTH */
+	{ 0x02648, 0x02653 }, /* ARIES - PISCES */
+	{ 0x0267F, 0x0267F }, /* WHEELCHAIR SYMBOL */
+	{ 0x0268A, 0x0268F }, /* MONOGRAM FOR YANG - DIGRAM FOR GREATER YIN */
+	{ 0x02693, 0x02693 }, /* ANCHOR */
+	{ 0x026A1, 0x026A1 }, /* HIGH VOLTAGE SIGN */
+	{ 0x026AA, 0x026AB }, /* MEDIUM WHITE CIRCLE - MEDIUM BLACK CIRCLE */
+	{ 0x026BD, 0x026BE }, /* SOCCER BALL - BASEBALL */
+	{ 0x026C4, 0x026C5 }, /* SNOWMAN WITHOUT SNOW - SUN BEHIND CLOUD */
+	{ 0x026CE, 0x026CE }, /* OPHIUCHUS */
+	{ 0x026D4, 0x026D4 }, /* NO ENTRY */
+	{ 0x026EA, 0x026EA }, /* CHURCH */
+	{ 0x026F2, 0x026F3 }, /* FOUNTAIN - FLAG IN HOLE */
+	{ 0x026F5, 0x026F5 }, /* SAILBOAT */
+	{ 0x026FA, 0x026FA }, /* TENT */
+	{ 0x026FD, 0x026FD }, /* FUEL PUMP */
+	{ 0x02705, 0x02705 }, /* WHITE HEAVY CHECK MARK */
+	{ 0x0270A, 0x0270B }, /* RAISED FIST - RAISED HAND */
+	{ 0x02728, 0x02728 }, /* SPARKLES */
+	{ 0x0274C, 0x0274C }, /* CROSS MARK */
+	{ 0x0274E, 0x0274E }, /* NEGATIVE SQUARED CROSS MARK */
+	{ 0x02753, 0x02755 }, /* BLACK QUESTION MARK ORNAMENT - WHITE EXCLAMATION MARK ORNAMENT */
+	{ 0x02757, 0x02757 }, /* HEAVY EXCLAMATION MARK SYMBOL */
+	{ 0x02795, 0x02797 }, /* HEAVY PLUS SIGN - HEAVY DIVISION SIGN */
+	{ 0x027B0, 0x027B0 }, /* CURLY LOOP */
+	{ 0x027BF, 0x027BF }, /* DOUBLE CURLY LOOP */
+	{ 0x02B1B, 0x02B1C }, /* BLACK LARGE SQUARE - WHITE LARGE SQUARE */
+	{ 0x02B50, 0x02B50 }, /* WHITE MEDIUM STAR */
+	{ 0x02B55, 0x02B55 }, /* HEAVY LARGE CIRCLE */
+	{ 0x02E80, 0x02E99 }, /* CJK RADICAL REPEAT - CJK RADICAL RAP */
+	{ 0x02E9B, 0x02EF3 }, /* CJK RADICAL CHOKE - CJK RADICAL C-SIMPLIFIED TURTLE */
+	{ 0x02F00, 0x02FD5 }, /* KANGXI RADICAL ONE - KANGXI RADICAL FLUTE */
+	{ 0x02FF0, 0x03029 }, /* IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT - HANGZHOU NUMERAL NINE */
+	{ 0x03030, 0x0303E }, /* WAVY DASH - IDEOGRAPHIC VARIATION INDICATOR */
+	{ 0x03041, 0x03096 }, /* HIRAGANA LETTER SMALL A - HIRAGANA LETTER SMALL KE */
+	{ 0x0309B, 0x030FF }, /* KATAKANA-HIRAGANA VOICED SOUND MARK - KATAKANA DIGRAPH KOTO */
+	{ 0x03105, 0x0312F }, /* BOPOMOFO LETTER B - BOPOMOFO LETTER NN */
+	{ 0x03131, 0x0318E }, /* HANGUL LETTER KIYEOK - HANGUL LETTER ARAEAE */
+	{ 0x03190, 0x031E5 }, /* IDEOGRAPHIC ANNOTATION LINKING MARK - CJK STROKE SZP */
+	{ 0x031EF, 0x0321E }, /* IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION - PARENTHESIZED KOREAN CHARACTER O HU */
+	{ 0x03220, 0x03247 }, /* PARENTHESIZED IDEOGRAPH ONE - CIRCLED IDEOGRAPH KOTO */
+	{ 0x03250, 0x0A48C }, /* PARTNERSHIP SIGN - YI SYLLABLE YYR */
+	{ 0x0A490, 0x0A4C6 }, /* YI RADICAL QOT - YI RADICAL KE */
+	{ 0x0A960, 0x0A97C }, /* HANGUL CHOSEONG TIKEUT-MIEUM - HANGUL CHOSEONG SSANGYEORINHIEUH */
+	{ 0x0AC00, 0x0D7A3 }, /* HANGUL SYLLABLE GA - HANGUL SYLLABLE HIH */
+	{ 0x0F900, 0x0FAFF }, /* U+0F900 - U+0FAFF */
+	{ 0x0FE10, 0x0FE19 }, /* PRESENTATION FORM FOR VERTICAL COMMA - PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS */
+	{ 0x0FE30, 0x0FE52 }, /* PRESENTATION FORM FOR VERTICAL TWO DOT LEADER - SMALL FULL STOP */
+	{ 0x0FE54, 0x0FE66 }, /* SMALL SEMICOLON - SMALL EQUALS SIGN */
+	{ 0x0FE68, 0x0FE6B }, /* SMALL REVERSE SOLIDUS - SMALL COMMERCIAL AT */
+	{ 0x0FF01, 0x0FF60 }, /* FULLWIDTH EXCLAMATION MARK - FULLWIDTH RIGHT WHITE PARENTHESIS */
+	{ 0x0FFE0, 0x0FFE6 }, /* FULLWIDTH CENT SIGN - FULLWIDTH WON SIGN */
+	{ 0x16FE0, 0x16FE3 }, /* U+16FE0 - U+16FE3 */
+	{ 0x17000, 0x187F7 }, /* U+17000 - U+187F7 */
+	{ 0x18800, 0x18CD5 }, /* U+18800 - U+18CD5 */
+	{ 0x18CFF, 0x18D08 }, /* U+18CFF - U+18D08 */
+	{ 0x1AFF0, 0x1AFF3 }, /* U+1AFF0 - U+1AFF3 */
+	{ 0x1AFF5, 0x1AFFB }, /* U+1AFF5 - U+1AFFB */
+	{ 0x1AFFD, 0x1AFFE }, /* U+1AFFD - U+1AFFE */
+	{ 0x1B000, 0x1B122 }, /* U+1B000 - U+1B122 */
+	{ 0x1B132, 0x1B132 }, /* U+1B132 */
+	{ 0x1B150, 0x1B152 }, /* U+1B150 - U+1B152 */
+	{ 0x1B155, 0x1B155 }, /* U+1B155 */
+	{ 0x1B164, 0x1B167 }, /* U+1B164 - U+1B167 */
+	{ 0x1B170, 0x1B2FB }, /* U+1B170 - U+1B2FB */
+	{ 0x1D300, 0x1D356 }, /* U+1D300 - U+1D356 */
+	{ 0x1D360, 0x1D376 }, /* U+1D360 - U+1D376 */
+	{ 0x1F000, 0x1F02F }, /* U+1F000 - U+1F02F */
+	{ 0x1F0A0, 0x1F0FF }, /* U+1F0A0 - U+1F0FF */
+	{ 0x1F18E, 0x1F18E }, /* U+1F18E */
+	{ 0x1F191, 0x1F19A }, /* U+1F191 - U+1F19A */
+	{ 0x1F200, 0x1F202 }, /* U+1F200 - U+1F202 */
+	{ 0x1F210, 0x1F23B }, /* U+1F210 - U+1F23B */
+	{ 0x1F240, 0x1F248 }, /* U+1F240 - U+1F248 */
+	{ 0x1F250, 0x1F251 }, /* U+1F250 - U+1F251 */
+	{ 0x1F260, 0x1F265 }, /* U+1F260 - U+1F265 */
+	{ 0x1F300, 0x1F3FA }, /* U+1F300 - U+1F3FA */
+	{ 0x1F400, 0x1F64F }, /* U+1F400 - U+1F64F */
+	{ 0x1F680, 0x1F9AF }, /* U+1F680 - U+1F9AF */
+	{ 0x1F9B4, 0x1FAFF }, /* U+1F9B4 - U+1FAFF */
+	{ 0x20000, 0x2FFFD }, /* U+20000 - U+2FFFD */
+	{ 0x30000, 0x3FFFD }, /* U+30000 - U+3FFFD */
+};
+
+
+static int ucs_cmp(const void *key, const void *element)
 {
 	uint32_t cp = *(uint32_t *)key;
-	struct interval e = *(struct interval *) elt;
+	const struct interval *e = element;
 
-	if (cp > e.last)
+	if (cp > e->last)
 		return 1;
-	else if (cp < e.first)
+	if (cp < e->first)
 		return -1;
 	return 0;
 }
 
-static const struct interval double_width[] = {
-	{ 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
-	{ 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
-	{ 0xFE10, 0xFE19 }, { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 },
-	{ 0xFFE0, 0xFFE6 }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
-};
-
-bool ucs_is_double_width(uint32_t cp)
+static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t count)
 {
-	if (cp < double_width[0].first ||
-	    cp > double_width[ARRAY_SIZE(double_width) - 1].last)
+	if (cp < intervals[0].first || cp > intervals[count - 1].last)
 		return false;
 
-	return bsearch(&cp, double_width, ARRAY_SIZE(double_width),
-		       sizeof(struct interval), ucs_cmp) != NULL;
+	return __inline_bsearch(&cp, intervals, count,
+				sizeof(*intervals), ucs_cmp) != NULL;
+}
+
+/**
+ * Determine if a Unicode code point is zero-width.
+ *
+ * @param ucs: Unicode code point (UCS-4)
+ * Return: true if the character is zero-width, false otherwise
+ */
+bool ucs_is_zero_width(uint32_t cp)
+{
+	return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
+}
+
+/**
+ * Determine if a Unicode code point is double-width.
+ *
+ * @param ucs: Unicode code point (UCS-4)
+ * Return: true if the character is double-width, false otherwise
+ */
+bool ucs_is_double_width(uint32_t cp)
+{
+	return is_in_interval(cp, double_width_ranges, ARRAY_SIZE(double_width_ranges));
 }
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index 7d778752dc..b3a9118666 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -29,11 +29,7 @@ u32 conv_8bit_to_uni(unsigned char c);
 int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
 bool ucs_is_double_width(uint32_t cp);
-static inline bool ucs_is_zero_width(uint32_t cp)
-{
-	/* coming soon */
-	return false;
-}
+bool ucs_is_zero_width(uint32_t cp);
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (4 preceding siblings ...)
  2025-04-10  1:13 ` [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-14  7:08   ` Jiri Slaby
  2025-04-10  1:13 ` [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py Nicolas Pitre
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

The generated code includes a table that maps base character + combining
mark pairs to their precomposed equivalents using Python's unicodedata
module. It also provides the ucs_recompose() function to query that
table.

The default script behavior is to create a table with most commonly used
Latin, Greek, and Cyrillic recomposition pairs only. It is much smaller
than the table with all possible recomposition pairs (71 entries vs 1000
entries). But if one needs/wants the full table then simply running the
script with the --full argument will generate it.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++++++++++++
 1 file changed, 321 insertions(+)
 create mode 100755 drivers/tty/vt/gen_ucs_recompose.py

diff --git a/drivers/tty/vt/gen_ucs_recompose.py b/drivers/tty/vt/gen_ucs_recompose.py
new file mode 100755
index 0000000000..64418803e4
--- /dev/null
+++ b/drivers/tty/vt/gen_ucs_recompose.py
@@ -0,0 +1,321 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+#
+# This script uses Python's unicodedata module to generate ucs_recompose.c.
+# The generated code maps base character + combining mark pairs to their
+# precomposed equivalents.
+#
+# Usage:
+#   python gen_ucs_recompose.py         # Generate with common recomposition pairs
+#   python gen_ucs_recompose.py --full  # Generate with all recomposition pairs
+
+import unicodedata
+import sys
+import argparse
+import textwrap
+
+common_recompose_description = "most commonly used Latin, Greek, and Cyrillic recomposition pairs only"
+COMMON_RECOMPOSITION_PAIRS = [
+    # Latin letters with accents - uppercase
+    (0x0041, 0x0300, 0x00C0),  # A + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER A WITH GRAVE
+    (0x0041, 0x0301, 0x00C1),  # A + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER A WITH ACUTE
+    (0x0041, 0x0302, 0x00C2),  # A + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER A WITH CIRCUMFLEX
+    (0x0041, 0x0303, 0x00C3),  # A + COMBINING TILDE = LATIN CAPITAL LETTER A WITH TILDE
+    (0x0041, 0x0308, 0x00C4),  # A + COMBINING DIAERESIS = LATIN CAPITAL LETTER A WITH DIAERESIS
+    (0x0041, 0x030A, 0x00C5),  # A + COMBINING RING ABOVE = LATIN CAPITAL LETTER A WITH RING ABOVE
+    (0x0043, 0x0327, 0x00C7),  # C + COMBINING CEDILLA = LATIN CAPITAL LETTER C WITH CEDILLA
+    (0x0045, 0x0300, 0x00C8),  # E + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER E WITH GRAVE
+    (0x0045, 0x0301, 0x00C9),  # E + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER E WITH ACUTE
+    (0x0045, 0x0302, 0x00CA),  # E + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER E WITH CIRCUMFLEX
+    (0x0045, 0x0308, 0x00CB),  # E + COMBINING DIAERESIS = LATIN CAPITAL LETTER E WITH DIAERESIS
+    (0x0049, 0x0300, 0x00CC),  # I + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER I WITH GRAVE
+    (0x0049, 0x0301, 0x00CD),  # I + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER I WITH ACUTE
+    (0x0049, 0x0302, 0x00CE),  # I + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER I WITH CIRCUMFLEX
+    (0x0049, 0x0308, 0x00CF),  # I + COMBINING DIAERESIS = LATIN CAPITAL LETTER I WITH DIAERESIS
+    (0x004E, 0x0303, 0x00D1),  # N + COMBINING TILDE = LATIN CAPITAL LETTER N WITH TILDE
+    (0x004F, 0x0300, 0x00D2),  # O + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER O WITH GRAVE
+    (0x004F, 0x0301, 0x00D3),  # O + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER O WITH ACUTE
+    (0x004F, 0x0302, 0x00D4),  # O + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER O WITH CIRCUMFLEX
+    (0x004F, 0x0303, 0x00D5),  # O + COMBINING TILDE = LATIN CAPITAL LETTER O WITH TILDE
+    (0x004F, 0x0308, 0x00D6),  # O + COMBINING DIAERESIS = LATIN CAPITAL LETTER O WITH DIAERESIS
+    (0x0055, 0x0300, 0x00D9),  # U + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER U WITH GRAVE
+    (0x0055, 0x0301, 0x00DA),  # U + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER U WITH ACUTE
+    (0x0055, 0x0302, 0x00DB),  # U + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER U WITH CIRCUMFLEX
+    (0x0055, 0x0308, 0x00DC),  # U + COMBINING DIAERESIS = LATIN CAPITAL LETTER U WITH DIAERESIS
+    (0x0059, 0x0301, 0x00DD),  # Y + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER Y WITH ACUTE
+
+    # Latin letters with accents - lowercase
+    (0x0061, 0x0300, 0x00E0),  # a + COMBINING GRAVE ACCENT = LATIN SMALL LETTER A WITH GRAVE
+    (0x0061, 0x0301, 0x00E1),  # a + COMBINING ACUTE ACCENT = LATIN SMALL LETTER A WITH ACUTE
+    (0x0061, 0x0302, 0x00E2),  # a + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER A WITH CIRCUMFLEX
+    (0x0061, 0x0303, 0x00E3),  # a + COMBINING TILDE = LATIN SMALL LETTER A WITH TILDE
+    (0x0061, 0x0308, 0x00E4),  # a + COMBINING DIAERESIS = LATIN SMALL LETTER A WITH DIAERESIS
+    (0x0061, 0x030A, 0x00E5),  # a + COMBINING RING ABOVE = LATIN SMALL LETTER A WITH RING ABOVE
+    (0x0063, 0x0327, 0x00E7),  # c + COMBINING CEDILLA = LATIN SMALL LETTER C WITH CEDILLA
+    (0x0065, 0x0300, 0x00E8),  # e + COMBINING GRAVE ACCENT = LATIN SMALL LETTER E WITH GRAVE
+    (0x0065, 0x0301, 0x00E9),  # e + COMBINING ACUTE ACCENT = LATIN SMALL LETTER E WITH ACUTE
+    (0x0065, 0x0302, 0x00EA),  # e + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER E WITH CIRCUMFLEX
+    (0x0065, 0x0308, 0x00EB),  # e + COMBINING DIAERESIS = LATIN SMALL LETTER E WITH DIAERESIS
+    (0x0069, 0x0300, 0x00EC),  # i + COMBINING GRAVE ACCENT = LATIN SMALL LETTER I WITH GRAVE
+    (0x0069, 0x0301, 0x00ED),  # i + COMBINING ACUTE ACCENT = LATIN SMALL LETTER I WITH ACUTE
+    (0x0069, 0x0302, 0x00EE),  # i + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER I WITH CIRCUMFLEX
+    (0x0069, 0x0308, 0x00EF),  # i + COMBINING DIAERESIS = LATIN SMALL LETTER I WITH DIAERESIS
+    (0x006E, 0x0303, 0x00F1),  # n + COMBINING TILDE = LATIN SMALL LETTER N WITH TILDE
+    (0x006F, 0x0300, 0x00F2),  # o + COMBINING GRAVE ACCENT = LATIN SMALL LETTER O WITH GRAVE
+    (0x006F, 0x0301, 0x00F3),  # o + COMBINING ACUTE ACCENT = LATIN SMALL LETTER O WITH ACUTE
+    (0x006F, 0x0302, 0x00F4),  # o + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER O WITH CIRCUMFLEX
+    (0x006F, 0x0303, 0x00F5),  # o + COMBINING TILDE = LATIN SMALL LETTER O WITH TILDE
+    (0x006F, 0x0308, 0x00F6),  # o + COMBINING DIAERESIS = LATIN SMALL LETTER O WITH DIAERESIS
+    (0x0075, 0x0300, 0x00F9),  # u + COMBINING GRAVE ACCENT = LATIN SMALL LETTER U WITH GRAVE
+    (0x0075, 0x0301, 0x00FA),  # u + COMBINING ACUTE ACCENT = LATIN SMALL LETTER U WITH ACUTE
+    (0x0075, 0x0302, 0x00FB),  # u + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER U WITH CIRCUMFLEX
+    (0x0075, 0x0308, 0x00FC),  # u + COMBINING DIAERESIS = LATIN SMALL LETTER U WITH DIAERESIS
+    (0x0079, 0x0301, 0x00FD),  # y + COMBINING ACUTE ACCENT = LATIN SMALL LETTER Y WITH ACUTE
+    (0x0079, 0x0308, 0x00FF),  # y + COMBINING DIAERESIS = LATIN SMALL LETTER Y WITH DIAERESIS
+
+    # Common Greek characters
+    (0x0391, 0x0301, 0x0386),  # Α + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ALPHA WITH TONOS
+    (0x0395, 0x0301, 0x0388),  # Ε + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER EPSILON WITH TONOS
+    (0x0397, 0x0301, 0x0389),  # Η + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ETA WITH TONOS
+    (0x0399, 0x0301, 0x038A),  # Ι + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER IOTA WITH TONOS
+    (0x039F, 0x0301, 0x038C),  # Ο + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMICRON WITH TONOS
+    (0x03A5, 0x0301, 0x038E),  # Υ + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER UPSILON WITH TONOS
+    (0x03A9, 0x0301, 0x038F),  # Ω + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMEGA WITH TONOS
+    (0x03B1, 0x0301, 0x03AC),  # α + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ALPHA WITH TONOS
+    (0x03B5, 0x0301, 0x03AD),  # ε + COMBINING ACUTE ACCENT = GREEK SMALL LETTER EPSILON WITH TONOS
+    (0x03B7, 0x0301, 0x03AE),  # η + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ETA WITH TONOS
+    (0x03B9, 0x0301, 0x03AF),  # ι + COMBINING ACUTE ACCENT = GREEK SMALL LETTER IOTA WITH TONOS
+    (0x03BF, 0x0301, 0x03CC),  # ο + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMICRON WITH TONOS
+    (0x03C5, 0x0301, 0x03CD),  # υ + COMBINING ACUTE ACCENT = GREEK SMALL LETTER UPSILON WITH TONOS
+    (0x03C9, 0x0301, 0x03CE),  # ω + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMEGA WITH TONOS
+
+    # Common Cyrillic characters
+    (0x0418, 0x0306, 0x0419),  # И + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT I
+    (0x0438, 0x0306, 0x0439),  # и + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT I
+    (0x0423, 0x0306, 0x040E),  # У + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT U
+    (0x0443, 0x0306, 0x045E),  # у + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT U
+]
+
+full_recompose_description = "all possible recomposition pairs from the Unicode BMP"
+def collect_all_recomposition_pairs():
+    """Collect all possible recomposition pairs from the Unicode data."""
+    # Map to store recomposition pairs: (base, combining) -> recomposed
+    recompose_map = {}
+
+    # Process all assigned Unicode code points in BMP (Basic Multilingual Plane)
+    # We limit to BMP (0x0000-0xFFFF) to keep our table smaller with uint16_t
+    for cp in range(0, 0x10000):
+        try:
+            char = chr(cp)
+
+            # Skip unassigned or control characters
+            if not unicodedata.name(char, ''):
+                continue
+
+            # Find decomposition
+            decomp = unicodedata.decomposition(char)
+            if not decomp or '<' in decomp:  # Skip compatibility decompositions
+                continue
+
+            # Parse the decomposition
+            parts = decomp.split()
+            if len(parts) == 2:  # Simple base + combining mark
+                base = int(parts[0], 16)
+                combining = int(parts[1], 16)
+
+                # Only store if both are in BMP
+                if base < 0x10000 and combining < 0x10000:
+                    recompose_map[(base, combining)] = cp
+
+        except (ValueError, TypeError):
+            continue
+
+    # Convert to a list of tuples and sort for binary search
+    recompose_list = [(base, combining, recomposed)
+                     for (base, combining), recomposed in recompose_map.items()]
+    recompose_list.sort()
+
+    return recompose_list
+
+def validate_common_pairs(full_list):
+    """Validate that all common pairs are in the full list.
+
+    Raises:
+        ValueError: If any common pair is missing or has a different recomposition
+        value than what's in the full table.
+    """
+    full_pairs = {(base, combining): recomposed for base, combining, recomposed in full_list}
+    for base, combining, recomposed in COMMON_RECOMPOSITION_PAIRS:
+        full_recomposed = full_pairs.get((base, combining))
+        if full_recomposed is None:
+            error_msg = f"Error: Common pair (0x{base:04X}, 0x{combining:04X}) not found in full data"
+            print(error_msg)
+            raise ValueError(error_msg)
+        elif full_recomposed != recomposed:
+            error_msg = (f"Error: Common pair (0x{base:04X}, 0x{combining:04X}) has different recomposition: "
+                         f"0x{recomposed:04X} vs 0x{full_recomposed:04X}")
+            print(error_msg)
+            raise ValueError(error_msg)
+
+def generate_recomposition_table(use_full_list=False):
+    """Generate the recomposition table C code."""
+    # Output file name
+    c_file = "ucs_recompose.c"
+
+    # Get Unicode version information
+    unicode_version = unicodedata.unidata_version
+
+    # Collect all recomposition pairs for validation
+    full_recompose_list = collect_all_recomposition_pairs()
+
+    # Decide which list to use
+    if use_full_list:
+        print("Using full recomposition list...")
+        recompose_list = full_recompose_list
+        table_description = full_recompose_description
+        alt_list = COMMON_RECOMPOSITION_PAIRS
+        alt_description = common_recompose_description
+    else:
+        print("Using common recomposition list...")
+        # Validate that all common pairs are in the full list
+        validate_common_pairs(full_recompose_list)
+        recompose_list = sorted(COMMON_RECOMPOSITION_PAIRS)
+        table_description = common_recompose_description
+        alt_list = full_recompose_list
+        alt_description = full_recompose_description
+    generation_mode = " --full" if use_full_list else ""
+    alternative_mode = " --full" if not use_full_list else ""
+    table_description_detail = f"{table_description} ({len(recompose_list)} entries)"
+    alt_description_detail = f"{alt_description} ({len(alt_list)} entries)"
+
+    # Calculate min/max values for boundary checks
+    min_base = min(base for base, _, _ in recompose_list)
+    max_base = max(base for base, _, _ in recompose_list)
+    min_combining = min(combining for _, combining, _ in recompose_list)
+    max_combining = max(combining for _, combining, _ in recompose_list)
+
+    # Generate implementation file
+    with open(c_file, 'w') as f:
+        f.write(f"""\
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ucs_recompose.c - Unicode character recomposition
+ *
+ * Auto-generated by gen_ucs_recompose.py{generation_mode}
+ *
+ * Unicode Version: {unicode_version}
+ *
+{textwrap.fill(
+    f"This file contains a table with {table_description_detail}. " +
+    f"To generate a table with {alt_description_detail} instead, run:",
+    width=75, initial_indent=" * ", subsequent_indent=" * ")}
+ *
+ *   python gen_ucs_recompose.py{alternative_mode}
+ */
+
+#include <linux/types.h>
+#include <linux/array_size.h>
+#include <linux/bsearch.h>
+#include <linux/consolemap.h>
+
+/*
+ * Structure for recomposition pairs.
+ * First element is the base character, second is the combining mark,
+ * third is the recomposed character.
+ * Using uint16_t to save space since all values are within BMP range.
+ */
+struct recomposition {{
+	uint16_t base;
+	uint16_t combining;
+	uint16_t recomposed;
+}};
+
+/*
+ * Table of {table_description}
+ * Sorted by base character and then combining character for binary search
+ */
+static const struct recomposition recomposition_table[] = {{
+""")
+
+        # Write the recomposition table with comments
+        for base, combining, recomposed in recompose_list:
+            try:
+                base_name = unicodedata.name(chr(base))
+                combining_name = unicodedata.name(chr(combining))
+                recomposed_name = unicodedata.name(chr(recomposed))
+                comment = f"/* {base_name} + {combining_name} = {recomposed_name} */"
+            except ValueError:
+                comment = f"/* U+{base:04X} + U+{combining:04X} = U+{recomposed:04X} */"
+            f.write(f"\t{{ 0x{base:04X}, 0x{combining:04X}, 0x{recomposed:04X} }}, {comment}\n")
+
+        f.write(f"""\
+}};
+
+/*
+ * Boundary values for quick rejection
+ * These are calculated by analyzing the table during generation
+ */
+#define MIN_BASE_CHAR       0x{min_base:04X}
+#define MAX_BASE_CHAR       0x{max_base:04X}
+#define MIN_COMBINING_CHAR  0x{min_combining:04X}
+#define MAX_COMBINING_CHAR  0x{max_combining:04X}
+
+struct compare_key {{
+	uint16_t base;
+	uint16_t combining;
+}};
+
+static int recomposition_compare(const void *key, const void *element)
+{{
+	const struct compare_key *search_key = key;
+	const struct recomposition *table_entry = element;
+
+	/* Compare base character first */
+	if (search_key->base < table_entry->base)
+		return -1;
+	if (search_key->base > table_entry->base)
+		return 1;
+
+	/* Base characters match, now compare combining character */
+	if (search_key->combining < table_entry->combining)
+		return -1;
+	if (search_key->combining > table_entry->combining)
+		return 1;
+
+	/* Both match */
+	return 0;
+}}
+
+/**
+ * Attempt to recompose two Unicode characters into a single character.
+ *
+ * @param previous: Previous Unicode code point (UCS-4)
+ * @param current: Current Unicode code point (UCS-4)
+ * Return: Recomposed Unicode code point, or 0 if no recomposition is possible
+ */
+uint32_t ucs_recompose(uint32_t base, uint32_t combining)
+{{
+	/* Check if characters are within the range of our table */
+	if (base < MIN_BASE_CHAR || base > MAX_BASE_CHAR ||
+	    combining < MIN_COMBINING_CHAR || combining > MAX_COMBINING_CHAR)
+		return 0;
+
+	struct compare_key key = {{ base, combining }};
+
+	struct recomposition *result =
+		__inline_bsearch(&key, recomposition_table,
+				 ARRAY_SIZE(recomposition_table),
+				 sizeof(*recomposition_table),
+				 recomposition_compare);
+
+	return result ? result->recomposed : 0;
+}}
+""")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate Unicode recomposition table")
+    parser.add_argument("--full", action="store_true",
+                        help="Generate a full recomposition table (default: common pairs only)")
+    args = parser.parse_args()
+
+    generate_recomposition_table(use_full_list=args.full)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (5 preceding siblings ...)
  2025-04-10  1:13 ` [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c Nicolas Pitre
@ 2025-04-10  1:13 ` Nicolas Pitre
  2025-04-11  6:00   ` kernel test robot
  2025-04-10  1:14 ` [PATCH 08/11] vt: support Unicode recomposition Nicolas Pitre
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This provides ucs_recompose() to recompose two Unicode characters into
a single character if possible. This is needed for the VT to properly
display decomposed UTF8 sequences.

Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
      Please ignore.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/Makefile        |   2 +-
 drivers/tty/vt/ucs_recompose.c | 170 +++++++++++++++++++++++++++++++++
 include/linux/consolemap.h     |   6 ++
 3 files changed, 177 insertions(+), 1 deletion(-)
 create mode 100644 drivers/tty/vt/ucs_recompose.c

diff --git a/drivers/tty/vt/Makefile b/drivers/tty/vt/Makefile
index bee69277bb..a63f6c9438 100644
--- a/drivers/tty/vt/Makefile
+++ b/drivers/tty/vt/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_VT)			+= vt_ioctl.o vc_screen.o \
 					   selection.o keyboard.o \
 					   vt.o defkeymap.o
 obj-$(CONFIG_CONSOLE_TRANSLATIONS)	+= consolemap.o consolemap_deftbl.o \
-					   ucs_width.o
+					   ucs_width.o ucs_recompose.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/tty/vt/ucs_recompose.c b/drivers/tty/vt/ucs_recompose.c
new file mode 100644
index 0000000000..5c30c989de
--- /dev/null
+++ b/drivers/tty/vt/ucs_recompose.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ucs_recompose.c - Unicode character recomposition
+ *
+ * Auto-generated by gen_ucs_recompose.py
+ *
+ * Unicode Version: 16.0.0
+ *
+ * This file contains a table with most commonly used Latin, Greek, and
+ * Cyrillic recomposition pairs only (71 entries). To generate a table with
+ * all possible recomposition pairs from the Unicode BMP (1000 entries)
+ * instead, run:
+ *
+ *   python gen_ucs_recompose.py --full
+ */
+
+#include <linux/types.h>
+#include <linux/array_size.h>
+#include <linux/bsearch.h>
+#include <linux/consolemap.h>
+
+/*
+ * Structure for recomposition pairs.
+ * First element is the base character, second is the combining mark,
+ * third is the recomposed character.
+ * Using uint16_t to save space since all values are within BMP range.
+ */
+struct recomposition {
+	uint16_t base;
+	uint16_t combining;
+	uint16_t recomposed;
+};
+
+/*
+ * Table of most commonly used Latin, Greek, and Cyrillic recomposition pairs only
+ * Sorted by base character and then combining character for binary search
+ */
+static const struct recomposition recomposition_table[] = {
+	{ 0x0041, 0x0300, 0x00C0 }, /* LATIN CAPITAL LETTER A + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER A WITH GRAVE */
+	{ 0x0041, 0x0301, 0x00C1 }, /* LATIN CAPITAL LETTER A + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER A WITH ACUTE */
+	{ 0x0041, 0x0302, 0x00C2 }, /* LATIN CAPITAL LETTER A + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER A WITH CIRCUMFLEX */
+	{ 0x0041, 0x0303, 0x00C3 }, /* LATIN CAPITAL LETTER A + COMBINING TILDE = LATIN CAPITAL LETTER A WITH TILDE */
+	{ 0x0041, 0x0308, 0x00C4 }, /* LATIN CAPITAL LETTER A + COMBINING DIAERESIS = LATIN CAPITAL LETTER A WITH DIAERESIS */
+	{ 0x0041, 0x030A, 0x00C5 }, /* LATIN CAPITAL LETTER A + COMBINING RING ABOVE = LATIN CAPITAL LETTER A WITH RING ABOVE */
+	{ 0x0043, 0x0327, 0x00C7 }, /* LATIN CAPITAL LETTER C + COMBINING CEDILLA = LATIN CAPITAL LETTER C WITH CEDILLA */
+	{ 0x0045, 0x0300, 0x00C8 }, /* LATIN CAPITAL LETTER E + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER E WITH GRAVE */
+	{ 0x0045, 0x0301, 0x00C9 }, /* LATIN CAPITAL LETTER E + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER E WITH ACUTE */
+	{ 0x0045, 0x0302, 0x00CA }, /* LATIN CAPITAL LETTER E + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER E WITH CIRCUMFLEX */
+	{ 0x0045, 0x0308, 0x00CB }, /* LATIN CAPITAL LETTER E + COMBINING DIAERESIS = LATIN CAPITAL LETTER E WITH DIAERESIS */
+	{ 0x0049, 0x0300, 0x00CC }, /* LATIN CAPITAL LETTER I + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER I WITH GRAVE */
+	{ 0x0049, 0x0301, 0x00CD }, /* LATIN CAPITAL LETTER I + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER I WITH ACUTE */
+	{ 0x0049, 0x0302, 0x00CE }, /* LATIN CAPITAL LETTER I + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER I WITH CIRCUMFLEX */
+	{ 0x0049, 0x0308, 0x00CF }, /* LATIN CAPITAL LETTER I + COMBINING DIAERESIS = LATIN CAPITAL LETTER I WITH DIAERESIS */
+	{ 0x004E, 0x0303, 0x00D1 }, /* LATIN CAPITAL LETTER N + COMBINING TILDE = LATIN CAPITAL LETTER N WITH TILDE */
+	{ 0x004F, 0x0300, 0x00D2 }, /* LATIN CAPITAL LETTER O + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER O WITH GRAVE */
+	{ 0x004F, 0x0301, 0x00D3 }, /* LATIN CAPITAL LETTER O + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER O WITH ACUTE */
+	{ 0x004F, 0x0302, 0x00D4 }, /* LATIN CAPITAL LETTER O + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER O WITH CIRCUMFLEX */
+	{ 0x004F, 0x0303, 0x00D5 }, /* LATIN CAPITAL LETTER O + COMBINING TILDE = LATIN CAPITAL LETTER O WITH TILDE */
+	{ 0x004F, 0x0308, 0x00D6 }, /* LATIN CAPITAL LETTER O + COMBINING DIAERESIS = LATIN CAPITAL LETTER O WITH DIAERESIS */
+	{ 0x0055, 0x0300, 0x00D9 }, /* LATIN CAPITAL LETTER U + COMBINING GRAVE ACCENT = LATIN CAPITAL LETTER U WITH GRAVE */
+	{ 0x0055, 0x0301, 0x00DA }, /* LATIN CAPITAL LETTER U + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER U WITH ACUTE */
+	{ 0x0055, 0x0302, 0x00DB }, /* LATIN CAPITAL LETTER U + COMBINING CIRCUMFLEX ACCENT = LATIN CAPITAL LETTER U WITH CIRCUMFLEX */
+	{ 0x0055, 0x0308, 0x00DC }, /* LATIN CAPITAL LETTER U + COMBINING DIAERESIS = LATIN CAPITAL LETTER U WITH DIAERESIS */
+	{ 0x0059, 0x0301, 0x00DD }, /* LATIN CAPITAL LETTER Y + COMBINING ACUTE ACCENT = LATIN CAPITAL LETTER Y WITH ACUTE */
+	{ 0x0061, 0x0300, 0x00E0 }, /* LATIN SMALL LETTER A + COMBINING GRAVE ACCENT = LATIN SMALL LETTER A WITH GRAVE */
+	{ 0x0061, 0x0301, 0x00E1 }, /* LATIN SMALL LETTER A + COMBINING ACUTE ACCENT = LATIN SMALL LETTER A WITH ACUTE */
+	{ 0x0061, 0x0302, 0x00E2 }, /* LATIN SMALL LETTER A + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER A WITH CIRCUMFLEX */
+	{ 0x0061, 0x0303, 0x00E3 }, /* LATIN SMALL LETTER A + COMBINING TILDE = LATIN SMALL LETTER A WITH TILDE */
+	{ 0x0061, 0x0308, 0x00E4 }, /* LATIN SMALL LETTER A + COMBINING DIAERESIS = LATIN SMALL LETTER A WITH DIAERESIS */
+	{ 0x0061, 0x030A, 0x00E5 }, /* LATIN SMALL LETTER A + COMBINING RING ABOVE = LATIN SMALL LETTER A WITH RING ABOVE */
+	{ 0x0063, 0x0327, 0x00E7 }, /* LATIN SMALL LETTER C + COMBINING CEDILLA = LATIN SMALL LETTER C WITH CEDILLA */
+	{ 0x0065, 0x0300, 0x00E8 }, /* LATIN SMALL LETTER E + COMBINING GRAVE ACCENT = LATIN SMALL LETTER E WITH GRAVE */
+	{ 0x0065, 0x0301, 0x00E9 }, /* LATIN SMALL LETTER E + COMBINING ACUTE ACCENT = LATIN SMALL LETTER E WITH ACUTE */
+	{ 0x0065, 0x0302, 0x00EA }, /* LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER E WITH CIRCUMFLEX */
+	{ 0x0065, 0x0308, 0x00EB }, /* LATIN SMALL LETTER E + COMBINING DIAERESIS = LATIN SMALL LETTER E WITH DIAERESIS */
+	{ 0x0069, 0x0300, 0x00EC }, /* LATIN SMALL LETTER I + COMBINING GRAVE ACCENT = LATIN SMALL LETTER I WITH GRAVE */
+	{ 0x0069, 0x0301, 0x00ED }, /* LATIN SMALL LETTER I + COMBINING ACUTE ACCENT = LATIN SMALL LETTER I WITH ACUTE */
+	{ 0x0069, 0x0302, 0x00EE }, /* LATIN SMALL LETTER I + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER I WITH CIRCUMFLEX */
+	{ 0x0069, 0x0308, 0x00EF }, /* LATIN SMALL LETTER I + COMBINING DIAERESIS = LATIN SMALL LETTER I WITH DIAERESIS */
+	{ 0x006E, 0x0303, 0x00F1 }, /* LATIN SMALL LETTER N + COMBINING TILDE = LATIN SMALL LETTER N WITH TILDE */
+	{ 0x006F, 0x0300, 0x00F2 }, /* LATIN SMALL LETTER O + COMBINING GRAVE ACCENT = LATIN SMALL LETTER O WITH GRAVE */
+	{ 0x006F, 0x0301, 0x00F3 }, /* LATIN SMALL LETTER O + COMBINING ACUTE ACCENT = LATIN SMALL LETTER O WITH ACUTE */
+	{ 0x006F, 0x0302, 0x00F4 }, /* LATIN SMALL LETTER O + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER O WITH CIRCUMFLEX */
+	{ 0x006F, 0x0303, 0x00F5 }, /* LATIN SMALL LETTER O + COMBINING TILDE = LATIN SMALL LETTER O WITH TILDE */
+	{ 0x006F, 0x0308, 0x00F6 }, /* LATIN SMALL LETTER O + COMBINING DIAERESIS = LATIN SMALL LETTER O WITH DIAERESIS */
+	{ 0x0075, 0x0300, 0x00F9 }, /* LATIN SMALL LETTER U + COMBINING GRAVE ACCENT = LATIN SMALL LETTER U WITH GRAVE */
+	{ 0x0075, 0x0301, 0x00FA }, /* LATIN SMALL LETTER U + COMBINING ACUTE ACCENT = LATIN SMALL LETTER U WITH ACUTE */
+	{ 0x0075, 0x0302, 0x00FB }, /* LATIN SMALL LETTER U + COMBINING CIRCUMFLEX ACCENT = LATIN SMALL LETTER U WITH CIRCUMFLEX */
+	{ 0x0075, 0x0308, 0x00FC }, /* LATIN SMALL LETTER U + COMBINING DIAERESIS = LATIN SMALL LETTER U WITH DIAERESIS */
+	{ 0x0079, 0x0301, 0x00FD }, /* LATIN SMALL LETTER Y + COMBINING ACUTE ACCENT = LATIN SMALL LETTER Y WITH ACUTE */
+	{ 0x0079, 0x0308, 0x00FF }, /* LATIN SMALL LETTER Y + COMBINING DIAERESIS = LATIN SMALL LETTER Y WITH DIAERESIS */
+	{ 0x0391, 0x0301, 0x0386 }, /* GREEK CAPITAL LETTER ALPHA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ALPHA WITH TONOS */
+	{ 0x0395, 0x0301, 0x0388 }, /* GREEK CAPITAL LETTER EPSILON + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER EPSILON WITH TONOS */
+	{ 0x0397, 0x0301, 0x0389 }, /* GREEK CAPITAL LETTER ETA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER ETA WITH TONOS */
+	{ 0x0399, 0x0301, 0x038A }, /* GREEK CAPITAL LETTER IOTA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER IOTA WITH TONOS */
+	{ 0x039F, 0x0301, 0x038C }, /* GREEK CAPITAL LETTER OMICRON + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMICRON WITH TONOS */
+	{ 0x03A5, 0x0301, 0x038E }, /* GREEK CAPITAL LETTER UPSILON + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER UPSILON WITH TONOS */
+	{ 0x03A9, 0x0301, 0x038F }, /* GREEK CAPITAL LETTER OMEGA + COMBINING ACUTE ACCENT = GREEK CAPITAL LETTER OMEGA WITH TONOS */
+	{ 0x03B1, 0x0301, 0x03AC }, /* GREEK SMALL LETTER ALPHA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ALPHA WITH TONOS */
+	{ 0x03B5, 0x0301, 0x03AD }, /* GREEK SMALL LETTER EPSILON + COMBINING ACUTE ACCENT = GREEK SMALL LETTER EPSILON WITH TONOS */
+	{ 0x03B7, 0x0301, 0x03AE }, /* GREEK SMALL LETTER ETA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER ETA WITH TONOS */
+	{ 0x03B9, 0x0301, 0x03AF }, /* GREEK SMALL LETTER IOTA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER IOTA WITH TONOS */
+	{ 0x03BF, 0x0301, 0x03CC }, /* GREEK SMALL LETTER OMICRON + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMICRON WITH TONOS */
+	{ 0x03C5, 0x0301, 0x03CD }, /* GREEK SMALL LETTER UPSILON + COMBINING ACUTE ACCENT = GREEK SMALL LETTER UPSILON WITH TONOS */
+	{ 0x03C9, 0x0301, 0x03CE }, /* GREEK SMALL LETTER OMEGA + COMBINING ACUTE ACCENT = GREEK SMALL LETTER OMEGA WITH TONOS */
+	{ 0x0418, 0x0306, 0x0419 }, /* CYRILLIC CAPITAL LETTER I + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT I */
+	{ 0x0423, 0x0306, 0x040E }, /* CYRILLIC CAPITAL LETTER U + COMBINING BREVE = CYRILLIC CAPITAL LETTER SHORT U */
+	{ 0x0438, 0x0306, 0x0439 }, /* CYRILLIC SMALL LETTER I + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT I */
+	{ 0x0443, 0x0306, 0x045E }, /* CYRILLIC SMALL LETTER U + COMBINING BREVE = CYRILLIC SMALL LETTER SHORT U */
+};
+
+/*
+ * Boundary values for quick rejection
+ * These are calculated by analyzing the table during generation
+ */
+#define MIN_BASE_CHAR       0x0041
+#define MAX_BASE_CHAR       0x0443
+#define MIN_COMBINING_CHAR  0x0300
+#define MAX_COMBINING_CHAR  0x0327
+
+struct compare_key {
+	uint16_t base;
+	uint16_t combining;
+};
+
+static int recomposition_compare(const void *key, const void *element)
+{
+	const struct compare_key *search_key = key;
+	const struct recomposition *table_entry = element;
+
+	/* Compare base character first */
+	if (search_key->base < table_entry->base)
+		return -1;
+	if (search_key->base > table_entry->base)
+		return 1;
+
+	/* Base characters match, now compare combining character */
+	if (search_key->combining < table_entry->combining)
+		return -1;
+	if (search_key->combining > table_entry->combining)
+		return 1;
+
+	/* Both match */
+	return 0;
+}
+
+/**
+ * Attempt to recompose two Unicode characters into a single character.
+ *
+ * @param previous: Previous Unicode code point (UCS-4)
+ * @param current: Current Unicode code point (UCS-4)
+ * Return: Recomposed Unicode code point, or 0 if no recomposition is possible
+ */
+uint32_t ucs_recompose(uint32_t base, uint32_t combining)
+{
+	/* Check if characters are within the range of our table */
+	if (base < MIN_BASE_CHAR || base > MAX_BASE_CHAR ||
+	    combining < MIN_COMBINING_CHAR || combining > MAX_COMBINING_CHAR)
+		return 0;
+
+	struct compare_key key = { base, combining };
+
+	struct recomposition *result =
+		__inline_bsearch(&key, recomposition_table,
+				 ARRAY_SIZE(recomposition_table),
+				 sizeof(*recomposition_table),
+				 recomposition_compare);
+
+	return result ? result->recomposed : 0;
+}
diff --git a/include/linux/consolemap.h b/include/linux/consolemap.h
index b3a9118666..4d3a34c288 100644
--- a/include/linux/consolemap.h
+++ b/include/linux/consolemap.h
@@ -30,6 +30,7 @@ int conv_uni_to_8bit(u32 uni);
 void console_map_init(void);
 bool ucs_is_double_width(uint32_t cp);
 bool ucs_is_zero_width(uint32_t cp);
+uint32_t ucs_recompose(uint32_t base, uint32_t combining);
 #else
 static inline u16 inverse_translate(const struct vc_data *conp, u16 glyph,
 		bool use_unicode)
@@ -69,6 +70,11 @@ static inline bool ucs_is_zero_width(uint32_t cp)
 {
 	return false;
 }
+
+static inline uint32_t ucs_recompose(uint32_t base, uint32_t combining)
+{
+	return 0;
+}
 #endif /* CONFIG_CONSOLE_TRANSLATIONS */
 
 #endif /* __LINUX_CONSOLEMAP_H__ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 08/11] vt: support Unicode recomposition
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (6 preceding siblings ...)
  2025-04-10  1:13 ` [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py Nicolas Pitre
@ 2025-04-10  1:14 ` Nicolas Pitre
  2025-04-10  1:14 ` [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables Nicolas Pitre
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Try replacing any decomposed Unicode sequence by the corresponding
recomposed code point. Code point to glyph correspondance works best
after recomposition, and this apply mostly to single-width code points
therefore we can't preserve them in their decomposed form anyway.

With all the infrastructure in place this is now trivial to do.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 5d53feeb5d..e3d35c4f92 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2953,8 +2953,15 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 				 * double-width.
 				 */
 			} else {
-				/* Otherwise zero-width code points are ignored */
-				goto out;
+				/* try recomposition */
+				prev_c = ucs_recompose(prev_c, c);
+				if (prev_c != 0) {
+					vc_con_rewind(vc);
+					c = prev_c;
+				} else {
+					/* Otherwise zero-width code points are ignored */
+					goto out;
+				}
 			}
 		}
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (7 preceding siblings ...)
  2025-04-10  1:14 ` [PATCH 08/11] vt: support Unicode recomposition Nicolas Pitre
@ 2025-04-10  1:14 ` Nicolas Pitre
  2025-04-14  7:14   ` Jiri Slaby
  2025-04-10  1:14 ` [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py Nicolas Pitre
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
This reduces the corresponding text size by 20-25%.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/gen_ucs_width.py | 154 +++++++++++++++++++++++---------
 1 file changed, 113 insertions(+), 41 deletions(-)

diff --git a/drivers/tty/vt/gen_ucs_width.py b/drivers/tty/vt/gen_ucs_width.py
index 41997fe001..c6cbc93e83 100755
--- a/drivers/tty/vt/gen_ucs_width.py
+++ b/drivers/tty/vt/gen_ucs_width.py
@@ -132,13 +132,49 @@ def generate_ucs_width():
         ranges.append((start, prev))
         return ranges
 
+    # Function to split ranges into BMP (16-bit) and non-BMP (above 16-bit)
+    def split_ranges_by_size(ranges):
+        bmp_ranges = []
+        non_bmp_ranges = []
+
+        for start, end in ranges:
+            if end <= 0xFFFF:
+                bmp_ranges.append((start, end))
+            elif start > 0xFFFF:
+                non_bmp_ranges.append((start, end))
+            else:
+                # Split the range at 0xFFFF
+                bmp_ranges.append((start, 0xFFFF))
+                non_bmp_ranges.append((0x10000, end))
+
+        return bmp_ranges, non_bmp_ranges
+
     # Extract ranges for each width
     zero_width_ranges = ranges_optimize(width_map, 0)
     double_width_ranges = ranges_optimize(width_map, 2)
 
+    # Split ranges into BMP and non-BMP
+    zero_width_bmp, zero_width_non_bmp = split_ranges_by_size(zero_width_ranges)
+    double_width_bmp, double_width_non_bmp = split_ranges_by_size(double_width_ranges)
+
     # Get Unicode version information
     unicode_version = unicodedata.unidata_version
 
+    # Function to generate code point description comments
+    def get_code_point_comment(start, end):
+        try:
+            start_char_desc = unicodedata.name(chr(start))
+            if start == end:
+                return f"/* {start_char_desc} */"
+            else:
+                end_char_desc = unicodedata.name(chr(end))
+                return f"/* {start_char_desc} - {end_char_desc} */"
+        except:
+            if start == end:
+                return f"/* U+{start:04X} */"
+            else:
+                return f"/* U+{start:04X} - U+{end:04X} */"
+
     # Generate C implementation file
     with open(c_file, 'w') as f:
         f.write(f"""\
@@ -156,62 +192,77 @@ def generate_ucs_width():
 #include <linux/bsearch.h>
 #include <linux/consolemap.h>
 
-struct interval {{
+struct interval16 {{
+	uint16_t first;
+	uint16_t last;
+}};
+
+struct interval32 {{
 	uint32_t first;
 	uint32_t last;
 }};
 
-/* Zero-width character ranges */
-static const struct interval zero_width_ranges[] = {{
+/* Zero-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct interval16 zero_width_bmp[] = {{
 """)
 
-        for start, end in zero_width_ranges:
-            try:
-                start_char_desc = unicodedata.name(chr(start)) if start < 0x10000 else f"U+{start:05X}"
-                if start == end:
-                    comment = f"/* {start_char_desc} */"
-                else:
-                    end_char_desc = unicodedata.name(chr(end)) if end < 0x10000 else f"U+{end:05X}"
-                    comment = f"/* {start_char_desc} - {end_char_desc} */"
-            except:
-                if start == end:
-                    comment = f"/* U+{start:05X} */"
-                else:
-                    comment = f"/* U+{start:05X} - U+{end:05X} */"
+        for start, end in zero_width_bmp:
+            comment = get_code_point_comment(start, end)
+            f.write(f"\t{{ 0x{start:04X}, 0x{end:04X} }}, {comment}\n")
+
+        f.write("""\
+};
 
+/* Zero-width character ranges (non-BMP, U+10000 and above) */
+static const struct interval32 zero_width_non_bmp[] = {
+""")
+
+        for start, end in zero_width_non_bmp:
+            comment = get_code_point_comment(start, end)
             f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
 
         f.write("""\
 };
 
-/* Double-width character ranges */
-static const struct interval double_width_ranges[] = {
+/* Double-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct interval16 double_width_bmp[] = {
 """)
 
-        for start, end in double_width_ranges:
-            try:
-                start_char_desc = unicodedata.name(chr(start)) if start < 0x10000 else f"U+{start:05X}"
-                if start == end:
-                    comment = f"/* {start_char_desc} */"
-                else:
-                    end_char_desc = unicodedata.name(chr(end)) if end < 0x10000 else f"U+{end:05X}"
-                    comment = f"/* {start_char_desc} - {end_char_desc} */"
-            except:
-                if start == end:
-                    comment = f"/* U+{start:05X} */"
-                else:
-                    comment = f"/* U+{start:05X} - U+{end:05X} */"
+        for start, end in double_width_bmp:
+            comment = get_code_point_comment(start, end)
+            f.write(f"\t{{ 0x{start:04X}, 0x{end:04X} }}, {comment}\n")
+
+        f.write("""\
+};
 
+/* Double-width character ranges (non-BMP, U+10000 and above) */
+static const struct interval32 double_width_non_bmp[] = {
+""")
+
+        for start, end in double_width_non_bmp:
+            comment = get_code_point_comment(start, end)
             f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
 
         f.write("""\
 };
 
 
-static int ucs_cmp(const void *key, const void *element)
+static int ucs_cmp16(const void *key, const void *element)
+{
+	uint16_t cp = *(uint16_t *)key;
+	const struct interval16 *e = element;
+
+	if (cp > e->last)
+		return 1;
+	if (cp < e->first)
+		return -1;
+	return 0;
+}
+
+static int ucs_cmp32(const void *key, const void *element)
 {
 	uint32_t cp = *(uint32_t *)key;
-	const struct interval *e = element;
+	const struct interval32 *e = element;
 
 	if (cp > e->last)
 		return 1;
@@ -220,13 +271,22 @@ static int ucs_cmp(const void *key, const void *element)
 	return 0;
 }
 
-static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t count)
+static bool is_in_interval16(uint16_t cp, const struct interval16 *intervals, size_t count)
 {
 	if (cp < intervals[0].first || cp > intervals[count - 1].last)
 		return false;
 
 	return __inline_bsearch(&cp, intervals, count,
-				sizeof(*intervals), ucs_cmp) != NULL;
+				sizeof(*intervals), ucs_cmp16) != NULL;
+}
+
+static bool is_in_interval32(uint32_t cp, const struct interval32 *intervals, size_t count)
+{
+	if (cp < intervals[0].first || cp > intervals[count - 1].last)
+		return false;
+
+	return __inline_bsearch(&cp, intervals, count,
+				sizeof(*intervals), ucs_cmp32) != NULL;
 }
 
 /**
@@ -237,7 +297,9 @@ static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t
  */
 bool ucs_is_zero_width(uint32_t cp)
 {
-	return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
+	return (cp <= 0xFFFF)
+	       ? is_in_interval16(cp, zero_width_bmp, ARRAY_SIZE(zero_width_bmp))
+	       : is_in_interval32(cp, zero_width_non_bmp, ARRAY_SIZE(zero_width_non_bmp));
 }
 
 /**
@@ -248,17 +310,27 @@ bool ucs_is_zero_width(uint32_t cp)
  */
 bool ucs_is_double_width(uint32_t cp)
 {
-	return is_in_interval(cp, double_width_ranges, ARRAY_SIZE(double_width_ranges));
+	return (cp <= 0xFFFF)
+	       ? is_in_interval16(cp, double_width_bmp, ARRAY_SIZE(double_width_bmp))
+	       : is_in_interval32(cp, double_width_non_bmp, ARRAY_SIZE(double_width_non_bmp));
 }
 """)
 
     # Print summary
-    zero_width_count = sum(end - start + 1 for start, end in zero_width_ranges)
-    double_width_count = sum(end - start + 1 for start, end in double_width_ranges)
+    zero_width_bmp_count = sum(end - start + 1 for start, end in zero_width_bmp)
+    zero_width_non_bmp_count = sum(end - start + 1 for start, end in zero_width_non_bmp)
+    double_width_bmp_count = sum(end - start + 1 for start, end in double_width_bmp)
+    double_width_non_bmp_count = sum(end - start + 1 for start, end in double_width_non_bmp)
+
+    total_zero_width = zero_width_bmp_count + zero_width_non_bmp_count
+    total_double_width = double_width_bmp_count + double_width_non_bmp_count
 
     print(f"Generated {c_file} with:")
-    print(f"- {len(zero_width_ranges)} zero-width ranges covering ~{zero_width_count} code points")
-    print(f"- {len(double_width_ranges)} double-width ranges covering ~{double_width_count} code points")
+    print(f"- {len(zero_width_bmp)} zero-width BMP ranges (16-bit) covering ~{zero_width_bmp_count} code points")
+    print(f"- {len(zero_width_non_bmp)} zero-width non-BMP ranges (32-bit) covering ~{zero_width_non_bmp_count} code points")
+    print(f"- {len(double_width_bmp)} double-width BMP ranges (16-bit) covering ~{double_width_bmp_count} code points")
+    print(f"- {len(double_width_non_bmp)} double-width non-BMP ranges (32-bit) covering ~{double_width_non_bmp_count} code points")
+    print(f"Total: {len(zero_width_bmp) + len(zero_width_non_bmp) + len(double_width_bmp) + len(double_width_non_bmp)} ranges covering ~{total_zero_width + total_double_width} code points")
 
 if __name__ == "__main__":
     generate_ucs_width()
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (8 preceding siblings ...)
  2025-04-10  1:14 ` [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables Nicolas Pitre
@ 2025-04-10  1:14 ` Nicolas Pitre
  2025-04-14  7:17   ` Jiri Slaby
  2025-04-10  1:14 ` [PATCH 11/11] vt: pad double-width code points with a zero-white-space Nicolas Pitre
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
This reduces the corresponding text size by 20-25%.

Note: scripts/checkpatch.pl complains about "... exceeds 100 columns".
      Please ignore.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/ucs_width.c | 902 +++++++++++++++++++------------------
 1 file changed, 470 insertions(+), 432 deletions(-)

diff --git a/drivers/tty/vt/ucs_width.c b/drivers/tty/vt/ucs_width.c
index 47b22583bd..060aa8ae7f 100644
--- a/drivers/tty/vt/ucs_width.c
+++ b/drivers/tty/vt/ucs_width.c
@@ -12,452 +12,477 @@
 #include <linux/bsearch.h>
 #include <linux/consolemap.h>
 
-struct interval {
+struct interval16 {
+	uint16_t first;
+	uint16_t last;
+};
+
+struct interval32 {
 	uint32_t first;
 	uint32_t last;
 };
 
-/* Zero-width character ranges */
-static const struct interval zero_width_ranges[] = {
-	{ 0x000AD, 0x000AD }, /* SOFT HYPHEN */
-	{ 0x00300, 0x0036F }, /* COMBINING GRAVE ACCENT - COMBINING LATIN SMALL LETTER X */
-	{ 0x00483, 0x00489 }, /* COMBINING CYRILLIC TITLO - COMBINING CYRILLIC MILLIONS SIGN */
-	{ 0x00591, 0x005BD }, /* HEBREW ACCENT ETNAHTA - HEBREW POINT METEG */
-	{ 0x005BF, 0x005BF }, /* HEBREW POINT RAFE */
-	{ 0x005C1, 0x005C2 }, /* HEBREW POINT SHIN DOT - HEBREW POINT SIN DOT */
-	{ 0x005C4, 0x005C5 }, /* HEBREW MARK UPPER DOT - HEBREW MARK LOWER DOT */
-	{ 0x005C7, 0x005C7 }, /* HEBREW POINT QAMATS QATAN */
-	{ 0x00600, 0x00605 }, /* ARABIC NUMBER SIGN - ARABIC NUMBER MARK ABOVE */
-	{ 0x00610, 0x0061A }, /* ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM - ARABIC SMALL KASRA */
-	{ 0x0064B, 0x0065F }, /* ARABIC FATHATAN - ARABIC WAVY HAMZA BELOW */
-	{ 0x00670, 0x00670 }, /* ARABIC LETTER SUPERSCRIPT ALEF */
-	{ 0x006D6, 0x006DC }, /* ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA - ARABIC SMALL HIGH SEEN */
-	{ 0x006DF, 0x006E4 }, /* ARABIC SMALL HIGH ROUNDED ZERO - ARABIC SMALL HIGH MADDA */
-	{ 0x006E7, 0x006E8 }, /* ARABIC SMALL HIGH YEH - ARABIC SMALL HIGH NOON */
-	{ 0x006EA, 0x006ED }, /* ARABIC EMPTY CENTRE LOW STOP - ARABIC SMALL LOW MEEM */
-	{ 0x00711, 0x00711 }, /* SYRIAC LETTER SUPERSCRIPT ALAPH */
-	{ 0x00730, 0x0074A }, /* SYRIAC PTHAHA ABOVE - SYRIAC BARREKH */
-	{ 0x007A6, 0x007B0 }, /* THAANA ABAFILI - THAANA SUKUN */
-	{ 0x007EB, 0x007F3 }, /* NKO COMBINING SHORT HIGH TONE - NKO COMBINING DOUBLE DOT ABOVE */
-	{ 0x007FD, 0x007FD }, /* NKO DANTAYALAN */
-	{ 0x00816, 0x00819 }, /* SAMARITAN MARK IN - SAMARITAN MARK DAGESH */
-	{ 0x0081B, 0x00823 }, /* SAMARITAN MARK EPENTHETIC YUT - SAMARITAN VOWEL SIGN A */
-	{ 0x00825, 0x00827 }, /* SAMARITAN VOWEL SIGN SHORT A - SAMARITAN VOWEL SIGN U */
-	{ 0x00829, 0x0082D }, /* SAMARITAN VOWEL SIGN LONG I - SAMARITAN MARK NEQUDAA */
-	{ 0x00859, 0x0085B }, /* MANDAIC AFFRICATION MARK - MANDAIC GEMINATION MARK */
-	{ 0x00890, 0x00891 }, /* ARABIC POUND MARK ABOVE - ARABIC PIASTRE MARK ABOVE */
-	{ 0x00897, 0x0089F }, /* ARABIC PEPET - ARABIC HALF MADDA OVER MADDA */
-	{ 0x008CA, 0x00903 }, /* ARABIC SMALL HIGH FARSI YEH - DEVANAGARI SIGN VISARGA */
-	{ 0x0093A, 0x0093C }, /* DEVANAGARI VOWEL SIGN OE - DEVANAGARI SIGN NUKTA */
-	{ 0x0093E, 0x0094F }, /* DEVANAGARI VOWEL SIGN AA - DEVANAGARI VOWEL SIGN AW */
-	{ 0x00951, 0x00957 }, /* DEVANAGARI STRESS SIGN UDATTA - DEVANAGARI VOWEL SIGN UUE */
-	{ 0x00962, 0x00963 }, /* DEVANAGARI VOWEL SIGN VOCALIC L - DEVANAGARI VOWEL SIGN VOCALIC LL */
-	{ 0x00981, 0x00983 }, /* BENGALI SIGN CANDRABINDU - BENGALI SIGN VISARGA */
-	{ 0x009BC, 0x009BC }, /* BENGALI SIGN NUKTA */
-	{ 0x009BE, 0x009C4 }, /* BENGALI VOWEL SIGN AA - BENGALI VOWEL SIGN VOCALIC RR */
-	{ 0x009C7, 0x009C8 }, /* BENGALI VOWEL SIGN E - BENGALI VOWEL SIGN AI */
-	{ 0x009CB, 0x009CD }, /* BENGALI VOWEL SIGN O - BENGALI SIGN VIRAMA */
-	{ 0x009D7, 0x009D7 }, /* BENGALI AU LENGTH MARK */
-	{ 0x009E2, 0x009E3 }, /* BENGALI VOWEL SIGN VOCALIC L - BENGALI VOWEL SIGN VOCALIC LL */
-	{ 0x009FE, 0x009FE }, /* BENGALI SANDHI MARK */
-	{ 0x00A01, 0x00A03 }, /* GURMUKHI SIGN ADAK BINDI - GURMUKHI SIGN VISARGA */
-	{ 0x00A3C, 0x00A3C }, /* GURMUKHI SIGN NUKTA */
-	{ 0x00A3E, 0x00A42 }, /* GURMUKHI VOWEL SIGN AA - GURMUKHI VOWEL SIGN UU */
-	{ 0x00A47, 0x00A48 }, /* GURMUKHI VOWEL SIGN EE - GURMUKHI VOWEL SIGN AI */
-	{ 0x00A4B, 0x00A4D }, /* GURMUKHI VOWEL SIGN OO - GURMUKHI SIGN VIRAMA */
-	{ 0x00A51, 0x00A51 }, /* GURMUKHI SIGN UDAAT */
-	{ 0x00A70, 0x00A71 }, /* GURMUKHI TIPPI - GURMUKHI ADDAK */
-	{ 0x00A75, 0x00A75 }, /* GURMUKHI SIGN YAKASH */
-	{ 0x00A81, 0x00A83 }, /* GUJARATI SIGN CANDRABINDU - GUJARATI SIGN VISARGA */
-	{ 0x00ABC, 0x00ABC }, /* GUJARATI SIGN NUKTA */
-	{ 0x00ABE, 0x00AC5 }, /* GUJARATI VOWEL SIGN AA - GUJARATI VOWEL SIGN CANDRA E */
-	{ 0x00AC7, 0x00AC9 }, /* GUJARATI VOWEL SIGN E - GUJARATI VOWEL SIGN CANDRA O */
-	{ 0x00ACB, 0x00ACD }, /* GUJARATI VOWEL SIGN O - GUJARATI SIGN VIRAMA */
-	{ 0x00AE2, 0x00AE3 }, /* GUJARATI VOWEL SIGN VOCALIC L - GUJARATI VOWEL SIGN VOCALIC LL */
-	{ 0x00AFA, 0x00AFF }, /* GUJARATI SIGN SUKUN - GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE */
-	{ 0x00B01, 0x00B03 }, /* ORIYA SIGN CANDRABINDU - ORIYA SIGN VISARGA */
-	{ 0x00B3C, 0x00B3C }, /* ORIYA SIGN NUKTA */
-	{ 0x00B3E, 0x00B44 }, /* ORIYA VOWEL SIGN AA - ORIYA VOWEL SIGN VOCALIC RR */
-	{ 0x00B47, 0x00B48 }, /* ORIYA VOWEL SIGN E - ORIYA VOWEL SIGN AI */
-	{ 0x00B4B, 0x00B4D }, /* ORIYA VOWEL SIGN O - ORIYA SIGN VIRAMA */
-	{ 0x00B55, 0x00B57 }, /* ORIYA SIGN OVERLINE - ORIYA AU LENGTH MARK */
-	{ 0x00B62, 0x00B63 }, /* ORIYA VOWEL SIGN VOCALIC L - ORIYA VOWEL SIGN VOCALIC LL */
-	{ 0x00B82, 0x00B82 }, /* TAMIL SIGN ANUSVARA */
-	{ 0x00BBE, 0x00BC2 }, /* TAMIL VOWEL SIGN AA - TAMIL VOWEL SIGN UU */
-	{ 0x00BC6, 0x00BC8 }, /* TAMIL VOWEL SIGN E - TAMIL VOWEL SIGN AI */
-	{ 0x00BCA, 0x00BCD }, /* TAMIL VOWEL SIGN O - TAMIL SIGN VIRAMA */
-	{ 0x00BD7, 0x00BD7 }, /* TAMIL AU LENGTH MARK */
-	{ 0x00C00, 0x00C04 }, /* TELUGU SIGN COMBINING CANDRABINDU ABOVE - TELUGU SIGN COMBINING ANUSVARA ABOVE */
-	{ 0x00C3C, 0x00C3C }, /* TELUGU SIGN NUKTA */
-	{ 0x00C3E, 0x00C44 }, /* TELUGU VOWEL SIGN AA - TELUGU VOWEL SIGN VOCALIC RR */
-	{ 0x00C46, 0x00C48 }, /* TELUGU VOWEL SIGN E - TELUGU VOWEL SIGN AI */
-	{ 0x00C4A, 0x00C4D }, /* TELUGU VOWEL SIGN O - TELUGU SIGN VIRAMA */
-	{ 0x00C55, 0x00C56 }, /* TELUGU LENGTH MARK - TELUGU AI LENGTH MARK */
-	{ 0x00C62, 0x00C63 }, /* TELUGU VOWEL SIGN VOCALIC L - TELUGU VOWEL SIGN VOCALIC LL */
-	{ 0x00C81, 0x00C83 }, /* KANNADA SIGN CANDRABINDU - KANNADA SIGN VISARGA */
-	{ 0x00CBC, 0x00CBC }, /* KANNADA SIGN NUKTA */
-	{ 0x00CBE, 0x00CC4 }, /* KANNADA VOWEL SIGN AA - KANNADA VOWEL SIGN VOCALIC RR */
-	{ 0x00CC6, 0x00CC8 }, /* KANNADA VOWEL SIGN E - KANNADA VOWEL SIGN AI */
-	{ 0x00CCA, 0x00CCD }, /* KANNADA VOWEL SIGN O - KANNADA SIGN VIRAMA */
-	{ 0x00CD5, 0x00CD6 }, /* KANNADA LENGTH MARK - KANNADA AI LENGTH MARK */
-	{ 0x00CE2, 0x00CE3 }, /* KANNADA VOWEL SIGN VOCALIC L - KANNADA VOWEL SIGN VOCALIC LL */
-	{ 0x00CF3, 0x00CF3 }, /* KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT */
-	{ 0x00D00, 0x00D03 }, /* MALAYALAM SIGN COMBINING ANUSVARA ABOVE - MALAYALAM SIGN VISARGA */
-	{ 0x00D3B, 0x00D3C }, /* MALAYALAM SIGN VERTICAL BAR VIRAMA - MALAYALAM SIGN CIRCULAR VIRAMA */
-	{ 0x00D3E, 0x00D44 }, /* MALAYALAM VOWEL SIGN AA - MALAYALAM VOWEL SIGN VOCALIC RR */
-	{ 0x00D46, 0x00D48 }, /* MALAYALAM VOWEL SIGN E - MALAYALAM VOWEL SIGN AI */
-	{ 0x00D4A, 0x00D4D }, /* MALAYALAM VOWEL SIGN O - MALAYALAM SIGN VIRAMA */
-	{ 0x00D57, 0x00D57 }, /* MALAYALAM AU LENGTH MARK */
-	{ 0x00D62, 0x00D63 }, /* MALAYALAM VOWEL SIGN VOCALIC L - MALAYALAM VOWEL SIGN VOCALIC LL */
-	{ 0x00D81, 0x00D83 }, /* SINHALA SIGN CANDRABINDU - SINHALA SIGN VISARGAYA */
-	{ 0x00DCA, 0x00DCA }, /* SINHALA SIGN AL-LAKUNA */
-	{ 0x00DCF, 0x00DD4 }, /* SINHALA VOWEL SIGN AELA-PILLA - SINHALA VOWEL SIGN KETTI PAA-PILLA */
-	{ 0x00DD6, 0x00DD6 }, /* SINHALA VOWEL SIGN DIGA PAA-PILLA */
-	{ 0x00DD8, 0x00DDF }, /* SINHALA VOWEL SIGN GAETTA-PILLA - SINHALA VOWEL SIGN GAYANUKITTA */
-	{ 0x00DF2, 0x00DF3 }, /* SINHALA VOWEL SIGN DIGA GAETTA-PILLA - SINHALA VOWEL SIGN DIGA GAYANUKITTA */
-	{ 0x00E31, 0x00E31 }, /* THAI CHARACTER MAI HAN-AKAT */
-	{ 0x00E34, 0x00E3A }, /* THAI CHARACTER SARA I - THAI CHARACTER PHINTHU */
-	{ 0x00E47, 0x00E4E }, /* THAI CHARACTER MAITAIKHU - THAI CHARACTER YAMAKKAN */
-	{ 0x00EB1, 0x00EB1 }, /* LAO VOWEL SIGN MAI KAN */
-	{ 0x00EB4, 0x00EBC }, /* LAO VOWEL SIGN I - LAO SEMIVOWEL SIGN LO */
-	{ 0x00EC8, 0x00ECE }, /* LAO TONE MAI EK - LAO YAMAKKAN */
-	{ 0x00F18, 0x00F19 }, /* TIBETAN ASTROLOGICAL SIGN -KHYUD PA - TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS */
-	{ 0x00F35, 0x00F35 }, /* TIBETAN MARK NGAS BZUNG NYI ZLA */
-	{ 0x00F37, 0x00F37 }, /* TIBETAN MARK NGAS BZUNG SGOR RTAGS */
-	{ 0x00F39, 0x00F39 }, /* TIBETAN MARK TSA -PHRU */
-	{ 0x00F3E, 0x00F3F }, /* TIBETAN SIGN YAR TSHES - TIBETAN SIGN MAR TSHES */
-	{ 0x00F71, 0x00F84 }, /* TIBETAN VOWEL SIGN AA - TIBETAN MARK HALANTA */
-	{ 0x00F86, 0x00F87 }, /* TIBETAN SIGN LCI RTAGS - TIBETAN SIGN YANG RTAGS */
-	{ 0x00F8D, 0x00F97 }, /* TIBETAN SUBJOINED SIGN LCE TSA CAN - TIBETAN SUBJOINED LETTER JA */
-	{ 0x00F99, 0x00FBC }, /* TIBETAN SUBJOINED LETTER NYA - TIBETAN SUBJOINED LETTER FIXED-FORM RA */
-	{ 0x00FC6, 0x00FC6 }, /* TIBETAN SYMBOL PADMA GDAN */
-	{ 0x0102B, 0x0103E }, /* MYANMAR VOWEL SIGN TALL AA - MYANMAR CONSONANT SIGN MEDIAL HA */
-	{ 0x01056, 0x01059 }, /* MYANMAR VOWEL SIGN VOCALIC R - MYANMAR VOWEL SIGN VOCALIC LL */
-	{ 0x0105E, 0x01060 }, /* MYANMAR CONSONANT SIGN MON MEDIAL NA - MYANMAR CONSONANT SIGN MON MEDIAL LA */
-	{ 0x01062, 0x01064 }, /* MYANMAR VOWEL SIGN SGAW KAREN EU - MYANMAR TONE MARK SGAW KAREN KE PHO */
-	{ 0x01067, 0x0106D }, /* MYANMAR VOWEL SIGN WESTERN PWO KAREN EU - MYANMAR SIGN WESTERN PWO KAREN TONE-5 */
-	{ 0x01071, 0x01074 }, /* MYANMAR VOWEL SIGN GEBA KAREN I - MYANMAR VOWEL SIGN KAYAH EE */
-	{ 0x01082, 0x0108D }, /* MYANMAR CONSONANT SIGN SHAN MEDIAL WA - MYANMAR SIGN SHAN COUNCIL EMPHATIC TONE */
-	{ 0x0108F, 0x0108F }, /* MYANMAR SIGN RUMAI PALAUNG TONE-5 */
-	{ 0x0109A, 0x0109D }, /* MYANMAR SIGN KHAMTI TONE-1 - MYANMAR VOWEL SIGN AITON AI */
-	{ 0x0135D, 0x0135F }, /* ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK - ETHIOPIC COMBINING GEMINATION MARK */
-	{ 0x01712, 0x01715 }, /* TAGALOG VOWEL SIGN I - TAGALOG SIGN PAMUDPOD */
-	{ 0x01732, 0x01734 }, /* HANUNOO VOWEL SIGN I - HANUNOO SIGN PAMUDPOD */
-	{ 0x01752, 0x01753 }, /* BUHID VOWEL SIGN I - BUHID VOWEL SIGN U */
-	{ 0x01772, 0x01773 }, /* TAGBANWA VOWEL SIGN I - TAGBANWA VOWEL SIGN U */
-	{ 0x017B4, 0x017D3 }, /* KHMER VOWEL INHERENT AQ - KHMER SIGN BATHAMASAT */
-	{ 0x017DD, 0x017DD }, /* KHMER SIGN ATTHACAN */
-	{ 0x0180B, 0x0180D }, /* MONGOLIAN FREE VARIATION SELECTOR ONE - MONGOLIAN FREE VARIATION SELECTOR THREE */
-	{ 0x0180F, 0x0180F }, /* MONGOLIAN FREE VARIATION SELECTOR FOUR */
-	{ 0x01885, 0x01886 }, /* MONGOLIAN LETTER ALI GALI BALUDA - MONGOLIAN LETTER ALI GALI THREE BALUDA */
-	{ 0x018A9, 0x018A9 }, /* MONGOLIAN LETTER ALI GALI DAGALGA */
-	{ 0x01920, 0x0192B }, /* LIMBU VOWEL SIGN A - LIMBU SUBJOINED LETTER WA */
-	{ 0x01930, 0x0193B }, /* LIMBU SMALL LETTER KA - LIMBU SIGN SA-I */
-	{ 0x01A17, 0x01A1B }, /* BUGINESE VOWEL SIGN I - BUGINESE VOWEL SIGN AE */
-	{ 0x01A55, 0x01A5E }, /* TAI THAM CONSONANT SIGN MEDIAL RA - TAI THAM CONSONANT SIGN SA */
-	{ 0x01A60, 0x01A7C }, /* TAI THAM SIGN SAKOT - TAI THAM SIGN KHUEN-LUE KARAN */
-	{ 0x01A7F, 0x01A7F }, /* TAI THAM COMBINING CRYPTOGRAMMIC DOT */
-	{ 0x01AB0, 0x01ACE }, /* COMBINING DOUBLED CIRCUMFLEX ACCENT - COMBINING LATIN SMALL LETTER INSULAR T */
-	{ 0x01B00, 0x01B04 }, /* BALINESE SIGN ULU RICEM - BALINESE SIGN BISAH */
-	{ 0x01B34, 0x01B44 }, /* BALINESE SIGN REREKAN - BALINESE ADEG ADEG */
-	{ 0x01B6B, 0x01B73 }, /* BALINESE MUSICAL SYMBOL COMBINING TEGEH - BALINESE MUSICAL SYMBOL COMBINING GONG */
-	{ 0x01B80, 0x01B82 }, /* SUNDANESE SIGN PANYECEK - SUNDANESE SIGN PANGWISAD */
-	{ 0x01BA1, 0x01BAD }, /* SUNDANESE CONSONANT SIGN PAMINGKAL - SUNDANESE CONSONANT SIGN PASANGAN WA */
-	{ 0x01BE6, 0x01BF3 }, /* BATAK SIGN TOMPI - BATAK PANONGONAN */
-	{ 0x01C24, 0x01C37 }, /* LEPCHA SUBJOINED LETTER YA - LEPCHA SIGN NUKTA */
-	{ 0x01CD0, 0x01CD2 }, /* VEDIC TONE KARSHANA - VEDIC TONE PRENKHA */
-	{ 0x01CD4, 0x01CE8 }, /* VEDIC SIGN YAJURVEDIC MIDLINE SVARITA - VEDIC SIGN VISARGA ANUDATTA WITH TAIL */
-	{ 0x01CED, 0x01CED }, /* VEDIC SIGN TIRYAK */
-	{ 0x01CF4, 0x01CF4 }, /* VEDIC TONE CANDRA ABOVE */
-	{ 0x01CF7, 0x01CF9 }, /* VEDIC SIGN ATIKRAMA - VEDIC TONE DOUBLE RING ABOVE */
-	{ 0x01DC0, 0x01DFF }, /* COMBINING DOTTED GRAVE ACCENT - COMBINING RIGHT ARROWHEAD AND DOWN ARROWHEAD BELOW */
-	{ 0x0200B, 0x0200E }, /* ZERO WIDTH SPACE - LEFT-TO-RIGHT MARK */
-	{ 0x0202A, 0x0202D }, /* LEFT-TO-RIGHT EMBEDDING - LEFT-TO-RIGHT OVERRIDE */
-	{ 0x02060, 0x02064 }, /* WORD JOINER - INVISIBLE PLUS */
-	{ 0x0206A, 0x0206F }, /* INHIBIT SYMMETRIC SWAPPING - NOMINAL DIGIT SHAPES */
-	{ 0x020D0, 0x020F0 }, /* COMBINING LEFT HARPOON ABOVE - COMBINING ASTERISK ABOVE */
-	{ 0x02640, 0x02640 }, /* FEMALE SIGN */
-	{ 0x02642, 0x02642 }, /* MALE SIGN */
-	{ 0x026A7, 0x026A7 }, /* MALE WITH STROKE AND MALE AND FEMALE SIGN */
-	{ 0x02CEF, 0x02CF1 }, /* COPTIC COMBINING NI ABOVE - COPTIC COMBINING SPIRITUS LENIS */
-	{ 0x02D7F, 0x02D7F }, /* TIFINAGH CONSONANT JOINER */
-	{ 0x02DE0, 0x02DFF }, /* COMBINING CYRILLIC LETTER BE - COMBINING CYRILLIC LETTER IOTIFIED BIG YUS */
-	{ 0x0302A, 0x0302F }, /* IDEOGRAPHIC LEVEL TONE MARK - HANGUL DOUBLE DOT TONE MARK */
-	{ 0x03099, 0x0309A }, /* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK - COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK */
-	{ 0x0A66F, 0x0A672 }, /* COMBINING CYRILLIC VZMET - COMBINING CYRILLIC THOUSAND MILLIONS SIGN */
-	{ 0x0A674, 0x0A67D }, /* COMBINING CYRILLIC LETTER UKRAINIAN IE - COMBINING CYRILLIC PAYEROK */
-	{ 0x0A69E, 0x0A69F }, /* COMBINING CYRILLIC LETTER EF - COMBINING CYRILLIC LETTER IOTIFIED E */
-	{ 0x0A6F0, 0x0A6F1 }, /* BAMUM COMBINING MARK KOQNDON - BAMUM COMBINING MARK TUKWENTIS */
-	{ 0x0A802, 0x0A802 }, /* SYLOTI NAGRI SIGN DVISVARA */
-	{ 0x0A806, 0x0A806 }, /* SYLOTI NAGRI SIGN HASANTA */
-	{ 0x0A80B, 0x0A80B }, /* SYLOTI NAGRI SIGN ANUSVARA */
-	{ 0x0A823, 0x0A827 }, /* SYLOTI NAGRI VOWEL SIGN A - SYLOTI NAGRI VOWEL SIGN OO */
-	{ 0x0A82C, 0x0A82C }, /* SYLOTI NAGRI SIGN ALTERNATE HASANTA */
-	{ 0x0A880, 0x0A881 }, /* SAURASHTRA SIGN ANUSVARA - SAURASHTRA SIGN VISARGA */
-	{ 0x0A8B4, 0x0A8C5 }, /* SAURASHTRA CONSONANT SIGN HAARU - SAURASHTRA SIGN CANDRABINDU */
-	{ 0x0A8E0, 0x0A8F1 }, /* COMBINING DEVANAGARI DIGIT ZERO - COMBINING DEVANAGARI SIGN AVAGRAHA */
-	{ 0x0A8FF, 0x0A8FF }, /* DEVANAGARI VOWEL SIGN AY */
-	{ 0x0A926, 0x0A92D }, /* KAYAH LI VOWEL UE - KAYAH LI TONE CALYA PLOPHU */
-	{ 0x0A947, 0x0A953 }, /* REJANG VOWEL SIGN I - REJANG VIRAMA */
-	{ 0x0A980, 0x0A983 }, /* JAVANESE SIGN PANYANGGA - JAVANESE SIGN WIGNYAN */
-	{ 0x0A9B3, 0x0A9C0 }, /* JAVANESE SIGN CECAK TELU - JAVANESE PANGKON */
-	{ 0x0A9E5, 0x0A9E5 }, /* MYANMAR SIGN SHAN SAW */
-	{ 0x0AA29, 0x0AA36 }, /* CHAM VOWEL SIGN AA - CHAM CONSONANT SIGN WA */
-	{ 0x0AA43, 0x0AA43 }, /* CHAM CONSONANT SIGN FINAL NG */
-	{ 0x0AA4C, 0x0AA4D }, /* CHAM CONSONANT SIGN FINAL M - CHAM CONSONANT SIGN FINAL H */
-	{ 0x0AA7B, 0x0AA7D }, /* MYANMAR SIGN PAO KAREN TONE - MYANMAR SIGN TAI LAING TONE-5 */
-	{ 0x0AAB0, 0x0AAB0 }, /* TAI VIET MAI KANG */
-	{ 0x0AAB2, 0x0AAB4 }, /* TAI VIET VOWEL I - TAI VIET VOWEL U */
-	{ 0x0AAB7, 0x0AAB8 }, /* TAI VIET MAI KHIT - TAI VIET VOWEL IA */
-	{ 0x0AABE, 0x0AABF }, /* TAI VIET VOWEL AM - TAI VIET TONE MAI EK */
-	{ 0x0AAC1, 0x0AAC1 }, /* TAI VIET TONE MAI THO */
-	{ 0x0AAEB, 0x0AAEF }, /* MEETEI MAYEK VOWEL SIGN II - MEETEI MAYEK VOWEL SIGN AAU */
-	{ 0x0AAF5, 0x0AAF6 }, /* MEETEI MAYEK VOWEL SIGN VISARGA - MEETEI MAYEK VIRAMA */
-	{ 0x0ABE3, 0x0ABEA }, /* MEETEI MAYEK VOWEL SIGN ONAP - MEETEI MAYEK VOWEL SIGN NUNG */
-	{ 0x0ABEC, 0x0ABED }, /* MEETEI MAYEK LUM IYEK - MEETEI MAYEK APUN IYEK */
-	{ 0x0FB1E, 0x0FB1E }, /* HEBREW POINT JUDEO-SPANISH VARIKA */
-	{ 0x0FE00, 0x0FE0F }, /* VARIATION SELECTOR-1 - VARIATION SELECTOR-16 */
-	{ 0x0FE20, 0x0FE2F }, /* COMBINING LIGATURE LEFT HALF - COMBINING CYRILLIC TITLO RIGHT HALF */
-	{ 0x0FEFF, 0x0FEFF }, /* ZERO WIDTH NO-BREAK SPACE */
-	{ 0x0FFF9, 0x0FFFB }, /* INTERLINEAR ANNOTATION ANCHOR - INTERLINEAR ANNOTATION TERMINATOR */
-	{ 0x101FD, 0x101FD }, /* U+101FD */
-	{ 0x102E0, 0x102E0 }, /* U+102E0 */
-	{ 0x10376, 0x1037A }, /* U+10376 - U+1037A */
-	{ 0x10A01, 0x10A03 }, /* U+10A01 - U+10A03 */
-	{ 0x10A05, 0x10A06 }, /* U+10A05 - U+10A06 */
-	{ 0x10A0C, 0x10A0F }, /* U+10A0C - U+10A0F */
-	{ 0x10A38, 0x10A3A }, /* U+10A38 - U+10A3A */
-	{ 0x10A3F, 0x10A3F }, /* U+10A3F */
-	{ 0x10AE5, 0x10AE6 }, /* U+10AE5 - U+10AE6 */
-	{ 0x10D24, 0x10D27 }, /* U+10D24 - U+10D27 */
-	{ 0x10D69, 0x10D6D }, /* U+10D69 - U+10D6D */
-	{ 0x10EAB, 0x10EAC }, /* U+10EAB - U+10EAC */
-	{ 0x10EFC, 0x10EFF }, /* U+10EFC - U+10EFF */
-	{ 0x10F46, 0x10F50 }, /* U+10F46 - U+10F50 */
-	{ 0x10F82, 0x10F85 }, /* U+10F82 - U+10F85 */
-	{ 0x11000, 0x11002 }, /* U+11000 - U+11002 */
-	{ 0x11038, 0x11046 }, /* U+11038 - U+11046 */
-	{ 0x11070, 0x11070 }, /* U+11070 */
-	{ 0x11073, 0x11074 }, /* U+11073 - U+11074 */
-	{ 0x1107F, 0x11082 }, /* U+1107F - U+11082 */
-	{ 0x110B0, 0x110BA }, /* U+110B0 - U+110BA */
-	{ 0x110BD, 0x110BD }, /* U+110BD */
-	{ 0x110C2, 0x110C2 }, /* U+110C2 */
-	{ 0x110CD, 0x110CD }, /* U+110CD */
-	{ 0x11100, 0x11102 }, /* U+11100 - U+11102 */
-	{ 0x11127, 0x11134 }, /* U+11127 - U+11134 */
-	{ 0x11145, 0x11146 }, /* U+11145 - U+11146 */
-	{ 0x11173, 0x11173 }, /* U+11173 */
-	{ 0x11180, 0x11182 }, /* U+11180 - U+11182 */
-	{ 0x111B3, 0x111C0 }, /* U+111B3 - U+111C0 */
-	{ 0x111C9, 0x111CC }, /* U+111C9 - U+111CC */
-	{ 0x111CE, 0x111CF }, /* U+111CE - U+111CF */
-	{ 0x1122C, 0x11237 }, /* U+1122C - U+11237 */
-	{ 0x1123E, 0x1123E }, /* U+1123E */
-	{ 0x11241, 0x11241 }, /* U+11241 */
-	{ 0x112DF, 0x112EA }, /* U+112DF - U+112EA */
-	{ 0x11300, 0x11303 }, /* U+11300 - U+11303 */
-	{ 0x1133B, 0x1133C }, /* U+1133B - U+1133C */
-	{ 0x1133E, 0x11344 }, /* U+1133E - U+11344 */
-	{ 0x11347, 0x11348 }, /* U+11347 - U+11348 */
-	{ 0x1134B, 0x1134D }, /* U+1134B - U+1134D */
-	{ 0x11357, 0x11357 }, /* U+11357 */
-	{ 0x11362, 0x11363 }, /* U+11362 - U+11363 */
-	{ 0x11366, 0x1136C }, /* U+11366 - U+1136C */
-	{ 0x11370, 0x11374 }, /* U+11370 - U+11374 */
-	{ 0x113B8, 0x113C0 }, /* U+113B8 - U+113C0 */
-	{ 0x113C2, 0x113C2 }, /* U+113C2 */
-	{ 0x113C5, 0x113C5 }, /* U+113C5 */
-	{ 0x113C7, 0x113CA }, /* U+113C7 - U+113CA */
-	{ 0x113CC, 0x113D0 }, /* U+113CC - U+113D0 */
-	{ 0x113D2, 0x113D2 }, /* U+113D2 */
-	{ 0x113E1, 0x113E2 }, /* U+113E1 - U+113E2 */
-	{ 0x11435, 0x11446 }, /* U+11435 - U+11446 */
-	{ 0x1145E, 0x1145E }, /* U+1145E */
-	{ 0x114B0, 0x114C3 }, /* U+114B0 - U+114C3 */
-	{ 0x115AF, 0x115B5 }, /* U+115AF - U+115B5 */
-	{ 0x115B8, 0x115C0 }, /* U+115B8 - U+115C0 */
-	{ 0x115DC, 0x115DD }, /* U+115DC - U+115DD */
-	{ 0x11630, 0x11640 }, /* U+11630 - U+11640 */
-	{ 0x116AB, 0x116B7 }, /* U+116AB - U+116B7 */
-	{ 0x1171D, 0x1172B }, /* U+1171D - U+1172B */
-	{ 0x1182C, 0x1183A }, /* U+1182C - U+1183A */
-	{ 0x11930, 0x11935 }, /* U+11930 - U+11935 */
-	{ 0x11937, 0x11938 }, /* U+11937 - U+11938 */
-	{ 0x1193B, 0x1193E }, /* U+1193B - U+1193E */
-	{ 0x11940, 0x11940 }, /* U+11940 */
-	{ 0x11942, 0x11943 }, /* U+11942 - U+11943 */
-	{ 0x119D1, 0x119D7 }, /* U+119D1 - U+119D7 */
-	{ 0x119DA, 0x119E0 }, /* U+119DA - U+119E0 */
-	{ 0x119E4, 0x119E4 }, /* U+119E4 */
-	{ 0x11A01, 0x11A0A }, /* U+11A01 - U+11A0A */
-	{ 0x11A33, 0x11A39 }, /* U+11A33 - U+11A39 */
-	{ 0x11A3B, 0x11A3E }, /* U+11A3B - U+11A3E */
-	{ 0x11A47, 0x11A47 }, /* U+11A47 */
-	{ 0x11A51, 0x11A5B }, /* U+11A51 - U+11A5B */
-	{ 0x11A8A, 0x11A99 }, /* U+11A8A - U+11A99 */
-	{ 0x11C2F, 0x11C36 }, /* U+11C2F - U+11C36 */
-	{ 0x11C38, 0x11C3F }, /* U+11C38 - U+11C3F */
-	{ 0x11C92, 0x11CA7 }, /* U+11C92 - U+11CA7 */
-	{ 0x11CA9, 0x11CB6 }, /* U+11CA9 - U+11CB6 */
-	{ 0x11D31, 0x11D36 }, /* U+11D31 - U+11D36 */
-	{ 0x11D3A, 0x11D3A }, /* U+11D3A */
-	{ 0x11D3C, 0x11D3D }, /* U+11D3C - U+11D3D */
-	{ 0x11D3F, 0x11D45 }, /* U+11D3F - U+11D45 */
-	{ 0x11D47, 0x11D47 }, /* U+11D47 */
-	{ 0x11D8A, 0x11D8E }, /* U+11D8A - U+11D8E */
-	{ 0x11D90, 0x11D91 }, /* U+11D90 - U+11D91 */
-	{ 0x11D93, 0x11D97 }, /* U+11D93 - U+11D97 */
-	{ 0x11EF3, 0x11EF6 }, /* U+11EF3 - U+11EF6 */
-	{ 0x11F00, 0x11F01 }, /* U+11F00 - U+11F01 */
-	{ 0x11F03, 0x11F03 }, /* U+11F03 */
-	{ 0x11F34, 0x11F3A }, /* U+11F34 - U+11F3A */
-	{ 0x11F3E, 0x11F42 }, /* U+11F3E - U+11F42 */
-	{ 0x11F5A, 0x11F5A }, /* U+11F5A */
-	{ 0x13430, 0x13440 }, /* U+13430 - U+13440 */
-	{ 0x13447, 0x13455 }, /* U+13447 - U+13455 */
-	{ 0x1611E, 0x1612F }, /* U+1611E - U+1612F */
-	{ 0x16AF0, 0x16AF4 }, /* U+16AF0 - U+16AF4 */
-	{ 0x16B30, 0x16B36 }, /* U+16B30 - U+16B36 */
-	{ 0x16F4F, 0x16F4F }, /* U+16F4F */
-	{ 0x16F51, 0x16F87 }, /* U+16F51 - U+16F87 */
-	{ 0x16F8F, 0x16F92 }, /* U+16F8F - U+16F92 */
-	{ 0x16FE4, 0x16FE4 }, /* U+16FE4 */
-	{ 0x16FF0, 0x16FF1 }, /* U+16FF0 - U+16FF1 */
-	{ 0x1BC9D, 0x1BC9E }, /* U+1BC9D - U+1BC9E */
-	{ 0x1BCA0, 0x1BCA3 }, /* U+1BCA0 - U+1BCA3 */
-	{ 0x1CF00, 0x1CF2D }, /* U+1CF00 - U+1CF2D */
-	{ 0x1CF30, 0x1CF46 }, /* U+1CF30 - U+1CF46 */
-	{ 0x1D165, 0x1D169 }, /* U+1D165 - U+1D169 */
-	{ 0x1D16D, 0x1D182 }, /* U+1D16D - U+1D182 */
-	{ 0x1D185, 0x1D18B }, /* U+1D185 - U+1D18B */
-	{ 0x1D1AA, 0x1D1AD }, /* U+1D1AA - U+1D1AD */
-	{ 0x1D242, 0x1D244 }, /* U+1D242 - U+1D244 */
-	{ 0x1DA00, 0x1DA36 }, /* U+1DA00 - U+1DA36 */
-	{ 0x1DA3B, 0x1DA6C }, /* U+1DA3B - U+1DA6C */
-	{ 0x1DA75, 0x1DA75 }, /* U+1DA75 */
-	{ 0x1DA84, 0x1DA84 }, /* U+1DA84 */
-	{ 0x1DA9B, 0x1DA9F }, /* U+1DA9B - U+1DA9F */
-	{ 0x1DAA1, 0x1DAAF }, /* U+1DAA1 - U+1DAAF */
-	{ 0x1E000, 0x1E006 }, /* U+1E000 - U+1E006 */
-	{ 0x1E008, 0x1E018 }, /* U+1E008 - U+1E018 */
-	{ 0x1E01B, 0x1E021 }, /* U+1E01B - U+1E021 */
-	{ 0x1E023, 0x1E024 }, /* U+1E023 - U+1E024 */
-	{ 0x1E026, 0x1E02A }, /* U+1E026 - U+1E02A */
-	{ 0x1E08F, 0x1E08F }, /* U+1E08F */
-	{ 0x1E130, 0x1E136 }, /* U+1E130 - U+1E136 */
-	{ 0x1E2AE, 0x1E2AE }, /* U+1E2AE */
-	{ 0x1E2EC, 0x1E2EF }, /* U+1E2EC - U+1E2EF */
-	{ 0x1E4EC, 0x1E4EF }, /* U+1E4EC - U+1E4EF */
-	{ 0x1E5EE, 0x1E5EF }, /* U+1E5EE - U+1E5EF */
-	{ 0x1E8D0, 0x1E8D6 }, /* U+1E8D0 - U+1E8D6 */
-	{ 0x1E944, 0x1E94A }, /* U+1E944 - U+1E94A */
-	{ 0x1F3FB, 0x1F3FF }, /* U+1F3FB - U+1F3FF */
-	{ 0x1F9B0, 0x1F9B3 }, /* U+1F9B0 - U+1F9B3 */
-	{ 0xE0001, 0xE0001 }, /* U+E0001 */
-	{ 0xE0020, 0xE007F }, /* U+E0020 - U+E007F */
-	{ 0xE0100, 0xE01EF }, /* U+E0100 - U+E01EF */
+/* Zero-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct interval16 zero_width_bmp[] = {
+	{ 0x00AD, 0x00AD }, /* SOFT HYPHEN */
+	{ 0x0300, 0x036F }, /* COMBINING GRAVE ACCENT - COMBINING LATIN SMALL LETTER X */
+	{ 0x0483, 0x0489 }, /* COMBINING CYRILLIC TITLO - COMBINING CYRILLIC MILLIONS SIGN */
+	{ 0x0591, 0x05BD }, /* HEBREW ACCENT ETNAHTA - HEBREW POINT METEG */
+	{ 0x05BF, 0x05BF }, /* HEBREW POINT RAFE */
+	{ 0x05C1, 0x05C2 }, /* HEBREW POINT SHIN DOT - HEBREW POINT SIN DOT */
+	{ 0x05C4, 0x05C5 }, /* HEBREW MARK UPPER DOT - HEBREW MARK LOWER DOT */
+	{ 0x05C7, 0x05C7 }, /* HEBREW POINT QAMATS QATAN */
+	{ 0x0600, 0x0605 }, /* ARABIC NUMBER SIGN - ARABIC NUMBER MARK ABOVE */
+	{ 0x0610, 0x061A }, /* ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM - ARABIC SMALL KASRA */
+	{ 0x064B, 0x065F }, /* ARABIC FATHATAN - ARABIC WAVY HAMZA BELOW */
+	{ 0x0670, 0x0670 }, /* ARABIC LETTER SUPERSCRIPT ALEF */
+	{ 0x06D6, 0x06DC }, /* ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA - ARABIC SMALL HIGH SEEN */
+	{ 0x06DF, 0x06E4 }, /* ARABIC SMALL HIGH ROUNDED ZERO - ARABIC SMALL HIGH MADDA */
+	{ 0x06E7, 0x06E8 }, /* ARABIC SMALL HIGH YEH - ARABIC SMALL HIGH NOON */
+	{ 0x06EA, 0x06ED }, /* ARABIC EMPTY CENTRE LOW STOP - ARABIC SMALL LOW MEEM */
+	{ 0x0711, 0x0711 }, /* SYRIAC LETTER SUPERSCRIPT ALAPH */
+	{ 0x0730, 0x074A }, /* SYRIAC PTHAHA ABOVE - SYRIAC BARREKH */
+	{ 0x07A6, 0x07B0 }, /* THAANA ABAFILI - THAANA SUKUN */
+	{ 0x07EB, 0x07F3 }, /* NKO COMBINING SHORT HIGH TONE - NKO COMBINING DOUBLE DOT ABOVE */
+	{ 0x07FD, 0x07FD }, /* NKO DANTAYALAN */
+	{ 0x0816, 0x0819 }, /* SAMARITAN MARK IN - SAMARITAN MARK DAGESH */
+	{ 0x081B, 0x0823 }, /* SAMARITAN MARK EPENTHETIC YUT - SAMARITAN VOWEL SIGN A */
+	{ 0x0825, 0x0827 }, /* SAMARITAN VOWEL SIGN SHORT A - SAMARITAN VOWEL SIGN U */
+	{ 0x0829, 0x082D }, /* SAMARITAN VOWEL SIGN LONG I - SAMARITAN MARK NEQUDAA */
+	{ 0x0859, 0x085B }, /* MANDAIC AFFRICATION MARK - MANDAIC GEMINATION MARK */
+	{ 0x0890, 0x0891 }, /* ARABIC POUND MARK ABOVE - ARABIC PIASTRE MARK ABOVE */
+	{ 0x0897, 0x089F }, /* ARABIC PEPET - ARABIC HALF MADDA OVER MADDA */
+	{ 0x08CA, 0x0903 }, /* ARABIC SMALL HIGH FARSI YEH - DEVANAGARI SIGN VISARGA */
+	{ 0x093A, 0x093C }, /* DEVANAGARI VOWEL SIGN OE - DEVANAGARI SIGN NUKTA */
+	{ 0x093E, 0x094F }, /* DEVANAGARI VOWEL SIGN AA - DEVANAGARI VOWEL SIGN AW */
+	{ 0x0951, 0x0957 }, /* DEVANAGARI STRESS SIGN UDATTA - DEVANAGARI VOWEL SIGN UUE */
+	{ 0x0962, 0x0963 }, /* DEVANAGARI VOWEL SIGN VOCALIC L - DEVANAGARI VOWEL SIGN VOCALIC LL */
+	{ 0x0981, 0x0983 }, /* BENGALI SIGN CANDRABINDU - BENGALI SIGN VISARGA */
+	{ 0x09BC, 0x09BC }, /* BENGALI SIGN NUKTA */
+	{ 0x09BE, 0x09C4 }, /* BENGALI VOWEL SIGN AA - BENGALI VOWEL SIGN VOCALIC RR */
+	{ 0x09C7, 0x09C8 }, /* BENGALI VOWEL SIGN E - BENGALI VOWEL SIGN AI */
+	{ 0x09CB, 0x09CD }, /* BENGALI VOWEL SIGN O - BENGALI SIGN VIRAMA */
+	{ 0x09D7, 0x09D7 }, /* BENGALI AU LENGTH MARK */
+	{ 0x09E2, 0x09E3 }, /* BENGALI VOWEL SIGN VOCALIC L - BENGALI VOWEL SIGN VOCALIC LL */
+	{ 0x09FE, 0x09FE }, /* BENGALI SANDHI MARK */
+	{ 0x0A01, 0x0A03 }, /* GURMUKHI SIGN ADAK BINDI - GURMUKHI SIGN VISARGA */
+	{ 0x0A3C, 0x0A3C }, /* GURMUKHI SIGN NUKTA */
+	{ 0x0A3E, 0x0A42 }, /* GURMUKHI VOWEL SIGN AA - GURMUKHI VOWEL SIGN UU */
+	{ 0x0A47, 0x0A48 }, /* GURMUKHI VOWEL SIGN EE - GURMUKHI VOWEL SIGN AI */
+	{ 0x0A4B, 0x0A4D }, /* GURMUKHI VOWEL SIGN OO - GURMUKHI SIGN VIRAMA */
+	{ 0x0A51, 0x0A51 }, /* GURMUKHI SIGN UDAAT */
+	{ 0x0A70, 0x0A71 }, /* GURMUKHI TIPPI - GURMUKHI ADDAK */
+	{ 0x0A75, 0x0A75 }, /* GURMUKHI SIGN YAKASH */
+	{ 0x0A81, 0x0A83 }, /* GUJARATI SIGN CANDRABINDU - GUJARATI SIGN VISARGA */
+	{ 0x0ABC, 0x0ABC }, /* GUJARATI SIGN NUKTA */
+	{ 0x0ABE, 0x0AC5 }, /* GUJARATI VOWEL SIGN AA - GUJARATI VOWEL SIGN CANDRA E */
+	{ 0x0AC7, 0x0AC9 }, /* GUJARATI VOWEL SIGN E - GUJARATI VOWEL SIGN CANDRA O */
+	{ 0x0ACB, 0x0ACD }, /* GUJARATI VOWEL SIGN O - GUJARATI SIGN VIRAMA */
+	{ 0x0AE2, 0x0AE3 }, /* GUJARATI VOWEL SIGN VOCALIC L - GUJARATI VOWEL SIGN VOCALIC LL */
+	{ 0x0AFA, 0x0AFF }, /* GUJARATI SIGN SUKUN - GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE */
+	{ 0x0B01, 0x0B03 }, /* ORIYA SIGN CANDRABINDU - ORIYA SIGN VISARGA */
+	{ 0x0B3C, 0x0B3C }, /* ORIYA SIGN NUKTA */
+	{ 0x0B3E, 0x0B44 }, /* ORIYA VOWEL SIGN AA - ORIYA VOWEL SIGN VOCALIC RR */
+	{ 0x0B47, 0x0B48 }, /* ORIYA VOWEL SIGN E - ORIYA VOWEL SIGN AI */
+	{ 0x0B4B, 0x0B4D }, /* ORIYA VOWEL SIGN O - ORIYA SIGN VIRAMA */
+	{ 0x0B55, 0x0B57 }, /* ORIYA SIGN OVERLINE - ORIYA AU LENGTH MARK */
+	{ 0x0B62, 0x0B63 }, /* ORIYA VOWEL SIGN VOCALIC L - ORIYA VOWEL SIGN VOCALIC LL */
+	{ 0x0B82, 0x0B82 }, /* TAMIL SIGN ANUSVARA */
+	{ 0x0BBE, 0x0BC2 }, /* TAMIL VOWEL SIGN AA - TAMIL VOWEL SIGN UU */
+	{ 0x0BC6, 0x0BC8 }, /* TAMIL VOWEL SIGN E - TAMIL VOWEL SIGN AI */
+	{ 0x0BCA, 0x0BCD }, /* TAMIL VOWEL SIGN O - TAMIL SIGN VIRAMA */
+	{ 0x0BD7, 0x0BD7 }, /* TAMIL AU LENGTH MARK */
+	{ 0x0C00, 0x0C04 }, /* TELUGU SIGN COMBINING CANDRABINDU ABOVE - TELUGU SIGN COMBINING ANUSVARA ABOVE */
+	{ 0x0C3C, 0x0C3C }, /* TELUGU SIGN NUKTA */
+	{ 0x0C3E, 0x0C44 }, /* TELUGU VOWEL SIGN AA - TELUGU VOWEL SIGN VOCALIC RR */
+	{ 0x0C46, 0x0C48 }, /* TELUGU VOWEL SIGN E - TELUGU VOWEL SIGN AI */
+	{ 0x0C4A, 0x0C4D }, /* TELUGU VOWEL SIGN O - TELUGU SIGN VIRAMA */
+	{ 0x0C55, 0x0C56 }, /* TELUGU LENGTH MARK - TELUGU AI LENGTH MARK */
+	{ 0x0C62, 0x0C63 }, /* TELUGU VOWEL SIGN VOCALIC L - TELUGU VOWEL SIGN VOCALIC LL */
+	{ 0x0C81, 0x0C83 }, /* KANNADA SIGN CANDRABINDU - KANNADA SIGN VISARGA */
+	{ 0x0CBC, 0x0CBC }, /* KANNADA SIGN NUKTA */
+	{ 0x0CBE, 0x0CC4 }, /* KANNADA VOWEL SIGN AA - KANNADA VOWEL SIGN VOCALIC RR */
+	{ 0x0CC6, 0x0CC8 }, /* KANNADA VOWEL SIGN E - KANNADA VOWEL SIGN AI */
+	{ 0x0CCA, 0x0CCD }, /* KANNADA VOWEL SIGN O - KANNADA SIGN VIRAMA */
+	{ 0x0CD5, 0x0CD6 }, /* KANNADA LENGTH MARK - KANNADA AI LENGTH MARK */
+	{ 0x0CE2, 0x0CE3 }, /* KANNADA VOWEL SIGN VOCALIC L - KANNADA VOWEL SIGN VOCALIC LL */
+	{ 0x0CF3, 0x0CF3 }, /* KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT */
+	{ 0x0D00, 0x0D03 }, /* MALAYALAM SIGN COMBINING ANUSVARA ABOVE - MALAYALAM SIGN VISARGA */
+	{ 0x0D3B, 0x0D3C }, /* MALAYALAM SIGN VERTICAL BAR VIRAMA - MALAYALAM SIGN CIRCULAR VIRAMA */
+	{ 0x0D3E, 0x0D44 }, /* MALAYALAM VOWEL SIGN AA - MALAYALAM VOWEL SIGN VOCALIC RR */
+	{ 0x0D46, 0x0D48 }, /* MALAYALAM VOWEL SIGN E - MALAYALAM VOWEL SIGN AI */
+	{ 0x0D4A, 0x0D4D }, /* MALAYALAM VOWEL SIGN O - MALAYALAM SIGN VIRAMA */
+	{ 0x0D57, 0x0D57 }, /* MALAYALAM AU LENGTH MARK */
+	{ 0x0D62, 0x0D63 }, /* MALAYALAM VOWEL SIGN VOCALIC L - MALAYALAM VOWEL SIGN VOCALIC LL */
+	{ 0x0D81, 0x0D83 }, /* SINHALA SIGN CANDRABINDU - SINHALA SIGN VISARGAYA */
+	{ 0x0DCA, 0x0DCA }, /* SINHALA SIGN AL-LAKUNA */
+	{ 0x0DCF, 0x0DD4 }, /* SINHALA VOWEL SIGN AELA-PILLA - SINHALA VOWEL SIGN KETTI PAA-PILLA */
+	{ 0x0DD6, 0x0DD6 }, /* SINHALA VOWEL SIGN DIGA PAA-PILLA */
+	{ 0x0DD8, 0x0DDF }, /* SINHALA VOWEL SIGN GAETTA-PILLA - SINHALA VOWEL SIGN GAYANUKITTA */
+	{ 0x0DF2, 0x0DF3 }, /* SINHALA VOWEL SIGN DIGA GAETTA-PILLA - SINHALA VOWEL SIGN DIGA GAYANUKITTA */
+	{ 0x0E31, 0x0E31 }, /* THAI CHARACTER MAI HAN-AKAT */
+	{ 0x0E34, 0x0E3A }, /* THAI CHARACTER SARA I - THAI CHARACTER PHINTHU */
+	{ 0x0E47, 0x0E4E }, /* THAI CHARACTER MAITAIKHU - THAI CHARACTER YAMAKKAN */
+	{ 0x0EB1, 0x0EB1 }, /* LAO VOWEL SIGN MAI KAN */
+	{ 0x0EB4, 0x0EBC }, /* LAO VOWEL SIGN I - LAO SEMIVOWEL SIGN LO */
+	{ 0x0EC8, 0x0ECE }, /* LAO TONE MAI EK - LAO YAMAKKAN */
+	{ 0x0F18, 0x0F19 }, /* TIBETAN ASTROLOGICAL SIGN -KHYUD PA - TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS */
+	{ 0x0F35, 0x0F35 }, /* TIBETAN MARK NGAS BZUNG NYI ZLA */
+	{ 0x0F37, 0x0F37 }, /* TIBETAN MARK NGAS BZUNG SGOR RTAGS */
+	{ 0x0F39, 0x0F39 }, /* TIBETAN MARK TSA -PHRU */
+	{ 0x0F3E, 0x0F3F }, /* TIBETAN SIGN YAR TSHES - TIBETAN SIGN MAR TSHES */
+	{ 0x0F71, 0x0F84 }, /* TIBETAN VOWEL SIGN AA - TIBETAN MARK HALANTA */
+	{ 0x0F86, 0x0F87 }, /* TIBETAN SIGN LCI RTAGS - TIBETAN SIGN YANG RTAGS */
+	{ 0x0F8D, 0x0F97 }, /* TIBETAN SUBJOINED SIGN LCE TSA CAN - TIBETAN SUBJOINED LETTER JA */
+	{ 0x0F99, 0x0FBC }, /* TIBETAN SUBJOINED LETTER NYA - TIBETAN SUBJOINED LETTER FIXED-FORM RA */
+	{ 0x0FC6, 0x0FC6 }, /* TIBETAN SYMBOL PADMA GDAN */
+	{ 0x102B, 0x103E }, /* MYANMAR VOWEL SIGN TALL AA - MYANMAR CONSONANT SIGN MEDIAL HA */
+	{ 0x1056, 0x1059 }, /* MYANMAR VOWEL SIGN VOCALIC R - MYANMAR VOWEL SIGN VOCALIC LL */
+	{ 0x105E, 0x1060 }, /* MYANMAR CONSONANT SIGN MON MEDIAL NA - MYANMAR CONSONANT SIGN MON MEDIAL LA */
+	{ 0x1062, 0x1064 }, /* MYANMAR VOWEL SIGN SGAW KAREN EU - MYANMAR TONE MARK SGAW KAREN KE PHO */
+	{ 0x1067, 0x106D }, /* MYANMAR VOWEL SIGN WESTERN PWO KAREN EU - MYANMAR SIGN WESTERN PWO KAREN TONE-5 */
+	{ 0x1071, 0x1074 }, /* MYANMAR VOWEL SIGN GEBA KAREN I - MYANMAR VOWEL SIGN KAYAH EE */
+	{ 0x1082, 0x108D }, /* MYANMAR CONSONANT SIGN SHAN MEDIAL WA - MYANMAR SIGN SHAN COUNCIL EMPHATIC TONE */
+	{ 0x108F, 0x108F }, /* MYANMAR SIGN RUMAI PALAUNG TONE-5 */
+	{ 0x109A, 0x109D }, /* MYANMAR SIGN KHAMTI TONE-1 - MYANMAR VOWEL SIGN AITON AI */
+	{ 0x135D, 0x135F }, /* ETHIOPIC COMBINING GEMINATION AND VOWEL LENGTH MARK - ETHIOPIC COMBINING GEMINATION MARK */
+	{ 0x1712, 0x1715 }, /* TAGALOG VOWEL SIGN I - TAGALOG SIGN PAMUDPOD */
+	{ 0x1732, 0x1734 }, /* HANUNOO VOWEL SIGN I - HANUNOO SIGN PAMUDPOD */
+	{ 0x1752, 0x1753 }, /* BUHID VOWEL SIGN I - BUHID VOWEL SIGN U */
+	{ 0x1772, 0x1773 }, /* TAGBANWA VOWEL SIGN I - TAGBANWA VOWEL SIGN U */
+	{ 0x17B4, 0x17D3 }, /* KHMER VOWEL INHERENT AQ - KHMER SIGN BATHAMASAT */
+	{ 0x17DD, 0x17DD }, /* KHMER SIGN ATTHACAN */
+	{ 0x180B, 0x180D }, /* MONGOLIAN FREE VARIATION SELECTOR ONE - MONGOLIAN FREE VARIATION SELECTOR THREE */
+	{ 0x180F, 0x180F }, /* MONGOLIAN FREE VARIATION SELECTOR FOUR */
+	{ 0x1885, 0x1886 }, /* MONGOLIAN LETTER ALI GALI BALUDA - MONGOLIAN LETTER ALI GALI THREE BALUDA */
+	{ 0x18A9, 0x18A9 }, /* MONGOLIAN LETTER ALI GALI DAGALGA */
+	{ 0x1920, 0x192B }, /* LIMBU VOWEL SIGN A - LIMBU SUBJOINED LETTER WA */
+	{ 0x1930, 0x193B }, /* LIMBU SMALL LETTER KA - LIMBU SIGN SA-I */
+	{ 0x1A17, 0x1A1B }, /* BUGINESE VOWEL SIGN I - BUGINESE VOWEL SIGN AE */
+	{ 0x1A55, 0x1A5E }, /* TAI THAM CONSONANT SIGN MEDIAL RA - TAI THAM CONSONANT SIGN SA */
+	{ 0x1A60, 0x1A7C }, /* TAI THAM SIGN SAKOT - TAI THAM SIGN KHUEN-LUE KARAN */
+	{ 0x1A7F, 0x1A7F }, /* TAI THAM COMBINING CRYPTOGRAMMIC DOT */
+	{ 0x1AB0, 0x1ACE }, /* COMBINING DOUBLED CIRCUMFLEX ACCENT - COMBINING LATIN SMALL LETTER INSULAR T */
+	{ 0x1B00, 0x1B04 }, /* BALINESE SIGN ULU RICEM - BALINESE SIGN BISAH */
+	{ 0x1B34, 0x1B44 }, /* BALINESE SIGN REREKAN - BALINESE ADEG ADEG */
+	{ 0x1B6B, 0x1B73 }, /* BALINESE MUSICAL SYMBOL COMBINING TEGEH - BALINESE MUSICAL SYMBOL COMBINING GONG */
+	{ 0x1B80, 0x1B82 }, /* SUNDANESE SIGN PANYECEK - SUNDANESE SIGN PANGWISAD */
+	{ 0x1BA1, 0x1BAD }, /* SUNDANESE CONSONANT SIGN PAMINGKAL - SUNDANESE CONSONANT SIGN PASANGAN WA */
+	{ 0x1BE6, 0x1BF3 }, /* BATAK SIGN TOMPI - BATAK PANONGONAN */
+	{ 0x1C24, 0x1C37 }, /* LEPCHA SUBJOINED LETTER YA - LEPCHA SIGN NUKTA */
+	{ 0x1CD0, 0x1CD2 }, /* VEDIC TONE KARSHANA - VEDIC TONE PRENKHA */
+	{ 0x1CD4, 0x1CE8 }, /* VEDIC SIGN YAJURVEDIC MIDLINE SVARITA - VEDIC SIGN VISARGA ANUDATTA WITH TAIL */
+	{ 0x1CED, 0x1CED }, /* VEDIC SIGN TIRYAK */
+	{ 0x1CF4, 0x1CF4 }, /* VEDIC TONE CANDRA ABOVE */
+	{ 0x1CF7, 0x1CF9 }, /* VEDIC SIGN ATIKRAMA - VEDIC TONE DOUBLE RING ABOVE */
+	{ 0x1DC0, 0x1DFF }, /* COMBINING DOTTED GRAVE ACCENT - COMBINING RIGHT ARROWHEAD AND DOWN ARROWHEAD BELOW */
+	{ 0x200B, 0x200E }, /* ZERO WIDTH SPACE - LEFT-TO-RIGHT MARK */
+	{ 0x202A, 0x202D }, /* LEFT-TO-RIGHT EMBEDDING - LEFT-TO-RIGHT OVERRIDE */
+	{ 0x2060, 0x2064 }, /* WORD JOINER - INVISIBLE PLUS */
+	{ 0x206A, 0x206F }, /* INHIBIT SYMMETRIC SWAPPING - NOMINAL DIGIT SHAPES */
+	{ 0x20D0, 0x20F0 }, /* COMBINING LEFT HARPOON ABOVE - COMBINING ASTERISK ABOVE */
+	{ 0x2640, 0x2640 }, /* FEMALE SIGN */
+	{ 0x2642, 0x2642 }, /* MALE SIGN */
+	{ 0x26A7, 0x26A7 }, /* MALE WITH STROKE AND MALE AND FEMALE SIGN */
+	{ 0x2CEF, 0x2CF1 }, /* COPTIC COMBINING NI ABOVE - COPTIC COMBINING SPIRITUS LENIS */
+	{ 0x2D7F, 0x2D7F }, /* TIFINAGH CONSONANT JOINER */
+	{ 0x2DE0, 0x2DFF }, /* COMBINING CYRILLIC LETTER BE - COMBINING CYRILLIC LETTER IOTIFIED BIG YUS */
+	{ 0x302A, 0x302F }, /* IDEOGRAPHIC LEVEL TONE MARK - HANGUL DOUBLE DOT TONE MARK */
+	{ 0x3099, 0x309A }, /* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK - COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK */
+	{ 0xA66F, 0xA672 }, /* COMBINING CYRILLIC VZMET - COMBINING CYRILLIC THOUSAND MILLIONS SIGN */
+	{ 0xA674, 0xA67D }, /* COMBINING CYRILLIC LETTER UKRAINIAN IE - COMBINING CYRILLIC PAYEROK */
+	{ 0xA69E, 0xA69F }, /* COMBINING CYRILLIC LETTER EF - COMBINING CYRILLIC LETTER IOTIFIED E */
+	{ 0xA6F0, 0xA6F1 }, /* BAMUM COMBINING MARK KOQNDON - BAMUM COMBINING MARK TUKWENTIS */
+	{ 0xA802, 0xA802 }, /* SYLOTI NAGRI SIGN DVISVARA */
+	{ 0xA806, 0xA806 }, /* SYLOTI NAGRI SIGN HASANTA */
+	{ 0xA80B, 0xA80B }, /* SYLOTI NAGRI SIGN ANUSVARA */
+	{ 0xA823, 0xA827 }, /* SYLOTI NAGRI VOWEL SIGN A - SYLOTI NAGRI VOWEL SIGN OO */
+	{ 0xA82C, 0xA82C }, /* SYLOTI NAGRI SIGN ALTERNATE HASANTA */
+	{ 0xA880, 0xA881 }, /* SAURASHTRA SIGN ANUSVARA - SAURASHTRA SIGN VISARGA */
+	{ 0xA8B4, 0xA8C5 }, /* SAURASHTRA CONSONANT SIGN HAARU - SAURASHTRA SIGN CANDRABINDU */
+	{ 0xA8E0, 0xA8F1 }, /* COMBINING DEVANAGARI DIGIT ZERO - COMBINING DEVANAGARI SIGN AVAGRAHA */
+	{ 0xA8FF, 0xA8FF }, /* DEVANAGARI VOWEL SIGN AY */
+	{ 0xA926, 0xA92D }, /* KAYAH LI VOWEL UE - KAYAH LI TONE CALYA PLOPHU */
+	{ 0xA947, 0xA953 }, /* REJANG VOWEL SIGN I - REJANG VIRAMA */
+	{ 0xA980, 0xA983 }, /* JAVANESE SIGN PANYANGGA - JAVANESE SIGN WIGNYAN */
+	{ 0xA9B3, 0xA9C0 }, /* JAVANESE SIGN CECAK TELU - JAVANESE PANGKON */
+	{ 0xA9E5, 0xA9E5 }, /* MYANMAR SIGN SHAN SAW */
+	{ 0xAA29, 0xAA36 }, /* CHAM VOWEL SIGN AA - CHAM CONSONANT SIGN WA */
+	{ 0xAA43, 0xAA43 }, /* CHAM CONSONANT SIGN FINAL NG */
+	{ 0xAA4C, 0xAA4D }, /* CHAM CONSONANT SIGN FINAL M - CHAM CONSONANT SIGN FINAL H */
+	{ 0xAA7B, 0xAA7D }, /* MYANMAR SIGN PAO KAREN TONE - MYANMAR SIGN TAI LAING TONE-5 */
+	{ 0xAAB0, 0xAAB0 }, /* TAI VIET MAI KANG */
+	{ 0xAAB2, 0xAAB4 }, /* TAI VIET VOWEL I - TAI VIET VOWEL U */
+	{ 0xAAB7, 0xAAB8 }, /* TAI VIET MAI KHIT - TAI VIET VOWEL IA */
+	{ 0xAABE, 0xAABF }, /* TAI VIET VOWEL AM - TAI VIET TONE MAI EK */
+	{ 0xAAC1, 0xAAC1 }, /* TAI VIET TONE MAI THO */
+	{ 0xAAEB, 0xAAEF }, /* MEETEI MAYEK VOWEL SIGN II - MEETEI MAYEK VOWEL SIGN AAU */
+	{ 0xAAF5, 0xAAF6 }, /* MEETEI MAYEK VOWEL SIGN VISARGA - MEETEI MAYEK VIRAMA */
+	{ 0xABE3, 0xABEA }, /* MEETEI MAYEK VOWEL SIGN ONAP - MEETEI MAYEK VOWEL SIGN NUNG */
+	{ 0xABEC, 0xABED }, /* MEETEI MAYEK LUM IYEK - MEETEI MAYEK APUN IYEK */
+	{ 0xFB1E, 0xFB1E }, /* HEBREW POINT JUDEO-SPANISH VARIKA */
+	{ 0xFE00, 0xFE0F }, /* VARIATION SELECTOR-1 - VARIATION SELECTOR-16 */
+	{ 0xFE20, 0xFE2F }, /* COMBINING LIGATURE LEFT HALF - COMBINING CYRILLIC TITLO RIGHT HALF */
+	{ 0xFEFF, 0xFEFF }, /* ZERO WIDTH NO-BREAK SPACE */
+	{ 0xFFF9, 0xFFFB }, /* INTERLINEAR ANNOTATION ANCHOR - INTERLINEAR ANNOTATION TERMINATOR */
+};
+
+/* Zero-width character ranges (non-BMP, U+10000 and above) */
+static const struct interval32 zero_width_non_bmp[] = {
+	{ 0x101FD, 0x101FD }, /* PHAISTOS DISC SIGN COMBINING OBLIQUE STROKE */
+	{ 0x102E0, 0x102E0 }, /* COPTIC EPACT THOUSANDS MARK */
+	{ 0x10376, 0x1037A }, /* COMBINING OLD PERMIC LETTER AN - COMBINING OLD PERMIC LETTER SII */
+	{ 0x10A01, 0x10A03 }, /* KHAROSHTHI VOWEL SIGN I - KHAROSHTHI VOWEL SIGN VOCALIC R */
+	{ 0x10A05, 0x10A06 }, /* KHAROSHTHI VOWEL SIGN E - KHAROSHTHI VOWEL SIGN O */
+	{ 0x10A0C, 0x10A0F }, /* KHAROSHTHI VOWEL LENGTH MARK - KHAROSHTHI SIGN VISARGA */
+	{ 0x10A38, 0x10A3A }, /* KHAROSHTHI SIGN BAR ABOVE - KHAROSHTHI SIGN DOT BELOW */
+	{ 0x10A3F, 0x10A3F }, /* KHAROSHTHI VIRAMA */
+	{ 0x10AE5, 0x10AE6 }, /* MANICHAEAN ABBREVIATION MARK ABOVE - MANICHAEAN ABBREVIATION MARK BELOW */
+	{ 0x10D24, 0x10D27 }, /* HANIFI ROHINGYA SIGN HARBAHAY - HANIFI ROHINGYA SIGN TASSI */
+	{ 0x10D69, 0x10D6D }, /* GARAY VOWEL SIGN E - GARAY CONSONANT NASALIZATION MARK */
+	{ 0x10EAB, 0x10EAC }, /* YEZIDI COMBINING HAMZA MARK - YEZIDI COMBINING MADDA MARK */
+	{ 0x10EFC, 0x10EFF }, /* ARABIC COMBINING ALEF OVERLAY - ARABIC SMALL LOW WORD MADDA */
+	{ 0x10F46, 0x10F50 }, /* SOGDIAN COMBINING DOT BELOW - SOGDIAN COMBINING STROKE BELOW */
+	{ 0x10F82, 0x10F85 }, /* OLD UYGHUR COMBINING DOT ABOVE - OLD UYGHUR COMBINING TWO DOTS BELOW */
+	{ 0x11000, 0x11002 }, /* BRAHMI SIGN CANDRABINDU - BRAHMI SIGN VISARGA */
+	{ 0x11038, 0x11046 }, /* BRAHMI VOWEL SIGN AA - BRAHMI VIRAMA */
+	{ 0x11070, 0x11070 }, /* BRAHMI SIGN OLD TAMIL VIRAMA */
+	{ 0x11073, 0x11074 }, /* BRAHMI VOWEL SIGN OLD TAMIL SHORT E - BRAHMI VOWEL SIGN OLD TAMIL SHORT O */
+	{ 0x1107F, 0x11082 }, /* BRAHMI NUMBER JOINER - KAITHI SIGN VISARGA */
+	{ 0x110B0, 0x110BA }, /* KAITHI VOWEL SIGN AA - KAITHI SIGN NUKTA */
+	{ 0x110BD, 0x110BD }, /* KAITHI NUMBER SIGN */
+	{ 0x110C2, 0x110C2 }, /* KAITHI VOWEL SIGN VOCALIC R */
+	{ 0x110CD, 0x110CD }, /* KAITHI NUMBER SIGN ABOVE */
+	{ 0x11100, 0x11102 }, /* CHAKMA SIGN CANDRABINDU - CHAKMA SIGN VISARGA */
+	{ 0x11127, 0x11134 }, /* CHAKMA VOWEL SIGN A - CHAKMA MAAYYAA */
+	{ 0x11145, 0x11146 }, /* CHAKMA VOWEL SIGN AA - CHAKMA VOWEL SIGN EI */
+	{ 0x11173, 0x11173 }, /* MAHAJANI SIGN NUKTA */
+	{ 0x11180, 0x11182 }, /* SHARADA SIGN CANDRABINDU - SHARADA SIGN VISARGA */
+	{ 0x111B3, 0x111C0 }, /* SHARADA VOWEL SIGN AA - SHARADA SIGN VIRAMA */
+	{ 0x111C9, 0x111CC }, /* SHARADA SANDHI MARK - SHARADA EXTRA SHORT VOWEL MARK */
+	{ 0x111CE, 0x111CF }, /* SHARADA VOWEL SIGN PRISHTHAMATRA E - SHARADA SIGN INVERTED CANDRABINDU */
+	{ 0x1122C, 0x11237 }, /* KHOJKI VOWEL SIGN AA - KHOJKI SIGN SHADDA */
+	{ 0x1123E, 0x1123E }, /* KHOJKI SIGN SUKUN */
+	{ 0x11241, 0x11241 }, /* KHOJKI VOWEL SIGN VOCALIC R */
+	{ 0x112DF, 0x112EA }, /* KHUDAWADI SIGN ANUSVARA - KHUDAWADI SIGN VIRAMA */
+	{ 0x11300, 0x11303 }, /* GRANTHA SIGN COMBINING ANUSVARA ABOVE - GRANTHA SIGN VISARGA */
+	{ 0x1133B, 0x1133C }, /* COMBINING BINDU BELOW - GRANTHA SIGN NUKTA */
+	{ 0x1133E, 0x11344 }, /* GRANTHA VOWEL SIGN AA - GRANTHA VOWEL SIGN VOCALIC RR */
+	{ 0x11347, 0x11348 }, /* GRANTHA VOWEL SIGN EE - GRANTHA VOWEL SIGN AI */
+	{ 0x1134B, 0x1134D }, /* GRANTHA VOWEL SIGN OO - GRANTHA SIGN VIRAMA */
+	{ 0x11357, 0x11357 }, /* GRANTHA AU LENGTH MARK */
+	{ 0x11362, 0x11363 }, /* GRANTHA VOWEL SIGN VOCALIC L - GRANTHA VOWEL SIGN VOCALIC LL */
+	{ 0x11366, 0x1136C }, /* COMBINING GRANTHA DIGIT ZERO - COMBINING GRANTHA DIGIT SIX */
+	{ 0x11370, 0x11374 }, /* COMBINING GRANTHA LETTER A - COMBINING GRANTHA LETTER PA */
+	{ 0x113B8, 0x113C0 }, /* TULU-TIGALARI VOWEL SIGN AA - TULU-TIGALARI VOWEL SIGN VOCALIC LL */
+	{ 0x113C2, 0x113C2 }, /* TULU-TIGALARI VOWEL SIGN EE */
+	{ 0x113C5, 0x113C5 }, /* TULU-TIGALARI VOWEL SIGN AI */
+	{ 0x113C7, 0x113CA }, /* TULU-TIGALARI VOWEL SIGN OO - TULU-TIGALARI SIGN CANDRA ANUNASIKA */
+	{ 0x113CC, 0x113D0 }, /* TULU-TIGALARI SIGN ANUSVARA - TULU-TIGALARI CONJOINER */
+	{ 0x113D2, 0x113D2 }, /* TULU-TIGALARI GEMINATION MARK */
+	{ 0x113E1, 0x113E2 }, /* TULU-TIGALARI VEDIC TONE SVARITA - TULU-TIGALARI VEDIC TONE ANUDATTA */
+	{ 0x11435, 0x11446 }, /* NEWA VOWEL SIGN AA - NEWA SIGN NUKTA */
+	{ 0x1145E, 0x1145E }, /* NEWA SANDHI MARK */
+	{ 0x114B0, 0x114C3 }, /* TIRHUTA VOWEL SIGN AA - TIRHUTA SIGN NUKTA */
+	{ 0x115AF, 0x115B5 }, /* SIDDHAM VOWEL SIGN AA - SIDDHAM VOWEL SIGN VOCALIC RR */
+	{ 0x115B8, 0x115C0 }, /* SIDDHAM VOWEL SIGN E - SIDDHAM SIGN NUKTA */
+	{ 0x115DC, 0x115DD }, /* SIDDHAM VOWEL SIGN ALTERNATE U - SIDDHAM VOWEL SIGN ALTERNATE UU */
+	{ 0x11630, 0x11640 }, /* MODI VOWEL SIGN AA - MODI SIGN ARDHACANDRA */
+	{ 0x116AB, 0x116B7 }, /* TAKRI SIGN ANUSVARA - TAKRI SIGN NUKTA */
+	{ 0x1171D, 0x1172B }, /* AHOM CONSONANT SIGN MEDIAL LA - AHOM SIGN KILLER */
+	{ 0x1182C, 0x1183A }, /* DOGRA VOWEL SIGN AA - DOGRA SIGN NUKTA */
+	{ 0x11930, 0x11935 }, /* DIVES AKURU VOWEL SIGN AA - DIVES AKURU VOWEL SIGN E */
+	{ 0x11937, 0x11938 }, /* DIVES AKURU VOWEL SIGN AI - DIVES AKURU VOWEL SIGN O */
+	{ 0x1193B, 0x1193E }, /* DIVES AKURU SIGN ANUSVARA - DIVES AKURU VIRAMA */
+	{ 0x11940, 0x11940 }, /* DIVES AKURU MEDIAL YA */
+	{ 0x11942, 0x11943 }, /* DIVES AKURU MEDIAL RA - DIVES AKURU SIGN NUKTA */
+	{ 0x119D1, 0x119D7 }, /* NANDINAGARI VOWEL SIGN AA - NANDINAGARI VOWEL SIGN VOCALIC RR */
+	{ 0x119DA, 0x119E0 }, /* NANDINAGARI VOWEL SIGN E - NANDINAGARI SIGN VIRAMA */
+	{ 0x119E4, 0x119E4 }, /* NANDINAGARI VOWEL SIGN PRISHTHAMATRA E */
+	{ 0x11A01, 0x11A0A }, /* ZANABAZAR SQUARE VOWEL SIGN I - ZANABAZAR SQUARE VOWEL LENGTH MARK */
+	{ 0x11A33, 0x11A39 }, /* ZANABAZAR SQUARE FINAL CONSONANT MARK - ZANABAZAR SQUARE SIGN VISARGA */
+	{ 0x11A3B, 0x11A3E }, /* ZANABAZAR SQUARE CLUSTER-FINAL LETTER YA - ZANABAZAR SQUARE CLUSTER-FINAL LETTER VA */
+	{ 0x11A47, 0x11A47 }, /* ZANABAZAR SQUARE SUBJOINER */
+	{ 0x11A51, 0x11A5B }, /* SOYOMBO VOWEL SIGN I - SOYOMBO VOWEL LENGTH MARK */
+	{ 0x11A8A, 0x11A99 }, /* SOYOMBO FINAL CONSONANT SIGN G - SOYOMBO SUBJOINER */
+	{ 0x11C2F, 0x11C36 }, /* BHAIKSUKI VOWEL SIGN AA - BHAIKSUKI VOWEL SIGN VOCALIC L */
+	{ 0x11C38, 0x11C3F }, /* BHAIKSUKI VOWEL SIGN E - BHAIKSUKI SIGN VIRAMA */
+	{ 0x11C92, 0x11CA7 }, /* MARCHEN SUBJOINED LETTER KA - MARCHEN SUBJOINED LETTER ZA */
+	{ 0x11CA9, 0x11CB6 }, /* MARCHEN SUBJOINED LETTER YA - MARCHEN SIGN CANDRABINDU */
+	{ 0x11D31, 0x11D36 }, /* MASARAM GONDI VOWEL SIGN AA - MASARAM GONDI VOWEL SIGN VOCALIC R */
+	{ 0x11D3A, 0x11D3A }, /* MASARAM GONDI VOWEL SIGN E */
+	{ 0x11D3C, 0x11D3D }, /* MASARAM GONDI VOWEL SIGN AI - MASARAM GONDI VOWEL SIGN O */
+	{ 0x11D3F, 0x11D45 }, /* MASARAM GONDI VOWEL SIGN AU - MASARAM GONDI VIRAMA */
+	{ 0x11D47, 0x11D47 }, /* MASARAM GONDI RA-KARA */
+	{ 0x11D8A, 0x11D8E }, /* GUNJALA GONDI VOWEL SIGN AA - GUNJALA GONDI VOWEL SIGN UU */
+	{ 0x11D90, 0x11D91 }, /* GUNJALA GONDI VOWEL SIGN EE - GUNJALA GONDI VOWEL SIGN AI */
+	{ 0x11D93, 0x11D97 }, /* GUNJALA GONDI VOWEL SIGN OO - GUNJALA GONDI VIRAMA */
+	{ 0x11EF3, 0x11EF6 }, /* MAKASAR VOWEL SIGN I - MAKASAR VOWEL SIGN O */
+	{ 0x11F00, 0x11F01 }, /* KAWI SIGN CANDRABINDU - KAWI SIGN ANUSVARA */
+	{ 0x11F03, 0x11F03 }, /* KAWI SIGN VISARGA */
+	{ 0x11F34, 0x11F3A }, /* KAWI VOWEL SIGN AA - KAWI VOWEL SIGN VOCALIC R */
+	{ 0x11F3E, 0x11F42 }, /* KAWI VOWEL SIGN E - KAWI CONJOINER */
+	{ 0x11F5A, 0x11F5A }, /* KAWI SIGN NUKTA */
+	{ 0x13430, 0x13440 }, /* EGYPTIAN HIEROGLYPH VERTICAL JOINER - EGYPTIAN HIEROGLYPH MIRROR HORIZONTALLY */
+	{ 0x13447, 0x13455 }, /* EGYPTIAN HIEROGLYPH MODIFIER DAMAGED AT TOP START - EGYPTIAN HIEROGLYPH MODIFIER DAMAGED */
+	{ 0x1611E, 0x1612F }, /* GURUNG KHEMA VOWEL SIGN AA - GURUNG KHEMA SIGN THOLHOMA */
+	{ 0x16AF0, 0x16AF4 }, /* BASSA VAH COMBINING HIGH TONE - BASSA VAH COMBINING HIGH-LOW TONE */
+	{ 0x16B30, 0x16B36 }, /* PAHAWH HMONG MARK CIM TUB - PAHAWH HMONG MARK CIM TAUM */
+	{ 0x16F4F, 0x16F4F }, /* MIAO SIGN CONSONANT MODIFIER BAR */
+	{ 0x16F51, 0x16F87 }, /* MIAO SIGN ASPIRATION - MIAO VOWEL SIGN UI */
+	{ 0x16F8F, 0x16F92 }, /* MIAO TONE RIGHT - MIAO TONE BELOW */
+	{ 0x16FE4, 0x16FE4 }, /* KHITAN SMALL SCRIPT FILLER */
+	{ 0x16FF0, 0x16FF1 }, /* VIETNAMESE ALTERNATE READING MARK CA - VIETNAMESE ALTERNATE READING MARK NHAY */
+	{ 0x1BC9D, 0x1BC9E }, /* DUPLOYAN THICK LETTER SELECTOR - DUPLOYAN DOUBLE MARK */
+	{ 0x1BCA0, 0x1BCA3 }, /* SHORTHAND FORMAT LETTER OVERLAP - SHORTHAND FORMAT UP STEP */
+	{ 0x1CF00, 0x1CF2D }, /* ZNAMENNY COMBINING MARK GORAZDO NIZKO S KRYZHEM ON LEFT - ZNAMENNY COMBINING MARK KRYZH ON LEFT */
+	{ 0x1CF30, 0x1CF46 }, /* ZNAMENNY COMBINING TONAL RANGE MARK MRACHNO - ZNAMENNY PRIZNAK MODIFIER ROG */
+	{ 0x1D165, 0x1D169 }, /* MUSICAL SYMBOL COMBINING STEM - MUSICAL SYMBOL COMBINING TREMOLO-3 */
+	{ 0x1D16D, 0x1D182 }, /* MUSICAL SYMBOL COMBINING AUGMENTATION DOT - MUSICAL SYMBOL COMBINING LOURE */
+	{ 0x1D185, 0x1D18B }, /* MUSICAL SYMBOL COMBINING DOIT - MUSICAL SYMBOL COMBINING TRIPLE TONGUE */
+	{ 0x1D1AA, 0x1D1AD }, /* MUSICAL SYMBOL COMBINING DOWN BOW - MUSICAL SYMBOL COMBINING SNAP PIZZICATO */
+	{ 0x1D242, 0x1D244 }, /* COMBINING GREEK MUSICAL TRISEME - COMBINING GREEK MUSICAL PENTASEME */
+	{ 0x1DA00, 0x1DA36 }, /* SIGNWRITING HEAD RIM - SIGNWRITING AIR SUCKING IN */
+	{ 0x1DA3B, 0x1DA6C }, /* SIGNWRITING MOUTH CLOSED NEUTRAL - SIGNWRITING EXCITEMENT */
+	{ 0x1DA75, 0x1DA75 }, /* SIGNWRITING UPPER BODY TILTING FROM HIP JOINTS */
+	{ 0x1DA84, 0x1DA84 }, /* SIGNWRITING LOCATION HEAD NECK */
+	{ 0x1DA9B, 0x1DA9F }, /* SIGNWRITING FILL MODIFIER-2 - SIGNWRITING FILL MODIFIER-6 */
+	{ 0x1DAA1, 0x1DAAF }, /* SIGNWRITING ROTATION MODIFIER-2 - SIGNWRITING ROTATION MODIFIER-16 */
+	{ 0x1E000, 0x1E006 }, /* COMBINING GLAGOLITIC LETTER AZU - COMBINING GLAGOLITIC LETTER ZHIVETE */
+	{ 0x1E008, 0x1E018 }, /* COMBINING GLAGOLITIC LETTER ZEMLJA - COMBINING GLAGOLITIC LETTER HERU */
+	{ 0x1E01B, 0x1E021 }, /* COMBINING GLAGOLITIC LETTER SHTA - COMBINING GLAGOLITIC LETTER YATI */
+	{ 0x1E023, 0x1E024 }, /* COMBINING GLAGOLITIC LETTER YU - COMBINING GLAGOLITIC LETTER SMALL YUS */
+	{ 0x1E026, 0x1E02A }, /* COMBINING GLAGOLITIC LETTER YO - COMBINING GLAGOLITIC LETTER FITA */
+	{ 0x1E08F, 0x1E08F }, /* COMBINING CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I */
+	{ 0x1E130, 0x1E136 }, /* NYIAKENG PUACHUE HMONG TONE-B - NYIAKENG PUACHUE HMONG TONE-D */
+	{ 0x1E2AE, 0x1E2AE }, /* TOTO SIGN RISING TONE */
+	{ 0x1E2EC, 0x1E2EF }, /* WANCHO TONE TUP - WANCHO TONE KOINI */
+	{ 0x1E4EC, 0x1E4EF }, /* NAG MUNDARI SIGN MUHOR - NAG MUNDARI SIGN SUTUH */
+	{ 0x1E5EE, 0x1E5EF }, /* OL ONAL SIGN MU - OL ONAL SIGN IKIR */
+	{ 0x1E8D0, 0x1E8D6 }, /* MENDE KIKAKUI COMBINING NUMBER TEENS - MENDE KIKAKUI COMBINING NUMBER MILLIONS */
+	{ 0x1E944, 0x1E94A }, /* ADLAM ALIF LENGTHENER - ADLAM NUKTA */
+	{ 0x1F3FB, 0x1F3FF }, /* EMOJI MODIFIER FITZPATRICK TYPE-1-2 - EMOJI MODIFIER FITZPATRICK TYPE-6 */
+	{ 0x1F9B0, 0x1F9B3 }, /* EMOJI COMPONENT RED HAIR - EMOJI COMPONENT WHITE HAIR */
+	{ 0xE0001, 0xE0001 }, /* LANGUAGE TAG */
+	{ 0xE0020, 0xE007F }, /* TAG SPACE - CANCEL TAG */
+	{ 0xE0100, 0xE01EF }, /* VARIATION SELECTOR-17 - VARIATION SELECTOR-256 */
+};
+
+/* Double-width character ranges (BMP - Basic Multilingual Plane, U+0000 to U+FFFF) */
+static const struct interval16 double_width_bmp[] = {
+	{ 0x1100, 0x115F }, /* HANGUL CHOSEONG KIYEOK - HANGUL CHOSEONG FILLER */
+	{ 0x231A, 0x231B }, /* WATCH - HOURGLASS */
+	{ 0x2329, 0x232A }, /* LEFT-POINTING ANGLE BRACKET - RIGHT-POINTING ANGLE BRACKET */
+	{ 0x23E9, 0x23EC }, /* BLACK RIGHT-POINTING DOUBLE TRIANGLE - BLACK DOWN-POINTING DOUBLE TRIANGLE */
+	{ 0x23F0, 0x23F0 }, /* ALARM CLOCK */
+	{ 0x23F3, 0x23F3 }, /* HOURGLASS WITH FLOWING SAND */
+	{ 0x25FD, 0x25FE }, /* WHITE MEDIUM SMALL SQUARE - BLACK MEDIUM SMALL SQUARE */
+	{ 0x2614, 0x2615 }, /* UMBRELLA WITH RAIN DROPS - HOT BEVERAGE */
+	{ 0x2630, 0x2637 }, /* TRIGRAM FOR HEAVEN - TRIGRAM FOR EARTH */
+	{ 0x2648, 0x2653 }, /* ARIES - PISCES */
+	{ 0x267F, 0x267F }, /* WHEELCHAIR SYMBOL */
+	{ 0x268A, 0x268F }, /* MONOGRAM FOR YANG - DIGRAM FOR GREATER YIN */
+	{ 0x2693, 0x2693 }, /* ANCHOR */
+	{ 0x26A1, 0x26A1 }, /* HIGH VOLTAGE SIGN */
+	{ 0x26AA, 0x26AB }, /* MEDIUM WHITE CIRCLE - MEDIUM BLACK CIRCLE */
+	{ 0x26BD, 0x26BE }, /* SOCCER BALL - BASEBALL */
+	{ 0x26C4, 0x26C5 }, /* SNOWMAN WITHOUT SNOW - SUN BEHIND CLOUD */
+	{ 0x26CE, 0x26CE }, /* OPHIUCHUS */
+	{ 0x26D4, 0x26D4 }, /* NO ENTRY */
+	{ 0x26EA, 0x26EA }, /* CHURCH */
+	{ 0x26F2, 0x26F3 }, /* FOUNTAIN - FLAG IN HOLE */
+	{ 0x26F5, 0x26F5 }, /* SAILBOAT */
+	{ 0x26FA, 0x26FA }, /* TENT */
+	{ 0x26FD, 0x26FD }, /* FUEL PUMP */
+	{ 0x2705, 0x2705 }, /* WHITE HEAVY CHECK MARK */
+	{ 0x270A, 0x270B }, /* RAISED FIST - RAISED HAND */
+	{ 0x2728, 0x2728 }, /* SPARKLES */
+	{ 0x274C, 0x274C }, /* CROSS MARK */
+	{ 0x274E, 0x274E }, /* NEGATIVE SQUARED CROSS MARK */
+	{ 0x2753, 0x2755 }, /* BLACK QUESTION MARK ORNAMENT - WHITE EXCLAMATION MARK ORNAMENT */
+	{ 0x2757, 0x2757 }, /* HEAVY EXCLAMATION MARK SYMBOL */
+	{ 0x2795, 0x2797 }, /* HEAVY PLUS SIGN - HEAVY DIVISION SIGN */
+	{ 0x27B0, 0x27B0 }, /* CURLY LOOP */
+	{ 0x27BF, 0x27BF }, /* DOUBLE CURLY LOOP */
+	{ 0x2B1B, 0x2B1C }, /* BLACK LARGE SQUARE - WHITE LARGE SQUARE */
+	{ 0x2B50, 0x2B50 }, /* WHITE MEDIUM STAR */
+	{ 0x2B55, 0x2B55 }, /* HEAVY LARGE CIRCLE */
+	{ 0x2E80, 0x2E99 }, /* CJK RADICAL REPEAT - CJK RADICAL RAP */
+	{ 0x2E9B, 0x2EF3 }, /* CJK RADICAL CHOKE - CJK RADICAL C-SIMPLIFIED TURTLE */
+	{ 0x2F00, 0x2FD5 }, /* KANGXI RADICAL ONE - KANGXI RADICAL FLUTE */
+	{ 0x2FF0, 0x3029 }, /* IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT - HANGZHOU NUMERAL NINE */
+	{ 0x3030, 0x303E }, /* WAVY DASH - IDEOGRAPHIC VARIATION INDICATOR */
+	{ 0x3041, 0x3096 }, /* HIRAGANA LETTER SMALL A - HIRAGANA LETTER SMALL KE */
+	{ 0x309B, 0x30FF }, /* KATAKANA-HIRAGANA VOICED SOUND MARK - KATAKANA DIGRAPH KOTO */
+	{ 0x3105, 0x312F }, /* BOPOMOFO LETTER B - BOPOMOFO LETTER NN */
+	{ 0x3131, 0x318E }, /* HANGUL LETTER KIYEOK - HANGUL LETTER ARAEAE */
+	{ 0x3190, 0x31E5 }, /* IDEOGRAPHIC ANNOTATION LINKING MARK - CJK STROKE SZP */
+	{ 0x31EF, 0x321E }, /* IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION - PARENTHESIZED KOREAN CHARACTER O HU */
+	{ 0x3220, 0x3247 }, /* PARENTHESIZED IDEOGRAPH ONE - CIRCLED IDEOGRAPH KOTO */
+	{ 0x3250, 0xA48C }, /* PARTNERSHIP SIGN - YI SYLLABLE YYR */
+	{ 0xA490, 0xA4C6 }, /* YI RADICAL QOT - YI RADICAL KE */
+	{ 0xA960, 0xA97C }, /* HANGUL CHOSEONG TIKEUT-MIEUM - HANGUL CHOSEONG SSANGYEORINHIEUH */
+	{ 0xAC00, 0xD7A3 }, /* HANGUL SYLLABLE GA - HANGUL SYLLABLE HIH */
+	{ 0xF900, 0xFAFF }, /* U+F900 - U+FAFF */
+	{ 0xFE10, 0xFE19 }, /* PRESENTATION FORM FOR VERTICAL COMMA - PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS */
+	{ 0xFE30, 0xFE52 }, /* PRESENTATION FORM FOR VERTICAL TWO DOT LEADER - SMALL FULL STOP */
+	{ 0xFE54, 0xFE66 }, /* SMALL SEMICOLON - SMALL EQUALS SIGN */
+	{ 0xFE68, 0xFE6B }, /* SMALL REVERSE SOLIDUS - SMALL COMMERCIAL AT */
+	{ 0xFF01, 0xFF60 }, /* FULLWIDTH EXCLAMATION MARK - FULLWIDTH RIGHT WHITE PARENTHESIS */
+	{ 0xFFE0, 0xFFE6 }, /* FULLWIDTH CENT SIGN - FULLWIDTH WON SIGN */
 };
 
-/* Double-width character ranges */
-static const struct interval double_width_ranges[] = {
-	{ 0x01100, 0x0115F }, /* HANGUL CHOSEONG KIYEOK - HANGUL CHOSEONG FILLER */
-	{ 0x0231A, 0x0231B }, /* WATCH - HOURGLASS */
-	{ 0x02329, 0x0232A }, /* LEFT-POINTING ANGLE BRACKET - RIGHT-POINTING ANGLE BRACKET */
-	{ 0x023E9, 0x023EC }, /* BLACK RIGHT-POINTING DOUBLE TRIANGLE - BLACK DOWN-POINTING DOUBLE TRIANGLE */
-	{ 0x023F0, 0x023F0 }, /* ALARM CLOCK */
-	{ 0x023F3, 0x023F3 }, /* HOURGLASS WITH FLOWING SAND */
-	{ 0x025FD, 0x025FE }, /* WHITE MEDIUM SMALL SQUARE - BLACK MEDIUM SMALL SQUARE */
-	{ 0x02614, 0x02615 }, /* UMBRELLA WITH RAIN DROPS - HOT BEVERAGE */
-	{ 0x02630, 0x02637 }, /* TRIGRAM FOR HEAVEN - TRIGRAM FOR EARTH */
-	{ 0x02648, 0x02653 }, /* ARIES - PISCES */
-	{ 0x0267F, 0x0267F }, /* WHEELCHAIR SYMBOL */
-	{ 0x0268A, 0x0268F }, /* MONOGRAM FOR YANG - DIGRAM FOR GREATER YIN */
-	{ 0x02693, 0x02693 }, /* ANCHOR */
-	{ 0x026A1, 0x026A1 }, /* HIGH VOLTAGE SIGN */
-	{ 0x026AA, 0x026AB }, /* MEDIUM WHITE CIRCLE - MEDIUM BLACK CIRCLE */
-	{ 0x026BD, 0x026BE }, /* SOCCER BALL - BASEBALL */
-	{ 0x026C4, 0x026C5 }, /* SNOWMAN WITHOUT SNOW - SUN BEHIND CLOUD */
-	{ 0x026CE, 0x026CE }, /* OPHIUCHUS */
-	{ 0x026D4, 0x026D4 }, /* NO ENTRY */
-	{ 0x026EA, 0x026EA }, /* CHURCH */
-	{ 0x026F2, 0x026F3 }, /* FOUNTAIN - FLAG IN HOLE */
-	{ 0x026F5, 0x026F5 }, /* SAILBOAT */
-	{ 0x026FA, 0x026FA }, /* TENT */
-	{ 0x026FD, 0x026FD }, /* FUEL PUMP */
-	{ 0x02705, 0x02705 }, /* WHITE HEAVY CHECK MARK */
-	{ 0x0270A, 0x0270B }, /* RAISED FIST - RAISED HAND */
-	{ 0x02728, 0x02728 }, /* SPARKLES */
-	{ 0x0274C, 0x0274C }, /* CROSS MARK */
-	{ 0x0274E, 0x0274E }, /* NEGATIVE SQUARED CROSS MARK */
-	{ 0x02753, 0x02755 }, /* BLACK QUESTION MARK ORNAMENT - WHITE EXCLAMATION MARK ORNAMENT */
-	{ 0x02757, 0x02757 }, /* HEAVY EXCLAMATION MARK SYMBOL */
-	{ 0x02795, 0x02797 }, /* HEAVY PLUS SIGN - HEAVY DIVISION SIGN */
-	{ 0x027B0, 0x027B0 }, /* CURLY LOOP */
-	{ 0x027BF, 0x027BF }, /* DOUBLE CURLY LOOP */
-	{ 0x02B1B, 0x02B1C }, /* BLACK LARGE SQUARE - WHITE LARGE SQUARE */
-	{ 0x02B50, 0x02B50 }, /* WHITE MEDIUM STAR */
-	{ 0x02B55, 0x02B55 }, /* HEAVY LARGE CIRCLE */
-	{ 0x02E80, 0x02E99 }, /* CJK RADICAL REPEAT - CJK RADICAL RAP */
-	{ 0x02E9B, 0x02EF3 }, /* CJK RADICAL CHOKE - CJK RADICAL C-SIMPLIFIED TURTLE */
-	{ 0x02F00, 0x02FD5 }, /* KANGXI RADICAL ONE - KANGXI RADICAL FLUTE */
-	{ 0x02FF0, 0x03029 }, /* IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT - HANGZHOU NUMERAL NINE */
-	{ 0x03030, 0x0303E }, /* WAVY DASH - IDEOGRAPHIC VARIATION INDICATOR */
-	{ 0x03041, 0x03096 }, /* HIRAGANA LETTER SMALL A - HIRAGANA LETTER SMALL KE */
-	{ 0x0309B, 0x030FF }, /* KATAKANA-HIRAGANA VOICED SOUND MARK - KATAKANA DIGRAPH KOTO */
-	{ 0x03105, 0x0312F }, /* BOPOMOFO LETTER B - BOPOMOFO LETTER NN */
-	{ 0x03131, 0x0318E }, /* HANGUL LETTER KIYEOK - HANGUL LETTER ARAEAE */
-	{ 0x03190, 0x031E5 }, /* IDEOGRAPHIC ANNOTATION LINKING MARK - CJK STROKE SZP */
-	{ 0x031EF, 0x0321E }, /* IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION - PARENTHESIZED KOREAN CHARACTER O HU */
-	{ 0x03220, 0x03247 }, /* PARENTHESIZED IDEOGRAPH ONE - CIRCLED IDEOGRAPH KOTO */
-	{ 0x03250, 0x0A48C }, /* PARTNERSHIP SIGN - YI SYLLABLE YYR */
-	{ 0x0A490, 0x0A4C6 }, /* YI RADICAL QOT - YI RADICAL KE */
-	{ 0x0A960, 0x0A97C }, /* HANGUL CHOSEONG TIKEUT-MIEUM - HANGUL CHOSEONG SSANGYEORINHIEUH */
-	{ 0x0AC00, 0x0D7A3 }, /* HANGUL SYLLABLE GA - HANGUL SYLLABLE HIH */
-	{ 0x0F900, 0x0FAFF }, /* U+0F900 - U+0FAFF */
-	{ 0x0FE10, 0x0FE19 }, /* PRESENTATION FORM FOR VERTICAL COMMA - PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS */
-	{ 0x0FE30, 0x0FE52 }, /* PRESENTATION FORM FOR VERTICAL TWO DOT LEADER - SMALL FULL STOP */
-	{ 0x0FE54, 0x0FE66 }, /* SMALL SEMICOLON - SMALL EQUALS SIGN */
-	{ 0x0FE68, 0x0FE6B }, /* SMALL REVERSE SOLIDUS - SMALL COMMERCIAL AT */
-	{ 0x0FF01, 0x0FF60 }, /* FULLWIDTH EXCLAMATION MARK - FULLWIDTH RIGHT WHITE PARENTHESIS */
-	{ 0x0FFE0, 0x0FFE6 }, /* FULLWIDTH CENT SIGN - FULLWIDTH WON SIGN */
-	{ 0x16FE0, 0x16FE3 }, /* U+16FE0 - U+16FE3 */
+/* Double-width character ranges (non-BMP, U+10000 and above) */
+static const struct interval32 double_width_non_bmp[] = {
+	{ 0x16FE0, 0x16FE3 }, /* TANGUT ITERATION MARK - OLD CHINESE ITERATION MARK */
 	{ 0x17000, 0x187F7 }, /* U+17000 - U+187F7 */
-	{ 0x18800, 0x18CD5 }, /* U+18800 - U+18CD5 */
+	{ 0x18800, 0x18CD5 }, /* TANGUT COMPONENT-001 - KHITAN SMALL SCRIPT CHARACTER-18CD5 */
 	{ 0x18CFF, 0x18D08 }, /* U+18CFF - U+18D08 */
-	{ 0x1AFF0, 0x1AFF3 }, /* U+1AFF0 - U+1AFF3 */
-	{ 0x1AFF5, 0x1AFFB }, /* U+1AFF5 - U+1AFFB */
-	{ 0x1AFFD, 0x1AFFE }, /* U+1AFFD - U+1AFFE */
-	{ 0x1B000, 0x1B122 }, /* U+1B000 - U+1B122 */
-	{ 0x1B132, 0x1B132 }, /* U+1B132 */
-	{ 0x1B150, 0x1B152 }, /* U+1B150 - U+1B152 */
-	{ 0x1B155, 0x1B155 }, /* U+1B155 */
-	{ 0x1B164, 0x1B167 }, /* U+1B164 - U+1B167 */
-	{ 0x1B170, 0x1B2FB }, /* U+1B170 - U+1B2FB */
-	{ 0x1D300, 0x1D356 }, /* U+1D300 - U+1D356 */
-	{ 0x1D360, 0x1D376 }, /* U+1D360 - U+1D376 */
+	{ 0x1AFF0, 0x1AFF3 }, /* KATAKANA LETTER MINNAN TONE-2 - KATAKANA LETTER MINNAN TONE-5 */
+	{ 0x1AFF5, 0x1AFFB }, /* KATAKANA LETTER MINNAN TONE-7 - KATAKANA LETTER MINNAN NASALIZED TONE-5 */
+	{ 0x1AFFD, 0x1AFFE }, /* KATAKANA LETTER MINNAN NASALIZED TONE-7 - KATAKANA LETTER MINNAN NASALIZED TONE-8 */
+	{ 0x1B000, 0x1B122 }, /* KATAKANA LETTER ARCHAIC E - KATAKANA LETTER ARCHAIC WU */
+	{ 0x1B132, 0x1B132 }, /* HIRAGANA LETTER SMALL KO */
+	{ 0x1B150, 0x1B152 }, /* HIRAGANA LETTER SMALL WI - HIRAGANA LETTER SMALL WO */
+	{ 0x1B155, 0x1B155 }, /* KATAKANA LETTER SMALL KO */
+	{ 0x1B164, 0x1B167 }, /* KATAKANA LETTER SMALL WI - KATAKANA LETTER SMALL N */
+	{ 0x1B170, 0x1B2FB }, /* NUSHU CHARACTER-1B170 - NUSHU CHARACTER-1B2FB */
+	{ 0x1D300, 0x1D356 }, /* MONOGRAM FOR EARTH - TETRAGRAM FOR FOSTERING */
+	{ 0x1D360, 0x1D376 }, /* COUNTING ROD UNIT DIGIT ONE - IDEOGRAPHIC TALLY MARK FIVE */
 	{ 0x1F000, 0x1F02F }, /* U+1F000 - U+1F02F */
 	{ 0x1F0A0, 0x1F0FF }, /* U+1F0A0 - U+1F0FF */
-	{ 0x1F18E, 0x1F18E }, /* U+1F18E */
-	{ 0x1F191, 0x1F19A }, /* U+1F191 - U+1F19A */
-	{ 0x1F200, 0x1F202 }, /* U+1F200 - U+1F202 */
-	{ 0x1F210, 0x1F23B }, /* U+1F210 - U+1F23B */
-	{ 0x1F240, 0x1F248 }, /* U+1F240 - U+1F248 */
-	{ 0x1F250, 0x1F251 }, /* U+1F250 - U+1F251 */
-	{ 0x1F260, 0x1F265 }, /* U+1F260 - U+1F265 */
-	{ 0x1F300, 0x1F3FA }, /* U+1F300 - U+1F3FA */
-	{ 0x1F400, 0x1F64F }, /* U+1F400 - U+1F64F */
-	{ 0x1F680, 0x1F9AF }, /* U+1F680 - U+1F9AF */
+	{ 0x1F18E, 0x1F18E }, /* NEGATIVE SQUARED AB */
+	{ 0x1F191, 0x1F19A }, /* SQUARED CL - SQUARED VS */
+	{ 0x1F200, 0x1F202 }, /* SQUARE HIRAGANA HOKA - SQUARED KATAKANA SA */
+	{ 0x1F210, 0x1F23B }, /* SQUARED CJK UNIFIED IDEOGRAPH-624B - SQUARED CJK UNIFIED IDEOGRAPH-914D */
+	{ 0x1F240, 0x1F248 }, /* TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-672C - TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-6557 */
+	{ 0x1F250, 0x1F251 }, /* CIRCLED IDEOGRAPH ADVANTAGE - CIRCLED IDEOGRAPH ACCEPT */
+	{ 0x1F260, 0x1F265 }, /* ROUNDED SYMBOL FOR FU - ROUNDED SYMBOL FOR CAI */
+	{ 0x1F300, 0x1F3FA }, /* CYCLONE - AMPHORA */
+	{ 0x1F400, 0x1F64F }, /* RAT - PERSON WITH FOLDED HANDS */
+	{ 0x1F680, 0x1F9AF }, /* ROCKET - PROBING CANE */
 	{ 0x1F9B4, 0x1FAFF }, /* U+1F9B4 - U+1FAFF */
 	{ 0x20000, 0x2FFFD }, /* U+20000 - U+2FFFD */
 	{ 0x30000, 0x3FFFD }, /* U+30000 - U+3FFFD */
 };
 
 
-static int ucs_cmp(const void *key, const void *element)
+static int ucs_cmp16(const void *key, const void *element)
+{
+	uint16_t cp = *(uint16_t *)key;
+	const struct interval16 *e = element;
+
+	if (cp > e->last)
+		return 1;
+	if (cp < e->first)
+		return -1;
+	return 0;
+}
+
+static int ucs_cmp32(const void *key, const void *element)
 {
 	uint32_t cp = *(uint32_t *)key;
-	const struct interval *e = element;
+	const struct interval32 *e = element;
 
 	if (cp > e->last)
 		return 1;
@@ -466,13 +491,22 @@ static int ucs_cmp(const void *key, const void *element)
 	return 0;
 }
 
-static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t count)
+static bool is_in_interval16(uint16_t cp, const struct interval16 *intervals, size_t count)
+{
+	if (cp < intervals[0].first || cp > intervals[count - 1].last)
+		return false;
+
+	return __inline_bsearch(&cp, intervals, count,
+				sizeof(*intervals), ucs_cmp16) != NULL;
+}
+
+static bool is_in_interval32(uint32_t cp, const struct interval32 *intervals, size_t count)
 {
 	if (cp < intervals[0].first || cp > intervals[count - 1].last)
 		return false;
 
 	return __inline_bsearch(&cp, intervals, count,
-				sizeof(*intervals), ucs_cmp) != NULL;
+				sizeof(*intervals), ucs_cmp32) != NULL;
 }
 
 /**
@@ -483,7 +517,9 @@ static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t
  */
 bool ucs_is_zero_width(uint32_t cp)
 {
-	return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
+	return (cp <= 0xFFFF)
+	       ? is_in_interval16(cp, zero_width_bmp, ARRAY_SIZE(zero_width_bmp))
+	       : is_in_interval32(cp, zero_width_non_bmp, ARRAY_SIZE(zero_width_non_bmp));
 }
 
 /**
@@ -494,5 +530,7 @@ bool ucs_is_zero_width(uint32_t cp)
  */
 bool ucs_is_double_width(uint32_t cp)
 {
-	return is_in_interval(cp, double_width_ranges, ARRAY_SIZE(double_width_ranges));
+	return (cp <= 0xFFFF)
+	       ? is_in_interval16(cp, double_width_bmp, ARRAY_SIZE(double_width_bmp))
+	       : is_in_interval32(cp, double_width_non_bmp, ARRAY_SIZE(double_width_non_bmp));
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 11/11] vt: pad double-width code points with a zero-white-space
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (9 preceding siblings ...)
  2025-04-10  1:14 ` [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py Nicolas Pitre
@ 2025-04-10  1:14 ` Nicolas Pitre
  2025-04-14  7:18   ` Jiri Slaby
  2025-04-10 19:38 ` [PATCH 12/11] vt: remove zero-white-space handling from conv_uni_to_pc() Nicolas Pitre
  2025-04-11 14:49 ` [PATCH 00/11] vt: implement proper Unicode handling Greg Kroah-Hartman
  12 siblings, 1 reply; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

In the Unicode screen buffer, we follow double-width code points with a
space to maintain proper column alignment. This, however, creates
semantic problems when e.g. using cut and paste or selection.

Let's use a better code point for the column padding's purpose i.e. a
zero-white-space rather than a full space.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 drivers/tty/vt/vt.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index e3d35c4f92..dc84f9c6b7 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2937,12 +2937,13 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 			width = 2;
 		} else if (ucs_is_zero_width(c)) {
 			prev_c = vc_uniscr_getc(vc, -1);
-			if (prev_c == ' ' &&
+			if (prev_c == 0x200B &&
 			    ucs_is_double_width(vc_uniscr_getc(vc, -2))) {
 				/*
 				 * Let's merge this zero-width code point with
 				 * the preceding double-width code point by
-				 * replacing the existing whitespace padding.
+				 * replacing the existing zero-white-space
+				 * padding.
 				 */
 				vc_con_rewind(vc);
 			} else if (c == 0xfe0f && prev_c != 0) {
@@ -3040,7 +3041,11 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 		tc = conv_uni_to_pc(vc, ' ');
 		if (tc < 0)
 			tc = ' ';
-		next_c = ' ';
+		/*
+		 * Store a zero-white-space in the Unicode screen given that
+		 * the previous code point is semantically double-width.
+		 */
+		next_c = 0x200B;
 	}
 
 out:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 12/11] vt: remove zero-white-space handling from conv_uni_to_pc()
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (10 preceding siblings ...)
  2025-04-10  1:14 ` [PATCH 11/11] vt: pad double-width code points with a zero-white-space Nicolas Pitre
@ 2025-04-10 19:38 ` Nicolas Pitre
  2025-04-11 14:49 ` [PATCH 00/11] vt: implement proper Unicode handling Greg Kroah-Hartman
  12 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10 19:38 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

From: Nicolas Pitre <npitre@baylibre.com>

This is now taken care of by ucs_is_zero_width(). And in the case where
we do want a padding from some zero-width code point then we should also
give the legacy displays a space character to work with.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---

This is a fix for a small issue discovered during everyday usage.
I didn't think it is worth resending the whole series for this but
if you prefer otherwise please let me know.

diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 82d70083fe..bb4bb272eb 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -870,8 +870,6 @@ int conv_uni_to_pc(struct vc_data *conp, long ucs)
 		return -4;		/* Not found */
 	else if (ucs < 0x20)
 		return -1;		/* Not a printable character */
-	else if (ucs == 0xfeff || (ucs >= 0x200b && ucs <= 0x200f))
-		return -2;			/* Zero-width space */
 	/*
 	 * UNI_DIRECT_BASE indicates the start of the region in the User Zone
 	 * which always has a 1:1 mapping to the currently loaded font.  The
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index dc84f9c6b7..0d1d663c78 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2964,13 +2964,15 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
 					goto out;
 				}
 			}
+			/* padding for the legacy display like done below */
+			tc = ' ';
 		}
 	}
 
 	/* Now try to find out how to display it */
 	tc = conv_uni_to_pc(vc, tc);
 	if (tc & ~charmask) {
-		if (tc == -1 || tc == -2)
+		if (tc == -1)
 			return -1; /* nothing to display */
 
 		/* Glyph not found */

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py
  2025-04-10  1:13 ` [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py Nicolas Pitre
@ 2025-04-11  3:47   ` kernel test robot
  0 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2025-04-11  3:47 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman, Jiri Slaby
  Cc: oe-kbuild-all, Nicolas Pitre, Dave Mielke, linux-serial,
	linux-kernel

Hi Nicolas,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tty/tty-testing]
[also build test WARNING on tty/tty-next tty/tty-linus linus/master v6.15-rc1 next-20250410]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolas-Pitre/vt-minor-cleanup-to-vc_translate_unicode/20250410-092318
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git tty-testing
patch link:    https://lore.kernel.org/r/20250410011839.64418-6-nico%40fluxnic.net
patch subject: [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py
config: csky-randconfig-001-20250411 (https://download.01.org/0day-ci/archive/20250411/202504111036.YH1iEqBR-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250411/202504111036.YH1iEqBR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504111036.YH1iEqBR-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/tty/vt/ucs_width.c:485: warning: Function parameter or struct member 'cp' not described in 'ucs_is_zero_width'
>> drivers/tty/vt/ucs_width.c:485: warning: expecting prototype for Determine if a Unicode code point is zero(). Prototype was for ucs_is_zero_width() instead
>> drivers/tty/vt/ucs_width.c:496: warning: Function parameter or struct member 'cp' not described in 'ucs_is_double_width'
>> drivers/tty/vt/ucs_width.c:496: warning: expecting prototype for Determine if a Unicode code point is double(). Prototype was for ucs_is_double_width() instead


vim +485 drivers/tty/vt/ucs_width.c

   477	
   478	/**
   479	 * Determine if a Unicode code point is zero-width.
   480	 *
   481	 * @param ucs: Unicode code point (UCS-4)
   482	 * Return: true if the character is zero-width, false otherwise
   483	 */
   484	bool ucs_is_zero_width(uint32_t cp)
 > 485	{
   486		return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
   487	}
   488	
   489	/**
   490	 * Determine if a Unicode code point is double-width.
   491	 *
   492	 * @param ucs: Unicode code point (UCS-4)
   493	 * Return: true if the character is double-width, false otherwise
   494	 */
   495	bool ucs_is_double_width(uint32_t cp)
 > 496	{

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py
  2025-04-10  1:13 ` [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py Nicolas Pitre
@ 2025-04-11  6:00   ` kernel test robot
  0 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2025-04-11  6:00 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman, Jiri Slaby
  Cc: oe-kbuild-all, Nicolas Pitre, Dave Mielke, linux-serial,
	linux-kernel

Hi Nicolas,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tty/tty-testing]
[also build test WARNING on tty/tty-next tty/tty-linus linus/master v6.15-rc1 next-20250410]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolas-Pitre/vt-minor-cleanup-to-vc_translate_unicode/20250410-092318
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git tty-testing
patch link:    https://lore.kernel.org/r/20250410011839.64418-8-nico%40fluxnic.net
patch subject: [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py
config: csky-randconfig-001-20250411 (https://download.01.org/0day-ci/archive/20250411/202504111359.urXWyzvQ-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250411/202504111359.urXWyzvQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504111359.urXWyzvQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/tty/vt/ucs_recompose.c:148: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * Attempt to recompose two Unicode characters into a single character.


vim +148 drivers/tty/vt/ucs_recompose.c

   146	
   147	/**
 > 148	 * Attempt to recompose two Unicode characters into a single character.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/11] vt: implement proper Unicode handling
  2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
                   ` (11 preceding siblings ...)
  2025-04-10 19:38 ` [PATCH 12/11] vt: remove zero-white-space handling from conv_uni_to_pc() Nicolas Pitre
@ 2025-04-11 14:49 ` Greg Kroah-Hartman
  12 siblings, 0 replies; 27+ messages in thread
From: Greg Kroah-Hartman @ 2025-04-11 14:49 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jiri Slaby, Nicolas Pitre, Dave Mielke, linux-serial,
	linux-kernel

On Wed, Apr 09, 2025 at 09:13:52PM -0400, Nicolas Pitre wrote:
> The Linux VT console has many problems with regards to proper Unicode
> handling:

Wow, very nice work, thanks for doing all of this.  I'll go queue it up
now, the kernel test robot warnings for comments can be fixed up later
if you want to.

greg k-h

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 02/11] vt: move unicode processing to a separate file
  2025-04-10  1:13 ` [PATCH 02/11] vt: move unicode processing to a separate file Nicolas Pitre
@ 2025-04-14  6:47   ` Jiri Slaby
  2025-04-15 19:03     ` Nicolas Pitre
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  6:47 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> This will make it easier to maintain. Also make it depend on
> CONFIG_CONSOLE_TRANSLATIONS.
...
> --- a/include/linux/consolemap.h
> +++ b/include/linux/consolemap.h
...
> @@ -57,6 +58,11 @@ static inline int conv_uni_to_8bit(u32 uni)
>   }
>   
>   static inline void console_map_init(void) { }
> +
> +static inline bool ucs_is_double_width(uint32_t cp)
> +{
> +	return false;
> +}

Is this inline necessary? I assume ucs_is_double_width() won't be called 
outside CONFIG_CONSOLE_TRANSLATIONS?

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 03/11] vt: properly support zero-width Unicode code points
  2025-04-10  1:13 ` [PATCH 03/11] vt: properly support zero-width Unicode code points Nicolas Pitre
@ 2025-04-14  6:51   ` Jiri Slaby
  2025-04-15 19:06     ` Nicolas Pitre
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  6:51 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Zero-width Unicode code points are causing misalignment in vertically
> aligned content, disrupting the visual layout. Let's handle zero-width
> code points more intelligently.
...
> --- a/drivers/tty/vt/vt.c
> +++ b/drivers/tty/vt/vt.c
> @@ -443,6 +443,15 @@ static void vc_uniscr_scroll(struct vc_data *vc, unsigned int top,
>   	}
>   }
>   
> +static u32 vc_uniscr_getc(struct vc_data *vc, int relative_pos)
> +{
> +	int pos = vc->state.x + vc->vc_need_wrap + relative_pos;
> +
> +	if (vc->vc_uni_lines && pos >= 0 && pos < vc->vc_cols)

So that is:
   in_range(pos, 0, vc->vc_cols)
right?

> +		return vc->vc_uni_lines[vc->state.y][pos];
> +	return 0;
> +}
> +
>   static void vc_uniscr_copy_area(u32 **dst_lines,
>   				unsigned int dst_cols,
>   				unsigned int dst_rows,
> @@ -2905,18 +2914,49 @@ static bool vc_is_control(struct vc_data *vc, int tc, int c)
>   	return false;
>   }
>   
> +static void vc_con_rewind(struct vc_data *vc)
> +{
> +	if (vc->state.x && !vc->vc_need_wrap) {
> +		vc->vc_pos -= 2;
> +		vc->state.x--;
> +	}
> +	vc->vc_need_wrap = 0;
> +}
> +
>   static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
>   		struct vc_draw_region *draw)
>   {
> -	int next_c;
> +	int next_c, prev_c;
>   	unsigned char vc_attr = vc->vc_attr;
>   	u16 himask = vc->vc_hi_font_mask, charmask = himask ? 0x1ff : 0xff;
>   	u8 width = 1;
>   	bool inverse = false;
>   
>   	if (vc->vc_utf && !vc->vc_disp_ctrl) {
> -		if (ucs_is_double_width(c))
> +		if (ucs_is_double_width(c)) {
>   			width = 2;
> +		} else if (ucs_is_zero_width(c)) {
> +			prev_c = vc_uniscr_getc(vc, -1);
> +			if (prev_c == ' ' &&
> +			    ucs_is_double_width(vc_uniscr_getc(vc, -2))) {
> +				/*
> +				 * Let's merge this zero-width code point with
> +				 * the preceding double-width code point by
> +				 * replacing the existing whitespace padding.
> +				 */
> +				vc_con_rewind(vc);
> +			} else if (c == 0xfe0f && prev_c != 0) {
> +				/*
> +				 * VS16 (U+FE0F) is special. Let it have a
> +				 * width of 1 when preceded by a single-width
> +				 * code point effectively making the later
> +				 * double-width.
> +				 */
> +			} else {
> +				/* Otherwise zero-width code points are ignored */
> +				goto out;
> +			}
> +		}

Please, extract this width evaluation to a separate function.

...
> --- a/include/linux/consolemap.h
> +++ b/include/linux/consolemap.h
...
> @@ -63,6 +68,11 @@ static inline bool ucs_is_double_width(uint32_t cp)
>   {
>   	return false;
>   }
> +
> +static inline bool ucs_is_zero_width(uint32_t cp)
> +{
> +	return false;
> +}

Again, is this necessary?

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c
  2025-04-10  1:13 ` [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c Nicolas Pitre
@ 2025-04-14  7:04   ` Jiri Slaby
  2025-04-15 19:13     ` Nicolas Pitre
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  7:04 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> The table in the current ucs_width.c is terribly out of date and
> incomplete. We also need a second table to store zero-width code points.
> Properly maintaining those tables manually is impossible. So here's a
> script to automatically generate them.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> ---
>   drivers/tty/vt/gen_ucs_width.py | 264 ++++++++++++++++++++++++++++++++
>   1 file changed, 264 insertions(+)
>   create mode 100755 drivers/tty/vt/gen_ucs_width.py
> 
> diff --git a/drivers/tty/vt/gen_ucs_width.py b/drivers/tty/vt/gen_ucs_width.py
> new file mode 100755
> index 0000000000..41997fe001
> --- /dev/null
> +++ b/drivers/tty/vt/gen_ucs_width.py
> @@ -0,0 +1,264 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# This script uses Python's unicodedata module to generate ucs_width.c

That is obvious, no need for the comment, IMO :).

> +import unicodedata
> +import sys
> +
> +def generate_ucs_width():
> +    # Output file name
> +    c_file = "ucs_width.c"
> +
> +    # Width data mapping
> +    width_map = {}  # Maps code points to width (0, 1, 2)
> +
> +    # Define emoji modifiers and components that should have zero width
> +    emoji_zero_width = [
> +        # Skin tone modifiers
> +        (0x1F3FB, 0x1F3FF),  # Emoji modifiers (skin tones)
> +
> +        # Variation selectors (note: VS16 is treated specially in vt.c)
> +        (0xFE00, 0xFE0F),    # Variation Selectors 1-16
> +
> +        # Gender and hair style modifiers
> +        (0x2640, 0x2640),    # Female sign
> +        (0x2642, 0x2642),    # Male sign
> +        (0x26A7, 0x26A7),    # Transgender symbol
> +        (0x1F9B0, 0x1F9B3),  # Hair components (red, curly, white, bald)
> +
> +        # Tag characters
> +        (0xE0020, 0xE007E),  # Tags
> +    ]
> +
> +    # Mark these emoji modifiers as zero-width
> +    for start, end in emoji_zero_width:
> +        for cp in range(start, end + 1):
> +            try:
> +                width_map[cp] = 0
> +            except (ValueError, OverflowError):

When can this happen and why is it not fatal?

> +                continue
> +
> +    # Mark all regional indicators as single-width as they are usually paired
> +    # providing a combined with of 2.

s/with/width/

> +    regional_indicators = (0x1F1E6, 0x1F1FF)  # Regional indicator symbols A-Z
> +    start, end = regional_indicators
> +    for cp in range(start, end + 1):
> +        try:
> +            width_map[cp] = 1
> +        except (ValueError, OverflowError):
> +            continue
> +
> +    # Process all assigned Unicode code points (Basic Multilingual Plane + Supplementary Planes)
> +    # Range 0x0 to 0x10FFFF (the full Unicode range)
> +    for block_start in range(0, 0x110000, 0x1000):
> +        block_end = block_start + 0x1000
> +        for cp in range(block_start, block_end):
> +            try:
> +                char = chr(cp)
> +
> +                # Skip if already processed
> +                if cp in width_map:
> +                    continue
> +
> +                # Check if the character is a combining mark
> +                category = unicodedata.category(char)
> +
> +                # Combining marks, format characters, zero-width characters
> +                if (category.startswith('M') or  # Mark (combining)
> +                    (category == 'Cf' and cp not in (0x061C, 0x06DD, 0x070F, 0x180E, 0x200F, 0x202E, 0x2066, 0x2067, 0x2068, 0x2069)) or
> +                    cp in (0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF)):  # Known zero-width characters

Convert this if to a function.

> +                    width_map[cp] = 0
> +                    continue
> +
> +                # Use East Asian Width property
> +                eaw = unicodedata.east_asian_width(char)
> +
> +                if eaw in ('F', 'W'):  # Fullwidth or Wide
> +                    width_map[cp] = 2
> +                elif eaw in ('Na', 'H', 'N', 'A'):  # Narrow, Halfwidth, Neutral, Ambiguous
> +                    width_map[cp] = 1
> +                else:
> +                    # Default to single-width for unknown
> +                    width_map[cp] = 1
> +
> +            except (ValueError, OverflowError):
> +                # Skip invalid code points
> +                continue
> +
> +    # Process Emoji - generally double-width
> +    # Ranges according to Unicode Emoji standard

No capital in "ranges".

"to the Unicode Emoji standard"

> +    emoji_ranges = [
> +        (0x1F000, 0x1F02F),  # Mahjong Tiles
> +        (0x1F0A0, 0x1F0FF),  # Playing Cards
> +        (0x1F300, 0x1F5FF),  # Miscellaneous Symbols and Pictographs
> +        (0x1F600, 0x1F64F),  # Emoticons
> +        (0x1F680, 0x1F6FF),  # Transport and Map Symbols
> +        (0x1F700, 0x1F77F),  # Alchemical Symbols
> +        (0x1F780, 0x1F7FF),  # Geometric Shapes Extended
> +        (0x1F800, 0x1F8FF),  # Supplemental Arrows-C
> +        (0x1F900, 0x1F9FF),  # Supplemental Symbols and Pictographs
> +        (0x1FA00, 0x1FA6F),  # Chess Symbols
> +        (0x1FA70, 0x1FAFF),  # Symbols and Pictographs Extended-A
> +    ]
> +
> +    for start, end in emoji_ranges:
> +        for cp in range(start, end + 1):
> +            if cp not in width_map or width_map[cp] != 0:  # Don't override zero-width
> +                try:
> +                    char = chr(cp)
> +                    width_map[cp] = 2
> +                except (ValueError, OverflowError):
> +                    continue
> +
> +    # Optimize to create range tables
> +    def ranges_optimize(width_data, target_width):
> +        points = sorted([cp for cp, width in width_data.items() if width == target_width])
> +        if not points:
> +            return []
> +
> +        # Group consecutive code points into ranges
> +        ranges = []
> +        start = points[0]
> +        prev = start
> +
> +        for cp in points[1:]:
> +            if cp > prev + 1:
> +                ranges.append((start, prev))
> +                start = cp
> +            prev = cp
> +
> +        # Add the last range
> +        ranges.append((start, prev))
> +        return ranges
> +
> +    # Extract ranges for each width
> +    zero_width_ranges = ranges_optimize(width_map, 0)
> +    double_width_ranges = ranges_optimize(width_map, 2)
> +
> +    # Get Unicode version information
> +    unicode_version = unicodedata.unidata_version
> +
> +    # Generate C implementation file
> +    with open(c_file, 'w') as f:
> +        f.write(f"""\

Why this backslash?

> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ucs_width.c - Unicode character width lookup
> + *
> + * Auto-generated by gen_ucs_width.py
> + *
> + * Unicode Version: {unicode_version}
> + */
> +
> +#include <linux/types.h>
> +#include <linux/array_size.h>
> +#include <linux/bsearch.h>
> +#include <linux/consolemap.h>

Pls sort includes alphabetically.

> +
> +struct interval {{
> +	uint32_t first;
> +	uint32_t last;
> +}};
> +
> +/* Zero-width character ranges */
> +static const struct interval zero_width_ranges[] = {{
> +""")
> +
> +        for start, end in zero_width_ranges:
> +            try:
> +                start_char_desc = unicodedata.name(chr(start)) if start < 0x10000 else f"U+{start:05X}"
> +                if start == end:
> +                    comment = f"/* {start_char_desc} */"
> +                else:
> +                    end_char_desc = unicodedata.name(chr(end)) if end < 0x10000 else f"U+{end:05X}"
> +                    comment = f"/* {start_char_desc} - {end_char_desc} */"
> +            except:
> +                if start == end:
> +                    comment = f"/* U+{start:05X} */"
> +                else:
> +                    comment = f"/* U+{start:05X} - U+{end:05X} */"
> +
> +            f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
> +
> +        f.write("""\
> +};
> +
> +/* Double-width character ranges */
> +static const struct interval double_width_ranges[] = {
> +""")
> +
> +        for start, end in double_width_ranges:
> +            try:
> +                start_char_desc = unicodedata.name(chr(start)) if start < 0x10000 else f"U+{start:05X}"
> +                if start == end:
> +                    comment = f"/* {start_char_desc} */"
> +                else:
> +                    end_char_desc = unicodedata.name(chr(end)) if end < 0x10000 else f"U+{end:05X}"
> +                    comment = f"/* {start_char_desc} - {end_char_desc} */"
> +            except:
> +                if start == end:
> +                    comment = f"/* U+{start:05X} */"
> +                else:
> +                    comment = f"/* U+{start:05X} - U+{end:05X} */"
> +
> +            f.write(f"\t{{ 0x{start:05X}, 0x{end:05X} }}, {comment}\n")
> +
> +        f.write("""\
> +};
> +
> +
> +static int ucs_cmp(const void *key, const void *element)
> +{
> +	uint32_t cp = *(uint32_t *)key;
> +	const struct interval *e = element;
> +
> +	if (cp > e->last)
> +		return 1;
> +	if (cp < e->first)
> +		return -1;
> +	return 0;
> +}
> +
> +static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t count)
> +{
> +	if (cp < intervals[0].first || cp > intervals[count - 1].last)
> +		return false;
> +
> +	return __inline_bsearch(&cp, intervals, count,
> +				sizeof(*intervals), ucs_cmp) != NULL;
> +}
> +
> +/**
> + * Determine if a Unicode code point is zero-width.
> + *
> + * @param ucs: Unicode code point (UCS-4)
> + * Return: true if the character is zero-width, false otherwise
> + */
> +bool ucs_is_zero_width(uint32_t cp)
> +{
> +	return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
> +}
> +
> +/**
> + * Determine if a Unicode code point is double-width.
> + *
> + * @param ucs: Unicode code point (UCS-4)
> + * Return: true if the character is double-width, false otherwise
> + */
> +bool ucs_is_double_width(uint32_t cp)
> +{
> +	return is_in_interval(cp, double_width_ranges, ARRAY_SIZE(double_width_ranges));
> +}
> +""")
> +
> +    # Print summary
> +    zero_width_count = sum(end - start + 1 for start, end in zero_width_ranges)
> +    double_width_count = sum(end - start + 1 for start, end in double_width_ranges)
> +
> +    print(f"Generated {c_file} with:")
> +    print(f"- {len(zero_width_ranges)} zero-width ranges covering ~{zero_width_count} code points")
> +    print(f"- {len(double_width_ranges)} double-width ranges covering ~{double_width_count} code points")
> +
> +if __name__ == "__main__":

Will this be a lib at some point?

> +    generate_ucs_width()


I wonder, if you could generate only zero_width_ranges[] to some 
generated.c and "maintain" the C functions in the kernel the standard 
way -- including that generated.c. I.e. not having C functions in a py 
script.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c
  2025-04-10  1:13 ` [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c Nicolas Pitre
@ 2025-04-14  7:08   ` Jiri Slaby
  0 siblings, 0 replies; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  7:08 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> The generated code includes a table that maps base character + combining
> mark pairs to their precomposed equivalents using Python's unicodedata
> module. It also provides the ucs_recompose() function to query that
> table.
> 
> The default script behavior is to create a table with most commonly used
> Latin, Greek, and Cyrillic recomposition pairs only. It is much smaller
> than the table with all possible recomposition pairs (71 entries vs 1000
> entries). But if one needs/wants the full table then simply running the
> script with the --full argument will generate it.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> ---
>   drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++++++++++++
>   1 file changed, 321 insertions(+)
>   create mode 100755 drivers/tty/vt/gen_ucs_recompose.py
> 
> diff --git a/drivers/tty/vt/gen_ucs_recompose.py b/drivers/tty/vt/gen_ucs_recompose.py
> new file mode 100755
> index 0000000000..64418803e4
> --- /dev/null
> +++ b/drivers/tty/vt/gen_ucs_recompose.py
...
> +struct compare_key {{
> +	uint16_t base;
> +	uint16_t combining;
> +}};
> +
> +static int recomposition_compare(const void *key, const void *element)
> +{{
> +	const struct compare_key *search_key = key;
> +	const struct recomposition *table_entry = element;
> +
> +	/* Compare base character first */
> +	if (search_key->base < table_entry->base)
> +		return -1;
> +	if (search_key->base > table_entry->base)
> +		return 1;
> +
> +	/* Base characters match, now compare combining character */
> +	if (search_key->combining < table_entry->combining)
> +		return -1;
> +	if (search_key->combining > table_entry->combining)
> +		return 1;
> +
> +	/* Both match */
> +	return 0;
> +}}
> +
> +/**
> + * Attempt to recompose two Unicode characters into a single character.
> + *
> + * @param previous: Previous Unicode code point (UCS-4)
> + * @param current: Current Unicode code point (UCS-4)
> + * Return: Recomposed Unicode code point, or 0 if no recomposition is possible
> + */
> +uint32_t ucs_recompose(uint32_t base, uint32_t combining)
> +{{
> +	/* Check if characters are within the range of our table */
> +	if (base < MIN_BASE_CHAR || base > MAX_BASE_CHAR ||
> +	    combining < MIN_COMBINING_CHAR || combining > MAX_COMBINING_CHAR)
> +		return 0;
> +
> +	struct compare_key key = {{ base, combining }};
> +
> +	struct recomposition *result =
> +		__inline_bsearch(&key, recomposition_table,
> +				 ARRAY_SIZE(recomposition_table),
> +				 sizeof(*recomposition_table),
> +				 recomposition_compare);
> +
> +	return result ? result->recomposed : 0;
> +}}

Again, I think no reason to maintain C functions in py.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables
  2025-04-10  1:14 ` [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables Nicolas Pitre
@ 2025-04-14  7:14   ` Jiri Slaby
  2025-04-15 19:16     ` Nicolas Pitre
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  7:14 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:14, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
> This reduces the corresponding text size by 20-25%.

I like this!

> -struct interval {{
> +struct interval16 {{
> +	uint16_t first;
> +	uint16_t last;
> +}};
> +
> +struct interval32 {{
>   	uint32_t first;
>   	uint32_t last;

Actually, why not to use u16 and u32, respectively?

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py
  2025-04-10  1:14 ` [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py Nicolas Pitre
@ 2025-04-14  7:17   ` Jiri Slaby
  0 siblings, 0 replies; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  7:17 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:14, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
> This reduces the corresponding text size by 20-25%.
...
> @@ -483,7 +517,9 @@ static bool is_in_interval(uint32_t cp, const struct interval *intervals, size_t
>    */
>   bool ucs_is_zero_width(uint32_t cp)
>   {
> -	return is_in_interval(cp, zero_width_ranges, ARRAY_SIZE(zero_width_ranges));
> +	return (cp <= 0xFFFF)

This calls for some is_bmp() helper.

And then the classic way:
if (is_bmp())
   return is_in_interval16();

return is_in_interval32();

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 11/11] vt: pad double-width code points with a zero-white-space
  2025-04-10  1:14 ` [PATCH 11/11] vt: pad double-width code points with a zero-white-space Nicolas Pitre
@ 2025-04-14  7:18   ` Jiri Slaby
  0 siblings, 0 replies; 27+ messages in thread
From: Jiri Slaby @ 2025-04-14  7:18 UTC (permalink / raw)
  To: Nicolas Pitre, Greg Kroah-Hartman
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

On 10. 04. 25, 3:14, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> In the Unicode screen buffer, we follow double-width code points with a
> space to maintain proper column alignment. This, however, creates
> semantic problems when e.g. using cut and paste or selection.
> 
> Let's use a better code point for the column padding's purpose i.e. a
> zero-white-space rather than a full space.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> ---
>   drivers/tty/vt/vt.c | 11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
> index e3d35c4f92..dc84f9c6b7 100644
> --- a/drivers/tty/vt/vt.c
> +++ b/drivers/tty/vt/vt.c
> @@ -2937,12 +2937,13 @@ static int vc_con_write_normal(struct vc_data *vc, int tc, int c,
>   			width = 2;
>   		} else if (ucs_is_zero_width(c)) {
>   			prev_c = vc_uniscr_getc(vc, -1);
> -			if (prev_c == ' ' &&
> +			if (prev_c == 0x200B &&

Then introduce a NAME (macro) for this.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 02/11] vt: move unicode processing to a separate file
  2025-04-14  6:47   ` Jiri Slaby
@ 2025-04-15 19:03     ` Nicolas Pitre
  0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:03 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Mon, 14 Apr 2025, Jiri Slaby wrote:

> On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> > 
> > This will make it easier to maintain. Also make it depend on
> > CONFIG_CONSOLE_TRANSLATIONS.
> ...
> > --- a/include/linux/consolemap.h
> > +++ b/include/linux/consolemap.h
> ...
> > @@ -57,6 +58,11 @@ static inline int conv_uni_to_8bit(u32 uni)
> >   }
> >   
> >   static inline void console_map_init(void) { }
> > +
> > +static inline bool ucs_is_double_width(uint32_t cp)
> > +{
> > +	return false;
> > +}
> 
> Is this inline necessary? I assume ucs_is_double_width() won't be called
> outside CONFIG_CONSOLE_TRANSLATIONS?

It is, alongside the other functions in this header file.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 03/11] vt: properly support zero-width Unicode code points
  2025-04-14  6:51   ` Jiri Slaby
@ 2025-04-15 19:06     ` Nicolas Pitre
  0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:06 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Mon, 14 Apr 2025, Jiri Slaby wrote:

> On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> > 
> > Zero-width Unicode code points are causing misalignment in vertically
> > aligned content, disrupting the visual layout. Let's handle zero-width
> > code points more intelligently.
> ...
> > --- a/drivers/tty/vt/vt.c
> > +++ b/drivers/tty/vt/vt.c
> > @@ -443,6 +443,15 @@ static void vc_uniscr_scroll(struct vc_data *vc,
> > unsigned int top,
> >   	}
> >   }
> >   
> > +static u32 vc_uniscr_getc(struct vc_data *vc, int relative_pos)
> > +{
> > +	int pos = vc->state.x + vc->vc_need_wrap + relative_pos;
> > +
> > +	if (vc->vc_uni_lines && pos >= 0 && pos < vc->vc_cols)
> 
> So that is:
>   in_range(pos, 0, vc->vc_cols)
> right?

Good idea. Didn't know about that one.

> >   	if (vc->vc_utf && !vc->vc_disp_ctrl) {
> > -		if (ucs_is_double_width(c))
> > +		if (ucs_is_double_width(c)) {
> >   			width = 2;
> > +		} else if (ucs_is_zero_width(c)) {
> > +			prev_c = vc_uniscr_getc(vc, -1);
> > +			if (prev_c == ' ' &&
> > +			    ucs_is_double_width(vc_uniscr_getc(vc, -2))) {
> > +				/*
> > +				 * Let's merge this zero-width code point with
> > +				 * the preceding double-width code point by
> > +				 * replacing the existing whitespace padding.
> > +				 */
> > +				vc_con_rewind(vc);
> > +			} else if (c == 0xfe0f && prev_c != 0) {
> > +				/*
> > +				 * VS16 (U+FE0F) is special. Let it have a
> > +				 * width of 1 when preceded by a single-width
> > +				 * code point effectively making the later
> > +				 * double-width.
> > +				 */
> > +			} else {
> > +				/* Otherwise zero-width code points are
> > ignored */
> > +				goto out;
> > +			}
> > +		}
> 
> Please, extract this width evaluation to a separate function.

Done.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c
  2025-04-14  7:04   ` Jiri Slaby
@ 2025-04-15 19:13     ` Nicolas Pitre
  0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:13 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Mon, 14 Apr 2025, Jiri Slaby wrote:

> On 10. 04. 25, 3:13, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> > 
> > The table in the current ucs_width.c is terribly out of date and
> > incomplete. We also need a second table to store zero-width code points.
> > Properly maintaining those tables manually is impossible. So here's a
> > script to automatically generate them.
> > 
> > Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> > ---
> >   drivers/tty/vt/gen_ucs_width.py | 264 ++++++++++++++++++++++++++++++++
> >   1 file changed, 264 insertions(+)
> >   create mode 100755 drivers/tty/vt/gen_ucs_width.py
> > 
> > diff --git a/drivers/tty/vt/gen_ucs_width.py
> > b/drivers/tty/vt/gen_ucs_width.py
> > new file mode 100755
> > index 0000000000..41997fe001
> > --- /dev/null
> > +++ b/drivers/tty/vt/gen_ucs_width.py
[...]
> > +    # Mark these emoji modifiers as zero-width
> > +    for start, end in emoji_zero_width:
> > +        for cp in range(start, end + 1):
> > +            try:
> > +                width_map[cp] = 0
> > +            except (ValueError, OverflowError):
> 
> When can this happen and why is it not fatal?

This is some bogus leftovers. That doesn't fail.

Those scripts have been significantly cleaned up.

> > +    with open(c_file, 'w') as f:
> > +        f.write(f"""\
> 
> Why this backslash?

To inhibit the implied \n otherwise the file would start with an empty 
line. Same reason elsewhere: to prevent spurious empty lines.

> I wonder, if you could generate only zero_width_ranges[] to some generated.c
> and "maintain" the C functions in the kernel the standard way -- including
> that generated.c. I.e. not having C functions in a py script.

Yes, I did that. Easier to maintain in the end.


Nicolas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables
  2025-04-14  7:14   ` Jiri Slaby
@ 2025-04-15 19:16     ` Nicolas Pitre
  0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-15 19:16 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg Kroah-Hartman, linux-serial, linux-kernel

On Mon, 14 Apr 2025, Jiri Slaby wrote:

> On 10. 04. 25, 3:14, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> > 
> > Split table ranges into BMP (16-bit) and non-BMP (above 16-bit).
> > This reduces the corresponding text size by 20-25%.
> 
> I like this!
> 
> > -struct interval {{
> > +struct interval16 {{
> > +	uint16_t first;
> > +	uint16_t last;
> > +}};
> > +
> > +struct interval32 {{
> >    uint32_t first;
> >    uint32_t last;
> 
> Actually, why not to use u16 and u32, respectively?

No particular reason. The kernel uses both so I picked the one that made 
it easier for prototyping in user space. It is u16+u32 now.


Nicolas

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-04-15 19:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
2025-04-10  1:13 ` [PATCH 01/11] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
2025-04-10  1:13 ` [PATCH 02/11] vt: move unicode processing to a separate file Nicolas Pitre
2025-04-14  6:47   ` Jiri Slaby
2025-04-15 19:03     ` Nicolas Pitre
2025-04-10  1:13 ` [PATCH 03/11] vt: properly support zero-width Unicode code points Nicolas Pitre
2025-04-14  6:51   ` Jiri Slaby
2025-04-15 19:06     ` Nicolas Pitre
2025-04-10  1:13 ` [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c Nicolas Pitre
2025-04-14  7:04   ` Jiri Slaby
2025-04-15 19:13     ` Nicolas Pitre
2025-04-10  1:13 ` [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py Nicolas Pitre
2025-04-11  3:47   ` kernel test robot
2025-04-10  1:13 ` [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c Nicolas Pitre
2025-04-14  7:08   ` Jiri Slaby
2025-04-10  1:13 ` [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py Nicolas Pitre
2025-04-11  6:00   ` kernel test robot
2025-04-10  1:14 ` [PATCH 08/11] vt: support Unicode recomposition Nicolas Pitre
2025-04-10  1:14 ` [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables Nicolas Pitre
2025-04-14  7:14   ` Jiri Slaby
2025-04-15 19:16     ` Nicolas Pitre
2025-04-10  1:14 ` [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py Nicolas Pitre
2025-04-14  7:17   ` Jiri Slaby
2025-04-10  1:14 ` [PATCH 11/11] vt: pad double-width code points with a zero-white-space Nicolas Pitre
2025-04-14  7:18   ` Jiri Slaby
2025-04-10 19:38 ` [PATCH 12/11] vt: remove zero-white-space handling from conv_uni_to_pc() Nicolas Pitre
2025-04-11 14:49 ` [PATCH 00/11] vt: implement proper Unicode handling Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).