public inbox for linux-serial@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] vt: implement proper Unicode handling
@ 2025-04-10  1:13 Nicolas Pitre
  2025-04-10  1:13 ` [PATCH 01/11] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: Nicolas Pitre @ 2025-04-10  1:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Nicolas Pitre, Dave Mielke, linux-serial, linux-kernel

The Linux VT console has many problems with regards to proper Unicode
handling:

- All new double-width Unicode code points which have been introduced since
  Unicode 5.0 are not recognized as such (we're at Unicode 16.0 now).

- Zero-width code points are not recognized at all. If you try to edit files
  containing a lot of emojis, you will see the rendering issues. When there
  are a lot of zero-width characters (like "variation selectors"), long
  lines get wrapped, but any Unicode-aware editor thinks that the content
  was rendered properly and its rendering logic starts to work in very bad
  ways. Combine this with tmux or screen, and there is a huge mess going on
  in the terminal.

- Also, text which uses combining diacritics has the same effect as text
  with zero-width characters as programs expect the characters to take fewer
  columns than what they actually do.

Some may argue that the Linux VT console is unmaintained and/or not used
much any longer and that one should consider a user space terminal
alternative instead. But every such alternative that is not less maintained
than the Linux VT console does require a full heavy graphical environment
and that is the exact antithesis of what the Linux console is meant to be.

Furthermore, there is a significant Linux console user base represented by
blind users (which I'm a member of) for whom the alternatives are way more
cumbersome to use reducing our productivity. So it has to stay and
be maintained to the best of our abilities.

That being said...

This patch series is about fixing all the above issues. This is accomplished
with some Python scripts leveraging Python's unicodedata module to generate
C code with lookup tables that is suitable for the kernel. In summary:

- The double-width code point table is updated to the latest Unicode version
  and the table itself is optimized to reduce its size.

- A zero-width code point table is created and the console code is modified
  to properly use it.

- A table with base character + combining mark pairs is created to convert
  them into their precomposed equivalents when they're encountered.
  By default the generated table contains most commonly used Latin, Greek,
  and Cyrillic recomposition pairs only, but one can execute the provided
  script with the --full argument to create a table that covers all
  possibilities. Combining marks that are not listed in the table are simply
  treated like zero-width code points and properly ignored.

- All those tables plus related lookup code require about 3500 additional
  bytes of text which is not very significant these days. Yet, one
  can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out
  if need be.

Note: The generated C code makes scripts/checkpatch.pl complain about
      "... exceeds 100 columns" because the inserted comments with code
      point names, well, make some inlines exceed 100 columns. Please make
      an exception for those files and disregard those warnings. When
      checkpatch.pl is used on those files directly with -f then it doesn't
      complain.

This series was tested on top of v6.15-rc1.

diffstat:

 drivers/tty/vt/Makefile             |   3 +-
 drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++
 drivers/tty/vt/gen_ucs_width.py     | 336 +++++++++++++++++++
 drivers/tty/vt/ucs_recompose.c      | 170 ++++++++++
 drivers/tty/vt/ucs_width.c          | 536 ++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c                 | 111 ++++---
 include/linux/consolemap.h          |  18 +
 7 files changed, 1448 insertions(+), 47 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-04-15 19:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10  1:13 [PATCH 00/11] vt: implement proper Unicode handling Nicolas Pitre
2025-04-10  1:13 ` [PATCH 01/11] vt: minor cleanup to vc_translate_unicode() Nicolas Pitre
2025-04-10  1:13 ` [PATCH 02/11] vt: move unicode processing to a separate file Nicolas Pitre
2025-04-14  6:47   ` Jiri Slaby
2025-04-15 19:03     ` Nicolas Pitre
2025-04-10  1:13 ` [PATCH 03/11] vt: properly support zero-width Unicode code points Nicolas Pitre
2025-04-14  6:51   ` Jiri Slaby
2025-04-15 19:06     ` Nicolas Pitre
2025-04-10  1:13 ` [PATCH 04/11] vt: introduce gen_ucs_width.py to create ucs_width.c Nicolas Pitre
2025-04-14  7:04   ` Jiri Slaby
2025-04-15 19:13     ` Nicolas Pitre
2025-04-10  1:13 ` [PATCH 05/11] vt: update ucs_width.c using gen_ucs_width.py Nicolas Pitre
2025-04-11  3:47   ` kernel test robot
2025-04-10  1:13 ` [PATCH 06/11] vt: introduce gen_ucs_recompose.py to create ucs_recompose.c Nicolas Pitre
2025-04-14  7:08   ` Jiri Slaby
2025-04-10  1:13 ` [PATCH 07/11] vt: create ucs_recompose.c using gen_ucs_recompose.py Nicolas Pitre
2025-04-11  6:00   ` kernel test robot
2025-04-10  1:14 ` [PATCH 08/11] vt: support Unicode recomposition Nicolas Pitre
2025-04-10  1:14 ` [PATCH 09/11] vt: update gen_ucs_width.py to produce more space efficient tables Nicolas Pitre
2025-04-14  7:14   ` Jiri Slaby
2025-04-15 19:16     ` Nicolas Pitre
2025-04-10  1:14 ` [PATCH 10/11] vt: update ucs_width.c following latest gen_ucs_width.py Nicolas Pitre
2025-04-14  7:17   ` Jiri Slaby
2025-04-10  1:14 ` [PATCH 11/11] vt: pad double-width code points with a zero-white-space Nicolas Pitre
2025-04-14  7:18   ` Jiri Slaby
2025-04-10 19:38 ` [PATCH 12/11] vt: remove zero-white-space handling from conv_uni_to_pc() Nicolas Pitre
2025-04-11 14:49 ` [PATCH 00/11] vt: implement proper Unicode handling Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox