[PATCH 00/11] vt: implement proper Unicode handling

Nicolas Pitre posted 11 patches 8 months, 1 week ago
There is a newer version of this series
drivers/tty/vt/Makefile             |   3 +-
drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++
drivers/tty/vt/gen_ucs_width.py     | 336 +++++++++++++++++++
drivers/tty/vt/ucs_recompose.c      | 170 ++++++++++
drivers/tty/vt/ucs_width.c          | 536 ++++++++++++++++++++++++++++++
drivers/tty/vt/vt.c                 | 111 ++++---
include/linux/consolemap.h          |  18 +
7 files changed, 1448 insertions(+), 47 deletions(-)
[PATCH 00/11] vt: implement proper Unicode handling
Posted by Nicolas Pitre 8 months, 1 week ago
The Linux VT console has many problems with regards to proper Unicode
handling:

- All new double-width Unicode code points which have been introduced since
  Unicode 5.0 are not recognized as such (we're at Unicode 16.0 now).

- Zero-width code points are not recognized at all. If you try to edit files
  containing a lot of emojis, you will see the rendering issues. When there
  are a lot of zero-width characters (like "variation selectors"), long
  lines get wrapped, but any Unicode-aware editor thinks that the content
  was rendered properly and its rendering logic starts to work in very bad
  ways. Combine this with tmux or screen, and there is a huge mess going on
  in the terminal.

- Also, text which uses combining diacritics has the same effect as text
  with zero-width characters as programs expect the characters to take fewer
  columns than what they actually do.

Some may argue that the Linux VT console is unmaintained and/or not used
much any longer and that one should consider a user space terminal
alternative instead. But every such alternative that is not less maintained
than the Linux VT console does require a full heavy graphical environment
and that is the exact antithesis of what the Linux console is meant to be.

Furthermore, there is a significant Linux console user base represented by
blind users (which I'm a member of) for whom the alternatives are way more
cumbersome to use reducing our productivity. So it has to stay and
be maintained to the best of our abilities.

That being said...

This patch series is about fixing all the above issues. This is accomplished
with some Python scripts leveraging Python's unicodedata module to generate
C code with lookup tables that is suitable for the kernel. In summary:

- The double-width code point table is updated to the latest Unicode version
  and the table itself is optimized to reduce its size.

- A zero-width code point table is created and the console code is modified
  to properly use it.

- A table with base character + combining mark pairs is created to convert
  them into their precomposed equivalents when they're encountered.
  By default the generated table contains most commonly used Latin, Greek,
  and Cyrillic recomposition pairs only, but one can execute the provided
  script with the --full argument to create a table that covers all
  possibilities. Combining marks that are not listed in the table are simply
  treated like zero-width code points and properly ignored.

- All those tables plus related lookup code require about 3500 additional
  bytes of text which is not very significant these days. Yet, one
  can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out
  if need be.

Note: The generated C code makes scripts/checkpatch.pl complain about
      "... exceeds 100 columns" because the inserted comments with code
      point names, well, make some inlines exceed 100 columns. Please make
      an exception for those files and disregard those warnings. When
      checkpatch.pl is used on those files directly with -f then it doesn't
      complain.

This series was tested on top of v6.15-rc1.

diffstat:

 drivers/tty/vt/Makefile             |   3 +-
 drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++
 drivers/tty/vt/gen_ucs_width.py     | 336 +++++++++++++++++++
 drivers/tty/vt/ucs_recompose.c      | 170 ++++++++++
 drivers/tty/vt/ucs_width.c          | 536 ++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c                 | 111 ++++---
 include/linux/consolemap.h          |  18 +
 7 files changed, 1448 insertions(+), 47 deletions(-)
Re: [PATCH 00/11] vt: implement proper Unicode handling
Posted by Greg Kroah-Hartman 8 months, 1 week ago
On Wed, Apr 09, 2025 at 09:13:52PM -0400, Nicolas Pitre wrote:
> The Linux VT console has many problems with regards to proper Unicode
> handling:

Wow, very nice work, thanks for doing all of this.  I'll go queue it up
now, the kernel test robot warnings for comments can be fixed up later
if you want to.

greg k-h