[PATCH v2 0/2] Fix misaligned output of git repo structure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jiang Xin <worldhello.net@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
	Git List <git@vger.kernel.org>,
	Justin Tobler <jltobler@gmail.com>
Cc: Jiang Xin <worldhello.net@gmail.com>
Subject: [PATCH v2 0/2] Fix misaligned output of git repo structure
Date: Sat, 15 Nov 2025 08:36:09 -0500	[thread overview]
Message-ID: <cover.1763213290.git.worldhello.net@gmail.com> (raw)
In-Reply-To: <cover.1763098804.git.worldhello.net@gmail.com>

While localizing Git 2.52.0, I noticed that the output table from git
repo structure becomes misaligned when displaying UTF-8 characters. For
example:

    | 仓库结构   | 值  |
    | -------------- | ---- |

The previous implementation used simple width formatting with printf()
which didn't properly handle multi-byte UTF-8 characters, causing
misaligned table columns when displaying repository structure
information.

This change modifies the stats_table_print_structure function to use
strbuf_utf8_align() instead of basic printf width specifiers. This
ensures proper column alignment regardless of the character encoding of
the content being displayed.

Jiang Xin (2):
  t/unit-tests: add UTF-8 width tests for CJK chars
  builtin/repo: fix table alignment for UTF-8 characters

 Makefile                    |   1 +
 builtin/repo.c              |  21 ++++--
 t/meson.build               |   1 +
 t/unit-tests/u-utf8-width.c | 134 ++++++++++++++++++++++++++++++++++++
 4 files changed, 153 insertions(+), 4 deletions(-)
 create mode 100644 t/unit-tests/u-utf8-width.c


## Range-diff vs v1:

1:  53c1e5219b ! 1:  72e73484d2 t/unit-tests: add UTF-8 width tests for CJK chars
    @@ Metadata
      ## Commit message ##
         t/unit-tests: add UTF-8 width tests for CJK chars
     
    -    This commit adds a new test suite (u-utf8-width.c) to test the UTF-8
    -    width functions in Git, particularly focusing on multi-byte characters
    -    from East Asian languages like Chinese, Japanese, and Korean that
    -    typically require 2 display columns per character.
    -
    -    The test suite includes:
    -    - Tests for utf8_strnwidth with Chinese strings
    -    - Tests for utf8_strwidth with Chinese strings
    -    - Tests for Japanese and Korean characters
    -    - Edge case tests with invalid UTF-8 sequences
    -    - Proper test function naming following the Clar framework convention
    +    The file "builtin/repo.c" uses utf8_strwidth() to calculate the display
    +    width of UTF-8 characters in a table, but the resulting output is still
    +    misaligned. Add test cases for both utf8_strwidth and utf8_strnwidth to
    +    verify that they correctly compute the display width for UTF-8
    +    characters.
     
         Also updated the build configuration in Makefile and meson.build to
         include the new test suite in the build process.
     
    -    Co-developed-by: Claude <noreply@anthropic.com>
         Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
     
      ## Makefile ##
    @@ t/unit-tests/u-utf8-width.c (new)
     + */
     +void test_utf8_width__strnwidth_chinese(void)
     +{
    -+	const char *ansi_test;
     +	const char *str;
     +
     +	/* Test basic ASCII - each character should have width 1 */
    -+	cl_assert_equal_i(5, utf8_strnwidth("hello", 5, 0));
    -+	cl_assert_equal_i(5, utf8_strnwidth("hello", 5, 1));  /* skip_ansi = 1 */
    ++	cl_assert_equal_i(5, utf8_strnwidth("Hello", 5, 0));
    ++	/* skip_ansi = 1 */
    ++	cl_assert_equal_i(5, utf8_strnwidth("Hello", 5, 1));
     +
     +	/* Test simple Chinese characters - each should have width 2 */
    -+	cl_assert_equal_i(4, utf8_strnwidth("你好", 6, 0));  /* "你好" is 6 bytes (3 bytes per char in UTF-8), 4 display columns */
    ++	/* "你好" is 6 bytes (3 bytes per char in UTF-8), 4 display columns */
    ++	cl_assert_equal_i(4, utf8_strnwidth("你好", 6, 0));
     +
     +	/* Test mixed ASCII and Chinese - ASCII = 1 column, Chinese = 2 columns */
    -+	cl_assert_equal_i(6, utf8_strnwidth("hi你好", 8, 0));  /* "h"(1) + "i"(1) + "你"(2) + "好"(2) = 6 */
    ++	/* "h"(1) + "i"(1) + "你"(2) + "好"(2) = 6 */
    ++	cl_assert_equal_i(6, utf8_strnwidth("Hi你好", 8, 0));
     +
     +	/* Test longer Chinese string */
    -+	cl_assert_equal_i(10, utf8_strnwidth("你好世界！", 15, 0));  /* 5 Chinese chars = 10 display columns */
    -+
    -+	/* Test with skip_ansi = 1 to make sure it works with escape sequences */
    -+	ansi_test = "\033[31m你好\033[0m";
    -+	cl_assert_equal_i(4, utf8_strnwidth(ansi_test, strlen(ansi_test), 1));  /* Skip escape sequences, just count "你好" which should be 4 columns */
    ++	/* 5 Chinese chars = 10 display columns */
    ++	cl_assert_equal_i(10, utf8_strnwidth("你好世界！", 15, 0));
     +
     +	/* Test individual Chinese character width */
    -+	cl_assert_equal_i(2, utf8_strnwidth("中", 3, 0));  /* Single Chinese char should be 2 columns */
    ++	cl_assert_equal_i(2, utf8_strnwidth("中", 3, 0));
     +
     +	/* Test empty string */
     +	cl_assert_equal_i(0, utf8_strnwidth("", 0, 0));
     +
     +	/* Test length limiting */
     +	str = "你好世界";
    -+	cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0));  /* Only first char "你"(2 columns) within 3 bytes */
    -+	cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0));  /* First two chars "你好"(4 columns) in 6 bytes */
    ++	/* Only first char "你"(2 columns) within 3 bytes */
    ++	cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0));
    ++	/* First two chars "你好"(4 columns) in 6 bytes */
    ++	cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0));
     +}
     +
     +/*
    @@ t/unit-tests/u-utf8-width.c (new)
     +void test_utf8_width__strwidth_chinese(void)
     +{
     +	/* Test basic ASCII */
    -+	cl_assert_equal_i(5, utf8_strwidth("hello"));
    ++	cl_assert_equal_i(5, utf8_strwidth("Hello"));
     +
     +	/* Test Chinese characters */
    -+	cl_assert_equal_i(4, utf8_strwidth("你好"));  /* 2 Chinese chars = 4 display columns */
    ++	/* 2 Chinese chars = 4 display columns */
    ++	cl_assert_equal_i(4, utf8_strwidth("你好"));
    ++
    ++	/* Test longer Chinese string */
    ++	/* 5 Chinese chars = 10 display columns */
    ++	cl_assert_equal_i(10, utf8_strwidth("你好世界！"));
     +
     +	/* Test mixed ASCII and Chinese */
    -+	cl_assert_equal_i(9, utf8_strwidth("hello世界"));  /* 5 ASCII (5 cols) + 2 Chinese (4 cols) = 9 */
    -+	cl_assert_equal_i(7, utf8_strwidth("hi世界!"));   /* 2 ASCII (2 cols) + 2 Chinese (4 cols) + 1 ASCII (1 col) = 7 */
    ++	/* 5 ASCII (5 cols) + 2 Chinese (4 cols) = 9 */
    ++	cl_assert_equal_i(9, utf8_strwidth("Hello世界"));
    ++	/* 2 ASCII (2 cols) + 2 Chinese (4 cols) + 1 ASCII (1 col) = 7 */
    ++	cl_assert_equal_i(7, utf8_strwidth("Hi世界!"));
     +}
     +
     +/*
    @@ t/unit-tests/u-utf8-width.c (new)
     +void test_utf8_width__strnwidth_japanese_korean(void)
     +{
     +	/* Japanese characters (should also be 2 columns each) */
    -+	cl_assert_equal_i(10, utf8_strnwidth("こんにちは", 15, 0));  /* 5 Japanese chars @ 2 cols each = 10 display columns */
    ++	/* 5 Japanese chars x 2 cols each = 10 display columns */
    ++	cl_assert_equal_i(10, utf8_strnwidth("こんにちは", 15, 0));
     +
     +	/* Korean characters (should also be 2 columns each) */
    -+	cl_assert_equal_i(10, utf8_strnwidth("안녕하세요", 15, 0));  /* 5 Korean chars @ 2 cols each = 10 display columns */
    ++	/* 5 Korean chars x 2 cols each = 10 display columns */
    ++	cl_assert_equal_i(10, utf8_strnwidth("안녕하세요", 15, 0));
     +}
     +
     +/*
    -+ * Test edge cases with partial UTF-8 sequences
    ++ * Test utf8_strnwidth with CJK strings and ANSI sequences
     + */
    -+void test_utf8_width__strnwidth_edge_cases(void)
    ++void test_utf8_width__strnwidth_cjk_with_ansi(void)
     +{
    -+	const char *invalid;
    -+	unsigned char truncated_bytes[] = {0xe4, 0xbd, 0x00};  /* First 2 bytes of "中" + null */
    -+
    -+	/* Test invalid UTF-8 - should fall back to byte count */
    -+	invalid = "\xff\xfe";  /* Invalid UTF-8 sequence */
    -+	cl_assert_equal_i(2, utf8_strnwidth(invalid, 2, 0));  /* Should return length if invalid UTF-8 */
    -+
    -+	/* Test partial UTF-8 character (truncated) */
    -+	cl_assert_equal_i(2, utf8_strnwidth((const char*)truncated_bytes, 2, 0));  /* Invalid UTF-8, returns byte count */
    ++	/* Test CJK with ANSI sequences */
    ++	const char *ansi_test = "\033[1m你好\033[0m";
    ++	int width = utf8_strnwidth(ansi_test, strlen(ansi_test), 1);
    ++	/* Should skip ANSI sequences and count "你好" as 4 columns */
    ++	cl_assert_equal_i(4, width);
    ++
    ++	/* Test mixed ASCII, CJK, and ANSI */
    ++	ansi_test = "Hello\033[32m世界\033[0m!";
    ++	width = utf8_strnwidth(ansi_test, strlen(ansi_test), 1);
    ++	/* "Hello"(5) + "世界"(4) + "!"(1) = 10 */
    ++	cl_assert_equal_i(10, width);
     +}
2:  65efad527f ! 2:  d0975427c9 builtin/repo: fix table alignment for UTF-8 characters
    @@ Commit message
             | -------------- | ---- |
             | * 引用       |      |
             |   * 计数     |   67 |
    -        |     * 分支   |    6 |
    -        |     * 标签   |   30 |
    -        |     * 远程   |   19 |
    -        |     * 其它   |   12 |
    -        |                |      |
    -        | * 可达对象 |      |
    -        |   * 计数     | 2217 |
    -        |     * 提交   |  279 |
    -        |     * 树      |  740 |
    -        |     * 数据对象 | 1168 |
    -        |     * 标签   |   30 |
     
         The previous implementation used simple width formatting with printf()
         which didn't properly handle multi-byte UTF-8 characters, causing
    @@ Commit message
         ensures proper column alignment regardless of the character encoding of
         the content being displayed.
     
    -    Co-developed-by: Gemini <noreply@developers.google.com>
    +    Also add test cases for strbuf_utf8_align(), a function newly introduced
    +    in "builtin/repo.c".
    +
         Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
     
      ## builtin/repo.c ##
    @@ builtin/repo.c: static void stats_table_print_structure(const struct stats_table
     +	strbuf_utf8_align(&buf, ALIGN_LEFT, value_col_width, value_col_title);
     +	strbuf_addstr(&buf, " |");
     +	printf("%s\n", buf.buf);
    -+	strbuf_reset(&buf);
     +
      	printf("| ");
      	for (int i = 0; i < name_col_width; i++)
    @@ builtin/repo.c: static void stats_table_print_structure(const struct stats_table
      }
      
      static void stats_table_clear(struct stats_table *table)
    +
    + ## t/unit-tests/u-utf8-width.c ##
    +@@ t/unit-tests/u-utf8-width.c: void test_utf8_width__strnwidth_cjk_with_ansi(void)
    + 	/* "Hello"(5) + "世界"(4) + "!"(1) = 10 */
    + 	cl_assert_equal_i(10, width);
    + }
    ++
    ++/*
    ++ * Test the strbuf_utf8_align function with CJK characters
    ++ */
    ++void test_utf8_width__strbuf_utf8_align(void)
    ++{
    ++	struct strbuf buf = STRBUF_INIT;
    ++
    ++	/* Test left alignment with CJK */
    ++	strbuf_utf8_align(&buf, ALIGN_LEFT, 10, "你好");
    ++	/* Since "你好" is 4 display columns, we need 6 more spaces to reach 10 */
    ++	cl_assert_equal_s("你好      ", buf.buf);
    ++	strbuf_reset(&buf);
    ++
    ++	/* Test right alignment with CJK */
    ++	strbuf_utf8_align(&buf, ALIGN_RIGHT, 8, "世界");
    ++	/* "世界" is 4 display columns, so we need 4 leading spaces */
    ++	cl_assert_equal_s("    世界", buf.buf);
    ++	strbuf_reset(&buf);
    ++
    ++	/* Test center alignment with CJK */
    ++	strbuf_utf8_align(&buf, ALIGN_MIDDLE, 10, "中");
    ++	/* "中" is 2 display columns, so (10-2)/2 = 4 spaces on left, 4 on right */
    ++	cl_assert_equal_s("    中    ", buf.buf);
    ++	strbuf_reset(&buf);
    ++
    ++	strbuf_utf8_align(&buf, ALIGN_MIDDLE, 5, "中");
    ++	/* "中" is 2 display columns, so (5-2)/2 = 1 spaces on left, 2 on right */
    ++	cl_assert_equal_s(" 中  ", buf.buf);
    ++	strbuf_reset(&buf);
    ++
    ++	/* Test alignment that is smaller than string width */
    ++	strbuf_utf8_align(&buf, ALIGN_LEFT, 2, "你好");
    ++	/* Since "你好" is 4 display columns, it should not be truncated */
    ++	cl_assert_equal_s("你好", buf.buf);
    ++	strbuf_release(&buf);
    ++}

--
Jiang Xin

next prev parent reply	other threads:[~2025-11-15 13:36 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-14  5:52 [PATCH 0/2] Fix misaligned output of git repo structure Jiang Xin
2025-11-14  5:52 ` [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
2025-11-14 20:17   ` Junio C Hamano
2025-11-15 12:38     ` Jiang Xin
2025-11-14  5:52 ` [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
2025-11-14 17:50   ` Justin Tobler
2025-11-15 12:41     ` Jiang Xin
2025-11-14 20:00   ` Junio C Hamano
2025-11-15 12:54     ` Jiang Xin
2025-11-15 16:36       ` Junio C Hamano
2025-11-16 13:32         ` Jiang Xin
2025-11-16 16:51           ` Junio C Hamano
2025-11-14  7:41 ` [PATCH 0/2] Fix misaligned output of git repo structure Kristoffer Haugsbakk
2025-11-14  9:52   ` Jiang Xin
2025-11-14 19:22     ` Junio C Hamano
2025-11-15 12:25       ` Jiang Xin
2025-11-14 16:13 ` Junio C Hamano
2025-11-15 13:36 ` Jiang Xin [this message]
2025-11-15 13:36   ` [PATCH v2 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
2025-11-15 13:36   ` [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
2025-11-15 15:04     ` Phillip Wood
2025-11-15 16:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1763213290.git.worldhello.net@gmail.com \
    --to=worldhello.net@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jltobler@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.