* [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars
2025-11-14 5:52 [PATCH 0/2] Fix misaligned output of git repo structure Jiang Xin
@ 2025-11-14 5:52 ` Jiang Xin
2025-11-14 20:17 ` Junio C Hamano
2025-11-14 5:52 ` [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
` (3 subsequent siblings)
4 siblings, 1 reply; 22+ messages in thread
From: Jiang Xin @ 2025-11-14 5:52 UTC (permalink / raw)
To: Junio C Hamano, Git List, Justin Tobler
Cc: Jiang Xin, Alexander Shopov, Mikel Forcada, Ralf Thielow,
Jean-Noël Avila, Bagas Sanjaya, Dimitriy Ryazantcev,
Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan, Claude
This commit adds a new test suite (u-utf8-width.c) to test the UTF-8
width functions in Git, particularly focusing on multi-byte characters
from East Asian languages like Chinese, Japanese, and Korean that
typically require 2 display columns per character.
The test suite includes:
- Tests for utf8_strnwidth with Chinese strings
- Tests for utf8_strwidth with Chinese strings
- Tests for Japanese and Korean characters
- Edge case tests with invalid UTF-8 sequences
- Proper test function naming following the Clar framework convention
Also updated the build configuration in Makefile and meson.build to
include the new test suite in the build process.
Co-developed-by: Claude <noreply@anthropic.com>
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
---
Makefile | 1 +
t/meson.build | 1 +
t/unit-tests/u-utf8-width.c | 85 +++++++++++++++++++++++++++++++++++++
3 files changed, 87 insertions(+)
create mode 100644 t/unit-tests/u-utf8-width.c
diff --git a/Makefile b/Makefile
index 7e0f77e298..2a67546154 100644
--- a/Makefile
+++ b/Makefile
@@ -1525,6 +1525,7 @@ CLAR_TEST_SUITES += u-string-list
CLAR_TEST_SUITES += u-strvec
CLAR_TEST_SUITES += u-trailer
CLAR_TEST_SUITES += u-urlmatch-normalization
+CLAR_TEST_SUITES += u-utf8-width
CLAR_TEST_PROG = $(UNIT_TEST_BIN)/unit-tests$(X)
CLAR_TEST_OBJS = $(patsubst %,$(UNIT_TEST_DIR)/%.o,$(CLAR_TEST_SUITES))
CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/clar/clar.o
diff --git a/t/meson.build b/t/meson.build
index a5531df415..dc43d69636 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -24,6 +24,7 @@ clar_test_suites = [
'unit-tests/u-strvec.c',
'unit-tests/u-trailer.c',
'unit-tests/u-urlmatch-normalization.c',
+ 'unit-tests/u-utf8-width.c',
]
clar_sources = [
diff --git a/t/unit-tests/u-utf8-width.c b/t/unit-tests/u-utf8-width.c
new file mode 100644
index 0000000000..455294ca90
--- /dev/null
+++ b/t/unit-tests/u-utf8-width.c
@@ -0,0 +1,85 @@
+#include "unit-test.h"
+#include "utf8.h"
+#include "strbuf.h"
+
+/*
+ * Test utf8_strnwidth with various Chinese strings
+ * Chinese characters typically have a width of 2 columns when displayed
+ */
+void test_utf8_width__strnwidth_chinese(void)
+{
+ const char *ansi_test;
+ const char *str;
+
+ /* Test basic ASCII - each character should have width 1 */
+ cl_assert_equal_i(5, utf8_strnwidth("hello", 5, 0));
+ cl_assert_equal_i(5, utf8_strnwidth("hello", 5, 1)); /* skip_ansi = 1 */
+
+ /* Test simple Chinese characters - each should have width 2 */
+ cl_assert_equal_i(4, utf8_strnwidth("你好", 6, 0)); /* "你好" is 6 bytes (3 bytes per char in UTF-8), 4 display columns */
+
+ /* Test mixed ASCII and Chinese - ASCII = 1 column, Chinese = 2 columns */
+ cl_assert_equal_i(6, utf8_strnwidth("hi你好", 8, 0)); /* "h"(1) + "i"(1) + "你"(2) + "好"(2) = 6 */
+
+ /* Test longer Chinese string */
+ cl_assert_equal_i(10, utf8_strnwidth("你好世界!", 15, 0)); /* 5 Chinese chars = 10 display columns */
+
+ /* Test with skip_ansi = 1 to make sure it works with escape sequences */
+ ansi_test = "\033[31m你好\033[0m";
+ cl_assert_equal_i(4, utf8_strnwidth(ansi_test, strlen(ansi_test), 1)); /* Skip escape sequences, just count "你好" which should be 4 columns */
+
+ /* Test individual Chinese character width */
+ cl_assert_equal_i(2, utf8_strnwidth("中", 3, 0)); /* Single Chinese char should be 2 columns */
+
+ /* Test empty string */
+ cl_assert_equal_i(0, utf8_strnwidth("", 0, 0));
+
+ /* Test length limiting */
+ str = "你好世界";
+ cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0)); /* Only first char "你"(2 columns) within 3 bytes */
+ cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0)); /* First two chars "你好"(4 columns) in 6 bytes */
+}
+
+/*
+ * Tests for utf8_strwidth (simpler version without length limit)
+ */
+void test_utf8_width__strwidth_chinese(void)
+{
+ /* Test basic ASCII */
+ cl_assert_equal_i(5, utf8_strwidth("hello"));
+
+ /* Test Chinese characters */
+ cl_assert_equal_i(4, utf8_strwidth("你好")); /* 2 Chinese chars = 4 display columns */
+
+ /* Test mixed ASCII and Chinese */
+ cl_assert_equal_i(9, utf8_strwidth("hello世界")); /* 5 ASCII (5 cols) + 2 Chinese (4 cols) = 9 */
+ cl_assert_equal_i(7, utf8_strwidth("hi世界!")); /* 2 ASCII (2 cols) + 2 Chinese (4 cols) + 1 ASCII (1 col) = 7 */
+}
+
+/*
+ * Additional tests with other East Asian characters
+ */
+void test_utf8_width__strnwidth_japanese_korean(void)
+{
+ /* Japanese characters (should also be 2 columns each) */
+ cl_assert_equal_i(10, utf8_strnwidth("こんにちは", 15, 0)); /* 5 Japanese chars @ 2 cols each = 10 display columns */
+
+ /* Korean characters (should also be 2 columns each) */
+ cl_assert_equal_i(10, utf8_strnwidth("안녕하세요", 15, 0)); /* 5 Korean chars @ 2 cols each = 10 display columns */
+}
+
+/*
+ * Test edge cases with partial UTF-8 sequences
+ */
+void test_utf8_width__strnwidth_edge_cases(void)
+{
+ const char *invalid;
+ unsigned char truncated_bytes[] = {0xe4, 0xbd, 0x00}; /* First 2 bytes of "中" + null */
+
+ /* Test invalid UTF-8 - should fall back to byte count */
+ invalid = "\xff\xfe"; /* Invalid UTF-8 sequence */
+ cl_assert_equal_i(2, utf8_strnwidth(invalid, 2, 0)); /* Should return length if invalid UTF-8 */
+
+ /* Test partial UTF-8 character (truncated) */
+ cl_assert_equal_i(2, utf8_strnwidth((const char*)truncated_bytes, 2, 0)); /* Invalid UTF-8, returns byte count */
+}
--
2.52.0.rc2.5.g4c20a63325.dirty
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars
2025-11-14 5:52 ` [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
@ 2025-11-14 20:17 ` Junio C Hamano
2025-11-15 12:38 ` Jiang Xin
0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2025-11-14 20:17 UTC (permalink / raw)
To: Jiang Xin
Cc: Git List, Justin Tobler, Alexander Shopov, Mikel Forcada,
Ralf Thielow, Jean-Noël Avila, Bagas Sanjaya,
Dimitriy Ryazantcev, Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan
Jiang Xin <worldhello.net@gmail.com> writes:
[jc: the same question about the choice of Cc addresses applies]
> This commit adds a new test suite (u-utf8-width.c) to test the UTF-8
> width functions in Git, particularly focusing on multi-byte characters
> from East Asian languages like Chinese, Japanese, and Korean that
> typically require 2 display columns per character.
>
> The test suite includes:
> - Tests for utf8_strnwidth with Chinese strings
> - Tests for utf8_strwidth with Chinese strings
> - Tests for Japanese and Korean characters
> - Edge case tests with invalid UTF-8 sequences
> - Proper test function naming following the Clar framework convention
>
> Also updated the build configuration in Makefile and meson.build to
> include the new test suite in the build process.
The usual way to compose a log message of this project is to
- Give an observation on how the current system works in the
present tense (so no need to say "Currently X is Y", or
"Previously X was Y" to describe the state before your change;
just "X is Y" is enough), and discuss what you perceive as a
problem in it.
- Propose a solution (optional---often, problem description
trivially leads to an obvious solution in reader's minds).
- Give commands to somebody editing the codebase to "make it so",
instead of saying "This commit does X".
in this order.
> + /* Test length limiting */
> + str = "你好世界";
> + cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0)); /* Only first char "你"(2 columns) within 3 bytes */
> + cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0)); /* First two chars "你好"(4 columns) in 6 bytes */
We also should test utf8_strwidth() on the same string here.
> +/*
> + * Test edge cases with partial UTF-8 sequences
> + */
All tests before these make sense, but I am not sure if we want to
hold utf8_strnwidth() to the requirement that it will tolerate "len"
to end in the middle of a single character, as such a requirement by
itself does not do application any good.
A caller may have "你好世界" in str, learn that the first 4 bytes
would only need two display columns to show (i.e., 3-byte "你" plus
a single garbage byte, that would make UTF-8 encoded "好" if the
remaining two bytes were included), and may want to learn how to
show only enough to fill the two display columns. But there is not
enough information given back by utf8_strnwidth() for such a caller
to figure out that it needs to feed only the first three bytes (not
four) of str to printf() to do so.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars
2025-11-14 20:17 ` Junio C Hamano
@ 2025-11-15 12:38 ` Jiang Xin
0 siblings, 0 replies; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 12:38 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git List, Justin Tobler
On Sat, Nov 15, 2025 at 4:17 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jiang Xin <worldhello.net@gmail.com> writes:
>
> [jc: the same question about the choice of Cc addresses applies]
>
> > This commit adds a new test suite (u-utf8-width.c) to test the UTF-8
> > width functions in Git, particularly focusing on multi-byte characters
> > from East Asian languages like Chinese, Japanese, and Korean that
> > typically require 2 display columns per character.
> >
> > The test suite includes:
> > - Tests for utf8_strnwidth with Chinese strings
> > - Tests for utf8_strwidth with Chinese strings
> > - Tests for Japanese and Korean characters
> > - Edge case tests with invalid UTF-8 sequences
> > - Proper test function naming following the Clar framework convention
> >
> > Also updated the build configuration in Makefile and meson.build to
> > include the new test suite in the build process.
>
> The usual way to compose a log message of this project is to
>
> - Give an observation on how the current system works in the
> present tense (so no need to say "Currently X is Y", or
> "Previously X was Y" to describe the state before your change;
> just "X is Y" is enough), and discuss what you perceive as a
> problem in it.
>
> - Propose a solution (optional---often, problem description
> trivially leads to an obvious solution in reader's minds).
>
> - Give commands to somebody editing the codebase to "make it so",
> instead of saying "This commit does X".
>
> in this order.
Will document the purpose in commit message of next reroll.
> > +/*
> > + * Test edge cases with partial UTF-8 sequences
> > + */
>
> All tests before these make sense, but I am not sure if we want to
> hold utf8_strnwidth() to the requirement that it will tolerate "len"
> to end in the middle of a single character, as such a requirement by
> itself does not do application any good.
Will remove unnecessary test cases.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-14 5:52 [PATCH 0/2] Fix misaligned output of git repo structure Jiang Xin
2025-11-14 5:52 ` [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
@ 2025-11-14 5:52 ` Jiang Xin
2025-11-14 17:50 ` Justin Tobler
2025-11-14 20:00 ` Junio C Hamano
2025-11-14 7:41 ` [PATCH 0/2] Fix misaligned output of git repo structure Kristoffer Haugsbakk
` (2 subsequent siblings)
4 siblings, 2 replies; 22+ messages in thread
From: Jiang Xin @ 2025-11-14 5:52 UTC (permalink / raw)
To: Junio C Hamano, Git List, Justin Tobler
Cc: Jiang Xin, Alexander Shopov, Mikel Forcada, Ralf Thielow,
Jean-Noël Avila, Bagas Sanjaya, Dimitriy Ryazantcev,
Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan, Gemini
The output table from "git repo structure" is misaligned when displaying
UTF-8 characters (e.g., non-ASCII glyphs). E.g.:
| 仓库结构 | 值 |
| -------------- | ---- |
| * 引用 | |
| * 计数 | 67 |
| * 分支 | 6 |
| * 标签 | 30 |
| * 远程 | 19 |
| * 其它 | 12 |
| | |
| * 可达对象 | |
| * 计数 | 2217 |
| * 提交 | 279 |
| * 树 | 740 |
| * 数据对象 | 1168 |
| * 标签 | 30 |
The previous implementation used simple width formatting with printf()
which didn't properly handle multi-byte UTF-8 characters, causing
misaligned table columns when displaying repository structure
information.
This change modifies the stats_table_print_structure function to use
strbuf_utf8_align() instead of basic printf width specifiers. This
ensures proper column alignment regardless of the character encoding of
the content being displayed.
Co-developed-by: Gemini <noreply@developers.google.com>
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
---
builtin/repo.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/builtin/repo.c b/builtin/repo.c
index 9d4749f79b..d0b4a060b1 100644
--- a/builtin/repo.c
+++ b/builtin/repo.c
@@ -292,14 +292,21 @@ static void stats_table_print_structure(const struct stats_table *table)
int name_col_width = utf8_strwidth(name_col_title);
int value_col_width = utf8_strwidth(value_col_title);
struct string_list_item *item;
+ struct strbuf buf = STRBUF_INIT;
if (table->name_col_width > name_col_width)
name_col_width = table->name_col_width;
if (table->value_col_width > value_col_width)
value_col_width = table->value_col_width;
- printf("| %-*s | %-*s |\n", name_col_width, name_col_title,
- value_col_width, value_col_title);
+ strbuf_addstr(&buf, "| ");
+ strbuf_utf8_align(&buf, ALIGN_LEFT, name_col_width, name_col_title);
+ strbuf_addstr(&buf, " | ");
+ strbuf_utf8_align(&buf, ALIGN_LEFT, value_col_width, value_col_title);
+ strbuf_addstr(&buf, " |");
+ printf("%s\n", buf.buf);
+ strbuf_reset(&buf);
+
printf("| ");
for (int i = 0; i < name_col_width; i++)
putchar('-');
@@ -317,9 +324,16 @@ static void stats_table_print_structure(const struct stats_table *table)
value = entry->value;
}
- printf("| %-*s | %*s |\n", name_col_width, item->string,
- value_col_width, value);
+ strbuf_reset(&buf);
+ strbuf_addstr(&buf, "| ");
+ strbuf_utf8_align(&buf, ALIGN_LEFT, name_col_width, item->string);
+ strbuf_addstr(&buf, " | ");
+ strbuf_utf8_align(&buf, ALIGN_RIGHT, value_col_width, value);
+ strbuf_addstr(&buf, " |");
+ printf("%s\n", buf.buf);
}
+
+ strbuf_release(&buf);
}
static void stats_table_clear(struct stats_table *table)
--
2.52.0.rc2.5.g4c20a63325.dirty
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-14 5:52 ` [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
@ 2025-11-14 17:50 ` Justin Tobler
2025-11-15 12:41 ` Jiang Xin
2025-11-14 20:00 ` Junio C Hamano
1 sibling, 1 reply; 22+ messages in thread
From: Justin Tobler @ 2025-11-14 17:50 UTC (permalink / raw)
To: Jiang Xin
Cc: Junio C Hamano, Git List, Alexander Shopov, Mikel Forcada,
Ralf Thielow, Jean-Noël Avila, Bagas Sanjaya,
Dimitriy Ryazantcev, Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan, Gemini
On 25/11/14 12:52AM, Jiang Xin wrote:
> The output table from "git repo structure" is misaligned when displaying
> UTF-8 characters (e.g., non-ASCII glyphs). E.g.:
>
> | 仓库结构 | 值 |
> | -------------- | ---- |
> | * 引用 | |
> | * 计数 | 67 |
> | * 分支 | 6 |
> | * 标签 | 30 |
> | * 远程 | 19 |
> | * 其它 | 12 |
> | | |
> | * 可达对象 | |
> | * 计数 | 2217 |
> | * 提交 | 279 |
> | * 树 | 740 |
> | * 数据对象 | 1168 |
> | * 标签 | 30 |
>
> The previous implementation used simple width formatting with printf()
> which didn't properly handle multi-byte UTF-8 characters, causing
> misaligned table columns when displaying repository structure
> information.
Thanks for finding this issue and submitting a fix! I failed to consider
the fact that the printf() format specifier width would be counting
bytes. This causes the overall line width to fall short in some
scenarios with multi-byte UTF-8 characters.
> This change modifies the stats_table_print_structure function to use
> strbuf_utf8_align() instead of basic printf width specifiers. This
> ensures proper column alignment regardless of the character encoding of
> the content being displayed.
Makes sense.
> Co-developed-by: Gemini <noreply@developers.google.com>
> Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
> ---
> builtin/repo.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/builtin/repo.c b/builtin/repo.c
> index 9d4749f79b..d0b4a060b1 100644
> --- a/builtin/repo.c
> +++ b/builtin/repo.c
> @@ -292,14 +292,21 @@ static void stats_table_print_structure(const struct stats_table *table)
> int name_col_width = utf8_strwidth(name_col_title);
> int value_col_width = utf8_strwidth(value_col_title);
> struct string_list_item *item;
> + struct strbuf buf = STRBUF_INIT;
>
> if (table->name_col_width > name_col_width)
> name_col_width = table->name_col_width;
> if (table->value_col_width > value_col_width)
> value_col_width = table->value_col_width;
>
> - printf("| %-*s | %-*s |\n", name_col_width, name_col_title,
> - value_col_width, value_col_title);
> + strbuf_addstr(&buf, "| ");
> + strbuf_utf8_align(&buf, ALIGN_LEFT, name_col_width, name_col_title);
> + strbuf_addstr(&buf, " | ");
> + strbuf_utf8_align(&buf, ALIGN_LEFT, value_col_width, value_col_title);
> + strbuf_addstr(&buf, " |");
> + printf("%s\n", buf.buf);
Ok, using strbuf_utf8_align() compensates the line width when using
multi-byte UTF-8 characters to ensure the correct length. Looks good.
> + strbuf_reset(&buf);
Do we need to reset the buffer here? In the following loop we reset it
at the start of each iteration.
> +
> printf("| ");
> for (int i = 0; i < name_col_width; i++)
> putchar('-');
> @@ -317,9 +324,16 @@ static void stats_table_print_structure(const struct stats_table *table)
> value = entry->value;
> }
>
> - printf("| %-*s | %*s |\n", name_col_width, item->string,
> - value_col_width, value);
> + strbuf_reset(&buf);
> + strbuf_addstr(&buf, "| ");
> + strbuf_utf8_align(&buf, ALIGN_LEFT, name_col_width, item->string);
> + strbuf_addstr(&buf, " | ");
> + strbuf_utf8_align(&buf, ALIGN_RIGHT, value_col_width, value);
> + strbuf_addstr(&buf, " |");
> + printf("%s\n", buf.buf);
Here we do the same thing for the values column. Looks good to me.
Thanks,
-Justin
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-14 17:50 ` Justin Tobler
@ 2025-11-15 12:41 ` Jiang Xin
0 siblings, 0 replies; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 12:41 UTC (permalink / raw)
To: Justin Tobler; +Cc: Junio C Hamano, Git List
On Sat, Nov 15, 2025 at 1:50 AM Justin Tobler <jltobler@gmail.com> wrote:
> Ok, using strbuf_utf8_align() compensates the line width when using
> multi-byte UTF-8 characters to ensure the correct length. Looks good.
>
> > + strbuf_reset(&buf);
>
> Do we need to reset the buffer here? In the following loop we reset it
> at the start of each iteration.
Will remove this line in next reroll.
--
Jiang Xin
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-14 5:52 ` [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
2025-11-14 17:50 ` Justin Tobler
@ 2025-11-14 20:00 ` Junio C Hamano
2025-11-15 12:54 ` Jiang Xin
1 sibling, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2025-11-14 20:00 UTC (permalink / raw)
To: Jiang Xin
Cc: Git List, Justin Tobler, Alexander Shopov, Mikel Forcada,
Ralf Thielow, Jean-Noël Avila, Bagas Sanjaya,
Dimitriy Ryazantcev, Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan
Jiang Xin <worldhello.net@gmail.com> writes:
Not about the contents of the patch, but how was the list of
addresses on CC produced? Do they all have enough stakes in the
code being updated that they do not mind getting spammed like this?
Also, you had a non-address "Gemini <noreply@developers.google.com>",
which forced me and anybody who will respond to the patch edit Cc
address list (or suffer bounces). Please don't.
> The output table from "git repo structure" is misaligned when displaying
> UTF-8 characters (e.g., non-ASCII glyphs). E.g.:
>
> | 仓库结构 | 值 |
> | -------------- | ---- |
> | * 引用 | |
> | * 计数 | 67 |
> | * 分支 | 6 |
> | * 标签 | 30 |
> | * 远程 | 19 |
> | * 其它 | 12 |
> | | |
> | * 可达对象 | |
> | * 计数 | 2217 |
> | * 提交 | 279 |
> | * 树 | 740 |
> | * 数据对象 | 1168 |
> | * 标签 | 30 |
As there is a concrete reproduction sample from a specific tool, ...
> builtin/repo.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
... it is a good idea to protect the change with a new test or two
to make sure the expected alignment in the output.
Thanks.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-14 20:00 ` Junio C Hamano
@ 2025-11-15 12:54 ` Jiang Xin
2025-11-15 16:36 ` Junio C Hamano
0 siblings, 1 reply; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 12:54 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git List, Justin Tobler
On Sat, Nov 15, 2025 at 4:00 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jiang Xin <worldhello.net@gmail.com> writes:
>
> Not about the contents of the patch, but how was the list of
> addresses on CC produced? Do they all have enough stakes in the
> code being updated that they do not mind getting spammed like this?
I’m cc’ing this patch series to all Git l10n team leads to inform them
that the issue has been identified and will be fixed.
> Also, you had a non-address "Gemini <noreply@developers.google.com>",
> which forced me and anybody who will respond to the patch edit Cc
> address list (or suffer bounces). Please don't.
Will remove this trailer.
> > | * 提交 | 279 |
> > | * 树 | 740 |
> > | * 数据对象 | 1168 |
> > | * 标签 | 30 |
>
> As there is a concrete reproduction sample from a specific tool, ...
This output of the `git repo structure` command is based on the
Chinese translation for Git 2.52. The next reroll will retain only the
table header, which is sufficient to demonstrate the issue.
>
> > builtin/repo.c | 22 ++++++++++++++++++----
> > 1 file changed, 18 insertions(+), 4 deletions(-)
>
> ... it is a good idea to protect the change with a new test or two
> to make sure the expected alignment in the output.
Will add test cases for strbuf_utf8_align(), a function newly
introduced in builtin/repo.c.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-15 12:54 ` Jiang Xin
@ 2025-11-15 16:36 ` Junio C Hamano
2025-11-16 13:32 ` Jiang Xin
0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2025-11-15 16:36 UTC (permalink / raw)
To: Jiang Xin; +Cc: Git List, Justin Tobler
Jiang Xin <worldhello.net@gmail.com> writes:
>> > builtin/repo.c | 22 ++++++++++++++++++----
>> > 1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> ... it is a good idea to protect the change with a new test or two
>> to make sure the expected alignment in the output.
>
> Will add test cases for strbuf_utf8_align(), a function newly
> introduced in builtin/repo.c.
Unit tests are nice to make sure that building blocks like this
helper function works as expected. To ensure that the application
uses the building blocks correctly, you'd also need end-to-end test,
getting output out of the tool ("repo struct"?) and checking it.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-15 16:36 ` Junio C Hamano
@ 2025-11-16 13:32 ` Jiang Xin
2025-11-16 16:51 ` Junio C Hamano
0 siblings, 1 reply; 22+ messages in thread
From: Jiang Xin @ 2025-11-16 13:32 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git List, Justin Tobler
On Sun, Nov 16, 2025 at 12:36 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jiang Xin <worldhello.net@gmail.com> writes:
>
> >> > builtin/repo.c | 22 ++++++++++++++++++----
> >> > 1 file changed, 18 insertions(+), 4 deletions(-)
> >>
> >> ... it is a good idea to protect the change with a new test or two
> >> to make sure the expected alignment in the output.
> >
> > Will add test cases for strbuf_utf8_align(), a function newly
> > introduced in builtin/repo.c.
>
> Unit tests are nice to make sure that building blocks like this
> helper function works as expected. To ensure that the application
> uses the building blocks correctly, you'd also need end-to-end test,
> getting output out of the tool ("repo struct"?) and checking it.
t1901 already includes test cases to safeguard the output of the
"git repo structure" command. I could add a new test case to
validate the output when localized in Chinese (as shown below),
but such a test would be inherently unstable, because it risks
breaking at the end of every release cycle whenever translations
change.
Therefore, I feel it's better to fix the issue by using strbuf_utf8_align()
and adding dedicated unit tests for it, rather than relying on
fragile end-to-end localization tests.
-------- 8< --------
diff --git a/t/t1901-repo-structure.sh b/t/t1901-repo-structure.sh
index 36a71a144e..fdab0a3d29 100755
--- a/t/t1901-repo-structure.sh
+++ b/t/t1901-repo-structure.sh
@@ -34,6 +34,37 @@ test_expect_success 'empty repository' '
)
'
+test_expect_success 'output repo structure in non-ASCII glyphs' '
+ test_when_finished "rm -rf repo" &&
+ git init repo &&
+ (
+ cd repo &&
+ cat >expect <<-\EOF &&
+ | 仓库结构 | 值 |
+ | -------------- | -- |
+ | * 引用 | |
+ | * 计数 | 0 |
+ | * 分支 | 0 |
+ | * 标签 | 0 |
+ | * 远程 | 0 |
+ | * 其它 | 0 |
+ | | |
+ | * 可达的对象 | |
+ | * 计数 | 0 |
+ | * 提交 | 0 |
+ | * 树 | 0 |
+ | * 数据对象 | 0 |
+ | * 标签 | 0 |
+ EOF
+
+ env LC_ALL=zh_CN.utf-8 \
+ git repo structure >out 2>err &&
+
+ test_cmp expect out &&
+ test_line_count = 0 err
+ )
+'
+
test_expect_success 'repository with references and objects' '
test_when_finished "rm -rf repo" &&
git init repo &&
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-16 13:32 ` Jiang Xin
@ 2025-11-16 16:51 ` Junio C Hamano
0 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2025-11-16 16:51 UTC (permalink / raw)
To: Jiang Xin; +Cc: Git List, Justin Tobler
Jiang Xin <worldhello.net@gmail.com> writes:
> t1901 already includes test cases to safeguard the output of the
> "git repo structure" command. I could add a new test case to
> validate the output when localized in Chinese (as shown below),
> but such a test would be inherently unstable, because it risks
> breaking at the end of every release cycle whenever translations
> change.
I haven't considered the i18n aspect. We already compare program
output with expected output, so a change in a message has to be
updated together with the test that covers the code path, but po/
updates tend to come too late for test updates, so the problem is
much more serious.
OK. Let's omit this feature from end-to-end testing at least for
now. Thanks.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] Fix misaligned output of git repo structure
2025-11-14 5:52 [PATCH 0/2] Fix misaligned output of git repo structure Jiang Xin
2025-11-14 5:52 ` [PATCH 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
2025-11-14 5:52 ` [PATCH 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
@ 2025-11-14 7:41 ` Kristoffer Haugsbakk
2025-11-14 9:52 ` Jiang Xin
2025-11-14 16:13 ` Junio C Hamano
2025-11-15 13:36 ` [PATCH v2 " Jiang Xin
4 siblings, 1 reply; 22+ messages in thread
From: Kristoffer Haugsbakk @ 2025-11-14 7:41 UTC (permalink / raw)
To: Jiang Xin, Junio C Hamano, Git List, Justin Tobler
Cc: Alexander Shopov, Mikel Forcada, Ralf Thielow,
Jean-Noël AVILA, Bagas Sanjaya, Dimitriy Ryazantcev,
Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan
On Fri, Nov 14, 2025, at 06:52, Jiang Xin wrote:
> While localizing Git 2.52.0, I noticed that the output table from git
> repo structure becomes misaligned when displaying UTF-8 characters. For
> example:
>
>[snip]
>
> BTW, I used two AI coding tools (Claude Code and Gemini-CLI) to generate
> the commits, and added the "Co-developed-by" trailers in the commit
> messages by using one of my opensource project:
Is `Co-developed-by` supposed to have a different meaning than the more
common `Co-authored-by`?
https://lore.kernel.org/git/xmqq1pq7re7q.fsf@gitster.g/
>
> - https://github.com/ai-coding-workshop/commit-msg
>
>
> ## Changes
>
>[snip]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] Fix misaligned output of git repo structure
2025-11-14 7:41 ` [PATCH 0/2] Fix misaligned output of git repo structure Kristoffer Haugsbakk
@ 2025-11-14 9:52 ` Jiang Xin
2025-11-14 19:22 ` Junio C Hamano
0 siblings, 1 reply; 22+ messages in thread
From: Jiang Xin @ 2025-11-14 9:52 UTC (permalink / raw)
To: Kristoffer Haugsbakk
Cc: Junio C Hamano, Git List, Justin Tobler, Alexander Shopov,
Mikel Forcada, Ralf Thielow, Jean-Noël AVILA, Bagas Sanjaya,
Dimitriy Ryazantcev, Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan
On Fri, Nov 14, 2025 at 3:41 PM Kristoffer Haugsbakk
<kristofferhaugsbakk@fastmail.com> wrote:
>
> On Fri, Nov 14, 2025, at 06:52, Jiang Xin wrote:
> > While localizing Git 2.52.0, I noticed that the output table from git
> > repo structure becomes misaligned when displaying UTF-8 characters. For
> > example:
> >
> >[snip]
> >
> > BTW, I used two AI coding tools (Claude Code and Gemini-CLI) to generate
> > the commits, and added the "Co-developed-by" trailers in the commit
> > messages by using one of my opensource project:
>
> Is `Co-developed-by` supposed to have a different meaning than the more
> common `Co-authored-by`?
This is a very good question.
**Background**
At Alibaba Cloud, our development team uses a variety of AI coding tools,
including Cursor, Claude Code, Gemini-CLI, Lingma, and Qoder, etc. To
measure adoption—specifically, how many developers are using AI coding
tools and how much code is AI-generated—we needed a unified tracking
mechanism compatible with all these tools. I chose to implement a git
commit-msg hook that automatically detects the AI coding tool responsible
for a commit based on environment variables at commit time.
**Why choose the Co-developed-by trailer for AI developer?**
Git repositories already use the Co-authored-by trailer to credit human
collaborators. Since any human developer, including co-authors, may use
AI coding tools to assist their work, introducing a distinct trailer
like Co-developed-by allows us to clearly differentiate between human
contributors and the AI tools they used. For example, the following
commit trailers indicate two human engineers and the respective AI
coding tools they employed:
Co-developed-by: Cursor <noreply@cursor.com>
Co-authored-by: Real Person <real.person@example.com>
Co-developed-by: Gemini <noreply@developers.google.com>
Signed-off-by: Me <me@example.com>
I noticed that Sasha Levin (NVIDIA) previously proposed adopting
the Co-developed-by trailer for the Linux kernel as well.
- https://ostechnix.com/linux-kernel-ai-coding-assistants-rules-proposal/
--
Jiang Xin
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 0/2] Fix misaligned output of git repo structure
2025-11-14 9:52 ` Jiang Xin
@ 2025-11-14 19:22 ` Junio C Hamano
2025-11-15 12:25 ` Jiang Xin
0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2025-11-14 19:22 UTC (permalink / raw)
To: Jiang Xin
Cc: Kristoffer Haugsbakk, Git List, Justin Tobler, Alexander Shopov,
Mikel Forcada, Ralf Thielow, Jean-Noël AVILA, Bagas Sanjaya,
Dimitriy Ryazantcev, Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan
Jiang Xin <worldhello.net@gmail.com> writes:
>> Is `Co-developed-by` supposed to have a different meaning than the more
>> common `Co-authored-by`?
>
> This is a very good question.
>
> **Background**
>
> At Alibaba Cloud, our development team uses a variety of AI coding tools,
> including Cursor, Claude Code, Gemini-CLI, Lingma, and Qoder, etc. To
> measure adoption—specifically, how many developers are using AI coding
> tools and how much code is AI-generated—we needed a unified tracking
> mechanism compatible with all these tools. I chose to implement a git
> commit-msg hook that automatically detects the AI coding tool responsible
> for a commit based on environment variables at commit time.
In other words, addition of this is solely to help corporations like
Alibaba to measure which AI tools are used (and what correlation
there are between success rate of the patches and the tools that
generated them, etc..
What is in it for us? What benefit are we getting in exchange for
tolerating these additional trailer lines in our log messages?
A few random thoughts about generated contents:
* Disclosing the tools that were used during the development of a
patch is a good practice in principle, but this is not limited to
use of AI tools. We have fixes for issues found with existing
Coccinelle checks, sanitizers, static checkers, and it is the
usual practice for the patches that fix them to disclose how the
author discovered the issue. When making mechanical replacement
changes en masse, it is the usual practice for the patches to
describe what scripts were used to make the changes in them. But
we do not dedicate a trailer line for such a disclosure, and
there is no reason why AI tools has to be treated specially here.
Instead of "Co-developed-by" that only tells what tool was used,
why not disclose what prompts (again, somehow AI tools are
treated specially here, too---we call the input to these tools
"scripts" when the changes were made with sed or perl or
coccinelle) were used?
* Whether some or all contents in a submitted patch were generated
by tools, it does not change the obligation of the person who
submits the patch. They need to make sure that the changes are
reviewable, its goal and implementation are described in the
proposed log message appropriately, the updated code does what
the proposed log message claims to do. They need to make sure
that they have the right to contribute the patch under DCO, and
sign off their patch accordingly.
* What is made more difficult for a submitter with AI tools is that
it is often not obvious to the human developer how much of the
tools' generated output is parroting what the tools saw during
their training session, and what the licensing terms of these
training materials are. Even if a hypothetical AI tool were
trained only with BSD licensed material, the output from such a
tool is likely to hold you under certain obligations like
including the original copyright notice, but without the tool
disclosing to you the human developer, you do not even know whose
copyright notice to include.
* Worse yet, the above difficulty is only for the submitter of such
a patch, not the project that, trusting what the sign-off of the
submitter certifies, reviews and accepts such a patch. It does
not make any difference if the original submitter copied and
pasted proprietary code of their employer in the patch, or
included code that AI tools "borrowed" from elsewhere without
following proper procedure to honor the licensing terms. In
either case, the project may have accepted what was stolen
without knowing, and it is very likely that the submitter but not
the project is primarily held liable. In a sense, the project
would be better off if the patch does not say it was generated
with AI tools---if the project does not know, it cannot possibly
held liable for it, even though the project will have to waste
engineering resources to rewrite or remove the remnant from such
a faulty contribution.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] Fix misaligned output of git repo structure
2025-11-14 19:22 ` Junio C Hamano
@ 2025-11-15 12:25 ` Jiang Xin
0 siblings, 0 replies; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 12:25 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git List, Justin Tobler
On Sat, Nov 15, 2025 at 3:22 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jiang Xin <worldhello.net@gmail.com> writes:
>
> >> Is `Co-developed-by` supposed to have a different meaning than the more
> >> common `Co-authored-by`?
> >
> > This is a very good question.
> >
> > **Background**
> >
> > At Alibaba Cloud, our development team uses a variety of AI coding tools,
> > including Cursor, Claude Code, Gemini-CLI, Lingma, and Qoder, etc. To
> > measure adoption—specifically, how many developers are using AI coding
> > tools and how much code is AI-generated—we needed a unified tracking
> > mechanism compatible with all these tools. I chose to implement a git
> > commit-msg hook that automatically detects the AI coding tool responsible
> > for a commit based on environment variables at commit time.
>
> In other words, addition of this is solely to help corporations like
> Alibaba to measure which AI tools are used (and what correlation
> there are between success rate of the patches and the tools that
> generated them, etc..
>
> What is in it for us? What benefit are we getting in exchange for
> tolerating these additional trailer lines in our log messages?
>
> A few random thoughts about generated contents:
Regardless of whether the trailer used in commits to identify AI
coding tools was leaked intentionally or unintentionally, the
following insights are extremely valuable—thank you!
I’ll add the appropriate configuration to disable the commit-msg
hook’s automatic modification of commit messages for git.git
repositories, and I’ll review any AI-generated code more carefully,
if present.
--
Jiang Xin
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/2] Fix misaligned output of git repo structure
2025-11-14 5:52 [PATCH 0/2] Fix misaligned output of git repo structure Jiang Xin
` (2 preceding siblings ...)
2025-11-14 7:41 ` [PATCH 0/2] Fix misaligned output of git repo structure Kristoffer Haugsbakk
@ 2025-11-14 16:13 ` Junio C Hamano
2025-11-15 13:36 ` [PATCH v2 " Jiang Xin
4 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2025-11-14 16:13 UTC (permalink / raw)
To: Jiang Xin
Cc: Git List, Justin Tobler, Alexander Shopov, Mikel Forcada,
Ralf Thielow, Jean-Noël Avila, Bagas Sanjaya,
Dimitriy Ryazantcev, Peter Krefting, Emir SARI, Arkadii Yakovets,
Vũ Tiến Hưng, Teng Long, Yi-Jyun Pan
Jiang Xin <worldhello.net@gmail.com> writes:
> BTW, I used two AI coding tools (Claude Code and Gemini-CLI) to generate
> the commits, and added the "Co-developed-by" trailers in the commit
> messages by using one of my opensource project:
We had a mini-thread on this recently.
https://lore.kernel.org/git/xmqqo6p9zo8f.fsf@gitster.g/
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v2 0/2] Fix misaligned output of git repo structure
2025-11-14 5:52 [PATCH 0/2] Fix misaligned output of git repo structure Jiang Xin
` (3 preceding siblings ...)
2025-11-14 16:13 ` Junio C Hamano
@ 2025-11-15 13:36 ` Jiang Xin
2025-11-15 13:36 ` [PATCH v2 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
2025-11-15 13:36 ` [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
4 siblings, 2 replies; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 13:36 UTC (permalink / raw)
To: Junio C Hamano, Git List, Justin Tobler; +Cc: Jiang Xin
While localizing Git 2.52.0, I noticed that the output table from git
repo structure becomes misaligned when displaying UTF-8 characters. For
example:
| 仓库结构 | 值 |
| -------------- | ---- |
The previous implementation used simple width formatting with printf()
which didn't properly handle multi-byte UTF-8 characters, causing
misaligned table columns when displaying repository structure
information.
This change modifies the stats_table_print_structure function to use
strbuf_utf8_align() instead of basic printf width specifiers. This
ensures proper column alignment regardless of the character encoding of
the content being displayed.
Jiang Xin (2):
t/unit-tests: add UTF-8 width tests for CJK chars
builtin/repo: fix table alignment for UTF-8 characters
Makefile | 1 +
builtin/repo.c | 21 ++++--
t/meson.build | 1 +
t/unit-tests/u-utf8-width.c | 134 ++++++++++++++++++++++++++++++++++++
4 files changed, 153 insertions(+), 4 deletions(-)
create mode 100644 t/unit-tests/u-utf8-width.c
## Range-diff vs v1:
1: 53c1e5219b ! 1: 72e73484d2 t/unit-tests: add UTF-8 width tests for CJK chars
@@ Metadata
## Commit message ##
t/unit-tests: add UTF-8 width tests for CJK chars
- This commit adds a new test suite (u-utf8-width.c) to test the UTF-8
- width functions in Git, particularly focusing on multi-byte characters
- from East Asian languages like Chinese, Japanese, and Korean that
- typically require 2 display columns per character.
-
- The test suite includes:
- - Tests for utf8_strnwidth with Chinese strings
- - Tests for utf8_strwidth with Chinese strings
- - Tests for Japanese and Korean characters
- - Edge case tests with invalid UTF-8 sequences
- - Proper test function naming following the Clar framework convention
+ The file "builtin/repo.c" uses utf8_strwidth() to calculate the display
+ width of UTF-8 characters in a table, but the resulting output is still
+ misaligned. Add test cases for both utf8_strwidth and utf8_strnwidth to
+ verify that they correctly compute the display width for UTF-8
+ characters.
Also updated the build configuration in Makefile and meson.build to
include the new test suite in the build process.
- Co-developed-by: Claude <noreply@anthropic.com>
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
## Makefile ##
@@ t/unit-tests/u-utf8-width.c (new)
+ */
+void test_utf8_width__strnwidth_chinese(void)
+{
-+ const char *ansi_test;
+ const char *str;
+
+ /* Test basic ASCII - each character should have width 1 */
-+ cl_assert_equal_i(5, utf8_strnwidth("hello", 5, 0));
-+ cl_assert_equal_i(5, utf8_strnwidth("hello", 5, 1)); /* skip_ansi = 1 */
++ cl_assert_equal_i(5, utf8_strnwidth("Hello", 5, 0));
++ /* skip_ansi = 1 */
++ cl_assert_equal_i(5, utf8_strnwidth("Hello", 5, 1));
+
+ /* Test simple Chinese characters - each should have width 2 */
-+ cl_assert_equal_i(4, utf8_strnwidth("你好", 6, 0)); /* "你好" is 6 bytes (3 bytes per char in UTF-8), 4 display columns */
++ /* "你好" is 6 bytes (3 bytes per char in UTF-8), 4 display columns */
++ cl_assert_equal_i(4, utf8_strnwidth("你好", 6, 0));
+
+ /* Test mixed ASCII and Chinese - ASCII = 1 column, Chinese = 2 columns */
-+ cl_assert_equal_i(6, utf8_strnwidth("hi你好", 8, 0)); /* "h"(1) + "i"(1) + "你"(2) + "好"(2) = 6 */
++ /* "h"(1) + "i"(1) + "你"(2) + "好"(2) = 6 */
++ cl_assert_equal_i(6, utf8_strnwidth("Hi你好", 8, 0));
+
+ /* Test longer Chinese string */
-+ cl_assert_equal_i(10, utf8_strnwidth("你好世界!", 15, 0)); /* 5 Chinese chars = 10 display columns */
-+
-+ /* Test with skip_ansi = 1 to make sure it works with escape sequences */
-+ ansi_test = "\033[31m你好\033[0m";
-+ cl_assert_equal_i(4, utf8_strnwidth(ansi_test, strlen(ansi_test), 1)); /* Skip escape sequences, just count "你好" which should be 4 columns */
++ /* 5 Chinese chars = 10 display columns */
++ cl_assert_equal_i(10, utf8_strnwidth("你好世界!", 15, 0));
+
+ /* Test individual Chinese character width */
-+ cl_assert_equal_i(2, utf8_strnwidth("中", 3, 0)); /* Single Chinese char should be 2 columns */
++ cl_assert_equal_i(2, utf8_strnwidth("中", 3, 0));
+
+ /* Test empty string */
+ cl_assert_equal_i(0, utf8_strnwidth("", 0, 0));
+
+ /* Test length limiting */
+ str = "你好世界";
-+ cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0)); /* Only first char "你"(2 columns) within 3 bytes */
-+ cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0)); /* First two chars "你好"(4 columns) in 6 bytes */
++ /* Only first char "你"(2 columns) within 3 bytes */
++ cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0));
++ /* First two chars "你好"(4 columns) in 6 bytes */
++ cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0));
+}
+
+/*
@@ t/unit-tests/u-utf8-width.c (new)
+void test_utf8_width__strwidth_chinese(void)
+{
+ /* Test basic ASCII */
-+ cl_assert_equal_i(5, utf8_strwidth("hello"));
++ cl_assert_equal_i(5, utf8_strwidth("Hello"));
+
+ /* Test Chinese characters */
-+ cl_assert_equal_i(4, utf8_strwidth("你好")); /* 2 Chinese chars = 4 display columns */
++ /* 2 Chinese chars = 4 display columns */
++ cl_assert_equal_i(4, utf8_strwidth("你好"));
++
++ /* Test longer Chinese string */
++ /* 5 Chinese chars = 10 display columns */
++ cl_assert_equal_i(10, utf8_strwidth("你好世界!"));
+
+ /* Test mixed ASCII and Chinese */
-+ cl_assert_equal_i(9, utf8_strwidth("hello世界")); /* 5 ASCII (5 cols) + 2 Chinese (4 cols) = 9 */
-+ cl_assert_equal_i(7, utf8_strwidth("hi世界!")); /* 2 ASCII (2 cols) + 2 Chinese (4 cols) + 1 ASCII (1 col) = 7 */
++ /* 5 ASCII (5 cols) + 2 Chinese (4 cols) = 9 */
++ cl_assert_equal_i(9, utf8_strwidth("Hello世界"));
++ /* 2 ASCII (2 cols) + 2 Chinese (4 cols) + 1 ASCII (1 col) = 7 */
++ cl_assert_equal_i(7, utf8_strwidth("Hi世界!"));
+}
+
+/*
@@ t/unit-tests/u-utf8-width.c (new)
+void test_utf8_width__strnwidth_japanese_korean(void)
+{
+ /* Japanese characters (should also be 2 columns each) */
-+ cl_assert_equal_i(10, utf8_strnwidth("こんにちは", 15, 0)); /* 5 Japanese chars @ 2 cols each = 10 display columns */
++ /* 5 Japanese chars x 2 cols each = 10 display columns */
++ cl_assert_equal_i(10, utf8_strnwidth("こんにちは", 15, 0));
+
+ /* Korean characters (should also be 2 columns each) */
-+ cl_assert_equal_i(10, utf8_strnwidth("안녕하세요", 15, 0)); /* 5 Korean chars @ 2 cols each = 10 display columns */
++ /* 5 Korean chars x 2 cols each = 10 display columns */
++ cl_assert_equal_i(10, utf8_strnwidth("안녕하세요", 15, 0));
+}
+
+/*
-+ * Test edge cases with partial UTF-8 sequences
++ * Test utf8_strnwidth with CJK strings and ANSI sequences
+ */
-+void test_utf8_width__strnwidth_edge_cases(void)
++void test_utf8_width__strnwidth_cjk_with_ansi(void)
+{
-+ const char *invalid;
-+ unsigned char truncated_bytes[] = {0xe4, 0xbd, 0x00}; /* First 2 bytes of "中" + null */
-+
-+ /* Test invalid UTF-8 - should fall back to byte count */
-+ invalid = "\xff\xfe"; /* Invalid UTF-8 sequence */
-+ cl_assert_equal_i(2, utf8_strnwidth(invalid, 2, 0)); /* Should return length if invalid UTF-8 */
-+
-+ /* Test partial UTF-8 character (truncated) */
-+ cl_assert_equal_i(2, utf8_strnwidth((const char*)truncated_bytes, 2, 0)); /* Invalid UTF-8, returns byte count */
++ /* Test CJK with ANSI sequences */
++ const char *ansi_test = "\033[1m你好\033[0m";
++ int width = utf8_strnwidth(ansi_test, strlen(ansi_test), 1);
++ /* Should skip ANSI sequences and count "你好" as 4 columns */
++ cl_assert_equal_i(4, width);
++
++ /* Test mixed ASCII, CJK, and ANSI */
++ ansi_test = "Hello\033[32m世界\033[0m!";
++ width = utf8_strnwidth(ansi_test, strlen(ansi_test), 1);
++ /* "Hello"(5) + "世界"(4) + "!"(1) = 10 */
++ cl_assert_equal_i(10, width);
+}
2: 65efad527f ! 2: d0975427c9 builtin/repo: fix table alignment for UTF-8 characters
@@ Commit message
| -------------- | ---- |
| * 引用 | |
| * 计数 | 67 |
- | * 分支 | 6 |
- | * 标签 | 30 |
- | * 远程 | 19 |
- | * 其它 | 12 |
- | | |
- | * 可达对象 | |
- | * 计数 | 2217 |
- | * 提交 | 279 |
- | * 树 | 740 |
- | * 数据对象 | 1168 |
- | * 标签 | 30 |
The previous implementation used simple width formatting with printf()
which didn't properly handle multi-byte UTF-8 characters, causing
@@ Commit message
ensures proper column alignment regardless of the character encoding of
the content being displayed.
- Co-developed-by: Gemini <noreply@developers.google.com>
+ Also add test cases for strbuf_utf8_align(), a function newly introduced
+ in "builtin/repo.c".
+
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
## builtin/repo.c ##
@@ builtin/repo.c: static void stats_table_print_structure(const struct stats_table
+ strbuf_utf8_align(&buf, ALIGN_LEFT, value_col_width, value_col_title);
+ strbuf_addstr(&buf, " |");
+ printf("%s\n", buf.buf);
-+ strbuf_reset(&buf);
+
printf("| ");
for (int i = 0; i < name_col_width; i++)
@@ builtin/repo.c: static void stats_table_print_structure(const struct stats_table
}
static void stats_table_clear(struct stats_table *table)
+
+ ## t/unit-tests/u-utf8-width.c ##
+@@ t/unit-tests/u-utf8-width.c: void test_utf8_width__strnwidth_cjk_with_ansi(void)
+ /* "Hello"(5) + "世界"(4) + "!"(1) = 10 */
+ cl_assert_equal_i(10, width);
+ }
++
++/*
++ * Test the strbuf_utf8_align function with CJK characters
++ */
++void test_utf8_width__strbuf_utf8_align(void)
++{
++ struct strbuf buf = STRBUF_INIT;
++
++ /* Test left alignment with CJK */
++ strbuf_utf8_align(&buf, ALIGN_LEFT, 10, "你好");
++ /* Since "你好" is 4 display columns, we need 6 more spaces to reach 10 */
++ cl_assert_equal_s("你好 ", buf.buf);
++ strbuf_reset(&buf);
++
++ /* Test right alignment with CJK */
++ strbuf_utf8_align(&buf, ALIGN_RIGHT, 8, "世界");
++ /* "世界" is 4 display columns, so we need 4 leading spaces */
++ cl_assert_equal_s(" 世界", buf.buf);
++ strbuf_reset(&buf);
++
++ /* Test center alignment with CJK */
++ strbuf_utf8_align(&buf, ALIGN_MIDDLE, 10, "中");
++ /* "中" is 2 display columns, so (10-2)/2 = 4 spaces on left, 4 on right */
++ cl_assert_equal_s(" 中 ", buf.buf);
++ strbuf_reset(&buf);
++
++ strbuf_utf8_align(&buf, ALIGN_MIDDLE, 5, "中");
++ /* "中" is 2 display columns, so (5-2)/2 = 1 spaces on left, 2 on right */
++ cl_assert_equal_s(" 中 ", buf.buf);
++ strbuf_reset(&buf);
++
++ /* Test alignment that is smaller than string width */
++ strbuf_utf8_align(&buf, ALIGN_LEFT, 2, "你好");
++ /* Since "你好" is 4 display columns, it should not be truncated */
++ cl_assert_equal_s("你好", buf.buf);
++ strbuf_release(&buf);
++}
--
Jiang Xin
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v2 1/2] t/unit-tests: add UTF-8 width tests for CJK chars
2025-11-15 13:36 ` [PATCH v2 " Jiang Xin
@ 2025-11-15 13:36 ` Jiang Xin
2025-11-15 13:36 ` [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
1 sibling, 0 replies; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 13:36 UTC (permalink / raw)
To: Junio C Hamano, Git List, Justin Tobler; +Cc: Jiang Xin
The file "builtin/repo.c" uses utf8_strwidth() to calculate the display
width of UTF-8 characters in a table, but the resulting output is still
misaligned. Add test cases for both utf8_strwidth and utf8_strnwidth to
verify that they correctly compute the display width for UTF-8
characters.
Also updated the build configuration in Makefile and meson.build to
include the new test suite in the build process.
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
---
Makefile | 1 +
t/meson.build | 1 +
t/unit-tests/u-utf8-width.c | 97 +++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+)
create mode 100644 t/unit-tests/u-utf8-width.c
diff --git a/Makefile b/Makefile
index 7e0f77e298..2a67546154 100644
--- a/Makefile
+++ b/Makefile
@@ -1525,6 +1525,7 @@ CLAR_TEST_SUITES += u-string-list
CLAR_TEST_SUITES += u-strvec
CLAR_TEST_SUITES += u-trailer
CLAR_TEST_SUITES += u-urlmatch-normalization
+CLAR_TEST_SUITES += u-utf8-width
CLAR_TEST_PROG = $(UNIT_TEST_BIN)/unit-tests$(X)
CLAR_TEST_OBJS = $(patsubst %,$(UNIT_TEST_DIR)/%.o,$(CLAR_TEST_SUITES))
CLAR_TEST_OBJS += $(UNIT_TEST_DIR)/clar/clar.o
diff --git a/t/meson.build b/t/meson.build
index a5531df415..dc43d69636 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -24,6 +24,7 @@ clar_test_suites = [
'unit-tests/u-strvec.c',
'unit-tests/u-trailer.c',
'unit-tests/u-urlmatch-normalization.c',
+ 'unit-tests/u-utf8-width.c',
]
clar_sources = [
diff --git a/t/unit-tests/u-utf8-width.c b/t/unit-tests/u-utf8-width.c
new file mode 100644
index 0000000000..3766f19726
--- /dev/null
+++ b/t/unit-tests/u-utf8-width.c
@@ -0,0 +1,97 @@
+#include "unit-test.h"
+#include "utf8.h"
+#include "strbuf.h"
+
+/*
+ * Test utf8_strnwidth with various Chinese strings
+ * Chinese characters typically have a width of 2 columns when displayed
+ */
+void test_utf8_width__strnwidth_chinese(void)
+{
+ const char *str;
+
+ /* Test basic ASCII - each character should have width 1 */
+ cl_assert_equal_i(5, utf8_strnwidth("Hello", 5, 0));
+ /* skip_ansi = 1 */
+ cl_assert_equal_i(5, utf8_strnwidth("Hello", 5, 1));
+
+ /* Test simple Chinese characters - each should have width 2 */
+ /* "你好" is 6 bytes (3 bytes per char in UTF-8), 4 display columns */
+ cl_assert_equal_i(4, utf8_strnwidth("你好", 6, 0));
+
+ /* Test mixed ASCII and Chinese - ASCII = 1 column, Chinese = 2 columns */
+ /* "h"(1) + "i"(1) + "你"(2) + "好"(2) = 6 */
+ cl_assert_equal_i(6, utf8_strnwidth("Hi你好", 8, 0));
+
+ /* Test longer Chinese string */
+ /* 5 Chinese chars = 10 display columns */
+ cl_assert_equal_i(10, utf8_strnwidth("你好世界!", 15, 0));
+
+ /* Test individual Chinese character width */
+ cl_assert_equal_i(2, utf8_strnwidth("中", 3, 0));
+
+ /* Test empty string */
+ cl_assert_equal_i(0, utf8_strnwidth("", 0, 0));
+
+ /* Test length limiting */
+ str = "你好世界";
+ /* Only first char "你"(2 columns) within 3 bytes */
+ cl_assert_equal_i(2, utf8_strnwidth(str, 3, 0));
+ /* First two chars "你好"(4 columns) in 6 bytes */
+ cl_assert_equal_i(4, utf8_strnwidth(str, 6, 0));
+}
+
+/*
+ * Tests for utf8_strwidth (simpler version without length limit)
+ */
+void test_utf8_width__strwidth_chinese(void)
+{
+ /* Test basic ASCII */
+ cl_assert_equal_i(5, utf8_strwidth("Hello"));
+
+ /* Test Chinese characters */
+ /* 2 Chinese chars = 4 display columns */
+ cl_assert_equal_i(4, utf8_strwidth("你好"));
+
+ /* Test longer Chinese string */
+ /* 5 Chinese chars = 10 display columns */
+ cl_assert_equal_i(10, utf8_strwidth("你好世界!"));
+
+ /* Test mixed ASCII and Chinese */
+ /* 5 ASCII (5 cols) + 2 Chinese (4 cols) = 9 */
+ cl_assert_equal_i(9, utf8_strwidth("Hello世界"));
+ /* 2 ASCII (2 cols) + 2 Chinese (4 cols) + 1 ASCII (1 col) = 7 */
+ cl_assert_equal_i(7, utf8_strwidth("Hi世界!"));
+}
+
+/*
+ * Additional tests with other East Asian characters
+ */
+void test_utf8_width__strnwidth_japanese_korean(void)
+{
+ /* Japanese characters (should also be 2 columns each) */
+ /* 5 Japanese chars x 2 cols each = 10 display columns */
+ cl_assert_equal_i(10, utf8_strnwidth("こんにちは", 15, 0));
+
+ /* Korean characters (should also be 2 columns each) */
+ /* 5 Korean chars x 2 cols each = 10 display columns */
+ cl_assert_equal_i(10, utf8_strnwidth("안녕하세요", 15, 0));
+}
+
+/*
+ * Test utf8_strnwidth with CJK strings and ANSI sequences
+ */
+void test_utf8_width__strnwidth_cjk_with_ansi(void)
+{
+ /* Test CJK with ANSI sequences */
+ const char *ansi_test = "\033[1m你好\033[0m";
+ int width = utf8_strnwidth(ansi_test, strlen(ansi_test), 1);
+ /* Should skip ANSI sequences and count "你好" as 4 columns */
+ cl_assert_equal_i(4, width);
+
+ /* Test mixed ASCII, CJK, and ANSI */
+ ansi_test = "Hello\033[32m世界\033[0m!";
+ width = utf8_strnwidth(ansi_test, strlen(ansi_test), 1);
+ /* "Hello"(5) + "世界"(4) + "!"(1) = 10 */
+ cl_assert_equal_i(10, width);
+}
--
2.52.0.rc2.5.g4c20a63325.dirty
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-15 13:36 ` [PATCH v2 " Jiang Xin
2025-11-15 13:36 ` [PATCH v2 1/2] t/unit-tests: add UTF-8 width tests for CJK chars Jiang Xin
@ 2025-11-15 13:36 ` Jiang Xin
2025-11-15 15:04 ` Phillip Wood
1 sibling, 1 reply; 22+ messages in thread
From: Jiang Xin @ 2025-11-15 13:36 UTC (permalink / raw)
To: Junio C Hamano, Git List, Justin Tobler; +Cc: Jiang Xin
The output table from "git repo structure" is misaligned when displaying
UTF-8 characters (e.g., non-ASCII glyphs). E.g.:
| 仓库结构 | 值 |
| -------------- | ---- |
| * 引用 | |
| * 计数 | 67 |
The previous implementation used simple width formatting with printf()
which didn't properly handle multi-byte UTF-8 characters, causing
misaligned table columns when displaying repository structure
information.
This change modifies the stats_table_print_structure function to use
strbuf_utf8_align() instead of basic printf width specifiers. This
ensures proper column alignment regardless of the character encoding of
the content being displayed.
Also add test cases for strbuf_utf8_align(), a function newly introduced
in "builtin/repo.c".
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
---
builtin/repo.c | 21 +++++++++++++++++----
t/unit-tests/u-utf8-width.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+), 4 deletions(-)
diff --git a/builtin/repo.c b/builtin/repo.c
index 9d4749f79b..e3adb353a2 100644
--- a/builtin/repo.c
+++ b/builtin/repo.c
@@ -292,14 +292,20 @@ static void stats_table_print_structure(const struct stats_table *table)
int name_col_width = utf8_strwidth(name_col_title);
int value_col_width = utf8_strwidth(value_col_title);
struct string_list_item *item;
+ struct strbuf buf = STRBUF_INIT;
if (table->name_col_width > name_col_width)
name_col_width = table->name_col_width;
if (table->value_col_width > value_col_width)
value_col_width = table->value_col_width;
- printf("| %-*s | %-*s |\n", name_col_width, name_col_title,
- value_col_width, value_col_title);
+ strbuf_addstr(&buf, "| ");
+ strbuf_utf8_align(&buf, ALIGN_LEFT, name_col_width, name_col_title);
+ strbuf_addstr(&buf, " | ");
+ strbuf_utf8_align(&buf, ALIGN_LEFT, value_col_width, value_col_title);
+ strbuf_addstr(&buf, " |");
+ printf("%s\n", buf.buf);
+
printf("| ");
for (int i = 0; i < name_col_width; i++)
putchar('-');
@@ -317,9 +323,16 @@ static void stats_table_print_structure(const struct stats_table *table)
value = entry->value;
}
- printf("| %-*s | %*s |\n", name_col_width, item->string,
- value_col_width, value);
+ strbuf_reset(&buf);
+ strbuf_addstr(&buf, "| ");
+ strbuf_utf8_align(&buf, ALIGN_LEFT, name_col_width, item->string);
+ strbuf_addstr(&buf, " | ");
+ strbuf_utf8_align(&buf, ALIGN_RIGHT, value_col_width, value);
+ strbuf_addstr(&buf, " |");
+ printf("%s\n", buf.buf);
}
+
+ strbuf_release(&buf);
}
static void stats_table_clear(struct stats_table *table)
diff --git a/t/unit-tests/u-utf8-width.c b/t/unit-tests/u-utf8-width.c
index 3766f19726..86e09c3574 100644
--- a/t/unit-tests/u-utf8-width.c
+++ b/t/unit-tests/u-utf8-width.c
@@ -95,3 +95,40 @@ void test_utf8_width__strnwidth_cjk_with_ansi(void)
/* "Hello"(5) + "世界"(4) + "!"(1) = 10 */
cl_assert_equal_i(10, width);
}
+
+/*
+ * Test the strbuf_utf8_align function with CJK characters
+ */
+void test_utf8_width__strbuf_utf8_align(void)
+{
+ struct strbuf buf = STRBUF_INIT;
+
+ /* Test left alignment with CJK */
+ strbuf_utf8_align(&buf, ALIGN_LEFT, 10, "你好");
+ /* Since "你好" is 4 display columns, we need 6 more spaces to reach 10 */
+ cl_assert_equal_s("你好 ", buf.buf);
+ strbuf_reset(&buf);
+
+ /* Test right alignment with CJK */
+ strbuf_utf8_align(&buf, ALIGN_RIGHT, 8, "世界");
+ /* "世界" is 4 display columns, so we need 4 leading spaces */
+ cl_assert_equal_s(" 世界", buf.buf);
+ strbuf_reset(&buf);
+
+ /* Test center alignment with CJK */
+ strbuf_utf8_align(&buf, ALIGN_MIDDLE, 10, "中");
+ /* "中" is 2 display columns, so (10-2)/2 = 4 spaces on left, 4 on right */
+ cl_assert_equal_s(" 中 ", buf.buf);
+ strbuf_reset(&buf);
+
+ strbuf_utf8_align(&buf, ALIGN_MIDDLE, 5, "中");
+ /* "中" is 2 display columns, so (5-2)/2 = 1 spaces on left, 2 on right */
+ cl_assert_equal_s(" 中 ", buf.buf);
+ strbuf_reset(&buf);
+
+ /* Test alignment that is smaller than string width */
+ strbuf_utf8_align(&buf, ALIGN_LEFT, 2, "你好");
+ /* Since "你好" is 4 display columns, it should not be truncated */
+ cl_assert_equal_s("你好", buf.buf);
+ strbuf_release(&buf);
+}
--
2.52.0.rc2.5.g4c20a63325.dirty
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-15 13:36 ` [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters Jiang Xin
@ 2025-11-15 15:04 ` Phillip Wood
2025-11-15 16:49 ` Junio C Hamano
0 siblings, 1 reply; 22+ messages in thread
From: Phillip Wood @ 2025-11-15 15:04 UTC (permalink / raw)
To: Jiang Xin, Junio C Hamano, Git List, Justin Tobler
Hi Jiang
On 15/11/2025 13:36, Jiang Xin wrote:
> The output table from "git repo structure" is misaligned when displaying
> UTF-8 characters (e.g., non-ASCII glyphs). E.g.:
>
> | 仓库结构 | 值 |
> | -------------- | ---- |
> | * 引用 | |
> | * 计数 | 67 |
>
> The previous implementation used simple width formatting with printf()
> which didn't properly handle multi-byte UTF-8 characters, causing
> misaligned table columns when displaying repository structure
> information.
>
> This change modifies the stats_table_print_structure function to use
> strbuf_utf8_align() instead of basic printf width specifiers. This
> ensures proper column alignment regardless of the character encoding of
> the content being displayed.
How does it ensure proper column alignment for non-utf8 encodings? I
don't see how it is possible to calculate the display width without
knowing the encoding.
> Also add test cases for strbuf_utf8_align(), a function newly introduced
> in "builtin/repo.c".
Nice.
Using strbuf_utf8_align ends up being quite verbose. An alternative
would be to keep using printf() but calculate the padding ourselves as
shown below. Either way we end up calling utf8_strwidth() twice on the
same string which is a bit of a shame but probably doesn't matter too
much in the grand scheme of things.
Thanks
Phillip
---- 8< ----
builtin/repo.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/builtin/repo.c b/builtin/repo.c
index 9d4749f79be..1b139b89672 100644
--- a/builtin/repo.c
+++ b/builtin/repo.c
@@ -298,8 +298,9 @@ static void stats_table_print_structure(const struct stats_table *table)
if (table->value_col_width > value_col_width)
value_col_width = table->value_col_width;
- printf("| %-*s | %-*s |\n", name_col_width, name_col_title,
- value_col_width, value_col_title);
+ printf("| %s%*s | %s%*s |\n",
+ name_col_title, name_col_width - utf8_strwidth(name_col_title), "",
+ value_col_title, value_col_width - utf8_strwidth(value_col_title), "");
printf("| ");
for (int i = 0; i < name_col_width; i++)
putchar('-');
@@ -317,8 +318,9 @@ static void stats_table_print_structure(const struct stats_table *table)
value = entry->value;
}
- printf("| %-*s | %*s |\n", name_col_width, item->string,
- value_col_width, value);
+ printf("| %s%*s | %*s%s |\n",
+ item->string, name_col_width - utf8_strwidth(item->string), "",
+ value_col_width - utf8_strwidth(value), "", value);
}
}
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v2 2/2] builtin/repo: fix table alignment for UTF-8 characters
2025-11-15 15:04 ` Phillip Wood
@ 2025-11-15 16:49 ` Junio C Hamano
0 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2025-11-15 16:49 UTC (permalink / raw)
To: Phillip Wood; +Cc: Jiang Xin, Git List, Justin Tobler
Phillip Wood <phillip.wood123@gmail.com> writes:
> How does it ensure proper column alignment for non-utf8 encodings? I
> don't see how it is possible to calculate the display width without
> knowing the encoding.
Correct. But for Git, pretty much the ship has sailed, I am afraid.
All tools that rely on utf8_strwidth() are "broken" in that way if
you feed latin-1 or ISO/IEC 2022, and that includes "diff --stat"
with pathnames in non-UTF8 encodings (I do not remember if we fully
fixed the codepath for UTF-8---it used to be broken even for UTF-8).
> Using strbuf_utf8_align ends up being quite verbose. An alternative
> would be to keep using printf() but calculate the padding ourselves as
> shown below.
I think that has been the preferred way to do this, utf8_strwidth()
to measure and decide how wide each column can be, then for each row,
we measure and make printf() fit, or truncate when the column we decide
to allocate cannot accomodate the data on a particular row that is
overly long.
Thanks.
^ permalink raw reply [flat|nested] 22+ messages in thread