* [GSoC PATCH 1/1] diff: improve scaling of filenames in diffstat to handle UTF-8 chars
@ 2026-01-14 22:27 LorenzoPegorari
2026-01-14 22:50 ` Junio C Hamano
2026-01-16 0:04 ` [GSoC PATCH v2 0/2] " LorenzoPegorari
0 siblings, 2 replies; 7+ messages in thread
From: LorenzoPegorari @ 2026-01-14 22:27 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano
The `show_stats()` function tries to scale the filenames in the diffstat to
ensure they don't exceed the given `name-width`. It does so by calculating
the "display width" of the characters to be dropped, but then advances the
filename pointer by that number of bytes.
However, the "display width" of a character is not always equal to its byte
count. The result is that sometimes, when displaying UTF-8 characters,
filenames exceed the given `name-width`, and frequently the bytes of the
UTF-8 characters are truncated.
The following is an example of the issue, where the 2 files are "HelloHi" and
"Hello你好", and `name-width=6`:
...oHi | 0
...<BD><A0>好 | 0
Make the filename pointer move by the actual number of bytes of the
characters to drop from the filename, rather than their display width, using
the `utf8_width()` function.
Signed-off-by: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>
---
diff.c | 15 ++++-----------
1 file changed, 4 insertions(+), 11 deletions(-)
diff --git a/diff.c b/diff.c
index a68ddd2168..271ace5728 100644
--- a/diff.c
+++ b/diff.c
@@ -2859,17 +2859,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
char *slash;
prefix = "...";
len -= 3;
- /*
- * NEEDSWORK: (name_len - len) counts the display
- * width, which would be shorter than the byte
- * length of the corresponding substring.
- * Advancing "name" by that number of bytes does
- * *NOT* skip over that many columns, so it is
- * very likely that chomping the pathname at the
- * slash we will find starting from "name" will
- * leave the resulting string still too long.
- */
- name += name_len - len;
+
+ while (name_len > len)
+ name_len -= utf8_width((const char**)&name, NULL);
+
slash = strchr(name, '/');
if (slash)
name = slash;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [GSoC PATCH 1/1] diff: improve scaling of filenames in diffstat to handle UTF-8 chars
2026-01-14 22:27 [GSoC PATCH 1/1] diff: improve scaling of filenames in diffstat to handle UTF-8 chars LorenzoPegorari
@ 2026-01-14 22:50 ` Junio C Hamano
2026-01-16 0:00 ` Lorenzo Pegorari
2026-01-16 0:04 ` [GSoC PATCH v2 0/2] " LorenzoPegorari
1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2026-01-14 22:50 UTC (permalink / raw)
To: LorenzoPegorari; +Cc: git
LorenzoPegorari <lorenzo.pegorari2002@gmail.com> writes:
> The `show_stats()` function tries to scale the filenames in the diffstat to
> ensure they don't exceed the given `name-width`. It does so by calculating
> the "display width" of the characters to be dropped, but then advances the
> filename pointer by that number of bytes.
>
> However, the "display width" of a character is not always equal to its byte
> count. The result is that sometimes, when displaying UTF-8 characters,
> filenames exceed the given `name-width`, and frequently the bytes of the
> UTF-8 characters are truncated.
>
> The following is an example of the issue, where the 2 files are "HelloHi" and
> "Hello你好", and `name-width=6`:
>
> ...oHi | 0
> ...<BD><A0>好 | 0
>
> Make the filename pointer move by the actual number of bytes of the
> characters to drop from the filename, rather than their display width, using
> the `utf8_width()` function.
>
> Signed-off-by: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>
> ---
> diff.c | 15 ++++-----------
> 1 file changed, 4 insertions(+), 11 deletions(-)
Two comments and a half.
* The change needed for this is surprisingly simple.
* You already know about samples that may exhibit the issue you are
addressing. Can we add it as a test case somewhere in t/
directory?
* The NEEDSWORK item addressed by this patch is one of the two
NEEDSWORK items added by ce8529b2 (diff: leave NEEDWORK notes in
show_stats() function, 2022-10-21). Makes me wonder how involved
the changes would need to be to solve the other one?
Thanks.
> diff --git a/diff.c b/diff.c
> index a68ddd2168..271ace5728 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2859,17 +2859,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> char *slash;
> prefix = "...";
> len -= 3;
> - /*
> - * NEEDSWORK: (name_len - len) counts the display
> - * width, which would be shorter than the byte
> - * length of the corresponding substring.
> - * Advancing "name" by that number of bytes does
> - * *NOT* skip over that many columns, so it is
> - * very likely that chomping the pathname at the
> - * slash we will find starting from "name" will
> - * leave the resulting string still too long.
> - */
> - name += name_len - len;
> +
> + while (name_len > len)
> + name_len -= utf8_width((const char**)&name, NULL);
> +
> slash = strchr(name, '/');
> if (slash)
> name = slash;
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [GSoC PATCH 1/1] diff: improve scaling of filenames in diffstat to handle UTF-8 chars
2026-01-14 22:50 ` Junio C Hamano
@ 2026-01-16 0:00 ` Lorenzo Pegorari
0 siblings, 0 replies; 7+ messages in thread
From: Lorenzo Pegorari @ 2026-01-16 0:00 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Jan 14, 2026 at 02:50:02PM -0800, Junio C Hamano wrote:
> LorenzoPegorari <lorenzo.pegorari2002@gmail.com> writes:
>
> > The `show_stats()` function tries to scale the filenames in the diffstat to
> > ensure they don't exceed the given `name-width`. It does so by calculating
> > the "display width" of the characters to be dropped, but then advances the
> > filename pointer by that number of bytes.
> >
> > However, the "display width" of a character is not always equal to its byte
> > count. The result is that sometimes, when displaying UTF-8 characters,
> > filenames exceed the given `name-width`, and frequently the bytes of the
> > UTF-8 characters are truncated.
> >
> > The following is an example of the issue, where the 2 files are "HelloHi" and
> > "Hello你好", and `name-width=6`:
> >
> > ...oHi | 0
> > ...<BD><A0>好 | 0
> >
> > Make the filename pointer move by the actual number of bytes of the
> > characters to drop from the filename, rather than their display width, using
> > the `utf8_width()` function.
> >
> > Signed-off-by: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>
> > ---
> > diff.c | 15 ++++-----------
> > 1 file changed, 4 insertions(+), 11 deletions(-)
>
> Two comments and a half.
>
> * The change needed for this is surprisingly simple.
It is indeed surprisingly simple, I agree!
> * You already know about samples that may exhibit the issue you are
> addressing. Can we add it as a test case somewhere in t/
> directory?
Yeah, we should add a test case. I will do it in the next reroll.
> * The NEEDSWORK item addressed by this patch is one of the two
> NEEDSWORK items added by ce8529b2 (diff: leave NEEDWORK notes in
> show_stats() function, 2022-10-21). Makes me wonder how involved
> the changes would need to be to solve the other one?
Mmh, I see. I'll take a closer look, but at a first glance it doesn't
seem too involved.
>
> Thanks.
>
Thank you!
>
> > diff --git a/diff.c b/diff.c
> > index a68ddd2168..271ace5728 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2859,17 +2859,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> > char *slash;
> > prefix = "...";
> > len -= 3;
> > - /*
> > - * NEEDSWORK: (name_len - len) counts the display
> > - * width, which would be shorter than the byte
> > - * length of the corresponding substring.
> > - * Advancing "name" by that number of bytes does
> > - * *NOT* skip over that many columns, so it is
> > - * very likely that chomping the pathname at the
> > - * slash we will find starting from "name" will
> > - * leave the resulting string still too long.
> > - */
> > - name += name_len - len;
> > +
> > + while (name_len > len)
> > + name_len -= utf8_width((const char**)&name, NULL);
> > +
> > slash = strchr(name, '/');
> > if (slash)
> > name = slash;
^ permalink raw reply [flat|nested] 7+ messages in thread
* [GSoC PATCH v2 0/2] diff: improve scaling of filenames in diffstat to handle UTF-8 chars
2026-01-14 22:27 [GSoC PATCH 1/1] diff: improve scaling of filenames in diffstat to handle UTF-8 chars LorenzoPegorari
2026-01-14 22:50 ` Junio C Hamano
@ 2026-01-16 0:04 ` LorenzoPegorari
2026-01-16 0:05 ` [GSoC PATCH v2 1/2] " LorenzoPegorari
2026-01-16 0:05 ` [GSoC PATCH v2 2/2] t4073: add test for diffstat paths length when containing " LorenzoPegorari
1 sibling, 2 replies; 7+ messages in thread
From: LorenzoPegorari @ 2026-01-16 0:04 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Jeff King, Toon Claes, Justin Tobler,
Niels Glodny, Patrick Steinhardt
Added a test (as Junio Hamano suggested) to check how the generated
diffstat handles UTF-8 characters when given various `name-width`s.
This allowed me to notice a bug where, if the given `name-width` was 2
or less, the `len` variable would become negative, entering an infinite
loop. So I also fixed this bug.
LorenzoPegorari (2):
diff: improve scaling of filenames in diffstat to handle UTF-8 chars
t4073: add test for diffstat paths length when containing UTF-8 chars
diff.c | 17 ++++-----
t/meson.build | 1 +
t/t4073-diff-stat-name-width.sh | 61 +++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+), 11 deletions(-)
create mode 100755 t/t4073-diff-stat-name-width.sh
Range-diff against v1:
1: 63e73122d1 ! 1: abeb8d3439 diff: improve scaling of filenames in diffstat to handle UTF-8 chars
@@ Commit message
characters to drop from the filename, rather than their display width, using
the `utf8_width()` function.
+ Force `len` to not be less than 0 (this happens if the given `name-width` is
+ 2 or less), otherwise an infinite loop is entered.
+
Signed-off-by: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>
## diff.c ##
@@ diff.c: static void show_stats(struct diffstat_t *data, struct diff_options *opt
- * leave the resulting string still too long.
- */
- name += name_len - len;
++ if (len < 0)
++ len = 0;
+
+ while (name_len > len)
+ name_len -= utf8_width((const char**)&name, NULL);
-: ---------- > 2: ee088ea6ef t4073: add test for diffstat paths length when containing UTF-8 chars
--
2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [GSoC PATCH v2 1/2] diff: improve scaling of filenames in diffstat to handle UTF-8 chars
2026-01-16 0:04 ` [GSoC PATCH v2 0/2] " LorenzoPegorari
@ 2026-01-16 0:05 ` LorenzoPegorari
2026-01-16 0:05 ` [GSoC PATCH v2 2/2] t4073: add test for diffstat paths length when containing " LorenzoPegorari
1 sibling, 0 replies; 7+ messages in thread
From: LorenzoPegorari @ 2026-01-16 0:05 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Jeff King, Toon Claes, Justin Tobler,
Niels Glodny, Patrick Steinhardt
The `show_stats()` function tries to scale the filenames in the diffstat to
ensure they don't exceed the given `name-width`. It does so by calculating
the "display width" of the characters to be dropped, but then advances the
filename pointer by that number of bytes.
However, the "display width" of a character is not always equal to its byte
count. The result is that sometimes, when displaying UTF-8 characters,
filenames exceed the given `name-width`, and frequently the bytes of the
UTF-8 characters are truncated.
The following is an example of the issue, where the 2 files are "HelloHi" and
"Hello你好", and `name-width=6`:
...oHi | 0
...<BD><A0>好 | 0
Make the filename pointer move by the actual number of bytes of the
characters to drop from the filename, rather than their display width, using
the `utf8_width()` function.
Force `len` to not be less than 0 (this happens if the given `name-width` is
2 or less), otherwise an infinite loop is entered.
Signed-off-by: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>
---
diff.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)
diff --git a/diff.c b/diff.c
index a68ddd2168..452fc69775 100644
--- a/diff.c
+++ b/diff.c
@@ -2859,17 +2859,12 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
char *slash;
prefix = "...";
len -= 3;
- /*
- * NEEDSWORK: (name_len - len) counts the display
- * width, which would be shorter than the byte
- * length of the corresponding substring.
- * Advancing "name" by that number of bytes does
- * *NOT* skip over that many columns, so it is
- * very likely that chomping the pathname at the
- * slash we will find starting from "name" will
- * leave the resulting string still too long.
- */
- name += name_len - len;
+ if (len < 0)
+ len = 0;
+
+ while (name_len > len)
+ name_len -= utf8_width((const char**)&name, NULL);
+
slash = strchr(name, '/');
if (slash)
name = slash;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [GSoC PATCH v2 2/2] t4073: add test for diffstat paths length when containing UTF-8 chars
2026-01-16 0:04 ` [GSoC PATCH v2 0/2] " LorenzoPegorari
2026-01-16 0:05 ` [GSoC PATCH v2 1/2] " LorenzoPegorari
@ 2026-01-16 0:05 ` LorenzoPegorari
2026-01-17 17:52 ` Junio C Hamano
1 sibling, 1 reply; 7+ messages in thread
From: LorenzoPegorari @ 2026-01-16 0:05 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Jeff King, Toon Claes, Justin Tobler,
Niels Glodny, Patrick Steinhardt
Add test checking the length of filepaths containing UTF-8 chars when
generating a diffstat with various `name-width`s.
Signed-off-by: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>
---
t/meson.build | 1 +
t/t4073-diff-stat-name-width.sh | 61 +++++++++++++++++++++++++++++++++
2 files changed, 62 insertions(+)
create mode 100755 t/t4073-diff-stat-name-width.sh
diff --git a/t/meson.build b/t/meson.build
index 459c52a489..f2ad6d2f12 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -498,6 +498,7 @@ integration_tests = [
't4070-diff-pairs.sh',
't4071-diff-minimal.sh',
't4072-diff-max-depth.sh',
+ 't4073-diff-stat.sh',
't4100-apply-stat.sh',
't4101-apply-nonl.sh',
't4102-apply-rename.sh',
diff --git a/t/t4073-diff-stat-name-width.sh b/t/t4073-diff-stat-name-width.sh
new file mode 100755
index 0000000000..ec5d3c3c1f
--- /dev/null
+++ b/t/t4073-diff-stat-name-width.sh
@@ -0,0 +1,61 @@
+#!/bin/sh
+
+test_description='git-diff check diffstat filepaths length when containing UTF-8 chars'
+
+. ./test-lib.sh
+
+
+create_files () {
+ mkdir -p "d你好" &&
+ touch "d你好/f再见"
+}
+
+test_expect_success 'setup' '
+ git init &&
+ git config core.quotepath off &&
+ git commit -m "Initial commit" --allow-empty &&
+ create_files &&
+ git add . &&
+ git commit -m "Added files"
+'
+
+test_expect_success 'test name-width long enough for filepath' '
+ git diff HEAD~1 HEAD --stat --stat-name-width=12 >out &&
+ grep "d你好/f再见 |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=11 >out &&
+ grep "d你好/f再见 |" out
+'
+
+test_expect_success 'test name-width not long enough for dir name' '
+ git diff HEAD~1 HEAD --stat --stat-name-width=10 >out &&
+ grep ".../f再见 |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=9 >out &&
+ grep ".../f再见 |" out
+'
+
+test_expect_success 'test name-width not long enough for slash' '
+ git diff HEAD~1 HEAD --stat --stat-name-width=8 >out &&
+ grep "...f再见 |" out
+'
+
+test_expect_success 'test name-width not long enough for file name' '
+ git diff HEAD~1 HEAD --stat --stat-name-width=7 >out &&
+ grep "...再见 |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=6 >out &&
+ grep "...见 |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=5 >out &&
+ grep "...见 |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=4 >out &&
+ grep "... |" out
+'
+
+test_expect_success 'test name-width minimum length' '
+ git diff HEAD~1 HEAD --stat --stat-name-width=3 >out &&
+ grep "... |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=2 >out &&
+ grep "... |" out &&
+ git diff HEAD~1 HEAD --stat --stat-name-width=1 >out &&
+ grep "... |" out
+'
+
+test_done
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [GSoC PATCH v2 2/2] t4073: add test for diffstat paths length when containing UTF-8 chars
2026-01-16 0:05 ` [GSoC PATCH v2 2/2] t4073: add test for diffstat paths length when containing " LorenzoPegorari
@ 2026-01-17 17:52 ` Junio C Hamano
0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2026-01-17 17:52 UTC (permalink / raw)
To: LorenzoPegorari
Cc: git, Jeff King, Toon Claes, Justin Tobler, Niels Glodny,
Patrick Steinhardt
LorenzoPegorari <lorenzo.pegorari2002@gmail.com> writes:
> diff --git a/t/meson.build b/t/meson.build
> index 459c52a489..f2ad6d2f12 100644
> --- a/t/meson.build
> +++ b/t/meson.build
> @@ -498,6 +498,7 @@ integration_tests = [
> 't4070-diff-pairs.sh',
> 't4071-diff-minimal.sh',
> 't4072-diff-max-depth.sh',
> + 't4073-diff-stat.sh',
This name ...
> 't4100-apply-stat.sh',
> 't4101-apply-nonl.sh',
> 't4102-apply-rename.sh',
> diff --git a/t/t4073-diff-stat-name-width.sh b/t/t4073-diff-stat-name-width.sh
> new file mode 100755
... must match this one. I already locally updated the former to
match, so no need to resend, but if you need to reroll the patch in
the future, please make sure to correct this part. Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-01-17 17:52 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-14 22:27 [GSoC PATCH 1/1] diff: improve scaling of filenames in diffstat to handle UTF-8 chars LorenzoPegorari
2026-01-14 22:50 ` Junio C Hamano
2026-01-16 0:00 ` Lorenzo Pegorari
2026-01-16 0:04 ` [GSoC PATCH v2 0/2] " LorenzoPegorari
2026-01-16 0:05 ` [GSoC PATCH v2 1/2] " LorenzoPegorari
2026-01-16 0:05 ` [GSoC PATCH v2 2/2] t4073: add test for diffstat paths length when containing " LorenzoPegorari
2026-01-17 17:52 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox