From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>,
Elijah Newren <newren@gmail.com>,
Elijah Newren <newren@gmail.com>,
Elijah Newren <newren@gmail.com>
Subject: [PATCH v2] diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
Date: Fri, 17 Apr 2026 22:45:10 +0000 [thread overview]
Message-ID: <pull.2093.v2.git.1776465910538.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2093.git.1776443163041.gitgitgadget@gmail.com>
From: Elijah Newren <newren@gmail.com>
f85b49f3d4a (diff: improve scaling of filenames in diffstat to handle
UTF-8 chars, 2026-01-16) introduced a loop in show_stats() that calls
utf8_width() repeatedly to skip leading characters until the displayed
width fits. However, utf8_width() can return problematic values:
- For invalid UTF-8 sequences, pick_one_utf8_char() sets the name
pointer to NULL and utf8_width() returns 0. Since name_len does
not change, the loop iterates once more and pick_one_utf8_char()
dereferences the NULL pointer, crashing.
- For control characters, utf8_width() returns -1, so name_len
grows when it is expected to shrink. This can cause the loop to
consume more characters than the string contains, reading past
the trailing NUL.
By default, fill_print_name() will C-quotes filenames which escapes
control characters and invalid bytes to printable text. That avoids
this bug from being triggered; however, with core.quotePath=false,
raw bytes can reach this code.
Add tests exercising both failure modes with core.quotePath=false and
a narrow --stat-name-width to force truncation: one with a bare 0xC0
byte (invalid UTF-8 lead byte, triggers NULL deref) and one with a
0x01 byte (control character, causes the loop to read past the end
of the string).
Fix both issues by introducing utf8_ish_width(), a thin wrapper
around utf8_width() that guarantees the pointer always advances and
the returned width is never negative:
- On invalid UTF-8 it restores the pointer, advances by one byte,
and returns width 1 (matching the strlen()-based fallback used
by utf8_strwidth()).
- On a control character it returns 0 (matching utf8_strnwidth()
which skips them).
Also add a "&& *name" guard to the while-loop condition so it
terminates at end-of-string even when utf8_strwidth()'s strlen()
fallback causes name_len to exceed the sum of per-character widths.
Signed-off-by: Elijah Newren <newren@gmail.com>
---
diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8
truncation
Changes since v1:
* Simplified the loop to almost what we had before via a wrapper
function that always succeeds in advancing the string and never
returns a negative width. (Which, as a consequence, treats invalid
UTF-8 and control characters the roughly the same, unlike v1.)
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2093%2Fnewren%2Ffix%2Fdiffstat-utf8-loop-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2093/newren/fix/diffstat-utf8-loop-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/2093
Range-diff vs v1:
1: fcd44d6cf8 ! 1: 4a72647ce2 diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
@@ Commit message
diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
f85b49f3d4a (diff: improve scaling of filenames in diffstat to handle
- UTF-8 chars, 2024-10-27) introduced a loop in show_stats() that calls
+ UTF-8 chars, 2026-01-16) introduced a loop in show_stats() that calls
utf8_width() repeatedly to skip leading characters until the displayed
width fits. However, utf8_width() can return problematic values:
@@ Commit message
0x01 byte (control character, causes the loop to read past the end
of the string).
- Fix the bug by:
- - Adding a *name check to terminate the loop at end-of-string
- - Detecting the NULL pointer from invalid UTF-8 and falling back to
- showing the full untruncated name
- - Breaking on negative width (control characters)
+ Fix both issues by introducing utf8_ish_width(), a thin wrapper
+ around utf8_width() that guarantees the pointer always advances and
+ the returned width is never negative:
+
+ - On invalid UTF-8 it restores the pointer, advances by one byte,
+ and returns width 1 (matching the strlen()-based fallback used
+ by utf8_strwidth()).
+ - On a control character it returns 0 (matching utf8_strnwidth()
+ which skips them).
+
+ Also add a "&& *name" guard to the while-loop condition so it
+ terminates at end-of-string even when utf8_strwidth()'s strlen()
+ fallback causes name_len to exceed the sum of per-character widths.
Signed-off-by: Elijah Newren <newren@gmail.com>
## diff.c ##
+@@ diff.c: void print_stat_summary(FILE *fp, int files,
+ print_stat_summary_inserts_deletes(&o, files, insertions, deletions);
+ }
+
++/*
++ * Like utf8_width(), but guaranteed safe for use in loops that subtract
++ * per-character widths:
++ *
++ * - utf8_width() sets *start to NULL on invalid UTF-8 and returns 0;
++ * we restore the pointer and advance by one byte, returning width 1
++ * (matching the strlen()-based fallback in utf8_strwidth()).
++ *
++ * - utf8_width() returns -1 for control characters; we return 0
++ * (matching utf8_strnwidth() which skips them).
++ */
++static int utf8_ish_width(const char **start)
++{
++ const char *old = *start;
++ int w = utf8_width(start, NULL);
++ if (!*start) {
++ *start = old + 1;
++ return 1;
++ }
++ return (w < 0) ? 0 : w;
++}
++
+ static void show_stats(struct diffstat_t *data, struct diff_options *options)
+ {
+ int i, len, add, del, adds = 0, dels = 0;
@@ diff.c: static void show_stats(struct diffstat_t *data, struct diff_options *options)
if (len < 0)
len = 0;
- while (name_len > len)
- name_len -= utf8_width((const char**)&name, NULL);
-+ while (name_len > len && *name) {
-+ int w = utf8_width((const char **)&name, NULL);
-+ if (!name) { /* Invalid UTF-8 */
-+ name = file->print_name;
-+ name_len = utf8_strwidth(name);
-+ break;
-+ }
-+ if (w < 0) /* control character */
-+ break;
-+ name_len -= w;
-+ }
++ while (name_len > len && *name)
++ name_len -= utf8_ish_width((const char**)&name);
slash = strchr(name, '/');
if (slash)
diff.c | 26 ++++++++++++++++++++++++--
t/t4052-stat-output.sh | 25 +++++++++++++++++++++++++
2 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/diff.c b/diff.c
index 397e38b41c..1a3b19f71f 100644
--- a/diff.c
+++ b/diff.c
@@ -2927,6 +2927,28 @@ void print_stat_summary(FILE *fp, int files,
print_stat_summary_inserts_deletes(&o, files, insertions, deletions);
}
+/*
+ * Like utf8_width(), but guaranteed safe for use in loops that subtract
+ * per-character widths:
+ *
+ * - utf8_width() sets *start to NULL on invalid UTF-8 and returns 0;
+ * we restore the pointer and advance by one byte, returning width 1
+ * (matching the strlen()-based fallback in utf8_strwidth()).
+ *
+ * - utf8_width() returns -1 for control characters; we return 0
+ * (matching utf8_strnwidth() which skips them).
+ */
+static int utf8_ish_width(const char **start)
+{
+ const char *old = *start;
+ int w = utf8_width(start, NULL);
+ if (!*start) {
+ *start = old + 1;
+ return 1;
+ }
+ return (w < 0) ? 0 : w;
+}
+
static void show_stats(struct diffstat_t *data, struct diff_options *options)
{
int i, len, add, del, adds = 0, dels = 0;
@@ -3093,8 +3115,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
if (len < 0)
len = 0;
- while (name_len > len)
- name_len -= utf8_width((const char**)&name, NULL);
+ while (name_len > len && *name)
+ name_len -= utf8_ish_width((const char**)&name);
slash = strchr(name, '/');
if (slash)
diff --git a/t/t4052-stat-output.sh b/t/t4052-stat-output.sh
index 7c749062e2..84c53c1a51 100755
--- a/t/t4052-stat-output.sh
+++ b/t/t4052-stat-output.sh
@@ -445,4 +445,29 @@ test_expect_success 'diffstat where line_prefix contains ANSI escape codes is co
test_grep "<RED>|<RESET> ${FILENAME_TRIMMED} | 0" out
'
+test_expect_success 'diffstat truncation with invalid UTF-8 does not crash' '
+ empty_blob=$(git hash-object -w --stdin </dev/null) &&
+ printf "100644 blob $empty_blob\taaa-\300-aaa\n" |
+ git mktree >tree_file &&
+ tree=$(cat tree_file) &&
+ empty_tree=$(git mktree </dev/null) &&
+ c1=$(git commit-tree -m before $empty_tree) &&
+ c2=$(git commit-tree -m after -p $c1 $tree) &&
+ git -c core.quotepath=false diff --stat --stat-name-width=5 $c1..$c2 >output &&
+ test_grep "| 0" output
+'
+
+test_expect_success FUNNYNAMES 'diffstat truncation with control chars does not crash' '
+ FNAME=$(printf "aaa-\x01-aaa") &&
+ git commit --allow-empty -m setup &&
+ >$FNAME &&
+ git add -- $FNAME &&
+ git commit -m "add file with control char name" &&
+ git -c core.quotepath=false diff --stat --stat-name-width=5 HEAD~1..HEAD >output &&
+ test_grep "| 0" output &&
+ rm -- $FNAME &&
+ git rm -- $FNAME &&
+ git commit -m "remove test file"
+'
+
test_done
base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
--
gitgitgadget
next prev parent reply other threads:[~2026-04-17 22:45 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-17 16:26 [PATCH] diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation Elijah Newren via GitGitGadget
2026-04-17 19:21 ` Junio C Hamano
2026-04-17 22:00 ` Elijah Newren
2026-04-17 22:21 ` Junio C Hamano
2026-04-17 22:45 ` Elijah Newren via GitGitGadget [this message]
2026-04-19 23:52 ` [PATCH v2] " Lorenzo Pegorari
2026-04-20 14:51 ` Elijah Newren
2026-04-20 15:42 ` [PATCH v3] " Elijah Newren via GitGitGadget
2026-04-20 16:41 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pull.2093.v2.git.1776465910538.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=git@vger.kernel.org \
--cc=lorenzo.pegorari2002@gmail.com \
--cc=newren@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox