[PATCH v2] diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation

Git development
 help / color / mirror / Atom feed

From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: LorenzoPegorari <lorenzo.pegorari2002@gmail.com>,
	Elijah Newren <newren@gmail.com>,
	Elijah Newren <newren@gmail.com>,
	Elijah Newren <newren@gmail.com>
Subject: [PATCH v2] diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
Date: Fri, 17 Apr 2026 22:45:10 +0000	[thread overview]
Message-ID: <pull.2093.v2.git.1776465910538.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2093.git.1776443163041.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

f85b49f3d4a (diff: improve scaling of filenames in diffstat to handle
UTF-8 chars, 2026-01-16) introduced a loop in show_stats() that calls
utf8_width() repeatedly to skip leading characters until the displayed
width fits.  However, utf8_width() can return problematic values:

  - For invalid UTF-8 sequences, pick_one_utf8_char() sets the name
    pointer to NULL and utf8_width() returns 0.  Since name_len does
    not change, the loop iterates once more and pick_one_utf8_char()
    dereferences the NULL pointer, crashing.

  - For control characters, utf8_width() returns -1, so name_len
    grows when it is expected to shrink.  This can cause the loop to
    consume more characters than the string contains, reading past
    the trailing NUL.

By default, fill_print_name() will C-quotes filenames which escapes
control characters and invalid bytes to printable text.  That avoids
this bug from being triggered; however, with core.quotePath=false,
raw bytes can reach this code.

Add tests exercising both failure modes with core.quotePath=false and
a narrow --stat-name-width to force truncation: one with a bare 0xC0
byte (invalid UTF-8 lead byte, triggers NULL deref) and one with a
0x01 byte (control character, causes the loop to read past the end
of the string).

Fix both issues by introducing utf8_ish_width(), a thin wrapper
around utf8_width() that guarantees the pointer always advances and
the returned width is never negative:

  - On invalid UTF-8 it restores the pointer, advances by one byte,
    and returns width 1 (matching the strlen()-based fallback used
    by utf8_strwidth()).
  - On a control character it returns 0 (matching utf8_strnwidth()
    which skips them).

Also add a "&& *name" guard to the while-loop condition so it
terminates at end-of-string even when utf8_strwidth()'s strlen()
fallback causes name_len to exceed the sum of per-character widths.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8
    truncation
    
    Changes since v1:
    
     * Simplified the loop to almost what we had before via a wrapper
       function that always succeeds in advancing the string and never
       returns a negative width. (Which, as a consequence, treats invalid
       UTF-8 and control characters the roughly the same, unlike v1.)

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2093%2Fnewren%2Ffix%2Fdiffstat-utf8-loop-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2093/newren/fix/diffstat-utf8-loop-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/2093

Range-diff vs v1:

 1:  fcd44d6cf8 ! 1:  4a72647ce2 diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
     @@ Commit message
          diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
      
          f85b49f3d4a (diff: improve scaling of filenames in diffstat to handle
     -    UTF-8 chars, 2024-10-27) introduced a loop in show_stats() that calls
     +    UTF-8 chars, 2026-01-16) introduced a loop in show_stats() that calls
          utf8_width() repeatedly to skip leading characters until the displayed
          width fits.  However, utf8_width() can return problematic values:
      
     @@ Commit message
          0x01 byte (control character, causes the loop to read past the end
          of the string).
      
     -    Fix the bug by:
     -      - Adding a *name check to terminate the loop at end-of-string
     -      - Detecting the NULL pointer from invalid UTF-8 and falling back to
     -        showing the full untruncated name
     -      - Breaking on negative width (control characters)
     +    Fix both issues by introducing utf8_ish_width(), a thin wrapper
     +    around utf8_width() that guarantees the pointer always advances and
     +    the returned width is never negative:
     +
     +      - On invalid UTF-8 it restores the pointer, advances by one byte,
     +        and returns width 1 (matching the strlen()-based fallback used
     +        by utf8_strwidth()).
     +      - On a control character it returns 0 (matching utf8_strnwidth()
     +        which skips them).
     +
     +    Also add a "&& *name" guard to the while-loop condition so it
     +    terminates at end-of-string even when utf8_strwidth()'s strlen()
     +    fallback causes name_len to exceed the sum of per-character widths.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diff.c ##
     +@@ diff.c: void print_stat_summary(FILE *fp, int files,
     + 	print_stat_summary_inserts_deletes(&o, files, insertions, deletions);
     + }
     + 
     ++/*
     ++ * Like utf8_width(), but guaranteed safe for use in loops that subtract
     ++ * per-character widths:
     ++ *
     ++ *   - utf8_width() sets *start to NULL on invalid UTF-8 and returns 0;
     ++ *     we restore the pointer and advance by one byte, returning width 1
     ++ *     (matching the strlen()-based fallback in utf8_strwidth()).
     ++ *
     ++ *   - utf8_width() returns -1 for control characters; we return 0
     ++ *     (matching utf8_strnwidth() which skips them).
     ++ */
     ++static int utf8_ish_width(const char **start)
     ++{
     ++	const char *old = *start;
     ++	int w = utf8_width(start, NULL);
     ++	if (!*start) {
     ++		*start = old + 1;
     ++		return 1;
     ++	}
     ++	return (w < 0) ? 0 : w;
     ++}
     ++
     + static void show_stats(struct diffstat_t *data, struct diff_options *options)
     + {
     + 	int i, len, add, del, adds = 0, dels = 0;
      @@ diff.c: static void show_stats(struct diffstat_t *data, struct diff_options *options)
       			if (len < 0)
       				len = 0;
       
      -			while (name_len > len)
      -				name_len -= utf8_width((const char**)&name, NULL);
     -+			while (name_len > len && *name) {
     -+				int w = utf8_width((const char **)&name, NULL);
     -+				if (!name) { /* Invalid UTF-8 */
     -+					name = file->print_name;
     -+					name_len = utf8_strwidth(name);
     -+					break;
     -+				}
     -+				if (w < 0)  /* control character */
     -+					break;
     -+				name_len -= w;
     -+			}
     ++			while (name_len > len && *name)
     ++				name_len -= utf8_ish_width((const char**)&name);
       
       			slash = strchr(name, '/');
       			if (slash)


 diff.c                 | 26 ++++++++++++++++++++++++--
 t/t4052-stat-output.sh | 25 +++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/diff.c b/diff.c
index 397e38b41c..1a3b19f71f 100644
--- a/diff.c
+++ b/diff.c
@@ -2927,6 +2927,28 @@ void print_stat_summary(FILE *fp, int files,
 	print_stat_summary_inserts_deletes(&o, files, insertions, deletions);
 }
 
+/*
+ * Like utf8_width(), but guaranteed safe for use in loops that subtract
+ * per-character widths:
+ *
+ *   - utf8_width() sets *start to NULL on invalid UTF-8 and returns 0;
+ *     we restore the pointer and advance by one byte, returning width 1
+ *     (matching the strlen()-based fallback in utf8_strwidth()).
+ *
+ *   - utf8_width() returns -1 for control characters; we return 0
+ *     (matching utf8_strnwidth() which skips them).
+ */
+static int utf8_ish_width(const char **start)
+{
+	const char *old = *start;
+	int w = utf8_width(start, NULL);
+	if (!*start) {
+		*start = old + 1;
+		return 1;
+	}
+	return (w < 0) ? 0 : w;
+}
+
 static void show_stats(struct diffstat_t *data, struct diff_options *options)
 {
 	int i, len, add, del, adds = 0, dels = 0;
@@ -3093,8 +3115,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (len < 0)
 				len = 0;
 
-			while (name_len > len)
-				name_len -= utf8_width((const char**)&name, NULL);
+			while (name_len > len && *name)
+				name_len -= utf8_ish_width((const char**)&name);
 
 			slash = strchr(name, '/');
 			if (slash)
diff --git a/t/t4052-stat-output.sh b/t/t4052-stat-output.sh
index 7c749062e2..84c53c1a51 100755
--- a/t/t4052-stat-output.sh
+++ b/t/t4052-stat-output.sh
@@ -445,4 +445,29 @@ test_expect_success 'diffstat where line_prefix contains ANSI escape codes is co
 	test_grep "<RED>|<RESET>  ${FILENAME_TRIMMED} | 0" out
 '
 
+test_expect_success 'diffstat truncation with invalid UTF-8 does not crash' '
+	empty_blob=$(git hash-object -w --stdin </dev/null) &&
+	printf "100644 blob $empty_blob\taaa-\300-aaa\n" |
+	git mktree >tree_file &&
+	tree=$(cat tree_file) &&
+	empty_tree=$(git mktree </dev/null) &&
+	c1=$(git commit-tree -m before $empty_tree) &&
+	c2=$(git commit-tree -m after -p $c1 $tree) &&
+	git -c core.quotepath=false diff --stat --stat-name-width=5 $c1..$c2 >output &&
+	test_grep "| 0" output
+'
+
+test_expect_success FUNNYNAMES 'diffstat truncation with control chars does not crash' '
+	FNAME=$(printf "aaa-\x01-aaa") &&
+	git commit --allow-empty -m setup &&
+	>$FNAME &&
+	git add -- $FNAME &&
+	git commit -m "add file with control char name" &&
+	git -c core.quotepath=false diff --stat --stat-name-width=5 HEAD~1..HEAD >output &&
+	test_grep "| 0" output &&
+	rm -- $FNAME &&
+	git rm -- $FNAME &&
+	git commit -m "remove test file"
+'
+
 test_done

base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
-- 
gitgitgadget

next prev parent reply	other threads:[~2026-04-17 22:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-17 16:26 [PATCH] diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation Elijah Newren via GitGitGadget
2026-04-17 19:21 ` Junio C Hamano
2026-04-17 22:00   ` Elijah Newren
2026-04-17 22:21     ` Junio C Hamano
2026-04-17 22:45 ` Elijah Newren via GitGitGadget [this message]
2026-04-19 23:52   ` [PATCH v2] " Lorenzo Pegorari
2026-04-20 14:51     ` Elijah Newren
2026-04-20 15:42   ` [PATCH v3] " Elijah Newren via GitGitGadget
2026-04-20 16:41     ` Junio C Hamano

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:397e38b41 dfblob:1a3b19f71 dfblob:7c749062e dfblob:84c53c1a5 )
 OR (
bs:"[PATCH v2] diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.2093.v2.git.1776465910538.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=lorenzo.pegorari2002@gmail.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox