git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ignores: handle non UTF-8 exclude files
@ 2026-01-03 22:16 Matthieu Beauchamp-Boulay via GitGitGadget
  2026-01-04  2:54 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Matthieu Beauchamp-Boulay via GitGitGadget @ 2026-01-03 22:16 UTC (permalink / raw)
  To: git
  Cc: Matheus Tavares, Johannes Schindelin, Matthieu Beauchamp-Boulay,
	Matthieu Beauchamp-Boulay

From: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>

When reading exclude files, git assumes it is encoded in UTF-8 and will
fail to apply patterns if it isn't. This is a silent failure as no warning
or errors are shown to the users. This is a problem that can take a while
to diagnose as many users will not think of checking the encoding of their
file and may believe their patterns are wrong instead. Users may also
accidentally commit undesired files.

On Windows, this happens if a user uses Windows PowerShell to create the
file, which results in a UTF-16LE file with a BOM. This issue was discussed
here https://github.com/git-for-windows/git/issues/3329. An example of
where a user was confused that his exclude file was not working is cited
https://github.com/git-for-windows/git/issues/3227.

A minimal fix should at least warn the user if git cannot properly decode
the exclude file. Ideally, git would handle any given Unicode file.

First, check if a BOM is present. If it is, decode the file to UTF-8.
If no BOM is detected, then try to parse the file as UTF-8. If that fails,
attempt to decode the file using the working tree encoding of the file,
if any. If that fails, print a warning to tell the user that the exclude
file could not be decoded and skip the file.

This raises the issue that if the entire tree is encoded in, for example
UTF-16BE (no BOM), then even if the encoding is given in .gitattributes,
git would not be able to decode it. I believe that this is still
acceptable since a warning will be emitted for the file (since it has no
BOM, is not valid UTF-8 and no working tree encoding could be found).

One case that isn't handled is if a wrong encoding is given in the
attributes and the exclude file has no BOM and is not UTF-8. Using
iconv to convert an UTF16BE file to UTF-8 while specifying UTF-16LE
yields gibberish without an error and so this case is a silent failure
where no patterns will match.

Signed-off-by: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>
---
    ignores: handle non UTF-8 exclude files

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-2157%2FMatthieu-Beauchamp%2Funicode-support-gitignore-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-2157/Matthieu-Beauchamp/unicode-support-gitignore-v1
Pull-Request: https://github.com/git/git/pull/2157

 dir.c              |  29 +++++++++++++
 t/lib-encoding.sh  |   8 ++++
 t/t0008-ignores.sh | 102 +++++++++++++++++++++++++++++++++++++++++++++
 utf8.c             |  29 +++++++++++++
 utf8.h             |  12 ++++++
 5 files changed, 180 insertions(+)

diff --git a/dir.c b/dir.c
index b00821f294..895d476253 100644
--- a/dir.c
+++ b/dir.c
@@ -1154,7 +1154,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 	int r;
 	int fd;
 	size_t size = 0;
+	size_t reencoded_size = 0;
 	char *buf;
+	char *reencoded = NULL;
 
 	if (flags & PATTERN_NOFOLLOW)
 		fd = open_nofollow(fname, O_RDONLY);
@@ -1190,7 +1192,34 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 			close(fd);
 			return -1;
 		}
+
+		if (!try_reencode_to_utf8(buf, size, &reencoded, &reencoded_size) && !is_valid_utf8(buf, size)) {
+			struct conv_attrs ca;
+			convert_attrs(istate, &ca, fname);
+
+			if (ca.working_tree_encoding) {
+				reencoded = reencode_string_len(buf, size, "UTF-8", ca.working_tree_encoding, &reencoded_size);
+				if (!reencoded) {
+					warning(_("Failed to decode exclude file %s from encoding %s"), pl->src, ca.working_tree_encoding);
+					free(buf);
+					return -1;
+				}
+			} else {
+				warning(_("Ignoring exclude file with unknown encoding: %s"), pl->src);
+				free(buf);
+				return -1;
+			}
+		}
+
+		if (reencoded) {
+			size = reencoded_size;
+			free(buf);
+			buf = xmallocz(size);
+			memcpy(buf, reencoded, size);
+			free(reencoded);
+		}
 		buf[size++] = '\n';
+
 		close(fd);
 		if (oid_stat) {
 			int pos;
diff --git a/t/lib-encoding.sh b/t/lib-encoding.sh
index 2dabc8c73e..1b1cc357ba 100644
--- a/t/lib-encoding.sh
+++ b/t/lib-encoding.sh
@@ -23,3 +23,11 @@ write_utf32 () {
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
+
+write_encoded () {
+  iconv -f UTF-8 -t "$1"
+}
+
+write_bom () {
+  echo "$@" | perl -pe 's/\s+//g; $_=pack("H*", $_)'
+}
\ No newline at end of file
diff --git a/t/t0008-ignores.sh b/t/t0008-ignores.sh
index db8bde280e..d5a3002ffb 100755
--- a/t/t0008-ignores.sh
+++ b/t/t0008-ignores.sh
@@ -4,6 +4,7 @@ test_description=check-ignore
 
 TEST_CREATE_REPO_NO_TEMPLATE=1
 . ./test-lib.sh
+. "$TEST_DIRECTORY/lib-encoding.sh"
 
 init_vars () {
 	global_excludes="global-excludes"
@@ -963,4 +964,105 @@ test_expect_success EXPENSIVE 'large exclude file ignored in tree' '
 	test_cmp expect err
 '
 
+############################################################################
+#
+# test handling of unicode for .gitignore when BOM is preset or worktree encoding is set for the file
+
+supports_encoding () {
+  encoding="$1" &&
+  d="support-bom-$encoding" &&
+  shift &&
+	mkdir -p "$d" &&
+	touch "$d/file" "$d/excluded" &&
+	write_bom "$@" > "$d/.gitignore" &&
+  echo excluded | write_encoded "$encoding" >> "$d/.gitignore"
+	git check-ignore "$d/excluded" > "$d/actual" &&
+	echo "$d/excluded" > expect &&
+	test_cmp expect "$d/actual"
+}
+
+test_expect_success ICONV 'Can read gitignore in UTF-8 with BOM' '
+  supports_encoding "UTF-8" EF BB BF
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-16LE when given a BOM' '
+  supports_encoding "UTF-16LE" FF FE
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-16BE when given a BOM' '
+  supports_encoding "UTF-16BE" FE FF
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-32LE when given a BOM' '
+  supports_encoding "UTF-32LE" FF FE 00 00
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-32BE when given a BOM' '
+  supports_encoding "UTF-32BE" 00 00 FE FF
+'
+
+supports_reading_ignore_in_working_tree_encoding () {
+  encoding="$1" &&
+  d="support-wt-$encoding" &&
+	mkdir -p "$d" &&
+	touch "$d/file" "$d/excluded" &&
+  echo ".gitignore		text working-tree-encoding=$encoding" > "$d/.gitattributes" &&
+  echo excluded | write_encoded "$encoding" > "$d/.gitignore"
+	git check-ignore "$d/excluded" > "$d/actual" &&
+	echo "$d/excluded" > expect &&
+	test_cmp expect "$d/actual"
+}
+
+test_expect_success ICONV 'Can read gitignore in UTF-8 when it is set as working tree encoding' '
+  supports_reading_ignore_in_working_tree_encoding "UTF-8"
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-16LE when it is set as working tree encoding' '
+  supports_reading_ignore_in_working_tree_encoding "UTF-16LE"
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-16BE when it is set as working tree encoding' '
+  supports_reading_ignore_in_working_tree_encoding "UTF-16BE"
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-32LE when it is set as working tree encoding' '
+  supports_reading_ignore_in_working_tree_encoding "UTF-32LE"
+'
+
+test_expect_success ICONV 'Can read gitignore in UTF-32BE when it is set as working tree encoding' '
+  supports_reading_ignore_in_working_tree_encoding "UTF-32BE"
+'
+
+test_expect_success ICONV 'Issues a warning if encoding cannot be deduced' '
+  d="warn-unknown-encoding" &&
+	mkdir -p "$d" &&
+	touch "$d/file" "$d/excluded" &&
+	echo excluded | write_encoded "UTF-16BE" > "$d/.gitignore" &&
+	test_must_fail git check-ignore "$d/excluded" > actual 2>&1 &&
+	echo "warning: Ignoring exclude file with unknown encoding: $d/.gitignore" > expect &&
+	test_cmp expect actual
+'
+
+test_expect_success ICONV 'Warns if the exclude file cannot be decoded due to encoded attributes' '
+  d="warn-cant-decode-attributes" &&
+	mkdir -p "$d" &&
+	touch "$d/file" "$d/excluded" &&
+	echo excluded | write_encoded "UTF-16BE" > "$d/.gitignore" &&
+  echo ".gitignore		text working-tree-encoding=UTF-16BE" | write_encoded "UTF-16BE" > "$d/.gitattributes" &&
+	test_must_fail git check-ignore "$d/excluded" > actual 2>&1 &&
+	echo "warning: Ignoring exclude file with unknown encoding: $d/.gitignore" > expect &&
+	test_cmp expect actual
+'
+
+test_expect_failure ICONV 'Issues a warning if the wrong encoding is given' '
+  d="warn-wrong-encoding" &&
+	mkdir -p "$d" &&
+	touch "$d/file" "$d/excluded" &&
+	echo excluded | write_encoded "UTF-16BE" > "$d/.gitignore" &&
+  echo ".gitignore		text working-tree-encoding=UTF-16LE" > "$d/.gitattributes" &&
+	test_must_fail git check-ignore "$d/excluded" > actual 2>&1 &&
+	echo "warning: Ignoring exclude file with unknown encoding: $d/.gitignore" > expect &&
+	test_cmp expect actual
+'
+
 test_done
diff --git a/utf8.c b/utf8.c
index 35a0251939..904cf20f97 100644
--- a/utf8.c
+++ b/utf8.c
@@ -252,6 +252,17 @@ int is_utf8(const char *text)
 	return 1;
 }
 
+int is_valid_utf8(const char *text, size_t len)
+{
+	while (text && len > 0) {
+		ucs_char_t ch = pick_one_utf8_char(&text, &len);
+		if (text && ch == 0 && len > 0)
+			return 0;
+	}
+
+	return text != NULL;
+}
+
 static void strbuf_add_indented_text(struct strbuf *buf, const char *text,
 				     int indent, int indent2)
 {
@@ -643,6 +654,24 @@ int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int try_reencode_to_utf8(const char *text, size_t len, char **out_text, size_t *out_len)
+{
+	const char *in_encoding;
+	if (has_bom_prefix(text, len, utf32_be_bom, sizeof(utf32_be_bom)))
+		in_encoding = "UTF-32BE";
+	else if (has_bom_prefix(text, len, utf32_le_bom, sizeof(utf32_le_bom)))
+		in_encoding = "UTF-32LE";
+	else if (has_bom_prefix(text, len, utf16_be_bom, sizeof(utf16_be_bom)))
+		in_encoding = "UTF-16BE";
+	else if (has_bom_prefix(text, len, utf16_le_bom, sizeof(utf16_le_bom)))
+		in_encoding = "UTF-16LE";
+	else
+		return 0;
+
+	*out_text = reencode_string_len(text, len, "UTF-8", in_encoding, out_len);
+	return *out_text != NULL;
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index cf8ecb0f21..e52e4f7c06 100644
--- a/utf8.h
+++ b/utf8.h
@@ -10,6 +10,12 @@ int utf8_width(const char **start, size_t *remainder_p);
 int utf8_strnwidth(const char *string, size_t len, int skip_ansi);
 int utf8_strwidth(const char *string);
 int is_utf8(const char *text);
+
+/*
+ * Checks that the string is valid UTF-8 that does not contain the null byte
+ * except at the end of the string
+ */
+int is_valid_utf8(const char *text, size_t len);
 int is_encoding_utf8(const char *name);
 int same_encoding(const char *, const char *);
 __attribute__((format (printf, 2, 3)))
@@ -48,6 +54,12 @@ static inline char *reencode_string(const char *in,
 				   NULL);
 }
 
+/*
+ * Returns true if an unicode BOM is detected and the string can be reencoded to UTF-8.
+ * In that case the string is reencoded to UTF-8 in *out_text.
+ */
+int try_reencode_to_utf8(const char *text, size_t len, char **out_text, size_t *out_len);
+
 int mbs_chrlen(const char **text, size_t *remainder_p, const char *encoding);
 
 /*

base-commit: c4a0c8845e2426375ad257b6c221a3a7d92ecfda
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-01-08  1:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-03 22:16 [PATCH] ignores: handle non UTF-8 exclude files Matthieu Beauchamp-Boulay via GitGitGadget
2026-01-04  2:54 ` Junio C Hamano
2026-01-06 19:52   ` Matthieu Beauchamp
2026-01-04 17:35 ` Torsten Bögershausen
2026-01-06 20:32   ` Matthieu Beauchamp
2026-01-07 14:36     ` Phillip Wood
2026-01-04 19:40 ` brian m. carlson
2026-01-06 20:45   ` Matthieu Beauchamp
2026-01-06 23:22     ` brian m. carlson
2026-01-07  1:35       ` Collin Funk
2026-01-07 14:28         ` Phillip Wood
2026-01-07 23:38         ` brian m. carlson
2026-01-08  1:13           ` Collin Funk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).