[PATCH 0/9] asan bonanza

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] asan bonanza
@ 2025-11-12  7:55 Jeff King
  2025-11-12  7:56 ` [PATCH 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
                   ` (10 more replies)
  0 siblings, 11 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  7:55 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

This series fixes a handful of issues that ASan finds in our test suite
if we tweak a few options to let it look deeper.

The cache-tree one was reported to the security list. It's a real bug,
but I don't think is an interesting vulnerability (it's a benign read
off the end of an mmap'd file that is local and not generally under
attacker control).

The bitmap bug is also a real bug in new code that I think is not well
exercised yet (+cc Taylor for that one).

The fsck changes are for false positives in ASan, but I think it is
reasonable for it to complain about this sketchy code. ;) I hope the
result is nicer to read and reason about, but whether it is worth the
churn may be debatable.

Along the way we can turn a few knobs that will potentially help us find
more problems down the road (but ordered so that "make SANITIZE=address"
passes at each step of the series).

  [1/9]: compat/mmap: mark unused argument in git_munmap()
  [2/9]: pack-bitmap: handle name-hash lookups in incremental bitmaps
  [3/9]: Makefile: turn on NO_MMAP when building with ASan
  [4/9]: cache-tree: avoid strtol() on non-string buffer
  [5/9]: fsck: assert newline presence in fsck_ident()
  [6/9]: fsck: avoid strcspn() in fsck_ident()
  [7/9]: fsck: remove redundant date timestamp check
  [8/9]: fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  [9/9]: t: enable ASan's strict_string_checks option

 Makefile      |  1 +
 cache-tree.c  | 45 ++++++++++++++++++++++----------
 compat/mmap.c |  2 +-
 fsck.c        | 71 ++++++++++++++++++++++++++++++++++++---------------
 pack-bitmap.c | 27 +++++++++++++++++---
 t/test-lib.sh |  1 +
 6 files changed, 107 insertions(+), 40 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/9] compat/mmap: mark unused argument in git_munmap()
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
@ 2025-11-12  7:56 ` Jeff King
  2025-11-12  8:01 ` [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  7:56 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

Our mmap compat code emulates mapping by using malloc/free. Our
git_munmap() must take a "length" parameter to match the interface of
munmap(), but we don't use it (it is up to the allocator to know how big
the block is in free()).

Let's mark it as UNUSED to avoid complaints from -Wunused-parameter.
Otherwise you cannot build with "make DEVELOPER=1 NO_MMAP=1".

Signed-off-by: Jeff King <peff@peff.net>
---
This made me wonder if nobody is using NO_MMAP at all. But it may just
be that platforms which need it are not using -Werror in the first
place.

 compat/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/compat/mmap.c b/compat/mmap.c
index 2fe1c7732e..1a118711f7 100644
--- a/compat/mmap.c
+++ b/compat/mmap.c
@@ -38,7 +38,7 @@ void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t of
 	return start;
 }
 
-int git_munmap(void *start, size_t length)
+int git_munmap(void *start, size_t length UNUSED)
 {
 	free(start);
 	return 0;
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
  2025-11-12  7:56 ` [PATCH 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
@ 2025-11-12  8:01 ` Jeff King
  2025-11-12 11:25   ` Patrick Steinhardt
  2025-11-13  2:55   ` Taylor Blau
  2025-11-12  8:02 ` [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:01 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

If a bitmap has a name-hash cache, it is an array of 32-bit integers,
one per entry in the bitmap, which we've mmap'd from the .bitmap file.
We access it directly like this:

    if (bitmap_git->hashes)
            hash = get_be32(bitmap_git->hashes + index_pos);

That works for both regular pack bitmaps and for non-incremental midx
bitmaps. There is one bitmap_index with one "hashes" array, and
index_pos is within its bounds (we do the bounds-checking when we load
the bitmap).

But for an incremental midx bitmap, we have a linked list of
bitmap_index structs, and each one has only its own small slice of the
name-hash array. If index_pos refers to an object that is not in the
first bitmap_git of the chain, then we'll access memory outside of the
bounds of its "hashes" array, and often outside of the mmap.

Instead, we should walk through the list until we find the bitmap_index
which serves our index_pos, and use its hash (after adjusting index_pos
to make it relative to the slice we found). This is exactly what we do
elsewhere for incremental midx lookups (like the pack_pos_to_midx() call
a few lines above). But we can't use existing helpers like
midx_for_object() here, because we're walking through the chain of
bitmap_index structs (each of which refers to a midx), not the chain of
incremental multi_pack_index structs themselves.

The problem is triggered in the test suite, but we don't get a segfault
because the out-of-bounds index is too small. The OS typically rounds
our mmap up to the nearest page size, so we just end up accessing some
extra zero'd memory. Nor do we catch it with ASan, since it doesn't seem
to instrument mmaps at all. But if we build with NO_MMAP, then our maps
are replaced with heap allocations, which ASan does check. And so:

  make NO_MMAP=1 SANITIZE=address
  cd t
  ./t5334-incremental-multi-pack-index.sh

does show the problem (and this patch makes it go away).

Signed-off-by: Jeff King <peff@peff.net>
---
As always with the midx and bitmap code, I am left unsure of which
ordering it is correct to use (pseudo-pack order, or lexical oid order,
or how each splits across incremental files). I _think_ this is right
because it's matching the ordering that is already used for a single
midx. But clearly this area is under-tested, since even when we did not
go off the end of the array we were probably passing back junk
name-hashes (either from the .bitmap file's trailing checksum, or
zero-padding at the end of the mapped page).

So it might be worth adding more tests here, but I know this incremental
bitmap code is a big work in progress. So I contented myself with the
reproduction above, and anything else can go onto the incremental todo
pile. :)

 pack-bitmap.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 291e1a9cf4..710b86a451 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -213,6 +213,26 @@ static uint32_t bitmap_num_objects(struct bitmap_index *index)
 	return index->pack->num_objects;
 }
 
+static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos)
+{
+	if (bitmap_is_midx(index)) {
+		while (index && pos < index->midx->num_objects_in_base)
+			index = index->base;
+
+		if (!index)
+			BUG("NULL base bitmap for object position: %"PRIu32, pos);
+
+		pos -= index->midx->num_objects_in_base;
+		if (pos >= index->midx->num_objects)
+			BUG("out-of-bounds midx bitmap object at %"PRIu32, pos);
+	}
+
+	if (!index->hashes)
+		return 0;
+
+	return get_be32(index->hashes + pos);
+}
+
 static struct repository *bitmap_repo(struct bitmap_index *bitmap_git)
 {
 	if (bitmap_is_midx(bitmap_git))
@@ -1724,8 +1744,7 @@ static void show_objects_for_type(
 				pack = bitmap_git->pack;
 			}
 
-			if (bitmap_git->hashes)
-				hash = get_be32(bitmap_git->hashes + index_pos);
+			hash = bitmap_name_hash(bitmap_git, index_pos);
 
 			show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
 		}
@@ -3124,8 +3143,8 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 
 		if (oe) {
 			reposition[i] = oe_in_pack_pos(mapping, oe) + 1;
-			if (bitmap_git->hashes && !oe->hash)
-				oe->hash = get_be32(bitmap_git->hashes + index_pos);
+			if (!oe->hash)
+				oe->hash = bitmap_name_hash(bitmap_git, index_pos);
 		}
 	}
 
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
  2025-11-12  7:56 ` [PATCH 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
  2025-11-12  8:01 ` [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
@ 2025-11-12  8:02 ` Jeff King
  2025-11-12  8:17   ` Collin Funk
  2025-11-12 11:26   ` Patrick Steinhardt
  2025-11-12  8:05 ` [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:02 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

Git often uses mmap() to access on-disk files. This leaves a blind spot
in our SANITIZE=address builds, since ASan does not seem to handle mmap
at all. Nor does the OS notice most out-of-bounds access, since it tends
to round up to the nearest page size (so depending on how big the map
is, you might have to overrun it by up to 4095 bytes to trigger a
segfault).

The previous commit demonstrates a memory bug that we missed. We could
have made a new test where the out-of-bounds access was much larger, or
where the mapped file ended closer to a page boundary. But the point of
running the test suite with sanitizers is to catch these problems
without having to construct specific tests.

Let's enable NO_MMAP for our ASan builds by default, which should give
us better coverage. This does increase the memory usage of Git, since
we're copying from the filesystem into heap. But the repositories in the
test suite tend to be small, so the overhead isn't really noticeable
(and ASan already has quite a performance penalty).

There are a few other known bugs that this patch will help flush out.
However, they aren't directly triggered in the test suite (yet). So
it's safe to turn this on now without breaking the test suite, which
will help us add new tests to demonstrate those other bugs as we fix
them.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 7e0f77e298..0f44268405 100644
--- a/Makefile
+++ b/Makefile
@@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
 endif
 ifneq ($(filter address,$(SANITIZERS)),)
 NO_REGEX = NeededForASAN
+NO_MMAP = NeededForASAN
 SANITIZE_ADDRESS = YesCompiledWithIt
 endif
 endif
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (2 preceding siblings ...)
  2025-11-12  8:02 ` [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
@ 2025-11-12  8:05 ` Jeff King
  2025-11-12 11:26   ` Patrick Steinhardt
  2025-11-12  8:06 ` [PATCH 5/9] fsck: assert newline presence in fsck_ident() Jeff King
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:05 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

A cache-tree extension entry in the index looks like this:

  <name> NUL <entry_nr> SPACE <subtree_nr> NEWLINE <binary_oid>

where the "_nr" items are human-readable base-10 ASCII. We parse them
with strtol(), even though we do not have a NUL-terminated string (we'd
generally have an mmap() of the on-disk index file). For a well-formed
entry, this is not a problem; strtol() will stop when it sees the
newline. But there are two problems:

  1. A corrupted entry could omit the newline, causing us to read
     further. You'd mostly get stopped by seeing non-digits in the oid
     field (and if it is likewise truncated, there will still be 20 or
     more bytes of the index checksum). So it's possible, though
     unlikely, to see read off the end of the mmap'd buffer. Of course a
     malicious index file can fake the oid and the index checksum to all
     (ASCII) 0's.

     This is further complicated by the fact that mmap'd buffers tend to
     be zero-padded up to the page boundary. So to run off the end, the
     index size also has to be a multiple of the page size. This is also
     unlikely, though you can construct a malicious index file that
     matches this.

     The security implications aren't too interesting. The index file is
     a local file anyway (so you can't attack somebody by cloning, but
     only if you convince them to operate in a .git directory you made,
     at which point attacking .git/config is much easier). And it's just
     a read overflow via strtol(), which is unlikely to buy you much
     beyond a crash.

  2. ASan has a strict_string_checks option, which tells it to make sure
     that options to string functions (like strtol) have some eventual
     NUL, without regard to what the function would actually do (like
     stopping at a newline here). This option sometimes has false
     positives, but it can point to sketchy areas (like this one) where
     the input we use doesn't exhibit a problem, but different input
     _could_ cause us to misbehave.

Let's fix it by just parsing the values ourselves with a helper function
that is careful not to go past the end of the buffer. There are a few
behavior changes here that should not matter:

  - We do not consider overflow, as strtol() would. But nor did the
    original code. However, we don't trust the value we get from the
    on-disk file, and if it says to read 2^30 entries, we would notice
    that we do not have that many and bail before reading off the end of
    the buffer.

  - Our helper does not skip past extra leading whitespace as strtol()
    would, but according to gitformat-index(5) there should not be any.

  - The original quit parsing at a newline or a NUL byte, but now we
    insist on a newline (which is what the documentation says, and what
    Git has always produced).

Since we are providing our own helper function, we can tweak the
interface a bit to make our lives easier. The original code does not use
strtol's "end" pointer to find the end of the parsed data, but rather
uses a separate loop to advance our "buf" pointer to the trailing
newline. We can instead provide a helper that advances "buf" as it
parses, letting us read strictly left-to-right through the buffer.

I didn't add a new test here. It's surprisingly difficult to construct
an index of exactly the right size due to the way we pad entries. But it
is easy to trigger the problem in existing tests when using ASan's
strict string checking, coupled with a recent change to use NO_MMAP with
ASan builds. So:

  make SANITIZE=address
  cd t
  ASAN_OPTIONS=strict_string_checks=1 ./t0090-cache-tree.sh

triggers it reliably. Technically it is not deterministic because there
is ~8% chance (it's 1-(255/256)^20, or ^32 for sha256) that the trailing
checksum hash has a NUL byte in it. But we compute enough cache-trees in
the course of that script that we are very likely to hit the problem in
one of them.

We can look at making strict_string_checks the default for ASan builds,
but there are some other cases we'd want to fix first.

Reported-by: correctmost <cmlists@sent.com>
Signed-off-by: Jeff King <peff@peff.net>
---
It feels gross to reimplement strtol(), just because there are so many
weird corner cases (signs, leading whitespace, overflow, and so on).
The alternative is reading into a buffer, NUL-terminating it, and
calling strtol() there. But that has its own hazards (you have to decide
when to stop reading, which means reimplementing those same rules to
soak up whitespace, etc).

We can take some shortcuts here because we know what our one caller
looks like. It might be worth having a carefully written "strntol()"
that can be used everywhere. But there aren't _that_ many call sites
that would use it, so maybe this ad-hoc approach is better?

 cache-tree.c | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2aba47060e..ab20ffe863 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -548,12 +548,36 @@ void cache_tree_write(struct strbuf *sb, struct cache_tree *root)
 	trace2_region_leave("cache_tree", "write", the_repository);
 }
 
+static long parse_long(const char **ptr, unsigned long *len_p)
+{
+	const char *s = *ptr;
+	unsigned long len = *len_p;
+	long ret = 0;
+	int sign = 1;
+
+	while (len && *s == '-') {
+		sign *= -1;
+		s++;
+		len--;
+	}
+
+	while (len) {
+		if (!isdigit(*s))
+			break;
+		ret *= 10;
+		ret += *s - '0';
+		s++;
+		len--;
+	}
+	*ptr = s;
+	*len_p = len;
+	return sign * ret;
+}
+
 static struct cache_tree *read_one(const char **buffer, unsigned long *size_p)
 {
 	const char *buf = *buffer;
 	unsigned long size = *size_p;
-	const char *cp;
-	char *ep;
 	struct cache_tree *it;
 	int i, subtree_nr;
 	const unsigned rawsz = the_hash_algo->rawsz;
@@ -569,19 +593,12 @@ static struct cache_tree *read_one(const char **buffer, unsigned long *size_p)
 	buf++; size--;
 	it = cache_tree();
 
-	cp = buf;
-	it->entry_count = strtol(cp, &ep, 10);
-	if (cp == ep)
+	it->entry_count = parse_long(&buf, &size);
+	if (!size || *buf != ' ')
 		goto free_return;
-	cp = ep;
-	subtree_nr = strtol(cp, &ep, 10);
-	if (cp == ep)
-		goto free_return;
-	while (size && *buf && *buf != '\n') {
-		size--;
-		buf++;
-	}
-	if (!size)
+	buf++; size--;
+	subtree_nr = parse_long(&buf, &size);
+	if (!size || *buf != '\n')
 		goto free_return;
 	buf++; size--;
 	if (0 <= it->entry_count) {
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 5/9] fsck: assert newline presence in fsck_ident()
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (3 preceding siblings ...)
  2025-11-12  8:05 ` [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
@ 2025-11-12  8:06 ` Jeff King
  2025-11-12  8:06 ` [PATCH 6/9] fsck: avoid strcspn() " Jeff King
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:06 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

The fsck code purports to handle buffers that are not NUL-terminated,
but fsck_ident() uses some string functions. This works OK in practice,
as explained in 8e4309038f (fsck: do not assume NUL-termination of
buffers, 2023-01-19). Before calling fsck_ident() we'll have called
verify_headers(), which makes sure we have at least a trailing newline.
And none of our string-like functions will walk past that newline.

However, that makes this code at the top of fsck_ident() very confusing:

    *ident = strchrnul(*ident, '\n');
    if (**ident == '\n')
            (*ident)++;

We should always see that newline, or our memory safety assumptions have
been violated! Further, using strchrnul() is weird, since the whole
point is that if the newline is not there, we don't necessarily have a
NUL at all, and might read off the end of the buffer.

So let's have callers pass in the boundary of our buffer, which lets us
safely find the newline with memchr(). And if it is not there, this is a
BUG(), because it means our caller did not validate the input with
verify_headers() as it was supposed to (and we are better off bailing
rather than having memory-safety problems).

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fsck.c b/fsck.c
index 341e100d24..8991f04943 100644
--- a/fsck.c
+++ b/fsck.c
@@ -860,16 +860,18 @@ static int verify_headers(const void *data, unsigned long size,
 		FSCK_MSG_UNTERMINATED_HEADER, "unterminated header");
 }
 
-static int fsck_ident(const char **ident,
+static int fsck_ident(const char **ident, const char *ident_end,
 		      const struct object_id *oid, enum object_type type,
 		      struct fsck_options *options)
 {
 	const char *p = *ident;
+	const char *nl;
 	char *end;
 
-	*ident = strchrnul(*ident, '\n');
-	if (**ident == '\n')
-		(*ident)++;
+	nl = memchr(p, '\n', ident_end - p);
+	if (!nl)
+		BUG("verify_headers() should have made sure we have a newline");
+	*ident = nl + 1;
 
 	if (*p == '<')
 		return report(options, oid, type, FSCK_MSG_MISSING_NAME_BEFORE_EMAIL, "invalid author/committer line - missing space before email");
@@ -958,7 +960,7 @@ static int fsck_commit(const struct object_id *oid,
 	author_count = 0;
 	while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) {
 		author_count++;
-		err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
+		err = fsck_ident(&buffer, buffer_end, oid, OBJ_COMMIT, options);
 		if (err)
 			return err;
 	}
@@ -970,7 +972,7 @@ static int fsck_commit(const struct object_id *oid,
 		return err;
 	if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer))
 		return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line");
-	err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
+	err = fsck_ident(&buffer, buffer_end, oid, OBJ_COMMIT, options);
 	if (err)
 		return err;
 	if (memchr(buffer_begin, '\0', size)) {
@@ -1065,7 +1067,7 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
 			goto done;
 	}
 	else
-		ret = fsck_ident(&buffer, oid, OBJ_TAG, options);
+		ret = fsck_ident(&buffer, buffer_end, oid, OBJ_TAG, options);
 
 	if (buffer < buffer_end && (skip_prefix(buffer, "gpgsig ", &buffer) || skip_prefix(buffer, "gpgsig-sha256 ", &buffer))) {
 		eol = memchr(buffer, '\n', buffer_end - buffer);
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 6/9] fsck: avoid strcspn() in fsck_ident()
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (4 preceding siblings ...)
  2025-11-12  8:06 ` [PATCH 5/9] fsck: assert newline presence in fsck_ident() Jeff King
@ 2025-11-12  8:06 ` Jeff King
  2025-11-12  8:06 ` [PATCH 7/9] fsck: remove redundant date timestamp check Jeff King
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:06 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

We may be operating on a buffer that is not NUL-terminated, but we use
strcspn() to parse it. This is OK in practice, as discussed in
8e4309038f (fsck: do not assume NUL-termination of buffers, 2023-01-19),
because we know there is at least a trailing newline in our buffer, and
we always pass "\n" to strcspn(). So we know it will stop before running
off the end of the buffer.

But this is a subtle point to hang our memory safety hat on. And it
confuses ASan's strict_string_checks mode, even though it is technically
a false positive (that mode complains that we have no NUL, which is
true, but it does not know that we have verified the presence of the
newline already).

Let's instead open-code the loop. As a bonus, this makes the logic more
obvious (to my mind, anyway). The current code skips forward with
strcspn until it hits "<", ">", or "\n". But then it must check which it
saw to decide if that was what we expected or not, duplicating some
logic between what's in the strcspn() and what's in the domain logic.
Instead, we can just check each character as we loop and act on it
immediately.

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/fsck.c b/fsck.c
index 8991f04943..2ee72d573d 100644
--- a/fsck.c
+++ b/fsck.c
@@ -875,18 +875,30 @@ static int fsck_ident(const char **ident, const char *ident_end,
 
 	if (*p == '<')
 		return report(options, oid, type, FSCK_MSG_MISSING_NAME_BEFORE_EMAIL, "invalid author/committer line - missing space before email");
-	p += strcspn(p, "<>\n");
-	if (*p == '>')
-		return report(options, oid, type, FSCK_MSG_BAD_NAME, "invalid author/committer line - bad name");
-	if (*p != '<')
-		return report(options, oid, type, FSCK_MSG_MISSING_EMAIL, "invalid author/committer line - missing email");
+	for (;;) {
+		if (p >= ident_end || *p == '\n')
+			return report(options, oid, type, FSCK_MSG_MISSING_EMAIL, "invalid author/committer line - missing email");
+		if (*p == '>')
+			return report(options, oid, type, FSCK_MSG_BAD_NAME, "invalid author/committer line - bad name");
+		if (*p == '<')
+			break; /* end of name, beginning of email */
+
+		/* otherwise, skip past arbitrary name char */
+		p++;
+	}
 	if (p[-1] != ' ')
 		return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_EMAIL, "invalid author/committer line - missing space before email");
-	p++;
-	p += strcspn(p, "<>\n");
-	if (*p != '>')
-		return report(options, oid, type, FSCK_MSG_BAD_EMAIL, "invalid author/committer line - bad email");
-	p++;
+	p++; /* skip past '<' we found */
+	for (;;) {
+		if (p >= ident_end || *p == '<' || *p == '\n')
+			return report(options, oid, type, FSCK_MSG_BAD_EMAIL, "invalid author/committer line - bad email");
+		if (*p == '>')
+			break; /* end of email */
+
+		/* otherwise, skip past arbitrary email char */
+		p++;
+	}
+	p++; /* skip past '>' we found */
 	if (*p != ' ')
 		return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_DATE, "invalid author/committer line - missing space before date");
 	p++;
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 7/9] fsck: remove redundant date timestamp check
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (5 preceding siblings ...)
  2025-11-12  8:06 ` [PATCH 6/9] fsck: avoid strcspn() " Jeff King
@ 2025-11-12  8:06 ` Jeff King
  2025-11-12  8:10 ` [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:06 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

After calling "parse_timestamp(p, &end, 10)", we complain if "p == end",
which would imply that we did not see any digits at all. But we know
this cannot be the case, since we would have bailed already if we did
not see any digits, courtesy of extra checks added by 8e4309038f (fsck:
do not assume NUL-termination of buffers, 2023-01-19). Since then,
checking "p == end" is redundant and we can drop it.

This will make our lives a little easier as we refactor further.

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fsck.c b/fsck.c
index 2ee72d573d..266c965cec 100644
--- a/fsck.c
+++ b/fsck.c
@@ -920,7 +920,7 @@ static int fsck_ident(const char **ident, const char *ident_end,
 		return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date");
 	if (date_overflows(parse_timestamp(p, &end, 10)))
 		return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow");
-	if ((end == p || *end != ' '))
+	if (*end != ' ')
 		return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
 	p = end + 1;
 	if ((*p != '+' && *p != '-') ||
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (6 preceding siblings ...)
  2025-11-12  8:06 ` [PATCH 7/9] fsck: remove redundant date timestamp check Jeff King
@ 2025-11-12  8:10 ` Jeff King
  2025-11-12 11:25   ` Patrick Steinhardt
  2025-11-12  8:10 ` [PATCH 9/9] t: enable ASan's strict_string_checks option Jeff King
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:10 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

In fsck_ident(), we parse the timestamp with parse_timestamp(), which is
really an alias for strtoumax(). But since our buffer may not be
NUL-terminated, this can trigger a complaint from ASan's
strict_string_checks mode. This is a false positive, since we know that
the buffer contains a trailing newline (which we checked earlier in the
function), and that strtoumax() would stop there.

But it is worth working around ASan's complaint. One is because that
will let us turn on strict_string_checks by default, which has helped
catch other real problems. And two is that the safety of the current
code is very hard to reason about (it subtly depends on distant code
which could change).

One option here is to just parse the number left-to-right ourselves. But
we care about the size of a timestamp_t and detecting overflow, since
that's part of the point of these checks. And doing that correctly is
tricky. So we'll instead just pull the digits into a separate,
NUL-terminated buffer, and use that to call parse_timestamp().

Signed-off-by: Jeff King <peff@peff.net>
---
There's one step we could take after this commit, which is to annotate
all of the spots that look at "p" to do:

  -if (*p != ' ')
  +if (p >= ident_end || *p != ' ')
	return report(..., "expected space");

or similar. And then I think it would be safe to call fsck_ident() with
a buffer that is not NUL-terminated (and does not have our "safety"
newline). I stopped short for this series because we are just trying to
appease ASan (and not fixing real bugs), and because the result gets
rather unwieldy. But it might be worth it in the long run. We could
always do it later on top.

Again, I find the strto*() wrapping to be gross. Here we use the
extra-buffer trick. But we still don't get away with avoiding custom
logic (for example, if we ever want to support negative timestamps, the
fsck code will have to recognize "-" signs). But it feels like the best
we can do for now.

 fsck.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fsck.c b/fsck.c
index 266c965cec..8e8083e7c6 100644
--- a/fsck.c
+++ b/fsck.c
@@ -860,13 +860,28 @@ static int verify_headers(const void *data, unsigned long size,
 		FSCK_MSG_UNTERMINATED_HEADER, "unterminated header");
 }
 
+static timestamp_t parse_timestamp_from_buf(const char **start, const char *end)
+{
+	const char *p = *start;
+	char buf[24]; /* big enough for 2^64 */
+	size_t i = 0;
+
+	while (p < end && isdigit(*p)) {
+		if (i >= ARRAY_SIZE(buf) - 1)
+			return TIME_MAX;
+		buf[i++] = *p++;
+	}
+	buf[i] = '\0';
+	*start = p;
+	return parse_timestamp(buf, NULL, 10);
+}
+
 static int fsck_ident(const char **ident, const char *ident_end,
 		      const struct object_id *oid, enum object_type type,
 		      struct fsck_options *options)
 {
 	const char *p = *ident;
 	const char *nl;
-	char *end;
 
 	nl = memchr(p, '\n', ident_end - p);
 	if (!nl)
@@ -918,11 +933,11 @@ static int fsck_ident(const char **ident, const char *ident_end,
 			      "invalid author/committer line - bad date");
 	if (*p == '0' && p[1] != ' ')
 		return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date");
-	if (date_overflows(parse_timestamp(p, &end, 10)))
+	if (date_overflows(parse_timestamp_from_buf(&p, ident_end)))
 		return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow");
-	if (*end != ' ')
+	if (*p != ' ')
 		return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
-	p = end + 1;
+	p++;
 	if ((*p != '+' && *p != '-') ||
 	    !isdigit(p[1]) ||
 	    !isdigit(p[2]) ||
-- 
2.52.0.rc1.260.g3e4993586f


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 9/9] t: enable ASan's strict_string_checks option
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (7 preceding siblings ...)
  2025-11-12  8:10 ` [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
@ 2025-11-12  8:10 ` Jeff King
  2025-11-13  3:17 ` [PATCH 0/9] asan bonanza Taylor Blau
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
  10 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-12  8:10 UTC (permalink / raw)
  To: git; +Cc: correctmost, Taylor Blau

ASan has an option to enable strict string checking, where any pointer
passed to a function that expects a NUL-terminated string will be
checked for that NUL termination. This can sometimes produce false
positives. E.g., it is not wrong to pass a buffer with { '1', '2', '\n' }
into strtoul(). Even though it is not NUL-terminated, it will stop at
the newline.

But in trying it out, it identified two problematic spots in our test
suite (which have now been adjusted):

  1. The strtol() parsing in cache-tree.c was a real potential problem,
     which would have been very hard to find otherwise (since it
     required constructing a very specific broken index file).

  2. The use of string functions in fsck_ident() were false positives,
     because we knew that there was always a trailing newline which
     would stop the functions from reading off the end of the buffer.
     But the reasoning behind that is somewhat fragile, and silencing
     those complaints made the code easier to reason about.

So even though this did not find any earth-shattering bugs, and even had
a few false positives, I'm sufficiently convinced that its complaints
are more helpful than hurtful. Let's turn it on by default (since the
test suite now runs cleanly with it) and see if it ever turns up any
other instances.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/test-lib.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/test-lib.sh b/t/test-lib.sh
index ef0ab7ec2d..0fb76f7d11 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -77,6 +77,7 @@ prepend_var GIT_SAN_OPTIONS : strip_path_prefix="$GIT_BUILD_DIR/"
 # want that one to complain to stderr).
 prepend_var ASAN_OPTIONS : $GIT_SAN_OPTIONS
 prepend_var ASAN_OPTIONS : detect_leaks=0
+prepend_var ASAN_OPTIONS : strict_string_checks=1
 export ASAN_OPTIONS

 prepend_var LSAN_OPTIONS : $GIT_SAN_OPTIONS
-- 
2.52.0.rc1.260.g3e4993586f

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12  8:02 ` [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
@ 2025-11-12  8:17   ` Collin Funk
  2025-11-12 10:31     ` Jeff King
  2025-11-12 11:26   ` Patrick Steinhardt
  1 sibling, 1 reply; 62+ messages in thread
From: Collin Funk @ 2025-11-12  8:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost, Taylor Blau

Jeff King <peff@peff.net> writes:

> Git often uses mmap() to access on-disk files. This leaves a blind spot
> in our SANITIZE=address builds, since ASan does not seem to handle mmap
> at all. Nor does the OS notice most out-of-bounds access, since it tends
> to round up to the nearest page size (so depending on how big the map
> is, you might have to overrun it by up to 4095 bytes to trigger a
> segfault).
>
> The previous commit demonstrates a memory bug that we missed. We could
> have made a new test where the out-of-bounds access was much larger, or
> where the mapped file ended closer to a page boundary. But the point of
> running the test suite with sanitizers is to catch these problems
> without having to construct specific tests.
>
> Let's enable NO_MMAP for our ASan builds by default, which should give
> us better coverage. This does increase the memory usage of Git, since
> we're copying from the filesystem into heap. But the repositories in the
> test suite tend to be small, so the overhead isn't really noticeable
> (and ASan already has quite a performance penalty).
>
> There are a few other known bugs that this patch will help flush out.
> However, they aren't directly triggered in the test suite (yet). So
> it's safe to turn this on now without breaking the test suite, which
> will help us add new tests to demonstrate those other bugs as we fix
> them.
>
> Signed-off-by: Jeff King <peff@peff.net>

I see that an interceptor was added in 2023 [1]. Maybe your compiler is
older than that?

On my system:

    $ cat main.c 
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/mman.h>
    int
    main (void)
    {
      char *ptr = mmap (NULL, getpagesize (), PROT_READ | PROT_WRITE,
    		    MAP_ANONYMOUS, -1, 0);
      if (ptr == NULL)
        abort ();
      ptr[getpagesize () + 1] = 'a';
      return 0;
    }
    $ gcc --version | head -n 1
    gcc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3)
    $ clang --version | head -n 1
    clang version 21.1.4 (Fedora 21.1.4-1.fc43)
    $ gcc -fsanitize=address main.c && ./a.out 2>&1 | grep ^SUMMARY:
    SUMMARY: AddressSanitizer: SEGV (/home/collin/a.out+0x400554) (BuildId: 1b7a82189bfffb3f73d420e138b9859add25901a) in main
    $ clang -fsanitize=address main.c && ./a.out 2>&1 | grep ^SUMMARY:
    SUMMARY: AddressSanitizer: SEGV (/home/collin/a.out+0x4e9ee6) (BuildId: aca1d168eacebaa239082d8a45ab74c8470f4b31) in main


Collin

[1] https://github.com/llvm/llvm-project/commit/a34e702aa16fde4cc76e9360d985a64e008e0b23

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12  8:17   ` Collin Funk
@ 2025-11-12 10:31     ` Jeff King
  2025-11-12 20:06       ` Collin Funk
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-12 10:31 UTC (permalink / raw)
  To: Collin Funk; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 12:17:24AM -0800, Collin Funk wrote:

> I see that an interceptor was added in 2023 [1]. Maybe your compiler is
> older than that?

No, I'm using gcc 15.2.0 (from Debian unstable).

But I'm not sure if the linked code does anything useful for us.

One, it's not clear to me if it is even kicking in or not. It only does
anything if the region is "sanitizer managed", according to the details
at https://reviews.llvm.org/D154659. I'm not sure what that means
exactly, because I'm fuzzy on how the shadow map works.

But even when it does do something, it seems to round up to the nearest
page size. But we really want to know if we go even one byte over the
requested length, because if we touch the 1235th byte of a 1234-byte
buffer (which is going to be a NUL because of mmap rounding up the
pages), then there's probably another test case somewhere where we
access the 4097th byte of a 4096-byte buffer (which is going to
segfault).

>       char *ptr = mmap (NULL, getpagesize (), PROT_READ | PROT_WRITE,
>     		    MAP_ANONYMOUS, -1, 0);
>       if (ptr == NULL)
>         abort ();

I think you want to check for MAP_FAILED here, not NULL. And I think we
always get that, because MAP_ANONYMOUS needs to be OR-ed into MAP_SHARED
or MAP_PRIVATE. So here:

>     $ gcc -fsanitize=address main.c && ./a.out 2>&1 | grep ^SUMMARY:
>     SUMMARY: AddressSanitizer: SEGV (/home/collin/a.out+0x400554) (BuildId: 1b7a82189bfffb3f73d420e138b9859add25901a) in main
>     $ clang -fsanitize=address main.c && ./a.out 2>&1 | grep ^SUMMARY:
>     SUMMARY: AddressSanitizer: SEGV (/home/collin/a.out+0x4e9ee6) (BuildId: aca1d168eacebaa239082d8a45ab74c8470f4b31) in main

I don't think this is ASan finding a problem. It is just telling us that
we segfaulted for other reasons. And the fault here is because the
broken mmap() invocation returned MAP_FAILED, and we tried to access
that garbage pointer.

>       ptr[getpagesize () + 1] = 'a';

This is also making a map that is a multiple of the page size, and then
touching a byte that's on the next page. That's the easy-ish case that
we can often already find, even without ASan (though it depends on what
comes after the mapped memory; it might be a valid page).

A more interesting test for Git is to actually map a file, like:

  $ cat main.c
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <stdio.h>
  static void die(const char *msg)
  {
  	perror(msg);
  	exit(1);
  }
  int main (int argc, const char **argv)
  {
  	struct stat st;
  	int fd;
  	char *ptr;

  	fd = open(argv[1], O_RDONLY);
  	if (fd < 0)
  		die("open");
  	if (fstat(fd, &st) < 0)
  		die("fstat");
  	ptr = mmap (NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
  	if (ptr == MAP_FAILED)
  		die("mmap");
  	printf("last byte: %d\n", ptr[st.st_size-1]);
  	printf("one byte after: %d\n", ptr[st.st_size]);
  	return 0;
  }
  $ yes | head -c 4096 >big
  $ yes | head -c 372 >small

And ASan does often detect the problem for the "big" page-sized file,
but not consistently! If I do:

  gcc -fsanitize=address main.c
  while ./a.out big; do echo ok; done

I may get output like:

  last byte: 10
  one byte after: 127
  ok
  last byte: 10
  one byte after: 0
  ok
  last byte: 10
  one byte after: 0
  ok
  last byte: 10
  =================================================================
  ==988617==ERROR: AddressSanitizer: unknown-crash on address 0x7efd40b9f000 at pc 0x564fe77b64eb bp 0x7ffff49e8160 sp 0x7ffff49e8158
  READ of size 1 at 0x7efd40b9f000 thread T0
      #0 0x564fe77b64ea in main (/home/peff/a.out+0x14ea) (BuildId: 8db121bb5c048cb336f8be729e8cefebd6f059a3)
      #1 0x7efd41233ca7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
      #2 0x7efd41233d64 in __libc_start_main_impl ../csu/libc-start.c:360
      #3 0x564fe77b6150 in _start (/home/peff/a.out+0x1150) (BuildId: 8db121bb5c048cb336f8be729e8cefebd6f059a3)

  Address 0x7efd40b9f000 is a wild pointer inside of access range of size 0x000000000001.

So it worked three times without ASan noticing the problem (producing
two different outputs), and then ASan finally crashed. But it didn't
give us the usual information we get for a malloc overflow. It's just an
"unknown crash" from a "wild pointer". So I'm not sure if it's even
finding these through its own poisoning, and not just catching an
unlucky segfault.

If we switch to the small file, then ASan never reports anything! The OS
gives us a page-sized chunk, so we consistently read a "0" in from the
byte after our requested size.

If we swap out the mmap for:

  ptr = malloc(st.st_size);
  read(fd, ptr, st.st_size);

(which is roughly what our NO_MMAP wrapper is doing behind the scenes),
then ASan does catch it consistently, even for the "small" file:

  $ ./a.out small
  =================================================================
  ==1008630==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7d11fe01b4 at pc 0x55cf89b1b4c8 bp 0x7fff471b84e0 sp 0x7fff471b84d8
  READ of size 1 at 0x7c7d11fe01b4 thread T0
      #0 0x55cf89b1b4c7 in main (/home/peff/a.out+0x14c7) (BuildId: 2cebfcd0a00064eaaed750af010fcecdae2f5666)
      #1 0x7f4d12e33ca7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
      #2 0x7f4d12e33d64 in __libc_start_main_impl ../csu/libc-start.c:360
      #3 0x55cf89b1b150 in _start (/home/peff/a.out+0x1150) (BuildId: 2cebfcd0a00064eaaed750af010fcecdae2f5666)

  0x7c7d11fe01b4 is located 0 bytes after 372-byte region [0x7c7d11fe0040,0x7c7d11fe01b4)
  allocated by thread T0 here:
      #0 0x7f4d1311a0ab in malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:67
      #1 0x55cf89b1b3a0 in main (/home/peff/a.out+0x13a0) (BuildId: 2cebfcd0a00064eaaed750af010fcecdae2f5666)
      #2 0x7f4d12e33ca7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps
  2025-11-12  8:01 ` [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
@ 2025-11-12 11:25   ` Patrick Steinhardt
  2025-11-13  2:55   ` Taylor Blau
  1 sibling, 0 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-11-12 11:25 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 03:01:51AM -0500, Jeff King wrote:
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 291e1a9cf4..710b86a451 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -213,6 +213,26 @@ static uint32_t bitmap_num_objects(struct bitmap_index *index)
>  	return index->pack->num_objects;
>  }
>  
> +static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos)
> +{
> +	if (bitmap_is_midx(index)) {
> +		while (index && pos < index->midx->num_objects_in_base)
> +			index = index->base;

So we first find the MIDX that is supposed to contain the position.

> +		if (!index)
> +			BUG("NULL base bitmap for object position: %"PRIu32, pos);
> +
> +		pos -= index->midx->num_objects_in_base;

We then subtract the number of objects from all of our predeceding
layers from the position. This should result in the position relative to
the current layer.

> +		if (pos >= index->midx->num_objects)
> +			BUG("out-of-bounds midx bitmap object at %"PRIu32, pos);
> +	}
> +
> +	if (!index->hashes)
> +		return 0;
> +
> +	return get_be32(index->hashes + pos);

And we then return the value at that given position. Makes sense to me.

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  2025-11-12  8:10 ` [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
@ 2025-11-12 11:25   ` Patrick Steinhardt
  2025-11-12 19:36     ` Junio C Hamano
  2025-11-15  2:12     ` Jeff King
  0 siblings, 2 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-11-12 11:25 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 03:10:40AM -0500, Jeff King wrote:
> In fsck_ident(), we parse the timestamp with parse_timestamp(), which is
> really an alias for strtoumax(). But since our buffer may not be
> NUL-terminated, this can trigger a complaint from ASan's
> strict_string_checks mode. This is a false positive, since we know that
> the buffer contains a trailing newline (which we checked earlier in the
> function), and that strtoumax() would stop there.
> 
> But it is worth working around ASan's complaint. One is because that
> will let us turn on strict_string_checks by default, which has helped
> catch other real problems. And two is that the safety of the current
> code is very hard to reason about (it subtly depends on distant code
> which could change).
> 
> One option here is to just parse the number left-to-right ourselves. But
> we care about the size of a timestamp_t and detecting overflow, since
> that's part of the point of these checks. And doing that correctly is
> tricky. So we'll instead just pull the digits into a separate,
> NUL-terminated buffer, and use that to call parse_timestamp().

So this is another site that would benefit from having something like
`git_parse_int()` with an extra parameter indicating the number of
bytes available for parsing (and a way to disable unit factors).

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-12  8:05 ` [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
@ 2025-11-12 11:26   ` Patrick Steinhardt
  2025-11-13  3:09     ` Taylor Blau
  2025-11-18  8:38     ` Jeff King
  0 siblings, 2 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-11-12 11:26 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 03:05:37AM -0500, Jeff King wrote:
> A cache-tree extension entry in the index looks like this:
> 
>   <name> NUL <entry_nr> SPACE <subtree_nr> NEWLINE <binary_oid>
> 
> where the "_nr" items are human-readable base-10 ASCII. We parse them
> with strtol(), even though we do not have a NUL-terminated string (we'd
> generally have an mmap() of the on-disk index file). For a well-formed
> entry, this is not a problem; strtol() will stop when it sees the
> newline. But there are two problems:
> 
>   1. A corrupted entry could omit the newline, causing us to read
>      further. You'd mostly get stopped by seeing non-digits in the oid
>      field (and if it is likewise truncated, there will still be 20 or
>      more bytes of the index checksum). So it's possible, though
>      unlikely, to see read off the end of the mmap'd buffer. Of course a

s/see read/read/

>      malicious index file can fake the oid and the index checksum to all
>      (ASCII) 0's.
> 
>      This is further complicated by the fact that mmap'd buffers tend to
>      be zero-padded up to the page boundary. So to run off the end, the
>      index size also has to be a multiple of the page size. This is also
>      unlikely, though you can construct a malicious index file that
>      matches this.
> 
>      The security implications aren't too interesting. The index file is
>      a local file anyway (so you can't attack somebody by cloning, but
>      only if you convince them to operate in a .git directory you made,
>      at which point attacking .git/config is much easier). And it's just
>      a read overflow via strtol(), which is unlikely to buy you much
>      beyond a crash.

Agreed. Good to fix it regardless.

> diff --git a/cache-tree.c b/cache-tree.c
> index 2aba47060e..ab20ffe863 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -548,12 +548,36 @@ void cache_tree_write(struct strbuf *sb, struct cache_tree *root)
>  	trace2_region_leave("cache_tree", "write", the_repository);
>  }
>  
> +static long parse_long(const char **ptr, unsigned long *len_p)
> +{
> +	const char *s = *ptr;
> +	unsigned long len = *len_p;
> +	long ret = 0;
> +	int sign = 1;
> +
> +	while (len && *s == '-') {
> +		sign *= -1;
> +		s++;
> +		len--;
> +	}
> +
> +	while (len) {
> +		if (!isdigit(*s))
> +			break;
> +		ret *= 10;
> +		ret += *s - '0';
> +		s++;
> +		len--;
> +	}
> +	*ptr = s;
> +	*len_p = len;
> +	return sign * ret;
> +}

Hm. I'm not a huge fan of not having any error handling at all. It just
feels way too fragile for my taste:

  - As you mention we don't detect overflows, as we would detect them at
    a later point in time when trying to access index entries at invalid
    offsets. But if the input is crafted in a way that the overflow ends
    up with a reasonable index entry we might just as well _not_ detect
    that an overflow has happened and end up using the wrong index
    entry.

  - We don't verify that we even have a number in the first place. We'd
    simply return "0" in that case and not advance the pointer. This is
    fine though as we verify that the returned size is non-zero, so we'd
    detect this case.

I'd much rather prefer to have an interface similar to `git_parse_int()`
and related functions, which are way easier to use compared to the likes
of `stroi()`.

Otherwise, the next time we want to have something similar like you
introduce here we might not be aware of this existing helper and may not
know to generalize it, so we'd introduce another ad-hoc helper.

Anyway. I won't object if you want to go with your version anyway, as it
at least fixes one class of errors.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12  8:02 ` [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
  2025-11-12  8:17   ` Collin Funk
@ 2025-11-12 11:26   ` Patrick Steinhardt
  2025-11-13  3:12     ` Taylor Blau
  2025-11-13 16:30     ` Junio C Hamano
  1 sibling, 2 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-11-12 11:26 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 03:02:15AM -0500, Jeff King wrote:
> diff --git a/Makefile b/Makefile
> index 7e0f77e298..0f44268405 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
>  endif
>  ifneq ($(filter address,$(SANITIZERS)),)
>  NO_REGEX = NeededForASAN
> +NO_MMAP = NeededForASAN
>  SANITIZE_ADDRESS = YesCompiledWithIt
>  endif
>  endif

Let's also apply this to Meson. Thanks!

Patrick

diff --git a/meson.build b/meson.build
index ad4eb2c4fa..668f8769d2 100644
--- a/meson.build
+++ b/meson.build
@@ -1408,12 +1408,18 @@ if host_machine.system() == 'windows'
   libgit_c_args += '-DUSE_WIN32_MMAP'
 else
   checkfuncs += {
-    'mmap' : ['mmap.c'],
     # provided by compat/mingw.c.
     'unsetenv' : ['unsetenv.c'],
     # provided by compat/mingw.c.
     'getpagesize' : [],
   }
+
+  if get_option('b_sanitize').contains('address')
+    libgit_c_args += '-DNO_MMAP'
+    libgit_sources += 'compat/mmap.c'
+  else
+    checkfuncs += { 'mmap': ['mmap.c'] }
+  endif
 endif
 
 foreach func, impls : checkfuncs

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  2025-11-12 11:25   ` Patrick Steinhardt
@ 2025-11-12 19:36     ` Junio C Hamano
  2025-11-15  2:12     ` Jeff King
  1 sibling, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2025-11-12 19:36 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, correctmost, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> So this is another site that would benefit from having something like
> `git_parse_int()` with an extra parameter indicating the number of
> bytes available for parsing (and a way to disable unit factors).

It is certainly an interesting approach that would work well.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12 10:31     ` Jeff King
@ 2025-11-12 20:06       ` Collin Funk
  0 siblings, 0 replies; 62+ messages in thread
From: Collin Funk @ 2025-11-12 20:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost, Taylor Blau

Jeff King <peff@peff.net> writes:

> On Wed, Nov 12, 2025 at 12:17:24AM -0800, Collin Funk wrote:
>
>> I see that an interceptor was added in 2023 [1]. Maybe your compiler is
>> older than that?
>
> No, I'm using gcc 15.2.0 (from Debian unstable).
>
> But I'm not sure if the linked code does anything useful for us.
>
> One, it's not clear to me if it is even kicking in or not. It only does
> anything if the region is "sanitizer managed", according to the details
> at https://reviews.llvm.org/D154659. I'm not sure what that means
> exactly, because I'm fuzzy on how the shadow map works.
>
> But even when it does do something, it seems to round up to the nearest
> page size. But we really want to know if we go even one byte over the
> requested length, because if we touch the 1235th byte of a 1234-byte
> buffer (which is going to be a NUL because of mmap rounding up the
> pages), then there's probably another test case somewhere where we
> access the 4097th byte of a 4096-byte buffer (which is going to
> segfault).

The glibc docs say that the length is rounded up to the nearest page
size [1]:

    Thus, addresses for mapping must be page-aligned, and length values
    will be rounded up.

This wording in POSIX makes me think that all systems will round up to
the nearest page size [2]:

    Thus, while the parameter len need not meet a size or alignment
    constraint, the system shall include, in any mapping operation, any
    partial page specified by the address range starting at pa and
    continuing for len bytes.

>>       char *ptr = mmap (NULL, getpagesize (), PROT_READ | PROT_WRITE,
>>     		    MAP_ANONYMOUS, -1, 0);
>>       if (ptr == NULL)
>>         abort ();
>
> I think you want to check for MAP_FAILED here, not NULL. And I think we
> always get that, because MAP_ANONYMOUS needs to be OR-ed into MAP_SHARED
> or MAP_PRIVATE. So here:
>
>>     $ gcc -fsanitize=address main.c && ./a.out 2>&1 | grep ^SUMMARY:
>>     SUMMARY: AddressSanitizer: SEGV (/home/collin/a.out+0x400554) (BuildId: 1b7a82189bfffb3f73d420e138b9859add25901a) in main
>>     $ clang -fsanitize=address main.c && ./a.out 2>&1 | grep ^SUMMARY:
>>     SUMMARY: AddressSanitizer: SEGV (/home/collin/a.out+0x4e9ee6) (BuildId: aca1d168eacebaa239082d8a45ab74c8470f4b31) in main
>
> I don't think this is ASan finding a problem. It is just telling us that
> we segfaulted for other reasons. And the fault here is because the
> broken mmap() invocation returned MAP_FAILED, and we tried to access
> that garbage pointer.
>
>>       ptr[getpagesize () + 1] = 'a';
>
> This is also making a map that is a multiple of the page size, and then
> touching a byte that's on the next page. That's the easy-ish case that
> we can often already find, even without ASan (though it depends on what
> comes after the mapped memory; it might be a valid page).

Oops. I clearly don't use mmap much. :)

> A more interesting test for Git is to actually map a file, like:
>
>   $ cat main.c
>   #include <unistd.h>
>   #include <fcntl.h>
>   #include <sys/mman.h>
>   #include <sys/stat.h>
>   #include <stdio.h>
>   static void die(const char *msg)
>   {
>   	perror(msg);
>   	exit(1);
>   }
>   int main (int argc, const char **argv)
>   {
>   	struct stat st;
>   	int fd;
>   	char *ptr;
>   
>   	fd = open(argv[1], O_RDONLY);
>   	if (fd < 0)
>   		die("open");
>   	if (fstat(fd, &st) < 0)
>   		die("fstat");
>   	ptr = mmap (NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
>   	if (ptr == MAP_FAILED)
>   		die("mmap");
>   	printf("last byte: %d\n", ptr[st.st_size-1]);
>   	printf("one byte after: %d\n", ptr[st.st_size]);
>   	return 0;
>   }
>   $ yes | head -c 4096 >big
>   $ yes | head -c 372 >small
>
> And ASan does often detect the problem for the "big" page-sized file,
> but not consistently! If I do:
>
>   gcc -fsanitize=address main.c
>   while ./a.out big; do echo ok; done
>
> I may get output like:
>
>   last byte: 10
>   one byte after: 127
>   ok
>   last byte: 10
>   one byte after: 0
>   ok
>   last byte: 10
>   one byte after: 0
>   ok
>   last byte: 10
>   =================================================================
>   ==988617==ERROR: AddressSanitizer: unknown-crash on address 0x7efd40b9f000 at pc 0x564fe77b64eb bp 0x7ffff49e8160 sp 0x7ffff49e8158
>   READ of size 1 at 0x7efd40b9f000 thread T0
>       #0 0x564fe77b64ea in main (/home/peff/a.out+0x14ea) (BuildId: 8db121bb5c048cb336f8be729e8cefebd6f059a3)
>       #1 0x7efd41233ca7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>       #2 0x7efd41233d64 in __libc_start_main_impl ../csu/libc-start.c:360
>       #3 0x564fe77b6150 in _start (/home/peff/a.out+0x1150) (BuildId: 8db121bb5c048cb336f8be729e8cefebd6f059a3)
>   
>   Address 0x7efd40b9f000 is a wild pointer inside of access range of size 0x000000000001.
>
> So it worked three times without ASan noticing the problem (producing
> two different outputs), and then ASan finally crashed. But it didn't
> give us the usual information we get for a malloc overflow. It's just an
> "unknown crash" from a "wild pointer". So I'm not sure if it's even
> finding these through its own poisoning, and not just catching an
> unlucky segfault.
>
> If we switch to the small file, then ASan never reports anything! The OS
> gives us a page-sized chunk, so we consistently read a "0" in from the
> byte after our requested size.
>
> If we swap out the mmap for:
>
>   ptr = malloc(st.st_size);
>   read(fd, ptr, st.st_size);
>
> (which is roughly what our NO_MMAP wrapper is doing behind the scenes),
> then ASan does catch it consistently, even for the "small" file:
>
>   $ ./a.out small
>   =================================================================
>   ==1008630==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7d11fe01b4 at pc 0x55cf89b1b4c8 bp 0x7fff471b84e0 sp 0x7fff471b84d8
>   READ of size 1 at 0x7c7d11fe01b4 thread T0
>       #0 0x55cf89b1b4c7 in main (/home/peff/a.out+0x14c7) (BuildId: 2cebfcd0a00064eaaed750af010fcecdae2f5666)
>       #1 0x7f4d12e33ca7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>       #2 0x7f4d12e33d64 in __libc_start_main_impl ../csu/libc-start.c:360
>       #3 0x55cf89b1b150 in _start (/home/peff/a.out+0x1150) (BuildId: 2cebfcd0a00064eaaed750af010fcecdae2f5666)
>   
>   0x7c7d11fe01b4 is located 0 bytes after 372-byte region [0x7c7d11fe0040,0x7c7d11fe01b4)
>   allocated by thread T0 here:
>       #0 0x7f4d1311a0ab in malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:67
>       #1 0x55cf89b1b3a0 in main (/home/peff/a.out+0x13a0) (BuildId: 2cebfcd0a00064eaaed750af010fcecdae2f5666)
>       #2 0x7f4d12e33ca7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Cool, thanks for the actual working example. Your patch makes perfect
sense now.

Collin

[1] https://www.gnu.org/software/libc/manual/html_node/Memory_002dmapped-I_002fO.html
[2] https://pubs.opengroup.org/onlinepubs/9799919799/functions/mmap.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps
  2025-11-12  8:01 ` [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
  2025-11-12 11:25   ` Patrick Steinhardt
@ 2025-11-13  2:55   ` Taylor Blau
  2025-11-18  8:59     ` Jeff King
  1 sibling, 1 reply; 62+ messages in thread
From: Taylor Blau @ 2025-11-13  2:55 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost

On Wed, Nov 12, 2025 at 03:01:51AM -0500, Jeff King wrote:
> As always with the midx and bitmap code, I am left unsure of which
> ordering it is correct to use (pseudo-pack order, or lexical oid order,
> or how each splits across incremental files). I _think_ this is right
> because it's matching the ordering that is already used for a single
> midx. But clearly this area is under-tested, since even when we did not
> go off the end of the array we were probably passing back junk
> name-hashes (either from the .bitmap file's trailing checksum, or
> zero-padding at the end of the mapped page).

Yeah, this is the right order. "index_pos" is a good hint that this is
in lexical order. bitmap_writer_finish() has some oid_pos() lookups that
use index directly without sorting, so bitmap_writer_finish() expects
this array in lexical order.

Commit c528e17966 (pack-bitmap: write multi-pack bitmaps, 2021-08-31)
has a comment in (what is now) midx-write.c explaining this assumption
in bitmap_writer_finish(), but it should probably be documented
explicitly in pack-bitmap.h.

> So it might be worth adding more tests here, but I know this incremental
> bitmap code is a big work in progress. So I contented myself with the
> reproduction above, and anything else can go onto the incremental todo
> pile. :)

Yeah, I agree. The only hash-cache test that I could think of is from
t5326, which tests that we can propagate existing name-hash values from
a pack bitmap in to a MIDX one. We probably need an equivalent for when
writing an incremental MIDX/bitmap too. #leftoverbits

>  pack-bitmap.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 291e1a9cf4..710b86a451 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -213,6 +213,26 @@ static uint32_t bitmap_num_objects(struct bitmap_index *index)
>  	return index->pack->num_objects;
>  }
>
> +static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos)
> +{
> +	if (bitmap_is_midx(index)) {
> +		while (index && pos < index->midx->num_objects_in_base)
> +			index = index->base;

Looks good. It's too bad that we have to reimplement something very
similar to midx_for_object(), but I agree with what you wrote in the
patch message and this faithfully captures that. It might be worth doing
something like:

    while (index && pos < index->midx->num_objects_in_base) {
        ASSERT(bitmap_is_midx(index));
        index = index->base;
    }

, which should never trigger, but is a good sanity check. Definitely not
worth re-rolling IMHO.

> +
> +		if (!index)
> +			BUG("NULL base bitmap for object position: %"PRIu32, pos);
> +
> +		pos -= index->midx->num_objects_in_base;
> +		if (pos >= index->midx->num_objects)
> +			BUG("out-of-bounds midx bitmap object at %"PRIu32, pos);

midx_for_object() spells this portion slightly differently, but what you
have here is still good.

> +	}
> +
> +	if (!index->hashes)
> +		return 0;
> +
> +	return get_be32(index->hashes + pos);

We *could* double check that that offset is within bounds of
index->map_size, and I think that is ultimately worth doing at some
point. But I think that stopping where you did makes sense, since it
does the minimal thing to fix this bug.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-12 11:26   ` Patrick Steinhardt
@ 2025-11-13  3:09     ` Taylor Blau
  2025-11-18  8:40       ` Jeff King
  2025-11-18  8:38     ` Jeff King
  1 sibling, 1 reply; 62+ messages in thread
From: Taylor Blau @ 2025-11-13  3:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, correctmost

On Wed, Nov 12, 2025 at 12:26:06PM +0100, Patrick Steinhardt wrote:
> Hm. I'm not a huge fan of not having any error handling at all. It just
> feels way too fragile for my taste:
>
>   - As you mention we don't detect overflows, as we would detect them at
>     a later point in time when trying to access index entries at invalid
>     offsets. But if the input is crafted in a way that the overflow ends
>     up with a reasonable index entry we might just as well _not_ detect
>     that an overflow has happened and end up using the wrong index
>     entry.
>
>   - We don't verify that we even have a number in the first place. We'd
>     simply return "0" in that case and not advance the pointer. This is
>     fine though as we verify that the returned size is non-zero, so we'd
>     detect this case.
>
> I'd much rather prefer to have an interface similar to `git_parse_int()`
> and related functions, which are way easier to use compared to the likes
> of `stroi()`.

Those git_parse_XYZ() functions all end up calling either
git_parse_signed() or git_parse_unsigned() under the hood, which bolts
on our k/m/g suffixes, which we probably don't want here when parsing an
on-disk format.

I don't have a strong opinion here, though I tend to agree with
Patrick's thinking above (with the exception of the suffix thing that I
pointed to earlier).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12 11:26   ` Patrick Steinhardt
@ 2025-11-13  3:12     ` Taylor Blau
  2025-11-13  6:34       ` Patrick Steinhardt
  2025-11-18  8:49       ` Jeff King
  2025-11-13 16:30     ` Junio C Hamano
  1 sibling, 2 replies; 62+ messages in thread
From: Taylor Blau @ 2025-11-13  3:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, correctmost

On Wed, Nov 12, 2025 at 12:26:13PM +0100, Patrick Steinhardt wrote:
> On Wed, Nov 12, 2025 at 03:02:15AM -0500, Jeff King wrote:
> > diff --git a/Makefile b/Makefile
> > index 7e0f77e298..0f44268405 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
> >  endif
> >  ifneq ($(filter address,$(SANITIZERS)),)
> >  NO_REGEX = NeededForASAN
> > +NO_MMAP = NeededForASAN
> >  SANITIZE_ADDRESS = YesCompiledWithIt
> >  endif
> >  endif
>
> Let's also apply this to Meson. Thanks!

Not to derail us too far off topic, but... ;-)

I wonder what (if anything) our policy should be for keeping the
Makefile and Meson build scripts in sync. On the one hand, I do not want
the two of them to drift (too far) apart. But on the other, I am not
sure that everyone who may be touching the Makefile are necessarily
familiar enough to make the equivalent changes to the Meson build files.

I genuinely don't have a very strong opinion here or even really a clear
sense of what the right thing to do is. Just something that crossed my
mind while reading and figured I'd write down in case others had similar
thoughts.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/9] asan bonanza
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (8 preceding siblings ...)
  2025-11-12  8:10 ` [PATCH 9/9] t: enable ASan's strict_string_checks option Jeff King
@ 2025-11-13  3:17 ` Taylor Blau
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
  10 siblings, 0 replies; 62+ messages in thread
From: Taylor Blau @ 2025-11-13  3:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, correctmost

On Wed, Nov 12, 2025 at 02:55:22AM -0500, Jeff King wrote:
>   [1/9]: compat/mmap: mark unused argument in git_munmap()
>   [2/9]: pack-bitmap: handle name-hash lookups in incremental bitmaps
>   [3/9]: Makefile: turn on NO_MMAP when building with ASan
>   [4/9]: cache-tree: avoid strtol() on non-string buffer
>   [5/9]: fsck: assert newline presence in fsck_ident()
>   [6/9]: fsck: avoid strcspn() in fsck_ident()
>   [7/9]: fsck: remove redundant date timestamp check
>   [8/9]: fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
>   [9/9]: t: enable ASan's strict_string_checks option

Naturally I focused on the first two patches more than the others, but
the rest look good to me. I left one minor comment that you might
consider if you end up re-rolling, but I don't feel strongly about it.

I like Patrick's suggestion to use an interface similar to
git_parse_int() instead of introducing parse_long() as a strtol()
replacement. That may be worth a re-roll, especially because there are
two spots that would benefit from that style of interface. But I don't
feel strongly about it either way.

Like I mentioned earlier, I mostly glossed over the fsck patches, but
they all look reasonable to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-13  3:12     ` Taylor Blau
@ 2025-11-13  6:34       ` Patrick Steinhardt
  2025-11-18  8:49       ` Jeff King
  1 sibling, 0 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-11-13  6:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, correctmost

On Wed, Nov 12, 2025 at 10:12:02PM -0500, Taylor Blau wrote:
> On Wed, Nov 12, 2025 at 12:26:13PM +0100, Patrick Steinhardt wrote:
> > On Wed, Nov 12, 2025 at 03:02:15AM -0500, Jeff King wrote:
> > > diff --git a/Makefile b/Makefile
> > > index 7e0f77e298..0f44268405 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
> > >  endif
> > >  ifneq ($(filter address,$(SANITIZERS)),)
> > >  NO_REGEX = NeededForASAN
> > > +NO_MMAP = NeededForASAN
> > >  SANITIZE_ADDRESS = YesCompiledWithIt
> > >  endif
> > >  endif
> >
> > Let's also apply this to Meson. Thanks!
> 
> Not to derail us too far off topic, but... ;-)
> 
> I wonder what (if anything) our policy should be for keeping the
> Makefile and Meson build scripts in sync. On the one hand, I do not want
> the two of them to drift (too far) apart. But on the other, I am not
> sure that everyone who may be touching the Makefile are necessarily
> familiar enough to make the equivalent changes to the Meson build files.

They might not be, true. That's basically the reason why I'm always on
the lookout for such changes and try to help them with diffs that they
can simply apply.

> I genuinely don't have a very strong opinion here or even really a clear
> sense of what the right thing to do is. Just something that crossed my
> mind while reading and figured I'd write down in case others had similar

I don't think there needs to be a policy any stricter than "Meson
pipelines need to stay green". So new code files and tests need to be
added, but that's easy enough to do even for somebody who isn't familiar
with Meson, I think.

Other changes like this one here are not really _that_ important in most
cases. Git compiles just fine without such a change, and any real-world
user is likely to not care. So I think it's fine to handle these on a
best effort basis.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-12 11:26   ` Patrick Steinhardt
  2025-11-13  3:12     ` Taylor Blau
@ 2025-11-13 16:30     ` Junio C Hamano
  2025-11-14  7:00       ` Patrick Steinhardt
  1 sibling, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2025-11-13 16:30 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, correctmost, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> On Wed, Nov 12, 2025 at 03:02:15AM -0500, Jeff King wrote:
>> diff --git a/Makefile b/Makefile
>> index 7e0f77e298..0f44268405 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
>>  endif
>>  ifneq ($(filter address,$(SANITIZERS)),)
>>  NO_REGEX = NeededForASAN
>> +NO_MMAP = NeededForASAN
>>  SANITIZE_ADDRESS = YesCompiledWithIt
>>  endif
>>  endif
>
> Let's also apply this to Meson. Thanks!
>
> Patrick

Do you two want me to squash this into the Makefile patch?

>
> diff --git a/meson.build b/meson.build
> index ad4eb2c4fa..668f8769d2 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1408,12 +1408,18 @@ if host_machine.system() == 'windows'
>    libgit_c_args += '-DUSE_WIN32_MMAP'
>  else
>    checkfuncs += {
> -    'mmap' : ['mmap.c'],
>      # provided by compat/mingw.c.
>      'unsetenv' : ['unsetenv.c'],
>      # provided by compat/mingw.c.
>      'getpagesize' : [],
>    }
> +
> +  if get_option('b_sanitize').contains('address')
> +    libgit_c_args += '-DNO_MMAP'
> +    libgit_sources += 'compat/mmap.c'
> +  else
> +    checkfuncs += { 'mmap': ['mmap.c'] }
> +  endif
>  endif
>  
>  foreach func, impls : checkfuncs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-13 16:30     ` Junio C Hamano
@ 2025-11-14  7:00       ` Patrick Steinhardt
  2025-11-15  2:13         ` Jeff King
  0 siblings, 1 reply; 62+ messages in thread
From: Patrick Steinhardt @ 2025-11-14  7:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, correctmost, Taylor Blau

On Thu, Nov 13, 2025 at 08:30:03AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > On Wed, Nov 12, 2025 at 03:02:15AM -0500, Jeff King wrote:
> >> diff --git a/Makefile b/Makefile
> >> index 7e0f77e298..0f44268405 100644
> >> --- a/Makefile
> >> +++ b/Makefile
> >> @@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
> >>  endif
> >>  ifneq ($(filter address,$(SANITIZERS)),)
> >>  NO_REGEX = NeededForASAN
> >> +NO_MMAP = NeededForASAN
> >>  SANITIZE_ADDRESS = YesCompiledWithIt
> >>  endif
> >>  endif
> >
> > Let's also apply this to Meson. Thanks!
> >
> > Patrick
> 
> Do you two want me to squash this into the Makefile patch?

I feel like there's going to be a revised version of this series anyway,
so that's probably not necessary. Thanks!

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  2025-11-12 11:25   ` Patrick Steinhardt
  2025-11-12 19:36     ` Junio C Hamano
@ 2025-11-15  2:12     ` Jeff King
  1 sibling, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-15  2:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 12:25:59PM +0100, Patrick Steinhardt wrote:

> On Wed, Nov 12, 2025 at 03:10:40AM -0500, Jeff King wrote:
> > In fsck_ident(), we parse the timestamp with parse_timestamp(), which is
> > really an alias for strtoumax(). But since our buffer may not be
> > NUL-terminated, this can trigger a complaint from ASan's
> > strict_string_checks mode. This is a false positive, since we know that
> > the buffer contains a trailing newline (which we checked earlier in the
> > function), and that strtoumax() would stop there.
> > 
> > But it is worth working around ASan's complaint. One is because that
> > will let us turn on strict_string_checks by default, which has helped
> > catch other real problems. And two is that the safety of the current
> > code is very hard to reason about (it subtly depends on distant code
> > which could change).
> > 
> > One option here is to just parse the number left-to-right ourselves. But
> > we care about the size of a timestamp_t and detecting overflow, since
> > that's part of the point of these checks. And doing that correctly is
> > tricky. So we'll instead just pull the digits into a separate,
> > NUL-terminated buffer, and use that to call parse_timestamp().
> 
> So this is another site that would benefit from having something like
> `git_parse_int()` with an extra parameter indicating the number of
> bytes available for parsing (and a way to disable unit factors).

Yes, but also no.

Yes, in the sense that if we had a robust global function to parse an
integer from a buf we could use it here, as well as in cache-tree.

But there are lots of no's:

  - We could not have one such function, because the implementation
    would differ based on signedness and size of the integer type. And
    cache-tree is a signed long, whereas this is a uintmax_t. We can
    factor out some of the work with a helper that takes a max
    parameter, but you can see we duplicate a bunch of code between
    git_parse_signed() and git_parse_unsigned().

  - The interface for git_parse_int() isn't quite a match. It wants to
    parse every byte in the provided string and complains if there is
    any extra cruft, rather than aiding in progressive parsing of a
    buffer. So if you have a string "10 20\0", it will not just parse
    "10" and then tell you how far it parsed; it will barf on the space.
    If we added a new parameter for "this is how many bytes we have",
    and you fed it the buffer "10 20" and the length 5, it would have
    the same problem.

    So there's a fundamental mismatch between "parse this as an integer
    and complain if there is anything else" versus "parse an integer,
    advance the pointer, and we'll keep going".

  - The implementation for git_parse_int() isn't a match either. It is
    just asking strtoimax() to do all of the work, which is the very
    thing we need to avoid.

So I think one _could_ write a strtoimax() replacement that handled
everything we wanted, and then you could probably build
git_parse_signed() etc around that. But it would be a lot more work than
what's there (like checking overflow progressively as we multiply and
add), and there are some decisions to be made (like handling leading
whitespace or how +/- work on unsigned integers). I'd probably err more
on the side of simplicity and strictness than strtol() does, but that
also means that plugging in the new function would change user-visible
behavior. Sometimes in a good or OK way (stricter parsing), but maybe
sometimes in a bad or confusing one (rejecting inputs that used to
work).

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-14  7:00       ` Patrick Steinhardt
@ 2025-11-15  2:13         ` Jeff King
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-15  2:13 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Junio C Hamano, git, correctmost, Taylor Blau

On Fri, Nov 14, 2025 at 08:00:48AM +0100, Patrick Steinhardt wrote:

> > Do you two want me to squash this into the Makefile patch?
> 
> I feel like there's going to be a revised version of this series anyway,
> so that's probably not necessary. Thanks!

Yeah, I'll re-roll (but probably not tonight) and will squash it in.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-12 11:26   ` Patrick Steinhardt
  2025-11-13  3:09     ` Taylor Blau
@ 2025-11-18  8:38     ` Jeff King
  1 sibling, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  8:38 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 12:26:06PM +0100, Patrick Steinhardt wrote:

> > +static long parse_long(const char **ptr, unsigned long *len_p)
> > +{
> > +	const char *s = *ptr;
> > +	unsigned long len = *len_p;
> > +	long ret = 0;
> > +	int sign = 1;
> > +
> > +	while (len && *s == '-') {
> > +		sign *= -1;
> > +		s++;
> > +		len--;
> > +	}
> > +
> > +	while (len) {
> > +		if (!isdigit(*s))
> > +			break;
> > +		ret *= 10;
> > +		ret += *s - '0';
> > +		s++;
> > +		len--;
> > +	}
> > +	*ptr = s;
> > +	*len_p = len;
> > +	return sign * ret;
> > +}
> 
> Hm. I'm not a huge fan of not having any error handling at all. It just
> feels way too fragile for my taste:
> 
>   - As you mention we don't detect overflows, as we would detect them at
>     a later point in time when trying to access index entries at invalid
>     offsets. But if the input is crafted in a way that the overflow ends
>     up with a reasonable index entry we might just as well _not_ detect
>     that an overflow has happened and end up using the wrong index
>     entry.

Yes, but this is true of the original code as well. It does not bother
to check if strtol() saw overflow (and in fact, it is using "int" and
not "long", so it would need to do its own overflow check on top).

But see below.

>   - We don't verify that we even have a number in the first place. We'd
>     simply return "0" in that case and not advance the pointer. This is
>     fine though as we verify that the returned size is non-zero, so we'd
>     detect this case.

I think "0" is accepted at least for it->entry_count. The original did
check that strtol() advanced "ep", and I dropped that. As with the
overflow, it's not a memory safety issue, but it changes how we'd react
to garbage input.

For v2 I've tweaked the interface to return an error code from the
helper, and it will complain if there's no input at all. I didn't add in
overflow checks, as they weren't there originally and are a little
tricky to get right (and I wanted to focus on the memory safety issue).
But they could be easily added to the helper on top of my patches.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-13  3:09     ` Taylor Blau
@ 2025-11-18  8:40       ` Jeff King
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  8:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Patrick Steinhardt, git, correctmost

On Wed, Nov 12, 2025 at 10:09:22PM -0500, Taylor Blau wrote:

> On Wed, Nov 12, 2025 at 12:26:06PM +0100, Patrick Steinhardt wrote:
> > Hm. I'm not a huge fan of not having any error handling at all. It just
> > feels way too fragile for my taste:
> >
> >   - As you mention we don't detect overflows, as we would detect them at
> >     a later point in time when trying to access index entries at invalid
> >     offsets. But if the input is crafted in a way that the overflow ends
> >     up with a reasonable index entry we might just as well _not_ detect
> >     that an overflow has happened and end up using the wrong index
> >     entry.
> >
> >   - We don't verify that we even have a number in the first place. We'd
> >     simply return "0" in that case and not advance the pointer. This is
> >     fine though as we verify that the returned size is non-zero, so we'd
> >     detect this case.
> >
> > I'd much rather prefer to have an interface similar to `git_parse_int()`
> > and related functions, which are way easier to use compared to the likes
> > of `stroi()`.
> 
> Those git_parse_XYZ() functions all end up calling either
> git_parse_signed() or git_parse_unsigned() under the hood, which bolts
> on our k/m/g suffixes, which we probably don't want here when parsing an
> on-disk format.

It's much worse than that. They are just wrappers around strtoimax(),
etc, themselves. So we cannot use them for a non-string buffer, and have
to start from scratch (see my other reply).

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-13  3:12     ` Taylor Blau
  2025-11-13  6:34       ` Patrick Steinhardt
@ 2025-11-18  8:49       ` Jeff King
  1 sibling, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  8:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Patrick Steinhardt, git, correctmost

On Wed, Nov 12, 2025 at 10:12:02PM -0500, Taylor Blau wrote:

> I wonder what (if anything) our policy should be for keeping the
> Makefile and Meson build scripts in sync. On the one hand, I do not want
> the two of them to drift (too far) apart. But on the other, I am not
> sure that everyone who may be touching the Makefile are necessarily
> familiar enough to make the equivalent changes to the Meson build files.
> 
> I genuinely don't have a very strong opinion here or even really a clear
> sense of what the right thing to do is. Just something that crossed my
> mind while reading and figured I'd write down in case others had similar
> thoughts.

My personal preference is for people who care about meson to just post
their own meson patches adding the same features. Either as a separate
series (collecting several such features as appropriate), or as a
complete patch that the maintainer can pick up on top of the series in
question.

Posting something squashable (as Patrick did here) is almost as good
(and certainly better than tasking random folks with figuring out how
meson works). But I'm hesitant for reviews of random topics to include
"you should re-roll with my proposed meson changes". It's extra work for
contributors who don't care about meson, and it risks de-railing the
topic if the changes are non-trivial.

I'll admit I'm possibly biased and being selfish there, because I do not
care about meson myself and have mostly found its addition to the
project to be a hassle.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps
  2025-11-13  2:55   ` Taylor Blau
@ 2025-11-18  8:59     ` Jeff King
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  8:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, correctmost

On Wed, Nov 12, 2025 at 09:55:03PM -0500, Taylor Blau wrote:

> On Wed, Nov 12, 2025 at 03:01:51AM -0500, Jeff King wrote:
> > As always with the midx and bitmap code, I am left unsure of which
> > ordering it is correct to use (pseudo-pack order, or lexical oid order,
> > or how each splits across incremental files). I _think_ this is right
> > because it's matching the ordering that is already used for a single
> > midx. But clearly this area is under-tested, since even when we did not
> > go off the end of the array we were probably passing back junk
> > name-hashes (either from the .bitmap file's trailing checksum, or
> > zero-padding at the end of the mapped page).
> 
> Yeah, this is the right order. "index_pos" is a good hint that this is
> in lexical order. bitmap_writer_finish() has some oid_pos() lookups that
> use index directly without sorting, so bitmap_writer_finish() expects
> this array in lexical order.

OK, that matches my analysis. I guess I was just a little surprised that
the name hash is in lexical index order, and not pack order. But it
definitely is according to the documentation and the implementation. I
guess in the end it doesn't really matter that much either way, as you
tend to reverse the pack/bit position into a lexical index position
anyway to get the oid. So there is no situation where you don't have
both anyway.

> Commit c528e17966 (pack-bitmap: write multi-pack bitmaps, 2021-08-31)
> has a comment in (what is now) midx-write.c explaining this assumption
> in bitmap_writer_finish(), but it should probably be documented
> explicitly in pack-bitmap.h.

Maybe, but I think I may just have been overly paranoid that I got it
wrong.

> > +static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos)
> > +{
> > +	if (bitmap_is_midx(index)) {
> > +		while (index && pos < index->midx->num_objects_in_base)
> > +			index = index->base;
> 
> Looks good. It's too bad that we have to reimplement something very
> similar to midx_for_object(), but I agree with what you wrote in the
> patch message and this faithfully captures that. It might be worth doing
> something like:
> 
>     while (index && pos < index->midx->num_objects_in_base) {
>         ASSERT(bitmap_is_midx(index));
>         index = index->base;
>     }
> 
> , which should never trigger, but is a good sanity check. Definitely not
> worth re-rolling IMHO.

Yeah, I wondered the same thing while writing it. It would be a pretty
horrid bug to have mixed entries in the linked list. But that is also
what assertions are there for. ;) I added it for v2.

> > +		if (!index)
> > +			BUG("NULL base bitmap for object position: %"PRIu32, pos);
> > +
> > +		pos -= index->midx->num_objects_in_base;
> > +		if (pos >= index->midx->num_objects)
> > +			BUG("out-of-bounds midx bitmap object at %"PRIu32, pos);
> 
> midx_for_object() spells this portion slightly differently, but what you
> have here is still good.

Yes, there it's a die(). But elsewhere, like in pack_pos_to_midx() and
its reverse, the same situation is a BUG(). It's not clear to me we get
a bit or index position that is out of bounds here (is it truly a bug or
programming error, or might we get it from a corrupt on-disk file). So I
think it's mostly academic, at least until somebody can generate a real
corrupted case.

> > +	if (!index->hashes)
> > +		return 0;
> > +
> > +	return get_be32(index->hashes + pos);
> 
> We *could* double check that that offset is within bounds of
> index->map_size, and I think that is ultimately worth doing at some
> point. But I think that stopping where you did makes sense, since it
> does the minimal thing to fix this bug.

I don't think we need to. When we open the bitmap, we check that its
hash-cache size matches our expectation based on the number of objects
covered by the bitmap (using bitmap_num_objects(), so either from the
pack's count or the midx slice's count).

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 0/9] asan bonanza
  2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
                   ` (9 preceding siblings ...)
  2025-11-13  3:17 ` [PATCH 0/9] asan bonanza Taylor Blau
@ 2025-11-18  9:11 ` Jeff King
  2025-11-18  9:11   ` [PATCH v2 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
                     ` (9 more replies)
  10 siblings, 10 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:11 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

On Wed, Nov 12, 2025 at 02:55:22AM -0500, Jeff King wrote:

> This series fixes a handful of issues that ASan finds in our test suite
> if we tweak a few options to let it look deeper.

Here's a v2 based on feedback:

  - added the extra assertion in the midx code

  - meson changes are squashed into patch 3

  - The cache-tree integer parsing is more robust around total garbage
    inputs (with no digits at all). I agree with the reviewers that it
    would be nice to have a robust, reusable integer parsing function.
    But I think it's non-trivial to do (and I left more comments in the
    thread). I'd like to stick here to just fixing the memory issues
    without making anything worse (which I think this version does).

    Note that since the new helper takes an out-parameter, we have to
    match the type more strictly to what the callers have. So it is now
    parse_int(), and not parse_long().

Range diff is below.

  [1/9]: compat/mmap: mark unused argument in git_munmap()
  [2/9]: pack-bitmap: handle name-hash lookups in incremental bitmaps
  [3/9]: Makefile: turn on NO_MMAP when building with ASan
  [4/9]: cache-tree: avoid strtol() on non-string buffer
  [5/9]: fsck: assert newline presence in fsck_ident()
  [6/9]: fsck: avoid strcspn() in fsck_ident()
  [7/9]: fsck: remove redundant date timestamp check
  [8/9]: fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  [9/9]: t: enable ASan's strict_string_checks option

 Makefile      |  1 +
 cache-tree.c  | 50 ++++++++++++++++++++++++++----------
 compat/mmap.c |  2 +-
 fsck.c        | 71 ++++++++++++++++++++++++++++++++++++---------------
 meson.build   |  8 +++++-
 pack-bitmap.c | 29 ++++++++++++++++++---
 t/test-lib.sh |  1 +
 7 files changed, 122 insertions(+), 40 deletions(-)

 1:  e24015d41b =  1:  3ce5bd39b5 compat/mmap: mark unused argument in git_munmap()
 2:  e217fb0e3b !  2:  9908283c33 pack-bitmap: handle name-hash lookups in incremental bitmaps
    @@ pack-bitmap.c: static uint32_t bitmap_num_objects(struct bitmap_index *index)
     +static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos)
     +{
     +	if (bitmap_is_midx(index)) {
    -+		while (index && pos < index->midx->num_objects_in_base)
    ++		while (index && pos < index->midx->num_objects_in_base) {
    ++			ASSERT(bitmap_is_midx(index));
     +			index = index->base;
    ++		}
     +
     +		if (!index)
     +			BUG("NULL base bitmap for object position: %"PRIu32, pos);
 3:  8c85dad3c5 <  -:  ---------- Makefile: turn on NO_MMAP when building with ASan
 -:  ---------- >  3:  fe3421f6ec Makefile: turn on NO_MMAP when building with ASan
 4:  38d42984da !  4:  5e228f2c90 cache-tree: avoid strtol() on non-string buffer
    @@ Commit message
              further. You'd mostly get stopped by seeing non-digits in the oid
              field (and if it is likewise truncated, there will still be 20 or
              more bytes of the index checksum). So it's possible, though
    -         unlikely, to see read off the end of the mmap'd buffer. Of course a
    +         unlikely, to read off the end of the mmap'd buffer. Of course a
              malicious index file can fake the oid and the index checksum to all
              (ASCII) 0's.
     
    @@ cache-tree.c: void cache_tree_write(struct strbuf *sb, struct cache_tree *root)
      	trace2_region_leave("cache_tree", "write", the_repository);
      }
      
    -+static long parse_long(const char **ptr, unsigned long *len_p)
    ++static int parse_int(const char **ptr, unsigned long *len_p, int *out)
     +{
     +	const char *s = *ptr;
     +	unsigned long len = *len_p;
    -+	long ret = 0;
    ++	int ret = 0;
     +	int sign = 1;
     +
     +	while (len && *s == '-') {
    @@ cache-tree.c: void cache_tree_write(struct strbuf *sb, struct cache_tree *root)
     +		s++;
     +		len--;
     +	}
    ++
    ++	if (s == *ptr)
    ++		return -1;
    ++
     +	*ptr = s;
     +	*len_p = len;
    -+	return sign * ret;
    ++	*out = sign * ret;
    ++	return 0;
     +}
     +
      static struct cache_tree *read_one(const char **buffer, unsigned long *size_p)
    @@ cache-tree.c: static struct cache_tree *read_one(const char **buffer, unsigned l
     -	cp = buf;
     -	it->entry_count = strtol(cp, &ep, 10);
     -	if (cp == ep)
    -+	it->entry_count = parse_long(&buf, &size);
    -+	if (!size || *buf != ' ')
    ++	if (parse_int(&buf, &size, &it->entry_count) < 0)
      		goto free_return;
     -	cp = ep;
     -	subtree_nr = strtol(cp, &ep, 10);
     -	if (cp == ep)
    --		goto free_return;
    ++	if (!size || *buf != ' ')
    + 		goto free_return;
     -	while (size && *buf && *buf != '\n') {
     -		size--;
     -		buf++;
     -	}
     -	if (!size)
     +	buf++; size--;
    -+	subtree_nr = parse_long(&buf, &size);
    ++	if (parse_int(&buf, &size, &subtree_nr) < 0)
    ++		goto free_return;
     +	if (!size || *buf != '\n')
      		goto free_return;
      	buf++; size--;
 5:  73e921a34e =  5:  1d6814233c fsck: assert newline presence in fsck_ident()
 6:  95e8961df9 =  6:  8cf8152449 fsck: avoid strcspn() in fsck_ident()
 7:  34baa85dae =  7:  563c3006e4 fsck: remove redundant date timestamp check
 8:  f5ff2dc8ef =  8:  6f88309d76 fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
 9:  1b5c0e7ce7 =  9:  ad1a1f6a82 t: enable ASan's strict_string_checks option

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 1/9] compat/mmap: mark unused argument in git_munmap()
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
@ 2025-11-18  9:11   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:11 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

Our mmap compat code emulates mapping by using malloc/free. Our
git_munmap() must take a "length" parameter to match the interface of
munmap(), but we don't use it (it is up to the allocator to know how big
the block is in free()).

Let's mark it as UNUSED to avoid complaints from -Wunused-parameter.
Otherwise you cannot build with "make DEVELOPER=1 NO_MMAP=1".

Signed-off-by: Jeff King <peff@peff.net>
---
 compat/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/compat/mmap.c b/compat/mmap.c
index 2fe1c7732e..1a118711f7 100644
--- a/compat/mmap.c
+++ b/compat/mmap.c
@@ -38,7 +38,7 @@ void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t of
 	return start;
 }
 
-int git_munmap(void *start, size_t length)
+int git_munmap(void *start, size_t length UNUSED)
 {
 	free(start);
 	return 0;
-- 
2.52.0.278.gadc6434dc3


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
  2025-11-18  9:11   ` [PATCH v2 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

If a bitmap has a name-hash cache, it is an array of 32-bit integers,
one per entry in the bitmap, which we've mmap'd from the .bitmap file.
We access it directly like this:

    if (bitmap_git->hashes)
            hash = get_be32(bitmap_git->hashes + index_pos);

That works for both regular pack bitmaps and for non-incremental midx
bitmaps. There is one bitmap_index with one "hashes" array, and
index_pos is within its bounds (we do the bounds-checking when we load
the bitmap).

But for an incremental midx bitmap, we have a linked list of
bitmap_index structs, and each one has only its own small slice of the
name-hash array. If index_pos refers to an object that is not in the
first bitmap_git of the chain, then we'll access memory outside of the
bounds of its "hashes" array, and often outside of the mmap.

Instead, we should walk through the list until we find the bitmap_index
which serves our index_pos, and use its hash (after adjusting index_pos
to make it relative to the slice we found). This is exactly what we do
elsewhere for incremental midx lookups (like the pack_pos_to_midx() call
a few lines above). But we can't use existing helpers like
midx_for_object() here, because we're walking through the chain of
bitmap_index structs (each of which refers to a midx), not the chain of
incremental multi_pack_index structs themselves.

The problem is triggered in the test suite, but we don't get a segfault
because the out-of-bounds index is too small. The OS typically rounds
our mmap up to the nearest page size, so we just end up accessing some
extra zero'd memory. Nor do we catch it with ASan, since it doesn't seem
to instrument mmaps at all. But if we build with NO_MMAP, then our maps
are replaced with heap allocations, which ASan does check. And so:

  make NO_MMAP=1 SANITIZE=address
  cd t
  ./t5334-incremental-multi-pack-index.sh

does show the problem (and this patch makes it go away).

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-bitmap.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 291e1a9cf4..8ca79725b1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -213,6 +213,28 @@ static uint32_t bitmap_num_objects(struct bitmap_index *index)
 	return index->pack->num_objects;
 }

+static uint32_t bitmap_name_hash(struct bitmap_index *index, uint32_t pos)
+{
+	if (bitmap_is_midx(index)) {
+		while (index && pos < index->midx->num_objects_in_base) {
+			ASSERT(bitmap_is_midx(index));
+			index = index->base;
+		}
+
+		if (!index)
+			BUG("NULL base bitmap for object position: %"PRIu32, pos);
+
+		pos -= index->midx->num_objects_in_base;
+		if (pos >= index->midx->num_objects)
+			BUG("out-of-bounds midx bitmap object at %"PRIu32, pos);
+	}
+
+	if (!index->hashes)
+		return 0;
+
+	return get_be32(index->hashes + pos);
+}
+
 static struct repository *bitmap_repo(struct bitmap_index *bitmap_git)
 {
 	if (bitmap_is_midx(bitmap_git))
@@ -1724,8 +1746,7 @@ static void show_objects_for_type(
 				pack = bitmap_git->pack;
 			}

-			if (bitmap_git->hashes)
-				hash = get_be32(bitmap_git->hashes + index_pos);
+			hash = bitmap_name_hash(bitmap_git, index_pos);

 			show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
 		}
@@ -3124,8 +3145,8 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,

 		if (oe) {
 			reposition[i] = oe_in_pack_pos(mapping, oe) + 1;
-			if (bitmap_git->hashes && !oe->hash)
-				oe->hash = get_be32(bitmap_git->hashes + index_pos);
+			if (!oe->hash)
+				oe->hash = bitmap_name_hash(bitmap_git, index_pos);
 		}
 	}

-- 
2.52.0.278.gadc6434dc3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 3/9] Makefile: turn on NO_MMAP when building with ASan
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
  2025-11-18  9:11   ` [PATCH v2 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
  2025-11-18  9:12   ` [PATCH v2 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

Git often uses mmap() to access on-disk files. This leaves a blind spot
in our SANITIZE=address builds, since ASan does not seem to handle mmap
at all. Nor does the OS notice most out-of-bounds access, since it tends
to round up to the nearest page size (so depending on how big the map
is, you might have to overrun it by up to 4095 bytes to trigger a
segfault).

The previous commit demonstrates a memory bug that we missed. We could
have made a new test where the out-of-bounds access was much larger, or
where the mapped file ended closer to a page boundary. But the point of
running the test suite with sanitizers is to catch these problems
without having to construct specific tests.

Let's enable NO_MMAP for our ASan builds by default, which should give
us better coverage. This does increase the memory usage of Git, since
we're copying from the filesystem into heap. But the repositories in the
test suite tend to be small, so the overhead isn't really noticeable
(and ASan already has quite a performance penalty).

There are a few other known bugs that this patch will help flush out.
However, they aren't directly triggered in the test suite (yet). So
it's safe to turn this on now without breaking the test suite, which
will help us add new tests to demonstrate those other bugs as we fix
them.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile    | 1 +
 meson.build | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 7e0f77e298..0f44268405 100644
--- a/Makefile
+++ b/Makefile
@@ -1587,6 +1587,7 @@ SANITIZE_LEAK = YesCompiledWithIt
 endif
 ifneq ($(filter address,$(SANITIZERS)),)
 NO_REGEX = NeededForASAN
+NO_MMAP = NeededForASAN
 SANITIZE_ADDRESS = YesCompiledWithIt
 endif
 endif
diff --git a/meson.build b/meson.build
index 1f95a06edb..f1b3615659 100644
--- a/meson.build
+++ b/meson.build
@@ -1411,12 +1411,18 @@ if host_machine.system() == 'windows'
   libgit_c_args += '-DUSE_WIN32_MMAP'
 else
   checkfuncs += {
-    'mmap' : ['mmap.c'],
     # provided by compat/mingw.c.
     'unsetenv' : ['unsetenv.c'],
     # provided by compat/mingw.c.
     'getpagesize' : [],
   }
+
+  if get_option('b_sanitize').contains('address')
+    libgit_c_args += '-DNO_MMAP'
+    libgit_sources += 'compat/mmap.c'
+  else
+    checkfuncs += { 'mmap': ['mmap.c'] }
+  endif
 endif

 foreach func, impls : checkfuncs
-- 
2.52.0.278.gadc6434dc3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (2 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18 14:30     ` Phillip Wood
  2025-11-18  9:12   ` [PATCH v2 5/9] fsck: assert newline presence in fsck_ident() Jeff King
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

A cache-tree extension entry in the index looks like this:

  <name> NUL <entry_nr> SPACE <subtree_nr> NEWLINE <binary_oid>

where the "_nr" items are human-readable base-10 ASCII. We parse them
with strtol(), even though we do not have a NUL-terminated string (we'd
generally have an mmap() of the on-disk index file). For a well-formed
entry, this is not a problem; strtol() will stop when it sees the
newline. But there are two problems:

  1. A corrupted entry could omit the newline, causing us to read
     further. You'd mostly get stopped by seeing non-digits in the oid
     field (and if it is likewise truncated, there will still be 20 or
     more bytes of the index checksum). So it's possible, though
     unlikely, to read off the end of the mmap'd buffer. Of course a
     malicious index file can fake the oid and the index checksum to all
     (ASCII) 0's.

     This is further complicated by the fact that mmap'd buffers tend to
     be zero-padded up to the page boundary. So to run off the end, the
     index size also has to be a multiple of the page size. This is also
     unlikely, though you can construct a malicious index file that
     matches this.

     The security implications aren't too interesting. The index file is
     a local file anyway (so you can't attack somebody by cloning, but
     only if you convince them to operate in a .git directory you made,
     at which point attacking .git/config is much easier). And it's just
     a read overflow via strtol(), which is unlikely to buy you much
     beyond a crash.

  2. ASan has a strict_string_checks option, which tells it to make sure
     that options to string functions (like strtol) have some eventual
     NUL, without regard to what the function would actually do (like
     stopping at a newline here). This option sometimes has false
     positives, but it can point to sketchy areas (like this one) where
     the input we use doesn't exhibit a problem, but different input
     _could_ cause us to misbehave.

Let's fix it by just parsing the values ourselves with a helper function
that is careful not to go past the end of the buffer. There are a few
behavior changes here that should not matter:

  - We do not consider overflow, as strtol() would. But nor did the
    original code. However, we don't trust the value we get from the
    on-disk file, and if it says to read 2^30 entries, we would notice
    that we do not have that many and bail before reading off the end of
    the buffer.

  - Our helper does not skip past extra leading whitespace as strtol()
    would, but according to gitformat-index(5) there should not be any.

  - The original quit parsing at a newline or a NUL byte, but now we
    insist on a newline (which is what the documentation says, and what
    Git has always produced).

Since we are providing our own helper function, we can tweak the
interface a bit to make our lives easier. The original code does not use
strtol's "end" pointer to find the end of the parsed data, but rather
uses a separate loop to advance our "buf" pointer to the trailing
newline. We can instead provide a helper that advances "buf" as it
parses, letting us read strictly left-to-right through the buffer.

I didn't add a new test here. It's surprisingly difficult to construct
an index of exactly the right size due to the way we pad entries. But it
is easy to trigger the problem in existing tests when using ASan's
strict string checking, coupled with a recent change to use NO_MMAP with
ASan builds. So:

  make SANITIZE=address
  cd t
  ASAN_OPTIONS=strict_string_checks=1 ./t0090-cache-tree.sh

triggers it reliably. Technically it is not deterministic because there
is ~8% chance (it's 1-(255/256)^20, or ^32 for sha256) that the trailing
checksum hash has a NUL byte in it. But we compute enough cache-trees in
the course of that script that we are very likely to hit the problem in
one of them.

We can look at making strict_string_checks the default for ASan builds,
but there are some other cases we'd want to fix first.

Reported-by: correctmost <cmlists@sent.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 cache-tree.c | 50 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2aba47060e..2d8947b518 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -548,12 +548,41 @@ void cache_tree_write(struct strbuf *sb, struct cache_tree *root)
 	trace2_region_leave("cache_tree", "write", the_repository);
 }
 
+static int parse_int(const char **ptr, unsigned long *len_p, int *out)
+{
+	const char *s = *ptr;
+	unsigned long len = *len_p;
+	int ret = 0;
+	int sign = 1;
+
+	while (len && *s == '-') {
+		sign *= -1;
+		s++;
+		len--;
+	}
+
+	while (len) {
+		if (!isdigit(*s))
+			break;
+		ret *= 10;
+		ret += *s - '0';
+		s++;
+		len--;
+	}
+
+	if (s == *ptr)
+		return -1;
+
+	*ptr = s;
+	*len_p = len;
+	*out = sign * ret;
+	return 0;
+}
+
 static struct cache_tree *read_one(const char **buffer, unsigned long *size_p)
 {
 	const char *buf = *buffer;
 	unsigned long size = *size_p;
-	const char *cp;
-	char *ep;
 	struct cache_tree *it;
 	int i, subtree_nr;
 	const unsigned rawsz = the_hash_algo->rawsz;
@@ -569,19 +598,14 @@ static struct cache_tree *read_one(const char **buffer, unsigned long *size_p)
 	buf++; size--;
 	it = cache_tree();
 
-	cp = buf;
-	it->entry_count = strtol(cp, &ep, 10);
-	if (cp == ep)
+	if (parse_int(&buf, &size, &it->entry_count) < 0)
 		goto free_return;
-	cp = ep;
-	subtree_nr = strtol(cp, &ep, 10);
-	if (cp == ep)
+	if (!size || *buf != ' ')
 		goto free_return;
-	while (size && *buf && *buf != '\n') {
-		size--;
-		buf++;
-	}
-	if (!size)
+	buf++; size--;
+	if (parse_int(&buf, &size, &subtree_nr) < 0)
+		goto free_return;
+	if (!size || *buf != '\n')
 		goto free_return;
 	buf++; size--;
 	if (0 <= it->entry_count) {
-- 
2.52.0.278.gadc6434dc3


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 5/9] fsck: assert newline presence in fsck_ident()
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (3 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 6/9] fsck: avoid strcspn() " Jeff King
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

The fsck code purports to handle buffers that are not NUL-terminated,
but fsck_ident() uses some string functions. This works OK in practice,
as explained in 8e4309038f (fsck: do not assume NUL-termination of
buffers, 2023-01-19). Before calling fsck_ident() we'll have called
verify_headers(), which makes sure we have at least a trailing newline.
And none of our string-like functions will walk past that newline.

However, that makes this code at the top of fsck_ident() very confusing:

    *ident = strchrnul(*ident, '\n');
    if (**ident == '\n')
            (*ident)++;

We should always see that newline, or our memory safety assumptions have
been violated! Further, using strchrnul() is weird, since the whole
point is that if the newline is not there, we don't necessarily have a
NUL at all, and might read off the end of the buffer.

So let's have callers pass in the boundary of our buffer, which lets us
safely find the newline with memchr(). And if it is not there, this is a
BUG(), because it means our caller did not validate the input with
verify_headers() as it was supposed to (and we are better off bailing
rather than having memory-safety problems).

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fsck.c b/fsck.c
index 341e100d24..8991f04943 100644
--- a/fsck.c
+++ b/fsck.c
@@ -860,16 +860,18 @@ static int verify_headers(const void *data, unsigned long size,
 		FSCK_MSG_UNTERMINATED_HEADER, "unterminated header");
 }
 
-static int fsck_ident(const char **ident,
+static int fsck_ident(const char **ident, const char *ident_end,
 		      const struct object_id *oid, enum object_type type,
 		      struct fsck_options *options)
 {
 	const char *p = *ident;
+	const char *nl;
 	char *end;
 
-	*ident = strchrnul(*ident, '\n');
-	if (**ident == '\n')
-		(*ident)++;
+	nl = memchr(p, '\n', ident_end - p);
+	if (!nl)
+		BUG("verify_headers() should have made sure we have a newline");
+	*ident = nl + 1;
 
 	if (*p == '<')
 		return report(options, oid, type, FSCK_MSG_MISSING_NAME_BEFORE_EMAIL, "invalid author/committer line - missing space before email");
@@ -958,7 +960,7 @@ static int fsck_commit(const struct object_id *oid,
 	author_count = 0;
 	while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) {
 		author_count++;
-		err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
+		err = fsck_ident(&buffer, buffer_end, oid, OBJ_COMMIT, options);
 		if (err)
 			return err;
 	}
@@ -970,7 +972,7 @@ static int fsck_commit(const struct object_id *oid,
 		return err;
 	if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer))
 		return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line");
-	err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
+	err = fsck_ident(&buffer, buffer_end, oid, OBJ_COMMIT, options);
 	if (err)
 		return err;
 	if (memchr(buffer_begin, '\0', size)) {
@@ -1065,7 +1067,7 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
 			goto done;
 	}
 	else
-		ret = fsck_ident(&buffer, oid, OBJ_TAG, options);
+		ret = fsck_ident(&buffer, buffer_end, oid, OBJ_TAG, options);
 
 	if (buffer < buffer_end && (skip_prefix(buffer, "gpgsig ", &buffer) || skip_prefix(buffer, "gpgsig-sha256 ", &buffer))) {
 		eol = memchr(buffer, '\n', buffer_end - buffer);
-- 
2.52.0.278.gadc6434dc3


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 6/9] fsck: avoid strcspn() in fsck_ident()
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (4 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 5/9] fsck: assert newline presence in fsck_ident() Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 7/9] fsck: remove redundant date timestamp check Jeff King
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

We may be operating on a buffer that is not NUL-terminated, but we use
strcspn() to parse it. This is OK in practice, as discussed in
8e4309038f (fsck: do not assume NUL-termination of buffers, 2023-01-19),
because we know there is at least a trailing newline in our buffer, and
we always pass "\n" to strcspn(). So we know it will stop before running
off the end of the buffer.

But this is a subtle point to hang our memory safety hat on. And it
confuses ASan's strict_string_checks mode, even though it is technically
a false positive (that mode complains that we have no NUL, which is
true, but it does not know that we have verified the presence of the
newline already).

Let's instead open-code the loop. As a bonus, this makes the logic more
obvious (to my mind, anyway). The current code skips forward with
strcspn until it hits "<", ">", or "\n". But then it must check which it
saw to decide if that was what we expected or not, duplicating some
logic between what's in the strcspn() and what's in the domain logic.
Instead, we can just check each character as we loop and act on it
immediately.

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/fsck.c b/fsck.c
index 8991f04943..2ee72d573d 100644
--- a/fsck.c
+++ b/fsck.c
@@ -875,18 +875,30 @@ static int fsck_ident(const char **ident, const char *ident_end,
 
 	if (*p == '<')
 		return report(options, oid, type, FSCK_MSG_MISSING_NAME_BEFORE_EMAIL, "invalid author/committer line - missing space before email");
-	p += strcspn(p, "<>\n");
-	if (*p == '>')
-		return report(options, oid, type, FSCK_MSG_BAD_NAME, "invalid author/committer line - bad name");
-	if (*p != '<')
-		return report(options, oid, type, FSCK_MSG_MISSING_EMAIL, "invalid author/committer line - missing email");
+	for (;;) {
+		if (p >= ident_end || *p == '\n')
+			return report(options, oid, type, FSCK_MSG_MISSING_EMAIL, "invalid author/committer line - missing email");
+		if (*p == '>')
+			return report(options, oid, type, FSCK_MSG_BAD_NAME, "invalid author/committer line - bad name");
+		if (*p == '<')
+			break; /* end of name, beginning of email */
+
+		/* otherwise, skip past arbitrary name char */
+		p++;
+	}
 	if (p[-1] != ' ')
 		return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_EMAIL, "invalid author/committer line - missing space before email");
-	p++;
-	p += strcspn(p, "<>\n");
-	if (*p != '>')
-		return report(options, oid, type, FSCK_MSG_BAD_EMAIL, "invalid author/committer line - bad email");
-	p++;
+	p++; /* skip past '<' we found */
+	for (;;) {
+		if (p >= ident_end || *p == '<' || *p == '\n')
+			return report(options, oid, type, FSCK_MSG_BAD_EMAIL, "invalid author/committer line - bad email");
+		if (*p == '>')
+			break; /* end of email */
+
+		/* otherwise, skip past arbitrary email char */
+		p++;
+	}
+	p++; /* skip past '>' we found */
 	if (*p != ' ')
 		return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_DATE, "invalid author/committer line - missing space before date");
 	p++;
-- 
2.52.0.278.gadc6434dc3


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 7/9] fsck: remove redundant date timestamp check
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (5 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 6/9] fsck: avoid strcspn() " Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

After calling "parse_timestamp(p, &end, 10)", we complain if "p == end",
which would imply that we did not see any digits at all. But we know
this cannot be the case, since we would have bailed already if we did
not see any digits, courtesy of extra checks added by 8e4309038f (fsck:
do not assume NUL-termination of buffers, 2023-01-19). Since then,
checking "p == end" is redundant and we can drop it.

This will make our lives a little easier as we refactor further.

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fsck.c b/fsck.c
index 2ee72d573d..266c965cec 100644
--- a/fsck.c
+++ b/fsck.c
@@ -920,7 +920,7 @@ static int fsck_ident(const char **ident, const char *ident_end,
 		return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date");
 	if (date_overflows(parse_timestamp(p, &end, 10)))
 		return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow");
-	if ((end == p || *end != ' '))
+	if (*end != ' ')
 		return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
 	p = end + 1;
 	if ((*p != '+' && *p != '-') ||
-- 
2.52.0.278.gadc6434dc3


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (6 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 7/9] fsck: remove redundant date timestamp check Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-18  9:12   ` [PATCH v2 9/9] t: enable ASan's strict_string_checks option Jeff King
  2025-11-23  5:49   ` [PATCH v2 0/9] asan bonanza Junio C Hamano
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

In fsck_ident(), we parse the timestamp with parse_timestamp(), which is
really an alias for strtoumax(). But since our buffer may not be
NUL-terminated, this can trigger a complaint from ASan's
strict_string_checks mode. This is a false positive, since we know that
the buffer contains a trailing newline (which we checked earlier in the
function), and that strtoumax() would stop there.

But it is worth working around ASan's complaint. One is because that
will let us turn on strict_string_checks by default, which has helped
catch other real problems. And two is that the safety of the current
code is very hard to reason about (it subtly depends on distant code
which could change).

One option here is to just parse the number left-to-right ourselves. But
we care about the size of a timestamp_t and detecting overflow, since
that's part of the point of these checks. And doing that correctly is
tricky. So we'll instead just pull the digits into a separate,
NUL-terminated buffer, and use that to call parse_timestamp().

Signed-off-by: Jeff King <peff@peff.net>
---
 fsck.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fsck.c b/fsck.c
index 266c965cec..8e8083e7c6 100644
--- a/fsck.c
+++ b/fsck.c
@@ -860,13 +860,28 @@ static int verify_headers(const void *data, unsigned long size,
 		FSCK_MSG_UNTERMINATED_HEADER, "unterminated header");
 }
 
+static timestamp_t parse_timestamp_from_buf(const char **start, const char *end)
+{
+	const char *p = *start;
+	char buf[24]; /* big enough for 2^64 */
+	size_t i = 0;
+
+	while (p < end && isdigit(*p)) {
+		if (i >= ARRAY_SIZE(buf) - 1)
+			return TIME_MAX;
+		buf[i++] = *p++;
+	}
+	buf[i] = '\0';
+	*start = p;
+	return parse_timestamp(buf, NULL, 10);
+}
+
 static int fsck_ident(const char **ident, const char *ident_end,
 		      const struct object_id *oid, enum object_type type,
 		      struct fsck_options *options)
 {
 	const char *p = *ident;
 	const char *nl;
-	char *end;
 
 	nl = memchr(p, '\n', ident_end - p);
 	if (!nl)
@@ -918,11 +933,11 @@ static int fsck_ident(const char **ident, const char *ident_end,
 			      "invalid author/committer line - bad date");
 	if (*p == '0' && p[1] != ' ')
 		return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date");
-	if (date_overflows(parse_timestamp(p, &end, 10)))
+	if (date_overflows(parse_timestamp_from_buf(&p, ident_end)))
 		return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow");
-	if (*end != ' ')
+	if (*p != ' ')
 		return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
-	p = end + 1;
+	p++;
 	if ((*p != '+' && *p != '-') ||
 	    !isdigit(p[1]) ||
 	    !isdigit(p[2]) ||
-- 
2.52.0.278.gadc6434dc3


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 9/9] t: enable ASan's strict_string_checks option
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (7 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
@ 2025-11-18  9:12   ` Jeff King
  2025-11-23  5:49   ` [PATCH v2 0/9] asan bonanza Junio C Hamano
  9 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-18  9:12 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

ASan has an option to enable strict string checking, where any pointer
passed to a function that expects a NUL-terminated string will be
checked for that NUL termination. This can sometimes produce false
positives. E.g., it is not wrong to pass a buffer with { '1', '2', '\n' }
into strtoul(). Even though it is not NUL-terminated, it will stop at
the newline.

But in trying it out, it identified two problematic spots in our test
suite (which have now been adjusted):

  1. The strtol() parsing in cache-tree.c was a real potential problem,
     which would have been very hard to find otherwise (since it
     required constructing a very specific broken index file).

  2. The use of string functions in fsck_ident() were false positives,
     because we knew that there was always a trailing newline which
     would stop the functions from reading off the end of the buffer.
     But the reasoning behind that is somewhat fragile, and silencing
     those complaints made the code easier to reason about.

So even though this did not find any earth-shattering bugs, and even had
a few false positives, I'm sufficiently convinced that its complaints
are more helpful than hurtful. Let's turn it on by default (since the
test suite now runs cleanly with it) and see if it ever turns up any
other instances.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/test-lib.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/test-lib.sh b/t/test-lib.sh
index ef0ab7ec2d..0fb76f7d11 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -77,6 +77,7 @@ prepend_var GIT_SAN_OPTIONS : strip_path_prefix="$GIT_BUILD_DIR/"
 # want that one to complain to stderr).
 prepend_var ASAN_OPTIONS : $GIT_SAN_OPTIONS
 prepend_var ASAN_OPTIONS : detect_leaks=0
+prepend_var ASAN_OPTIONS : strict_string_checks=1
 export ASAN_OPTIONS

 prepend_var LSAN_OPTIONS : $GIT_SAN_OPTIONS
-- 
2.52.0.278.gadc6434dc3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-18  9:12   ` [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
@ 2025-11-18 14:30     ` Phillip Wood
  2025-11-23  6:19       ` Junio C Hamano
  0 siblings, 1 reply; 62+ messages in thread
From: Phillip Wood @ 2025-11-18 14:30 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Patrick Steinhardt, correctmost, Taylor Blau

Hi Peff

On 18/11/2025 09:12, Jeff King wrote:
> Let's fix it by just parsing the values ourselves with a helper function
> that is careful not to go past the end of the buffer. There are a few
> behavior changes here that should not matter:
> 
>    - We do not consider overflow, as strtol() would. But nor did the
>      original code. However, we don't trust the value we get from the
>      on-disk file, and if it says to read 2^30 entries, we would notice
>      that we do not have that many and bail before reading off the end of
>      the buffer.
> 
>    - Our helper does not skip past extra leading whitespace as strtol()
>      would, but according to gitformat-index(5) there should not be any.
> 
>    - The original quit parsing at a newline or a NUL byte, but now we
>      insist on a newline (which is what the documentation says, and what
>      Git has always produced).

I think that sounds reasonable, I've left a couple of comments below.

> +static int parse_int(const char **ptr, unsigned long *len_p, int *out)
> +{
> +	const char *s = *ptr;
> +	unsigned long len = *len_p;
> +	int ret = 0;

This is signed which means that any overflow is undefined. While the 
existing code does not check for overflow I think it is well defined in 
the presence of overflow. It also means parsing INT_MIN is undefined as 
we parse the value as unsigned and then multiply by -1 if we saw a 
leading '-'. We shouldn't see any negative values apart from "-1" but 
given we're changing this code to be more robust in handling malformed 
input it would be nice if parsing INT_MIN was well defined.

> +	int sign = 1;
> +
> +	while (len && *s == '-') {
> +		sign *= -1;
> +		s++;
> +		len--;
> +	}

This accepts any number of '-' signs but I believe strtol() only accepts 
a single sign (the standard says "optionally preceded by a plus or minus 
sign") so this is a change in behavior from the existing code. I'm not 
sure we really need to be that accommodating here.

> +	while (len) {
> +		if (!isdigit(*s))
> +			break;
> +		ret *= 10;
> +		ret += *s - '0';
> +		s++;
> +		len--;
> +	}
> +
> +	if (s == *ptr)
> +		return -1;

This accepts "-" as a valid input, as we're tightening up our parsing it 
would be nice to require a digit after any '-' sign.

 > [...]> +	buf++; size--;
> +	if (parse_int(&buf, &size, &subtree_nr) < 0)
> +		goto free_return;

This isn't a new problem but if subtree_nr is negative we end up trying 
to allocate a huge chunk of memory. If that somehow succeeds we then end 
up calling die("cache-tree: internal error"). The existing code looks 
safe but it would be nice to die() a bit earlier if subtree_nr is negative.

Thanks

Phillip
  > +	if (!size || *buf != '\n')
>   		goto free_return;
>   	buf++; size--;
>   	if (0 <= it->entry_count) {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 0/9] asan bonanza
  2025-11-18  9:11 ` [PATCH v2 " Jeff King
                     ` (8 preceding siblings ...)
  2025-11-18  9:12   ` [PATCH v2 9/9] t: enable ASan's strict_string_checks option Jeff King
@ 2025-11-23  5:49   ` Junio C Hamano
  9 siblings, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2025-11-23  5:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Patrick Steinhardt, correctmost, Taylor Blau

Jeff King <peff@peff.net> writes:

> Here's a v2 based on feedback:
>
>   - added the extra assertion in the midx code
>
>   - meson changes are squashed into patch 3
>
>   - The cache-tree integer parsing is more robust around total garbage
>     inputs (with no digits at all). I agree with the reviewers that it
>     would be nice to have a robust, reusable integer parsing function.
>     But I think it's non-trivial to do (and I left more comments in the
>     thread). I'd like to stick here to just fixing the memory issues
>     without making anything worse (which I think this version does).
>
>     Note that since the new helper takes an out-parameter, we have to
>     match the type more strictly to what the callers have. So it is now
>     parse_int(), and not parse_long().
>
> Range diff is below.
>
>   [1/9]: compat/mmap: mark unused argument in git_munmap()
>   [2/9]: pack-bitmap: handle name-hash lookups in incremental bitmaps
>   [3/9]: Makefile: turn on NO_MMAP when building with ASan
>   [4/9]: cache-tree: avoid strtol() on non-string buffer
>   [5/9]: fsck: assert newline presence in fsck_ident()
>   [6/9]: fsck: avoid strcspn() in fsck_ident()
>   [7/9]: fsck: remove redundant date timestamp check
>   [8/9]: fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated
>   [9/9]: t: enable ASan's strict_string_checks option

Aside from the comment on strtol() replacement, this iteration did
not see any comments.  What do we want to do next with this series?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-18 14:30     ` Phillip Wood
@ 2025-11-23  6:19       ` Junio C Hamano
  2025-11-23 15:51         ` Phillip Wood
  2025-11-24 22:30         ` Jeff King
  0 siblings, 2 replies; 62+ messages in thread
From: Junio C Hamano @ 2025-11-23  6:19 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Jeff King, git, Patrick Steinhardt, correctmost, Taylor Blau

Phillip Wood <phillip.wood123@gmail.com> writes:

>> +	while (len && *s == '-') {
>> +		sign *= -1;
>> +		s++;
>> +		len--;
>> +	}
>
> This accepts any number of '-' signs but I believe strtol() only accepts 
> a single sign (the standard says "optionally preceded by a plus or minus 
> sign") so this is a change in behavior from the existing code. I'm not 
> sure we really need to be that accommodating here.

That is true, but at the same time I do not think we really need to
make it more strict with extra code.

>> +	while (len) {
>> +		if (!isdigit(*s))
>> +			break;
>> +		ret *= 10;
>> +		ret += *s - '0';
>> +		s++;
>> +		len--;
>> +	}
>> +
>> +	if (s == *ptr)
>> +		return -1;
>
> This accepts "-" as a valid input, as we're tightening up our parsing it 
> would be nice to require a digit after any '-' sign.

Ditto.



We could try to be more careful, but it quickly became messy when I
tried.  Here is an unfinished attempt of mine.


static int parse_int(const char **ptr, unsigned long *len_p, int *out)
{
	const char *s = *ptr;
	unsigned long len = *len_p;
	unsigned val = 0;
	bool negate = false;
	int saw_digits = 0;

	while (len && isspace(*s)) {
		len--;
		s++;
	}
	if (!len)
		return -1;
	switch (*s) {
	case '-':
		negate = true;
		/* fallthru */
	case '+':
		s++;
		len--;
		break;
	default:
		break;
	}

	while (len) {
		unsigned next;
		if (!isdigit(*s))
			break;
		next = val * 10 + *s - '0';
		if (next < val)
			return -1;
		val = next;
		s++;
		len--;
		saw_digits = 1;
	}
	if (!saw_digits ||
	    (!negate && INT_MAX <= val) || 
	    (negate && INT_MAX < val))
		return -1;

	*ptr = s;
	*len_p = len;
	*out = negate ? (0 - val) : val;
	return 0;
}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-23  6:19       ` Junio C Hamano
@ 2025-11-23 15:51         ` Phillip Wood
  2025-11-23 18:06           ` Junio C Hamano
  2025-11-24 22:30         ` Jeff King
  1 sibling, 1 reply; 62+ messages in thread
From: Phillip Wood @ 2025-11-23 15:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, git, Patrick Steinhardt, correctmost, Taylor Blau

On 23/11/2025 06:19, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>>> +	while (len && *s == '-') {
>>> +		sign *= -1;
>>> +		s++;
>>> +		len--;
>>> +	}
>>
>> This accepts any number of '-' signs but I believe strtol() only accepts
>> a single sign (the standard says "optionally preceded by a plus or minus
>> sign") so this is a change in behavior from the existing code. I'm not
>> sure we really need to be that accommodating here.
> 
> That is true, but at the same time I do not think we really need to
> make it more strict with extra code.

All we need to do to accept a single minus sign is s/while/if/

>>> +	while (len) {
>>> +		if (!isdigit(*s))
>>> +			break;
>>> +		ret *= 10;
>>> +		ret += *s - '0';
>>> +		s++;
>>> +		len--;
>>> +	}
>>> +
>>> +	if (s == *ptr)
>>> +		return -1;
>>
>> This accepts "-" as a valid input, as we're tightening up our parsing it
>> would be nice to require a digit after any '-' sign.
> 
> Ditto.

If we limit ourselves to accepting a single minus sign then this can become
	if (s == *ptr + (sign == -1))

so we need very little in the way of extra code.

> We could try to be more careful, but it quickly became messy when I
> tried.  Here is an unfinished attempt of mine.

A generic helper to replace strtol() that takes a length rather than 
assuming the input is NUL terminated could be useful elsewhere but I'm 
not sure we need something that complicated here. I do like the fact 
that overflow does not cause undefined behavior though. Changing ret for 
"int" to "unsigned" in peff's patch should fix that.

Thanks

Phillip

> 
> static int parse_int(const char **ptr, unsigned long *len_p, int *out)
> {
> 	const char *s = *ptr;
> 	unsigned long len = *len_p;
> 	unsigned val = 0;
> 	bool negate = false;
> 	int saw_digits = 0;
> 
> 	while (len && isspace(*s)) {
> 		len--;
> 		s++;
> 	}
> 	if (!len)
> 		return -1;
> 	switch (*s) {
> 	case '-':
> 		negate = true;
> 		/* fallthru */
> 	case '+':
> 		s++;
> 		len--;
> 		break;
> 	default:
> 		break;
> 	}
> 
> 	while (len) {
> 		unsigned next;
> 		if (!isdigit(*s))
> 			break;
> 		next = val * 10 + *s - '0';
> 		if (next < val)
> 			return -1;
> 		val = next;
> 		s++;
> 		len--;
> 		saw_digits = 1;
> 	}
> 	if (!saw_digits ||
> 	    (!negate && INT_MAX <= val) ||
> 	    (negate && INT_MAX < val))
> 		return -1;
> 
> 	*ptr = s;
> 	*len_p = len;
> 	*out = negate ? (0 - val) : val;
> 	return 0;
> }


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-23 15:51         ` Phillip Wood
@ 2025-11-23 18:06           ` Junio C Hamano
  0 siblings, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2025-11-23 18:06 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Jeff King, git, Patrick Steinhardt, correctmost, Taylor Blau

Phillip Wood <phillip.wood123@gmail.com> writes:

> All we need to do to accept a single minus sign is s/while/if/
> ...
> If we limit ourselves to accepting a single minus sign then this can become
> 	if (s == *ptr + (sign == -1))
>
> so we need very little in the way of extra code.
> ...
> A generic helper to replace strtol() that takes a length rather than 
> assuming the input is NUL terminated could be useful elsewhere but I'm 
> not sure we need something that complicated here. I do like the fact 
> that overflow does not cause undefined behavior though. Changing ret for 
> "int" to "unsigned" in peff's patch should fix that.
> 
> Thanks

Perhaps.

By the way, an interesting tangent is this.

The only reason why these fields under discussion are stored in
textual decimal is pretty much the same as the reason why the object
header expresses the byte-length of the payload in textual decimal,
i.e., to be independent from the platform natural implementation of
"int" type (e.g., endiannness and width), but unlike object files,
the index is a local matter (we are prepared for the same directory
accessed over NFS from two platforms with different endianness, but
we do not recommend network access to a repository in the first
place).  And a lot more importantly, the total number of the index
entries contained within an index file is capped to 2^32-1 (the
header has 32-bit count in the network byte order).  The total
number of subdirectories within a directory or the total number of
entries for a level of directory hierarchy that would form a tree
object from a slice of the index cannot exceed that number anyway.

And thanks to the design that made cache-tree an optional index
extension, we can make cache-tree version 2 where the in-core
representation is exactly the same as the current one, but only uses
different serialization when writing to and reading from the index
file.  The new serialization can use 32-bit network byte order
integers, or use our own varint.{c,h,rs}, to record these numbers.

A version of Git that knows about that extension could be taught to
read from the current cache-tree and convert to a new version, but
better yet, it can simply ignore the current cache-tree data in the
file, and write the new version when we do need to write the index
out with a cache-tree.  When such a transparent auto conversion
happens, one single invocation of write_index_as_tree() would become
more expensive than usual (because the last invocation of the
current Git left cache-tree data in the index and usually the next
invocation of Git would take advantage of it when it writes a tree,
but a new version of Git that uses the v2 format would behave as if
there is no cache-tree data in the index and build the tree from
scratch.  After that happens, the cache-tree data in the new format
will be reused and things will continue to work.  You could use an
older version of Git on such an index file and the same transparent
auto conversion will take care of the transition.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-23  6:19       ` Junio C Hamano
  2025-11-23 15:51         ` Phillip Wood
@ 2025-11-24 22:30         ` Jeff King
  2025-11-24 23:09           ` Junio C Hamano
  1 sibling, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-24 22:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

On Sat, Nov 22, 2025 at 10:19:22PM -0800, Junio C Hamano wrote:

> We could try to be more careful, but it quickly became messy when I
> tried.  Here is an unfinished attempt of mine.

So yeah, I was hoping to avoid jumping into this rabbit hole of
messiness and just do the bare minimum to give us memory safety. But it
seems nobody is quite happy with the result. :(

Looking over what you wrote below, it seems pretty reasonable to me.
What do you consider unfinished in it? I'm wondering if we should swap
it into what my patch is doing (or do it on top if you prefer).

Another option is to scrap this approach entirely, and copy up until the
trailing newline into a separate buffer, NUL-terminate it, and parse
from that buffer. That feels a little dirty to me, but I suspect it is
pretty performant in practice, and it pushes all of the complexity back
onto strtol().

Another variant of that is: parse up to the trailing newline, making
sure it's there, and then leave the rest of the code as-is. We know that
strtol() will do the right thing in that case, but it does mean we
cannot use ASan's strict_string_checks (it would still yield a false
positive, because it does not know we've checked for the newline).

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-24 22:30         ` Jeff King
@ 2025-11-24 23:09           ` Junio C Hamano
  2025-11-26 15:09             ` Jeff King
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2025-11-24 23:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

Jeff King <peff@peff.net> writes:

> Looking over what you wrote below, it seems pretty reasonable to me.
> What do you consider unfinished in it?

Two things I am unhappy about are that (1) parsing the digit
sequence that represents abs(x) into unsigned int while catching
wraparound and (2) checking if 'val' that has abs(x) would fit in a
signed int when 'negate' is applied.  For both of them, there ought
to be a better way to write, and perhaps there may be a clean way to
do both at the same time that is easier reason about.

> Another option is to scrap this approach entirely, and copy up until the
> trailing newline into a separate buffer, NUL-terminate it, and parse
> from that buffer. That feels a little dirty to me, but I suspect it is
> pretty performant in practice, and it pushes all of the complexity back
> onto strtol().
>
> Another variant of that is: parse up to the trailing newline, making
> sure it's there, and then leave the rest of the code as-is. We know that
> strtol() will do the right thing in that case, but it does mean we
> cannot use ASan's strict_string_checks (it would still yield a false
> positive, because it does not know we've checked for the newline).

Or perhaps introduce cache-tree-version-2 index extension.  If there
are other things we may want to fix while we are at it, that would
be a better way to spend our engineering resource, but I offhand do
not know of anything gravely lacking there that we may want to fix
(there are little things like how the pathnames are sorted that I
regret the way it was implemented, but that does not motivate me
enough).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-24 23:09           ` Junio C Hamano
@ 2025-11-26 15:09             ` Jeff King
  2025-11-26 17:22               ` Junio C Hamano
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-26 15:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

On Mon, Nov 24, 2025 at 03:09:47PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Looking over what you wrote below, it seems pretty reasonable to me.
> > What do you consider unfinished in it?
> 
> Two things I am unhappy about are that (1) parsing the digit
> sequence that represents abs(x) into unsigned int while catching
> wraparound and (2) checking if 'val' that has abs(x) would fit in a
> signed int when 'negate' is applied.  For both of them, there ought
> to be a better way to write, and perhaps there may be a clean way to
> do both at the same time that is easier reason about.

Hmm, I thought both of those things were reasonably clever. The other
obvious way to do it, AFAICT, is to used checked-operation intrinsics or
add unsigned_add_overflows() before every operation.

It is true that for the general case of: "x = y + z" or "x = y * z", you
cannot determine overflow strictly from checking that x < y. But I think
given that we know "z" must be small, it works in this case.

It looks like you merged what I had into 'next'. Where do you want to go
from there? I am mostly content to let it be, but we can also try to
replace with something like your version. Or even, I guess, work on a
global strntoi() that could be used everywhere, if we think it is robust
enough. (Though technically that name is reserved by the standard, which
is a shame, because that is really what this thing is).

> Or perhaps introduce cache-tree-version-2 index extension.  If there
> are other things we may want to fix while we are at it, that would
> be a better way to spend our engineering resource, but I offhand do
> not know of anything gravely lacking there that we may want to fix
> (there are little things like how the pathnames are sorted that I
> regret the way it was implemented, but that does not motivate me
> enough).

I read your other email laying out the v2 concept, and I didn't disagree
with anything. It just feels like a bigger engineering effort and a
bigger risk that the transition does not go as smoothly as we expect for
solving a very small implementation problem. But like you say, I do not
have a laundry list of cache-tree things I'd like to fix either. I think
the transition being worth it would depend on that kind of list.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer
  2025-11-26 15:09             ` Jeff King
@ 2025-11-26 17:22               ` Junio C Hamano
  2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2025-11-26 17:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

Jeff King <peff@peff.net> writes:

> Hmm, I thought both of those things were reasonably clever. The other
> obvious way to do it, AFAICT, is to used checked-operation intrinsics or
> add unsigned_add_overflows() before every operation.

Yup, but the thing is, I didn't want something "clever".  I prefer
"clean and obvious" if we add extra code for safety.

> It is true that for the general case of: "x = y + z" or "x = y * z", you
> cannot determine overflow strictly from checking that x < y. But I think
> given that we know "z" must be small, it works in this case.
>
> It looks like you merged what I had into 'next'. Where do you want to go
> from there? I am mostly content to let it be, but we can also try to
> replace with something like your version.

That is my preference.  While the topic is still in 'next', or after
the topic graduates to 'master'.  Either is fine.  And it is fine if
such an update did not come, too.  After all, this is to deal with
contents in a locally generated file (.git/index), so a maliciously
corrupt string that lack the expected whitespace character after the
digit string is a sign that you are trying to burn yourself and you
have only yourself to blame, isn't it?  An attacker that can put
garbage in your .git/index has better ways to fool you by updating
your .git/config file that sits next to it.  Or teach the sanitizer
that this code path is already OK somehow?

> Or even, I guess, work on a
> global strntoi() that could be used everywhere, if we think it is robust
> enough. (Though technically that name is reserved by the standard, which
> is a shame, because that is really what this thing is).

Well, we already use plenty of names beginning with 'str' followed
by a lowercase letter, like strbuf_foo() and string_list_init().

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 0/4] more robust functions for parsing int from buf
  2025-11-26 17:22               ` Junio C Hamano
@ 2025-11-30 13:13                 ` Jeff King
  2025-11-30 13:14                   ` [PATCH 1/4] parse: prefer bool to int for boolean returns Jeff King
                                     ` (3 more replies)
  0 siblings, 4 replies; 62+ messages in thread
From: Jeff King @ 2025-11-30 13:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

On Wed, Nov 26, 2025 at 09:22:38AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Hmm, I thought both of those things were reasonably clever. The other
> > obvious way to do it, AFAICT, is to used checked-operation intrinsics or
> > add unsigned_add_overflows() before every operation.
> 
> Yup, but the thing is, I didn't want something "clever".  I prefer
> "clean and obvious" if we add extra code for safety.

Yeah, that's fair. It turns out that one half of that is easy: checking
for overflow as we compute the number). And one half is hard. If you
don't assume a twos-complement style range where the "min = -max - 1",
then you are stuck using INT_MIN. Which is OK for "int", but not for
arbitrary types. We already make the same assumption in git_parse_int(),
etc.

So I went with that approach here, but it is at least documented
clearly.

> > It looks like you merged what I had into 'next'. Where do you want to go
> > from there? I am mostly content to let it be, but we can also try to
> > replace with something like your version.
> 
> That is my preference.  While the topic is still in 'next', or after
> the topic graduates to 'master'.  Either is fine.  And it is fine if
> such an update did not come, too.  After all, this is to deal with
> contents in a locally generated file (.git/index), so a maliciously
> corrupt string that lack the expected whitespace character after the
> digit string is a sign that you are trying to burn yourself and you
> have only yourself to blame, isn't it?  An attacker that can put
> garbage in your .git/index has better ways to fool you by updating
> your .git/config file that sits next to it.  Or teach the sanitizer
> that this code path is already OK somehow?

Yeah, I agree the stakes are low here. Though they were somewhat low to
begin with for the same reason! But I was grossed out enough by the
whole thing that I tried to put together a decent helper for parsing
integers from buffers, and converted both sites here.

I suspect it could be used in other places, too, but I didn't convert
any.

> > Or even, I guess, work on a
> > global strntoi() that could be used everywhere, if we think it is robust
> > enough. (Though technically that name is reserved by the standard, which
> > is a shame, because that is really what this thing is).
> 
> Well, we already use plenty of names beginning with 'str' followed
> by a lowercase letter, like strbuf_foo() and string_list_init().

In the end it was sufficiently different from strtoi() that I decided
not to use that name. It was but one of many bike-sheddable decisions,
which I tried to document. So I guess let the flaming commence. ;)

This is built on top of jk/asan-bonanza.

  [1/4]: parse: prefer bool to int for boolean returns
  [2/4]: parse: add functions for parsing from non-string buffers
  [3/4]: cache-tree: use parse_int_from_buf()
  [4/4]: fsck: use parse_unsigned_from_buf() for parsing timestamp

 Makefile                   |   1 +
 cache-tree.c               |  28 ++-----
 compat/posix.h             |   2 +
 fsck.c                     |  20 +----
 parse.c                    | 162 +++++++++++++++++++++++++++++--------
 parse.h                    |  31 +++++--
 t/meson.build              |   1 +
 t/unit-tests/u-parse-int.c |  98 ++++++++++++++++++++++
 8 files changed, 263 insertions(+), 80 deletions(-)
 create mode 100644 t/unit-tests/u-parse-int.c

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/4] parse: prefer bool to int for boolean returns
  2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
@ 2025-11-30 13:14                   ` Jeff King
  2025-12-04 11:23                     ` Patrick Steinhardt
  2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-30 13:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

All of the integer parsing functions in parse.[ch] return an int that is
"0" for failure or "1" for success. Since most of the other functions in
Git use "0" for success and "-1" for failure, this can be confusing.
Let's switch the return types to bool to make it clear that we are using
this other convention. Callers should not need to update at all.

Signed-off-by: Jeff King <peff@peff.net>
---
Obviously not strictly necessary for this series, but I think a good
idea regardless of the rest of it.

 parse.c | 66 ++++++++++++++++++++++++++++-----------------------------
 parse.h | 14 ++++++------
 2 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/parse.c b/parse.c
index 48313571aa..f626846def 100644
--- a/parse.c
+++ b/parse.c
@@ -15,7 +15,7 @@ static uintmax_t get_unit_factor(const char *end)
 	return 0;
 }
 
-int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
+bool git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 {
 	if (value && *value) {
 		char *end;
@@ -28,30 +28,30 @@ int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 		errno = 0;
 		val = strtoimax(value, &end, 0);
 		if (errno == ERANGE)
-			return 0;
+			return false;
 		if (end == value) {
 			errno = EINVAL;
-			return 0;
+			return false;
 		}
 		factor = get_unit_factor(end);
 		if (!factor) {
 			errno = EINVAL;
-			return 0;
+			return false;
 		}
 		if ((val < 0 && (-max - 1) / factor > val) ||
 		    (val > 0 && max / factor < val)) {
 			errno = ERANGE;
-			return 0;
+			return false;
 		}
 		val *= factor;
 		*ret = val;
-		return 1;
+		return true;
 	}
 	errno = EINVAL;
-	return 0;
+	return false;
 }
 
-int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
+bool git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
 {
 	if (value && *value) {
 		char *end;
@@ -61,97 +61,97 @@ int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
 		/* negative values would be accepted by strtoumax */
 		if (strchr(value, '-')) {
 			errno = EINVAL;
-			return 0;
+			return false;
 		}
 		errno = 0;
 		val = strtoumax(value, &end, 0);
 		if (errno == ERANGE)
-			return 0;
+			return false;
 		if (end == value) {
 			errno = EINVAL;
-			return 0;
+			return false;
 		}
 		factor = get_unit_factor(end);
 		if (!factor) {
 			errno = EINVAL;
-			return 0;
+			return false;
 		}
 		if (unsigned_mult_overflows(factor, val) ||
 		    factor * val > max) {
 			errno = ERANGE;
-			return 0;
+			return false;
 		}
 		val *= factor;
 		*ret = val;
-		return 1;
+		return true;
 	}
 	errno = EINVAL;
-	return 0;
+	return false;
 }
 
-int git_parse_int(const char *value, int *ret)
+bool git_parse_int(const char *value, int *ret)
 {
 	intmax_t tmp;
 	if (!git_parse_signed(value, &tmp, maximum_signed_value_of_type(int)))
-		return 0;
+		return false;
 	*ret = tmp;
-	return 1;
+	return true;
 }
 
-int git_parse_int64(const char *value, int64_t *ret)
+bool git_parse_int64(const char *value, int64_t *ret)
 {
 	intmax_t tmp;
 	if (!git_parse_signed(value, &tmp, maximum_signed_value_of_type(int64_t)))
-		return 0;
+		return false;
 	*ret = tmp;
-	return 1;
+	return true;
 }
 
-int git_parse_ulong(const char *value, unsigned long *ret)
+bool git_parse_ulong(const char *value, unsigned long *ret)
 {
 	uintmax_t tmp;
 	if (!git_parse_unsigned(value, &tmp, maximum_unsigned_value_of_type(long)))
-		return 0;
+		return false;
 	*ret = tmp;
-	return 1;
+	return true;
 }
 
-int git_parse_ssize_t(const char *value, ssize_t *ret)
+bool git_parse_ssize_t(const char *value, ssize_t *ret)
 {
 	intmax_t tmp;
 	if (!git_parse_signed(value, &tmp, maximum_signed_value_of_type(ssize_t)))
-		return 0;
+		return false;
 	*ret = tmp;
-	return 1;
+	return true;
 }
 
-int git_parse_double(const char *value, double *ret)
+bool git_parse_double(const char *value, double *ret)
 {
 	char *end;
 	double val;
 	uintmax_t factor;
 
 	if (!value || !*value) {
 		errno = EINVAL;
-		return 0;
+		return false;
 	}
 
 	errno = 0;
 	val = strtod(value, &end);
 	if (errno == ERANGE)
-		return 0;
+		return false;
 	if (end == value) {
 		errno = EINVAL;
-		return 0;
+		return false;
 	}
 	factor = get_unit_factor(end);
 	if (!factor) {
 		errno = EINVAL;
-		return 0;
+		return false;
 	}
 	val *= factor;
 	*ret = val;
-	return 1;
+	return true;
 }
 
 int git_parse_maybe_bool_text(const char *value)
diff --git a/parse.h b/parse.h
index ea32de9a91..f80cc5b9fd 100644
--- a/parse.h
+++ b/parse.h
@@ -1,13 +1,13 @@
 #ifndef PARSE_H
 #define PARSE_H
 
-int git_parse_signed(const char *value, intmax_t *ret, intmax_t max);
-int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max);
-int git_parse_ssize_t(const char *, ssize_t *);
-int git_parse_ulong(const char *, unsigned long *);
-int git_parse_int(const char *value, int *ret);
-int git_parse_int64(const char *value, int64_t *ret);
-int git_parse_double(const char *value, double *ret);
+bool git_parse_signed(const char *value, intmax_t *ret, intmax_t max);
+bool git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max);
+bool git_parse_ssize_t(const char *, ssize_t *);
+bool git_parse_ulong(const char *, unsigned long *);
+bool git_parse_int(const char *value, int *ret);
+bool git_parse_int64(const char *value, int64_t *ret);
+bool git_parse_double(const char *value, double *ret);
 
 /**
  * Same as `git_config_bool`, except that it returns -1 on error rather
-- 
2.52.0.413.gf695cdb9bd


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/4] parse: add functions for parsing from non-string buffers
  2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
  2025-11-30 13:14                   ` [PATCH 1/4] parse: prefer bool to int for boolean returns Jeff King
@ 2025-11-30 13:15                   ` Jeff King
  2025-11-30 13:46                     ` my complaints with clar Jeff King
                                       ` (2 more replies)
  2025-11-30 13:15                   ` [PATCH 3/4] cache-tree: use parse_int_from_buf() Jeff King
  2025-11-30 13:16                   ` [PATCH 4/4] fsck: use parse_unsigned_from_buf() for parsing timestamp Jeff King
  3 siblings, 3 replies; 62+ messages in thread
From: Jeff King @ 2025-11-30 13:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

If you have a buffer that is not NUL-terminated but want to parse an
integer, there aren't many good options. If you use strtol() and
friends, you risk running off the end of the buffer if there is no
non-digit terminating character. And even if you carefully make sure
that there is such a character, ASan's strict-string-check mode will
still complain.

You can copy bytes into a temporary buffer, terminate it, and then call
strtol(), but doing so adds some pitfalls (like making sure you soak up
whitespace and leading +/- signs, and reporting overflow for overly long
input). Or you can hand-parse the digits, but then you need to take some
care to handle overflow (and again, whitespace and +/- signs).

These things aren't impossible to do right, but it's error-prone to have
to do them in every spot that wants to do such parsing. So let's add
some functions which can be used across the code base.

There are a few choices regarding the interface and the implementation.

First, the implementation:

  - I went with with parsing the digits (rather than buffering and
    passing to libc functions). It ends up being a similar amount of
    code because we have to do some parsing either way. And likewise
    overflow detection depends on the exact type the caller wants, so we
    either have to do it by hand or write a separate wrapper for
    strtol(), strtoumax(), and so on.

  - Unsigned overflow detection is done using the same techniques as in
    unsigned_add_overflows(), etc. We can't use those macros directly
    because our core function is type-agnostic (so the caller passes in
    the max value, rather than us deriving it on the fly). This is
    similar to how git_parse_int(), etc, work.

  - Signed overflow detection assumes that we can express a negative
    value with magnitude one larger than our maximum positive value
    (e.g., -128..127 for a signed 8-bit value). I doubt this is
    guaranteed by the standard, but it should hold in practice, and we
    make the same assumption in git_parse_int(), etc. The nice thing
    about this is that we can derive the range from the number of bits
    in the type. For ints, you obviously could use INT_MIN..INT_MAX, but
    for an arbitrary type, we can use maximum_signed_value_of_type().

  - I didn't bother with handling bases other than 10. It would
    complicate the code, and I suspect it won't be needed. We could
    probably retro-fit it later without too much work, if need be.

For the interface:

  - What do we call it? We have git_parse_int() and friends, which aim
    to make parsing less error-prone. And in some ways, these are just
    buffer (rather than string) versions of those functions. But not
    entirely. Those functions are aimed at parsing a single user-facing
    value. So they accept a unit prefix (e.g., "10k"), which we won't
    always want. And they insist that the whole string is consumed
    (rather than passing back an "end" pointer).

    We also have strtol_i() and strtoul_ui() wrappers, which try to make
    error handling simpler (especially around overflow), but mostly
    behave like their libc counterparts. These also don't pass out an
    end pointer, though.

    So I started a new namespace, "parse_<type>_from_buf".

  - Like those other functions above, we use an out-parameter to store
    the result, which lets us return an error code directly. This avoids
    the complicated errno dance for detecting overflow that you get with
    strtol().

    What should the error code look like? git_parse_int() uses a bool
    for success/failure. But strtol_ui() uses the syscall-like "0 is
    success, -1 is error" convention.

    I went with the bool approach here. Since the names are closest to
    those functions, I thought it would cause the least confusion.

  - Unlike git_parse_signed() and friends, we do not insist that the
    entire buffer be consumed. For parsing a specific standalone string
    that makes sense, but within an unterminated buffer you are much
    more likely to be parsing multiple fields from a larger data set.

    We pass out an "end" pointer the same way strtol() does. Another
    option is to accept the input as an in-out parameter and advance the
    pointer ourselves (and likewise shrink the length pointer). That
    would let you do something like:

       if (!parse_int_from_buf(&p, &len, &out))
               return error(...);
       /* "p" and "len" were adjusted automatically */
       if (!len || *p++ != ' ')
               return error(...);

    That saves a few lines of code in some spots, but requires a few
    more in others (depending on whether the caller has a length in the
    first place or is using an end pointer). Of the two callers I intend
    to immediately convert, we have one of each type!

    I went with the strtol() approach as flexible and time-tested.

  - We could likewise take the input buffer as two pointers (start and
    end) rather than a pointer and a length. That again makes life
    easier for some callers and harder for others. I stuck with pointer
    and length as the more usual interface.

  - What happens when a caller passes in a NULL end pointer? This is
    allowed by strtol(). But I think it's often a sign of a lurking bug,
    because there's no way to know how much was consumed (and even if a
    caller wants to assume everything is consumed, you have no way to
    verify it). So it is simply an error in this interface (you'd get a
    segfault).

    I am tempted to say that if the end pointer is NULL the functions
    could confirm that the entire buffer was consumed, as a convenience.
    But that felt a bit magical and surprising.

Like git_parse_*(), there is a generic signed/unsigned helper, and then
we can add type-specific helpers on top. I've added an int helper here
to start, and we'll add more as we convert callers.

Signed-off-by: Jeff King <peff@peff.net>
---
Sorry for the long message, but I tried to lay out my thinking for all
of it, since there were a lot of arbitrary decisions.

 Makefile                   |  1 +
 parse.c                    | 96 +++++++++++++++++++++++++++++++++++++
 parse.h                    | 17 +++++++
 t/meson.build              |  1 +
 t/unit-tests/u-parse-int.c | 98 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 213 insertions(+)
 create mode 100644 t/unit-tests/u-parse-int.c

diff --git a/Makefile b/Makefile
index 237b56fc9d..751bd40a9f 100644
--- a/Makefile
+++ b/Makefile
@@ -1510,6 +1510,7 @@ CLAR_TEST_SUITES += u-mem-pool
 CLAR_TEST_SUITES += u-oid-array
 CLAR_TEST_SUITES += u-oidmap
 CLAR_TEST_SUITES += u-oidtree
+CLAR_TEST_SUITES += u-parse-int
 CLAR_TEST_SUITES += u-prio-queue
 CLAR_TEST_SUITES += u-reftable-basics
 CLAR_TEST_SUITES += u-reftable-block
diff --git a/parse.c b/parse.c
index f626846def..1dcbcf64a1 100644
--- a/parse.c
+++ b/parse.c
@@ -209,3 +209,99 @@ unsigned long git_env_ulong(const char *k, unsigned long val)
 		die(_("failed to parse %s"), k);
 	return val;
 }
+
+/*
+ * Helper that handles both signed/unsigned cases. If "negate" is NULL,
+ * negative values are disallowed. If not NULL and the input is negative,
+ * the value is range-checked but the caller is responsible for actually doing
+ * the negatiion. You probably don't want to use this! Use one of
+ * parse_signed_from_buf() or parse_unsigned_from_buf() below.
+ */
+static bool parse_from_buf_internal(const char *buf, size_t len,
+				    const char **ep, bool *negate,
+				    uintmax_t *ret, uintmax_t max)
+{
+	const char *end = buf + len;
+	uintmax_t val = 0;
+
+	while (buf < end && isspace(*buf))
+		buf++;
+
+	if (negate)
+		*negate = false;
+	if (buf < end && *buf == '-') {
+		if (!negate) {
+			errno = EINVAL;
+			return false;
+		}
+		buf++;
+		*negate = true;
+		/* Assume negative range is always one larger than positive. */
+		max = max + 1;
+	} else if (buf < end && *buf == '+') {
+		buf++;
+	}
+
+	if (buf == end || !isdigit(*buf)) {
+		errno = EINVAL;
+		return false;
+	}
+
+	while (buf < end && isdigit(*buf)) {
+		int digit = *buf - '0';
+
+		if (val > max / 10) {
+			errno = ERANGE;
+			return false;
+		}
+		val *= 10;
+		if (val > max - digit) {
+			errno = ERANGE;
+			return false;
+		}
+		val += digit;
+
+		buf++;
+	}
+
+	*ep = buf;
+	*ret = val;
+	return true;
+}
+
+bool parse_unsigned_from_buf(const char *buf, size_t len, const char **ep,
+			     uintmax_t *ret, uintmax_t max)
+{
+	return parse_from_buf_internal(buf, len, ep, NULL, ret, max);
+}
+
+bool parse_signed_from_buf(const char *buf, size_t len, const char **ep,
+			   intmax_t *ret, intmax_t max)
+{
+	uintmax_t u_ret;
+	bool negate;
+
+	if (!parse_from_buf_internal(buf, len, ep, &negate, &u_ret, max))
+		return false;
+	/*
+	 * Range already checked internally, but we must apply negation
+	 * ourselves since only we have the signed integer type.
+	 */
+	if (negate) {
+		*ret = u_ret;
+		*ret = -*ret;
+	} else {
+		*ret = u_ret;
+	}
+	return true;
+}
+
+bool parse_int_from_buf(const char *buf, size_t len, const char **ep, int *ret)
+{
+	intmax_t tmp;
+	if (!parse_signed_from_buf(buf, len, ep, &tmp,
+				   maximum_signed_value_of_type(int)))
+		return false;
+	*ret = tmp;
+	return true;
+}
diff --git a/parse.h b/parse.h
index f80cc5b9fd..53663c8939 100644
--- a/parse.h
+++ b/parse.h
@@ -19,4 +19,21 @@ int git_parse_maybe_bool_text(const char *value);
 int git_env_bool(const char *, int);
 unsigned long git_env_ulong(const char *, unsigned long);
 
+/*
+ * These functions parse an integer from a buffer that does not need to be
+ * NUL-terminated. They return true on success, or false if no integer is found
+ * (in which case errno is set to EINVAL) or if the integer is out of the
+ * allowable range (in which case errno is ERANGE).
+ *
+ * You must pass in a non-NULL value for "ep", which returns a pointer to the
+ * next character in the buf (similar to strtol(), etc).
+ *
+ * These functions always parse in base 10 (and do not allow input like "0xff"
+ * to switch to base 16). They do not allow unit suffixes like git_parse_int(),
+ * above.
+ */
+bool parse_unsigned_from_buf(const char *buf, size_t len, const char **ep, uintmax_t *ret, uintmax_t max);
+bool parse_signed_from_buf(const char *buf, size_t len, const char **ep, intmax_t *ret, intmax_t max);
+bool parse_int_from_buf(const char *buf, size_t len, const char **ep, int *ret);
+
 #endif /* PARSE_H */
diff --git a/t/meson.build b/t/meson.build
index 7c994d4643..1289614545 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -8,6 +8,7 @@ clar_test_suites = [
   'unit-tests/u-oid-array.c',
   'unit-tests/u-oidmap.c',
   'unit-tests/u-oidtree.c',
+  'unit-tests/u-parse-int.c',
   'unit-tests/u-prio-queue.c',
   'unit-tests/u-reftable-basics.c',
   'unit-tests/u-reftable-block.c',
diff --git a/t/unit-tests/u-parse-int.c b/t/unit-tests/u-parse-int.c
new file mode 100644
index 0000000000..a1601bb16b
--- /dev/null
+++ b/t/unit-tests/u-parse-int.c
@@ -0,0 +1,98 @@
+#include "unit-test.h"
+#include "parse.h"
+
+static void check_int(const char *buf, size_t len,
+		      size_t expect_ep_ofs, int expect_errno,
+		      int expect_result)
+{
+	const char *ep;
+	int result;
+	bool ok = parse_int_from_buf(buf, len, &ep, &result);
+
+	if (expect_errno) {
+		cl_assert(!ok);
+		cl_assert_equal_i(expect_errno, errno);
+		return;
+	}
+
+	cl_assert(ok);
+	cl_assert_equal_i(expect_result, result);
+	cl_assert_equal_i(expect_ep_ofs, ep - buf);
+}
+
+static void check_int_str(const char *buf, size_t ofs, int err, int res)
+{
+	check_int(buf, strlen(buf), ofs, err, res);
+}
+
+static void check_int_full(const char *buf, int res)
+{
+	check_int_str(buf, strlen(buf), 0, res);
+}
+
+static void check_int_err(const char *buf, int err)
+{
+	check_int(buf, strlen(buf), 0, err, 0);
+}
+
+void test_parse_int__basic(void)
+{
+	cl_invoke(check_int_full("0", 0));
+	cl_invoke(check_int_full("11", 11));
+	cl_invoke(check_int_full("-23", -23));
+	cl_invoke(check_int_full("+23", 23));
+
+	cl_invoke(check_int_str("  31337  ", 7, 0, 31337));
+
+	cl_invoke(check_int_err("  garbage", EINVAL));
+	cl_invoke(check_int_err("", EINVAL));
+	cl_invoke(check_int_err("-", EINVAL));
+
+	cl_invoke(check_int("123", 2, 2, 0, 12));
+}
+
+void test_parse_int__range(void)
+{
+	/*
+	 * These assume a 32-bit int. We could avoid that with some
+	 * conditionals, but it's probably better for the test to
+	 * fail noisily and we can decide how to handle it then.
+	 */
+	cl_invoke(check_int_full("2147483647", 2147483647));
+	cl_invoke(check_int_err("2147483648", ERANGE));
+	cl_invoke(check_int_full("-2147483647", -2147483647));
+	cl_invoke(check_int_full("-2147483648", -2147483648));
+	cl_invoke(check_int_err("-2147483649", ERANGE));
+}
+
+static void check_unsigned(const char *buf, uintmax_t max,
+			   int expect_errno, uintmax_t expect_result)
+{
+	const char *ep;
+	uintmax_t result;
+	bool ok = parse_unsigned_from_buf(buf, strlen(buf), &ep, &result, max);
+
+	if (expect_errno) {
+		cl_assert(!ok);
+		cl_assert_equal_i(expect_errno, errno);
+		return;
+	}
+
+	cl_assert(ok);
+	cl_assert_equal_s(ep, "");
+	/*
+	 * Do not use cl_assert_equal_i_fmt(..., PRIuMAX) here. The macro
+	 * casts to int under the hood, corrupting the values.
+	 */
+	clar__assert_equal(CLAR_CURRENT_FILE, CLAR_CURRENT_FUNC,
+			   CLAR_CURRENT_LINE,
+			   "expect_result != result", 1,
+			   "%"PRIuMAX, expect_result, result);
+}
+
+void test_parse_int__unsigned(void)
+{
+	cl_invoke(check_unsigned("4294967295", UINT_MAX, 0, 4294967295U));
+	cl_invoke(check_unsigned("1053", 1000, ERANGE, 0));
+	cl_invoke(check_unsigned("-17", UINT_MAX, EINVAL, 0));
+}
-- 
2.52.0.413.gf695cdb9bd


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 3/4] cache-tree: use parse_int_from_buf()
  2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
  2025-11-30 13:14                   ` [PATCH 1/4] parse: prefer bool to int for boolean returns Jeff King
  2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
@ 2025-11-30 13:15                   ` Jeff King
  2025-11-30 13:16                   ` [PATCH 4/4] fsck: use parse_unsigned_from_buf() for parsing timestamp Jeff King
  3 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-30 13:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

In c4c9089584 (cache-tree: avoid strtol() on non-string buffer,
2025-11-18) we wrote an ad-hoc integer parser which did not detect
overflow. This wasn't too big a problem, since the original use of
strtol() did not do so either. But now that we have a more robust
parsing function, let's use that. It reduces the amount of code and
should catch more cases of malformed entries.

I kept our local parse_int() wrapper here, since it handles management
of our ptr/len pair (rather than doing it inline in the entry parser of
read_one()).

Signed-off-by: Jeff King <peff@peff.net>
---
 cache-tree.c | 28 +++++-----------------------
 1 file changed, 5 insertions(+), 23 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2d8947b518..f8fb290443 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -16,6 +16,7 @@
 #include "promisor-remote.h"
 #include "trace.h"
 #include "trace2.h"
+#include "parse.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -550,32 +551,13 @@ void cache_tree_write(struct strbuf *sb, struct cache_tree *root)
 
 static int parse_int(const char **ptr, unsigned long *len_p, int *out)
 {
-	const char *s = *ptr;
-	unsigned long len = *len_p;
-	int ret = 0;
-	int sign = 1;
-
-	while (len && *s == '-') {
-		sign *= -1;
-		s++;
-		len--;
-	}
-
-	while (len) {
-		if (!isdigit(*s))
-			break;
-		ret *= 10;
-		ret += *s - '0';
-		s++;
-		len--;
-	}
+	const char *ep;
 
-	if (s == *ptr)
+	if (!parse_int_from_buf(*ptr, *len_p, &ep, out))
 		return -1;
 
-	*ptr = s;
-	*len_p = len;
-	*out = sign * ret;
+	*len_p -= ep - *ptr;
+	*ptr = ep;
 	return 0;
 }
 
-- 
2.52.0.413.gf695cdb9bd


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 4/4] fsck: use parse_unsigned_from_buf() for parsing timestamp
  2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
                                     ` (2 preceding siblings ...)
  2025-11-30 13:15                   ` [PATCH 3/4] cache-tree: use parse_int_from_buf() Jeff King
@ 2025-11-30 13:16                   ` Jeff King
  3 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-11-30 13:16 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, git, Patrick Steinhardt, correctmost, Taylor Blau

In 5a993593b2 (fsck: avoid parse_timestamp() on buffer that isn't
NUL-terminated, 2025-11-18), we added a wrapper that copies the
timestamp into a buffer before calling parse_timestamp().

Now that we have a more robust helper for parsing from a buffer, we can
drop our wrapper and switch to that. We could just do so inline, but the
choice of "unsigned" vs "signed" depends on the typedef of timestamp_t.
So we'll wrap that in a macro that is defined alongside the rest of the
timestamp abstraction.

The resulting function is almost a drop-in replacement, but the new
interface means we need to hold the result in a separate timestamp_t,
rather than returning it directly from one function into the parameter
of another. The old one did still detect overflow errors by returning
TIME_MAX, since date_overflows() checks for that, but now we'll see it
more directly from the return of parse_timestamp_from_buf(). The
behavior should be the same.

Signed-off-by: Jeff King <peff@peff.net>
---
 compat/posix.h |  2 ++
 fsck.c         | 20 +++-----------------
 2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/compat/posix.h b/compat/posix.h
index 067a00f33b..d2dbc3e2a5 100644
--- a/compat/posix.h
+++ b/compat/posix.h
@@ -253,6 +253,8 @@ char *gitdirname(char *);
 typedef uintmax_t timestamp_t;
 #define PRItime PRIuMAX
 #define parse_timestamp strtoumax
+#define parse_timestamp_from_buf(buf, len, ep, result) \
+	parse_unsigned_from_buf((buf), (len), (ep), (result), TIME_MAX)
 #define TIME_MAX UINTMAX_MAX
 #define TIME_MIN 0
 
diff --git a/fsck.c b/fsck.c
index 8e8083e7c6..68a23ae628 100644
--- a/fsck.c
+++ b/fsck.c
@@ -860,28 +860,13 @@ static int verify_headers(const void *data, unsigned long size,
 		FSCK_MSG_UNTERMINATED_HEADER, "unterminated header");
 }
 
-static timestamp_t parse_timestamp_from_buf(const char **start, const char *end)
-{
-	const char *p = *start;
-	char buf[24]; /* big enough for 2^64 */
-	size_t i = 0;
-
-	while (p < end && isdigit(*p)) {
-		if (i >= ARRAY_SIZE(buf) - 1)
-			return TIME_MAX;
-		buf[i++] = *p++;
-	}
-	buf[i] = '\0';
-	*start = p;
-	return parse_timestamp(buf, NULL, 10);
-}
-
 static int fsck_ident(const char **ident, const char *ident_end,
 		      const struct object_id *oid, enum object_type type,
 		      struct fsck_options *options)
 {
 	const char *p = *ident;
 	const char *nl;
+	timestamp_t timestamp;
 
 	nl = memchr(p, '\n', ident_end - p);
 	if (!nl)
@@ -933,7 +918,8 @@ static int fsck_ident(const char **ident, const char *ident_end,
 			      "invalid author/committer line - bad date");
 	if (*p == '0' && p[1] != ' ')
 		return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date");
-	if (date_overflows(parse_timestamp_from_buf(&p, ident_end)))
+	if (!parse_timestamp_from_buf(p, ident_end - p, &p, &timestamp) ||
+	    date_overflows(timestamp))
 		return report(options, oid, type, FSCK_MSG_BAD_DATE_OVERFLOW, "invalid author/committer line - date causes integer overflow");
 	if (*p != ' ')
 		return report(options, oid, type, FSCK_MSG_BAD_DATE, "invalid author/committer line - bad date");
-- 
2.52.0.413.gf695cdb9bd

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* my complaints with clar
  2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
@ 2025-11-30 13:46                     ` Jeff King
  2025-12-01 14:16                       ` Phillip Wood
  2025-12-04 11:23                     ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Patrick Steinhardt
  2025-12-05 16:11                     ` Phillip Wood
  2 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2025-11-30 13:46 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Phillip Wood, Patrick Steinhardt, Taylor Blau

> --- /dev/null
> +++ b/t/unit-tests/u-parse-int.c

This was my first time writing unit tests with clar, so I thought I'd
document a few rough edges I found. It's possible I'm holding it wrong
in some areas.

> +static void check_int(const char *buf, size_t len,
> +		      size_t expect_ep_ofs, int expect_errno,
> +		      int expect_result)
> +{
> +	const char *ep;
> +	int result;
> +	bool ok = parse_int_from_buf(buf, len, &ep, &result);
> +
> +	if (expect_errno) {
> +		cl_assert(!ok);
> +		cl_assert_equal_i(expect_errno, errno);
> +		return;
> +	}
> +
> +	cl_assert(ok);
> +	cl_assert_equal_i(expect_result, result);
> +	cl_assert_equal_i(expect_ep_ofs, ep - buf);
> +}

The error messages I got on failure from this function were not super
informative. Naively, if you do something like:

  check_int_full("0", 0);
  check_int_full("11", 11);
  check_int_full("-23", -23);
  check_int_full("+23", 23);

and it fails, you'll get not much beyond "expected ok, but it's not
true" with a line number in the helper, but no idea which input failed.
So OK, we have cl_invoke for that, and:

  cl_invoke(check_int_full("0", 0));
  cl_invoke(check_int_full("11", 11));
  cl_invoke(check_int_full("-23", -23));
  cl_invoke(check_int_full("+23", 23));

gives you the line number in the caller. Better, but there's a lot of
cross-referencing the line numbers (plus sprinkling cl_invoke everywhere
is ugly).

What I really would have liked is some notion of "context". If the
helper could have done:

  cl_context("input: %.*s", (int)len, buf);

or similar, and failed assertions print that context, then that would
have made the failing part of the test easy to see, even without using
cl_invoke() at all.

Alternatively, I kind of wonder if cl_invoke() could just stringify the
entire macro argument and shove that into the context. That helps for:

  cl_invoke(check_int_full("11", 11));

though not if parameters are opaque in that line, like:

  cl_invoke(check_int_full(str, expect));

But I think boilerplate-saving helpers tend to be written more like the
first way.

In a more general sense, what I'd really have loved is an automatic
backtrace, but I suspect getting a readable one is impossible. Even if
we knew the called function and the parameters, a generic backtracer
cannot know the meaning of "buf" and "len" enough to show what was in
the buffer.

And of course the more obvious way to avoid that is to break this:

> +void test_parse_int__basic(void)
> +{
> +	cl_invoke(check_int_full("0", 0));
> +	cl_invoke(check_int_full("11", 11));
> +	cl_invoke(check_int_full("-23", -23));
> +	cl_invoke(check_int_full("+23", 23));
> +	cl_invoke(check_int_str("  31337  ", 7, 0, 31337));
> +
> +	cl_invoke(check_int_err("  garbage", EINVAL));
> +	cl_invoke(check_int_err("", EINVAL));
> +	cl_invoke(check_int_err("-", EINVAL));
> +
> +	cl_invoke(check_int("123", 2, 2, 0, 12));
> +}

into a series of nine separate tests, each of which gets a name. But
each of those tests is at least five lines of boilerplate, which sucks
(plus you have to come up with syntactically valid C names for them).

> +	/*
> +	 * Do not use cl_assert_equal_i_fmt(..., PRIuMAX) here. The macro
> +	 * casts to int under the hood, corrupting the values.
> +	 */
> +	clar__assert_equal(CLAR_CURRENT_FILE, CLAR_CURRENT_FUNC,
> +			   CLAR_CURRENT_LINE,
> +			   "expect_result != result", 1,
> +			   "%"PRIuMAX, expect_result, result);
> +}

This was an exciting bug to track down. If you use i_fmt() here, you get
some neat undefined behavior. It worked for gcc, but failed with clang
(but only with -O2!).

Obviously this was me using it wrong, and the "i" in the macro should
have been a hint. But this invocation is kind of ugly, with the explicit
mentions of internal CLAR variables. clar__assert_equal() understands
PRIuMAX as a comparator, but there doesn't appear to be any macro to use
it nicely.

Should there be a generic cl_assert_equal() that fills in the first
few parameters but is otherwise type-agnostic?

It also looked error-prone to me that if you pass in an unknown format
specifier, clar__assert_equal() will assume you want integers. So a
typo, or using an unknown-but-equivalent specifier will give you weird
undefined behavior bugs. These are just tests, so we can perhaps a bit
more loose, but these kinds of things can be hard to track down
(especially if they trigger only in certain compiler combos via CI,
which is what happened to me).

And that brings me to my final complaint. ;)

When the unit-tests fail in CI, you get very little useful feedback,
because the output is eaten by "prove", and the unit tests don't
understand --verbose-log at all. And then to make it more exciting, the
output that clar produces actually chokes prove. For example, if I do
this:

diff --git a/t/unit-tests/u-parse-int.c b/t/unit-tests/u-parse-int.c
index a1601bb16b..da706d5840 100644
--- a/t/unit-tests/u-parse-int.c
+++ b/t/unit-tests/u-parse-int.c
@@ -38,7 +38,7 @@ static void check_int_err(const char *buf, int err)
 void test_parse_int__basic(void)
 {
 	cl_invoke(check_int_full("0", 0));
-	cl_invoke(check_int_full("11", 11));
+	cl_invoke(check_int_full("11", 10));
 	cl_invoke(check_int_full("-23", -23));
 	cl_invoke(check_int_full("+23", 23));

then running t/unit-tests/bin/unit-tests produces:

  # start of suite 10: parse_int
  not ok 59 - parse_int::basic
      ---
      reason: |
        expect_result != result
        10 != 11
      at:
        file: 't/unit-tests/u-parse-int.c'
        line: 41
        function: 'test_parse_int__basic'
      ---

OK, but "prove t/unit-tests/bin/unit-tests" gives me:

  t/unit-tests/bin/unit-tests .. Failed 1/59 subtests

  Test Summary Report
  -------------------
  t/unit-tests/bin/unit-tests (Wstat: (none) Tests: 59 Failed: 1)
    Failed test:  59
    Parse errors: Badly formed hash line: '---' at /usr/share/perl/5.40/TAP/Parser/YAMLish/Reader.pm line 244.

Yuck. It actually does have what I need (that test 59 was the failure),
so the extra parse error is mostly a red herring (though it does prevent
us finding any further failures). I think in TAP that arbitrary output
is supposed to be prefixed with a "#". In test-lib.sh, we solve this by
only allowing "--verbose-log", not regular "-v", under a TAP harness.

I kind of wonder if we should have t0011-unit-tests.sh that simply runs
unit-tests and filters the output into stdout and stderr. But maybe it's
too ugly. I think --verbose-log works because we know in the test code
when we are outputting TAP on stdout, and everything else goes to the
log. But because it's all generated by the unit-tests bin, we'd end up
having to parse its output ourselves and redirect some to stdout and
some to the log. It might be less work to implement --verbose-log in
our clar harness.

Anyway. Those are all of my complaints. For now. ;) I don't know if I'll
work on any of them or not, and maybe people more familiar with clar can
offer suggestions. But I thought it worth documenting the experience.

-Peff

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: my complaints with clar
  2025-11-30 13:46                     ` my complaints with clar Jeff King
@ 2025-12-01 14:16                       ` Phillip Wood
  2025-12-04 11:09                         ` Patrick Steinhardt
  0 siblings, 1 reply; 62+ messages in thread
From: Phillip Wood @ 2025-12-01 14:16 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Junio C Hamano, Patrick Steinhardt, Taylor Blau

Hi Peff

On 30/11/2025 13:46, Jeff King wrote:
> 
>    cl_invoke(check_int_full("0", 0));
>    cl_invoke(check_int_full("11", 11));
>    cl_invoke(check_int_full("-23", -23));
>    cl_invoke(check_int_full("+23", 23));
> 
> gives you the line number in the caller. Better, but there's a lot of
> cross-referencing the line numbers (plus sprinkling cl_invoke everywhere
> is ugly).

The README for the old unit-testing framework recommended wrapping 
helper functions in a macro to pass the file and line numbers from the 
calling site. Perhaps we should do the same with clar

#define check_int_full(input, expect) cl_invoke(check_int_full(input, 
expect))

> What I really would have liked is some notion of "context". If the
> helper could have done:
> 
>    cl_context("input: %.*s", (int)len, buf);
> 
> or similar, and failed assertions print that context, then that would
> have made the failing part of the test easy to see, even without using
> cl_invoke() at all.

If you're writing a helper function you might want to use cl_failf() 
instead of cl_assert_* to provide more context but it's a pain that you 
can't just use the builtin assertions. I've not used them but there are 
assertions named cl_assert_*_ which I think let you add some context. 
One of the features of the conversion of our unit tests from the old 
framework to clar has been a degradation of the diagnostic messages when 
a test fails.

>> +void test_parse_int__basic(void)
>> +{
>> +	cl_invoke(check_int_full("0", 0));
>> +	cl_invoke(check_int_full("11", 11));
>> +	cl_invoke(check_int_full("-23", -23));
>> +	cl_invoke(check_int_full("+23", 23));
>> +	cl_invoke(check_int_str("  31337  ", 7, 0, 31337));
>> +
>> +	cl_invoke(check_int_err("  garbage", EINVAL));
>> +	cl_invoke(check_int_err("", EINVAL));
>> +	cl_invoke(check_int_err("-", EINVAL));
>> +
>> +	cl_invoke(check_int("123", 2, 2, 0, 12));
>> +}
> 
> into a series of nine separate tests, each of which gets a name. But
> each of those tests is at least five lines of boilerplate, which sucks
> (plus you have to come up with syntactically valid C names for them).

Yes that's a pain. One of the nice things about the old framework was 
the the TEST() macro just took an expression and created a test case out 
of it which worked well for tests like this and meant you could have 
table driven tests where each entry in the table was a separate test case.

>> +	/*
>> +	 * Do not use cl_assert_equal_i_fmt(..., PRIuMAX) here. The macro
>> +	 * casts to int under the hood, corrupting the values.
>> +	 */
>> +	clar__assert_equal(CLAR_CURRENT_FILE, CLAR_CURRENT_FUNC,
>> +			   CLAR_CURRENT_LINE,
>> +			   "expect_result != result", 1,
>> +			   "%"PRIuMAX, expect_result, result);
>> +}
> 
> This was an exciting bug to track down. If you use i_fmt() here, you get
> some neat undefined behavior. It worked for gcc, but failed with clang
> (but only with -O2!).
> 
> Obviously this was me using it wrong, and the "i" in the macro should
> have been a hint. But this invocation is kind of ugly, with the explicit
> mentions of internal CLAR variables. clar__assert_equal() understands
> PRIuMAX as a comparator, but there doesn't appear to be any macro to use
> it nicely.
> 
> Should there be a generic cl_assert_equal() that fills in the first
> few parameters but is otherwise type-agnostic?

Patrick's got a PR open for that at 
https://github.com/clar-test/clar/pull/117 it seems to have got stuck 
because of a lack of review.

>    # start of suite 10: parse_int
>    not ok 59 - parse_int::basic
>        ---
>        reason: |
>          expect_result != result
>          10 != 11
>        at:
>          file: 't/unit-tests/u-parse-int.c'
>          line: 41
>          function: 'test_parse_int__basic'
>        ---
> 
> OK, but "prove t/unit-tests/bin/unit-tests" gives me:
> 
>    t/unit-tests/bin/unit-tests .. Failed 1/59 subtests
>    
>    Test Summary Report
>    -------------------
>    t/unit-tests/bin/unit-tests (Wstat: (none) Tests: 59 Failed: 1)
>      Failed test:  59
>      Parse errors: Badly formed hash line: '---' at /usr/share/perl/5.40/TAP/Parser/YAMLish/Reader.pm line 244.
> 
> Yuck. It actually does have what I need (that test 59 was the failure),
> so the extra parse error is mostly a red herring (though it does prevent
> us finding any further failures). I think in TAP that arbitrary output
> is supposed to be prefixed with a "#".

TAP also allows you to embed YAML and unfortunately that's what clar 
tries to do but that last "    ---" line should be "    ...". With the 
diff below (which I'm afraid thunderbird will probably mangle) prove 
parses the output correctly but still does not print the error message. 
I'll update clar's self tests and open a PR later this week.

---- 8< ----
diff --git a/t/unit-tests/clar/clar/print.h b/t/unit-tests/clar/clar/print.h
index 89b66591d75..6a2321b399d 100644
--- a/t/unit-tests/clar/clar/print.h
+++ b/t/unit-tests/clar/clar/print.h
@@ -164,7 +164,7 @@ static void clar_print_tap_ontest(const char 
*suite_name, const char *test_name,
                          printf("      file: '"); 
print_escaped(error->file); printf("'\n");
                          printf("      line: %" PRIuMAX "\n", 
error->line_number);
                          printf("      function: '%s'\n", error->function);
-                        printf("    ---\n");
+                        printf("    ...\n");
                  }

                  break;
---- >8 ----

> In test-lib.sh, we solve this by
> only allowing "--verbose-log", not regular "-v", under a TAP harness.
> 
> I kind of wonder if we should have t0011-unit-tests.sh that simply runs
> unit-tests and filters the output into stdout and stderr.

I don't have a strong opinion on this but now that we don't run the unit 
tests in parallel because clar links them all into a single executable 
there is less reason to use prove.


Thanks

Phillip


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: my complaints with clar
  2025-12-01 14:16                       ` Phillip Wood
@ 2025-12-04 11:09                         ` Patrick Steinhardt
  2025-12-05 18:30                           ` Jeff King
  0 siblings, 1 reply; 62+ messages in thread
From: Patrick Steinhardt @ 2025-12-04 11:09 UTC (permalink / raw)
  To: phillip.wood; +Cc: Jeff King, git, Junio C Hamano, Taylor Blau

On Mon, Dec 01, 2025 at 02:16:13PM +0000, Phillip Wood wrote:
> On 30/11/2025 13:46, Jeff King wrote:
> > > +	/*
> > > +	 * Do not use cl_assert_equal_i_fmt(..., PRIuMAX) here. The macro
> > > +	 * casts to int under the hood, corrupting the values.
> > > +	 */
> > > +	clar__assert_equal(CLAR_CURRENT_FILE, CLAR_CURRENT_FUNC,
> > > +			   CLAR_CURRENT_LINE,
> > > +			   "expect_result != result", 1,
> > > +			   "%"PRIuMAX, expect_result, result);
> > > +}
> > 
> > This was an exciting bug to track down. If you use i_fmt() here, you get
> > some neat undefined behavior. It worked for gcc, but failed with clang
> > (but only with -O2!).
> > 
> > Obviously this was me using it wrong, and the "i" in the macro should
> > have been a hint. But this invocation is kind of ugly, with the explicit
> > mentions of internal CLAR variables. clar__assert_equal() understands
> > PRIuMAX as a comparator, but there doesn't appear to be any macro to use
> > it nicely.
> > 
> > Should there be a generic cl_assert_equal() that fills in the first
> > few parameters but is otherwise type-agnostic?
> 
> Patrick's got a PR open for that at
> https://github.com/clar-test/clar/pull/117 it seems to have got stuck
> because of a lack of review.

Yeah, this is indeed a long-standing issue. As Phillip mentioned I've
already had the fix pending, but I lost track and just never merged it.
Phillip now left a review, and I've polished the PR a bit. I'll wait a
few more days before merging it, and then these issues will be a thing
of the past :)

> >    # start of suite 10: parse_int
> >    not ok 59 - parse_int::basic
> >        ---
> >        reason: |
> >          expect_result != result
> >          10 != 11
> >        at:
> >          file: 't/unit-tests/u-parse-int.c'
> >          line: 41
> >          function: 'test_parse_int__basic'
> >        ---
> > 
> > OK, but "prove t/unit-tests/bin/unit-tests" gives me:
> > 
> >    t/unit-tests/bin/unit-tests .. Failed 1/59 subtests
> >    Test Summary Report
> >    -------------------
> >    t/unit-tests/bin/unit-tests (Wstat: (none) Tests: 59 Failed: 1)
> >      Failed test:  59
> >      Parse errors: Badly formed hash line: '---' at /usr/share/perl/5.40/TAP/Parser/YAMLish/Reader.pm line 244.
> > 
> > Yuck. It actually does have what I need (that test 59 was the failure),
> > so the extra parse error is mostly a red herring (though it does prevent
> > us finding any further failures). I think in TAP that arbitrary output
> > is supposed to be prefixed with a "#".
> 
> TAP also allows you to embed YAML and unfortunately that's what clar tries
> to do but that last "    ---" line should be "    ...". With the diff below
> (which I'm afraid thunderbird will probably mangle) prove parses the output
> correctly but still does not print the error message. I'll update clar's
> self tests and open a PR later this week.
> 
> ---- 8< ----
> diff --git a/t/unit-tests/clar/clar/print.h b/t/unit-tests/clar/clar/print.h
> index 89b66591d75..6a2321b399d 100644
> --- a/t/unit-tests/clar/clar/print.h
> +++ b/t/unit-tests/clar/clar/print.h
> @@ -164,7 +164,7 @@ static void clar_print_tap_ontest(const char
> *suite_name, const char *test_name,
>                          printf("      file: '");
> print_escaped(error->file); printf("'\n");
>                          printf("      line: %" PRIuMAX "\n",
> error->line_number);
>                          printf("      function: '%s'\n", error->function);
> -                        printf("    ---\n");
> +                        printf("    ...\n");
>                  }
> 
>                  break;
> ---- >8 ----

Indeed, this was a plain bug. I've merged your upstream PR, thanks!

I'll send a pull request soonish to update our own version of clar.

Thanks for the feedback! Hope that the pending changes will improve the
status quo.

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/4] parse: add functions for parsing from non-string buffers
  2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
  2025-11-30 13:46                     ` my complaints with clar Jeff King
@ 2025-12-04 11:23                     ` Patrick Steinhardt
  2025-12-05 16:11                     ` Phillip Wood
  2 siblings, 0 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-12-04 11:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Phillip Wood, git, correctmost, Taylor Blau

On Sun, Nov 30, 2025 at 08:15:37AM -0500, Jeff King wrote:
[snip]
> For the interface:
> 
>   - What do we call it? We have git_parse_int() and friends, which aim
>     to make parsing less error-prone. And in some ways, these are just
>     buffer (rather than string) versions of those functions. But not
>     entirely. Those functions are aimed at parsing a single user-facing
>     value. So they accept a unit prefix (e.g., "10k"), which we won't
>     always want. And they insist that the whole string is consumed
>     (rather than passing back an "end" pointer).
> 
>     We also have strtol_i() and strtoul_ui() wrappers, which try to make
>     error handling simpler (especially around overflow), but mostly
>     behave like their libc counterparts. These also don't pass out an
>     end pointer, though.
> 
>     So I started a new namespace, "parse_<type>_from_buf".

I think it would be nice if we could eventually converge towards a
common namespace here. E.g. `strotol_i()` would then become
`parse_<type>()`, without the `_from_buf()` suffix. That would make it a
bit more discoverable.

Similarly, `git_parse_int()` could become `parse_<type>_with_units()`
eventually.

That certainly doesn't have to be part of this series though.

>   - Like those other functions above, we use an out-parameter to store
>     the result, which lets us return an error code directly. This avoids
>     the complicated errno dance for detecting overflow that you get with
>     strtol().
> 
>     What should the error code look like? git_parse_int() uses a bool
>     for success/failure. But strtol_ui() uses the syscall-like "0 is
>     success, -1 is error" convention.
> 
>     I went with the bool approach here. Since the names are closest to
>     those functions, I thought it would cause the least confusion.

I think that's a sensible choice.

> diff --git a/Makefile b/Makefile
> index 237b56fc9d..751bd40a9f 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1510,6 +1510,7 @@ CLAR_TEST_SUITES += u-mem-pool
>  CLAR_TEST_SUITES += u-oid-array
>  CLAR_TEST_SUITES += u-oidmap
>  CLAR_TEST_SUITES += u-oidtree
> +CLAR_TEST_SUITES += u-parse-int
>  CLAR_TEST_SUITES += u-prio-queue
>  CLAR_TEST_SUITES += u-reftable-basics
>  CLAR_TEST_SUITES += u-reftable-block
> diff --git a/parse.c b/parse.c
> index f626846def..1dcbcf64a1 100644
> --- a/parse.c
> +++ b/parse.c
> @@ -209,3 +209,99 @@ unsigned long git_env_ulong(const char *k, unsigned long val)
>  		die(_("failed to parse %s"), k);
>  	return val;
>  }
> +
> +/*
> + * Helper that handles both signed/unsigned cases. If "negate" is NULL,
> + * negative values are disallowed. If not NULL and the input is negative,
> + * the value is range-checked but the caller is responsible for actually doing
> + * the negatiion. You probably don't want to use this! Use one of
> + * parse_signed_from_buf() or parse_unsigned_from_buf() below.
> + */
> +static bool parse_from_buf_internal(const char *buf, size_t len,
> +				    const char **ep, bool *negate,
> +				    uintmax_t *ret, uintmax_t max)
> +{
> +	const char *end = buf + len;
> +	uintmax_t val = 0;
> +
> +	while (buf < end && isspace(*buf))
> +		buf++;

Hm. Do we really want to retain the behaviour of skipping leading
spaces? I think it's a rather weird edge case of `strtol()` and friends,
and if we can avoid it I'd prefer to not replicate this behaviour.

> diff --git a/t/unit-tests/u-parse-int.c b/t/unit-tests/u-parse-int.c
> new file mode 100644
> index 0000000000..a1601bb16b
> --- /dev/null
> +++ b/t/unit-tests/u-parse-int.c
> @@ -0,0 +1,98 @@
[snip]
> +void test_parse_int__basic(void)
> +{
> +	cl_invoke(check_int_full("0", 0));
> +	cl_invoke(check_int_full("11", 11));
> +	cl_invoke(check_int_full("-23", -23));
> +	cl_invoke(check_int_full("+23", 23));
> +
> +	cl_invoke(check_int_str("  31337  ", 7, 0, 31337));
> +
> +	cl_invoke(check_int_err("  garbage", EINVAL));
> +	cl_invoke(check_int_err("", EINVAL));
> +	cl_invoke(check_int_err("-", EINVAL));
> +
> +	cl_invoke(check_int("123", 2, 2, 0, 12));
> +}

As Phillip suggested, it might make sense to wrap these `cl_invoke()`
calls into a macro.

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] parse: prefer bool to int for boolean returns
  2025-11-30 13:14                   ` [PATCH 1/4] parse: prefer bool to int for boolean returns Jeff King
@ 2025-12-04 11:23                     ` Patrick Steinhardt
  0 siblings, 0 replies; 62+ messages in thread
From: Patrick Steinhardt @ 2025-12-04 11:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Phillip Wood, git, correctmost, Taylor Blau

On Sun, Nov 30, 2025 at 08:14:41AM -0500, Jeff King wrote:
> All of the integer parsing functions in parse.[ch] return an int that is
> "0" for failure or "1" for success. Since most of the other functions in
> Git use "0" for success and "-1" for failure, this can be confusing.
> Let's switch the return types to bool to make it clear that we are using
> this other convention. Callers should not need to update at all.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Obviously not strictly necessary for this series, but I think a good
> idea regardless of the rest of it.

Agreed, I think this is a sensible change as it helps guide users. I
know that I was confused by the API several times already.

> diff --git a/parse.c b/parse.c
> index 48313571aa..f626846def 100644
> --- a/parse.c
> +++ b/parse.c
>  int git_parse_maybe_bool_text(const char *value)
> diff --git a/parse.h b/parse.h
> index ea32de9a91..f80cc5b9fd 100644
> --- a/parse.h
> +++ b/parse.h
> @@ -1,13 +1,13 @@
>  #ifndef PARSE_H
>  #define PARSE_H
>  
> -int git_parse_signed(const char *value, intmax_t *ret, intmax_t max);
> -int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max);
> -int git_parse_ssize_t(const char *, ssize_t *);
> -int git_parse_ulong(const char *, unsigned long *);
> -int git_parse_int(const char *value, int *ret);
> -int git_parse_int64(const char *value, int64_t *ret);
> -int git_parse_double(const char *value, double *ret);
> +bool git_parse_signed(const char *value, intmax_t *ret, intmax_t max);
> +bool git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max);
> +bool git_parse_ssize_t(const char *, ssize_t *);
> +bool git_parse_ulong(const char *, unsigned long *);
> +bool git_parse_int(const char *value, int *ret);
> +bool git_parse_int64(const char *value, int64_t *ret);
> +bool git_parse_double(const char *value, double *ret);

Should we maybe add a comment to these functions while at it to document
their behaviour?

Patrick

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/4] parse: add functions for parsing from non-string buffers
  2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
  2025-11-30 13:46                     ` my complaints with clar Jeff King
  2025-12-04 11:23                     ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Patrick Steinhardt
@ 2025-12-05 16:11                     ` Phillip Wood
  2 siblings, 0 replies; 62+ messages in thread
From: Phillip Wood @ 2025-12-05 16:11 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano
  Cc: git, Patrick Steinhardt, correctmost, Taylor Blau

On 30/11/2025 13:15, Jeff King wrote:
> If you have a buffer that is not NUL-terminated but want to parse an
> integer, there aren't many good options. If you use strtol() and
> friends, you risk running off the end of the buffer if there is no
> non-digit terminating character. And even if you carefully make sure
> that there is such a character, ASan's strict-string-check mode will
> still complain.
> 
> You can copy bytes into a temporary buffer, terminate it, and then call
> strtol(), but doing so adds some pitfalls (like making sure you soak up
> whitespace and leading +/- signs, and reporting overflow for overly long
> input). Or you can hand-parse the digits, but then you need to take some
> care to handle overflow (and again, whitespace and +/- signs).
> 
> These things aren't impossible to do right, but it's error-prone to have
> to do them in every spot that wants to do such parsing. So let's add
> some functions which can be used across the code base.
> 
> There are a few choices regarding the interface and the implementation.
> 
> First, the implementation:
> 
>    - I went with with parsing the digits (rather than buffering and
>      passing to libc functions). It ends up being a similar amount of
>      code because we have to do some parsing either way. And likewise
>      overflow detection depends on the exact type the caller wants, so we
>      either have to do it by hand or write a separate wrapper for
>      strtol(), strtoumax(), and so on.
> 
>    - Unsigned overflow detection is done using the same techniques as in
>      unsigned_add_overflows(), etc. We can't use those macros directly
>      because our core function is type-agnostic (so the caller passes in
>      the max value, rather than us deriving it on the fly). This is
>      similar to how git_parse_int(), etc, work.
> 
>    - Signed overflow detection assumes that we can express a negative
>      value with magnitude one larger than our maximum positive value
>      (e.g., -128..127 for a signed 8-bit value). I doubt this is
>      guaranteed by the standard, but it should hold in practice, and we
>      make the same assumption in git_parse_int(), etc. The nice thing
>      about this is that we can derive the range from the number of bits
>      in the type. For ints, you obviously could use INT_MIN..INT_MAX, but
>      for an arbitrary type, we can use maximum_signed_value_of_type().
> 
>    - I didn't bother with handling bases other than 10. It would
>      complicate the code, and I suspect it won't be needed. We could
>      probably retro-fit it later without too much work, if need be.

This all sounds sensible to me and an does the interface description.

> +bool parse_unsigned_from_buf(const char *buf, size_t len, const char **ep,
> +			     uintmax_t *ret, uintmax_t max)
> +{
> +	return parse_from_buf_internal(buf, len, ep, NULL, ret, max);
> +}
> +
> +bool parse_signed_from_buf(const char *buf, size_t len, const char **ep,
> +			   intmax_t *ret, intmax_t max)
> +{
> +	uintmax_t u_ret;
> +	bool negate;
> +
> +	if (!parse_from_buf_internal(buf, len, ep, &negate, &u_ret, max))
> +		return false;
> +	/*
> +	 * Range already checked internally, but we must apply negation
> +	 * ourselves since only we have the signed integer type.
> +	 */
> +	if (negate) {
> +		*ret = u_ret;
> +		*ret = -*ret;

If we're parsing INTMAX_MIN then this negation tries to calculate 
-INTMAX_MIN which is undefined (I've added some tests for parsing 
INTMAX_MAX and INTMAX_MIN at [1] and verified that UBSAN is triggered 
when parsing INTMAX_MIN). We could do

		*ret = u_ret;
		if (*ret != INTMAX_MIN)
			*ret = -*ret;

but I think it might be easier to alter parse_from_buf_internal() to 
make "negate" a local variable, change the function argument to "bool 
allow_negative" and do

		*ret = negate ? 0u - val : val;

Then parse_signed_from_buf() can do "*ret = *u_ret;" to convert the 
output of parse_from_buf_internal() to a signed value.

> diff --git a/t/unit-tests/u-parse-int.c b/t/unit-tests/u-parse-int.c
> new file mode 100644
> index 0000000000..a1601bb16b
> --- /dev/null
> +++ b/t/unit-tests/u-parse-int.c
> @@ -0,0 +1,98 @@
> +#include "unit-test.h"
> +#include "parse.h"
> +
> +static void check_int(const char *buf, size_t len,
> +		      size_t expect_ep_ofs, int expect_errno,
> +		      int expect_result)
> +{
> +	const char *ep;
> +	int result;

Do we want to set errno=0 here so that we can be sure it has been set by 
parse_int_from_buf() when we check it below?

Thanks

Phillip

[1] 
https://github.com/phillipwood/git/commit/e061e3e640db01d4fcf54d265d33352235151973

> +	bool ok = parse_int_from_buf(buf, len, &ep, &result);
> +
> +	if (expect_errno) {
> +		cl_assert(!ok);
> +		cl_assert_equal_i(expect_errno, errno);
> +		return;
> +	}
> +
> +	cl_assert(ok);
> +	cl_assert_equal_i(expect_result, result);
> +	cl_assert_equal_i(expect_ep_ofs, ep - buf);
> +}
> +
> +static void check_int_str(const char *buf, size_t ofs, int err, int res)
> +{
> +	check_int(buf, strlen(buf), ofs, err, res);
> +}
> +
> +static void check_int_full(const char *buf, int res)
> +{
> +	check_int_str(buf, strlen(buf), 0, res);
> +}
> +
> +static void check_int_err(const char *buf, int err)
> +{
> +	check_int(buf, strlen(buf), 0, err, 0);
> +}
> +
> +void test_parse_int__basic(void)
> +{
> +	cl_invoke(check_int_full("0", 0));
> +	cl_invoke(check_int_full("11", 11));
> +	cl_invoke(check_int_full("-23", -23));
> +	cl_invoke(check_int_full("+23", 23));
> +
> +	cl_invoke(check_int_str("  31337  ", 7, 0, 31337));
> +
> +	cl_invoke(check_int_err("  garbage", EINVAL));
> +	cl_invoke(check_int_err("", EINVAL));
> +	cl_invoke(check_int_err("-", EINVAL));
> +
> +	cl_invoke(check_int("123", 2, 2, 0, 12));
> +}
> +
> +void test_parse_int__range(void)
> +{
> +	/*
> +	 * These assume a 32-bit int. We could avoid that with some
> +	 * conditionals, but it's probably better for the test to
> +	 * fail noisily and we can decide how to handle it then.
> +	 */
> +	cl_invoke(check_int_full("2147483647", 2147483647));
> +	cl_invoke(check_int_err("2147483648", ERANGE));
> +	cl_invoke(check_int_full("-2147483647", -2147483647));
> +	cl_invoke(check_int_full("-2147483648", -2147483648));
> +	cl_invoke(check_int_err("-2147483649", ERANGE));
> +}
> +
> +static void check_unsigned(const char *buf, uintmax_t max,
> +			   int expect_errno, uintmax_t expect_result)
> +{
> +	const char *ep;
> +	uintmax_t result;
> +	bool ok = parse_unsigned_from_buf(buf, strlen(buf), &ep, &result, max);
> +
> +	if (expect_errno) {
> +		cl_assert(!ok);
> +		cl_assert_equal_i(expect_errno, errno);
> +		return;
> +	}
> +
> +	cl_assert(ok);
> +	cl_assert_equal_s(ep, "");
> +	/*
> +	 * Do not use cl_assert_equal_i_fmt(..., PRIuMAX) here. The macro
> +	 * casts to int under the hood, corrupting the values.
> +	 */
> +	clar__assert_equal(CLAR_CURRENT_FILE, CLAR_CURRENT_FUNC,
> +			   CLAR_CURRENT_LINE,
> +			   "expect_result != result", 1,
> +			   "%"PRIuMAX, expect_result, result);
> +}
> +
> +void test_parse_int__unsigned(void)
> +{
> +	cl_invoke(check_unsigned("4294967295", UINT_MAX, 0, 4294967295U));
> +	cl_invoke(check_unsigned("1053", 1000, ERANGE, 0));
> +	cl_invoke(check_unsigned("-17", UINT_MAX, EINVAL, 0));
> +}


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: my complaints with clar
  2025-12-04 11:09                         ` Patrick Steinhardt
@ 2025-12-05 18:30                           ` Jeff King
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2025-12-05 18:30 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: phillip.wood, git, Junio C Hamano, Taylor Blau

On Thu, Dec 04, 2025 at 12:09:55PM +0100, Patrick Steinhardt wrote:

> > Patrick's got a PR open for that at
> > https://github.com/clar-test/clar/pull/117 it seems to have got stuck
> > because of a lack of review.
> 
> Yeah, this is indeed a long-standing issue. As Phillip mentioned I've
> already had the fix pending, but I lost track and just never merged it.
> Phillip now left a review, and I've polished the PR a bit. I'll wait a
> few more days before merging it, and then these issues will be a thing
> of the past :)

Great, thanks.

> > ---- 8< ----
> > diff --git a/t/unit-tests/clar/clar/print.h b/t/unit-tests/clar/clar/print.h
> > index 89b66591d75..6a2321b399d 100644
> > --- a/t/unit-tests/clar/clar/print.h
> > +++ b/t/unit-tests/clar/clar/print.h
> > @@ -164,7 +164,7 @@ static void clar_print_tap_ontest(const char
> > *suite_name, const char *test_name,
> >                          printf("      file: '");
> > print_escaped(error->file); printf("'\n");
> >                          printf("      line: %" PRIuMAX "\n",
> > error->line_number);
> >                          printf("      function: '%s'\n", error->function);
> > -                        printf("    ---\n");
> > +                        printf("    ...\n");
> >                  }
> > 
> >                  break;
> > ---- >8 ----
> 
> Indeed, this was a plain bug. I've merged your upstream PR, thanks!
> 
> I'll send a pull request soonish to update our own version of clar.
> 
> Thanks for the feedback! Hope that the pending changes will improve the
> status quo.

It is certainly better for the TAP parser not to choke on the output,
but the more fundamental issue remains that the output is never stored
or relayed anywhere in our CI runs.

I looked at implementing --verbose-log in our unit-tests clar wrapper,
but it's tricky. For one thing, we have to know where to store the
test-results (so understanding $TEST_OUTPUT_DIRECTORY and how it may be
set). Plus, we would then need to duplicate much of the output, going to
both the log and to stdout. And all of the output routines are inside
clar. So we'd need hooks there.

The regular test suite uses "tee" to duplicate the output without the
script having to worry about it. We could use the same trick here.

And an easy way to solve both issues is to just call it from the test
suite, letting it handle --verbose-log itself!

Something like this seems to work:

  #!/bin/sh
  test_description='run clar unit tests'
  . ./test-lib.sh
  exec "$TEST_DIRECTORY/unit-tests/bin/unit-tests" ${immediate:+-i}

We have to "exec" there so that we skip out on the exit handler that
checks we called test_done. And we obviously cannot call that, because
it outputs extra TAP.

But the exec means we also miss some cleanup, like removing our trash
directory.

Probably we'd want a mode in the shell test suite that says "this thing
is going to generate TAP, just stay out of its way". We used to have
test_external, but it was removed in 5beca49a0b (test-lib: simplify by
removing test_external, 2022-07-28). The "modern" alternative is:

  test_expect_success 'run unit-tests' '
	unit-tests/bin/unit-tests
  '

which feels worse (the outer layer of TAP output is all-or-nothing, and
the fact that the unit tests produce TAP is irrelevant). I think it
roughly accomplishes the same thing, but it might be worth looking again
at the reasons for dropping test_external in the first place.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2025-12-05 18:30 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
2025-11-12  7:56 ` [PATCH 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
2025-11-12  8:01 ` [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
2025-11-12 11:25   ` Patrick Steinhardt
2025-11-13  2:55   ` Taylor Blau
2025-11-18  8:59     ` Jeff King
2025-11-12  8:02 ` [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
2025-11-12  8:17   ` Collin Funk
2025-11-12 10:31     ` Jeff King
2025-11-12 20:06       ` Collin Funk
2025-11-12 11:26   ` Patrick Steinhardt
2025-11-13  3:12     ` Taylor Blau
2025-11-13  6:34       ` Patrick Steinhardt
2025-11-18  8:49       ` Jeff King
2025-11-13 16:30     ` Junio C Hamano
2025-11-14  7:00       ` Patrick Steinhardt
2025-11-15  2:13         ` Jeff King
2025-11-12  8:05 ` [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
2025-11-12 11:26   ` Patrick Steinhardt
2025-11-13  3:09     ` Taylor Blau
2025-11-18  8:40       ` Jeff King
2025-11-18  8:38     ` Jeff King
2025-11-12  8:06 ` [PATCH 5/9] fsck: assert newline presence in fsck_ident() Jeff King
2025-11-12  8:06 ` [PATCH 6/9] fsck: avoid strcspn() " Jeff King
2025-11-12  8:06 ` [PATCH 7/9] fsck: remove redundant date timestamp check Jeff King
2025-11-12  8:10 ` [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
2025-11-12 11:25   ` Patrick Steinhardt
2025-11-12 19:36     ` Junio C Hamano
2025-11-15  2:12     ` Jeff King
2025-11-12  8:10 ` [PATCH 9/9] t: enable ASan's strict_string_checks option Jeff King
2025-11-13  3:17 ` [PATCH 0/9] asan bonanza Taylor Blau
2025-11-18  9:11 ` [PATCH v2 " Jeff King
2025-11-18  9:11   ` [PATCH v2 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
2025-11-18  9:12   ` [PATCH v2 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
2025-11-18  9:12   ` [PATCH v2 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
2025-11-18  9:12   ` [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
2025-11-18 14:30     ` Phillip Wood
2025-11-23  6:19       ` Junio C Hamano
2025-11-23 15:51         ` Phillip Wood
2025-11-23 18:06           ` Junio C Hamano
2025-11-24 22:30         ` Jeff King
2025-11-24 23:09           ` Junio C Hamano
2025-11-26 15:09             ` Jeff King
2025-11-26 17:22               ` Junio C Hamano
2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
2025-11-30 13:14                   ` [PATCH 1/4] parse: prefer bool to int for boolean returns Jeff King
2025-12-04 11:23                     ` Patrick Steinhardt
2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
2025-11-30 13:46                     ` my complaints with clar Jeff King
2025-12-01 14:16                       ` Phillip Wood
2025-12-04 11:09                         ` Patrick Steinhardt
2025-12-05 18:30                           ` Jeff King
2025-12-04 11:23                     ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Patrick Steinhardt
2025-12-05 16:11                     ` Phillip Wood
2025-11-30 13:15                   ` [PATCH 3/4] cache-tree: use parse_int_from_buf() Jeff King
2025-11-30 13:16                   ` [PATCH 4/4] fsck: use parse_unsigned_from_buf() for parsing timestamp Jeff King
2025-11-18  9:12   ` [PATCH v2 5/9] fsck: assert newline presence in fsck_ident() Jeff King
2025-11-18  9:12   ` [PATCH v2 6/9] fsck: avoid strcspn() " Jeff King
2025-11-18  9:12   ` [PATCH v2 7/9] fsck: remove redundant date timestamp check Jeff King
2025-11-18  9:12   ` [PATCH v2 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
2025-11-18  9:12   ` [PATCH v2 9/9] t: enable ASan's strict_string_checks option Jeff King
2025-11-23  5:49   ` [PATCH v2 0/9] asan bonanza Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).