[PATCH 0/4] pack-check: fix verification of large objects

public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] pack-check: fix verification of large objects
@ 2026-02-23  9:50 Patrick Steinhardt
  2026-02-23  9:50 ` [PATCH 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23  9:50 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King

Hi,

this small patch series addresses the bug reported by brian in [1].
Thanks!

Patrick

[1]: <20260222183710.2963424-1-sandals@crustytoothpaste.net>

---
Patrick Steinhardt (4):
      t/helper: improve "genrandom" test helper
      object-file: adapt `stream_object_signature()` to take a stream
      packfile: expose function to read object stream for an offset
      pack-check: fix verification of large objects

 object-file.c                         | 10 +++-------
 object-file.h                         |  4 +++-
 object.c                              | 15 ++++++++++++---
 pack-check.c                          | 12 +++++++++---
 packfile.c                            | 36 +++++++++++++++++++----------------
 packfile.h                            |  4 ++++
 t/helper/test-genrandom.c             |  5 ++++-
 t/t1006-cat-file.sh                   |  2 +-
 t/t1050-large.sh                      |  6 +++---
 t/t1450-fsck.sh                       | 17 ++++++++++++++++-
 t/t5301-sliding-window.sh             |  2 +-
 t/t5310-pack-bitmaps.sh               |  2 +-
 t/t5710-promisor-remote-capability.sh |  4 ++--
 t/t7700-repack.sh                     |  6 +++---
 14 files changed, 82 insertions(+), 43 deletions(-)


---
base-commit: 7c02d39fc2ed2702223c7674f73150d9a7e61ba4
change-id: 20260223-pks-fsck-fix-aa8a18a223c8


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] t/helper: improve "genrandom" test helper
  2026-02-23  9:50 [PATCH 0/4] pack-check: fix verification of large objects Patrick Steinhardt
@ 2026-02-23  9:50 ` Patrick Steinhardt
  2026-02-23 11:13   ` Jeff King
  2026-02-23 14:01   ` Eric Sunshine
  2026-02-23  9:50 ` [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23  9:50 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King

The `test-tool genrandom` test helper can be used to generate random
data, either as an infinite stream or with a specified number of bytes.
The way we handle parsing the number of bytes is lacking though:

  - We don't have good error handling, so if the caller for example uses
    `test-tool genrandom 200xyz` then we'll end up generating 200 bytes
    of random data successfully.

  - Many callers want to generate e.g. 1 kilobyte or megabyte of data,
    but they have to either use unwieldy numbers like 1048576, or they
    have to precompute them.

Fix both of these issues by using `git_parse_ulong()` to parse the
argumemnt. This function has better error handling, and it knows to
handle unit suffixes.

Adapt a couple of our tests to use suffixes instead of manual
computations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/helper/test-genrandom.c             | 5 ++++-
 t/t1006-cat-file.sh                   | 2 +-
 t/t1050-large.sh                      | 6 +++---
 t/t1450-fsck.sh                       | 2 +-
 t/t5301-sliding-window.sh             | 2 +-
 t/t5310-pack-bitmaps.sh               | 2 +-
 t/t5710-promisor-remote-capability.sh | 4 ++--
 t/t7700-repack.sh                     | 6 +++---
 8 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/t/helper/test-genrandom.c b/t/helper/test-genrandom.c
index 51b67f2f87..77dc31a315 100644
--- a/t/helper/test-genrandom.c
+++ b/t/helper/test-genrandom.c
@@ -6,6 +6,7 @@
 
 #include "test-tool.h"
 #include "git-compat-util.h"
+#include "parse.h"
 
 int cmd__genrandom(int argc, const char **argv)
 {
@@ -22,7 +23,9 @@ int cmd__genrandom(int argc, const char **argv)
 		next = next * 11 + *c;
 	} while (*c++);
 
-	count = (argc == 3) ? strtoul(argv[2], NULL, 0) : ULONG_MAX;
+	count = ULONG_MAX;
+	if (argc == 3 && git_parse_ulong(argv[2], &count) < 0)
+		return error_errno("cannot parse argument '%s'", argv[2]);
 
 	while (count--) {
 		next = next * 1103515245 + 12345;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 0eee3bb878..5499be8dc9 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -643,7 +643,7 @@ test_expect_success 'object reference via commit text search' '
 '
 
 test_expect_success 'setup blobs which are likely to delta' '
-	test-tool genrandom foo 10240 >foo &&
+	test-tool genrandom foo 10k >foo &&
 	{ cat foo && echo plus; } >foo-plus &&
 	git add foo foo-plus &&
 	git commit -m foo &&
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 5be273611a..7d40d08521 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -104,9 +104,9 @@ test_expect_success 'packsize limit' '
 		# mid1 and mid2 will fit within 256k limit but
 		# appending mid3 will bust the limit and will
 		# result in a separate packfile.
-		test-tool genrandom "a" $(( 66 * 1024 )) >mid1 &&
-		test-tool genrandom "b" $(( 80 * 1024 )) >mid2 &&
-		test-tool genrandom "c" $(( 128 * 1024 )) >mid3 &&
+		test-tool genrandom "a" 66k >mid1 &&
+		test-tool genrandom "b" 80k >mid2 &&
+		test-tool genrandom "c" 128k >mid3 &&
 		git add mid1 mid2 mid3 &&
 
 		count=0 &&
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 3fae05f9d9..8fb79b3e5d 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -918,7 +918,7 @@ test_expect_success 'fsck detects trailing loose garbage (large blob)' '
 test_expect_success 'fsck detects truncated loose object' '
 	# make it big enough that we know we will truncate in the data
 	# portion, not the header
-	test-tool genrandom truncate 4096 >file &&
+	test-tool genrandom truncate 4k >file &&
 	blob=$(git hash-object -w file) &&
 	file=$(sha1_file $blob) &&
 	test_when_finished "remove_object $blob" &&
diff --git a/t/t5301-sliding-window.sh b/t/t5301-sliding-window.sh
index ff6b5159a3..3c3666b278 100755
--- a/t/t5301-sliding-window.sh
+++ b/t/t5301-sliding-window.sh
@@ -12,7 +12,7 @@ test_expect_success 'setup' '
 	for i in a b c
 	do
 	echo $i >$i &&
-	test-tool genrandom "$i" 32768 >>$i &&
+	test-tool genrandom "$i" 32k >>$i &&
 	git update-index --add $i || return 1
 	done &&
 	echo d >d && cat c >>d && git update-index --add d &&
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 6718fb98c0..3e3366f57d 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -242,7 +242,7 @@ test_bitmap_cases () {
 	'
 
 	test_expect_success 'splitting packs does not generate bogus bitmaps' '
-		test-tool genrandom foo $((1024 * 1024)) >rand &&
+		test-tool genrandom foo 1m >rand &&
 		git add rand &&
 		git commit -m "commit with big file" &&
 		git -c pack.packSizeLimit=500k repack -adb &&
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 023735d6a8..66af84cd56 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -20,7 +20,7 @@ test_expect_success 'setup: create "template" repository' '
 	test_commit -C template 1 &&
 	test_commit -C template 2 &&
 	test_commit -C template 3 &&
-	test-tool genrandom foo 10240 >template/foo &&
+	test-tool genrandom foo 10k >template/foo &&
 	git -C template add foo &&
 	git -C template commit -m foo
 '
@@ -376,7 +376,7 @@ test_expect_success "clone with promisor.advertise set to 'true' but don't delet
 
 test_expect_success "setup for subsequent fetches" '
 	# Generate new commit with large blob
-	test-tool genrandom bar 10240 >template/bar &&
+	test-tool genrandom bar 10k >template/bar &&
 	git -C template add bar &&
 	git -C template commit -m bar &&
 
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 73b78bdd88..439ab24d23 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -319,7 +319,7 @@ test_expect_success 'no bitmaps created if .keep files present' '
 
 test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_config -C bare.git pack.packSizeLimit 1M &&
-	blob=$(test-tool genrandom big $((1024*1024)) |
+	blob=$(test-tool genrandom big 1m |
 	       git -C bare.git hash-object -w --stdin) &&
 	git -C bare.git update-ref refs/tags/big $blob &&
 
@@ -495,9 +495,9 @@ test_expect_success '--filter works with --max-pack-size' '
 		cd max-pack-size &&
 		test_commit base &&
 		# two blobs which exceed the maximum pack size
-		test-tool genrandom foo 1048576 >foo &&
+		test-tool genrandom foo 1m >foo &&
 		git hash-object -w foo &&
-		test-tool genrandom bar 1048576 >bar &&
+		test-tool genrandom bar 1m >bar &&
 		git hash-object -w bar &&
 		git add foo bar &&
 		git commit -m "adding foo and bar"

-- 
2.53.0.414.gf7e9f6c205.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream
  2026-02-23  9:50 [PATCH 0/4] pack-check: fix verification of large objects Patrick Steinhardt
  2026-02-23  9:50 ` [PATCH 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
@ 2026-02-23  9:50 ` Patrick Steinhardt
  2026-02-23 10:49   ` Jeff King
  2026-02-23  9:50 ` [PATCH 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23  9:50 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King

The function `stream_object_signature()` is responsible for verifying
whether the given object ID matches the actual hash of the object's
contents. In contrast to `check_object_signature()` it does so in a
streaming fashion so that we don't have to load the full object into
memory.

In a subsequent commit we'll want to adapt one of its callsites to pass
a preconstructed stream. Prepare for this by accepting a stream as input
that the caller needs to assemble.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 object-file.c | 10 +++-------
 object-file.h |  4 +++-
 object.c      | 15 ++++++++++++---
 pack-check.c  | 12 +++++++++---
 4 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/object-file.c b/object-file.c
index 1b62996ef0..ca2c4dddf3 100644
--- a/object-file.c
+++ b/object-file.c
@@ -129,18 +129,15 @@ int check_object_signature(struct repository *r, const struct object_id *oid,
 	return !oideq(oid, &real_oid) ? -1 : 0;
 }
 
-int stream_object_signature(struct repository *r, const struct object_id *oid)
+int stream_object_signature(struct repository *r,
+			    struct odb_read_stream *st,
+			    const struct object_id *oid)
 {
 	struct object_id real_oid;
-	struct odb_read_stream *st;
 	struct git_hash_ctx c;
 	char hdr[MAX_HEADER_LEN];
 	int hdrlen;
 
-	st = odb_read_stream_open(r->objects, oid, NULL);
-	if (!st)
-		return -1;
-
 	/* Generate the header */
 	hdrlen = format_object_header(hdr, sizeof(hdr), st->type, st->size);
 
@@ -160,7 +157,6 @@ int stream_object_signature(struct repository *r, const struct object_id *oid)
 		git_hash_update(&c, buf, readlen);
 	}
 	git_hash_final_oid(&real_oid, &c);
-	odb_read_stream_close(st);
 	return !oideq(oid, &real_oid) ? -1 : 0;
 }
 
diff --git a/object-file.h b/object-file.h
index a62d0de394..733d232309 100644
--- a/object-file.h
+++ b/object-file.h
@@ -164,7 +164,9 @@ int check_object_signature(struct repository *r, const struct object_id *oid,
  * Try reading the object named with "oid" using
  * the streaming interface and rehash it to do the same.
  */
-int stream_object_signature(struct repository *r, const struct object_id *oid);
+int stream_object_signature(struct repository *r,
+			    struct odb_read_stream *stream,
+			    const struct object_id *oid);
 
 enum finalize_object_file_flags {
 	FOF_SKIP_COLLISION_CHECK = 1,
diff --git a/object.c b/object.c
index 4669b8d65e..56d79d77b4 100644
--- a/object.c
+++ b/object.c
@@ -6,6 +6,7 @@
 #include "object.h"
 #include "replace-object.h"
 #include "object-file.h"
+#include "odb/streaming.h"
 #include "blob.h"
 #include "statinfo.h"
 #include "tree.h"
@@ -330,9 +331,17 @@ struct object *parse_object_with_flags(struct repository *r,
 
 	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
-		if (!skip_hash && stream_object_signature(r, repl) < 0) {
-			error(_("hash mismatch %s"), oid_to_hex(oid));
-			return NULL;
+		if (!skip_hash) {
+			struct odb_read_stream *stream = odb_read_stream_open(r->objects, oid, NULL);
+			if (!stream || stream_object_signature(r, stream, repl) < 0) {
+				error(_("hash mismatch %s"), oid_to_hex(oid));
+				if (stream)
+					odb_read_stream_close(stream);
+				return NULL;
+			}
+
+			if (stream)
+				odb_read_stream_close(stream);
 		}
 		parse_blob_buffer(lookup_blob(r, oid));
 		return lookup_object(r, oid);
diff --git a/pack-check.c b/pack-check.c
index 67cb2cf72f..46782a29d5 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -9,6 +9,7 @@
 #include "packfile.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/streaming.h"
 
 struct idx_entry {
 	off_t                offset;
@@ -104,6 +105,7 @@ static int verify_packfile(struct repository *r,
 	QSORT(entries, nr_objects, compare_entries);
 
 	for (i = 0; i < nr_objects; i++) {
+		struct odb_read_stream *stream = NULL;
 		void *data;
 		struct object_id oid;
 		enum object_type type;
@@ -152,7 +154,9 @@ static int verify_packfile(struct repository *r,
 							type) < 0)
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
-		else if (!data && stream_object_signature(r, &oid) < 0)
+		else if (!data &&
+			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
+			  stream_object_signature(r, stream, &oid) < 0))
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
 		else if (fn) {
@@ -163,12 +167,14 @@ static int verify_packfile(struct repository *r,
 		}
 		if (((base_count + i) & 1023) == 0)
 			display_progress(progress, base_count + i);
-		free(data);
 
+		if (stream)
+			odb_read_stream_close(stream);
+		free(data);
 	}
+
 	display_progress(progress, base_count + i);
 	free(entries);
-
 	return err;
 }
 

-- 
2.53.0.414.gf7e9f6c205.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] packfile: expose function to read object stream for an offset
  2026-02-23  9:50 [PATCH 0/4] pack-check: fix verification of large objects Patrick Steinhardt
  2026-02-23  9:50 ` [PATCH 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
  2026-02-23  9:50 ` [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
@ 2026-02-23  9:50 ` Patrick Steinhardt
  2026-02-23 11:07   ` Jeff King
  2026-02-23  9:50 ` [PATCH 4/4] pack-check: fix verification of large objects Patrick Steinhardt
  2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
  4 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23  9:50 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King

The function `packfile_store_read_object_stream()` takes as input an
object ID and then constructs a `struct odb_read_stream` from it. In a
subsequent commit we'll want to create an object stream for a given
combination of packfile and offset though, which is not something that
can currently be done.

Extract a new function `packfile_read_object_stream()` that makes this
functionality available.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 packfile.c | 36 ++++++++++++++++++++----------------
 packfile.h |  4 ++++
 2 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/packfile.c b/packfile.c
index 402c3b5dc7..9d795a671f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2553,29 +2553,21 @@ static int close_istream_pack_non_delta(struct odb_read_stream *_st)
 	return 0;
 }
 
-int packfile_store_read_object_stream(struct odb_read_stream **out,
-				      struct packfile_store *store,
-				      const struct object_id *oid)
+int packfile_read_object_stream(struct odb_read_stream **out,
+				struct packed_git *pack,
+				off_t offset)
 {
 	struct odb_packed_read_stream *stream;
 	struct pack_window *window = NULL;
-	struct object_info oi = OBJECT_INFO_INIT;
 	enum object_type in_pack_type;
 	unsigned long size;
 
-	oi.sizep = &size;
+	in_pack_type = unpack_object_header(pack, &window, &offset, &size);
+	unuse_pack(&window);
 
-	if (packfile_store_read_object_info(store, oid, &oi, 0) ||
-	    oi.u.packed.type == PACKED_OBJECT_TYPE_REF_DELTA ||
-	    oi.u.packed.type == PACKED_OBJECT_TYPE_OFS_DELTA ||
-	    repo_settings_get_big_file_threshold(store->source->odb->repo) >= size)
+	if (repo_settings_get_big_file_threshold(pack->repo) >= size)
 		return -1;
 
-	in_pack_type = unpack_object_header(oi.u.packed.pack,
-					    &window,
-					    &oi.u.packed.offset,
-					    &size);
-	unuse_pack(&window);
 	switch (in_pack_type) {
 	default:
 		return -1; /* we do not do deltas for now */
@@ -2592,10 +2584,22 @@ int packfile_store_read_object_stream(struct odb_read_stream **out,
 	stream->base.type = in_pack_type;
 	stream->base.size = size;
 	stream->z_state = ODB_PACKED_READ_STREAM_UNINITIALIZED;
-	stream->pack = oi.u.packed.pack;
-	stream->pos = oi.u.packed.offset;
+	stream->pack = pack;
+	stream->pos = offset;
 
 	*out = &stream->base;
 
 	return 0;
 }
+
+int packfile_store_read_object_stream(struct odb_read_stream **out,
+				      struct packfile_store *store,
+				      const struct object_id *oid)
+{
+	struct pack_entry e;
+
+	if (!find_pack_entry(store, oid, &e))
+		return -1;
+
+	return packfile_read_object_stream(out, e.p, e.offset);
+}
diff --git a/packfile.h b/packfile.h
index acc5c55ad5..67d5750140 100644
--- a/packfile.h
+++ b/packfile.h
@@ -436,6 +436,10 @@ off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
 		     off_t *curpos, enum object_type type,
 		     off_t delta_obj_offset);
 
+int packfile_read_object_stream(struct odb_read_stream **out,
+				struct packed_git *pack,
+				off_t offset);
+
 void release_pack_memory(size_t);
 
 /* global flag to enable extra checks when accessing packed objects */

-- 
2.53.0.414.gf7e9f6c205.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23  9:50 [PATCH 0/4] pack-check: fix verification of large objects Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2026-02-23  9:50 ` [PATCH 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
@ 2026-02-23  9:50 ` Patrick Steinhardt
  2026-02-23 11:11   ` Jeff King
  2026-02-23 20:35   ` Junio C Hamano
  2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
  4 siblings, 2 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23  9:50 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King

It was reported [1] that git-fsck(1) may sometimes run into an infinite
loop when processing packfiles. This bug was bisected to c31bad4f7d
(packfile: track packs via the MRU list exclusively, 2025-10-30), which
refactored our lsit of packfiles to only be tracked via an MRU list,
exclusively. This isn't entirely surprising: any caller that iterates
through the list of packfiles and then hits `find_pack_entry()`, for
example because they read an object from it, may cause the MRU list to
be updated. And if the caller is unlucky, this may cause the mentioned
infinite loop.

While this mechanism is somewhat fragile, it is still surprising that we
encounter it when verifying the packfile. We iterate through objects in
a given pack one by one and then read them via their offset, and doing
this shouldn't ever end up in `find_pack_entry()`.

But there is an edge case here: when the object in question is a blob
bigger than "core.largeFileThreshold", then we will be careful to not
read it into memory. Instead, we read it via an object stream by calling
`odb_read_object_stream()`, and that function will perform an object
lookup via `odb_read_object_info()`. So in the case where there are at
least two blobs in two different packfiles, and both of these blobs
exceed "core.largeFileThreshold", then we'll run into an infinite loop
because we'll always update the MRU.

We could fix this by improving `repo_for_each_pack()` to not update the
MRU, and this would address the issue. But the fun part is that using
`odb_read_object_stream()` is the wrong thing to do in the first place:
it may open _any_ instance of this object, so we ultimately cannot be
sure that we even verified the object in our given packfile.

Fix this bug by creating the object stream for the packed object
directly via `packfile_read_object_stream()`. Add a test that would have
caused the infinite loop.

[1]: <20260222183710.2963424-1-sandals@crustytoothpaste.net>

Reported-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 pack-check.c    |  2 +-
 t/t1450-fsck.sh | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/pack-check.c b/pack-check.c
index 46782a29d5..6149567060 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -155,7 +155,7 @@ static int verify_packfile(struct repository *r,
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
 		else if (!data &&
-			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
+			 (packfile_read_object_stream(&stream, p, entries[i].offset) < 0 ||
 			  stream_object_signature(r, stream, &oid) < 0))
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 8fb79b3e5d..ec68397ea3 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -852,6 +852,21 @@ test_expect_success 'fsck errors in packed objects' '
 	! grep corrupt out
 '
 
+test_expect_success 'fsck handles multiple packfiles with big blobs' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+		blob_one=$(test-tool genrandom one 200k | git hash-object -t blob -w --stdin) &&
+		blob_two=$(test-tool genrandom two 200k | git hash-object -t blob -w --stdin) &&
+		printf "%s\n" "$blob_one" | git pack-objects .git/objects/pack/pack &&
+		printf "%s\n" "$blob_two" | git pack-objects .git/objects/pack/pack &&
+		remove_object "$blob_one" &&
+		remove_object "$blob_two" &&
+		git -c core.bigFileThreshold=100k fsck
+	)
+'
+
 test_expect_success 'fsck fails on corrupt packfile' '
 	hsh=$(git commit-tree -m mycommit HEAD^{tree}) &&
 	pack=$(echo $hsh | git pack-objects .git/objects/pack/pack) &&

-- 
2.53.0.414.gf7e9f6c205.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream
  2026-02-23  9:50 ` [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
@ 2026-02-23 10:49   ` Jeff King
  2026-02-23 12:21     ` Patrick Steinhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2026-02-23 10:49 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 10:50:41AM +0100, Patrick Steinhardt wrote:

> The function `stream_object_signature()` is responsible for verifying
> whether the given object ID matches the actual hash of the object's
> contents. In contrast to `check_object_signature()` it does so in a
> streaming fashion so that we don't have to load the full object into
> memory.

Makes sense, and is the first step I expected to see.

The code looks OK, though a few small comments below.

> -int stream_object_signature(struct repository *r, const struct object_id *oid)
> +int stream_object_signature(struct repository *r,
> +			    struct odb_read_stream *st,
> +			    const struct object_id *oid)
>  {
>  	struct object_id real_oid;
> -	struct odb_read_stream *st;
>  	struct git_hash_ctx c;
>  	char hdr[MAX_HEADER_LEN];
>  	int hdrlen;
>  
> -	st = odb_read_stream_open(r->objects, oid, NULL);
> -	if (!st)
> -		return -1;
> -
>  	/* Generate the header */
>  	hdrlen = format_object_header(hdr, sizeof(hdr), st->type, st->size);
>  
> @@ -160,7 +157,6 @@ int stream_object_signature(struct repository *r, const struct object_id *oid)
>  		git_hash_update(&c, buf, readlen);
>  	}
>  	git_hash_final_oid(&real_oid, &c);
> -	odb_read_stream_close(st);
>  	return !oideq(oid, &real_oid) ? -1 : 0;
>  }

The minimal change for callers would be to give a wrapper like
stream_object_signature_from_oid() or similar, that included the
open/close. But it's not too many lines, and of the two callers, one is
the caller we are trying to split apart anyway. So inlining these bits
in the callers makes sense.

> @@ -330,9 +331,17 @@ struct object *parse_object_with_flags(struct repository *r,
>  
>  	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
>  	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
> -		if (!skip_hash && stream_object_signature(r, repl) < 0) {
> -			error(_("hash mismatch %s"), oid_to_hex(oid));
> -			return NULL;
> +		if (!skip_hash) {
> +			struct odb_read_stream *stream = odb_read_stream_open(r->objects, oid, NULL);
> +			if (!stream || stream_object_signature(r, stream, repl) < 0) {
> +				error(_("hash mismatch %s"), oid_to_hex(oid));
> +				if (stream)
> +					odb_read_stream_close(stream);
> +				return NULL;
> +			}
> +
> +			if (stream)
> +				odb_read_stream_close(stream);
>  		}

This final "if (stream)" is a noop; we'd have exited the function
already if "!stream".

In the earlier conditional:

  if (!stream || stream_object_signature(r, stream, repl) < 0)

we're combining two error checks: opening the stream and actually
hashing it. That matches the existing code (since it all happened in a
single function), but should we take this opportunity to give more
accurate error messages? I.e., to do:

  if (!stream) {
	error(_("unable to open object stream for %s"), oid_to_hex(oid));
	return NULL;
  }
  if (stream_object_signature(r, stream, repl) < 0) {
	error(_("hash mismatch %s"), oid_to_hex(oid));
	odb_read_stream_close(stream);
	return NULL;
  }
  odb_read_stream_close(stream);

I dunno. It should be quite uncommon to see either of these messages,
but that is sometimes the moment when details are most important.

Also, as an aside, I found it curious that we still need to pass the
repository struct to stream_object_signature(). That's because it needs
to know the correct hash_algo. I wondered if the stream struct itself
might know about that, but it doesn't seem to (it doesn't know anything
about where it came from). So it's unavoidable that we'd need to retain
it.

> @@ -104,6 +105,7 @@ static int verify_packfile(struct repository *r,
>  	QSORT(entries, nr_objects, compare_entries);
>  
>  	for (i = 0; i < nr_objects; i++) {
> +		struct odb_read_stream *stream = NULL;
>  		void *data;
>  		struct object_id oid;
>  		enum object_type type;
> @@ -152,7 +154,9 @@ static int verify_packfile(struct repository *r,
>  							type) < 0)
>  			err = error("packed %s from %s is corrupt",
>  				    oid_to_hex(&oid), p->pack_name);
> -		else if (!data && stream_object_signature(r, &oid) < 0)
> +		else if (!data &&
> +			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
> +			  stream_object_signature(r, stream, &oid) < 0))
>  			err = error("packed %s from %s is corrupt",
>  				    oid_to_hex(&oid), p->pack_name);
>  		else if (fn) {
> @@ -163,12 +167,14 @@ static int verify_packfile(struct repository *r,
>  		}
>  		if (((base_count + i) & 1023) == 0)
>  			display_progress(progress, base_count + i);
> -		free(data);
>  
> +		if (stream)
> +			odb_read_stream_close(stream);
> +		free(data);
>  	}

OK, and in this case the final "if" is important, because we just set
"err" and continue through the function. This would be much simpler with
a wrapper, but of course this is the very caller we're planning to fix.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] packfile: expose function to read object stream for an offset
  2026-02-23  9:50 ` [PATCH 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
@ 2026-02-23 11:07   ` Jeff King
  2026-02-23 12:21     ` Patrick Steinhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2026-02-23 11:07 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 10:50:42AM +0100, Patrick Steinhardt wrote:

> The function `packfile_store_read_object_stream()` takes as input an
> object ID and then constructs a `struct odb_read_stream` from it. In a
> subsequent commit we'll want to create an object stream for a given
> combination of packfile and offset though, which is not something that
> can currently be done.
> 
> Extract a new function `packfile_read_object_stream()` that makes this
> functionality available.

Yup, makes sense. There's one part that puzzled me at first (but I
figured out), and one part I'm not quite sure of.

> -int packfile_store_read_object_stream(struct odb_read_stream **out,
> -				      struct packfile_store *store,
> -				      const struct object_id *oid)
> +int packfile_read_object_stream(struct odb_read_stream **out,
> +				struct packed_git *pack,
> +				off_t offset)
>  {
>  	struct odb_packed_read_stream *stream;
>  	struct pack_window *window = NULL;
> -	struct object_info oi = OBJECT_INFO_INIT;
>  	enum object_type in_pack_type;
>  	unsigned long size;
>  
> -	oi.sizep = &size;
> +	in_pack_type = unpack_object_header(pack, &window, &offset, &size);
> +	unuse_pack(&window);
>  
> -	if (packfile_store_read_object_info(store, oid, &oi, 0) ||
> -	    oi.u.packed.type == PACKED_OBJECT_TYPE_REF_DELTA ||
> -	    oi.u.packed.type == PACKED_OBJECT_TYPE_OFS_DELTA ||
> -	    repo_settings_get_big_file_threshold(store->source->odb->repo) >= size)
> +	if (repo_settings_get_big_file_threshold(pack->repo) >= size)
>  		return -1;
>  
> -	in_pack_type = unpack_object_header(oi.u.packed.pack,
> -					    &window,
> -					    &oi.u.packed.offset,
> -					    &size);
> -	unuse_pack(&window);

Before we were checking big_file_threshold up front, and now we must
call unpack_object_header() first. But that's because we got the size
for "free" as part of the object_info call that found our pack entry.

Now our caller is responsible for finding the entry. Our wrapper _could_
continue to provide us with the size, but I don't think there is any
efficiency to be gained. Once we have the pack/offset pair, both code
paths will call unpack_object_header() to find it cheaply. And the new
code is even a little more efficient.

But what about the checks for deltas? We've dropped them completely. I
think that's OK, though, because later we have:

>  	switch (in_pack_type) {
>  	default:
>  		return -1; /* we do not do deltas for now */

So they were somewhat redundant in the first place, and just avoided
calling unpack_object_header() for cases where we knew we could not use
the result (which again, was already filled by packed_object_info() in
the same way).

Good.

> +int packfile_store_read_object_stream(struct odb_read_stream **out,
> +				      struct packfile_store *store,
> +				      const struct object_id *oid)
> +{
> +	struct pack_entry e;
> +
> +	if (!find_pack_entry(store, oid, &e))
> +		return -1;
> +
> +	return packfile_read_object_stream(out, e.p, e.offset);
> +}

OK. The original read via packfile_store_read_object_info(), which does
a bit more work. It called packed_object_info() and if necessary would
trigger mark_bad_packed_object(). But now that we are leaving it to
packfile_read_object_stream() to look at the header, we don't need to
load any object info, and we have no error code to check.

It does make me wonder, though, if we are missing out on marking bad
objects here. The idea is that we'd usually do something like:

  1. some code wants to access $OID

  2. we find $OID in pack $P

  3. that turns out to be broken for some reason, so we mark it as bad

  4. we try again, skipping $P and finding it in some other pack

But now I wonder if code that tries to stream will skip step 3, and then
in step 4 we'll find the same broken $P over and over.

But I suspect if that is possible, it was already true. We were only
asking for the type and size, so any content-level corruption wouldn't
be caught here and we'd have the same issue. I think the right thing is
probably for the streaming code to know about the pack/oid pair it's
trying to read, and to mark it as bad if it hits an error.

So your patch here might be making the problem a tiny bit worse, but not
in a material way. I think we can ignore it for now.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23  9:50 ` [PATCH 4/4] pack-check: fix verification of large objects Patrick Steinhardt
@ 2026-02-23 11:11   ` Jeff King
  2026-02-23 11:30     ` Patrick Steinhardt
  2026-02-23 20:35   ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2026-02-23 11:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 10:50:43AM +0100, Patrick Steinhardt wrote:

> diff --git a/pack-check.c b/pack-check.c
> index 46782a29d5..6149567060 100644
> --- a/pack-check.c
> +++ b/pack-check.c
> @@ -155,7 +155,7 @@ static int verify_packfile(struct repository *r,
>  			err = error("packed %s from %s is corrupt",
>  				    oid_to_hex(&oid), p->pack_name);
>  		else if (!data &&
> -			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
> +			 (packfile_read_object_stream(&stream, p, entries[i].offset) < 0 ||

And now this change is delightfully simple.

> +test_expect_success 'fsck handles multiple packfiles with big blobs' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		blob_one=$(test-tool genrandom one 200k | git hash-object -t blob -w --stdin) &&
> +		blob_two=$(test-tool genrandom two 200k | git hash-object -t blob -w --stdin) &&
> +		printf "%s\n" "$blob_one" | git pack-objects .git/objects/pack/pack &&
> +		printf "%s\n" "$blob_two" | git pack-objects .git/objects/pack/pack &&
> +		remove_object "$blob_one" &&
> +		remove_object "$blob_two" &&
> +		git -c core.bigFileThreshold=100k fsck
> +	)
> +'

I like seeing this much-more-specific test case. It does sort of become
a noop if we fix the iteration problem, though.

A more concrete test would probably be something like:

   1. Two packs, $X and $Y, both contain the same object.

   2. The object is corrupt in $X but not in $Y.

   3. Running fsck detects that one copy is corrupt but the other is
      not.

Right now it may or may not fail depending on the ordering of the packs
in the MRU list (which we might be able to tweak via mtimes). But
hopefully in the "after" state it should deterministically complain
about $X.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] t/helper: improve "genrandom" test helper
  2026-02-23  9:50 ` [PATCH 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
@ 2026-02-23 11:13   ` Jeff King
  2026-02-23 12:20     ` Patrick Steinhardt
  2026-02-23 14:01   ` Eric Sunshine
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2026-02-23 11:13 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 10:50:40AM +0100, Patrick Steinhardt wrote:

> Fix both of these issues by using `git_parse_ulong()` to parse the
> argumemnt. This function has better error handling, and it knows to
> handle unit suffixes.

Makes sense, but...

> @@ -22,7 +23,9 @@ int cmd__genrandom(int argc, const char **argv)
>  		next = next * 11 + *c;
>  	} while (*c++);
>  
> -	count = (argc == 3) ? strtoul(argv[2], NULL, 0) : ULONG_MAX;
> +	count = ULONG_MAX;
> +	if (argc == 3 && git_parse_ulong(argv[2], &count) < 0)
> +		return error_errno("cannot parse argument '%s'", argv[2]);

...I think the return value of git_parse_ulong() is boolean 0/1, not
0/negative.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23 11:11   ` Jeff King
@ 2026-02-23 11:30     ` Patrick Steinhardt
  2026-02-23 12:58       ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 11:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 06:11:20AM -0500, Jeff King wrote:
> On Mon, Feb 23, 2026 at 10:50:43AM +0100, Patrick Steinhardt wrote:
> 
> > diff --git a/pack-check.c b/pack-check.c
> > index 46782a29d5..6149567060 100644
> > --- a/pack-check.c
> > +++ b/pack-check.c
> > @@ -155,7 +155,7 @@ static int verify_packfile(struct repository *r,
> >  			err = error("packed %s from %s is corrupt",
> >  				    oid_to_hex(&oid), p->pack_name);
> >  		else if (!data &&
> > -			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
> > +			 (packfile_read_object_stream(&stream, p, entries[i].offset) < 0 ||
> 
> And now this change is delightfully simple.
> 
> > +test_expect_success 'fsck handles multiple packfiles with big blobs' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +		blob_one=$(test-tool genrandom one 200k | git hash-object -t blob -w --stdin) &&
> > +		blob_two=$(test-tool genrandom two 200k | git hash-object -t blob -w --stdin) &&
> > +		printf "%s\n" "$blob_one" | git pack-objects .git/objects/pack/pack &&
> > +		printf "%s\n" "$blob_two" | git pack-objects .git/objects/pack/pack &&
> > +		remove_object "$blob_one" &&
> > +		remove_object "$blob_two" &&
> > +		git -c core.bigFileThreshold=100k fsck
> > +	)
> > +'
> 
> I like seeing this much-more-specific test case. It does sort of become
> a noop if we fix the iteration problem, though.
> 
> A more concrete test would probably be something like:
> 
>    1. Two packs, $X and $Y, both contain the same object.
> 
>    2. The object is corrupt in $X but not in $Y.
> 
>    3. Running fsck detects that one copy is corrupt but the other is
>       not.
> 
> Right now it may or may not fail depending on the ordering of the packs
> in the MRU list (which we might be able to tweak via mtimes). But
> hopefully in the "after" state it should deterministically complain
> about $X.

Yeah. The problem I had here is that I'm not sure whether we have any
tools to reliably create a corrupted object, e.g. with a hash mismatch.
I'll have a look for v2.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] t/helper: improve "genrandom" test helper
  2026-02-23 11:13   ` Jeff King
@ 2026-02-23 12:20     ` Patrick Steinhardt
  0 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 12:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 06:13:46AM -0500, Jeff King wrote:
> On Mon, Feb 23, 2026 at 10:50:40AM +0100, Patrick Steinhardt wrote:
> > @@ -22,7 +23,9 @@ int cmd__genrandom(int argc, const char **argv)
> >  		next = next * 11 + *c;
> >  	} while (*c++);
> >  
> > -	count = (argc == 3) ? strtoul(argv[2], NULL, 0) : ULONG_MAX;
> > +	count = ULONG_MAX;
> > +	if (argc == 3 && git_parse_ulong(argv[2], &count) < 0)
> > +		return error_errno("cannot parse argument '%s'", argv[2]);
> 
> ...I think the return value of git_parse_ulong() is boolean 0/1, not
> 0/negative.

Ugh, right. Will fix.

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream
  2026-02-23 10:49   ` Jeff King
@ 2026-02-23 12:21     ` Patrick Steinhardt
  2026-02-23 12:59       ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 12:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 05:49:15AM -0500, Jeff King wrote:
> On Mon, Feb 23, 2026 at 10:50:41AM +0100, Patrick Steinhardt wrote:
> 
> > The function `stream_object_signature()` is responsible for verifying
> > whether the given object ID matches the actual hash of the object's
> > contents. In contrast to `check_object_signature()` it does so in a
> > streaming fashion so that we don't have to load the full object into
> > memory.
> 
> Makes sense, and is the first step I expected to see.
> 
> The code looks OK, though a few small comments below.
> 
> > -int stream_object_signature(struct repository *r, const struct object_id *oid)
> > +int stream_object_signature(struct repository *r,
> > +			    struct odb_read_stream *st,
> > +			    const struct object_id *oid)
> >  {
> >  	struct object_id real_oid;
> > -	struct odb_read_stream *st;
> >  	struct git_hash_ctx c;
> >  	char hdr[MAX_HEADER_LEN];
> >  	int hdrlen;
> >  
> > -	st = odb_read_stream_open(r->objects, oid, NULL);
> > -	if (!st)
> > -		return -1;
> > -
> >  	/* Generate the header */
> >  	hdrlen = format_object_header(hdr, sizeof(hdr), st->type, st->size);
> >  
> > @@ -160,7 +157,6 @@ int stream_object_signature(struct repository *r, const struct object_id *oid)
> >  		git_hash_update(&c, buf, readlen);
> >  	}
> >  	git_hash_final_oid(&real_oid, &c);
> > -	odb_read_stream_close(st);
> >  	return !oideq(oid, &real_oid) ? -1 : 0;
> >  }
> 
> The minimal change for callers would be to give a wrapper like
> stream_object_signature_from_oid() or similar, that included the
> open/close. But it's not too many lines, and of the two callers, one is
> the caller we are trying to split apart anyway. So inlining these bits
> in the callers makes sense.

Yeah. Had there been more callers then I'd have done that, but as you
noted there's only two, and one of them will need to be open coded
anyway.

> > @@ -330,9 +331,17 @@ struct object *parse_object_with_flags(struct repository *r,
> >  
> >  	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
> >  	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
> > -		if (!skip_hash && stream_object_signature(r, repl) < 0) {
> > -			error(_("hash mismatch %s"), oid_to_hex(oid));
> > -			return NULL;
> > +		if (!skip_hash) {
> > +			struct odb_read_stream *stream = odb_read_stream_open(r->objects, oid, NULL);
> > +			if (!stream || stream_object_signature(r, stream, repl) < 0) {
> > +				error(_("hash mismatch %s"), oid_to_hex(oid));
> > +				if (stream)
> > +					odb_read_stream_close(stream);
> > +				return NULL;
> > +			}
> > +
> > +			if (stream)
> > +				odb_read_stream_close(stream);
> >  		}
> 
> This final "if (stream)" is a noop; we'd have exited the function
> already if "!stream".
> 
> In the earlier conditional:
> 
>   if (!stream || stream_object_signature(r, stream, repl) < 0)
> 
> we're combining two error checks: opening the stream and actually
> hashing it.

Ah, right.

> That matches the existing code (since it all happened in a
> single function), but should we take this opportunity to give more
> accurate error messages? I.e., to do:
> 
>   if (!stream) {
> 	error(_("unable to open object stream for %s"), oid_to_hex(oid));
> 	return NULL;
>   }
>   if (stream_object_signature(r, stream, repl) < 0) {
> 	error(_("hash mismatch %s"), oid_to_hex(oid));
> 	odb_read_stream_close(stream);
> 	return NULL;
>   }
>   odb_read_stream_close(stream);
> 
> I dunno. It should be quite uncommon to see either of these messages,
> but that is sometimes the moment when details are most important.

Agreed, that feels like a sensible change indeed. Also makes the code
flow easier to follow in my opinion.

> Also, as an aside, I found it curious that we still need to pass the
> repository struct to stream_object_signature(). That's because it needs
> to know the correct hash_algo. I wondered if the stream struct itself
> might know about that, but it doesn't seem to (it doesn't know anything
> about where it came from). So it's unavoidable that we'd need to retain
> it.

Yeah, agreed. I wondered whether we should eventually extend `struct
odb_read_stream` to have a pointer to the owning object source, and in
that case we could've avoided the extra repository parameter. But I
decided it was out of scope for this patch series, also because I don't
want to cause conflicts with other stuff I'm working on in this vicinity
:)

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] packfile: expose function to read object stream for an offset
  2026-02-23 11:07   ` Jeff King
@ 2026-02-23 12:21     ` Patrick Steinhardt
  2026-02-23 13:12       ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 12:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 06:07:22AM -0500, Jeff King wrote:
> On Mon, Feb 23, 2026 at 10:50:42AM +0100, Patrick Steinhardt wrote:
> > +int packfile_store_read_object_stream(struct odb_read_stream **out,
> > +				      struct packfile_store *store,
> > +				      const struct object_id *oid)
> > +{
> > +	struct pack_entry e;
> > +
> > +	if (!find_pack_entry(store, oid, &e))
> > +		return -1;
> > +
> > +	return packfile_read_object_stream(out, e.p, e.offset);
> > +}
> 
> OK. The original read via packfile_store_read_object_info(), which does
> a bit more work. It called packed_object_info() and if necessary would
> trigger mark_bad_packed_object(). But now that we are leaving it to
> packfile_read_object_stream() to look at the header, we don't need to
> load any object info, and we have no error code to check.
> 
> It does make me wonder, though, if we are missing out on marking bad
> objects here. The idea is that we'd usually do something like:
> 
>   1. some code wants to access $OID
> 
>   2. we find $OID in pack $P
> 
>   3. that turns out to be broken for some reason, so we mark it as bad
> 
>   4. we try again, skipping $P and finding it in some other pack
> 
> But now I wonder if code that tries to stream will skip step 3, and then
> in step 4 we'll find the same broken $P over and over.
> 
> But I suspect if that is possible, it was already true. We were only
> asking for the type and size, so any content-level corruption wouldn't
> be caught here and we'd have the same issue. I think the right thing is
> probably for the streaming code to know about the pack/oid pair it's
> trying to read, and to mark it as bad if it hits an error.
> 
> So your patch here might be making the problem a tiny bit worse, but not
> in a material way. I think we can ignore it for now.

I guess the "tiny bit worse" part is that we don't handle the case
anymore where `unpack_object_header()` returns `OBJ_BAD`. As you say, we
previously didn't fully parse the object anyway, so we couldn't have
detected all kinds of corruptions. But we definitely handled the case
where `unpack_object_header()` failed.

So maybe we should do something like the below patch?

Patrick

diff --git a/packfile.c b/packfile.c
index 9d795a671f..3e61176128 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2554,6 +2554,7 @@ static int close_istream_pack_non_delta(struct odb_read_stream *_st)
 }
 
 int packfile_read_object_stream(struct odb_read_stream **out,
+				const struct object_id *oid,
 				struct packed_git *pack,
 				off_t offset)
 {
@@ -2571,6 +2572,9 @@ int packfile_read_object_stream(struct odb_read_stream **out,
 	switch (in_pack_type) {
 	default:
 		return -1; /* we do not do deltas for now */
+	case OBJ_BAD:
+		mark_bad_packed_object(pack, oid);
+		return -1;
 	case OBJ_COMMIT:
 	case OBJ_TREE:
 	case OBJ_BLOB:
@@ -2601,5 +2605,5 @@ int packfile_store_read_object_stream(struct odb_read_stream **out,
 	if (!find_pack_entry(store, oid, &e))
 		return -1;
 
-	return packfile_read_object_stream(out, e.p, e.offset);
+	return packfile_read_object_stream(out, oid, e.p, e.offset);
 }
diff --git a/packfile.h b/packfile.h
index 67d5750140..b9f5f1c18c 100644
--- a/packfile.h
+++ b/packfile.h
@@ -437,6 +437,7 @@ off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
 		     off_t delta_obj_offset);
 
 int packfile_read_object_stream(struct odb_read_stream **out,
+				const struct object_id *oid,
 				struct packed_git *pack,
 				off_t offset);
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23 11:30     ` Patrick Steinhardt
@ 2026-02-23 12:58       ` Jeff King
  2026-02-23 15:48         ` Patrick Steinhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2026-02-23 12:58 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 12:30:35PM +0100, Patrick Steinhardt wrote:

> > A more concrete test would probably be something like:
> > 
> >    1. Two packs, $X and $Y, both contain the same object.
> > 
> >    2. The object is corrupt in $X but not in $Y.
> > 
> >    3. Running fsck detects that one copy is corrupt but the other is
> >       not.
> > 
> > Right now it may or may not fail depending on the ordering of the packs
> > in the MRU list (which we might be able to tweak via mtimes). But
> > hopefully in the "after" state it should deterministically complain
> > about $X.
> 
> Yeah. The problem I had here is that I'm not sure whether we have any
> tools to reliably create a corrupted object, e.g. with a hash mismatch.
> I'll have a look for v2.

You can see how do_corrupt_object() in t5303 does it. It's basically
finding an offset via show-index and then writing a zero over it with
dd.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream
  2026-02-23 12:21     ` Patrick Steinhardt
@ 2026-02-23 12:59       ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2026-02-23 12:59 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 01:21:00PM +0100, Patrick Steinhardt wrote:

> > That matches the existing code (since it all happened in a
> > single function), but should we take this opportunity to give more
> > accurate error messages? I.e., to do:
> > 
> >   if (!stream) {
> > 	error(_("unable to open object stream for %s"), oid_to_hex(oid));
> > 	return NULL;
> >   }
> >   if (stream_object_signature(r, stream, repl) < 0) {
> > 	error(_("hash mismatch %s"), oid_to_hex(oid));
> > 	odb_read_stream_close(stream);
> > 	return NULL;
> >   }
> >   odb_read_stream_close(stream);
> > 
> > I dunno. It should be quite uncommon to see either of these messages,
> > but that is sometimes the moment when details are most important.
> 
> Agreed, that feels like a sensible change indeed. Also makes the code
> flow easier to follow in my opinion.

Yeah, the readability was actually what got me thinking on it in the
first place.

> > Also, as an aside, I found it curious that we still need to pass the
> > repository struct to stream_object_signature(). That's because it needs
> > to know the correct hash_algo. I wondered if the stream struct itself
> > might know about that, but it doesn't seem to (it doesn't know anything
> > about where it came from). So it's unavoidable that we'd need to retain
> > it.
> 
> Yeah, agreed. I wondered whether we should eventually extend `struct
> odb_read_stream` to have a pointer to the owning object source, and in
> that case we could've avoided the extra repository parameter. But I
> decided it was out of scope for this patch series, also because I don't
> want to cause conflicts with other stuff I'm working on in this vicinity
> :)

Yes, definitely out of scope for this series.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] packfile: expose function to read object stream for an offset
  2026-02-23 12:21     ` Patrick Steinhardt
@ 2026-02-23 13:12       ` Jeff King
  2026-02-23 15:59         ` Patrick Steinhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2026-02-23 13:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 01:21:06PM +0100, Patrick Steinhardt wrote:

> > So your patch here might be making the problem a tiny bit worse, but not
> > in a material way. I think we can ignore it for now.
> 
> I guess the "tiny bit worse" part is that we don't handle the case
> anymore where `unpack_object_header()` returns `OBJ_BAD`. As you say, we
> previously didn't fully parse the object anyway, so we couldn't have
> detected all kinds of corruptions. But we definitely handled the case
> where `unpack_object_header()` failed.

Yeah, I think that would cover it. Technically packed_object_info()
could error on more cases (e.g., errors chasing delta bases for
type/size info). But we would bail on trying to stream those anyway, so
presumably any errors would be found via the non-streaming code paths in
those cases.

> So maybe we should do something like the below patch?
> [...]
> @@ -2571,6 +2572,9 @@ int packfile_read_object_stream(struct odb_read_stream **out,
>  	switch (in_pack_type) {
>  	default:
>  		return -1; /* we do not do deltas for now */
> +	case OBJ_BAD:
> +		mark_bad_packed_object(pack, oid);
> +		return -1;
>  	case OBJ_COMMIT:
>  	case OBJ_TREE:
>  	case OBJ_BLOB:

I think that restores the original behavior. But I'm not sure it's even
worth it. We are still missing the much more likely case of a bit error
in the actual zlib stream, which would not be caught until much later.

So yeah, if you want to feel better about making sure your patch keeps
the behavior as identical as possible, I don't mind adding this. But it
feels like the tip of the iceberg, and I'd be OK leaving it for later
(or never).

My biggest objection is not the two lines above (which I actually think
clarify what is going on) but rather this interface change:

>  int packfile_read_object_stream(struct odb_read_stream **out,
> +				const struct object_id *oid,
>  				struct packed_git *pack,
>  				off_t offset);

Now we are back to taking an oid, except we don't ever use it to look up
the object! So it's a little misleading that it's there at all. It may
be the best we can do, though.

The only other way I could think of is for packfile_read_object_stream()
to return a more detailed error: one of "success", "chose not to
stream", or "broken object". And then the caller can call
mark_bad_packed_object() as appropriate. In this case, I think
packfile_store_read_object_stream() would do so, but verify_pack()
probably would not choose to (it is not interested in fallbacks at all
but is going through an individual pack).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] t/helper: improve "genrandom" test helper
  2026-02-23  9:50 ` [PATCH 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
  2026-02-23 11:13   ` Jeff King
@ 2026-02-23 14:01   ` Eric Sunshine
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2026-02-23 14:01 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Junio C Hamano, Jeff King

On Mon, Feb 23, 2026 at 4:51 AM Patrick Steinhardt <ps@pks.im> wrote:
> The `test-tool genrandom` test helper can be used to generate random
> data, either as an infinite stream or with a specified number of bytes.
> The way we handle parsing the number of bytes is lacking though:
>
>   - We don't have good error handling, so if the caller for example uses
>     `test-tool genrandom 200xyz` then we'll end up generating 200 bytes
>     of random data successfully.
>
>   - Many callers want to generate e.g. 1 kilobyte or megabyte of data,
>     but they have to either use unwieldy numbers like 1048576, or they
>     have to precompute them.
>
> Fix both of these issues by using `git_parse_ulong()` to parse the
> argumemnt. This function has better error handling, and it knows to
> handle unit suffixes.

s/argumemnt/argument/

> Adapt a couple of our tests to use suffixes instead of manual
> computations.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23 12:58       ` Jeff King
@ 2026-02-23 15:48         ` Patrick Steinhardt
  0 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 15:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 07:58:43AM -0500, Jeff King wrote:
> On Mon, Feb 23, 2026 at 12:30:35PM +0100, Patrick Steinhardt wrote:
> 
> > > A more concrete test would probably be something like:
> > > 
> > >    1. Two packs, $X and $Y, both contain the same object.
> > > 
> > >    2. The object is corrupt in $X but not in $Y.
> > > 
> > >    3. Running fsck detects that one copy is corrupt but the other is
> > >       not.
> > > 
> > > Right now it may or may not fail depending on the ordering of the packs
> > > in the MRU list (which we might be able to tweak via mtimes). But
> > > hopefully in the "after" state it should deterministically complain
> > > about $X.
> > 
> > Yeah. The problem I had here is that I'm not sure whether we have any
> > tools to reliably create a corrupted object, e.g. with a hash mismatch.
> > I'll have a look for v2.
> 
> You can see how do_corrupt_object() in t5303 does it. It's basically
> finding an offset via show-index and then writing a zero over it with
> dd.

Yeah, that's what I ended up doing indeed. I spotted such a test in t1450.

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] packfile: expose function to read object stream for an offset
  2026-02-23 13:12       ` Jeff King
@ 2026-02-23 15:59         ` Patrick Steinhardt
  0 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 15:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, brian m. carlson, Junio C Hamano

On Mon, Feb 23, 2026 at 08:12:01AM -0500, Jeff King wrote:
> On Mon, Feb 23, 2026 at 01:21:06PM +0100, Patrick Steinhardt wrote:
> 
> > > So your patch here might be making the problem a tiny bit worse, but not
> > > in a material way. I think we can ignore it for now.
> > 
> > I guess the "tiny bit worse" part is that we don't handle the case
> > anymore where `unpack_object_header()` returns `OBJ_BAD`. As you say, we
> > previously didn't fully parse the object anyway, so we couldn't have
> > detected all kinds of corruptions. But we definitely handled the case
> > where `unpack_object_header()` failed.
> 
> Yeah, I think that would cover it. Technically packed_object_info()
> could error on more cases (e.g., errors chasing delta bases for
> type/size info). But we would bail on trying to stream those anyway, so
> presumably any errors would be found via the non-streaming code paths in
> those cases.
> 
> > So maybe we should do something like the below patch?
> > [...]
> > @@ -2571,6 +2572,9 @@ int packfile_read_object_stream(struct odb_read_stream **out,
> >  	switch (in_pack_type) {
> >  	default:
> >  		return -1; /* we do not do deltas for now */
> > +	case OBJ_BAD:
> > +		mark_bad_packed_object(pack, oid);
> > +		return -1;
> >  	case OBJ_COMMIT:
> >  	case OBJ_TREE:
> >  	case OBJ_BLOB:
> 
> I think that restores the original behavior. But I'm not sure it's even
> worth it. We are still missing the much more likely case of a bit error
> in the actual zlib stream, which would not be caught until much later.
> 
> So yeah, if you want to feel better about making sure your patch keeps
> the behavior as identical as possible, I don't mind adding this. But it
> feels like the tip of the iceberg, and I'd be OK leaving it for later
> (or never).
> 
> My biggest objection is not the two lines above (which I actually think
> clarify what is going on) but rather this interface change:
> 
> >  int packfile_read_object_stream(struct odb_read_stream **out,
> > +				const struct object_id *oid,
> >  				struct packed_git *pack,
> >  				off_t offset);
> 
> Now we are back to taking an oid, except we don't ever use it to look up
> the object! So it's a little misleading that it's there at all. It may
> be the best we can do, though.

Yeah, I agree it's a bit ugly. But the function is not likely to gain a
lot of additional callers anyway, as it is an implementation detail of
the packfile store. So overall I think it's okayish.

> The only other way I could think of is for packfile_read_object_stream()
> to return a more detailed error: one of "success", "chose not to
> stream", or "broken object". And then the caller can call
> mark_bad_packed_object() as appropriate. In this case, I think
> packfile_store_read_object_stream() would do so, but verify_pack()
> probably would not choose to (it is not interested in fallbacks at all
> but is going through an individual pack).

We could of course have it return the `enum object_type` directly, which
gives us enough context to do this. But on the other hand it'd mean that
we might now miss adding calls to `mark_bad_packed_object()` in the
future, and it causes a bit of repetition across callsites.

So I think I lean towards my proposed patch.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 0/4] pack-check: fix verification of large objects
  2026-02-23  9:50 [PATCH 0/4] pack-check: fix verification of large objects Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2026-02-23  9:50 ` [PATCH 4/4] pack-check: fix verification of large objects Patrick Steinhardt
@ 2026-02-23 16:00 ` Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
                     ` (3 more replies)
  4 siblings, 4 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 16:00 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King, Eric Sunshine

Hi,

this small patch series addresses the bug reported by brian in [1].
Thanks!

Changes in v2:
  - Extend the test to verify that we actually find corrupted objects in
    both packs, even in the case where a non-corrupt version exists in
    another pack.
  - Reinstate `mark_packed_object_bad()`.
  - Fix error checking for `git_parse_ulong()`.
  - Disambiguate error conditions in `parse_object_with_flags()`.
  - Link to v1: https://lore.kernel.org/r/20260223-pks-fsck-fix-v1-0-c29036832b6e@pks.im

Patrick

[1]: <20260222183710.2963424-1-sandals@crustytoothpaste.net>

---
Patrick Steinhardt (4):
      t/helper: improve "genrandom" test helper
      object-file: adapt `stream_object_signature()` to take a stream
      packfile: expose function to read object stream for an offset
      pack-check: fix verification of large objects

 object-file.c                         | 10 +++------
 object-file.h                         |  4 +++-
 object.c                              | 19 ++++++++++++++---
 pack-check.c                          | 12 ++++++++---
 packfile.c                            | 40 +++++++++++++++++++++--------------
 packfile.h                            |  5 +++++
 t/helper/test-genrandom.c             |  5 ++++-
 t/t1006-cat-file.sh                   |  2 +-
 t/t1050-large.sh                      |  6 +++---
 t/t1450-fsck.sh                       | 40 ++++++++++++++++++++++++++++++++++-
 t/t5301-sliding-window.sh             |  2 +-
 t/t5310-pack-bitmaps.sh               |  2 +-
 t/t5710-promisor-remote-capability.sh |  4 ++--
 t/t7700-repack.sh                     |  6 +++---
 14 files changed, 114 insertions(+), 43 deletions(-)

Range-diff versus v1:

1:  1b1283e837 ! 1:  daf895aef6 t/helper: improve "genrandom" test helper
    @@ Commit message
             have to precompute them.
     
         Fix both of these issues by using `git_parse_ulong()` to parse the
    -    argumemnt. This function has better error handling, and it knows to
    +    argument. This function has better error handling, and it knows to
         handle unit suffixes.
     
         Adapt a couple of our tests to use suffixes instead of manual
    @@ t/helper/test-genrandom.c: int cmd__genrandom(int argc, const char **argv)
      
     -	count = (argc == 3) ? strtoul(argv[2], NULL, 0) : ULONG_MAX;
     +	count = ULONG_MAX;
    -+	if (argc == 3 && git_parse_ulong(argv[2], &count) < 0)
    ++	if (argc == 3 && !git_parse_ulong(argv[2], &count))
     +		return error_errno("cannot parse argument '%s'", argv[2]);
      
      	while (count--) {
2:  9f25ed1a4b ! 2:  ebca9efaec object-file: adapt `stream_object_signature()` to take a stream
    @@ Commit message
         a preconstructed stream. Prepare for this by accepting a stream as input
         that the caller needs to assemble.
     
    +    While at it, improve the error reporting in `parse_object_with_flags()`
    +    to tell apart the two failure modes.
    +
    +    Helped-by: Jeff King <peff@peff.net>
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## object-file.c ##
    @@ object.c: struct object *parse_object_with_flags(struct repository *r,
     -			return NULL;
     +		if (!skip_hash) {
     +			struct odb_read_stream *stream = odb_read_stream_open(r->objects, oid, NULL);
    -+			if (!stream || stream_object_signature(r, stream, repl) < 0) {
    -+				error(_("hash mismatch %s"), oid_to_hex(oid));
    -+				if (stream)
    -+					odb_read_stream_close(stream);
    ++
    ++			if (!stream) {
    ++				error(_("unable to open object stream for %s"), oid_to_hex(oid));
     +				return NULL;
     +			}
     +
    -+			if (stream)
    ++			if (stream_object_signature(r, stream, repl) < 0) {
    ++				error(_("hash mismatch %s"), oid_to_hex(oid));
     +				odb_read_stream_close(stream);
    ++				return NULL;
    ++			}
    ++
    ++			odb_read_stream_close(stream);
      		}
      		parse_blob_buffer(lookup_blob(r, oid));
      		return lookup_object(r, oid);
3:  9a28867564 ! 3:  bead96797e packfile: expose function to read object stream for an offset
    @@ packfile.c: static int close_istream_pack_non_delta(struct odb_read_stream *_st)
     -				      struct packfile_store *store,
     -				      const struct object_id *oid)
     +int packfile_read_object_stream(struct odb_read_stream **out,
    ++				const struct object_id *oid,
     +				struct packed_git *pack,
     +				off_t offset)
      {
    @@ packfile.c: static int close_istream_pack_non_delta(struct odb_read_stream *_st)
      	switch (in_pack_type) {
      	default:
      		return -1; /* we do not do deltas for now */
    ++	case OBJ_BAD:
    ++		mark_bad_packed_object(pack, oid);
    ++		return -1;
    + 	case OBJ_COMMIT:
    + 	case OBJ_TREE:
    + 	case OBJ_BLOB:
     @@ packfile.c: int packfile_store_read_object_stream(struct odb_read_stream **out,
      	stream->base.type = in_pack_type;
      	stream->base.size = size;
    @@ packfile.c: int packfile_store_read_object_stream(struct odb_read_stream **out,
     +	if (!find_pack_entry(store, oid, &e))
     +		return -1;
     +
    -+	return packfile_read_object_stream(out, e.p, e.offset);
    ++	return packfile_read_object_stream(out, oid, e.p, e.offset);
     +}
     
      ## packfile.h ##
    @@ packfile.h: off_t get_delta_base(struct packed_git *p, struct pack_window **w_cu
      		     off_t delta_obj_offset);
      
     +int packfile_read_object_stream(struct odb_read_stream **out,
    ++				const struct object_id *oid,
     +				struct packed_git *pack,
     +				off_t offset);
     +
4:  4eaf958e57 ! 4:  6b69624d81 pack-check: fix verification of large objects
    @@ pack-check.c: static int verify_packfile(struct repository *r,
      				    oid_to_hex(&oid), p->pack_name);
      		else if (!data &&
     -			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
    -+			 (packfile_read_object_stream(&stream, p, entries[i].offset) < 0 ||
    ++			 (packfile_read_object_stream(&stream, &oid, p, entries[i].offset) < 0 ||
      			  stream_object_signature(r, stream, &oid) < 0))
      			err = error("packed %s from %s is corrupt",
      				    oid_to_hex(&oid), p->pack_name);
    @@ t/t1450-fsck.sh: test_expect_success 'fsck errors in packed objects' '
     +	git init repo &&
     +	(
     +		cd repo &&
    ++
    ++		# We construct two packfiles with two objects in common and one
    ++		# object not in common. The objects in common can then be
    ++		# corrupted in one of the packfiles, respectively. The other
    ++		# objects that are unique to the packs are merely used to not
    ++		# have both packs contain the same data.
     +		blob_one=$(test-tool genrandom one 200k | git hash-object -t blob -w --stdin) &&
     +		blob_two=$(test-tool genrandom two 200k | git hash-object -t blob -w --stdin) &&
    -+		printf "%s\n" "$blob_one" | git pack-objects .git/objects/pack/pack &&
    -+		printf "%s\n" "$blob_two" | git pack-objects .git/objects/pack/pack &&
    -+		remove_object "$blob_one" &&
    -+		remove_object "$blob_two" &&
    -+		git -c core.bigFileThreshold=100k fsck
    ++		blob_three=$(test-tool genrandom three 200k | git hash-object -t blob -w --stdin) &&
    ++		blob_four=$(test-tool genrandom four 200k | git hash-object -t blob -w --stdin) &&
    ++		pack_one=$(printf "%s\n" "$blob_one" "$blob_two" "$blob_three" | git pack-objects .git/objects/pack/pack) &&
    ++		pack_two=$(printf "%s\n" "$blob_two" "$blob_three" "$blob_four" | git pack-objects .git/objects/pack/pack) &&
    ++		chmod a+w .git/objects/pack/pack-*.pack &&
    ++
    ++		# Corrupt blob two in the first pack.
    ++		git verify-pack -v .git/objects/pack/pack-$pack_one >objects &&
    ++		offset_one=$(sed <objects -n "s/^$blob_two .* \(.*\)$/\1/p") &&
    ++		printf "\0" | dd of=.git/objects/pack/pack-$pack_one.pack bs=1 conv=notrunc seek=$offset_one &&
    ++
    ++		# Corrupt blob three in the second pack.
    ++		git verify-pack -v .git/objects/pack/pack-$pack_two >objects &&
    ++		offset_two=$(sed <objects -n "s/^$blob_three .* \(.*\)$/\1/p") &&
    ++		printf "\0" | dd of=.git/objects/pack/pack-$pack_two.pack bs=1 conv=notrunc seek=$offset_two &&
    ++
    ++		# We now expect to see two failures for the corrupted objects,
    ++		# even though they exist in a non-corrupted form in the
    ++		# respective other pack.
    ++		test_must_fail git -c core.bigFileThreshold=100k fsck 2>err &&
    ++		test_grep "unknown object type 0 at offset $offset_one in .git/objects/pack/pack-$pack_one.pack" err &&
    ++		test_grep "unknown object type 0 at offset $offset_two in .git/objects/pack/pack-$pack_two.pack" err
     +	)
     +'
     +

---
base-commit: 7c02d39fc2ed2702223c7674f73150d9a7e61ba4
change-id: 20260223-pks-fsck-fix-aa8a18a223c8


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/4] t/helper: improve "genrandom" test helper
  2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
@ 2026-02-23 16:00   ` Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 16:00 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King, Eric Sunshine

The `test-tool genrandom` test helper can be used to generate random
data, either as an infinite stream or with a specified number of bytes.
The way we handle parsing the number of bytes is lacking though:

  - We don't have good error handling, so if the caller for example uses
    `test-tool genrandom 200xyz` then we'll end up generating 200 bytes
    of random data successfully.

  - Many callers want to generate e.g. 1 kilobyte or megabyte of data,
    but they have to either use unwieldy numbers like 1048576, or they
    have to precompute them.

Fix both of these issues by using `git_parse_ulong()` to parse the
argument. This function has better error handling, and it knows to
handle unit suffixes.

Adapt a couple of our tests to use suffixes instead of manual
computations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/helper/test-genrandom.c             | 5 ++++-
 t/t1006-cat-file.sh                   | 2 +-
 t/t1050-large.sh                      | 6 +++---
 t/t1450-fsck.sh                       | 2 +-
 t/t5301-sliding-window.sh             | 2 +-
 t/t5310-pack-bitmaps.sh               | 2 +-
 t/t5710-promisor-remote-capability.sh | 4 ++--
 t/t7700-repack.sh                     | 6 +++---
 8 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/t/helper/test-genrandom.c b/t/helper/test-genrandom.c
index 51b67f2f87..d681961abb 100644
--- a/t/helper/test-genrandom.c
+++ b/t/helper/test-genrandom.c
@@ -6,6 +6,7 @@
 
 #include "test-tool.h"
 #include "git-compat-util.h"
+#include "parse.h"
 
 int cmd__genrandom(int argc, const char **argv)
 {
@@ -22,7 +23,9 @@ int cmd__genrandom(int argc, const char **argv)
 		next = next * 11 + *c;
 	} while (*c++);
 
-	count = (argc == 3) ? strtoul(argv[2], NULL, 0) : ULONG_MAX;
+	count = ULONG_MAX;
+	if (argc == 3 && !git_parse_ulong(argv[2], &count))
+		return error_errno("cannot parse argument '%s'", argv[2]);
 
 	while (count--) {
 		next = next * 1103515245 + 12345;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 0eee3bb878..5499be8dc9 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -643,7 +643,7 @@ test_expect_success 'object reference via commit text search' '
 '
 
 test_expect_success 'setup blobs which are likely to delta' '
-	test-tool genrandom foo 10240 >foo &&
+	test-tool genrandom foo 10k >foo &&
 	{ cat foo && echo plus; } >foo-plus &&
 	git add foo foo-plus &&
 	git commit -m foo &&
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 5be273611a..7d40d08521 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -104,9 +104,9 @@ test_expect_success 'packsize limit' '
 		# mid1 and mid2 will fit within 256k limit but
 		# appending mid3 will bust the limit and will
 		# result in a separate packfile.
-		test-tool genrandom "a" $(( 66 * 1024 )) >mid1 &&
-		test-tool genrandom "b" $(( 80 * 1024 )) >mid2 &&
-		test-tool genrandom "c" $(( 128 * 1024 )) >mid3 &&
+		test-tool genrandom "a" 66k >mid1 &&
+		test-tool genrandom "b" 80k >mid2 &&
+		test-tool genrandom "c" 128k >mid3 &&
 		git add mid1 mid2 mid3 &&
 
 		count=0 &&
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 3fae05f9d9..8fb79b3e5d 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -918,7 +918,7 @@ test_expect_success 'fsck detects trailing loose garbage (large blob)' '
 test_expect_success 'fsck detects truncated loose object' '
 	# make it big enough that we know we will truncate in the data
 	# portion, not the header
-	test-tool genrandom truncate 4096 >file &&
+	test-tool genrandom truncate 4k >file &&
 	blob=$(git hash-object -w file) &&
 	file=$(sha1_file $blob) &&
 	test_when_finished "remove_object $blob" &&
diff --git a/t/t5301-sliding-window.sh b/t/t5301-sliding-window.sh
index ff6b5159a3..3c3666b278 100755
--- a/t/t5301-sliding-window.sh
+++ b/t/t5301-sliding-window.sh
@@ -12,7 +12,7 @@ test_expect_success 'setup' '
 	for i in a b c
 	do
 	echo $i >$i &&
-	test-tool genrandom "$i" 32768 >>$i &&
+	test-tool genrandom "$i" 32k >>$i &&
 	git update-index --add $i || return 1
 	done &&
 	echo d >d && cat c >>d && git update-index --add d &&
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 6718fb98c0..3e3366f57d 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -242,7 +242,7 @@ test_bitmap_cases () {
 	'
 
 	test_expect_success 'splitting packs does not generate bogus bitmaps' '
-		test-tool genrandom foo $((1024 * 1024)) >rand &&
+		test-tool genrandom foo 1m >rand &&
 		git add rand &&
 		git commit -m "commit with big file" &&
 		git -c pack.packSizeLimit=500k repack -adb &&
diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
index 023735d6a8..66af84cd56 100755
--- a/t/t5710-promisor-remote-capability.sh
+++ b/t/t5710-promisor-remote-capability.sh
@@ -20,7 +20,7 @@ test_expect_success 'setup: create "template" repository' '
 	test_commit -C template 1 &&
 	test_commit -C template 2 &&
 	test_commit -C template 3 &&
-	test-tool genrandom foo 10240 >template/foo &&
+	test-tool genrandom foo 10k >template/foo &&
 	git -C template add foo &&
 	git -C template commit -m foo
 '
@@ -376,7 +376,7 @@ test_expect_success "clone with promisor.advertise set to 'true' but don't delet
 
 test_expect_success "setup for subsequent fetches" '
 	# Generate new commit with large blob
-	test-tool genrandom bar 10240 >template/bar &&
+	test-tool genrandom bar 10k >template/bar &&
 	git -C template add bar &&
 	git -C template commit -m bar &&
 
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 73b78bdd88..439ab24d23 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -319,7 +319,7 @@ test_expect_success 'no bitmaps created if .keep files present' '
 
 test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_config -C bare.git pack.packSizeLimit 1M &&
-	blob=$(test-tool genrandom big $((1024*1024)) |
+	blob=$(test-tool genrandom big 1m |
 	       git -C bare.git hash-object -w --stdin) &&
 	git -C bare.git update-ref refs/tags/big $blob &&
 
@@ -495,9 +495,9 @@ test_expect_success '--filter works with --max-pack-size' '
 		cd max-pack-size &&
 		test_commit base &&
 		# two blobs which exceed the maximum pack size
-		test-tool genrandom foo 1048576 >foo &&
+		test-tool genrandom foo 1m >foo &&
 		git hash-object -w foo &&
-		test-tool genrandom bar 1048576 >bar &&
+		test-tool genrandom bar 1m >bar &&
 		git hash-object -w bar &&
 		git add foo bar &&
 		git commit -m "adding foo and bar"

-- 
2.53.0.536.g309c995771.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/4] object-file: adapt `stream_object_signature()` to take a stream
  2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
@ 2026-02-23 16:00   ` Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 4/4] pack-check: fix verification of large objects Patrick Steinhardt
  3 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 16:00 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King, Eric Sunshine

The function `stream_object_signature()` is responsible for verifying
whether the given object ID matches the actual hash of the object's
contents. In contrast to `check_object_signature()` it does so in a
streaming fashion so that we don't have to load the full object into
memory.

In a subsequent commit we'll want to adapt one of its callsites to pass
a preconstructed stream. Prepare for this by accepting a stream as input
that the caller needs to assemble.

While at it, improve the error reporting in `parse_object_with_flags()`
to tell apart the two failure modes.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 object-file.c | 10 +++-------
 object-file.h |  4 +++-
 object.c      | 19 ++++++++++++++++---
 pack-check.c  | 12 +++++++++---
 4 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/object-file.c b/object-file.c
index 1b62996ef0..ca2c4dddf3 100644
--- a/object-file.c
+++ b/object-file.c
@@ -129,18 +129,15 @@ int check_object_signature(struct repository *r, const struct object_id *oid,
 	return !oideq(oid, &real_oid) ? -1 : 0;
 }
 
-int stream_object_signature(struct repository *r, const struct object_id *oid)
+int stream_object_signature(struct repository *r,
+			    struct odb_read_stream *st,
+			    const struct object_id *oid)
 {
 	struct object_id real_oid;
-	struct odb_read_stream *st;
 	struct git_hash_ctx c;
 	char hdr[MAX_HEADER_LEN];
 	int hdrlen;
 
-	st = odb_read_stream_open(r->objects, oid, NULL);
-	if (!st)
-		return -1;
-
 	/* Generate the header */
 	hdrlen = format_object_header(hdr, sizeof(hdr), st->type, st->size);
 
@@ -160,7 +157,6 @@ int stream_object_signature(struct repository *r, const struct object_id *oid)
 		git_hash_update(&c, buf, readlen);
 	}
 	git_hash_final_oid(&real_oid, &c);
-	odb_read_stream_close(st);
 	return !oideq(oid, &real_oid) ? -1 : 0;
 }
 
diff --git a/object-file.h b/object-file.h
index a62d0de394..733d232309 100644
--- a/object-file.h
+++ b/object-file.h
@@ -164,7 +164,9 @@ int check_object_signature(struct repository *r, const struct object_id *oid,
  * Try reading the object named with "oid" using
  * the streaming interface and rehash it to do the same.
  */
-int stream_object_signature(struct repository *r, const struct object_id *oid);
+int stream_object_signature(struct repository *r,
+			    struct odb_read_stream *stream,
+			    const struct object_id *oid);
 
 enum finalize_object_file_flags {
 	FOF_SKIP_COLLISION_CHECK = 1,
diff --git a/object.c b/object.c
index 4669b8d65e..9d2c676b16 100644
--- a/object.c
+++ b/object.c
@@ -6,6 +6,7 @@
 #include "object.h"
 #include "replace-object.h"
 #include "object-file.h"
+#include "odb/streaming.h"
 #include "blob.h"
 #include "statinfo.h"
 #include "tree.h"
@@ -330,9 +331,21 @@ struct object *parse_object_with_flags(struct repository *r,
 
 	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
-		if (!skip_hash && stream_object_signature(r, repl) < 0) {
-			error(_("hash mismatch %s"), oid_to_hex(oid));
-			return NULL;
+		if (!skip_hash) {
+			struct odb_read_stream *stream = odb_read_stream_open(r->objects, oid, NULL);
+
+			if (!stream) {
+				error(_("unable to open object stream for %s"), oid_to_hex(oid));
+				return NULL;
+			}
+
+			if (stream_object_signature(r, stream, repl) < 0) {
+				error(_("hash mismatch %s"), oid_to_hex(oid));
+				odb_read_stream_close(stream);
+				return NULL;
+			}
+
+			odb_read_stream_close(stream);
 		}
 		parse_blob_buffer(lookup_blob(r, oid));
 		return lookup_object(r, oid);
diff --git a/pack-check.c b/pack-check.c
index 67cb2cf72f..46782a29d5 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -9,6 +9,7 @@
 #include "packfile.h"
 #include "object-file.h"
 #include "odb.h"
+#include "odb/streaming.h"
 
 struct idx_entry {
 	off_t                offset;
@@ -104,6 +105,7 @@ static int verify_packfile(struct repository *r,
 	QSORT(entries, nr_objects, compare_entries);
 
 	for (i = 0; i < nr_objects; i++) {
+		struct odb_read_stream *stream = NULL;
 		void *data;
 		struct object_id oid;
 		enum object_type type;
@@ -152,7 +154,9 @@ static int verify_packfile(struct repository *r,
 							type) < 0)
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
-		else if (!data && stream_object_signature(r, &oid) < 0)
+		else if (!data &&
+			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
+			  stream_object_signature(r, stream, &oid) < 0))
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
 		else if (fn) {
@@ -163,12 +167,14 @@ static int verify_packfile(struct repository *r,
 		}
 		if (((base_count + i) & 1023) == 0)
 			display_progress(progress, base_count + i);
-		free(data);
 
+		if (stream)
+			odb_read_stream_close(stream);
+		free(data);
 	}
+
 	display_progress(progress, base_count + i);
 	free(entries);
-
 	return err;
 }
 

-- 
2.53.0.536.g309c995771.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/4] packfile: expose function to read object stream for an offset
  2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
@ 2026-02-23 16:00   ` Patrick Steinhardt
  2026-02-23 16:00   ` [PATCH v2 4/4] pack-check: fix verification of large objects Patrick Steinhardt
  3 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 16:00 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King, Eric Sunshine

The function `packfile_store_read_object_stream()` takes as input an
object ID and then constructs a `struct odb_read_stream` from it. In a
subsequent commit we'll want to create an object stream for a given
combination of packfile and offset though, which is not something that
can currently be done.

Extract a new function `packfile_read_object_stream()` that makes this
functionality available.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 packfile.c | 40 ++++++++++++++++++++++++----------------
 packfile.h |  5 +++++
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/packfile.c b/packfile.c
index 402c3b5dc7..3e61176128 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2553,32 +2553,28 @@ static int close_istream_pack_non_delta(struct odb_read_stream *_st)
 	return 0;
 }
 
-int packfile_store_read_object_stream(struct odb_read_stream **out,
-				      struct packfile_store *store,
-				      const struct object_id *oid)
+int packfile_read_object_stream(struct odb_read_stream **out,
+				const struct object_id *oid,
+				struct packed_git *pack,
+				off_t offset)
 {
 	struct odb_packed_read_stream *stream;
 	struct pack_window *window = NULL;
-	struct object_info oi = OBJECT_INFO_INIT;
 	enum object_type in_pack_type;
 	unsigned long size;
 
-	oi.sizep = &size;
+	in_pack_type = unpack_object_header(pack, &window, &offset, &size);
+	unuse_pack(&window);
 
-	if (packfile_store_read_object_info(store, oid, &oi, 0) ||
-	    oi.u.packed.type == PACKED_OBJECT_TYPE_REF_DELTA ||
-	    oi.u.packed.type == PACKED_OBJECT_TYPE_OFS_DELTA ||
-	    repo_settings_get_big_file_threshold(store->source->odb->repo) >= size)
+	if (repo_settings_get_big_file_threshold(pack->repo) >= size)
 		return -1;
 
-	in_pack_type = unpack_object_header(oi.u.packed.pack,
-					    &window,
-					    &oi.u.packed.offset,
-					    &size);
-	unuse_pack(&window);
 	switch (in_pack_type) {
 	default:
 		return -1; /* we do not do deltas for now */
+	case OBJ_BAD:
+		mark_bad_packed_object(pack, oid);
+		return -1;
 	case OBJ_COMMIT:
 	case OBJ_TREE:
 	case OBJ_BLOB:
@@ -2592,10 +2588,22 @@ int packfile_store_read_object_stream(struct odb_read_stream **out,
 	stream->base.type = in_pack_type;
 	stream->base.size = size;
 	stream->z_state = ODB_PACKED_READ_STREAM_UNINITIALIZED;
-	stream->pack = oi.u.packed.pack;
-	stream->pos = oi.u.packed.offset;
+	stream->pack = pack;
+	stream->pos = offset;
 
 	*out = &stream->base;
 
 	return 0;
 }
+
+int packfile_store_read_object_stream(struct odb_read_stream **out,
+				      struct packfile_store *store,
+				      const struct object_id *oid)
+{
+	struct pack_entry e;
+
+	if (!find_pack_entry(store, oid, &e))
+		return -1;
+
+	return packfile_read_object_stream(out, oid, e.p, e.offset);
+}
diff --git a/packfile.h b/packfile.h
index acc5c55ad5..b9f5f1c18c 100644
--- a/packfile.h
+++ b/packfile.h
@@ -436,6 +436,11 @@ off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
 		     off_t *curpos, enum object_type type,
 		     off_t delta_obj_offset);
 
+int packfile_read_object_stream(struct odb_read_stream **out,
+				const struct object_id *oid,
+				struct packed_git *pack,
+				off_t offset);
+
 void release_pack_memory(size_t);
 
 /* global flag to enable extra checks when accessing packed objects */

-- 
2.53.0.536.g309c995771.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/4] pack-check: fix verification of large objects
  2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2026-02-23 16:00   ` [PATCH v2 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
@ 2026-02-23 16:00   ` Patrick Steinhardt
  3 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-23 16:00 UTC (permalink / raw)
  To: git; +Cc: brian m. carlson, Junio C Hamano, Jeff King, Eric Sunshine

It was reported [1] that git-fsck(1) may sometimes run into an infinite
loop when processing packfiles. This bug was bisected to c31bad4f7d
(packfile: track packs via the MRU list exclusively, 2025-10-30), which
refactored our lsit of packfiles to only be tracked via an MRU list,
exclusively. This isn't entirely surprising: any caller that iterates
through the list of packfiles and then hits `find_pack_entry()`, for
example because they read an object from it, may cause the MRU list to
be updated. And if the caller is unlucky, this may cause the mentioned
infinite loop.

While this mechanism is somewhat fragile, it is still surprising that we
encounter it when verifying the packfile. We iterate through objects in
a given pack one by one and then read them via their offset, and doing
this shouldn't ever end up in `find_pack_entry()`.

But there is an edge case here: when the object in question is a blob
bigger than "core.largeFileThreshold", then we will be careful to not
read it into memory. Instead, we read it via an object stream by calling
`odb_read_object_stream()`, and that function will perform an object
lookup via `odb_read_object_info()`. So in the case where there are at
least two blobs in two different packfiles, and both of these blobs
exceed "core.largeFileThreshold", then we'll run into an infinite loop
because we'll always update the MRU.

We could fix this by improving `repo_for_each_pack()` to not update the
MRU, and this would address the issue. But the fun part is that using
`odb_read_object_stream()` is the wrong thing to do in the first place:
it may open _any_ instance of this object, so we ultimately cannot be
sure that we even verified the object in our given packfile.

Fix this bug by creating the object stream for the packed object
directly via `packfile_read_object_stream()`. Add a test that would have
caused the infinite loop.

[1]: <20260222183710.2963424-1-sandals@crustytoothpaste.net>

Reported-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 pack-check.c    |  2 +-
 t/t1450-fsck.sh | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/pack-check.c b/pack-check.c
index 46782a29d5..7378c80730 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -155,7 +155,7 @@ static int verify_packfile(struct repository *r,
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
 		else if (!data &&
-			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
+			 (packfile_read_object_stream(&stream, &oid, p, entries[i].offset) < 0 ||
 			  stream_object_signature(r, stream, &oid) < 0))
 			err = error("packed %s from %s is corrupt",
 				    oid_to_hex(&oid), p->pack_name);
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 8fb79b3e5d..54e81c2636 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -852,6 +852,44 @@ test_expect_success 'fsck errors in packed objects' '
 	! grep corrupt out
 '

+test_expect_success 'fsck handles multiple packfiles with big blobs' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		# We construct two packfiles with two objects in common and one
+		# object not in common. The objects in common can then be
+		# corrupted in one of the packfiles, respectively. The other
+		# objects that are unique to the packs are merely used to not
+		# have both packs contain the same data.
+		blob_one=$(test-tool genrandom one 200k | git hash-object -t blob -w --stdin) &&
+		blob_two=$(test-tool genrandom two 200k | git hash-object -t blob -w --stdin) &&
+		blob_three=$(test-tool genrandom three 200k | git hash-object -t blob -w --stdin) &&
+		blob_four=$(test-tool genrandom four 200k | git hash-object -t blob -w --stdin) &&
+		pack_one=$(printf "%s\n" "$blob_one" "$blob_two" "$blob_three" | git pack-objects .git/objects/pack/pack) &&
+		pack_two=$(printf "%s\n" "$blob_two" "$blob_three" "$blob_four" | git pack-objects .git/objects/pack/pack) &&
+		chmod a+w .git/objects/pack/pack-*.pack &&
+
+		# Corrupt blob two in the first pack.
+		git verify-pack -v .git/objects/pack/pack-$pack_one >objects &&
+		offset_one=$(sed <objects -n "s/^$blob_two .* \(.*\)$/\1/p") &&
+		printf "\0" | dd of=.git/objects/pack/pack-$pack_one.pack bs=1 conv=notrunc seek=$offset_one &&
+
+		# Corrupt blob three in the second pack.
+		git verify-pack -v .git/objects/pack/pack-$pack_two >objects &&
+		offset_two=$(sed <objects -n "s/^$blob_three .* \(.*\)$/\1/p") &&
+		printf "\0" | dd of=.git/objects/pack/pack-$pack_two.pack bs=1 conv=notrunc seek=$offset_two &&
+
+		# We now expect to see two failures for the corrupted objects,
+		# even though they exist in a non-corrupted form in the
+		# respective other pack.
+		test_must_fail git -c core.bigFileThreshold=100k fsck 2>err &&
+		test_grep "unknown object type 0 at offset $offset_one in .git/objects/pack/pack-$pack_one.pack" err &&
+		test_grep "unknown object type 0 at offset $offset_two in .git/objects/pack/pack-$pack_two.pack" err
+	)
+'
+
 test_expect_success 'fsck fails on corrupt packfile' '
 	hsh=$(git commit-tree -m mycommit HEAD^{tree}) &&
 	pack=$(echo $hsh | git pack-objects .git/objects/pack/pack) &&

-- 
2.53.0.536.g309c995771.dirty

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23  9:50 ` [PATCH 4/4] pack-check: fix verification of large objects Patrick Steinhardt
  2026-02-23 11:11   ` Jeff King
@ 2026-02-23 20:35   ` Junio C Hamano
  2026-02-24  6:26     ` Patrick Steinhardt
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2026-02-23 20:35 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

> It was reported [1] that git-fsck(1) may sometimes run into an infinite
> loop when processing packfiles. This bug was bisected to c31bad4f7d
> (packfile: track packs via the MRU list exclusively, 2025-10-30), which
> refactored our lsit of packfiles to only be tracked via an MRU list,

"lsit of" -> "list of"

> exclusively. This isn't entirely surprising: any caller that iterates
> through the list of packfiles and then hits `find_pack_entry()`, for
> example because they read an object from it, may cause the MRU list to
> be updated. And if the caller is unlucky, this may cause the mentioned
> infinite loop.
>
> While this mechanism is somewhat fragile, it is still surprising that we
> encounter it when verifying the packfile. We iterate through objects in
> a given pack one by one and then read them via their offset, and doing
> this shouldn't ever end up in `find_pack_entry()`.
>
> But there is an edge case here: when the object in question is a blob
> bigger than "core.largeFileThreshold", then we will be careful to not
> read it into memory. Instead, we read it via an object stream by calling
> `odb_read_object_stream()`, and that function will perform an object
> lookup via `odb_read_object_info()`. So in the case where there are at
> least two blobs in two different packfiles, and both of these blobs
> exceed "core.largeFileThreshold", then we'll run into an infinite loop
> because we'll always update the MRU.

Good find, and it is not surprising.  What is surprising is that we
do not see this kind of breakage more often.  The mechanism does
sound fragile, not just "somewhat" X-<.

> We could fix this by improving `repo_for_each_pack()` to not update the
> MRU, and this would address the issue. But the fun part is that using
> `odb_read_object_stream()` is the wrong thing to do in the first place:
> it may open _any_ instance of this object, so we ultimately cannot be
> sure that we even verified the object in our given packfile.

Again, very good reasoning.

> Fix this bug by creating the object stream for the packed object
> directly via `packfile_read_object_stream()`. Add a test that would have
> caused the infinite loop.

Curious that we have a completely different test.  I've locally
applied (without committing or amending) t1050 update from brian's
patch and with this series, fsck there does not seem to get stuck.
Of course, the new test added here doesn't either ;-).

>
> [1]: <20260222183710.2963424-1-sandals@crustytoothpaste.net>
>
> Reported-by: brian m. carlson <sandals@crustytoothpaste.net>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  pack-check.c    |  2 +-
>  t/t1450-fsck.sh | 15 +++++++++++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/pack-check.c b/pack-check.c
> index 46782a29d5..6149567060 100644
> --- a/pack-check.c
> +++ b/pack-check.c
> @@ -155,7 +155,7 @@ static int verify_packfile(struct repository *r,
>  			err = error("packed %s from %s is corrupt",
>  				    oid_to_hex(&oid), p->pack_name);
>  		else if (!data &&
> -			 (!(stream = odb_read_stream_open(r->objects, &oid, NULL)) ||
> +			 (packfile_read_object_stream(&stream, p, entries[i].offset) < 0 ||
>  			  stream_object_signature(r, stream, &oid) < 0))
>  			err = error("packed %s from %s is corrupt",
>  				    oid_to_hex(&oid), p->pack_name);
> diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
> index 8fb79b3e5d..ec68397ea3 100755
> --- a/t/t1450-fsck.sh
> +++ b/t/t1450-fsck.sh
> @@ -852,6 +852,21 @@ test_expect_success 'fsck errors in packed objects' '
>  	! grep corrupt out
>  '
>  
> +test_expect_success 'fsck handles multiple packfiles with big blobs' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		blob_one=$(test-tool genrandom one 200k | git hash-object -t blob -w --stdin) &&
> +		blob_two=$(test-tool genrandom two 200k | git hash-object -t blob -w --stdin) &&
> +		printf "%s\n" "$blob_one" | git pack-objects .git/objects/pack/pack &&
> +		printf "%s\n" "$blob_two" | git pack-objects .git/objects/pack/pack &&
> +		remove_object "$blob_one" &&
> +		remove_object "$blob_two" &&
> +		git -c core.bigFileThreshold=100k fsck
> +	)
> +'
> +
>  test_expect_success 'fsck fails on corrupt packfile' '
>  	hsh=$(git commit-tree -m mycommit HEAD^{tree}) &&
>  	pack=$(echo $hsh | git pack-objects .git/objects/pack/pack) &&

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] pack-check: fix verification of large objects
  2026-02-23 20:35   ` Junio C Hamano
@ 2026-02-24  6:26     ` Patrick Steinhardt
  0 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2026-02-24  6:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, brian m. carlson, Jeff King

On Mon, Feb 23, 2026 at 12:35:48PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > Fix this bug by creating the object stream for the packed object
> > directly via `packfile_read_object_stream()`. Add a test that would have
> > caused the infinite loop.
> 
> Curious that we have a completely different test.  I've locally
> applied (without committing or amending) t1050 update from brian's
> patch and with this series, fsck there does not seem to get stuck.
> Of course, the new test added here doesn't either ;-).

Yeah, the fact that the test in t1050 hit the bug was pure coincidence,
and I wanted to have something a bit more reliable that also allowed
myself to prove what exact circumstances cause this to fail.

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-02-24  6:26 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-23  9:50 [PATCH 0/4] pack-check: fix verification of large objects Patrick Steinhardt
2026-02-23  9:50 ` [PATCH 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
2026-02-23 11:13   ` Jeff King
2026-02-23 12:20     ` Patrick Steinhardt
2026-02-23 14:01   ` Eric Sunshine
2026-02-23  9:50 ` [PATCH 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
2026-02-23 10:49   ` Jeff King
2026-02-23 12:21     ` Patrick Steinhardt
2026-02-23 12:59       ` Jeff King
2026-02-23  9:50 ` [PATCH 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
2026-02-23 11:07   ` Jeff King
2026-02-23 12:21     ` Patrick Steinhardt
2026-02-23 13:12       ` Jeff King
2026-02-23 15:59         ` Patrick Steinhardt
2026-02-23  9:50 ` [PATCH 4/4] pack-check: fix verification of large objects Patrick Steinhardt
2026-02-23 11:11   ` Jeff King
2026-02-23 11:30     ` Patrick Steinhardt
2026-02-23 12:58       ` Jeff King
2026-02-23 15:48         ` Patrick Steinhardt
2026-02-23 20:35   ` Junio C Hamano
2026-02-24  6:26     ` Patrick Steinhardt
2026-02-23 16:00 ` [PATCH v2 0/4] " Patrick Steinhardt
2026-02-23 16:00   ` [PATCH v2 1/4] t/helper: improve "genrandom" test helper Patrick Steinhardt
2026-02-23 16:00   ` [PATCH v2 2/4] object-file: adapt `stream_object_signature()` to take a stream Patrick Steinhardt
2026-02-23 16:00   ` [PATCH v2 3/4] packfile: expose function to read object stream for an offset Patrick Steinhardt
2026-02-23 16:00   ` [PATCH v2 4/4] pack-check: fix verification of large objects Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox