[PATCH 0/9] some more chunk-file bounds-checks fixes

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] some more chunk-file bounds-checks fixes
@ 2023-11-09  7:03 Jeff King
  2023-11-09  7:09 ` [PATCH 1/9] commit-graph: handle overflow in chunk_size checks Jeff King
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:03 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

This is a follow-up to the series from:

  https://lore.kernel.org/git/20231009205544.GA3281950@coredump.intra.peff.net/

which was merged to master as jk/chunk-bounds. There were a few issues
left open by that series and its review:

  1. the midx code didn't check fanout ordering

  2. whether we needed to sprinkle some more st_mult() on it

  3. improving some of the error messages (translations, some
     consistency, maybe more details)

  4. possible refactoring with a pair_chunk_expect() API (Taylor posted
     a series in that direction, which is currently in limbo)

The patches here fix the remaining correctness issues (points 1 and 2,
along with a few other small issues I found). There's some improvement
on 3, but I stopped short of adding lots more details. Partially because
the series was already getting big, and partially because going further
may depend on what we do with 4.

With regards to the pair_chunk_expect() thing, this series is
incompatible (textually, but not conceptually) with what Taylor posted
earlier, just because I'm moving some checks into the chunk-reader
callbacks. Because it's fixing user-visible bugs, I think we'd want to
do this first, and then (possibly) rebase Taylor's series on top. But I
also think some of the things I noticed around overflow (especially
patches 1 and 6) may inform how we'd want the pair_chunk_expect() API to
look.

This is a continuation of the jk/chunk-bounds topic, which is new in the
v2.43 cycle. But it should be OK to leave this until after the release.
Nothing here is fixing a regression in the 2.43 release candidates; it's
just a few bits that were incomplete. That said, I did try to float the
correctness bits to the first two patches just in case. ;)

  [1/9]: commit-graph: handle overflow in chunk_size checks
  [2/9]: midx: check consistency of fanout table
  [3/9]: commit-graph: drop redundant call to "lite" verification
  [4/9]: commit-graph: clarify missing-chunk error messages
  [5/9]: commit-graph: abort as soon as we see a bogus chunk
  [6/9]: commit-graph: use fanout value for graph size
  [7/9]: commit-graph: check order while reading fanout chunk
  [8/9]: commit-graph: drop verify_commit_graph_lite()
  [9/9]: commit-graph: mark chunk error messages for translation

 commit-graph.c              | 94 +++++++++++++------------------------
 midx.c                      | 20 ++++----
 t/t5318-commit-graph.sh     | 16 ++++---
 t/t5319-multi-pack-index.sh | 14 ++++++
 4 files changed, 67 insertions(+), 77 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/9] commit-graph: handle overflow in chunk_size checks
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
@ 2023-11-09  7:09 ` Jeff King
  2023-11-09 21:13   ` Taylor Blau
  2023-11-09  7:12 ` [PATCH 2/9] midx: check consistency of fanout table Jeff King
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:09 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We check the size of chunks with fixed records by multiplying the width
of each record by the number of commits in the file. Like:

  if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)

If this multiplication overflows, we may not notice a chunk is too small
(which could later lead to out-of-bound reads).

In the current code this is only possible for the CDAT chunk, but the
reasons are quite subtle. We compute g->num_commits by dividing the size
of the OIDL chunk by the hash length (since it consists of a bunch of
hashes). So we know that any size_t multiplication that uses a value
smaller than the hash length cannot overflow. And the CDAT records are
the only ones that are larger (the others are just 4-byte records). So
it's worth fixing all of these, to make it clear that they're not
subject to overflow (without having to reason about seemingly unrelated
code).

The obvious thing to do is add an st_mult(), like:

  if (chunk_size != st_mult(g->num_commits, GRAPH_DATA_WIDTH))

And that certainly works, but it has one downside: if we detect an
overflow, we'll immediately die(). But the commit graph is an optional
file; if we run into other problems loading it, we'll generally return
an error and fall back to accessing the full objects. Using st_mult()
means a malformed file will abort the whole process.

So instead, we can do a division like this:

  if (chunk_size / GRAPH_DATA_WIDTH != g->num_commits)

where there's no possibility of overflow. We do lose a little bit of
precision; due to integer division truncation we'd allow up to an extra
GRAPH_DATA_WIDTH-1 bytes of data in the chunk. That's OK. Our main goal
here is making sure we don't have too _few_ bytes, which would cause an
out-of-bounds read (we could actually replace our "!=" with "<", but I
think it's worth being a little pedantic, as a large mismatch could be a
sign of other problems).

I didn't add a test here. We'd need to generate a very large graph file
in order to get g->num_commits large enough to cause an overflow. And a
later patch in this series will use this same division technique in a
way that is much easier to trigger in the tests.

Signed-off-by: Jeff King <peff@peff.net>
---
There are still several st_mult() calls within commit-graph.c, unrelated
to my jk/chunk-bounds series. I suspect they're all redundant, as the
chunk-size checks give a stricter bound. But checking and removing them
is a separate topic.

I think the midx code has similar st_mult() problems, but it's quite
happy to die() on any error already. So I've left auditing that for
another day.

Another alternative to the division is introducing an st_mult() variant
like:

  if (!st_mult(&out, &a, &b))
        return error(...);

We may want to go there in the long run as part of lib-ification, but I
didn't want to get into it for this series.

 commit-graph.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ee66098e07..5d7d7a89e5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -344,7 +344,7 @@ static int graph_read_commit_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
 	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)
+	if (chunk_size / GRAPH_DATA_WIDTH != g->num_commits)
 		return error("commit-graph commit data chunk is wrong size");
 	g->chunk_commit_data = chunk_start;
 	return 0;
@@ -354,7 +354,7 @@ static int graph_read_generation_data(const unsigned char *chunk_start,
 				      size_t chunk_size, void *data)
 {
 	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * sizeof(uint32_t))
+	if (chunk_size / sizeof(uint32_t) != g->num_commits)
 		return error("commit-graph generations chunk is wrong size");
 	g->chunk_generation_data = chunk_start;
 	return 0;
@@ -364,7 +364,7 @@ static int graph_read_bloom_index(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
 	struct commit_graph *g = data;
-	if (chunk_size != g->num_commits * 4) {
+	if (chunk_size / 4 != g->num_commits) {
 		warning("commit-graph changed-path index chunk is too small");
 		return -1;
 	}
-- 
2.43.0.rc1.572.g273fc7bed6

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/9] midx: check consistency of fanout table
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
  2023-11-09  7:09 ` [PATCH 1/9] commit-graph: handle overflow in chunk_size checks Jeff King
@ 2023-11-09  7:12 ` Jeff King
  2023-11-09  7:13 ` [PATCH 3/9] commit-graph: drop redundant call to "lite" verification Jeff King
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:12 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The commit-graph, midx, and pack idx on-disk formats all have oid fanout
tables which are fed to bsearch_hash(). If these tables do not increase
monotonically, then the binary search may not only produce bogus values,
it may cause out of bounds reads.

We fixed this for commit graphs in 4169d89645 (commit-graph: check
consistency of fanout table, 2023-10-09). That commit argued that we did
not need to do the same for midx and pack idx files, because they
already did this check. However, that is wrong. We _do_ check the fanout
table for pack idx files when we load them, but we only do so for midx
files when running "git multi-pack-index verify". So it is possible to
get an out-of-bounds read by running a normal command with a specially
crafted midx file.

Let's fix this using the same solution (and roughly the same test) we
did for the commit-graph in 4169d89645. This replaces the same check
from "multi-pack-index verify", because verify uses the same read
routines, we'd bail on reading the midx much sooner now. So let's make
sure to copy its verbose error message.

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      | 20 +++++++++++---------
 t/t5319-multi-pack-index.sh | 14 ++++++++++++++
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index 2f3863c936..1d14661dad 100644
--- a/midx.c
+++ b/midx.c
@@ -64,13 +64,24 @@ void get_midx_rev_filename(struct strbuf *out, struct multi_pack_index *m)
 static int midx_read_oid_fanout(const unsigned char *chunk_start,
 				size_t chunk_size, void *data)
 {
+	int i;
 	struct multi_pack_index *m = data;
 	m->chunk_oid_fanout = (uint32_t *)chunk_start;
 
 	if (chunk_size != 4 * 256) {
 		error(_("multi-pack-index OID fanout is of the wrong size"));
 		return 1;
 	}
+	for (i = 0; i < 255; i++) {
+		uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
+		uint32_t oid_fanout2 = ntohl(m->chunk_oid_fanout[i+1]);
+
+		if (oid_fanout1 > oid_fanout2) {
+			error(_("oid fanout out of order: fanout[%d] = %"PRIx32" > %"PRIx32" = fanout[%d]"),
+			      i, oid_fanout1, oid_fanout2, i + 1);
+			return 1;
+		}
+	}
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 	return 0;
 }
@@ -1782,15 +1793,6 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag
 	}
 	stop_progress(&progress);
 
-	for (i = 0; i < 255; i++) {
-		uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
-		uint32_t oid_fanout2 = ntohl(m->chunk_oid_fanout[i + 1]);
-
-		if (oid_fanout1 > oid_fanout2)
-			midx_report(_("oid fanout out of order: fanout[%d] = %"PRIx32" > %"PRIx32" = fanout[%d]"),
-				    i, oid_fanout1, oid_fanout2, i + 1);
-	}
-
 	if (m->num_objects == 0) {
 		midx_report(_("the midx contains no oid"));
 		/*
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c4c6060cee..c20aafe99a 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1157,4 +1157,18 @@ test_expect_success 'reader notices too-small revindex chunk' '
 	test_cmp expect.err err
 '
 
+test_expect_success 'reader notices out-of-bounds fanout' '
+	# This is similar to the out-of-bounds fanout test in t5318. The values
+	# in adjacent entries should be large but not identical (they
+	# are used as hi/lo starts for a binary search, which would then abort
+	# immediately).
+	corrupt_chunk OIDF 0 $(printf "%02x000000" $(test_seq 0 254)) &&
+	test_must_fail git log 2>err &&
+	cat >expect <<-\EOF &&
+	error: oid fanout out of order: fanout[254] = fe000000 > 5c = fanout[255]
+	fatal: multi-pack-index required OID fanout chunk missing or corrupted
+	EOF
+	test_cmp expect err
+'
+
 test_done
-- 
2.43.0.rc1.572.g273fc7bed6


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/9] commit-graph: drop redundant call to "lite" verification
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
  2023-11-09  7:09 ` [PATCH 1/9] commit-graph: handle overflow in chunk_size checks Jeff King
  2023-11-09  7:12 ` [PATCH 2/9] midx: check consistency of fanout table Jeff King
@ 2023-11-09  7:13 ` Jeff King
  2023-11-09  7:14 ` [PATCH 4/9] commit-graph: clarify missing-chunk error messages Jeff King
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:13 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The idea of verify_commit_graph_lite() is to have cheap verification
checks both for everyday use of the graph files (to avoid out of bounds
reads, etc) as well as for doing a full check via "commit-graph verify"
(which will also check the hash, etc).

But the expensive verification checks operate on a commit_graph struct,
which we get by using the normal everyday-reader code! So any problem
we'd find by calling it would have been found before we even got to the
verify_one_commit_graph() function.

Removing it simplifies the code a bit, but also frees us up to move the
"lite" verification steps around within that everyday-reader code.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5d7d7a89e5..d9fc08de86 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2690,10 +2690,6 @@ static int verify_one_commit_graph(struct repository *r,
 	struct commit *seen_gen_zero = NULL;
 	struct commit *seen_gen_non_zero = NULL;

-	verify_commit_graph_error = verify_commit_graph_lite(g);
-	if (verify_commit_graph_error)
-		return verify_commit_graph_error;
-
 	if (!commit_graph_checksum_valid(g)) {
 		graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt"));
 		verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH;
-- 
2.43.0.rc1.572.g273fc7bed6

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/9] commit-graph: clarify missing-chunk error messages
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (2 preceding siblings ...)
  2023-11-09  7:13 ` [PATCH 3/9] commit-graph: drop redundant call to "lite" verification Jeff King
@ 2023-11-09  7:14 ` Jeff King
  2023-11-09  7:17 ` [PATCH 5/9] commit-graph: abort as soon as we see a bogus chunk Jeff King
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:14 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

When a required commit-graph chunk cannot be loaded, we leave its entry
in the struct NULL, and then later complain that it is missing. But
that's just one reason we might not have loaded it, as we also do some
data quality checks.

Let's switch these messages to say "missing or corrupted", which is
exactly what the midx code says for the same cases. Likewise, we'll use
the same phrasing and capitalization as those for consistency. And while
we're here, we can mark them for translation (just like the midx ones).

Signed-off-by: Jeff King <peff@peff.net>
---
I went with "corrupted" here for consistency with the others (versus
"corrupt"). If we think there's a reason to prefer one over the other,
I'm happy to take a patch on top fixing all of them.

 commit-graph.c          |  6 +++---
 t/t5318-commit-graph.sh | 10 +++++-----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d9fc08de86..989ebbe816 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -292,15 +292,15 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	 * itself.
 	 */
 	if (!g->chunk_oid_fanout) {
-		error("commit-graph is missing the OID Fanout chunk");
+		error(_("commit-graph required OID fanout chunk missing or corrupted"));
 		return 1;
 	}
 	if (!g->chunk_oid_lookup) {
-		error("commit-graph is missing the OID Lookup chunk");
+		error(_("commit-graph required OID lookup chunk missing or corrupted"));
 		return 1;
 	}
 	if (!g->chunk_commit_data) {
-		error("commit-graph is missing the Commit Data chunk");
+		error(_("commit-graph required commit data chunk missing or corrupted"));
 		return 1;
 	}
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index d4fc65e078..affb959d64 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -540,17 +540,17 @@ test_expect_success 'detect low chunk count' '
 
 test_expect_success 'detect missing OID fanout chunk' '
 	corrupt_graph_and_verify $GRAPH_BYTE_OID_FANOUT_ID "\0" \
-		"missing the OID Fanout chunk"
+		"commit-graph required OID fanout chunk missing or corrupted"
 '
 
 test_expect_success 'detect missing OID lookup chunk' '
 	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ID "\0" \
-		"missing the OID Lookup chunk"
+		"commit-graph required OID lookup chunk missing or corrupted"
 '
 
 test_expect_success 'detect missing commit data chunk' '
 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATA_ID "\0" \
-		"missing the Commit Data chunk"
+		"commit-graph required commit data chunk missing or corrupted"
 '
 
 test_expect_success 'detect incorrect fanout' '
@@ -842,7 +842,7 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
 	check_corrupt_chunk OIDF clear $(printf "000000%02x" $(test_seq 250)) &&
 	cat >expect.err <<-\EOF &&
 	error: commit-graph oid fanout chunk is wrong size
-	error: commit-graph is missing the OID Fanout chunk
+	error: commit-graph required OID fanout chunk missing or corrupted
 	EOF
 	test_cmp expect.err err
 '
@@ -874,7 +874,7 @@ test_expect_success 'reader notices too-small commit data chunk' '
 	check_corrupt_chunk CDAT clear 00000000 &&
 	cat >expect.err <<-\EOF &&
 	error: commit-graph commit data chunk is wrong size
-	error: commit-graph is missing the Commit Data chunk
+	error: commit-graph required commit data chunk missing or corrupted
 	EOF
 	test_cmp expect.err err
 '
-- 
2.43.0.rc1.572.g273fc7bed6


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/9] commit-graph: abort as soon as we see a bogus chunk
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (3 preceding siblings ...)
  2023-11-09  7:14 ` [PATCH 4/9] commit-graph: clarify missing-chunk error messages Jeff King
@ 2023-11-09  7:17 ` Jeff King
  2023-11-09 21:18   ` Taylor Blau
  2023-11-09  7:24 ` [PATCH 6/9] commit-graph: use fanout value for graph size Jeff King
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:17 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The code to read commit-graph files tries to read all of the required
chunks, but doesn't abort if we can't find one (or if it's corrupted).
It's only at the end of reading the file that we then do some sanity
checks for NULL entries. But it's preferable to detect the errors and
bail immediately, for a few reasons:

  1. It's less error-prone. It's easy in the reader functions to flag an
     error but still end up setting some struct fields (an error I in
     fact made while working on this patch series).

  2. It's safer. Since verifying some chunks depends on the values of
     other chunks, we may be depending on not-yet-verified data. I don't
     know offhand of any case where this can cause problems, but it's
     one less subtle thing to worry about in the reader code.

  3. It prevents the user from seeing nonsense errors. If we're missing
     an OIDL chunk, then g->num_commits will be zero. And so we may
     complain that the size of our CDAT chunk (which should have a
     fixed-size record for each commit) is wrong unless it's also zero.
     But that's misleading; the problem is the missing OIDL chunk; the
     CDAT one might be fine!

So let's just check the return value from read_chunk(). This is exactly
how the midx chunk-reading code does it.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 989ebbe816..374575b484 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -291,19 +291,6 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	 * over g->num_commits, or runs a checksum on the commit-graph
 	 * itself.
 	 */
-	if (!g->chunk_oid_fanout) {
-		error(_("commit-graph required OID fanout chunk missing or corrupted"));
-		return 1;
-	}
-	if (!g->chunk_oid_lookup) {
-		error(_("commit-graph required OID lookup chunk missing or corrupted"));
-		return 1;
-	}
-	if (!g->chunk_commit_data) {
-		error(_("commit-graph required commit data chunk missing or corrupted"));
-		return 1;
-	}
-
 	for (i = 0; i < 255; i++) {
 		uint32_t oid_fanout1 = ntohl(g->chunk_oid_fanout[i]);
 		uint32_t oid_fanout2 = ntohl(g->chunk_oid_fanout[i + 1]);
@@ -462,9 +449,19 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 				   GRAPH_HEADER_SIZE, graph->num_chunks, 1))
 		goto free_and_return;
 
-	read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
-	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
-	read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
+	if (read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph)) {
+		error(_("commit-graph required OID fanout chunk missing or corrupted"));
+		goto free_and_return;
+	}
+	if (read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph)) {
+		error(_("commit-graph required OID lookup chunk missing or corrupted"));
+		goto free_and_return;
+	}
+	if (read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph)) {
+		error(_("commit-graph required commit data chunk missing or corrupted"));
+		goto free_and_return;
+	}
+
 	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
 		   &graph->chunk_extra_edges_size);
 	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs,
-- 
2.43.0.rc1.572.g273fc7bed6


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/9] commit-graph: use fanout value for graph size
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (4 preceding siblings ...)
  2023-11-09  7:17 ` [PATCH 5/9] commit-graph: abort as soon as we see a bogus chunk Jeff King
@ 2023-11-09  7:24 ` Jeff King
  2023-11-09 21:20   ` Taylor Blau
  2023-11-09  7:25 ` [PATCH 7/9] commit-graph: check order while reading fanout chunk Jeff King
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:24 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

Commit-graph, midx, and pack idx files all have both a lookup table of
oids and an oid fanout table. In midx and pack idx files, we take the
final entry of the fanout table as the source of truth for the number of
entries, and then verify that the size of the lookup table matches that.
But for commit-graph files, we do the opposite: we use the size of the
lookup table as the source of truth, and then check the final fanout
entry against it.

As noted in 4169d89645 (commit-graph: check consistency of fanout
table, 2023-10-09), either is correct. But there are a few reasons to
prefer the fanout table as the source of truth:

  1. The fanout entries are 32-bits on disk, and that defines the
     maximum number of entries we can store. But since the size of the
     lookup table is only bounded by the filesystem, it can be much
     larger. And hence computing it as the commit-graph does means that
     we may truncate the result when storing it in a uint32_t.

  2. We read the fanout first, then the lookup table. If we're verifying
     the chunks as we read them, then we'd want to take the fanout as
     truth (we have nothing yet to check it against) and then we can
     check that the lookup table matches what we already know.

  3. It is pointlessly inconsistent with the midx and pack idx code.
     Since the three have to do similar size and bounds checks, it is
     easier to reason about all three if they use the same approach.

So this patch moves the assignment of g->num_commits to the fanout
parser, and then we can check the size of the lookup chunk as soon as we
try to load it.

There's already a test covering this situation, which munges the final
fanout entry to 2^32-1. In the current code we complain that it does not
agree with the table size. But now that we treat the munged value as the
source of truth, we'll complain that the lookup table is the wrong size
(again, either is correct). So we'll have to update the message we
expect (and likewise for an earlier test which does similar munging).

There's a similar test for this situation on the midx side, but rather
than making a very-large fanout value, it just truncates the lookup
table. We could do that here, too, but the very-large fanout value
actually shows an interesting corner case. On a 32-bit system,
multiplying to find the expected table size would cause an integer
overflow. Using st_mult() would detect that, but cause us to die()
rather than falling back to the non-graph code path. Checking the size
using division (as we do with existing chunk-size checks) avoids the
overflow entirely, and the test demonstrates this when run on a 32-bit
system.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c          | 8 +++-----
 t/t5318-commit-graph.sh | 5 +++--
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 374575b484..094814c2ba 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -300,10 +300,6 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 			return 1;
 		}
 	}
-	if (ntohl(g->chunk_oid_fanout[255]) != g->num_commits) {
-		error("commit-graph oid table and fanout disagree on size");
-		return 1;
-	}
 
 	return 0;
 }
@@ -315,6 +311,7 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 	if (chunk_size != 256 * sizeof(uint32_t))
 		return error("commit-graph oid fanout chunk is wrong size");
 	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
+	g->num_commits = ntohl(g->chunk_oid_fanout[255]);
 	return 0;
 }
 
@@ -323,7 +320,8 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 {
 	struct commit_graph *g = data;
 	g->chunk_oid_lookup = chunk_start;
-	g->num_commits = chunk_size / g->hash_len;
+	if (chunk_size / g->hash_len != g->num_commits)
+		return error(_("commit-graph OID lookup chunk is the wrong size"));
 	return 0;
 }
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index affb959d64..8bd7fcefb5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -560,7 +560,7 @@ test_expect_success 'detect incorrect fanout' '
 
 test_expect_success 'detect incorrect fanout final value' '
 	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
-		"oid table and fanout disagree on size"
+		"OID lookup chunk is the wrong size"
 '
 
 test_expect_success 'detect incorrect OID order' '
@@ -850,7 +850,8 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
 test_expect_success 'reader notices fanout/lookup table mismatch' '
 	check_corrupt_chunk OIDF 1020 "FFFFFFFF" &&
 	cat >expect.err <<-\EOF &&
-	error: commit-graph oid table and fanout disagree on size
+	error: commit-graph OID lookup chunk is the wrong size
+	error: commit-graph required OID lookup chunk missing or corrupted
 	EOF
 	test_cmp expect.err err
 '
-- 
2.43.0.rc1.572.g273fc7bed6


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/9] commit-graph: check order while reading fanout chunk
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (5 preceding siblings ...)
  2023-11-09  7:24 ` [PATCH 6/9] commit-graph: use fanout value for graph size Jeff King
@ 2023-11-09  7:25 ` Jeff King
  2023-11-09  7:25 ` [PATCH 8/9] commit-graph: drop verify_commit_graph_lite() Jeff King
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:25 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

We read the fanout chunk, storing a pointer to it, but only confirm that
the entries are monotonic in a final "lite" verification step. Let's
move that into the actual OIDF chunk callback, so that we can report
problems immediately (for all the reasons given in the previous
"commit-graph: abort as soon as we see a bogus chunk" commit).

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c          | 25 +++++++++++++------------
 t/t5318-commit-graph.sh |  1 +
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 094814c2ba..4ba523cd15 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -277,8 +277,6 @@ struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
 
 static int verify_commit_graph_lite(struct commit_graph *g)
 {
-	int i;
-
 	/*
 	 * Basic validation shared between parse_commit_graph()
 	 * which'll be called every time the graph is used, and the
@@ -291,27 +289,30 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	 * over g->num_commits, or runs a checksum on the commit-graph
 	 * itself.
 	 */
-	for (i = 0; i < 255; i++) {
-		uint32_t oid_fanout1 = ntohl(g->chunk_oid_fanout[i]);
-		uint32_t oid_fanout2 = ntohl(g->chunk_oid_fanout[i + 1]);
-
-		if (oid_fanout1 > oid_fanout2) {
-			error("commit-graph fanout values out of order");
-			return 1;
-		}
-	}
-
 	return 0;
 }
 
 static int graph_read_oid_fanout(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
 	struct commit_graph *g = data;
+	int i;
+
 	if (chunk_size != 256 * sizeof(uint32_t))
 		return error("commit-graph oid fanout chunk is wrong size");
 	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
 	g->num_commits = ntohl(g->chunk_oid_fanout[255]);
+
+	for (i = 0; i < 255; i++) {
+		uint32_t oid_fanout1 = ntohl(g->chunk_oid_fanout[i]);
+		uint32_t oid_fanout2 = ntohl(g->chunk_oid_fanout[i + 1]);
+
+		if (oid_fanout1 > oid_fanout2) {
+			error("commit-graph fanout values out of order");
+			return 1;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 8bd7fcefb5..7fe7c72a87 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -867,6 +867,7 @@ test_expect_success 'reader notices out-of-bounds fanout' '
 	check_corrupt_chunk OIDF 0 $(printf "%02x000000" $(test_seq 0 254)) &&
 	cat >expect.err <<-\EOF &&
 	error: commit-graph fanout values out of order
+	error: commit-graph required OID fanout chunk missing or corrupted
 	EOF
 	test_cmp expect.err err
 '
-- 
2.43.0.rc1.572.g273fc7bed6


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 8/9] commit-graph: drop verify_commit_graph_lite()
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (6 preceding siblings ...)
  2023-11-09  7:25 ` [PATCH 7/9] commit-graph: check order while reading fanout chunk Jeff King
@ 2023-11-09  7:25 ` Jeff King
  2023-11-09  7:26 ` [PATCH 9/9] commit-graph: mark chunk error messages for translation Jeff King
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:25 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

As we've moved all of the checks from this function directly into the
chunk-reading code used by the caller (and there is only one caller), we
can just drop it entirely.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 4ba523cd15..6fbfe4c68e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -275,23 +275,6 @@ struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
 	return ret;
 }
 
-static int verify_commit_graph_lite(struct commit_graph *g)
-{
-	/*
-	 * Basic validation shared between parse_commit_graph()
-	 * which'll be called every time the graph is used, and the
-	 * much more expensive verify_commit_graph() used by
-	 * "commit-graph verify".
-	 *
-	 * There should only be very basic checks here to ensure that
-	 * we don't e.g. segfault in fill_commit_in_graph(), but
-	 * because this is a very hot codepath nothing that e.g. loops
-	 * over g->num_commits, or runs a checksum on the commit-graph
-	 * itself.
-	 */
-	return 0;
-}
-
 static int graph_read_oid_fanout(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
@@ -495,9 +478,6 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 
 	oidread(&graph->oid, graph->data + graph->data_len - graph->hash_len);
 
-	if (verify_commit_graph_lite(graph))
-		goto free_and_return;
-
 	free_chunkfile(cf);
 	return graph;
 
-- 
2.43.0.rc1.572.g273fc7bed6


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 9/9] commit-graph: mark chunk error messages for translation
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (7 preceding siblings ...)
  2023-11-09  7:25 ` [PATCH 8/9] commit-graph: drop verify_commit_graph_lite() Jeff King
@ 2023-11-09  7:26 ` Jeff King
  2023-11-09 21:22 ` [PATCH 0/9] some more chunk-file bounds-checks fixes Taylor Blau
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
  10 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09  7:26 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

The patches from f32af12cee (Merge branch 'jk/chunk-bounds', 2023-10-23)
added many new untranslated error messages. While it's unlikely for most
users to see these messages at all, most of the other commit-graph error
messages are translated (and likewise for the matching midx messages).

Let's mark them all for consistency (and to help any poor unfortunate
user who does manage to find a broken graph file).

Signed-off-by: Jeff King <peff@peff.net>
---
The "wrong size" ones may be dropped eventually if we have a generic
pair_chunk_expect() API, but it seemed easier to just fix them all
mechanically in one go.

 commit-graph.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6fbfe4c68e..acac9bf6e1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -282,7 +282,7 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 	int i;
 
 	if (chunk_size != 256 * sizeof(uint32_t))
-		return error("commit-graph oid fanout chunk is wrong size");
+		return error(_("commit-graph oid fanout chunk is wrong size"));
 	g->chunk_oid_fanout = (const uint32_t *)chunk_start;
 	g->num_commits = ntohl(g->chunk_oid_fanout[255]);
 
@@ -291,7 +291,7 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 		uint32_t oid_fanout2 = ntohl(g->chunk_oid_fanout[i + 1]);
 
 		if (oid_fanout1 > oid_fanout2) {
-			error("commit-graph fanout values out of order");
+			error(_("commit-graph fanout values out of order"));
 			return 1;
 		}
 	}
@@ -314,7 +314,7 @@ static int graph_read_commit_data(const unsigned char *chunk_start,
 {
 	struct commit_graph *g = data;
 	if (chunk_size / GRAPH_DATA_WIDTH != g->num_commits)
-		return error("commit-graph commit data chunk is wrong size");
+		return error(_("commit-graph commit data chunk is wrong size"));
 	g->chunk_commit_data = chunk_start;
 	return 0;
 }
@@ -324,7 +324,7 @@ static int graph_read_generation_data(const unsigned char *chunk_start,
 {
 	struct commit_graph *g = data;
 	if (chunk_size / sizeof(uint32_t) != g->num_commits)
-		return error("commit-graph generations chunk is wrong size");
+		return error(_("commit-graph generations chunk is wrong size"));
 	g->chunk_generation_data = chunk_start;
 	return 0;
 }
@@ -334,7 +334,7 @@ static int graph_read_bloom_index(const unsigned char *chunk_start,
 {
 	struct commit_graph *g = data;
 	if (chunk_size / 4 != g->num_commits) {
-		warning("commit-graph changed-path index chunk is too small");
+		warning(_("commit-graph changed-path index chunk is too small"));
 		return -1;
 	}
 	g->chunk_bloom_indexes = chunk_start;
@@ -348,8 +348,8 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
 	uint32_t hash_version;
 
 	if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
-		warning("ignoring too-small changed-path chunk"
-			" (%"PRIuMAX" < %"PRIuMAX") in commit-graph file",
+		warning(_("ignoring too-small changed-path chunk"
+			" (%"PRIuMAX" < %"PRIuMAX") in commit-graph file"),
 			(uintmax_t)chunk_size,
 			(uintmax_t)BLOOMDATA_CHUNK_HEADER_SIZE);
 		return -1;
@@ -605,7 +605,7 @@ int open_commit_graph_chain(const char *chain_file,
 			/* treat empty files the same as missing */
 			errno = ENOENT;
 		} else {
-			warning("commit-graph chain file too small");
+			warning(_("commit-graph chain file too small"));
 			errno = EINVAL;
 		}
 		return 0;
@@ -946,7 +946,7 @@ static int fill_commit_in_graph(struct repository *r,
 	parent_data_pos = edge_value & GRAPH_EDGE_LAST_MASK;
 	do {
 		if (g->chunk_extra_edges_size / sizeof(uint32_t) <= parent_data_pos) {
-			error("commit-graph extra-edges pointer out of bounds");
+			error(_("commit-graph extra-edges pointer out of bounds"));
 			free_commit_list(item->parents);
 			item->parents = NULL;
 			item->object.parsed = 0;
-- 
2.43.0.rc1.572.g273fc7bed6

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/9] commit-graph: handle overflow in chunk_size checks
  2023-11-09  7:09 ` [PATCH 1/9] commit-graph: handle overflow in chunk_size checks Jeff King
@ 2023-11-09 21:13   ` Taylor Blau
  2023-11-09 21:27     ` Jeff King
  0 siblings, 1 reply; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 21:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Nov 09, 2023 at 02:09:48AM -0500, Jeff King wrote:
> So instead, we can do a division like this:
>
>   if (chunk_size / GRAPH_DATA_WIDTH != g->num_commits)
>
> where there's no possibility of overflow. We do lose a little bit of
> precision; due to integer division truncation we'd allow up to an extra
> GRAPH_DATA_WIDTH-1 bytes of data in the chunk. That's OK. Our main goal
> here is making sure we don't have too _few_ bytes, which would cause an
> out-of-bounds read (we could actually replace our "!=" with "<", but I
> think it's worth being a little pedantic, as a large mismatch could be a
> sign of other problems).

This is wonderfully explained, and the patch below follows trivially
from what you wrote here.

So everything in this patch makes sense and looks good to me. It does
make me think about the pair_chunk_expect() function that I proposed
elsewhere. I haven't yet read the rest of the series, so it may be a
totally useless direction by the end of this series ;-).

But I wonder if the interface we want is something like:

    int pair_chunk_expect(struct chunkfile *cf, uint32_t chunk_id,
                          const unsigned char **p,
                          size_t record_size, size_t record_nr);

So we can then grab the size of the chunk, divide it by "record_size"
and ensure that end up with "record_nr" as a result.

Again, this is perhaps totally useless by the end of your series, but
just having looked at the first patch I wonder if this is a productive
direction to consider...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/9] commit-graph: abort as soon as we see a bogus chunk
  2023-11-09  7:17 ` [PATCH 5/9] commit-graph: abort as soon as we see a bogus chunk Jeff King
@ 2023-11-09 21:18   ` Taylor Blau
  0 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 21:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Nov 09, 2023 at 02:17:11AM -0500, Jeff King wrote:
> The code to read commit-graph files tries to read all of the required
> chunks, but doesn't abort if we can't find one (or if it's corrupted).
> It's only at the end of reading the file that we then do some sanity
> checks for NULL entries. But it's preferable to detect the errors and
> bail immediately, for a few reasons:
>
>   1. It's less error-prone. It's easy in the reader functions to flag an
>      error but still end up setting some struct fields (an error I in
>      fact made while working on this patch series).
>
>   2. It's safer. Since verifying some chunks depends on the values of
>      other chunks, we may be depending on not-yet-verified data. I don't
>      know offhand of any case where this can cause problems, but it's
>      one less subtle thing to worry about in the reader code.
>
>   3. It prevents the user from seeing nonsense errors. If we're missing
>      an OIDL chunk, then g->num_commits will be zero. And so we may
>      complain that the size of our CDAT chunk (which should have a
>      fixed-size record for each commit) is wrong unless it's also zero.
>      But that's misleading; the problem is the missing OIDL chunk; the
>      CDAT one might be fine!
>
> So let's just check the return value from read_chunk(). This is exactly
> how the midx chunk-reading code does it.

All very well explained. I hit that same snag as you did when I was
working on the few patches I proposed we put on top of your earlier
chunk-format hardening series.

I'm glad to see this getting cleaned up, and I'm very happy with the
post-image of this patch.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/9] commit-graph: use fanout value for graph size
  2023-11-09  7:24 ` [PATCH 6/9] commit-graph: use fanout value for graph size Jeff King
@ 2023-11-09 21:20   ` Taylor Blau
  2023-11-09 21:38     ` Jeff King
  0 siblings, 1 reply; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 21:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Nov 09, 2023 at 02:24:35AM -0500, Jeff King wrote:
> @@ -323,7 +320,8 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
>  {
>  	struct commit_graph *g = data;
>  	g->chunk_oid_lookup = chunk_start;
> -	g->num_commits = chunk_size / g->hash_len;
> +	if (chunk_size / g->hash_len != g->num_commits)
> +		return error(_("commit-graph OID lookup chunk is the wrong size"));
>  	return 0;
>  }

My understanding is that you need this error message to come from
graph_read_oid_lookup() in order to pass the "detect incorrect fanout
final value" test, but I wish that we didn't have to, since having the
more-or-less duplicate error messages in the latter "reader notices
fanout/lookup table mismatch" is somewhat unfortunate.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index affb959d64..8bd7fcefb5 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -560,7 +560,7 @@ test_expect_success 'detect incorrect fanout' '
>
>  test_expect_success 'detect incorrect fanout final value' '
>  	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
> -		"oid table and fanout disagree on size"
> +		"OID lookup chunk is the wrong size"
>  '
>
>  test_expect_success 'detect incorrect OID order' '
> @@ -850,7 +850,8 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
>  test_expect_success 'reader notices fanout/lookup table mismatch' '
>  	check_corrupt_chunk OIDF 1020 "FFFFFFFF" &&
>  	cat >expect.err <<-\EOF &&
> -	error: commit-graph oid table and fanout disagree on size
> +	error: commit-graph OID lookup chunk is the wrong size
> +	error: commit-graph required OID lookup chunk missing or corrupted
>  	EOF
>  	test_cmp expect.err err
>  '
> --
> 2.43.0.rc1.572.g273fc7bed6
>
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] some more chunk-file bounds-checks fixes
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (8 preceding siblings ...)
  2023-11-09  7:26 ` [PATCH 9/9] commit-graph: mark chunk error messages for translation Jeff King
@ 2023-11-09 21:22 ` Taylor Blau
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
  10 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 21:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Nov 09, 2023 at 02:03:10AM -0500, Jeff King wrote:
> This is a follow-up to the series from:
>
>   https://lore.kernel.org/git/20231009205544.GA3281950@coredump.intra.peff.net/
>
> which was merged to master as jk/chunk-bounds. There were a few issues
> left open by that series and its review:
>
>   1. the midx code didn't check fanout ordering
>
>   2. whether we needed to sprinkle some more st_mult() on it
>
>   3. improving some of the error messages (translations, some
>      consistency, maybe more details)
>
>   4. possible refactoring with a pair_chunk_expect() API (Taylor posted
>      a series in that direction, which is currently in limbo)

I read this series thoroughly and was very pleased with the result.
Thanks for patching these up.

I think that I am still of the mind that it would be useful to have some
kind of pair_chunk_expect() function, so I'll try and rebase/rewrite my
patches on top of your new ones here.

In the meantime, this series LGTM.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/9] commit-graph: handle overflow in chunk_size checks
  2023-11-09 21:13   ` Taylor Blau
@ 2023-11-09 21:27     ` Jeff King
  0 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-09 21:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Thu, Nov 09, 2023 at 04:13:17PM -0500, Taylor Blau wrote:

> So everything in this patch makes sense and looks good to me. It does
> make me think about the pair_chunk_expect() function that I proposed
> elsewhere. I haven't yet read the rest of the series, so it may be a
> totally useless direction by the end of this series ;-).

Nope, it's not useless. But I do think it affects what we'd want the
interface to look like, and...

> But I wonder if the interface we want is something like:
> 
>     int pair_chunk_expect(struct chunkfile *cf, uint32_t chunk_id,
>                           const unsigned char **p,
>                           size_t record_size, size_t record_nr);
> 
> So we can then grab the size of the chunk, divide it by "record_size"
> and ensure that end up with "record_nr" as a result.

...this is exactly the direction I was thinking it would go. So if you
got to the same place after reading my explanation, then hopefully
something went right. ;)

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/9] commit-graph: use fanout value for graph size
  2023-11-09 21:20   ` Taylor Blau
@ 2023-11-09 21:38     ` Jeff King
  2023-11-09 22:15       ` Taylor Blau
  0 siblings, 1 reply; 39+ messages in thread
From: Jeff King @ 2023-11-09 21:38 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Thu, Nov 09, 2023 at 04:20:53PM -0500, Taylor Blau wrote:

> On Thu, Nov 09, 2023 at 02:24:35AM -0500, Jeff King wrote:
> > @@ -323,7 +320,8 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
> >  {
> >  	struct commit_graph *g = data;
> >  	g->chunk_oid_lookup = chunk_start;
> > -	g->num_commits = chunk_size / g->hash_len;
> > +	if (chunk_size / g->hash_len != g->num_commits)
> > +		return error(_("commit-graph OID lookup chunk is the wrong size"));
> >  	return 0;
> >  }
> 
> My understanding is that you need this error message to come from
> graph_read_oid_lookup() in order to pass the "detect incorrect fanout
> final value" test, but I wish that we didn't have to, since having the
> more-or-less duplicate error messages in the latter "reader notices
> fanout/lookup table mismatch" is somewhat unfortunate.

I'm not sure what you mean here. We notice the problem in the reading
code, which is used in turn by the verify code. So both of these tests:

> >  test_expect_success 'detect incorrect fanout final value' '
> >  	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
> > -		"oid table and fanout disagree on size"
> > +		"OID lookup chunk is the wrong size"
> >  '
> >
> >  test_expect_success 'detect incorrect OID order' '
> > @@ -850,7 +850,8 @@ test_expect_success 'reader notices too-small oid fanout chunk' '
> >  test_expect_success 'reader notices fanout/lookup table mismatch' '
> >  	check_corrupt_chunk OIDF 1020 "FFFFFFFF" &&
> >  	cat >expect.err <<-\EOF &&
> > -	error: commit-graph oid table and fanout disagree on size
> > +	error: commit-graph OID lookup chunk is the wrong size
> > +	error: commit-graph required OID lookup chunk missing or corrupted
> >  	EOF
> >  	test_cmp expect.err err
> >  '

will see a message regardless of where we put it. Or by "duplicate" did
you mean the two-line:

  > > +   error: commit-graph OID lookup chunk is the wrong size
  > > +   error: commit-graph required OID lookup chunk missing or corrupted

output. That's because our chunk of the return value from read_chunk()
is a bit lazy, and does not distinguish "missing chunk" (where we should
write a string saying so) from errors produced by the callback (where a
more specific error message has already been written). That's true of
all of the read_chunk() callers, including the midx ones.

This is one of the ways that pair_chunk_expect() could do better without
adding a lot of code, because it can check the return value from
read_chunk(). It doesn't help the other cases (like OIDF) that still
have to call read_chunk() themselves, though. Possibly read_chunk()
should just take a flag to indicate that it should complain when the
chunk is missing. And then callers could just do:

  if (read_chunk(cf, id, our_callback, CHUNK_REQUIRED)
	return -1; /* no need to complain; either our_callback() did, or
	              read_chunk() itself */

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/9] commit-graph: use fanout value for graph size
  2023-11-09 21:38     ` Jeff King
@ 2023-11-09 22:15       ` Taylor Blau
  2023-11-10 21:52         ` Jeff King
  0 siblings, 1 reply; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Nov 09, 2023 at 04:38:13PM -0500, Jeff King wrote:
> Or by "duplicate" did you mean the two-line:
>
>   > > +   error: commit-graph OID lookup chunk is the wrong size
>   > > +   error: commit-graph required OID lookup chunk missing or corrupted
>
> output. That's because our chunk of the return value from read_chunk()
> is a bit lazy, and does not distinguish "missing chunk" (where we should
> write a string saying so) from errors produced by the callback (where a
> more specific error message has already been written). That's true of
> all of the read_chunk() callers, including the midx ones.

Yeah, I meant duplicate in the sense above.

> This is one of the ways that pair_chunk_expect() could do better without
> adding a lot of code, because it can check the return value from
> read_chunk(). It doesn't help the other cases (like OIDF) that still
> have to call read_chunk() themselves, though. Possibly read_chunk()
> should just take a flag to indicate that it should complain when the
> chunk is missing. And then callers could just do:
>
>   if (read_chunk(cf, id, our_callback, CHUNK_REQUIRED)
> 	return -1; /* no need to complain; either our_callback() did, or
> 	              read_chunk() itself */

We do return CHUNK_NOT_FOUND when we have a missing chunk, which we
could check for individually. But TBH, I don't find the first error all
that useful. In this instance, there's really only one way for the OIDL
chunk to be corrupt, which is that it has the wrong size.

And short of making the error much more robust, e.g.:

    error: commit-graph OID lookup chunk is the wrong size (got: $X, wanted: $Y)

and dropping the generic "missing or corrupt" error, I think that just
the generic error is fine here.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()`
  2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
                   ` (9 preceding siblings ...)
  2023-11-09 21:22 ` [PATCH 0/9] some more chunk-file bounds-checks fixes Taylor Blau
@ 2023-11-09 22:34 ` Taylor Blau
  2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
                     ` (6 more replies)
  10 siblings, 7 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

This series is based on the latest patches in what I am assuming will
become "jk/chunk-bounds", found here [1].

This is a rework of my earlier round of patches in [2], which introduces
a new `pair_chunk_expect()` to replace some of the trivial callbacks
given to `read_chunk()` that assert on the size of the expected chunk.

Let me know what you think, and thanks in advance for your review!

[1]: https://lore.kernel.org/git/20231109070310.GA2697602@coredump.intra.peff.net/
[2]: https://lore.kernel.org/git/cover.1697225110.git.me@ttaylorr.com/

Taylor Blau (7):
  chunk-format: introduce `pair_chunk_expect()` helper
  commit-graph: read `OIDL` chunk with `pair_chunk_expect()`
  commit-graph: read `CDAT` chunk with `pair_chunk_expect()`
  commit-graph: read `GDAT` chunk with `pair_chunk_expect()`
  commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  midx: read `OIDL` chunk with `pair_chunk_expect()`
  midx: read `OOFF` chunk with `pair_chunk_expect()`

 chunk-format.c              | 29 ++++++++++++++++
 chunk-format.h              | 13 ++++++-
 commit-graph.c              | 67 ++++++++++---------------------------
 midx.c                      | 33 +++---------------
 t/t5318-commit-graph.sh     |  6 ++--
 t/t5319-multi-pack-index.sh |  2 --
 6 files changed, 65 insertions(+), 85 deletions(-)


base-commit: 8be77c5de65442b331a28d63802c7a3b94a06c5a
prerequisite-patch-id: 507f1dfd74fae351883612048d334ed750db8b8c
prerequisite-patch-id: c5eef290abc1d28950b3ee8729ea86d2e1773027
prerequisite-patch-id: 0853baab4862833faee8ade3b1b63ee3aa406224
prerequisite-patch-id: 6dd32f90fd87aa92f7d0661414cdaab257bd14b7
prerequisite-patch-id: b0e1617c501a011c703605e59dd5eba89f8a3573
prerequisite-patch-id: 906426f565c470dc86d7e7379d25ecfbd2bc30ce
prerequisite-patch-id: 970cd79d1911bde36b88c340f001266d7e8d843b
prerequisite-patch-id: 083a24abc73a83345bd9e4ca714577990dd9b08b
prerequisite-patch-id: 210371a338dfe9b6f905d8821501aa9c235c8722
-- 
2.43.0.rc0.39.g44bd344727

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  2023-11-10  4:55     ` Junio C Hamano
                       ` (2 more replies)
  2023-11-09 22:34   ` [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()` Taylor Blau
                     ` (5 subsequent siblings)
  6 siblings, 3 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

In 570b8b8836 (chunk-format: note that pair_chunk() is unsafe,
2023-10-09), the pair_chunk() interface grew a required "size" pointer,
so that the caller is forced to at least have a handle on the actual
size of the given chunk.

Many callers were converted to the new interface. A handful were instead
converted from the unsafe version of pair_chunk() to read_chunk() so
that they could check their expected size.

This led to a lot of code like:

    static int graph_read_oid_lookup(const unsigned char *chunk_start,
                                     size_t chunk_size, void *data)
    {
      struct commit_graph *g = data;
      g->chunk_oid_lookup = chunk_start;
      if (chunk_size / g->hash_len != g->num_commits)
        return error(_("commit-graph OID lookup chunk is the wrong size"));
      return 0;
    }

, leaving the caller to use read_chunk(), like so:

    read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);

The callback to read_chunk() (in the above, `graph_read_oid_lookup()`)
does nothing more than (a) assign a pointer to the location of the start
of the chunk in the mmap'd file, and (b) assert that it has the correct
size.

For callers that know the expected size of their chunk(s) up-front (most
often because they are made up of a known number of fixed-size records),
we can simplify this by teaching the chunk-format API itself to validate
the expected size for us.

This is wrapped in a new function, called `pair_chunk_expect()` which
takes a pair of "size_t"s (corresponding to the record size and count),
instead of a "size_t *", and validates that the given chunk matches the
expected size as given.

This will allow us to reduce the number of lines of code it takes to
perform these basic read_chunk() operations, by taking the above and
replacing it with something like:

    if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDLOOKUP,
                          &graph->chunk_oid_lookup,
                          graph->hash_len, graph->num_commits))
      error(_("commit-graph oid lookup chunk is wrong size"));

We will perform those transformations in the following commits.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 29 +++++++++++++++++++++++++++++
 chunk-format.h | 13 ++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/chunk-format.c b/chunk-format.c
index cdc7f39b70..be078dcca8 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -163,6 +163,10 @@ int read_table_of_contents(struct chunkfile *cf,
 struct pair_chunk_data {
 	const unsigned char **p;
 	size_t *size;
+
+	/* for pair_chunk_expect() only */
+	size_t record_size;
+	size_t record_nr;
 };
 
 static int pair_chunk_fn(const unsigned char *chunk_start,
@@ -175,6 +179,17 @@ static int pair_chunk_fn(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int pair_chunk_expect_fn(const unsigned char *chunk_start,
+				size_t chunk_size,
+				void *data)
+{
+	struct pair_chunk_data *pcd = data;
+	if (chunk_size / pcd->record_size != pcd->record_nr)
+		return -1;
+	*pcd->p = chunk_start;
+	return 0;
+}
+
 int pair_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       const unsigned char **p,
@@ -184,6 +199,20 @@ int pair_chunk(struct chunkfile *cf,
 	return read_chunk(cf, chunk_id, pair_chunk_fn, &pcd);
 }
 
+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p,
+		      size_t record_size,
+		      size_t record_nr)
+{
+	struct pair_chunk_data pcd = {
+		.p = p,
+		.record_size = record_size,
+		.record_nr = record_nr,
+	};
+	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
+}
+
 int read_chunk(struct chunkfile *cf,
 	       uint32_t chunk_id,
 	       chunk_read_fn fn,
diff --git a/chunk-format.h b/chunk-format.h
index 14b76180ef..10806d7a9a 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -17,7 +17,8 @@ struct chunkfile;
  *
  * If reading a file, use a NULL 'struct hashfile *' and then call
  * read_table_of_contents(). Supply the memory-mapped data to the
- * pair_chunk() or read_chunk() methods, as appropriate.
+ * pair_chunk(), pair_chunk_expect(), or read_chunk() methods, as
+ * appropriate.
  *
  * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
  * for reading and writing.
@@ -54,6 +55,16 @@ int pair_chunk(struct chunkfile *cf,
 	       const unsigned char **p,
 	       size_t *size);
 
+/*
+ * Similar to 'pair_chunk', but used for callers who are reading a chunk
+ * with a known number of fixed-width records.
+ */
+int pair_chunk_expect(struct chunkfile *cf,
+		      uint32_t chunk_id,
+		      const unsigned char **p,
+		      size_t record_size,
+		      size_t record_nr);
+
 typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
 			     size_t chunk_size, void *data);
 /*
-- 
2.43.0.rc0.39.g44bd344727


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()`
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
  2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  2023-11-10 22:10     ` Jeff King
  2023-11-09 22:34   ` [PATCH 3/7] commit-graph: read `CDAT` " Taylor Blau
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

The `OIDL` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

While here, clean up some of the duplicate error messages to simplify
the output when we are missing or have a corrupt OIDL chunk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c          | 16 ++++------------
 t/t5318-commit-graph.sh |  3 +--
 2 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 93cf690565..6072c2a17f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,16 +299,6 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_oid_lookup(const unsigned char *chunk_start,
-				 size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	g->chunk_oid_lookup = chunk_start;
-	if (chunk_size / g->hash_len != g->num_commits)
-		return error(_("commit-graph OID lookup chunk is the wrong size"));
-	return 0;
-}
-
 static int graph_read_commit_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -435,8 +425,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 		error(_("commit-graph required OID fanout chunk missing or corrupted"));
 		goto free_and_return;
 	}
-	if (read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph)) {
-		error(_("commit-graph required OID lookup chunk missing or corrupted"));
+	if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDLOOKUP,
+			      &graph->chunk_oid_lookup, graph->hash_len,
+			      graph->num_commits)) {
+		error(_("commit-graph OID lookup chunk is the wrong size"));
 		goto free_and_return;
 	}
 	if (read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph)) {
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b0d436a6f0..b3e8af018d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -545,7 +545,7 @@ test_expect_success 'detect missing OID fanout chunk' '
 
 test_expect_success 'detect missing OID lookup chunk' '
 	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ID "\0" \
-		"commit-graph required OID lookup chunk missing or corrupted"
+		"commit-graph OID lookup chunk is the wrong size"
 '
 
 test_expect_success 'detect missing commit data chunk' '
@@ -851,7 +851,6 @@ test_expect_success 'reader notices fanout/lookup table mismatch' '
 	check_corrupt_chunk OIDF 1020 "FFFFFFFF" &&
 	cat >expect.err <<-\EOF &&
 	error: commit-graph OID lookup chunk is the wrong size
-	error: commit-graph required OID lookup chunk missing or corrupted
 	EOF
 	test_cmp expect.err err
 '
-- 
2.43.0.rc0.39.g44bd344727


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/7] commit-graph: read `CDAT` chunk with `pair_chunk_expect()`
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
  2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
  2023-11-09 22:34   ` [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()` Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  2023-11-09 22:34   ` [PATCH 4/7] commit-graph: read `GDAT` " Taylor Blau
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

The `CDAT` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c          | 15 +++------------
 t/t5318-commit-graph.sh |  3 +--
 2 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6072c2a17f..67ab0f2448 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,16 +299,6 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_commit_data(const unsigned char *chunk_start,
-				  size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size / GRAPH_DATA_WIDTH != g->num_commits)
-		return error(_("commit-graph commit data chunk is wrong size"));
-	g->chunk_commit_data = chunk_start;
-	return 0;
-}
-
 static int graph_read_generation_data(const unsigned char *chunk_start,
 				      size_t chunk_size, void *data)
 {
@@ -431,8 +421,9 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 		error(_("commit-graph OID lookup chunk is the wrong size"));
 		goto free_and_return;
 	}
-	if (read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph)) {
-		error(_("commit-graph required commit data chunk missing or corrupted"));
+	if (pair_chunk_expect(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data,
+			      GRAPH_DATA_WIDTH, graph->num_commits)) {
+		error(_("commit-graph commit data chunk is wrong size"));
 		goto free_and_return;
 	}
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b3e8af018d..fd5165bd07 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -550,7 +550,7 @@ test_expect_success 'detect missing OID lookup chunk' '
 
 test_expect_success 'detect missing commit data chunk' '
 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATA_ID "\0" \
-		"commit-graph required commit data chunk missing or corrupted"
+		"commit-graph commit data chunk is wrong size"
 '
 
 test_expect_success 'detect incorrect fanout' '
@@ -875,7 +875,6 @@ test_expect_success 'reader notices too-small commit data chunk' '
 	check_corrupt_chunk CDAT clear 00000000 &&
 	cat >expect.err <<-\EOF &&
 	error: commit-graph commit data chunk is wrong size
-	error: commit-graph required commit data chunk missing or corrupted
 	EOF
 	test_cmp expect.err err
 '
-- 
2.43.0.rc0.39.g44bd344727


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/7] commit-graph: read `GDAT` chunk with `pair_chunk_expect()`
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
                     ` (2 preceding siblings ...)
  2023-11-09 22:34   ` [PATCH 3/7] commit-graph: read `CDAT` " Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  2023-11-09 22:34   ` [PATCH 5/7] commit-graph: read `BIDX` " Taylor Blau
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

The `GDAT` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 67ab0f2448..3b456d199f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,16 +299,6 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_generation_data(const unsigned char *chunk_start,
-				      size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size / sizeof(uint32_t) != g->num_commits)
-		return error(_("commit-graph generations chunk is wrong size"));
-	g->chunk_generation_data = chunk_start;
-	return 0;
-}
-
 static int graph_read_bloom_index(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -433,8 +423,11 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 		   &graph->chunk_base_graphs_size);
 
 	if (s->commit_graph_generation_version >= 2) {
-		read_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
-			   graph_read_generation_data, graph);
+		if (pair_chunk_expect(cf, GRAPH_CHUNKID_GENERATION_DATA,
+				      &graph->chunk_generation_data,
+				      sizeof(uint32_t),
+				      graph->num_commits))
+			error(_("commit-graph generations chunk is wrong size"));
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			   &graph->chunk_generation_data_overflow,
 			   &graph->chunk_generation_data_overflow_size);
-- 
2.43.0.rc0.39.g44bd344727


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/7] commit-graph: read `BIDX` chunk with `pair_chunk_expect()`
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
                     ` (3 preceding siblings ...)
  2023-11-09 22:34   ` [PATCH 4/7] commit-graph: read `GDAT` " Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  2023-11-09 22:34   ` [PATCH 6/7] midx: read `OIDL` " Taylor Blau
  2023-11-09 22:34   ` [PATCH 7/7] midx: read `OOFF` " Taylor Blau
  6 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

The `BIDX` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3b456d199f..76d220508b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,18 +299,6 @@ static int graph_read_oid_fanout(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int graph_read_bloom_index(const unsigned char *chunk_start,
-				  size_t chunk_size, void *data)
-{
-	struct commit_graph *g = data;
-	if (chunk_size / 4 != g->num_commits) {
-		warning(_("commit-graph changed-path index chunk is too small"));
-		return -1;
-	}
-	g->chunk_bloom_indexes = chunk_start;
-	return 0;
-}
-
 static int graph_read_bloom_data(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
@@ -437,8 +425,11 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
 	}
 
 	if (s->commit_graph_read_changed_paths) {
-		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
-			   graph_read_bloom_index, graph);
+		int res = pair_chunk_expect(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+				      &graph->chunk_bloom_indexes,
+				      sizeof(uint32_t), graph->num_commits);
+		if (res && res != CHUNK_NOT_FOUND)
+			warning(_("commit-graph changed-path index chunk is too small"));
 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
 			   graph_read_bloom_data, graph);
 	}
-- 
2.43.0.rc0.39.g44bd344727


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/7] midx: read `OIDL` chunk with `pair_chunk_expect()`
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
                     ` (4 preceding siblings ...)
  2023-11-09 22:34   ` [PATCH 5/7] commit-graph: read `BIDX` " Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  2023-11-09 22:34   ` [PATCH 7/7] midx: read `OOFF` " Taylor Blau
  6 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

The `OIDL` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c                      | 16 ++--------------
 t/t5319-multi-pack-index.sh |  1 -
 2 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/midx.c b/midx.c
index 1d14661dad..66f34ed1a3 100644
--- a/midx.c
+++ b/midx.c
@@ -86,19 +86,6 @@ static int midx_read_oid_fanout(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int midx_read_oid_lookup(const unsigned char *chunk_start,
-				size_t chunk_size, void *data)
-{
-	struct multi_pack_index *m = data;
-	m->chunk_oid_lookup = chunk_start;
-
-	if (chunk_size != st_mult(m->hash_len, m->num_objects)) {
-		error(_("multi-pack-index OID lookup chunk is the wrong size"));
-		return 1;
-	}
-	return 0;
-}
-
 static int midx_read_object_offsets(const unsigned char *chunk_start,
 				    size_t chunk_size, void *data)
 {
@@ -186,7 +173,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		die(_("multi-pack-index required pack-name chunk missing or corrupted"));
 	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m))
 		die(_("multi-pack-index required OID fanout chunk missing or corrupted"));
-	if (read_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m))
+	if (pair_chunk_expect(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup,
+			      m->hash_len, m->num_objects))
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
 	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 313496c0cf..2d68616c59 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1077,7 +1077,6 @@ test_expect_success 'reader notices too-small oid lookup chunk' '
 	corrupt_chunk OIDL clear 00000000 &&
 	test_must_fail git log 2>err &&
 	cat >expect <<-\EOF &&
-	error: multi-pack-index OID lookup chunk is the wrong size
 	fatal: multi-pack-index required OID lookup chunk missing or corrupted
 	EOF
 	test_cmp expect err
-- 
2.43.0.rc0.39.g44bd344727


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/7] midx: read `OOFF` chunk with `pair_chunk_expect()`
  2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
                     ` (5 preceding siblings ...)
  2023-11-09 22:34   ` [PATCH 6/7] midx: read `OIDL` " Taylor Blau
@ 2023-11-09 22:34   ` Taylor Blau
  6 siblings, 0 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-09 22:34 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano

The `OOFF` chunk can benefit from the new chunk-format API function
described in the previous commit. Convert it to `pair_chunk_expect()`
accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c                      | 17 +++--------------
 t/t5319-multi-pack-index.sh |  1 -
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/midx.c b/midx.c
index 66f34ed1a3..ca41748b74 100644
--- a/midx.c
+++ b/midx.c
@@ -86,19 +86,6 @@ static int midx_read_oid_fanout(const unsigned char *chunk_start,
 	return 0;
 }
 
-static int midx_read_object_offsets(const unsigned char *chunk_start,
-				    size_t chunk_size, void *data)
-{
-	struct multi_pack_index *m = data;
-	m->chunk_object_offsets = chunk_start;
-
-	if (chunk_size != st_mult(m->num_objects, MIDX_CHUNK_OFFSET_WIDTH)) {
-		error(_("multi-pack-index object offset chunk is the wrong size"));
-		return 1;
-	}
-	return 0;
-}
-
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -176,7 +163,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	if (pair_chunk_expect(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup,
 			      m->hash_len, m->num_objects))
 		die(_("multi-pack-index required OID lookup chunk missing or corrupted"));
-	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
+	if (pair_chunk_expect(cf, MIDX_CHUNKID_OBJECTOFFSETS,
+			      &m->chunk_object_offsets, MIDX_CHUNK_OFFSET_WIDTH,
+			      m->num_objects))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
 
 	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 2d68616c59..f1f6764efe 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1111,7 +1111,6 @@ test_expect_success 'reader notices too-small object offset chunk' '
 	corrupt_chunk OOFF clear 00000000 &&
 	test_must_fail git log 2>err &&
 	cat >expect <<-\EOF &&
-	error: multi-pack-index object offset chunk is the wrong size
 	fatal: multi-pack-index required object offsets chunk missing or corrupted
 	EOF
 	test_cmp expect err
-- 
2.43.0.rc0.39.g44bd344727

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
@ 2023-11-10  4:55     ` Junio C Hamano
  2023-11-10 16:27       ` Taylor Blau
  2023-11-10 21:57       ` Jeff King
  2023-11-10 22:08     ` Jeff King
  2024-01-15 22:31     ` Linus Arver
  2 siblings, 2 replies; 39+ messages in thread
From: Junio C Hamano @ 2023-11-10  4:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}

I know one of the original places did the "divide the whole by
per-record size and see if it matches the number of records", the
same as we see above, but the check in the above could also be

	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
		return -1;

which would also catch the case where chunk_size is not a multiple
of the record size.  Your conversion of OOFF in midx.c loses this
protection as the original uses the multiplication-and-compare, but
the rewrite to call pair_chunk_expect would call the above and
checks with the truncating-divide-and-compare.

Does the distinction matter?  I dunno.  If the record/chunk
alignment is asserted elsewhere, then the distinction should not
matter, but even if it were, seeing a truncating division used in
any validation makes my skin tingle.

Other than that, the series was a pleasant read.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-10  4:55     ` Junio C Hamano
@ 2023-11-10 16:27       ` Taylor Blau
  2023-11-10 22:01         ` Jeff King
  2023-11-10 23:38         ` Junio C Hamano
  2023-11-10 21:57       ` Jeff King
  1 sibling, 2 replies; 39+ messages in thread
From: Taylor Blau @ 2023-11-10 16:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King

On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
>
> I know one of the original places did the "divide the whole by
> per-record size and see if it matches the number of records", the
> same as we see above, but the check in the above could also be
>
> 	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
> 		return -1;
>
> which would also catch the case where chunk_size is not a multiple
> of the record size.  Your conversion of OOFF in midx.c loses this
> protection as the original uses the multiplication-and-compare, but
> the rewrite to call pair_chunk_expect would call the above and
> checks with the truncating-divide-and-compare.

Hmm. I was thinking of Peff's "commit-graph: handle overflow in
chunk_size checks", but I think that I was overly eager in applying the
same reasoning to the MIDX code.

The important piece of the rationale in that patch is as follows:

    In the current code this is only possible for the CDAT chunk, but
    the reasons are quite subtle. We compute g->num_commits by dividing
    the size of the OIDL chunk by the hash length (since it consists of
    a bunch of hashes). So we know that any size_t multiplication that
    uses a value smaller than the hash length cannot overflow. And the
    CDAT records are the only ones that are larger (the others are just
    4-byte records). So it's worth fixing all of these, to make it clear
    that they're not subject to overflow (without having to reason about
    seemingly unrelated code).

In particular, that g->num_commits is computed by dividing the length of
the OIDL chunk by the hash length, thus any size_t multiplication of
g->num_commits with a value smaller than that hash length cannot
overflow.

But I don't think we enjoy the same benefits in the MIDX scenario. In
this case, the num_objects field is just:

    m->num_objects = ntohl(m->chunk_oid_fanout[255])

so I don't think we can make the same guarantees about what is and isn't
save to compute under size_t arithmetic.

I'd be curious what Peff has to say, but if he agrees with me, I'd
recommend taking the first five patches, and dropping the two
MIDX-related ones.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/9] commit-graph: use fanout value for graph size
  2023-11-09 22:15       ` Taylor Blau
@ 2023-11-10 21:52         ` Jeff King
  0 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-10 21:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Thu, Nov 09, 2023 at 05:15:05PM -0500, Taylor Blau wrote:

> > This is one of the ways that pair_chunk_expect() could do better without
> > adding a lot of code, because it can check the return value from
> > read_chunk(). It doesn't help the other cases (like OIDF) that still
> > have to call read_chunk() themselves, though. Possibly read_chunk()
> > should just take a flag to indicate that it should complain when the
> > chunk is missing. And then callers could just do:
> >
> >   if (read_chunk(cf, id, our_callback, CHUNK_REQUIRED)
> > 	return -1; /* no need to complain; either our_callback() did, or
> > 	              read_chunk() itself */
> 
> We do return CHUNK_NOT_FOUND when we have a missing chunk, which we
> could check for individually. But TBH, I don't find the first error all
> that useful. In this instance, there's really only one way for the OIDL
> chunk to be corrupt, which is that it has the wrong size.

But aren't there two ways it can go wrong? It can be the wrong size, or
it can be missing. In the first we say:

  error: wrong size
  error: missing or corrupted

and in the second we say:

  error: missing or corrupted

Which is why I think issuing a message from the callback has value. Of
course the ideal would be:

  error: wrong size

and:

  error: missing

but I didn't think it was worth making the code much longer for an error
condition we don't really expect anybody to see in practice.

And also...

> And short of making the error much more robust, e.g.:
> 
>     error: commit-graph OID lookup chunk is the wrong size (got: $X, wanted: $Y)

...yes, I think that would be worth doing, especially if you are
centralizing the error messages in pair_chunk_expect(). But your series
goes the opposite direction.

> and dropping the generic "missing or corrupt" error, I think that just
> the generic error is fine here.

If you drop that error, then who will warn about the missing-chunk case?
The user would see nothing at all.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-10  4:55     ` Junio C Hamano
  2023-11-10 16:27       ` Taylor Blau
@ 2023-11-10 21:57       ` Jeff King
  2023-11-10 22:09         ` Jeff King
  1 sibling, 1 reply; 39+ messages in thread
From: Jeff King @ 2023-11-10 21:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git

On Fri, Nov 10, 2023 at 01:55:48PM +0900, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
> 
> I know one of the original places did the "divide the whole by
> per-record size and see if it matches the number of records", the
> same as we see above, but the check in the above could also be
> 
> 	if (chunk_size != st_mult(pcd->record_size, pcd->record_nr))
> 		return -1;
> 
> which would also catch the case where chunk_size is not a multiple
> of the record size.  Your conversion of OOFF in midx.c loses this
> protection as the original uses the multiplication-and-compare, but
> the rewrite to call pair_chunk_expect would call the above and
> checks with the truncating-divide-and-compare.
> 
> Does the distinction matter?  I dunno.  If the record/chunk
> alignment is asserted elsewhere, then the distinction should not
> matter, but even if it were, seeing a truncating division used in
> any validation makes my skin tingle.

Yes, the distinction does matter. If pair_chunk_expect_fn() used
st_mult(), then it would die on overflow, rather than returning an
error. For commit-graph files this is propagated up, and we continue the
operation by falling back to the non-graph code. There's a test in t5318
that will fail in this case. See patches 1 and 6 in my series for more
discussion.

One of those patches calls out the truncating division issue, but to
summarize: IMHO this is OK, as what we really want to know is "is it big
enough that we can always ask for NR records of size ELEM", which
division gives us. If we do want to be more precise, but also avoid
die(), we'd need a variant of st_mult() that returns a boolean. I didn't
think it was worth it for this case (but arguably it is something that
would be useful to have in general).

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-10 16:27       ` Taylor Blau
@ 2023-11-10 22:01         ` Jeff King
  2023-11-10 23:39           ` Junio C Hamano
  2023-11-10 23:38         ` Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Jeff King @ 2023-11-10 22:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git

On Fri, Nov 10, 2023 at 11:27:41AM -0500, Taylor Blau wrote:

> Hmm. I was thinking of Peff's "commit-graph: handle overflow in
> chunk_size checks", but I think that I was overly eager in applying the
> same reasoning to the MIDX code.
> 
> The important piece of the rationale in that patch is as follows:
> 
>     In the current code this is only possible for the CDAT chunk, but
>     the reasons are quite subtle. We compute g->num_commits by dividing
>     the size of the OIDL chunk by the hash length (since it consists of
>     a bunch of hashes). So we know that any size_t multiplication that
>     uses a value smaller than the hash length cannot overflow. And the
>     CDAT records are the only ones that are larger (the others are just
>     4-byte records). So it's worth fixing all of these, to make it clear
>     that they're not subject to overflow (without having to reason about
>     seemingly unrelated code).
> 
> In particular, that g->num_commits is computed by dividing the length of
> the OIDL chunk by the hash length, thus any size_t multiplication of
> g->num_commits with a value smaller than that hash length cannot
> overflow.
> 
> But I don't think we enjoy the same benefits in the MIDX scenario. In
> this case, the num_objects field is just:
> 
>     m->num_objects = ntohl(m->chunk_oid_fanout[255])
> 
> so I don't think we can make the same guarantees about what is and isn't
> save to compute under size_t arithmetic.

Yes, the logic does not apply to the midx code (just like the graph code
after we switch to using the fanout value later in my series). But that
paragraph was just explaining why only the CDAT chunk was _currently_
vulnerable.

The more interesting question is the paragraphs after that, about
whether it is OK to die() or not when we see overflow (and IMHO it is
not for commit-graphs).

The situation _is_ different there for midx's, because we immediately
die() if we see a bogus midx anyway. But I think that is wrong, and
something we may want to change in the long run. Both because it's the
wrong thing for lib-ification, but also because we can easily keep going
if the midx is broken; we can still use the individual pack idx files.

> I'd be curious what Peff has to say, but if he agrees with me, I'd
> recommend taking the first five patches, and dropping the two
> MIDX-related ones.

I think dropping those is a bad direction. The point of adding
pair_chunk_expect() is that we could use it consistently.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
  2023-11-10  4:55     ` Junio C Hamano
@ 2023-11-10 22:08     ` Jeff King
  2024-01-15 22:31     ` Linus Arver
  2 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-10 22:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano

On Thu, Nov 09, 2023 at 05:34:11PM -0500, Taylor Blau wrote:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}

I think this is taking us backwards in terms of the output the user sees
(see the message I just sent elsewhere in the thread). The user won't be
able to tell the difference between a missing chunk and one with the
wrong size.

And we miss the opportunity to give more details about the expected and
detected sizes of the chunks.

If the two-line error message really bothers you, then...

> +int pair_chunk_expect(struct chunkfile *cf,
> +		      uint32_t chunk_id,
> +		      const unsigned char **p,
> +		      size_t record_size,
> +		      size_t record_nr)
> +{
> +	struct pair_chunk_data pcd = {
> +		.p = p,
> +		.record_size = record_size,
> +		.record_nr = record_nr,
> +	};
> +	return read_chunk(cf, chunk_id, pair_chunk_expect_fn, &pcd);
> +}

...this is where we could take a CHUNK_REQUIRED flag, and so:

  ret = read_chunk(...);
  /* pair_chunk_expect_fn() wrote an error already for other cases */
  if (ret == CHUNK_MISSING)
	error("chunk not found");
  return ret;

And then the callers become a very nice:

  if (pair_chunk_expect(cf, id, &ptr, size, nr, CHUNK_REQUIRED))
	return -1;

You probably would need to pass some kind of string giving more context
for the error messages, though. We can show the chunk id, but the
context of "commit-graph" vs "midx" is important.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-10 21:57       ` Jeff King
@ 2023-11-10 22:09         ` Jeff King
  0 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-10 22:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git

On Fri, Nov 10, 2023 at 04:57:47PM -0500, Jeff King wrote:

> One of those patches calls out the truncating division issue, but to
> summarize: IMHO this is OK, as what we really want to know is "is it big
> enough that we can always ask for NR records of size ELEM", which
> division gives us. If we do want to be more precise, but also avoid
> die(), we'd need a variant of st_mult() that returns a boolean. I didn't
> think it was worth it for this case (but arguably it is something that
> would be useful to have in general).

Oh, and obviously there is another option here if we want to be more
careful but don't want to introduce an st_mult() variant: we can use "%"
to check for divisibility ourselves.

I don't think it's worth doing that in every individual size-check, but
maybe it would be in a central pair_chunk_expect().

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()`
  2023-11-09 22:34   ` [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()` Taylor Blau
@ 2023-11-10 22:10     ` Jeff King
  0 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2023-11-10 22:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano

On Thu, Nov 09, 2023 at 05:34:14PM -0500, Taylor Blau wrote:

> @@ -435,8 +425,10 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s,
>  		error(_("commit-graph required OID fanout chunk missing or corrupted"));
>  		goto free_and_return;
>  	}
> -	if (read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph)) {
> -		error(_("commit-graph required OID lookup chunk missing or corrupted"));
> +	if (pair_chunk_expect(cf, GRAPH_CHUNKID_OIDLOOKUP,
> +			      &graph->chunk_oid_lookup, graph->hash_len,
> +			      graph->num_commits)) {
> +		error(_("commit-graph OID lookup chunk is the wrong size"));
>  		goto free_and_return;

I know the original message was vague, but I think the new one is
actively misleading in the case of a missing chunk. We'll say "wrong
size", but it was not present at all!

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-10 16:27       ` Taylor Blau
  2023-11-10 22:01         ` Jeff King
@ 2023-11-10 23:38         ` Junio C Hamano
  1 sibling, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2023-11-10 23:38 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

> But I don't think we enjoy the same benefits in the MIDX scenario. In
> this case, the num_objects field is just:
>
>     m->num_objects = ntohl(m->chunk_oid_fanout[255])
>
> so I don't think we can make the same guarantees about what is and isn't
> save to compute under size_t arithmetic.

So ..., from the correctness's point of view, if we do not mind
st_mult() dying, the "multiply-and-compare" should give us much more
robust results.  If we do mind st_mult() dying, we could pair the
"truncating-division-and-compare" you wrote with "ensure that chunk
size is a multiple of record size" to achieve the same, I would
think.  I.e.,

	if (chunk_size % pcd->record_size ||
	    chunk_size / pcd->record_size != pcd->record_nr)
		return -1;

or something like that.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-10 22:01         ` Jeff King
@ 2023-11-10 23:39           ` Junio C Hamano
  0 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2023-11-10 23:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git

Jeff King <peff@peff.net> writes:

> I think dropping those is a bad direction. The point of adding
> pair_chunk_expect() is that we could use it consistently.

I think so, too.  If we are adding a helper to avoid common
mistakes, and if it cannot be used in some situations, then the
helper needs to be improved.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
  2023-11-10  4:55     ` Junio C Hamano
  2023-11-10 22:08     ` Jeff King
@ 2024-01-15 22:31     ` Linus Arver
  2024-01-15 22:53       ` Linus Arver
  2024-01-16 15:10       ` Jeff King
  2 siblings, 2 replies; 39+ messages in thread
From: Linus Arver @ 2024-01-15 22:31 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

Taylor Blau <me@ttaylorr.com> writes:

> +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> +				size_t chunk_size,
> +				void *data)
> +{
> +	struct pair_chunk_data *pcd = data;
> +	if (chunk_size / pcd->record_size != pcd->record_nr)
> +		return -1;
> +	*pcd->p = chunk_start;
> +	return 0;
> +}
> +

I don't think this function should assume anything about the inputs
(chunk_size, record_size, nor record_nr). If we are asking the helper to
be the common tool for multiple locations, it should assume even less
about what these inputs could look like.

For example, if record_size is 0 this will lead to a
divide-by-zero. Likewise, if record_nr is zero it doesn't make
sense to check if chunk_size fits into record_size * record_nr.

And even if there are no (unexpected) zero-values involved, shouldn't we
also check for nonsensical comparisons, such as if chunk_size is smaller
than record_size?

I think there are several possibilities:

CHUNK_MISSING   (chunk_size == 0)
CHUNK_TOO_SMALL (chunk_size < record_size)
CHUNK_OK        (chunk_size == record_nr * record_size)
CHUNK_TOO_BIG   (chunk_size > record_size, record_nr * record_size < chunk_size)

And for the CHUNK_TOO_BIG case there are two possibilities --- the
leftover parts of chunk_size that are not accounted by "record_nr *
record_size" are (or are not) "aligned" such that increasing the
record_size by some positive increment could exactly match the
chunk_size. For example, chunk_size = 128, record_size = 8, record_nr =
8. In this case the records account for 64 bytes so we have 128 - 64 =
64 bytes left over, and simply increasing record_nr to 16 could account
for every bytes in chunk_size. If chunk_size in this example was 120 or
130, there would be bytes left over that cannot be accounted for by
adjusting record_size. This divisibility-of-leftover-bytes check could
be done with '%' as mentioned already by others.

So in summary there appear to be the following possibilities:

CHUNK_MISSING
CHUNK_TOO_SMALL
CHUNK_OK
CHUNK_TOO_BIG_ALIGNED
CHUNK_TOO_BIG_MISALIGNED

(on top of the cases where record_* inputs are zero).

And it seems prudent to treat each of the not-OK cases differently
(including how we report error messages) once we know which category we
fall into.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2024-01-15 22:31     ` Linus Arver
@ 2024-01-15 22:53       ` Linus Arver
  2024-01-16 15:10       ` Jeff King
  1 sibling, 0 replies; 39+ messages in thread
From: Linus Arver @ 2024-01-15 22:53 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

Linus Arver <linusa@google.com> writes:

> So in summary there appear to be the following possibilities:
>
> CHUNK_MISSING
> CHUNK_TOO_SMALL
> CHUNK_OK
> CHUNK_TOO_BIG_ALIGNED
> CHUNK_TOO_BIG_MISALIGNED

On second thought, it appears that CHUNK_TOO_SMALL has three cases:

(1) chunk_size < record_size
(2) chunk_size >= record_size, but chunk_size < record_size * record_nr
    and decreasing record_nr can make chunk_size "fit"
(3) chunk_size >= record_size, but chunk_size < record_size * record_nr
    and decreasing record_nr cannot make chunk_size "fit"

where (2) and (3) are just like the *_(MIS)ALIGNED cases above when
chunk_size is too big.

My default position is that these additional cases might need to be
treated differently, although maybe it's overkill also.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2024-01-15 22:31     ` Linus Arver
  2024-01-15 22:53       ` Linus Arver
@ 2024-01-16 15:10       ` Jeff King
  2024-01-18 23:59         ` Linus Arver
  1 sibling, 1 reply; 39+ messages in thread
From: Jeff King @ 2024-01-16 15:10 UTC (permalink / raw)
  To: Linus Arver; +Cc: Taylor Blau, git, Junio C Hamano

On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote:

> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
> > +				size_t chunk_size,
> > +				void *data)
> > +{
> > +	struct pair_chunk_data *pcd = data;
> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
> > +		return -1;
> > +	*pcd->p = chunk_start;
> > +	return 0;
> > +}
> > +
> 
> I don't think this function should assume anything about the inputs
> (chunk_size, record_size, nor record_nr). If we are asking the helper to
> be the common tool for multiple locations, it should assume even less
> about what these inputs could look like.
>
> For example, if record_size is 0 this will lead to a
> divide-by-zero. Likewise, if record_nr is zero it doesn't make
> sense to check if chunk_size fits into record_size * record_nr.

I'm not sure that divide-by-zero is a big deal, because 0-length
fixed-size records do not make any sense to ask about. And even if
somebody accidentally passed 0, even though it won't be caught by the
compiler, it would barf on any input, so even rudimentary testing would
catch it.

A zero record_nr is a perfectly reasonable thing to ask about. If you
have a chunk file with zero entries for some entity, then we are
checking that the chunk is the expected zero length as well.

> And even if there are no (unexpected) zero-values involved, shouldn't we
> also check for nonsensical comparisons, such as if chunk_size is smaller
> than record_size?

Aren't we checking that already? If chunk_size is less than record_size,
then the division will result in "0". If the expected number of records
is not also 0, then that's a bogus file.

What we really care about here is that we won't walk off the end of the
buffer while looking at N fixed-size records (in that sense, the "too
big" test is overly careful, but it's easy to include as a sanity
check).

> So in summary there appear to be the following possibilities:
> 
> CHUNK_MISSING
> CHUNK_TOO_SMALL
> CHUNK_OK
> CHUNK_TOO_BIG_ALIGNED
> CHUNK_TOO_BIG_MISALIGNED

So yes, I agree all these cases exist. We detect them all except the
"misaligned" case (which I think was discussed earlier in the thread as
not worth caring too much about).

But...

> (on top of the cases where record_* inputs are zero).
> 
> And it seems prudent to treat each of the not-OK cases differently
> (including how we report error messages) once we know which category we
> fall into.

I'm not sure it is worth caring too much about the different cases. From
the caller's perspective the end result is always to avoid using the
chunk/file. From the user's perspective, any error of the form "we
expected N bytes and got X" is plenty. Especially since record_nr
typically comes from another part of the file, we cannot tell if our
chunk is too big/small, or if another chunk gave us a bogus record_nr.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper
  2024-01-16 15:10       ` Jeff King
@ 2024-01-18 23:59         ` Linus Arver
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Arver @ 2024-01-18 23:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Junio C Hamano

Jeff King <peff@peff.net> writes:

> On Mon, Jan 15, 2024 at 02:31:12PM -0800, Linus Arver wrote:
>
>> > +static int pair_chunk_expect_fn(const unsigned char *chunk_start,
>> > +				size_t chunk_size,
>> > +				void *data)
>> > +{
>> > +	struct pair_chunk_data *pcd = data;
>> > +	if (chunk_size / pcd->record_size != pcd->record_nr)
>> > +		return -1;
>> > +	*pcd->p = chunk_start;
>> > +	return 0;
>> > +}
>> > +
>> 
>> I don't think this function should assume anything about the inputs
>> (chunk_size, record_size, nor record_nr). If we are asking the helper to
>> be the common tool for multiple locations, it should assume even less
>> about what these inputs could look like.
>>
>> For example, if record_size is 0 this will lead to a
>> divide-by-zero. Likewise, if record_nr is zero it doesn't make
>> sense to check if chunk_size fits into record_size * record_nr.
>
> I'm not sure that divide-by-zero is a big deal, because 0-length
> fixed-size records do not make any sense to ask about.

So why not make this function check for this? While it may be true that
0-length fixed-size records are impossible currently, nothing guarantees
they will always be that way all the time in the future.

> And even if
> somebody accidentally passed 0, even though it won't be caught by the
> compiler, it would barf on any input, so even rudimentary testing would
> catch it.

If somebody is accidentally passing an invalid value to a function, then
it feels right for that function to be able to handle it instead of
crashing (or doing any other undefined behavior) with division-by-zero.

Taking a step back though, maybe I am being overly defensive about
possible failure modes. I don't know the surrounding area well so I
might be overreacting.

> A zero record_nr is a perfectly reasonable thing to ask about. If you
> have a chunk file with zero entries for some entity, then we are
> checking that the chunk is the expected zero length as well.

Right.

>> And even if there are no (unexpected) zero-values involved, shouldn't we
>> also check for nonsensical comparisons, such as if chunk_size is smaller
>> than record_size?
>
> Aren't we checking that already? If chunk_size is less than record_size,
> then the division will result in "0". If the expected number of records
> is not also 0, then that's a bogus file.

I was thinking of an early return like

    if (chunk_size < record_size)
        return CHUNK_TOO_SMALL

before evaluating (chunk_size / pcd->record_size != pcd->record_nr).
You're correct that the division will result in "0" if chunk_size is
less than record_size. But I didn't like having the extra mental load
for reading and understanding the correctness of "if (chunk_size /
pcd->record_size != pcd->record_nr)" for that case. IOW, the more early
returns we have for known-bad cases, by the time we get to "if
(chunk_size / pcd->record_size != pcd->record_nr)" it would be that much
easier to understand that code.

> What we really care about here is that we won't walk off the end of the
> buffer while looking at N fixed-size records ...

Ah, I see. This sort of insight would be great to have as a comment in
the code.

> ... (in that sense, the "too
> big" test is overly careful, but it's easy to include as a sanity
> check).

OK.

>> So in summary there appear to be the following possibilities:
>> 
>> CHUNK_MISSING
>> CHUNK_TOO_SMALL
>> CHUNK_OK
>> CHUNK_TOO_BIG_ALIGNED
>> CHUNK_TOO_BIG_MISALIGNED
>
> So yes, I agree all these cases exist. We detect them all except the
> "misaligned" case (which I think was discussed earlier in the thread as
> not worth caring too much about).

OK.

> But...
>
>> (on top of the cases where record_* inputs are zero).
>> 
>> And it seems prudent to treat each of the not-OK cases differently
>> (including how we report error messages) once we know which category we
>> fall into.
>
> I'm not sure it is worth caring too much about the different cases. From
> the caller's perspective the end result is always to avoid using the
> chunk/file.

Ah OK. Then yes, it does seem like caring about the different cases is
too much from the callers' perspective.

But I do think that checking the different cases with early returns will
at least help readability (and as a bonus assure future Git developers
that divide-by-zero errors are impossible).

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-01-18 23:59 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-09  7:03 [PATCH 0/9] some more chunk-file bounds-checks fixes Jeff King
2023-11-09  7:09 ` [PATCH 1/9] commit-graph: handle overflow in chunk_size checks Jeff King
2023-11-09 21:13   ` Taylor Blau
2023-11-09 21:27     ` Jeff King
2023-11-09  7:12 ` [PATCH 2/9] midx: check consistency of fanout table Jeff King
2023-11-09  7:13 ` [PATCH 3/9] commit-graph: drop redundant call to "lite" verification Jeff King
2023-11-09  7:14 ` [PATCH 4/9] commit-graph: clarify missing-chunk error messages Jeff King
2023-11-09  7:17 ` [PATCH 5/9] commit-graph: abort as soon as we see a bogus chunk Jeff King
2023-11-09 21:18   ` Taylor Blau
2023-11-09  7:24 ` [PATCH 6/9] commit-graph: use fanout value for graph size Jeff King
2023-11-09 21:20   ` Taylor Blau
2023-11-09 21:38     ` Jeff King
2023-11-09 22:15       ` Taylor Blau
2023-11-10 21:52         ` Jeff King
2023-11-09  7:25 ` [PATCH 7/9] commit-graph: check order while reading fanout chunk Jeff King
2023-11-09  7:25 ` [PATCH 8/9] commit-graph: drop verify_commit_graph_lite() Jeff King
2023-11-09  7:26 ` [PATCH 9/9] commit-graph: mark chunk error messages for translation Jeff King
2023-11-09 21:22 ` [PATCH 0/9] some more chunk-file bounds-checks fixes Taylor Blau
2023-11-09 22:34 ` [PATCH 0/7] chunk-format: introduce `pair_chunk_expect()` Taylor Blau
2023-11-09 22:34   ` [PATCH 1/7] chunk-format: introduce `pair_chunk_expect()` helper Taylor Blau
2023-11-10  4:55     ` Junio C Hamano
2023-11-10 16:27       ` Taylor Blau
2023-11-10 22:01         ` Jeff King
2023-11-10 23:39           ` Junio C Hamano
2023-11-10 23:38         ` Junio C Hamano
2023-11-10 21:57       ` Jeff King
2023-11-10 22:09         ` Jeff King
2023-11-10 22:08     ` Jeff King
2024-01-15 22:31     ` Linus Arver
2024-01-15 22:53       ` Linus Arver
2024-01-16 15:10       ` Jeff King
2024-01-18 23:59         ` Linus Arver
2023-11-09 22:34   ` [PATCH 2/7] commit-graph: read `OIDL` chunk with `pair_chunk_expect()` Taylor Blau
2023-11-10 22:10     ` Jeff King
2023-11-09 22:34   ` [PATCH 3/7] commit-graph: read `CDAT` " Taylor Blau
2023-11-09 22:34   ` [PATCH 4/7] commit-graph: read `GDAT` " Taylor Blau
2023-11-09 22:34   ` [PATCH 5/7] commit-graph: read `BIDX` " Taylor Blau
2023-11-09 22:34   ` [PATCH 6/7] midx: read `OIDL` " Taylor Blau
2023-11-09 22:34   ` [PATCH 7/7] midx: read `OOFF` " Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).