Git development

Git development
 help / color / mirror / Atom feed

* [PATCH v2 4/5] reftable/writer: fix writing multi-level indices
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk
In-Reply-To: <cover.1706773842.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4607 bytes --]

When finishing a section we will potentially write an index that makes
it more efficient to look up relevant blocks. The index records written
will encode, for each block of the indexed section, what the offset of
that block is as well as the last key of that block. Thus, the reader
would iterate through the index records to find the first key larger or
equal to the wanted key and then use the encoded offset to look up the
desired block.

When there are a lot of blocks to index though we may end up writing
multiple index blocks, too. To not require a linear search across all
index blocks we instead end up writing a multi-level index. Instead of
referring to the block we are after, an index record may point to
another index block. The reader will then access the highest-level index
and follow down the chain of index blocks until it hits the sought-after
block.

It has been observed though that it is impossible to seek ref records of
the last ref block when using a multi-level index. While the multi-level
index exists and looks fine for most of the part, the highest-level
index was missing an index record pointing to the last block of the next
index. Thus, every additional level made more refs become unseekable at
the end of the ref section.

The root cause is that we are not flushing the last block of the current
level once done writing the level. Consequently, it wasn't recorded in
the blocks that need to be indexed by the next-higher level and thus we
forgot about it.

Fix this bug by flushing blocks after we have written all index records.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/readwrite_test.c | 56 +++++++++++++++++++++++++++++++++++++++
 reftable/writer.c         |  8 +++---
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 6b99daeaf2..853923397e 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -866,6 +866,61 @@ static void test_write_multiple_indices(void)
 	strbuf_release(&buf);
 }

+static void test_write_multi_level_index(void)
+{
+	struct reftable_write_options opts = {
+		.block_size = 100,
+	};
+	struct strbuf writer_buf = STRBUF_INIT, buf = STRBUF_INIT;
+	struct reftable_block_source source = { 0 };
+	struct reftable_iterator it = { 0 };
+	const struct reftable_stats *stats;
+	struct reftable_writer *writer;
+	struct reftable_reader *reader;
+	int err;
+
+	writer = reftable_new_writer(&strbuf_add_void, &noop_flush, &writer_buf, &opts);
+	reftable_writer_set_limits(writer, 1, 1);
+	for (size_t i = 0; i < 200; i++) {
+		struct reftable_ref_record ref = {
+			.update_index = 1,
+			.value_type = REFTABLE_REF_VAL1,
+			.value.val1 = {i},
+		};
+
+		strbuf_reset(&buf);
+		strbuf_addf(&buf, "refs/heads/%03" PRIuMAX, (uintmax_t)i);
+		ref.refname = buf.buf,
+
+		err = reftable_writer_add_ref(writer, &ref);
+		EXPECT_ERR(err);
+	}
+	reftable_writer_close(writer);
+
+	/*
+	 * The written refs should be sufficiently large to result in a
+	 * multi-level index.
+	 */
+	stats = reftable_writer_stats(writer);
+	EXPECT(stats->ref_stats.max_index_level == 2);
+
+	block_source_from_strbuf(&source, &writer_buf);
+	err = reftable_new_reader(&reader, &source, "filename");
+	EXPECT_ERR(err);
+
+	/*
+	 * Seeking the last ref should work as expected.
+	 */
+	err = reftable_reader_seek_ref(reader, &it, "refs/heads/199");
+	EXPECT_ERR(err);
+
+	reftable_iterator_destroy(&it);
+	reftable_writer_free(writer);
+	reftable_reader_free(reader);
+	strbuf_release(&writer_buf);
+	strbuf_release(&buf);
+}
+
 static void test_corrupt_table_empty(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -916,5 +971,6 @@ int readwrite_test_main(int argc, const char *argv[])
 	RUN_TEST(test_write_object_id_length);
 	RUN_TEST(test_write_object_id_min_length);
 	RUN_TEST(test_write_multiple_indices);
+	RUN_TEST(test_write_multi_level_index);
 	return 0;
 }
diff --git a/reftable/writer.c b/reftable/writer.c
index b5bcce0292..0349094d29 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -418,15 +418,15 @@ static int writer_finish_section(struct reftable_writer *w)
 				return err;
 		}

+		err = writer_flush_block(w);
+		if (err < 0)
+			return err;
+
 		for (i = 0; i < idx_len; i++)
 			strbuf_release(&idx[i].last_key);
 		reftable_free(idx);
 	}

-	err = writer_flush_block(w);
-	if (err < 0)
-		return err;
-
 	writer_clear_index(w);

 	bstats = writer_reftable_block_stats(w, typ);
-- 
2.43.GIT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 3/5] reftable/writer: simplify writing index records
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk
In-Reply-To: <cover.1706773842.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1428 bytes --]

When finishing the current section some index records might be written
for the section to the table. The logic that adds these records to the
writer duplicates what we already have in `writer_add_record()`, making
this more complicated than it really has to be.

Simplify the code by using `writer_add_record()` instead. While at it,
drop the unneeded braces around a loop to make the code conform to our
code style better.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/writer.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/reftable/writer.c b/reftable/writer.c
index 5a0b87b406..b5bcce0292 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -412,26 +412,14 @@ static int writer_finish_section(struct reftable_writer *w)
 					.idx = idx[i],
 				},
 			};
-			if (block_writer_add(w->block_writer, &rec) == 0) {
-				continue;
-			}
 
-			err = writer_flush_block(w);
+			err = writer_add_record(w, &rec);
 			if (err < 0)
 				return err;
-
-			writer_reinit_block_writer(w, BLOCK_TYPE_INDEX);
-
-			err = block_writer_add(w->block_writer, &rec);
-			if (err != 0) {
-				/* write into fresh block should always succeed
-				 */
-				abort();
-			}
 		}
-		for (i = 0; i < idx_len; i++) {
+
+		for (i = 0; i < idx_len; i++)
 			strbuf_release(&idx[i].last_key);
-		}
 		reftable_free(idx);
 	}
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 2/5] reftable/writer: use correct type to iterate through index entries
From: Patrick Steinhardt @ 2024-02-01  7:52 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk
In-Reply-To: <cover.1706773842.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1718 bytes --]

The reftable writer is tracking the number of blocks it has to index via
the `index_len` variable. But while this variable is of type `size_t`,
some sites use an `int` to loop through the index entries.

Convert the code to consistently use `size_t`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/writer.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/reftable/writer.c b/reftable/writer.c
index 92935baa70..5a0b87b406 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -379,20 +379,21 @@ int reftable_writer_add_logs(struct reftable_writer *w,
 
 static int writer_finish_section(struct reftable_writer *w)
 {
+	struct reftable_block_stats *bstats = NULL;
 	uint8_t typ = block_writer_type(w->block_writer);
 	uint64_t index_start = 0;
 	int max_level = 0;
-	int threshold = w->opts.unpadded ? 1 : 3;
+	size_t threshold = w->opts.unpadded ? 1 : 3;
 	int before_blocks = w->stats.idx_stats.blocks;
-	int err = writer_flush_block(w);
-	int i = 0;
-	struct reftable_block_stats *bstats = NULL;
+	int err;
+
+	err = writer_flush_block(w);
 	if (err < 0)
 		return err;
 
 	while (w->index_len > threshold) {
 		struct reftable_index_record *idx = NULL;
-		int idx_len = 0;
+		size_t i, idx_len;
 
 		max_level++;
 		index_start = w->next;
@@ -630,11 +631,8 @@ int reftable_writer_close(struct reftable_writer *w)
 
 static void writer_clear_index(struct reftable_writer *w)
 {
-	int i = 0;
-	for (i = 0; i < w->index_len; i++) {
+	for (size_t i = 0; i < w->index_len; i++)
 		strbuf_release(&w->index[i].last_key);
-	}
-
 	FREE_AND_NULL(w->index);
 	w->index_len = 0;
 	w->index_cap = 0;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 1/5] reftable/reader: be more careful about errors in indexed seeks
From: Patrick Steinhardt @ 2024-02-01  7:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk
In-Reply-To: <cover.1706773842.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

When doing an indexed seek we first need to do a linear seek in order to
find the index block for our wanted key. We do not check the returned
error of the linear seek though. This is likely not an issue because the
next call to `table_iter_next()` would return error, too. But it very
much is a code smell when an error variable is being assigned to without
actually checking it.

Safeguard the code by checking for errors.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/reader.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/reftable/reader.c b/reftable/reader.c
index 64dc366fb1..278f727a3d 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -509,6 +509,9 @@ static int reader_seek_indexed(struct reftable_reader *r,
 		goto done;
 
 	err = reader_seek_linear(&index_iter, &want_index);
+	if (err < 0)
+		goto done;
+
 	while (1) {
 		err = table_iter_next(&index_iter, &index_result);
 		table_iter_block_done(&index_iter);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 0/5] reftable: fix writing multi-level indices
From: Patrick Steinhardt @ 2024-02-01  7:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Toon Claes, Kristoffer Haugsbakk
In-Reply-To: <cover.1706263918.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2874 bytes --]

Hi,

this is the second version of my patch series that fixes writing of
multi-level indices. There are two minor changes compared to v1:

  - Slightly rephrased a commit message.

  - Dropped an added newline that resulted in a new hunk.

The patch series continues to build on top of jc/reftable-core-fsync.

Thanks!

Patrick

Patrick Steinhardt (5):
  reftable/reader: be more careful about errors in indexed seeks
  reftable/writer: use correct type to iterate through index entries
  reftable/writer: simplify writing index records
  reftable/writer: fix writing multi-level indices
  reftable: document reading and writing indices

 reftable/reader.c         | 30 +++++++++++++++++++
 reftable/readwrite_test.c | 56 ++++++++++++++++++++++++++++++++++
 reftable/writer.c         | 63 ++++++++++++++++++++++-----------------
 3 files changed, 122 insertions(+), 27 deletions(-)

Range-diff against v1:
1:  ecf834a299 = 1:  ecf834a299 reftable/reader: be more careful about errors in indexed seeks
2:  88541d03be = 2:  88541d03be reftable/writer: use correct type to iterate through index entries
3:  b0982baacf ! 3:  b3de0b7f3b reftable/writer: simplify writing index records
    @@ Metadata
      ## Commit message ##
         reftable/writer: simplify writing index records
     
    -    When finishing the current section we may end up writing index records
    -    for the section to the table. The logic to do so essentially copies what
    -    we already have in `writer_add_record()`, making this more complicated
    -    than it really has to be.
    +    When finishing the current section some index records might be written
    +    for the section to the table. The logic that adds these records to the
    +    writer duplicates what we already have in `writer_add_record()`, making
    +    this more complicated than it really has to be.
     
    -    Simplify the code by using `writer_add_record()` instead.
    +    Simplify the code by using `writer_add_record()` instead. While at it,
    +    drop the unneeded braces around a loop to make the code conform to our
    +    code style better.
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## reftable/writer.c ##
    -@@ reftable/writer.c: static int writer_finish_section(struct reftable_writer *w)
    - 		w->index = NULL;
    - 		w->index_len = 0;
    - 		w->index_cap = 0;
    -+
    - 		for (i = 0; i < idx_len; i++) {
    - 			struct reftable_record rec = {
    - 				.type = BLOCK_TYPE_INDEX,
     @@ reftable/writer.c: static int writer_finish_section(struct reftable_writer *w)
      					.idx = idx[i],
      				},
4:  9c6622c409 = 4:  89a88cf83e reftable/writer: fix writing multi-level indices
5:  7850e65878 = 5:  c3492bbd42 reftable: document reading and writing indices
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v2 9/9] reftable/record: improve semantics when initializing records
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 7146 bytes --]

According to our usual coding style, the `reftable_new_record()`
function would indicate that it is allocating a new record. This is not
the case though as the function merely initializes records without
allocating any memory.

Replace `reftable_new_record()` with a new `reftable_record_init()`
function that takes a record pointer as input and initializes it
accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c       | 18 +++++++++---------
 reftable/merged.c      |  8 +++++---
 reftable/reader.c      |  4 ++--
 reftable/record.c      | 43 ++++++++++--------------------------------
 reftable/record.h      | 10 +++++-----
 reftable/record_test.c |  4 ++--
 6 files changed, 33 insertions(+), 54 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 6952d0facf..9ad220747e 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -380,23 +380,23 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		.key = *want,
 		.r = br,
 	};
-	struct reftable_record rec = reftable_new_record(block_reader_type(br));
-	int err = 0;
 	struct block_iter next = BLOCK_ITER_INIT;
+	struct reftable_record rec;
+	int err = 0, i;
 
-	int i = binsearch(br->restart_count, &restart_key_less, &args);
 	if (args.error) {
 		err = REFTABLE_FORMAT_ERROR;
 		goto done;
 	}
 
-	it->br = br;
-	if (i > 0) {
-		i--;
-		it->next_off = block_reader_restart_offset(br, i);
-	} else {
+	i = binsearch(br->restart_count, &restart_key_less, &args);
+	if (i > 0)
+		it->next_off = block_reader_restart_offset(br, i - 1);
+	else
 		it->next_off = br->header_off + 4;
-	}
+	it->br = br;
+
+	reftable_record_init(&rec, block_reader_type(br));
 
 	/* We're looking for the last entry less/equal than the wanted key, so
 	   we have to go one entry too far and then back up.
diff --git a/reftable/merged.c b/reftable/merged.c
index 0e60e2a39b..a0f222e07b 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -21,11 +21,11 @@ static int merged_iter_init(struct merged_iter *mi)
 {
 	for (size_t i = 0; i < mi->stack_len; i++) {
 		struct pq_entry e = {
-			.rec = reftable_new_record(mi->typ),
 			.index = i,
 		};
 		int err;
 
+		reftable_record_init(&e.rec, mi->typ);
 		err = iterator_next(&mi->stack[i], &e.rec);
 		if (err < 0)
 			return err;
@@ -57,10 +57,12 @@ static int merged_iter_advance_nonnull_subiter(struct merged_iter *mi,
 					       size_t idx)
 {
 	struct pq_entry e = {
-		.rec = reftable_new_record(mi->typ),
 		.index = idx,
 	};
-	int err = iterator_next(&mi->stack[idx], &e.rec);
+	int err;
+
+	reftable_record_init(&e.rec, mi->typ);
+	err = iterator_next(&mi->stack[idx], &e.rec);
 	if (err < 0)
 		return err;
 
diff --git a/reftable/reader.c b/reftable/reader.c
index 3e0de5e8ad..5e6c8f30a1 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -444,13 +444,13 @@ static int reader_start(struct reftable_reader *r, struct table_iter *ti,
 static int reader_seek_linear(struct table_iter *ti,
 			      struct reftable_record *want)
 {
-	struct reftable_record rec =
-		reftable_new_record(reftable_record_type(want));
 	struct strbuf want_key = STRBUF_INIT;
 	struct strbuf got_key = STRBUF_INIT;
 	struct table_iter next = TABLE_ITER_INIT;
+	struct reftable_record rec;
 	int err = -1;
 
+	reftable_record_init(&rec, reftable_record_type(want));
 	reftable_record_key(want, &want_key);
 
 	while (1) {
diff --git a/reftable/record.c b/reftable/record.c
index 5c3fbb7b2a..8b84b8157f 100644
--- a/reftable/record.c
+++ b/reftable/record.c
@@ -1257,45 +1257,22 @@ reftable_record_vtable(struct reftable_record *rec)
 	abort();
 }
 
-struct reftable_record reftable_new_record(uint8_t typ)
+void reftable_record_init(struct reftable_record *rec, uint8_t typ)
 {
-	struct reftable_record clean = {
-		.type = typ,
-	};
+	memset(rec, 0, sizeof(*rec));
+	rec->type = typ;
 
-	/* the following is involved, but the naive solution (just return
-	 * `clean` as is, except for BLOCK_TYPE_INDEX), returns a garbage
-	 * clean.u.obj.offsets pointer on Windows VS CI.  Go figure.
-	 */
 	switch (typ) {
-	case BLOCK_TYPE_OBJ:
-	{
-		struct reftable_obj_record obj = { 0 };
-		clean.u.obj = obj;
-		break;
-	}
-	case BLOCK_TYPE_INDEX:
-	{
-		struct reftable_index_record idx = {
-			.last_key = STRBUF_INIT,
-		};
-		clean.u.idx = idx;
-		break;
-	}
 	case BLOCK_TYPE_REF:
-	{
-		struct reftable_ref_record ref = { 0 };
-		clean.u.ref = ref;
-		break;
-	}
 	case BLOCK_TYPE_LOG:
-	{
-		struct reftable_log_record log = { 0 };
-		clean.u.log = log;
-		break;
-	}
+	case BLOCK_TYPE_OBJ:
+		return;
+	case BLOCK_TYPE_INDEX:
+		strbuf_init(&rec->u.idx.last_key, 0);
+		return;
+	default:
+		BUG("unhandled record type");
 	}
-	return clean;
 }
 
 void reftable_record_print(struct reftable_record *rec, int hash_size)
diff --git a/reftable/record.h b/reftable/record.h
index fd80cd451d..e64ed30c80 100644
--- a/reftable/record.h
+++ b/reftable/record.h
@@ -69,9 +69,6 @@ struct reftable_record_vtable {
 /* returns true for recognized block types. Block start with the block type. */
 int reftable_is_block_type(uint8_t typ);
 
-/* return an initialized record for the given type */
-struct reftable_record reftable_new_record(uint8_t typ);
-
 /* Encode `key` into `dest`. Sets `is_restart` to indicate a restart. Returns
  * number of bytes written. */
 int reftable_encode_key(int *is_restart, struct string_view dest,
@@ -100,8 +97,8 @@ struct reftable_obj_record {
 /* record is a generic wrapper for different types of records. It is normally
  * created on the stack, or embedded within another struct. If the type is
  * known, a fresh instance can be initialized explicitly. Otherwise, use
- * reftable_new_record() to initialize generically (as the index_record is not
- * valid as 0-initialized structure)
+ * `reftable_record_init()` to initialize generically (as the index_record is
+ * not valid as 0-initialized structure)
  */
 struct reftable_record {
 	uint8_t type;
@@ -113,6 +110,9 @@ struct reftable_record {
 	} u;
 };
 
+/* Initialize the reftable record for the given type */
+void reftable_record_init(struct reftable_record *rec, uint8_t typ);
+
 /* see struct record_vtable */
 int reftable_record_equal(struct reftable_record *a, struct reftable_record *b, int hash_size);
 void reftable_record_print(struct reftable_record *rec, int hash_size);
diff --git a/reftable/record_test.c b/reftable/record_test.c
index 999366ad46..a86cff5526 100644
--- a/reftable/record_test.c
+++ b/reftable/record_test.c
@@ -16,11 +16,11 @@
 
 static void test_copy(struct reftable_record *rec)
 {
-	struct reftable_record copy = { 0 };
+	struct reftable_record copy;
 	uint8_t typ;
 
 	typ = reftable_record_type(rec);
-	copy = reftable_new_record(typ);
+	reftable_record_init(&copy, typ);
 	reftable_record_copy_from(&copy, rec, GIT_SHA1_RAWSZ);
 	/* do it twice to catch memory leaks */
 	reftable_record_copy_from(&copy, rec, GIT_SHA1_RAWSZ);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 8/9] reftable/merged: refactor initialization of iterators
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 1412 bytes --]

Refactor the initialization of the merged iterator to fit our code style
better. This refactoring prepares the code for a refactoring of how
records are being initialized.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/merged.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/reftable/merged.c b/reftable/merged.c
index 0abcda26e8..0e60e2a39b 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -19,24 +19,23 @@ license that can be found in the LICENSE file or at
 
 static int merged_iter_init(struct merged_iter *mi)
 {
-	int i = 0;
-	for (i = 0; i < mi->stack_len; i++) {
-		struct reftable_record rec = reftable_new_record(mi->typ);
-		int err = iterator_next(&mi->stack[i], &rec);
-		if (err < 0) {
+	for (size_t i = 0; i < mi->stack_len; i++) {
+		struct pq_entry e = {
+			.rec = reftable_new_record(mi->typ),
+			.index = i,
+		};
+		int err;
+
+		err = iterator_next(&mi->stack[i], &e.rec);
+		if (err < 0)
 			return err;
-		}
-
 		if (err > 0) {
 			reftable_iterator_destroy(&mi->stack[i]);
-			reftable_record_release(&rec);
-		} else {
-			struct pq_entry e = {
-				.rec = rec,
-				.index = i,
-			};
-			merged_iter_pqueue_add(&mi->pq, &e);
+			reftable_record_release(&e.rec);
+			continue;
 		}
+
+		merged_iter_pqueue_add(&mi->pq, &e);
 	}
 
 	return 0;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 7/9] reftable/merged: refactor seeking of records
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2640 bytes --]

The code to seek reftable records in the merged table code is quite hard
to read and does not conform to our coding style in multiple ways:

  - We have multiple exit paths where we release resources even though
    that is not really necessary.

  - We use a scoped error variable `e` which is hard to reason about.
    This variable is not required at all.

  - We allocate memory in the variable declarations, which is easy to
    miss.

Refactor the function so that it becomes more maintainable in the
future.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/merged.c | 54 ++++++++++++++++++-----------------------------
 1 file changed, 21 insertions(+), 33 deletions(-)

diff --git a/reftable/merged.c b/reftable/merged.c
index e2c6253324..0abcda26e8 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -238,50 +238,38 @@ static int merged_table_seek_record(struct reftable_merged_table *mt,
 				    struct reftable_iterator *it,
 				    struct reftable_record *rec)
 {
-	struct reftable_iterator *iters = reftable_calloc(
-		mt->stack_len, sizeof(*iters));
 	struct merged_iter merged = {
-		.stack = iters,
 		.typ = reftable_record_type(rec),
 		.hash_id = mt->hash_id,
 		.suppress_deletions = mt->suppress_deletions,
 		.key = STRBUF_INIT,
 		.entry_key = STRBUF_INIT,
 	};
-	int n = 0;
-	int err = 0;
-	int i = 0;
-	for (i = 0; i < mt->stack_len && err == 0; i++) {
-		int e = reftable_table_seek_record(&mt->stack[i], &iters[n],
-						   rec);
-		if (e < 0) {
-			err = e;
-		}
-		if (e == 0) {
-			n++;
-		}
-	}
-	if (err < 0) {
-		int i = 0;
-		for (i = 0; i < n; i++) {
-			reftable_iterator_destroy(&iters[i]);
-		}
-		reftable_free(iters);
-		return err;
+	struct merged_iter *p;
+	int err;
+
+	REFTABLE_CALLOC_ARRAY(merged.stack, mt->stack_len);
+	for (size_t i = 0; i < mt->stack_len; i++) {
+		err = reftable_table_seek_record(&mt->stack[i],
+						 &merged.stack[merged.stack_len], rec);
+		if (err < 0)
+			goto out;
+		if (!err)
+			merged.stack_len++;
 	}
 
-	merged.stack_len = n;
 	err = merged_iter_init(&merged);
-	if (err < 0) {
+	if (err < 0)
+		goto out;
+
+	p = reftable_malloc(sizeof(struct merged_iter));
+	*p = merged;
+	iterator_from_merged_iter(it, p);
+
+out:
+	if (err < 0)
 		merged_iter_close(&merged);
-		return err;
-	} else {
-		struct merged_iter *p =
-			reftable_malloc(sizeof(struct merged_iter));
-		*p = merged;
-		iterator_from_merged_iter(it, p);
-	}
-	return 0;
+	return err;
 }
 
 int reftable_merged_table_seek_ref(struct reftable_merged_table *mt,
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 6/9] reftable/stack: use `size_t` to track stack length
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 6804 bytes --]

While the stack length is already stored as `size_t`, we frequently use
`int`s to refer to those stacks throughout the reftable library. Convert
those cases to use `size_t` instead to make things consistent.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c          |  7 +++----
 reftable/basics.h          |  2 +-
 reftable/merged.c          | 11 +++++------
 reftable/merged_test.c     | 14 ++++++--------
 reftable/reftable-merged.h |  2 +-
 reftable/stack.c           | 21 ++++++++++-----------
 6 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index af9004cec2..0785aff941 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -64,12 +64,11 @@ void free_names(char **a)
 	reftable_free(a);
 }
 
-int names_length(char **names)
+size_t names_length(char **names)
 {
 	char **p = names;
-	for (; *p; p++) {
-		/* empty */
-	}
+	while (*p)
+		p++;
 	return p - names;
 }
 
diff --git a/reftable/basics.h b/reftable/basics.h
index 4c3ac963a3..91f3533efe 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -44,7 +44,7 @@ void parse_names(char *buf, int size, char ***namesp);
 int names_equal(char **a, char **b);
 
 /* returns the array size of a NULL-terminated array of strings. */
-int names_length(char **names);
+size_t names_length(char **names);
 
 /* Allocation routines; they invoke the functions set through
  * reftable_set_alloc() */
diff --git a/reftable/merged.c b/reftable/merged.c
index 2031fd51b4..e2c6253324 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -45,11 +45,10 @@ static int merged_iter_init(struct merged_iter *mi)
 static void merged_iter_close(void *p)
 {
 	struct merged_iter *mi = p;
-	int i = 0;
+
 	merged_iter_pqueue_release(&mi->pq);
-	for (i = 0; i < mi->stack_len; i++) {
+	for (size_t i = 0; i < mi->stack_len; i++)
 		reftable_iterator_destroy(&mi->stack[i]);
-	}
 	reftable_free(mi->stack);
 	strbuf_release(&mi->key);
 	strbuf_release(&mi->entry_key);
@@ -168,14 +167,14 @@ static void iterator_from_merged_iter(struct reftable_iterator *it,
 }
 
 int reftable_new_merged_table(struct reftable_merged_table **dest,
-			      struct reftable_table *stack, int n,
+			      struct reftable_table *stack, size_t n,
 			      uint32_t hash_id)
 {
 	struct reftable_merged_table *m = NULL;
 	uint64_t last_max = 0;
 	uint64_t first_min = 0;
-	int i = 0;
-	for (i = 0; i < n; i++) {
+
+	for (size_t i = 0; i < n; i++) {
 		uint64_t min = reftable_table_min_update_index(&stack[i]);
 		uint64_t max = reftable_table_max_update_index(&stack[i]);
 
diff --git a/reftable/merged_test.c b/reftable/merged_test.c
index e233a9d581..442917cc83 100644
--- a/reftable/merged_test.c
+++ b/reftable/merged_test.c
@@ -88,18 +88,17 @@ static struct reftable_merged_table *
 merged_table_from_records(struct reftable_ref_record **refs,
 			  struct reftable_block_source **source,
 			  struct reftable_reader ***readers, int *sizes,
-			  struct strbuf *buf, int n)
+			  struct strbuf *buf, size_t n)
 {
-	int i = 0;
 	struct reftable_merged_table *mt = NULL;
-	int err;
 	struct reftable_table *tabs;
+	int err;
 
 	REFTABLE_CALLOC_ARRAY(tabs, n);
 	REFTABLE_CALLOC_ARRAY(*readers, n);
 	REFTABLE_CALLOC_ARRAY(*source, n);
 
-	for (i = 0; i < n; i++) {
+	for (size_t i = 0; i < n; i++) {
 		write_test_table(&buf[i], refs[i], sizes[i]);
 		block_source_from_strbuf(&(*source)[i], &buf[i]);
 
@@ -263,18 +262,17 @@ static struct reftable_merged_table *
 merged_table_from_log_records(struct reftable_log_record **logs,
 			      struct reftable_block_source **source,
 			      struct reftable_reader ***readers, int *sizes,
-			      struct strbuf *buf, int n)
+			      struct strbuf *buf, size_t n)
 {
-	int i = 0;
 	struct reftable_merged_table *mt = NULL;
-	int err;
 	struct reftable_table *tabs;
+	int err;
 
 	REFTABLE_CALLOC_ARRAY(tabs, n);
 	REFTABLE_CALLOC_ARRAY(*readers, n);
 	REFTABLE_CALLOC_ARRAY(*source, n);
 
-	for (i = 0; i < n; i++) {
+	for (size_t i = 0; i < n; i++) {
 		write_test_log_table(&buf[i], logs[i], sizes[i], i + 1);
 		block_source_from_strbuf(&(*source)[i], &buf[i]);
 
diff --git a/reftable/reftable-merged.h b/reftable/reftable-merged.h
index 1a6d16915a..c91a2d83a2 100644
--- a/reftable/reftable-merged.h
+++ b/reftable/reftable-merged.h
@@ -33,7 +33,7 @@ struct reftable_table;
    the stack array.
 */
 int reftable_new_merged_table(struct reftable_merged_table **dest,
-			      struct reftable_table *stack, int n,
+			      struct reftable_table *stack, size_t n,
 			      uint32_t hash_id);
 
 /* returns an iterator positioned just before 'name' */
diff --git a/reftable/stack.c b/reftable/stack.c
index c1f8cf1cef..079ba7fde8 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -202,18 +202,18 @@ static struct reftable_reader **stack_copy_readers(struct reftable_stack *st,
 static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 				      int reuse_open)
 {
-	int cur_len = !st->merged ? 0 : st->merged->stack_len;
+	size_t cur_len = !st->merged ? 0 : st->merged->stack_len;
 	struct reftable_reader **cur = stack_copy_readers(st, cur_len);
-	int err = 0;
-	int names_len = names_length(names);
+	size_t names_len = names_length(names);
 	struct reftable_reader **new_readers =
 		reftable_calloc(names_len, sizeof(*new_readers));
 	struct reftable_table *new_tables =
 		reftable_calloc(names_len, sizeof(*new_tables));
-	int new_readers_len = 0;
+	size_t new_readers_len = 0;
 	struct reftable_merged_table *new_merged = NULL;
 	struct strbuf table_path = STRBUF_INIT;
-	int i;
+	int err = 0;
+	size_t i;
 
 	while (*names) {
 		struct reftable_reader *rd = NULL;
@@ -221,11 +221,10 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 
 		/* this is linear; we assume compaction keeps the number of
 		   tables under control so this is not quadratic. */
-		int j = 0;
-		for (j = 0; reuse_open && j < cur_len; j++) {
-			if (cur[j] && 0 == strcmp(cur[j]->name, name)) {
-				rd = cur[j];
-				cur[j] = NULL;
+		for (i = 0; reuse_open && i < cur_len; i++) {
+			if (cur[i] && 0 == strcmp(cur[i]->name, name)) {
+				rd = cur[i];
+				cur[i] = NULL;
 				break;
 			}
 		}
@@ -870,7 +869,7 @@ static int stack_write_compact(struct reftable_stack *st,
 			       size_t first, size_t last,
 			       struct reftable_log_expiry_config *config)
 {
-	int subtabs_len = last - first + 1;
+	size_t subtabs_len = last - first + 1;
 	struct reftable_table *subtabs = reftable_calloc(
 		last - first + 1, sizeof(*subtabs));
 	struct reftable_merged_table *mt = NULL;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 5/9] reftable/stack: use `size_t` to track stack slices during compaction
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4072 bytes --]

We use `int`s to track reftable slices when compacting the reftable
stack, which is considered to be a code smell in the Git project.
Convert the code to use `size_t` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index 2be3d1e4ba..c1f8cf1cef 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -24,7 +24,8 @@ static int stack_try_add(struct reftable_stack *st,
 					    void *arg),
 			 void *arg);
 static int stack_write_compact(struct reftable_stack *st,
-			       struct reftable_writer *wr, int first, int last,
+			       struct reftable_writer *wr,
+			       size_t first, size_t last,
 			       struct reftable_log_expiry_config *config);
 static int stack_check_addition(struct reftable_stack *st,
 				const char *new_tab_name);
@@ -820,7 +821,8 @@ uint64_t reftable_stack_next_update_index(struct reftable_stack *st)
 	return 1;
 }
 
-static int stack_compact_locked(struct reftable_stack *st, int first, int last,
+static int stack_compact_locked(struct reftable_stack *st,
+				size_t first, size_t last,
 				struct strbuf *temp_tab,
 				struct reftable_log_expiry_config *config)
 {
@@ -864,22 +866,21 @@ static int stack_compact_locked(struct reftable_stack *st, int first, int last,
 }
 
 static int stack_write_compact(struct reftable_stack *st,
-			       struct reftable_writer *wr, int first, int last,
+			       struct reftable_writer *wr,
+			       size_t first, size_t last,
 			       struct reftable_log_expiry_config *config)
 {
 	int subtabs_len = last - first + 1;
 	struct reftable_table *subtabs = reftable_calloc(
 		last - first + 1, sizeof(*subtabs));
 	struct reftable_merged_table *mt = NULL;
-	int err = 0;
 	struct reftable_iterator it = { NULL };
 	struct reftable_ref_record ref = { NULL };
 	struct reftable_log_record log = { NULL };
-
 	uint64_t entries = 0;
+	int err = 0;
 
-	int i = 0, j = 0;
-	for (i = first, j = 0; i <= last; i++) {
+	for (size_t i = first, j = 0; i <= last; i++) {
 		struct reftable_reader *t = st->readers[i];
 		reftable_table_from_reader(&subtabs[j++], t);
 		st->stats.bytes += t->size;
@@ -963,7 +964,8 @@ static int stack_write_compact(struct reftable_stack *st,
 }
 
 /* <  0: error. 0 == OK, > 0 attempt failed; could retry. */
-static int stack_compact_range(struct reftable_stack *st, int first, int last,
+static int stack_compact_range(struct reftable_stack *st,
+			       size_t first, size_t last,
 			       struct reftable_log_expiry_config *expiry)
 {
 	char **delete_on_success = NULL, **subtable_locks = NULL, **listp = NULL;
@@ -972,12 +974,10 @@ static int stack_compact_range(struct reftable_stack *st, int first, int last,
 	struct strbuf lock_file_name = STRBUF_INIT;
 	struct strbuf ref_list_contents = STRBUF_INIT;
 	struct strbuf new_table_path = STRBUF_INIT;
+	size_t i, j, compact_count;
 	int err = 0;
 	int have_lock = 0;
 	int lock_file_fd = -1;
-	int compact_count;
-	int i = 0;
-	int j = 0;
 	int is_empty_table = 0;
 
 	if (first > last || (!expiry && first == last)) {
@@ -1172,17 +1172,17 @@ static int stack_compact_range(struct reftable_stack *st, int first, int last,
 int reftable_stack_compact_all(struct reftable_stack *st,
 			       struct reftable_log_expiry_config *config)
 {
-	return stack_compact_range(st, 0, st->merged->stack_len - 1, config);
+	return stack_compact_range(st, 0, st->merged->stack_len ?
+			st->merged->stack_len - 1 : 0, config);
 }
 
-static int stack_compact_range_stats(struct reftable_stack *st, int first,
-				     int last,
+static int stack_compact_range_stats(struct reftable_stack *st,
+				     size_t first, size_t last,
 				     struct reftable_log_expiry_config *config)
 {
 	int err = stack_compact_range(st, first, last, config);
-	if (err > 0) {
+	if (err > 0)
 		st->stats.failures++;
-	}
 	return err;
 }
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 4/9] reftable/stack: index segments with `size_t`
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 3751 bytes --]

We use `int`s to index into arrays of segments and track the length of
them, which is considered to be a code smell in the Git project. Convert
the code to use `size_t` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c      | 25 +++++++++++--------------
 reftable/stack.h      |  6 +++---
 reftable/stack_test.c |  7 +++----
 3 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index b6b24c90bf..2be3d1e4ba 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -1202,12 +1202,11 @@ int fastlog2(uint64_t sz)
 	return l - 1;
 }
 
-struct segment *sizes_to_segments(int *seglen, uint64_t *sizes, int n)
+struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n)
 {
 	struct segment *segs = reftable_calloc(n, sizeof(*segs));
-	int next = 0;
 	struct segment cur = { 0 };
-	int i = 0;
+	size_t next = 0, i;
 
 	if (n == 0) {
 		*seglen = 0;
@@ -1233,29 +1232,27 @@ struct segment *sizes_to_segments(int *seglen, uint64_t *sizes, int n)
 	return segs;
 }
 
-struct segment suggest_compaction_segment(uint64_t *sizes, int n)
+struct segment suggest_compaction_segment(uint64_t *sizes, size_t n)
 {
-	int seglen = 0;
-	struct segment *segs = sizes_to_segments(&seglen, sizes, n);
 	struct segment min_seg = {
 		.log = 64,
 	};
-	int i = 0;
+	struct segment *segs;
+	size_t seglen = 0, i;
+
+	segs = sizes_to_segments(&seglen, sizes, n);
 	for (i = 0; i < seglen; i++) {
-		if (segment_size(&segs[i]) == 1) {
+		if (segment_size(&segs[i]) == 1)
 			continue;
-		}
 
-		if (segs[i].log < min_seg.log) {
+		if (segs[i].log < min_seg.log)
 			min_seg = segs[i];
-		}
 	}
 
 	while (min_seg.start > 0) {
-		int prev = min_seg.start - 1;
-		if (fastlog2(min_seg.bytes) < fastlog2(sizes[prev])) {
+		size_t prev = min_seg.start - 1;
+		if (fastlog2(min_seg.bytes) < fastlog2(sizes[prev]))
 			break;
-		}
 
 		min_seg.start = prev;
 		min_seg.bytes += sizes[prev];
diff --git a/reftable/stack.h b/reftable/stack.h
index c1e3efa899..d919455669 100644
--- a/reftable/stack.h
+++ b/reftable/stack.h
@@ -32,13 +32,13 @@ struct reftable_stack {
 int read_lines(const char *filename, char ***lines);
 
 struct segment {
-	int start, end;
+	size_t start, end;
 	int log;
 	uint64_t bytes;
 };
 
 int fastlog2(uint64_t sz);
-struct segment *sizes_to_segments(int *seglen, uint64_t *sizes, int n);
-struct segment suggest_compaction_segment(uint64_t *sizes, int n);
+struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n);
+struct segment suggest_compaction_segment(uint64_t *sizes, size_t n);
 
 #endif
diff --git a/reftable/stack_test.c b/reftable/stack_test.c
index 289e902146..2d5b24e5c5 100644
--- a/reftable/stack_test.c
+++ b/reftable/stack_test.c
@@ -711,7 +711,7 @@ static void test_sizes_to_segments(void)
 	uint64_t sizes[] = { 2, 3, 4, 5, 7, 9 };
 	/* .................0  1  2  3  4  5 */
 
-	int seglen = 0;
+	size_t seglen = 0;
 	struct segment *segs =
 		sizes_to_segments(&seglen, sizes, ARRAY_SIZE(sizes));
 	EXPECT(segs[2].log == 3);
@@ -726,7 +726,7 @@ static void test_sizes_to_segments(void)
 
 static void test_sizes_to_segments_empty(void)
 {
-	int seglen = 0;
+	size_t seglen = 0;
 	struct segment *segs = sizes_to_segments(&seglen, NULL, 0);
 	EXPECT(seglen == 0);
 	reftable_free(segs);
@@ -735,8 +735,7 @@ static void test_sizes_to_segments_empty(void)
 static void test_sizes_to_segments_all_equal(void)
 {
 	uint64_t sizes[] = { 5, 5 };
-
-	int seglen = 0;
+	size_t seglen = 0;
 	struct segment *segs =
 		sizes_to_segments(&seglen, sizes, ARRAY_SIZE(sizes));
 	EXPECT(seglen == 1);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 3/9] reftable/stack: fix parameter validation when compacting range
From: Patrick Steinhardt @ 2024-02-01  7:33 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2892 bytes --]

The `stack_compact_range()` function receives a "first" and "last" index
that indicates which tables of the reftable stack should be compacted.
Naturally, "first" must be smaller than "last" in order to identify a
proper range of tables to compress, which we indeed also assert in the
function. But the validations happens after we have already allocated
arrays with a size of `last - first + 1`, leading to an underflow and
thus an invalid allocation size.

Fix this by reordering the array allocations to happen after we have
validated parameters. While at it, convert the array allocations to use
the newly introduced macros.

Note that the relevant variables pointing into arrays should also be
converted to use `size_t` instead of `int`. This is left for a later
commit in this series.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index d084823a92..b6b24c90bf 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -966,6 +966,7 @@ static int stack_write_compact(struct reftable_stack *st,
 static int stack_compact_range(struct reftable_stack *st, int first, int last,
 			       struct reftable_log_expiry_config *expiry)
 {
+	char **delete_on_success = NULL, **subtable_locks = NULL, **listp = NULL;
 	struct strbuf temp_tab_file_name = STRBUF_INIT;
 	struct strbuf new_table_name = STRBUF_INIT;
 	struct strbuf lock_file_name = STRBUF_INIT;
@@ -974,12 +975,7 @@ static int stack_compact_range(struct reftable_stack *st, int first, int last,
 	int err = 0;
 	int have_lock = 0;
 	int lock_file_fd = -1;
-	int compact_count = last - first + 1;
-	char **listp = NULL;
-	char **delete_on_success =
-		reftable_calloc(compact_count + 1, sizeof(*delete_on_success));
-	char **subtable_locks =
-		reftable_calloc(compact_count + 1, sizeof(*subtable_locks));
+	int compact_count;
 	int i = 0;
 	int j = 0;
 	int is_empty_table = 0;
@@ -989,6 +985,10 @@ static int stack_compact_range(struct reftable_stack *st, int first, int last,
 		goto done;
 	}
 
+	compact_count = last - first + 1;
+	REFTABLE_CALLOC_ARRAY(delete_on_success, compact_count + 1);
+	REFTABLE_CALLOC_ARRAY(subtable_locks, compact_count + 1);
+
 	st->stats.attempts++;
 
 	strbuf_reset(&lock_file_name);
@@ -1146,12 +1146,14 @@ static int stack_compact_range(struct reftable_stack *st, int first, int last,
 done:
 	free_names(delete_on_success);
 
-	listp = subtable_locks;
-	while (*listp) {
-		unlink(*listp);
-		listp++;
+	if (subtable_locks) {
+		listp = subtable_locks;
+		while (*listp) {
+			unlink(*listp);
+			listp++;
+		}
+		free_names(subtable_locks);
 	}
-	free_names(subtable_locks);
 	if (lock_file_fd >= 0) {
 		close(lock_file_fd);
 		lock_file_fd = -1;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 2/9] reftable: introduce macros to allocate arrays
From: Patrick Steinhardt @ 2024-02-01  7:32 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 16607 bytes --]

Similar to the preceding commit, let's carry over macros to allocate
arrays with `REFTABLE_ALLOC_ARRAY()` and `REFTABLE_CALLOC_ARRAY()`. This
requires us to change the signature of `reftable_calloc()`, which only
takes a single argument right now and thus puts the burden on the caller
to calculate the final array's size. This is a net improvement though as
it means that we can now provide proper overflow checks when multiplying
the array size with the member size.

Convert callsites of `reftable_calloc()` to the new signature, using the
new macros where possible.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.h         |  4 +++-
 reftable/block_test.c     |  2 +-
 reftable/blocksource.c    |  4 ++--
 reftable/iter.c           |  3 +--
 reftable/merged.c         |  4 ++--
 reftable/merged_test.c    | 22 +++++++++++++---------
 reftable/publicbasics.c   |  3 ++-
 reftable/reader.c         |  8 +++-----
 reftable/readwrite_test.c |  8 +++++---
 reftable/record_test.c    |  4 ++--
 reftable/refname.c        |  4 ++--
 reftable/stack.c          | 26 ++++++++++++--------------
 reftable/tree.c           |  4 ++--
 reftable/writer.c         |  7 +++----
 14 files changed, 53 insertions(+), 50 deletions(-)

diff --git a/reftable/basics.h b/reftable/basics.h
index 2f855cd724..4c3ac963a3 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -51,8 +51,10 @@ int names_length(char **names);
 void *reftable_malloc(size_t sz);
 void *reftable_realloc(void *p, size_t sz);
 void reftable_free(void *p);
-void *reftable_calloc(size_t sz);
+void *reftable_calloc(size_t nelem, size_t elsize);
 
+#define REFTABLE_ALLOC_ARRAY(x, alloc) (x) = reftable_malloc(st_mult(sizeof(*(x)), (alloc)))
+#define REFTABLE_CALLOC_ARRAY(x, alloc) (x) = reftable_calloc((alloc), sizeof(*(x)))
 #define REFTABLE_REALLOC_ARRAY(x, alloc) (x) = reftable_realloc((x), st_mult(sizeof(*(x)), (alloc)))
 #define REFTABLE_ALLOC_GROW(x, nr, alloc) \
 	do { \
diff --git a/reftable/block_test.c b/reftable/block_test.c
index dedb05c7d8..e162c6e33f 100644
--- a/reftable/block_test.c
+++ b/reftable/block_test.c
@@ -36,7 +36,7 @@ static void test_block_read_write(void)
 	int j = 0;
 	struct strbuf want = STRBUF_INIT;
 
-	block.data = reftable_calloc(block_size);
+	REFTABLE_CALLOC_ARRAY(block.data, block_size);
 	block.len = block_size;
 	block.source = malloc_block_source();
 	block_writer_init(&bw, BLOCK_TYPE_REF, block.data, block_size,
diff --git a/reftable/blocksource.c b/reftable/blocksource.c
index 8c41e3c70f..eeed254ba9 100644
--- a/reftable/blocksource.c
+++ b/reftable/blocksource.c
@@ -29,7 +29,7 @@ static int strbuf_read_block(void *v, struct reftable_block *dest, uint64_t off,
 {
 	struct strbuf *b = v;
 	assert(off + size <= b->len);
-	dest->data = reftable_calloc(size);
+	REFTABLE_CALLOC_ARRAY(dest->data, size);
 	memcpy(dest->data, b->buf + off, size);
 	dest->len = size;
 	return size;
@@ -132,7 +132,7 @@ int reftable_block_source_from_file(struct reftable_block_source *bs,
 		return REFTABLE_IO_ERROR;
 	}
 
-	p = reftable_calloc(sizeof(*p));
+	REFTABLE_CALLOC_ARRAY(p, 1);
 	p->size = st.st_size;
 	p->data = xmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
diff --git a/reftable/iter.c b/reftable/iter.c
index a8d174c040..8b5ebf6183 100644
--- a/reftable/iter.c
+++ b/reftable/iter.c
@@ -160,8 +160,7 @@ int new_indexed_table_ref_iter(struct indexed_table_ref_iter **dest,
 			       int oid_len, uint64_t *offsets, int offset_len)
 {
 	struct indexed_table_ref_iter empty = INDEXED_TABLE_REF_ITER_INIT;
-	struct indexed_table_ref_iter *itr =
-		reftable_calloc(sizeof(struct indexed_table_ref_iter));
+	struct indexed_table_ref_iter *itr = reftable_calloc(1, sizeof(*itr));
 	int err = 0;
 
 	*itr = empty;
diff --git a/reftable/merged.c b/reftable/merged.c
index c258ce953e..2031fd51b4 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -190,7 +190,7 @@ int reftable_new_merged_table(struct reftable_merged_table **dest,
 		}
 	}
 
-	m = reftable_calloc(sizeof(struct reftable_merged_table));
+	REFTABLE_CALLOC_ARRAY(m, 1);
 	m->stack = stack;
 	m->stack_len = n;
 	m->min = first_min;
@@ -240,7 +240,7 @@ static int merged_table_seek_record(struct reftable_merged_table *mt,
 				    struct reftable_record *rec)
 {
 	struct reftable_iterator *iters = reftable_calloc(
-		sizeof(struct reftable_iterator) * mt->stack_len);
+		mt->stack_len, sizeof(*iters));
 	struct merged_iter merged = {
 		.stack = iters,
 		.typ = reftable_record_type(rec),
diff --git a/reftable/merged_test.c b/reftable/merged_test.c
index e05351e035..e233a9d581 100644
--- a/reftable/merged_test.c
+++ b/reftable/merged_test.c
@@ -93,10 +93,12 @@ merged_table_from_records(struct reftable_ref_record **refs,
 	int i = 0;
 	struct reftable_merged_table *mt = NULL;
 	int err;
-	struct reftable_table *tabs =
-		reftable_calloc(n * sizeof(struct reftable_table));
-	*readers = reftable_calloc(n * sizeof(struct reftable_reader *));
-	*source = reftable_calloc(n * sizeof(**source));
+	struct reftable_table *tabs;
+
+	REFTABLE_CALLOC_ARRAY(tabs, n);
+	REFTABLE_CALLOC_ARRAY(*readers, n);
+	REFTABLE_CALLOC_ARRAY(*source, n);
+
 	for (i = 0; i < n; i++) {
 		write_test_table(&buf[i], refs[i], sizes[i]);
 		block_source_from_strbuf(&(*source)[i], &buf[i]);
@@ -266,10 +268,12 @@ merged_table_from_log_records(struct reftable_log_record **logs,
 	int i = 0;
 	struct reftable_merged_table *mt = NULL;
 	int err;
-	struct reftable_table *tabs =
-		reftable_calloc(n * sizeof(struct reftable_table));
-	*readers = reftable_calloc(n * sizeof(struct reftable_reader *));
-	*source = reftable_calloc(n * sizeof(**source));
+	struct reftable_table *tabs;
+
+	REFTABLE_CALLOC_ARRAY(tabs, n);
+	REFTABLE_CALLOC_ARRAY(*readers, n);
+	REFTABLE_CALLOC_ARRAY(*source, n);
+
 	for (i = 0; i < n; i++) {
 		write_test_log_table(&buf[i], logs[i], sizes[i], i + 1);
 		block_source_from_strbuf(&(*source)[i], &buf[i]);
@@ -412,7 +416,7 @@ static void test_default_write_opts(void)
 	};
 	int err;
 	struct reftable_block_source source = { NULL };
-	struct reftable_table *tab = reftable_calloc(sizeof(*tab) * 1);
+	struct reftable_table *tab = reftable_calloc(1, sizeof(*tab));
 	uint32_t hash_id;
 	struct reftable_reader *rd = NULL;
 	struct reftable_merged_table *merged = NULL;
diff --git a/reftable/publicbasics.c b/reftable/publicbasics.c
index bcb82530d6..44b84a125e 100644
--- a/reftable/publicbasics.c
+++ b/reftable/publicbasics.c
@@ -37,8 +37,9 @@ void reftable_free(void *p)
 		free(p);
 }
 
-void *reftable_calloc(size_t sz)
+void *reftable_calloc(size_t nelem, size_t elsize)
 {
+	size_t sz = st_mult(nelem, elsize);
 	void *p = reftable_malloc(sz);
 	memset(p, 0, sz);
 	return p;
diff --git a/reftable/reader.c b/reftable/reader.c
index 64dc366fb1..3e0de5e8ad 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -539,8 +539,7 @@ static int reader_seek_indexed(struct reftable_reader *r,
 
 	if (err == 0) {
 		struct table_iter empty = TABLE_ITER_INIT;
-		struct table_iter *malloced =
-			reftable_calloc(sizeof(struct table_iter));
+		struct table_iter *malloced = reftable_calloc(1, sizeof(*malloced));
 		*malloced = empty;
 		table_iter_copy_from(malloced, &next);
 		iterator_from_table_iter(it, malloced);
@@ -635,8 +634,7 @@ void reader_close(struct reftable_reader *r)
 int reftable_new_reader(struct reftable_reader **p,
 			struct reftable_block_source *src, char const *name)
 {
-	struct reftable_reader *rd =
-		reftable_calloc(sizeof(struct reftable_reader));
+	struct reftable_reader *rd = reftable_calloc(1, sizeof(*rd));
 	int err = init_reader(rd, src, name);
 	if (err == 0) {
 		*p = rd;
@@ -711,7 +709,7 @@ static int reftable_reader_refs_for_unindexed(struct reftable_reader *r,
 					      uint8_t *oid)
 {
 	struct table_iter ti_empty = TABLE_ITER_INIT;
-	struct table_iter *ti = reftable_calloc(sizeof(struct table_iter));
+	struct table_iter *ti = reftable_calloc(1, sizeof(*ti));
 	struct filtering_ref_iterator *filter = NULL;
 	struct filtering_ref_iterator empty = FILTERING_REF_ITERATOR_INIT;
 	int oid_len = hash_size(r->hash_id);
diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index b8a3224016..91ea7a10ef 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -56,7 +56,9 @@ static void write_table(char ***names, struct strbuf *buf, int N,
 	int i = 0, n;
 	struct reftable_log_record log = { NULL };
 	const struct reftable_stats *stats = NULL;
-	*names = reftable_calloc(sizeof(char *) * (N + 1));
+
+	REFTABLE_CALLOC_ARRAY(*names, N + 1);
+
 	reftable_writer_set_limits(w, update_index, update_index);
 	for (i = 0; i < N; i++) {
 		char name[100];
@@ -188,7 +190,7 @@ static void test_log_overflow(void)
 static void test_log_write_read(void)
 {
 	int N = 2;
-	char **names = reftable_calloc(sizeof(char *) * (N + 1));
+	char **names = reftable_calloc(N + 1, sizeof(*names));
 	int err;
 	struct reftable_write_options opts = {
 		.block_size = 256,
@@ -519,7 +521,7 @@ static void test_table_read_write_seek_index(void)
 static void test_table_refs_for(int indexed)
 {
 	int N = 50;
-	char **want_names = reftable_calloc(sizeof(char *) * (N + 1));
+	char **want_names = reftable_calloc(N + 1, sizeof(*want_names));
 	int want_names_len = 0;
 	uint8_t want_hash[GIT_SHA1_RAWSZ];
 
diff --git a/reftable/record_test.c b/reftable/record_test.c
index 2876db7d27..999366ad46 100644
--- a/reftable/record_test.c
+++ b/reftable/record_test.c
@@ -231,8 +231,8 @@ static void test_reftable_log_record_roundtrip(void)
 				.value_type = REFTABLE_LOG_UPDATE,
 				.value = {
 					.update = {
-						.new_hash = reftable_calloc(GIT_SHA1_RAWSZ),
-						.old_hash = reftable_calloc(GIT_SHA1_RAWSZ),
+						.new_hash = reftable_calloc(GIT_SHA1_RAWSZ, 1),
+						.old_hash = reftable_calloc(GIT_SHA1_RAWSZ, 1),
 						.name = xstrdup("old name"),
 						.email = xstrdup("old@email"),
 						.message = xstrdup("old message"),
diff --git a/reftable/refname.c b/reftable/refname.c
index 9573496932..7570e4acf9 100644
--- a/reftable/refname.c
+++ b/reftable/refname.c
@@ -140,8 +140,8 @@ int validate_ref_record_addition(struct reftable_table tab,
 {
 	struct modification mod = {
 		.tab = tab,
-		.add = reftable_calloc(sizeof(char *) * sz),
-		.del = reftable_calloc(sizeof(char *) * sz),
+		.add = reftable_calloc(sz, sizeof(*mod.add)),
+		.del = reftable_calloc(sz, sizeof(*mod.del)),
 	};
 	int i = 0;
 	int err = 0;
diff --git a/reftable/stack.c b/reftable/stack.c
index 1dfab99e96..d084823a92 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -50,8 +50,7 @@ static ssize_t reftable_fd_write(void *arg, const void *data, size_t sz)
 int reftable_new_stack(struct reftable_stack **dest, const char *dir,
 		       struct reftable_write_options config)
 {
-	struct reftable_stack *p =
-		reftable_calloc(sizeof(struct reftable_stack));
+	struct reftable_stack *p = reftable_calloc(1, sizeof(*p));
 	struct strbuf list_file_name = STRBUF_INIT;
 	int err = 0;
 
@@ -114,7 +113,7 @@ int read_lines(const char *filename, char ***namesp)
 	int err = 0;
 	if (fd < 0) {
 		if (errno == ENOENT) {
-			*namesp = reftable_calloc(sizeof(char *));
+			REFTABLE_CALLOC_ARRAY(*namesp, 1);
 			return 0;
 		}
 
@@ -191,8 +190,7 @@ void reftable_stack_destroy(struct reftable_stack *st)
 static struct reftable_reader **stack_copy_readers(struct reftable_stack *st,
 						   int cur_len)
 {
-	struct reftable_reader **cur =
-		reftable_calloc(sizeof(struct reftable_reader *) * cur_len);
+	struct reftable_reader **cur = reftable_calloc(cur_len, sizeof(*cur));
 	int i = 0;
 	for (i = 0; i < cur_len; i++) {
 		cur[i] = st->readers[i];
@@ -208,9 +206,9 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 	int err = 0;
 	int names_len = names_length(names);
 	struct reftable_reader **new_readers =
-		reftable_calloc(sizeof(struct reftable_reader *) * names_len);
+		reftable_calloc(names_len, sizeof(*new_readers));
 	struct reftable_table *new_tables =
-		reftable_calloc(sizeof(struct reftable_table) * names_len);
+		reftable_calloc(names_len, sizeof(*new_tables));
 	int new_readers_len = 0;
 	struct reftable_merged_table *new_merged = NULL;
 	struct strbuf table_path = STRBUF_INIT;
@@ -344,7 +342,7 @@ static int reftable_stack_reload_maybe_reuse(struct reftable_stack *st,
 				goto out;
 			}
 
-			names = reftable_calloc(sizeof(char *));
+			REFTABLE_CALLOC_ARRAY(names, 1);
 		} else {
 			err = fd_read_lines(fd, &names);
 			if (err < 0)
@@ -686,7 +684,7 @@ int reftable_stack_new_addition(struct reftable_addition **dest,
 {
 	int err = 0;
 	struct reftable_addition empty = REFTABLE_ADDITION_INIT;
-	*dest = reftable_calloc(sizeof(**dest));
+	REFTABLE_CALLOC_ARRAY(*dest, 1);
 	**dest = empty;
 	err = reftable_stack_init_addition(*dest, st);
 	if (err) {
@@ -871,7 +869,7 @@ static int stack_write_compact(struct reftable_stack *st,
 {
 	int subtabs_len = last - first + 1;
 	struct reftable_table *subtabs = reftable_calloc(
-		sizeof(struct reftable_table) * (last - first + 1));
+		last - first + 1, sizeof(*subtabs));
 	struct reftable_merged_table *mt = NULL;
 	int err = 0;
 	struct reftable_iterator it = { NULL };
@@ -979,9 +977,9 @@ static int stack_compact_range(struct reftable_stack *st, int first, int last,
 	int compact_count = last - first + 1;
 	char **listp = NULL;
 	char **delete_on_success =
-		reftable_calloc(sizeof(char *) * (compact_count + 1));
+		reftable_calloc(compact_count + 1, sizeof(*delete_on_success));
 	char **subtable_locks =
-		reftable_calloc(sizeof(char *) * (compact_count + 1));
+		reftable_calloc(compact_count + 1, sizeof(*subtable_locks));
 	int i = 0;
 	int j = 0;
 	int is_empty_table = 0;
@@ -1204,7 +1202,7 @@ int fastlog2(uint64_t sz)
 
 struct segment *sizes_to_segments(int *seglen, uint64_t *sizes, int n)
 {
-	struct segment *segs = reftable_calloc(sizeof(struct segment) * n);
+	struct segment *segs = reftable_calloc(n, sizeof(*segs));
 	int next = 0;
 	struct segment cur = { 0 };
 	int i = 0;
@@ -1268,7 +1266,7 @@ struct segment suggest_compaction_segment(uint64_t *sizes, int n)
 static uint64_t *stack_table_sizes_for_compaction(struct reftable_stack *st)
 {
 	uint64_t *sizes =
-		reftable_calloc(sizeof(uint64_t) * st->merged->stack_len);
+		reftable_calloc(st->merged->stack_len, sizeof(*sizes));
 	int version = (st->config.hash_id == GIT_SHA1_FORMAT_ID) ? 1 : 2;
 	int overhead = header_size(version) - 1;
 	int i = 0;
diff --git a/reftable/tree.c b/reftable/tree.c
index a5bf880985..528f33ae38 100644
--- a/reftable/tree.c
+++ b/reftable/tree.c
@@ -20,8 +20,8 @@ struct tree_node *tree_search(void *key, struct tree_node **rootp,
 		if (!insert) {
 			return NULL;
 		} else {
-			struct tree_node *n =
-				reftable_calloc(sizeof(struct tree_node));
+			struct tree_node *n;
+			REFTABLE_CALLOC_ARRAY(n, 1);
 			n->key = key;
 			*rootp = n;
 			return *rootp;
diff --git a/reftable/writer.c b/reftable/writer.c
index 4483bb21c3..47b3448132 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -49,7 +49,7 @@ static int padded_write(struct reftable_writer *w, uint8_t *data, size_t len,
 {
 	int n = 0;
 	if (w->pending_padding > 0) {
-		uint8_t *zeroed = reftable_calloc(w->pending_padding);
+		uint8_t *zeroed = reftable_calloc(w->pending_padding, sizeof(*zeroed));
 		int n = w->write(w->write_arg, zeroed, w->pending_padding);
 		if (n < 0)
 			return n;
@@ -123,8 +123,7 @@ struct reftable_writer *
 reftable_new_writer(ssize_t (*writer_func)(void *, const void *, size_t),
 		    void *writer_arg, struct reftable_write_options *opts)
 {
-	struct reftable_writer *wp =
-		reftable_calloc(sizeof(struct reftable_writer));
+	struct reftable_writer *wp = reftable_calloc(1, sizeof(*wp));
 	strbuf_init(&wp->block_writer_data.last_key, 0);
 	options_set_defaults(opts);
 	if (opts->block_size >= (1 << 24)) {
@@ -132,7 +131,7 @@ reftable_new_writer(ssize_t (*writer_func)(void *, const void *, size_t),
 		abort();
 	}
 	wp->last_key = reftable_empty_strbuf;
-	wp->block = reftable_calloc(opts->block_size);
+	REFTABLE_CALLOC_ARRAY(wp->block, opts->block_size);
 	wp->write = writer_func;
 	wp->write_arg = writer_arg;
 	wp->opts = *opts;
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 1/9] reftable: introduce macros to grow arrays
From: Patrick Steinhardt @ 2024-02-01  7:32 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706772591.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 10447 bytes --]

Throughout the reftable library we have many cases where we need to grow
arrays. In order to avoid too many reallocations, we roughly double the
capacity of the array on each iteration. The resulting code pattern is
duplicated across many sites.

We have similar patterns in our main codebase, which is why we have
eventually introduced an `ALLOC_GROW()` macro to abstract it away and
avoid some code duplication. We cannot easily reuse this macro here
though because `ALLOC_GROW()` uses `REALLOC_ARRAY()`, which in turn will
call realloc(3P) to grow the array. The reftable code is structured as a
library though (even if the boundaries are fuzzy), and one property this
brings with it is that it is possible to plug in your own allocators. So
instead of using realloc(3P), we need to use `reftable_realloc()` that
knows to use the user-provided implementation.

So let's introduce two new macros `REFTABLE_REALLOC_ARRAY()` and
`REFTABLE_ALLOC_GROW()` that mirror what we do in our main codebase,
with two modifications:

  - They use `reftable_realloc()`, as explained above.

  - They use a different growth factor of `2 * cap + 1` instead of `(cap
    + 16) * 3 / 2`.

The second change is because we know a bit more about the allocation
patterns in the reftable library. In most cases, we end up only having a
handful of items in the array and don't end up growing them. The initial
capacity that our normal growth factor uses (which is 24) would thus end
up over-allocating in a lot of code paths. This effect is measurable:

  - Before change:

      HEAP SUMMARY:
          in use at exit: 671,983 bytes in 152 blocks
        total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,402 bytes allocated

  - After change with a growth factor of `(2 * alloc + 1)`:

      HEAP SUMMARY:
          in use at exit: 671,983 bytes in 152 blocks
        total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,410 bytes allocated

  - After change with a growth factor of `(alloc + 16)* 2 / 3`:

      HEAP SUMMARY:
          in use at exit: 671,983 bytes in 152 blocks
        total heap usage: 3,833,673 allocs, 3,833,521 frees, 4,728,251,742 bytes allocated

While the total heap usage is roughly the same, we do end up allocating
significantly more bytes with our usual growth factor (in fact, roughly
21 times as many).

Convert the reftable library to use these new macros.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/basics.c      |  8 ++------
 reftable/basics.h      | 11 +++++++++++
 reftable/block.c       |  7 +------
 reftable/merged_test.c | 20 ++++++--------------
 reftable/pq.c          |  8 ++------
 reftable/stack.c       | 29 ++++++++++++-----------------
 reftable/writer.c      | 14 ++------------
 7 files changed, 36 insertions(+), 61 deletions(-)

diff --git a/reftable/basics.c b/reftable/basics.c
index f761e48028..af9004cec2 100644
--- a/reftable/basics.c
+++ b/reftable/basics.c
@@ -89,17 +89,13 @@ void parse_names(char *buf, int size, char ***namesp)
 			next = end;
 		}
 		if (p < next) {
-			if (names_len == names_cap) {
-				names_cap = 2 * names_cap + 1;
-				names = reftable_realloc(
-					names, names_cap * sizeof(*names));
-			}
+			REFTABLE_ALLOC_GROW(names, names_len + 1, names_cap);
 			names[names_len++] = xstrdup(p);
 		}
 		p = next + 1;
 	}
 
-	names = reftable_realloc(names, (names_len + 1) * sizeof(*names));
+	REFTABLE_REALLOC_ARRAY(names, names_len + 1);
 	names[names_len] = NULL;
 	*namesp = names;
 }
diff --git a/reftable/basics.h b/reftable/basics.h
index 096b36862b..2f855cd724 100644
--- a/reftable/basics.h
+++ b/reftable/basics.h
@@ -53,6 +53,17 @@ void *reftable_realloc(void *p, size_t sz);
 void reftable_free(void *p);
 void *reftable_calloc(size_t sz);
 
+#define REFTABLE_REALLOC_ARRAY(x, alloc) (x) = reftable_realloc((x), st_mult(sizeof(*(x)), (alloc)))
+#define REFTABLE_ALLOC_GROW(x, nr, alloc) \
+	do { \
+		if ((nr) > alloc) { \
+			alloc = 2 * (alloc) + 1; \
+			if (alloc < (nr)) \
+				alloc = (nr); \
+			REFTABLE_REALLOC_ARRAY(x, alloc); \
+		} \
+	} while (0)
+
 /* Find the longest shared prefix size of `a` and `b` */
 struct strbuf;
 int common_prefix_size(struct strbuf *a, struct strbuf *b);
diff --git a/reftable/block.c b/reftable/block.c
index 1df3d8a0f0..6952d0facf 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -51,12 +51,7 @@ static int block_writer_register_restart(struct block_writer *w, int n,
 	if (2 + 3 * rlen + n > w->block_size - w->next)
 		return -1;
 	if (is_restart) {
-		if (w->restart_len == w->restart_cap) {
-			w->restart_cap = w->restart_cap * 2 + 1;
-			w->restarts = reftable_realloc(
-				w->restarts, sizeof(uint32_t) * w->restart_cap);
-		}
-
+		REFTABLE_ALLOC_GROW(w->restarts, w->restart_len + 1, w->restart_cap);
 		w->restarts[w->restart_len++] = w->next;
 	}
 
diff --git a/reftable/merged_test.c b/reftable/merged_test.c
index 46908f738f..e05351e035 100644
--- a/reftable/merged_test.c
+++ b/reftable/merged_test.c
@@ -231,14 +231,10 @@ static void test_merged(void)
 	while (len < 100) { /* cap loops/recursion. */
 		struct reftable_ref_record ref = { NULL };
 		int err = reftable_iterator_next_ref(&it, &ref);
-		if (err > 0) {
+		if (err > 0)
 			break;
-		}
-		if (len == cap) {
-			cap = 2 * cap + 1;
-			out = reftable_realloc(
-				out, sizeof(struct reftable_ref_record) * cap);
-		}
+
+		REFTABLE_ALLOC_GROW(out, len + 1, cap);
 		out[len++] = ref;
 	}
 	reftable_iterator_destroy(&it);
@@ -368,14 +364,10 @@ static void test_merged_logs(void)
 	while (len < 100) { /* cap loops/recursion. */
 		struct reftable_log_record log = { NULL };
 		int err = reftable_iterator_next_log(&it, &log);
-		if (err > 0) {
+		if (err > 0)
 			break;
-		}
-		if (len == cap) {
-			cap = 2 * cap + 1;
-			out = reftable_realloc(
-				out, sizeof(struct reftable_log_record) * cap);
-		}
+
+		REFTABLE_ALLOC_GROW(out, len + 1, cap);
 		out[len++] = log;
 	}
 	reftable_iterator_destroy(&it);
diff --git a/reftable/pq.c b/reftable/pq.c
index dcefeb793a..2461daf5ff 100644
--- a/reftable/pq.c
+++ b/reftable/pq.c
@@ -75,13 +75,9 @@ void merged_iter_pqueue_add(struct merged_iter_pqueue *pq, const struct pq_entry
 {
 	int i = 0;
 
-	if (pq->len == pq->cap) {
-		pq->cap = 2 * pq->cap + 1;
-		pq->heap = reftable_realloc(pq->heap,
-					    pq->cap * sizeof(struct pq_entry));
-	}
-
+	REFTABLE_ALLOC_GROW(pq->heap, pq->len + 1, pq->cap);
 	pq->heap[pq->len++] = *e;
+
 	i = pq->len - 1;
 	while (i > 0) {
 		int j = (i - 1) / 2;
diff --git a/reftable/stack.c b/reftable/stack.c
index bf3869ce70..1dfab99e96 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -551,7 +551,7 @@ struct reftable_addition {
 	struct reftable_stack *stack;
 
 	char **new_tables;
-	int new_tables_len;
+	size_t new_tables_len, new_tables_cap;
 	uint64_t next_update_index;
 };
 
@@ -602,8 +602,9 @@ static int reftable_stack_init_addition(struct reftable_addition *add,
 
 static void reftable_addition_close(struct reftable_addition *add)
 {
-	int i = 0;
 	struct strbuf nm = STRBUF_INIT;
+	size_t i;
+
 	for (i = 0; i < add->new_tables_len; i++) {
 		stack_filename(&nm, add->stack, add->new_tables[i]);
 		unlink(nm.buf);
@@ -613,6 +614,7 @@ static void reftable_addition_close(struct reftable_addition *add)
 	reftable_free(add->new_tables);
 	add->new_tables = NULL;
 	add->new_tables_len = 0;
+	add->new_tables_cap = 0;
 
 	delete_tempfile(&add->lock_file);
 	strbuf_release(&nm);
@@ -631,8 +633,8 @@ int reftable_addition_commit(struct reftable_addition *add)
 {
 	struct strbuf table_list = STRBUF_INIT;
 	int lock_file_fd = get_tempfile_fd(add->lock_file);
-	int i = 0;
 	int err = 0;
+	size_t i;
 
 	if (add->new_tables_len == 0)
 		goto done;
@@ -660,12 +662,12 @@ int reftable_addition_commit(struct reftable_addition *add)
 	}
 
 	/* success, no more state to clean up. */
-	for (i = 0; i < add->new_tables_len; i++) {
+	for (i = 0; i < add->new_tables_len; i++)
 		reftable_free(add->new_tables[i]);
-	}
 	reftable_free(add->new_tables);
 	add->new_tables = NULL;
 	add->new_tables_len = 0;
+	add->new_tables_cap = 0;
 
 	err = reftable_stack_reload_maybe_reuse(add->stack, 1);
 	if (err)
@@ -792,11 +794,9 @@ int reftable_addition_add(struct reftable_addition *add,
 		goto done;
 	}
 
-	add->new_tables = reftable_realloc(add->new_tables,
-					   sizeof(*add->new_tables) *
-						   (add->new_tables_len + 1));
-	add->new_tables[add->new_tables_len] = strbuf_detach(&next_name, NULL);
-	add->new_tables_len++;
+	REFTABLE_ALLOC_GROW(add->new_tables, add->new_tables_len + 1,
+			    add->new_tables_cap);
+	add->new_tables[add->new_tables_len++] = strbuf_detach(&next_name, NULL);
 done:
 	if (tab_fd > 0) {
 		close(tab_fd);
@@ -1367,17 +1367,12 @@ static int stack_check_addition(struct reftable_stack *st,
 	while (1) {
 		struct reftable_ref_record ref = { NULL };
 		err = reftable_iterator_next_ref(&it, &ref);
-		if (err > 0) {
+		if (err > 0)
 			break;
-		}
 		if (err < 0)
 			goto done;
 
-		if (len >= cap) {
-			cap = 2 * cap + 1;
-			refs = reftable_realloc(refs, cap * sizeof(refs[0]));
-		}
-
+		REFTABLE_ALLOC_GROW(refs, len + 1, cap);
 		refs[len++] = ref;
 	}
 
diff --git a/reftable/writer.c b/reftable/writer.c
index ee4590e20f..4483bb21c3 100644
--- a/reftable/writer.c
+++ b/reftable/writer.c
@@ -200,12 +200,7 @@ static void writer_index_hash(struct reftable_writer *w, struct strbuf *hash)
 		return;
 	}
 
-	if (key->offset_len == key->offset_cap) {
-		key->offset_cap = 2 * key->offset_cap + 1;
-		key->offsets = reftable_realloc(
-			key->offsets, sizeof(uint64_t) * key->offset_cap);
-	}
-
+	REFTABLE_ALLOC_GROW(key->offsets, key->offset_len + 1, key->offset_cap);
 	key->offsets[key->offset_len++] = off;
 }
 
@@ -674,12 +669,7 @@ static int writer_flush_nonempty_block(struct reftable_writer *w)
 	if (err < 0)
 		return err;
 
-	if (w->index_cap == w->index_len) {
-		w->index_cap = 2 * w->index_cap + 1;
-		w->index = reftable_realloc(
-			w->index,
-			sizeof(struct reftable_index_record) * w->index_cap);
-	}
+	REFTABLE_ALLOC_GROW(w->index, w->index_len + 1, w->index_cap);
 
 	ir.offset = w->next;
 	strbuf_reset(&ir.last_key);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v2 0/9] reftable: code style improvements
From: Patrick Steinhardt @ 2024-02-01  7:32 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano
In-Reply-To: <cover.1706687982.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 6284 bytes --]

Hi,

this is the second version of my patch series that aims to improve the
code style of the reftable library.

The only changes compared to v1 are improvements for the commit
messages. The edit for commit 1 should hopefully make it a much more
compelling argument why we don't want to use the default growth factor
used by our own `ALLOC_GROW()` macro.

Thanks!

Patrick

Patrick Steinhardt (9):
  reftable: introduce macros to grow arrays
  reftable: introduce macros to allocate arrays
  reftable/stack: fix parameter validation when compacting range
  reftable/stack: index segments with `size_t`
  reftable/stack: use `size_t` to track stack slices during compaction
  reftable/stack: use `size_t` to track stack length
  reftable/merged: refactor seeking of records
  reftable/merged: refactor initialization of iterators
  reftable/record: improve semantics when initializing records

 reftable/basics.c          |  15 ++--
 reftable/basics.h          |  17 ++++-
 reftable/block.c           |  25 +++---
 reftable/block_test.c      |   2 +-
 reftable/blocksource.c     |   4 +-
 reftable/iter.c            |   3 +-
 reftable/merged.c          | 100 +++++++++++-------------
 reftable/merged_test.c     |  52 ++++++-------
 reftable/pq.c              |   8 +-
 reftable/publicbasics.c    |   3 +-
 reftable/reader.c          |  12 ++-
 reftable/readwrite_test.c  |   8 +-
 reftable/record.c          |  43 +++--------
 reftable/record.h          |  10 +--
 reftable/record_test.c     |   8 +-
 reftable/refname.c         |   4 +-
 reftable/reftable-merged.h |   2 +-
 reftable/stack.c           | 151 +++++++++++++++++--------------------
 reftable/stack.h           |   6 +-
 reftable/stack_test.c      |   7 +-
 reftable/tree.c            |   4 +-
 reftable/writer.c          |  21 ++----
 22 files changed, 221 insertions(+), 284 deletions(-)

Range-diff against v1:
 1:  0597e6944a !  1:  12bd721ddf reftable: introduce macros to grow arrays
    @@ Commit message
         Throughout the reftable library we have many cases where we need to grow
         arrays. In order to avoid too many reallocations, we roughly double the
         capacity of the array on each iteration. The resulting code pattern is
    -    thus duplicated across many sites.
    +    duplicated across many sites.
     
         We have similar patterns in our main codebase, which is why we have
         eventually introduced an `ALLOC_GROW()` macro to abstract it away and
    @@ Commit message
             + 16) * 3 / 2`.
     
         The second change is because we know a bit more about the allocation
    -    patterns in the reftable library. For In most cases, we end up only having a
    -    single item in the array, so the initial capacity that our global growth
    -    factor uses (which is 24), significantly overallocates in a lot of code
    -    paths. This effect is indeed measurable:
    +    patterns in the reftable library. In most cases, we end up only having a
    +    handful of items in the array and don't end up growing them. The initial
    +    capacity that our normal growth factor uses (which is 24) would thus end
    +    up over-allocating in a lot of code paths. This effect is measurable:
     
    -      Benchmark 1: update-ref: create many refs (growth factor = 2 * cap + 1)
    -        Time (mean ± σ):      4.834 s ±  0.020 s    [User: 2.219 s, System: 2.614 s]
    -        Range (min … max):    4.793 s …  4.871 s    20 runs
    +      - Before change:
     
    -      Benchmark 2: update-ref: create many refs (growth factor = (cap + 16) * 3 + 2)
    -        Time (mean ± σ):      4.933 s ±  0.021 s    [User: 2.325 s, System: 2.607 s]
    -        Range (min … max):    4.889 s …  4.962 s    20 runs
    +          HEAP SUMMARY:
    +              in use at exit: 671,983 bytes in 152 blocks
    +            total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,402 bytes allocated
     
    -      Summary
    -        update-ref: create many refs (growth factor = 2 * cap + 1) ran
    -          1.02 ± 0.01 times faster than update-ref: create many refs (growth factor = (cap + 16) * 3 + 2)
    +      - After change with a growth factor of `(2 * alloc + 1)`:
    +
    +          HEAP SUMMARY:
    +              in use at exit: 671,983 bytes in 152 blocks
    +            total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,410 bytes allocated
    +
    +      - After change with a growth factor of `(alloc + 16)* 2 / 3`:
    +
    +          HEAP SUMMARY:
    +              in use at exit: 671,983 bytes in 152 blocks
    +            total heap usage: 3,833,673 allocs, 3,833,521 frees, 4,728,251,742 bytes allocated
    +
    +    While the total heap usage is roughly the same, we do end up allocating
    +    significantly more bytes with our usual growth factor (in fact, roughly
    +    21 times as many).
     
         Convert the reftable library to use these new macros.
     
 2:  eee6580c84 =  2:  2dde581a02 reftable: introduce macros to allocate arrays
 3:  e2d05f7c38 =  3:  f134702dc5 reftable/stack: fix parameter validation when compacting range
 4:  b8b2cce742 =  4:  50dac904e8 reftable/stack: index segments with `size_t`
 5:  2f6a69aa14 =  5:  a5ffbf09dd reftable/stack: use `size_t` to track stack slices during compaction
 6:  1ee9a4477f =  6:  55605fb53b reftable/stack: use `size_t` to track stack length
 7:  00aeaeee63 !  7:  80cf2fd272 reftable/merged: refactor seeking of records
    @@ Commit message
         to read and does not conform to our coding style in multiple ways:
     
           - We have multiple exit paths where we release resources even though
    -        that is not really necessary
    +        that is not really necessary.
     
           - We use a scoped error variable `e` which is hard to reason about.
             This variable is not required at all.
 8:  31864740e9 =  8:  8c1be2b159 reftable/merged: refactor initialization of iterators
 9:  fd8a1ce99d =  9:  c39d7e30e7 reftable/record: improve semantics when initializing records

base-commit: bc7ee2e5e16f0d1e710ef8fab3db59ab11f2bbe7
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/9] reftable: introduce macros to grow arrays
From: Patrick Steinhardt @ 2024-02-01  7:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqqmssl44wo.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

On Wed, Jan 31, 2024 at 12:35:51PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > patterns in the reftable library. For In most cases, we end up only having a
> > single item in the array, so the initial capacity that our global growth
> > factor uses (which is 24), significantly overallocates in a lot of code
> > paths. 
> 
> You need to know not just that you very often initially have only
> one but you rarely grow it beyond 3, or something like that to
> explain "significantly overallocates", though.

True.

> > This effect is indeed measurable:
> 
> And measuring is very good, but I somehow expected that you would
> measure not the time (if you often under-allocate and end up
> reallocating too many times, it might consume more time, though) but
> the peak memory usage.  I cannot quite tell what to think of that 2%
> time difference.

Very good point indeed. I don't think peak memory usage is really all
that helpful either because the problem is not that we are allocating
arrays that we keep around all the time, but many small arrays which are
short lived. So what is telling is the total number of bytes we end up
allocating:

    Before change:

        HEAP SUMMARY:
            in use at exit: 671,983 bytes in 152 blocks
          total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,402 bytes allocated

    Growth factor (alloc * 2 + 1):

        HEAP SUMMARY:
            in use at exit: 671,983 bytes in 152 blocks
          total heap usage: 3,843,446 allocs, 3,843,294 frees, 223,761,410 bytes allocated

    Growth factor (alloc + 16) * 2 / 3:

        HEAP SUMMARY:
            in use at exit: 671,983 bytes in 152 blocks
          total heap usage: 3,833,673 allocs, 3,833,521 frees, 4,728,251,742 bytes allocated

Allocating 21 times as many bytes with our default growth factor should
be a much more compelling argument why we don't actually want to use it
compared to the 2% speedup.

It's somewhat amazing though that this huge difference only makes for a
2% speedup.

> > Convert the reftable library to use these new macros.
> 
> In any case, the conversion shortens the code and is a good thing.
> I wish we had a way to tell these macros that we are actually using
> the same single allocator (which we are doing in our code when
> linking the reftable thing to us anyway), which would have made this
> even simpler because you did not have to introduce separate macros..

Yeah, I wasn't a huge fan of this, either. I initially just wanted to
reuse our usual macros, but when I noticed the resulting difference in
allocated bytes I already had two arguments against this: the fact that
we have pluggable allocators in the reftable library and the growth
factor. While we could make our macros more flexible so that they can
accommodate for both, I don't think that the result would be pretty.

So at that point I decided to just duplicate the code. It still ends up
removing a lot of code duplication in the reftable library itself, so I
don't think it is too bad.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v4 2/2] index-pack: --fsck-objects to take an optional argument for fsck msgs
From: John Cai via GitGitGadget @ 2024-02-01  1:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Patrick Steinhardt, John Cai, John Cai
In-Reply-To: <pull.1658.v4.git.git.1706751483.gitgitgadget@gmail.com>

From: John Cai <johncai86@gmail.com>

git-index-pack has a --strict option that can take an optional argument
to provide a list of fsck issues to change their severity.
--fsck-objects does not have such a utility, which would be useful if
one would like to be more lenient or strict on data integrity in a
repository.

Like --strict, allow --fsck-objects to also take a list of fsck msgs to
change the severity.

Remove the "For internal use only" note for --fsck-objects, and document
the option. This won't often be used by the normal end user, but it
turns out it is useful for Git forges like GitLab.

Reviewed-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-index-pack.txt | 17 +++++++++++------
 builtin/index-pack.c             |  5 +++--
 t/t5300-pack-object.sh           | 29 ++++++++++++++++++++++++-----
 3 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 694bb9409bf..5a20deefd5f 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -96,13 +96,18 @@ default and "Indexing objects" when `--stdin` is specified.
 --check-self-contained-and-connected::
 	Die if the pack contains broken links. For internal use only.
 
---fsck-objects::
-	For internal use only.
+--fsck-objects[=<msg-id>=<severity>...]::
+	Die if the pack contains broken objects, but unlike `--strict`, don't
+	choke on broken links. If the pack contains a tree pointing to a
+	.gitmodules blob that does not exist, prints the hash of that blob
+	(for the caller to check) after the hash that goes into the name of the
+	pack/idx file (see "Notes").
 +
-Die if the pack contains broken objects. If the pack contains a tree
-pointing to a .gitmodules blob that does not exist, prints the hash of
-that blob (for the caller to check) after the hash that goes into the
-name of the pack/idx file (see "Notes").
+An optional comma-separated list of `<msg-id>=<severity>` can be passed to
+change the severity of some possible issues, e.g.,
+`--fsck-objects="missingEmail=ignore,badTagName=ignore"`. See the entry for the
+`fsck.<msg-id>` configuration options in linkgit:git-fsck[1] for more
+information on the possible values of `<msg-id>` and `<severity>`.
 
 --threads=<n>::
 	Specifies the number of threads to spawn when resolving
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 240c7021168..a3a37bd215d 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -24,7 +24,7 @@
 #include "setup.h"
 
 static const char index_pack_usage[] =
-"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
+"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] [--fsck-objects[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
 struct object_entry {
 	struct pack_idx_entry idx;
@@ -1785,8 +1785,9 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 			} else if (!strcmp(arg, "--check-self-contained-and-connected")) {
 				strict = 1;
 				check_self_contained_and_connected = 1;
-			} else if (!strcmp(arg, "--fsck-objects")) {
+			} else if (skip_to_optional_arg(arg, "--fsck-objects", &arg)) {
 				do_fsck_object = 1;
+				fsck_set_msg_types(&fsck_options, arg);
 			} else if (!strcmp(arg, "--verify")) {
 				verify = 1;
 			} else if (!strcmp(arg, "--verify-stat")) {
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 496fffa0f8a..a58f91035d1 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -441,8 +441,7 @@ test_expect_success 'index-pack with --strict' '
 	)
 '
 
-test_expect_success 'index-pack with --strict downgrading fsck msgs' '
-	test_when_finished rm -rf strict &&
+test_expect_success 'setup for --strict and --fsck-objects downgrading fsck msgs' '
 	git init strict &&
 	(
 		cd strict &&
@@ -457,12 +456,32 @@ test_expect_success 'index-pack with --strict downgrading fsck msgs' '
 
 		EOF
 		git hash-object --literally -t commit -w --stdin <commit >commit_list &&
-		PACK=$(git pack-objects test <commit_list) &&
-		test_must_fail git index-pack --strict "test-$PACK.pack" &&
-		git index-pack --strict="missingEmail=ignore" "test-$PACK.pack"
+		git pack-objects test <commit_list >pack-name
 	)
 '
 
+test_with_bad_commit () {
+	must_fail_arg="$1" &&
+	must_pass_arg="$2" &&
+	(
+		cd strict &&
+		test_expect_fail git index-pack "$must_fail_arg" "test-$(cat pack-name).pack"
+		git index-pack "$must_pass_arg" "test-$(cat pack-name).pack"
+	)
+}
+
+test_expect_success 'index-pack with --strict downgrading fsck msgs' '
+	test_with_bad_commit --strict --strict="missingEmail=ignore"
+'
+
+test_expect_success 'index-pack with --fsck-objects downgrading fsck msgs' '
+	test_with_bad_commit --fsck-objects --fsck-objects="missingEmail=ignore"
+'
+
+test_expect_success 'cleanup for --strict and --fsck-objects downgrading fsck msgs' '
+	rm -rf strict
+'
+
 test_expect_success 'honor pack.packSizeLimit' '
 	git config pack.packSizeLimit 3m &&
 	packname_10=$(git pack-objects test-10 <obj-list) &&
-- 
gitgitgadget

^ permalink raw reply related

* [PATCH v4 1/2] index-pack: test and document --strict=<msg-id>=<severity>...
From: John Cai via GitGitGadget @ 2024-02-01  1:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Patrick Steinhardt, John Cai, John Cai
In-Reply-To: <pull.1658.v4.git.git.1706751483.gitgitgadget@gmail.com>

From: John Cai <johncai86@gmail.com>

5d477a334a (fsck (receive-pack): allow demoting errors to warnings,
2015-06-22) allowed a list of fsck msg to downgrade to be passed to
--strict. However this is a hidden argument that was not documented nor
tested. Though it is true that most users would not call this option
directly, (nor use index-pack for that matter) it is still useful to
document and test this feature.

Reviewed-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-index-pack.txt |  9 +++++++--
 builtin/index-pack.c             |  2 +-
 t/t5300-pack-object.sh           | 22 ++++++++++++++++++++++
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 6486620c3d8..694bb9409bf 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -79,8 +79,13 @@ OPTIONS
 	to force the version for the generated pack index, and to force
 	64-bit index entries on objects located above the given offset.
 
---strict::
-	Die, if the pack contains broken objects or links.
+--strict[=<msg-id>=<severity>...]::
+	Die, if the pack contains broken objects or links. An optional
+	comma-separated list of `<msg-id>=<severity>` can be passed to change
+	the severity of some possible issues, e.g.,
+	 `--strict="missingEmail=ignore,badTagName=error"`. See the entry for the
+	`fsck.<msg-id>` configuration options in linkgit:git-fsck[1] for more
+	information on the possible values of `<msg-id>` and `<severity>`.
 
 --progress-title::
 	For internal use only.
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 1ea87e01f29..240c7021168 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -24,7 +24,7 @@
 #include "setup.h"
 
 static const char index_pack_usage[] =
-"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
+"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
 struct object_entry {
 	struct pack_idx_entry idx;
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index d402ec18b79..496fffa0f8a 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -441,6 +441,28 @@ test_expect_success 'index-pack with --strict' '
 	)
 '
 
+test_expect_success 'index-pack with --strict downgrading fsck msgs' '
+	test_when_finished rm -rf strict &&
+	git init strict &&
+	(
+		cd strict &&
+		test_commit first hello &&
+		cat >commit <<-EOF &&
+		tree $(git rev-parse HEAD^{tree})
+		parent $(git rev-parse HEAD)
+		author A U Thor
+		committer A U Thor
+
+		commit: this is a commit with bad emails
+
+		EOF
+		git hash-object --literally -t commit -w --stdin <commit >commit_list &&
+		PACK=$(git pack-objects test <commit_list) &&
+		test_must_fail git index-pack --strict "test-$PACK.pack" &&
+		git index-pack --strict="missingEmail=ignore" "test-$PACK.pack"
+	)
+'
+
 test_expect_success 'honor pack.packSizeLimit' '
 	git config pack.packSizeLimit 3m &&
 	packname_10=$(git pack-objects test-10 <obj-list) &&
-- 
gitgitgadget


^ permalink raw reply related

* [PATCH v4 0/2] index-pack: fsck honor checks
From: John Cai via GitGitGadget @ 2024-02-01  1:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Patrick Steinhardt, John Cai
In-Reply-To: <pull.1658.v3.git.git.1706302749.gitgitgadget@gmail.com>

git-index-pack has a --strict mode that can take an optional argument to
provide a list of fsck issues to change their severity. --fsck-objects does
not have such a utility, which would be useful if one would like to be more
lenient or strict on data integrity in a repository.

Like --strict, Allow --fsck-objects to also take a list of fsck msgs to
change the severity.

Changes since V3:

 * clarification of --fsck-objects documentation wording

Changes since V2:

 * fixed some typos in the documentation
 * added commit trailers

Change since V1:

 * edited commit messages
 * clarified formatting in documentation for --strict= and --fsck-objects=

John Cai (2):
  index-pack: test and document --strict=<msg-id>=<severity>...
  index-pack: --fsck-objects to take an optional argument for fsck msgs

 Documentation/git-index-pack.txt | 26 +++++++++++++-------
 builtin/index-pack.c             |  5 ++--
 t/t5300-pack-object.sh           | 41 ++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+), 10 deletions(-)


base-commit: 186b115d3062e6230ee296d1ddaa0c4b72a464b5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1658%2Fjohn-cai%2Fjc%2Findex-pack-fsck-honor-checks-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1658/john-cai/jc/index-pack-fsck-honor-checks-v4
Pull-Request: https://github.com/git/git/pull/1658

Range-diff vs v3:

 1:  cdf7fc7fe8a = 1:  cdf7fc7fe8a index-pack: test and document --strict=<msg-id>=<severity>...
 2:  a2b9adb93d8 ! 2:  f29ab9136fb index-pack: --fsck-objects to take an optional argument for fsck msgs
     @@ Documentation/git-index-pack.txt: default and "Indexing objects" when `--stdin`
      ---fsck-objects::
      -	For internal use only.
      +--fsck-objects[=<msg-id>=<severity>...]::
     -+	Die if the pack contains broken objects. If the pack contains a tree
     -+	pointing to a .gitmodules blob that does not exist, prints the hash of
     -+	that blob (for the caller to check) after the hash that goes into the
     -+	name of the pack/idx file (see "Notes").
     ++	Die if the pack contains broken objects, but unlike `--strict`, don't
     ++	choke on broken links. If the pack contains a tree pointing to a
     ++	.gitmodules blob that does not exist, prints the hash of that blob
     ++	(for the caller to check) after the hash that goes into the name of the
     ++	pack/idx file (see "Notes").
       +
      -Die if the pack contains broken objects. If the pack contains a tree
      -pointing to a .gitmodules blob that does not exist, prints the hash of
      -that blob (for the caller to check) after the hash that goes into the
      -name of the pack/idx file (see "Notes").
     -+Unlike `--strict` however, don't choke on broken links. An optional
     -+comma-separated list of `<msg-id>=<severity>` can be passed to change the
     -+severity of some possible issues, e.g.,
     ++An optional comma-separated list of `<msg-id>=<severity>` can be passed to
     ++change the severity of some possible issues, e.g.,
      +`--fsck-objects="missingEmail=ignore,badTagName=ignore"`. See the entry for the
      +`fsck.<msg-id>` configuration options in linkgit:git-fsck[1] for more
      +information on the possible values of `<msg-id>` and `<severity>`.

-- 
gitgitgadget

^ permalink raw reply

* Re: [PATCH v3 0/2] index-pack: fsck honor checks
From: John Cai @ 2024-02-01  1:34 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, John Cai via GitGitGadget, git
In-Reply-To: <20240131223032.4065897-1-jonathantanmy@google.com>

Hi Jonathan,

On 31 Jan 2024, at 17:30, Jonathan Tan wrote:

> John Cai <johncai86@gmail.com> writes:
>> Hi Jonathan,
>>
>> On 26 Jan 2024, at 17:13, Jonathan Tan wrote:
>>
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>>>      ++--fsck-objects[=<msg-id>=<severity>...]::
>>>>>      ++	Die if the pack contains broken objects. If the pack contains a tree
>>>>>      ++	pointing to a .gitmodules blob that does not exist, prints the hash of
>>>>>      ++	that blob (for the caller to check) after the hash that goes into the
>>>>>      ++	name of the pack/idx file (see "Notes").
>
>> Thanks for clarifying! Would you mind providing a patch to revise the wording
>> here to make it clearer? I would try but I feel like I might get the wording
>> wrong.
>
> I think the wording there is already mostly correct, except maybe make
> everything plural (a tree -> trees, a .gitmodules blob -> .gitmodules
> blobs, hash of that blob -> hashes of those blobs). We might also need
> to modify a test to show that the current code indeed handles the plural
> situation correctly. I don't have time right now to get to this, so
> hopefully someone could pick this up.

Thanks! It sounds like we may want to tackle this as part of another patch.

John

^ permalink raw reply

* Re: [PATCH v3 03/10] trailer: unify trailer formatting machinery
From: Junio C Hamano @ 2024-02-01  1:07 UTC (permalink / raw)
  To: Linus Arver
  Cc: Linus Arver via GitGitGadget, git, Christian Couder,
	Emily Shaffer, Josh Steadmon, Randall S. Becker
In-Reply-To: <owlyv879106s.fsf@fine.c.googlers.com>

Linus Arver <linusa@google.com> writes:

>> In any case, the updated code does try to call unfold_value() in
>> format_trailers() on "val" that has already been unfolded in
>> parse_trailers().
>
> Correct. But this was already the case before this series. IOW the
> existing code assumes that this function is idempotent: we call
> unfold_value() in parse_trailers() and again in format_trailer_info().

But not in format_trailers(), which is the theme of this step.  In
other words, the behaviour before and after this step of the
function are not the same (modulo that the new version stores the
output in a strbuf), and as the way the changes are presented here,
it is almost impossible to make sure that we are not introducing
regressions, without making such assumptions like "unfold_value() is
idempotent and we already rely on that fact elsewhere in the code".

It is not just "unfold", which was the first thing I happened to
notice, by the way.  There are many more lines in the new lines of
that function, doing things that the original version did not appear
to do, or doing in very different ways.

> I could just grow this series with another ~22 patches to include those
> additional refactors, but I am hesitant about doing so, simply due to
> the sheer number of them.

I actually do not mind at all if you started with a preliminary
clean-up series, and stopped the first batch somewhere in the middle
of the 20+ patches before even reaching any of these 10 patches we
see here, if that gives us a readable set of bite-sized changes that
prepare a solid foundation to rebuild things on top.  I am having a
feeling that not even a single person has reviewed them on list even
though we are already at the third iteration, which is quite
frustrating (and I would imagine that it would be frustrating for
you, too), and I suspect that the step like [v3 03/10] that makes
too large a change with too little explanation (and perhaps a bit of
"trust me, this does not change the behaviour") is one contributing
factor why people are afraid of touching it.

^ permalink raw reply

* Re: [PATCH v3 03/10] trailer: unify trailer formatting machinery
From: Linus Arver @ 2024-02-01  0:46 UTC (permalink / raw)
  To: Junio C Hamano, Linus Arver via GitGitGadget
  Cc: git, Christian Couder, Emily Shaffer, Josh Steadmon,
	Randall S. Becker
In-Reply-To: <xmqqa5ol409k.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>>> +				if (opts->unfold)
>>> +					unfold_value(&val);
>>
>> So, ... how would this affect an invocation of
>>
>>     $ git interpret-trailers --unfold
>>
>> which would have set opts.unfold to true in cmd_interpret_trailers()
>> and eventually called process_trailers() with it, which eventually
>> called print_all(), which conditionally called print_tok_val() but
>> never gave the val to unfold_value()?
>>
>>  ... goes and digs a bit more ...
>>
>> Ahhhh, original process_trailers() did this by calling
>> parse_trailers() that munged the value.  So its print_all()
>> codepath only had to print what has already been munged.
>>
>> But then in this new code, do we unfold only here?  I thought that
>> in the new code you moved around, the caller, whose name is now
>> interpret_trailers(), still calls parse_trailers() and that calls
>> the unfold_value().  So, are we doing redundant work that may merely
>> be unneeded (if we are lucky) or could be harmful (if things, other
>> than the unfold thing I just noticed, that are done both in
>> parse_trailers() and this function are not idempotent)?
>
> I was too confused by the code flow and resorted to tracing what
> happens when "git interpret-trailers --unfold" is run in a way
> similar to how t7513 does in gdb.  Shame shame.

In my larger series I've made the parser and formater much simpler
(removing nesting, calling helper functions, _only parsing once and only
once_, etc) to make both parsing and formatting much easier to
understand. While I am biased as the author of these refactors, I do
think they make the code simpler.

> In any case, the updated code does try to call unfold_value() in
> format_trailers() on "val" that has already been unfolded in
> parse_trailers().

Correct. But this was already the case before this series. IOW the
existing code assumes that this function is idempotent: we call
unfold_value() in parse_trailers() and again in format_trailer_info().

In my tests if I remove the redundant call to unfold_value() in
the new formatter (so that unfolding only happens during
parse_trailers()), all trailer-related tests still pass:

    prove -j47 t{7513,3507,7502,4205,3511,3428,6300}*.sh

FWIW in the larger series we prohibit the parser from making mutations
to the input (unfolding is one such mutation), and allow multiline
(folded) strings to be stored as is in "val", only unfolding the value
if we need to during formatting.

> This may not produce an incorrect result if
> unfold_value() is truly idempotent (I do not know), but even if it
> is correct for its handling of .unfold bit, this episode lowered my
> confidence level on the new code significantly, as I do not know
> what unintended effects [*] all the other new things done in this
> function have, or if none of these unintended effects are
> error-free.

I think dropping the redundant call to unfold_value() would help address
some of your concerns. Of course, all of the existing test cases pass
(with and without the redundant call). As for addressing _all_ (not just
some) concerns, I am not confident I can deliver that as an additional
patch or two to this series because ...

>> It could be that a bit more preparatory refactoring would clarify.
>> I dunno.

... this is basically what I've done in the larger series, because there
are many areas that could be simplified (not to mention addressing some
bugs some bugs are lurking around). Granted, those are more aggressive
refactorings, and are not "preparatory refactoring" at all.

I could just grow this series with another ~22 patches to include those
additional refactors, but I am hesitant about doing so, simply due to
the sheer number of them.

How would you like me to proceed, other than deleting the redundant
unfold_value() call in the next reroll?

For reference, here's a list of those 22 commits that build on this
series as a preview (roughly grouped on themes):

    trailer formatter: deprecate format_trailers()
    trailer formatter: introduce format_trailer_block()
    trailer parser: parse_trailer_v2() for '--trailer <...>' CLI arguments
    trailer parser: parse_trailer_v2() for configuration
    trailer parser: parse_trailer_v2() for trailer blocks
    trailer parser: introduce parse_trailer_v2()
    interpret-trailers: preserve trailers coming from the input
    trailer_block: remove trailer_strings member
    trailer_block: parse trailers in one place
    trailer_block: prepare to remove trailer_strings member
    trailer: make find_same_and_apply_arg() do one thing
    trailer: unify global configuration
    interpret-trailers: print error if given unrecognized arguments
    trailer: make entire iterator struct private
    trailer: rename terms for consistency
    trailer: use create_new_target_for() everywhere
    trailer: capture behavior of addIfDifferentNeighbor
    trailer: EXISTS_REPLACE preserves original position
    trailer: capture replacement behavior for multiple matches
    trailer: split template application into 2 steps
    trailer: free templates in one place
    trailer: template's final value does not depend on in_tok

and "trailer formatter: introduce format_trailer_block()" is the one
that refactors the unified formatter introduced in this patch to have
multiple smaller helper functions.

^ permalink raw reply

* Re: [PATCH v3 03/10] trailer: unify trailer formatting machinery
From: Linus Arver @ 2024-01-31 23:29 UTC (permalink / raw)
  To: Junio C Hamano, Linus Arver via GitGitGadget
  Cc: git, Christian Couder, Emily Shaffer, Josh Steadmon,
	Randall S. Becker
In-Reply-To: <xmqqy1c545y0.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> writes:

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Linus Arver <linusa@google.com>
>>
>> Currently have two functions for formatting trailers exposed in
>> trailer.h:
>>
>>     void format_trailers(FILE *outfile, struct list_head *head,
>>                         const struct process_trailer_options *opts);
>>
>>     void format_trailers_from_commit(struct strbuf *out, const char *msg,
>>                                     const struct process_trailer_options *opts);
>>
>> and previously these functions, although similar enough (even taking the
>> same process_trailer_options struct pointer), did not build on each
>> other.
>>
>> Make format_trailers_from_commit() rely on format_trailers(). Teach
>> format_trailers() to process trailers with the additional
>> process_trailer_options fields like opts->key_only which is only used by
>> format_trailers_from_commit() and not builtin/interpret-trailers.c.
>> While we're at it, reorder parameters to put the trailer processing
>> options first, and the out parameter (strbuf we write into) at the end.
>>
>> This unification will allow us to delete the format_trailer_info() and
>> print_tok_val() functions in the next patch. They are not deleted here
>> in order to keep the diff small.
>>
>> Helped-by: Junio C Hamano <gitster@pobox.com>
>> Signed-off-by: Linus Arver <linusa@google.com>
>> ---
>
> I am not sure how I helped this particularly (say, compared to all
> the other patches on the list from everybody), though.

You made a suggestion to put trailer processing options first for
function parameters, which I've adopted in this series (but also in the
larger series).

^ permalink raw reply

* Re: [PATCH v3 03/10] trailer: unify trailer formatting machinery
From: Linus Arver @ 2024-01-31 23:21 UTC (permalink / raw)
  To: Josh Steadmon, Linus Arver via GitGitGadget
  Cc: git, Christian Couder, Junio C Hamano, Emily Shaffer,
	Randall S. Becker
In-Reply-To: <ZbqnSNPjIQW3Durz@google.com>

Josh Steadmon <steadmon@google.com> writes:

> On 2024.01.31 01:22, Linus Arver via GitGitGadget wrote:
>> This unification will allow us to delete the format_trailer_info() and
>> print_tok_val() functions in the next patch. They are not deleted here
>> in order to keep the diff small.
>
> Needs to be removed after squashing v2 patch 4 :)

Oops. Will update in next reroll, thanks.

^ permalink raw reply

* Re: [PATCH v3 02/10] trailer: move interpret_trailers() to interpret-trailers.c
From: Linus Arver @ 2024-01-31 23:20 UTC (permalink / raw)
  To: Junio C Hamano, Linus Arver via GitGitGadget
  Cc: git, Christian Couder, Emily Shaffer, Josh Steadmon,
	Randall S. Becker
In-Reply-To: <xmqqy1c55o6a.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> writes:

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Linus Arver <linusa@google.com>
>>
>> The interpret-trailers.c builtin is the only place we need to call
>> interpret_trailers(), so move its definition there.
>
> A few helper functions that are only called by interpret_trailers()
> are also moved, naturally.  I would have less surprised to see
> the addtion near the beginning of builtin/interpret-trailers.c if
> this part said:
>
>     ..., so move its definition there, together with a few helper
>     functions called only by it, and remove its external declaration
>     from <trailer.h>.

Will add in next reroll.

>> Delete the corresponding declaration from trailer.h, which then forces
>> us to expose the working innards of that function.
>
> This was a bit confusing, at least to me.  The reason why several
> other helper functions that are called by interpret_trailers() need
> to be declared in trailer.h is not because the declaration of
> interpret_trailers() is deleted from trailer.h but that is how I
> read the above.
>
>     Several helper functions that are called by interpret_trailers()
>     remain in trailer.c because other callers in the same file still
>     call them, so add declaration for them to <trailer.h>.

I've adopted most of this language in the next reroll to make the
reasoning a bit easier to follow. Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox