Git development
 help / color / mirror / Atom feed
* [PATCH v3 06/11] reftable/stack: reuse buffers when reloading stack
From: Patrick Steinhardt @ 2023-12-11  9:07 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Taylor Blau, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

In `reftable_stack_reload_once()` we iterate over all the tables added
to the stack in order to figure out whether any of the tables needs to
be reloaded. We use a set of buffers in this context to compute the
paths of these tables, but discard those buffers on every iteration.
This is quite wasteful given that we do not need to transfer ownership
of the allocated buffer outside of the loop.

Refactor the code to instead reuse the buffers to reduce the number of
allocations we need to do. Note that we do not have to manually reset
the buffer because `stack_filename()` does this for us already.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index f5d18a842a..2dd2373360 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -204,6 +204,7 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 		reftable_calloc(sizeof(struct reftable_table) * names_len);
 	int new_readers_len = 0;
 	struct reftable_merged_table *new_merged = NULL;
+	struct strbuf table_path = STRBUF_INIT;
 	int i;
 
 	while (*names) {
@@ -223,13 +224,10 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 
 		if (!rd) {
 			struct reftable_block_source src = { NULL };
-			struct strbuf table_path = STRBUF_INIT;
 			stack_filename(&table_path, st, name);
 
 			err = reftable_block_source_from_file(&src,
 							      table_path.buf);
-			strbuf_release(&table_path);
-
 			if (err < 0)
 				goto done;
 
@@ -267,16 +265,13 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 	for (i = 0; i < cur_len; i++) {
 		if (cur[i]) {
 			const char *name = reader_name(cur[i]);
-			struct strbuf filename = STRBUF_INIT;
-			stack_filename(&filename, st, name);
+			stack_filename(&table_path, st, name);
 
 			reader_close(cur[i]);
 			reftable_reader_free(cur[i]);
 
 			/* On Windows, can only unlink after closing. */
-			unlink(filename.buf);
-
-			strbuf_release(&filename);
+			unlink(table_path.buf);
 		}
 	}
 
@@ -288,6 +283,7 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
 	reftable_free(new_readers);
 	reftable_free(new_tables);
 	reftable_free(cur);
+	strbuf_release(&table_path);
 	return err;
 }
 
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 07/11] reftable/stack: fix stale lock when dying
From: Patrick Steinhardt @ 2023-12-11  9:07 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Taylor Blau, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4525 bytes --]

When starting a transaction via `reftable_stack_init_addition()`, we
create a lockfile for the reftable stack itself which we'll write the
new list of tables to. But if we terminate abnormally e.g. via a call to
`die()`, then we do not remove the lockfile. Subsequent executions of
Git which try to modify references will thus fail with an out-of-date
error.

Fix this bug by registering the lock as a `struct tempfile`, which
ensures automatic cleanup for us.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack.c | 47 +++++++++++++++--------------------------------
 1 file changed, 15 insertions(+), 32 deletions(-)

diff --git a/reftable/stack.c b/reftable/stack.c
index 2dd2373360..0c235724e2 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -17,6 +17,8 @@ license that can be found in the LICENSE file or at
 #include "reftable-merged.h"
 #include "writer.h"
 
+#include "tempfile.h"
+
 static int stack_try_add(struct reftable_stack *st,
 			 int (*write_table)(struct reftable_writer *wr,
 					    void *arg),
@@ -440,8 +442,7 @@ static void format_name(struct strbuf *dest, uint64_t min, uint64_t max)
 }
 
 struct reftable_addition {
-	int lock_file_fd;
-	struct strbuf lock_file_name;
+	struct tempfile *lock_file;
 	struct reftable_stack *stack;
 
 	char **new_tables;
@@ -449,24 +450,19 @@ struct reftable_addition {
 	uint64_t next_update_index;
 };
 
-#define REFTABLE_ADDITION_INIT                \
-	{                                     \
-		.lock_file_name = STRBUF_INIT \
-	}
+#define REFTABLE_ADDITION_INIT {0}
 
 static int reftable_stack_init_addition(struct reftable_addition *add,
 					struct reftable_stack *st)
 {
+	struct strbuf lock_file_name = STRBUF_INIT;
 	int err = 0;
 	add->stack = st;
 
-	strbuf_reset(&add->lock_file_name);
-	strbuf_addstr(&add->lock_file_name, st->list_file);
-	strbuf_addstr(&add->lock_file_name, ".lock");
+	strbuf_addf(&lock_file_name, "%s.lock", st->list_file);
 
-	add->lock_file_fd = open(add->lock_file_name.buf,
-				 O_EXCL | O_CREAT | O_WRONLY, 0666);
-	if (add->lock_file_fd < 0) {
+	add->lock_file = create_tempfile(lock_file_name.buf);
+	if (!add->lock_file) {
 		if (errno == EEXIST) {
 			err = REFTABLE_LOCK_ERROR;
 		} else {
@@ -475,7 +471,7 @@ static int reftable_stack_init_addition(struct reftable_addition *add,
 		goto done;
 	}
 	if (st->config.default_permissions) {
-		if (chmod(add->lock_file_name.buf, st->config.default_permissions) < 0) {
+		if (chmod(add->lock_file->filename.buf, st->config.default_permissions) < 0) {
 			err = REFTABLE_IO_ERROR;
 			goto done;
 		}
@@ -495,6 +491,7 @@ static int reftable_stack_init_addition(struct reftable_addition *add,
 	if (err) {
 		reftable_addition_close(add);
 	}
+	strbuf_release(&lock_file_name);
 	return err;
 }
 
@@ -512,15 +509,7 @@ static void reftable_addition_close(struct reftable_addition *add)
 	add->new_tables = NULL;
 	add->new_tables_len = 0;
 
-	if (add->lock_file_fd > 0) {
-		close(add->lock_file_fd);
-		add->lock_file_fd = 0;
-	}
-	if (add->lock_file_name.len > 0) {
-		unlink(add->lock_file_name.buf);
-		strbuf_release(&add->lock_file_name);
-	}
-
+	delete_tempfile(&add->lock_file);
 	strbuf_release(&nm);
 }
 
@@ -536,8 +525,10 @@ void reftable_addition_destroy(struct reftable_addition *add)
 int reftable_addition_commit(struct reftable_addition *add)
 {
 	struct strbuf table_list = STRBUF_INIT;
+	int lock_file_fd = get_tempfile_fd(add->lock_file);
 	int i = 0;
 	int err = 0;
+
 	if (add->new_tables_len == 0)
 		goto done;
 
@@ -550,28 +541,20 @@ int reftable_addition_commit(struct reftable_addition *add)
 		strbuf_addstr(&table_list, "\n");
 	}
 
-	err = write_in_full(add->lock_file_fd, table_list.buf, table_list.len);
+	err = write_in_full(lock_file_fd, table_list.buf, table_list.len);
 	strbuf_release(&table_list);
 	if (err < 0) {
 		err = REFTABLE_IO_ERROR;
 		goto done;
 	}
 
-	err = close(add->lock_file_fd);
-	add->lock_file_fd = 0;
-	if (err < 0) {
-		err = REFTABLE_IO_ERROR;
-		goto done;
-	}
-
-	err = rename(add->lock_file_name.buf, add->stack->list_file);
+	err = rename_tempfile(&add->lock_file, add->stack->list_file);
 	if (err < 0) {
 		err = REFTABLE_IO_ERROR;
 		goto done;
 	}
 
 	/* success, no more state to clean up. */
-	strbuf_release(&add->lock_file_name);
 	for (i = 0; i < add->new_tables_len; i++) {
 		reftable_free(add->new_tables[i]);
 	}
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 08/11] reftable/stack: fix use of unseeded randomness
From: Patrick Steinhardt @ 2023-12-11  9:07 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Taylor Blau, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 2117 bytes --]

When writing a new reftable stack, Git will first create the stack with
a random suffix so that concurrent updates will not try to write to the
same file. This random suffix is computed via a call to rand(3P). But we
never seed the function via srand(3P), which means that the suffix is in
fact always the same.

Fix this bug by using `git_rand()` instead, which does not need to be
initialized. While this function is likely going to be slower depending
on the platform, this slowness should not matter in practice as we only
use it when writing a new reftable stack.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/readwrite_test.c | 6 +++---
 reftable/stack.c          | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/reftable/readwrite_test.c b/reftable/readwrite_test.c
index 469ab79a5a..278663f22d 100644
--- a/reftable/readwrite_test.c
+++ b/reftable/readwrite_test.c
@@ -141,8 +141,8 @@ static void test_log_buffer_size(void)
 	*/
 	uint8_t hash1[GIT_SHA1_RAWSZ], hash2[GIT_SHA1_RAWSZ];
 	for (i = 0; i < GIT_SHA1_RAWSZ; i++) {
-		hash1[i] = (uint8_t)(rand() % 256);
-		hash2[i] = (uint8_t)(rand() % 256);
+		hash1[i] = (uint8_t)(git_rand() % 256);
+		hash2[i] = (uint8_t)(git_rand() % 256);
 	}
 	log.value.update.old_hash = hash1;
 	log.value.update.new_hash = hash2;
@@ -320,7 +320,7 @@ static void test_log_zlib_corruption(void)
 	};
 
 	for (i = 0; i < sizeof(message) - 1; i++)
-		message[i] = (uint8_t)(rand() % 64 + ' ');
+		message[i] = (uint8_t)(git_rand() % 64 + ' ');
 
 	reftable_writer_set_limits(w, 1, 1);
 
diff --git a/reftable/stack.c b/reftable/stack.c
index 0c235724e2..16bab82063 100644
--- a/reftable/stack.c
+++ b/reftable/stack.c
@@ -434,7 +434,7 @@ int reftable_stack_add(struct reftable_stack *st,
 static void format_name(struct strbuf *dest, uint64_t min, uint64_t max)
 {
 	char buf[100];
-	uint32_t rnd = (uint32_t)rand();
+	uint32_t rnd = (uint32_t)git_rand();
 	snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x",
 		 min, max, rnd);
 	strbuf_reset(dest);
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 09/11] reftable/merged: reuse buffer to compute record keys
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Taylor Blau, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 3361 bytes --]

When iterating over entries in the merged iterator's queue, we compute
the key of each of the entries and write it into a buffer. We do not
reuse the buffer though and thus re-allocate it on every iteration,
which is wasteful given that we never transfer ownership of the
allocated bytes outside of the loop.

Refactor the code to reuse the buffer. This also fixes a potential
memory leak when `merged_iter_advance_subiter()` returns an error.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/merged.c | 31 ++++++++++++++++---------------
 reftable/merged.h |  2 ++
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/reftable/merged.c b/reftable/merged.c
index 5ded470c08..556bb5c556 100644
--- a/reftable/merged.c
+++ b/reftable/merged.c
@@ -52,6 +52,8 @@ static void merged_iter_close(void *p)
 		reftable_iterator_destroy(&mi->stack[i]);
 	}
 	reftable_free(mi->stack);
+	strbuf_release(&mi->key);
+	strbuf_release(&mi->entry_key);
 }
 
 static int merged_iter_advance_nonnull_subiter(struct merged_iter *mi,
@@ -85,7 +87,6 @@ static int merged_iter_advance_subiter(struct merged_iter *mi, size_t idx)
 static int merged_iter_next_entry(struct merged_iter *mi,
 				  struct reftable_record *rec)
 {
-	struct strbuf entry_key = STRBUF_INIT;
 	struct pq_entry entry = { 0 };
 	int err = 0;
 
@@ -105,33 +106,31 @@ static int merged_iter_next_entry(struct merged_iter *mi,
 	  such a deployment, the loop below must be changed to collect all
 	  entries for the same key, and return new the newest one.
 	*/
-	reftable_record_key(&entry.rec, &entry_key);
+	reftable_record_key(&entry.rec, &mi->entry_key);
 	while (!merged_iter_pqueue_is_empty(mi->pq)) {
 		struct pq_entry top = merged_iter_pqueue_top(mi->pq);
-		struct strbuf k = STRBUF_INIT;
-		int err = 0, cmp = 0;
+		int cmp = 0;
 
-		reftable_record_key(&top.rec, &k);
+		reftable_record_key(&top.rec, &mi->key);
 
-		cmp = strbuf_cmp(&k, &entry_key);
-		strbuf_release(&k);
-
-		if (cmp > 0) {
+		cmp = strbuf_cmp(&mi->key, &mi->entry_key);
+		if (cmp > 0)
 			break;
-		}
 
 		merged_iter_pqueue_remove(&mi->pq);
 		err = merged_iter_advance_subiter(mi, top.index);
-		if (err < 0) {
-			return err;
-		}
+		if (err < 0)
+			goto done;
 		reftable_record_release(&top.rec);
 	}
 
 	reftable_record_copy_from(rec, &entry.rec, hash_size(mi->hash_id));
+
+done:
 	reftable_record_release(&entry.rec);
-	strbuf_release(&entry_key);
-	return 0;
+	strbuf_release(&mi->entry_key);
+	strbuf_release(&mi->key);
+	return err;
 }
 
 static int merged_iter_next(struct merged_iter *mi, struct reftable_record *rec)
@@ -248,6 +247,8 @@ static int merged_table_seek_record(struct reftable_merged_table *mt,
 		.typ = reftable_record_type(rec),
 		.hash_id = mt->hash_id,
 		.suppress_deletions = mt->suppress_deletions,
+		.key = STRBUF_INIT,
+		.entry_key = STRBUF_INIT,
 	};
 	int n = 0;
 	int err = 0;
diff --git a/reftable/merged.h b/reftable/merged.h
index 7d9f95d27e..d5b39dfe7f 100644
--- a/reftable/merged.h
+++ b/reftable/merged.h
@@ -31,6 +31,8 @@ struct merged_iter {
 	uint8_t typ;
 	int suppress_deletions;
 	struct merged_iter_pqueue pq;
+	struct strbuf key;
+	struct strbuf entry_key;
 };
 
 void merged_table_release(struct reftable_merged_table *mt);
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 10/11] reftable/block: introduce macro to initialize `struct block_iter`
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Taylor Blau, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

There are a bunch of locations where we initialize members of `struct
block_iter`, which makes it harder than necessary to expand this struct
to have additional members. Unify the logic via a new `BLOCK_ITER_INIT`
macro that initializes all members.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c      | 4 +---
 reftable/block.h      | 4 ++++
 reftable/block_test.c | 4 ++--
 reftable/iter.h       | 8 ++++----
 reftable/reader.c     | 7 +++----
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 34d4d07369..8c6a8c77fc 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -389,9 +389,7 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	struct reftable_record rec = reftable_new_record(block_reader_type(br));
 	struct strbuf key = STRBUF_INIT;
 	int err = 0;
-	struct block_iter next = {
-		.last_key = STRBUF_INIT,
-	};
+	struct block_iter next = BLOCK_ITER_INIT;
 
 	int i = binsearch(br->restart_count, &restart_key_less, &args);
 	if (args.error) {
diff --git a/reftable/block.h b/reftable/block.h
index 87c77539b5..51699af233 100644
--- a/reftable/block.h
+++ b/reftable/block.h
@@ -86,6 +86,10 @@ struct block_iter {
 	struct strbuf last_key;
 };
 
+#define BLOCK_ITER_INIT { \
+	.last_key = STRBUF_INIT, \
+}
+
 /* initializes a block reader. */
 int block_reader_init(struct block_reader *br, struct reftable_block *bl,
 		      uint32_t header_off, uint32_t table_block_size,
diff --git a/reftable/block_test.c b/reftable/block_test.c
index cb88af4a56..c00bbc8aed 100644
--- a/reftable/block_test.c
+++ b/reftable/block_test.c
@@ -32,7 +32,7 @@ static void test_block_read_write(void)
 	int i = 0;
 	int n;
 	struct block_reader br = { 0 };
-	struct block_iter it = { .last_key = STRBUF_INIT };
+	struct block_iter it = BLOCK_ITER_INIT;
 	int j = 0;
 	struct strbuf want = STRBUF_INIT;
 
@@ -87,7 +87,7 @@ static void test_block_read_write(void)
 	block_iter_close(&it);
 
 	for (i = 0; i < N; i++) {
-		struct block_iter it = { .last_key = STRBUF_INIT };
+		struct block_iter it = BLOCK_ITER_INIT;
 		strbuf_reset(&want);
 		strbuf_addstr(&want, names[i]);
 
diff --git a/reftable/iter.h b/reftable/iter.h
index 09eb0cbfa5..47d67d84df 100644
--- a/reftable/iter.h
+++ b/reftable/iter.h
@@ -53,10 +53,10 @@ struct indexed_table_ref_iter {
 	int is_finished;
 };
 
-#define INDEXED_TABLE_REF_ITER_INIT                                     \
-	{                                                               \
-		.cur = { .last_key = STRBUF_INIT }, .oid = STRBUF_INIT, \
-	}
+#define INDEXED_TABLE_REF_ITER_INIT { \
+	.cur = BLOCK_ITER_INIT, \
+	.oid = STRBUF_INIT, \
+}
 
 void iterator_from_indexed_table_ref_iter(struct reftable_iterator *it,
 					  struct indexed_table_ref_iter *itr);
diff --git a/reftable/reader.c b/reftable/reader.c
index b4db23ce18..9de64f50b4 100644
--- a/reftable/reader.c
+++ b/reftable/reader.c
@@ -224,10 +224,9 @@ struct table_iter {
 	struct block_iter bi;
 	int is_finished;
 };
-#define TABLE_ITER_INIT                          \
-	{                                        \
-		.bi = {.last_key = STRBUF_INIT } \
-	}
+#define TABLE_ITER_INIT { \
+	.bi = BLOCK_ITER_INIT \
+}
 
 static void table_iter_copy_from(struct table_iter *dest,
 				 struct table_iter *src)
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* [PATCH v3 11/11] reftable/block: reuse buffer to compute record keys
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Taylor Blau, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 3135 bytes --]

When iterating over entries in the block iterator we compute the key of
each of the entries and write it into a buffer. We do not reuse the
buffer though and thus re-allocate it on every iteration, which is
wasteful.

Refactor the code to reuse the buffer.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 19 ++++++++-----------
 reftable/block.h |  2 ++
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 8c6a8c77fc..1df3d8a0f0 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -323,30 +323,28 @@ int block_iter_next(struct block_iter *it, struct reftable_record *rec)
 		.len = it->br->block_len - it->next_off,
 	};
 	struct string_view start = in;
-	struct strbuf key = STRBUF_INIT;
 	uint8_t extra = 0;
 	int n = 0;
 
 	if (it->next_off >= it->br->block_len)
 		return 1;
 
-	n = reftable_decode_key(&key, &extra, it->last_key, in);
+	n = reftable_decode_key(&it->key, &extra, it->last_key, in);
 	if (n < 0)
 		return -1;
 
-	if (!key.len)
+	if (!it->key.len)
 		return REFTABLE_FORMAT_ERROR;
 
 	string_view_consume(&in, n);
-	n = reftable_record_decode(rec, key, extra, in, it->br->hash_size);
+	n = reftable_record_decode(rec, it->key, extra, in, it->br->hash_size);
 	if (n < 0)
 		return -1;
 	string_view_consume(&in, n);
 
 	strbuf_reset(&it->last_key);
-	strbuf_addbuf(&it->last_key, &key);
+	strbuf_addbuf(&it->last_key, &it->key);
 	it->next_off += start.len - in.len;
-	strbuf_release(&key);
 	return 0;
 }
 
@@ -377,6 +375,7 @@ int block_iter_seek(struct block_iter *it, struct strbuf *want)
 void block_iter_close(struct block_iter *it)
 {
 	strbuf_release(&it->last_key);
+	strbuf_release(&it->key);
 }
 
 int block_reader_seek(struct block_reader *br, struct block_iter *it,
@@ -387,7 +386,6 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		.r = br,
 	};
 	struct reftable_record rec = reftable_new_record(block_reader_type(br));
-	struct strbuf key = STRBUF_INIT;
 	int err = 0;
 	struct block_iter next = BLOCK_ITER_INIT;
 
@@ -414,8 +412,8 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 		if (err < 0)
 			goto done;
 
-		reftable_record_key(&rec, &key);
-		if (err > 0 || strbuf_cmp(&key, want) >= 0) {
+		reftable_record_key(&rec, &it->key);
+		if (err > 0 || strbuf_cmp(&it->key, want) >= 0) {
 			err = 0;
 			goto done;
 		}
@@ -424,8 +422,7 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
 	}
 
 done:
-	strbuf_release(&key);
-	strbuf_release(&next.last_key);
+	block_iter_close(&next);
 	reftable_record_release(&rec);
 
 	return err;
diff --git a/reftable/block.h b/reftable/block.h
index 51699af233..17481e6331 100644
--- a/reftable/block.h
+++ b/reftable/block.h
@@ -84,10 +84,12 @@ struct block_iter {
 
 	/* key for last entry we read. */
 	struct strbuf last_key;
+	struct strbuf key;
 };
 
 #define BLOCK_ITER_INIT { \
 	.last_key = STRBUF_INIT, \
+	.key = STRBUF_INIT, \
 }
 
 /* initializes a block reader. */
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCH v2 02/11] reftable: handle interrupted reads
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <ZXOK2o0jTgLmxPWZ@nand.local>

[-- Attachment #1: Type: text/plain, Size: 658 bytes --]

On Fri, Dec 08, 2023 at 04:30:02PM -0500, Taylor Blau wrote:
> On Fri, Dec 08, 2023 at 03:53:02PM +0100, Patrick Steinhardt wrote:
> > There are calls to pread(3P) and read(3P) where we don't properly handle
> > interrupts. Convert them to use `pread_in_full()` and `read_in_full()`,
> 
> Just checking... do you mean "interrupt" in the kernel sense? Or are you
> referring to the possibility of short reads/writes (in later patches)?

Both. The callsites I'm converting are explicitly checking that they get
the exact number of requested bytes. That means that we'll have to loop
both around EINTR/EAGAIN, but also around short reads.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 04/11] reftable/stack: verify that `reftable_stack_add()` uses auto-compaction
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Taylor Blau, git, Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <CAPig+cRGZvyhSs9=3-tkBKRZDjDUsb-VDs+dzOaZof__qyBjbA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1910 bytes --]

On Fri, Dec 08, 2023 at 06:46:33PM -0500, Eric Sunshine wrote:
> On Fri, Dec 8, 2023 at 4:35 PM Taylor Blau <me@ttaylorr.com> wrote:
> > On Fri, Dec 08, 2023 at 03:53:10PM +0100, Patrick Steinhardt wrote:
> > > +static void test_reftable_stack_add_performs_auto_compaction(void)
> > > +{
> > > +             char name[100];
> > > +             snprintf(name, sizeof(name), "branch%04d", i);
> > > +             ref.refname = name;
> >
> > Is there a reason that we have to use snprintf() here and not a strbuf?
> >
> > I would have expected to see something like:
> >
> >     struct strbuf buf = STRBUF_INIT;
> >     /* ... */
> >     strbuf_addf(&buf, "branch%04d", i);
> >     ref.refname = strbuf_detach(&buf, NULL);
> 
> If I'm reading the code correctly, this use of strbuf would leak each
> time through the loop.
> 
> > I guess it doesn't matter too much, but I think if we can avoid using
> > snprintf(), it's worth doing. If we must use snprintf() here, we should
> > probably use Git's xsnprintf() instead.
> 
> xstrfmt() from strbuf.h would be even simpler if the intention is to
> allocate a new string which will be freed later.
> 
> In this case, though, assuming I understand the intent, I think the
> more common and safe idiom in this codebase is something like this:
> 
>     struct strbuf name = STRBUF_INIT;
>     strbuf_addstr(&name, "branch");
>     size_t len = name.len;
>     for (...) {
>         strbuf_setlen(&name, len);
>         strbuf_addf(&name, "%04d", i);
>         ref.refname = name.buf;
>         ...
>     }
>     strbuf_release(&name);

Yeah, I'll convert this to use a `struct strbuf` instead. But instead of
tracking the length I'll just use a `strbuf_reset()` followed by
`strbuf_addf("branch-%04d")`. It's simpler to read and we don't need to
squeeze every last drop of performance out of this loop anyway.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 05/11] reftable/stack: perform auto-compaction with transactional interface
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <ZXOVVWGyUtrGWYMB@nand.local>

[-- Attachment #1: Type: text/plain, Size: 546 bytes --]

On Fri, Dec 08, 2023 at 05:14:45PM -0500, Taylor Blau wrote:
> On Fri, Dec 08, 2023 at 03:53:14PM +0100, Patrick Steinhardt wrote:
> > [...] It can thus happen quite fast that the stack grows very long, which
> > results in performance issues when trying to read records.
> 
> This sentence was a little confusing to read at first glance. Perhaps
> instead:
> 
>     The stack can grow to become quite long rather quickly, leading to
>     performance issues when trying to read records.

Thanks, this reads better indeed.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 06/11] reftable/stack: reuse buffers when reloading stack
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <ZXOV8TCqaH0xXRnS@nand.local>

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

On Fri, Dec 08, 2023 at 05:17:21PM -0500, Taylor Blau wrote:
> On Fri, Dec 08, 2023 at 03:53:18PM +0100, Patrick Steinhardt wrote:
> > In `reftable_stack_reload_once()` we iterate over all the tables added
> > to the stack in order to figure out whether any of the tables needs to
> > be reloaded. We use a set of buffers in this context to compute the
> > paths of these tables, but discard those buffers on every iteration.
> > This is quite wasteful given that we do not need to transfer ownership
> > of the allocated buffer outside of the loop.
> >
> > Refactor the code to instead reuse the buffers to reduce the number of
> > allocations we need to do.
> 
> > @@ -267,16 +265,13 @@ static int reftable_stack_reload_once(struct reftable_stack *st, char **names,
> >  	for (i = 0; i < cur_len; i++) {
> >  		if (cur[i]) {
> >  			const char *name = reader_name(cur[i]);
> > -			struct strbuf filename = STRBUF_INIT;
> > -			stack_filename(&filename, st, name);
> > +			stack_filename(&table_path, st, name);
> 
> This initially caught me by surprise, but on closer inspection I agree
> that this is OK, since stack_filename() calls strbuf_reset() before
> adjusting the buffer contents.
> 
> (As a side-note, I do find the side-effect of stack_filename() to be a
> little surprising, but that's not the fault of this series and not worth
> changing here.)

Agreed, I also found this to be a bit confusing at first. I'll amend the
commit message with "Note that we do not have to manually reset the
buffer because `stack_filename()` does this for us already." to help
future readers.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 07/11] reftable/stack: fix stale lock when dying
From: Patrick Steinhardt @ 2023-12-11  9:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <ZXOXpQstse8CdI7J@nand.local>

[-- Attachment #1: Type: text/plain, Size: 1471 bytes --]

On Fri, Dec 08, 2023 at 05:24:37PM -0500, Taylor Blau wrote:
> On Fri, Dec 08, 2023 at 03:53:23PM +0100, Patrick Steinhardt wrote:
> > When starting a transaction via `reftable_stack_init_addition()`, we
> > create a lockfile for the reftable stack itself which we'll write the
> > new list of tables to. But if we terminate abnormally e.g. via a call to
> > `die()`, then we do not remove the lockfile. Subsequent executions of
> > Git which try to modify references will thus fail with an out-of-date
> > error.
> >
> > Fix this bug by registering the lock as a `struct tempfile`, which
> > ensures automatic cleanup for us.
> 
> Makes sense.
> 
> > @@ -475,7 +471,7 @@ static int reftable_stack_init_addition(struct reftable_addition *add,
> >  		goto done;
> >  	}
> >  	if (st->config.default_permissions) {
> > -		if (chmod(add->lock_file_name.buf, st->config.default_permissions) < 0) {
> > +		if (chmod(lock_file_name.buf, st->config.default_permissions) < 0) {
> 
> Hmm. Would we want to use add->lock_file->filename.buf here instead? I
> don't think that it matters (other than that the lockfile's pathname is
> absolute). But it arguably makes it clearer to readers that we're
> touching calling chmod() on the lockfile here, and hardens us against
> the contents of the temporary strbuf changing.

It doesn't make much of a difference, but I can see how this might make
things a bit clearer to the reader. Will change.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 04/11] reftable/stack: verify that `reftable_stack_add()` uses auto-compaction
From: Eric Sunshine @ 2023-12-11  9:36 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Taylor Blau, git, Han-Wen Nienhuys, Jonathan Nieder
In-Reply-To: <ZXbRkOiD80zT7tC5@tanuki>

On Mon, Dec 11, 2023 at 4:08 AM Patrick Steinhardt <ps@pks.im> wrote:
> On Fri, Dec 08, 2023 at 06:46:33PM -0500, Eric Sunshine wrote:
> > In this case, though, assuming I understand the intent, I think the
> > more common and safe idiom in this codebase is something like this:
> >
> >     struct strbuf name = STRBUF_INIT;
> >     strbuf_addstr(&name, "branch");
> >     size_t len = name.len;
> >     for (...) {
> >         strbuf_setlen(&name, len);
> >         strbuf_addf(&name, "%04d", i);
> >         ref.refname = name.buf;
> >         ...
> >     }
> >     strbuf_release(&name);
>
> Yeah, I'll convert this to use a `struct strbuf` instead. But instead of
> tracking the length I'll just use a `strbuf_reset()` followed by
> `strbuf_addf("branch-%04d")`. It's simpler to read and we don't need to
> squeeze every last drop of performance out of this loop anyway.

Sounds perfectly reasonable to me, and I agree with the reasoning,
especially the lower cognitive load with strbuf_reset(). Thanks.

^ permalink raw reply

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
From: Patrick Steinhardt @ 2023-12-11 11:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, brian m. carlson
In-Reply-To: <xmqq7clmn3w1.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 3953 bytes --]

On Sat, Dec 09, 2023 at 07:16:30PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > While at it I noticed that this actually fixes a bug with bundle URIs
> > when the object formats diverge in this way.
> > ...
> > This patch series is actually the last incompatibility for the reftable
> > backend that I have found. All tests except for the files-backend
> > specific ones pass now with the current state I have at [1], which is
> > currently at e6f2f592b7 (t: skip tests which are incompatible with
> > reftable, 2023-11-24)
> 
> An existing test
> 
>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
> 
> passes with vanilla Git 2.43, but with these patches applied, it
> fails the "7 - empty dumb HTTP" step.

Indeed -- now that the GitLab CI definitions have landed in your master
branch I can also see this failure as part of our CI job [1]. And with
the NixOS httpd refactorings I'm actually able to run these tests in the
first place. Really happy to see that things come together like this, as
it means that we'll detect such issues before we send the series to the
mailing list from now on.

Anyway, regarding the test itself. I think the refactorings in this
patch series uncover a preexisting and already-known issue with empty
repositories when using the dumb HTTP protocol: there is no proper way
to know about the hash function that the remote repository uses [2].

Before my refactorings we used to fall back to the local default hash
format with which we've already initialized the repository, which is
wrong. Now we use the hash format we detected via the remote, which we
cannot detect because the remote is empty and does not advertise the
hash function, so we fall back to SHA1 and thus also do the wrong thing.
The only correct thing here would be to use the actual hash function
that the remote repository uses, but we have no to do so. So we're kind
of stuck here and can only choose between two different wrong ways to
pick the hash function.

We can restore the previous wrong behaviour by honoring GIT_DEFAULT_HASH
in git-remote-curl(1) in the case where we do not have a repository set
up yet. So something similar in spirit to the following:

```
diff --git a/remote-curl.c b/remote-curl.c
index fc29757b65..7e97c9c2e9 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -27,6 +27,7 @@
 static struct remote *remote;
 /* always ends with a trailing slash */
 static struct strbuf url = STRBUF_INIT;
+static int nongit;
 
 struct options {
 	int verbosity;
@@ -275,8 +276,30 @@ static const struct git_hash_algo *detect_hash_algo(struct discovery *heads)
 {
 	const char *p = memchr(heads->buf, '\t', heads->len);
 	int algo;
-	if (!p)
-		return the_hash_algo;
+
+	if (!p) {
+		const char *env;
+
+		if (!nongit)
+			return the_hash_algo;
+
+		/*
+		 * In case the remote neither advertises the hash format nor
+		 * any references we have no way to detect the correct hash
+		 * format. We can thus only guess what the remote is using,
+		 * where the best guess is to fall back to the default hash.
+		 */
+		env = getenv("GIT_DEFAULT_HASH");
+		if (env) {
+			algo = hash_algo_by_name(env);
+			if (algo == GIT_HASH_UNKNOWN)
+				die(_("unknown hash algorithm '%s'"), env);
+		} else {
+			algo = GIT_HASH_SHA1;
+		}
+
+		return &hash_algos[algo];
+	}
 
 	algo = hash_algo_by_length((p - heads->buf) / 2);
 	if (algo == GIT_HASH_UNKNOWN)
@@ -1521,7 +1544,6 @@ static int stateless_connect(const char *service_name)
 int cmd_main(int argc, const char **argv)
 {
 	struct strbuf buf = STRBUF_INIT;
-	int nongit;
 	int ret = 1;
 
 	setup_git_directory_gently(&nongit);
```

+Cc brian, as he's the author of [2].

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/jobs/5723052108
[2]: ac093d0790 (remote-curl: detect algorithm for dumb HTTP by size, 2020-06-19)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCH 1/7] setup: extract function to create the refdb
From: Patrick Steinhardt @ 2023-12-11 11:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Karthik Nayak, git
In-Reply-To: <xmqq1qbw1f0j.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

On Sat, Dec 09, 2023 at 07:54:52AM +0900, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > On Wed, Dec 06, 2023 at 10:10:37PM +0100, Karthik Nayak wrote:
> >> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
> >> > +static void create_reference_database(const char *initial_branch, int quiet)
> >> > +{
> >> > +       struct strbuf err = STRBUF_INIT;
> >> > +       int reinit = is_reinit();
> >> > +
> >> > +       /*
> >> > +        * We need to create a "refs" dir in any case so that older
> >> > +        * versions of git can tell that this is a repository.
> >> > +        */
> >> 
> >> How does this work though, even if an earlier version of git can tell
> >> that this is a repository, it still won't be able to read the reftable
> >> backend. In that sense, what do we achieve here?
> >
> > This is a good question, and there is related ongoing discussion about
> > this topic in the thread starting at [1]. There are a few benefits to
> > letting clients discover such repos even if they don't understand the
> > new reference backend format:
> >
> >   - They know to stop walking up the parent-directory chain. Otherwise a
> >     client might end up detecting a Git repository in the parent dir.
> >
> >   - The user gets a proper error message why the repository cannot be
> >     accessed. Instead of failing to detect the repository altogether we
> >     instead say that we don't understand the "extensions.refFormat"
> >     extension.
> 
> Yup, both are very good reasons.  Would it help to sneak a condensed
> version of it in the in-code comment, perhaps?

Sure, let's do so. I failed to condense this meaningfully, but hope that
the result will be okay regardless of that.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 3/7] remote-curl: rediscover repository when fetching refs
From: Patrick Steinhardt @ 2023-12-11 11:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <xmqqmsukz3yr.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 2173 bytes --]

On Sat, Dec 09, 2023 at 08:09:32AM +0900, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > We're about to change git-clone(1) so that we set up the reference
> > database at a later point. This change will cause git-remote-curl(1) to
> > not detect the repository anymore due to "HEAD" not having been created
> > yet at the time it spawns, and thus causes it to error out once it is
> > asked to fetch the references.
> >
> > We can address this issue by trying to re-discover the Git repository in
> > case none was detected at startup time. With this change, the clone will
> > look as following:
> >
> >   1. git-clone(1) sets up the initial repository, excluding the
> >      reference database.
> >
> >   2. git-clone(1) spawns git-remote-curl(1), which will be unable to
> >      detect the repository due to a missing "HEAD".
> >
> >   3. git-clone(1) asks git-remote-curl(1) to list remote references.
> >      This works just fine as this step does not require a local
> >      repository
> >
> >   4. git-clone(1) creates the reference database as it has now learned
> >      about the object format.
> 
> Sorry, but I am not sure I understand this step.  I assume you mean
> by "the object format" what hash function is used to index the
> objects (which can be learned from the remote "origin" in step 2 and
> we can choose to use the one they use), not what ref backend is used
> (which is purely a local matter and we do not need to know what is
> used at the "origin").

Yes, exactly. I'm never quite sure whether I should be saying "hash
function" or "object format". I'll convert the message to say "hash
function" instead to clarify.

> Why do we need to wait initializing ref backend until we learn what
> hash is being in use?

This is because of the reftable backend. With the files backend it never
mattered much, because we do not encode the object format anywhere. But
with the reftable backend we do indeed encode the object format in the
tables' header, so it's important to initialize it with the correct
format right from the start.

I'll amend the commit message.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: Re* [BUG] git-bisect man page description of terms command doesn't mention old/new support
From: Matthieu Moy @ 2023-12-11 12:34 UTC (permalink / raw)
  To: Junio C Hamano, Britton Kerin; +Cc: git, Christian Couder
In-Reply-To: <xmqqzfyjmk02.fsf@gitster.g>

On 12/9/23 17:13, Junio C Hamano wrote:

> so you could read that
> 
> 	git bisect terms --term-good
> 	git bisect terms --term-old
> 
> are the same thing, and when you squint your eyes, you can probably
> guess that
> 
> 	git bisect terms --term-bad
> 	git bisect terms --term-new
> 
> are the same.  But I agree that the documentation should not force
> you to guess.

Agreed.
> --- c/Documentation/git-bisect.txt
> +++ w/Documentation/git-bisect.txt
> @@ -20,7 +20,7 @@ on the subcommand:
>   		  [--no-checkout] [--first-parent] [<bad> [<good>...]] [--] [<paths>...]
>    git bisect (bad|new|<term-new>) [<rev>]
>    git bisect (good|old|<term-old>) [<rev>...]
> - git bisect terms [--term-good | --term-bad]
> + git bisect terms [--term-(good|old) | --term-(bad|new)]

Nit: just above we have the description for `bisect start` saying:

--term-{new,bad}=<term> --term-{old,good}=<term>

it probably makes sense to make both homogeneous (start with the same 
alternative, and make the {...,...} vs (...|...) notations consistent. 
The (...|...) notation seems the most common).

In any case, the patch looks good to me, thanks.

-- 
Matthieu Moy
https://matthieu-moy.fr/

^ permalink raw reply

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
From: Junio C Hamano @ 2023-12-11 14:57 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson
In-Reply-To: <ZXbzzlyWC3HTUyDA@tanuki>

Patrick Steinhardt <ps@pks.im> writes:

>> An existing test
>> 
>>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
>> 
>> passes with vanilla Git 2.43, but with these patches applied, it
>> fails the "7 - empty dumb HTTP" step.
> ...
> Before my refactorings we used to fall back to the local default hash
> format with which we've already initialized the repository, which is
> wrong. Now we use the hash format we detected via the remote, which we
> cannot detect because the remote is empty and does not advertise the
> hash function, so we fall back to SHA1 and thus also do the wrong thing.

Yeah, that is why I did *not* say "the series *breaks* existing
tests".  It triggers a failure, and in this case, a test failure
does not mean the behaviour is broken because there is no correct
answer.  ... oh, wait.  There isn't?

I wonder if the right thing to do is to advertise the hash function
even from an absolutely empty repository.  There are no objects in
such a repository, but it should already know what hash function to
use when it creates its first object (determined at the repository
creation time), so that _could_ be advertised.  

> The only correct thing here would be to use the actual hash function
> that the remote repository uses, but we have no to do so.

We have "no way to do so"?  We have "not done so"?

It is possible for the client side to download the $GIT_DIR/config
file from the remote to learn what value extensions.objectFormat is
in use over there instead, I think, but at the same time, I highly
suspect that dumb HTTP outlived its usefulness to warrant such an
additional investment of engineering resource.

The simplest "fix" might be to leave what happens in this narrow
case (i.e. cloning over dumb HTTP from an empty repository)
undefined by not testing (or not insisting on one particular
outcome), but ...

> +Cc brian, as he's the author of [2].

... of course I trust Brian more than I trust myself in this area ;-)

> Patrick
>
> [1]: https://gitlab.com/gitlab-org/git/-/jobs/5723052108
> [2]: ac093d0790 (remote-curl: detect algorithm for dumb HTTP by size, 2020-06-19)

Thanks.

^ permalink raw reply

* Re: Re* [BUG] git-bisect man page description of terms command doesn't mention old/new support
From: Junio C Hamano @ 2023-12-11 15:05 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Britton Kerin, git, Christian Couder
In-Reply-To: <24a42fa6-7bc4-4a3b-8bf4-a0ef85dc457a@matthieu-moy.fr>

Matthieu Moy <git@matthieu-moy.fr> writes:

> Nit: just above we have the description for `bisect start` saying:
>
> --term-{new,bad}=<term> --term-{old,good}=<term>
>
> it probably makes sense to make both homogeneous (start with the same
> alternative, and make the {...,...} vs (...|...) notations
> consistent. The (...|...) notation seems the most common).

Thanks for noticing; I think it has already been fixed a few months
ago, but apparently what I sent predates 3f02785d (doc/git-bisect:
clarify `git bisect run` syntax, 2023-10-23).  So when the patch
gets merged, it will fix itself ;-)

> In any case, the patch looks good to me, thanks.

Thanks.

^ permalink raw reply

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
From: Patrick Steinhardt @ 2023-12-11 15:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, brian m. carlson
In-Reply-To: <xmqqmsugvlbe.fsf@gitster.g>

[-- Attachment #1: Type: text/plain, Size: 4580 bytes --]

On Mon, Dec 11, 2023 at 06:57:25AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> >> An existing test
> >> 
> >>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
> >> 
> >> passes with vanilla Git 2.43, but with these patches applied, it
> >> fails the "7 - empty dumb HTTP" step.
> > ...
> > Before my refactorings we used to fall back to the local default hash
> > format with which we've already initialized the repository, which is
> > wrong. Now we use the hash format we detected via the remote, which we
> > cannot detect because the remote is empty and does not advertise the
> > hash function, so we fall back to SHA1 and thus also do the wrong thing.
> 
> Yeah, that is why I did *not* say "the series *breaks* existing
> tests".  It triggers a failure, and in this case, a test failure
> does not mean the behaviour is broken because there is no correct
> answer.  ... oh, wait.  There isn't?
> 
> I wonder if the right thing to do is to advertise the hash function
> even from an absolutely empty repository.  There are no objects in
> such a repository, but it should already know what hash function to
> use when it creates its first object (determined at the repository
> creation time), so that _could_ be advertised.  

For the smart HTTP and SSH protocols definitely, and we already do. But
it's a different story for dumb HTTP, unfortunately, where there is no
CGI-like thing sitting between the client and the repository's data.

> > The only correct thing here would be to use the actual hash function
> > that the remote repository uses, but we have no to do so.
> 
> We have "no way to do so"?  We have "not done so"?

We have not done so until now, and we have no easy way to change this on
the server-side as the server is not controlled by us in the first
place. That leaves two options I can think of:

  - Try harder on the client-side, e.g. by trying to download the
    gitconfig as you propose further down. I wonder whether admins would
    typically block access to the config, but doubt they do.

  - Change the format of `info/refs` to include the hash format, as this
    file _is_ controlled by us on the server-side. Doesn't help though
    in an empty repository, where the file is likely to never have been
    generated in the first place.

So it seems like downloading the gitconfig is the only viable option
that I can think of right now.

It does increase the potential attack surface though because we would
start to unconditionally parse a config file from an untrusted source,
and we did hit issues in our config parser in the past already. You
could argue that we already parse untrusted configs via `.gitmodules`,
but these require opt-in to actually be used by anything if I'm not
mistaken.

So... I dunno.

> It is possible for the client side to download the $GIT_DIR/config
> file from the remote to learn what value extensions.objectFormat is
> in use over there instead, I think, but at the same time, I highly
> suspect that dumb HTTP outlived its usefulness to warrant such an
> additional investment of engineering resource.

Fair enough. All of this feels like an edge case (admin that uses dumb
HTTP) in an edge case (the cloned repository uses SHA256) in an edge
case (the remote repository is empty). Sure, SHA256 is likely to gain in
popularity eventually. But at the same time I'd expect that dumb HTTP
will become increasingly rare.

Taken together, chances for this to happen should be fairly low.

> The simplest "fix" might be to leave what happens in this narrow
> case (i.e. cloning over dumb HTTP from an empty repository)
> undefined by not testing (or not insisting on one particular
> outcome), but ...

I would be fine with that outcome, as well. It's not like the current
behaviour is correct in all cases either. The only benefit of that
behaviour is that a user can in fact work around the broken cases by
setting `GIT_HASH_DEFAULT` to the correct hash, and that benefit would
be retained by the diff I sent that made remote-curl.c aware of this
environment variable.

One additional solution would be to print a user-visible warning a la
"warning: failed to detect hash function of empty remote repository" and
then call it a day, potentially pointing out that a user can correct it
by re-cloning with `GIT_HASH_DEFAULT`. But the warning may not be
actionable by the user, because they may not know what hash function the
remote uses, either.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v6] subtree: fix split processing with multiple subtrees present
From: Zach FettersMoore @ 2023-12-11 15:39 UTC (permalink / raw)
  To: Christian Couder; +Cc: Zach FettersMoore via GitGitGadget, git
In-Reply-To: <CAP8UFD3FzP6QW4dJ9yiG1BAytLcsk+zGE+CBeArRJBJ8gsaDMQ@mail.gmail.com>

>>
>> From: Zach FettersMoore <zach.fetters@apollographql.com>
>>
>> When there are multiple subtrees present in a repository and they are
>> all using 'git subtree split', the 'split' command can take a
>> significant (and constantly growing) amount of time to run even when
>> using the '--rejoin' flag. This is due to the fact that when processing
>> commits to determine the last known split to start from when looking
>> for changes, if there has been a split/merge done from another subtree
>> there will be 2 split commits, one mainline and one subtree, for the
>> second subtree that are part of the processing. The non-mainline
>> subtree split commit will cause the processing to always need to search
>> the entire history of the given subtree as part of its processing even
>> though those commits are totally irrelevant to the current subtree
>> split being run.
>>
>> To see this in practice you can use the open source GitHub repo
>> 'apollo-ios-dev' and do the following in order:
>>
>> -Make a changes to a file in 'apollo-ios' and 'apollo-ios-codegen'
>> directories
>> -Create a commit containing these changes
>> -Do a split on apollo-ios-codegen
>> - Do a fetch on the subtree repo
>> - git fetch git@github.com:apollographql/apollo-ios-codegen.git
>> - git subtree split --prefix=apollo-ios-codegen --squash --rejoin

> Now I get the following without your patch at this step:
>
> $ git subtree split --prefix=apollo-ios-codegen --squash --rejoin
> [...]/libexec/git-core/git-subtree: 318: Maximum function recursion
> depth (1000) reached
>
> Line 318 in git-subtree.sh contains the following:
>
> missed=$(cache_miss "$@") || exit $?
>
> With your patch it seems to work:
>
> $ git subtree split --prefix=apollo-ios-codegen --squash --rejoin
> Merge made by the 'ort' strategy.
> e274aed3ba6d0659fb4cc014587cf31c1e8df7f4

Looking into this some it looks like it could be a bash config
difference? My machine always runs it all the way through vs
failing for recursion depth. Although that would also be an issue
which is solved by this fix.

>> - Depending on the current state of the 'apollo-ios-dev' repo
>> you may see the issue at this point if the last split was on
>> apollo-ios

> I guess I see it, but it seems a bit different for me than what you describe.
>
> Otherwise your patch looks good to me now.

Yea I hadn't accounted for/realized that some folks may see a recursion
depth error vs it just taking a long time like it does for me. Also what
I was saying with the apollo-ios-dev repo is you may not need all the steps
to see the issue, because its possible the state of the repo is already
in a position to display the issue just by doing a split on
apollo-ios-codegen.

Great! Thanks again for all the feedback and guidance! Is there anything
else I need to do to get this across the finish line and merged in?

^ permalink raw reply

* Re: [PATCH 1/7] setup: extract function to create the refdb
From: Karthik Nayak @ 2023-12-11 15:50 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git
In-Reply-To: <ZXFy0_T1AZLh058g@tanuki>

Patrick Steinhardt <ps@pks.im> writes:

> On Wed, Dec 06, 2023 at 10:10:37PM +0100, Karthik Nayak wrote:
>> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
>> > +       /*
>> > +        * We need to create a "refs" dir in any case so that older
>> > +        * versions of git can tell that this is a repository.
>> > +        */
>>
>> How does this work though, even if an earlier version of git can tell
>> that this is a repository, it still won't be able to read the reftable
>> backend. In that sense, what do we achieve here?
>
> This is a good question, and there is related ongoing discussion about
> this topic in the thread starting at [1]. There are a few benefits to
> letting clients discover such repos even if they don't understand the
> new reference backend format:
>
>   - They know to stop walking up the parent-directory chain. Otherwise a
>     client might end up detecting a Git repository in the parent dir.
>
>   - The user gets a proper error message why the repository cannot be
>     accessed. Instead of failing to detect the repository altogether we
>     instead say that we don't understand the "extensions.refFormat"
>     extension.
>
> Maybe there are other cases I can't think of right now.
> [1]: <ZWcOvjGPVS_CMUAk@tanuki>

Thank Patrick, this does indeed make a lot of sense now. +1 that this
would be super useful as a comment here.

^ permalink raw reply

* Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode
From: Taylor Blau @ 2023-12-11 19:59 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano
In-Reply-To: <ZXbF2O4qjIr2L7b8@tanuki>

On Mon, Dec 11, 2023 at 09:18:32AM +0100, Patrick Steinhardt wrote:
> > If you have the bulk of your repositories in the alternate, then I think
> > you might want to consider how we combine the two.
>
> Yes, in the general case the bulk of objects is indeed contained in the
> alternate.

I thought about this for a while this morning, and think that while it
is possible, I'm not sure I can think of a compelling use-case where
you'd want to reuse objects from packs across an alternate boundary.

On the "I think it's possible front":

The challenge is making sure that the set of disjoint packs among each
object store is globally disjoint in one direction along the alternate
path. I think the rule would require you to honor the disjointed-ness of
any packs in alternate(s) you might have when constructing new disjoint
packs.

So if repository fork.git is an alternate of network.git (and both had
long-lived MIDXs), network.git is free to make any set of disjoint packs
it chooses, and fork.git can only create disjoint packs which are
disjoint with respect to (a) the other disjoint packs in fork.git (if
any), and (b) the disjoint packs in network.git (and recursively for any
repositories that network.git is an alternate of in the general case).

Now on the "I can't think of a compelling use-case front":

I think the only reason you'd want to be able to reuse objects from
MIDXs across the alternates boundary is if you have MIDX bitmaps in both
the repository and its alternate. Indeed, the only time that we kick in
pack-reuse in general is when we have a bitmap, so in order to reuse
objects from both the repo and its alternate, you'd have to have a
bitmap in both repositories.

But having a MIDX bitmap means that any packs in the MIDX for which
you're generating a bitmap have to be closed over object reachability.
So unless the repository and its alternate have totally distinct lines
of history (in which case, I'm not sure you would want to share objects
between the two in the first place), any pack you bitmap in the child
repository fundamentally couldn't be disjoint with respect to its
parent.

This is because if it were to be disjoint, it would have to be repacked
with '-l' (or some equivalent '-l'-like flag that only ignores non-local
packs which are marked as disjoint). But if you exclude those objects
and any one (or more) of them is reachable from some object(s) you
didn't exclude, you wouldn't be able to generate a bitmap in the first
place.

It's very possible that there's something about your setup that I'm not
fully grokking, but I don't think in general this is something that we'd
want to do (even if it is theoretically possible).

> > Whether or not this is a feature that you/others need, I definitely
> > think we should leave it out of this series, since I am (a) fairly
> > certain that this is possible to do, and (b) already feel like this
> > series on its own is complicated enough.
>
> I'm perfectly fine if we say that the benefits of your patch series
> cannot yet be applied to repositories with alternates. But from my point
> of view it's a requirement that this patch series does not silently
> break this usecase due to Git starting to generate disjointed packs
> where it cannot ensure that the disjointedness property holds.

I think one thing you could reasonably do is use *only* the non-local
MIDX bitmaps when doing pack reuse.

Currently we'll use the first MIDX we find, which is guaranteed to be
the local one, if it exists. This was the case before this series, and
this series does not change that behavior. Unless you had a pack bitmap
in the alternated repository (which I think is unlikely, since it would
require a full reachability closure, thus eliminating any de-duplication
benefits you'd otherwise get when using alternates), you'd be find
before and after this series.

> As I haven't yet read through this whole patch series, the question is
> boils down to whether the end result is opt-in or opt-out. If it was
> opt-out then I could see the above usecase breaking silently. If it was
> opt-in then things should be fine and we can address this ommission in a
> follow up patch series. We at GitLab would definitely be interested in
> helping out with this given that it directly affects us and that the
> demonstrated savings seem very promising.

The end result is opt-in, you have to change the `pack.allowPackReuse`
configuration from its default value of "true" (or the alternate
spelling taught in this series, "single") to "multi" in order to enable
the new behavior.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH v3 04/11] reftable/stack: verify that `reftable_stack_add()` uses auto-compaction
From: Taylor Blau @ 2023-12-11 20:15 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Han-Wen Nienhuys, Jonathan Nieder, Eric Sunshine
In-Reply-To: <5e27d0a5566d90969734e92984cfafe6048924f4.1702285387.git.ps@pks.im>

On Mon, Dec 11, 2023 at 10:07:42AM +0100, Patrick Steinhardt wrote:
> While we have several tests that check whether we correctly perform
> auto-compaction when manually calling `reftable_stack_auto_compact()`,
> we don't have any tests that verify whether `reftable_stack_add()` does
> call it automatically. Add one.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  reftable/stack_test.c | 49 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)
>
> diff --git a/reftable/stack_test.c b/reftable/stack_test.c
> index 0644c8ad2e..52b4dc3b14 100644
> --- a/reftable/stack_test.c
> +++ b/reftable/stack_test.c
> @@ -850,6 +850,54 @@ static void test_reftable_stack_auto_compaction(void)
>  	clear_dir(dir);
>  }
>
> +static void test_reftable_stack_add_performs_auto_compaction(void)
> +{
> +	struct reftable_write_options cfg = { 0 };
> +	struct reftable_stack *st = NULL;
> +	struct strbuf refname = STRBUF_INIT;
> +	char *dir = get_tmp_dir(__LINE__);
> +	int err, i, n = 20;
> +
> +	err = reftable_new_stack(&st, dir, cfg);
> +	EXPECT_ERR(err);
> +
> +	for (i = 0; i <= n; i++) {
> +		struct reftable_ref_record ref = {
> +			.update_index = reftable_stack_next_update_index(st),
> +			.value_type = REFTABLE_REF_SYMREF,
> +			.value.symref = "master",
> +		};
> +
> +		/*
> +		 * Disable auto-compaction for all but the last runs. Like this
> +		 * we can ensure that we indeed honor this setting and have
> +		 * better control over when exactly auto compaction runs.
> +		 */
> +		st->disable_auto_compact = i != n;
> +
> +		strbuf_reset(&refname);
> +		strbuf_addf(&refname, "branch-%04d", i);
> +		ref.refname = refname.buf;

Does the reftable backend take ownership of the "refname" field? If so,
then I think we'd want to use strbuf_detach() here to avoid a
double-free() when you call strbuf_release() below.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH v3 00/11] reftable: small set of fixes
From: Taylor Blau @ 2023-12-11 20:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Han-Wen Nienhuys, Jonathan Nieder, Eric Sunshine
In-Reply-To: <cover.1702285387.git.ps@pks.im>

On Mon, Dec 11, 2023 at 10:07:25AM +0100, Patrick Steinhardt wrote:
>  reftable/block.c          |  23 ++++----
>  reftable/block.h          |   6 +++
>  reftable/block_test.c     |   4 +-
>  reftable/blocksource.c    |   2 +-
>  reftable/iter.h           |   8 +--
>  reftable/merged.c         |  31 +++++------
>  reftable/merged.h         |   2 +
>  reftable/reader.c         |   7 ++-
>  reftable/readwrite_test.c |   6 +--
>  reftable/stack.c          |  73 +++++++++++---------------
>  reftable/stack_test.c     | 107 +++++++++++++++++++++++++++++++++++++-
>  reftable/test_framework.h |  58 ++++++++++++---------
>  12 files changed, 213 insertions(+), 114 deletions(-)
>
> Range-diff against v2:

I had one small question on the new version of the fourth patch, but
otherwise this version LGTM.

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
From: brian m. carlson @ 2023-12-11 22:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Junio C Hamano, git
In-Reply-To: <ZXcrpGQhH121AQ7y@tanuki>

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

On 2023-12-11 at 15:32:52, Patrick Steinhardt wrote:
> On Mon, Dec 11, 2023 at 06:57:25AM -0800, Junio C Hamano wrote:
> > We have "no way to do so"?  We have "not done so"?
> 
> We have not done so until now, and we have no easy way to change this on
> the server-side as the server is not controlled by us in the first
> place. That leaves two options I can think of:
> 
>   - Try harder on the client-side, e.g. by trying to download the
>     gitconfig as you propose further down. I wonder whether admins would
>     typically block access to the config, but doubt they do.
> 
>   - Change the format of `info/refs` to include the hash format, as this
>     file _is_ controlled by us on the server-side. Doesn't help though
>     in an empty repository, where the file is likely to never have been
>     generated in the first place.
> 
> So it seems like downloading the gitconfig is the only viable option
> that I can think of right now.

I mean, we can add an `info/capabilities` file with capabilities and
assume the repository is SHA-1 without it.  I'm fine with that approach
as well, and it can be implemented as part of `git update-server-info`
pretty easily.

But yes, absent that approach or parsing the config file, we'll have to
just use the default settings.

> > The simplest "fix" might be to leave what happens in this narrow
> > case (i.e. cloning over dumb HTTP from an empty repository)
> > undefined by not testing (or not insisting on one particular
> > outcome), but ...
> 
> I would be fine with that outcome, as well. It's not like the current
> behaviour is correct in all cases either. The only benefit of that
> behaviour is that a user can in fact work around the broken cases by
> setting `GIT_HASH_DEFAULT` to the correct hash, and that benefit would
> be retained by the diff I sent that made remote-curl.c aware of this
> environment variable.

That would also be fine with me.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox