git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced
@ 2024-10-24 18:08 Jonathan Tan
  2024-10-24 18:08 ` [PATCH 1/5] pack-objects: make variable non-static Jonathan Tan
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-10-24 18:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, calvinwan, hanyang.tony

This is a polished version of [1], also with all the test failures
debugged and addressed.

The first 4 patches are cleanups and addressing issues with tests, and
the last patch contains the actual change.

This aims to solve the same problem as [2]. Some issues with it have
been brought up in [3] (e.g. not being able to identify if an object is
missing due to repo corruption or legitimately missing because it's been
promised, and also GC not removing any local object); these patches do
not have those issues. (Admittedly, these patches may have other issues
- mainly, more work needs to be done during fetch, and that work may
result in duplicate objects on disk, but I think that both the work and
the disk space used will be minimal, and the extra disk space used will
go away after a GC.)

[1] https://lore.kernel.org/git/cover.1729549127.git.jonathantanmy@google.com/
[2] https://lore.kernel.org/git/20240925072021.77078-1-hanyang.tony@bytedance.com/
[3] https://lore.kernel.org/git/a5e3322d-4e63-4b8c-84af-6578fe257cad@gmail.com/

Jonathan Tan (5):
  pack-objects: make variable non-static
  t0410: make test description clearer
  t0410: use from-scratch server
  t5300: move --window clamp test next to unclamped
  index-pack: repack local links into promisor packs

 Documentation/git-index-pack.txt |   5 ++
 builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
 builtin/pack-objects.c           |  31 ++++++++-
 t/t0410-partial-clone.sh         |   6 +-
 t/t5300-pack-object.sh           |  10 +--
 t/t5616-partial-clone.sh         |  30 +++++++++
 6 files changed, 180 insertions(+), 12 deletions(-)

-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] pack-objects: make variable non-static
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
@ 2024-10-24 18:08 ` Jonathan Tan
  2024-10-28  0:30   ` Taylor Blau
  2024-10-24 18:08 ` [PATCH 2/5] t0410: make test description clearer Jonathan Tan
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-10-24 18:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, calvinwan, hanyang.tony

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/pack-objects.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0fc0680b40..e15fbaeb21 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -238,8 +238,6 @@ static enum {
 } write_bitmap_index;
 static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
 
-static int exclude_promisor_objects;
-
 static int use_delta_islands;
 
 static unsigned long delta_cache_size = 0;
@@ -4327,6 +4325,7 @@ int cmd_pack_objects(int argc,
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct list_objects_filter_options filter_options =
 		LIST_OBJECTS_FILTER_INIT;
+	int exclude_promisor_objects = 0;
 
 	struct option pack_objects_options[] = {
 		OPT_CALLBACK_F('q', "quiet", &progress, NULL,
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] t0410: make test description clearer
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
  2024-10-24 18:08 ` [PATCH 1/5] pack-objects: make variable non-static Jonathan Tan
@ 2024-10-24 18:08 ` Jonathan Tan
  2024-10-24 18:08 ` [PATCH 3/5] t0410: use from-scratch server Jonathan Tan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-10-24 18:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, calvinwan, hanyang.tony

Commit 9a4c507886 (t0410: test fetching from many promisor remotes,
2019-06-25) adds some tests that demonstrate not the automatic fetching
of missing objects, but the direct fetching from another promisor remote
(configured explicitly in one test and implicitly via --filter on the
"git fetch" CLI invocation in the other test) - thus demonstrating
support for multiple promisor remotes, as described in the commit
message.

Change the test descriptions accordingly to make this clearer.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t0410-partial-clone.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 818700fbec..eadb69473f 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -241,7 +241,7 @@ test_expect_success 'fetching of missing objects works with ref-in-want enabled'
 	grep "fetch< fetch=.*ref-in-want" trace
 '
 
-test_expect_success 'fetching of missing objects from another promisor remote' '
+test_expect_success 'fetching from another promisor remote' '
 	git clone "file://$(pwd)/server" server2 &&
 	test_commit -C server2 bar &&
 	git -C server2 repack -a -d --write-bitmap-index &&
@@ -264,7 +264,7 @@ test_expect_success 'fetching of missing objects from another promisor remote' '
 	grep "$HASH2" out
 '
 
-test_expect_success 'fetching of missing objects configures a promisor remote' '
+test_expect_success 'fetching with --filter configures a promisor remote' '
 	git clone "file://$(pwd)/server" server3 &&
 	test_commit -C server3 baz &&
 	git -C server3 repack -a -d --write-bitmap-index &&
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] t0410: use from-scratch server
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
  2024-10-24 18:08 ` [PATCH 1/5] pack-objects: make variable non-static Jonathan Tan
  2024-10-24 18:08 ` [PATCH 2/5] t0410: make test description clearer Jonathan Tan
@ 2024-10-24 18:08 ` Jonathan Tan
  2024-10-24 18:08 ` [PATCH 4/5] t5300: move --window clamp test next to unclamped Jonathan Tan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-10-24 18:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, calvinwan, hanyang.tony

A subsequent commit will add functionality: when fetching from a
promisor remote, existing non-promisor objects that are ancestors of any
fetched object will be repacked into promisor packs (since if a promisor
remote has an object, it also has all its ancestors).

This means that sometimes, a fetch from a promisor remote results in 2
new promisor packs (instead of the 1 that you would expect). There is a
test that fetches a descendant of a local object from a promisor remote,
but also specifically tests that there is exactly 1 promisor pack as
a result of the fetch. This means that this test will fail when the
subsequent commit is added.

Since the ancestry of the fetched object is not the concern of this
test, make the fetched objects have no ancestry in common with the
objets in the client repo. This is done by making the server from
scratch, instead of using an existing repo that has objects in common
with the client.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t0410-partial-clone.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index eadb69473f..e2b317db65 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -265,7 +265,7 @@ test_expect_success 'fetching from another promisor remote' '
 '
 
 test_expect_success 'fetching with --filter configures a promisor remote' '
-	git clone "file://$(pwd)/server" server3 &&
+	test_create_repo server3 &&
 	test_commit -C server3 baz &&
 	git -C server3 repack -a -d --write-bitmap-index &&
 	HASH3=$(git -C server3 rev-parse baz) &&
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] t5300: move --window clamp test next to unclamped
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
                   ` (2 preceding siblings ...)
  2024-10-24 18:08 ` [PATCH 3/5] t0410: use from-scratch server Jonathan Tan
@ 2024-10-24 18:08 ` Jonathan Tan
  2024-10-24 18:08 ` [PATCH 5/5] index-pack: repack local links into promisor packs Jonathan Tan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-10-24 18:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, calvinwan, hanyang.tony

A subsequent commit will change the behavior of "git index-pack
--promisor", which is exercised in "build pack index for an existing
pack", causing the unclamped and clamped versions of the --window
test to exhibit different behavior. Move the clamp test closer to the
unclamped test that it references.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t5300-pack-object.sh | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a..aff164ddf8 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -156,6 +156,11 @@ test_expect_success 'pack without delta' '
 	check_deltas stderr = 0
 '
 
+test_expect_success 'negative window clamps to 0' '
+	git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
+	check_deltas stderr = 0
+'
+
 test_expect_success 'pack-objects with bogus arguments' '
 	test_must_fail git pack-objects --window=0 test-1 blah blah <obj-list
 '
@@ -630,11 +635,6 @@ test_expect_success 'prefetch objects' '
 	test_line_count = 1 donelines
 '
 
-test_expect_success 'negative window clamps to 0' '
-	git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
-	check_deltas stderr = 0
-'
-
 for hash in sha1 sha256
 do
 	test_expect_success "verify-pack with $hash packfile" '
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] index-pack: repack local links into promisor packs
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
                   ` (3 preceding siblings ...)
  2024-10-24 18:08 ` [PATCH 4/5] t5300: move --window clamp test next to unclamped Jonathan Tan
@ 2024-10-24 18:08 ` Jonathan Tan
  2024-10-30 22:29   ` Josh Steadmon
  2024-10-25  6:04 ` [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Han Young
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-10-24 18:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, calvinwan, hanyang.tony

Teach index-pack to, when processing the objects in a pack with
--promisor specified on the CLI, repack local objects (and the local
objects that they refer to, recursively) referenced by these objects
into promisor packs.

This prevents the situation in which, when fetching from a promisor
remote, we end up with promisor objects (newly fetched) referring
to non-promisor objects (locally created prior to the fetch). This
situation may arise if the client had previously pushed objects to the
remote, for example. One issue that arises in this situation is that,
if the non-promisor objects become inaccessible except through promisor
objects (for example, if the branch pointing to them has moved to
point to the promisor object that refers to them), then GC will garbage
collect them. There are other ways to solve this, but the simplest
seems to be to enforce the invariant that we don't have promisor objects
referring to non-promisor objects.

This repacking is done from index-pack to minimize the performance
impact. During a fetch, the only time most objects are fully inflated
in memory is when their object ID is computed, so we also scan the
objects (to see which objects they refer to) during this time.

Also to minimize the performance impact, an object is calculated to be
local if it's a loose object or present in a non-promisor pack. (If it's
also in a promisor pack or referred to by an object in a promisor pack,
it is technically already a promisor object. But a misidentification
of a promisor object as a non-promisor object is relatively benign
here - we will thus repack that promisor object into a promisor pack,
duplicating it in the object store, but there is no correctness issue,
just an issue of inefficiency.)

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/git-index-pack.txt |   5 ++
 builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
 builtin/pack-objects.c           |  28 ++++++++
 t/t5616-partial-clone.sh         |  30 +++++++++
 4 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 5a20deefd5..4be09e58e7 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -139,6 +139,11 @@ include::object-format-disclaimer.txt[]
 	written. If a `<message>` is provided, then that content will be
 	written to the .promisor file for future reference. See
 	link:technical/partial-clone.html[partial clone] for more information.
++
+Also, if there are objects in the given pack that references non-promisor
+objects (in the repo), repacks those non-promisor objects into a promisor
+pack. This avoids a situation in which a repo has non-promisor objects that are
+accessible through promisor objects.
 
 NOTES
 -----
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 9d23b41b3a..e4afd6725f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -9,6 +9,7 @@
 #include "csum-file.h"
 #include "blob.h"
 #include "commit.h"
+#include "tag.h"
 #include "tree.h"
 #include "progress.h"
 #include "fsck.h"
@@ -20,9 +21,14 @@
 #include "object-file.h"
 #include "object-store-ll.h"
 #include "oid-array.h"
+#include "oidset.h"
+#include "path.h"
 #include "replace-object.h"
+#include "tree-walk.h"
 #include "promisor-remote.h"
+#include "run-command.h"
 #include "setup.h"
+#include "strvec.h"
 
 static const char index_pack_usage[] =
 "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] [--fsck-objects[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
@@ -148,6 +154,13 @@ static uint32_t input_crc32;
 static int input_fd, output_fd;
 static const char *curr_pack;
 
+/*
+ * local_links is guarded by read_mutex, and record_local_links is read-only in
+ * a thread.
+ */
+static struct oidset local_links = OIDSET_INIT;
+static int record_local_links;
+
 static struct thread_local *thread_data;
 static int nr_dispatched;
 static int threads_active;
@@ -799,6 +812,44 @@ static int check_collison(struct object_entry *entry)
 	return 0;
 }
 
+static void record_if_local_object(const struct object_id *oid)
+{
+	struct object_info info = OBJECT_INFO_INIT;
+	if (oid_object_info_extended(the_repository, oid, &info, 0))
+		/* Missing; assume it is a promisor object */
+		return;
+	if (info.whence == OI_PACKED && info.u.packed.pack->pack_promisor)
+		return;
+	oidset_insert(&local_links, oid);
+}
+
+static void do_record_local_links(struct object *obj)
+{
+	if (obj->type == OBJ_TREE) {
+		struct tree *tree = (struct tree *)obj;
+		struct tree_desc desc;
+		struct name_entry entry;
+		if (init_tree_desc_gently(&desc, &tree->object.oid,
+					  tree->buffer, tree->size, 0))
+			/*
+			 * Error messages are given when packs are
+			 * verified, so do not print any here.
+			 */
+			return;
+		while (tree_entry_gently(&desc, &entry))
+			record_if_local_object(&entry.oid);
+	} else if (obj->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) obj;
+		struct commit_list *parents = commit->parents;
+
+		for (; parents; parents = parents->next)
+			record_if_local_object(&parents->item->object.oid);
+	} else if (obj->type == OBJ_TAG) {
+		struct tag *tag = (struct tag *) obj;
+		record_if_local_object(get_tagged_oid(tag));
+	}
+}
+
 static void sha1_object(const void *data, struct object_entry *obj_entry,
 			unsigned long size, enum object_type type,
 			const struct object_id *oid)
@@ -845,7 +896,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 		free(has_data);
 	}
 
-	if (strict || do_fsck_object) {
+	if (strict || do_fsck_object || record_local_links) {
 		read_lock();
 		if (type == OBJ_BLOB) {
 			struct blob *blob = lookup_blob(the_repository, oid);
@@ -877,6 +928,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 				die(_("fsck error in packed object"));
 			if (strict && fsck_walk(obj, NULL, &fsck_options))
 				die(_("Not all child objects of %s are reachable"), oid_to_hex(&obj->oid));
+			if (record_local_links)
+				do_record_local_links(obj);
 
 			if (obj->type == OBJ_TREE) {
 				struct tree *item = (struct tree *) obj;
@@ -1719,6 +1772,57 @@ static void show_pack_info(int stat_only)
 	free(chain_histogram);
 }
 
+static void repack_local_links(void)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+	struct oidset_iter iter;
+	struct object_id *oid;
+	char *base_name;
+
+	if (!oidset_size(&local_links))
+		return;
+
+	base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository));
+
+	strvec_push(&cmd.args, "pack-objects");
+	strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort");
+	strvec_push(&cmd.args, base_name);
+	cmd.git_cmd = 1;
+	cmd.in = -1;
+	cmd.out = -1;
+	if (start_command(&cmd))
+		die(_("could not start pack-objects to repack local links"));
+
+	oidset_iter_init(&local_links, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+		    write_in_full(cmd.in, "\n", 1) < 0)
+			die(_("failed to feed local object to pack-objects"));
+	}
+	close(cmd.in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		unsigned char binary[GIT_MAX_RAWSZ];
+		if (line.len != the_hash_algo->hexsz ||
+		    !hex_to_bytes(binary, line.buf, line.len))
+			die(_("index-pack: Expecting full hex object ID lines only from pack-objects."));
+
+		/*
+		 * pack-objects creates the .pack and .idx files, but not the
+		 * .promisor file. Create the .promisor file, which is empty.
+		 */
+		write_special_file("promisor", "", NULL, binary, NULL);
+	}
+
+	fclose(out);
+	if (finish_command(&cmd))
+		die(_("could not finish pack-objects to repack local links"));
+	strbuf_release(&line);
+}
+
 int cmd_index_pack(int argc,
 		   const char **argv,
 		   const char *prefix,
@@ -1794,7 +1898,7 @@ int cmd_index_pack(int argc,
 			} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
 				; /* nothing to do */
 			} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
-				; /* already parsed */
+				record_local_links = 1;
 			} else if (starts_with(arg, "--threads=")) {
 				char *end;
 				nr_threads = strtoul(arg+10, &end, 0);
@@ -1970,6 +2074,8 @@ int cmd_index_pack(int argc,
 		free((void *) curr_index);
 	free(curr_rev_index);
 
+	repack_local_links();
+
 	/*
 	 * Let the caller know this pack is not self contained
 	 */
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e15fbaeb21..a565ab9b40 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4310,6 +4310,18 @@ static int option_parse_cruft_expiration(const struct option *opt UNUSED,
 	return 0;
 }
 
+static int should_include_obj(struct object *obj, void *data UNUSED)
+{
+	struct object_info info = OBJECT_INFO_INIT;
+	if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
+		BUG("should_include_obj should only be called on existing objects");
+	return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
+}
+
+static int should_include(struct commit *commit, void *data) {
+	return should_include_obj((struct object *) commit, data);
+}
+
 int cmd_pack_objects(int argc,
 		     const char **argv,
 		     const char *prefix,
@@ -4326,6 +4338,7 @@ int cmd_pack_objects(int argc,
 	struct list_objects_filter_options filter_options =
 		LIST_OBJECTS_FILTER_INIT;
 	int exclude_promisor_objects = 0;
+	int exclude_promisor_objects_best_effort = 0;
 
 	struct option pack_objects_options[] = {
 		OPT_CALLBACK_F('q', "quiet", &progress, NULL,
@@ -4423,6 +4436,9 @@ int cmd_pack_objects(int argc,
 		  option_parse_missing_action),
 		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
 			 N_("do not pack objects in promisor packfiles")),
+		OPT_BOOL(0, "exclude-promisor-objects-best-effort",
+			 &exclude_promisor_objects_best_effort,
+			 N_("implies --missing=allow-any")),
 		OPT_BOOL(0, "delta-islands", &use_delta_islands,
 			 N_("respect islands during delta compression")),
 		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
@@ -4503,10 +4519,18 @@ int cmd_pack_objects(int argc,
 		strvec_push(&rp, "--unpacked");
 	}
 
+	if (exclude_promisor_objects && exclude_promisor_objects_best_effort)
+		die(_("options '%s' and '%s' cannot be used together"),
+		    "--exclude-promisor-objects", "--exclude-promisor-objects-best-effort");
 	if (exclude_promisor_objects) {
 		use_internal_rev_list = 1;
 		fetch_if_missing = 0;
 		strvec_push(&rp, "--exclude-promisor-objects");
+	} else if (exclude_promisor_objects_best_effort) {
+		use_internal_rev_list = 1;
+		fetch_if_missing = 0;
+		option_parse_missing_action(NULL, "allow-any", 0);
+		/* revs configured below */
 	}
 	if (unpack_unreachable || keep_unreachable || pack_loose_unreachable)
 		use_internal_rev_list = 1;
@@ -4626,6 +4650,10 @@ int cmd_pack_objects(int argc,
 
 		repo_init_revisions(the_repository, &revs, NULL);
 		list_objects_filter_copy(&revs.filter, &filter_options);
+		if (exclude_promisor_objects_best_effort) {
+			revs.include_check = should_include;
+			revs.include_check_obj = should_include_obj;
+		}
 		get_object_list(&revs, rp.nr, rp.v);
 		release_revisions(&revs);
 	}
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index c53e93be2f..2e67f59f89 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -694,6 +694,36 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
 	git -C client restore --recurse-submodules --source=HEAD^ :/
 '
 
+test_expect_success 'after fetching descendants of non-promisor commits, gc works' '
+	# Setup
+	git init full &&
+	git -C full config uploadpack.allowfilter 1 &&
+ 	git -C full config uploadpack.allowanysha1inwant 1 &&
+	touch full/foo &&
+	git -C full add foo &&
+	git -C full commit -m "commit 1" &&
+	git -C full checkout --detach &&
+
+	# Partial clone and push commit to remote
+	git clone "file://$(pwd)/full" --filter=blob:none partial &&
+	echo "hello" > partial/foo &&
+	git -C partial commit -a -m "commit 2" &&
+	git -C partial push &&
+
+	# gc in partial repo
+	git -C partial gc --prune=now &&
+
+	# Create another commit in normal repo
+	git -C full checkout main &&
+	echo " world" >> full/foo &&
+	git -C full commit -a -m "commit 3" &&
+
+	# Pull from remote in partial repo, and run gc again
+	git -C partial pull &&
+	git -C partial gc --prune=now
+'
+
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
                   ` (4 preceding siblings ...)
  2024-10-24 18:08 ` [PATCH 5/5] index-pack: repack local links into promisor packs Jonathan Tan
@ 2024-10-25  6:04 ` Han Young
  2024-10-25 21:07   ` Taylor Blau
  2024-10-25 21:07 ` Taylor Blau
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
  7 siblings, 1 reply; 37+ messages in thread
From: Han Young @ 2024-10-25  6:04 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, calvinwan

On Fri, Oct 25, 2024 at 2:09 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> This is a polished version of [1], also with all the test failures
> debugged and addressed.

Thanks! I think I can drop my "repack all" patches now. :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced
  2024-10-25  6:04 ` [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Han Young
@ 2024-10-25 21:07   ` Taylor Blau
  2024-11-02 10:38     ` Junio C Hamano
  0 siblings, 1 reply; 37+ messages in thread
From: Taylor Blau @ 2024-10-25 21:07 UTC (permalink / raw)
  To: Han Young; +Cc: Jonathan Tan, git, calvinwan

On Fri, Oct 25, 2024 at 02:04:21PM +0800, Han Young wrote:
> On Fri, Oct 25, 2024 at 2:09 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > This is a polished version of [1], also with all the test failures
> > debugged and addressed.
>
> Thanks! I think I can drop my "repack all" patches now. :)

Thanks for saying so, I dropped the 'hy/partial-repack-fix' branch from
my tree.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
                   ` (5 preceding siblings ...)
  2024-10-25  6:04 ` [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Han Young
@ 2024-10-25 21:07 ` Taylor Blau
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
  7 siblings, 0 replies; 37+ messages in thread
From: Taylor Blau @ 2024-10-25 21:07 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, calvinwan, hanyang.tony

On Thu, Oct 24, 2024 at 11:08:39AM -0700, Jonathan Tan wrote:
> Jonathan Tan (5):
>   pack-objects: make variable non-static
>   t0410: make test description clearer
>   t0410: use from-scratch server
>   t5300: move --window clamp test next to unclamped
>   index-pack: repack local links into promisor packs

Thanks, will queue.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] pack-objects: make variable non-static
  2024-10-24 18:08 ` [PATCH 1/5] pack-objects: make variable non-static Jonathan Tan
@ 2024-10-28  0:30   ` Taylor Blau
  2024-10-28 19:34     ` Jonathan Tan
  0 siblings, 1 reply; 37+ messages in thread
From: Taylor Blau @ 2024-10-28  0:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, calvinwan, hanyang.tony, Derrick Stolee

On Thu, Oct 24, 2024 at 11:08:40AM -0700, Jonathan Tan wrote:
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/pack-objects.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 0fc0680b40..e15fbaeb21 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -238,8 +238,6 @@ static enum {
>  } write_bitmap_index;
>  static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
>
> -static int exclude_promisor_objects;
> -
>  static int use_delta_islands;
>
>  static unsigned long delta_cache_size = 0;
> @@ -4327,6 +4325,7 @@ int cmd_pack_objects(int argc,
>  	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>  	struct list_objects_filter_options filter_options =
>  		LIST_OBJECTS_FILTER_INIT;
> +	int exclude_promisor_objects = 0;
>
>  	struct option pack_objects_options[] = {
>  		OPT_CALLBACK_F('q', "quiet", &progress, NULL,
> --
> 2.47.0.163.g1226f6d8fa-goog

This patch appears to conflict with ds/path-walk, which wants to read
the exclude_promisor_objects variable from outside of cmd_pack_objects()
(but elsewhere within the builtin/pack-objects.c compilation unit).

Is this refactoring a necessary step, or just cleanup? If the former, it
may be good for you and Stolee (CC'd) to work together to figure out how
to eliminate the conflict from your two series. If the latter, it may be
worth dropping this patch.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] pack-objects: make variable non-static
  2024-10-28  0:30   ` Taylor Blau
@ 2024-10-28 19:34     ` Jonathan Tan
  2024-10-28 19:50       ` Taylor Blau
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-10-28 19:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, calvinwan, hanyang.tony, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:
> This patch appears to conflict with ds/path-walk, which wants to read
> the exclude_promisor_objects variable from outside of cmd_pack_objects()
> (but elsewhere within the builtin/pack-objects.c compilation unit).
> 
> Is this refactoring a necessary step, or just cleanup? If the former, it
> may be good for you and Stolee (CC'd) to work together to figure out how
> to eliminate the conflict from your two series. If the latter, it may be
> worth dropping this patch.
> 
> Thanks,
> Taylor

It's just cleanup. I've dropped this patch in my local copy but will
wait for reviews before sending the next one (probably not worth sending
it now since it's a relatively trivial change).

I've also looked briefly at ds/path-walk - will reply with a few
comments on that email thread.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] pack-objects: make variable non-static
  2024-10-28 19:34     ` Jonathan Tan
@ 2024-10-28 19:50       ` Taylor Blau
  2024-10-28 23:04         ` Jonathan Tan
  0 siblings, 1 reply; 37+ messages in thread
From: Taylor Blau @ 2024-10-28 19:50 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, calvinwan, hanyang.tony, Derrick Stolee

On Mon, Oct 28, 2024 at 12:34:09PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > This patch appears to conflict with ds/path-walk, which wants to read
> > the exclude_promisor_objects variable from outside of cmd_pack_objects()
> > (but elsewhere within the builtin/pack-objects.c compilation unit).
> >
> > Is this refactoring a necessary step, or just cleanup? If the former, it
> > may be good for you and Stolee (CC'd) to work together to figure out how
> > to eliminate the conflict from your two series. If the latter, it may be
> > worth dropping this patch.
> >
> > Thanks,
> > Taylor
>
> It's just cleanup. I've dropped this patch in my local copy but will
> wait for reviews before sending the next one (probably not worth sending
> it now since it's a relatively trivial change).
>
> I've also looked briefly at ds/path-walk - will reply with a few
> comments on that email thread.

Great, thanks on both.

Thanks,
Taylor

P.S.: it's good to see you back on the list again :-).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] pack-objects: make variable non-static
  2024-10-28 19:50       ` Taylor Blau
@ 2024-10-28 23:04         ` Jonathan Tan
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-10-28 23:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, git, calvinwan, hanyang.tony, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:
> Great, thanks on both.
> 
> Thanks,
> Taylor

Thanks also for your work in coordinating the patches from various
authors.

> P.S.: it's good to see you back on the list again :-).

Thank you :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] index-pack: repack local links into promisor packs
  2024-10-24 18:08 ` [PATCH 5/5] index-pack: repack local links into promisor packs Jonathan Tan
@ 2024-10-30 22:29   ` Josh Steadmon
  2024-11-01 20:14     ` Jonathan Tan
  0 siblings, 1 reply; 37+ messages in thread
From: Josh Steadmon @ 2024-10-30 22:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, calvinwan, hanyang.tony

Mostly looks good, just one point of confusion on my part, and one nit.
Thanks for the patch! (comments inline below)


On 2024.10.24 11:08, Jonathan Tan wrote:
> Teach index-pack to, when processing the objects in a pack with
> --promisor specified on the CLI, repack local objects (and the local
> objects that they refer to, recursively) referenced by these objects
> into promisor packs.
> 
> This prevents the situation in which, when fetching from a promisor
> remote, we end up with promisor objects (newly fetched) referring
> to non-promisor objects (locally created prior to the fetch). This
> situation may arise if the client had previously pushed objects to the
> remote, for example. One issue that arises in this situation is that,
> if the non-promisor objects become inaccessible except through promisor
> objects (for example, if the branch pointing to them has moved to
> point to the promisor object that refers to them), then GC will garbage
> collect them. There are other ways to solve this, but the simplest
> seems to be to enforce the invariant that we don't have promisor objects
> referring to non-promisor objects.
> 
> This repacking is done from index-pack to minimize the performance
> impact. During a fetch, the only time most objects are fully inflated
> in memory is when their object ID is computed, so we also scan the
> objects (to see which objects they refer to) during this time.
> 
> Also to minimize the performance impact, an object is calculated to be
> local if it's a loose object or present in a non-promisor pack. (If it's
> also in a promisor pack or referred to by an object in a promisor pack,
> it is technically already a promisor object. But a misidentification
> of a promisor object as a non-promisor object is relatively benign
> here - we will thus repack that promisor object into a promisor pack,
> duplicating it in the object store, but there is no correctness issue,
> just an issue of inefficiency.)
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/git-index-pack.txt |   5 ++
>  builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
>  builtin/pack-objects.c           |  28 ++++++++
>  t/t5616-partial-clone.sh         |  30 +++++++++
>  4 files changed, 171 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
> index 5a20deefd5..4be09e58e7 100644
> --- a/Documentation/git-index-pack.txt
> +++ b/Documentation/git-index-pack.txt
> @@ -139,6 +139,11 @@ include::object-format-disclaimer.txt[]
>  	written. If a `<message>` is provided, then that content will be
>  	written to the .promisor file for future reference. See
>  	link:technical/partial-clone.html[partial clone] for more information.
> ++
> +Also, if there are objects in the given pack that references non-promisor
> +objects (in the repo), repacks those non-promisor objects into a promisor
> +pack. This avoids a situation in which a repo has non-promisor objects that are
> +accessible through promisor objects.
>  
>  NOTES
>  -----
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 9d23b41b3a..e4afd6725f 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -9,6 +9,7 @@
>  #include "csum-file.h"
>  #include "blob.h"
>  #include "commit.h"
> +#include "tag.h"
>  #include "tree.h"
>  #include "progress.h"
>  #include "fsck.h"
> @@ -20,9 +21,14 @@
>  #include "object-file.h"
>  #include "object-store-ll.h"
>  #include "oid-array.h"
> +#include "oidset.h"
> +#include "path.h"
>  #include "replace-object.h"
> +#include "tree-walk.h"
>  #include "promisor-remote.h"
> +#include "run-command.h"
>  #include "setup.h"
> +#include "strvec.h"
>  
>  static const char index_pack_usage[] =
>  "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] [--fsck-objects[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
> @@ -148,6 +154,13 @@ static uint32_t input_crc32;
>  static int input_fd, output_fd;
>  static const char *curr_pack;
>  
> +/*
> + * local_links is guarded by read_mutex, and record_local_links is read-only in
> + * a thread.
> + */
> +static struct oidset local_links = OIDSET_INIT;
> +static int record_local_links;
> +
>  static struct thread_local *thread_data;
>  static int nr_dispatched;
>  static int threads_active;
> @@ -799,6 +812,44 @@ static int check_collison(struct object_entry *entry)
>  	return 0;
>  }
>  
> +static void record_if_local_object(const struct object_id *oid)
> +{
> +	struct object_info info = OBJECT_INFO_INIT;
> +	if (oid_object_info_extended(the_repository, oid, &info, 0))
> +		/* Missing; assume it is a promisor object */
> +		return;
> +	if (info.whence == OI_PACKED && info.u.packed.pack->pack_promisor)
> +		return;
> +	oidset_insert(&local_links, oid);
> +}
> +
> +static void do_record_local_links(struct object *obj)
> +{
> +	if (obj->type == OBJ_TREE) {
> +		struct tree *tree = (struct tree *)obj;
> +		struct tree_desc desc;
> +		struct name_entry entry;
> +		if (init_tree_desc_gently(&desc, &tree->object.oid,
> +					  tree->buffer, tree->size, 0))

This part confused me for a bit, since I didn't see how simply casting
a `struct object *` to a `struct tree *` could possibly guarantee that
`tree->buffer` or `tree->size` pointed to valid memory. But the object
pointers in question have all been created via `parse_object_buffer` and
allocated as the appropriate blob/tree/commit/tag structs, and were
previously cast to `struct object *` by that function. So all looks good
here.


> +			/*
> +			 * Error messages are given when packs are
> +			 * verified, so do not print any here.
> +			 */
> +			return;
> +		while (tree_entry_gently(&desc, &entry))
> +			record_if_local_object(&entry.oid);
> +	} else if (obj->type == OBJ_COMMIT) {
> +		struct commit *commit = (struct commit *) obj;
> +		struct commit_list *parents = commit->parents;
> +
> +		for (; parents; parents = parents->next)
> +			record_if_local_object(&parents->item->object.oid);
> +	} else if (obj->type == OBJ_TAG) {
> +		struct tag *tag = (struct tag *) obj;
> +		record_if_local_object(get_tagged_oid(tag));
> +	}

We only ever `record_if_local_object()` on things referenced by `obj`
here but not on `obj` itself. We should only be calling this on objects
that are inside a promisor pack, so all good here I think.

> +}
> +
>  static void sha1_object(const void *data, struct object_entry *obj_entry,
>  			unsigned long size, enum object_type type,
>  			const struct object_id *oid)
> @@ -845,7 +896,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>  		free(has_data);
>  	}
>  
> -	if (strict || do_fsck_object) {
> +	if (strict || do_fsck_object || record_local_links) {
>  		read_lock();
>  		if (type == OBJ_BLOB) {
>  			struct blob *blob = lookup_blob(the_repository, oid);
> @@ -877,6 +928,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>  				die(_("fsck error in packed object"));
>  			if (strict && fsck_walk(obj, NULL, &fsck_options))
>  				die(_("Not all child objects of %s are reachable"), oid_to_hex(&obj->oid));
> +			if (record_local_links)
> +				do_record_local_links(obj);
>  
>  			if (obj->type == OBJ_TREE) {
>  				struct tree *item = (struct tree *) obj;
> @@ -1719,6 +1772,57 @@ static void show_pack_info(int stat_only)
>  	free(chain_histogram);
>  }
>  
> +static void repack_local_links(void)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	FILE *out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct oidset_iter iter;
> +	struct object_id *oid;
> +	char *base_name;
> +
> +	if (!oidset_size(&local_links))
> +		return;
> +
> +	base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository));
> +
> +	strvec_push(&cmd.args, "pack-objects");
> +	strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort");
> +	strvec_push(&cmd.args, base_name);
> +	cmd.git_cmd = 1;
> +	cmd.in = -1;
> +	cmd.out = -1;
> +	if (start_command(&cmd))
> +		die(_("could not start pack-objects to repack local links"));
> +
> +	oidset_iter_init(&local_links, &iter);
> +	while ((oid = oidset_iter_next(&iter))) {
> +		if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
> +		    write_in_full(cmd.in, "\n", 1) < 0)
> +			die(_("failed to feed local object to pack-objects"));
> +	}
> +	close(cmd.in);
> +
> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline_lf(&line, out) != EOF) {
> +		unsigned char binary[GIT_MAX_RAWSZ];
> +		if (line.len != the_hash_algo->hexsz ||
> +		    !hex_to_bytes(binary, line.buf, line.len))
> +			die(_("index-pack: Expecting full hex object ID lines only from pack-objects."));

I'm not sure why we check the pack-objects output here, is this just to
detect errors? Could we instead just check the exit status of
pack-objects, and discard the output?


> +
> +		/*
> +		 * pack-objects creates the .pack and .idx files, but not the
> +		 * .promisor file. Create the .promisor file, which is empty.
> +		 */
> +		write_special_file("promisor", "", NULL, binary, NULL);
> +	}
> +
> +	fclose(out);
> +	if (finish_command(&cmd))
> +		die(_("could not finish pack-objects to repack local links"));
> +	strbuf_release(&line);
> +}
> +
>  int cmd_index_pack(int argc,
>  		   const char **argv,
>  		   const char *prefix,
> @@ -1794,7 +1898,7 @@ int cmd_index_pack(int argc,
>  			} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
>  				; /* nothing to do */
>  			} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
> -				; /* already parsed */
> +				record_local_links = 1;
>  			} else if (starts_with(arg, "--threads=")) {
>  				char *end;
>  				nr_threads = strtoul(arg+10, &end, 0);
> @@ -1970,6 +2074,8 @@ int cmd_index_pack(int argc,
>  		free((void *) curr_index);
>  	free(curr_rev_index);
>  
> +	repack_local_links();
> +
>  	/*
>  	 * Let the caller know this pack is not self contained
>  	 */
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index e15fbaeb21..a565ab9b40 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -4310,6 +4310,18 @@ static int option_parse_cruft_expiration(const struct option *opt UNUSED,
>  	return 0;
>  }
>  
> +static int should_include_obj(struct object *obj, void *data UNUSED)
> +{
> +	struct object_info info = OBJECT_INFO_INIT;
> +	if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
> +		BUG("should_include_obj should only be called on existing objects");
> +	return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
> +}
> +
> +static int should_include(struct commit *commit, void *data) {
> +	return should_include_obj((struct object *) commit, data);
> +}
> +

Nit: these two functions could be named a bit more descriptively.


>  int cmd_pack_objects(int argc,
>  		     const char **argv,
>  		     const char *prefix,
> @@ -4326,6 +4338,7 @@ int cmd_pack_objects(int argc,
>  	struct list_objects_filter_options filter_options =
>  		LIST_OBJECTS_FILTER_INIT;
>  	int exclude_promisor_objects = 0;
> +	int exclude_promisor_objects_best_effort = 0;
>  
>  	struct option pack_objects_options[] = {
>  		OPT_CALLBACK_F('q', "quiet", &progress, NULL,
> @@ -4423,6 +4436,9 @@ int cmd_pack_objects(int argc,
>  		  option_parse_missing_action),
>  		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
>  			 N_("do not pack objects in promisor packfiles")),
> +		OPT_BOOL(0, "exclude-promisor-objects-best-effort",
> +			 &exclude_promisor_objects_best_effort,
> +			 N_("implies --missing=allow-any")),
>  		OPT_BOOL(0, "delta-islands", &use_delta_islands,
>  			 N_("respect islands during delta compression")),
>  		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
> @@ -4503,10 +4519,18 @@ int cmd_pack_objects(int argc,
>  		strvec_push(&rp, "--unpacked");
>  	}
>  
> +	if (exclude_promisor_objects && exclude_promisor_objects_best_effort)
> +		die(_("options '%s' and '%s' cannot be used together"),
> +		    "--exclude-promisor-objects", "--exclude-promisor-objects-best-effort");
>  	if (exclude_promisor_objects) {
>  		use_internal_rev_list = 1;
>  		fetch_if_missing = 0;
>  		strvec_push(&rp, "--exclude-promisor-objects");
> +	} else if (exclude_promisor_objects_best_effort) {
> +		use_internal_rev_list = 1;
> +		fetch_if_missing = 0;
> +		option_parse_missing_action(NULL, "allow-any", 0);
> +		/* revs configured below */
>  	}
>  	if (unpack_unreachable || keep_unreachable || pack_loose_unreachable)
>  		use_internal_rev_list = 1;
> @@ -4626,6 +4650,10 @@ int cmd_pack_objects(int argc,
>  
>  		repo_init_revisions(the_repository, &revs, NULL);
>  		list_objects_filter_copy(&revs.filter, &filter_options);
> +		if (exclude_promisor_objects_best_effort) {
> +			revs.include_check = should_include;
> +			revs.include_check_obj = should_include_obj;
> +		}
>  		get_object_list(&revs, rp.nr, rp.v);
>  		release_revisions(&revs);
>  	}
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index c53e93be2f..2e67f59f89 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -694,6 +694,36 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
>  	git -C client restore --recurse-submodules --source=HEAD^ :/
>  '
>  
> +test_expect_success 'after fetching descendants of non-promisor commits, gc works' '
> +	# Setup
> +	git init full &&
> +	git -C full config uploadpack.allowfilter 1 &&
> + 	git -C full config uploadpack.allowanysha1inwant 1 &&
> +	touch full/foo &&
> +	git -C full add foo &&
> +	git -C full commit -m "commit 1" &&
> +	git -C full checkout --detach &&
> +
> +	# Partial clone and push commit to remote
> +	git clone "file://$(pwd)/full" --filter=blob:none partial &&
> +	echo "hello" > partial/foo &&
> +	git -C partial commit -a -m "commit 2" &&
> +	git -C partial push &&
> +
> +	# gc in partial repo
> +	git -C partial gc --prune=now &&
> +
> +	# Create another commit in normal repo
> +	git -C full checkout main &&
> +	echo " world" >> full/foo &&
> +	git -C full commit -a -m "commit 3" &&
> +
> +	# Pull from remote in partial repo, and run gc again
> +	git -C partial pull &&
> +	git -C partial gc --prune=now
> +'
> +
> +
>  . "$TEST_DIRECTORY"/lib-httpd.sh
>  start_httpd
>  
> -- 
> 2.47.0.163.g1226f6d8fa-goog
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced
  2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
                   ` (6 preceding siblings ...)
  2024-10-25 21:07 ` Taylor Blau
@ 2024-11-01 20:11 ` Jonathan Tan
  2024-11-01 20:11   ` [PATCH v2 1/4] t0410: make test description clearer Jonathan Tan
                     ` (4 more replies)
  7 siblings, 5 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-11-01 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, steadmon, hanyang.tony, me

Thanks everyone for looking at it. Here's version 2.

Jonathan Tan (4):
  t0410: make test description clearer
  t0410: use from-scratch server
  t5300: move --window clamp test next to unclamped
  index-pack: repack local links into promisor packs

 Documentation/git-index-pack.txt |   5 ++
 builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
 builtin/pack-objects.c           |  28 ++++++++
 t/t0410-partial-clone.sh         |   6 +-
 t/t5300-pack-object.sh           |  10 +--
 t/t5616-partial-clone.sh         |  30 +++++++++
 6 files changed, 179 insertions(+), 10 deletions(-)

Range-diff against v1:
1:  b2c76c207d < -:  ---------- pack-objects: make variable non-static
2:  c220e77ccf = 1:  f405c9c9aa t0410: make test description clearer
3:  08750988e0 = 2:  ce9d5af42a t0410: use from-scratch server
4:  85fc3fa77e = 3:  1526a59e2d t5300: move --window clamp test next to unclamped
5:  5dd7fdc16d ! 4:  c51fac33fb index-pack: repack local links into promisor packs
    @@ builtin/index-pack.c: int cmd_index_pack(int argc,
      	 */
     
      ## builtin/pack-objects.c ##
    +@@ builtin/pack-objects.c: static enum {
    + static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
    + 
    + static int exclude_promisor_objects;
    ++static int exclude_promisor_objects_best_effort;
    + 
    + static int use_delta_islands;
    + 
     @@ builtin/pack-objects.c: static int option_parse_cruft_expiration(const struct option *opt UNUSED,
      	return 0;
      }
      
    -+static int should_include_obj(struct object *obj, void *data UNUSED)
    ++static int is_not_in_promisor_pack_obj(struct object *obj, void *data UNUSED)
     +{
     +	struct object_info info = OBJECT_INFO_INIT;
     +	if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
    @@ builtin/pack-objects.c: static int option_parse_cruft_expiration(const struct op
     +	return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
     +}
     +
    -+static int should_include(struct commit *commit, void *data) {
    -+	return should_include_obj((struct object *) commit, data);
    ++static int is_not_in_promisor_pack(struct commit *commit, void *data) {
    ++	return is_not_in_promisor_pack_obj((struct object *) commit, data);
     +}
     +
      int cmd_pack_objects(int argc,
      		     const char **argv,
      		     const char *prefix,
    -@@ builtin/pack-objects.c: int cmd_pack_objects(int argc,
    - 	struct list_objects_filter_options filter_options =
    - 		LIST_OBJECTS_FILTER_INIT;
    - 	int exclude_promisor_objects = 0;
    -+	int exclude_promisor_objects_best_effort = 0;
    - 
    - 	struct option pack_objects_options[] = {
    - 		OPT_CALLBACK_F('q', "quiet", &progress, NULL,
     @@ builtin/pack-objects.c: int cmd_pack_objects(int argc,
      		  option_parse_missing_action),
      		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc,
      		repo_init_revisions(the_repository, &revs, NULL);
      		list_objects_filter_copy(&revs.filter, &filter_options);
     +		if (exclude_promisor_objects_best_effort) {
    -+			revs.include_check = should_include;
    -+			revs.include_check_obj = should_include_obj;
    ++			revs.include_check = is_not_in_promisor_pack;
    ++			revs.include_check_obj = is_not_in_promisor_pack_obj;
     +		}
      		get_object_list(&revs, rp.nr, rp.v);
      		release_revisions(&revs);
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/4] t0410: make test description clearer
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
@ 2024-11-01 20:11   ` Jonathan Tan
  2024-11-01 20:11   ` [PATCH v2 2/4] t0410: use from-scratch server Jonathan Tan
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-11-01 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, steadmon, hanyang.tony, me

Commit 9a4c507886 (t0410: test fetching from many promisor remotes,
2019-06-25) adds some tests that demonstrate not the automatic fetching
of missing objects, but the direct fetching from another promisor remote
(configured explicitly in one test and implicitly via --filter on the
"git fetch" CLI invocation in the other test) - thus demonstrating
support for multiple promisor remotes, as described in the commit
message.

Change the test descriptions accordingly to make this clearer.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t0410-partial-clone.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 818700fbec..eadb69473f 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -241,7 +241,7 @@ test_expect_success 'fetching of missing objects works with ref-in-want enabled'
 	grep "fetch< fetch=.*ref-in-want" trace
 '
 
-test_expect_success 'fetching of missing objects from another promisor remote' '
+test_expect_success 'fetching from another promisor remote' '
 	git clone "file://$(pwd)/server" server2 &&
 	test_commit -C server2 bar &&
 	git -C server2 repack -a -d --write-bitmap-index &&
@@ -264,7 +264,7 @@ test_expect_success 'fetching of missing objects from another promisor remote' '
 	grep "$HASH2" out
 '
 
-test_expect_success 'fetching of missing objects configures a promisor remote' '
+test_expect_success 'fetching with --filter configures a promisor remote' '
 	git clone "file://$(pwd)/server" server3 &&
 	test_commit -C server3 baz &&
 	git -C server3 repack -a -d --write-bitmap-index &&
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/4] t0410: use from-scratch server
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
  2024-11-01 20:11   ` [PATCH v2 1/4] t0410: make test description clearer Jonathan Tan
@ 2024-11-01 20:11   ` Jonathan Tan
  2024-11-01 20:11   ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jonathan Tan
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-11-01 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, steadmon, hanyang.tony, me

A subsequent commit will add functionality: when fetching from a
promisor remote, existing non-promisor objects that are ancestors of any
fetched object will be repacked into promisor packs (since if a promisor
remote has an object, it also has all its ancestors).

This means that sometimes, a fetch from a promisor remote results in 2
new promisor packs (instead of the 1 that you would expect). There is a
test that fetches a descendant of a local object from a promisor remote,
but also specifically tests that there is exactly 1 promisor pack as
a result of the fetch. This means that this test will fail when the
subsequent commit is added.

Since the ancestry of the fetched object is not the concern of this
test, make the fetched objects have no ancestry in common with the
objets in the client repo. This is done by making the server from
scratch, instead of using an existing repo that has objects in common
with the client.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t0410-partial-clone.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index eadb69473f..e2b317db65 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -265,7 +265,7 @@ test_expect_success 'fetching from another promisor remote' '
 '
 
 test_expect_success 'fetching with --filter configures a promisor remote' '
-	git clone "file://$(pwd)/server" server3 &&
+	test_create_repo server3 &&
 	test_commit -C server3 baz &&
 	git -C server3 repack -a -d --write-bitmap-index &&
 	HASH3=$(git -C server3 rev-parse baz) &&
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
  2024-11-01 20:11   ` [PATCH v2 1/4] t0410: make test description clearer Jonathan Tan
  2024-11-01 20:11   ` [PATCH v2 2/4] t0410: use from-scratch server Jonathan Tan
@ 2024-11-01 20:11   ` Jonathan Tan
  2024-11-13  7:35     ` Jeff King
  2024-11-01 20:11   ` [PATCH v2 4/4] index-pack: repack local links into promisor packs Jonathan Tan
  2024-11-04  0:22   ` [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced Junio C Hamano
  4 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-11-01 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, steadmon, hanyang.tony, me

A subsequent commit will change the behavior of "git index-pack
--promisor", which is exercised in "build pack index for an existing
pack", causing the unclamped and clamped versions of the --window
test to exhibit different behavior. Move the clamp test closer to the
unclamped test that it references.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t5300-pack-object.sh | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a..aff164ddf8 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -156,6 +156,11 @@ test_expect_success 'pack without delta' '
 	check_deltas stderr = 0
 '
 
+test_expect_success 'negative window clamps to 0' '
+	git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
+	check_deltas stderr = 0
+'
+
 test_expect_success 'pack-objects with bogus arguments' '
 	test_must_fail git pack-objects --window=0 test-1 blah blah <obj-list
 '
@@ -630,11 +635,6 @@ test_expect_success 'prefetch objects' '
 	test_line_count = 1 donelines
 '
 
-test_expect_success 'negative window clamps to 0' '
-	git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
-	check_deltas stderr = 0
-'
-
 for hash in sha1 sha256
 do
 	test_expect_success "verify-pack with $hash packfile" '
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/4] index-pack: repack local links into promisor packs
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
                     ` (2 preceding siblings ...)
  2024-11-01 20:11   ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jonathan Tan
@ 2024-11-01 20:11   ` Jonathan Tan
  2024-11-04  0:22   ` [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced Junio C Hamano
  4 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-11-01 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, steadmon, hanyang.tony, me

Teach index-pack to, when processing the objects in a pack with
--promisor specified on the CLI, repack local objects (and the local
objects that they refer to, recursively) referenced by these objects
into promisor packs.

This prevents the situation in which, when fetching from a promisor
remote, we end up with promisor objects (newly fetched) referring
to non-promisor objects (locally created prior to the fetch). This
situation may arise if the client had previously pushed objects to the
remote, for example. One issue that arises in this situation is that,
if the non-promisor objects become inaccessible except through promisor
objects (for example, if the branch pointing to them has moved to
point to the promisor object that refers to them), then GC will garbage
collect them. There are other ways to solve this, but the simplest
seems to be to enforce the invariant that we don't have promisor objects
referring to non-promisor objects.

This repacking is done from index-pack to minimize the performance
impact. During a fetch, the only time most objects are fully inflated
in memory is when their object ID is computed, so we also scan the
objects (to see which objects they refer to) during this time.

Also to minimize the performance impact, an object is calculated to be
local if it's a loose object or present in a non-promisor pack. (If it's
also in a promisor pack or referred to by an object in a promisor pack,
it is technically already a promisor object. But a misidentification
of a promisor object as a non-promisor object is relatively benign
here - we will thus repack that promisor object into a promisor pack,
duplicating it in the object store, but there is no correctness issue,
just an issue of inefficiency.)

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/git-index-pack.txt |   5 ++
 builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
 builtin/pack-objects.c           |  28 ++++++++
 t/t5616-partial-clone.sh         |  30 +++++++++
 4 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 5a20deefd5..4be09e58e7 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -139,6 +139,11 @@ include::object-format-disclaimer.txt[]
 	written. If a `<message>` is provided, then that content will be
 	written to the .promisor file for future reference. See
 	link:technical/partial-clone.html[partial clone] for more information.
++
+Also, if there are objects in the given pack that references non-promisor
+objects (in the repo), repacks those non-promisor objects into a promisor
+pack. This avoids a situation in which a repo has non-promisor objects that are
+accessible through promisor objects.
 
 NOTES
 -----
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 9d23b41b3a..e4afd6725f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -9,6 +9,7 @@
 #include "csum-file.h"
 #include "blob.h"
 #include "commit.h"
+#include "tag.h"
 #include "tree.h"
 #include "progress.h"
 #include "fsck.h"
@@ -20,9 +21,14 @@
 #include "object-file.h"
 #include "object-store-ll.h"
 #include "oid-array.h"
+#include "oidset.h"
+#include "path.h"
 #include "replace-object.h"
+#include "tree-walk.h"
 #include "promisor-remote.h"
+#include "run-command.h"
 #include "setup.h"
+#include "strvec.h"
 
 static const char index_pack_usage[] =
 "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] [--fsck-objects[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
@@ -148,6 +154,13 @@ static uint32_t input_crc32;
 static int input_fd, output_fd;
 static const char *curr_pack;
 
+/*
+ * local_links is guarded by read_mutex, and record_local_links is read-only in
+ * a thread.
+ */
+static struct oidset local_links = OIDSET_INIT;
+static int record_local_links;
+
 static struct thread_local *thread_data;
 static int nr_dispatched;
 static int threads_active;
@@ -799,6 +812,44 @@ static int check_collison(struct object_entry *entry)
 	return 0;
 }
 
+static void record_if_local_object(const struct object_id *oid)
+{
+	struct object_info info = OBJECT_INFO_INIT;
+	if (oid_object_info_extended(the_repository, oid, &info, 0))
+		/* Missing; assume it is a promisor object */
+		return;
+	if (info.whence == OI_PACKED && info.u.packed.pack->pack_promisor)
+		return;
+	oidset_insert(&local_links, oid);
+}
+
+static void do_record_local_links(struct object *obj)
+{
+	if (obj->type == OBJ_TREE) {
+		struct tree *tree = (struct tree *)obj;
+		struct tree_desc desc;
+		struct name_entry entry;
+		if (init_tree_desc_gently(&desc, &tree->object.oid,
+					  tree->buffer, tree->size, 0))
+			/*
+			 * Error messages are given when packs are
+			 * verified, so do not print any here.
+			 */
+			return;
+		while (tree_entry_gently(&desc, &entry))
+			record_if_local_object(&entry.oid);
+	} else if (obj->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) obj;
+		struct commit_list *parents = commit->parents;
+
+		for (; parents; parents = parents->next)
+			record_if_local_object(&parents->item->object.oid);
+	} else if (obj->type == OBJ_TAG) {
+		struct tag *tag = (struct tag *) obj;
+		record_if_local_object(get_tagged_oid(tag));
+	}
+}
+
 static void sha1_object(const void *data, struct object_entry *obj_entry,
 			unsigned long size, enum object_type type,
 			const struct object_id *oid)
@@ -845,7 +896,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 		free(has_data);
 	}
 
-	if (strict || do_fsck_object) {
+	if (strict || do_fsck_object || record_local_links) {
 		read_lock();
 		if (type == OBJ_BLOB) {
 			struct blob *blob = lookup_blob(the_repository, oid);
@@ -877,6 +928,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 				die(_("fsck error in packed object"));
 			if (strict && fsck_walk(obj, NULL, &fsck_options))
 				die(_("Not all child objects of %s are reachable"), oid_to_hex(&obj->oid));
+			if (record_local_links)
+				do_record_local_links(obj);
 
 			if (obj->type == OBJ_TREE) {
 				struct tree *item = (struct tree *) obj;
@@ -1719,6 +1772,57 @@ static void show_pack_info(int stat_only)
 	free(chain_histogram);
 }
 
+static void repack_local_links(void)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	FILE *out;
+	struct strbuf line = STRBUF_INIT;
+	struct oidset_iter iter;
+	struct object_id *oid;
+	char *base_name;
+
+	if (!oidset_size(&local_links))
+		return;
+
+	base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository));
+
+	strvec_push(&cmd.args, "pack-objects");
+	strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort");
+	strvec_push(&cmd.args, base_name);
+	cmd.git_cmd = 1;
+	cmd.in = -1;
+	cmd.out = -1;
+	if (start_command(&cmd))
+		die(_("could not start pack-objects to repack local links"));
+
+	oidset_iter_init(&local_links, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
+		    write_in_full(cmd.in, "\n", 1) < 0)
+			die(_("failed to feed local object to pack-objects"));
+	}
+	close(cmd.in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		unsigned char binary[GIT_MAX_RAWSZ];
+		if (line.len != the_hash_algo->hexsz ||
+		    !hex_to_bytes(binary, line.buf, line.len))
+			die(_("index-pack: Expecting full hex object ID lines only from pack-objects."));
+
+		/*
+		 * pack-objects creates the .pack and .idx files, but not the
+		 * .promisor file. Create the .promisor file, which is empty.
+		 */
+		write_special_file("promisor", "", NULL, binary, NULL);
+	}
+
+	fclose(out);
+	if (finish_command(&cmd))
+		die(_("could not finish pack-objects to repack local links"));
+	strbuf_release(&line);
+}
+
 int cmd_index_pack(int argc,
 		   const char **argv,
 		   const char *prefix,
@@ -1794,7 +1898,7 @@ int cmd_index_pack(int argc,
 			} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
 				; /* nothing to do */
 			} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
-				; /* already parsed */
+				record_local_links = 1;
 			} else if (starts_with(arg, "--threads=")) {
 				char *end;
 				nr_threads = strtoul(arg+10, &end, 0);
@@ -1970,6 +2074,8 @@ int cmd_index_pack(int argc,
 		free((void *) curr_index);
 	free(curr_rev_index);
 
+	repack_local_links();
+
 	/*
 	 * Let the caller know this pack is not self contained
 	 */
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0fc0680b40..51d2ffe490 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -239,6 +239,7 @@ static enum {
 static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
 
 static int exclude_promisor_objects;
+static int exclude_promisor_objects_best_effort;
 
 static int use_delta_islands;
 
@@ -4312,6 +4313,18 @@ static int option_parse_cruft_expiration(const struct option *opt UNUSED,
 	return 0;
 }
 
+static int is_not_in_promisor_pack_obj(struct object *obj, void *data UNUSED)
+{
+	struct object_info info = OBJECT_INFO_INIT;
+	if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
+		BUG("should_include_obj should only be called on existing objects");
+	return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
+}
+
+static int is_not_in_promisor_pack(struct commit *commit, void *data) {
+	return is_not_in_promisor_pack_obj((struct object *) commit, data);
+}
+
 int cmd_pack_objects(int argc,
 		     const char **argv,
 		     const char *prefix,
@@ -4424,6 +4437,9 @@ int cmd_pack_objects(int argc,
 		  option_parse_missing_action),
 		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
 			 N_("do not pack objects in promisor packfiles")),
+		OPT_BOOL(0, "exclude-promisor-objects-best-effort",
+			 &exclude_promisor_objects_best_effort,
+			 N_("implies --missing=allow-any")),
 		OPT_BOOL(0, "delta-islands", &use_delta_islands,
 			 N_("respect islands during delta compression")),
 		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
@@ -4504,10 +4520,18 @@ int cmd_pack_objects(int argc,
 		strvec_push(&rp, "--unpacked");
 	}
 
+	if (exclude_promisor_objects && exclude_promisor_objects_best_effort)
+		die(_("options '%s' and '%s' cannot be used together"),
+		    "--exclude-promisor-objects", "--exclude-promisor-objects-best-effort");
 	if (exclude_promisor_objects) {
 		use_internal_rev_list = 1;
 		fetch_if_missing = 0;
 		strvec_push(&rp, "--exclude-promisor-objects");
+	} else if (exclude_promisor_objects_best_effort) {
+		use_internal_rev_list = 1;
+		fetch_if_missing = 0;
+		option_parse_missing_action(NULL, "allow-any", 0);
+		/* revs configured below */
 	}
 	if (unpack_unreachable || keep_unreachable || pack_loose_unreachable)
 		use_internal_rev_list = 1;
@@ -4627,6 +4651,10 @@ int cmd_pack_objects(int argc,
 
 		repo_init_revisions(the_repository, &revs, NULL);
 		list_objects_filter_copy(&revs.filter, &filter_options);
+		if (exclude_promisor_objects_best_effort) {
+			revs.include_check = is_not_in_promisor_pack;
+			revs.include_check_obj = is_not_in_promisor_pack_obj;
+		}
 		get_object_list(&revs, rp.nr, rp.v);
 		release_revisions(&revs);
 	}
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index c53e93be2f..2e67f59f89 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -694,6 +694,36 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
 	git -C client restore --recurse-submodules --source=HEAD^ :/
 '
 
+test_expect_success 'after fetching descendants of non-promisor commits, gc works' '
+	# Setup
+	git init full &&
+	git -C full config uploadpack.allowfilter 1 &&
+ 	git -C full config uploadpack.allowanysha1inwant 1 &&
+	touch full/foo &&
+	git -C full add foo &&
+	git -C full commit -m "commit 1" &&
+	git -C full checkout --detach &&
+
+	# Partial clone and push commit to remote
+	git clone "file://$(pwd)/full" --filter=blob:none partial &&
+	echo "hello" > partial/foo &&
+	git -C partial commit -a -m "commit 2" &&
+	git -C partial push &&
+
+	# gc in partial repo
+	git -C partial gc --prune=now &&
+
+	# Create another commit in normal repo
+	git -C full checkout main &&
+	echo " world" >> full/foo &&
+	git -C full commit -a -m "commit 3" &&
+
+	# Pull from remote in partial repo, and run gc again
+	git -C partial pull &&
+	git -C partial gc --prune=now
+'
+
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.47.0.163.g1226f6d8fa-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] index-pack: repack local links into promisor packs
  2024-10-30 22:29   ` Josh Steadmon
@ 2024-11-01 20:14     ` Jonathan Tan
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-11-01 20:14 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: Jonathan Tan, git, calvinwan, hanyang.tony

Josh Steadmon <steadmon@google.com> writes:
> > @@ -1719,6 +1772,57 @@ static void show_pack_info(int stat_only)
> >  	free(chain_histogram);
> >  }
> >  
> > +static void repack_local_links(void)
> > +{
> > +	struct child_process cmd = CHILD_PROCESS_INIT;
> > +	FILE *out;
> > +	struct strbuf line = STRBUF_INIT;
> > +	struct oidset_iter iter;
> > +	struct object_id *oid;
> > +	char *base_name;
> > +
> > +	if (!oidset_size(&local_links))
> > +		return;
> > +
> > +	base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository));
> > +
> > +	strvec_push(&cmd.args, "pack-objects");
> > +	strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort");
> > +	strvec_push(&cmd.args, base_name);
> > +	cmd.git_cmd = 1;
> > +	cmd.in = -1;
> > +	cmd.out = -1;
> > +	if (start_command(&cmd))
> > +		die(_("could not start pack-objects to repack local links"));
> > +
> > +	oidset_iter_init(&local_links, &iter);
> > +	while ((oid = oidset_iter_next(&iter))) {
> > +		if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
> > +		    write_in_full(cmd.in, "\n", 1) < 0)
> > +			die(_("failed to feed local object to pack-objects"));
> > +	}
> > +	close(cmd.in);
> > +
> > +	out = xfdopen(cmd.out, "r");
> > +	while (strbuf_getline_lf(&line, out) != EOF) {
> > +		unsigned char binary[GIT_MAX_RAWSZ];
> > +		if (line.len != the_hash_algo->hexsz ||
> > +		    !hex_to_bytes(binary, line.buf, line.len))
> > +			die(_("index-pack: Expecting full hex object ID lines only from pack-objects."));
> 
> I'm not sure why we check the pack-objects output here, is this just to
> detect errors? Could we instead just check the exit status of
> pack-objects, and discard the output?

The output is a hex object ID that tells us what we need to name
the .promisor file. (Later in this function, "binary" is used.) So we
can't discard it.

> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index e15fbaeb21..a565ab9b40 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -4310,6 +4310,18 @@ static int option_parse_cruft_expiration(const struct option *opt UNUSED,
> >  	return 0;
> >  }
> >  
> > +static int should_include_obj(struct object *obj, void *data UNUSED)
> > +{
> > +	struct object_info info = OBJECT_INFO_INIT;
> > +	if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
> > +		BUG("should_include_obj should only be called on existing objects");
> > +	return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
> > +}
> > +
> > +static int should_include(struct commit *commit, void *data) {
> > +	return should_include_obj((struct object *) commit, data);
> > +}
> > +
> 
> Nit: these two functions could be named a bit more descriptively.

OK, done.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced
  2024-10-25 21:07   ` Taylor Blau
@ 2024-11-02 10:38     ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2024-11-02 10:38 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Han Young, Jonathan Tan, git, calvinwan

Taylor Blau <me@ttaylorr.com> writes:

> On Fri, Oct 25, 2024 at 02:04:21PM +0800, Han Young wrote:
>> On Fri, Oct 25, 2024 at 2:09 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>> >
>> > This is a polished version of [1], also with all the test failures
>> > debugged and addressed.
>>
>> Thanks! I think I can drop my "repack all" patches now. :)
>
> Thanks for saying so, I dropped the 'hy/partial-repack-fix' branch from
> my tree.

I'll drop the topic, together with its "Need review." comment, from
the "What's Cooking" draft, then.  Thanks, all.

JTan's 4-patch series should be queued instead, I presume, which
I'll look into next.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced
  2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
                     ` (3 preceding siblings ...)
  2024-11-01 20:11   ` [PATCH v2 4/4] index-pack: repack local links into promisor packs Jonathan Tan
@ 2024-11-04  0:22   ` Junio C Hamano
  2024-11-04  2:05     ` Junio C Hamano
  4 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2024-11-04  0:22 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon, hanyang.tony, me

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks everyone for looking at it. Here's version 2.
>
> Jonathan Tan (4):
>   t0410: make test description clearer
>   t0410: use from-scratch server
>   t5300: move --window clamp test next to unclamped
>   index-pack: repack local links into promisor packs
>
>  Documentation/git-index-pack.txt |   5 ++
>  builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
>  builtin/pack-objects.c           |  28 ++++++++
>  t/t0410-partial-clone.sh         |   6 +-
>  t/t5300-pack-object.sh           |  10 +--
>  t/t5616-partial-clone.sh         |  30 +++++++++
>  6 files changed, 179 insertions(+), 10 deletions(-)

The reasoning behind each commit seems to be very well described.

Thanks, queued.

You may already have noticed this, but with this topic merged,
'seen' seems to start failing leak checker jobs e.g.

https://github.com/git/git/actions/runs/11642458705
  https://github.com/git/git/actions/runs/11642458705/job/32422074222#step:4:1964
  https://github.com/git/git/actions/runs/11642458705/job/32422074132#step:4:1963


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced
  2024-11-04  0:22   ` [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced Junio C Hamano
@ 2024-11-04  2:05     ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2024-11-04  2:05 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon, hanyang.tony, me

Junio C Hamano <gitster@pobox.com> writes:

> You may already have noticed this, but with this topic merged,
> 'seen' seems to start failing leak checker jobs e.g.
>
> https://github.com/git/git/actions/runs/11642458705
>   https://github.com/git/git/actions/runs/11642458705/job/32422074222#step:4:1964
>   https://github.com/git/git/actions/runs/11642458705/job/32422074132#step:4:1963

With this patch, t5300 no longer seems to leak (when the topic is
tested alone, at least).

 builtin/index-pack.c | 1 +
 1 file changed, 1 insertion(+)

diff --git c/builtin/index-pack.c w/builtin/index-pack.c
index e4afd6725f..08b340552f 100644
--- c/builtin/index-pack.c
+++ w/builtin/index-pack.c
@@ -1821,6 +1821,7 @@ static void repack_local_links(void)
 	if (finish_command(&cmd))
 		die(_("could not finish pack-objects to repack local links"));
 	strbuf_release(&line);
+	free(base_name);
 }
 
 int cmd_index_pack(int argc,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-01 20:11   ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jonathan Tan
@ 2024-11-13  7:35     ` Jeff King
  2024-11-13 18:26       ` Jonathan Tan
  2024-11-14  0:59       ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2024-11-13  7:35 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon, hanyang.tony, me

On Fri, Nov 01, 2024 at 01:11:47PM -0700, Jonathan Tan wrote:

> A subsequent commit will change the behavior of "git index-pack
> --promisor", which is exercised in "build pack index for an existing
> pack", causing the unclamped and clamped versions of the --window
> test to exhibit different behavior. Move the clamp test closer to the
> unclamped test that it references.

Hmm. The change in patch 4 broke another similar --window test I had in
a topic in flight. I can probably move it to match what you've done
here, but I feel like this may be papering over a bigger issue.

The reason these window tests are broken is that the earlier "build pack
index for an existing pack" is now finding and storing deltas in a new
pack when it does this:

  git index-pack --promisor=message test-3.pack &&

But that command is indexing a pack that is not even in the repository's
object store at all! Yet it triggers a call to pack-objects that repacks
within that object store.

Here's an even more extreme version. You do not need to have a
repository at all to run index-pack. So doing:

  mkdir /tmp/foo
  cd /tmp/foo
  cp /some/repo/.git/objects/pack/*.pack .
  for i in *.pack; do
    git index-pack -v --promisor=foo $i
  done

used to work, but with your patches will segfault (because the repo
pointer is NULL). Granted it's odd to pass --promisor when you are not
in a repo, but certainly we should never segfault.

So I think at the very least that index-pack should not try to modify
the repository's object database unless we are indexing a pack that is
within it, which would fix both of those issues.

I'd guess in the real world, we'd only pass that option when indexing
packs that we just fetched. But as a bystander to this feature, it feels
quite odd to me that index-pack, which I generally consider a "read
only" operation except for the index it was asked to write, would be
creating a new pack like this. I didn't follow the topic closely enough
to comment more intelligently, but would it be possible for the caller
of index-pack to trigger the repack as an independent step?

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-13  7:35     ` Jeff King
@ 2024-11-13 18:26       ` Jonathan Tan
  2024-11-14  0:56         ` Jeff King
  2024-11-14  0:59       ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-11-13 18:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git, steadmon, hanyang.tony, me

Jeff King <peff@peff.net> writes:
> On Fri, Nov 01, 2024 at 01:11:47PM -0700, Jonathan Tan wrote:
> 
> > A subsequent commit will change the behavior of "git index-pack
> > --promisor", which is exercised in "build pack index for an existing
> > pack", causing the unclamped and clamped versions of the --window
> > test to exhibit different behavior. Move the clamp test closer to the
> > unclamped test that it references.
> 
> Hmm. The change in patch 4 broke another similar --window test I had in
> a topic in flight. I can probably move it to match what you've done
> here, but I feel like this may be papering over a bigger issue.
> 
> The reason these window tests are broken is that the earlier "build pack
> index for an existing pack" is now finding and storing deltas in a new
> pack when it does this:
> 
>   git index-pack --promisor=message test-3.pack &&
> 
> But that command is indexing a pack that is not even in the repository's
> object store at all! Yet it triggers a call to pack-objects that repacks
> within that object store.

As far as I know, index-pack, when run as part of fetch, indexes a pack
that's not in the repository's object store; it indexes a packfile in a
temp directory. (So I don't think this is a strange thing to do.)

> Here's an even more extreme version. You do not need to have a
> repository at all to run index-pack. So doing:
> 
>   mkdir /tmp/foo
>   cd /tmp/foo
>   cp /some/repo/.git/objects/pack/*.pack .
>   for i in *.pack; do
>     git index-pack -v --promisor=foo $i
>   done
> 
> used to work, but with your patches will segfault (because the repo
> pointer is NULL). Granted it's odd to pass --promisor when you are not
> in a repo, but certainly we should never segfault.

Ah, good catch.

> So I think at the very least that index-pack should not try to modify
> the repository's object database unless we are indexing a pack that is
> within it, which would fix both of those issues.
> 
> I'd guess in the real world, we'd only pass that option when indexing
> packs that we just fetched. But as a bystander to this feature, it feels
> quite odd to me that index-pack, which I generally consider a "read
> only" operation except for the index it was asked to write, would be
> creating a new pack like this. I didn't follow the topic closely enough
> to comment more intelligently, but would it be possible for the caller
> of index-pack to trigger the repack as an independent step?
> 
> -Peff

I thought of that, but as far as I know, during a fetch, index-pack is
the only time in which the objects in the fetched pack are uncompressed
in memory. There have been concerns about the performance of various
ways of solving the promisor-object-and-GC bug, so I took an approach
that minimizes the performance hit as much as possible, by avoiding yet
another uncompression (we need to uncompress the objects to find their
outgoing links, so that we know what to repack).

We definitely should prevent the segfault, but I think that's better
done by making --promisor only work if we run index-pack from within
a repo. I don't think we can restrict the repacking to run only if
we're indexing a pack within the repo, because in our fetch case, we're
indexing a new pack - not one within the repo.

Maybe we could conceptualize "index-pack --promisor" as the pack giving
"testimony" about objects that its objects link to, so we can update our
own records.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-13 18:26       ` Jonathan Tan
@ 2024-11-14  0:56         ` Jeff King
  2024-11-14  6:41           ` Junio C Hamano
  2024-11-15 19:55           ` Jonathan Tan
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2024-11-14  0:56 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon, hanyang.tony, me

On Wed, Nov 13, 2024 at 10:26:56AM -0800, Jonathan Tan wrote:

> > The reason these window tests are broken is that the earlier "build pack
> > index for an existing pack" is now finding and storing deltas in a new
> > pack when it does this:
> > 
> >   git index-pack --promisor=message test-3.pack &&
> > 
> > But that command is indexing a pack that is not even in the repository's
> > object store at all! Yet it triggers a call to pack-objects that repacks
> > within that object store.
> 
> As far as I know, index-pack, when run as part of fetch, indexes a pack
> that's not in the repository's object store; it indexes a packfile in a
> temp directory. (So I don't think this is a strange thing to do.)

When fetching (or receiving a push), we use "index-pack --stdin" and do
write the resulting pack into the repository (and the command will
complain if there is no repository).

So I think what the test is doing above (using --promisor on a random
packfile) is questionable. It goes back to 1f52cdfacb (index-pack:
document and test the --promisor option, 2022-03-09). I wonder if that
test is actually valuable now that the --promisor option is actually
triggered via fetch.

If we restricted --promisor to work only with --stdin, that would deal
with both of the issues I saw. And I suspect (but didn't dig deeply or
test) that would be sufficient for how it is called within git. (I
wondered briefly if bundles might index in-place, but they seem to use
--stdin also).


Where it gets weirder to me is with quarantine directories (and maybe
this is what you meant above). On receiving a push, we "index --stdin"
into a temporary quarantine directory. If that kicks off a pack-objects
run, where does that pack-objects put its new pack? Within the
quarantined index-pack we set GIT_OBJECT_DIRECTORY to the quarantine and
add the original repo as an alternate. So I _think_ both the pushed-up
pack and the repacked promisor pack would go into the quarantine dir,
and then we'd migrate both (or neither) when we commit to the push.

Which is OK, but I don't know that I thought that far ahead when writing
the quarantine stuff long ago.

It's probably somewhat academic right now, as I'm not sure if you can
even push reliably into a promisor repo (and it doesn't look like
receive-pack knows about passing --promisor anyway). We don't quarantine
on fetch right now, though we have discussed it in the past (and I think
we should consider doing it).

So this may become more real in the future. I wonder if there is a way
to add a test to future-proof against changes to how the quarantine
system works. The theoretical problem case is if we did quarantine
fetches, but accidentally wrote the new promisor pack into the main
repo instead of the quarantine, and then a fetch rejected the incoming
pack (because of a hook, failed connectivity check, etc). Then we'd end
up with the new promisor pack when we shouldn't, which I guess could
move objects from that incoming pack that we rejected into the main
repo, despite the quarantine?

I can't think of a way to test that now, without the quarantine-on-fetch
feature existing.

> > I'd guess in the real world, we'd only pass that option when indexing
> > packs that we just fetched. But as a bystander to this feature, it feels
> > quite odd to me that index-pack, which I generally consider a "read
> > only" operation except for the index it was asked to write, would be
> > creating a new pack like this. I didn't follow the topic closely enough
> > to comment more intelligently, but would it be possible for the caller
> > of index-pack to trigger the repack as an independent step?
> 
> I thought of that, but as far as I know, during a fetch, index-pack is
> the only time in which the objects in the fetched pack are uncompressed
> in memory. There have been concerns about the performance of various
> ways of solving the promisor-object-and-GC bug, so I took an approach
> that minimizes the performance hit as much as possible, by avoiding yet
> another uncompression (we need to uncompress the objects to find their
> outgoing links, so that we know what to repack).

Hmm, yeah. I can see the appeal of doing the processing there. Kicking
off a pack-objects can be similarly expensive in the worst case, but in
practice it should only be dealing with a few new objects (and we want
to cheaply find out what those objects are, if there are even any at
all).

> We definitely should prevent the segfault, but I think that's better
> done by making --promisor only work if we run index-pack from within a
> repo. I don't think we can restrict the repacking to run only if we're
> indexing a pack within the repo, because in our fetch case, we're
> indexing a new pack - not one within the repo.

I think the "--stdin" thing above neatly solves this.

> Maybe we could conceptualize "index-pack --promisor" as the pack
> giving "testimony" about objects that its objects link to, so we can
> update our own records.

Yeah, I guess the fundamental thing here is that anybody who isn't
passing "--promisor" is not going to be affected, so that at least
limits the opportunity for surprise.

The quarantine discussion above is an example of how there could be
unexpected consequences. I _think_ it's OK based on what I wrote, but
hopefully that explains my general feeling of surprise. I dunno. It
still may be the least bad thing.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-13  7:35     ` Jeff King
  2024-11-13 18:26       ` Jonathan Tan
@ 2024-11-14  0:59       ` Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Jeff King @ 2024-11-14  0:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon, hanyang.tony, me

On Wed, Nov 13, 2024 at 02:35:00AM -0500, Jeff King wrote:

> On Fri, Nov 01, 2024 at 01:11:47PM -0700, Jonathan Tan wrote:
> 
> > A subsequent commit will change the behavior of "git index-pack
> > --promisor", which is exercised in "build pack index for an existing
> > pack", causing the unclamped and clamped versions of the --window
> > test to exhibit different behavior. Move the clamp test closer to the
> > unclamped test that it references.
> 
> Hmm. The change in patch 4 broke another similar --window test I had in
> a topic in flight. I can probably move it to match what you've done
> here, but I feel like this may be papering over a bigger issue.
> 
> The reason these window tests are broken is that the earlier "build pack
> index for an existing pack" is now finding and storing deltas in a new
> pack when it does this:
> 
>   git index-pack --promisor=message test-3.pack &&

BTW, an alternate fix instead of moving the test is below. But maybe not
worth revisiting since it's already in next.

-- >8 --
Subject: [PATCH] t5300: use --no-reuse-delta for --window test

In the test added for 953aa54e1a (pack-objects: clamp negative window
size to 0, 2021-05-01), we expect that dropping the --window parameter
will mean the resulting pack does not have any deltas. But this
expectation would not hold if there are deltas from an on-disk pack that
are reused.

This makes the test fragile with respect to the existing repository
state. It works reliably now, but changes to earlier tests could produce
packs that violate the assumption.

We can make the test more reliable by passing --no-reuse-delta, meaning
we will only output deltas we find in the current run (using --window).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5300-pack-object.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a..392d6a4d41 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -631,7 +631,7 @@ test_expect_success 'prefetch objects' '
 '
 
 test_expect_success 'negative window clamps to 0' '
-	git pack-objects --progress --window=-1 neg-window <obj-list 2>stderr &&
+	git pack-objects --progress --no-reuse-delta --window=-1 neg-window <obj-list 2>stderr &&
 	check_deltas stderr = 0
 '
 
-- 
2.47.0.527.gfb211c7f3b


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-14  0:56         ` Jeff King
@ 2024-11-14  6:41           ` Junio C Hamano
  2024-11-15  9:52             ` Jeff King
  2024-11-15 19:55           ` Jonathan Tan
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2024-11-14  6:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git, steadmon, hanyang.tony, me

Jeff King <peff@peff.net> writes:

>> As far as I know, index-pack, when run as part of fetch, indexes a pack
>> that's not in the repository's object store; it indexes a packfile in a
>> temp directory. (So I don't think this is a strange thing to do.)
>
> When fetching (or receiving a push), we use "index-pack --stdin" and do
> write the resulting pack into the repository (and the command will
> complain if there is no repository).
> ...
>> We definitely should prevent the segfault, but I think that's better
>> done by making --promisor only work if we run index-pack from within a
>> repo. I don't think we can restrict the repacking to run only if we're
>> indexing a pack within the repo, because in our fetch case, we're
>> indexing a new pack - not one within the repo.
>
> I think the "--stdin" thing above neatly solves this.
> ...
> Yeah, I guess the fundamental thing here is that anybody who isn't
> passing "--promisor" is not going to be affected, so that at least
> limits the opportunity for surprise.
>
> The quarantine discussion above is an example of how there could be
> unexpected consequences. I _think_ it's OK based on what I wrote, but
> hopefully that explains my general feeling of surprise. I dunno. It
> still may be the least bad thing.

Tying this extra processing to the use of "--stdin" is not exactly
intuitive, in that a "--stdin" user is not necessarily doing a fetch
(even though a fetch may always use "--stdin"), but I guess it is a
good enough approximation (and the best one easily available to us)
if we want to safeguard the use of this "--promisor" logic only to
fetch client.

As to future potential mis-interaction between quarantined fetch and
the effect of this "repack local objects that can be reached by
objects in a promisor pack" feature, I do not offhand think of a
good way to future-proof it with tests.

Thanks for the discussion, both of you.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-14  6:41           ` Junio C Hamano
@ 2024-11-15  9:52             ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2024-11-15  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, steadmon, hanyang.tony, me

On Thu, Nov 14, 2024 at 03:41:08PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> As far as I know, index-pack, when run as part of fetch, indexes a pack
> >> that's not in the repository's object store; it indexes a packfile in a
> >> temp directory. (So I don't think this is a strange thing to do.)
> >
> > When fetching (or receiving a push), we use "index-pack --stdin" and do
> > write the resulting pack into the repository (and the command will
> > complain if there is no repository).
> > ...
> >> We definitely should prevent the segfault, but I think that's better
> >> done by making --promisor only work if we run index-pack from within a
> >> repo. I don't think we can restrict the repacking to run only if we're
> >> indexing a pack within the repo, because in our fetch case, we're
> >> indexing a new pack - not one within the repo.
> >
> > I think the "--stdin" thing above neatly solves this.
> [...]
> 
> Tying this extra processing to the use of "--stdin" is not exactly
> intuitive, in that a "--stdin" user is not necessarily doing a fetch
> (even though a fetch may always use "--stdin"), but I guess it is a
> good enough approximation (and the best one easily available to us)
> if we want to safeguard the use of this "--promisor" logic only to
> fetch client.

I think the "--stdin" thing is a bit more general than that. Even though
we expect to use it with --promisor only under fetch, the real rule is
more like: only do the extra --promisor repacking when indexing a pack
that will be made available in the repository. With --stdin, we know
the result will be available because index-pack itself will write the
pack into the repository. And that is true whether it is fetch, push, or
some other script driving it.

When being fed a path to a pack on the command line, then "is it
available in the repo" would involve some path comparisons to see if
it's in the object database directory. Probably not that hard, but not
entirely trivial (due to normalization, etc). But since we care only
about fetch and --stdin, it seemed like an easy cheat to simply disallow
the other case for now, erring on the conservative side.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-14  0:56         ` Jeff King
  2024-11-14  6:41           ` Junio C Hamano
@ 2024-11-15 19:55           ` Jonathan Tan
  2024-11-16  3:23             ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-11-15 19:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git, steadmon, hanyang.tony, me

Jeff King <peff@peff.net> writes:
> Where it gets weirder to me is with quarantine directories (and maybe
> this is what you meant above). On receiving a push, we "index --stdin"
> into a temporary quarantine directory. If that kicks off a pack-objects
> run, where does that pack-objects put its new pack? Within the
> quarantined index-pack we set GIT_OBJECT_DIRECTORY to the quarantine and
> add the original repo as an alternate. So I _think_ both the pushed-up
> pack and the repacked promisor pack would go into the quarantine dir,
> and then we'd migrate both (or neither) when we commit to the push.
> 
> Which is OK, but I don't know that I thought that far ahead when writing
> the quarantine stuff long ago.
> 
> It's probably somewhat academic right now, as I'm not sure if you can
> even push reliably into a promisor repo (and it doesn't look like
> receive-pack knows about passing --promisor anyway).

Thanks for this description. Such a push would be an "I am pushing a
pack with missing objects to you, and you can later get those missing
objects from me" situation. Not completely implausible, but doesn't seem
high-priority to me.

> We don't quarantine
> on fetch right now, though we have discussed it in the past (and I think
> we should consider doing it).
> 
> So this may become more real in the future. I wonder if there is a way
> to add a test to future-proof against changes to how the quarantine
> system works. The theoretical problem case is if we did quarantine
> fetches, but accidentally wrote the new promisor pack into the main
> repo instead of the quarantine, and then a fetch rejected the incoming
> pack (because of a hook, failed connectivity check, etc). Then we'd end
> up with the new promisor pack when we shouldn't, which I guess could
> move objects from that incoming pack that we rejected into the main
> repo, despite the quarantine?
> 
> I can't think of a way to test that now, without the quarantine-on-fetch
> feature existing.

Quarantine on fetch does seem like a good idea. I also can't think of
a way to test that now. Although, for the fetch case, my patch set is
not the first time that an extra packfile (that is, a packfile not in
the "packfile" section of the fetch response) could be written during
a fetch: packfile-uris and bundle-uris already exist. So I would hope
that the implementor of the fetch quarantine feature would be aware of
at least one of these extra features, and design the test to check that
absolutely no packfiles are written if the fetch is rejected. (So I
don't think the future needs to be "proofed" so much.)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] t5300: move --window clamp test next to unclamped
  2024-11-15 19:55           ` Jonathan Tan
@ 2024-11-16  3:23             ` Jeff King
  2024-11-18 19:02               ` [PATCH] index-pack: teach --promisor to require --stdin Jonathan Tan
  2024-11-19 20:10               ` [PATCH v2] index-pack: teach --promisor to forbid pack name Jonathan Tan
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2024-11-16  3:23 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon, hanyang.tony, me

On Fri, Nov 15, 2024 at 11:55:03AM -0800, Jonathan Tan wrote:

> > So this may become more real in the future. I wonder if there is a way
> > to add a test to future-proof against changes to how the quarantine
> > system works. The theoretical problem case is if we did quarantine
> > fetches, but accidentally wrote the new promisor pack into the main
> > repo instead of the quarantine, and then a fetch rejected the incoming
> > pack (because of a hook, failed connectivity check, etc). Then we'd end
> > up with the new promisor pack when we shouldn't, which I guess could
> > move objects from that incoming pack that we rejected into the main
> > repo, despite the quarantine?
> > 
> > I can't think of a way to test that now, without the quarantine-on-fetch
> > feature existing.
> 
> Quarantine on fetch does seem like a good idea. I also can't think of
> a way to test that now. Although, for the fetch case, my patch set is
> not the first time that an extra packfile (that is, a packfile not in
> the "packfile" section of the fetch response) could be written during
> a fetch: packfile-uris and bundle-uris already exist. So I would hope
> that the implementor of the fetch quarantine feature would be aware of
> at least one of these extra features, and design the test to check that
> absolutely no packfiles are written if the fetch is rejected. (So I
> don't think the future needs to be "proofed" so much.)

Good point. I think we have to just leave it until that hypothetical
future and hope that person is careful. :)

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] index-pack: teach --promisor to require --stdin
  2024-11-16  3:23             ` Jeff King
@ 2024-11-18 19:02               ` Jonathan Tan
  2024-11-19  3:29                 ` Junio C Hamano
  2024-11-19 18:53                 ` Jeff King
  2024-11-19 20:10               ` [PATCH v2] index-pack: teach --promisor to forbid pack name Jonathan Tan
  1 sibling, 2 replies; 37+ messages in thread
From: Jonathan Tan @ 2024-11-18 19:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, stolee

Currently,

 - Running "index-pack --promisor" outside a repo segfaults.
 - It may be confusing to a user that running "index-pack --promisor"
   within a repo may make changes to the repo's object DB, especially
   since the packs indexed by the index-pack invocation may not even be
   related to the repo.

As discussed in [1], teaching --promisor to require --stdin and forbid a
packfile name solves both these problems. This combination of arguments
requires a repo (since we are writing the resulting .pack and .idx to
it) and it is clear that the files are related to the repo.

Currently, Git uses "index-pack --promisor" only when fetching into
a repo, so it could be argued that we should teach "index-pack" a new
argument (say, "--fetching-mode") instead of tying --promisor to a
generic argument like "--stdin". However, this --promisor feature could
conceivably be used whenever we have a packfile that is known to come
from the promisor remote (whether obtained through Git's fetch protocol
or through other means) so it seems reasonable to use --stdin here -
one could envision a user-made script obtaining a packfile and then
running "index-pack --promisor --stdin", for example. In fact, it might
be possible to relax the restriction further (say, by also allowing
--promisor when indexing a packfile that is in the object DB), but
relaxing the restriction is backwards-compatible so we can revisit that
later.

One thing to watch out for is the possibility of a future Git feature
that indexes a pack in the context of a repo, but does not necessarily
write the resulting pack to it (and does not necessarily desire to
make any changes to the object DB). One such feature would be fetch
quarantine, which might need the repo context in order to detect
hash collisions, but would also need to ensure that the object DB
is undisturbed in case the fetch fails for whatever reason, even if
the reason occurs only after the indexing is complete. It may not be
obvious to the implementer of such a feature that "index-pack" could
sometimes write packs other than the indexed pack to the object DB,
but there are already other ways that "fetch" could write to the object
DB (in particular, packfile URIs and bundle URIs), so hopefully the
implementation of this future feature would already include a test that
the object DB be undisturbed.

This change requires the change to t5300 by 1f52cdfacb (index-pack:
document and test the --promisor option, 2022-03-09) to be undone.
(--promisor is already tested indirectly, so we don't need the explicit
test here any more.)

[1] https://lore.kernel.org/git/20241114005652.GC1140565@coredump.intra.peff.net/

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is on jt/repack-local-promisor.

Looking into it further, I think that we also need to require no
packfile name to be given (so that we are writing the file to the
repository). Therefore, I've added that requirement both in the code and
in the documentation.

I've tried to summarize our conversation in the commit message - if you
notice anything missing or incorrect, feel free to let me know.
---
 Documentation/git-index-pack.txt | 2 ++
 builtin/index-pack.c             | 4 ++++
 t/t5300-pack-object.sh           | 4 +---
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 4be09e58e7..ac96935d73 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -144,6 +144,8 @@ Also, if there are objects in the given pack that references non-promisor
 objects (in the repo), repacks those non-promisor objects into a promisor
 pack. This avoids a situation in which a repo has non-promisor objects that are
 accessible through promisor objects.
++
+Requires --stdin, and requires <pack-file> to not be specified.
 
 NOTES
 -----
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 08b340552f..c46b6e4061 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
 		usage(index_pack_usage);
 	if (fix_thin_pack && !from_stdin)
 		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
+	if (promisor_msg && !from_stdin)
+		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
+	if (promisor_msg && pack_name)
+		die(_("--promisor cannot be used with a pack name"));
 	if (from_stdin && !startup_info->have_repository)
 		die(_("--stdin requires a git repository"));
 	if (from_stdin && hash_algo)
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index aff164ddf8..c53f355e48 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -332,10 +332,8 @@ test_expect_success 'build pack index for an existing pack' '
 	git index-pack -o tmp.idx test-3.pack &&
 	cmp tmp.idx test-1-${packname_1}.idx &&
 
-	git index-pack --promisor=message test-3.pack &&
+	git index-pack test-3.pack &&
 	cmp test-3.idx test-1-${packname_1}.idx &&
-	echo message >expect &&
-	test_cmp expect test-3.promisor &&
 
 	cat test-2-${packname_2}.pack >test-3.pack &&
 	git index-pack -o tmp.idx test-2-${packname_2}.pack &&
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] index-pack: teach --promisor to require --stdin
  2024-11-18 19:02               ` [PATCH] index-pack: teach --promisor to require --stdin Jonathan Tan
@ 2024-11-19  3:29                 ` Junio C Hamano
  2024-11-19 18:53                 ` Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2024-11-19  3:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> Currently,
>
>  - Running "index-pack --promisor" outside a repo segfaults.
>  - It may be confusing to a user that running "index-pack --promisor"
>    within a repo may make changes to the repo's object DB, especially
>    since the packs indexed by the index-pack invocation may not even be
>    related to the repo.
>
> As discussed in [1], teaching --promisor to require --stdin and forbid a
> packfile name solves both these problems. This combination of arguments
> requires a repo (since we are writing the resulting .pack and .idx to
> it) and it is clear that the files are related to the repo.

Makes sense.

> This change requires the change to t5300 by 1f52cdfacb (index-pack:
> document and test the --promisor option, 2022-03-09) to be undone.
> (--promisor is already tested indirectly, so we don't need the explicit
> test here any more.)

OK.

> This is on jt/repack-local-promisor.
> diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
> index 4be09e58e7..ac96935d73 100644
> --- a/Documentation/git-index-pack.txt
> +++ b/Documentation/git-index-pack.txt
> @@ -144,6 +144,8 @@ Also, if there are objects in the given pack that references non-promisor
>  objects (in the repo), repacks those non-promisor objects into a promisor
>  pack. This avoids a situation in which a repo has non-promisor objects that are
>  accessible through promisor objects.
> ++
> +Requires --stdin, and requires <pack-file> to not be specified.
>  
>  NOTES
>  -----
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 08b340552f..c46b6e4061 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
>  		usage(index_pack_usage);
>  	if (fix_thin_pack && !from_stdin)
>  		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
> +	if (promisor_msg && !from_stdin)
> +		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
> +	if (promisor_msg && pack_name)
> +		die(_("--promisor cannot be used with a pack name"));
>  	if (from_stdin && !startup_info->have_repository)
>  		die(_("--stdin requires a git repository"));

OK.  Thanks, will queue.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] index-pack: teach --promisor to require --stdin
  2024-11-18 19:02               ` [PATCH] index-pack: teach --promisor to require --stdin Jonathan Tan
  2024-11-19  3:29                 ` Junio C Hamano
@ 2024-11-19 18:53                 ` Jeff King
  2024-11-20  1:34                   ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff King @ 2024-11-19 18:53 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, stolee

On Mon, Nov 18, 2024 at 11:02:06AM -0800, Jonathan Tan wrote:

> Currently, Git uses "index-pack --promisor" only when fetching into
> a repo, so it could be argued that we should teach "index-pack" a new
> argument (say, "--fetching-mode") instead of tying --promisor to a
> generic argument like "--stdin". However, this --promisor feature could
> conceivably be used whenever we have a packfile that is known to come
> from the promisor remote (whether obtained through Git's fetch protocol
> or through other means) so it seems reasonable to use --stdin here -
> one could envision a user-made script obtaining a packfile and then
> running "index-pack --promisor --stdin", for example. In fact, it might
> be possible to relax the restriction further (say, by also allowing
> --promisor when indexing a packfile that is in the object DB), but
> relaxing the restriction is backwards-compatible so we can revisit that
> later.

Yeah, I agree with this summary.

> This change requires the change to t5300 by 1f52cdfacb (index-pack:
> document and test the --promisor option, 2022-03-09) to be undone.
> (--promisor is already tested indirectly, so we don't need the explicit
> test here any more.)

OK, I think this is reasonable.

> Looking into it further, I think that we also need to require no
> packfile name to be given (so that we are writing the file to the
> repository). Therefore, I've added that requirement both in the code and
> in the documentation.

Hmm. I didn't realize that you could specify a pack name _and_ --stdin,
but I guess it makes sense if you wanted to write the result to a
non-standard location (though curiously --stdin requires a repo, which
feels overly restrictive if you give a pack name).

But I think that makes the --stdin check redundant. I.e., here:

> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 08b340552f..c46b6e4061 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
>  		usage(index_pack_usage);
>  	if (fix_thin_pack && !from_stdin)
>  		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
> +	if (promisor_msg && !from_stdin)
> +		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
> +	if (promisor_msg && pack_name)
> +		die(_("--promisor cannot be used with a pack name"));

...just the second one would be sufficient, because the context just
above this has:

	if (!pack_name && !from_stdin)
		usage(index_pack_usage);

So if there isn't a pack name then from_stdin must be set anyway.

What you've written won't behave incorrectly, but I wonder if this means
we can explain the rule in a more simple way:

  - the --promisor option requires that we be indexing a pack in the
    object database

  - when not given a pack name on the command line, we know this is true
    (because we generate the name ourselves internally)

  - when given a pack name on the command line, we _could_ check that it
    is inside the object directory, but we don't currently do so and
    just bail. That could be changed in the future.

And then there is no mention of --stdin at all (though of course it is
an implication of the second point, since we have to get input somehow).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2] index-pack: teach --promisor to forbid pack name
  2024-11-16  3:23             ` Jeff King
  2024-11-18 19:02               ` [PATCH] index-pack: teach --promisor to require --stdin Jonathan Tan
@ 2024-11-19 20:10               ` Jonathan Tan
  2024-11-20  6:29                 ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2024-11-19 20:10 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff

Currently,

 - Running "index-pack --promisor" outside a repo segfaults.
 - It may be confusing to a user that running "index-pack --promisor"
   within a repo may make changes to the repo's object DB, especially
   since the packs indexed by the index-pack invocation may not even be
   related to the repo.

As discussed in [1] and [2], teaching --promisor to forbid a packfile
name solves both these problems. This combination of arguments requires
a repo (since we are writing the resulting .pack and .idx to it) and it
is clear that the files are related to the repo.

Currently, Git uses "index-pack --promisor" only when fetching into
a repo, so it could be argued that we should teach "index-pack" a
new argument (say, "--fetching-mode") instead of tying --promisor to
a generic argument like the packfile name. However, this --promisor
feature could conceivably be used whenever we have a packfile that is
known to come from the promisor remote (whether obtained through Git's
fetch protocol or through other means) so not using a new argument seems
reasonable - one could envision a user-made script obtaining a packfile
and then running "index-pack --promisor --stdin", for example. In fact,
it might be possible to relax the restriction further (say, by also
allowing --promisor when indexing a packfile that is in the object DB),
but relaxing the restriction is backwards-compatible so we can revisit
that later.

One thing to watch out for is the possibility of a future Git feature
that indexes a pack in the context of a repo, but does not necessarily
write the resulting pack to it (and does not necessarily desire to
make any changes to the object DB). One such feature would be fetch
quarantine, which might need the repo context in order to detect
hash collisions, but would also need to ensure that the object DB
is undisturbed in case the fetch fails for whatever reason, even if
the reason occurs only after the indexing is complete. It may not be
obvious to the implementer of such a feature that "index-pack" could
sometimes write packs other than the indexed pack to the object DB,
but there are already other ways that "fetch" could write to the object
DB (in particular, packfile URIs and bundle URIs), so hopefully the
implementation of this future feature would already include a test that
the object DB be undisturbed.

This change requires the change to t5300 by 1f52cdfacb (index-pack:
document and test the --promisor option, 2022-03-09) to be undone.
(--promisor is already tested indirectly, so we don't need the explicit
test here any more.)

[1] https://lore.kernel.org/git/20241114005652.GC1140565@coredump.intra.peff.net/
[2] https://lore.kernel.org/git/20241119185345.GB15723@coredump.intra.peff.net/

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is on jt/repack-local-promisor.

Thanks, Peff, for the catch. Here's an updated patch, with an updated
commit message.
---
 Documentation/git-index-pack.txt | 2 ++
 builtin/index-pack.c             | 2 ++
 t/t5300-pack-object.sh           | 4 +---
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 4be09e58e7..58dd5b5f0e 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -144,6 +144,8 @@ Also, if there are objects in the given pack that references non-promisor
 objects (in the repo), repacks those non-promisor objects into a promisor
 pack. This avoids a situation in which a repo has non-promisor objects that are
 accessible through promisor objects.
++
+Requires <pack-file> to not be specified.
 
 NOTES
 -----
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 08b340552f..05758a2f3e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1970,6 +1970,8 @@ int cmd_index_pack(int argc,
 		usage(index_pack_usage);
 	if (fix_thin_pack && !from_stdin)
 		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
+	if (promisor_msg && pack_name)
+		die(_("--promisor cannot be used with a pack name"));
 	if (from_stdin && !startup_info->have_repository)
 		die(_("--stdin requires a git repository"));
 	if (from_stdin && hash_algo)
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index aff164ddf8..c53f355e48 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -332,10 +332,8 @@ test_expect_success 'build pack index for an existing pack' '
 	git index-pack -o tmp.idx test-3.pack &&
 	cmp tmp.idx test-1-${packname_1}.idx &&
 
-	git index-pack --promisor=message test-3.pack &&
+	git index-pack test-3.pack &&
 	cmp test-3.idx test-1-${packname_1}.idx &&
-	echo message >expect &&
-	test_cmp expect test-3.promisor &&
 
 	cat test-2-${packname_2}.pack >test-3.pack &&
 	git index-pack -o tmp.idx test-2-${packname_2}.pack &&

Range-diff against v1:
1:  b5a0012531 ! 1:  226a627c25 index-pack: teach --promisor to require --stdin
    @@ Metadata
     Author: Jonathan Tan <jonathantanmy@google.com>
     
      ## Commit message ##
    -    index-pack: teach --promisor to require --stdin
    +    index-pack: teach --promisor to forbid pack name
     
         Currently,
     
    @@ Commit message
            since the packs indexed by the index-pack invocation may not even be
            related to the repo.
     
    -    As discussed in [1], teaching --promisor to require --stdin and forbid a
    -    packfile name solves both these problems. This combination of arguments
    -    requires a repo (since we are writing the resulting .pack and .idx to
    -    it) and it is clear that the files are related to the repo.
    +    As discussed in [1] and [2], teaching --promisor to forbid a packfile
    +    name solves both these problems. This combination of arguments requires
    +    a repo (since we are writing the resulting .pack and .idx to it) and it
    +    is clear that the files are related to the repo.
     
         Currently, Git uses "index-pack --promisor" only when fetching into
    -    a repo, so it could be argued that we should teach "index-pack" a new
    -    argument (say, "--fetching-mode") instead of tying --promisor to a
    -    generic argument like "--stdin". However, this --promisor feature could
    -    conceivably be used whenever we have a packfile that is known to come
    -    from the promisor remote (whether obtained through Git's fetch protocol
    -    or through other means) so it seems reasonable to use --stdin here -
    -    one could envision a user-made script obtaining a packfile and then
    -    running "index-pack --promisor --stdin", for example. In fact, it might
    -    be possible to relax the restriction further (say, by also allowing
    -    --promisor when indexing a packfile that is in the object DB), but
    -    relaxing the restriction is backwards-compatible so we can revisit that
    -    later.
    +    a repo, so it could be argued that we should teach "index-pack" a
    +    new argument (say, "--fetching-mode") instead of tying --promisor to
    +    a generic argument like the packfile name. However, this --promisor
    +    feature could conceivably be used whenever we have a packfile that is
    +    known to come from the promisor remote (whether obtained through Git's
    +    fetch protocol or through other means) so not using a new argument seems
    +    reasonable - one could envision a user-made script obtaining a packfile
    +    and then running "index-pack --promisor --stdin", for example. In fact,
    +    it might be possible to relax the restriction further (say, by also
    +    allowing --promisor when indexing a packfile that is in the object DB),
    +    but relaxing the restriction is backwards-compatible so we can revisit
    +    that later.
     
         One thing to watch out for is the possibility of a future Git feature
         that indexes a pack in the context of a repo, but does not necessarily
    @@ Commit message
         test here any more.)
     
         [1] https://lore.kernel.org/git/20241114005652.GC1140565@coredump.intra.peff.net/
    +    [2] https://lore.kernel.org/git/20241119185345.GB15723@coredump.intra.peff.net/
     
         Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
         ---
         This is on jt/repack-local-promisor.
     
    -    Looking into it further, I think that we also need to require no
    -    packfile name to be given (so that we are writing the file to the
    -    repository). Therefore, I've added that requirement both in the code and
    -    in the documentation.
    -
    -    I've tried to summarize our conversation in the commit message - if you
    -    notice anything missing or incorrect, feel free to let me know.
    +    Thanks, Peff, for the catch. Here's an updated patch, with an updated
    +    commit message.
     
      ## Documentation/git-index-pack.txt ##
     @@ Documentation/git-index-pack.txt: Also, if there are objects in the given pack that references non-promisor
    @@ Documentation/git-index-pack.txt: Also, if there are objects in the given pack t
      pack. This avoids a situation in which a repo has non-promisor objects that are
      accessible through promisor objects.
     ++
    -+Requires --stdin, and requires <pack-file> to not be specified.
    ++Requires <pack-file> to not be specified.
      
      NOTES
      -----
    @@ builtin/index-pack.c: int cmd_index_pack(int argc,
      		usage(index_pack_usage);
      	if (fix_thin_pack && !from_stdin)
      		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
    -+	if (promisor_msg && !from_stdin)
    -+		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
     +	if (promisor_msg && pack_name)
     +		die(_("--promisor cannot be used with a pack name"));
      	if (from_stdin && !startup_info->have_repository)
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] index-pack: teach --promisor to require --stdin
  2024-11-19 18:53                 ` Jeff King
@ 2024-11-20  1:34                   ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2024-11-20  1:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git, stolee

Jeff King <peff@peff.net> writes:

> But I think that makes the --stdin check redundant. I.e., here:
>
>> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>> index 08b340552f..c46b6e4061 100644
>> --- a/builtin/index-pack.c
>> +++ b/builtin/index-pack.c
>> @@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
>>  		usage(index_pack_usage);
>>  	if (fix_thin_pack && !from_stdin)
>>  		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
>> +	if (promisor_msg && !from_stdin)
>> +		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
>> +	if (promisor_msg && pack_name)
>> +		die(_("--promisor cannot be used with a pack name"));
>
> ...just the second one would be sufficient, because the context just
> above this has:
>
> 	if (!pack_name && !from_stdin)
> 		usage(index_pack_usage);
>
> So if there isn't a pack name then from_stdin must be set anyway.

Nice findings that leads to ... 

> What you've written won't behave incorrectly, but I wonder if this means
> we can explain the rule in a more simple way:
>
>   - the --promisor option requires that we be indexing a pack in the
>     object database
>
>   - when not given a pack name on the command line, we know this is true
>     (because we generate the name ourselves internally)
>
>   - when given a pack name on the command line, we _could_ check that it
>     is inside the object directory, but we don't currently do so and
>     just bail. That could be changed in the future.
>
> And then there is no mention of --stdin at all (though of course it is
> an implication of the second point, since we have to get input somehow).

... a good simplification.  Not of the implementation---as it is
already simple enough---but of the concept, and simplification of
the latter counts a lot more ;-)

Thanks, both, for working on this.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] index-pack: teach --promisor to forbid pack name
  2024-11-19 20:10               ` [PATCH v2] index-pack: teach --promisor to forbid pack name Jonathan Tan
@ 2024-11-20  6:29                 ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2024-11-20  6:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git

On Tue, Nov 19, 2024 at 12:10:15PM -0800, Jonathan Tan wrote:

> Thanks, Peff, for the catch. Here's an updated patch, with an updated
> commit message.

This looks good to me, thanks.

> Range-diff against v1:
> [...]
>     @@ Commit message
>          test here any more.)
>      
>          [1] https://lore.kernel.org/git/20241114005652.GC1140565@coredump.intra.peff.net/
>     +    [2] https://lore.kernel.org/git/20241119185345.GB15723@coredump.intra.peff.net/
>      
>          Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>          ---
>          This is on jt/repack-local-promisor.
>      
>     -    Looking into it further, I think that we also need to require no
>     -    packfile name to be given (so that we are writing the file to the
>     -    repository). Therefore, I've added that requirement both in the code and
>     -    in the documentation.
>     -
>     -    I've tried to summarize our conversation in the commit message - if you
>     -    notice anything missing or incorrect, feel free to let me know.
>     +    Thanks, Peff, for the catch. Here's an updated patch, with an updated
>     +    commit message.

Heh, I guess you stick your notes directly into the commit message. ;) I
do that sometimes, too. A long time ago I had a patch that would let you
write "---" in the commit message editor and then auto-convert that into
actual notes. Probably not that big a deal, though.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-11-20  6:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-24 18:08 [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Jonathan Tan
2024-10-24 18:08 ` [PATCH 1/5] pack-objects: make variable non-static Jonathan Tan
2024-10-28  0:30   ` Taylor Blau
2024-10-28 19:34     ` Jonathan Tan
2024-10-28 19:50       ` Taylor Blau
2024-10-28 23:04         ` Jonathan Tan
2024-10-24 18:08 ` [PATCH 2/5] t0410: make test description clearer Jonathan Tan
2024-10-24 18:08 ` [PATCH 3/5] t0410: use from-scratch server Jonathan Tan
2024-10-24 18:08 ` [PATCH 4/5] t5300: move --window clamp test next to unclamped Jonathan Tan
2024-10-24 18:08 ` [PATCH 5/5] index-pack: repack local links into promisor packs Jonathan Tan
2024-10-30 22:29   ` Josh Steadmon
2024-11-01 20:14     ` Jonathan Tan
2024-10-25  6:04 ` [External] [PATCH 0/5] When fetching from a promisor remote, repack local objects referenced Han Young
2024-10-25 21:07   ` Taylor Blau
2024-11-02 10:38     ` Junio C Hamano
2024-10-25 21:07 ` Taylor Blau
2024-11-01 20:11 ` [PATCH v2 0/4] " Jonathan Tan
2024-11-01 20:11   ` [PATCH v2 1/4] t0410: make test description clearer Jonathan Tan
2024-11-01 20:11   ` [PATCH v2 2/4] t0410: use from-scratch server Jonathan Tan
2024-11-01 20:11   ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jonathan Tan
2024-11-13  7:35     ` Jeff King
2024-11-13 18:26       ` Jonathan Tan
2024-11-14  0:56         ` Jeff King
2024-11-14  6:41           ` Junio C Hamano
2024-11-15  9:52             ` Jeff King
2024-11-15 19:55           ` Jonathan Tan
2024-11-16  3:23             ` Jeff King
2024-11-18 19:02               ` [PATCH] index-pack: teach --promisor to require --stdin Jonathan Tan
2024-11-19  3:29                 ` Junio C Hamano
2024-11-19 18:53                 ` Jeff King
2024-11-20  1:34                   ` Junio C Hamano
2024-11-19 20:10               ` [PATCH v2] index-pack: teach --promisor to forbid pack name Jonathan Tan
2024-11-20  6:29                 ` Jeff King
2024-11-14  0:59       ` [PATCH v2 3/4] t5300: move --window clamp test next to unclamped Jeff King
2024-11-01 20:11   ` [PATCH v2 4/4] index-pack: repack local links into promisor packs Jonathan Tan
2024-11-04  0:22   ` [PATCH v2 0/4] When fetching from a promisor remote, repack local objects referenced Junio C Hamano
2024-11-04  2:05     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).