git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] PATH WALK III: Add 'git backfill' command
@ 2024-12-06 20:07 Derrick Stolee via GitGitGadget
  2024-12-06 20:07 ` [PATCH 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 20:07 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

This is based on v3 of ds/path-walk-1 [1] and an earlier version was part of
my initial path-walk RFC [2].

[1]
https://lore.kernel.org/git/pull.1818.v3.git.1733514358.gitgitgadget@gmail.com/

[2]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

This series adds a new 'git backfill' command that uses the path-walk API to
download missing blobs in a blobless partial clone. Users can specify
interaction with the sparse-checkout using '--[no-]sparse' but the
'--sparse' option is implied by the existence of a sparse-checkout.

The reason to use the path-walk API is to make sure that the missing objects
are grouped by a common path, giving a reasonable process for batching
requests and expecting the server to compress the resulting packfile nicely
together.

I first prototyped this feature in June 2024 as an exploration and created
the path-walk algorithm for this purpose. It was only my intuition that led
me to believe that batching by path would lead to better packfiles. This has
been proven out as a very important feature due to recent investigations to
compressing full repositories by doing a better job of grouping objects by
path. See the --name-hash-version series [3] or the 'git pack-objects
--path-walk' series [4] (currently on hold as it conflicts with the
--name-hash-version series).

[3]
https://lore.kernel.org/git/pull.1823.v2.git.1733181682.gitgitgadget@gmail.com/

[4]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/

This idea can be further demonstrated by the evidence in testing this
feature: by downloading objects in small batch sizes, the client can force
the server to repack things more efficiently than a full repack.

The example repository I have used in multiple places is the
microsoft/fluentui repo [5] as it has many CHANGELOG.md files that cause
name hash collisions that make the full repack inefficient.

[5] https://github.com/microsoft/fluentui

If we create a blobless clone of the fluentui repo, then this downloads 105
MB across two packfiles (the commits and trees pack, followed by the blobs
needed for an initial checkout). Running 'git backfill --batch-size=' for
different sizes leads to some interesting results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |


The smaller batches cause the server to realize that the existing deltas
cannot be reused and it finds better deltas. This takes some extra time for
the small batches, but halves the size of the repo. Even in the 500K batch
size, we get less data than the 738 MB of a full clone.

Implementing the --sparse feature is best done by augmenting the path-walk
API to be aware of a pattern list. This works for both cone and non-cone
mode sparse-checkouts.

There are future directions we could take this command, especially to run
the command with a user-specified pathspec. The tricky case for that
additional feature is trying to make the path-walk more efficient by
skipping tree paths that would not lead to a match of the pathspec. It would
likely need optimization in a small subset of pathspec features (such as
prefix matches) to work as efficiently as possible. I did prototype a
version that puts the pathspec match in the callback function within
builtin/backfill.c, but I found that uninspiring and unnecessary for now.

Thanks, -Stolee

Derrick Stolee (5):
  backfill: add builtin boilerplate
  backfill: basic functionality and tests
  backfill: add --batch-size=<n> option
  backfill: add --sparse option
  backfill: assume --sparse when sparse-checkout is enabled

 .gitignore                                |   1 +
 Documentation/git-backfill.txt            |  60 ++++++++
 Documentation/technical/api-path-walk.txt |  11 +-
 Makefile                                  |   1 +
 builtin.h                                 |   1 +
 builtin/backfill.c                        | 147 ++++++++++++++++++
 command-list.txt                          |   1 +
 dir.c                                     |  10 +-
 dir.h                                     |   3 +
 git.c                                     |   1 +
 path-walk.c                               |  18 +++
 path-walk.h                               |  11 ++
 t/helper/test-path-walk.c                 |  22 ++-
 t/t5620-backfill.sh                       | 178 ++++++++++++++++++++++
 t/t6601-path-walk.sh                      |  32 ++++
 15 files changed, 488 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c
 create mode 100755 t/t5620-backfill.sh


base-commit: e716672c041473dd2bf257b7532b86696fef32a0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1820%2Fderrickstolee%2Fbackfill-upstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1820/derrickstolee/backfill-upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1820
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/5] backfill: add builtin boilerplate
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
@ 2024-12-06 20:07 ` Derrick Stolee via GitGitGadget
  2025-01-16 10:11   ` Patrick Steinhardt
  2024-12-06 20:07 ` [PATCH 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 20:07 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 .gitignore                     |  1 +
 Documentation/git-backfill.txt | 23 +++++++++++++++++++++++
 Makefile                       |  1 +
 builtin.h                      |  1 +
 builtin/backfill.c             | 29 +++++++++++++++++++++++++++++
 command-list.txt               |  1 +
 git.c                          |  1 +
 7 files changed, 57 insertions(+)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c

diff --git a/.gitignore b/.gitignore
index 6687bd6db4c..0f9e7de2ec3 100644
--- a/.gitignore
+++ b/.gitignore
@@ -20,6 +20,7 @@
 /git-apply
 /git-archimport
 /git-archive
+/git-backfill
 /git-bisect
 /git-blame
 /git-branch
diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
new file mode 100644
index 00000000000..640144187d3
--- /dev/null
+++ b/Documentation/git-backfill.txt
@@ -0,0 +1,23 @@
+git-backfill(1)
+===============
+
+NAME
+----
+git-backfill - Download missing objects in a partial clone
+
+
+SYNOPSIS
+--------
+[verse]
+'git backfill' [<options>]
+
+DESCRIPTION
+-----------
+
+SEE ALSO
+--------
+linkgit:git-clone[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 50413d96492..e18e0f4e447 100644
--- a/Makefile
+++ b/Makefile
@@ -1203,6 +1203,7 @@ BUILTIN_OBJS += builtin/am.o
 BUILTIN_OBJS += builtin/annotate.o
 BUILTIN_OBJS += builtin/apply.o
 BUILTIN_OBJS += builtin/archive.o
+BUILTIN_OBJS += builtin/backfill.o
 BUILTIN_OBJS += builtin/bisect.o
 BUILTIN_OBJS += builtin/blame.o
 BUILTIN_OBJS += builtin/branch.o
diff --git a/builtin.h b/builtin.h
index f7b166b3348..89928ccf92f 100644
--- a/builtin.h
+++ b/builtin.h
@@ -120,6 +120,7 @@ int cmd_am(int argc, const char **argv, const char *prefix, struct repository *r
 int cmd_annotate(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_apply(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_archive(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/backfill.c b/builtin/backfill.c
new file mode 100644
index 00000000000..38e6aaeaa03
--- /dev/null
+++ b/builtin/backfill.c
@@ -0,0 +1,29 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+#include "repository.h"
+#include "object.h"
+
+static const char * const builtin_backfill_usage[] = {
+	N_("git backfill [<options>]"),
+	NULL
+};
+
+int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
+{
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_backfill_usage, options);
+
+	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
+			     0);
+
+	repo_config(repo, git_default_config, NULL);
+
+	die(_("not implemented"));
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e0bb87b3b5c..c537114b468 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -60,6 +60,7 @@ git-annotate                            ancillaryinterrogators
 git-apply                               plumbingmanipulators            complete
 git-archimport                          foreignscminterface
 git-archive                             mainporcelain
+git-backfill                            mainporcelain           history
 git-bisect                              mainporcelain           info
 git-blame                               ancillaryinterrogators          complete
 git-branch                              mainporcelain           history
diff --git a/git.c b/git.c
index 2fbea24ec92..00d9b3ec8a9 100644
--- a/git.c
+++ b/git.c
@@ -509,6 +509,7 @@ static struct cmd_struct commands[] = {
 	{ "annotate", cmd_annotate, RUN_SETUP },
 	{ "apply", cmd_apply, RUN_SETUP_GENTLY },
 	{ "archive", cmd_archive, RUN_SETUP_GENTLY },
+	{ "backfill", cmd_backfill, RUN_SETUP },
 	{ "bisect", cmd_bisect, RUN_SETUP },
 	{ "blame", cmd_blame, RUN_SETUP },
 	{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/5] backfill: basic functionality and tests
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
  2024-12-06 20:07 ` [PATCH 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
@ 2024-12-06 20:07 ` Derrick Stolee via GitGitGadget
  2024-12-16  8:01   ` Patrick Steinhardt
  2024-12-06 20:07 ` [PATCH 3/5] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 20:07 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

The path-walk API provides lists of objects in batches according to a
common path, but that list could be very small. We want to balance the
number of requests to the server with the ability to have the process
interrupted with minimal repeated work to catch up in the next run.
Based on some experiments (detailed in the next change) a minimum batch
size of 50,000 is selected for the default.

This batch size is a _minimum_. As the path-walk API emits lists of blob
IDs, they are collected into a list of objects for a request to the
server. When that list is at least the minimum batch size, then the
request is sent to the server for the new objects. However, the list of
blob IDs from the path-walk API could be much longer than the batch
size. At this moment, it is unclear if there is a benefit to split the
list when there are too many objects at the same path.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  24 +++++
 Documentation/technical/api-path-walk.txt |   3 +-
 builtin/backfill.c                        | 104 +++++++++++++++++++++-
 t/t5620-backfill.sh                       |  94 +++++++++++++++++++
 4 files changed, 221 insertions(+), 4 deletions(-)
 create mode 100755 t/t5620-backfill.sh

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 640144187d3..0e10f066fef 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -14,6 +14,30 @@ SYNOPSIS
 DESCRIPTION
 -----------
 
+Blobless partial clones are created using `git clone --filter=blob:none`
+and then configure the local repository such that the Git client avoids
+downloading blob objects unless they are required for a local operation.
+This initially means that the clone and later fetches download reachable
+commits and trees but no blobs. Later operations that change the `HEAD`
+pointer, such as `git checkout` or `git merge`, may need to download
+missing blobs in order to complete their operation.
+
+In the worst cases, commands that compute blob diffs, such as `git blame`,
+become very slow as they download the missing blobs in single-blob
+requests to satisfy the missing object as the Git command needs it. This
+leads to multiple download requests and no ability for the Git server to
+provide delta compression across those objects.
+
+The `git backfill` command provides a way for the user to request that
+Git downloads the missing blobs (with optional filters) such that the
+missing blobs representing historical versions of files can be downloaded
+in batches. The `backfill` command attempts to optimize the request by
+grouping blobs that appear at the same path, hopefully leading to good
+delta compression in the packfile sent by the server.
+
+By default, `git backfill` downloads all blobs reachable from the `HEAD`
+commit. This set can be restricted or expanded using various options.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 7075d0d5ab5..1fba0ce04cb 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -60,4 +60,5 @@ Examples
 --------
 
 See example usages in:
-	`t/helper/test-path-walk.c`
+	`t/helper/test-path-walk.c`,
+	`builtin/backfill.c`
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 38e6aaeaa03..e5f2000d5e0 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,16 +1,116 @@
 #include "builtin.h"
+#include "git-compat-util.h"
 #include "config.h"
 #include "parse-options.h"
 #include "repository.h"
+#include "commit.h"
+#include "hex.h"
+#include "tree.h"
+#include "tree-walk.h"
 #include "object.h"
+#include "object-store-ll.h"
+#include "oid-array.h"
+#include "oidset.h"
+#include "promisor-remote.h"
+#include "strmap.h"
+#include "string-list.h"
+#include "revision.h"
+#include "trace2.h"
+#include "progress.h"
+#include "packfile.h"
+#include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
 	N_("git backfill [<options>]"),
 	NULL
 };
 
+struct backfill_context {
+	struct repository *repo;
+	struct oid_array current_batch;
+	size_t batch_size;
+};
+
+static void clear_backfill_context(struct backfill_context *ctx)
+{
+	oid_array_clear(&ctx->current_batch);
+}
+
+static void download_batch(struct backfill_context *ctx)
+{
+	promisor_remote_get_direct(ctx->repo,
+				   ctx->current_batch.oid,
+				   ctx->current_batch.nr);
+	oid_array_clear(&ctx->current_batch);
+
+	/*
+	 * We likely have a new packfile. Add it to the packed list to
+	 * avoid possible duplicate downloads of the same objects.
+	 */
+	reprepare_packed_git(ctx->repo);
+}
+
+static int fill_missing_blobs(const char *path UNUSED,
+			      struct oid_array *list,
+			      enum object_type type,
+			      void *data)
+{
+	struct backfill_context *ctx = data;
+
+	if (type != OBJ_BLOB)
+		return 0;
+
+	for (size_t i = 0; i < list->nr; i++) {
+		off_t size = 0;
+		struct object_info info = OBJECT_INFO_INIT;
+		info.disk_sizep = &size;
+		if (oid_object_info_extended(ctx->repo,
+					     &list->oid[i],
+					     &info,
+					     OBJECT_INFO_FOR_PREFETCH) ||
+		    !size)
+			oid_array_append(&ctx->current_batch, &list->oid[i]);
+	}
+
+	if (ctx->current_batch.nr >= ctx->batch_size)
+		download_batch(ctx);
+
+	return 0;
+}
+
+static int do_backfill(struct backfill_context *ctx)
+{
+	struct rev_info revs;
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	int ret;
+
+	repo_init_revisions(ctx->repo, &revs, "");
+	handle_revision_arg("HEAD", &revs, 0, 0);
+
+	info.blobs = 1;
+	info.tags = info.commits = info.trees = 0;
+
+	info.revs = &revs;
+	info.path_fn = fill_missing_blobs;
+	info.path_fn_data = ctx;
+
+	ret = walk_objects_by_path(&info);
+
+	/* Download the objects that did not fill a batch. */
+	if (!ret)
+		download_batch(ctx);
+
+	clear_backfill_context(ctx);
+	return ret;
+}
+
 int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
 {
+	struct backfill_context ctx = {
+		.repo = repo,
+		.current_batch = OID_ARRAY_INIT,
+		.batch_size = 50000,
+	};
 	struct option options[] = {
 		OPT_END(),
 	};
@@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
-	die(_("not implemented"));
-
-	return 0;
+	return do_backfill(&ctx);
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
new file mode 100755
index 00000000000..64326362d80
--- /dev/null
+++ b/t/t5620-backfill.sh
@@ -0,0 +1,94 @@
+#!/bin/sh
+
+test_description='git backfill on partial clones'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# We create objects in the 'src' repo.
+test_expect_success 'setup repo for object creation' '
+	echo "{print \$1}" >print_1.awk &&
+	echo "{print \$2}" >print_2.awk &&
+
+	git init src &&
+
+	mkdir -p src/a/b/c &&
+	mkdir -p src/d/e &&
+
+	for i in 1 2
+	do
+		for n in 1 2 3 4
+		do
+			echo "Version $i of file $n" > src/file.$n.txt &&
+			echo "Version $i of file a/$n" > src/a/file.$n.txt &&
+			echo "Version $i of file a/b/$n" > src/a/b/file.$n.txt &&
+			echo "Version $i of file a/b/c/$n" > src/a/b/c/file.$n.txt &&
+			echo "Version $i of file d/$n" > src/d/file.$n.txt &&
+			echo "Version $i of file d/e/$n" > src/d/e/file.$n.txt &&
+			git -C src add . &&
+			git -C src commit -m "Iteration $n" || return 1
+		done
+	done
+'
+
+# Clone 'src' into 'srv.bare' so we have a bare repo to be our origin
+# server for the partial clone.
+test_expect_success 'setup bare clone for server' '
+	git clone --bare "file://$(pwd)/src" srv.bare &&
+	git -C srv.bare config --local uploadpack.allowfilter 1 &&
+	git -C srv.bare config --local uploadpack.allowanysha1inwant 1
+'
+
+# do basic partial clone from "srv.bare"
+test_expect_success 'do partial clone 1, backfill gets all objects' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill1 &&
+
+	# Backfill with no options gets everything reachable from HEAD.
+	GIT_TRACE2_EVENT="$(pwd)/backfill-file-trace" git \
+		-C backfill1 backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-file-trace &&
+
+	# No more missing objects!
+	git -C backfill1 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'create a partial clone over HTTP' '
+	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+	rm -rf "$SERVER" repo &&
+	git clone --bare "file://$(pwd)/src" "$SERVER" &&
+	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
+	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
+
+	git clone --no-checkout --filter=blob:none \
+		"$HTTPD_URL/smart/server" backfill-http
+'
+
+test_expect_success 'backfilling over HTTP succeeds' '
+	GIT_TRACE2_EVENT="$(pwd)/backfill-http-trace" git \
+		-C backfill-http backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-http-trace &&
+
+	# Confirm all objects are present, none missing.
+	git -C backfill-http rev-list --objects --all >rev-list-out &&
+	awk "{print \$1;}" <rev-list-out >oids &&
+	GIT_TRACE2_EVENT="$(pwd)/walk-trace" git -C backfill-http \
+		cat-file --batch-check <oids >batch-out &&
+	! grep missing batch-out
+'
+
+# DO NOT add non-httpd-specific tests here, because the last part of this
+# test script is only executed when httpd is available and enabled.
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/5] backfill: add --batch-size=<n> option
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
  2024-12-06 20:07 ` [PATCH 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
  2024-12-06 20:07 ` [PATCH 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
@ 2024-12-06 20:07 ` Derrick Stolee via GitGitGadget
  2024-12-16  8:01   ` Patrick Steinhardt
  2025-01-19 17:57   ` Jean-Noël AVILA
  2024-12-06 20:07 ` [PATCH 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 20:07 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future. For now, we let the path-walk API batches determine the
boundaries.

To get a feeling for the value of specifying the --batch-size parameter,
I tested a number of open source repositories available on GitHub. The
procedure was generally:

 1. git clone --filter=blob:none <url>
 2. git backfill

Checking the number of packfiles and the size of the .git/objects/pack
directory helps to identify the effects of different batch sizes.

For the Git repository, we get these results:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 2          | 119 MB    |       |
| 25K             | 8          | 290 MB    | 24s   |
| 50K             | 5          | 290 MB    | 24s   |
| 100K            | 4          | 290 MB    | 29s   |

Other than the packfile counts decreasing as we need fewer batches, the
size and time required is not changing much for this small example.

For the nodejs/node repository, we see these results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 330 MB    |        |
| 25K             | 19         | 1,222 MB  | 1m 22s |
| 50K             | 11         | 1,221 MB  | 1m 24s |
| 100K            | 7          | 1,223 MB  | 1m 40s |
| 250K            | 4          | 1,224 MB  | 2m 23s |
| 500K            | 3          | 1,216 MB  | 4m 38s |

Here, we don't have much difference in the size of the repo, though the
500K batch size results in a few MB gained. That comes at a cost of a
much longer time. This extra time is due to server-side delta
compression happening as the on-disk deltas don't appear to be reusable
all the time. But for smaller batch sizes, the server is able to find
reasonable deltas partly because we are asking for objects that appear
in the same region of the directory tree and include all versions of a
file at a specific path.

To contrast this example, I tested the microsoft/fluentui repo, which
has been known to have inefficient packing due to name hash collisions.
These results are found before GitHub had the opportunity to repack the
server with more advanced name hash versions:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |

Here, a larger variety of batch sizes were chosen because of the great
variation in results. By asking the server to download small batches
corresponding to fewer paths at a time, the server is able to provide
better compression for these batches than it would for a regular clone.
A typical full clone for this repository would require 738 MB.

This example justifies the choice to batch requests by path name,
leading to improved communication with a server that is not optimally
packed.

Finally, the same experiment for the Linux repository had these results:

| Batch Size      | Pack Count | Pack Size | Time    |
|-----------------|------------|-----------|---------|
| (Initial clone) | 2          | 2,153 MB  |         |
| 25K             | 63         | 6,380 MB  | 14m 08s |
| 50K             | 58         | 6,126 MB  | 15m 11s |
| 100K            | 30         | 6,135 MB  | 18m 11s |
| 250K            | 14         | 6,146 MB  | 18m 22s |
| 500K            | 8          | 6,143 MB  | 33m 29s |

Even in this example, where the default name hash algorithm leads to
decent compression of the Linux kernel repository, there is value for
selecting a smaller batch size, to a limit. The 25K batch size has the
fastest time, but uses 250 MB more than the 50K batch size. The 500K
batch size took much more time due to server compression time and thus
we should avoid large batch sizes like this.

Based on these experiments, a batch size of 50,000 was chosen as the
default value.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt | 10 +++++++++-
 builtin/backfill.c             |  4 +++-
 t/t5620-backfill.sh            | 18 ++++++++++++++++++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 0e10f066fef..9b0bae04e9d 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
 SYNOPSIS
 --------
 [verse]
-'git backfill' [<options>]
+'git backfill' [--batch-size=<n>]
 
 DESCRIPTION
 -----------
@@ -38,6 +38,14 @@ delta compression in the packfile sent by the server.
 By default, `git backfill` downloads all blobs reachable from the `HEAD`
 commit. This set can be restricted or expanded using various options.
 
+OPTIONS
+-------
+
+--batch-size=<n>::
+	Specify a minimum size for a batch of missing objects to request
+	from the server. This size may be exceeded by the last set of
+	blobs seen at a given path. Default batch size is 16,000.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/builtin/backfill.c b/builtin/backfill.c
index e5f2000d5e0..127333daef8 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -21,7 +21,7 @@
 #include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
-	N_("git backfill [<options>]"),
+	N_("git backfill [--batch-size=<n>]"),
 	NULL
 };
 
@@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 		.batch_size = 50000,
 	};
 	struct option options[] = {
+		OPT_INTEGER(0, "batch-size", &ctx.batch_size,
+			    N_("Minimun number of objects to request at a time")),
 		OPT_END(),
 	};
 
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 64326362d80..32e2bb1c132 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -59,6 +59,24 @@ test_expect_success 'do partial clone 1, backfill gets all objects' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'do partial clone 2, backfill batch size' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill2 &&
+
+	GIT_TRACE2_EVENT="$(pwd)/batch-trace" git \
+		-C backfill2 backfill --batch-size=20 &&
+
+	# Batches were used
+	test_trace2_data promisor fetch_count 20 <batch-trace >matches &&
+	test_line_count = 2 matches &&
+	test_trace2_data promisor fetch_count 8 <batch-trace &&
+
+	# No more missing objects!
+	git -C backfill2 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/5] backfill: add --sparse option
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2024-12-06 20:07 ` [PATCH 3/5] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
@ 2024-12-06 20:07 ` Derrick Stolee via GitGitGadget
  2024-12-16  8:01   ` Patrick Steinhardt
  2024-12-06 20:07 ` [PATCH 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 20:07 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are

 1. git clone --filter=blob:none --sparse <url>
 2. git sparse-checkout set <dir1> ... <dirN>
 3. git backfill --sparse

For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3          | 110 MB    |       |
| 10K             | 12         | 192 MB    | 17.2s |
| 15K             | 9          | 192 MB    | 15.5s |
| 20K             | 8          | 192 MB    | 15.5s |
| 25K             | 7          | 192 MB    | 14.7s |

This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.

Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:

| Batch Size      | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2          | 1,876 MB  |      |
| 10K             | 11         | 2,187 MB  | 46s  |
| 25K             | 7          | 2,188 MB  | 43s  |
| 50K             | 5          | 2,194 MB  | 44s  |
| 100K            | 4          | 2,194 MB  | 48s  |

This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.

Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  6 ++-
 Documentation/technical/api-path-walk.txt |  8 ++++
 builtin/backfill.c                        | 13 +++++-
 dir.c                                     | 10 ++---
 dir.h                                     |  3 ++
 path-walk.c                               | 18 ++++++++
 path-walk.h                               | 11 +++++
 t/helper/test-path-walk.c                 | 22 ++++++++-
 t/t5620-backfill.sh                       | 55 +++++++++++++++++++++++
 t/t6601-path-walk.sh                      | 32 +++++++++++++
 10 files changed, 168 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 9b0bae04e9d..ecf2ac428ce 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
 SYNOPSIS
 --------
 [verse]
-'git backfill' [--batch-size=<n>]
+'git backfill' [--batch-size=<n>] [--[no-]sparse]
 
 DESCRIPTION
 -----------
@@ -46,6 +46,10 @@ OPTIONS
 	from the server. This size may be exceeded by the last set of
 	blobs seen at a given path. Default batch size is 16,000.
 
+--[no-]sparse::
+	Only download objects if they appear at a path that matches the
+	current sparse-checkout.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 1fba0ce04cb..3e089211fb4 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -56,6 +56,14 @@ better off using the revision walk API instead.
 	the revision walk so that the walk emits commits marked with the
 	`UNINTERESTING` flag.
 
+`pl`::
+	This pattern list pointer allows focusing the path-walk search to
+	a set of patterns, only emitting paths that match the given
+	patterns. See linkgit:gitignore[5] or
+	linkgit:git-sparse-checkout[1] for details about pattern lists.
+	When the pattern list uses cone-mode patterns, then the path-walk
+	API can prune the set of paths it walks to improve performance.
+
 Examples
 --------
 
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 127333daef8..cdee87e38af 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -4,6 +4,7 @@
 #include "parse-options.h"
 #include "repository.h"
 #include "commit.h"
+#include "dir.h"
 #include "hex.h"
 #include "tree.h"
 #include "tree-walk.h"
@@ -21,7 +22,7 @@
 #include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
-	N_("git backfill [--batch-size=<n>]"),
+	N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
 	NULL
 };
 
@@ -29,6 +30,7 @@ struct backfill_context {
 	struct repository *repo;
 	struct oid_array current_batch;
 	size_t batch_size;
+	int sparse;
 };
 
 static void clear_backfill_context(struct backfill_context *ctx)
@@ -84,6 +86,12 @@ static int do_backfill(struct backfill_context *ctx)
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	int ret;
 
+	if (ctx->sparse) {
+		CALLOC_ARRAY(info.pl, 1);
+		if (get_sparse_checkout_patterns(info.pl))
+			return error(_("problem loading sparse-checkout"));
+	}
+
 	repo_init_revisions(ctx->repo, &revs, "");
 	handle_revision_arg("HEAD", &revs, 0, 0);
 
@@ -110,10 +118,13 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 		.repo = repo,
 		.current_batch = OID_ARRAY_INIT,
 		.batch_size = 50000,
+		.sparse = 0,
 	};
 	struct option options[] = {
 		OPT_INTEGER(0, "batch-size", &ctx.batch_size,
 			    N_("Minimun number of objects to request at a time")),
+		OPT_BOOL(0, "sparse", &ctx.sparse,
+			 N_("Restrict the missing objects to the current sparse-checkout")),
 		OPT_END(),
 	};
 
diff --git a/dir.c b/dir.c
index c43b5e30813..32af7ee294d 100644
--- a/dir.c
+++ b/dir.c
@@ -1088,10 +1088,6 @@ static void invalidate_directory(struct untracked_cache *uc,
 		dir->dirs[i]->recurse = 0;
 }
 
-static int add_patterns_from_buffer(char *buf, size_t size,
-				    const char *base, int baselen,
-				    struct pattern_list *pl);
-
 /* Flags for add_patterns() */
 #define PATTERN_NOFOLLOW (1<<0)
 
@@ -1181,9 +1177,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 	return 0;
 }
 
-static int add_patterns_from_buffer(char *buf, size_t size,
-				    const char *base, int baselen,
-				    struct pattern_list *pl)
+int add_patterns_from_buffer(char *buf, size_t size,
+			     const char *base, int baselen,
+			     struct pattern_list *pl)
 {
 	char *orig = buf;
 	int i, lineno = 1;
diff --git a/dir.h b/dir.h
index a3a2f00f5d9..6cfef5df660 100644
--- a/dir.h
+++ b/dir.h
@@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
 int add_patterns_from_blob_to_list(struct object_id *oid,
 				   const char *base, int baselen,
 				   struct pattern_list *pl);
+int add_patterns_from_buffer(char *buf, size_t size,
+			     const char *base, int baselen,
+			     struct pattern_list *pl);
 void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
 void add_pattern(const char *string, const char *base,
 		 int baselen, struct pattern_list *pl, int srcpos);
diff --git a/path-walk.c b/path-walk.c
index b31924df52e..47809f8c315 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -12,6 +12,7 @@
 #include "object.h"
 #include "oid-array.h"
 #include "prio-queue.h"
+#include "repository.h"
 #include "revision.h"
 #include "string-list.h"
 #include "strmap.h"
@@ -166,6 +167,23 @@ static int add_children(struct path_walk_context *ctx,
 		if (type == OBJ_TREE)
 			strbuf_addch(&path, '/');
 
+		if (ctx->info->pl) {
+			int dtype;
+			enum pattern_match_result match;
+			match = path_matches_pattern_list(path.buf, path.len,
+							  path.buf + base_len, &dtype,
+							  ctx->info->pl,
+							  ctx->repo->index);
+
+			if (ctx->info->pl->use_cone_patterns &&
+			    match == NOT_MATCHED)
+				continue;
+			else if (!ctx->info->pl->use_cone_patterns &&
+				 type == OBJ_BLOB &&
+				 match != MATCHED)
+				continue;
+		}
+
 		if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
 			CALLOC_ARRAY(list, 1);
 			list->type = type;
diff --git a/path-walk.h b/path-walk.h
index de0db007dc9..0961f67bc9d 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -6,6 +6,7 @@
 
 struct rev_info;
 struct oid_array;
+struct pattern_list;
 
 /**
  * The type of a function pointer for the method that is called on a list of
@@ -47,6 +48,16 @@ struct path_walk_info {
 	 * walk the children of such trees.
 	 */
 	int prune_all_uninteresting;
+
+	/**
+	 * Specify a sparse-checkout definition to match our paths to. Do not
+	 * walk outside of this sparse definition. If the patterns are in
+	 * cone mode, then the search may prune directories that are outside
+	 * of the cone. If not in cone mode, then all tree paths will be
+	 * explored but the path_fn will only be called when the path matches
+	 * the sparse-checkout patterns.
+	 */
+	struct pattern_list *pl;
 };
 
 #define PATH_WALK_INFO_INIT {   \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 7f2d409c5bc..61e845e5ec2 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -1,6 +1,7 @@
 #define USE_THE_REPOSITORY_VARIABLE
 
 #include "test-tool.h"
+#include "dir.h"
 #include "environment.h"
 #include "hex.h"
 #include "object-name.h"
@@ -9,6 +10,7 @@
 #include "revision.h"
 #include "setup.h"
 #include "parse-options.h"
+#include "strbuf.h"
 #include "path-walk.h"
 #include "oid-array.h"
 
@@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids,
 
 int cmd__path_walk(int argc, const char **argv)
 {
-	int res;
+	int res, stdin_pl = 0;
 	struct rev_info revs = REV_INFO_INIT;
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	struct path_walk_test_data data = { 0 };
@@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of tree objects")),
 		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
 			 N_("toggle pruning of uninteresting paths")),
+		OPT_BOOL(0, "stdin-pl", &stdin_pl,
+			 N_("read a pattern list over stdin")),
 		OPT_END(),
 	};
 
@@ -99,6 +103,17 @@ int cmd__path_walk(int argc, const char **argv)
 	info.path_fn = emit_block;
 	info.path_fn_data = &data;
 
+	if (stdin_pl) {
+		struct strbuf in = STRBUF_INIT;
+		CALLOC_ARRAY(info.pl, 1);
+
+		info.pl->use_cone_patterns = 1;
+
+		strbuf_fread(&in, 2048, stdin);
+		add_patterns_from_buffer(in.buf, in.len, "", 0, info.pl);
+		strbuf_release(&in);
+	}
+
 	res = walk_objects_by_path(&info);
 
 	printf("commits:%" PRIuMAX "\n"
@@ -107,6 +122,11 @@ int cmd__path_walk(int argc, const char **argv)
 	       "tags:%" PRIuMAX "\n",
 	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
 
+	if (info.pl) {
+		clear_pattern_list(info.pl);
+		free(info.pl);
+	}
+
 	release_revisions(&revs);
 	return res;
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 32e2bb1c132..c2acd1339bd 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -77,6 +77,61 @@ test_expect_success 'do partial clone 2, backfill batch size' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'backfill --sparse' '
+	git clone --sparse --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill3 &&
+
+	# Initial checkout includes four files at root.
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 44 missing &&
+
+	# Initial sparse-checkout is just the files at root, so we get the
+	# older versions of the four files at tip.
+	GIT_TRACE2_EVENT="$(pwd)/sparse-trace1" git \
+		-C backfill3 backfill --sparse &&
+	test_trace2_data promisor fetch_count 4 <sparse-trace1 &&
+	test_trace2_data path-walk paths 5 <sparse-trace1 &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 40 missing &&
+
+	# Expand the sparse-checkout to include 'd' recursively. This
+	# engages the algorithm to skip the trees for 'a'. Note that
+	# the "sparse-checkout set" command downloads the objects at tip
+	# to satisfy the current checkout.
+	git -C backfill3 sparse-checkout set d &&
+	GIT_TRACE2_EVENT="$(pwd)/sparse-trace2" git \
+		-C backfill3 backfill --sparse &&
+	test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
+	test_trace2_data path-walk paths 15 <sparse-trace2 &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 24 missing
+'
+
+test_expect_success 'backfill --sparse without cone mode' '
+	git clone --no-checkout --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill4 &&
+
+	# No blobs yet
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 48 missing &&
+
+	# Define sparse-checkout by filename regardless of parent directory.
+	# This downloads 6 blobs to satisfy the checkout.
+	git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" &&
+	git -C backfill4 checkout main &&
+
+	GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \
+		-C backfill4 backfill --sparse &&
+	test_trace2_data promisor fetch_count 6 <no-cone-trace1 &&
+
+	# This walk needed to visit all directories to search for these paths.
+	test_trace2_data path-walk paths 12 <no-cone-trace1 &&
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 36 missing
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 7d765ffe907..538e2ed297f 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -176,6 +176,38 @@ test_expect_success 'branches and indexed objects mix well' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'base & topic, sparse' '
+	cat >patterns <<-EOF &&
+	/*
+	!/*/
+	/left/
+	EOF
+
+	test-tool path-walk --stdin-pl -- base topic <patterns >out &&
+
+	cat >expect <<-EOF &&
+	0:commit::$(git rev-parse topic)
+	0:commit::$(git rev-parse base)
+	0:commit::$(git rev-parse base~1)
+	0:commit::$(git rev-parse base~2)
+	1:tree::$(git rev-parse topic^{tree})
+	1:tree::$(git rev-parse base^{tree})
+	1:tree::$(git rev-parse base~1^{tree})
+	1:tree::$(git rev-parse base~2^{tree})
+	2:blob:a:$(git rev-parse base~2:a)
+	3:tree:left/:$(git rev-parse base:left)
+	3:tree:left/:$(git rev-parse base~2:left)
+	4:blob:left/b:$(git rev-parse base~2:left/b)
+	4:blob:left/b:$(git rev-parse base:left/b)
+	blobs:3
+	commits:4
+	tags:0
+	trees:6
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'topic only' '
 	test-tool path-walk -- topic >out &&
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/5] backfill: assume --sparse when sparse-checkout is enabled
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2024-12-06 20:07 ` [PATCH 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
@ 2024-12-06 20:07 ` Derrick Stolee via GitGitGadget
  2024-12-08 10:53 ` [PATCH 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 20:07 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt |  3 ++-
 builtin/backfill.c             |  7 +++++++
 t/t5620-backfill.sh            | 13 ++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index ecf2ac428ce..066ec6b161a 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -48,7 +48,8 @@ OPTIONS
 
 --[no-]sparse::
 	Only download objects if they appear at a path that matches the
-	current sparse-checkout.
+	current sparse-checkout. If the sparse-checkout feature is enabled,
+	then `--sparse` is assumed and can be disabled with `--no-sparse`.
 
 SEE ALSO
 --------
diff --git a/builtin/backfill.c b/builtin/backfill.c
index cdee87e38af..225764f17e8 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,3 +1,6 @@
+/* We need this macro to access core_apply_sparse_checkout */
+#define USE_THE_REPOSITORY_VARIABLE
+
 #include "builtin.h"
 #include "git-compat-util.h"
 #include "config.h"
@@ -5,6 +8,7 @@
 #include "repository.h"
 #include "commit.h"
 #include "dir.h"
+#include "environment.h"
 #include "hex.h"
 #include "tree.h"
 #include "tree-walk.h"
@@ -136,5 +140,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
+	if (ctx.sparse < 0)
+		ctx.sparse = core_apply_sparse_checkout;
+
 	return do_backfill(&ctx);
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index c2acd1339bd..eecf03d5199 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -77,6 +77,12 @@ test_expect_success 'do partial clone 2, backfill batch size' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'backfill --sparse without sparse-checkout fails' '
+	git init not-sparse &&
+	test_must_fail git -C not-sparse backfill --sparse 2>err &&
+	grep "problem loading sparse-checkout" err
+'
+
 test_expect_success 'backfill --sparse' '
 	git clone --sparse --filter=blob:none		\
 		--single-branch --branch=main 		\
@@ -105,7 +111,12 @@ test_expect_success 'backfill --sparse' '
 	test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
 	test_trace2_data path-walk paths 15 <sparse-trace2 &&
 	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
-	test_line_count = 24 missing
+	test_line_count = 24 missing &&
+
+	# Disabling the --sparse option (on by default) will download everything
+	git -C backfill3 backfill --no-sparse &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 0 missing
 '
 
 test_expect_success 'backfill --sparse without cone mode' '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] PATH WALK III: Add 'git backfill' command
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2024-12-06 20:07 ` [PATCH 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
@ 2024-12-08 10:53 ` Junio C Hamano
  2024-12-09  0:34   ` Junio C Hamano
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2024-12-08 10:53 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

It seems that the result of calling init_revisions() from backfill
is leaked?

https://github.com/git/git/actions/runs/12218430154/job/34083929479 

I did not dig further but the below is from my local leaksanitizer
run.

Thanks.

=================================================================
==git==182342==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 576 byte(s) in 1 object(s) allocated from:
    #0 0x55d4b0e42915 in __interceptor_realloc (git+0x84915) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)
    #1 0x55d4b119ce6d in xrealloc wrapper.c:140:8
    #2 0x55d4b11204d4 in add_rev_cmdline revision.c:1563:2
    #3 0x55d4b11187a1 in handle_revision_arg_1 revision.c:2263:2
    #4 0x55d4b1118398 in handle_revision_arg revision.c:2275:12
    #5 0x55d4b0e5233b in do_backfill builtin/backfill.c:100:2
    #6 0x55d4b0e52253 in cmd_backfill builtin/backfill.c:146:9
    #7 0x55d4b0e46b80 in run_builtin git.c:480:11
    #8 0x55d4b0e45502 in handle_builtin git.c:741:9
    #9 0x55d4b0e4649c in run_argv git.c:808:4
    #10 0x55d4b0e45274 in cmd_main git.c:948:19
    #11 0x55d4b0f6da8a in main common-main.c:9:11
    #12 0x7ff25f97ac89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #13 0x7ff25f97ad44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #14 0x55d4b0e171e0 in _start (git+0x591e0) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)

DEDUP_TOKEN: __interceptor_realloc--xrealloc--add_rev_cmdline--handle_revision_arg_1--handle_revision_arg--do_backfill--cmd_backfill--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main--_start
Indirect leak of 5 byte(s) in 1 object(s) allocated from:
    #0 0x55d4b0e424b6 in __interceptor_malloc (git+0x844b6) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)
    #1 0x7ff25f9f34f9 in strdup string/strdup.c:42:15
    #2 0x55d4b119cb14 in xstrdup wrapper.c:43:14
    #3 0x55d4b1120506 in add_rev_cmdline revision.c:1565:23
    #4 0x55d4b11187a1 in handle_revision_arg_1 revision.c:2263:2
    #5 0x55d4b1118398 in handle_revision_arg revision.c:2275:12
    #6 0x55d4b0e5233b in do_backfill builtin/backfill.c:100:2
    #7 0x55d4b0e52253 in cmd_backfill builtin/backfill.c:146:9
    #8 0x55d4b0e46b80 in run_builtin git.c:480:11
    #9 0x55d4b0e45502 in handle_builtin git.c:741:9
    #10 0x55d4b0e4649c in run_argv git.c:808:4
    #11 0x55d4b0e45274 in cmd_main git.c:948:19
    #12 0x55d4b0f6da8a in main common-main.c:9:11
    #13 0x7ff25f97ac89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7ff25f97ad44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #15 0x55d4b0e171e0 in _start (git+0x591e0) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)

DEDUP_TOKEN: __interceptor_malloc--strdup--xstrdup--add_rev_cmdline--handle_revision_arg_1--handle_revision_arg--do_backfill--cmd_backfill--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main--_start
SUMMARY: LeakSanitizer: 581 byte(s) leaked in 2 allocation(s).
Our logs revealed a memory leak...
++ rmdir /home/gitster/w/git.git/t/test-results/t5620-backfill.leak
++ :
++ exit 134
++ eval_ret=134

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] PATH WALK III: Add 'git backfill' command
  2024-12-08 10:53 ` [PATCH 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
@ 2024-12-09  0:34   ` Junio C Hamano
  0 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2024-12-09  0:34 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> It seems that the result of calling init_revisions() from backfill
> is leaked?
>
> https://github.com/git/git/actions/runs/12218430154/job/34083929479 
>
> I did not dig further but the below is from my local leaksanitizer
> run.

The attached patch seems to plug the leaks observed by your backfill
tests.  If you agree with the implementation of the change, you are
welcome to squash it in.  I may be missing a better mechanism in the
path-walk API that I could use to plug the leaks, in which case, of
course a fix using such a better mechanism is very much welcomed.

A few things I noticed about path-walk API during my poking are:

 - struct path_walk_info has PATH_WALK_INFO_INIT() macro, but not a
   matching path_walk_info_init() helper function to help initialize
   any heap-allocated instances.  In general, anything that requires
   _INIT() macro by definition wants to be initialized with more
   than nul-initialized (otherwise, it would be fine to leave it
   uninitialized in the BSS, clear with = {0} initialization on the
   stack), so lack of matching path_walk_info_init() raises an
   eyebrow.

 - struct path_walk_info seems to lack a destructor.  In general,
   anything complex enough to require initializer should have one.

 - lack of destructor for path_walk_info requires users to release
   resources held by "struct pattern_list" instance at pwi.pl; if we
   had a destructor, we wouldn't have 2 of the three leaks we had to
   fix here.

Thanks.


diff --git c/builtin/backfill.c w/builtin/backfill.c
index 225764f17e..1cf2df9303 100644
--- c/builtin/backfill.c
+++ w/builtin/backfill.c
@@ -92,8 +92,11 @@ static int do_backfill(struct backfill_context *ctx)
 
 	if (ctx->sparse) {
 		CALLOC_ARRAY(info.pl, 1);
-		if (get_sparse_checkout_patterns(info.pl))
+		if (get_sparse_checkout_patterns(info.pl)) {
+			clear_pattern_list(info.pl);
+			free(info.pl);
 			return error(_("problem loading sparse-checkout"));
+		}
 	}
 
 	repo_init_revisions(ctx->repo, &revs, "");
@@ -113,6 +116,11 @@ static int do_backfill(struct backfill_context *ctx)
 		download_batch(ctx);
 
 	clear_backfill_context(ctx);
+	release_revisions(&revs);
+	if (info.pl) {
+		clear_pattern_list(info.pl);
+		free(info.pl);
+	}
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] backfill: add --batch-size=<n> option
  2024-12-06 20:07 ` [PATCH 3/5] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
@ 2024-12-16  8:01   ` Patrick Steinhardt
  2024-12-18 15:09     ` Derrick Stolee
  2025-01-19 17:57   ` Jean-Noël AVILA
  1 sibling, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2024-12-16  8:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 06, 2024 at 08:07:16PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Users may want to specify a minimum batch size for their needs. This is only
> a minimum: the path-walk API provides a list of OIDs that correspond to the
> same path, and thus it is optimal to allow delta compression across those
> objects in a single server request.

Okay, here you explicitly say that this is a minimum batch size, so this
is by design and with a proper reason. Good.

> We could consider limiting the request to have a maximum batch size in the
> future. For now, we let the path-walk API batches determine the
> boundaries.

Should we maybe rename `--batch-size` to `--min-batch-size` so that it
does not become awkward if we ever want to have a maximum batch size, as
well? Also helps to set expectations with the user.

[snip]
> Based on these experiments, a batch size of 50,000 was chosen as the
> default value.

Thanks for all the data, this is really helpful!

> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index 0e10f066fef..9b0bae04e9d 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server.
>  By default, `git backfill` downloads all blobs reachable from the `HEAD`
>  commit. This set can be restricted or expanded using various options.
>  
> +OPTIONS
> +-------
> +
> +--batch-size=<n>::
> +	Specify a minimum size for a batch of missing objects to request
> +	from the server. This size may be exceeded by the last set of
> +	blobs seen at a given path. Default batch size is 16,000.

This is stale: s/16,000/50,000/

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e5f2000d5e0..127333daef8 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix,
> struct reposit
>                 .batch_size = 50000,
>         };
>         struct option options[] = {
> +               OPT_INTEGER(0, "batch-size", &ctx.batch_size,
> +                           N_("Minimun number of objects to request at a time")),

s/Minimun/Minimum

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] backfill: basic functionality and tests
  2024-12-06 20:07 ` [PATCH 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
@ 2024-12-16  8:01   ` Patrick Steinhardt
  2024-12-18 15:03     ` Derrick Stolee
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2024-12-16  8:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index 640144187d3..0e10f066fef 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -14,6 +14,30 @@ SYNOPSIS
>  DESCRIPTION
>  -----------
>  
> +Blobless partial clones are created using `git clone --filter=blob:none`
> +and then configure the local repository such that the Git client avoids
> +downloading blob objects unless they are required for a local operation.
> +This initially means that the clone and later fetches download reachable
> +commits and trees but no blobs. Later operations that change the `HEAD`
> +pointer, such as `git checkout` or `git merge`, may need to download
> +missing blobs in order to complete their operation.

Okay.

> +In the worst cases, commands that compute blob diffs, such as `git blame`,
> +become very slow as they download the missing blobs in single-blob
> +requests to satisfy the missing object as the Git command needs it. This
> +leads to multiple download requests and no ability for the Git server to
> +provide delta compression across those objects.
> +
> +The `git backfill` command provides a way for the user to request that
> +Git downloads the missing blobs (with optional filters) such that the
> +missing blobs representing historical versions of files can be downloaded
> +in batches. The `backfill` command attempts to optimize the request by
> +grouping blobs that appear at the same path, hopefully leading to good
> +delta compression in the packfile sent by the server.

Hm. So we're asking the user to fix a usability issue of git-blame(1),
don't we? Ideally, git-blame(1) itself should know to transparently
batch the blobs it requires to compute its output, shouldn't it? That
usecase alone doesn't yet convince me that git-backfill(1) is a good
idea as I'd think we should rather fix the underlying issue.

So are there other usecases for git-backfill(1)? I can imagine that it
might be helpful in the context of scripts that know they'll operate on
a bunch of blobs.

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 38e6aaeaa03..e5f2000d5e0 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -1,16 +1,116 @@
>  #include "builtin.h"
> +#include "git-compat-util.h"
>  #include "config.h"
>  #include "parse-options.h"
>  #include "repository.h"
> +#include "commit.h"
> +#include "hex.h"
> +#include "tree.h"
> +#include "tree-walk.h"
>  #include "object.h"
> +#include "object-store-ll.h"
> +#include "oid-array.h"
> +#include "oidset.h"
> +#include "promisor-remote.h"
> +#include "strmap.h"
> +#include "string-list.h"
> +#include "revision.h"
> +#include "trace2.h"
> +#include "progress.h"
> +#include "packfile.h"
> +#include "path-walk.h"
>  
>  static const char * const builtin_backfill_usage[] = {
>  	N_("git backfill [<options>]"),
>  	NULL
>  };
>  
> +struct backfill_context {
> +	struct repository *repo;
> +	struct oid_array current_batch;
> +	size_t batch_size;
> +};
> +
> +static void clear_backfill_context(struct backfill_context *ctx)
> +{
> +	oid_array_clear(&ctx->current_batch);
> +}

Nit: our style guide says that this should rather be
`backfill_context_clear()`.

> +static void download_batch(struct backfill_context *ctx)
> +{
> +	promisor_remote_get_direct(ctx->repo,
> +				   ctx->current_batch.oid,
> +				   ctx->current_batch.nr);
> +	oid_array_clear(&ctx->current_batch);
> +
> +	/*
> +	 * We likely have a new packfile. Add it to the packed list to
> +	 * avoid possible duplicate downloads of the same objects.
> +	 */
> +	reprepare_packed_git(ctx->repo);
> +}
> +
> +static int fill_missing_blobs(const char *path UNUSED,
> +			      struct oid_array *list,
> +			      enum object_type type,
> +			      void *data)
> +{
> +	struct backfill_context *ctx = data;
> +
> +	if (type != OBJ_BLOB)
> +		return 0;
> +
> +	for (size_t i = 0; i < list->nr; i++) {
> +		off_t size = 0;
> +		struct object_info info = OBJECT_INFO_INIT;
> +		info.disk_sizep = &size;
> +		if (oid_object_info_extended(ctx->repo,
> +					     &list->oid[i],
> +					     &info,
> +					     OBJECT_INFO_FOR_PREFETCH) ||
> +		    !size)
> +			oid_array_append(&ctx->current_batch, &list->oid[i]);
> +	}
> +
> +	if (ctx->current_batch.nr >= ctx->batch_size)
> +		download_batch(ctx);

Okay, so the batch size is just "best effort". If we walk a tree that
makes us exceed the batch size then we wouldn't issue a fetch during the
tree walk. Is there any specific reason for this behaviour?

In any case, as long as this is properly documented I think this should
be fine in general.

> +	return 0;
> +}
> +
> +static int do_backfill(struct backfill_context *ctx)
> +{
> +	struct rev_info revs;
> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
> +	int ret;
> +
> +	repo_init_revisions(ctx->repo, &revs, "");
> +	handle_revision_arg("HEAD", &revs, 0, 0);
> +
> +	info.blobs = 1;
> +	info.tags = info.commits = info.trees = 0;
> +
> +	info.revs = &revs;
> +	info.path_fn = fill_missing_blobs;
> +	info.path_fn_data = ctx;
> +
> +	ret = walk_objects_by_path(&info);
> +
> +	/* Download the objects that did not fill a batch. */
> +	if (!ret)
> +		download_batch(ctx);
> +
> +	clear_backfill_context(ctx);

Are we leaking `revs` and `info`?

> +	return ret;
> +}
> +
>  int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
>  {
> +	struct backfill_context ctx = {
> +		.repo = repo,
> +		.current_batch = OID_ARRAY_INIT,
> +		.batch_size = 50000,
> +	};
>  	struct option options[] = {
>  		OPT_END(),
>  	};
> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  
>  	repo_config(repo, git_default_config, NULL);
>  
> -	die(_("not implemented"));
> -
> -	return 0;
> +	return do_backfill(&ctx);
>  }

The current iteration only backfills blobs as far as I can see. Do we
maybe want to keep the door open for future changes in git-backfill(1)
by implementing this via a "blob" subcommand?

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] backfill: add --sparse option
  2024-12-06 20:07 ` [PATCH 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
@ 2024-12-16  8:01   ` Patrick Steinhardt
  0 siblings, 0 replies; 39+ messages in thread
From: Patrick Steinhardt @ 2024-12-16  8:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 06, 2024 at 08:07:17PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> One way to significantly reduce the cost of a Git clone and later fetches is
> to use a blobless partial clone and combine that with a sparse-checkout that
> reduces the paths that need to be populated in the working directory. Not
> only does this reduce the cost of clones and fetches, the sparse-checkout
> reduces the number of objects needed to download from a promisor remote.
> 
> However, history investigations can be expensie as computing blob diffs will

s/expensie/expensive

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] backfill: basic functionality and tests
  2024-12-16  8:01   ` Patrick Steinhardt
@ 2024-12-18 15:03     ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2024-12-18 15:03 UTC (permalink / raw)
  To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

On 12/16/24 3:01 AM, Patrick Steinhardt wrote:
 > On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:

 >> +The `git backfill` command provides a way for the user to request that
 >> +Git downloads the missing blobs (with optional filters) such that the
 >> +missing blobs representing historical versions of files can be downloaded
 >> +in batches. The `backfill` command attempts to optimize the request by
 >> +grouping blobs that appear at the same path, hopefully leading to good
 >> +delta compression in the packfile sent by the server.
 >
 > Hm. So we're asking the user to fix a usability issue of git-blame(1),
 > don't we? Ideally, git-blame(1) itself should know to transparently
 > batch the blobs it requires to compute its output, shouldn't it? That
 > usecase alone doesn't yet convince me that git-backfill(1) is a good
 > idea as I'd think we should rather fix the underlying issue.

I've looked into making this change for 'git blame' and it is a
nontrivial change. I'm not sure on the timeline that we could expect
'git blame' to be improved.

But you're right that this is not enough justification on its own. It's
an example of a kind of command that has these problems, including 'git
log [-p|-L]'.

 > So are there other usecases for git-backfill(1)? I can imagine that it
 > might be helpful in the context of scripts that know they'll operate on
 > a bunch of blobs.

One major motivation is that it can essentially provide a way to do
resumable clones: get the commits and trees in one go with a blobless
clone, then download the blobs in batches. If something interrupts the
'git backfill' command, then restarting it will only repeat the most
recent batch.

Finally, in a later patch we can see that the --sparse option allows a
user to operate as if they have a full clone but where they only include
blob data within their sparse-checkout, providing reduced disk space and
network time to get to that state.

 >> +	if (ctx->current_batch.nr >= ctx->batch_size)
 >> +		download_batch(ctx);
 >
 > Okay, so the batch size is just "best effort". If we walk a tree that
 > makes us exceed the batch size then we wouldn't issue a fetch during the
 > tree walk. Is there any specific reason for this behaviour?
 >
 > In any case, as long as this is properly documented I think this should
 > be fine in general.

The main reason is that we expect the server to return a packfile that
has many delta relationships within the objects at a given path. If we
split the batch in the middle of a path batch, then we are artificially
introducing breaks in the delta chains that could be wasteful.

However, this batching pattern could be problematic if there are a
million versions of a single file and the batch is too large to download
efficiently. This "absolute max batch size" is currently left as a
future extension.

 >> +	clear_backfill_context(ctx);
 >
 > Are we leaking `revs` and `info`?

At the moment. Will fix.

 >> +	return ret;
 >> +}
 >> +
 >>   int cmd_backfill(int argc, const char **argv, const char *prefix, struct 
repository *repo)
 >>   {
 >> +	struct backfill_context ctx = {
 >> +		.repo = repo,
 >> +		.current_batch = OID_ARRAY_INIT,
 >> +		.batch_size = 50000,
 >> +	};
 >>   	struct option options[] = {
 >>   		OPT_END(),
 >>   	};
 >> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char 
*prefix, struct reposit
 >>
 >>   	repo_config(repo, git_default_config, NULL);
 >>
 >> -	die(_("not implemented"));
 >> -
 >> -	return 0;
 >> +	return do_backfill(&ctx);
 >>   }
 >
 > The current iteration only backfills blobs as far as I can see. Do we
 > maybe want to keep the door open for future changes in git-backfill(1)
 > by implementing this via a "blob" subcommand?

I think that one tricky part is to ask "what could be missing?". With
Git's partial clone, it seems that we could have treeless or depth-
based tree restrictions. Technically, there could also be clones that
restrict to a set of sparse-checkout patterns, but I'm not aware of any
server that will respect those kinds of clones.  At the moment, the tree
walk would fault in any missing trees as they are seen, but this is
extremely inefficient.

I think that the path-walk API could be adjusted to be more careful to
check for the existence of an object before automatically loading it.
That would allow for batched downloads of missing trees, though a
second scan would be required to get the next layer of objects.

I'm not sure a subcommand is the right way to solve for this potential
future, but instead later we could adjust the logic to be better for
these treeless or tree-restricted clones.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] backfill: add --batch-size=<n> option
  2024-12-16  8:01   ` Patrick Steinhardt
@ 2024-12-18 15:09     ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2024-12-18 15:09 UTC (permalink / raw)
  To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

On 12/16/24 3:01 AM, Patrick Steinhardt wrote:
 > On Fri, Dec 06, 2024 at 08:07:16PM +0000, Derrick Stolee via GitGitGadget wrote:
 >> From: Derrick Stolee <derrickstolee@github.com>
 >>
 >> Users may want to specify a minimum batch size for their needs. This is only
 >> a minimum: the path-walk API provides a list of OIDs that correspond to the
 >> same path, and thus it is optimal to allow delta compression across those
 >> objects in a single server request.
 >
 > Okay, here you explicitly say that this is a minimum batch size, so this
 > is by design and with a proper reason. Good.
 >
 >> We could consider limiting the request to have a maximum batch size in the
 >> future. For now, we let the path-walk API batches determine the
 >> boundaries.
 >
 > Should we maybe rename `--batch-size` to `--min-batch-size` so that it
 > does not become awkward if we ever want to have a maximum batch size, as
 > well? Also helps to set expectations with the user.

Good idea. Will do.

 >> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
 >> index 0e10f066fef..9b0bae04e9d 100644
 >> --- a/Documentation/git-backfill.txt
 >> +++ b/Documentation/git-backfill.txt
 >> @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server.
 >>   By default, `git backfill` downloads all blobs reachable from the `HEAD`
 >>   commit. This set can be restricted or expanded using various options.
 >>
 >> +OPTIONS
 >> +-------
 >> +
 >> +--batch-size=<n>::
 >> +	Specify a minimum size for a batch of missing objects to request
 >> +	from the server. This size may be exceeded by the last set of
 >> +	blobs seen at a given path. Default batch size is 16,000.
 >
 > This is stale: s/16,000/50,000/

Thanks!

 >> diff --git a/builtin/backfill.c b/builtin/backfill.c
 >> index e5f2000d5e0..127333daef8 100644
 >> --- a/builtin/backfill.c
 >> +++ b/builtin/backfill.c
 >> @@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char 
*prefix,
 >> struct reposit
 >>                  .batch_size = 50000,
 >>          };
 >>          struct option options[] = {
 >> +               OPT_INTEGER(0, "batch-size", &ctx.batch_size,
 >> +                           N_("Minimun number of objects to request at a 
time")),
 >
 > s/Minimun/Minimum

Thanks for the careful eye for detail.

-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command
  2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2024-12-08 10:53 ` [PATCH 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
@ 2024-12-20 16:29 ` Derrick Stolee via GitGitGadget
  2024-12-20 16:29   ` [PATCH v2 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
                     ` (6 more replies)
  6 siblings, 7 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:29 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

This is based on v4 of ds/path-walk-1 [1] and an earlier version was part of
my initial path-walk RFC [2].

[1]
https://lore.kernel.org/git/pull.1818.v3.git.1733514358.gitgitgadget@gmail.com/

[2]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

This series adds a new 'git backfill' command that uses the path-walk API to
download missing blobs in a blobless partial clone. Users can specify
interaction with the sparse-checkout using '--[no-]sparse' but the
'--sparse' option is implied by the existence of a sparse-checkout.

The reason to use the path-walk API is to make sure that the missing objects
are grouped by a common path, giving a reasonable process for batching
requests and expecting the server to compress the resulting packfile nicely
together.

I first prototyped this feature in June 2024 as an exploration and created
the path-walk algorithm for this purpose. It was only my intuition that led
me to believe that batching by path would lead to better packfiles. This has
been proven out as a very important feature due to recent investigations to
compressing full repositories by doing a better job of grouping objects by
path. See the --name-hash-version series [3] or the 'git pack-objects
--path-walk' series [4] (currently on hold as it conflicts with the
--name-hash-version series).

[3]
https://lore.kernel.org/git/pull.1823.v2.git.1733181682.gitgitgadget@gmail.com/

[4]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/

This idea can be further demonstrated by the evidence in testing this
feature: by downloading objects in small batch sizes, the client can force
the server to repack things more efficiently than a full repack.

The example repository I have used in multiple places is the
microsoft/fluentui repo [5] as it has many CHANGELOG.md files that cause
name hash collisions that make the full repack inefficient.

[5] https://github.com/microsoft/fluentui

If we create a blobless clone of the fluentui repo, then this downloads 105
MB across two packfiles (the commits and trees pack, followed by the blobs
needed for an initial checkout). Running 'git backfill --batch-size=' for
different sizes leads to some interesting results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |


The smaller batches cause the server to realize that the existing deltas
cannot be reused and it finds better deltas. This takes some extra time for
the small batches, but halves the size of the repo. Even in the 500K batch
size, we get less data than the 738 MB of a full clone.

Implementing the --sparse feature is best done by augmenting the path-walk
API to be aware of a pattern list. This works for both cone and non-cone
mode sparse-checkouts.

There are future directions we could take this command, especially to run
the command with a user-specified pathspec. The tricky case for that
additional feature is trying to make the path-walk more efficient by
skipping tree paths that would not lead to a match of the pathspec. It would
likely need optimization in a small subset of pathspec features (such as
prefix matches) to work as efficiently as possible. I did prototype a
version that puts the pathspec match in the callback function within
builtin/backfill.c, but I found that uninspiring and unnecessary for now.


Updates in v2
=============

Thanks for the review on v1.

 * Memory leaks should be plugged now due to the introduction of a
   destructor in v4 of the path-walk API and its extension here to clear the
   pattern list.

 * Documentation is expanded to demonstrate that 'git backfill' can also
   approximate incremental/resumable clones of repositories.

 * Method renames to better match style.

 * --batch-size is renamed to --min-batch-size for clarity and future
   extensibility.

Thanks, -Stolee

Derrick Stolee (5):
  backfill: add builtin boilerplate
  backfill: basic functionality and tests
  backfill: add --min-batch-size=<n> option
  backfill: add --sparse option
  backfill: assume --sparse when sparse-checkout is enabled

 .gitignore                                |   1 +
 Documentation/git-backfill.txt            |  68 +++++++++
 Documentation/technical/api-path-walk.txt |  11 +-
 Makefile                                  |   1 +
 builtin.h                                 |   1 +
 builtin/backfill.c                        | 151 ++++++++++++++++++
 command-list.txt                          |   1 +
 dir.c                                     |  10 +-
 dir.h                                     |   3 +
 git.c                                     |   1 +
 path-walk.c                               |  28 +++-
 path-walk.h                               |  11 ++
 t/helper/test-path-walk.c                 |  22 ++-
 t/t5620-backfill.sh                       | 178 ++++++++++++++++++++++
 t/t6601-path-walk.sh                      |  32 ++++
 15 files changed, 505 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c
 create mode 100755 t/t5620-backfill.sh


base-commit: ef543429ed9cb92e209900e7c7fc4f8e0698a306
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1820%2Fderrickstolee%2Fbackfill-upstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1820/derrickstolee/backfill-upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1820

Range-diff vs v1:

 1:  0300aa1b8c3 = 1:  ac86510417a backfill: add builtin boilerplate
 2:  5728dd27021 ! 2:  e4e88794cae backfill: basic functionality and tests
     @@ Documentation/git-backfill.txt: SYNOPSIS
      +grouping blobs that appear at the same path, hopefully leading to good
      +delta compression in the packfile sent by the server.
      +
     ++In this way, `git backfill` provides a mechanism to break a large clone
     ++into smaller chunks. Starting with a blobless partial clone with `git
     ++clone --filter=blob:none` and then running `git backfill` in the local
     ++repository provides a way to download all reachable objects in several
     ++smaller network calls than downloading the entire repository at clone
     ++time.
     ++
      +By default, `git backfill` downloads all blobs reachable from the `HEAD`
      +commit. This set can be restricted or expanded using various options.
      +
     @@ builtin/backfill.c
      +	size_t batch_size;
      +};
      +
     -+static void clear_backfill_context(struct backfill_context *ctx)
     ++static void backfill_context_clear(struct backfill_context *ctx)
      +{
      +	oid_array_clear(&ctx->current_batch);
      +}
     @@ builtin/backfill.c
      +	if (!ret)
      +		download_batch(ctx);
      +
     -+	clear_backfill_context(ctx);
     ++	backfill_context_clear(ctx);
     ++	path_walk_info_clear(&info);
     ++	release_revisions(&revs);
      +	return ret;
      +}
      +
 3:  3cfd23073a0 ! 3:  3fa32822dab backfill: add --batch-size=<n> option
     @@ Metadata
      Author: Derrick Stolee <derrickstolee@github.com>
      
       ## Commit message ##
     -    backfill: add --batch-size=<n> option
     +    backfill: add --min-batch-size=<n> option
      
          Users may want to specify a minimum batch size for their needs. This is only
          a minimum: the path-walk API provides a list of OIDs that correspond to the
     @@ Documentation/git-backfill.txt: git-backfill - Download missing objects in a par
       
       DESCRIPTION
       -----------
     -@@ Documentation/git-backfill.txt: delta compression in the packfile sent by the server.
     +@@ Documentation/git-backfill.txt: time.
       By default, `git backfill` downloads all blobs reachable from the `HEAD`
       commit. This set can be restricted or expanded using various options.
       
      +OPTIONS
      +-------
      +
     -+--batch-size=<n>::
     ++--min-batch-size=<n>::
      +	Specify a minimum size for a batch of missing objects to request
      +	from the server. This size may be exceeded by the last set of
     -+	blobs seen at a given path. Default batch size is 16,000.
     ++	blobs seen at a given path. The default minimum batch size is
     ++	50,000.
      +
       SEE ALSO
       --------
     @@ builtin/backfill.c
       	NULL
       };
       
     + struct backfill_context {
     + 	struct repository *repo;
     + 	struct oid_array current_batch;
     +-	size_t batch_size;
     ++	size_t min_batch_size;
     + };
     + 
     + static void backfill_context_clear(struct backfill_context *ctx)
     +@@ builtin/backfill.c: static int fill_missing_blobs(const char *path UNUSED,
     + 			oid_array_append(&ctx->current_batch, &list->oid[i]);
     + 	}
     + 
     +-	if (ctx->current_batch.nr >= ctx->batch_size)
     ++	if (ctx->current_batch.nr >= ctx->min_batch_size)
     + 		download_batch(ctx);
     + 
     + 	return 0;
      @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
     - 		.batch_size = 50000,
     + 	struct backfill_context ctx = {
     + 		.repo = repo,
     + 		.current_batch = OID_ARRAY_INIT,
     +-		.batch_size = 50000,
     ++		.min_batch_size = 50000,
       	};
       	struct option options[] = {
     -+		OPT_INTEGER(0, "batch-size", &ctx.batch_size,
     -+			    N_("Minimun number of objects to request at a time")),
     ++		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
     ++			    N_("Minimum number of objects to request at a time")),
       		OPT_END(),
       	};
       
     @@ t/t5620-backfill.sh: test_expect_success 'do partial clone 1, backfill gets all
       	test_line_count = 0 revs2
       '
       
     -+test_expect_success 'do partial clone 2, backfill batch size' '
     ++test_expect_success 'do partial clone 2, backfill min batch size' '
      +	git clone --no-checkout --filter=blob:none	\
      +		--single-branch --branch=main 		\
      +		"file://$(pwd)/srv.bare" backfill2 &&
      +
      +	GIT_TRACE2_EVENT="$(pwd)/batch-trace" git \
     -+		-C backfill2 backfill --batch-size=20 &&
     ++		-C backfill2 backfill --min-batch-size=20 &&
      +
      +	# Batches were used
      +	test_trace2_data promisor fetch_count 20 <batch-trace >matches &&
 4:  19a8efebbad ! 4:  2723143afb3 backfill: add --sparse option
     @@ Commit message
          only does this reduce the cost of clones and fetches, the sparse-checkout
          reduces the number of objects needed to download from a promisor remote.
      
     -    However, history investigations can be expensie as computing blob diffs will
     -    trigger promisor remote requests for one object at a time. This can be
     +    However, history investigations can be expensive as computing blob diffs
     +    will trigger promisor remote requests for one object at a time. This can be
          avoided by downloading the blobs needed for the given sparse-checkout using
          'git backfill' and its new '--sparse' mode, at a time that the user is
          willing to pay that extra cost.
     @@ Documentation/git-backfill.txt: git-backfill - Download missing objects in a par
       DESCRIPTION
       -----------
      @@ Documentation/git-backfill.txt: OPTIONS
     - 	from the server. This size may be exceeded by the last set of
     - 	blobs seen at a given path. Default batch size is 16,000.
     + 	blobs seen at a given path. The default minimum batch size is
     + 	50,000.
       
      +--[no-]sparse::
      +	Only download objects if they appear at a path that matches the
     @@ builtin/backfill.c
      @@ builtin/backfill.c: struct backfill_context {
       	struct repository *repo;
       	struct oid_array current_batch;
     - 	size_t batch_size;
     + 	size_t min_batch_size;
      +	int sparse;
       };
       
     - static void clear_backfill_context(struct backfill_context *ctx)
     + static void backfill_context_clear(struct backfill_context *ctx)
      @@ builtin/backfill.c: static int do_backfill(struct backfill_context *ctx)
       	struct path_walk_info info = PATH_WALK_INFO_INIT;
       	int ret;
       
      +	if (ctx->sparse) {
      +		CALLOC_ARRAY(info.pl, 1);
     -+		if (get_sparse_checkout_patterns(info.pl))
     ++		if (get_sparse_checkout_patterns(info.pl)) {
     ++			path_walk_info_clear(&info);
      +			return error(_("problem loading sparse-checkout"));
     ++		}
      +	}
      +
       	repo_init_revisions(ctx->repo, &revs, "");
     @@ builtin/backfill.c: static int do_backfill(struct backfill_context *ctx)
      @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
       		.repo = repo,
       		.current_batch = OID_ARRAY_INIT,
     - 		.batch_size = 50000,
     + 		.min_batch_size = 50000,
      +		.sparse = 0,
       	};
       	struct option options[] = {
     - 		OPT_INTEGER(0, "batch-size", &ctx.batch_size,
     - 			    N_("Minimun number of objects to request at a time")),
     + 		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
     + 			    N_("Minimum number of objects to request at a time")),
      +		OPT_BOOL(0, "sparse", &ctx.sparse,
      +			 N_("Restrict the missing objects to the current sparse-checkout")),
       		OPT_END(),
     @@ path-walk.c
       #include "revision.h"
       #include "string-list.h"
       #include "strmap.h"
     -@@ path-walk.c: static int add_children(struct path_walk_context *ctx,
     +@@ path-walk.c: static int add_tree_entries(struct path_walk_context *ctx,
       		if (type == OBJ_TREE)
       			strbuf_addch(&path, '/');
       
     @@ path-walk.c: static int add_children(struct path_walk_context *ctx,
       		if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
       			CALLOC_ARRAY(list, 1);
       			list->type = type;
     +@@ path-walk.c: void path_walk_info_init(struct path_walk_info *info)
     + 	memcpy(info, &empty, sizeof(empty));
     + }
     + 
     +-void path_walk_info_clear(struct path_walk_info *info UNUSED)
     ++void path_walk_info_clear(struct path_walk_info *info)
     + {
     +-	/*
     +-	 * This destructor is empty for now, as info->revs
     +-	 * is not owned by 'struct path_walk_info'.
     +-	 */
     ++	if (info->pl) {
     ++		clear_pattern_list(info->pl);
     ++		free(info->pl);
     ++	}
     + }
      
       ## path-walk.h ##
      @@
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
       }
      
       ## t/t5620-backfill.sh ##
     -@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill batch size' '
     +@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch size' '
       	test_line_count = 0 revs2
       '
       
 5:  35a7e4ec4d3 ! 5:  1f765409eaf backfill: assume --sparse when sparse-checkout is enabled
     @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *pr
       }
      
       ## t/t5620-backfill.sh ##
     -@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill batch size' '
     +@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch size' '
       	test_line_count = 0 revs2
       '
       

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/5] backfill: add builtin boilerplate
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2024-12-20 16:29   ` Derrick Stolee via GitGitGadget
  2024-12-20 16:29   ` [PATCH v2 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:29 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 .gitignore                     |  1 +
 Documentation/git-backfill.txt | 23 +++++++++++++++++++++++
 Makefile                       |  1 +
 builtin.h                      |  1 +
 builtin/backfill.c             | 29 +++++++++++++++++++++++++++++
 command-list.txt               |  1 +
 git.c                          |  1 +
 7 files changed, 57 insertions(+)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c

diff --git a/.gitignore b/.gitignore
index 6687bd6db4c..0f9e7de2ec3 100644
--- a/.gitignore
+++ b/.gitignore
@@ -20,6 +20,7 @@
 /git-apply
 /git-archimport
 /git-archive
+/git-backfill
 /git-bisect
 /git-blame
 /git-branch
diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
new file mode 100644
index 00000000000..640144187d3
--- /dev/null
+++ b/Documentation/git-backfill.txt
@@ -0,0 +1,23 @@
+git-backfill(1)
+===============
+
+NAME
+----
+git-backfill - Download missing objects in a partial clone
+
+
+SYNOPSIS
+--------
+[verse]
+'git backfill' [<options>]
+
+DESCRIPTION
+-----------
+
+SEE ALSO
+--------
+linkgit:git-clone[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 50413d96492..e18e0f4e447 100644
--- a/Makefile
+++ b/Makefile
@@ -1203,6 +1203,7 @@ BUILTIN_OBJS += builtin/am.o
 BUILTIN_OBJS += builtin/annotate.o
 BUILTIN_OBJS += builtin/apply.o
 BUILTIN_OBJS += builtin/archive.o
+BUILTIN_OBJS += builtin/backfill.o
 BUILTIN_OBJS += builtin/bisect.o
 BUILTIN_OBJS += builtin/blame.o
 BUILTIN_OBJS += builtin/branch.o
diff --git a/builtin.h b/builtin.h
index f7b166b3348..89928ccf92f 100644
--- a/builtin.h
+++ b/builtin.h
@@ -120,6 +120,7 @@ int cmd_am(int argc, const char **argv, const char *prefix, struct repository *r
 int cmd_annotate(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_apply(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_archive(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/backfill.c b/builtin/backfill.c
new file mode 100644
index 00000000000..38e6aaeaa03
--- /dev/null
+++ b/builtin/backfill.c
@@ -0,0 +1,29 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+#include "repository.h"
+#include "object.h"
+
+static const char * const builtin_backfill_usage[] = {
+	N_("git backfill [<options>]"),
+	NULL
+};
+
+int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
+{
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_backfill_usage, options);
+
+	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
+			     0);
+
+	repo_config(repo, git_default_config, NULL);
+
+	die(_("not implemented"));
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e0bb87b3b5c..c537114b468 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -60,6 +60,7 @@ git-annotate                            ancillaryinterrogators
 git-apply                               plumbingmanipulators            complete
 git-archimport                          foreignscminterface
 git-archive                             mainporcelain
+git-backfill                            mainporcelain           history
 git-bisect                              mainporcelain           info
 git-blame                               ancillaryinterrogators          complete
 git-branch                              mainporcelain           history
diff --git a/git.c b/git.c
index 2fbea24ec92..00d9b3ec8a9 100644
--- a/git.c
+++ b/git.c
@@ -509,6 +509,7 @@ static struct cmd_struct commands[] = {
 	{ "annotate", cmd_annotate, RUN_SETUP },
 	{ "apply", cmd_apply, RUN_SETUP_GENTLY },
 	{ "archive", cmd_archive, RUN_SETUP_GENTLY },
+	{ "backfill", cmd_backfill, RUN_SETUP },
 	{ "bisect", cmd_bisect, RUN_SETUP },
 	{ "blame", cmd_blame, RUN_SETUP },
 	{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/5] backfill: basic functionality and tests
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2024-12-20 16:29   ` [PATCH v2 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
@ 2024-12-20 16:29   ` Derrick Stolee via GitGitGadget
  2025-01-16 10:01     ` Patrick Steinhardt
  2024-12-20 16:29   ` [PATCH v2 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:29 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

The path-walk API provides lists of objects in batches according to a
common path, but that list could be very small. We want to balance the
number of requests to the server with the ability to have the process
interrupted with minimal repeated work to catch up in the next run.
Based on some experiments (detailed in the next change) a minimum batch
size of 50,000 is selected for the default.

This batch size is a _minimum_. As the path-walk API emits lists of blob
IDs, they are collected into a list of objects for a request to the
server. When that list is at least the minimum batch size, then the
request is sent to the server for the new objects. However, the list of
blob IDs from the path-walk API could be much longer than the batch
size. At this moment, it is unclear if there is a benefit to split the
list when there are too many objects at the same path.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  31 +++++++
 Documentation/technical/api-path-walk.txt |   3 +-
 builtin/backfill.c                        | 106 +++++++++++++++++++++-
 t/t5620-backfill.sh                       |  94 +++++++++++++++++++
 4 files changed, 230 insertions(+), 4 deletions(-)
 create mode 100755 t/t5620-backfill.sh

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 640144187d3..ece887831f6 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -14,6 +14,37 @@ SYNOPSIS
 DESCRIPTION
 -----------
 
+Blobless partial clones are created using `git clone --filter=blob:none`
+and then configure the local repository such that the Git client avoids
+downloading blob objects unless they are required for a local operation.
+This initially means that the clone and later fetches download reachable
+commits and trees but no blobs. Later operations that change the `HEAD`
+pointer, such as `git checkout` or `git merge`, may need to download
+missing blobs in order to complete their operation.
+
+In the worst cases, commands that compute blob diffs, such as `git blame`,
+become very slow as they download the missing blobs in single-blob
+requests to satisfy the missing object as the Git command needs it. This
+leads to multiple download requests and no ability for the Git server to
+provide delta compression across those objects.
+
+The `git backfill` command provides a way for the user to request that
+Git downloads the missing blobs (with optional filters) such that the
+missing blobs representing historical versions of files can be downloaded
+in batches. The `backfill` command attempts to optimize the request by
+grouping blobs that appear at the same path, hopefully leading to good
+delta compression in the packfile sent by the server.
+
+In this way, `git backfill` provides a mechanism to break a large clone
+into smaller chunks. Starting with a blobless partial clone with `git
+clone --filter=blob:none` and then running `git backfill` in the local
+repository provides a way to download all reachable objects in several
+smaller network calls than downloading the entire repository at clone
+time.
+
+By default, `git backfill` downloads all blobs reachable from the `HEAD`
+commit. This set can be restricted or expanded using various options.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 7075d0d5ab5..1fba0ce04cb 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -60,4 +60,5 @@ Examples
 --------
 
 See example usages in:
-	`t/helper/test-path-walk.c`
+	`t/helper/test-path-walk.c`,
+	`builtin/backfill.c`
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 38e6aaeaa03..177fd4286c7 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,16 +1,118 @@
 #include "builtin.h"
+#include "git-compat-util.h"
 #include "config.h"
 #include "parse-options.h"
 #include "repository.h"
+#include "commit.h"
+#include "hex.h"
+#include "tree.h"
+#include "tree-walk.h"
 #include "object.h"
+#include "object-store-ll.h"
+#include "oid-array.h"
+#include "oidset.h"
+#include "promisor-remote.h"
+#include "strmap.h"
+#include "string-list.h"
+#include "revision.h"
+#include "trace2.h"
+#include "progress.h"
+#include "packfile.h"
+#include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
 	N_("git backfill [<options>]"),
 	NULL
 };
 
+struct backfill_context {
+	struct repository *repo;
+	struct oid_array current_batch;
+	size_t batch_size;
+};
+
+static void backfill_context_clear(struct backfill_context *ctx)
+{
+	oid_array_clear(&ctx->current_batch);
+}
+
+static void download_batch(struct backfill_context *ctx)
+{
+	promisor_remote_get_direct(ctx->repo,
+				   ctx->current_batch.oid,
+				   ctx->current_batch.nr);
+	oid_array_clear(&ctx->current_batch);
+
+	/*
+	 * We likely have a new packfile. Add it to the packed list to
+	 * avoid possible duplicate downloads of the same objects.
+	 */
+	reprepare_packed_git(ctx->repo);
+}
+
+static int fill_missing_blobs(const char *path UNUSED,
+			      struct oid_array *list,
+			      enum object_type type,
+			      void *data)
+{
+	struct backfill_context *ctx = data;
+
+	if (type != OBJ_BLOB)
+		return 0;
+
+	for (size_t i = 0; i < list->nr; i++) {
+		off_t size = 0;
+		struct object_info info = OBJECT_INFO_INIT;
+		info.disk_sizep = &size;
+		if (oid_object_info_extended(ctx->repo,
+					     &list->oid[i],
+					     &info,
+					     OBJECT_INFO_FOR_PREFETCH) ||
+		    !size)
+			oid_array_append(&ctx->current_batch, &list->oid[i]);
+	}
+
+	if (ctx->current_batch.nr >= ctx->batch_size)
+		download_batch(ctx);
+
+	return 0;
+}
+
+static int do_backfill(struct backfill_context *ctx)
+{
+	struct rev_info revs;
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	int ret;
+
+	repo_init_revisions(ctx->repo, &revs, "");
+	handle_revision_arg("HEAD", &revs, 0, 0);
+
+	info.blobs = 1;
+	info.tags = info.commits = info.trees = 0;
+
+	info.revs = &revs;
+	info.path_fn = fill_missing_blobs;
+	info.path_fn_data = ctx;
+
+	ret = walk_objects_by_path(&info);
+
+	/* Download the objects that did not fill a batch. */
+	if (!ret)
+		download_batch(ctx);
+
+	backfill_context_clear(ctx);
+	path_walk_info_clear(&info);
+	release_revisions(&revs);
+	return ret;
+}
+
 int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
 {
+	struct backfill_context ctx = {
+		.repo = repo,
+		.current_batch = OID_ARRAY_INIT,
+		.batch_size = 50000,
+	};
 	struct option options[] = {
 		OPT_END(),
 	};
@@ -23,7 +125,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
-	die(_("not implemented"));
-
-	return 0;
+	return do_backfill(&ctx);
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
new file mode 100755
index 00000000000..64326362d80
--- /dev/null
+++ b/t/t5620-backfill.sh
@@ -0,0 +1,94 @@
+#!/bin/sh
+
+test_description='git backfill on partial clones'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# We create objects in the 'src' repo.
+test_expect_success 'setup repo for object creation' '
+	echo "{print \$1}" >print_1.awk &&
+	echo "{print \$2}" >print_2.awk &&
+
+	git init src &&
+
+	mkdir -p src/a/b/c &&
+	mkdir -p src/d/e &&
+
+	for i in 1 2
+	do
+		for n in 1 2 3 4
+		do
+			echo "Version $i of file $n" > src/file.$n.txt &&
+			echo "Version $i of file a/$n" > src/a/file.$n.txt &&
+			echo "Version $i of file a/b/$n" > src/a/b/file.$n.txt &&
+			echo "Version $i of file a/b/c/$n" > src/a/b/c/file.$n.txt &&
+			echo "Version $i of file d/$n" > src/d/file.$n.txt &&
+			echo "Version $i of file d/e/$n" > src/d/e/file.$n.txt &&
+			git -C src add . &&
+			git -C src commit -m "Iteration $n" || return 1
+		done
+	done
+'
+
+# Clone 'src' into 'srv.bare' so we have a bare repo to be our origin
+# server for the partial clone.
+test_expect_success 'setup bare clone for server' '
+	git clone --bare "file://$(pwd)/src" srv.bare &&
+	git -C srv.bare config --local uploadpack.allowfilter 1 &&
+	git -C srv.bare config --local uploadpack.allowanysha1inwant 1
+'
+
+# do basic partial clone from "srv.bare"
+test_expect_success 'do partial clone 1, backfill gets all objects' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill1 &&
+
+	# Backfill with no options gets everything reachable from HEAD.
+	GIT_TRACE2_EVENT="$(pwd)/backfill-file-trace" git \
+		-C backfill1 backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-file-trace &&
+
+	# No more missing objects!
+	git -C backfill1 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'create a partial clone over HTTP' '
+	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+	rm -rf "$SERVER" repo &&
+	git clone --bare "file://$(pwd)/src" "$SERVER" &&
+	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
+	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
+
+	git clone --no-checkout --filter=blob:none \
+		"$HTTPD_URL/smart/server" backfill-http
+'
+
+test_expect_success 'backfilling over HTTP succeeds' '
+	GIT_TRACE2_EVENT="$(pwd)/backfill-http-trace" git \
+		-C backfill-http backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-http-trace &&
+
+	# Confirm all objects are present, none missing.
+	git -C backfill-http rev-list --objects --all >rev-list-out &&
+	awk "{print \$1;}" <rev-list-out >oids &&
+	GIT_TRACE2_EVENT="$(pwd)/walk-trace" git -C backfill-http \
+		cat-file --batch-check <oids >batch-out &&
+	! grep missing batch-out
+'
+
+# DO NOT add non-httpd-specific tests here, because the last part of this
+# test script is only executed when httpd is available and enabled.
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 3/5] backfill: add --min-batch-size=<n> option
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2024-12-20 16:29   ` [PATCH v2 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
  2024-12-20 16:29   ` [PATCH v2 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
@ 2024-12-20 16:29   ` Derrick Stolee via GitGitGadget
  2025-01-16 10:01     ` Patrick Steinhardt
  2024-12-20 16:29   ` [PATCH v2 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:29 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future. For now, we let the path-walk API batches determine the
boundaries.

To get a feeling for the value of specifying the --batch-size parameter,
I tested a number of open source repositories available on GitHub. The
procedure was generally:

 1. git clone --filter=blob:none <url>
 2. git backfill

Checking the number of packfiles and the size of the .git/objects/pack
directory helps to identify the effects of different batch sizes.

For the Git repository, we get these results:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 2          | 119 MB    |       |
| 25K             | 8          | 290 MB    | 24s   |
| 50K             | 5          | 290 MB    | 24s   |
| 100K            | 4          | 290 MB    | 29s   |

Other than the packfile counts decreasing as we need fewer batches, the
size and time required is not changing much for this small example.

For the nodejs/node repository, we see these results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 330 MB    |        |
| 25K             | 19         | 1,222 MB  | 1m 22s |
| 50K             | 11         | 1,221 MB  | 1m 24s |
| 100K            | 7          | 1,223 MB  | 1m 40s |
| 250K            | 4          | 1,224 MB  | 2m 23s |
| 500K            | 3          | 1,216 MB  | 4m 38s |

Here, we don't have much difference in the size of the repo, though the
500K batch size results in a few MB gained. That comes at a cost of a
much longer time. This extra time is due to server-side delta
compression happening as the on-disk deltas don't appear to be reusable
all the time. But for smaller batch sizes, the server is able to find
reasonable deltas partly because we are asking for objects that appear
in the same region of the directory tree and include all versions of a
file at a specific path.

To contrast this example, I tested the microsoft/fluentui repo, which
has been known to have inefficient packing due to name hash collisions.
These results are found before GitHub had the opportunity to repack the
server with more advanced name hash versions:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |

Here, a larger variety of batch sizes were chosen because of the great
variation in results. By asking the server to download small batches
corresponding to fewer paths at a time, the server is able to provide
better compression for these batches than it would for a regular clone.
A typical full clone for this repository would require 738 MB.

This example justifies the choice to batch requests by path name,
leading to improved communication with a server that is not optimally
packed.

Finally, the same experiment for the Linux repository had these results:

| Batch Size      | Pack Count | Pack Size | Time    |
|-----------------|------------|-----------|---------|
| (Initial clone) | 2          | 2,153 MB  |         |
| 25K             | 63         | 6,380 MB  | 14m 08s |
| 50K             | 58         | 6,126 MB  | 15m 11s |
| 100K            | 30         | 6,135 MB  | 18m 11s |
| 250K            | 14         | 6,146 MB  | 18m 22s |
| 500K            | 8          | 6,143 MB  | 33m 29s |

Even in this example, where the default name hash algorithm leads to
decent compression of the Linux kernel repository, there is value for
selecting a smaller batch size, to a limit. The 25K batch size has the
fastest time, but uses 250 MB more than the 50K batch size. The 500K
batch size took much more time due to server compression time and thus
we should avoid large batch sizes like this.

Based on these experiments, a batch size of 50,000 was chosen as the
default value.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt | 11 ++++++++++-
 builtin/backfill.c             | 10 ++++++----
 t/t5620-backfill.sh            | 18 ++++++++++++++++++
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index ece887831f6..e392517869c 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
 SYNOPSIS
 --------
 [verse]
-'git backfill' [<options>]
+'git backfill' [--batch-size=<n>]
 
 DESCRIPTION
 -----------
@@ -45,6 +45,15 @@ time.
 By default, `git backfill` downloads all blobs reachable from the `HEAD`
 commit. This set can be restricted or expanded using various options.
 
+OPTIONS
+-------
+
+--min-batch-size=<n>::
+	Specify a minimum size for a batch of missing objects to request
+	from the server. This size may be exceeded by the last set of
+	blobs seen at a given path. The default minimum batch size is
+	50,000.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 177fd4286c7..ddccececc36 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -21,14 +21,14 @@
 #include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
-	N_("git backfill [<options>]"),
+	N_("git backfill [--batch-size=<n>]"),
 	NULL
 };
 
 struct backfill_context {
 	struct repository *repo;
 	struct oid_array current_batch;
-	size_t batch_size;
+	size_t min_batch_size;
 };
 
 static void backfill_context_clear(struct backfill_context *ctx)
@@ -72,7 +72,7 @@ static int fill_missing_blobs(const char *path UNUSED,
 			oid_array_append(&ctx->current_batch, &list->oid[i]);
 	}
 
-	if (ctx->current_batch.nr >= ctx->batch_size)
+	if (ctx->current_batch.nr >= ctx->min_batch_size)
 		download_batch(ctx);
 
 	return 0;
@@ -111,9 +111,11 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 	struct backfill_context ctx = {
 		.repo = repo,
 		.current_batch = OID_ARRAY_INIT,
-		.batch_size = 50000,
+		.min_batch_size = 50000,
 	};
 	struct option options[] = {
+		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
+			    N_("Minimum number of objects to request at a time")),
 		OPT_END(),
 	};
 
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 64326362d80..36107a51c54 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -59,6 +59,24 @@ test_expect_success 'do partial clone 1, backfill gets all objects' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'do partial clone 2, backfill min batch size' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill2 &&
+
+	GIT_TRACE2_EVENT="$(pwd)/batch-trace" git \
+		-C backfill2 backfill --min-batch-size=20 &&
+
+	# Batches were used
+	test_trace2_data promisor fetch_count 20 <batch-trace >matches &&
+	test_line_count = 2 matches &&
+	test_trace2_data promisor fetch_count 8 <batch-trace &&
+
+	# No more missing objects!
+	git -C backfill2 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 4/5] backfill: add --sparse option
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2024-12-20 16:29   ` [PATCH v2 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
@ 2024-12-20 16:29   ` Derrick Stolee via GitGitGadget
  2025-01-16 10:01     ` Patrick Steinhardt
  2024-12-20 16:29   ` [PATCH v2 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:29 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensive as computing blob diffs
will trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are

 1. git clone --filter=blob:none --sparse <url>
 2. git sparse-checkout set <dir1> ... <dirN>
 3. git backfill --sparse

For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3          | 110 MB    |       |
| 10K             | 12         | 192 MB    | 17.2s |
| 15K             | 9          | 192 MB    | 15.5s |
| 20K             | 8          | 192 MB    | 15.5s |
| 25K             | 7          | 192 MB    | 14.7s |

This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.

Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:

| Batch Size      | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2          | 1,876 MB  |      |
| 10K             | 11         | 2,187 MB  | 46s  |
| 25K             | 7          | 2,188 MB  | 43s  |
| 50K             | 5          | 2,194 MB  | 44s  |
| 100K            | 4          | 2,194 MB  | 48s  |

This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.

Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  6 ++-
 Documentation/technical/api-path-walk.txt |  8 ++++
 builtin/backfill.c                        | 15 ++++++-
 dir.c                                     | 10 ++---
 dir.h                                     |  3 ++
 path-walk.c                               | 28 +++++++++---
 path-walk.h                               | 11 +++++
 t/helper/test-path-walk.c                 | 22 ++++++++-
 t/t5620-backfill.sh                       | 55 +++++++++++++++++++++++
 t/t6601-path-walk.sh                      | 32 +++++++++++++
 10 files changed, 175 insertions(+), 15 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index e392517869c..4710e2c12e3 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
 SYNOPSIS
 --------
 [verse]
-'git backfill' [--batch-size=<n>]
+'git backfill' [--batch-size=<n>] [--[no-]sparse]
 
 DESCRIPTION
 -----------
@@ -54,6 +54,10 @@ OPTIONS
 	blobs seen at a given path. The default minimum batch size is
 	50,000.
 
+--[no-]sparse::
+	Only download objects if they appear at a path that matches the
+	current sparse-checkout.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 1fba0ce04cb..3e089211fb4 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -56,6 +56,14 @@ better off using the revision walk API instead.
 	the revision walk so that the walk emits commits marked with the
 	`UNINTERESTING` flag.
 
+`pl`::
+	This pattern list pointer allows focusing the path-walk search to
+	a set of patterns, only emitting paths that match the given
+	patterns. See linkgit:gitignore[5] or
+	linkgit:git-sparse-checkout[1] for details about pattern lists.
+	When the pattern list uses cone-mode patterns, then the path-walk
+	API can prune the set of paths it walks to improve performance.
+
 Examples
 --------
 
diff --git a/builtin/backfill.c b/builtin/backfill.c
index ddccececc36..b9f1cc98501 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -4,6 +4,7 @@
 #include "parse-options.h"
 #include "repository.h"
 #include "commit.h"
+#include "dir.h"
 #include "hex.h"
 #include "tree.h"
 #include "tree-walk.h"
@@ -21,7 +22,7 @@
 #include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
-	N_("git backfill [--batch-size=<n>]"),
+	N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
 	NULL
 };
 
@@ -29,6 +30,7 @@ struct backfill_context {
 	struct repository *repo;
 	struct oid_array current_batch;
 	size_t min_batch_size;
+	int sparse;
 };
 
 static void backfill_context_clear(struct backfill_context *ctx)
@@ -84,6 +86,14 @@ static int do_backfill(struct backfill_context *ctx)
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	int ret;
 
+	if (ctx->sparse) {
+		CALLOC_ARRAY(info.pl, 1);
+		if (get_sparse_checkout_patterns(info.pl)) {
+			path_walk_info_clear(&info);
+			return error(_("problem loading sparse-checkout"));
+		}
+	}
+
 	repo_init_revisions(ctx->repo, &revs, "");
 	handle_revision_arg("HEAD", &revs, 0, 0);
 
@@ -112,10 +122,13 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 		.repo = repo,
 		.current_batch = OID_ARRAY_INIT,
 		.min_batch_size = 50000,
+		.sparse = 0,
 	};
 	struct option options[] = {
 		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
 			    N_("Minimum number of objects to request at a time")),
+		OPT_BOOL(0, "sparse", &ctx.sparse,
+			 N_("Restrict the missing objects to the current sparse-checkout")),
 		OPT_END(),
 	};
 
diff --git a/dir.c b/dir.c
index c43b5e30813..32af7ee294d 100644
--- a/dir.c
+++ b/dir.c
@@ -1088,10 +1088,6 @@ static void invalidate_directory(struct untracked_cache *uc,
 		dir->dirs[i]->recurse = 0;
 }
 
-static int add_patterns_from_buffer(char *buf, size_t size,
-				    const char *base, int baselen,
-				    struct pattern_list *pl);
-
 /* Flags for add_patterns() */
 #define PATTERN_NOFOLLOW (1<<0)
 
@@ -1181,9 +1177,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 	return 0;
 }
 
-static int add_patterns_from_buffer(char *buf, size_t size,
-				    const char *base, int baselen,
-				    struct pattern_list *pl)
+int add_patterns_from_buffer(char *buf, size_t size,
+			     const char *base, int baselen,
+			     struct pattern_list *pl)
 {
 	char *orig = buf;
 	int i, lineno = 1;
diff --git a/dir.h b/dir.h
index a3a2f00f5d9..6cfef5df660 100644
--- a/dir.h
+++ b/dir.h
@@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
 int add_patterns_from_blob_to_list(struct object_id *oid,
 				   const char *base, int baselen,
 				   struct pattern_list *pl);
+int add_patterns_from_buffer(char *buf, size_t size,
+			     const char *base, int baselen,
+			     struct pattern_list *pl);
 void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
 void add_pattern(const char *string, const char *base,
 		 int baselen, struct pattern_list *pl, int srcpos);
diff --git a/path-walk.c b/path-walk.c
index 136ec08fb0e..c7456a9c1c0 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -12,6 +12,7 @@
 #include "object.h"
 #include "oid-array.h"
 #include "prio-queue.h"
+#include "repository.h"
 #include "revision.h"
 #include "string-list.h"
 #include "strmap.h"
@@ -173,6 +174,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
 		if (type == OBJ_TREE)
 			strbuf_addch(&path, '/');
 
+		if (ctx->info->pl) {
+			int dtype;
+			enum pattern_match_result match;
+			match = path_matches_pattern_list(path.buf, path.len,
+							  path.buf + base_len, &dtype,
+							  ctx->info->pl,
+							  ctx->repo->index);
+
+			if (ctx->info->pl->use_cone_patterns &&
+			    match == NOT_MATCHED)
+				continue;
+			else if (!ctx->info->pl->use_cone_patterns &&
+				 type == OBJ_BLOB &&
+				 match != MATCHED)
+				continue;
+		}
+
 		if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
 			CALLOC_ARRAY(list, 1);
 			list->type = type;
@@ -583,10 +601,10 @@ void path_walk_info_init(struct path_walk_info *info)
 	memcpy(info, &empty, sizeof(empty));
 }
 
-void path_walk_info_clear(struct path_walk_info *info UNUSED)
+void path_walk_info_clear(struct path_walk_info *info)
 {
-	/*
-	 * This destructor is empty for now, as info->revs
-	 * is not owned by 'struct path_walk_info'.
-	 */
+	if (info->pl) {
+		clear_pattern_list(info->pl);
+		free(info->pl);
+	}
 }
diff --git a/path-walk.h b/path-walk.h
index 414d6db23c2..473ee9d361c 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -6,6 +6,7 @@
 
 struct rev_info;
 struct oid_array;
+struct pattern_list;
 
 /**
  * The type of a function pointer for the method that is called on a list of
@@ -48,6 +49,16 @@ struct path_walk_info {
 	 * walk the children of such trees.
 	 */
 	int prune_all_uninteresting;
+
+	/**
+	 * Specify a sparse-checkout definition to match our paths to. Do not
+	 * walk outside of this sparse definition. If the patterns are in
+	 * cone mode, then the search may prune directories that are outside
+	 * of the cone. If not in cone mode, then all tree paths will be
+	 * explored but the path_fn will only be called when the path matches
+	 * the sparse-checkout patterns.
+	 */
+	struct pattern_list *pl;
 };
 
 #define PATH_WALK_INFO_INIT {   \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 7f2d409c5bc..61e845e5ec2 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -1,6 +1,7 @@
 #define USE_THE_REPOSITORY_VARIABLE
 
 #include "test-tool.h"
+#include "dir.h"
 #include "environment.h"
 #include "hex.h"
 #include "object-name.h"
@@ -9,6 +10,7 @@
 #include "revision.h"
 #include "setup.h"
 #include "parse-options.h"
+#include "strbuf.h"
 #include "path-walk.h"
 #include "oid-array.h"
 
@@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids,
 
 int cmd__path_walk(int argc, const char **argv)
 {
-	int res;
+	int res, stdin_pl = 0;
 	struct rev_info revs = REV_INFO_INIT;
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	struct path_walk_test_data data = { 0 };
@@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of tree objects")),
 		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
 			 N_("toggle pruning of uninteresting paths")),
+		OPT_BOOL(0, "stdin-pl", &stdin_pl,
+			 N_("read a pattern list over stdin")),
 		OPT_END(),
 	};
 
@@ -99,6 +103,17 @@ int cmd__path_walk(int argc, const char **argv)
 	info.path_fn = emit_block;
 	info.path_fn_data = &data;
 
+	if (stdin_pl) {
+		struct strbuf in = STRBUF_INIT;
+		CALLOC_ARRAY(info.pl, 1);
+
+		info.pl->use_cone_patterns = 1;
+
+		strbuf_fread(&in, 2048, stdin);
+		add_patterns_from_buffer(in.buf, in.len, "", 0, info.pl);
+		strbuf_release(&in);
+	}
+
 	res = walk_objects_by_path(&info);
 
 	printf("commits:%" PRIuMAX "\n"
@@ -107,6 +122,11 @@ int cmd__path_walk(int argc, const char **argv)
 	       "tags:%" PRIuMAX "\n",
 	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
 
+	if (info.pl) {
+		clear_pattern_list(info.pl);
+		free(info.pl);
+	}
+
 	release_revisions(&revs);
 	return res;
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 36107a51c54..f87a471c221 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -77,6 +77,61 @@ test_expect_success 'do partial clone 2, backfill min batch size' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'backfill --sparse' '
+	git clone --sparse --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill3 &&
+
+	# Initial checkout includes four files at root.
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 44 missing &&
+
+	# Initial sparse-checkout is just the files at root, so we get the
+	# older versions of the four files at tip.
+	GIT_TRACE2_EVENT="$(pwd)/sparse-trace1" git \
+		-C backfill3 backfill --sparse &&
+	test_trace2_data promisor fetch_count 4 <sparse-trace1 &&
+	test_trace2_data path-walk paths 5 <sparse-trace1 &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 40 missing &&
+
+	# Expand the sparse-checkout to include 'd' recursively. This
+	# engages the algorithm to skip the trees for 'a'. Note that
+	# the "sparse-checkout set" command downloads the objects at tip
+	# to satisfy the current checkout.
+	git -C backfill3 sparse-checkout set d &&
+	GIT_TRACE2_EVENT="$(pwd)/sparse-trace2" git \
+		-C backfill3 backfill --sparse &&
+	test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
+	test_trace2_data path-walk paths 15 <sparse-trace2 &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 24 missing
+'
+
+test_expect_success 'backfill --sparse without cone mode' '
+	git clone --no-checkout --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill4 &&
+
+	# No blobs yet
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 48 missing &&
+
+	# Define sparse-checkout by filename regardless of parent directory.
+	# This downloads 6 blobs to satisfy the checkout.
+	git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" &&
+	git -C backfill4 checkout main &&
+
+	GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \
+		-C backfill4 backfill --sparse &&
+	test_trace2_data promisor fetch_count 6 <no-cone-trace1 &&
+
+	# This walk needed to visit all directories to search for these paths.
+	test_trace2_data path-walk paths 12 <no-cone-trace1 &&
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 36 missing
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 5f04acb8a2f..c89b0f1e19d 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -176,6 +176,38 @@ test_expect_success 'branches and indexed objects mix well' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'base & topic, sparse' '
+	cat >patterns <<-EOF &&
+	/*
+	!/*/
+	/left/
+	EOF
+
+	test-tool path-walk --stdin-pl -- base topic <patterns >out &&
+
+	cat >expect <<-EOF &&
+	0:commit::$(git rev-parse topic)
+	0:commit::$(git rev-parse base)
+	0:commit::$(git rev-parse base~1)
+	0:commit::$(git rev-parse base~2)
+	1:tree::$(git rev-parse topic^{tree})
+	1:tree::$(git rev-parse base^{tree})
+	1:tree::$(git rev-parse base~1^{tree})
+	1:tree::$(git rev-parse base~2^{tree})
+	2:blob:a:$(git rev-parse base~2:a)
+	3:tree:left/:$(git rev-parse base:left)
+	3:tree:left/:$(git rev-parse base~2:left)
+	4:blob:left/b:$(git rev-parse base~2:left/b)
+	4:blob:left/b:$(git rev-parse base:left/b)
+	blobs:3
+	commits:4
+	tags:0
+	trees:6
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'topic only' '
 	test-tool path-walk -- topic >out &&
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 5/5] backfill: assume --sparse when sparse-checkout is enabled
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2024-12-20 16:29   ` [PATCH v2 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
@ 2024-12-20 16:29   ` Derrick Stolee via GitGitGadget
  2025-01-16 10:00   ` [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command Patrick Steinhardt
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:29 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt |  3 ++-
 builtin/backfill.c             |  7 +++++++
 t/t5620-backfill.sh            | 13 ++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 4710e2c12e3..9eecc6210c0 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -56,7 +56,8 @@ OPTIONS
 
 --[no-]sparse::
 	Only download objects if they appear at a path that matches the
-	current sparse-checkout.
+	current sparse-checkout. If the sparse-checkout feature is enabled,
+	then `--sparse` is assumed and can be disabled with `--no-sparse`.
 
 SEE ALSO
 --------
diff --git a/builtin/backfill.c b/builtin/backfill.c
index b9f1cc98501..d7c300dbe67 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,3 +1,6 @@
+/* We need this macro to access core_apply_sparse_checkout */
+#define USE_THE_REPOSITORY_VARIABLE
+
 #include "builtin.h"
 #include "git-compat-util.h"
 #include "config.h"
@@ -5,6 +8,7 @@
 #include "repository.h"
 #include "commit.h"
 #include "dir.h"
+#include "environment.h"
 #include "hex.h"
 #include "tree.h"
 #include "tree-walk.h"
@@ -140,5 +144,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
+	if (ctx.sparse < 0)
+		ctx.sparse = core_apply_sparse_checkout;
+
 	return do_backfill(&ctx);
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index f87a471c221..3fafcf99b58 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -77,6 +77,12 @@ test_expect_success 'do partial clone 2, backfill min batch size' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'backfill --sparse without sparse-checkout fails' '
+	git init not-sparse &&
+	test_must_fail git -C not-sparse backfill --sparse 2>err &&
+	grep "problem loading sparse-checkout" err
+'
+
 test_expect_success 'backfill --sparse' '
 	git clone --sparse --filter=blob:none		\
 		--single-branch --branch=main 		\
@@ -105,7 +111,12 @@ test_expect_success 'backfill --sparse' '
 	test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
 	test_trace2_data path-walk paths 15 <sparse-trace2 &&
 	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
-	test_line_count = 24 missing
+	test_line_count = 24 missing &&
+
+	# Disabling the --sparse option (on by default) will download everything
+	git -C backfill3 backfill --no-sparse &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 0 missing
 '
 
 test_expect_success 'backfill --sparse without cone mode' '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2024-12-20 16:29   ` [PATCH v2 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
@ 2025-01-16 10:00   ` Patrick Steinhardt
  2025-01-17 22:37     ` Junio C Hamano
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 10:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

On Fri, Dec 20, 2024 at 04:29:48PM +0000, Derrick Stolee via GitGitGadget wrote:
> This is based on v4 of ds/path-walk-1 [1] and an earlier version was part of
> my initial path-walk RFC [2].

I've got a couple more nits, but other than that I think that this
series looks quite nice.

I was wondering whether we might want to mark the new command as
experimental at first to allow us to iterate, but the last set of
commands where we have done so are still experimental many years after
they have been introduced. So... probably not a good idea.

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] backfill: basic functionality and tests
  2024-12-20 16:29   ` [PATCH v2 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
@ 2025-01-16 10:01     ` Patrick Steinhardt
  2025-02-03 14:44       ` Derrick Stolee
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 10:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 20, 2024 at 04:29:50PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 38e6aaeaa03..177fd4286c7 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
[snip]
> +static int fill_missing_blobs(const char *path UNUSED,
> +			      struct oid_array *list,
> +			      enum object_type type,
> +			      void *data)
> +{
> +	struct backfill_context *ctx = data;
> +
> +	if (type != OBJ_BLOB)
> +		return 0;
> +
> +	for (size_t i = 0; i < list->nr; i++) {
> +		off_t size = 0;
> +		struct object_info info = OBJECT_INFO_INIT;
> +		info.disk_sizep = &size;
> +		if (oid_object_info_extended(ctx->repo,
> +					     &list->oid[i],
> +					     &info,
> +					     OBJECT_INFO_FOR_PREFETCH) ||
> +		    !size)

So this is the object existence test? Is there a reason why we don't use
`repo_has_object_file()`, or its `_with_flags()` variant if we need to
pass `OBJECT_INFO_FOR_PREFETCH`?

> +			oid_array_append(&ctx->current_batch, &list->oid[i]);
> +	}
> +
> +	if (ctx->current_batch.nr >= ctx->batch_size)
> +		download_batch(ctx);
> +
> +	return 0;
> +}
> +
> +static int do_backfill(struct backfill_context *ctx)
> +{
> +	struct rev_info revs;
> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
> +	int ret;
> +
> +	repo_init_revisions(ctx->repo, &revs, "");
> +	handle_revision_arg("HEAD", &revs, 0, 0);
> +
> +	info.blobs = 1;
> +	info.tags = info.commits = info.trees = 0;

Nit: this should be unnecessary as PATH_WALK_INFO_INIT already
initialized those fields for us, right?

> +	info.revs = &revs;
> +	info.path_fn = fill_missing_blobs;
> +	info.path_fn_data = ctx;
> +
> +	ret = walk_objects_by_path(&info);
> +
> +	/* Download the objects that did not fill a batch. */
> +	if (!ret)
> +		download_batch(ctx);
> +
> +	backfill_context_clear(ctx);

Nit: I think it's a bit funny that we're cleaning up the context over
here rather than in the caller.

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/5] backfill: add --min-batch-size=<n> option
  2024-12-20 16:29   ` [PATCH v2 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
@ 2025-01-16 10:01     ` Patrick Steinhardt
  0 siblings, 0 replies; 39+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 10:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 20, 2024 at 04:29:51PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Users may want to specify a minimum batch size for their needs. This is only
> a minimum: the path-walk API provides a list of OIDs that correspond to the
> same path, and thus it is optimal to allow delta compression across those
> objects in a single server request.
> 
> We could consider limiting the request to have a maximum batch size in the
> future. For now, we let the path-walk API batches determine the
> boundaries.
> 
> To get a feeling for the value of specifying the --batch-size parameter,

This should say `--min-batch-size`.

> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index ece887831f6..e392517869c 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
>  SYNOPSIS
>  --------
>  [verse]
> -'git backfill' [<options>]
> +'git backfill' [--batch-size=<n>]
>  
>  DESCRIPTION
>  -----------

Here, as well.

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 177fd4286c7..ddccececc36 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -21,14 +21,14 @@
>  #include "path-walk.h"
>  
>  static const char * const builtin_backfill_usage[] = {
> -	N_("git backfill [<options>]"),
> +	N_("git backfill [--batch-size=<n>]"),

And here.

> @@ -111,9 +111,11 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  	struct backfill_context ctx = {
>  		.repo = repo,
>  		.current_batch = OID_ARRAY_INIT,
> -		.batch_size = 50000,
> +		.min_batch_size = 50000,
>  	};

Nit: it would be nice to adjust the name of this variable in the
preceding commit already so that it doesn't have to change again over
here.

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] backfill: add --sparse option
  2024-12-20 16:29   ` [PATCH v2 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
@ 2025-01-16 10:01     ` Patrick Steinhardt
  2025-02-03 15:11       ` Derrick Stolee
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 10:01 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 20, 2024 at 04:29:52PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 136ec08fb0e..c7456a9c1c0 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -12,6 +12,7 @@
>  #include "object.h"
>  #include "oid-array.h"
>  #include "prio-queue.h"
> +#include "repository.h"
>  #include "revision.h"
>  #include "string-list.h"
>  #include "strmap.h"
> @@ -173,6 +174,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
>  		if (type == OBJ_TREE)
>  			strbuf_addch(&path, '/');
>  
> +		if (ctx->info->pl) {
> +			int dtype;
> +			enum pattern_match_result match;
> +			match = path_matches_pattern_list(path.buf, path.len,
> +							  path.buf + base_len, &dtype,
> +							  ctx->info->pl,
> +							  ctx->repo->index);
> +
> +			if (ctx->info->pl->use_cone_patterns &&
> +			    match == NOT_MATCHED)
> +				continue;
> +			else if (!ctx->info->pl->use_cone_patterns &&
> +				 type == OBJ_BLOB &&
> +				 match != MATCHED)

For my own understanding: is there as pecific reason why one of the
branches uses `== NOT_MATCHED` whereas the other one uses `!= MATCHED`?

> diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
> index 7f2d409c5bc..61e845e5ec2 100644
> --- a/t/helper/test-path-walk.c
> +++ b/t/helper/test-path-walk.c
> @@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids,
>  
>  int cmd__path_walk(int argc, const char **argv)
>  {
> -	int res;
> +	int res, stdin_pl = 0;
>  	struct rev_info revs = REV_INFO_INIT;
>  	struct path_walk_info info = PATH_WALK_INFO_INIT;
>  	struct path_walk_test_data data = { 0 };
> @@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
>  			 N_("toggle inclusion of tree objects")),
>  		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
>  			 N_("toggle pruning of uninteresting paths")),
> +		OPT_BOOL(0, "stdin-pl", &stdin_pl,
> +			 N_("read a pattern list over stdin")),
>  		OPT_END(),
>  	};
>  

I was about to suggest giving this a more descriptive name, as it might
be confusing for anybody not intimately familiar with the code. But then
I noticed that this is part of the test helper, only, so it doesn't
matter as much. So feel free to ignore.

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] backfill: add builtin boilerplate
  2024-12-06 20:07 ` [PATCH 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
@ 2025-01-16 10:11   ` Patrick Steinhardt
  2025-01-16 17:52     ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2025-01-16 10:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Fri, Dec 06, 2024 at 08:07:14PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> new file mode 100644
> index 00000000000..640144187d3
> --- /dev/null
> +++ b/Documentation/git-backfill.txt
> @@ -0,0 +1,23 @@
> +git-backfill(1)
> +===============
> +
> +NAME
> +----
> +git-backfill - Download missing objects in a partial clone
> +
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git backfill' [<options>]

Ah, one thing I forgot about: this could use the new `[synopsis]` style,
which removes some need for formatting directives.

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] backfill: add builtin boilerplate
  2025-01-16 10:11   ` Patrick Steinhardt
@ 2025-01-16 17:52     ` Junio C Hamano
  2025-02-03 14:38       ` Derrick Stolee
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2025-01-16 17:52 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	me, johncai86, newren, christian.couder, kristofferhaugsbakk,
	jonathantanmy, karthik.188, Derrick Stolee, Derrick Stolee

Patrick Steinhardt <ps@pks.im> writes:

> On Fri, Dec 06, 2024 at 08:07:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
>> new file mode 100644
>> index 00000000000..640144187d3
>> --- /dev/null
>> +++ b/Documentation/git-backfill.txt
>> @@ -0,0 +1,23 @@
>> +git-backfill(1)
>> +===============
>> +
>> +NAME
>> +----
>> +git-backfill - Download missing objects in a partial clone
>> +
>> +
>> +SYNOPSIS
>> +--------
>> +[verse]
>> +'git backfill' [<options>]
>
> Ah, one thing I forgot about: this could use the new `[synopsis]` style,
> which removes some need for formatting directives.

Yeah, I thought it was more or less simultaneous development and it
was OK to convert after the dust settles, but it seems to predate
the series by 3 months.

$ git show -s --format=reference 029eff9e34f 375852e20
029eff9e34 (doc: update the guidelines to reflect the current formatting rules, 2024-09-24)
375852e20f (backfill: add builtin boilerplate, 2024-12-20)

ds/backfill:Documentation/CodingGuidelines does tell us '[synopsis]'
is available, even ;-)

Thanks for noticing.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command
  2025-01-16 10:00   ` [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command Patrick Steinhardt
@ 2025-01-17 22:37     ` Junio C Hamano
  0 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2025-01-17 22:37 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	me, johncai86, newren, christian.couder, kristofferhaugsbakk,
	jonathantanmy, karthik.188, Derrick Stolee

Patrick Steinhardt <ps@pks.im> writes:

> I was wondering whether we might want to mark the new command as
> experimental at first to allow us to iterate, but the last set of
> commands where we have done so are still experimental many years after
> they have been introduced. So... probably not a good idea.

I am not sure if I agree.  Anybody who wants any of these commands
that are marked as experimental to be declared stable can propose
to do so with a reasonable timeline attached.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] backfill: add --batch-size=<n> option
  2024-12-06 20:07 ` [PATCH 3/5] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
  2024-12-16  8:01   ` Patrick Steinhardt
@ 2025-01-19 17:57   ` Jean-Noël AVILA
  1 sibling, 0 replies; 39+ messages in thread
From: Jean-Noël AVILA @ 2025-01-19 17:57 UTC (permalink / raw)
  To: git, Derrick Stolee via GitGitGadget
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee, Derrick Stolee

On Friday, 6 December 2024 21:07:16 UTC+1 Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 

> @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server.
>  By default, `git backfill` downloads all blobs reachable from the `HEAD`
>  commit. This set can be restricted or expanded using various options.
>  
> +OPTIONS
> +-------
> +
> +--batch-size=<n>::

Please format automatically using backticks:  `--batch-size=<n>`::

> +	Specify a minimum size for a batch of missing objects to request
> +	from the server. This size may be exceeded by the last set of
> +	blobs seen at a given path. Default batch size is 16,000.
> +
>  SEE ALSO
>  --------
>  linkgit:git-clone[1].





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] backfill: add builtin boilerplate
  2025-01-16 17:52     ` Junio C Hamano
@ 2025-02-03 14:38       ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2025-02-03 14:38 UTC (permalink / raw)
  To: Junio C Hamano, Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	me, johncai86, newren, christian.couder, kristofferhaugsbakk,
	jonathantanmy, karthik.188, Derrick Stolee

On 1/16/25 12:52 PM, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
>> On Fri, Dec 06, 2024 at 08:07:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
>>> new file mode 100644
>>> index 00000000000..640144187d3
>>> --- /dev/null
>>> +++ b/Documentation/git-backfill.txt
>>> @@ -0,0 +1,23 @@
>>> +git-backfill(1)
>>> +===============
>>> +
>>> +NAME
>>> +----
>>> +git-backfill - Download missing objects in a partial clone
>>> +
>>> +
>>> +SYNOPSIS
>>> +--------
>>> +[verse]
>>> +'git backfill' [<options>]
>>
>> Ah, one thing I forgot about: this could use the new `[synopsis]` style,
>> which removes some need for formatting directives.
> 
> Yeah, I thought it was more or less simultaneous development and it
> was OK to convert after the dust settles, but it seems to predate
> the series by 3 months.
> 
> $ git show -s --format=reference 029eff9e34f 375852e20
> 029eff9e34 (doc: update the guidelines to reflect the current formatting rules, 2024-09-24)
> 375852e20f (backfill: add builtin boilerplate, 2024-12-20)
> 
> ds/backfill:Documentation/CodingGuidelines does tell us '[synopsis]'
> is available, even ;-)

Thanks. I was not aware of this new style. Will be fixed in the next version.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/5] backfill: basic functionality and tests
  2025-01-16 10:01     ` Patrick Steinhardt
@ 2025-02-03 14:44       ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2025-02-03 14:44 UTC (permalink / raw)
  To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

On 1/16/25 5:01 AM, Patrick Steinhardt wrote:
> On Fri, Dec 20, 2024 at 04:29:50PM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index 38e6aaeaa03..177fd4286c7 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
> [snip]
>> +static int fill_missing_blobs(const char *path UNUSED,
>> +			      struct oid_array *list,
>> +			      enum object_type type,
>> +			      void *data)
>> +{
>> +	struct backfill_context *ctx = data;
>> +
>> +	if (type != OBJ_BLOB)
>> +		return 0;
>> +
>> +	for (size_t i = 0; i < list->nr; i++) {
>> +		off_t size = 0;
>> +		struct object_info info = OBJECT_INFO_INIT;
>> +		info.disk_sizep = &size;
>> +		if (oid_object_info_extended(ctx->repo,
>> +					     &list->oid[i],
>> +					     &info,
>> +					     OBJECT_INFO_FOR_PREFETCH) ||
>> +		    !size)
> 
> So this is the object existence test? Is there a reason why we don't use
> `repo_has_object_file()`, or its `_with_flags()` variant if we need to
> pass `OBJECT_INFO_FOR_PREFETCH`?

You make a good point, but I also notice that repo_has_object_file() has
the following comment:

  * These functions can be removed once all callers have migrated to
  * has_object() and/or oid_object_info_extended().

so I'll use has_object().

>> +			oid_array_append(&ctx->current_batch, &list->oid[i]);
>> +	}
>> +
>> +	if (ctx->current_batch.nr >= ctx->batch_size)
>> +		download_batch(ctx);
>> +
>> +	return 0;
>> +}
>> +
>> +static int do_backfill(struct backfill_context *ctx)
>> +{
>> +	struct rev_info revs;
>> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
>> +	int ret;
>> +
>> +	repo_init_revisions(ctx->repo, &revs, "");
>> +	handle_revision_arg("HEAD", &revs, 0, 0);
>> +
>> +	info.blobs = 1;
>> +	info.tags = info.commits = info.trees = 0;
> 
> Nit: this should be unnecessary as PATH_WALK_INFO_INIT already
> initialized those fields for us, right?

The info.blobs is redundant, but is helpful for context. The
other line is necessary as PATH_WALK_INFO_INIT is defined as:

#define PATH_WALK_INFO_INIT {   \
	.blobs = 1,		\
	.trees = 1,		\
	.commits = 1,		\
	.tags = 1,		\
}

>> +	info.revs = &revs;
>> +	info.path_fn = fill_missing_blobs;
>> +	info.path_fn_data = ctx;
>> +
>> +	ret = walk_objects_by_path(&info);
>> +
>> +	/* Download the objects that did not fill a batch. */
>> +	if (!ret)
>> +		download_batch(ctx);
>> +
>> +	backfill_context_clear(ctx);
> 
> Nit: I think it's a bit funny that we're cleaning up the context over
> here rather than in the caller.
Cleaning up in the caller makes the "return do_backfill(&ctx);" line
slightly more complicated, but you are right that we shouldn't be
cleaning up something that the method "doesn't own".

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/5] backfill: add --sparse option
  2025-01-16 10:01     ` Patrick Steinhardt
@ 2025-02-03 15:11       ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2025-02-03 15:11 UTC (permalink / raw)
  To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
  Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Derrick Stolee

On 1/16/25 5:01 AM, Patrick Steinhardt wrote:
> On Fri, Dec 20, 2024 at 04:29:52PM +0000, Derrick Stolee via GitGitGadget wrote:

>> +		if (ctx->info->pl) {
>> +			int dtype;
>> +			enum pattern_match_result match;
>> +			match = path_matches_pattern_list(path.buf, path.len,
>> +							  path.buf + base_len, &dtype,
>> +							  ctx->info->pl,
>> +							  ctx->repo->index);
>> +
>> +			if (ctx->info->pl->use_cone_patterns &&
>> +			    match == NOT_MATCHED)
>> +				continue;
>> +			else if (!ctx->info->pl->use_cone_patterns &&
>> +				 type == OBJ_BLOB &&
>> +				 match != MATCHED)
> 
> For my own understanding: is there as pecific reason why one of the
> branches uses `== NOT_MATCHED` whereas the other one uses `!= MATCHED`?

With cone mode sparse-checkout, 'match' could equal MATCHED,
MATCHED_RECURSIVE, or UNDECIDED, which we want to be considered all the
same case: continue along this path.

When not in cone mode, we can't decide to filter by trees (hence the
OBJ_BLOB restriction) and then the result can be MATCHED, NOT_MATCHED,
and UNDECIDED. This rule matches the following realization:

  * MATCHED if there is a positive pattern that matches the path.
  * NOT_MATCHED if there is a negative pattern that matches the path.
  * UNDECIDED if no pattern matches the path.

This is subtle, but switching this to "match == NOT_MATCHED" will
result in the test failing (and the test is right).

I will make note of this in my commit message in the next version, as
well as adding a test that has nested positive and negative patterns.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command
  2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2025-01-16 10:00   ` [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command Patrick Steinhardt
@ 2025-02-03 17:11   ` Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
                       ` (5 more replies)
  6 siblings, 6 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-02-03 17:11 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee

This is based on 'master' now that ds/path-walk [1] is merged. An earlier
version was part of my initial path-walk RFC [2].

[1]
https://lore.kernel.org/git/pull.1818.v3.git.1733514358.gitgitgadget@gmail.com/

[2]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

This series adds a new 'git backfill' command that uses the path-walk API to
download missing blobs in a blobless partial clone. Users can specify
interaction with the sparse-checkout using '--[no-]sparse' but the
'--sparse' option is implied by the existence of a sparse-checkout.

The reason to use the path-walk API is to make sure that the missing objects
are grouped by a common path, giving a reasonable process for batching
requests and expecting the server to compress the resulting packfile nicely
together.

I first prototyped this feature in June 2024 as an exploration and created
the path-walk algorithm for this purpose. It was only my intuition that led
me to believe that batching by path would lead to better packfiles. This has
been proven out as a very important feature due to recent investigations to
compressing full repositories by doing a better job of grouping objects by
path. See the --name-hash-version series [3] or the 'git pack-objects
--path-walk' series [4] (currently on hold as it conflicts with the
--name-hash-version series).

[3]
https://lore.kernel.org/git/pull.1823.v2.git.1733181682.gitgitgadget@gmail.com/

[4]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/

This idea can be further demonstrated by the evidence in testing this
feature: by downloading objects in small batch sizes, the client can force
the server to repack things more efficiently than a full repack.

The example repository I have used in multiple places is the
microsoft/fluentui repo [5] as it has many CHANGELOG.md files that cause
name hash collisions that make the full repack inefficient.

[5] https://github.com/microsoft/fluentui

If we create a blobless clone of the fluentui repo, then this downloads 105
MB across two packfiles (the commits and trees pack, followed by the blobs
needed for an initial checkout). Running 'git backfill --batch-size=' for
different sizes leads to some interesting results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |


The smaller batches cause the server to realize that the existing deltas
cannot be reused and it finds better deltas. This takes some extra time for
the small batches, but halves the size of the repo. Even in the 500K batch
size, we get less data than the 738 MB of a full clone.

Implementing the --sparse feature is best done by augmenting the path-walk
API to be aware of a pattern list. This works for both cone and non-cone
mode sparse-checkouts.

There are future directions we could take this command, especially to run
the command with a user-specified pathspec. The tricky case for that
additional feature is trying to make the path-walk more efficient by
skipping tree paths that would not lead to a match of the pathspec. It would
likely need optimization in a small subset of pathspec features (such as
prefix matches) to work as efficiently as possible. I did prototype a
version that puts the pathspec match in the callback function within
builtin/backfill.c, but I found that uninspiring and unnecessary for now.


Updates in v2
=============

Thanks for the review on v1.

 * Memory leaks should be plugged now due to the introduction of a
   destructor in v4 of the path-walk API and its extension here to clear the
   pattern list.

 * Documentation is expanded to demonstrate that 'git backfill' can also
   approximate incremental/resumable clones of repositories.

 * Method renames to better match style.

 * --batch-size is renamed to --min-batch-size for clarity and future
   extensibility.


Updates in v3
=============

 * Rebased onto 'master' now that the path-walk API is merged.

 * New builtin boilerplate is updated with new standards, including:

 * Doc formatting uses [synopsis] formatting.
 * Add builtin/backfill.c to meson.build.
 * Add Documentation/git-backfill.txt to Documentation/meson.build.
 * Add t/t5620-backfill.sh to t/meson.build.
 * Update handling of -h due to f66d1423f5 (builtin: send usage() help text
   to standard output, 2025-01-16).

 * Doc formatting is updated to use back-ticks on options and mark the
   builtin as experimental.

 * The batch_size member of 'struct backfill_context' is now named
   'min_batch_size' in all patches.

 * Some mentions of '--batch-size' are updated to '--min-batch-size'.

 * An additional test is included for non-cone-mode sparse-checkout patterns
   to further check the return values of path_matches_pattern_list() within
   the path-walk API with sparse mode.

 * A use of oid_object_info_extended() is replaced with has_object().

 * The backfill_context_clear() method is called by the proper owner of the
   struct.

Thanks, -Stolee

Derrick Stolee (5):
  backfill: add builtin boilerplate
  backfill: basic functionality and tests
  backfill: add --min-batch-size=<n> option
  backfill: add --sparse option
  backfill: assume --sparse when sparse-checkout is enabled

 .gitignore                                |   1 +
 Documentation/git-backfill.txt            |  71 ++++++++
 Documentation/meson.build                 |   1 +
 Documentation/technical/api-path-walk.txt |  11 +-
 Makefile                                  |   1 +
 builtin.h                                 |   1 +
 builtin/backfill.c                        | 146 +++++++++++++++
 command-list.txt                          |   1 +
 dir.c                                     |  10 +-
 dir.h                                     |   3 +
 git.c                                     |   1 +
 meson.build                               |   1 +
 path-walk.c                               |  28 ++-
 path-walk.h                               |  11 ++
 t/helper/test-path-walk.c                 |  22 ++-
 t/meson.build                             |   1 +
 t/t5620-backfill.sh                       | 211 ++++++++++++++++++++++
 t/t6601-path-walk.sh                      |  32 ++++
 18 files changed, 539 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c
 create mode 100755 t/t5620-backfill.sh


base-commit: 58b5801aa94ad5031978f8e42c1be1230b3d352f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1820%2Fderrickstolee%2Fbackfill-upstream-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1820/derrickstolee/backfill-upstream-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1820

Range-diff vs v2:

 1:  ac86510417a ! 1:  1612ad92455 backfill: add builtin boilerplate
     @@ Commit message
          backfill: add builtin boilerplate
      
          In anticipation of implementing 'git backfill', populate the necessary files
     -    with the boilerplate of a new builtin.
     +    with the boilerplate of a new builtin. Mark the builtin as experimental at
     +    this time, allowing breaking changes in the near future, if necessary.
      
          Signed-off-by: Derrick Stolee <stolee@gmail.com>
      
     @@ Documentation/git-backfill.txt (new)
      +
      +SYNOPSIS
      +--------
     -+[verse]
     -+'git backfill' [<options>]
     ++[synopsis]
     ++git backfill [<options>]
      +
      +DESCRIPTION
      +-----------
      +
     ++THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
     ++
      +SEE ALSO
      +--------
      +linkgit:git-clone[1].
     @@ Documentation/git-backfill.txt (new)
      +---
      +Part of the linkgit:git[1] suite
      
     + ## Documentation/meson.build ##
     +@@ Documentation/meson.build: manpages = {
     +   'git-apply.txt' : 1,
     +   'git-archimport.txt' : 1,
     +   'git-archive.txt' : 1,
     ++  'git-backfill.txt' : 1,
     +   'git-bisect.txt' : 1,
     +   'git-blame.txt' : 1,
     +   'git-branch.txt' : 1,
     +
       ## Makefile ##
      @@ Makefile: BUILTIN_OBJS += builtin/am.o
       BUILTIN_OBJS += builtin/annotate.o
     @@ builtin/backfill.c (new)
      +		OPT_END(),
      +	};
      +
     -+	if (argc == 2 && !strcmp(argv[1], "-h"))
     -+		usage_with_options(builtin_backfill_usage, options);
     ++	show_usage_if_asked(argc, argv, builtin_backfill_usage[0]);
      +
      +	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
      +			     0);
     @@ git.c: static struct cmd_struct commands[] = {
       	{ "bisect", cmd_bisect, RUN_SETUP },
       	{ "blame", cmd_blame, RUN_SETUP },
       	{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
     +
     + ## meson.build ##
     +@@ meson.build: builtin_sources = [
     +   'builtin/annotate.c',
     +   'builtin/apply.c',
     +   'builtin/archive.c',
     ++  'builtin/backfill.c',
     +   'builtin/bisect.c',
     +   'builtin/blame.c',
     +   'builtin/branch.c',
 2:  e4e88794cae ! 2:  d0cb9cb9af5 backfill: basic functionality and tests
     @@ Commit message
          Signed-off-by: Derrick Stolee <stolee@gmail.com>
      
       ## Documentation/git-backfill.txt ##
     -@@ Documentation/git-backfill.txt: SYNOPSIS
     +@@ Documentation/git-backfill.txt: git backfill [<options>]
       DESCRIPTION
       -----------
       
     @@ Documentation/git-backfill.txt: SYNOPSIS
      +By default, `git backfill` downloads all blobs reachable from the `HEAD`
      +commit. This set can be restricted or expanded using various options.
      +
     + THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
     + 
       SEE ALSO
     - --------
     - linkgit:git-clone[1].
      
       ## Documentation/technical/api-path-walk.txt ##
      @@ Documentation/technical/api-path-walk.txt: Examples
     @@ builtin/backfill.c
      +struct backfill_context {
      +	struct repository *repo;
      +	struct oid_array current_batch;
     -+	size_t batch_size;
     ++	size_t min_batch_size;
      +};
      +
      +static void backfill_context_clear(struct backfill_context *ctx)
     @@ builtin/backfill.c
      +		return 0;
      +
      +	for (size_t i = 0; i < list->nr; i++) {
     -+		off_t size = 0;
     -+		struct object_info info = OBJECT_INFO_INIT;
     -+		info.disk_sizep = &size;
     -+		if (oid_object_info_extended(ctx->repo,
     -+					     &list->oid[i],
     -+					     &info,
     -+					     OBJECT_INFO_FOR_PREFETCH) ||
     -+		    !size)
     ++		if (!has_object(ctx->repo, &list->oid[i],
     ++				OBJECT_INFO_FOR_PREFETCH))
      +			oid_array_append(&ctx->current_batch, &list->oid[i]);
      +	}
      +
     -+	if (ctx->current_batch.nr >= ctx->batch_size)
     ++	if (ctx->current_batch.nr >= ctx->min_batch_size)
      +		download_batch(ctx);
      +
      +	return 0;
     @@ builtin/backfill.c
      +	if (!ret)
      +		download_batch(ctx);
      +
     -+	backfill_context_clear(ctx);
      +	path_walk_info_clear(&info);
      +	release_revisions(&revs);
      +	return ret;
     @@ builtin/backfill.c
      +
       int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
       {
     ++	int result;
      +	struct backfill_context ctx = {
      +		.repo = repo,
      +		.current_batch = OID_ARRAY_INIT,
     -+		.batch_size = 50000,
     ++		.min_batch_size = 50000,
      +	};
       	struct option options[] = {
       		OPT_END(),
     @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *pr
      -	die(_("not implemented"));
      -
      -	return 0;
     -+	return do_backfill(&ctx);
     ++	result = do_backfill(&ctx);
     ++	backfill_context_clear(&ctx);
     ++	return result;
       }
      
     + ## t/meson.build ##
     +@@ t/meson.build: integration_tests = [
     +   't5617-clone-submodules-remote.sh',
     +   't5618-alternate-refs.sh',
     +   't5619-clone-local-ambiguous-transport.sh',
     ++  't5620-backfill.sh',
     +   't5700-protocol-v1.sh',
     +   't5701-git-serve.sh',
     +   't5702-protocol-v2.sh',
     +
       ## t/t5620-backfill.sh (new) ##
      @@
      +#!/bin/sh
 3:  3fa32822dab ! 3:  b35c2f06b59 backfill: add --min-batch-size=<n> option
     @@ Commit message
          future. For now, we let the path-walk API batches determine the
          boundaries.
      
     -    To get a feeling for the value of specifying the --batch-size parameter,
     +    To get a feeling for the value of specifying the --min-batch-size parameter,
          I tested a number of open source repositories available on GitHub. The
          procedure was generally:
      
     @@ Documentation/git-backfill.txt
      @@ Documentation/git-backfill.txt: git-backfill - Download missing objects in a partial clone
       SYNOPSIS
       --------
     - [verse]
     --'git backfill' [<options>]
     -+'git backfill' [--batch-size=<n>]
     + [synopsis]
     +-git backfill [<options>]
     ++git backfill [--min-batch-size=<n>]
       
       DESCRIPTION
       -----------
     -@@ Documentation/git-backfill.txt: time.
     - By default, `git backfill` downloads all blobs reachable from the `HEAD`
     - commit. This set can be restricted or expanded using various options.
     +@@ Documentation/git-backfill.txt: commit. This set can be restricted or expanded using various options.
       
     + THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
     + 
     ++
      +OPTIONS
      +-------
      +
     -+--min-batch-size=<n>::
     ++`--min-batch-size=<n>`::
      +	Specify a minimum size for a batch of missing objects to request
      +	from the server. This size may be exceeded by the last set of
      +	blobs seen at a given path. The default minimum batch size is
     @@ builtin/backfill.c
       
       static const char * const builtin_backfill_usage[] = {
      -	N_("git backfill [<options>]"),
     -+	N_("git backfill [--batch-size=<n>]"),
     ++	N_("git backfill [--min-batch-size=<n>]"),
       	NULL
       };
       
     - struct backfill_context {
     - 	struct repository *repo;
     - 	struct oid_array current_batch;
     --	size_t batch_size;
     -+	size_t min_batch_size;
     - };
     - 
     - static void backfill_context_clear(struct backfill_context *ctx)
     -@@ builtin/backfill.c: static int fill_missing_blobs(const char *path UNUSED,
     - 			oid_array_append(&ctx->current_batch, &list->oid[i]);
     - 	}
     - 
     --	if (ctx->current_batch.nr >= ctx->batch_size)
     -+	if (ctx->current_batch.nr >= ctx->min_batch_size)
     - 		download_batch(ctx);
     - 
     - 	return 0;
      @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
     - 	struct backfill_context ctx = {
     - 		.repo = repo,
     - 		.current_batch = OID_ARRAY_INIT,
     --		.batch_size = 50000,
     -+		.min_batch_size = 50000,
     + 		.min_batch_size = 50000,
       	};
       	struct option options[] = {
      +		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
 4:  2723143afb3 ! 4:  f1c5a5c5fde backfill: add --sparse option
     @@ Commit message
      
          Be sure to test this in both cone mode and not cone mode. Cone mode has the
          benefit that the path-walk can skip certain paths once they would expand
     -    beyond the sparse-checkout.
     +    beyond the sparse-checkout. Non-cone mode can describe the included files
     +    using both positive and negative patterns, which changes the possible return
     +    values of path_matches_pattern_list(). Test both kinds of matches for
     +    increased coverage.
      
          To test this, we can create a blobless sparse clone, expand the
          sparse-checkout slightly, and then run 'git backfill --sparse' to see
     @@ Documentation/git-backfill.txt
      @@ Documentation/git-backfill.txt: git-backfill - Download missing objects in a partial clone
       SYNOPSIS
       --------
     - [verse]
     --'git backfill' [--batch-size=<n>]
     -+'git backfill' [--batch-size=<n>] [--[no-]sparse]
     + [synopsis]
     +-git backfill [--min-batch-size=<n>]
     ++git backfill [--min-batch-size=<n>] [--[no-]sparse]
       
       DESCRIPTION
       -----------
     @@ Documentation/git-backfill.txt: OPTIONS
       	blobs seen at a given path. The default minimum batch size is
       	50,000.
       
     -+--[no-]sparse::
     ++`--[no-]sparse`::
      +	Only download objects if they appear at a path that matches the
      +	current sparse-checkout.
      +
     @@ builtin/backfill.c
       #include "path-walk.h"
       
       static const char * const builtin_backfill_usage[] = {
     --	N_("git backfill [--batch-size=<n>]"),
     -+	N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
     +-	N_("git backfill [--min-batch-size=<n>]"),
     ++	N_("git backfill [--min-batch-size=<n>] [--[no-]sparse]"),
       	NULL
       };
       
     @@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch
      +	test_line_count = 24 missing
      +'
      +
     -+test_expect_success 'backfill --sparse without cone mode' '
     ++test_expect_success 'backfill --sparse without cone mode (positive)' '
      +	git clone --no-checkout --filter=blob:none		\
      +		--single-branch --branch=main 		\
      +		"file://$(pwd)/srv.bare" backfill4 &&
     @@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch
      +	git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" &&
      +	git -C backfill4 checkout main &&
      +
     ++	# Track new blob count
     ++	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
     ++	test_line_count = 42 missing &&
     ++
      +	GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \
      +		-C backfill4 backfill --sparse &&
      +	test_trace2_data promisor fetch_count 6 <no-cone-trace1 &&
     @@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch
      +	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
      +	test_line_count = 36 missing
      +'
     ++
     ++test_expect_success 'backfill --sparse without cone mode (negative)' '
     ++	git clone --no-checkout --filter=blob:none		\
     ++		--single-branch --branch=main 		\
     ++		"file://$(pwd)/srv.bare" backfill5 &&
     ++
     ++	# No blobs yet
     ++	git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
     ++	test_line_count = 48 missing &&
     ++
     ++	# Define sparse-checkout by filename regardless of parent directory.
     ++	# This downloads 18 blobs to satisfy the checkout
     ++	git -C backfill5 sparse-checkout set --no-cone "**/file*" "!**/file.1.txt" &&
     ++	git -C backfill5 checkout main &&
     ++
     ++	# Track new blob count
     ++	git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
     ++	test_line_count = 30 missing &&
     ++
     ++	GIT_TRACE2_EVENT="$(pwd)/no-cone-trace2" git \
     ++		-C backfill5 backfill --sparse &&
     ++	test_trace2_data promisor fetch_count 18 <no-cone-trace2 &&
     ++
     ++	# This walk needed to visit all directories to search for these paths, plus
     ++	# 12 extra "file.?.txt" paths than the previous test.
     ++	test_trace2_data path-walk paths 24 <no-cone-trace2 &&
     ++	git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
     ++	test_line_count = 12 missing
     ++'
      +
       . "$TEST_DIRECTORY"/lib-httpd.sh
       start_httpd
 5:  1f765409eaf ! 5:  f22cf8b3485 backfill: assume --sparse when sparse-checkout is enabled
     @@ Commit message
       ## Documentation/git-backfill.txt ##
      @@ Documentation/git-backfill.txt: OPTIONS
       
     - --[no-]sparse::
     + `--[no-]sparse`::
       	Only download objects if they appear at a path that matches the
      -	current sparse-checkout.
      +	current sparse-checkout. If the sparse-checkout feature is enabled,
     @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *pr
      +	if (ctx.sparse < 0)
      +		ctx.sparse = core_apply_sparse_checkout;
      +
     - 	return do_backfill(&ctx);
     - }
     + 	result = do_backfill(&ctx);
     + 	backfill_context_clear(&ctx);
     + 	return result;
      
       ## t/t5620-backfill.sh ##
      @@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch size' '
     @@ t/t5620-backfill.sh: test_expect_success 'backfill --sparse' '
      +	test_line_count = 0 missing
       '
       
     - test_expect_success 'backfill --sparse without cone mode' '
     + test_expect_success 'backfill --sparse without cone mode (positive)' '

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 1/5] backfill: add builtin boilerplate
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
@ 2025-02-03 17:11     ` Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-02-03 17:11 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin. Mark the builtin as experimental at
this time, allowing breaking changes in the near future, if necessary.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 .gitignore                     |  1 +
 Documentation/git-backfill.txt | 25 +++++++++++++++++++++++++
 Documentation/meson.build      |  1 +
 Makefile                       |  1 +
 builtin.h                      |  1 +
 builtin/backfill.c             | 28 ++++++++++++++++++++++++++++
 command-list.txt               |  1 +
 git.c                          |  1 +
 meson.build                    |  1 +
 9 files changed, 60 insertions(+)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c

diff --git a/.gitignore b/.gitignore
index e82aa19df03..95cd94c5044 100644
--- a/.gitignore
+++ b/.gitignore
@@ -19,6 +19,7 @@
 /git-apply
 /git-archimport
 /git-archive
+/git-backfill
 /git-bisect
 /git-blame
 /git-branch
diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
new file mode 100644
index 00000000000..ab384dad6e4
--- /dev/null
+++ b/Documentation/git-backfill.txt
@@ -0,0 +1,25 @@
+git-backfill(1)
+===============
+
+NAME
+----
+git-backfill - Download missing objects in a partial clone
+
+
+SYNOPSIS
+--------
+[synopsis]
+git backfill [<options>]
+
+DESCRIPTION
+-----------
+
+THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
+
+SEE ALSO
+--------
+linkgit:git-clone[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 2a26fa8a5fe..5e9e3e19c5c 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -6,6 +6,7 @@ manpages = {
   'git-apply.txt' : 1,
   'git-archimport.txt' : 1,
   'git-archive.txt' : 1,
+  'git-backfill.txt' : 1,
   'git-bisect.txt' : 1,
   'git-blame.txt' : 1,
   'git-branch.txt' : 1,
diff --git a/Makefile b/Makefile
index 2f6e2d52955..efa199d390c 100644
--- a/Makefile
+++ b/Makefile
@@ -1203,6 +1203,7 @@ BUILTIN_OBJS += builtin/am.o
 BUILTIN_OBJS += builtin/annotate.o
 BUILTIN_OBJS += builtin/apply.o
 BUILTIN_OBJS += builtin/archive.o
+BUILTIN_OBJS += builtin/backfill.o
 BUILTIN_OBJS += builtin/bisect.o
 BUILTIN_OBJS += builtin/blame.o
 BUILTIN_OBJS += builtin/branch.o
diff --git a/builtin.h b/builtin.h
index f7b166b3348..89928ccf92f 100644
--- a/builtin.h
+++ b/builtin.h
@@ -120,6 +120,7 @@ int cmd_am(int argc, const char **argv, const char *prefix, struct repository *r
 int cmd_annotate(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_apply(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_archive(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/backfill.c b/builtin/backfill.c
new file mode 100644
index 00000000000..58d0866c0fc
--- /dev/null
+++ b/builtin/backfill.c
@@ -0,0 +1,28 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+#include "repository.h"
+#include "object.h"
+
+static const char * const builtin_backfill_usage[] = {
+	N_("git backfill [<options>]"),
+	NULL
+};
+
+int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
+{
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	show_usage_if_asked(argc, argv, builtin_backfill_usage[0]);
+
+	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
+			     0);
+
+	repo_config(repo, git_default_config, NULL);
+
+	die(_("not implemented"));
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e0bb87b3b5c..c537114b468 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -60,6 +60,7 @@ git-annotate                            ancillaryinterrogators
 git-apply                               plumbingmanipulators            complete
 git-archimport                          foreignscminterface
 git-archive                             mainporcelain
+git-backfill                            mainporcelain           history
 git-bisect                              mainporcelain           info
 git-blame                               ancillaryinterrogators          complete
 git-branch                              mainporcelain           history
diff --git a/git.c b/git.c
index a94dab37702..a9b45fcef79 100644
--- a/git.c
+++ b/git.c
@@ -506,6 +506,7 @@ static struct cmd_struct commands[] = {
 	{ "annotate", cmd_annotate, RUN_SETUP },
 	{ "apply", cmd_apply, RUN_SETUP_GENTLY },
 	{ "archive", cmd_archive, RUN_SETUP_GENTLY },
+	{ "backfill", cmd_backfill, RUN_SETUP },
 	{ "bisect", cmd_bisect, RUN_SETUP },
 	{ "blame", cmd_blame, RUN_SETUP },
 	{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
diff --git a/meson.build b/meson.build
index 548eac62b28..527c015acfa 100644
--- a/meson.build
+++ b/meson.build
@@ -487,6 +487,7 @@ builtin_sources = [
   'builtin/annotate.c',
   'builtin/apply.c',
   'builtin/archive.c',
+  'builtin/backfill.c',
   'builtin/bisect.c',
   'builtin/blame.c',
   'builtin/branch.c',
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 2/5] backfill: basic functionality and tests
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
@ 2025-02-03 17:11     ` Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-02-03 17:11 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

The path-walk API provides lists of objects in batches according to a
common path, but that list could be very small. We want to balance the
number of requests to the server with the ability to have the process
interrupted with minimal repeated work to catch up in the next run.
Based on some experiments (detailed in the next change) a minimum batch
size of 50,000 is selected for the default.

This batch size is a _minimum_. As the path-walk API emits lists of blob
IDs, they are collected into a list of objects for a request to the
server. When that list is at least the minimum batch size, then the
request is sent to the server for the new objects. However, the list of
blob IDs from the path-walk API could be much longer than the batch
size. At this moment, it is unclear if there is a benefit to split the
list when there are too many objects at the same path.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  31 +++++++
 Documentation/technical/api-path-walk.txt |   3 +-
 builtin/backfill.c                        | 102 +++++++++++++++++++++-
 t/meson.build                             |   1 +
 t/t5620-backfill.sh                       |  94 ++++++++++++++++++++
 5 files changed, 227 insertions(+), 4 deletions(-)
 create mode 100755 t/t5620-backfill.sh

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index ab384dad6e4..56cbb9ffd82 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -14,6 +14,37 @@ git backfill [<options>]
 DESCRIPTION
 -----------
 
+Blobless partial clones are created using `git clone --filter=blob:none`
+and then configure the local repository such that the Git client avoids
+downloading blob objects unless they are required for a local operation.
+This initially means that the clone and later fetches download reachable
+commits and trees but no blobs. Later operations that change the `HEAD`
+pointer, such as `git checkout` or `git merge`, may need to download
+missing blobs in order to complete their operation.
+
+In the worst cases, commands that compute blob diffs, such as `git blame`,
+become very slow as they download the missing blobs in single-blob
+requests to satisfy the missing object as the Git command needs it. This
+leads to multiple download requests and no ability for the Git server to
+provide delta compression across those objects.
+
+The `git backfill` command provides a way for the user to request that
+Git downloads the missing blobs (with optional filters) such that the
+missing blobs representing historical versions of files can be downloaded
+in batches. The `backfill` command attempts to optimize the request by
+grouping blobs that appear at the same path, hopefully leading to good
+delta compression in the packfile sent by the server.
+
+In this way, `git backfill` provides a mechanism to break a large clone
+into smaller chunks. Starting with a blobless partial clone with `git
+clone --filter=blob:none` and then running `git backfill` in the local
+repository provides a way to download all reachable objects in several
+smaller network calls than downloading the entire repository at clone
+time.
+
+By default, `git backfill` downloads all blobs reachable from the `HEAD`
+commit. This set can be restricted or expanded using various options.
+
 THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
 
 SEE ALSO
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 7075d0d5ab5..1fba0ce04cb 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -60,4 +60,5 @@ Examples
 --------
 
 See example usages in:
-	`t/helper/test-path-walk.c`
+	`t/helper/test-path-walk.c`,
+	`builtin/backfill.c`
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 58d0866c0fc..0eca175a7fe 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,16 +1,112 @@
 #include "builtin.h"
+#include "git-compat-util.h"
 #include "config.h"
 #include "parse-options.h"
 #include "repository.h"
+#include "commit.h"
+#include "hex.h"
+#include "tree.h"
+#include "tree-walk.h"
 #include "object.h"
+#include "object-store-ll.h"
+#include "oid-array.h"
+#include "oidset.h"
+#include "promisor-remote.h"
+#include "strmap.h"
+#include "string-list.h"
+#include "revision.h"
+#include "trace2.h"
+#include "progress.h"
+#include "packfile.h"
+#include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
 	N_("git backfill [<options>]"),
 	NULL
 };
 
+struct backfill_context {
+	struct repository *repo;
+	struct oid_array current_batch;
+	size_t min_batch_size;
+};
+
+static void backfill_context_clear(struct backfill_context *ctx)
+{
+	oid_array_clear(&ctx->current_batch);
+}
+
+static void download_batch(struct backfill_context *ctx)
+{
+	promisor_remote_get_direct(ctx->repo,
+				   ctx->current_batch.oid,
+				   ctx->current_batch.nr);
+	oid_array_clear(&ctx->current_batch);
+
+	/*
+	 * We likely have a new packfile. Add it to the packed list to
+	 * avoid possible duplicate downloads of the same objects.
+	 */
+	reprepare_packed_git(ctx->repo);
+}
+
+static int fill_missing_blobs(const char *path UNUSED,
+			      struct oid_array *list,
+			      enum object_type type,
+			      void *data)
+{
+	struct backfill_context *ctx = data;
+
+	if (type != OBJ_BLOB)
+		return 0;
+
+	for (size_t i = 0; i < list->nr; i++) {
+		if (!has_object(ctx->repo, &list->oid[i],
+				OBJECT_INFO_FOR_PREFETCH))
+			oid_array_append(&ctx->current_batch, &list->oid[i]);
+	}
+
+	if (ctx->current_batch.nr >= ctx->min_batch_size)
+		download_batch(ctx);
+
+	return 0;
+}
+
+static int do_backfill(struct backfill_context *ctx)
+{
+	struct rev_info revs;
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	int ret;
+
+	repo_init_revisions(ctx->repo, &revs, "");
+	handle_revision_arg("HEAD", &revs, 0, 0);
+
+	info.blobs = 1;
+	info.tags = info.commits = info.trees = 0;
+
+	info.revs = &revs;
+	info.path_fn = fill_missing_blobs;
+	info.path_fn_data = ctx;
+
+	ret = walk_objects_by_path(&info);
+
+	/* Download the objects that did not fill a batch. */
+	if (!ret)
+		download_batch(ctx);
+
+	path_walk_info_clear(&info);
+	release_revisions(&revs);
+	return ret;
+}
+
 int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
 {
+	int result;
+	struct backfill_context ctx = {
+		.repo = repo,
+		.current_batch = OID_ARRAY_INIT,
+		.min_batch_size = 50000,
+	};
 	struct option options[] = {
 		OPT_END(),
 	};
@@ -22,7 +118,7 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
-	die(_("not implemented"));
-
-	return 0;
+	result = do_backfill(&ctx);
+	backfill_context_clear(&ctx);
+	return result;
 }
diff --git a/t/meson.build b/t/meson.build
index 35f25ca4a1d..af53e8ee583 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -721,6 +721,7 @@ integration_tests = [
   't5617-clone-submodules-remote.sh',
   't5618-alternate-refs.sh',
   't5619-clone-local-ambiguous-transport.sh',
+  't5620-backfill.sh',
   't5700-protocol-v1.sh',
   't5701-git-serve.sh',
   't5702-protocol-v2.sh',
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
new file mode 100755
index 00000000000..64326362d80
--- /dev/null
+++ b/t/t5620-backfill.sh
@@ -0,0 +1,94 @@
+#!/bin/sh
+
+test_description='git backfill on partial clones'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# We create objects in the 'src' repo.
+test_expect_success 'setup repo for object creation' '
+	echo "{print \$1}" >print_1.awk &&
+	echo "{print \$2}" >print_2.awk &&
+
+	git init src &&
+
+	mkdir -p src/a/b/c &&
+	mkdir -p src/d/e &&
+
+	for i in 1 2
+	do
+		for n in 1 2 3 4
+		do
+			echo "Version $i of file $n" > src/file.$n.txt &&
+			echo "Version $i of file a/$n" > src/a/file.$n.txt &&
+			echo "Version $i of file a/b/$n" > src/a/b/file.$n.txt &&
+			echo "Version $i of file a/b/c/$n" > src/a/b/c/file.$n.txt &&
+			echo "Version $i of file d/$n" > src/d/file.$n.txt &&
+			echo "Version $i of file d/e/$n" > src/d/e/file.$n.txt &&
+			git -C src add . &&
+			git -C src commit -m "Iteration $n" || return 1
+		done
+	done
+'
+
+# Clone 'src' into 'srv.bare' so we have a bare repo to be our origin
+# server for the partial clone.
+test_expect_success 'setup bare clone for server' '
+	git clone --bare "file://$(pwd)/src" srv.bare &&
+	git -C srv.bare config --local uploadpack.allowfilter 1 &&
+	git -C srv.bare config --local uploadpack.allowanysha1inwant 1
+'
+
+# do basic partial clone from "srv.bare"
+test_expect_success 'do partial clone 1, backfill gets all objects' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill1 &&
+
+	# Backfill with no options gets everything reachable from HEAD.
+	GIT_TRACE2_EVENT="$(pwd)/backfill-file-trace" git \
+		-C backfill1 backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-file-trace &&
+
+	# No more missing objects!
+	git -C backfill1 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'create a partial clone over HTTP' '
+	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+	rm -rf "$SERVER" repo &&
+	git clone --bare "file://$(pwd)/src" "$SERVER" &&
+	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
+	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
+
+	git clone --no-checkout --filter=blob:none \
+		"$HTTPD_URL/smart/server" backfill-http
+'
+
+test_expect_success 'backfilling over HTTP succeeds' '
+	GIT_TRACE2_EVENT="$(pwd)/backfill-http-trace" git \
+		-C backfill-http backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-http-trace &&
+
+	# Confirm all objects are present, none missing.
+	git -C backfill-http rev-list --objects --all >rev-list-out &&
+	awk "{print \$1;}" <rev-list-out >oids &&
+	GIT_TRACE2_EVENT="$(pwd)/walk-trace" git -C backfill-http \
+		cat-file --batch-check <oids >batch-out &&
+	! grep missing batch-out
+'
+
+# DO NOT add non-httpd-specific tests here, because the last part of this
+# test script is only executed when httpd is available and enabled.
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 3/5] backfill: add --min-batch-size=<n> option
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
@ 2025-02-03 17:11     ` Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-02-03 17:11 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future. For now, we let the path-walk API batches determine the
boundaries.

To get a feeling for the value of specifying the --min-batch-size parameter,
I tested a number of open source repositories available on GitHub. The
procedure was generally:

 1. git clone --filter=blob:none <url>
 2. git backfill

Checking the number of packfiles and the size of the .git/objects/pack
directory helps to identify the effects of different batch sizes.

For the Git repository, we get these results:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 2          | 119 MB    |       |
| 25K             | 8          | 290 MB    | 24s   |
| 50K             | 5          | 290 MB    | 24s   |
| 100K            | 4          | 290 MB    | 29s   |

Other than the packfile counts decreasing as we need fewer batches, the
size and time required is not changing much for this small example.

For the nodejs/node repository, we see these results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 330 MB    |        |
| 25K             | 19         | 1,222 MB  | 1m 22s |
| 50K             | 11         | 1,221 MB  | 1m 24s |
| 100K            | 7          | 1,223 MB  | 1m 40s |
| 250K            | 4          | 1,224 MB  | 2m 23s |
| 500K            | 3          | 1,216 MB  | 4m 38s |

Here, we don't have much difference in the size of the repo, though the
500K batch size results in a few MB gained. That comes at a cost of a
much longer time. This extra time is due to server-side delta
compression happening as the on-disk deltas don't appear to be reusable
all the time. But for smaller batch sizes, the server is able to find
reasonable deltas partly because we are asking for objects that appear
in the same region of the directory tree and include all versions of a
file at a specific path.

To contrast this example, I tested the microsoft/fluentui repo, which
has been known to have inefficient packing due to name hash collisions.
These results are found before GitHub had the opportunity to repack the
server with more advanced name hash versions:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |

Here, a larger variety of batch sizes were chosen because of the great
variation in results. By asking the server to download small batches
corresponding to fewer paths at a time, the server is able to provide
better compression for these batches than it would for a regular clone.
A typical full clone for this repository would require 738 MB.

This example justifies the choice to batch requests by path name,
leading to improved communication with a server that is not optimally
packed.

Finally, the same experiment for the Linux repository had these results:

| Batch Size      | Pack Count | Pack Size | Time    |
|-----------------|------------|-----------|---------|
| (Initial clone) | 2          | 2,153 MB  |         |
| 25K             | 63         | 6,380 MB  | 14m 08s |
| 50K             | 58         | 6,126 MB  | 15m 11s |
| 100K            | 30         | 6,135 MB  | 18m 11s |
| 250K            | 14         | 6,146 MB  | 18m 22s |
| 500K            | 8          | 6,143 MB  | 33m 29s |

Even in this example, where the default name hash algorithm leads to
decent compression of the Linux kernel repository, there is value for
selecting a smaller batch size, to a limit. The 25K batch size has the
fastest time, but uses 250 MB more than the 50K batch size. The 500K
batch size took much more time due to server compression time and thus
we should avoid large batch sizes like this.

Based on these experiments, a batch size of 50,000 was chosen as the
default value.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt | 12 +++++++++++-
 builtin/backfill.c             |  4 +++-
 t/t5620-backfill.sh            | 18 ++++++++++++++++++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 56cbb9ffd82..136a1f1d294 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
 SYNOPSIS
 --------
 [synopsis]
-git backfill [<options>]
+git backfill [--min-batch-size=<n>]
 
 DESCRIPTION
 -----------
@@ -47,6 +47,16 @@ commit. This set can be restricted or expanded using various options.
 
 THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
 
+
+OPTIONS
+-------
+
+`--min-batch-size=<n>`::
+	Specify a minimum size for a batch of missing objects to request
+	from the server. This size may be exceeded by the last set of
+	blobs seen at a given path. The default minimum batch size is
+	50,000.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 0eca175a7fe..cfebee6e17b 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -21,7 +21,7 @@
 #include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
-	N_("git backfill [<options>]"),
+	N_("git backfill [--min-batch-size=<n>]"),
 	NULL
 };
 
@@ -108,6 +108,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 		.min_batch_size = 50000,
 	};
 	struct option options[] = {
+		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
+			    N_("Minimum number of objects to request at a time")),
 		OPT_END(),
 	};
 
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 64326362d80..36107a51c54 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -59,6 +59,24 @@ test_expect_success 'do partial clone 1, backfill gets all objects' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'do partial clone 2, backfill min batch size' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill2 &&
+
+	GIT_TRACE2_EVENT="$(pwd)/batch-trace" git \
+		-C backfill2 backfill --min-batch-size=20 &&
+
+	# Batches were used
+	test_trace2_data promisor fetch_count 20 <batch-trace >matches &&
+	test_line_count = 2 matches &&
+	test_trace2_data promisor fetch_count 8 <batch-trace &&
+
+	# No more missing objects!
+	git -C backfill2 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 4/5] backfill: add --sparse option
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2025-02-03 17:11     ` [PATCH v3 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
@ 2025-02-03 17:11     ` Derrick Stolee via GitGitGadget
  2025-02-03 17:11     ` [PATCH v3 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
  2025-02-04  0:18     ` [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
  5 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-02-03 17:11 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensive as computing blob diffs
will trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout. Non-cone mode can describe the included files
using both positive and negative patterns, which changes the possible return
values of path_matches_pattern_list(). Test both kinds of matches for
increased coverage.

To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are

 1. git clone --filter=blob:none --sparse <url>
 2. git sparse-checkout set <dir1> ... <dirN>
 3. git backfill --sparse

For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3          | 110 MB    |       |
| 10K             | 12         | 192 MB    | 17.2s |
| 15K             | 9          | 192 MB    | 15.5s |
| 20K             | 8          | 192 MB    | 15.5s |
| 25K             | 7          | 192 MB    | 14.7s |

This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.

Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:

| Batch Size      | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2          | 1,876 MB  |      |
| 10K             | 11         | 2,187 MB  | 46s  |
| 25K             | 7          | 2,188 MB  | 43s  |
| 50K             | 5          | 2,194 MB  | 44s  |
| 100K            | 4          | 2,194 MB  | 48s  |

This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.

Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  6 +-
 Documentation/technical/api-path-walk.txt |  8 +++
 builtin/backfill.c                        | 15 +++-
 dir.c                                     | 10 +--
 dir.h                                     |  3 +
 path-walk.c                               | 28 ++++++--
 path-walk.h                               | 11 +++
 t/helper/test-path-walk.c                 | 22 +++++-
 t/t5620-backfill.sh                       | 88 +++++++++++++++++++++++
 t/t6601-path-walk.sh                      | 32 +++++++++
 10 files changed, 208 insertions(+), 15 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 136a1f1d294..a28678983e3 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
 SYNOPSIS
 --------
 [synopsis]
-git backfill [--min-batch-size=<n>]
+git backfill [--min-batch-size=<n>] [--[no-]sparse]
 
 DESCRIPTION
 -----------
@@ -57,6 +57,10 @@ OPTIONS
 	blobs seen at a given path. The default minimum batch size is
 	50,000.
 
+`--[no-]sparse`::
+	Only download objects if they appear at a path that matches the
+	current sparse-checkout.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 1fba0ce04cb..3e089211fb4 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -56,6 +56,14 @@ better off using the revision walk API instead.
 	the revision walk so that the walk emits commits marked with the
 	`UNINTERESTING` flag.
 
+`pl`::
+	This pattern list pointer allows focusing the path-walk search to
+	a set of patterns, only emitting paths that match the given
+	patterns. See linkgit:gitignore[5] or
+	linkgit:git-sparse-checkout[1] for details about pattern lists.
+	When the pattern list uses cone-mode patterns, then the path-walk
+	API can prune the set of paths it walks to improve performance.
+
 Examples
 --------
 
diff --git a/builtin/backfill.c b/builtin/backfill.c
index cfebee6e17b..d7b997fd6f7 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -4,6 +4,7 @@
 #include "parse-options.h"
 #include "repository.h"
 #include "commit.h"
+#include "dir.h"
 #include "hex.h"
 #include "tree.h"
 #include "tree-walk.h"
@@ -21,7 +22,7 @@
 #include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
-	N_("git backfill [--min-batch-size=<n>]"),
+	N_("git backfill [--min-batch-size=<n>] [--[no-]sparse]"),
 	NULL
 };
 
@@ -29,6 +30,7 @@ struct backfill_context {
 	struct repository *repo;
 	struct oid_array current_batch;
 	size_t min_batch_size;
+	int sparse;
 };
 
 static void backfill_context_clear(struct backfill_context *ctx)
@@ -78,6 +80,14 @@ static int do_backfill(struct backfill_context *ctx)
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	int ret;
 
+	if (ctx->sparse) {
+		CALLOC_ARRAY(info.pl, 1);
+		if (get_sparse_checkout_patterns(info.pl)) {
+			path_walk_info_clear(&info);
+			return error(_("problem loading sparse-checkout"));
+		}
+	}
+
 	repo_init_revisions(ctx->repo, &revs, "");
 	handle_revision_arg("HEAD", &revs, 0, 0);
 
@@ -106,10 +116,13 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 		.repo = repo,
 		.current_batch = OID_ARRAY_INIT,
 		.min_batch_size = 50000,
+		.sparse = 0,
 	};
 	struct option options[] = {
 		OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
 			    N_("Minimum number of objects to request at a time")),
+		OPT_BOOL(0, "sparse", &ctx.sparse,
+			 N_("Restrict the missing objects to the current sparse-checkout")),
 		OPT_END(),
 	};
 
diff --git a/dir.c b/dir.c
index 5b2181e5899..16ccfe7e4e8 100644
--- a/dir.c
+++ b/dir.c
@@ -1093,10 +1093,6 @@ static void invalidate_directory(struct untracked_cache *uc,
 		dir->dirs[i]->recurse = 0;
 }
 
-static int add_patterns_from_buffer(char *buf, size_t size,
-				    const char *base, int baselen,
-				    struct pattern_list *pl);
-
 /* Flags for add_patterns() */
 #define PATTERN_NOFOLLOW (1<<0)
 
@@ -1186,9 +1182,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 	return 0;
 }
 
-static int add_patterns_from_buffer(char *buf, size_t size,
-				    const char *base, int baselen,
-				    struct pattern_list *pl)
+int add_patterns_from_buffer(char *buf, size_t size,
+			     const char *base, int baselen,
+			     struct pattern_list *pl)
 {
 	char *orig = buf;
 	int i, lineno = 1;
diff --git a/dir.h b/dir.h
index a3a2f00f5d9..6cfef5df660 100644
--- a/dir.h
+++ b/dir.h
@@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
 int add_patterns_from_blob_to_list(struct object_id *oid,
 				   const char *base, int baselen,
 				   struct pattern_list *pl);
+int add_patterns_from_buffer(char *buf, size_t size,
+			     const char *base, int baselen,
+			     struct pattern_list *pl);
 void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
 void add_pattern(const char *string, const char *base,
 		 int baselen, struct pattern_list *pl, int srcpos);
diff --git a/path-walk.c b/path-walk.c
index 9715a5550ef..341bdd2ba4e 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -12,6 +12,7 @@
 #include "object.h"
 #include "oid-array.h"
 #include "prio-queue.h"
+#include "repository.h"
 #include "revision.h"
 #include "string-list.h"
 #include "strmap.h"
@@ -172,6 +173,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
 		if (type == OBJ_TREE)
 			strbuf_addch(&path, '/');
 
+		if (ctx->info->pl) {
+			int dtype;
+			enum pattern_match_result match;
+			match = path_matches_pattern_list(path.buf, path.len,
+							  path.buf + base_len, &dtype,
+							  ctx->info->pl,
+							  ctx->repo->index);
+
+			if (ctx->info->pl->use_cone_patterns &&
+			    match == NOT_MATCHED)
+				continue;
+			else if (!ctx->info->pl->use_cone_patterns &&
+				 type == OBJ_BLOB &&
+				 match != MATCHED)
+				continue;
+		}
+
 		if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
 			CALLOC_ARRAY(list, 1);
 			list->type = type;
@@ -582,10 +600,10 @@ void path_walk_info_init(struct path_walk_info *info)
 	memcpy(info, &empty, sizeof(empty));
 }
 
-void path_walk_info_clear(struct path_walk_info *info UNUSED)
+void path_walk_info_clear(struct path_walk_info *info)
 {
-	/*
-	 * This destructor is empty for now, as info->revs
-	 * is not owned by 'struct path_walk_info'.
-	 */
+	if (info->pl) {
+		clear_pattern_list(info->pl);
+		free(info->pl);
+	}
 }
diff --git a/path-walk.h b/path-walk.h
index 414d6db23c2..473ee9d361c 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -6,6 +6,7 @@
 
 struct rev_info;
 struct oid_array;
+struct pattern_list;
 
 /**
  * The type of a function pointer for the method that is called on a list of
@@ -48,6 +49,16 @@ struct path_walk_info {
 	 * walk the children of such trees.
 	 */
 	int prune_all_uninteresting;
+
+	/**
+	 * Specify a sparse-checkout definition to match our paths to. Do not
+	 * walk outside of this sparse definition. If the patterns are in
+	 * cone mode, then the search may prune directories that are outside
+	 * of the cone. If not in cone mode, then all tree paths will be
+	 * explored but the path_fn will only be called when the path matches
+	 * the sparse-checkout patterns.
+	 */
+	struct pattern_list *pl;
 };
 
 #define PATH_WALK_INFO_INIT {   \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 7f2d409c5bc..61e845e5ec2 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -1,6 +1,7 @@
 #define USE_THE_REPOSITORY_VARIABLE
 
 #include "test-tool.h"
+#include "dir.h"
 #include "environment.h"
 #include "hex.h"
 #include "object-name.h"
@@ -9,6 +10,7 @@
 #include "revision.h"
 #include "setup.h"
 #include "parse-options.h"
+#include "strbuf.h"
 #include "path-walk.h"
 #include "oid-array.h"
 
@@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids,
 
 int cmd__path_walk(int argc, const char **argv)
 {
-	int res;
+	int res, stdin_pl = 0;
 	struct rev_info revs = REV_INFO_INIT;
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	struct path_walk_test_data data = { 0 };
@@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of tree objects")),
 		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
 			 N_("toggle pruning of uninteresting paths")),
+		OPT_BOOL(0, "stdin-pl", &stdin_pl,
+			 N_("read a pattern list over stdin")),
 		OPT_END(),
 	};
 
@@ -99,6 +103,17 @@ int cmd__path_walk(int argc, const char **argv)
 	info.path_fn = emit_block;
 	info.path_fn_data = &data;
 
+	if (stdin_pl) {
+		struct strbuf in = STRBUF_INIT;
+		CALLOC_ARRAY(info.pl, 1);
+
+		info.pl->use_cone_patterns = 1;
+
+		strbuf_fread(&in, 2048, stdin);
+		add_patterns_from_buffer(in.buf, in.len, "", 0, info.pl);
+		strbuf_release(&in);
+	}
+
 	res = walk_objects_by_path(&info);
 
 	printf("commits:%" PRIuMAX "\n"
@@ -107,6 +122,11 @@ int cmd__path_walk(int argc, const char **argv)
 	       "tags:%" PRIuMAX "\n",
 	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
 
+	if (info.pl) {
+		clear_pattern_list(info.pl);
+		free(info.pl);
+	}
+
 	release_revisions(&revs);
 	return res;
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 36107a51c54..6b72e9d0e31 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -77,6 +77,94 @@ test_expect_success 'do partial clone 2, backfill min batch size' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'backfill --sparse' '
+	git clone --sparse --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill3 &&
+
+	# Initial checkout includes four files at root.
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 44 missing &&
+
+	# Initial sparse-checkout is just the files at root, so we get the
+	# older versions of the four files at tip.
+	GIT_TRACE2_EVENT="$(pwd)/sparse-trace1" git \
+		-C backfill3 backfill --sparse &&
+	test_trace2_data promisor fetch_count 4 <sparse-trace1 &&
+	test_trace2_data path-walk paths 5 <sparse-trace1 &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 40 missing &&
+
+	# Expand the sparse-checkout to include 'd' recursively. This
+	# engages the algorithm to skip the trees for 'a'. Note that
+	# the "sparse-checkout set" command downloads the objects at tip
+	# to satisfy the current checkout.
+	git -C backfill3 sparse-checkout set d &&
+	GIT_TRACE2_EVENT="$(pwd)/sparse-trace2" git \
+		-C backfill3 backfill --sparse &&
+	test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
+	test_trace2_data path-walk paths 15 <sparse-trace2 &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 24 missing
+'
+
+test_expect_success 'backfill --sparse without cone mode (positive)' '
+	git clone --no-checkout --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill4 &&
+
+	# No blobs yet
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 48 missing &&
+
+	# Define sparse-checkout by filename regardless of parent directory.
+	# This downloads 6 blobs to satisfy the checkout.
+	git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" &&
+	git -C backfill4 checkout main &&
+
+	# Track new blob count
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 42 missing &&
+
+	GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \
+		-C backfill4 backfill --sparse &&
+	test_trace2_data promisor fetch_count 6 <no-cone-trace1 &&
+
+	# This walk needed to visit all directories to search for these paths.
+	test_trace2_data path-walk paths 12 <no-cone-trace1 &&
+	git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 36 missing
+'
+
+test_expect_success 'backfill --sparse without cone mode (negative)' '
+	git clone --no-checkout --filter=blob:none		\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill5 &&
+
+	# No blobs yet
+	git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 48 missing &&
+
+	# Define sparse-checkout by filename regardless of parent directory.
+	# This downloads 18 blobs to satisfy the checkout
+	git -C backfill5 sparse-checkout set --no-cone "**/file*" "!**/file.1.txt" &&
+	git -C backfill5 checkout main &&
+
+	# Track new blob count
+	git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 30 missing &&
+
+	GIT_TRACE2_EVENT="$(pwd)/no-cone-trace2" git \
+		-C backfill5 backfill --sparse &&
+	test_trace2_data promisor fetch_count 18 <no-cone-trace2 &&
+
+	# This walk needed to visit all directories to search for these paths, plus
+	# 12 extra "file.?.txt" paths than the previous test.
+	test_trace2_data path-walk paths 24 <no-cone-trace2 &&
+	git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 12 missing
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 5f04acb8a2f..c89b0f1e19d 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -176,6 +176,38 @@ test_expect_success 'branches and indexed objects mix well' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'base & topic, sparse' '
+	cat >patterns <<-EOF &&
+	/*
+	!/*/
+	/left/
+	EOF
+
+	test-tool path-walk --stdin-pl -- base topic <patterns >out &&
+
+	cat >expect <<-EOF &&
+	0:commit::$(git rev-parse topic)
+	0:commit::$(git rev-parse base)
+	0:commit::$(git rev-parse base~1)
+	0:commit::$(git rev-parse base~2)
+	1:tree::$(git rev-parse topic^{tree})
+	1:tree::$(git rev-parse base^{tree})
+	1:tree::$(git rev-parse base~1^{tree})
+	1:tree::$(git rev-parse base~2^{tree})
+	2:blob:a:$(git rev-parse base~2:a)
+	3:tree:left/:$(git rev-parse base:left)
+	3:tree:left/:$(git rev-parse base~2:left)
+	4:blob:left/b:$(git rev-parse base~2:left/b)
+	4:blob:left/b:$(git rev-parse base:left/b)
+	blobs:3
+	commits:4
+	tags:0
+	trees:6
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'topic only' '
 	test-tool path-walk -- topic >out &&
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 5/5] backfill: assume --sparse when sparse-checkout is enabled
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2025-02-03 17:11     ` [PATCH v3 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
@ 2025-02-03 17:11     ` Derrick Stolee via GitGitGadget
  2025-02-04  0:18     ` [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
  5 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2025-02-03 17:11 UTC (permalink / raw)
  To: git
  Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt |  3 ++-
 builtin/backfill.c             |  7 +++++++
 t/t5620-backfill.sh            | 13 ++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index a28678983e3..95623051f78 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -59,7 +59,8 @@ OPTIONS
 
 `--[no-]sparse`::
 	Only download objects if they appear at a path that matches the
-	current sparse-checkout.
+	current sparse-checkout. If the sparse-checkout feature is enabled,
+	then `--sparse` is assumed and can be disabled with `--no-sparse`.
 
 SEE ALSO
 --------
diff --git a/builtin/backfill.c b/builtin/backfill.c
index d7b997fd6f7..d7ee84692f3 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,3 +1,6 @@
+/* We need this macro to access core_apply_sparse_checkout */
+#define USE_THE_REPOSITORY_VARIABLE
+
 #include "builtin.h"
 #include "git-compat-util.h"
 #include "config.h"
@@ -5,6 +8,7 @@
 #include "repository.h"
 #include "commit.h"
 #include "dir.h"
+#include "environment.h"
 #include "hex.h"
 #include "tree.h"
 #include "tree-walk.h"
@@ -133,6 +137,9 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
+	if (ctx.sparse < 0)
+		ctx.sparse = core_apply_sparse_checkout;
+
 	result = do_backfill(&ctx);
 	backfill_context_clear(&ctx);
 	return result;
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index 6b72e9d0e31..58c81556e72 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -77,6 +77,12 @@ test_expect_success 'do partial clone 2, backfill min batch size' '
 	test_line_count = 0 revs2
 '
 
+test_expect_success 'backfill --sparse without sparse-checkout fails' '
+	git init not-sparse &&
+	test_must_fail git -C not-sparse backfill --sparse 2>err &&
+	grep "problem loading sparse-checkout" err
+'
+
 test_expect_success 'backfill --sparse' '
 	git clone --sparse --filter=blob:none		\
 		--single-branch --branch=main 		\
@@ -105,7 +111,12 @@ test_expect_success 'backfill --sparse' '
 	test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
 	test_trace2_data path-walk paths 15 <sparse-trace2 &&
 	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
-	test_line_count = 24 missing
+	test_line_count = 24 missing &&
+
+	# Disabling the --sparse option (on by default) will download everything
+	git -C backfill3 backfill --no-sparse &&
+	git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
+	test_line_count = 0 missing
 '
 
 test_expect_success 'backfill --sparse without cone mode (positive)' '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command
  2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2025-02-03 17:11     ` [PATCH v3 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
@ 2025-02-04  0:18     ` Junio C Hamano
  2025-02-05  7:15       ` Patrick Steinhardt
  5 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2025-02-04  0:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, johannes.schindelin, peff, ps, me, johncai86, newren,
	christian.couder, kristofferhaugsbakk, jonathantanmy, karthik.188,
	Jean-Noël AVILA, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Updates in v3
> =============
>
>  * Rebased onto 'master' now that the path-walk API is merged.

I was going to object to this rebase, as the same path-walk was
contained already while building the base of the series for the
previous rounds.  IOW, "now that the path-walk API is merged" is not
a good excuse to rebase onto 'master'.

But then I forgot that there are other topics, like 'meson based
build' and 'synopsis formatting', that were in flight at the same
time that have been merged to 'master'.  They are good reasons why
we may want to rebase the updated version to 'master'.

IOW ...

>  * New builtin boilerplate is updated with new standards, including:
>
>  * Doc formatting uses [synopsis] formatting.
>  * Add builtin/backfill.c to meson.build.
>  * Add Documentation/git-backfill.txt to Documentation/meson.build.
>  * Add t/t5620-backfill.sh to t/meson.build.
>  * Update handling of -h due to f66d1423f5 (builtin: send usage() help text
>    to standard output, 2025-01-16).

... these are all good reasons, even if path-walk were still cooking
in 'next' (in which case, we'd prepare a custom base by merging path-walk
into 'master' and then apply these patches).

>  * Doc formatting is updated to use back-ticks on options and mark the
>    builtin as experimental.
>
>  * The batch_size member of 'struct backfill_context' is now named
>    'min_batch_size' in all patches.
>
>  * Some mentions of '--batch-size' are updated to '--min-batch-size'.
>
>  * An additional test is included for non-cone-mode sparse-checkout patterns
>    to further check the return values of path_matches_pattern_list() within
>    the path-walk API with sparse mode.
>
>  * A use of oid_object_info_extended() is replaced with has_object().
>
>  * The backfill_context_clear() method is called by the proper owner of the
>    struct.
>
> Thanks, -Stolee

Everything looked great from a quick look.  I'll have a more
detailed look later, but this round looks quite promising.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command
  2025-02-04  0:18     ` [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
@ 2025-02-05  7:15       ` Patrick Steinhardt
  2025-02-05 17:07         ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick Steinhardt @ 2025-02-05  7:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	me, johncai86, newren, christian.couder, kristofferhaugsbakk,
	jonathantanmy, karthik.188, Jean-Noël AVILA, Derrick Stolee

On Mon, Feb 03, 2025 at 04:18:44PM -0800, Junio C Hamano wrote:
> Everything looked great from a quick look.  I'll have a more
> detailed look later, but this round looks quite promising.

I didn't do another deep dive, but mostly had a look at my last comments
and the range-diff. As far as I can see my comments have been addressed,
so this version looks good to me.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command
  2025-02-05  7:15       ` Patrick Steinhardt
@ 2025-02-05 17:07         ` Junio C Hamano
  0 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2025-02-05 17:07 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Derrick Stolee via GitGitGadget, git, johannes.schindelin, peff,
	me, johncai86, newren, christian.couder, kristofferhaugsbakk,
	jonathantanmy, karthik.188, Jean-Noël AVILA, Derrick Stolee

Patrick Steinhardt <ps@pks.im> writes:

> On Mon, Feb 03, 2025 at 04:18:44PM -0800, Junio C Hamano wrote:
>> Everything looked great from a quick look.  I'll have a more
>> detailed look later, but this round looks quite promising.
>
> I didn't do another deep dive, but mostly had a look at my last comments
> and the range-diff. As far as I can see my comments have been addressed,
> so this version looks good to me.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-02-05 17:07 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
2024-12-06 20:07 ` [PATCH 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2025-01-16 10:11   ` Patrick Steinhardt
2025-01-16 17:52     ` Junio C Hamano
2025-02-03 14:38       ` Derrick Stolee
2024-12-06 20:07 ` [PATCH 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2024-12-16  8:01   ` Patrick Steinhardt
2024-12-18 15:03     ` Derrick Stolee
2024-12-06 20:07 ` [PATCH 3/5] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
2024-12-16  8:01   ` Patrick Steinhardt
2024-12-18 15:09     ` Derrick Stolee
2025-01-19 17:57   ` Jean-Noël AVILA
2024-12-06 20:07 ` [PATCH 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
2024-12-16  8:01   ` Patrick Steinhardt
2024-12-06 20:07 ` [PATCH 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2024-12-08 10:53 ` [PATCH 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
2024-12-09  0:34   ` Junio C Hamano
2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-12-20 16:29   ` [PATCH v2 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2024-12-20 16:29   ` [PATCH v2 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2025-01-16 10:01     ` Patrick Steinhardt
2025-02-03 14:44       ` Derrick Stolee
2024-12-20 16:29   ` [PATCH v2 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
2025-01-16 10:01     ` Patrick Steinhardt
2024-12-20 16:29   ` [PATCH v2 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
2025-01-16 10:01     ` Patrick Steinhardt
2025-02-03 15:11       ` Derrick Stolee
2024-12-20 16:29   ` [PATCH v2 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2025-01-16 10:00   ` [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command Patrick Steinhardt
2025-01-17 22:37     ` Junio C Hamano
2025-02-03 17:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-02-03 17:11     ` [PATCH v3 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2025-02-03 17:11     ` [PATCH v3 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2025-02-03 17:11     ` [PATCH v3 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
2025-02-03 17:11     ` [PATCH v3 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
2025-02-03 17:11     ` [PATCH v3 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2025-02-04  0:18     ` [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
2025-02-05  7:15       ` Patrick Steinhardt
2025-02-05 17:07         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).