From: Patrick Steinhardt <ps@pks.im>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, gitster@pobox.com,
johannes.schindelin@gmx.de, peff@peff.net, me@ttaylorr.com,
johncai86@gmail.com, newren@gmail.com,
christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com,
jonathantanmy@google.com, karthik.188@gmail.com,
Derrick Stolee <stolee@gmail.com>,
Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH 2/5] backfill: basic functionality and tests
Date: Mon, 16 Dec 2024 09:01:21 +0100 [thread overview]
Message-ID: <Z1_eUSGN3B13uGmM@pks.im> (raw)
In-Reply-To: <5728dd2702195b7ba3a208859f114e40ba2b6bbd.1733515638.git.gitgitgadget@gmail.com>
On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index 640144187d3..0e10f066fef 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -14,6 +14,30 @@ SYNOPSIS
> DESCRIPTION
> -----------
>
> +Blobless partial clones are created using `git clone --filter=blob:none`
> +and then configure the local repository such that the Git client avoids
> +downloading blob objects unless they are required for a local operation.
> +This initially means that the clone and later fetches download reachable
> +commits and trees but no blobs. Later operations that change the `HEAD`
> +pointer, such as `git checkout` or `git merge`, may need to download
> +missing blobs in order to complete their operation.
Okay.
> +In the worst cases, commands that compute blob diffs, such as `git blame`,
> +become very slow as they download the missing blobs in single-blob
> +requests to satisfy the missing object as the Git command needs it. This
> +leads to multiple download requests and no ability for the Git server to
> +provide delta compression across those objects.
> +
> +The `git backfill` command provides a way for the user to request that
> +Git downloads the missing blobs (with optional filters) such that the
> +missing blobs representing historical versions of files can be downloaded
> +in batches. The `backfill` command attempts to optimize the request by
> +grouping blobs that appear at the same path, hopefully leading to good
> +delta compression in the packfile sent by the server.
Hm. So we're asking the user to fix a usability issue of git-blame(1),
don't we? Ideally, git-blame(1) itself should know to transparently
batch the blobs it requires to compute its output, shouldn't it? That
usecase alone doesn't yet convince me that git-backfill(1) is a good
idea as I'd think we should rather fix the underlying issue.
So are there other usecases for git-backfill(1)? I can imagine that it
might be helpful in the context of scripts that know they'll operate on
a bunch of blobs.
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 38e6aaeaa03..e5f2000d5e0 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -1,16 +1,116 @@
> #include "builtin.h"
> +#include "git-compat-util.h"
> #include "config.h"
> #include "parse-options.h"
> #include "repository.h"
> +#include "commit.h"
> +#include "hex.h"
> +#include "tree.h"
> +#include "tree-walk.h"
> #include "object.h"
> +#include "object-store-ll.h"
> +#include "oid-array.h"
> +#include "oidset.h"
> +#include "promisor-remote.h"
> +#include "strmap.h"
> +#include "string-list.h"
> +#include "revision.h"
> +#include "trace2.h"
> +#include "progress.h"
> +#include "packfile.h"
> +#include "path-walk.h"
>
> static const char * const builtin_backfill_usage[] = {
> N_("git backfill [<options>]"),
> NULL
> };
>
> +struct backfill_context {
> + struct repository *repo;
> + struct oid_array current_batch;
> + size_t batch_size;
> +};
> +
> +static void clear_backfill_context(struct backfill_context *ctx)
> +{
> + oid_array_clear(&ctx->current_batch);
> +}
Nit: our style guide says that this should rather be
`backfill_context_clear()`.
> +static void download_batch(struct backfill_context *ctx)
> +{
> + promisor_remote_get_direct(ctx->repo,
> + ctx->current_batch.oid,
> + ctx->current_batch.nr);
> + oid_array_clear(&ctx->current_batch);
> +
> + /*
> + * We likely have a new packfile. Add it to the packed list to
> + * avoid possible duplicate downloads of the same objects.
> + */
> + reprepare_packed_git(ctx->repo);
> +}
> +
> +static int fill_missing_blobs(const char *path UNUSED,
> + struct oid_array *list,
> + enum object_type type,
> + void *data)
> +{
> + struct backfill_context *ctx = data;
> +
> + if (type != OBJ_BLOB)
> + return 0;
> +
> + for (size_t i = 0; i < list->nr; i++) {
> + off_t size = 0;
> + struct object_info info = OBJECT_INFO_INIT;
> + info.disk_sizep = &size;
> + if (oid_object_info_extended(ctx->repo,
> + &list->oid[i],
> + &info,
> + OBJECT_INFO_FOR_PREFETCH) ||
> + !size)
> + oid_array_append(&ctx->current_batch, &list->oid[i]);
> + }
> +
> + if (ctx->current_batch.nr >= ctx->batch_size)
> + download_batch(ctx);
Okay, so the batch size is just "best effort". If we walk a tree that
makes us exceed the batch size then we wouldn't issue a fetch during the
tree walk. Is there any specific reason for this behaviour?
In any case, as long as this is properly documented I think this should
be fine in general.
> + return 0;
> +}
> +
> +static int do_backfill(struct backfill_context *ctx)
> +{
> + struct rev_info revs;
> + struct path_walk_info info = PATH_WALK_INFO_INIT;
> + int ret;
> +
> + repo_init_revisions(ctx->repo, &revs, "");
> + handle_revision_arg("HEAD", &revs, 0, 0);
> +
> + info.blobs = 1;
> + info.tags = info.commits = info.trees = 0;
> +
> + info.revs = &revs;
> + info.path_fn = fill_missing_blobs;
> + info.path_fn_data = ctx;
> +
> + ret = walk_objects_by_path(&info);
> +
> + /* Download the objects that did not fill a batch. */
> + if (!ret)
> + download_batch(ctx);
> +
> + clear_backfill_context(ctx);
Are we leaking `revs` and `info`?
> + return ret;
> +}
> +
> int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
> {
> + struct backfill_context ctx = {
> + .repo = repo,
> + .current_batch = OID_ARRAY_INIT,
> + .batch_size = 50000,
> + };
> struct option options[] = {
> OPT_END(),
> };
> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>
> repo_config(repo, git_default_config, NULL);
>
> - die(_("not implemented"));
> -
> - return 0;
> + return do_backfill(&ctx);
> }
The current iteration only backfills blobs as far as I can see. Do we
maybe want to keep the door open for future changes in git-backfill(1)
by implementing this via a "blob" subcommand?
Patrick
next prev parent reply other threads:[~2024-12-16 8:01 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-06 20:07 [PATCH 0/5] PATH WALK III: Add 'git backfill' command Derrick Stolee via GitGitGadget
2024-12-06 20:07 ` [PATCH 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2025-01-16 10:11 ` Patrick Steinhardt
2025-01-16 17:52 ` Junio C Hamano
2025-02-03 14:38 ` Derrick Stolee
2024-12-06 20:07 ` [PATCH 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2024-12-16 8:01 ` Patrick Steinhardt [this message]
2024-12-18 15:03 ` Derrick Stolee
2024-12-06 20:07 ` [PATCH 3/5] backfill: add --batch-size=<n> option Derrick Stolee via GitGitGadget
2024-12-16 8:01 ` Patrick Steinhardt
2024-12-18 15:09 ` Derrick Stolee
2025-01-19 17:57 ` Jean-Noël AVILA
2024-12-06 20:07 ` [PATCH 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
2024-12-16 8:01 ` Patrick Steinhardt
2024-12-06 20:07 ` [PATCH 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2024-12-08 10:53 ` [PATCH 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
2024-12-09 0:34 ` Junio C Hamano
2024-12-20 16:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-12-20 16:29 ` [PATCH v2 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2024-12-20 16:29 ` [PATCH v2 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2025-01-16 10:01 ` Patrick Steinhardt
2025-02-03 14:44 ` Derrick Stolee
2024-12-20 16:29 ` [PATCH v2 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
2025-01-16 10:01 ` Patrick Steinhardt
2024-12-20 16:29 ` [PATCH v2 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
2025-01-16 10:01 ` Patrick Steinhardt
2025-02-03 15:11 ` Derrick Stolee
2024-12-20 16:29 ` [PATCH v2 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2025-01-16 10:00 ` [PATCH v2 0/5] PATH WALK III: Add 'git backfill' command Patrick Steinhardt
2025-01-17 22:37 ` Junio C Hamano
2025-02-03 17:11 ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-02-03 17:11 ` [PATCH v3 1/5] backfill: add builtin boilerplate Derrick Stolee via GitGitGadget
2025-02-03 17:11 ` [PATCH v3 2/5] backfill: basic functionality and tests Derrick Stolee via GitGitGadget
2025-02-03 17:11 ` [PATCH v3 3/5] backfill: add --min-batch-size=<n> option Derrick Stolee via GitGitGadget
2025-02-03 17:11 ` [PATCH v3 4/5] backfill: add --sparse option Derrick Stolee via GitGitGadget
2025-02-03 17:11 ` [PATCH v3 5/5] backfill: assume --sparse when sparse-checkout is enabled Derrick Stolee via GitGitGadget
2025-02-04 0:18 ` [PATCH v3 0/5] PATH WALK III: Add 'git backfill' command Junio C Hamano
2025-02-05 7:15 ` Patrick Steinhardt
2025-02-05 17:07 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z1_eUSGN3B13uGmM@pks.im \
--to=ps@pks.im \
--cc=christian.couder@gmail.com \
--cc=derrickstolee@github.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=jonathantanmy@google.com \
--cc=karthik.188@gmail.com \
--cc=kristofferhaugsbakk@fastmail.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.