From: Patrick Steinhardt <ps@pks.im>
To: Trieu Huynh <vikingtc4@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC GSoC PATCH] backfill: skip downloading for empty batches
Date: Wed, 1 Apr 2026 13:50:55 +0200 [thread overview]
Message-ID: <ac0GnzQgZMfu8aGL@pks.im> (raw)
In-Reply-To: <20260331121204.787826-1-vikingtc4@gmail.com>
On Tue, Mar 31, 2026 at 09:12:04PM +0900, Trieu Huynh wrote:
> When git backfill finishes its object walk, it unconditionally calls
> download_batch to process any remaining objects. If the repository
> is already up-to-date (no missing objects found), this call still
> performs an unnecessary directory scan via odb_reprepare.
>
> Fix it by adding a check in do_backfill to ensure download_batch is only
> called if the current batch actually contains objects (nr > 0).
>
> To facilitate testing and provide better telemetry, add a trace2 data
> event for batches_requested. This allows us to verify that no batches
> are processed when the command is run on an up-to-date repository.
>
> Add a test case in t5620-backfill.sh to ensure silence and efficiency
> when no objects are missing.
>
> Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
> ---
> Need discussion:
> 1. Is adding trace2_data_intmax() the preferred way to verify this
> behavior in our test suite, or should we rely on redirection of
> stderr to check for progress messages when the progress option
> is supported?
I think adding a call to trace2 only for the test itself doesn't make a
lot of sense if we already have another way to verify. But would we
actually see any progress messages? `promisor_remote_get_direct()` knows
to bail out early in case there is nothing to be downloaded, so the only
difference really is the call to `odb_reprepare()`.
Or is it? This part here...
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 0f31844ce7..67f9f28daf 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -58,6 +58,7 @@ static void download_batch(struct backfill_context *ctx)
> */
> odb_reprepare(ctx->repo->objects);
> display_progress(ctx->progress, ++ctx->batches_requested);
> + trace2_data_intmax("backfill", ctx->repo, "batches_requested", ctx->batches_requested);
> }
>
> static int fill_missing_blobs(const char *path UNUSED,
... looks different. What commit is this patch based on? There is no
call to `display_progress()` on "master", and you didn't mention any
other dependency in your cover letter. Please note such dependencies
when you post a patch that has any requirements.
But in any case, this here would cause us to print "batches_requested"
events repeatedly, which doesn't make a lot of sense.
> @@ -109,7 +110,7 @@ static int do_backfill(struct backfill_context *ctx)
> ret = walk_objects_by_path(&info);
>
> /* Download the objects that did not fill a batch. */
> - if (!ret)
> + if ( (!ret) && (ctx->current_batch.nr > 0) )
> download_batch(ctx);
>
> path_walk_info_clear(&info);
Please pay attention to our coding guidlines, see
"Documentation/CodingGuidelines".
I guess a more robust fix would add the check in `download_batch()`
itself, but I guess both alternatives work. But overall, it's sensible
to avoid repreparing the ODB in case we know nothing has changed.
> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index a1a8d736db..d3cc4022bf 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -221,6 +221,22 @@ test_expect_success 'backfill --sparse without cone mode (negative)' '
> test_line_count = 12 missing
> '
>
> +test_expect_success 'backfill does not request batches when up-to-date' '
> + git clone --no-checkout --filter=blob:none \
> + --single-branch --branch=main \
> + "file://$(pwd)/srv.bare" backfill-up-to-date &&
> +
> + # First trigger to have a full download
> + git -C backfill-up-to-date backfill &&
> +
> + # Second trigger to verify when already have a full download previously
> + GIT_TRACE2_EVENT="$(pwd)/up-to-date-trace" git \
> + -C backfill-up-to-date backfill &&
> +
> + # Verify no batches_request occurr
> + test_grep ! "batches_requested" up-to-date-trace
> +'
I'm ultimately not sure whether this change even needs a test. We're not
changing any user-visible behaviour, we're simply skipping some
pointless busywork that doesn't do much, but that shouldn't really hurt
much, either.
Patrick
next prev parent reply other threads:[~2026-04-01 11:51 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 12:12 [RFC GSoC PATCH] backfill: skip downloading for empty batches Trieu Huynh
2026-04-01 11:50 ` Patrick Steinhardt [this message]
2026-04-01 19:44 ` Trieu Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac0GnzQgZMfu8aGL@pks.im \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=vikingtc4@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox