Git development
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Trieu Huynh <vikingtc4@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC GSoC PATCH] backfill: skip downloading for empty batches
Date: Wed, 1 Apr 2026 13:50:55 +0200	[thread overview]
Message-ID: <ac0GnzQgZMfu8aGL@pks.im> (raw)
In-Reply-To: <20260331121204.787826-1-vikingtc4@gmail.com>

On Tue, Mar 31, 2026 at 09:12:04PM +0900, Trieu Huynh wrote:
> When git backfill finishes its object walk, it unconditionally calls
> download_batch to process any remaining objects. If the repository
> is already up-to-date (no missing objects found), this call still
> performs an unnecessary directory scan via odb_reprepare.
> 
> Fix it by adding a check in do_backfill to ensure download_batch is only
> called if the current batch actually contains objects (nr > 0).
> 
> To facilitate testing and provide better telemetry, add a trace2 data
> event for batches_requested. This allows us to verify that no batches
> are processed when the command is run on an up-to-date repository.
> 
> Add a test case in t5620-backfill.sh to ensure silence and efficiency
> when no objects are missing.
> 
> Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
> ---
> Need discussion:
> 1. Is adding trace2_data_intmax() the preferred way to verify this 
>    behavior in our test suite, or should we rely on redirection of 
>    stderr to check for progress messages when the progress option
>    is supported?

I think adding a call to trace2 only for the test itself doesn't make a
lot of sense if we already have another way to verify. But would we
actually see any progress messages? `promisor_remote_get_direct()` knows
to bail out early in case there is nothing to be downloaded, so the only
difference really is the call to `odb_reprepare()`.

Or is it? This part here...

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 0f31844ce7..67f9f28daf 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -58,6 +58,7 @@ static void download_batch(struct backfill_context *ctx)
>  	 */
>  	odb_reprepare(ctx->repo->objects);
>  	display_progress(ctx->progress, ++ctx->batches_requested);
> +	trace2_data_intmax("backfill", ctx->repo, "batches_requested", ctx->batches_requested);
>  }
>  
>  static int fill_missing_blobs(const char *path UNUSED,

... looks different. What commit is this patch based on? There is no
call to `display_progress()` on "master", and you didn't mention any
other dependency in your cover letter. Please note such dependencies
when you post a patch that has any requirements.

But in any case, this here would cause us to print "batches_requested"
events repeatedly, which doesn't make a lot of sense.

> @@ -109,7 +110,7 @@ static int do_backfill(struct backfill_context *ctx)
>  	ret = walk_objects_by_path(&info);
>  
>  	/* Download the objects that did not fill a batch. */
> -	if (!ret)
> +	if ( (!ret) && (ctx->current_batch.nr > 0) )
>  		download_batch(ctx);
>  
>  	path_walk_info_clear(&info);

Please pay attention to our coding guidlines, see
"Documentation/CodingGuidelines".

I guess a more robust fix would add the check in `download_batch()`
itself, but I guess both alternatives work. But overall, it's sensible
to avoid repreparing the ODB in case we know nothing has changed.

> diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
> index a1a8d736db..d3cc4022bf 100755
> --- a/t/t5620-backfill.sh
> +++ b/t/t5620-backfill.sh
> @@ -221,6 +221,22 @@ test_expect_success 'backfill --sparse without cone mode (negative)' '
>  	test_line_count = 12 missing
>  '
>  
> +test_expect_success 'backfill does not request batches when up-to-date' '
> +	git clone --no-checkout --filter=blob:none \
> +		--single-branch --branch=main \
> +		"file://$(pwd)/srv.bare" backfill-up-to-date &&
> +
> +	# First trigger to have a full download
> +	git -C backfill-up-to-date backfill &&
> +
> +	# Second trigger to verify when already have a full download previously
> +	GIT_TRACE2_EVENT="$(pwd)/up-to-date-trace" git \
> +		-C backfill-up-to-date backfill &&
> +
> +	# Verify no  batches_request occurr
> +	test_grep ! "batches_requested" up-to-date-trace
> +'

I'm ultimately not sure whether this change even needs a test. We're not
changing any user-visible behaviour, we're simply skipping some
pointless busywork that doesn't do much, but that shouldn't really hurt
much, either.

Patrick

  reply	other threads:[~2026-04-01 11:51 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31 12:12 [RFC GSoC PATCH] backfill: skip downloading for empty batches Trieu Huynh
2026-04-01 11:50 ` Patrick Steinhardt [this message]
2026-04-01 19:44   ` Trieu Huynh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac0GnzQgZMfu8aGL@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=vikingtc4@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox