Re: [PATCH 1/3] fetch: use batched reference updates

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Patrick Steinhardt <ps@pks.im>
To: Karthik Nayak <karthik.188@gmail.com>
Cc: git@vger.kernel.org, toon@iotcl.com
Subject: Re: [PATCH 1/3] fetch: use batched reference updates
Date: Wed, 14 May 2025 14:31:16 +0200	[thread overview]
Message-ID: <aCSNFMeh3WMav_Rn@pks.im> (raw)
In-Reply-To: <20250514-501-update-git-fetch-1-to-use-partial-transactions-v1-1-7c65f46493d4@gmail.com>

On Wed, May 14, 2025 at 11:03:47AM +0200, Karthik Nayak wrote:
> The reference updates performed as a part of 'git-fetch(1)', take place

s/,//

> one at a time. For each reference update, a new transaction is created
> and committed. This is necessary to ensure we can allow individual
> updates to fail without failing the entire command. The command also
> supports an '--atomic' mode, which uses a single transaction to update
> all of the references. But this mode has an all-or-nothing approach,
> where if a single update fails, all updates would fail.
> 
> In 23fc8e4f61 (refs: implement batch reference update support,
> 2025-04-08), we introduced a new mechanism to batch reference updates.
> Under the hood, this uses a single transaction to perform a batch of
> reference updates, while allowing only individual updates to fail.
> Utilize this newly introduced batch update mechanism in 'git-fetch(1)'.
> This provides a significant bump in performance, especially when dealing
> with repositories with large number of references.
> 
> Adding support for batched updates is simply modifying the flow to also
> create a batch update transaction in the non-atomic flow.
> 
> With the reftable backend there is a 22x performance improvement, when
> performing 'git-fetch(1)' with 10000 refs:
> 
>   Benchmark 1: fetch: many refs (refformat = reftable, refcount = 10000, revision = master)
>     Time (mean ± σ):      3.403 s ±  0.775 s    [User: 1.875 s, System: 1.417 s]
>     Range (min … max):    2.454 s …  4.529 s    10 runs
> 
>   Benchmark 2: fetch: many refs (refformat = reftable, refcount = 10000, revision = HEAD)
>     Time (mean ± σ):     154.3 ms ±  17.6 ms    [User: 102.5 ms, System: 56.1 ms]
>     Range (min … max):   145.2 ms … 220.5 ms    18 runs
> 
>   Summary
>     fetch: many refs (refformat = reftable, refcount = 10000, revision = HEAD) ran
>      22.06 ± 5.62 times faster than fetch: many refs (refformat = reftable, refcount = 10000, revision = master)

Nice. The speedup is larger than I have originally anticipated, but I
certainly won't complain about that :) For a good part, the speedup
should result from us not having to do 10000 auto-compactions anymore
for each of the updates, as well as not having to write 10000 new
tables. Instead, we only write a single new table and perform compaction
a single time, only.

> In similar conditions, the files backend sees a 1.25x performance
> improvement:
> 
>   Benchmark 1: fetch: many refs (refformat = files, refcount = 10000, revision = master)
>     Time (mean ± σ):     605.5 ms ±   9.4 ms    [User: 117.8 ms, System: 483.3 ms]
>     Range (min … max):   595.6 ms … 621.5 ms    10 runs
> 
>   Benchmark 2: fetch: many refs (refformat = files, refcount = 10000, revision = HEAD)
>     Time (mean ± σ):     485.8 ms ±   4.3 ms    [User: 91.1 ms, System: 396.7 ms]
>     Range (min … max):   477.6 ms … 494.3 ms    10 runs
> 
>   Summary
>     fetch: many refs (refformat = files, refcount = 10000, revision = HEAD) ran
>       1.25 ± 0.02 times faster than fetch: many refs (refformat = files, refcount = 10000, revision = master)

Yup, this makes sense, as well. We lose a bunch of overhead by creating
separate transactions for each of the updates, but the slow path is
still that we have to create 10000 files for each of the references. So
it's expected to see a small performance improvement, but nothing game
changing.

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 5279997c96..1558f6d1e8 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1688,6 +1644,32 @@ static int set_head(const struct ref *remote_refs, struct remote *remote)
>  	return result;
>  }
>  
> +struct ref_rejection_data {
> +	int *retcode;
> +	int conflict_msg_shown;
> +	const char *remote_name;
> +};
> +
> +static void ref_transaction_rejection_handler(const char *refname UNUSED,
> +					      const struct object_id *old_oid UNUSED,
> +					      const struct object_id *new_oid UNUSED,
> +					      const char *old_target UNUSED,
> +					      const char *new_target UNUSED,
> +					      enum ref_transaction_error err,
> +					      void *cb_data)
> +{
> +	struct ref_rejection_data *data = (struct ref_rejection_data *)cb_data;
> +
> +	if (err == REF_TRANSACTION_ERROR_NAME_CONFLICT && !data->conflict_msg_shown) {
> +		error(_("some local refs could not be updated; try running\n"
> +			" 'git remote prune %s' to remove any old, conflicting "
> +			"branches"), data->remote_name);
> +		data->conflict_msg_shown = 1;
> +	}
> +
> +	*data->retcode = 1;
> +}
> +
>  static int do_fetch(struct transport *transport,
>  		    struct refspec *rs,
>  		    const struct fetch_config *config)

Okay, so we now handle errors over here. Is the handled error the only
error that we may see, or do we also accept other errors like D/F now?
If the latter, wouldn't it mean that we don't print any error messages
for those other failures at all? That might be quite confusing.

> @@ -1808,6 +1790,20 @@ static int do_fetch(struct transport *transport,
>  			retcode = 1;
>  	}
>  
> +	/*
> +	 * If not atomic, we can still use batched updates, which would be much
> +	 * more performent. We don't initiate the transaction before pruning,

s/performent/performant/

> +	 * since pruning must be an independent step, to avoid F/D conflicts.
> +	 */
> +	if (!transaction) {
> +		transaction = ref_store_transaction_begin(get_main_ref_store(the_repository),
> +							  REF_TRANSACTION_ALLOW_FAILURE, &err);
> +		if (!transaction) {
> +			retcode = -1;
> +			goto cleanup;
> +		}
> +	}
> +
>  	if (fetch_and_consume_refs(&display_state, transport, transaction, ref_map,
>  				   &fetch_head, config)) {
>  		retcode = 1;

Don't transactions handle D/F conflicts for us? Isn't that the sole
reason why for example `refs_verify_refname_available()` accepts an
"extras" parameter that is supposed to contain refs that are about to be
deleted?

Patrick

next prev parent reply	other threads:[~2025-05-14 12:31 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-14  9:03 [PATCH 0/3] fetch/receive: use batched reference updates Karthik Nayak
2025-05-14  9:03 ` [PATCH 1/3] fetch: " Karthik Nayak
2025-05-14 12:31   ` Patrick Steinhardt [this message]
2025-05-15 11:13     ` Karthik Nayak
2025-05-15 11:30       ` Patrick Steinhardt
2025-05-15 11:36         ` Karthik Nayak
2025-05-14 17:36   ` Junio C Hamano
2025-05-14  9:03 ` [PATCH 2/3] send-pack: fix memory leak around duplicate refs Karthik Nayak
2025-05-14 17:46   ` Junio C Hamano
2025-05-15 11:23     ` Karthik Nayak
2025-05-14  9:03 ` [PATCH 3/3] receive-pack: use batched reference updates Karthik Nayak
2025-05-14 12:31   ` Patrick Steinhardt
2025-05-14 19:00     ` Junio C Hamano
2025-05-15 11:30     ` Karthik Nayak
2025-05-15 14:07 ` [PATCH v2 0/4] fetch/receive: " Karthik Nayak
2025-05-15 14:07   ` [PATCH v2 1/4] refs: add function to translate errors to strings Karthik Nayak
2025-05-15 19:11     ` Jeff King
2025-05-16  9:11       ` Karthik Nayak
2025-05-15 20:26     ` Junio C Hamano
2025-05-16  9:12       ` Karthik Nayak
2025-05-15 14:07   ` [PATCH v2 2/4] fetch: use batched reference updates Karthik Nayak
2025-05-16  5:40     ` Patrick Steinhardt
2025-05-16  9:53       ` Karthik Nayak
2025-05-16 10:00         ` Patrick Steinhardt
2025-05-18 11:30           ` Karthik Nayak
2025-05-15 14:07   ` [PATCH v2 3/4] send-pack: fix memory leak around duplicate refs Karthik Nayak
2025-05-15 14:07   ` [PATCH v2 4/4] receive-pack: use batched reference updates Karthik Nayak
2025-05-15 18:55     ` Jeff King
2025-05-15 19:09       ` Jeff King
2025-05-16 19:49         ` Karthik Nayak
2025-05-19  9:58 ` [PATCH v3 0/4] fetch/receive: " Karthik Nayak
2025-05-19  9:58   ` [PATCH v3 1/4] refs: add function to translate errors to strings Karthik Nayak
2025-05-19  9:58   ` [PATCH v3 2/4] fetch: use batched reference updates Karthik Nayak
2025-05-19  9:58   ` [PATCH v3 3/4] send-pack: fix memory leak around duplicate refs Karthik Nayak
2025-05-19  9:58   ` [PATCH v3 4/4] receive-pack: use batched reference updates Karthik Nayak
2025-05-19 18:14   ` [PATCH v3 0/4] fetch/receive: " Junio C Hamano
2025-05-20  9:05     ` Karthik Nayak
2025-05-21 13:14       ` Junio C Hamano
2025-05-22  6:00       ` Jeff King
2025-05-22  8:50         ` Karthik Nayak
2025-05-22 15:31           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aCSNFMeh3WMav_Rn@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=karthik.188@gmail.com \
    --cc=toon@iotcl.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).