public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed
From: Justin Tobler <jltobler@gmail.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org, Matt Smiley <msmiley@gitlab.com>
Subject: Re: [PATCH 1/3] commit: avoid parsing non-commits in `lookup_commit_reference_gently()`
Date: Wed, 18 Feb 2026 12:26:21 -0600	[thread overview]
Message-ID: <aZX4w8C5In9gEF33@denethor> (raw)
In-Reply-To: <20260216-b4-pks-receive-pack-optimize-shallow-v1-1-e98886daff2b@pks.im>

On 26/02/16 04:38PM, Patrick Steinhardt wrote:
> The function `lookup_commit_reference_gently()` can be used to look up a
> committish by object ID. As such, the function knows to peel for example
> tag objects so that we eventually end up with the commit.
> 
> The function is used quite a lot throughout our tree. One such user is
> "shallow.c" via `assign_shallow_commits_to_refs()`. The intent of this
> function is to figure out whether a shallow push is missing any objects
> that are required to satisfy the ref updates, and if so, which of the
> ref updates is missing objects.
> 
> This is done by painting the tree with `UNINTERESTING`. We start
> painting by calling `refs_for_each_ref()` so that we can mark all
> existing referenced objects as the boundary of objects that we already
> have, and which are supposed to be fully connected. The reference tips
> are then parsed via `lookup_commit_reference_gently()`, and the commit
> commit is then marked as uninteresting.

s/commit commit/commit/

> But references may not necessarily point to a committish, and if a lot
> of them aren't then this step takes a lot of time. This is mostly due to
> the way that `lookup_commit_reference_gently()` is implemented: before
> we learn about the type of the object we already call `parse_object()`
> on the object ID. This has two consequences:
> 
>   - We parse all objects, including trees and blobs, even though we
>     don't even need the contents of them.
> 
>   - More importantly though, `parse_object()` will cause us to check
>     whether the object ID matches its contents.
> 
> Combined this means that we deflate and hash every non-committish
> object, and that of course ends up being both CPU- and memory-intensive.

IIUC, the `mark_uninteresting()` callback used by `refs_for_each_ref()`
is resulting in resolved/peeled object being parsed regardless of its
type. This is wasteful though because we should only care about commits.
This makes sense.

> Improve the logic so that we first use `peel_object()`. This function
> won't parse the object for us, and thus it allows us to learn about the
> object's type before we parse and return it.
> 
> The following benchmark pushes a single object from a shallow clone into
> a repository that has 100,000 refs. These refs were created by listing
> all objects via `git rev-list(1) --objects --all` and creating refs for
> a subset of them, so lots of those refs will cover non-commit objects.
> 
>   Benchmark 1: git-receive-pack (rev = HEAD~)
>     Time (mean ± σ):     62.571 s ±  0.413 s    [User: 58.331 s, System: 4.053 s]
>     Range (min … max):   62.191 s … 63.010 s    3 runs
> 
>   Benchmark 2: git-receive-pack (rev = HEAD)
>     Time (mean ± σ):     38.339 s ±  0.192 s    [User: 36.220 s, System: 1.992 s]
>     Range (min … max):   38.176 s … 38.551 s    3 runs
> 
>   Summary
>     git-receive-pack . </tmp/input (rev = HEAD) ran
>       1.63 ± 0.01 times faster than git-receive-pack . </tmp/input (rev = HEAD~)
> 
> This leads to a sizeable speedup as we now skip reading and parsing
> non-commit objects. Before this change we spent around 40% of the time
> in `assign_shallow_commits_to_refs()`, after the change we only spend
> around 1.2% of the time in there. Almost the entire remainder of the
> time is spent in git-rev-list(1) to perform the connectivity checks.

Nice!

> Despite the speedup though, this also leads to a massive reduction in
> allocations. Before:
> 
>   HEAP SUMMARY:
>       in use at exit: 352,480,441 bytes in 97,185 blocks
>     total heap usage: 2,793,820 allocs, 2,696,635 frees, 67,271,456,983 bytes allocated
> 
> And after:
> 
>   HEAP SUMMARY:
>       in use at exit: 17,524,978 bytes in 22,393 blocks
>     total heap usage: 33,313 allocs, 10,920 frees, 407,774,251 bytes allocated
> 
> Note that when all references refer to commits performance stays roughly
> the same, as expected. The following benchmark was executed with 600k
> commits:
> 
>   Benchmark 1: git-receive-pack (rev = HEAD~)
>     Time (mean ± σ):      9.101 s ±  0.006 s    [User: 8.800 s, System: 0.520 s]
>     Range (min … max):    9.095 s …  9.106 s    3 runs
> 
>   Benchmark 2: git-receive-pack (rev = HEAD)
>     Time (mean ± σ):      9.128 s ±  0.094 s    [User: 8.820 s, System: 0.522 s]
>     Range (min … max):    9.019 s …  9.188 s    3 runs
> 
>   Summary
>     git-receive-pack (rev = HEAD~) ran
>       1.00 ± 0.01 times faster than git-receive-pack (rev = HEAD)
> 
> This will be improved in the next commit.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  commit.c | 32 +++++++++++++++++++++++++++-----
>  object.c | 23 ++++++++++++++++++-----
>  object.h |  5 +++++
>  3 files changed, 50 insertions(+), 10 deletions(-)
> 
> diff --git a/commit.c b/commit.c
> index 9bb471d217..b7c4ec2eb5 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -43,13 +43,35 @@ const char *commit_type = "commit";
>  struct commit *lookup_commit_reference_gently(struct repository *r,
>  		const struct object_id *oid, int quiet)
>  {
> -	struct object *obj = deref_tag(r,
> -				       parse_object(r, oid),
> -				       NULL, 0);

Here we drop the use of `deref_tag()` which required us to
unconfitionally parse the object.

> +	const struct object_id *maybe_peeled;
> +	struct object_id peeled_oid;
> +	struct object *object;
> +	enum object_type type;
>  
> -	if (!obj)
> +	switch (peel_object_ext(r, oid, &peeled_oid, 0, &type)) {

Here we peel the object and also get its type.

> +	case PEEL_NON_TAG:
> +		maybe_peeled = oid;
> +		break;
> +	case PEEL_PEELED:
> +		maybe_peeled = &peeled_oid;
> +		break;
> +	default:
>  		return NULL;
> -	return object_as_type(obj, OBJ_COMMIT, quiet);
> +	}
> +
> +	if (type != OBJ_COMMIT) {
> +		if (!quiet)
> +			error(_("object %s is a %s, not a %s"),
> +			      oid_to_hex(oid), type_name(type),
> +			      type_name(OBJ_COMMIT));
> +		return NULL;
> +	}
> +
> +	object = parse_object(r, maybe_peeled);
> +	if (!object)
> +		return NULL;

Now we only parse the object if it is a commit.

> +
> +	return object_as_type(object, OBJ_COMMIT, quiet);
>  }
>  
>  struct commit *lookup_commit_reference(struct repository *r, const struct object_id *oid)
> diff --git a/object.c b/object.c
> index 4669b8d65e..99b6df3780 100644
> --- a/object.c
> +++ b/object.c
> @@ -207,10 +207,11 @@ struct object *lookup_object_by_type(struct repository *r,
>  	}
>  }
>  
> -enum peel_status peel_object(struct repository *r,
> -			     const struct object_id *name,
> -			     struct object_id *oid,
> -			     unsigned flags)
> +enum peel_status peel_object_ext(struct repository *r,
> +				 const struct object_id *name,
> +				 struct object_id *oid,
> +				 unsigned flags,
> +				 enum object_type *typep)
>  {
>  	struct object *o = lookup_unknown_object(r, name);
>  
> @@ -220,8 +221,10 @@ enum peel_status peel_object(struct repository *r,
>  			return PEEL_INVALID;
>  	}
>  
> -	if (o->type != OBJ_TAG)
> +	if (o->type != OBJ_TAG) {
> +		*typep = o->type;
>  		return PEEL_NON_TAG;
> +	}
>  
>  	while (o && o->type == OBJ_TAG) {
>  		o = parse_object(r, &o->oid);
> @@ -241,9 +244,19 @@ enum peel_status peel_object(struct repository *r,
>  		return PEEL_INVALID;
>  
>  	oidcpy(oid, &o->oid);
> +	*typep = o->type;

Here we are modifying the existing `peel_object()` to also provide
object type info back to the caller. With this, we can avoid the
unconditional object parsing in `lookup_commit_reference_gently()`.
Looks good.

-Justin

  reply	other threads:[~2026-02-18 18:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-16 15:38 [PATCH 0/3] git-receive-pack(1): optimize `assign_shallow_commits_to_refs()` Patrick Steinhardt
2026-02-16 15:38 ` [PATCH 1/3] commit: avoid parsing non-commits in `lookup_commit_reference_gently()` Patrick Steinhardt
2026-02-18 18:26   ` Justin Tobler [this message]
2026-02-19 17:35     ` Junio C Hamano
2026-02-16 15:38 ` [PATCH 2/3] commit: make `repo_parse_commit_no_graph()` more robust Patrick Steinhardt
2026-02-18 19:23   ` Justin Tobler
2026-02-19  7:18     ` Patrick Steinhardt
2026-02-16 15:38 ` [PATCH 3/3] commit: use commit graph in `lookup_commit_reference_gently()` Patrick Steinhardt
2026-02-18 19:34   ` Justin Tobler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZX4w8C5In9gEF33@denethor \
    --to=jltobler@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=msmiley@gitlab.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox