Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Aaron Plattner <aplattner@nvidia.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
Date: Fri, 5 Dec 2025 13:56:23 -0800	[thread overview]
Message-ID: <4bd18399-26b3-44cd-93a7-8d2d32bef709@nvidia.com> (raw)
In-Reply-To: <20251205212839.GA35153@coredump.intra.peff.net>

On 12/5/25 1:28 PM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote:
> 
>> Unfortunately, setting that flag doesn't seem to improve performance for me
>> because in parse_object_with_flags(), lookup_object() returns an obj pointer
>> with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and
>> ends up inflating the object anyway:
>>
>> 	if ((!obj || obj->type == OBJ_BLOB) &&
>> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> 		if (!skip_hash && stream_object_signature(r, repl) < 0) {
>> 			error(_("hash mismatch %s"), oid_to_hex(oid));
>> 			return NULL;
>> 		}
>> 		parse_blob_buffer(lookup_blob(r, oid));
>> 		return lookup_object(r, oid);
>> 	}
>>
>> I was confused about why the check was structured that way, but reading the
>> description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that
>> up. Thank you for thoroughly documenting that!
>>
>> Are OBJ_NONE objects expected here? Should the check be
>>
>> 	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
>> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> 	    [...]
>> 	}
>>
>> ?
> 
> Yeah, that feels like a bug to me. The idea of that conditional is
> "could it be a blob?" and obviously OBJ_NONE does not rule that out.
> 
> I do wonder how you end up with OBJ_NONE, though. That implies somebody
> created the "struct object" but without knowing which type it was
> supposed to be, and then did not follow up by actually parsing it.

If I'm understanding correctly, this loop creates a dummy struct object 
for every object in the promisor packs:

	if (revs->exclude_promisor_objects) {
		for_each_packed_object(revs->repo, mark_uninteresting, revs,
				       FOR_EACH_OBJECT_PROMISOR_ONLY);
	}

Backtrace for one such object:

#0   create_object
#1   lookup_unknown_object
#2   mark_uninteresting
#3   for_each_object_in_pack
#4   for_each_packed_object
#5   prepare_revision_walk
#6   cmd_rev_list
#7   run_builtin
#8   handle_builtin
#9   cmd_main
#10  main

Then the is_promisor_object() loop finds these dummy objects when it 
loops over all the objects again.

> That's probably immaterial to what parse_object() should be doing, but
> it is certainly a curiosity. And I'm also not sure why I got good
> results from my rev-list invocation, but you did not. Weird.

Yeah, that's still a mystery.

> I think we could probably proceed without satisfying our curiosity here,
> but if you felt like it, it would be interesting to find such an object
> that is fed with OBJ_NONE to parse_object(), then run the command in a
> debugger trying to break on the original create_object() call that
> matches that oid. (Or if you want to be fancy use a reverse debugger
> like rr). I might play around with it and see if I can stimulate it.
> 
>> If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change
>> then the time drops to 1:58, so that's great!
> 
> Cool, though I think that's about the same that you got with your patch?

It's a bit better than my patch and yours seems cleaner. I'll put this 
together as a v2 of this patch.

> I was hoping for a little bit more from skipping the hash checks and
> commits, but maybe:
> 
>    1. Your commit/tree structure is dominated much more by the blobs than
>       the linux.git I used for testing. So there's not much extra gain to
>       be had.

That's certainly possible. Let's just say this codebase has a lot of 
cruft in it. :)

>    2. You didn't have a commit-graph built.

This repository came from "scalar clone" and then I created a worktree 
and disabled sparse checkout. I didn't do anything special to enable or 
disable commit-graph.

What I do notice is that usually, a `git pull` from the server this 
repository is hosted on is fast, but occasionally it hits this 
pathological case. I was using git-rev-list as a proxy for what git-pull 
was getting stuck on. Is it possible that having a working commit-graph 
is what avoids the problem in the first place? I'll admit to not having 
a great understanding of how the commit graph is used during a normal pull.

> -Peff

next prev parent reply	other threads:[~2025-12-05 21:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
2025-12-05 12:36 ` Patrick Steinhardt
2025-12-05 16:55   ` Aaron Plattner
2025-12-05 17:59     ` Jeff King
2025-12-05 17:48 ` Jeff King
2025-12-05 18:01   ` Jeff King
2025-12-05 18:50     ` Aaron Plattner
2025-12-05 21:28       ` Jeff King
2025-12-05 21:56         ` Aaron Plattner [this message]
2025-12-06  1:58           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4bd18399-26b3-44cd-93a7-8d2d32bef709@nvidia.com \
    --to=aplattner@nvidia.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).