From: Aaron Plattner <aplattner@nvidia.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
Date: Fri, 5 Dec 2025 13:56:23 -0800 [thread overview]
Message-ID: <4bd18399-26b3-44cd-93a7-8d2d32bef709@nvidia.com> (raw)
In-Reply-To: <20251205212839.GA35153@coredump.intra.peff.net>
On 12/5/25 1:28 PM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote:
>
>> Unfortunately, setting that flag doesn't seem to improve performance for me
>> because in parse_object_with_flags(), lookup_object() returns an obj pointer
>> with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and
>> ends up inflating the object anyway:
>>
>> if ((!obj || obj->type == OBJ_BLOB) &&
>> odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> if (!skip_hash && stream_object_signature(r, repl) < 0) {
>> error(_("hash mismatch %s"), oid_to_hex(oid));
>> return NULL;
>> }
>> parse_blob_buffer(lookup_blob(r, oid));
>> return lookup_object(r, oid);
>> }
>>
>> I was confused about why the check was structured that way, but reading the
>> description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that
>> up. Thank you for thoroughly documenting that!
>>
>> Are OBJ_NONE objects expected here? Should the check be
>>
>> if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
>> odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> [...]
>> }
>>
>> ?
>
> Yeah, that feels like a bug to me. The idea of that conditional is
> "could it be a blob?" and obviously OBJ_NONE does not rule that out.
>
> I do wonder how you end up with OBJ_NONE, though. That implies somebody
> created the "struct object" but without knowing which type it was
> supposed to be, and then did not follow up by actually parsing it.
If I'm understanding correctly, this loop creates a dummy struct object
for every object in the promisor packs:
if (revs->exclude_promisor_objects) {
for_each_packed_object(revs->repo, mark_uninteresting, revs,
FOR_EACH_OBJECT_PROMISOR_ONLY);
}
Backtrace for one such object:
#0 create_object
#1 lookup_unknown_object
#2 mark_uninteresting
#3 for_each_object_in_pack
#4 for_each_packed_object
#5 prepare_revision_walk
#6 cmd_rev_list
#7 run_builtin
#8 handle_builtin
#9 cmd_main
#10 main
Then the is_promisor_object() loop finds these dummy objects when it
loops over all the objects again.
> That's probably immaterial to what parse_object() should be doing, but
> it is certainly a curiosity. And I'm also not sure why I got good
> results from my rev-list invocation, but you did not. Weird.
Yeah, that's still a mystery.
> I think we could probably proceed without satisfying our curiosity here,
> but if you felt like it, it would be interesting to find such an object
> that is fed with OBJ_NONE to parse_object(), then run the command in a
> debugger trying to break on the original create_object() call that
> matches that oid. (Or if you want to be fancy use a reverse debugger
> like rr). I might play around with it and see if I can stimulate it.
>
>> If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change
>> then the time drops to 1:58, so that's great!
>
> Cool, though I think that's about the same that you got with your patch?
It's a bit better than my patch and yours seems cleaner. I'll put this
together as a v2 of this patch.
> I was hoping for a little bit more from skipping the hash checks and
> commits, but maybe:
>
> 1. Your commit/tree structure is dominated much more by the blobs than
> the linux.git I used for testing. So there's not much extra gain to
> be had.
That's certainly possible. Let's just say this codebase has a lot of
cruft in it. :)
> 2. You didn't have a commit-graph built.
This repository came from "scalar clone" and then I created a worktree
and disabled sparse checkout. I didn't do anything special to enable or
disable commit-graph.
What I do notice is that usually, a `git pull` from the server this
repository is hosted on is fast, but occasionally it hits this
pathological case. I was using git-rev-list as a proxy for what git-pull
was getting stuck on. Is it possible that having a working commit-graph
is what avoids the problem in the first place? I'll admit to not having
a great understanding of how the commit graph is used during a normal pull.
> -Peff
next prev parent reply other threads:[~2025-12-05 21:56 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
2025-12-05 12:36 ` Patrick Steinhardt
2025-12-05 16:55 ` Aaron Plattner
2025-12-05 17:59 ` Jeff King
2025-12-05 17:48 ` Jeff King
2025-12-05 18:01 ` Jeff King
2025-12-05 18:50 ` Aaron Plattner
2025-12-05 21:28 ` Jeff King
2025-12-05 21:56 ` Aaron Plattner [this message]
2025-12-06 1:58 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4bd18399-26b3-44cd-93a7-8d2d32bef709@nvidia.com \
--to=aplattner@nvidia.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).