Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Aaron Plattner <aplattner@nvidia.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
Date: Fri, 5 Dec 2025 13:56:23 -0800	[thread overview]
Message-ID: <4bd18399-26b3-44cd-93a7-8d2d32bef709@nvidia.com> (raw)
In-Reply-To: <20251205212839.GA35153@coredump.intra.peff.net>

On 12/5/25 1:28 PM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote:
> 
>> Unfortunately, setting that flag doesn't seem to improve performance for me
>> because in parse_object_with_flags(), lookup_object() returns an obj pointer
>> with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and
>> ends up inflating the object anyway:
>>
>> 	if ((!obj || obj->type == OBJ_BLOB) &&
>> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> 		if (!skip_hash && stream_object_signature(r, repl) < 0) {
>> 			error(_("hash mismatch %s"), oid_to_hex(oid));
>> 			return NULL;
>> 		}
>> 		parse_blob_buffer(lookup_blob(r, oid));
>> 		return lookup_object(r, oid);
>> 	}
>>
>> I was confused about why the check was structured that way, but reading the
>> description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that
>> up. Thank you for thoroughly documenting that!
>>
>> Are OBJ_NONE objects expected here? Should the check be
>>
>> 	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
>> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> 	    [...]
>> 	}
>>
>> ?
> 
> Yeah, that feels like a bug to me. The idea of that conditional is
> "could it be a blob?" and obviously OBJ_NONE does not rule that out.
> 
> I do wonder how you end up with OBJ_NONE, though. That implies somebody
> created the "struct object" but without knowing which type it was
> supposed to be, and then did not follow up by actually parsing it.

If I'm understanding correctly, this loop creates a dummy struct object 
for every object in the promisor packs:

	if (revs->exclude_promisor_objects) {
		for_each_packed_object(revs->repo, mark_uninteresting, revs,
				       FOR_EACH_OBJECT_PROMISOR_ONLY);
	}

Backtrace for one such object:

#0   create_object
#1   lookup_unknown_object
#2   mark_uninteresting
#3   for_each_object_in_pack
#4   for_each_packed_object
#5   prepare_revision_walk
#6   cmd_rev_list
#7   run_builtin
#8   handle_builtin
#9   cmd_main
#10  main

Then the is_promisor_object() loop finds these dummy objects when it 
loops over all the objects again.

> That's probably immaterial to what parse_object() should be doing, but
> it is certainly a curiosity. And I'm also not sure why I got good
> results from my rev-list invocation, but you did not. Weird.

Yeah, that's still a mystery.

> I think we could probably proceed without satisfying our curiosity here,
> but if you felt like it, it would be interesting to find such an object
> that is fed with OBJ_NONE to parse_object(), then run the command in a
> debugger trying to break on the original create_object() call that
> matches that oid. (Or if you want to be fancy use a reverse debugger
> like rr). I might play around with it and see if I can stimulate it.
> 
>> If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change
>> then the time drops to 1:58, so that's great!
> 
> Cool, though I think that's about the same that you got with your patch?

It's a bit better than my patch and yours seems cleaner. I'll put this 
together as a v2 of this patch.

> I was hoping for a little bit more from skipping the hash checks and
> commits, but maybe:
> 
>    1. Your commit/tree structure is dominated much more by the blobs than
>       the linux.git I used for testing. So there's not much extra gain to
>       be had.

That's certainly possible. Let's just say this codebase has a lot of 
cruft in it. :)

>    2. You didn't have a commit-graph built.

This repository came from "scalar clone" and then I created a worktree 
and disabled sparse checkout. I didn't do anything special to enable or 
disable commit-graph.

What I do notice is that usually, a `git pull` from the server this 
repository is hosted on is fast, but occasionally it hits this 
pathological case. I was using git-rev-list as a proxy for what git-pull 
was getting stuck on. Is it possible that having a working commit-graph 
is what avoids the problem in the first place? I'll admit to not having 
a great understanding of how the commit graph is used during a normal pull.

> -Peff

next prev parent reply	other threads:[~2025-12-05 21:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
2025-12-05 12:36 ` Patrick Steinhardt
2025-12-05 16:55   ` Aaron Plattner
2025-12-05 17:59     ` Jeff King
2025-12-05 17:48 ` Jeff King
2025-12-05 18:01   ` Jeff King
2025-12-05 18:50     ` Aaron Plattner
2025-12-05 21:28       ` Jeff King
2025-12-05 21:56         ` Aaron Plattner [this message]
2025-12-06  1:58           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4bd18399-26b3-44cd-93a7-8d2d32bef709@nvidia.com \
    --to=aplattner@nvidia.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.