[PATCH] packfile: skip decompressing and hashing blobs in add_promisor

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
@ 2025-12-04 17:21 Aaron Plattner
  2025-12-05 12:36 ` Patrick Steinhardt
  2025-12-05 17:48 ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: Aaron Plattner @ 2025-12-04 17:21 UTC (permalink / raw)
  To: git; +Cc: Aaron Plattner

When is_promisor_object() is called for the first time, it lazily
initializes a set of all promisor objects by iterating through all
objects in promisor packs. For each object, add_promisor_object() calls
parse_object(), which decompresses and hashes the entire object.

For repositories with large pack files, this can take an extremely long
time. For example, on a production repository with a 176 GB promisor
pack:

 $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
 ________________________________________________________
 Executed in   76.10 mins    fish           external
    usr time   72.10 mins    1.83 millis   72.10 mins
    sys time    3.56 mins    0.17 millis    3.56 mins

add_promisor_object() needs the full object for trees, commits, and
tags. But blobs contain no references to other objects, so the function
can just insert their oids into the set and move on.

For objects that weren't already parsed, use odb_read_object_info() to
query the object type. If it's a blob, just insert it into the oidset
without parsing it. This improves performance for very large pack files
significantly:

 $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
 ________________________________________________________
 Executed in  118.76 secs    fish           external
    usr time   50.88 secs   11.02 millis   50.87 secs
    sys time   36.31 secs    0.08 millis   36.31 secs

Signed-off-by: Aaron Plattner <aplattner@nvidia.com>
---
 packfile.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/packfile.c b/packfile.c
index 9cc11b6dc5..563fd14f0e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid,
 	if (obj && obj->parsed) {
 		we_parsed_object = 0;
 	} else {
+		/*
+		 * Blobs don't reference other objects, so skip parsing them
+		 * to save time.
+		 */
+		enum object_type type;
+		type = odb_read_object_info(pack->repo->objects, oid, NULL);
+		if (type == OBJ_BLOB) {
+			oidset_insert(set, oid);
+			return 0;
+		}
+
 		we_parsed_object = 1;
 		obj = parse_object(pack->repo, oid);
 	}
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
@ 2025-12-05 12:36 ` Patrick Steinhardt
  2025-12-05 16:55   ` Aaron Plattner
  2025-12-05 17:48 ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Patrick Steinhardt @ 2025-12-05 12:36 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Thu, Dec 04, 2025 at 09:21:29AM -0800, Aaron Plattner wrote:
> When is_promisor_object() is called for the first time, it lazily
> initializes a set of all promisor objects by iterating through all
> objects in promisor packs. For each object, add_promisor_object() calls
> parse_object(), which decompresses and hashes the entire object.
> 
> For repositories with large pack files, this can take an extremely long
> time. For example, on a production repository with a 176 GB promisor
> pack:
> 
>  $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>  ________________________________________________________
>  Executed in   76.10 mins    fish           external
>     usr time   72.10 mins    1.83 millis   72.10 mins
>     sys time    3.56 mins    0.17 millis    3.56 mins
> 
> add_promisor_object() needs the full object for trees, commits, and
> tags. But blobs contain no references to other objects, so the function
> can just insert their oids into the set and move on.
> 
> For objects that weren't already parsed, use odb_read_object_info() to
> query the object type. If it's a blob, just insert it into the oidset
> without parsing it. This improves performance for very large pack files
> significantly:
> 
>  $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>  ________________________________________________________
>  Executed in  118.76 secs    fish           external
>     usr time   50.88 secs   11.02 millis   50.87 secs
>     sys time   36.31 secs    0.08 millis   36.31 secs
> 
> Signed-off-by: Aaron Plattner <aplattner@nvidia.com>
> ---
>  packfile.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/packfile.c b/packfile.c
> index 9cc11b6dc5..563fd14f0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid,
>  	if (obj && obj->parsed) {
>  		we_parsed_object = 0;
>  	} else {
> +		/*
> +		 * Blobs don't reference other objects, so skip parsing them
> +		 * to save time.
> +		 */
> +		enum object_type type;
> +		type = odb_read_object_info(pack->repo->objects, oid, NULL);
> +		if (type == OBJ_BLOB) {
> +			oidset_insert(set, oid);
> +			return 0;
> +		}
> +
>  		we_parsed_object = 1;
>  		obj = parse_object(pack->repo, oid);
>  	}

This looks like an obvious improvement to me. While looking at the
function I noticed that we also free tree buffers here, which make
sense to reduce memory usage. I was wondering whether we want to do
the same for commit buffers, which may reduce memory usage even further.

See for example the below patch. Might be wort it to test in your large
repository to see whether it has an effect on the maximum RSS.

Thanks!

Patrick

diff --git a/packfile.c b/packfile.c
index 9cc11b6dc5..89227ada98 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2296,12 +2296,17 @@ int for_each_packed_object(struct repository *repo, each_packed_object_fn cb,
 	return r ? r : pack_errors;
 }
 
+struct add_promisor_object_data {
+	struct repository *repo;
+	struct oidset promisor_objects;
+};
+
 static int add_promisor_object(const struct object_id *oid,
 			       struct packed_git *pack,
 			       uint32_t pos UNUSED,
-			       void *set_)
+			       void *cb_data)
 {
-	struct oidset *set = set_;
+	struct add_promisor_object_data *data = cb_data;
 	struct object *obj;
 	int we_parsed_object;
 
@@ -2316,7 +2321,7 @@ static int add_promisor_object(const struct object_id *oid,
 	if (!obj)
 		return 1;
 
-	oidset_insert(set, oid);
+	oidset_insert(&data->promisor_objects, oid);
 
 	/*
 	 * If this is a tree, commit, or tag, the objects it refers
@@ -2334,38 +2339,45 @@ static int add_promisor_object(const struct object_id *oid,
 			 */
 			return 0;
 		while (tree_entry_gently(&desc, &entry))
-			oidset_insert(set, &entry.oid);
+			oidset_insert(&data->promisor_objects, &entry.oid);
 		if (we_parsed_object)
 			free_tree_buffer(tree);
 	} else if (obj->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *) obj;
 		struct commit_list *parents = commit->parents;
 
-		oidset_insert(set, get_commit_tree_oid(commit));
+		oidset_insert(&data->promisor_objects, get_commit_tree_oid(commit));
 		for (; parents; parents = parents->next)
-			oidset_insert(set, &parents->item->object.oid);
+			oidset_insert(&data->promisor_objects, &parents->item->object.oid);
+
+		if (we_parsed_object)
+			free_commit_buffer(data->repo->parsed_objects, commit);
 	} else if (obj->type == OBJ_TAG) {
 		struct tag *tag = (struct tag *) obj;
-		oidset_insert(set, get_tagged_oid(tag));
+		oidset_insert(&data->promisor_objects, get_tagged_oid(tag));
 	}
 	return 0;
 }
 
 int is_promisor_object(struct repository *r, const struct object_id *oid)
 {
-	static struct oidset promisor_objects;
+	static struct add_promisor_object_data data = {
+		.promisor_objects = OIDSET_INIT,
+	};
 	static int promisor_objects_prepared;
 
 	if (!promisor_objects_prepared) {
+		data.repo = r;
 		if (repo_has_promisor_remote(r)) {
 			for_each_packed_object(r, add_promisor_object,
-					       &promisor_objects,
+					       &data,
 					       FOR_EACH_OBJECT_PROMISOR_ONLY |
 					       FOR_EACH_OBJECT_PACK_ORDER);
 		}
 		promisor_objects_prepared = 1;
+		data.repo = NULL;
 	}
-	return oidset_contains(&promisor_objects, oid);
+	return oidset_contains(&data.promisor_objects, oid);
 }
 
 int parse_pack_header_option(const char *in, unsigned char *out, unsigned int *len)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 12:36 ` Patrick Steinhardt
@ 2025-12-05 16:55   ` Aaron Plattner
  2025-12-05 17:59     ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Plattner @ 2025-12-05 16:55 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On 12/5/25 4:36 AM, Patrick Steinhardt wrote:
> On Thu, Dec 04, 2025 at 09:21:29AM -0800, Aaron Plattner wrote:
>> When is_promisor_object() is called for the first time, it lazily
>> initializes a set of all promisor objects by iterating through all
>> objects in promisor packs. For each object, add_promisor_object() calls
>> parse_object(), which decompresses and hashes the entire object.
>>
>> For repositories with large pack files, this can take an extremely long
>> time. For example, on a production repository with a 176 GB promisor
>> pack:
>>
>>   $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>>   ________________________________________________________
>>   Executed in   76.10 mins    fish           external
>>      usr time   72.10 mins    1.83 millis   72.10 mins
>>      sys time    3.56 mins    0.17 millis    3.56 mins
>>
>> add_promisor_object() needs the full object for trees, commits, and
>> tags. But blobs contain no references to other objects, so the function
>> can just insert their oids into the set and move on.
>>
>> For objects that weren't already parsed, use odb_read_object_info() to
>> query the object type. If it's a blob, just insert it into the oidset
>> without parsing it. This improves performance for very large pack files
>> significantly:
>>
>>   $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>>   ________________________________________________________
>>   Executed in  118.76 secs    fish           external
>>      usr time   50.88 secs   11.02 millis   50.87 secs
>>      sys time   36.31 secs    0.08 millis   36.31 secs
>>
>> Signed-off-by: Aaron Plattner <aplattner@nvidia.com>
>> ---
>>   packfile.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/packfile.c b/packfile.c
>> index 9cc11b6dc5..563fd14f0e 100644
>> --- a/packfile.c
>> +++ b/packfile.c
>> @@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid,
>>   	if (obj && obj->parsed) {
>>   		we_parsed_object = 0;
>>   	} else {
>> +		/*
>> +		 * Blobs don't reference other objects, so skip parsing them
>> +		 * to save time.
>> +		 */
>> +		enum object_type type;
>> +		type = odb_read_object_info(pack->repo->objects, oid, NULL);
>> +		if (type == OBJ_BLOB) {
>> +			oidset_insert(set, oid);
>> +			return 0;
>> +		}
>> +
>>   		we_parsed_object = 1;
>>   		obj = parse_object(pack->repo, oid);
>>   	}
> 
> This looks like an obvious improvement to me. While looking at the
> function I noticed that we also free tree buffers here, which make
> sense to reduce memory usage. I was wondering whether we want to do
> the same for commit buffers, which may reduce memory usage even further.
> 
> See for example the below patch. Might be wort it to test in your large
> repository to see whether it has an effect on the maximum RSS.
> 
> Thanks!
> 
> Patrick
> 
> diff --git a/packfile.c b/packfile.c
> index 9cc11b6dc5..89227ada98 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2296,12 +2296,17 @@ int for_each_packed_object(struct repository *repo, each_packed_object_fn cb,
>   	return r ? r : pack_errors;
>   }
>   
> +struct add_promisor_object_data {
> +	struct repository *repo;
> +	struct oidset promisor_objects;
> +};
> +
>   static int add_promisor_object(const struct object_id *oid,
>   			       struct packed_git *pack,
>   			       uint32_t pos UNUSED,
> -			       void *set_)
> +			       void *cb_data)
>   {
> -	struct oidset *set = set_;
> +	struct add_promisor_object_data *data = cb_data;
>   	struct object *obj;
>   	int we_parsed_object;
>   
> @@ -2316,7 +2321,7 @@ static int add_promisor_object(const struct object_id *oid,
>   	if (!obj)
>   		return 1;
>   
> -	oidset_insert(set, oid);
> +	oidset_insert(&data->promisor_objects, oid);
>   
>   	/*
>   	 * If this is a tree, commit, or tag, the objects it refers
> @@ -2334,38 +2339,45 @@ static int add_promisor_object(const struct object_id *oid,
>   			 */
>   			return 0;
>   		while (tree_entry_gently(&desc, &entry))
> -			oidset_insert(set, &entry.oid);
> +			oidset_insert(&data->promisor_objects, &entry.oid);
>   		if (we_parsed_object)
>   			free_tree_buffer(tree);
>   	} else if (obj->type == OBJ_COMMIT) {
>   		struct commit *commit = (struct commit *) obj;
>   		struct commit_list *parents = commit->parents;
>   
> -		oidset_insert(set, get_commit_tree_oid(commit));
> +		oidset_insert(&data->promisor_objects, get_commit_tree_oid(commit));
>   		for (; parents; parents = parents->next)
> -			oidset_insert(set, &parents->item->object.oid);
> +			oidset_insert(&data->promisor_objects, &parents->item->object.oid);
> +
> +		if (we_parsed_object)
> +			free_commit_buffer(data->repo->parsed_objects, commit);

Isn't data->repo the same as pack->repo here? I think this patch can be 
simplified to just

diff --git a/packfile.c b/packfile.c
index 563fd14f0e..ac31077d0b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2355,6 +2355,9 @@ static int add_promisor_object(const struct 
object_id *oid,
                 oidset_insert(set, get_commit_tree_oid(commit));
                 for (; parents; parents = parents->next)
                         oidset_insert(set, &parents->item->object.oid);
+
+               if (we_parsed_object)
+                       free_commit_buffer(pack->repo->parsed_objects, 
commit);
         } else if (obj->type == OBJ_TAG) {
                 struct tag *tag = (struct tag *) obj;
                 oidset_insert(set, get_tagged_oid(tag));

--


That said, the memory footprint improvement seems pretty minimal with 
this change:

Without free_commit_buffer():

$ /usr/bin/time ~/git/git/git-rev-list --objects --all 
--exclude-promisor-objects --quiet
66.19user 38.97system 2:17.46elapsed 76%CPU (0avgtext+0avgdata 
8171072maxresident)k
307985728inputs+0outputs (151871major+1067727minor)pagefaults 0swaps

With free_commit_buffer():

$ /usr/bin/time ~/git/git/git-rev-list --objects --all 
--exclude-promisor-objects --quiet
66.47user 40.08system 2:18.72elapsed 76%CPU (0avgtext+0avgdata 
8135640maxresident)k
307820424inputs+0outputs (100432major+1065152minor)pagefaults 0swaps

I'm inclined not to worry about it for now.

-- Aaron

>   	} else if (obj->type == OBJ_TAG) {
>   		struct tag *tag = (struct tag *) obj;
> -		oidset_insert(set, get_tagged_oid(tag));
> +		oidset_insert(&data->promisor_objects, get_tagged_oid(tag));
>   	}
>   	return 0;
>   }
>   
>   int is_promisor_object(struct repository *r, const struct object_id *oid)
>   {
> -	static struct oidset promisor_objects;
> +	static struct add_promisor_object_data data = {
> +		.promisor_objects = OIDSET_INIT,
> +	};
>   	static int promisor_objects_prepared;
>   
>   	if (!promisor_objects_prepared) {
> +		data.repo = r;
>   		if (repo_has_promisor_remote(r)) {
>   			for_each_packed_object(r, add_promisor_object,
> -					       &promisor_objects,
> +					       &data,
>   					       FOR_EACH_OBJECT_PROMISOR_ONLY |
>   					       FOR_EACH_OBJECT_PACK_ORDER);
>   		}
>   		promisor_objects_prepared = 1;
> +		data.repo = NULL;
>   	}
> -	return oidset_contains(&promisor_objects, oid);
> +	return oidset_contains(&data.promisor_objects, oid);
>   }
>   
>   int parse_pack_header_option(const char *in, unsigned char *out, unsigned int *len)


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 16:55   ` Aaron Plattner
@ 2025-12-05 17:59     ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2025-12-05 17:59 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: Patrick Steinhardt, git

On Fri, Dec 05, 2025 at 08:55:19AM -0800, Aaron Plattner wrote:

> +               if (we_parsed_object)
> +                       free_commit_buffer(pack->repo->parsed_objects,
> commit);
>         } else if (obj->type == OBJ_TAG) {
>                 struct tag *tag = (struct tag *) obj;
>                 oidset_insert(set, get_tagged_oid(tag));
> 
> --
> 
> 
> That said, the memory footprint improvement seems pretty minimal with this
> change:
> 
> Without free_commit_buffer():
> 
> $ /usr/bin/time ~/git/git/git-rev-list --objects --all
> --exclude-promisor-objects --quiet
> 66.19user 38.97system 2:17.46elapsed 76%CPU (0avgtext+0avgdata
> 8171072maxresident)k
> 307985728inputs+0outputs (151871major+1067727minor)pagefaults 0swaps
> 
> With free_commit_buffer():
> 
> $ /usr/bin/time ~/git/git/git-rev-list --objects --all
> --exclude-promisor-objects --quiet
> 66.47user 40.08system 2:18.72elapsed 76%CPU (0avgtext+0avgdata
> 8135640maxresident)k
> 307820424inputs+0outputs (100432major+1065152minor)pagefaults 0swaps
> 
> I'm inclined not to worry about it for now.

I don't think it will make a difference for that command because we
already turn off the "save_commit_buffer" global in rev-list, unless we
are going to show the contents. Perhaps:

  git rev-list --objects --all --format=%s

would show the difference.

If we take the skip-hash suggestion I wrote elsewhere in the thread,
then we would usually not load the commit contents in the first place.
But it could still be worth adding this free_commit_buffer() to catch
commits not covered by the commit-graph.

Another way to do it is to temporarily unset save_commit_buffer in
is_promisor_object() when we start walking the objects. That's how we do
it in other places, like 359b01ca84 (ref-filter: disable
save_commit_buffer while traversing, 2022-07-11). But in this case since
the traversal code is custom it's pretty easy to just free immediately
afterwards.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
  2025-12-05 12:36 ` Patrick Steinhardt
@ 2025-12-05 17:48 ` Jeff King
  2025-12-05 18:01   ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2025-12-05 17:48 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Thu, Dec 04, 2025 at 09:21:29AM -0800, Aaron Plattner wrote:

> For repositories with large pack files, this can take an extremely long
> time. For example, on a production repository with a 176 GB promisor
> pack:
> 
>  $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>  ________________________________________________________
>  Executed in   76.10 mins    fish           external
>     usr time   72.10 mins    1.83 millis   72.10 mins
>     sys time    3.56 mins    0.17 millis    3.56 mins

FWIW, I had a hard time replicating the results with this command,
because it won't necessarily call is_promisor_object(). It only does so
when it finds a missing object. But it also marks everything in the
promisor pack as UNINTERESTING from the start, so you need a
non-promisor commit that points to an object excluded from the narrow
clone.

An easier way to trigger it is with a fake oid like:

  rev-list --objects --all --exclude-promisor-objects $(perl -e 'print "1" x 40')

Then we have to check is_promisor_object() to know that the 111... oid
isn't really a promisor mentioned somewhere.

> For objects that weren't already parsed, use odb_read_object_info() to
> query the object type. If it's a blob, just insert it into the oidset
> without parsing it. This improves performance for very large pack files
> significantly:
> 
>  $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>  ________________________________________________________
>  Executed in  118.76 secs    fish           external
>     usr time   50.88 secs   11.02 millis   50.87 secs
>     sys time   36.31 secs    0.08 millis   36.31 secs

Yeah, this is obviously a good idea. This all seemed eerily familiar,
and I wondered if we weren't doing this already. But it looks like it
came up as "maybe we should do this" along with some other
optimizations, but we never did it. Your 176GB (!) repository is
obviously a good way to show off the change. But I think we can see it
even in a fresh clone of linux.git, which (with my command above) goes
from ~7.5 minutes to under 2 minutes with your patch.

But I have an idea that makes your patch a little simpler and should
give us another little speed bump.

> diff --git a/packfile.c b/packfile.c
> index 9cc11b6dc5..563fd14f0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid,
>  	if (obj && obj->parsed) {
>  		we_parsed_object = 0;
>  	} else {
> +		/*
> +		 * Blobs don't reference other objects, so skip parsing them
> +		 * to save time.
> +		 */
> +		enum object_type type;
> +		type = odb_read_object_info(pack->repo->objects, oid, NULL);
> +		if (type == OBJ_BLOB) {
> +			oidset_insert(set, oid);
> +			return 0;
> +		}
> +

OK, so we are checking the type up front and then skipping
parse_object() if we can. But there is already some logic inside
parse_object() for these kinds of optimizations. If we tell it we are
not interested in checking the hash of the objects, then it knows it can
skip loading the blob entirely.

But it can _also_ use that flag for other things, like using the
commit-graph rather than loading individual commit objects. So doing
this:

diff --git a/packfile.c b/packfile.c
index 9cc11b6dc5..01b992a4e1 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid,
 		we_parsed_object = 0;
 	} else {
 		we_parsed_object = 1;
-		obj = parse_object(pack->repo, oid);
+		obj = parse_object_with_flags(pack->repo, oid,
+					      PARSE_OBJECT_SKIP_HASH_CHECK);
 	}

 	if (!obj)

drops my linux.git case down to 49s. It's skipping the blobs (with no
need for your patch) and loading the commits out of the graph file. Note
that you may need to "git commit-graph write --reachable" to see the
effect (I think we do generate graphs by default in git-gc these days,
but I'm not sure if we do so right after cloning).

-Peff

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 17:48 ` Jeff King
@ 2025-12-05 18:01   ` Jeff King
  2025-12-05 18:50     ` Aaron Plattner
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2025-12-05 18:01 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Fri, Dec 05, 2025 at 12:48:54PM -0500, Jeff King wrote:

> OK, so we are checking the type up front and then skipping
> parse_object() if we can. But there is already some logic inside
> parse_object() for these kinds of optimizations. If we tell it we are
> not interested in checking the hash of the objects, then it knows it can
> skip loading the blob entirely.
> 
> But it can _also_ use that flag for other things, like using the
> commit-graph rather than loading individual commit objects. So doing
> this:
>   
> diff --git a/packfile.c b/packfile.c
> index 9cc11b6dc5..01b992a4e1 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid,
>  		we_parsed_object = 0;
>  	} else {
>  		we_parsed_object = 1;
> -		obj = parse_object(pack->repo, oid);
> +		obj = parse_object_with_flags(pack->repo, oid,
> +					      PARSE_OBJECT_SKIP_HASH_CHECK);
>  	}
>  
>  	if (!obj)
> 
> drops my linux.git case down to 49s. It's skipping the blobs (with no
> need for your patch) and loading the commits out of the graph file. Note
> that you may need to "git commit-graph write --reachable" to see the
> effect (I think we do generate graphs by default in git-gc these days,
> but I'm not sure if we do so right after cloning).

Oh, and obviously it is skipping the hash computation on the objects,
too. That's probably not as important as avoiding the object loads in
the first place, but it may also be making a measurable difference on
the ones we do load (notably trees here).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 18:01   ` Jeff King
@ 2025-12-05 18:50     ` Aaron Plattner
  2025-12-05 21:28       ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Plattner @ 2025-12-05 18:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 12/5/25 10:01 AM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 12:48:54PM -0500, Jeff King wrote:
> 
>> OK, so we are checking the type up front and then skipping
>> parse_object() if we can. But there is already some logic inside
>> parse_object() for these kinds of optimizations. If we tell it we are
>> not interested in checking the hash of the objects, then it knows it can
>> skip loading the blob entirely.
>>
>> But it can _also_ use that flag for other things, like using the
>> commit-graph rather than loading individual commit objects. So doing
>> this:
>>    
>> diff --git a/packfile.c b/packfile.c
>> index 9cc11b6dc5..01b992a4e1 100644
>> --- a/packfile.c
>> +++ b/packfile.c
>> @@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid,
>>   		we_parsed_object = 0;
>>   	} else {
>>   		we_parsed_object = 1;
>> -		obj = parse_object(pack->repo, oid);
>> +		obj = parse_object_with_flags(pack->repo, oid,
>> +					      PARSE_OBJECT_SKIP_HASH_CHECK);
>>   	}
>>   
>>   	if (!obj)
>>
>> drops my linux.git case down to 49s. It's skipping the blobs (with no
>> need for your patch) and loading the commits out of the graph file. Note
>> that you may need to "git commit-graph write --reachable" to see the
>> effect (I think we do generate graphs by default in git-gc these days,
>> but I'm not sure if we do so right after cloning).
> 
> Oh, and obviously it is skipping the hash computation on the objects,
> too. That's probably not as important as avoiding the object loads in
> the first place, but it may also be making a measurable difference on
> the ones we do load (notably trees here).

Thanks! I had looked at PARSE_OBJECT_SKIP_HASH_CHECK but it wasn't 
obvious to me whether it could be used here or not.

Unfortunately, setting that flag doesn't seem to improve performance for 
me because in parse_object_with_flags(), lookup_object() returns an obj 
pointer with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips 
this block and ends up inflating the object anyway:

	if ((!obj || obj->type == OBJ_BLOB) &&
	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
		if (!skip_hash && stream_object_signature(r, repl) < 0) {
			error(_("hash mismatch %s"), oid_to_hex(oid));
			return NULL;
		}
		parse_blob_buffer(lookup_blob(r, oid));
		return lookup_object(r, oid);
	}

I was confused about why the check was structured that way, but reading 
the description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 
cleared that up. Thank you for thoroughly documenting that!

Are OBJ_NONE objects expected here? Should the check be

	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
	    [...]
	}

?

If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK 
change then the time drops to 1:58, so that's great!

> -Peff


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 18:50     ` Aaron Plattner
@ 2025-12-05 21:28       ` Jeff King
  2025-12-05 21:56         ` Aaron Plattner
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2025-12-05 21:28 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote:

> Unfortunately, setting that flag doesn't seem to improve performance for me
> because in parse_object_with_flags(), lookup_object() returns an obj pointer
> with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and
> ends up inflating the object anyway:
> 
> 	if ((!obj || obj->type == OBJ_BLOB) &&
> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
> 		if (!skip_hash && stream_object_signature(r, repl) < 0) {
> 			error(_("hash mismatch %s"), oid_to_hex(oid));
> 			return NULL;
> 		}
> 		parse_blob_buffer(lookup_blob(r, oid));
> 		return lookup_object(r, oid);
> 	}
> 
> I was confused about why the check was structured that way, but reading the
> description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that
> up. Thank you for thoroughly documenting that!
> 
> Are OBJ_NONE objects expected here? Should the check be
> 
> 	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
> 	    [...]
> 	}
> 
> ?

Yeah, that feels like a bug to me. The idea of that conditional is
"could it be a blob?" and obviously OBJ_NONE does not rule that out.

I do wonder how you end up with OBJ_NONE, though. That implies somebody
created the "struct object" but without knowing which type it was
supposed to be, and then did not follow up by actually parsing it.

That's probably immaterial to what parse_object() should be doing, but
it is certainly a curiosity. And I'm also not sure why I got good
results from my rev-list invocation, but you did not. Weird.

I think we could probably proceed without satisfying our curiosity here,
but if you felt like it, it would be interesting to find such an object
that is fed with OBJ_NONE to parse_object(), then run the command in a
debugger trying to break on the original create_object() call that
matches that oid. (Or if you want to be fancy use a reverse debugger
like rr). I might play around with it and see if I can stimulate it.

> If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change
> then the time drops to 1:58, so that's great!

Cool, though I think that's about the same that you got with your patch?
I was hoping for a little bit more from skipping the hash checks and
commits, but maybe:

  1. Your commit/tree structure is dominated much more by the blobs than
     the linux.git I used for testing. So there's not much extra gain to
     be had.

  2. You didn't have a commit-graph built.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 21:28       ` Jeff King
@ 2025-12-05 21:56         ` Aaron Plattner
  2025-12-06  1:58           ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Plattner @ 2025-12-05 21:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 12/5/25 1:28 PM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote:
> 
>> Unfortunately, setting that flag doesn't seem to improve performance for me
>> because in parse_object_with_flags(), lookup_object() returns an obj pointer
>> with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and
>> ends up inflating the object anyway:
>>
>> 	if ((!obj || obj->type == OBJ_BLOB) &&
>> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> 		if (!skip_hash && stream_object_signature(r, repl) < 0) {
>> 			error(_("hash mismatch %s"), oid_to_hex(oid));
>> 			return NULL;
>> 		}
>> 		parse_blob_buffer(lookup_blob(r, oid));
>> 		return lookup_object(r, oid);
>> 	}
>>
>> I was confused about why the check was structured that way, but reading the
>> description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that
>> up. Thank you for thoroughly documenting that!
>>
>> Are OBJ_NONE objects expected here? Should the check be
>>
>> 	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
>> 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
>> 	    [...]
>> 	}
>>
>> ?
> 
> Yeah, that feels like a bug to me. The idea of that conditional is
> "could it be a blob?" and obviously OBJ_NONE does not rule that out.
> 
> I do wonder how you end up with OBJ_NONE, though. That implies somebody
> created the "struct object" but without knowing which type it was
> supposed to be, and then did not follow up by actually parsing it.

If I'm understanding correctly, this loop creates a dummy struct object 
for every object in the promisor packs:

	if (revs->exclude_promisor_objects) {
		for_each_packed_object(revs->repo, mark_uninteresting, revs,
				       FOR_EACH_OBJECT_PROMISOR_ONLY);
	}

Backtrace for one such object:

#0   create_object
#1   lookup_unknown_object
#2   mark_uninteresting
#3   for_each_object_in_pack
#4   for_each_packed_object
#5   prepare_revision_walk
#6   cmd_rev_list
#7   run_builtin
#8   handle_builtin
#9   cmd_main
#10  main

Then the is_promisor_object() loop finds these dummy objects when it 
loops over all the objects again.

> That's probably immaterial to what parse_object() should be doing, but
> it is certainly a curiosity. And I'm also not sure why I got good
> results from my rev-list invocation, but you did not. Weird.

Yeah, that's still a mystery.

> I think we could probably proceed without satisfying our curiosity here,
> but if you felt like it, it would be interesting to find such an object
> that is fed with OBJ_NONE to parse_object(), then run the command in a
> debugger trying to break on the original create_object() call that
> matches that oid. (Or if you want to be fancy use a reverse debugger
> like rr). I might play around with it and see if I can stimulate it.
> 
>> If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change
>> then the time drops to 1:58, so that's great!
> 
> Cool, though I think that's about the same that you got with your patch?

It's a bit better than my patch and yours seems cleaner. I'll put this 
together as a v2 of this patch.

> I was hoping for a little bit more from skipping the hash checks and
> commits, but maybe:
> 
>    1. Your commit/tree structure is dominated much more by the blobs than
>       the linux.git I used for testing. So there's not much extra gain to
>       be had.

That's certainly possible. Let's just say this codebase has a lot of 
cruft in it. :)

>    2. You didn't have a commit-graph built.

This repository came from "scalar clone" and then I created a worktree 
and disabled sparse checkout. I didn't do anything special to enable or 
disable commit-graph.

What I do notice is that usually, a `git pull` from the server this 
repository is hosted on is fast, but occasionally it hits this 
pathological case. I was using git-rev-list as a proxy for what git-pull 
was getting stuck on. Is it possible that having a working commit-graph 
is what avoids the problem in the first place? I'll admit to not having 
a great understanding of how the commit graph is used during a normal pull.

> -Peff


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-05 21:56         ` Aaron Plattner
@ 2025-12-06  1:58           ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2025-12-06  1:58 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Fri, Dec 05, 2025 at 01:56:23PM -0800, Aaron Plattner wrote:

> > I do wonder how you end up with OBJ_NONE, though. That implies somebody
> > created the "struct object" but without knowing which type it was
> > supposed to be, and then did not follow up by actually parsing it.
> 
> If I'm understanding correctly, this loop creates a dummy struct object for
> every object in the promisor packs:
> 
> 	if (revs->exclude_promisor_objects) {
> 		for_each_packed_object(revs->repo, mark_uninteresting, revs,
> 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
> 	}
> 
> Backtrace for one such object:
> 
> #0   create_object
> #1   lookup_unknown_object
> #2   mark_uninteresting
> #3   for_each_object_in_pack
> #4   for_each_packed_object
> #5   prepare_revision_walk
> #6   cmd_rev_list
> #7   run_builtin
> #8   handle_builtin
> #9   cmd_main
> #10  main
> 
> Then the is_promisor_object() loop finds these dummy objects when it loops
> over all the objects again.

Ah, of course. That makes sense (and I don't think there's any other way
to do it, as we need the object struct to store the flags).

And that also explains this bit:

> > That's probably immaterial to what parse_object() should be doing, but
> > it is certainly a curiosity. And I'm also not sure why I got good
> > results from my rev-list invocation, but you did not. Weird.
> 
> Yeah, that's still a mystery.

It's because in the command I used:

  git rev-list --objects --exclude-promisor-objects $(perl -e 'print "1" x 40')

we call into is_promisor_object() _before_ we hit that part of
prepare_revision_walk() that marks everything uninteresting. In my
invocation above, we'd notice the missing object in get_reference() as
we try to load the initial tips for the walk, and then check it against
is_promisor_object() immediately.

And when I tried something more like your command:

  git rev-list --objects --all --exclude-promisor-objects

it did mark them all uninteresting, but because I had no objects that
were missing (and not simply marked uninteresting), it never needed to
call into is_promisor_object().

So good, mystery resolved.

> >    2. You didn't have a commit-graph built.
> 
> This repository came from "scalar clone" and then I created a worktree and
> disabled sparse checkout. I didn't do anything special to enable or disable
> commit-graph.
> 
> What I do notice is that usually, a `git pull` from the server this
> repository is hosted on is fast, but occasionally it hits this pathological
> case. I was using git-rev-list as a proxy for what git-pull was getting
> stuck on. Is it possible that having a working commit-graph is what avoids
> the problem in the first place? I'll admit to not having a great
> understanding of how the commit graph is used during a normal pull.

I'd expect scalar to create commit-graphs. We can leave it be, but if
you're curious you can double-check that .git/objects/info has either a
commit-graph file or a commit-graphs/ directory. If not, then running
"git commit-graph write -reachable" should generate one, and you can see
if that changes the timings at all.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-06  1:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
2025-12-05 12:36 ` Patrick Steinhardt
2025-12-05 16:55   ` Aaron Plattner
2025-12-05 17:59     ` Jeff King
2025-12-05 17:48 ` Jeff King
2025-12-05 18:01   ` Jeff King
2025-12-05 18:50     ` Aaron Plattner
2025-12-05 21:28       ` Jeff King
2025-12-05 21:56         ` Aaron Plattner
2025-12-06  1:58           ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).