* [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object()
@ 2025-12-04 17:21 Aaron Plattner
2025-12-05 12:36 ` Patrick Steinhardt
2025-12-05 17:48 ` Jeff King
0 siblings, 2 replies; 10+ messages in thread
From: Aaron Plattner @ 2025-12-04 17:21 UTC (permalink / raw)
To: git; +Cc: Aaron Plattner
When is_promisor_object() is called for the first time, it lazily
initializes a set of all promisor objects by iterating through all
objects in promisor packs. For each object, add_promisor_object() calls
parse_object(), which decompresses and hashes the entire object.
For repositories with large pack files, this can take an extremely long
time. For example, on a production repository with a 176 GB promisor
pack:
$ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
________________________________________________________
Executed in 76.10 mins fish external
usr time 72.10 mins 1.83 millis 72.10 mins
sys time 3.56 mins 0.17 millis 3.56 mins
add_promisor_object() needs the full object for trees, commits, and
tags. But blobs contain no references to other objects, so the function
can just insert their oids into the set and move on.
For objects that weren't already parsed, use odb_read_object_info() to
query the object type. If it's a blob, just insert it into the oidset
without parsing it. This improves performance for very large pack files
significantly:
$ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
________________________________________________________
Executed in 118.76 secs fish external
usr time 50.88 secs 11.02 millis 50.87 secs
sys time 36.31 secs 0.08 millis 36.31 secs
Signed-off-by: Aaron Plattner <aplattner@nvidia.com>
---
packfile.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/packfile.c b/packfile.c
index 9cc11b6dc5..563fd14f0e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid,
if (obj && obj->parsed) {
we_parsed_object = 0;
} else {
+ /*
+ * Blobs don't reference other objects, so skip parsing them
+ * to save time.
+ */
+ enum object_type type;
+ type = odb_read_object_info(pack->repo->objects, oid, NULL);
+ if (type == OBJ_BLOB) {
+ oidset_insert(set, oid);
+ return 0;
+ }
+
we_parsed_object = 1;
obj = parse_object(pack->repo, oid);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner @ 2025-12-05 12:36 ` Patrick Steinhardt 2025-12-05 16:55 ` Aaron Plattner 2025-12-05 17:48 ` Jeff King 1 sibling, 1 reply; 10+ messages in thread From: Patrick Steinhardt @ 2025-12-05 12:36 UTC (permalink / raw) To: Aaron Plattner; +Cc: git On Thu, Dec 04, 2025 at 09:21:29AM -0800, Aaron Plattner wrote: > When is_promisor_object() is called for the first time, it lazily > initializes a set of all promisor objects by iterating through all > objects in promisor packs. For each object, add_promisor_object() calls > parse_object(), which decompresses and hashes the entire object. > > For repositories with large pack files, this can take an extremely long > time. For example, on a production repository with a 176 GB promisor > pack: > > $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet > ________________________________________________________ > Executed in 76.10 mins fish external > usr time 72.10 mins 1.83 millis 72.10 mins > sys time 3.56 mins 0.17 millis 3.56 mins > > add_promisor_object() needs the full object for trees, commits, and > tags. But blobs contain no references to other objects, so the function > can just insert their oids into the set and move on. > > For objects that weren't already parsed, use odb_read_object_info() to > query the object type. If it's a blob, just insert it into the oidset > without parsing it. This improves performance for very large pack files > significantly: > > $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet > ________________________________________________________ > Executed in 118.76 secs fish external > usr time 50.88 secs 11.02 millis 50.87 secs > sys time 36.31 secs 0.08 millis 36.31 secs > > Signed-off-by: Aaron Plattner <aplattner@nvidia.com> > --- > packfile.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/packfile.c b/packfile.c > index 9cc11b6dc5..563fd14f0e 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid, > if (obj && obj->parsed) { > we_parsed_object = 0; > } else { > + /* > + * Blobs don't reference other objects, so skip parsing them > + * to save time. > + */ > + enum object_type type; > + type = odb_read_object_info(pack->repo->objects, oid, NULL); > + if (type == OBJ_BLOB) { > + oidset_insert(set, oid); > + return 0; > + } > + > we_parsed_object = 1; > obj = parse_object(pack->repo, oid); > } This looks like an obvious improvement to me. While looking at the function I noticed that we also free tree buffers here, which make sense to reduce memory usage. I was wondering whether we want to do the same for commit buffers, which may reduce memory usage even further. See for example the below patch. Might be wort it to test in your large repository to see whether it has an effect on the maximum RSS. Thanks! Patrick diff --git a/packfile.c b/packfile.c index 9cc11b6dc5..89227ada98 100644 --- a/packfile.c +++ b/packfile.c @@ -2296,12 +2296,17 @@ int for_each_packed_object(struct repository *repo, each_packed_object_fn cb, return r ? r : pack_errors; } +struct add_promisor_object_data { + struct repository *repo; + struct oidset promisor_objects; +}; + static int add_promisor_object(const struct object_id *oid, struct packed_git *pack, uint32_t pos UNUSED, - void *set_) + void *cb_data) { - struct oidset *set = set_; + struct add_promisor_object_data *data = cb_data; struct object *obj; int we_parsed_object; @@ -2316,7 +2321,7 @@ static int add_promisor_object(const struct object_id *oid, if (!obj) return 1; - oidset_insert(set, oid); + oidset_insert(&data->promisor_objects, oid); /* * If this is a tree, commit, or tag, the objects it refers @@ -2334,38 +2339,45 @@ static int add_promisor_object(const struct object_id *oid, */ return 0; while (tree_entry_gently(&desc, &entry)) - oidset_insert(set, &entry.oid); + oidset_insert(&data->promisor_objects, &entry.oid); if (we_parsed_object) free_tree_buffer(tree); } else if (obj->type == OBJ_COMMIT) { struct commit *commit = (struct commit *) obj; struct commit_list *parents = commit->parents; - oidset_insert(set, get_commit_tree_oid(commit)); + oidset_insert(&data->promisor_objects, get_commit_tree_oid(commit)); for (; parents; parents = parents->next) - oidset_insert(set, &parents->item->object.oid); + oidset_insert(&data->promisor_objects, &parents->item->object.oid); + + if (we_parsed_object) + free_commit_buffer(data->repo->parsed_objects, commit); } else if (obj->type == OBJ_TAG) { struct tag *tag = (struct tag *) obj; - oidset_insert(set, get_tagged_oid(tag)); + oidset_insert(&data->promisor_objects, get_tagged_oid(tag)); } return 0; } int is_promisor_object(struct repository *r, const struct object_id *oid) { - static struct oidset promisor_objects; + static struct add_promisor_object_data data = { + .promisor_objects = OIDSET_INIT, + }; static int promisor_objects_prepared; if (!promisor_objects_prepared) { + data.repo = r; if (repo_has_promisor_remote(r)) { for_each_packed_object(r, add_promisor_object, - &promisor_objects, + &data, FOR_EACH_OBJECT_PROMISOR_ONLY | FOR_EACH_OBJECT_PACK_ORDER); } promisor_objects_prepared = 1; + data.repo = NULL; } - return oidset_contains(&promisor_objects, oid); + return oidset_contains(&data.promisor_objects, oid); } int parse_pack_header_option(const char *in, unsigned char *out, unsigned int *len) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 12:36 ` Patrick Steinhardt @ 2025-12-05 16:55 ` Aaron Plattner 2025-12-05 17:59 ` Jeff King 0 siblings, 1 reply; 10+ messages in thread From: Aaron Plattner @ 2025-12-05 16:55 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git On 12/5/25 4:36 AM, Patrick Steinhardt wrote: > On Thu, Dec 04, 2025 at 09:21:29AM -0800, Aaron Plattner wrote: >> When is_promisor_object() is called for the first time, it lazily >> initializes a set of all promisor objects by iterating through all >> objects in promisor packs. For each object, add_promisor_object() calls >> parse_object(), which decompresses and hashes the entire object. >> >> For repositories with large pack files, this can take an extremely long >> time. For example, on a production repository with a 176 GB promisor >> pack: >> >> $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet >> ________________________________________________________ >> Executed in 76.10 mins fish external >> usr time 72.10 mins 1.83 millis 72.10 mins >> sys time 3.56 mins 0.17 millis 3.56 mins >> >> add_promisor_object() needs the full object for trees, commits, and >> tags. But blobs contain no references to other objects, so the function >> can just insert their oids into the set and move on. >> >> For objects that weren't already parsed, use odb_read_object_info() to >> query the object type. If it's a blob, just insert it into the oidset >> without parsing it. This improves performance for very large pack files >> significantly: >> >> $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet >> ________________________________________________________ >> Executed in 118.76 secs fish external >> usr time 50.88 secs 11.02 millis 50.87 secs >> sys time 36.31 secs 0.08 millis 36.31 secs >> >> Signed-off-by: Aaron Plattner <aplattner@nvidia.com> >> --- >> packfile.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/packfile.c b/packfile.c >> index 9cc11b6dc5..563fd14f0e 100644 >> --- a/packfile.c >> +++ b/packfile.c >> @@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid, >> if (obj && obj->parsed) { >> we_parsed_object = 0; >> } else { >> + /* >> + * Blobs don't reference other objects, so skip parsing them >> + * to save time. >> + */ >> + enum object_type type; >> + type = odb_read_object_info(pack->repo->objects, oid, NULL); >> + if (type == OBJ_BLOB) { >> + oidset_insert(set, oid); >> + return 0; >> + } >> + >> we_parsed_object = 1; >> obj = parse_object(pack->repo, oid); >> } > > This looks like an obvious improvement to me. While looking at the > function I noticed that we also free tree buffers here, which make > sense to reduce memory usage. I was wondering whether we want to do > the same for commit buffers, which may reduce memory usage even further. > > See for example the below patch. Might be wort it to test in your large > repository to see whether it has an effect on the maximum RSS. > > Thanks! > > Patrick > > diff --git a/packfile.c b/packfile.c > index 9cc11b6dc5..89227ada98 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -2296,12 +2296,17 @@ int for_each_packed_object(struct repository *repo, each_packed_object_fn cb, > return r ? r : pack_errors; > } > > +struct add_promisor_object_data { > + struct repository *repo; > + struct oidset promisor_objects; > +}; > + > static int add_promisor_object(const struct object_id *oid, > struct packed_git *pack, > uint32_t pos UNUSED, > - void *set_) > + void *cb_data) > { > - struct oidset *set = set_; > + struct add_promisor_object_data *data = cb_data; > struct object *obj; > int we_parsed_object; > > @@ -2316,7 +2321,7 @@ static int add_promisor_object(const struct object_id *oid, > if (!obj) > return 1; > > - oidset_insert(set, oid); > + oidset_insert(&data->promisor_objects, oid); > > /* > * If this is a tree, commit, or tag, the objects it refers > @@ -2334,38 +2339,45 @@ static int add_promisor_object(const struct object_id *oid, > */ > return 0; > while (tree_entry_gently(&desc, &entry)) > - oidset_insert(set, &entry.oid); > + oidset_insert(&data->promisor_objects, &entry.oid); > if (we_parsed_object) > free_tree_buffer(tree); > } else if (obj->type == OBJ_COMMIT) { > struct commit *commit = (struct commit *) obj; > struct commit_list *parents = commit->parents; > > - oidset_insert(set, get_commit_tree_oid(commit)); > + oidset_insert(&data->promisor_objects, get_commit_tree_oid(commit)); > for (; parents; parents = parents->next) > - oidset_insert(set, &parents->item->object.oid); > + oidset_insert(&data->promisor_objects, &parents->item->object.oid); > + > + if (we_parsed_object) > + free_commit_buffer(data->repo->parsed_objects, commit); Isn't data->repo the same as pack->repo here? I think this patch can be simplified to just diff --git a/packfile.c b/packfile.c index 563fd14f0e..ac31077d0b 100644 --- a/packfile.c +++ b/packfile.c @@ -2355,6 +2355,9 @@ static int add_promisor_object(const struct object_id *oid, oidset_insert(set, get_commit_tree_oid(commit)); for (; parents; parents = parents->next) oidset_insert(set, &parents->item->object.oid); + + if (we_parsed_object) + free_commit_buffer(pack->repo->parsed_objects, commit); } else if (obj->type == OBJ_TAG) { struct tag *tag = (struct tag *) obj; oidset_insert(set, get_tagged_oid(tag)); -- That said, the memory footprint improvement seems pretty minimal with this change: Without free_commit_buffer(): $ /usr/bin/time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet 66.19user 38.97system 2:17.46elapsed 76%CPU (0avgtext+0avgdata 8171072maxresident)k 307985728inputs+0outputs (151871major+1067727minor)pagefaults 0swaps With free_commit_buffer(): $ /usr/bin/time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet 66.47user 40.08system 2:18.72elapsed 76%CPU (0avgtext+0avgdata 8135640maxresident)k 307820424inputs+0outputs (100432major+1065152minor)pagefaults 0swaps I'm inclined not to worry about it for now. -- Aaron > } else if (obj->type == OBJ_TAG) { > struct tag *tag = (struct tag *) obj; > - oidset_insert(set, get_tagged_oid(tag)); > + oidset_insert(&data->promisor_objects, get_tagged_oid(tag)); > } > return 0; > } > > int is_promisor_object(struct repository *r, const struct object_id *oid) > { > - static struct oidset promisor_objects; > + static struct add_promisor_object_data data = { > + .promisor_objects = OIDSET_INIT, > + }; > static int promisor_objects_prepared; > > if (!promisor_objects_prepared) { > + data.repo = r; > if (repo_has_promisor_remote(r)) { > for_each_packed_object(r, add_promisor_object, > - &promisor_objects, > + &data, > FOR_EACH_OBJECT_PROMISOR_ONLY | > FOR_EACH_OBJECT_PACK_ORDER); > } > promisor_objects_prepared = 1; > + data.repo = NULL; > } > - return oidset_contains(&promisor_objects, oid); > + return oidset_contains(&data.promisor_objects, oid); > } > > int parse_pack_header_option(const char *in, unsigned char *out, unsigned int *len) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 16:55 ` Aaron Plattner @ 2025-12-05 17:59 ` Jeff King 0 siblings, 0 replies; 10+ messages in thread From: Jeff King @ 2025-12-05 17:59 UTC (permalink / raw) To: Aaron Plattner; +Cc: Patrick Steinhardt, git On Fri, Dec 05, 2025 at 08:55:19AM -0800, Aaron Plattner wrote: > + if (we_parsed_object) > + free_commit_buffer(pack->repo->parsed_objects, > commit); > } else if (obj->type == OBJ_TAG) { > struct tag *tag = (struct tag *) obj; > oidset_insert(set, get_tagged_oid(tag)); > > -- > > > That said, the memory footprint improvement seems pretty minimal with this > change: > > Without free_commit_buffer(): > > $ /usr/bin/time ~/git/git/git-rev-list --objects --all > --exclude-promisor-objects --quiet > 66.19user 38.97system 2:17.46elapsed 76%CPU (0avgtext+0avgdata > 8171072maxresident)k > 307985728inputs+0outputs (151871major+1067727minor)pagefaults 0swaps > > With free_commit_buffer(): > > $ /usr/bin/time ~/git/git/git-rev-list --objects --all > --exclude-promisor-objects --quiet > 66.47user 40.08system 2:18.72elapsed 76%CPU (0avgtext+0avgdata > 8135640maxresident)k > 307820424inputs+0outputs (100432major+1065152minor)pagefaults 0swaps > > I'm inclined not to worry about it for now. I don't think it will make a difference for that command because we already turn off the "save_commit_buffer" global in rev-list, unless we are going to show the contents. Perhaps: git rev-list --objects --all --format=%s would show the difference. If we take the skip-hash suggestion I wrote elsewhere in the thread, then we would usually not load the commit contents in the first place. But it could still be worth adding this free_commit_buffer() to catch commits not covered by the commit-graph. Another way to do it is to temporarily unset save_commit_buffer in is_promisor_object() when we start walking the objects. That's how we do it in other places, like 359b01ca84 (ref-filter: disable save_commit_buffer while traversing, 2022-07-11). But in this case since the traversal code is custom it's pretty easy to just free immediately afterwards. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner 2025-12-05 12:36 ` Patrick Steinhardt @ 2025-12-05 17:48 ` Jeff King 2025-12-05 18:01 ` Jeff King 1 sibling, 1 reply; 10+ messages in thread From: Jeff King @ 2025-12-05 17:48 UTC (permalink / raw) To: Aaron Plattner; +Cc: git On Thu, Dec 04, 2025 at 09:21:29AM -0800, Aaron Plattner wrote: > For repositories with large pack files, this can take an extremely long > time. For example, on a production repository with a 176 GB promisor > pack: > > $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet > ________________________________________________________ > Executed in 76.10 mins fish external > usr time 72.10 mins 1.83 millis 72.10 mins > sys time 3.56 mins 0.17 millis 3.56 mins FWIW, I had a hard time replicating the results with this command, because it won't necessarily call is_promisor_object(). It only does so when it finds a missing object. But it also marks everything in the promisor pack as UNINTERESTING from the start, so you need a non-promisor commit that points to an object excluded from the narrow clone. An easier way to trigger it is with a fake oid like: rev-list --objects --all --exclude-promisor-objects $(perl -e 'print "1" x 40') Then we have to check is_promisor_object() to know that the 111... oid isn't really a promisor mentioned somewhere. > For objects that weren't already parsed, use odb_read_object_info() to > query the object type. If it's a blob, just insert it into the oidset > without parsing it. This improves performance for very large pack files > significantly: > > $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet > ________________________________________________________ > Executed in 118.76 secs fish external > usr time 50.88 secs 11.02 millis 50.87 secs > sys time 36.31 secs 0.08 millis 36.31 secs Yeah, this is obviously a good idea. This all seemed eerily familiar, and I wondered if we weren't doing this already. But it looks like it came up as "maybe we should do this" along with some other optimizations, but we never did it. Your 176GB (!) repository is obviously a good way to show off the change. But I think we can see it even in a fresh clone of linux.git, which (with my command above) goes from ~7.5 minutes to under 2 minutes with your patch. But I have an idea that makes your patch a little simpler and should give us another little speed bump. > diff --git a/packfile.c b/packfile.c > index 9cc11b6dc5..563fd14f0e 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -2309,6 +2309,17 @@ static int add_promisor_object(const struct object_id *oid, > if (obj && obj->parsed) { > we_parsed_object = 0; > } else { > + /* > + * Blobs don't reference other objects, so skip parsing them > + * to save time. > + */ > + enum object_type type; > + type = odb_read_object_info(pack->repo->objects, oid, NULL); > + if (type == OBJ_BLOB) { > + oidset_insert(set, oid); > + return 0; > + } > + OK, so we are checking the type up front and then skipping parse_object() if we can. But there is already some logic inside parse_object() for these kinds of optimizations. If we tell it we are not interested in checking the hash of the objects, then it knows it can skip loading the blob entirely. But it can _also_ use that flag for other things, like using the commit-graph rather than loading individual commit objects. So doing this: diff --git a/packfile.c b/packfile.c index 9cc11b6dc5..01b992a4e1 100644 --- a/packfile.c +++ b/packfile.c @@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid, we_parsed_object = 0; } else { we_parsed_object = 1; - obj = parse_object(pack->repo, oid); + obj = parse_object_with_flags(pack->repo, oid, + PARSE_OBJECT_SKIP_HASH_CHECK); } if (!obj) drops my linux.git case down to 49s. It's skipping the blobs (with no need for your patch) and loading the commits out of the graph file. Note that you may need to "git commit-graph write --reachable" to see the effect (I think we do generate graphs by default in git-gc these days, but I'm not sure if we do so right after cloning). -Peff ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 17:48 ` Jeff King @ 2025-12-05 18:01 ` Jeff King 2025-12-05 18:50 ` Aaron Plattner 0 siblings, 1 reply; 10+ messages in thread From: Jeff King @ 2025-12-05 18:01 UTC (permalink / raw) To: Aaron Plattner; +Cc: git On Fri, Dec 05, 2025 at 12:48:54PM -0500, Jeff King wrote: > OK, so we are checking the type up front and then skipping > parse_object() if we can. But there is already some logic inside > parse_object() for these kinds of optimizations. If we tell it we are > not interested in checking the hash of the objects, then it knows it can > skip loading the blob entirely. > > But it can _also_ use that flag for other things, like using the > commit-graph rather than loading individual commit objects. So doing > this: > > diff --git a/packfile.c b/packfile.c > index 9cc11b6dc5..01b992a4e1 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid, > we_parsed_object = 0; > } else { > we_parsed_object = 1; > - obj = parse_object(pack->repo, oid); > + obj = parse_object_with_flags(pack->repo, oid, > + PARSE_OBJECT_SKIP_HASH_CHECK); > } > > if (!obj) > > drops my linux.git case down to 49s. It's skipping the blobs (with no > need for your patch) and loading the commits out of the graph file. Note > that you may need to "git commit-graph write --reachable" to see the > effect (I think we do generate graphs by default in git-gc these days, > but I'm not sure if we do so right after cloning). Oh, and obviously it is skipping the hash computation on the objects, too. That's probably not as important as avoiding the object loads in the first place, but it may also be making a measurable difference on the ones we do load (notably trees here). -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 18:01 ` Jeff King @ 2025-12-05 18:50 ` Aaron Plattner 2025-12-05 21:28 ` Jeff King 0 siblings, 1 reply; 10+ messages in thread From: Aaron Plattner @ 2025-12-05 18:50 UTC (permalink / raw) To: Jeff King; +Cc: git On 12/5/25 10:01 AM, Jeff King wrote: > On Fri, Dec 05, 2025 at 12:48:54PM -0500, Jeff King wrote: > >> OK, so we are checking the type up front and then skipping >> parse_object() if we can. But there is already some logic inside >> parse_object() for these kinds of optimizations. If we tell it we are >> not interested in checking the hash of the objects, then it knows it can >> skip loading the blob entirely. >> >> But it can _also_ use that flag for other things, like using the >> commit-graph rather than loading individual commit objects. So doing >> this: >> >> diff --git a/packfile.c b/packfile.c >> index 9cc11b6dc5..01b992a4e1 100644 >> --- a/packfile.c >> +++ b/packfile.c >> @@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid, >> we_parsed_object = 0; >> } else { >> we_parsed_object = 1; >> - obj = parse_object(pack->repo, oid); >> + obj = parse_object_with_flags(pack->repo, oid, >> + PARSE_OBJECT_SKIP_HASH_CHECK); >> } >> >> if (!obj) >> >> drops my linux.git case down to 49s. It's skipping the blobs (with no >> need for your patch) and loading the commits out of the graph file. Note >> that you may need to "git commit-graph write --reachable" to see the >> effect (I think we do generate graphs by default in git-gc these days, >> but I'm not sure if we do so right after cloning). > > Oh, and obviously it is skipping the hash computation on the objects, > too. That's probably not as important as avoiding the object loads in > the first place, but it may also be making a measurable difference on > the ones we do load (notably trees here). Thanks! I had looked at PARSE_OBJECT_SKIP_HASH_CHECK but it wasn't obvious to me whether it could be used here or not. Unfortunately, setting that flag doesn't seem to improve performance for me because in parse_object_with_flags(), lookup_object() returns an obj pointer with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and ends up inflating the object anyway: if ((!obj || obj->type == OBJ_BLOB) && odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) { if (!skip_hash && stream_object_signature(r, repl) < 0) { error(_("hash mismatch %s"), oid_to_hex(oid)); return NULL; } parse_blob_buffer(lookup_blob(r, oid)); return lookup_object(r, oid); } I was confused about why the check was structured that way, but reading the description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that up. Thank you for thoroughly documenting that! Are OBJ_NONE objects expected here? Should the check be if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) && odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) { [...] } ? If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change then the time drops to 1:58, so that's great! > -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 18:50 ` Aaron Plattner @ 2025-12-05 21:28 ` Jeff King 2025-12-05 21:56 ` Aaron Plattner 0 siblings, 1 reply; 10+ messages in thread From: Jeff King @ 2025-12-05 21:28 UTC (permalink / raw) To: Aaron Plattner; +Cc: git On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote: > Unfortunately, setting that flag doesn't seem to improve performance for me > because in parse_object_with_flags(), lookup_object() returns an obj pointer > with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and > ends up inflating the object anyway: > > if ((!obj || obj->type == OBJ_BLOB) && > odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) { > if (!skip_hash && stream_object_signature(r, repl) < 0) { > error(_("hash mismatch %s"), oid_to_hex(oid)); > return NULL; > } > parse_blob_buffer(lookup_blob(r, oid)); > return lookup_object(r, oid); > } > > I was confused about why the check was structured that way, but reading the > description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that > up. Thank you for thoroughly documenting that! > > Are OBJ_NONE objects expected here? Should the check be > > if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) && > odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) { > [...] > } > > ? Yeah, that feels like a bug to me. The idea of that conditional is "could it be a blob?" and obviously OBJ_NONE does not rule that out. I do wonder how you end up with OBJ_NONE, though. That implies somebody created the "struct object" but without knowing which type it was supposed to be, and then did not follow up by actually parsing it. That's probably immaterial to what parse_object() should be doing, but it is certainly a curiosity. And I'm also not sure why I got good results from my rev-list invocation, but you did not. Weird. I think we could probably proceed without satisfying our curiosity here, but if you felt like it, it would be interesting to find such an object that is fed with OBJ_NONE to parse_object(), then run the command in a debugger trying to break on the original create_object() call that matches that oid. (Or if you want to be fancy use a reverse debugger like rr). I might play around with it and see if I can stimulate it. > If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change > then the time drops to 1:58, so that's great! Cool, though I think that's about the same that you got with your patch? I was hoping for a little bit more from skipping the hash checks and commits, but maybe: 1. Your commit/tree structure is dominated much more by the blobs than the linux.git I used for testing. So there's not much extra gain to be had. 2. You didn't have a commit-graph built. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 21:28 ` Jeff King @ 2025-12-05 21:56 ` Aaron Plattner 2025-12-06 1:58 ` Jeff King 0 siblings, 1 reply; 10+ messages in thread From: Aaron Plattner @ 2025-12-05 21:56 UTC (permalink / raw) To: Jeff King; +Cc: git On 12/5/25 1:28 PM, Jeff King wrote: > On Fri, Dec 05, 2025 at 10:50:02AM -0800, Aaron Plattner wrote: > >> Unfortunately, setting that flag doesn't seem to improve performance for me >> because in parse_object_with_flags(), lookup_object() returns an obj pointer >> with obj->parsed == 0 and obj->type == OBJ_NONE. So it skips this block and >> ends up inflating the object anyway: >> >> if ((!obj || obj->type == OBJ_BLOB) && >> odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) { >> if (!skip_hash && stream_object_signature(r, repl) < 0) { >> error(_("hash mismatch %s"), oid_to_hex(oid)); >> return NULL; >> } >> parse_blob_buffer(lookup_blob(r, oid)); >> return lookup_object(r, oid); >> } >> >> I was confused about why the check was structured that way, but reading the >> description of commit 8db2dad7a045e376b9c4f51ddd33da43c962e3a4 cleared that >> up. Thank you for thoroughly documenting that! >> >> Are OBJ_NONE objects expected here? Should the check be >> >> if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) && >> odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) { >> [...] >> } >> >> ? > > Yeah, that feels like a bug to me. The idea of that conditional is > "could it be a blob?" and obviously OBJ_NONE does not rule that out. > > I do wonder how you end up with OBJ_NONE, though. That implies somebody > created the "struct object" but without knowing which type it was > supposed to be, and then did not follow up by actually parsing it. If I'm understanding correctly, this loop creates a dummy struct object for every object in the promisor packs: if (revs->exclude_promisor_objects) { for_each_packed_object(revs->repo, mark_uninteresting, revs, FOR_EACH_OBJECT_PROMISOR_ONLY); } Backtrace for one such object: #0 create_object #1 lookup_unknown_object #2 mark_uninteresting #3 for_each_object_in_pack #4 for_each_packed_object #5 prepare_revision_walk #6 cmd_rev_list #7 run_builtin #8 handle_builtin #9 cmd_main #10 main Then the is_promisor_object() loop finds these dummy objects when it loops over all the objects again. > That's probably immaterial to what parse_object() should be doing, but > it is certainly a curiosity. And I'm also not sure why I got good > results from my rev-list invocation, but you did not. Weird. Yeah, that's still a mystery. > I think we could probably proceed without satisfying our curiosity here, > but if you felt like it, it would be interesting to find such an object > that is fed with OBJ_NONE to parse_object(), then run the command in a > debugger trying to break on the original create_object() call that > matches that oid. (Or if you want to be fancy use a reverse debugger > like rr). I might play around with it and see if I can stimulate it. > >> If I make that change combined with your PARSE_OBJECT_SKIP_HASH_CHECK change >> then the time drops to 1:58, so that's great! > > Cool, though I think that's about the same that you got with your patch? It's a bit better than my patch and yours seems cleaner. I'll put this together as a v2 of this patch. > I was hoping for a little bit more from skipping the hash checks and > commits, but maybe: > > 1. Your commit/tree structure is dominated much more by the blobs than > the linux.git I used for testing. So there's not much extra gain to > be had. That's certainly possible. Let's just say this codebase has a lot of cruft in it. :) > 2. You didn't have a commit-graph built. This repository came from "scalar clone" and then I created a worktree and disabled sparse checkout. I didn't do anything special to enable or disable commit-graph. What I do notice is that usually, a `git pull` from the server this repository is hosted on is fast, but occasionally it hits this pathological case. I was using git-rev-list as a proxy for what git-pull was getting stuck on. Is it possible that having a working commit-graph is what avoids the problem in the first place? I'll admit to not having a great understanding of how the commit graph is used during a normal pull. > -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() 2025-12-05 21:56 ` Aaron Plattner @ 2025-12-06 1:58 ` Jeff King 0 siblings, 0 replies; 10+ messages in thread From: Jeff King @ 2025-12-06 1:58 UTC (permalink / raw) To: Aaron Plattner; +Cc: git On Fri, Dec 05, 2025 at 01:56:23PM -0800, Aaron Plattner wrote: > > I do wonder how you end up with OBJ_NONE, though. That implies somebody > > created the "struct object" but without knowing which type it was > > supposed to be, and then did not follow up by actually parsing it. > > If I'm understanding correctly, this loop creates a dummy struct object for > every object in the promisor packs: > > if (revs->exclude_promisor_objects) { > for_each_packed_object(revs->repo, mark_uninteresting, revs, > FOR_EACH_OBJECT_PROMISOR_ONLY); > } > > Backtrace for one such object: > > #0 create_object > #1 lookup_unknown_object > #2 mark_uninteresting > #3 for_each_object_in_pack > #4 for_each_packed_object > #5 prepare_revision_walk > #6 cmd_rev_list > #7 run_builtin > #8 handle_builtin > #9 cmd_main > #10 main > > Then the is_promisor_object() loop finds these dummy objects when it loops > over all the objects again. Ah, of course. That makes sense (and I don't think there's any other way to do it, as we need the object struct to store the flags). And that also explains this bit: > > That's probably immaterial to what parse_object() should be doing, but > > it is certainly a curiosity. And I'm also not sure why I got good > > results from my rev-list invocation, but you did not. Weird. > > Yeah, that's still a mystery. It's because in the command I used: git rev-list --objects --exclude-promisor-objects $(perl -e 'print "1" x 40') we call into is_promisor_object() _before_ we hit that part of prepare_revision_walk() that marks everything uninteresting. In my invocation above, we'd notice the missing object in get_reference() as we try to load the initial tips for the walk, and then check it against is_promisor_object() immediately. And when I tried something more like your command: git rev-list --objects --all --exclude-promisor-objects it did mark them all uninteresting, but because I had no objects that were missing (and not simply marked uninteresting), it never needed to call into is_promisor_object(). So good, mystery resolved. > > 2. You didn't have a commit-graph built. > > This repository came from "scalar clone" and then I created a worktree and > disabled sparse checkout. I didn't do anything special to enable or disable > commit-graph. > > What I do notice is that usually, a `git pull` from the server this > repository is hosted on is fast, but occasionally it hits this pathological > case. I was using git-rev-list as a proxy for what git-pull was getting > stuck on. Is it possible that having a working commit-graph is what avoids > the problem in the first place? I'll admit to not having a great > understanding of how the commit graph is used during a normal pull. I'd expect scalar to create commit-graphs. We can leave it be, but if you're curious you can double-check that .git/objects/info has either a commit-graph file or a commit-graphs/ directory. If not, then running "git commit-graph write -reachable" should generate one, and you can see if that changes the timings at all. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-12-06 1:58 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-04 17:21 [PATCH] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner 2025-12-05 12:36 ` Patrick Steinhardt 2025-12-05 16:55 ` Aaron Plattner 2025-12-05 17:59 ` Jeff King 2025-12-05 17:48 ` Jeff King 2025-12-05 18:01 ` Jeff King 2025-12-05 18:50 ` Aaron Plattner 2025-12-05 21:28 ` Jeff King 2025-12-05 21:56 ` Aaron Plattner 2025-12-06 1:58 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).