From: Phillip Wood <phillip.wood123@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>
Subject: Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
Date: Sun, 19 Apr 2026 15:04:47 +0100 [thread overview]
Message-ID: <a010a4ad-403a-4b6f-9a92-a33323eca0f2@gmail.com> (raw)
In-Reply-To: <a705852723fbe88e94ad3de1daba548dbce32211.1776472347.git.gitgitgadget@gmail.com>
Hi Elijah
On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
>
> In partial clones, `git cherry` fetches necessary blobs on-demand one
> at a time, which can be very slow. We would like to prefetch all
> necessary blobs upfront. To do so, we need to be able to first figure
> out which blobs are needed.
"git rebase" without "--reapply-cherry-picks" suffers from this problem
as well as it does the equivalent of "git log --cherry-pick". Is there
any way to share prefetch_cherry_blobs() with the cherry-pick detection
in revision.c?
Thanks
Phillip
> `git cherry` does its work in a two-phase approach: first computing
> header-only IDs (based on file paths and modes), then falling back to
> full content-based IDs only when header-only IDs collide -- or, more
> accurately, whenever the oidhash() of the header-only object_ids
> collide.
>
> patch-ids.c handles this by creating an ids->patches hashmap that has
> all the data we need, but the problem is that any attempt to query the
> hashmap will invoke the patch_id_neq() function on any colliding objects,
> which causes the on-demand fetching.
>
> Insert a new prefetch_cherry_blobs() function before checking for
> collisions. Use a temporary replacement on the ids->patches.cmpfn
> in order to enumerate the blobs that would be needed without yet
> fetching them, and then fetch them all at once, then restore the old
> ids->patches.cmpfn.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> builtin/log.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++
> t/t3500-cherry.sh | 18 +++++++
> 2 files changed, 143 insertions(+)
>
> diff --git a/builtin/log.c b/builtin/log.c
> index 8c0939dd42..df19876be6 100644
> --- a/builtin/log.c
> +++ b/builtin/log.c
> @@ -21,10 +21,12 @@
> #include "color.h"
> #include "commit.h"
> #include "diff.h"
> +#include "diffcore.h"
> #include "diff-merges.h"
> #include "revision.h"
> #include "log-tree.h"
> #include "oid-array.h"
> +#include "oidset.h"
> #include "tag.h"
> #include "reflog-walk.h"
> #include "patch-ids.h"
> @@ -43,9 +45,11 @@
> #include "utf8.h"
>
> #include "commit-reach.h"
> +#include "promisor-remote.h"
> #include "range-diff.h"
> #include "tmp-objdir.h"
> #include "tree.h"
> +#include "userdiff.h"
> #include "write-or-die.h"
>
> #define MAIL_DEFAULT_WRAP 72
> @@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose,
> }
> }
>
> +/*
> + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
> + * Skips files whose userdiff driver explicitly declares binary status
> + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
> + * never reads blob content. Use userdiff_find_by_path() since
> + * diff_filespec_load_driver() is static in diff.c.
> + *
> + * Clean up with diff_queue_clear() (from diffcore.h).
> + */
> +static void collect_diff_blob_oids(struct commit *commit,
> + struct diff_options *opts,
> + struct oidset *blobs)
> +{
> + struct diff_queue_struct *q;
> +
> + /*
> + * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
> + * so we'll never be called with one.
> + */
> + assert(!commit->parents || !commit->parents->next);
> +
> + if (commit->parents)
> + diff_tree_oid(&commit->parents->item->object.oid,
> + &commit->object.oid, "", opts);
> + else
> + diff_root_tree_oid(&commit->object.oid, "", opts);
> + diffcore_std(opts);
> +
> + q = &diff_queued_diff;
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + struct userdiff_driver *drv;
> +
> + /* Skip binary files */
> + drv = userdiff_find_by_path(opts->repo->index, p->one->path);
> + if (drv && drv->binary > 0)
> + continue;
> +
> + if (DIFF_FILE_VALID(p->one))
> + oidset_insert(blobs, &p->one->oid);
> + if (DIFF_FILE_VALID(p->two))
> + oidset_insert(blobs, &p->two->oid);
> + }
> + diff_queue_clear(q);
> +}
> +
> +static int always_match(const void *cmp_data UNUSED,
> + const struct hashmap_entry *entry1 UNUSED,
> + const struct hashmap_entry *entry2 UNUSED,
> + const void *keydata UNUSED)
> +{
> + return 0;
> +}
> +
> +/*
> + * Prefetch blobs for git cherry in partial clones.
> + *
> + * Called between the revision walk (which builds the head-side
> + * commit list) and the has_commit_patch_id() comparison loop.
> + *
> + * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
> + * replaces the hashmap's comparison function with a trivial
> + * always-match function, so hashmap_get()/hashmap_get_next() match
> + * any entry with the same oidhash bucket. These are the set of oids
> + * that would trigger patch_id_neq() during normal lookup and cause
> + * blobs to be read on demand, and we want to prefetch them all at
> + * once instead.
> + */
> +static void prefetch_cherry_blobs(struct repository *repo,
> + struct commit_list *list,
> + struct patch_ids *ids)
> +{
> + struct oidset blobs = OIDSET_INIT;
> + hashmap_cmp_fn original_cmpfn;
> +
> + /* Exit if we're not in a partial clone */
> + if (!repo_has_promisor_remote(repo))
> + return;
> +
> + /* Save original cmpfn, replace with always_match */
> + original_cmpfn = ids->patches.cmpfn;
> + ids->patches.cmpfn = always_match;
> +
> + /* Find header-only collisions, gather blobs from those commits */
> + for (struct commit_list *l = list; l; l = l->next) {
> + struct commit *c = l->item;
> + bool match_found = false;
> + for (struct patch_id *cur = patch_id_iter_first(c, ids);
> + cur;
> + cur = patch_id_iter_next(cur, ids)) {
> + match_found = true;
> + collect_diff_blob_oids(cur->commit, &ids->diffopts,
> + &blobs);
> + }
> + if (match_found)
> + collect_diff_blob_oids(c, &ids->diffopts, &blobs);
> + }
> +
> + /* Restore original cmpfn */
> + ids->patches.cmpfn = original_cmpfn;
> +
> + /* If we have any blobs to fetch, fetch them */
> + if (oidset_size(&blobs)) {
> + struct oid_array to_fetch = OID_ARRAY_INIT;
> + struct oidset_iter iter;
> + const struct object_id *oid;
> +
> + oidset_iter_init(&blobs, &iter);
> + while ((oid = oidset_iter_next(&iter)))
> + oid_array_append(&to_fetch, oid);
> +
> + promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
> +
> + oid_array_clear(&to_fetch);
> + }
> +
> + oidset_clear(&blobs);
> +}
> +
> int cmd_cherry(int argc,
> const char **argv,
> const char *prefix,
> @@ -2673,6 +2796,8 @@ int cmd_cherry(int argc,
> commit_list_insert(commit, &list);
> }
>
> + prefetch_cherry_blobs(the_repository, list, &ids);
> +
> for (struct commit_list *l = list; l; l = l->next) {
> char sign = '+';
>
> diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
> index 78c3eac54b..17507d9a28 100755
> --- a/t/t3500-cherry.sh
> +++ b/t/t3500-cherry.sh
> @@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' '
> test_cmp expect actual
> '
>
> +# Reuse the expect file from the previous test, in a partial clone
> +test_expect_success 'cherry in partial clone does bulk prefetch' '
> + test_config uploadpack.allowfilter 1 &&
> + test_config uploadpack.allowanysha1inwant 1 &&
> + test_when_finished "rm -rf copy" &&
> +
> + git clone --bare --filter=blob:none file://"$(pwd)" copy &&
> + (
> + cd copy &&
> + GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
> + test_cmp ../expect actual &&
> +
> + grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
> + test_line_count = 1 fetches &&
> + test_trace2_data promisor fetch_count 4 <trace.output
> + )
> +'
> +
> test_done
next prev parent reply other threads:[~2026-04-19 14:04 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-17 21:42 ` Junio C Hamano
2026-04-17 22:02 ` Elijah Newren
2026-04-16 22:48 ` [PATCH 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-18 0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-18 0:32 ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-18 0:32 ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-19 14:04 ` Phillip Wood [this message]
2026-04-21 21:28 ` Elijah Newren
2026-04-23 15:15 ` Phillip Wood
2026-04-23 17:38 ` Elijah Newren
2026-04-27 13:16 ` Derrick Stolee
2026-05-11 2:51 ` Junio C Hamano
2026-05-11 17:45 ` Elijah Newren
2026-05-13 23:17 ` Elijah Newren
2026-04-18 0:32 ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-27 12:59 ` Derrick Stolee
2026-05-13 19:21 ` Elijah Newren
2026-05-14 16:25 ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 4/4] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a010a4ad-403a-4b6f-9a92-a33323eca0f2@gmail.com \
--to=phillip.wood123@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=newren@gmail.com \
--cc=phillip.wood@dunelm.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox