[PATCH 0/3] Batch prefetching

Git development
 help / color / mirror / Atom feed

* [PATCH 0/3] Batch prefetching
@ 2026-04-16 22:48 Elijah Newren via GitGitGadget
  2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-16 22:48 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Partial clones provide a trade-off for users: avoid downloading blobs
upfront, at the expense of needing to download them later as they run other
commands. This tradeoff can sometimes incur a more severe cost than
expected, particularly if needed blobs are discovered as they are accessed,
resulting in downloading blobs one at a time. Some commands like checkout,
diff, and merge do batch prefetches of necessary blobs, since that can
dramatically reduce the pain of on-demand loading. Extend this ability to
two more commands: cherry and grep.

This series was spurred by a report where git cherry jobs were each doing
hundreds of single-blob fetches, at a cost of 3s each. Batching those
downloads should dramatically speed up their jobs. (And I decided to fix up
git grep similarly while at it.)

I'll also note that git backfill with revisions and/or pathspecs could also
improve things for these users, but since backfill is a manual command users
would have to run and requires users to try to figure out which data is
needed (a challenge in the case of cherry), it still makes sense to provide
smarter behavior for folks who don't choose to manually run backfill.

Also, correct a documentation typo I noticed in patch-ids.h (related to code
I was using for the git cherry fixes) as a preparatory fixup.

Elijah Newren (3):
  patch-ids.h: add missing trailing parenthesis in documentation comment
  builtin/log: prefetch necessary blobs for `git cherry`
  grep: prefetch necessary blobs

 builtin/grep.c                                | 142 ++++++++++++
 builtin/log.c                                 | 125 +++++++++++
 investigations/cherry-prefetch-design-spec.md | 210 ++++++++++++++++++
 patch-ids.h                                   |   2 +-
 t/t3500-cherry.sh                             |  18 ++
 t/t7810-grep.sh                               |  35 +++
 6 files changed, 531 insertions(+), 1 deletion(-)
 create mode 100644 investigations/cherry-prefetch-design-spec.md

base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2089%2Fnewren%2Fbatch-prefetching-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2089/newren/batch-prefetching-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/2089
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment
  2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
@ 2026-04-16 22:48 ` Elijah Newren via GitGitGadget
  2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-16 22:48 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 patch-ids.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/patch-ids.h b/patch-ids.h
index 490d739371..57534ee722 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -37,7 +37,7 @@ int has_commit_patch_id(struct commit *commit, struct patch_ids *);
  *   struct patch_id *cur;
  *   for (cur = patch_id_iter_first(commit, ids);
  *        cur;
- *        cur = patch_id_iter_next(cur, ids) {
+ *        cur = patch_id_iter_next(cur, ids)) {
  *           ... look at cur->commit
  *   }
  */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
  2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
@ 2026-04-16 22:48 ` Elijah Newren via GitGitGadget
  2026-04-17 21:42   ` Junio C Hamano
  2026-04-16 22:48 ` [PATCH 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
  2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
  3 siblings, 1 reply; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-16 22:48 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In partial clones, `git cherry` fetches necessary blobs on-demand one
at a time, which can be very slow.  We would like to prefetch all
necessary blobs upfront.  To do so, we need to be able to first figure
out which blobs are needed.

`git cherry` does its work in a two-phase approach: first computing
header-only IDs (based on file paths and modes), then falling back to
full content-based IDs only when header-only IDs collide -- or, more
accurately, whenever the oidhash() of the header-only object_ids
collide.

patch-ids.c handles this by creating an ids->patches hashmap that has
all the data we need, but the problem is that any attempt to query the
hashmap will invoke the patch_id_neq() function on any colliding objects,
which causes the on-demand fetching.

Insert a new prefetch_cherry_blobs() function before checking for
collisions.  Use a temporary replacement on the ids->patches.cmpfn
in order to enumerate the blobs that would be needed without yet
fetching them, and then fetch them all at once, then restore the old
ids->patches.cmpfn.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/log.c                                 | 125 +++++++++++
 investigations/cherry-prefetch-design-spec.md | 210 ++++++++++++++++++
 t/t3500-cherry.sh                             |  18 ++
 3 files changed, 353 insertions(+)
 create mode 100644 investigations/cherry-prefetch-design-spec.md

diff --git a/builtin/log.c b/builtin/log.c
index 8c0939dd42..df19876be6 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -21,10 +21,12 @@
 #include "color.h"
 #include "commit.h"
 #include "diff.h"
+#include "diffcore.h"
 #include "diff-merges.h"
 #include "revision.h"
 #include "log-tree.h"
 #include "oid-array.h"
+#include "oidset.h"
 #include "tag.h"
 #include "reflog-walk.h"
 #include "patch-ids.h"
@@ -43,9 +45,11 @@
 #include "utf8.h"
 
 #include "commit-reach.h"
+#include "promisor-remote.h"
 #include "range-diff.h"
 #include "tmp-objdir.h"
 #include "tree.h"
+#include "userdiff.h"
 #include "write-or-die.h"
 
 #define MAIL_DEFAULT_WRAP 72
@@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose,
 	}
 }
 
+/*
+ * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
+ * Skips files whose userdiff driver explicitly declares binary status
+ * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
+ * never reads blob content.  Use userdiff_find_by_path() since
+ * diff_filespec_load_driver() is static in diff.c.
+ *
+ * Clean up with diff_queue_clear() (from diffcore.h).
+ */
+static void collect_diff_blob_oids(struct commit *commit,
+				   struct diff_options *opts,
+				   struct oidset *blobs)
+{
+	struct diff_queue_struct *q;
+
+	/*
+	 * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
+	 * so we'll never be called with one.
+	 */
+	assert(!commit->parents || !commit->parents->next);
+
+	if (commit->parents)
+		diff_tree_oid(&commit->parents->item->object.oid,
+			      &commit->object.oid, "", opts);
+	else
+		diff_root_tree_oid(&commit->object.oid, "", opts);
+	diffcore_std(opts);
+
+	q = &diff_queued_diff;
+	for (int i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		struct userdiff_driver *drv;
+
+		/* Skip binary files */
+		drv = userdiff_find_by_path(opts->repo->index, p->one->path);
+		if (drv && drv->binary > 0)
+			continue;
+
+		if (DIFF_FILE_VALID(p->one))
+			oidset_insert(blobs, &p->one->oid);
+		if (DIFF_FILE_VALID(p->two))
+			oidset_insert(blobs, &p->two->oid);
+	}
+	diff_queue_clear(q);
+}
+
+static int always_match(const void *cmp_data UNUSED,
+			const struct hashmap_entry *entry1 UNUSED,
+			const struct hashmap_entry *entry2 UNUSED,
+			const void *keydata UNUSED)
+{
+	return 0;
+}
+
+/*
+ * Prefetch blobs for git cherry in partial clones.
+ *
+ * Called between the revision walk (which builds the head-side
+ * commit list) and the has_commit_patch_id() comparison loop.
+ *
+ * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
+ * replaces the hashmap's comparison function with a trivial
+ * always-match function, so hashmap_get()/hashmap_get_next() match
+ * any entry with the same oidhash bucket.  These are the set of oids
+ * that would trigger patch_id_neq() during normal lookup and cause
+ * blobs to be read on demand, and we want to prefetch them all at
+ * once instead.
+ */
+static void prefetch_cherry_blobs(struct repository *repo,
+				  struct commit_list *list,
+				  struct patch_ids *ids)
+{
+	struct oidset blobs = OIDSET_INIT;
+	hashmap_cmp_fn original_cmpfn;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(repo))
+		return;
+
+	/* Save original cmpfn, replace with always_match */
+	original_cmpfn = ids->patches.cmpfn;
+	ids->patches.cmpfn = always_match;
+
+	/* Find header-only collisions, gather blobs from those commits */
+	for (struct commit_list *l = list; l; l = l->next) {
+		struct commit *c = l->item;
+		bool match_found = false;
+		for (struct patch_id *cur = patch_id_iter_first(c, ids);
+		     cur;
+		     cur = patch_id_iter_next(cur, ids)) {
+			match_found = true;
+			collect_diff_blob_oids(cur->commit, &ids->diffopts,
+					       &blobs);
+		}
+		if (match_found)
+			collect_diff_blob_oids(c, &ids->diffopts, &blobs);
+	}
+
+	/* Restore original cmpfn */
+	ids->patches.cmpfn = original_cmpfn;
+
+	/* If we have any blobs to fetch, fetch them */
+	if (oidset_size(&blobs)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blobs, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+
+	oidset_clear(&blobs);
+}
+
 int cmd_cherry(int argc,
 	       const char **argv,
 	       const char *prefix,
@@ -2673,6 +2796,8 @@ int cmd_cherry(int argc,
 		commit_list_insert(commit, &list);
 	}
 
+	prefetch_cherry_blobs(the_repository, list, &ids);
+
 	for (struct commit_list *l = list; l; l = l->next) {
 		char sign = '+';
 
diff --git a/investigations/cherry-prefetch-design-spec.md b/investigations/cherry-prefetch-design-spec.md
new file mode 100644
index 0000000000..a499d9538e
--- /dev/null
+++ b/investigations/cherry-prefetch-design-spec.md
@@ -0,0 +1,210 @@
+# Design Spec: Batch Blob Prefetch for `git cherry` in Partial Clones
+
+## Problem
+
+In a partial clone with `--filter=blob:none`, `git cherry` compares
+commits using patch IDs.  Patch IDs are computed in two phases:
+
+1. Header-only: hashes file paths and mode changes only (no blob reads)
+2. Full: hashes actual diff content (requires reading blobs)
+
+Phase 2 only runs when two commits have matching header-only IDs
+(i.e. they modify the same set of files with the same modes).  This
+is common — any two commits touching the same file(s) will collide.
+
+When phase 2 needs a blob that isn't local, it triggers an on-demand
+promisor fetch.  Each fetch is a separate network round-trip.  With
+many collisions, this means many sequential fetches.
+
+## Solution Overview
+
+Add a preparatory pass before the existing comparison loop in
+`cmd_cherry()` that:
+
+1. Identifies which commit pairs will collide on header-only IDs
+2. Collects all blob OIDs those commits will need
+3. Batch-prefetches them in one fetch
+
+After this pass, the existing comparison loop runs as before, but
+all needed blobs are already local, so no on-demand fetches occur.
+
+## Detailed Design
+
+### 1. No struct changes to patch_id
+
+The existing `struct patch_id` and `patch_id_neq()` are not
+modified.  `is_null_oid()` remains the sentinel for "full ID not
+yet computed".  No `has_full_patch_id` boolean, no extra fields.
+
+Key insight: `init_patch_id_entry()` stores only `oidhash()` (the
+first 4 bytes of the header-only ID) in the hashmap bucket key.
+The real `patch_id_neq()` comparison function is invoked only when
+`hashmap_get()` or `hashmap_get_next()` finds entries with a
+matching oidhash — and that comparison triggers blob reads.
+
+The prefetch needs to detect exactly those oidhash collisions
+*without* triggering blob reads.  We achieve this by temporarily
+swapping the hashmap's comparison function.
+
+### 2. The prefetch function (in builtin/log.c)
+
+This function takes the repository, the head-side commit list (as
+built by the existing revision walk in `cmd_cherry()`), and the
+patch_ids structure (which contains the upstream entries).
+
+#### 2.1 Early exit
+
+If the repository has no promisor remote, return immediately.
+Use `repo_has_promisor_remote()` from promisor-remote.h.
+
+#### 2.2 Swap in a trivial comparison function
+
+Save `ids->patches.cmpfn` (the real `patch_id_neq`) and replace
+it with a trivial function that always returns 0 ("equal").
+
+```
+static int patch_id_match(const void *unused_cmpfn_data,
+                          const struct hashmap_entry *a,
+                          const struct hashmap_entry *b,
+                          const void *unused_keydata)
+{
+    return 0;
+}
+```
+
+With this cmpfn in place, `hashmap_get()` and `hashmap_get_next()`
+will match every entry in the same oidhash bucket — exactly the
+same set that would trigger `patch_id_neq()` during normal lookup.
+No blob reads occur because we never call the real comparison
+function.
+
+#### 2.3 For each head-side commit, probe for collisions
+
+For each commit in the head-side list:
+
+- Use `patch_id_iter_first(commit, ids)` to probe the upstream
+  hashmap.  This handles `init_patch_id_entry()` + hashmap lookup
+  internally.  With our swapped cmpfn, it returns any upstream
+  entry whose oidhash matches — i.e. any entry that *would*
+  trigger `patch_id_neq()` during the real comparison loop.
+  (Merge commits are already handled — `patch_id_iter_first()`
+  returns NULL for them via `patch_id_defined()`.)
+- If there's a match: collect blob OIDs from the head-side commit
+  (see section 3).
+- Then walk `patch_id_iter_next()` to find ALL upstream entries
+  in the same bucket.  For each, collect blob OIDs from that
+  upstream commit too.  (Multiple upstream commits can share the
+  same oidhash bucket.)
+- Collect blob OIDs from the first upstream match too (from
+  `patch_id_iter_first()`).
+
+We need blobs from BOTH sides because `patch_id_neq()` computes
+full patch IDs for both the upstream and head-side commit when
+comparing.
+
+#### 2.4 Restore the original comparison function
+
+Set `ids->patches.cmpfn` back to the saved value (patch_id_neq).
+This MUST happen before returning — the subsequent
+`has_commit_patch_id()` loop needs the real comparison function.
+
+#### 2.5 Batch prefetch
+
+If the oidset is non-empty, populate an oid_array from it using
+`oidset_iter_first()`/`oidset_iter_next()`, then call
+`promisor_remote_get_direct(repo, oid_array.oid, oid_array.nr)`.
+
+This is a single network round-trip regardless of how many blobs.
+
+#### 2.6 Cleanup
+
+Free the oid_array and the oidset.
+
+### 3. Collecting blob OIDs from a commit (helper function)
+
+Given a commit, enumerate the blobs its diff touches.  Takes an
+oidset to insert into (provides automatic dedup — consecutive
+commits often share blob OIDs, e.g. B:foo == C^:foo when C's
+parent is B).
+
+- Compute the diff: `diff_tree_oid()` for commits with a parent,
+  `diff_root_tree_oid()` for root commits.  Then `diffcore_std()`.
+- These populate the global `diff_queued_diff` queue.
+- For each filepair in the queue:
+  - Check the userdiff driver for the file path.  If the driver
+    explicitly declares the file as binary (`drv->binary != -1`),
+    skip it.  Reason: patch-ID uses `oid_to_hex()` for binary
+    files (see diff.c around line 6652) and never reads the blob.
+    Use `userdiff_find_by_path()` (NOT `diff_filespec_load_driver`
+    which is static in diff.c).
+  - For both sides of the filepair (p->one and p->two): if the
+    side is valid (`DIFF_FILE_VALID`) and has a non-null OID,
+    check the dedup oidset — `oidset_insert()` handles dedup
+    automatically (returns 1 if newly inserted, 0 if duplicate).
+- Clear the diff queue with `diff_queue_clear()` (from diffcore.h,
+  not diff.h).
+
+Note on `drv->binary`: The value -1 means "not set" (auto-detect
+at read time by reading the blob); 0 means explicitly text (will
+be diffed, blob reads needed); positive means explicitly binary
+(patch-ID uses `oid_to_hex()`, no blob read needed).
+
+The correct skip condition is `drv && drv->binary > 0` — skip
+only known-binary files.  Do NOT use `drv->binary != -1`, which
+would also skip explicitly-text files that DO need blob reads.
+(The copilot reference implementation uses `!= -1`, which is
+technically wrong but harmless in practice since explicit text
+attributes are rare.)
+
+### 4. Call site in cmd_cherry()
+
+Insert the call between the revision walk loop (which builds the
+head-side commit list) and the comparison loop (which calls
+`has_commit_patch_id()`).
+
+### 5. Required includes in builtin/log.c
+
+- promisor-remote.h  (for repo_has_promisor_remote,
+                       promisor_remote_get_direct)
+- userdiff.h         (for userdiff_find_by_path)
+- oidset.h           (for oidset used in blob OID dedup)
+- diffcore.h         (for diff_queue_clear)
+
+## Edge Cases
+
+- No promisor remote: early return, zero overhead
+- No collisions: probes the hashmap for each head-side commit but
+  finds no bucket matches, no blobs collected, no fetch issued
+- Merge commits in head-side list: skipped (no patch ID defined)
+- Root commits (no parent): use diff_root_tree_oid instead of
+  diff_tree_oid
+- Binary files (explicit driver): skipped, patch-ID doesn't read
+  them
+- The cmpfn swap approach matches at oidhash granularity (4 bytes),
+  which is exactly what the hashmap itself uses to trigger
+  patch_id_neq().  This means we prefetch for every case the real
+  code would trigger, plus rare false-positive oidhash collisions
+  (harmless: we fetch a few extra blobs that won't end up being
+  compared).  No under-fetching is possible.
+
+## Testing
+
+See t/t3500-cherry.sh on the copilot-faster-partial-clones branch
+for two tests:
+
+Test 5: "cherry batch-prefetches blobs in partial clone"
+  - Creates server with 3 upstream + 3 head-side commits modifying
+    the same file (guarantees collisions)
+  - Clones with --filter=blob:none
+  - Runs `git cherry` with GIT_TRACE2_PERF
+  - Asserts exactly 1 fetch (batch) instead of 6 (individual)
+
+Test 6: "cherry prefetch omits blobs for cherry-picked commits"
+  - Creates a cherry-pick scenario (divergent branches, shared
+    commit cherry-picked to head side)
+  - Verifies `git cherry` correctly identifies the cherry-picked
+    commit as "-" and head-only commits as "+"
+  - Important: the head side must diverge before the cherry-pick
+    so the cherry-pick creates a distinct commit object (otherwise
+    the commit hash is identical and it's in the symmetric
+    difference, not needing patch-ID comparison at all)
diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
index 78c3eac54b..17507d9a28 100755
--- a/t/t3500-cherry.sh
+++ b/t/t3500-cherry.sh
@@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' '
 	test_cmp expect actual
 '
 
+# Reuse the expect file from the previous test, in a partial clone
+test_expect_success 'cherry in partial clone does bulk prefetch' '
+	test_config uploadpack.allowfilter 1 &&
+	test_config uploadpack.allowanysha1inwant 1 &&
+	test_when_finished "rm -rf copy" &&
+
+	git clone --bare --filter=blob:none file://"$(pwd)" copy &&
+	(
+		cd copy &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
+		test_cmp ../expect actual &&
+
+		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
+		test_line_count = 1 fetches &&
+		test_trace2_data promisor fetch_count 4 <trace.output
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] grep: prefetch necessary blobs
  2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
  2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
  2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
@ 2026-04-16 22:48 ` Elijah Newren via GitGitGadget
  2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-16 22:48 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In partial clones, `git grep` fetches necessary blobs on-demand one
at a time, which can be very slow.  In partial clones, add an extra
preliminary walk over the tree similar to grep_tree() which collects
the blobs of interest, and then prefetches them.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/grep.c  | 142 ++++++++++++++++++++++++++++++++++++++++++++++++
 t/t7810-grep.sh |  35 ++++++++++++
 2 files changed, 177 insertions(+)

diff --git a/builtin/grep.c b/builtin/grep.c
index e33285e5e6..d559c48d94 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -28,9 +28,12 @@
 #include "object-file.h"
 #include "object-name.h"
 #include "odb.h"
+#include "oid-array.h"
+#include "oidset.h"
 #include "packfile.h"
 #include "pager.h"
 #include "path.h"
+#include "promisor-remote.h"
 #include "read-cache-ll.h"
 #include "write-or-die.h"
 
@@ -692,6 +695,143 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+static void collect_blob_oids_for_tree(struct repository *repo,
+				       const struct pathspec *pathspec,
+				       struct tree_desc *tree,
+				       struct strbuf *base,
+				       int tn_len,
+				       struct oidset *blob_oids)
+{
+	struct name_entry entry;
+	int old_baselen = base->len;
+	struct strbuf name = STRBUF_INIT;
+	enum interesting match = entry_not_interesting;
+
+	while (tree_entry(tree, &entry)) {
+		if (match != all_entries_interesting) {
+			strbuf_addstr(&name, base->buf + tn_len);
+			match = tree_entry_interesting(repo->index,
+						       &entry, &name,
+						       pathspec);
+			strbuf_reset(&name);
+
+			if (match == all_entries_not_interesting)
+				break;
+			if (match == entry_not_interesting)
+				continue;
+		}
+
+		strbuf_add(base, entry.path, tree_entry_len(&entry));
+
+		if (S_ISREG(entry.mode)) {
+			oidset_insert(blob_oids, &entry.oid);
+		} else if (S_ISDIR(entry.mode)) {
+			enum object_type type;
+			struct tree_desc sub_tree;
+			void *data;
+			unsigned long size;
+
+			data = odb_read_object(repo->objects, &entry.oid,
+					       &type, &size);
+			if (!data)
+				die(_("unable to read tree (%s)"),
+				    oid_to_hex(&entry.oid));
+
+			strbuf_addch(base, '/');
+			init_tree_desc(&sub_tree, &entry.oid, data, size);
+			collect_blob_oids_for_tree(repo, pathspec, &sub_tree,
+						   base, tn_len, blob_oids);
+			free(data);
+		}
+		/*
+		 * ...no else clause for S_ISGITLINK: submodules have their
+		 * own promisor configuration and would need separate fetches
+		 * anyway.
+		 */
+
+		strbuf_setlen(base, old_baselen);
+	}
+
+	strbuf_release(&name);
+}
+
+static void collect_blob_oids_for_treeish(struct grep_opt *opt,
+					  const struct pathspec *pathspec,
+					  const struct object_id *tree_ish_oid,
+					  const char *name,
+					  struct oidset *blob_oids)
+{
+	struct tree_desc tree;
+	void *data;
+	unsigned long size;
+	struct strbuf base = STRBUF_INIT;
+	int len;
+
+	data = odb_read_object_peeled(opt->repo->objects, tree_ish_oid,
+				      OBJ_TREE, &size, NULL);
+
+	if (!data)
+		return;
+
+	len = name ? strlen(name) : 0;
+	if (len) {
+		strbuf_add(&base, name, len);
+		strbuf_addch(&base, ':');
+	}
+	init_tree_desc(&tree, tree_ish_oid, data, size);
+
+	collect_blob_oids_for_tree(opt->repo, pathspec, &tree,
+				   &base, base.len, blob_oids);
+
+	strbuf_release(&base);
+	free(data);
+}
+
+static void prefetch_grep_blobs(struct grep_opt *opt,
+				const struct pathspec *pathspec,
+				const struct object_array *list)
+{
+	struct oidset blob_oids = OIDSET_INIT;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(opt->repo))
+		return;
+
+	/* For each tree, gather the blobs in it */
+	for (int i = 0; i < list->nr; i++) {
+		struct object *real_obj;
+
+		obj_read_lock();
+		real_obj = deref_tag(opt->repo, list->objects[i].item,
+				     NULL, 0);
+		obj_read_unlock();
+
+		if (real_obj &&
+		    (real_obj->type == OBJ_COMMIT ||
+		     real_obj->type == OBJ_TREE))
+			collect_blob_oids_for_treeish(opt, pathspec,
+						      &real_obj->oid,
+						      list->objects[i].name,
+						      &blob_oids);
+	}
+
+	/* Prefetch the blobs we found */
+	if (oidset_size(&blob_oids)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blob_oids, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+	oidset_clear(&blob_oids);
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
@@ -732,6 +872,8 @@ static int grep_objects(struct grep_opt *opt, const struct pathspec *pathspec,
 	int hit = 0;
 	const unsigned int nr = list->nr;
 
+	prefetch_grep_blobs(opt, pathspec, list);
+
 	for (i = 0; i < nr; i++) {
 		struct object *real_obj;
 
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 64ac4f04ee..1f484502fe 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -1929,4 +1929,39 @@ test_expect_success 'grep does not report i-t-a and assume unchanged with -L' '
 	test_cmp expected actual
 '
 
+test_expect_success 'grep of revision in partial clone does bulk prefetch' '
+	test_when_finished "rm -rf grep-partial-src grep-partial" &&
+
+	git init grep-partial-src &&
+	(
+		cd grep-partial-src &&
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+		echo "needle in haystack" >searchme &&
+		echo "no match here" >other &&
+		mkdir subdir &&
+		echo "needle again" >subdir/deep &&
+		git add . &&
+		git commit -m "initial"
+	) &&
+
+	git clone --no-checkout --filter=blob:none \
+		"file://$(pwd)/grep-partial-src" grep-partial &&
+
+	# All blobs should be missing after a blobless clone.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 3 missing &&
+
+	# grep HEAD should batch-prefetch all blobs in one request.
+	GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
+		git -C grep-partial grep -c "needle" HEAD >result &&
+
+	# Should find matches in two files.
+	test_line_count = 2 result &&
+
+	# Should have prefetched all 3 objects at once
+	test_trace2_data promisor fetch_count 3 <grep-trace
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
@ 2026-04-17 21:42   ` Junio C Hamano
  2026-04-17 22:02     ` Elijah Newren
  0 siblings, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2026-04-17 21:42 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> In partial clones, `git cherry` fetches necessary blobs on-demand one
> at a time, which can be very slow.  We would like to prefetch all
> necessary blobs upfront.  To do so, we need to be able to first figure
> out which blobs are needed.
>
> `git cherry` does its work in a two-phase approach: first computing
> header-only IDs (based on file paths and modes), then falling back to
> full content-based IDs only when header-only IDs collide -- or, more
> accurately, whenever the oidhash() of the header-only object_ids
> collide.
>
> patch-ids.c handles this by creating an ids->patches hashmap that has
> all the data we need, but the problem is that any attempt to query the
> hashmap will invoke the patch_id_neq() function on any colliding objects,
> which causes the on-demand fetching.
>
> Insert a new prefetch_cherry_blobs() function before checking for
> collisions.  Use a temporary replacement on the ids->patches.cmpfn
> in order to enumerate the blobs that would be needed without yet
> fetching them, and then fetch them all at once, then restore the old
> ids->patches.cmpfn.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  builtin/log.c                                 | 125 +++++++++++
>  investigations/cherry-prefetch-design-spec.md | 210 ++++++++++++++++++

Did you mean to add this file to the project?  As a document to
describe how "git cherry" works, it is vastly lacking, and once this
series lands, I am not sure how others would benefit from being able
to read it.  Many of the materials in there seem to typically be given
in the log message, but not to this degree of details, so I am not
sure where it belongs.

>  t/t3500-cherry.sh                             |  18 ++
>  3 files changed, 353 insertions(+)
>  create mode 100644 investigations/cherry-prefetch-design-spec.md
>
> diff --git a/builtin/log.c b/builtin/log.c
> index 8c0939dd42..df19876be6 100644
> --- a/builtin/log.c
> +++ b/builtin/log.c
> @@ -21,10 +21,12 @@
>  #include "color.h"
>  #include "commit.h"
>  #include "diff.h"
> +#include "diffcore.h"
>  #include "diff-merges.h"
>  #include "revision.h"
>  #include "log-tree.h"
>  #include "oid-array.h"
> +#include "oidset.h"
>  #include "tag.h"
>  #include "reflog-walk.h"
>  #include "patch-ids.h"
> @@ -43,9 +45,11 @@
>  #include "utf8.h"
>  
>  #include "commit-reach.h"
> +#include "promisor-remote.h"
>  #include "range-diff.h"
>  #include "tmp-objdir.h"
>  #include "tree.h"
> +#include "userdiff.h"
>  #include "write-or-die.h"
>  
>  #define MAIL_DEFAULT_WRAP 72
> @@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose,
>  	}
>  }
>  
> +/*
> + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
> + * Skips files whose userdiff driver explicitly declares binary status
> + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
> + * never reads blob content.  Use userdiff_find_by_path() since
> + * diff_filespec_load_driver() is static in diff.c.
> + *
> + * Clean up with diff_queue_clear() (from diffcore.h).
> + */
> +static void collect_diff_blob_oids(struct commit *commit,
> +				   struct diff_options *opts,
> +				   struct oidset *blobs)
> +{
> +	struct diff_queue_struct *q;
> +
> +	/*
> +	 * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
> +	 * so we'll never be called with one.
> +	 */
> +	assert(!commit->parents || !commit->parents->next);
> +
> +	if (commit->parents)
> +		diff_tree_oid(&commit->parents->item->object.oid,
> +			      &commit->object.oid, "", opts);
> +	else
> +		diff_root_tree_oid(&commit->object.oid, "", opts);
> +	diffcore_std(opts);
> +
> +	q = &diff_queued_diff;
> +	for (int i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		struct userdiff_driver *drv;
> +
> +		/* Skip binary files */
> +		drv = userdiff_find_by_path(opts->repo->index, p->one->path);
> +		if (drv && drv->binary > 0)
> +			continue;
> +
> +		if (DIFF_FILE_VALID(p->one))
> +			oidset_insert(blobs, &p->one->oid);
> +		if (DIFF_FILE_VALID(p->two))
> +			oidset_insert(blobs, &p->two->oid);
> +	}
> +	diff_queue_clear(q);
> +}
> +
> +static int always_match(const void *cmp_data UNUSED,
> +			const struct hashmap_entry *entry1 UNUSED,
> +			const struct hashmap_entry *entry2 UNUSED,
> +			const void *keydata UNUSED)
> +{
> +	return 0;
> +}
> +
> +/*
> + * Prefetch blobs for git cherry in partial clones.
> + *
> + * Called between the revision walk (which builds the head-side
> + * commit list) and the has_commit_patch_id() comparison loop.
> + *
> + * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
> + * replaces the hashmap's comparison function with a trivial
> + * always-match function, so hashmap_get()/hashmap_get_next() match
> + * any entry with the same oidhash bucket.  These are the set of oids
> + * that would trigger patch_id_neq() during normal lookup and cause
> + * blobs to be read on demand, and we want to prefetch them all at
> + * once instead.
> + */
> +static void prefetch_cherry_blobs(struct repository *repo,
> +				  struct commit_list *list,
> +				  struct patch_ids *ids)
> +{
> +	struct oidset blobs = OIDSET_INIT;
> +	hashmap_cmp_fn original_cmpfn;
> +
> +	/* Exit if we're not in a partial clone */
> +	if (!repo_has_promisor_remote(repo))
> +		return;
> +
> +	/* Save original cmpfn, replace with always_match */
> +	original_cmpfn = ids->patches.cmpfn;
> +	ids->patches.cmpfn = always_match;
> +
> +	/* Find header-only collisions, gather blobs from those commits */
> +	for (struct commit_list *l = list; l; l = l->next) {
> +		struct commit *c = l->item;
> +		bool match_found = false;
> +		for (struct patch_id *cur = patch_id_iter_first(c, ids);
> +		     cur;
> +		     cur = patch_id_iter_next(cur, ids)) {
> +			match_found = true;
> +			collect_diff_blob_oids(cur->commit, &ids->diffopts,
> +					       &blobs);
> +		}
> +		if (match_found)
> +			collect_diff_blob_oids(c, &ids->diffopts, &blobs);
> +	}
> +
> +	/* Restore original cmpfn */
> +	ids->patches.cmpfn = original_cmpfn;
> +
> +	/* If we have any blobs to fetch, fetch them */
> +	if (oidset_size(&blobs)) {
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		struct oidset_iter iter;
> +		const struct object_id *oid;
> +
> +		oidset_iter_init(&blobs, &iter);
> +		while ((oid = oidset_iter_next(&iter)))
> +			oid_array_append(&to_fetch, oid);
> +
> +		promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
> +
> +		oid_array_clear(&to_fetch);
> +	}
> +
> +	oidset_clear(&blobs);
> +}
> +
>  int cmd_cherry(int argc,
>  	       const char **argv,
>  	       const char *prefix,
> @@ -2673,6 +2796,8 @@ int cmd_cherry(int argc,
>  		commit_list_insert(commit, &list);
>  	}
>  
> +	prefetch_cherry_blobs(the_repository, list, &ids);
> +
>  	for (struct commit_list *l = list; l; l = l->next) {
>  		char sign = '+';
>  
> diff --git a/investigations/cherry-prefetch-design-spec.md b/investigations/cherry-prefetch-design-spec.md
> new file mode 100644
> index 0000000000..a499d9538e
> --- /dev/null
> +++ b/investigations/cherry-prefetch-design-spec.md
> @@ -0,0 +1,210 @@
> +# Design Spec: Batch Blob Prefetch for `git cherry` in Partial Clones
> +
> +## Problem
> +
> +In a partial clone with `--filter=blob:none`, `git cherry` compares
> +commits using patch IDs.  Patch IDs are computed in two phases:
> +
> +1. Header-only: hashes file paths and mode changes only (no blob reads)
> +2. Full: hashes actual diff content (requires reading blobs)
> +
> +Phase 2 only runs when two commits have matching header-only IDs
> +(i.e. they modify the same set of files with the same modes).  This
> +is common — any two commits touching the same file(s) will collide.
> +
> +When phase 2 needs a blob that isn't local, it triggers an on-demand
> +promisor fetch.  Each fetch is a separate network round-trip.  With
> +many collisions, this means many sequential fetches.
> +
> +## Solution Overview
> +
> +Add a preparatory pass before the existing comparison loop in
> +`cmd_cherry()` that:
> +
> +1. Identifies which commit pairs will collide on header-only IDs
> +2. Collects all blob OIDs those commits will need
> +3. Batch-prefetches them in one fetch
> +
> +After this pass, the existing comparison loop runs as before, but
> +all needed blobs are already local, so no on-demand fetches occur.
> +
> +## Detailed Design
> +
> +### 1. No struct changes to patch_id
> +
> +The existing `struct patch_id` and `patch_id_neq()` are not
> +modified.  `is_null_oid()` remains the sentinel for "full ID not
> +yet computed".  No `has_full_patch_id` boolean, no extra fields.
> +
> +Key insight: `init_patch_id_entry()` stores only `oidhash()` (the
> +first 4 bytes of the header-only ID) in the hashmap bucket key.
> +The real `patch_id_neq()` comparison function is invoked only when
> +`hashmap_get()` or `hashmap_get_next()` finds entries with a
> +matching oidhash — and that comparison triggers blob reads.
> +
> +The prefetch needs to detect exactly those oidhash collisions
> +*without* triggering blob reads.  We achieve this by temporarily
> +swapping the hashmap's comparison function.
> +
> +### 2. The prefetch function (in builtin/log.c)
> +
> +This function takes the repository, the head-side commit list (as
> +built by the existing revision walk in `cmd_cherry()`), and the
> +patch_ids structure (which contains the upstream entries).
> +
> +#### 2.1 Early exit
> +
> +If the repository has no promisor remote, return immediately.
> +Use `repo_has_promisor_remote()` from promisor-remote.h.
> +
> +#### 2.2 Swap in a trivial comparison function
> +
> +Save `ids->patches.cmpfn` (the real `patch_id_neq`) and replace
> +it with a trivial function that always returns 0 ("equal").
> +
> +```
> +static int patch_id_match(const void *unused_cmpfn_data,
> +                          const struct hashmap_entry *a,
> +                          const struct hashmap_entry *b,
> +                          const void *unused_keydata)
> +{
> +    return 0;
> +}
> +```
> +
> +With this cmpfn in place, `hashmap_get()` and `hashmap_get_next()`
> +will match every entry in the same oidhash bucket — exactly the
> +same set that would trigger `patch_id_neq()` during normal lookup.
> +No blob reads occur because we never call the real comparison
> +function.
> +
> +#### 2.3 For each head-side commit, probe for collisions
> +
> +For each commit in the head-side list:
> +
> +- Use `patch_id_iter_first(commit, ids)` to probe the upstream
> +  hashmap.  This handles `init_patch_id_entry()` + hashmap lookup
> +  internally.  With our swapped cmpfn, it returns any upstream
> +  entry whose oidhash matches — i.e. any entry that *would*
> +  trigger `patch_id_neq()` during the real comparison loop.
> +  (Merge commits are already handled — `patch_id_iter_first()`
> +  returns NULL for them via `patch_id_defined()`.)
> +- If there's a match: collect blob OIDs from the head-side commit
> +  (see section 3).
> +- Then walk `patch_id_iter_next()` to find ALL upstream entries
> +  in the same bucket.  For each, collect blob OIDs from that
> +  upstream commit too.  (Multiple upstream commits can share the
> +  same oidhash bucket.)
> +- Collect blob OIDs from the first upstream match too (from
> +  `patch_id_iter_first()`).
> +
> +We need blobs from BOTH sides because `patch_id_neq()` computes
> +full patch IDs for both the upstream and head-side commit when
> +comparing.
> +
> +#### 2.4 Restore the original comparison function
> +
> +Set `ids->patches.cmpfn` back to the saved value (patch_id_neq).
> +This MUST happen before returning — the subsequent
> +`has_commit_patch_id()` loop needs the real comparison function.
> +
> +#### 2.5 Batch prefetch
> +
> +If the oidset is non-empty, populate an oid_array from it using
> +`oidset_iter_first()`/`oidset_iter_next()`, then call
> +`promisor_remote_get_direct(repo, oid_array.oid, oid_array.nr)`.
> +
> +This is a single network round-trip regardless of how many blobs.
> +
> +#### 2.6 Cleanup
> +
> +Free the oid_array and the oidset.
> +
> +### 3. Collecting blob OIDs from a commit (helper function)
> +
> +Given a commit, enumerate the blobs its diff touches.  Takes an
> +oidset to insert into (provides automatic dedup — consecutive
> +commits often share blob OIDs, e.g. B:foo == C^:foo when C's
> +parent is B).
> +
> +- Compute the diff: `diff_tree_oid()` for commits with a parent,
> +  `diff_root_tree_oid()` for root commits.  Then `diffcore_std()`.
> +- These populate the global `diff_queued_diff` queue.
> +- For each filepair in the queue:
> +  - Check the userdiff driver for the file path.  If the driver
> +    explicitly declares the file as binary (`drv->binary != -1`),
> +    skip it.  Reason: patch-ID uses `oid_to_hex()` for binary
> +    files (see diff.c around line 6652) and never reads the blob.
> +    Use `userdiff_find_by_path()` (NOT `diff_filespec_load_driver`
> +    which is static in diff.c).
> +  - For both sides of the filepair (p->one and p->two): if the
> +    side is valid (`DIFF_FILE_VALID`) and has a non-null OID,
> +    check the dedup oidset — `oidset_insert()` handles dedup
> +    automatically (returns 1 if newly inserted, 0 if duplicate).
> +- Clear the diff queue with `diff_queue_clear()` (from diffcore.h,
> +  not diff.h).
> +
> +Note on `drv->binary`: The value -1 means "not set" (auto-detect
> +at read time by reading the blob); 0 means explicitly text (will
> +be diffed, blob reads needed); positive means explicitly binary
> +(patch-ID uses `oid_to_hex()`, no blob read needed).
> +
> +The correct skip condition is `drv && drv->binary > 0` — skip
> +only known-binary files.  Do NOT use `drv->binary != -1`, which
> +would also skip explicitly-text files that DO need blob reads.
> +(The copilot reference implementation uses `!= -1`, which is
> +technically wrong but harmless in practice since explicit text
> +attributes are rare.)
> +
> +### 4. Call site in cmd_cherry()
> +
> +Insert the call between the revision walk loop (which builds the
> +head-side commit list) and the comparison loop (which calls
> +`has_commit_patch_id()`).
> +
> +### 5. Required includes in builtin/log.c
> +
> +- promisor-remote.h  (for repo_has_promisor_remote,
> +                       promisor_remote_get_direct)
> +- userdiff.h         (for userdiff_find_by_path)
> +- oidset.h           (for oidset used in blob OID dedup)
> +- diffcore.h         (for diff_queue_clear)
> +
> +## Edge Cases
> +
> +- No promisor remote: early return, zero overhead
> +- No collisions: probes the hashmap for each head-side commit but
> +  finds no bucket matches, no blobs collected, no fetch issued
> +- Merge commits in head-side list: skipped (no patch ID defined)
> +- Root commits (no parent): use diff_root_tree_oid instead of
> +  diff_tree_oid
> +- Binary files (explicit driver): skipped, patch-ID doesn't read
> +  them
> +- The cmpfn swap approach matches at oidhash granularity (4 bytes),
> +  which is exactly what the hashmap itself uses to trigger
> +  patch_id_neq().  This means we prefetch for every case the real
> +  code would trigger, plus rare false-positive oidhash collisions
> +  (harmless: we fetch a few extra blobs that won't end up being
> +  compared).  No under-fetching is possible.
> +
> +## Testing
> +
> +See t/t3500-cherry.sh on the copilot-faster-partial-clones branch
> +for two tests:
> +
> +Test 5: "cherry batch-prefetches blobs in partial clone"
> +  - Creates server with 3 upstream + 3 head-side commits modifying
> +    the same file (guarantees collisions)
> +  - Clones with --filter=blob:none
> +  - Runs `git cherry` with GIT_TRACE2_PERF
> +  - Asserts exactly 1 fetch (batch) instead of 6 (individual)
> +
> +Test 6: "cherry prefetch omits blobs for cherry-picked commits"
> +  - Creates a cherry-pick scenario (divergent branches, shared
> +    commit cherry-picked to head side)
> +  - Verifies `git cherry` correctly identifies the cherry-picked
> +    commit as "-" and head-only commits as "+"
> +  - Important: the head side must diverge before the cherry-pick
> +    so the cherry-pick creates a distinct commit object (otherwise
> +    the commit hash is identical and it's in the symmetric
> +    difference, not needing patch-ID comparison at all)
> diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
> index 78c3eac54b..17507d9a28 100755
> --- a/t/t3500-cherry.sh
> +++ b/t/t3500-cherry.sh
> @@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' '
>  	test_cmp expect actual
>  '
>  
> +# Reuse the expect file from the previous test, in a partial clone
> +test_expect_success 'cherry in partial clone does bulk prefetch' '
> +	test_config uploadpack.allowfilter 1 &&
> +	test_config uploadpack.allowanysha1inwant 1 &&
> +	test_when_finished "rm -rf copy" &&
> +
> +	git clone --bare --filter=blob:none file://"$(pwd)" copy &&
> +	(
> +		cd copy &&
> +		GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
> +		test_cmp ../expect actual &&
> +
> +		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
> +		test_line_count = 1 fetches &&
> +		test_trace2_data promisor fetch_count 4 <trace.output
> +	)
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-17 21:42   ` Junio C Hamano
@ 2026-04-17 22:02     ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2026-04-17 22:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elijah Newren via GitGitGadget, git

On Fri, Apr 17, 2026 at 2:42 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> >  investigations/cherry-prefetch-design-spec.md | 210 ++++++++++++++++++
>
> Did you mean to add this file to the project?  As a document to
> describe how "git cherry" works, it is vastly lacking, and once this
> series lands, I am not sure how others would benefit from being able
> to read it.  Many of the materials in there seem to typically be given
> in the log message, but not to this degree of details, so I am not
> sure where it belongs.

Ugh, no, sorry.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 0/3] Batch prefetching
  2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2026-04-16 22:48 ` [PATCH 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
@ 2026-04-18  0:32 ` Elijah Newren via GitGitGadget
  2026-04-18  0:32   ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
                     ` (3 more replies)
  3 siblings, 4 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-18  0:32 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

Changes since v1:

 * Remove stray file that should have never been added. So embarrassing that
   I didn't catch that before submitting.

Partial clones provide a trade-off for users: avoid downloading blobs
upfront, at the expense of needing to download them later as they run other
commands. This tradeoff can sometimes incur a more severe cost than
expected, particularly if needed blobs are discovered as they are accessed,
resulting in downloading blobs one at a time. Some commands like checkout,
diff, and merge do batch prefetches of necessary blobs, since that can
dramatically reduce the pain of on-demand loading. Extend this ability to
two more commands: cherry and grep.

This series was spurred by a report where git cherry jobs were each doing
hundreds of single-blob fetches, at a cost of 3s each. Batching those
downloads should dramatically speed up their jobs. (And I decided to fix up
git grep similarly while at it.)

I'll also note that git backfill with revisions and/or pathspecs could also
improve things for these users, but since backfill is a manual command users
would have to run and requires users to try to figure out which data is
needed (a challenge in the case of cherry), it still makes sense to provide
smarter behavior for folks who don't choose to manually run backfill.

Also, correct a documentation typo I noticed in patch-ids.h (related to code
I was using for the git cherry fixes) as a preparatory fixup.

Elijah Newren (3):
  patch-ids.h: add missing trailing parenthesis in documentation comment
  builtin/log: prefetch necessary blobs for `git cherry`
  grep: prefetch necessary blobs

 builtin/grep.c    | 142 ++++++++++++++++++++++++++++++++++++++++++++++
 builtin/log.c     | 125 ++++++++++++++++++++++++++++++++++++++++
 patch-ids.h       |   2 +-
 t/t3500-cherry.sh |  18 ++++++
 t/t7810-grep.sh   |  35 ++++++++++++
 5 files changed, 321 insertions(+), 1 deletion(-)


base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2089%2Fnewren%2Fbatch-prefetching-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2089/newren/batch-prefetching-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/2089

Range-diff vs v1:

 1:  7f5ac5942e = 1:  663816a344 patch-ids.h: add missing trailing parenthesis in documentation comment
 2:  610be2a49a ! 2:  a705852723 builtin/log: prefetch necessary blobs for `git cherry`
     @@ builtin/log.c: int cmd_cherry(int argc,
       		char sign = '+';
       
      
     - ## investigations/cherry-prefetch-design-spec.md (new) ##
     -@@
     -+# Design Spec: Batch Blob Prefetch for `git cherry` in Partial Clones
     -+
     -+## Problem
     -+
     -+In a partial clone with `--filter=blob:none`, `git cherry` compares
     -+commits using patch IDs.  Patch IDs are computed in two phases:
     -+
     -+1. Header-only: hashes file paths and mode changes only (no blob reads)
     -+2. Full: hashes actual diff content (requires reading blobs)
     -+
     -+Phase 2 only runs when two commits have matching header-only IDs
     -+(i.e. they modify the same set of files with the same modes).  This
     -+is common — any two commits touching the same file(s) will collide.
     -+
     -+When phase 2 needs a blob that isn't local, it triggers an on-demand
     -+promisor fetch.  Each fetch is a separate network round-trip.  With
     -+many collisions, this means many sequential fetches.
     -+
     -+## Solution Overview
     -+
     -+Add a preparatory pass before the existing comparison loop in
     -+`cmd_cherry()` that:
     -+
     -+1. Identifies which commit pairs will collide on header-only IDs
     -+2. Collects all blob OIDs those commits will need
     -+3. Batch-prefetches them in one fetch
     -+
     -+After this pass, the existing comparison loop runs as before, but
     -+all needed blobs are already local, so no on-demand fetches occur.
     -+
     -+## Detailed Design
     -+
     -+### 1. No struct changes to patch_id
     -+
     -+The existing `struct patch_id` and `patch_id_neq()` are not
     -+modified.  `is_null_oid()` remains the sentinel for "full ID not
     -+yet computed".  No `has_full_patch_id` boolean, no extra fields.
     -+
     -+Key insight: `init_patch_id_entry()` stores only `oidhash()` (the
     -+first 4 bytes of the header-only ID) in the hashmap bucket key.
     -+The real `patch_id_neq()` comparison function is invoked only when
     -+`hashmap_get()` or `hashmap_get_next()` finds entries with a
     -+matching oidhash — and that comparison triggers blob reads.
     -+
     -+The prefetch needs to detect exactly those oidhash collisions
     -+*without* triggering blob reads.  We achieve this by temporarily
     -+swapping the hashmap's comparison function.
     -+
     -+### 2. The prefetch function (in builtin/log.c)
     -+
     -+This function takes the repository, the head-side commit list (as
     -+built by the existing revision walk in `cmd_cherry()`), and the
     -+patch_ids structure (which contains the upstream entries).
     -+
     -+#### 2.1 Early exit
     -+
     -+If the repository has no promisor remote, return immediately.
     -+Use `repo_has_promisor_remote()` from promisor-remote.h.
     -+
     -+#### 2.2 Swap in a trivial comparison function
     -+
     -+Save `ids->patches.cmpfn` (the real `patch_id_neq`) and replace
     -+it with a trivial function that always returns 0 ("equal").
     -+
     -+```
     -+static int patch_id_match(const void *unused_cmpfn_data,
     -+                          const struct hashmap_entry *a,
     -+                          const struct hashmap_entry *b,
     -+                          const void *unused_keydata)
     -+{
     -+    return 0;
     -+}
     -+```
     -+
     -+With this cmpfn in place, `hashmap_get()` and `hashmap_get_next()`
     -+will match every entry in the same oidhash bucket — exactly the
     -+same set that would trigger `patch_id_neq()` during normal lookup.
     -+No blob reads occur because we never call the real comparison
     -+function.
     -+
     -+#### 2.3 For each head-side commit, probe for collisions
     -+
     -+For each commit in the head-side list:
     -+
     -+- Use `patch_id_iter_first(commit, ids)` to probe the upstream
     -+  hashmap.  This handles `init_patch_id_entry()` + hashmap lookup
     -+  internally.  With our swapped cmpfn, it returns any upstream
     -+  entry whose oidhash matches — i.e. any entry that *would*
     -+  trigger `patch_id_neq()` during the real comparison loop.
     -+  (Merge commits are already handled — `patch_id_iter_first()`
     -+  returns NULL for them via `patch_id_defined()`.)
     -+- If there's a match: collect blob OIDs from the head-side commit
     -+  (see section 3).
     -+- Then walk `patch_id_iter_next()` to find ALL upstream entries
     -+  in the same bucket.  For each, collect blob OIDs from that
     -+  upstream commit too.  (Multiple upstream commits can share the
     -+  same oidhash bucket.)
     -+- Collect blob OIDs from the first upstream match too (from
     -+  `patch_id_iter_first()`).
     -+
     -+We need blobs from BOTH sides because `patch_id_neq()` computes
     -+full patch IDs for both the upstream and head-side commit when
     -+comparing.
     -+
     -+#### 2.4 Restore the original comparison function
     -+
     -+Set `ids->patches.cmpfn` back to the saved value (patch_id_neq).
     -+This MUST happen before returning — the subsequent
     -+`has_commit_patch_id()` loop needs the real comparison function.
     -+
     -+#### 2.5 Batch prefetch
     -+
     -+If the oidset is non-empty, populate an oid_array from it using
     -+`oidset_iter_first()`/`oidset_iter_next()`, then call
     -+`promisor_remote_get_direct(repo, oid_array.oid, oid_array.nr)`.
     -+
     -+This is a single network round-trip regardless of how many blobs.
     -+
     -+#### 2.6 Cleanup
     -+
     -+Free the oid_array and the oidset.
     -+
     -+### 3. Collecting blob OIDs from a commit (helper function)
     -+
     -+Given a commit, enumerate the blobs its diff touches.  Takes an
     -+oidset to insert into (provides automatic dedup — consecutive
     -+commits often share blob OIDs, e.g. B:foo == C^:foo when C's
     -+parent is B).
     -+
     -+- Compute the diff: `diff_tree_oid()` for commits with a parent,
     -+  `diff_root_tree_oid()` for root commits.  Then `diffcore_std()`.
     -+- These populate the global `diff_queued_diff` queue.
     -+- For each filepair in the queue:
     -+  - Check the userdiff driver for the file path.  If the driver
     -+    explicitly declares the file as binary (`drv->binary != -1`),
     -+    skip it.  Reason: patch-ID uses `oid_to_hex()` for binary
     -+    files (see diff.c around line 6652) and never reads the blob.
     -+    Use `userdiff_find_by_path()` (NOT `diff_filespec_load_driver`
     -+    which is static in diff.c).
     -+  - For both sides of the filepair (p->one and p->two): if the
     -+    side is valid (`DIFF_FILE_VALID`) and has a non-null OID,
     -+    check the dedup oidset — `oidset_insert()` handles dedup
     -+    automatically (returns 1 if newly inserted, 0 if duplicate).
     -+- Clear the diff queue with `diff_queue_clear()` (from diffcore.h,
     -+  not diff.h).
     -+
     -+Note on `drv->binary`: The value -1 means "not set" (auto-detect
     -+at read time by reading the blob); 0 means explicitly text (will
     -+be diffed, blob reads needed); positive means explicitly binary
     -+(patch-ID uses `oid_to_hex()`, no blob read needed).
     -+
     -+The correct skip condition is `drv && drv->binary > 0` — skip
     -+only known-binary files.  Do NOT use `drv->binary != -1`, which
     -+would also skip explicitly-text files that DO need blob reads.
     -+(The copilot reference implementation uses `!= -1`, which is
     -+technically wrong but harmless in practice since explicit text
     -+attributes are rare.)
     -+
     -+### 4. Call site in cmd_cherry()
     -+
     -+Insert the call between the revision walk loop (which builds the
     -+head-side commit list) and the comparison loop (which calls
     -+`has_commit_patch_id()`).
     -+
     -+### 5. Required includes in builtin/log.c
     -+
     -+- promisor-remote.h  (for repo_has_promisor_remote,
     -+                       promisor_remote_get_direct)
     -+- userdiff.h         (for userdiff_find_by_path)
     -+- oidset.h           (for oidset used in blob OID dedup)
     -+- diffcore.h         (for diff_queue_clear)
     -+
     -+## Edge Cases
     -+
     -+- No promisor remote: early return, zero overhead
     -+- No collisions: probes the hashmap for each head-side commit but
     -+  finds no bucket matches, no blobs collected, no fetch issued
     -+- Merge commits in head-side list: skipped (no patch ID defined)
     -+- Root commits (no parent): use diff_root_tree_oid instead of
     -+  diff_tree_oid
     -+- Binary files (explicit driver): skipped, patch-ID doesn't read
     -+  them
     -+- The cmpfn swap approach matches at oidhash granularity (4 bytes),
     -+  which is exactly what the hashmap itself uses to trigger
     -+  patch_id_neq().  This means we prefetch for every case the real
     -+  code would trigger, plus rare false-positive oidhash collisions
     -+  (harmless: we fetch a few extra blobs that won't end up being
     -+  compared).  No under-fetching is possible.
     -+
     -+## Testing
     -+
     -+See t/t3500-cherry.sh on the copilot-faster-partial-clones branch
     -+for two tests:
     -+
     -+Test 5: "cherry batch-prefetches blobs in partial clone"
     -+  - Creates server with 3 upstream + 3 head-side commits modifying
     -+    the same file (guarantees collisions)
     -+  - Clones with --filter=blob:none
     -+  - Runs `git cherry` with GIT_TRACE2_PERF
     -+  - Asserts exactly 1 fetch (batch) instead of 6 (individual)
     -+
     -+Test 6: "cherry prefetch omits blobs for cherry-picked commits"
     -+  - Creates a cherry-pick scenario (divergent branches, shared
     -+    commit cherry-picked to head side)
     -+  - Verifies `git cherry` correctly identifies the cherry-picked
     -+    commit as "-" and head-only commits as "+"
     -+  - Important: the head side must diverge before the cherry-pick
     -+    so the cherry-pick creates a distinct commit object (otherwise
     -+    the commit hash is identical and it's in the symmetric
     -+    difference, not needing patch-ID comparison at all)
     -
       ## t/t3500-cherry.sh ##
      @@ t/t3500-cherry.sh: test_expect_success 'cherry ignores whitespace' '
       	test_cmp expect actual
 3:  6dbfc7608b = 3:  8fbfe69bc4 grep: prefetch necessary blobs

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment
  2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
@ 2026-04-18  0:32   ` Elijah Newren via GitGitGadget
  2026-04-18  0:32   ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-18  0:32 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 patch-ids.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/patch-ids.h b/patch-ids.h
index 490d739371..57534ee722 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -37,7 +37,7 @@ int has_commit_patch_id(struct commit *commit, struct patch_ids *);
  *   struct patch_id *cur;
  *   for (cur = patch_id_iter_first(commit, ids);
  *        cur;
- *        cur = patch_id_iter_next(cur, ids) {
+ *        cur = patch_id_iter_next(cur, ids)) {
  *           ... look at cur->commit
  *   }
  */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
  2026-04-18  0:32   ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
@ 2026-04-18  0:32   ` Elijah Newren via GitGitGadget
  2026-04-19 14:04     ` Phillip Wood
  2026-04-27 13:16     ` Derrick Stolee
  2026-04-18  0:32   ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
  2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
  3 siblings, 2 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-18  0:32 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In partial clones, `git cherry` fetches necessary blobs on-demand one
at a time, which can be very slow.  We would like to prefetch all
necessary blobs upfront.  To do so, we need to be able to first figure
out which blobs are needed.

`git cherry` does its work in a two-phase approach: first computing
header-only IDs (based on file paths and modes), then falling back to
full content-based IDs only when header-only IDs collide -- or, more
accurately, whenever the oidhash() of the header-only object_ids
collide.

patch-ids.c handles this by creating an ids->patches hashmap that has
all the data we need, but the problem is that any attempt to query the
hashmap will invoke the patch_id_neq() function on any colliding objects,
which causes the on-demand fetching.

Insert a new prefetch_cherry_blobs() function before checking for
collisions.  Use a temporary replacement on the ids->patches.cmpfn
in order to enumerate the blobs that would be needed without yet
fetching them, and then fetch them all at once, then restore the old
ids->patches.cmpfn.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/log.c     | 125 ++++++++++++++++++++++++++++++++++++++++++++++
 t/t3500-cherry.sh |  18 +++++++
 2 files changed, 143 insertions(+)

diff --git a/builtin/log.c b/builtin/log.c
index 8c0939dd42..df19876be6 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -21,10 +21,12 @@
 #include "color.h"
 #include "commit.h"
 #include "diff.h"
+#include "diffcore.h"
 #include "diff-merges.h"
 #include "revision.h"
 #include "log-tree.h"
 #include "oid-array.h"
+#include "oidset.h"
 #include "tag.h"
 #include "reflog-walk.h"
 #include "patch-ids.h"
@@ -43,9 +45,11 @@
 #include "utf8.h"
 
 #include "commit-reach.h"
+#include "promisor-remote.h"
 #include "range-diff.h"
 #include "tmp-objdir.h"
 #include "tree.h"
+#include "userdiff.h"
 #include "write-or-die.h"
 
 #define MAIL_DEFAULT_WRAP 72
@@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose,
 	}
 }
 
+/*
+ * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
+ * Skips files whose userdiff driver explicitly declares binary status
+ * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
+ * never reads blob content.  Use userdiff_find_by_path() since
+ * diff_filespec_load_driver() is static in diff.c.
+ *
+ * Clean up with diff_queue_clear() (from diffcore.h).
+ */
+static void collect_diff_blob_oids(struct commit *commit,
+				   struct diff_options *opts,
+				   struct oidset *blobs)
+{
+	struct diff_queue_struct *q;
+
+	/*
+	 * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
+	 * so we'll never be called with one.
+	 */
+	assert(!commit->parents || !commit->parents->next);
+
+	if (commit->parents)
+		diff_tree_oid(&commit->parents->item->object.oid,
+			      &commit->object.oid, "", opts);
+	else
+		diff_root_tree_oid(&commit->object.oid, "", opts);
+	diffcore_std(opts);
+
+	q = &diff_queued_diff;
+	for (int i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		struct userdiff_driver *drv;
+
+		/* Skip binary files */
+		drv = userdiff_find_by_path(opts->repo->index, p->one->path);
+		if (drv && drv->binary > 0)
+			continue;
+
+		if (DIFF_FILE_VALID(p->one))
+			oidset_insert(blobs, &p->one->oid);
+		if (DIFF_FILE_VALID(p->two))
+			oidset_insert(blobs, &p->two->oid);
+	}
+	diff_queue_clear(q);
+}
+
+static int always_match(const void *cmp_data UNUSED,
+			const struct hashmap_entry *entry1 UNUSED,
+			const struct hashmap_entry *entry2 UNUSED,
+			const void *keydata UNUSED)
+{
+	return 0;
+}
+
+/*
+ * Prefetch blobs for git cherry in partial clones.
+ *
+ * Called between the revision walk (which builds the head-side
+ * commit list) and the has_commit_patch_id() comparison loop.
+ *
+ * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
+ * replaces the hashmap's comparison function with a trivial
+ * always-match function, so hashmap_get()/hashmap_get_next() match
+ * any entry with the same oidhash bucket.  These are the set of oids
+ * that would trigger patch_id_neq() during normal lookup and cause
+ * blobs to be read on demand, and we want to prefetch them all at
+ * once instead.
+ */
+static void prefetch_cherry_blobs(struct repository *repo,
+				  struct commit_list *list,
+				  struct patch_ids *ids)
+{
+	struct oidset blobs = OIDSET_INIT;
+	hashmap_cmp_fn original_cmpfn;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(repo))
+		return;
+
+	/* Save original cmpfn, replace with always_match */
+	original_cmpfn = ids->patches.cmpfn;
+	ids->patches.cmpfn = always_match;
+
+	/* Find header-only collisions, gather blobs from those commits */
+	for (struct commit_list *l = list; l; l = l->next) {
+		struct commit *c = l->item;
+		bool match_found = false;
+		for (struct patch_id *cur = patch_id_iter_first(c, ids);
+		     cur;
+		     cur = patch_id_iter_next(cur, ids)) {
+			match_found = true;
+			collect_diff_blob_oids(cur->commit, &ids->diffopts,
+					       &blobs);
+		}
+		if (match_found)
+			collect_diff_blob_oids(c, &ids->diffopts, &blobs);
+	}
+
+	/* Restore original cmpfn */
+	ids->patches.cmpfn = original_cmpfn;
+
+	/* If we have any blobs to fetch, fetch them */
+	if (oidset_size(&blobs)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blobs, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+
+	oidset_clear(&blobs);
+}
+
 int cmd_cherry(int argc,
 	       const char **argv,
 	       const char *prefix,
@@ -2673,6 +2796,8 @@ int cmd_cherry(int argc,
 		commit_list_insert(commit, &list);
 	}
 
+	prefetch_cherry_blobs(the_repository, list, &ids);
+
 	for (struct commit_list *l = list; l; l = l->next) {
 		char sign = '+';
 
diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
index 78c3eac54b..17507d9a28 100755
--- a/t/t3500-cherry.sh
+++ b/t/t3500-cherry.sh
@@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' '
 	test_cmp expect actual
 '
 
+# Reuse the expect file from the previous test, in a partial clone
+test_expect_success 'cherry in partial clone does bulk prefetch' '
+	test_config uploadpack.allowfilter 1 &&
+	test_config uploadpack.allowanysha1inwant 1 &&
+	test_when_finished "rm -rf copy" &&
+
+	git clone --bare --filter=blob:none file://"$(pwd)" copy &&
+	(
+		cd copy &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
+		test_cmp ../expect actual &&
+
+		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
+		test_line_count = 1 fetches &&
+		test_trace2_data promisor fetch_count 4 <trace.output
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/3] grep: prefetch necessary blobs
  2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
  2026-04-18  0:32   ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
  2026-04-18  0:32   ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
@ 2026-04-18  0:32   ` Elijah Newren via GitGitGadget
  2026-04-27 12:59     ` Derrick Stolee
  2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
  3 siblings, 1 reply; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-04-18  0:32 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In partial clones, `git grep` fetches necessary blobs on-demand one
at a time, which can be very slow.  In partial clones, add an extra
preliminary walk over the tree similar to grep_tree() which collects
the blobs of interest, and then prefetches them.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/grep.c  | 142 ++++++++++++++++++++++++++++++++++++++++++++++++
 t/t7810-grep.sh |  35 ++++++++++++
 2 files changed, 177 insertions(+)

diff --git a/builtin/grep.c b/builtin/grep.c
index e33285e5e6..d559c48d94 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -28,9 +28,12 @@
 #include "object-file.h"
 #include "object-name.h"
 #include "odb.h"
+#include "oid-array.h"
+#include "oidset.h"
 #include "packfile.h"
 #include "pager.h"
 #include "path.h"
+#include "promisor-remote.h"
 #include "read-cache-ll.h"
 #include "write-or-die.h"
 
@@ -692,6 +695,143 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+static void collect_blob_oids_for_tree(struct repository *repo,
+				       const struct pathspec *pathspec,
+				       struct tree_desc *tree,
+				       struct strbuf *base,
+				       int tn_len,
+				       struct oidset *blob_oids)
+{
+	struct name_entry entry;
+	int old_baselen = base->len;
+	struct strbuf name = STRBUF_INIT;
+	enum interesting match = entry_not_interesting;
+
+	while (tree_entry(tree, &entry)) {
+		if (match != all_entries_interesting) {
+			strbuf_addstr(&name, base->buf + tn_len);
+			match = tree_entry_interesting(repo->index,
+						       &entry, &name,
+						       pathspec);
+			strbuf_reset(&name);
+
+			if (match == all_entries_not_interesting)
+				break;
+			if (match == entry_not_interesting)
+				continue;
+		}
+
+		strbuf_add(base, entry.path, tree_entry_len(&entry));
+
+		if (S_ISREG(entry.mode)) {
+			oidset_insert(blob_oids, &entry.oid);
+		} else if (S_ISDIR(entry.mode)) {
+			enum object_type type;
+			struct tree_desc sub_tree;
+			void *data;
+			unsigned long size;
+
+			data = odb_read_object(repo->objects, &entry.oid,
+					       &type, &size);
+			if (!data)
+				die(_("unable to read tree (%s)"),
+				    oid_to_hex(&entry.oid));
+
+			strbuf_addch(base, '/');
+			init_tree_desc(&sub_tree, &entry.oid, data, size);
+			collect_blob_oids_for_tree(repo, pathspec, &sub_tree,
+						   base, tn_len, blob_oids);
+			free(data);
+		}
+		/*
+		 * ...no else clause for S_ISGITLINK: submodules have their
+		 * own promisor configuration and would need separate fetches
+		 * anyway.
+		 */
+
+		strbuf_setlen(base, old_baselen);
+	}
+
+	strbuf_release(&name);
+}
+
+static void collect_blob_oids_for_treeish(struct grep_opt *opt,
+					  const struct pathspec *pathspec,
+					  const struct object_id *tree_ish_oid,
+					  const char *name,
+					  struct oidset *blob_oids)
+{
+	struct tree_desc tree;
+	void *data;
+	unsigned long size;
+	struct strbuf base = STRBUF_INIT;
+	int len;
+
+	data = odb_read_object_peeled(opt->repo->objects, tree_ish_oid,
+				      OBJ_TREE, &size, NULL);
+
+	if (!data)
+		return;
+
+	len = name ? strlen(name) : 0;
+	if (len) {
+		strbuf_add(&base, name, len);
+		strbuf_addch(&base, ':');
+	}
+	init_tree_desc(&tree, tree_ish_oid, data, size);
+
+	collect_blob_oids_for_tree(opt->repo, pathspec, &tree,
+				   &base, base.len, blob_oids);
+
+	strbuf_release(&base);
+	free(data);
+}
+
+static void prefetch_grep_blobs(struct grep_opt *opt,
+				const struct pathspec *pathspec,
+				const struct object_array *list)
+{
+	struct oidset blob_oids = OIDSET_INIT;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(opt->repo))
+		return;
+
+	/* For each tree, gather the blobs in it */
+	for (int i = 0; i < list->nr; i++) {
+		struct object *real_obj;
+
+		obj_read_lock();
+		real_obj = deref_tag(opt->repo, list->objects[i].item,
+				     NULL, 0);
+		obj_read_unlock();
+
+		if (real_obj &&
+		    (real_obj->type == OBJ_COMMIT ||
+		     real_obj->type == OBJ_TREE))
+			collect_blob_oids_for_treeish(opt, pathspec,
+						      &real_obj->oid,
+						      list->objects[i].name,
+						      &blob_oids);
+	}
+
+	/* Prefetch the blobs we found */
+	if (oidset_size(&blob_oids)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blob_oids, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+	oidset_clear(&blob_oids);
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
@@ -732,6 +872,8 @@ static int grep_objects(struct grep_opt *opt, const struct pathspec *pathspec,
 	int hit = 0;
 	const unsigned int nr = list->nr;
 
+	prefetch_grep_blobs(opt, pathspec, list);
+
 	for (i = 0; i < nr; i++) {
 		struct object *real_obj;
 
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 64ac4f04ee..1f484502fe 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -1929,4 +1929,39 @@ test_expect_success 'grep does not report i-t-a and assume unchanged with -L' '
 	test_cmp expected actual
 '
 
+test_expect_success 'grep of revision in partial clone does bulk prefetch' '
+	test_when_finished "rm -rf grep-partial-src grep-partial" &&
+
+	git init grep-partial-src &&
+	(
+		cd grep-partial-src &&
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+		echo "needle in haystack" >searchme &&
+		echo "no match here" >other &&
+		mkdir subdir &&
+		echo "needle again" >subdir/deep &&
+		git add . &&
+		git commit -m "initial"
+	) &&
+
+	git clone --no-checkout --filter=blob:none \
+		"file://$(pwd)/grep-partial-src" grep-partial &&
+
+	# All blobs should be missing after a blobless clone.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 3 missing &&
+
+	# grep HEAD should batch-prefetch all blobs in one request.
+	GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
+		git -C grep-partial grep -c "needle" HEAD >result &&
+
+	# Should find matches in two files.
+	test_line_count = 2 result &&
+
+	# Should have prefetched all 3 objects at once
+	test_trace2_data promisor fetch_count 3 <grep-trace
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-18  0:32   ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
@ 2026-04-19 14:04     ` Phillip Wood
  2026-04-21 21:28       ` Elijah Newren
  2026-04-27 13:16     ` Derrick Stolee
  1 sibling, 1 reply; 25+ messages in thread
From: Phillip Wood @ 2026-04-19 14:04 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

Hi Elijah

On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> In partial clones, `git cherry` fetches necessary blobs on-demand one
> at a time, which can be very slow.  We would like to prefetch all
> necessary blobs upfront.  To do so, we need to be able to first figure
> out which blobs are needed.

"git rebase" without "--reapply-cherry-picks" suffers from this problem 
as well as it does the equivalent of "git log --cherry-pick". Is there 
any way to share prefetch_cherry_blobs() with the cherry-pick detection 
in revision.c?

Thanks

Phillip

> `git cherry` does its work in a two-phase approach: first computing
> header-only IDs (based on file paths and modes), then falling back to
> full content-based IDs only when header-only IDs collide -- or, more
> accurately, whenever the oidhash() of the header-only object_ids
> collide.
> 
> patch-ids.c handles this by creating an ids->patches hashmap that has
> all the data we need, but the problem is that any attempt to query the
> hashmap will invoke the patch_id_neq() function on any colliding objects,
> which causes the on-demand fetching.
> 
> Insert a new prefetch_cherry_blobs() function before checking for
> collisions.  Use a temporary replacement on the ids->patches.cmpfn
> in order to enumerate the blobs that would be needed without yet
> fetching them, and then fetch them all at once, then restore the old
> ids->patches.cmpfn.
> 
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>   builtin/log.c     | 125 ++++++++++++++++++++++++++++++++++++++++++++++
>   t/t3500-cherry.sh |  18 +++++++
>   2 files changed, 143 insertions(+)
> 
> diff --git a/builtin/log.c b/builtin/log.c
> index 8c0939dd42..df19876be6 100644
> --- a/builtin/log.c
> +++ b/builtin/log.c
> @@ -21,10 +21,12 @@
>   #include "color.h"
>   #include "commit.h"
>   #include "diff.h"
> +#include "diffcore.h"
>   #include "diff-merges.h"
>   #include "revision.h"
>   #include "log-tree.h"
>   #include "oid-array.h"
> +#include "oidset.h"
>   #include "tag.h"
>   #include "reflog-walk.h"
>   #include "patch-ids.h"
> @@ -43,9 +45,11 @@
>   #include "utf8.h"
>   
>   #include "commit-reach.h"
> +#include "promisor-remote.h"
>   #include "range-diff.h"
>   #include "tmp-objdir.h"
>   #include "tree.h"
> +#include "userdiff.h"
>   #include "write-or-die.h"
>   
>   #define MAIL_DEFAULT_WRAP 72
> @@ -2602,6 +2606,125 @@ static void print_commit(char sign, struct commit *commit, int verbose,
>   	}
>   }
>   
> +/*
> + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
> + * Skips files whose userdiff driver explicitly declares binary status
> + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
> + * never reads blob content.  Use userdiff_find_by_path() since
> + * diff_filespec_load_driver() is static in diff.c.
> + *
> + * Clean up with diff_queue_clear() (from diffcore.h).
> + */
> +static void collect_diff_blob_oids(struct commit *commit,
> +				   struct diff_options *opts,
> +				   struct oidset *blobs)
> +{
> +	struct diff_queue_struct *q;
> +
> +	/*
> +	 * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
> +	 * so we'll never be called with one.
> +	 */
> +	assert(!commit->parents || !commit->parents->next);
> +
> +	if (commit->parents)
> +		diff_tree_oid(&commit->parents->item->object.oid,
> +			      &commit->object.oid, "", opts);
> +	else
> +		diff_root_tree_oid(&commit->object.oid, "", opts);
> +	diffcore_std(opts);
> +
> +	q = &diff_queued_diff;
> +	for (int i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		struct userdiff_driver *drv;
> +
> +		/* Skip binary files */
> +		drv = userdiff_find_by_path(opts->repo->index, p->one->path);
> +		if (drv && drv->binary > 0)
> +			continue;
> +
> +		if (DIFF_FILE_VALID(p->one))
> +			oidset_insert(blobs, &p->one->oid);
> +		if (DIFF_FILE_VALID(p->two))
> +			oidset_insert(blobs, &p->two->oid);
> +	}
> +	diff_queue_clear(q);
> +}
> +
> +static int always_match(const void *cmp_data UNUSED,
> +			const struct hashmap_entry *entry1 UNUSED,
> +			const struct hashmap_entry *entry2 UNUSED,
> +			const void *keydata UNUSED)
> +{
> +	return 0;
> +}
> +
> +/*
> + * Prefetch blobs for git cherry in partial clones.
> + *
> + * Called between the revision walk (which builds the head-side
> + * commit list) and the has_commit_patch_id() comparison loop.
> + *
> + * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
> + * replaces the hashmap's comparison function with a trivial
> + * always-match function, so hashmap_get()/hashmap_get_next() match
> + * any entry with the same oidhash bucket.  These are the set of oids
> + * that would trigger patch_id_neq() during normal lookup and cause
> + * blobs to be read on demand, and we want to prefetch them all at
> + * once instead.
> + */
> +static void prefetch_cherry_blobs(struct repository *repo,
> +				  struct commit_list *list,
> +				  struct patch_ids *ids)
> +{
> +	struct oidset blobs = OIDSET_INIT;
> +	hashmap_cmp_fn original_cmpfn;
> +
> +	/* Exit if we're not in a partial clone */
> +	if (!repo_has_promisor_remote(repo))
> +		return;
> +
> +	/* Save original cmpfn, replace with always_match */
> +	original_cmpfn = ids->patches.cmpfn;
> +	ids->patches.cmpfn = always_match;
> +
> +	/* Find header-only collisions, gather blobs from those commits */
> +	for (struct commit_list *l = list; l; l = l->next) {
> +		struct commit *c = l->item;
> +		bool match_found = false;
> +		for (struct patch_id *cur = patch_id_iter_first(c, ids);
> +		     cur;
> +		     cur = patch_id_iter_next(cur, ids)) {
> +			match_found = true;
> +			collect_diff_blob_oids(cur->commit, &ids->diffopts,
> +					       &blobs);
> +		}
> +		if (match_found)
> +			collect_diff_blob_oids(c, &ids->diffopts, &blobs);
> +	}
> +
> +	/* Restore original cmpfn */
> +	ids->patches.cmpfn = original_cmpfn;
> +
> +	/* If we have any blobs to fetch, fetch them */
> +	if (oidset_size(&blobs)) {
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		struct oidset_iter iter;
> +		const struct object_id *oid;
> +
> +		oidset_iter_init(&blobs, &iter);
> +		while ((oid = oidset_iter_next(&iter)))
> +			oid_array_append(&to_fetch, oid);
> +
> +		promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
> +
> +		oid_array_clear(&to_fetch);
> +	}
> +
> +	oidset_clear(&blobs);
> +}
> +
>   int cmd_cherry(int argc,
>   	       const char **argv,
>   	       const char *prefix,
> @@ -2673,6 +2796,8 @@ int cmd_cherry(int argc,
>   		commit_list_insert(commit, &list);
>   	}
>   
> +	prefetch_cherry_blobs(the_repository, list, &ids);
> +
>   	for (struct commit_list *l = list; l; l = l->next) {
>   		char sign = '+';
>   
> diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
> index 78c3eac54b..17507d9a28 100755
> --- a/t/t3500-cherry.sh
> +++ b/t/t3500-cherry.sh
> @@ -78,4 +78,22 @@ test_expect_success 'cherry ignores whitespace' '
>   	test_cmp expect actual
>   '
>   
> +# Reuse the expect file from the previous test, in a partial clone
> +test_expect_success 'cherry in partial clone does bulk prefetch' '
> +	test_config uploadpack.allowfilter 1 &&
> +	test_config uploadpack.allowanysha1inwant 1 &&
> +	test_when_finished "rm -rf copy" &&
> +
> +	git clone --bare --filter=blob:none file://"$(pwd)" copy &&
> +	(
> +		cd copy &&
> +		GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
> +		test_cmp ../expect actual &&
> +
> +		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
> +		test_line_count = 1 fetches &&
> +		test_trace2_data promisor fetch_count 4 <trace.output
> +	)
> +'
> +
>   test_done


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-19 14:04     ` Phillip Wood
@ 2026-04-21 21:28       ` Elijah Newren
  2026-04-23 15:15         ` Phillip Wood
  0 siblings, 1 reply; 25+ messages in thread
From: Elijah Newren @ 2026-04-21 21:28 UTC (permalink / raw)
  To: phillip.wood; +Cc: Elijah Newren via GitGitGadget, git

Hi Phillip,

On Sun, Apr 19, 2026 at 7:04 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Elijah
>
> On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > In partial clones, `git cherry` fetches necessary blobs on-demand one
> > at a time, which can be very slow.  We would like to prefetch all
> > necessary blobs upfront.  To do so, we need to be able to first figure
> > out which blobs are needed.
>
> "git rebase" without "--reapply-cherry-picks" suffers from this problem
> as well as it does the equivalent of "git log --cherry-pick". Is there
> any way to share prefetch_cherry_blobs() with the cherry-pick detection
> in revision.c?

Yes, you're right; git rebase without --reapply-cherry-picks and git
log --cherry-pick both go through cherry_pick_list() in revision.c,
which has exactly the same shape as the patch-ids loop in
cmd_cherry(): build a hashmap of one side via add_commit_patch_id(),
then look up the other side via patch_id_iter_first(). The on-demand
blob fetches come from the same patch_id_neq() callback.

After poking around, I think the approximate scope of the fix would
be: Move collect_diff_blob_oids(), always_match(), and
prefetch_cherry_blobs() from builtin/log.c to patch-ids.c and expose
the last one in patch-ids.h. In cherry_pick_list(), between the
add_commit_patch_id loop and the comparison loop, build a temporary
list of just the lookup-side commits (filtering by
SYMMETRIC_LEFT/BOUNDARY as the existing loop already does) and call
prefetch_cherry_blobs() on it.

That said, I'd rather leave this out of the current series. The bigger
picture is that I have reservations about expanding partial-clone
support further into this area. git cherry, git log --cherry-pick, and
the default cherry-pick detection in git rebase all exist to answer
"has this patch already landed upstream?" -- a question that, in
repositories large enough to need partial clones, I feel is rarely
worth the cost of computing patch-ids across arbitrary amounts of
history. The honest guidance I would probably give for users on a
large repo is "pass --reapply-cherry-picks (with rebase) and skip this
entirely" or to narrow the range under consideration.  The omission of
a --no-reapply-cherry-picks option in git-replay wasn't a lack of
effort or oversight, but a deliberate choice where I'd rather hold off
(possibly indefinitely) on implementing it.  So I'm a bit reluctant to
make the performance hazard less visible without also asking whether
we should even be doing that piece of the operation.

I only implemented the git cherry fix because of a specific customer
situation where the operation was already baked into tooling, and
prefetching at least makes the worst case tolerable. I don't want to
hold myself to doing the same for the cherry_pick_list() path, but I'm
fairly confident the code here can be re-used for those other cases
and I'd help review a patch from anyone who wants to carry it forward.

Anyway, you are making the right connection, it's just that my
personal answer is to let some other interested individual do it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-21 21:28       ` Elijah Newren
@ 2026-04-23 15:15         ` Phillip Wood
  2026-04-23 17:38           ` Elijah Newren
  0 siblings, 1 reply; 25+ messages in thread
From: Phillip Wood @ 2026-04-23 15:15 UTC (permalink / raw)
  To: Elijah Newren, phillip.wood; +Cc: Elijah Newren via GitGitGadget, git

Hi Elijah

On 21/04/2026 22:28, Elijah Newren wrote:
> On Sun, Apr 19, 2026 at 7:04 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>> On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
>>>
>>> In partial clones, `git cherry` fetches necessary blobs on-demand one
>>> at a time, which can be very slow.  We would like to prefetch all
>>> necessary blobs upfront.  To do so, we need to be able to first figure
>>> out which blobs are needed.
>>
>> "git rebase" without "--reapply-cherry-picks" suffers from this problem
>> as well as it does the equivalent of "git log --cherry-pick". Is there
>> any way to share prefetch_cherry_blobs() with the cherry-pick detection
>> in revision.c?
> 
> Yes, you're right; git rebase without --reapply-cherry-picks and git
> log --cherry-pick both go through cherry_pick_list() in revision.c,
> which has exactly the same shape as the patch-ids loop in
> cmd_cherry(): build a hashmap of one side via add_commit_patch_id(),
> then look up the other side via patch_id_iter_first(). The on-demand
> blob fetches come from the same patch_id_neq() callback.
> 
> After poking around, I think the approximate scope of the fix would
> be: Move collect_diff_blob_oids(), always_match(), and
> prefetch_cherry_blobs() from builtin/log.c to patch-ids.c and expose
> the last one in patch-ids.h. In cherry_pick_list(), between the
> add_commit_patch_id loop and the comparison loop, build a temporary
> list of just the lookup-side commits (filtering by
> SYMMETRIC_LEFT/BOUNDARY as the existing loop already does) and call
> prefetch_cherry_blobs() on it.

Thanks for taking a look

> That said, I'd rather leave this out of the current series. The bigger
> picture is that I have reservations about expanding partial-clone
> support further into this area. git cherry, git log --cherry-pick, and
> the default cherry-pick detection in git rebase all exist to answer
> "has this patch already landed upstream?" -- a question that, in
> repositories large enough to need partial clones, I feel is rarely
> worth the cost of computing patch-ids across arbitrary amounts of
> history. The honest guidance I would probably give for users on a
> large repo is "pass --reapply-cherry-picks (with rebase) and skip this
> entirely" or to narrow the range under consideration.

"--reapply-cherry-picks --empty=drop" is certainly more efficient. When 
we're computing patch ids do we do it for every upstream commit or just 
the ones that modify the set of paths that are modified in the branch 
we're rebasing?

It is a shame that we don't have a config setting for 
"-reapply-cherry-picks" as it is easy to forget to pass that option. 
Unfortunately it is not supported by the apply backend which makes such 
a setting potentially confusing.

>  The omission of
> a --no-reapply-cherry-picks option in git-replay wasn't a lack of
> effort or oversight, but a deliberate choice where I'd rather hold off
> (possibly indefinitely) on implementing it.  So I'm a bit reluctant to
> make the performance hazard less visible without also asking whether
> we should even be doing that piece of the operation.
> 
> I only implemented the git cherry fix because of a specific customer
> situation where the operation was already baked into tooling, and
> prefetching at least makes the worst case tolerable.

I'm a bit surprised customers aren't complaining about tools that use 
"git rebase" being slow.

> I don't want to
> hold myself to doing the same for the cherry_pick_list() path, but I'm
> fairly confident the code here can be re-used for those other cases
> and I'd help review a patch from anyone who wants to carry it forward.
> 
> Anyway, you are making the right connection, it's just that my
> personal answer is to let some other interested individual do it.

Fair enough

Thanks

Phillip


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-23 15:15         ` Phillip Wood
@ 2026-04-23 17:38           ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2026-04-23 17:38 UTC (permalink / raw)
  To: phillip.wood; +Cc: Elijah Newren via GitGitGadget, git

Hi Phillip,

On Thu, Apr 23, 2026 at 8:15 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> On 21/04/2026 22:28, Elijah Newren wrote:
> > On Sun, Apr 19, 2026 at 7:04 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >> On 18/04/2026 01:32, Elijah Newren via GitGitGadget wrote:

> "--reapply-cherry-picks --empty=drop" is certainly more efficient. When
> we're computing patch ids do we do it for every upstream commit or just
> the ones that modify the set of paths that are modified in the branch
> we're rebasing?

You are correct that the patch id computations won't look at file
contents of commits unless they modify the same set of files as one of
the commits in our topic branch, but in order to determine the set of
commits which modify the same paths as commits in the branch we're
rebasing, we have to walk the upstream commits and do a tree-diff for
every one of them.  Yes, commits and trees tend to be much smaller
than blobs, but the number of trees/commits we have to look at may be
far larger than the number of blobs.  The biggest repositories are
constantly pushing so many commits that they are at a size where even
a merge-base operation can start to feel expensive.

> It is a shame that we don't have a config setting for
> "-reapply-cherry-picks" as it is easy to forget to pass that option.
> Unfortunately it is not supported by the apply backend which makes such
> a setting potentially confusing.

Indeed.

> >  The omission of
> > a --no-reapply-cherry-picks option in git-replay wasn't a lack of
> > effort or oversight, but a deliberate choice where I'd rather hold off
> > (possibly indefinitely) on implementing it.  So I'm a bit reluctant to
> > make the performance hazard less visible without also asking whether
> > we should even be doing that piece of the operation.
> >
> > I only implemented the git cherry fix because of a specific customer
> > situation where the operation was already baked into tooling, and
> > prefetching at least makes the worst case tolerable.
>
> I'm a bit surprised customers aren't complaining about tools that use
> "git rebase" being slow.

Are you sure they aren't complaining?

The merging parts of a rebase operation do have batch prefetching
already (up to 3 batches per commit; done that way to minimize the
number of objects downloaded because sometimes 2 or more of those
batches can be skipped entirely and trying to combine them into a
single batch would only be doable by downloading far more than
needed).  But, as you're alluding to, the --no-reapply-cherry-picks
part does not.

I'll note that GitHub tends to focus far more on the server side; it's
just that in this particular case with a special customer, they had me
dig a little closer to their client side operations.  In their case,
they were using git-replay rather than git-rebase, so they'd have no
reason to complain about rebase.  git-replay shares the same batch
prefetching for merge operations that rebase has, and doesn't have a
--no-reapply-cherry-picks behavior that can even be selected.
Honestly, I think the main reason this customer was also using
git-cherry was because I didn't get the drop-commits-that-become-empty
logic in the early versions of git-replay.  You added that to
git-replay (thanks again!), but after they had already built their
tooling.  This is only a guess on my part; they may have other reasons
for actively wanting git-cherry, but I think it might be worthwhile
for me to ask them if they can upgrade git versions (to get your fixes
for empty commits in replay) and then drop the calls to git-cherry.
However, I didn't want it to sound like I was pushing them to change
their workflows at my convenience, and hence this patch so that things
can be fast even if they keep the git-cherry in there.

> > I don't want to
> > hold myself to doing the same for the cherry_pick_list() path, but I'm
> > fairly confident the code here can be re-used for those other cases
> > and I'd help review a patch from anyone who wants to carry it forward.
> >
> > Anyway, you are making the right connection, it's just that my
> > personal answer is to let some other interested individual do it.
>
> Fair enough

Thanks for taking a look and asking interesting questions.

Elijah

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] grep: prefetch necessary blobs
  2026-04-18  0:32   ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
@ 2026-04-27 12:59     ` Derrick Stolee
  2026-05-13 19:21       ` Elijah Newren
  0 siblings, 1 reply; 25+ messages in thread
From: Derrick Stolee @ 2026-04-27 12:59 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 4/17/2026 8:32 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> In partial clones, `git grep` fetches necessary blobs on-demand one
> at a time, which can be very slow.  In partial clones, add an extra
> preliminary walk over the tree similar to grep_tree() which collects
> the blobs of interest, and then prefetches them.

A log of the code is about walking trees to find blobs matching
the input pathspec, with this being the core method:

> +static void collect_blob_oids_for_tree(struct repository *repo,
> +				       const struct pathspec *pathspec,
> +				       struct tree_desc *tree,
> +				       struct strbuf *base,
> +				       int tn_len,
> +				       struct oidset *blob_oids)

And in your test, you set up a repo to have three blobs with
matches in two of the files:

> +test_expect_success 'grep of revision in partial clone does bulk prefetch' '
> +	test_when_finished "rm -rf grep-partial-src grep-partial" &&
> +
> +	git init grep-partial-src &&
> +	(
> +		cd grep-partial-src &&
> +		git config uploadpack.allowfilter 1 &&
> +		git config uploadpack.allowanysha1inwant 1 &&
> +		echo "needle in haystack" >searchme &&
> +		echo "no match here" >other &&
> +		mkdir subdir &&
> +		echo "needle again" >subdir/deep &&
> +		git add . &&
> +		git commit -m "initial"
> +	) &&

But then the command downloads all of the blobs, not using a
pathspec:

> +	# grep HEAD should batch-prefetch all blobs in one request.
> +	GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
> +		git -C grep-partial grep -c "needle" HEAD >result &&
> +
> +	# Should find matches in two files.
> +	test_line_count = 2 result &&
> +
> +	# Should have prefetched all 3 objects at once
> +	test_trace2_data promisor fetch_count 3 <grep-trace
> +'
I think your code is correct, but I'd like to see a test
here that demonstrates a pathspec filter on the 'grep'
command to help filter out a blob that has a matching string.

Perhaps something like:

* matches.txt (has needle)
* nomatch.txt (does not have needle)
* matches.md (has needle)

and then 'git grep -c "needle" HEAD -- *.txt' would
download two blobs and find one match. A second run without
the pathspec would download one blob and find two matches.

Does that make sense as a test?

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-18  0:32   ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
  2026-04-19 14:04     ` Phillip Wood
@ 2026-04-27 13:16     ` Derrick Stolee
  2026-05-11  2:51       ` Junio C Hamano
  2026-05-13 23:17       ` Elijah Newren
  1 sibling, 2 replies; 25+ messages in thread
From: Derrick Stolee @ 2026-04-27 13:16 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 4/17/2026 8:32 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>

(I'm sorry that I'm reviewing out of order. This reply includes my
feelings about patch 3 after reading both.)

> +/*
> + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
> + * Skips files whose userdiff driver explicitly declares binary status
> + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
> + * never reads blob content.  Use userdiff_find_by_path() since
> + * diff_filespec_load_driver() is static in diff.c.
> + *
> + * Clean up with diff_queue_clear() (from diffcore.h).
> + */
> +static void collect_diff_blob_oids(struct commit *commit,
> +				   struct diff_options *opts,
> +				   struct oidset *blobs)

I think that this is generally a good idea, though I worry that
having this hidden in builtin/log.c may not be the right long-
term home.

I expect that we'll find more and more examples where we want to
prefetch blobs in different operations, those that exist now and
those that may be created in the future. It would be preferred if
they could automatically take advantage of the logic already in
diff_queued_diff_prefetch() within diffcore_std() in diff.c.

Ultimately, _this_ patch cares about a diff. Could we compute a
"diff prep" computation using the core diff library instead of
inventing a second queue of results for diffing?

Patch 3 cares about a "scan prep" which cares about loading all
blobs for a given tree with respect to a pathspec. This is very
similar to what a checkout would do, though it ultimately uses
a form of diff to find out what change should be applied to the
working directory. Perhaps 'git archive' is a better matching
example.

I don't mean to make your series more complicated. I value what
you're doing and can see how your current attention can be used
to make further improvements later. By implementing things in a
common location, then we can have later integrations add to the
confidence in the feature through tests covering each user-facing
use.

I'm not sure if it makes sense to attempt to create a universal
library method that would be used by builtin/log.c _and_ diff.c,
at least not right now. I'm most interested in having this logic
be more reusable in the future without needing to move code
across files.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-27 13:16     ` Derrick Stolee
@ 2026-05-11  2:51       ` Junio C Hamano
  2026-05-11 17:45         ` Elijah Newren
  2026-05-13 23:17       ` Elijah Newren
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2026-05-11  2:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren via GitGitGadget, git, Elijah Newren

Derrick Stolee <stolee@gmail.com> writes:

> Ultimately, _this_ patch cares about a diff. Could we compute a
> "diff prep" computation using the core diff library instead of
> inventing a second queue of results for diffing?
>
> Patch 3 cares about a "scan prep" which cares about loading all
> blobs for a given tree with respect to a pathspec. This is very
> similar to what a checkout would do, though it ultimately uses
> a form of diff to find out what change should be applied to the
> working directory. Perhaps 'git archive' is a better matching
> example.
>
> I don't mean to make your series more complicated. I value what
> you're doing and can see how your current attention can be used
> to make further improvements later. By implementing things in a
> common location, then we can have later integrations add to the
> confidence in the feature through tests covering each user-facing
> use.
>
> I'm not sure if it makes sense to attempt to create a universal
> library method that would be used by builtin/log.c _and_ diff.c,
> at least not right now. I'm most interested in having this logic
> be more reusable in the future without needing to move code
> across files.

The points raised in the message I am responding here, together with
the ones in <31763514-2602-4d8e-ac25-70590f090947@gmail.com>, remain
unanswered.

Should I still keep these patches in my tree, hoping that responses
may come some day?  I will mark the topic as "Expeting review
responses" in the draft "What's cooking" report I work from for now,
but it has been quite a while since we looked at the patches, so...?

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-05-11  2:51       ` Junio C Hamano
@ 2026-05-11 17:45         ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2026-05-11 17:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, Elijah Newren via GitGitGadget, git

On Sun, May 10, 2026 at 7:51 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Derrick Stolee <stolee@gmail.com> writes:
>
> > Ultimately, _this_ patch cares about a diff. Could we compute a
> > "diff prep" computation using the core diff library instead of
> > inventing a second queue of results for diffing?
> >
> > Patch 3 cares about a "scan prep" which cares about loading all
> > blobs for a given tree with respect to a pathspec. This is very
> > similar to what a checkout would do, though it ultimately uses
> > a form of diff to find out what change should be applied to the
> > working directory. Perhaps 'git archive' is a better matching
> > example.
> >
> > I don't mean to make your series more complicated. I value what
> > you're doing and can see how your current attention can be used
> > to make further improvements later. By implementing things in a
> > common location, then we can have later integrations add to the
> > confidence in the feature through tests covering each user-facing
> > use.
> >
> > I'm not sure if it makes sense to attempt to create a universal
> > library method that would be used by builtin/log.c _and_ diff.c,
> > at least not right now. I'm most interested in having this logic
> > be more reusable in the future without needing to move code
> > across files.
>
> The points raised in the message I am responding here, together with
> the ones in <31763514-2602-4d8e-ac25-70590f090947@gmail.com>, remain
> unanswered.
>
> Should I still keep these patches in my tree, hoping that responses
> may come some day?  I will mark the topic as "Expeting review
> responses" in the draft "What's cooking" report I work from for now,
> but it has been quite a while since we looked at the patches, so...?

Sorry for not responding sooner.  There have been a number of
incidents at work (including a big problematic one the day Derrick
sent his email), and I was pulled into both firefighting and
remediation duties which have sucked up all my time.  I owe responses
to Derrick, Patrick, Johannes, Phillip, and Taylor on a variety of
topics.

For this particular series, maybe mark as expecting a re-roll, since
Stolee suggested adding a test on patch 3?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] grep: prefetch necessary blobs
  2026-04-27 12:59     ` Derrick Stolee
@ 2026-05-13 19:21       ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2026-05-13 19:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren via GitGitGadget, git

On Mon, Apr 27, 2026 at 5:59 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 4/17/2026 8:32 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > In partial clones, `git grep` fetches necessary blobs on-demand one
> > at a time, which can be very slow.  In partial clones, add an extra
> > preliminary walk over the tree similar to grep_tree() which collects
> > the blobs of interest, and then prefetches them.
>
> A log of the code is about walking trees to find blobs matching
> the input pathspec, with this being the core method:
>
> > +static void collect_blob_oids_for_tree(struct repository *repo,
> > +                                    const struct pathspec *pathspec,
> > +                                    struct tree_desc *tree,
> > +                                    struct strbuf *base,
> > +                                    int tn_len,
> > +                                    struct oidset *blob_oids)
>
> And in your test, you set up a repo to have three blobs with
> matches in two of the files:
>
> > +test_expect_success 'grep of revision in partial clone does bulk prefetch' '
> > +     test_when_finished "rm -rf grep-partial-src grep-partial" &&
> > +
> > +     git init grep-partial-src &&
> > +     (
> > +             cd grep-partial-src &&
> > +             git config uploadpack.allowfilter 1 &&
> > +             git config uploadpack.allowanysha1inwant 1 &&
> > +             echo "needle in haystack" >searchme &&
> > +             echo "no match here" >other &&
> > +             mkdir subdir &&
> > +             echo "needle again" >subdir/deep &&
> > +             git add . &&
> > +             git commit -m "initial"
> > +     ) &&
>
> But then the command downloads all of the blobs, not using a
> pathspec:
>
> > +     # grep HEAD should batch-prefetch all blobs in one request.
> > +     GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
> > +             git -C grep-partial grep -c "needle" HEAD >result &&
> > +
> > +     # Should find matches in two files.
> > +     test_line_count = 2 result &&
> > +
> > +     # Should have prefetched all 3 objects at once
> > +     test_trace2_data promisor fetch_count 3 <grep-trace
> > +'
> I think your code is correct, but I'd like to see a test
> here that demonstrates a pathspec filter on the 'grep'
> command to help filter out a blob that has a matching string.
>
> Perhaps something like:
>
> * matches.txt (has needle)
> * nomatch.txt (does not have needle)
> * matches.md (has needle)
>
> and then 'git grep -c "needle" HEAD -- *.txt' would
> download two blobs and find one match. A second run without
> the pathspec would download one blob and find two matches.
>
> Does that make sense as a test?

Yes, absolutely.  And thanks for suggesting it; although I was
handling pathspecs correctly, I discovered that I was unconditionally
requesting to download whatever objects matched the pathspecs (or all
blobs in the commit if no pathspec given), even if the blobs were
already local.  I'll send an updated test in v2, along with the fix.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
  2026-04-27 13:16     ` Derrick Stolee
  2026-05-11  2:51       ` Junio C Hamano
@ 2026-05-13 23:17       ` Elijah Newren
  1 sibling, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2026-05-13 23:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren via GitGitGadget, git

Hi,

Sorry for the long delay.  Lots of firefighting of incidents kept me
away for a bit...

On Mon, Apr 27, 2026 at 6:17 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 4/17/2026 8:32 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
>
> (I'm sorry that I'm reviewing out of order. This reply includes my
> feelings about patch 3 after reading both.)

Thanks for taking a look!  And I have no problems with reviewing out
of order (unless the comments on later patches don't make sense due to
the reviewer being unaware of previous patches, which isn't the case
here).

> > +/*
> > + * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
> > + * Skips files whose userdiff driver explicitly declares binary status
> > + * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
> > + * never reads blob content.  Use userdiff_find_by_path() since
> > + * diff_filespec_load_driver() is static in diff.c.
> > + *
> > + * Clean up with diff_queue_clear() (from diffcore.h).
> > + */
> > +static void collect_diff_blob_oids(struct commit *commit,
> > +                                struct diff_options *opts,
> > +                                struct oidset *blobs)
>
> I think that this is generally a good idea, though I worry that
> having this hidden in builtin/log.c may not be the right long-
> term home.
>
> I expect that we'll find more and more examples where we want to
> prefetch blobs in different operations, those that exist now and
> those that may be created in the future. It would be preferred if
> they could automatically take advantage of the logic already in
> diff_queued_diff_prefetch() within diffcore_std() in diff.c.
>
> Ultimately, _this_ patch cares about a diff.

I read this patch a bit differently -- could you say more about what
you have in mind?

The body of collect_diff_blob_oids() really is just diff_tree_oid() +
diffcore_std() + process each pair, so at the per-commit level I am
already leaning on the diff library.  One of the things this patch
adds is accumulation across many commits: the containing loop (in
prefetch_cherry_blobs) is over a commit range, not over a single diff.

Concretely, the motivating case was a patch touching a few files where
upstream had tens of thousands of commits in <limit>..<head>, several
hundred of which modified the same set of files.  A per-diff prefetch
like diff.c uses would turn that into hundreds of small fetches of 1-3
blobs each; what this series gives you is one fetch.  So the win
really does live above the diff library, not inside it.

There are two further wrinkles in cherry that are filters layered on
top of the cross-commit accumulation, and they're cherry-specific in a
way that I don't think belongs in the diff library:

   1. For most commits in <limit>..<head>, cherry doesn't care about
the diff at all -- if the list of files modified doesn't exactly match
the commit of interest, the commit is skipped before patch-id is even
computed.  Prefetching for those would be wasted.

   2. We skip prefetching content for binary files (because patch-id
uses oid_to_hex() for such files instead of the diff contents).

> Could we compute a
> "diff prep" computation using the core diff library instead of
> inventing a second queue of results for diffing?

To check this concretely I looked at each of the existing
promisor_remote_get_direct() callsites for a similar producer.  The
closest cousin of collect_diff_blob_oids() (the only part of this
patch that looks like it might be close to the right shape to put in a
core diff library) is diff.c's diff_queued_diff_prefetch() -- but it
operates on the already-populated global diff_queued_diff and fetches
immediately, rather than setting up the diff itself and returning an
oidset for the caller to accumulate.  Reshaping it to match cherry's
needs would either break its current caller in diffcore_std() or
introduce a parallel function whose only consumer is cherry.  None of
the other sites (path-walk in backfill, index walk in read-cache,
three-way state in merge-ort, etc.) do anything resembling "diff two
trees and harvest oids."

And even if we did factor a helper out, cherry's filter is
patch-id-specific: commit_patch_id() substitutes oid_to_hex() for
files marked binary by their userdiff driver, so we deliberately skip
prefetching those.  That isn't a generic "diff prep" consideration --
it only makes sense because the caller is patch-id.  We could express
it as a predicate parameter, but with one caller that would feel to me
like it's just pushing cherry's policy across an API boundary for no
gain.

> Patch 3 cares about a "scan prep" which cares about loading all
> blobs for a given tree with respect to a pathspec. This is very
> similar to what a checkout would do, though it ultimately uses
> a form of diff to find out what change should be applied to the
> working directory. Perhaps 'git archive' is a better matching
> example.

Agreed that archive is the closer analog -- both grep and archive do a
pathspec-filtered single-tree walk, whereas checkout's prefetch is
tied to the index and optimizes to the subset of paths that are
different since the previous version checked out.  Retrofitting that
to grep would mean materializing an index for the target revision just
to throw it away, which feels like more machinery to bridge the
abstractions than the walk itself would take.

> By implementing things in a
> common location, then we can have later integrations add to the
> confidence in the feature through tests covering each user-facing
> use.

Sounds great...but what common user-facing uses exist?

Looking at the existing 11 callsites of promisor_remote_get_direct()
after this series [1], each has pretty specialized data needs --
index-driven (read-cache), index-pack & pack-objects internals,
path-walk batches (backfill), merge-ort's three-way logic,
diffcore-rename's two independent rename-detection paths, plain old
diffs, collection across a subset of commits (cherry),
pathspec-filtered tree walk (grep), and
on-demand-single-blob-at-a-time (odb.c) -- so I don't see a natural
shared layer above the primitive itself (which is already
promisor_remote_get_direct).

archive, if it had prefetch logic, would be the first match.  But it's
not clear where the shared logic between grep and archive would live,
if archive even had any prefetch logic to share.

So I'm inclined to leave both new producers local to their builtins
for now, and factor a tree-walk helper when archive (or a third
caller) actually wants one.  But I'm happy to be told I've missed the
boat.

Thanks,
Elijah

[1] builtin/backfill.c, builtin/grep.c, builtin/index-pack.c,
builtin/log.c, builtin/pack-objects.c, diff.c, diffcore-rename.c (two
callsites), merge-ort.c, odb.c, read-cache.c

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 0/4] Batch prefetching
  2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2026-04-18  0:32   ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
@ 2026-05-14 16:25   ` Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
                       ` (3 more replies)
  3 siblings, 4 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren

Changes since v2:

 * Modified the final patch as suggested by Stolee to include pathspec usage
   in the testcase
 * Modified the last two patches to not re-download blobs we already have
   locally, and adjusted the tests to verify
 * Inserted a new first patch, containing a documentation addition that
   would have helped me avoid making the above mistake in the first place.

Note: Stolee also suggest some code sharing or code movement in his review
of v2 2/3, but possibly based on a misunderstanding of v2 2/3 (that patch
isn't about a diff) and it's not clear to me what could be shared or moved,
so that's not part of this round.

Changes since v1:

 * Remove stray file that should have never been added. So embarrassing that
   I didn't catch that before submitting.


Original cover-letter:
======================

Partial clones provide a trade-off for users: avoid downloading blobs
upfront, at the expense of needing to download them later as they run other
commands. This tradeoff can sometimes incur a more severe cost than
expected, particularly if needed blobs are discovered as they are accessed,
resulting in downloading blobs one at a time. Some commands like checkout,
diff, and merge do batch prefetches of necessary blobs, since that can
dramatically reduce the pain of on-demand loading. Extend this ability to
two more commands: cherry and grep.

This series was spurred by a report where git cherry jobs were each doing
hundreds of single-blob fetches, at a cost of 3s each. Batching those
downloads should dramatically speed up their jobs. (And I decided to fix up
git grep similarly while at it.)

I'll also note that git backfill with revisions and/or pathspecs could also
improve things for these users, but since backfill is a manual command users
would have to run and requires users to try to figure out which data is
needed (a challenge in the case of cherry), it still makes sense to provide
smarter behavior for folks who don't choose to manually run backfill.

Also, correct a documentation typo I noticed in patch-ids.h (related to code
I was using for the git cherry fixes) as a preparatory fixup.

Elijah Newren (4):
  promisor-remote: document caller filtering contract
  patch-ids.h: add missing trailing parenthesis in documentation comment
  builtin/log: prefetch necessary blobs for `git cherry`
  grep: prefetch necessary blobs

 builtin/grep.c    | 143 ++++++++++++++++++++++++++++++++++++++++++++++
 builtin/log.c     | 131 ++++++++++++++++++++++++++++++++++++++++++
 patch-ids.h       |   2 +-
 promisor-remote.h |  11 ++++
 t/t3500-cherry.sh |  27 +++++++++
 t/t7810-grep.sh   |  58 +++++++++++++++++++
 6 files changed, 371 insertions(+), 1 deletion(-)


base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2089%2Fnewren%2Fbatch-prefetching-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2089/newren/batch-prefetching-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/2089

Range-diff vs v2:

 -:  ---------- > 1:  6ad11e2c28 promisor-remote: document caller filtering contract
 1:  663816a344 = 2:  08a2c6517b patch-ids.h: add missing trailing parenthesis in documentation comment
 2:  a705852723 ! 3:  c0655e5d41 builtin/log: prefetch necessary blobs for `git cherry`
     @@ builtin/log.c: static void print_commit(char sign, struct commit *commit, int ve
      +		if (drv && drv->binary > 0)
      +			continue;
      +
     -+		if (DIFF_FILE_VALID(p->one))
     ++		if (DIFF_FILE_VALID(p->one) &&
     ++		    odb_read_object_info_extended(opts->repo->objects,
     ++						  &p->one->oid, NULL,
     ++						  OBJECT_INFO_FOR_PREFETCH))
      +			oidset_insert(blobs, &p->one->oid);
     -+		if (DIFF_FILE_VALID(p->two))
     ++		if (DIFF_FILE_VALID(p->two) &&
     ++		    odb_read_object_info_extended(opts->repo->objects,
     ++						  &p->two->oid, NULL,
     ++						  OBJECT_INFO_FOR_PREFETCH))
      +			oidset_insert(blobs, &p->two->oid);
      +	}
      +	diff_queue_clear(q);
     @@ t/t3500-cherry.sh: test_expect_success 'cherry ignores whitespace' '
      +
      +		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
      +		test_line_count = 1 fetches &&
     -+		test_trace2_data promisor fetch_count 4 <trace.output
     ++		test_trace2_data promisor fetch_count 4 <trace.output &&
     ++
     ++		# A second invocation should not refetch any blobs, since
     ++		# the prefetch is expected to filter out OIDs that are
     ++		# already present locally.
     ++		GIT_TRACE2_EVENT="$(pwd)/trace2.output" git cherry upstream-with-space feature-without-space >actual &&
     ++		test_cmp ../expect actual &&
     ++
     ++		! grep "child_start.*fetch.negotiationAlgorithm" trace2.output &&
     ++		! grep "\"key\":\"fetch_count\"" trace2.output
      +	)
      +'
      +
 3:  8fbfe69bc4 ! 4:  75d4ca7cff grep: prefetch necessary blobs
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec
      +		strbuf_add(base, entry.path, tree_entry_len(&entry));
      +
      +		if (S_ISREG(entry.mode)) {
     -+			oidset_insert(blob_oids, &entry.oid);
     ++			if (!odb_has_object(repo->objects, &entry.oid, 0))
     ++				oidset_insert(blob_oids, &entry.oid);
      +		} else if (S_ISDIR(entry.mode)) {
      +			enum object_type type;
      +			struct tree_desc sub_tree;
     @@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
       	test_cmp expected actual
       '
       
     -+test_expect_success 'grep of revision in partial clone does bulk prefetch' '
     ++test_expect_success 'grep of revision in partial clone batches prefetch and honors pathspec' '
      +	test_when_finished "rm -rf grep-partial-src grep-partial" &&
      +
      +	git init grep-partial-src &&
     @@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
      +		cd grep-partial-src &&
      +		git config uploadpack.allowfilter 1 &&
      +		git config uploadpack.allowanysha1inwant 1 &&
     -+		echo "needle in haystack" >searchme &&
     -+		echo "no match here" >other &&
     -+		mkdir subdir &&
     -+		echo "needle again" >subdir/deep &&
     ++		mkdir a b &&
     ++		echo "needle in haystack" >a/matches.txt &&
     ++		echo "nothing to see here" >a/nomatch.txt &&
     ++		echo "needle again" >b/matches.md &&
      +		git add . &&
      +		git commit -m "initial"
      +	) &&
     @@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
      +	git clone --no-checkout --filter=blob:none \
      +		"file://$(pwd)/grep-partial-src" grep-partial &&
      +
     -+	# All blobs should be missing after a blobless clone.
     ++	# All three blobs are missing immediately after a blobless clone.
      +	git -C grep-partial rev-list --quiet --objects \
      +		--missing=print HEAD >missing &&
      +	test_line_count = 3 missing &&
      +
     -+	# grep HEAD should batch-prefetch all blobs in one request.
     -+	GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
     ++	# A pathspec-limited grep should prefetch only the two blobs
     ++	# in a/.  It should fetch both blobs in one batched request.
     ++	GIT_TRACE2_EVENT="$(pwd)/grep-trace-pathspec" \
     ++		git -C grep-partial grep -c "needle" HEAD -- "a/*.txt" >result &&
     ++
     ++	# Only a/matches.txt contains "needle" among the matched paths.
     ++	test_line_count = 1 result &&
     ++
     ++	# Exactly the two a/*.txt blobs should have been requested, and
     ++	# the server packed those two objects in the response.
     ++	test_trace2_data promisor fetch_count 2 <grep-trace-pathspec &&
     ++	test_trace2_data pack-objects written 2 <grep-trace-pathspec &&
     ++
     ++	# b/matches.md should still be missing locally.
     ++	git -C grep-partial rev-list --quiet --objects \
     ++		--missing=print HEAD >missing &&
     ++	test_line_count = 1 missing &&
     ++
     ++	# A second grep without a pathspec must recurse into both
     ++	# subdirectories, but should request only the still-missing blob
     ++	# from the promisor.
     ++	GIT_TRACE2_EVENT="$(pwd)/grep-trace-all" \
      +		git -C grep-partial grep -c "needle" HEAD >result &&
      +
     -+	# Should find matches in two files.
      +	test_line_count = 2 result &&
     ++	test_trace2_data promisor fetch_count 1 <grep-trace-all &&
     ++	test_trace2_data pack-objects written 1 <grep-trace-all &&
      +
     -+	# Should have prefetched all 3 objects at once
     -+	test_trace2_data promisor fetch_count 3 <grep-trace
     ++	# Everything is local now.
     ++	git -C grep-partial rev-list --quiet --objects \
     ++		--missing=print HEAD >missing &&
     ++	test_line_count = 0 missing
      +'
      +
       test_done

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/4] promisor-remote: document caller filtering contract
  2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
@ 2026-05-14 16:25     ` Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

promisor_remote_get_direct() does not, on its happy path, filter out
OIDs that are already present in the local object store: every OID
the caller supplies is written to the fetch subprocess's stdin and
ends up in the response pack.  The only filtering it performs is in
remove_fetched_oids(), and that only runs after a fetch failure when
falling back to a different configured promisor remote.

Almost every existing caller already filters locally-present OIDs out
itself (typically with odb_read_object_info_extended() and
OBJECT_INFO_FOR_PREFETCH, or odb_has_object() with no fetch flag).  But
the existing API comment does not state this expectation, so a new
caller is easy to write incorrectly (I missed this originally and wrote
two problematic callers).  Omitting the filter still "works" in the
sense that the desired objects end up local, but it silently makes the
fetch request -- and the response pack -- larger than necessary,
defeating part of the point of batching.

Spell the contract out so future callers know to filter (and
deduplicate) themselves, and point them at the helpers they should
use to check local presence without accidentally triggering a lazy
fetch.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 promisor-remote.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/promisor-remote.h b/promisor-remote.h
index 3d4d2de018..301f5ac5cb 100644
--- a/promisor-remote.h
+++ b/promisor-remote.h
@@ -29,6 +29,17 @@ int repo_has_promisor_remote(struct repository *r);
  * Fetches all requested objects from all promisor remotes, trying them one at
  * a time until all objects are fetched.
  *
+ * Callers are responsible for filtering out OIDs that are already present
+ * locally before calling this function: every supplied OID is sent in the
+ * fetch request, even if the object already exists in the local object
+ * store. (Only after a fetch failure does this function fall back to
+ * stripping already-present OIDs from the list before trying the next
+ * configured promisor remote.) Callers should also deduplicate the OIDs.
+ *
+ * To test for local presence without triggering a lazy fetch (which would
+ * defeat the purpose of batching), use odb_has_object(..., 0) or
+ * odb_read_object_info_extended() with OBJECT_INFO_FOR_PREFETCH.
+ *
  * If oid_nr is 0, this function returns immediately.
  */
 void promisor_remote_get_direct(struct repository *repo,
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment
  2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
@ 2026-05-14 16:25     ` Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 4/4] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 patch-ids.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/patch-ids.h b/patch-ids.h
index 490d739371..57534ee722 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -37,7 +37,7 @@ int has_commit_patch_id(struct commit *commit, struct patch_ids *);
  *   struct patch_id *cur;
  *   for (cur = patch_id_iter_first(commit, ids);
  *        cur;
- *        cur = patch_id_iter_next(cur, ids) {
+ *        cur = patch_id_iter_next(cur, ids)) {
  *           ... look at cur->commit
  *   }
  */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry`
  2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
@ 2026-05-14 16:25     ` Elijah Newren via GitGitGadget
  2026-05-14 16:25     ` [PATCH v3 4/4] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

In partial clones, `git cherry` fetches necessary blobs on-demand one
at a time, which can be very slow.  We would like to prefetch all
necessary blobs upfront.  To do so, we need to be able to first figure
out which blobs are needed.

`git cherry` does its work in a two-phase approach: first computing
header-only IDs (based on file paths and modes), then falling back to
full content-based IDs only when header-only IDs collide -- or, more
accurately, whenever the oidhash() of the header-only object_ids
collide.

patch-ids.c handles this by creating an ids->patches hashmap that has
all the data we need, but the problem is that any attempt to query the
hashmap will invoke the patch_id_neq() function on any colliding objects,
which causes the on-demand fetching.

Insert a new prefetch_cherry_blobs() function before checking for
collisions.  Use a temporary replacement on the ids->patches.cmpfn
in order to enumerate the blobs that would be needed without yet
fetching them, and then fetch them all at once, then restore the old
ids->patches.cmpfn.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/log.c     | 131 ++++++++++++++++++++++++++++++++++++++++++++++
 t/t3500-cherry.sh |  27 ++++++++++
 2 files changed, 158 insertions(+)

diff --git a/builtin/log.c b/builtin/log.c
index 8c0939dd42..e464b30af4 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -21,10 +21,12 @@
 #include "color.h"
 #include "commit.h"
 #include "diff.h"
+#include "diffcore.h"
 #include "diff-merges.h"
 #include "revision.h"
 #include "log-tree.h"
 #include "oid-array.h"
+#include "oidset.h"
 #include "tag.h"
 #include "reflog-walk.h"
 #include "patch-ids.h"
@@ -43,9 +45,11 @@
 #include "utf8.h"
 
 #include "commit-reach.h"
+#include "promisor-remote.h"
 #include "range-diff.h"
 #include "tmp-objdir.h"
 #include "tree.h"
+#include "userdiff.h"
 #include "write-or-die.h"
 
 #define MAIL_DEFAULT_WRAP 72
@@ -2602,6 +2606,131 @@ static void print_commit(char sign, struct commit *commit, int verbose,
 	}
 }
 
+/*
+ * Enumerate blob OIDs from a single commit's diff, inserting them into blobs.
+ * Skips files whose userdiff driver explicitly declares binary status
+ * (drv->binary > 0), since patch-ID uses oid_to_hex() for those and
+ * never reads blob content.  Use userdiff_find_by_path() since
+ * diff_filespec_load_driver() is static in diff.c.
+ *
+ * Clean up with diff_queue_clear() (from diffcore.h).
+ */
+static void collect_diff_blob_oids(struct commit *commit,
+				   struct diff_options *opts,
+				   struct oidset *blobs)
+{
+	struct diff_queue_struct *q;
+
+	/*
+	 * Merge commits are filtered out by patch_id_defined() in patch-ids.c,
+	 * so we'll never be called with one.
+	 */
+	assert(!commit->parents || !commit->parents->next);
+
+	if (commit->parents)
+		diff_tree_oid(&commit->parents->item->object.oid,
+			      &commit->object.oid, "", opts);
+	else
+		diff_root_tree_oid(&commit->object.oid, "", opts);
+	diffcore_std(opts);
+
+	q = &diff_queued_diff;
+	for (int i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		struct userdiff_driver *drv;
+
+		/* Skip binary files */
+		drv = userdiff_find_by_path(opts->repo->index, p->one->path);
+		if (drv && drv->binary > 0)
+			continue;
+
+		if (DIFF_FILE_VALID(p->one) &&
+		    odb_read_object_info_extended(opts->repo->objects,
+						  &p->one->oid, NULL,
+						  OBJECT_INFO_FOR_PREFETCH))
+			oidset_insert(blobs, &p->one->oid);
+		if (DIFF_FILE_VALID(p->two) &&
+		    odb_read_object_info_extended(opts->repo->objects,
+						  &p->two->oid, NULL,
+						  OBJECT_INFO_FOR_PREFETCH))
+			oidset_insert(blobs, &p->two->oid);
+	}
+	diff_queue_clear(q);
+}
+
+static int always_match(const void *cmp_data UNUSED,
+			const struct hashmap_entry *entry1 UNUSED,
+			const struct hashmap_entry *entry2 UNUSED,
+			const void *keydata UNUSED)
+{
+	return 0;
+}
+
+/*
+ * Prefetch blobs for git cherry in partial clones.
+ *
+ * Called between the revision walk (which builds the head-side
+ * commit list) and the has_commit_patch_id() comparison loop.
+ *
+ * Uses a cmpfn-swap trick to avoid reading blobs: temporarily
+ * replaces the hashmap's comparison function with a trivial
+ * always-match function, so hashmap_get()/hashmap_get_next() match
+ * any entry with the same oidhash bucket.  These are the set of oids
+ * that would trigger patch_id_neq() during normal lookup and cause
+ * blobs to be read on demand, and we want to prefetch them all at
+ * once instead.
+ */
+static void prefetch_cherry_blobs(struct repository *repo,
+				  struct commit_list *list,
+				  struct patch_ids *ids)
+{
+	struct oidset blobs = OIDSET_INIT;
+	hashmap_cmp_fn original_cmpfn;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(repo))
+		return;
+
+	/* Save original cmpfn, replace with always_match */
+	original_cmpfn = ids->patches.cmpfn;
+	ids->patches.cmpfn = always_match;
+
+	/* Find header-only collisions, gather blobs from those commits */
+	for (struct commit_list *l = list; l; l = l->next) {
+		struct commit *c = l->item;
+		bool match_found = false;
+		for (struct patch_id *cur = patch_id_iter_first(c, ids);
+		     cur;
+		     cur = patch_id_iter_next(cur, ids)) {
+			match_found = true;
+			collect_diff_blob_oids(cur->commit, &ids->diffopts,
+					       &blobs);
+		}
+		if (match_found)
+			collect_diff_blob_oids(c, &ids->diffopts, &blobs);
+	}
+
+	/* Restore original cmpfn */
+	ids->patches.cmpfn = original_cmpfn;
+
+	/* If we have any blobs to fetch, fetch them */
+	if (oidset_size(&blobs)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blobs, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+
+	oidset_clear(&blobs);
+}
+
 int cmd_cherry(int argc,
 	       const char **argv,
 	       const char *prefix,
@@ -2673,6 +2802,8 @@ int cmd_cherry(int argc,
 		commit_list_insert(commit, &list);
 	}
 
+	prefetch_cherry_blobs(the_repository, list, &ids);
+
 	for (struct commit_list *l = list; l; l = l->next) {
 		char sign = '+';
 
diff --git a/t/t3500-cherry.sh b/t/t3500-cherry.sh
index 78c3eac54b..3e66827d76 100755
--- a/t/t3500-cherry.sh
+++ b/t/t3500-cherry.sh
@@ -78,4 +78,31 @@ test_expect_success 'cherry ignores whitespace' '
 	test_cmp expect actual
 '
 
+# Reuse the expect file from the previous test, in a partial clone
+test_expect_success 'cherry in partial clone does bulk prefetch' '
+	test_config uploadpack.allowfilter 1 &&
+	test_config uploadpack.allowanysha1inwant 1 &&
+	test_when_finished "rm -rf copy" &&
+
+	git clone --bare --filter=blob:none file://"$(pwd)" copy &&
+	(
+		cd copy &&
+		GIT_TRACE2_EVENT="$(pwd)/trace.output" git cherry upstream-with-space feature-without-space >actual &&
+		test_cmp ../expect actual &&
+
+		grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
+		test_line_count = 1 fetches &&
+		test_trace2_data promisor fetch_count 4 <trace.output &&
+
+		# A second invocation should not refetch any blobs, since
+		# the prefetch is expected to filter out OIDs that are
+		# already present locally.
+		GIT_TRACE2_EVENT="$(pwd)/trace2.output" git cherry upstream-with-space feature-without-space >actual &&
+		test_cmp ../expect actual &&
+
+		! grep "child_start.*fetch.negotiationAlgorithm" trace2.output &&
+		! grep "\"key\":\"fetch_count\"" trace2.output
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/4] grep: prefetch necessary blobs
  2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2026-05-14 16:25     ` [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
@ 2026-05-14 16:25     ` Elijah Newren via GitGitGadget
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-05-14 16:25 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Phillip Wood, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

In partial clones, `git grep` fetches necessary blobs on-demand one
at a time, which can be very slow.  In partial clones, add an extra
preliminary walk over the tree similar to grep_tree() which collects
the blobs of interest, and then prefetches them.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/grep.c  | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
 t/t7810-grep.sh |  58 ++++++++++++++++++++
 2 files changed, 201 insertions(+)

diff --git a/builtin/grep.c b/builtin/grep.c
index e33285e5e6..85656d8d3f 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -28,9 +28,12 @@
 #include "object-file.h"
 #include "object-name.h"
 #include "odb.h"
+#include "oid-array.h"
+#include "oidset.h"
 #include "packfile.h"
 #include "pager.h"
 #include "path.h"
+#include "promisor-remote.h"
 #include "read-cache-ll.h"
 #include "write-or-die.h"
 
@@ -692,6 +695,144 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+static void collect_blob_oids_for_tree(struct repository *repo,
+				       const struct pathspec *pathspec,
+				       struct tree_desc *tree,
+				       struct strbuf *base,
+				       int tn_len,
+				       struct oidset *blob_oids)
+{
+	struct name_entry entry;
+	int old_baselen = base->len;
+	struct strbuf name = STRBUF_INIT;
+	enum interesting match = entry_not_interesting;
+
+	while (tree_entry(tree, &entry)) {
+		if (match != all_entries_interesting) {
+			strbuf_addstr(&name, base->buf + tn_len);
+			match = tree_entry_interesting(repo->index,
+						       &entry, &name,
+						       pathspec);
+			strbuf_reset(&name);
+
+			if (match == all_entries_not_interesting)
+				break;
+			if (match == entry_not_interesting)
+				continue;
+		}
+
+		strbuf_add(base, entry.path, tree_entry_len(&entry));
+
+		if (S_ISREG(entry.mode)) {
+			if (!odb_has_object(repo->objects, &entry.oid, 0))
+				oidset_insert(blob_oids, &entry.oid);
+		} else if (S_ISDIR(entry.mode)) {
+			enum object_type type;
+			struct tree_desc sub_tree;
+			void *data;
+			unsigned long size;
+
+			data = odb_read_object(repo->objects, &entry.oid,
+					       &type, &size);
+			if (!data)
+				die(_("unable to read tree (%s)"),
+				    oid_to_hex(&entry.oid));
+
+			strbuf_addch(base, '/');
+			init_tree_desc(&sub_tree, &entry.oid, data, size);
+			collect_blob_oids_for_tree(repo, pathspec, &sub_tree,
+						   base, tn_len, blob_oids);
+			free(data);
+		}
+		/*
+		 * ...no else clause for S_ISGITLINK: submodules have their
+		 * own promisor configuration and would need separate fetches
+		 * anyway.
+		 */
+
+		strbuf_setlen(base, old_baselen);
+	}
+
+	strbuf_release(&name);
+}
+
+static void collect_blob_oids_for_treeish(struct grep_opt *opt,
+					  const struct pathspec *pathspec,
+					  const struct object_id *tree_ish_oid,
+					  const char *name,
+					  struct oidset *blob_oids)
+{
+	struct tree_desc tree;
+	void *data;
+	unsigned long size;
+	struct strbuf base = STRBUF_INIT;
+	int len;
+
+	data = odb_read_object_peeled(opt->repo->objects, tree_ish_oid,
+				      OBJ_TREE, &size, NULL);
+
+	if (!data)
+		return;
+
+	len = name ? strlen(name) : 0;
+	if (len) {
+		strbuf_add(&base, name, len);
+		strbuf_addch(&base, ':');
+	}
+	init_tree_desc(&tree, tree_ish_oid, data, size);
+
+	collect_blob_oids_for_tree(opt->repo, pathspec, &tree,
+				   &base, base.len, blob_oids);
+
+	strbuf_release(&base);
+	free(data);
+}
+
+static void prefetch_grep_blobs(struct grep_opt *opt,
+				const struct pathspec *pathspec,
+				const struct object_array *list)
+{
+	struct oidset blob_oids = OIDSET_INIT;
+
+	/* Exit if we're not in a partial clone */
+	if (!repo_has_promisor_remote(opt->repo))
+		return;
+
+	/* For each tree, gather the blobs in it */
+	for (int i = 0; i < list->nr; i++) {
+		struct object *real_obj;
+
+		obj_read_lock();
+		real_obj = deref_tag(opt->repo, list->objects[i].item,
+				     NULL, 0);
+		obj_read_unlock();
+
+		if (real_obj &&
+		    (real_obj->type == OBJ_COMMIT ||
+		     real_obj->type == OBJ_TREE))
+			collect_blob_oids_for_treeish(opt, pathspec,
+						      &real_obj->oid,
+						      list->objects[i].name,
+						      &blob_oids);
+	}
+
+	/* Prefetch the blobs we found */
+	if (oidset_size(&blob_oids)) {
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		struct oidset_iter iter;
+		const struct object_id *oid;
+
+		oidset_iter_init(&blob_oids, &iter);
+		while ((oid = oidset_iter_next(&iter)))
+			oid_array_append(&to_fetch, oid);
+
+		promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+
+		oid_array_clear(&to_fetch);
+	}
+	oidset_clear(&blob_oids);
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
@@ -732,6 +873,8 @@ static int grep_objects(struct grep_opt *opt, const struct pathspec *pathspec,
 	int hit = 0;
 	const unsigned int nr = list->nr;
 
+	prefetch_grep_blobs(opt, pathspec, list);
+
 	for (i = 0; i < nr; i++) {
 		struct object *real_obj;
 
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 64ac4f04ee..3d08fd2a0c 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -1929,4 +1929,62 @@ test_expect_success 'grep does not report i-t-a and assume unchanged with -L' '
 	test_cmp expected actual
 '
 
+test_expect_success 'grep of revision in partial clone batches prefetch and honors pathspec' '
+	test_when_finished "rm -rf grep-partial-src grep-partial" &&
+
+	git init grep-partial-src &&
+	(
+		cd grep-partial-src &&
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+		mkdir a b &&
+		echo "needle in haystack" >a/matches.txt &&
+		echo "nothing to see here" >a/nomatch.txt &&
+		echo "needle again" >b/matches.md &&
+		git add . &&
+		git commit -m "initial"
+	) &&
+
+	git clone --no-checkout --filter=blob:none \
+		"file://$(pwd)/grep-partial-src" grep-partial &&
+
+	# All three blobs are missing immediately after a blobless clone.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 3 missing &&
+
+	# A pathspec-limited grep should prefetch only the two blobs
+	# in a/.  It should fetch both blobs in one batched request.
+	GIT_TRACE2_EVENT="$(pwd)/grep-trace-pathspec" \
+		git -C grep-partial grep -c "needle" HEAD -- "a/*.txt" >result &&
+
+	# Only a/matches.txt contains "needle" among the matched paths.
+	test_line_count = 1 result &&
+
+	# Exactly the two a/*.txt blobs should have been requested, and
+	# the server packed those two objects in the response.
+	test_trace2_data promisor fetch_count 2 <grep-trace-pathspec &&
+	test_trace2_data pack-objects written 2 <grep-trace-pathspec &&
+
+	# b/matches.md should still be missing locally.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 1 missing &&
+
+	# A second grep without a pathspec must recurse into both
+	# subdirectories, but should request only the still-missing blob
+	# from the promisor.
+	GIT_TRACE2_EVENT="$(pwd)/grep-trace-all" \
+		git -C grep-partial grep -c "needle" HEAD >result &&
+
+	test_line_count = 2 result &&
+	test_trace2_data promisor fetch_count 1 <grep-trace-all &&
+	test_trace2_data pack-objects written 1 <grep-trace-all &&
+
+	# Everything is local now.
+	git -C grep-partial rev-list --quiet --objects \
+		--missing=print HEAD >missing &&
+	test_line_count = 0 missing
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-05-14 16:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-17 21:42   ` Junio C Hamano
2026-04-17 22:02     ` Elijah Newren
2026-04-16 22:48 ` [PATCH 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-18  0:32   ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-18  0:32   ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-19 14:04     ` Phillip Wood
2026-04-21 21:28       ` Elijah Newren
2026-04-23 15:15         ` Phillip Wood
2026-04-23 17:38           ` Elijah Newren
2026-04-27 13:16     ` Derrick Stolee
2026-05-11  2:51       ` Junio C Hamano
2026-05-11 17:45         ` Elijah Newren
2026-05-13 23:17       ` Elijah Newren
2026-04-18  0:32   ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-27 12:59     ` Derrick Stolee
2026-05-13 19:21       ` Elijah Newren
2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 4/4] grep: prefetch necessary blobs Elijah Newren via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox