From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Elijah Newren <newren@gmail.com>,
Phillip Wood <phillip.wood123@gmail.com>,
Derrick Stolee <stolee@gmail.com>,
Elijah Newren <newren@gmail.com>
Subject: [PATCH v3 0/4] Batch prefetching
Date: Thu, 14 May 2026 16:25:24 +0000 [thread overview]
Message-ID: <pull.2089.v3.git.1778775928.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2089.v2.git.1776472347.gitgitgadget@gmail.com>
Changes since v2:
* Modified the final patch as suggested by Stolee to include pathspec usage
in the testcase
* Modified the last two patches to not re-download blobs we already have
locally, and adjusted the tests to verify
* Inserted a new first patch, containing a documentation addition that
would have helped me avoid making the above mistake in the first place.
Note: Stolee also suggest some code sharing or code movement in his review
of v2 2/3, but possibly based on a misunderstanding of v2 2/3 (that patch
isn't about a diff) and it's not clear to me what could be shared or moved,
so that's not part of this round.
Changes since v1:
* Remove stray file that should have never been added. So embarrassing that
I didn't catch that before submitting.
Original cover-letter:
======================
Partial clones provide a trade-off for users: avoid downloading blobs
upfront, at the expense of needing to download them later as they run other
commands. This tradeoff can sometimes incur a more severe cost than
expected, particularly if needed blobs are discovered as they are accessed,
resulting in downloading blobs one at a time. Some commands like checkout,
diff, and merge do batch prefetches of necessary blobs, since that can
dramatically reduce the pain of on-demand loading. Extend this ability to
two more commands: cherry and grep.
This series was spurred by a report where git cherry jobs were each doing
hundreds of single-blob fetches, at a cost of 3s each. Batching those
downloads should dramatically speed up their jobs. (And I decided to fix up
git grep similarly while at it.)
I'll also note that git backfill with revisions and/or pathspecs could also
improve things for these users, but since backfill is a manual command users
would have to run and requires users to try to figure out which data is
needed (a challenge in the case of cherry), it still makes sense to provide
smarter behavior for folks who don't choose to manually run backfill.
Also, correct a documentation typo I noticed in patch-ids.h (related to code
I was using for the git cherry fixes) as a preparatory fixup.
Elijah Newren (4):
promisor-remote: document caller filtering contract
patch-ids.h: add missing trailing parenthesis in documentation comment
builtin/log: prefetch necessary blobs for `git cherry`
grep: prefetch necessary blobs
builtin/grep.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++
builtin/log.c | 131 ++++++++++++++++++++++++++++++++++++++++++
patch-ids.h | 2 +-
promisor-remote.h | 11 ++++
t/t3500-cherry.sh | 27 +++++++++
t/t7810-grep.sh | 58 +++++++++++++++++++
6 files changed, 371 insertions(+), 1 deletion(-)
base-commit: 9f223ef1c026d91c7ac68cc0211bde255dda6199
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2089%2Fnewren%2Fbatch-prefetching-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2089/newren/batch-prefetching-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/2089
Range-diff vs v2:
-: ---------- > 1: 6ad11e2c28 promisor-remote: document caller filtering contract
1: 663816a344 = 2: 08a2c6517b patch-ids.h: add missing trailing parenthesis in documentation comment
2: a705852723 ! 3: c0655e5d41 builtin/log: prefetch necessary blobs for `git cherry`
@@ builtin/log.c: static void print_commit(char sign, struct commit *commit, int ve
+ if (drv && drv->binary > 0)
+ continue;
+
-+ if (DIFF_FILE_VALID(p->one))
++ if (DIFF_FILE_VALID(p->one) &&
++ odb_read_object_info_extended(opts->repo->objects,
++ &p->one->oid, NULL,
++ OBJECT_INFO_FOR_PREFETCH))
+ oidset_insert(blobs, &p->one->oid);
-+ if (DIFF_FILE_VALID(p->two))
++ if (DIFF_FILE_VALID(p->two) &&
++ odb_read_object_info_extended(opts->repo->objects,
++ &p->two->oid, NULL,
++ OBJECT_INFO_FOR_PREFETCH))
+ oidset_insert(blobs, &p->two->oid);
+ }
+ diff_queue_clear(q);
@@ t/t3500-cherry.sh: test_expect_success 'cherry ignores whitespace' '
+
+ grep "child_start.*fetch.negotiationAlgorithm" trace.output >fetches &&
+ test_line_count = 1 fetches &&
-+ test_trace2_data promisor fetch_count 4 <trace.output
++ test_trace2_data promisor fetch_count 4 <trace.output &&
++
++ # A second invocation should not refetch any blobs, since
++ # the prefetch is expected to filter out OIDs that are
++ # already present locally.
++ GIT_TRACE2_EVENT="$(pwd)/trace2.output" git cherry upstream-with-space feature-without-space >actual &&
++ test_cmp ../expect actual &&
++
++ ! grep "child_start.*fetch.negotiationAlgorithm" trace2.output &&
++ ! grep "\"key\":\"fetch_count\"" trace2.output
+ )
+'
+
3: 8fbfe69bc4 ! 4: 75d4ca7cff grep: prefetch necessary blobs
@@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec
+ strbuf_add(base, entry.path, tree_entry_len(&entry));
+
+ if (S_ISREG(entry.mode)) {
-+ oidset_insert(blob_oids, &entry.oid);
++ if (!odb_has_object(repo->objects, &entry.oid, 0))
++ oidset_insert(blob_oids, &entry.oid);
+ } else if (S_ISDIR(entry.mode)) {
+ enum object_type type;
+ struct tree_desc sub_tree;
@@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
test_cmp expected actual
'
-+test_expect_success 'grep of revision in partial clone does bulk prefetch' '
++test_expect_success 'grep of revision in partial clone batches prefetch and honors pathspec' '
+ test_when_finished "rm -rf grep-partial-src grep-partial" &&
+
+ git init grep-partial-src &&
@@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
+ cd grep-partial-src &&
+ git config uploadpack.allowfilter 1 &&
+ git config uploadpack.allowanysha1inwant 1 &&
-+ echo "needle in haystack" >searchme &&
-+ echo "no match here" >other &&
-+ mkdir subdir &&
-+ echo "needle again" >subdir/deep &&
++ mkdir a b &&
++ echo "needle in haystack" >a/matches.txt &&
++ echo "nothing to see here" >a/nomatch.txt &&
++ echo "needle again" >b/matches.md &&
+ git add . &&
+ git commit -m "initial"
+ ) &&
@@ t/t7810-grep.sh: test_expect_success 'grep does not report i-t-a and assume unch
+ git clone --no-checkout --filter=blob:none \
+ "file://$(pwd)/grep-partial-src" grep-partial &&
+
-+ # All blobs should be missing after a blobless clone.
++ # All three blobs are missing immediately after a blobless clone.
+ git -C grep-partial rev-list --quiet --objects \
+ --missing=print HEAD >missing &&
+ test_line_count = 3 missing &&
+
-+ # grep HEAD should batch-prefetch all blobs in one request.
-+ GIT_TRACE2_EVENT="$(pwd)/grep-trace" \
++ # A pathspec-limited grep should prefetch only the two blobs
++ # in a/. It should fetch both blobs in one batched request.
++ GIT_TRACE2_EVENT="$(pwd)/grep-trace-pathspec" \
++ git -C grep-partial grep -c "needle" HEAD -- "a/*.txt" >result &&
++
++ # Only a/matches.txt contains "needle" among the matched paths.
++ test_line_count = 1 result &&
++
++ # Exactly the two a/*.txt blobs should have been requested, and
++ # the server packed those two objects in the response.
++ test_trace2_data promisor fetch_count 2 <grep-trace-pathspec &&
++ test_trace2_data pack-objects written 2 <grep-trace-pathspec &&
++
++ # b/matches.md should still be missing locally.
++ git -C grep-partial rev-list --quiet --objects \
++ --missing=print HEAD >missing &&
++ test_line_count = 1 missing &&
++
++ # A second grep without a pathspec must recurse into both
++ # subdirectories, but should request only the still-missing blob
++ # from the promisor.
++ GIT_TRACE2_EVENT="$(pwd)/grep-trace-all" \
+ git -C grep-partial grep -c "needle" HEAD >result &&
+
-+ # Should find matches in two files.
+ test_line_count = 2 result &&
++ test_trace2_data promisor fetch_count 1 <grep-trace-all &&
++ test_trace2_data pack-objects written 1 <grep-trace-all &&
+
-+ # Should have prefetched all 3 objects at once
-+ test_trace2_data promisor fetch_count 3 <grep-trace
++ # Everything is local now.
++ git -C grep-partial rev-list --quiet --objects \
++ --missing=print HEAD >missing &&
++ test_line_count = 0 missing
+'
+
test_done
--
gitgitgadget
next prev parent reply other threads:[~2026-05-14 16:25 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-17 21:42 ` Junio C Hamano
2026-04-17 22:02 ` Elijah Newren
2026-04-16 22:48 ` [PATCH 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-18 0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-18 0:32 ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-18 0:32 ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-19 14:04 ` Phillip Wood
2026-04-21 21:28 ` Elijah Newren
2026-04-23 15:15 ` Phillip Wood
2026-04-23 17:38 ` Elijah Newren
2026-04-27 13:16 ` Derrick Stolee
2026-05-11 2:51 ` Junio C Hamano
2026-05-11 17:45 ` Elijah Newren
2026-05-13 23:17 ` Elijah Newren
2026-04-18 0:32 ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-27 12:59 ` Derrick Stolee
2026-05-13 19:21 ` Elijah Newren
2026-05-14 16:25 ` Elijah Newren via GitGitGadget [this message]
2026-05-14 16:25 ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-05-14 16:25 ` [PATCH v3 4/4] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pull.2089.v3.git.1778775928.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=git@vger.kernel.org \
--cc=newren@gmail.com \
--cc=phillip.wood123@gmail.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox