Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`

Git development
 help / color / mirror / Atom feed

From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry`
Date: Mon, 18 May 2026 08:14:03 -0400	[thread overview]
Message-ID: <9ff558a4-1be8-4a77-999a-b32c1812e4de@gmail.com> (raw)
In-Reply-To: <CABPp-BGpXgDfJeDEB91U-h092-8L6Q_MLrzSLFg9HotPDZ-m-g@mail.gmail.com>

On 5/13/2026 7:17 PM, Elijah Newren wrote:
> On Mon, Apr 27, 2026 at 6:17 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 4/17/2026 8:32 PM, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
>>> +static void collect_diff_blob_oids(struct commit *commit,
>>> +                                struct diff_options *opts,
>>> +                                struct oidset *blobs)
>>
>> I think that this is generally a good idea, though I worry that
>> having this hidden in builtin/log.c may not be the right long-
>> term home.
>>
>> I expect that we'll find more and more examples where we want to
>> prefetch blobs in different operations, those that exist now and
>> those that may be created in the future. It would be preferred if
>> they could automatically take advantage of the logic already in
>> diff_queued_diff_prefetch() within diffcore_std() in diff.c.
>>
>> Ultimately, _this_ patch cares about a diff.
> 
> I read this patch a bit differently -- could you say more about what
> you have in mind?
> 
> The body of collect_diff_blob_oids() really is just diff_tree_oid() +
> diffcore_std() + process each pair, so at the per-commit level I am
> already leaning on the diff library.  One of the things this patch
> adds is accumulation across many commits: the containing loop (in
> prefetch_cherry_blobs) is over a commit range, not over a single diff.
> 
> Concretely, the motivating case was a patch touching a few files where
> upstream had tens of thousands of commits in <limit>..<head>, several
> hundred of which modified the same set of files.  A per-diff prefetch
> like diff.c uses would turn that into hundreds of small fetches of 1-3
> blobs each; what this series gives you is one fetch.  So the win
> really does live above the diff library, not inside it.

My initial thought was about finding what we can abstract into the
diff API for later reuse. Upon rereading, it's clear that this is
tied very closely with the --cherry feature and wouldn't make a lot
of sense in the API layer.

> There are two further wrinkles in cherry that are filters layered on
> top of the cross-commit accumulation, and they're cherry-specific in a
> way that I don't think belongs in the diff library:
> 
>    1. For most commits in <limit>..<head>, cherry doesn't care about
> the diff at all -- if the list of files modified doesn't exactly match
> the commit of interest, the commit is skipped before patch-id is even
> computed.  Prefetching for those would be wasted.
> 
>    2. We skip prefetching content for binary files (because patch-id
> uses oid_to_hex() for such files instead of the diff contents).
> 
>> Could we compute a
>> "diff prep" computation using the core diff library instead of
>> inventing a second queue of results for diffing?
> 
> To check this concretely I looked at each of the existing
> promisor_remote_get_direct() callsites for a similar producer.  The
> closest cousin of collect_diff_blob_oids() (the only part of this
> patch that looks like it might be close to the right shape to put in a
> core diff library) is diff.c's diff_queued_diff_prefetch() -- but it
> operates on the already-populated global diff_queued_diff and fetches
> immediately, rather than setting up the diff itself and returning an
> oidset for the caller to accumulate.  Reshaping it to match cherry's
> needs would either break its current caller in diffcore_std() or
> introduce a parallel function whose only consumer is cherry.  None of
> the other sites (path-walk in backfill, index walk in read-cache,
> three-way state in merge-ort, etc.) do anything resembling "diff two
> trees and harvest oids."
> 
> And even if we did factor a helper out, cherry's filter is
> patch-id-specific: commit_patch_id() substitutes oid_to_hex() for
> files marked binary by their userdiff driver, so we deliberately skip
> prefetching those.  That isn't a generic "diff prep" consideration --
> it only makes sense because the caller is patch-id.  We could express
> it as a predicate parameter, but with one caller that would feel to me
> like it's just pushing cherry's policy across an API boundary for no
> gain.

Thanks for the additional context. I agree with your assessment.

>> Patch 3 cares about a "scan prep" which cares about loading all
>> blobs for a given tree with respect to a pathspec. This is very
>> similar to what a checkout would do, though it ultimately uses
>> a form of diff to find out what change should be applied to the
>> working directory. Perhaps 'git archive' is a better matching
>> example.
> 
> Agreed that archive is the closer analog -- both grep and archive do a
> pathspec-filtered single-tree walk, whereas checkout's prefetch is
> tied to the index and optimizes to the subset of paths that are
> different since the previous version checked out.  Retrofitting that
> to grep would mean materializing an index for the target revision just
> to throw it away, which feels like more machinery to bridge the
> abstractions than the walk itself would take.

Makes sense.

>> By implementing things in a
>> common location, then we can have later integrations add to the
>> confidence in the feature through tests covering each user-facing
>> use.
> 
> Sounds great...but what common user-facing uses exist?
> 
> Looking at the existing 11 callsites of promisor_remote_get_direct()
> after this series [1], each has pretty specialized data needs --
> index-driven (read-cache), index-pack & pack-objects internals,
> path-walk batches (backfill), merge-ort's three-way logic,
> diffcore-rename's two independent rename-detection paths, plain old
> diffs, collection across a subset of commits (cherry),
> pathspec-filtered tree walk (grep), and
> on-demand-single-blob-at-a-time (odb.c) -- so I don't see a natural
> shared layer above the primitive itself (which is already
> promisor_remote_get_direct).
> 
> archive, if it had prefetch logic, would be the first match.  But it's
> not clear where the shared logic between grep and archive would live,
> if archive even had any prefetch logic to share.
> 
> So I'm inclined to leave both new producers local to their builtins
> for now, and factor a tree-walk helper when archive (or a third
> caller) actually wants one.  But I'm happy to be told I've missed the
> boat.

No, clearly I missed the boat. Thanks for giving me insight to your
deep understanding of this area. You executed on a good design based
on the right amount of specialization required for this need.

Thanks,
-Stolee

next prev parent reply	other threads:[~2026-05-18 12:14 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-16 22:48 [PATCH 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-16 22:48 ` [PATCH 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-17 21:42   ` Junio C Hamano
2026-04-17 22:02     ` Elijah Newren
2026-04-16 22:48 ` [PATCH 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-18  0:32 ` [PATCH v2 0/3] Batch prefetching Elijah Newren via GitGitGadget
2026-04-18  0:32   ` [PATCH v2 1/3] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-04-18  0:32   ` [PATCH v2 2/3] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-04-19 14:04     ` Phillip Wood
2026-04-21 21:28       ` Elijah Newren
2026-04-23 15:15         ` Phillip Wood
2026-04-23 17:38           ` Elijah Newren
2026-04-27 13:16     ` Derrick Stolee
2026-05-11  2:51       ` Junio C Hamano
2026-05-11 17:45         ` Elijah Newren
2026-05-13 23:17       ` Elijah Newren
2026-05-18 12:14         ` Derrick Stolee [this message]
2026-04-18  0:32   ` [PATCH v2 3/3] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-04-27 12:59     ` Derrick Stolee
2026-05-13 19:21       ` Elijah Newren
2026-05-14 16:25   ` [PATCH v3 0/4] Batch prefetching Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 1/4] promisor-remote: document caller filtering contract Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 2/4] patch-ids.h: add missing trailing parenthesis in documentation comment Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 3/4] builtin/log: prefetch necessary blobs for `git cherry` Elijah Newren via GitGitGadget
2026-05-14 16:25     ` [PATCH v3 4/4] grep: prefetch necessary blobs Elijah Newren via GitGitGadget
2026-05-18 12:17     ` [PATCH v3 0/4] Batch prefetching Derrick Stolee
2026-05-18 12:44       ` Junio C Hamano
2026-05-18 16:55       ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ff558a4-1be8-4a77-999a-b32c1812e4de@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox