Re: [GSoC Proposal] Implement promisor remote fetch ordering

public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed

From: Lorenzo Pegorari <lorenzo.pegorari2002@gmail.com>
To: Christian Couder <christian.couder@gmail.com>
Cc: git@vger.kernel.org, Karthik Nayak <karthik.188@gmail.com>,
	Justin Tobler <jltobler@gmail.com>,
	Siddharth Asthana <siddharthasthana31@gmail.com>,
	Ayush Chandekar <ayu.chandekar@gmail.com>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [GSoC Proposal] Implement promisor remote fetch ordering
Date: Wed, 18 Mar 2026 17:29:06 +0100	[thread overview]
Message-ID: <abrS0q_Oc3kn_T3Y@lorenzo-VM> (raw)
In-Reply-To: <CAP8UFD1=Ow6NNFKK6y5csmneVaS0J+e5z9pGjFmaVoJ2g1OPFg@mail.gmail.com>

On Sat, Mar 14, 2026 at 06:30:57PM +0100, Christian Couder wrote:
> On Tue, Mar 10, 2026 at 7:25 PM Lorenzo Pegorari
> <lorenzo.pegorari2002@gmail.com> wrote:
> >
> > The following is my proposal for the GSoC'26 for the project "Implement
> > promisor remote fetch ordering".
> 
> Thank you for your interest in Git and this project.

Thank you for reading and giving me feedback on my proposal!

> > As soon as the the contributor application period begins, I will submit
> > the proposal in PDF format to the official GSoC website.
> 
> Good idea.

I will send v2 and upload it pretty soon.

> For the patches that are merged to master, it could help if you could
> give the object ID of the merge commit that merged your commits into
> master, or alternatively the object ID of all your commits.

Ack.

> >  * [GSoC PATCH v3] doc: improve gitprotocol-pack
> >    * Link: https://lore.kernel.org/git/cover.1772502209.git.lorenzo.pegorari2002@gmail.com
> >    * Description: Improved the `gitprotocol-pack` documentation.
> >    * Status: Will merge to `master`.
> 
> Yeah, this has been merged to master after your email.

Ack.

> > Partial clones avoid this issue during `clone` and `fetch` operations by
> > passing all the objects to download through a `--filter=<filter-spec>`
> > specified by the user, which will limit the number of blobs and trees
> > that actually get downloaded. The `<filter-spec>`, can, for example, be:
> >  * `blob:none`, which will filter out all blobs.
> >  * `tree:0`, which will filter out all trees.
> >  * `blob:limit=5k`, which will filter out all blobs whose size is greater
> >    than $5$kB.
> 
> Why are there '$' signs above?

Ops. I wrote the proposal on Markdown with LaTeX support. Text between
"$" is considered LaTeX. Forgot to delete it when sending the email. My
fault.

> > The filtered out objects will be lazily downloaded when the user runs a
> > command that requires those missing data.
> >
> > This mechanism works with the following steps:
> >  * When the client wants to fetch some objects from the server using a
> >    filter, the client, after sending a list of capabilities it wants to
> >    be in effect, sends the `filter: <filter-spec>` capability, followed
> >    by a request for the objects that the client wants to retrieve. The
> >    following is an example of a request (extracted using
> >    `GIT_TRACE_PACKET=1`) made by a client to a server to fetch 1 object
> >    using the `<filter-spec>=blob:none`:
> >
> >    ```
> >    [...]
> >    pkt-line.c:85           packet:        fetch< 0000  # "flush-pkt"
> >    pkt-line.c:85           packet:        fetch> command=fetch  # Execute fetch
> >    pkt-line.c:85           packet:        fetch> agent=git/2.43.0
> >    pkt-line.c:85           packet:        fetch> object-format=sha1
> >    pkt-line.c:85           packet:        fetch> 0001  # "delim-pkt"
> >    pkt-line.c:85           packet:        fetch> thin-pack  # Capability
> >    pkt-line.c:85           packet:        fetch> no-progress  # Capability
> >    pkt-line.c:85           packet:        fetch> ofs-delta  # Capability
> >    pkt-line.c:85           packet:        fetch> filter blob:none  # Filter capability
> >    # OID of the object the client wants to retrieve
> >    pkt-line.c:85           packet:        fetch> want 394ca7a7b5e75a57e736040480f685c8b71844eb
> >    pkt-line.c:85           packet:        fetch> done  # End fetch
> >    pkt-line.c:85           packet:        fetch> 0000  # "flush-pkt"
> >    [...]
> >    ```
> 
> I think when lazy fetching like this, the filter is always blob:none.
> It's not really used anyway because the objects that the client wants
> are specified explicitly.

Oh, I didn't know that. Makes sense.

> The filter is important when initially cloning or fetching from the
> server to specify which objects are initially excluded, even if some
> of these  objects will be lazy fetched soon. For example the checkout
> part of a clone might need objects that were initially excluded, so it
> might lazy fetch some.

Ooh ok, with this comment I actually fully understand now. Looking back
at the `GIT_TRACE_PACKET` output, I actually understand almost all of
it. So the partial clone fetches (usually) the `HEAD`, excluding the
filtered out objects, while the lazy fetching directly asks for the
missing objects when they are needed, so the filter is not used. Got it!

> >  * The server will apply the requested `<filter-spec>` as it creates the
> >    "promisor packfile" of the requested objects.
> 
> This is important during an initial clone or fetch, not when lazy fetching.

Got it. I will revisit all the instances where I made some confusion
between lazy fetching and initial cloning/fetching. Thank you so much
for your explaination Christian!

> > A packfile is a binary
> >    file that is used to compress many "loose objects", and it does so by
> >    containing the most recent versions of the stored objects and deltas
> >    of the previous versions of those objects. A promisor packfile is a
> >    filtered packfile, where the unwanted objects are not present. The
> >    promisor packfile is sent to the client.
> 
> 
> > I created a minimal example setup, mostly based on the test
> > `t/t5710-promisor-remote-capability` added by `4602676` ("Add
> > 'promisor-remote' capability to protocol v2", 2025-02-18), to experiment
> > with multiple promisor remotes, in order to not simply rely on the
> > documentation, but to actually get hands-on experience. The example setup
> > creates a `server`, a 'lopm' ("Large Object Promisor medium") for blobs
> > larger than 5kB, a `lopl` ("Large Object Promisor large") for blobs
> > larger than 50kB, and a `client` that interfaces with all of these
> > remotes. It is created in the following way:
> 
> [...]
> 
> > Now, with this setup, by slightly tweaking the configurations of each
> > repository, it is possible to deeply test how multiple promisor remotes
> > are handled in various situations, and actually see what is described in
> > the documentation.
> 
> Yeah, it's quite complex to set up.

Yep. The complexity of the tests are the reason behind my decision to
deeply describe them in the proposal.

> > ## Testing Promisor Remotes Advertisement
> >
> > An important thing to test is the promisor remotes advertisement feature.
> > This feature is dependent on 2 main configuration options: the
> > server-side option `promisor.advertise`, which enables the server to
> > advertise the promisor remotes it is using to the client, and the
> > client-side option `promisor.acceptFromServer`, which describes how the
> > client should handle the promisor remotes advertised:
> >
> >  * If `promisor.advertise=false`, when the `client` wants to fetch an
> >    object that the `server` does not have,
> 
> I don't think it depends on the client fetching an object the server
> does not have. It depends on the client using a filter because the
> promisor-remote capability only makes sense in the case of partial
> clones (or fetches).

Ok yeah, I should have explained this better. Of course this depends on
the client using a filter. Thanks for the feedback.

> > the `server` will not
> >    advertise the `promisor-remote` capability, and so it has no other
> >    choice than to first fetch the object from `lopl` and/or `lopm`, and
> >    then give it to the `client`. This can be checked by doing `git -C
> >    server rev-list --objects --all --missing=print`, and seeing that the
> >    previously missing large blobs are now present inside the `server`, or
> >    by directly looking into the `GIT_TRACE_PACKET` output, and seeing
> >    that there is no reference to the `promisor-remote` capability.
> >
> >  * If `promisor.advertise=true`, when the `client` wants to fetch an
> >    object that the `server` does not have,
> 
> Same as above, it doesn't depend on the client fetching an object the
> server does not have. It depends on the client using a filter because
> the promisor-remote capability only makes sense in the case of partial
> clones (or fetches).

Ack. Same as above.

> > the `server` will advertise
> >    its promisor remotes, as seen by the `GIT_TRACE_PACKET` output, which
> >    will contain:
> >
> >    ```
> >    [...]
> >    packet: upload-pack> promisor-remote= \
> >        name=lopl,url=file://$(pwd)/lopl; \  # Adv lopl
> >        name=lopm,url=file://$(pwd)/lopm  # Adv lopm
> >    [...]
> >    ```
> 
> [...]
> 
> > Recently, with the patch series "Implement `promisor.storeFields` and
> > `--filter=auto`" [5], the new client-side configuration variable
> > `promisor.storeFields` was added. It contains a list of field names
> > `partialCloneFilter` and/or `token`), and the values of these fields,
> > when transmitted by the server, will be stored in the local configuration
> > on the client.
> >
> > ## Testing Multiple Promisor Remotes Fetch Order
> 
> Yeah, I think this is the most relevant for the project.

Agreed.

> > Finally, the last mechanism that is fundamental to understand is the
> > fetch order when multiple promisor remotes are defined:
> >
> >  * When multiple remotes are configured, they are tried one after the
> >    other in the order in which they appear in the configuration, until
> >    all objects are fetched.
> 
> Right, but there is the exception of a remote configured with
> `extensions.partialClone` that will be tried last. You mention it
> later though.

Yep, will mention it also here.

> > This can be easily seen from the output of
> >    `GIT_TRACE`, which initially tries to fetch the objects from `lopl`,
> >    and then from `lopm`:
> >
> >    ```
> >    [...]
> >    trace: built-in: git fetch lopl [...] --filter=blob:none [...]
> >    [...]
> >    trace: built-in: git fetch lopm [...] --filter=blob:none [...]
> >    [...]
> >    ```
> >
> >    While, if we make it so that we first define `lopm` in the `client`
> >    configuration, then initially `lopm` will be used to fetch the
> >    objects, and `lopl` will not be used at all (because `lopm` contains
> >    all required objects:
> >
> >    ```
> >    [...]
> >    trace: built-in: git fetch lopm [...] --filter=blob:none [...]
> >    [...]
> >    ```
> 
> Yeah, when all the needed objects have been lazy fetched, there is no
> point in further fetching from any remote.

Yeah, and so `lopl` is not tried at all.

> >  * If the configuration option `extensions.partialClone` is present, the
> >    promisor remote that it specifies will always be the last one tried
> >    when fetching objects.
> >
> > ------------------------------
> >
> > # "Implement promisor remote fetch ordering"
> >
> > ## Project Goal
> >
> > This project aims to improve Git by implementing a fetch ordering
> > mechanism for multiple promisor remotes, that can be:
> >
> >  * Configured locally by the client.
> >  * Advertised by servers through the `promisor-remote` protocol.
> >
> > ## Approach
> >
> > The bulk of the project will be the creation of a system that allows to
> > define the order with which the promisor remotes will be tried when
> > fetching an object.
> >
> > The first goal will be the creation of a `remote.<name>.promisorPriority`
> 
> Yeah, or just `remote.<name>.priority`. The name is to be discussed.

Ack.

> > configuration option, which will hold a number between 1 and 'UCHAR_MAX',
> 
> UCHAR_MAX could be system dependent. It might be better to have
> configurations work in the same way on all machines though. So perhaps
> a fixed range like 1 to 100 would be better. Or are there other ranges
> of values used for similar things in Git or other well known software
> that could be reused?

Mmh true. A fixed range might be better, I agree.

> > and which defines the priority of that promisor remote in the fetch
> > order. This means that the order in which the promisor are tried will be
> > the following:
> >
> >  * All promisor remotes that have a valid `remote.<name>.promisorPriority`,
> >    starting from the one with higher priority (the lower `promisorPriority`
> >    value). If 2 or more promisor remotes have the same priority, they will be
> >    tried following the order in which they appear in the configuration file.
> >
> >  * All promisor remotes that don't have or have an invalid
> >    `remote.<name>.promisorPriority` configuration option. If 2 or more
> >    promisor remotes don't define any priority, or have an invalid priority,
> >    they will be tried following the order in which they appear in the
> >    configuration file.
> >
> >  * The promisor remote defined inside the `extensions.partialClone`, no
> >    matter their priority (which will be ignored if present). This is
> >    necessary for backward compatibility.
> 
> Yeah, I think something like what you describe makes sense.

Nice! :-)

> > Having already taken a look at the code, I have a general idea of th
> 
> s/of th/of the/

Ack.

> > major steps to take to actually introduce the
> > `remote.<name>.promisorPriority` configuration option:
> 
> [...]
> 
> > # Possible Issues
> >
> > From my understanding, the project as it is proposed will handle all
> > possible cases, except for one. Let's imagine the following situation:
> >
> >  * `server1` and `server2` both use the promisor remotes `lop1` and `lop2`.
> >  * `client` has both `server1` and `server2` as remotes.
> >
> > In this situation, the `client` has no way to specifically say that when
> > fetching from `server1`, it wants to first try `lop1` and then `lop2`, while
> > when fetching from `server2`, it wants to first try `lop2` and then `lop1`.
> 
> Right, but lazy fetching does not only happen as part of a clone or
> fetch from a server. It happens when for some reason (like a git show
> or a git blame for example) the user needs some objects it doesn't
> have locally, and when that happens, this is not related to a single
> server.
> 
> So global priorities are likely the most useful ones to have.
> 
> > One way to solve this very specific (and maybe unusual) issue is to
> > introduce a way to associate a `promisorPriority` to a specific remote.
> 
> Yeah, but I don't think it would be used a lot. We can perhaps think
> of some cases where it could be useful, but in practice it is likely
> that if there is an optimal order for one server, it will be optimal
> for all other servers too.

I agree. I should have pointed out clearly that, to me, this unusual
situation doesn't seem worth the effort.

> [...]
> 
> Thanks!

Thank you Christian!

next prev parent reply	other threads:[~2026-03-18 16:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10 18:25 [GSoC Proposal] Implement promisor remote fetch ordering Lorenzo Pegorari
2026-03-14 17:30 ` Christian Couder
2026-03-18 16:29   ` Lorenzo Pegorari [this message]
  -- strict thread matches above, loose matches on Subject: below --
2026-02-28 23:27 [GSoC] [Proposal]: " Abraham Samuel Adekunle
2026-03-03  9:27 ` Christian Couder
2026-03-03 12:08   ` Samuel Abraham
2026-03-10 15:11   ` Samuel Abraham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abrS0q_Oc3kn_T3Y@lorenzo-VM \
    --to=lorenzo.pegorari2002@gmail.com \
    --cc=ayu.chandekar@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jltobler@gmail.com \
    --cc=karthik.188@gmail.com \
    --cc=siddharthasthana31@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox