* [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
@ 2026-03-05 20:48 SoutrikDas
2026-03-15 10:11 ` SoutrikDas
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: SoutrikDas @ 2026-03-05 20:48 UTC (permalink / raw)
To: git
Cc: christian.couder, karthik.188, jltobler, ayu.chandekar,
siddharthasthana31, chandrapratap3519
Hi!
This is my project proposal for GSOC 2026
I am interested in the project idea : "Complete and extend the
remote-object-info command for git cat-file"
# Complete and extend the remote-object-info command for git cat-file
## Contact
- Name: Soutrik Das
- E-mail: valusoutrik@gmail.com
- Github: https://github.com/SoutrikDas
- LinkedIn: https://www.linkedin.com/in/soutrik-das/
## About Me
My name is Soutrik Das, I am a developer and CS bachelor from Indian
Institute of Technology, Dhanbad. Currently I am pursuing a master's
degree in AI from Indian Institute of Technology, Bhubaneswar.
I dont really have much experience in contributing to something as
large as git, but I would love to learn anything and everything I can
gain from this experience. I have experience in C/C++ from my
Btech coursework and participating in codeforces contests.
## Pre GSOC
I started exploring Git's codebase around February 2026 and sent my first patch
as a docfix, followed by a microproject of modernizing tests
- [PATCH] doc: fix repo_config documentation reference [1]
status: merged to master
Merge Commit: 94336d77bcbf4360b67a9454d8bf2e84b3d88ae7
Description: Replace the path for the repo_config() documentation
from 'Documentation/technical/api-config.h' to 'config.h'.
- [GSOC PATCH] t7003: modernize path existence checks using test helpers [2]
status: merged to master
Merge Commit: 11294bb0fa540d214d071b32cf74b1ed37b3bbbd
Description: Replace direct uses of 'test -f' and 'test -d' with
git's helper functions 'test_path_is_file' ,'test_path_is_missing'
and 'test_path_is_dir'
I have read through most of Eric Ju's [4] work and some of Calvin Wan's [5]
work. I am still finding more things to understand from each thread, but
I feel I have grasped the basics.
My work in this project would be focused on implementing the changes
suggested at the end of Eric Ju's [Patch v11].
I wouldn't say I understand every bit of discussion from that thread,
but in general my understanding is :
Calvin Wan and Eric Ju has already implemented a client side command
called get_remote_info but its designed for being batched to reduce
multiple network trips to get a single object's data.
I have added Eric Ju's patch series to an old master commit (2d2a71ce85)
since I could not find a base commit for Eric's patch series. The patch
was properly applied and I also played around and added a very rough
but workin "%(objecttype)" code , ie now it prints like this :
29658341f39210201ff7f72a4be83937cf2288c5 14 blob
## Project : Complete and extend the remote-object-info command for git cat-file
Currently in the case of a partial clone, the user cannot retrieve all
object data without fetching the object beforehand. To solve this problem
Calvin Wan and Eric Ju had designed a patch sreies that can solve that,
by utilising protocolv2 servers capabilities.
This was done in the form of "remote-object-info".
But only the %(objectsize) was implemented, and that patch was not merged.
This project has two goals
1: To Rebase and finalize Calvin Wan and Eric Ju's Work by addressing
the feedback on Eric Ju's Patch v11
2: To add support for objecttype in remote-object-info
3: To discuss other information type like objectsize:disk and deltabase.
Project Duration : 12 week approx
## Timeline
Mar 6-31 : Refine Proposal
If possible I would like to submit small patches... but first I will
have to rebase Eric Ju's Patches ... I am not sure if I can do this
before GSOC...
If not, I plan to contribute to git in other areas.
May 1-24 : Community Bonding
1-7 : Understand relevant underlying/ helper functions
8-24 : Ask about any design related problems/decisions
May 25 - Jun 14 : Start a Patch Series to rebase Calvin Wan and Eric Ju's work
and keep refining
Jun 15 - Aug 15 : Start and keep refining Patch Series to add support for
object type information
Aug 16 - Aug 24 : Discuss and Implement other object information if possible
Concurrently I shall make a report for all the work done.
## Availability
My current semester is ending in the first week of April, so I will be
able to contribute 7-8 hours per day, totalling around 35-40 hrs a week
on the project.
Total weeks = 12 , total hours = 35*12 = 420
It leaves with a lot more room to accomodate any unforeseen circumstances
that may arise during the project.
## RFC
I have a few ideas but do not know if they are worth pursuing, so I will
leave them here in the first draft
- Addition of a remote-object-info outside of batchmode :
Yes it should be optimally used in batch mode .. but if user wants
only one objects size or type then should they be able to just
`git cat-file -r origin <oid>`
and get the size and type ? or something similar , I am not sure if
the way I have depicted it conforms to git's design.
- Addition of commands for common user behaviour :
I dont know if its going to be a common user behaviour but what about
`git cat-file -r --all-absent`
Or inside "git cat-file --batch-command="<format> remote-object-info
--all-absent --type=tree <remote>"
which would basically fill in remote-object-info with all the blobs
that are currently absent from the worktree ?
No need to fill them if its for a common enough use case.
- Sort according to size :
Maybe a user would want to check whats the largest file they dont
have yet.
- Get total missing blob size :
Use case would be when someone wants to know how much exactly there
is to download, before starting the download.
Thank you for your time in revewing my proposal as well as considering
my application. I am excited to learn everything I can from git.
Thanks and Regards,
Soutrik
[1] : pull.2187.git.git.1770293021383.gitgitgadget@gmail.com
[2] : 20260209172445.39536-1-valusoutrik@gmail.com
[3] : 20260225190306.39358-1-valusoutrik@gmail.com
[4] : 20240628190503.67389-1-eric.peijian@gmail.com
[5] : 20220728230210.2952731-1-calvinwan@google.com
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
2026-03-05 20:48 [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file SoutrikDas
@ 2026-03-15 10:11 ` SoutrikDas
2026-03-16 12:08 ` Christian Couder
` (2 subsequent siblings)
3 siblings, 0 replies; 14+ messages in thread
From: SoutrikDas @ 2026-03-15 10:11 UTC (permalink / raw)
To: valusoutrik
Cc: ayu.chandekar, chandrapratap3519, christian.couder, git, jltobler,
karthik.188, siddharthasthana31
Hi I was wondering If I could get some feedback on this.
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
2026-03-05 20:48 [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file SoutrikDas
2026-03-15 10:11 ` SoutrikDas
@ 2026-03-16 12:08 ` Christian Couder
2026-03-17 13:06 ` SoutrikDas
2026-03-16 20:46 ` Karthik Nayak
2026-03-20 13:12 ` [GSoC Proposal v2] " SoutrikDas
3 siblings, 1 reply; 14+ messages in thread
From: Christian Couder @ 2026-03-16 12:08 UTC (permalink / raw)
To: SoutrikDas
Cc: git, karthik.188, jltobler, ayu.chandekar, siddharthasthana31,
chandrapratap3519
Hi,
Sorry for the late feedback.
On Thu, Mar 5, 2026 at 9:48 PM SoutrikDas <valusoutrik@gmail.com> wrote:
> I have read through most of Eric Ju's [4] work and some of Calvin Wan's [5]
> work. I am still finding more things to understand from each thread, but
> I feel I have grasped the basics.
>
> My work in this project would be focused on implementing the changes
> suggested at the end of Eric Ju's [Patch v11].
>
> I wouldn't say I understand every bit of discussion from that thread,
> but in general my understanding is :
>
> Calvin Wan and Eric Ju has already implemented a client side command
s/has/have/
> called get_remote_info but its designed for being batched to reduce
s/its/it's/
> multiple network trips to get a single object's data.
The `git cat-file` command has a `--batch-command[=<format>]` option
to enter a command mode. In this command mode some special commands
and arguments can be passed via stdin to `git cat-file` to request
information.
[...]
> ## Project : Complete and extend the remote-object-info command for git cat-file
>
> Currently in the case of a partial clone, the user cannot retrieve all
> object data without fetching the object beforehand. To solve this problem
> Calvin Wan and Eric Ju had designed a patch sreies that can solve that,
s/sreies/series/
> by utilising protocolv2 servers capabilities.
>
> This was done in the form of "remote-object-info".
>
> But only the %(objectsize) was implemented, and that patch was not merged.
> This project has two goals
>
> 1: To Rebase and finalize Calvin Wan and Eric Ju's Work by addressing
> the feedback on Eric Ju's Patch v11
>
> 2: To add support for objecttype in remote-object-info
>
> 3: To discuss other information type like objectsize:disk and deltabase.
s/type/types/
But anyway I think "information type" is not a good wording for these
things, because we already talk about "type" for Git object types.
Please try to find a better wording.
> ## Timeline
>
> Mar 6-31 : Refine Proposal
>
> If possible I would like to submit small patches... but first I will
> have to rebase Eric Ju's Patches ... I am not sure if I can do this
> before GSOC...
You can try a rebase to see which issues would need to be resolved to
complete a rebase, and talk a bit about these issues in your proposal,
but otherwise applicants shouldn't start working on a project before
they have been accepted.
> If not, I plan to contribute to git in other areas.
>
> May 1-24 : Community Bonding
> 1-7 : Understand relevant underlying/ helper functions
> 8-24 : Ask about any design related problems/decisions
>
> May 25 - Jun 14 : Start a Patch Series to rebase Calvin Wan and Eric Ju's work
> and keep refining
>
> Jun 15 - Aug 15 : Start and keep refining Patch Series to add support for
> object type information
Would you implement both the client and the server side in the same
patch series or do it separately?
> Aug 16 - Aug 24 : Discuss and Implement other object information if possible
> Concurrently I shall make a report for all the work done.
>
> ## Availability
>
> My current semester is ending in the first week of April, so I will be
> able to contribute 7-8 hours per day, totalling around 35-40 hrs a week
> on the project.
Do you have another semester starting after the current one?
> Total weeks = 12 , total hours = 35*12 = 420
> It leaves with a lot more room to accomodate any unforeseen circumstances
> that may arise during the project.
>
> ## RFC
>
> I have a few ideas but do not know if they are worth pursuing, so I will
> leave them here in the first draft
>
> - Addition of a remote-object-info outside of batchmode :
> Yes it should be optimally used in batch mode .. but if user wants
> only one objects size or type then should they be able to just
> `git cat-file -r origin <oid>`
> and get the size and type ? or something similar , I am not sure if
> the way I have depicted it conforms to git's design.
Not sure if that would be very useful first. Also that might be better
in a different command than `cat-file`.
> - Addition of commands for common user behaviour :
> I dont know if its going to be a common user behaviour but what about
> `git cat-file -r --all-absent`
> Or inside "git cat-file --batch-command="<format> remote-object-info
> --all-absent --type=tree <remote>"
> which would basically fill in remote-object-info with all the blobs
> that are currently absent from the worktree ?
There are other ways to do this, like using:
git rev-list --objects --all --missing=print
Thanks for your proposal.
Best.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
2026-03-16 12:08 ` Christian Couder
@ 2026-03-17 13:06 ` SoutrikDas
0 siblings, 0 replies; 14+ messages in thread
From: SoutrikDas @ 2026-03-17 13:06 UTC (permalink / raw)
To: christian.couder
Cc: ayu.chandekar, chandrapratap3519, git, jltobler, karthik.188,
siddharthasthana31, valusoutrik
Hi there,
> s/has/have/
> s/its/it's/
> s/sreies/series/
> s/type/types/
I will correct all the spelling mistakes.
> > multiple network trips to get a single object's data.
>
> The `git cat-file` command has a `--batch-command[=<format>]` option
> to enter a command mode. In this command mode some special commands
> and arguments can be passed via stdin to `git cat-file` to request
> information.
Will correct that.
> But anyway I think "information type" is not a good wording for these
> things, because we already talk about "type" for Git object types.
> Please try to find a better wording.
How about object property or object attribute or object field?
I feel like object fields may be a bit more technically correct.
> You can try a rebase to see which issues would need to be resolved to
> complete a rebase, and talk a bit about these issues in your proposal,
> but otherwise applicants shouldn't start working on a project before
> they have been accepted.
I tried a rebase on the current master , and there were indeed conflicts
I will include this part in my v2.
> Would you implement both the client and the server side in the same
> patch series or do it separately?
I am not sure actually... since Eric Ju did everything in one patch series.
But personally I feel like doing one series for server side first and another
for client side would be a bit more focused. But I am not sure if it would
cost more time for everyone involved, like giving feedback and all that?
> > My current semester is ending in the first week of April, so I will be
> > able to contribute 7-8 hours per day, totalling around 35-40 hrs a week
> > on the project.
>
> Do you have another semester starting after the current one?
Actually I made a mistake, its ending in the first week of May. But no,
after this semester we have a summer break so ... I will update this part.
> Not sure if that would be very useful first. Also that might be better
> in a different command than `cat-file`.
Alright. I will ask that as a question before my final gsoc proposal
submission so that if its approved, I will add it to my tasks in gsoc.
> There are other ways to do this, like using:
>
> git rev-list --objects --all --missing=print
Did not know that ... but thats great! I will remove this from the proposal.
Thanks for the feedback.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
2026-03-05 20:48 [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file SoutrikDas
2026-03-15 10:11 ` SoutrikDas
2026-03-16 12:08 ` Christian Couder
@ 2026-03-16 20:46 ` Karthik Nayak
2026-03-17 15:13 ` SoutrikDas
2026-03-20 13:12 ` [GSoC Proposal v2] " SoutrikDas
3 siblings, 1 reply; 14+ messages in thread
From: Karthik Nayak @ 2026-03-16 20:46 UTC (permalink / raw)
To: SoutrikDas, git
Cc: christian.couder, jltobler, ayu.chandekar, siddharthasthana31,
chandrapratap3519
[-- Attachment #1: Type: text/plain, Size: 6605 bytes --]
SoutrikDas <valusoutrik@gmail.com> writes:
Hello,
[snip]
> ## Pre GSOC
>
> I started exploring Git's codebase around February 2026 and sent my first patch
> as a docfix, followed by a microproject of modernizing tests
>
> - [PATCH] doc: fix repo_config documentation reference [1]
> status: merged to master
> Merge Commit: 94336d77bcbf4360b67a9454d8bf2e84b3d88ae7
> Description: Replace the path for the repo_config() documentation
> from 'Documentation/technical/api-config.h' to 'config.h'.
>
> - [GSOC PATCH] t7003: modernize path existence checks using test helpers [2]
> status: merged to master
> Merge Commit: 11294bb0fa540d214d071b32cf74b1ed37b3bbbd
> Description: Replace direct uses of 'test -f' and 'test -d' with
> git's helper functions 'test_path_is_file' ,'test_path_is_missing'
> and 'test_path_is_dir'
>
>
> I have read through most of Eric Ju's [4] work and some of Calvin Wan's [5]
> work. I am still finding more things to understand from each thread, but
> I feel I have grasped the basics.
>
> My work in this project would be focused on implementing the changes
> suggested at the end of Eric Ju's [Patch v11].
>
> I wouldn't say I understand every bit of discussion from that thread,
> but in general my understanding is :
>
I do agree that there is a lot to unpack there.
> Calvin Wan and Eric Ju has already implemented a client side command
> called get_remote_info but its designed for being batched to reduce
> multiple network trips to get a single object's data.
>
As far as I can recall, the command allowed users to enter multiple OIDs
in a single line to reduce the to-fro with the server. But you could
still fetch single OID info.
> I have added Eric Ju's patch series to an old master commit (2d2a71ce85)
> since I could not find a base commit for Eric's patch series. The patch
> was properly applied and I also played around and added a very rough
> but workin "%(objecttype)" code , ie now it prints like this :
>
> 29658341f39210201ff7f72a4be83937cf2288c5 14 blob
>
Nice, have you tried with a more recent 'master'? I assume there are
merge conflicts?
>
> ## Project : Complete and extend the remote-object-info command for git cat-file
>
> Currently in the case of a partial clone, the user cannot retrieve all
> object data without fetching the object beforehand. To solve this problem
> Calvin Wan and Eric Ju had designed a patch sreies that can solve that,
> by utilising protocolv2 servers capabilities.
>
> This was done in the form of "remote-object-info".
>
> But only the %(objectsize) was implemented, and that patch was not merged.
> This project has two goals
>
> 1: To Rebase and finalize Calvin Wan and Eric Ju's Work by addressing
> the feedback on Eric Ju's Patch v11
>
Any idea how much work is left post v11?
> 2: To add support for objecttype in remote-object-info
>
> 3: To discuss other information type like objectsize:disk and deltabase.
>
> Project Duration : 12 week approx
>
> ## Timeline
>
> Mar 6-31 : Refine Proposal
>
> If possible I would like to submit small patches... but first I will
> have to rebase Eric Ju's Patches ... I am not sure if I can do this
> before GSOC...
>
As per the guidelines, it says
Any work done on the Project prior to acceptance of the Project
Proposal will not be considered for Evaluations.
> If not, I plan to contribute to git in other areas.
>
> May 1-24 : Community Bonding
> 1-7 : Understand relevant underlying/ helper functions
> 8-24 : Ask about any design related problems/decisions
>
> May 25 - Jun 14 : Start a Patch Series to rebase Calvin Wan and Eric Ju's work
> and keep refining
>
> Jun 15 - Aug 15 : Start and keep refining Patch Series to add support for
> object type information
>
> Aug 16 - Aug 24 : Discuss and Implement other object information if possible
> Concurrently I shall make a report for all the work done.
How will you manage reviews, considering generally they take a long
time?
>
> ## Availability
>
> My current semester is ending in the first week of April, so I will be
> able to contribute 7-8 hours per day, totalling around 35-40 hrs a week
> on the project.
>
> Total weeks = 12 , total hours = 35*12 = 420
> It leaves with a lot more room to accomodate any unforeseen circumstances
> that may arise during the project.
>
> ## RFC
>
> I have a few ideas but do not know if they are worth pursuing, so I will
> leave them here in the first draft
>
> - Addition of a remote-object-info outside of batchmode :
> Yes it should be optimally used in batch mode .. but if user wants
> only one objects size or type then should they be able to just
> `git cat-file -r origin <oid>`
> and get the size and type ? or something similar , I am not sure if
> the way I have depicted it conforms to git's design.
>
I do agree that something like that would be useful indeed, I'm not sure
of what that design looks like though.
> - Addition of commands for common user behaviour :
> I dont know if its going to be a common user behaviour but what about
> `git cat-file -r --all-absent`
> Or inside "git cat-file --batch-command="<format> remote-object-info
> --all-absent --type=tree <remote>"
> which would basically fill in remote-object-info with all the blobs
> that are currently absent from the worktree ?
> No need to fill them if its for a common enough use case.
I do see benefits of this too. But I do wonder if 'git rev-list' is a
better command for something like this.
> - Sort according to size :
> Maybe a user would want to check whats the largest file they dont
> have yet.
>
Same here.
> - Get total missing blob size :
> Use case would be when someone wants to know how much exactly there
> is to download, before starting the download.
>
This could probably go into 'git backfill' ? Interesting ideas
nevertheless!
> Thank you for your time in revewing my proposal as well as considering
> my application. I am excited to learn everything I can from git.
>
> Thanks and Regards,
> Soutrik
>
What I missed from the proposal:
1. Where did the work from Eric and Calvin stop at, what review comments
need to be addressed.
2. How do you plan to handle reviews and iterations taking time.
Regards,
Karthik
>
> [1] : pull.2187.git.git.1770293021383.gitgitgadget@gmail.com
> [2] : 20260209172445.39536-1-valusoutrik@gmail.com
> [3] : 20260225190306.39358-1-valusoutrik@gmail.com
> [4] : 20240628190503.67389-1-eric.peijian@gmail.com
> [5] : 20220728230210.2952731-1-calvinwan@google.com
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
2026-03-16 20:46 ` Karthik Nayak
@ 2026-03-17 15:13 ` SoutrikDas
0 siblings, 0 replies; 14+ messages in thread
From: SoutrikDas @ 2026-03-17 15:13 UTC (permalink / raw)
To: karthik.188
Cc: ayu.chandekar, chandrapratap3519, christian.couder, git, jltobler,
siddharthasthana31, valusoutrik
Hi there,
> As far as I can recall, the command allowed users to enter multiple OIDs
> in a single line to reduce the to-fro with the server. But you could
> still fetch single OID info.
Yeah that was what I meant, but from Chistian Couder's feedback, I realized
that cat-file is not a good home for such a subcommand.
> Nice, have you tried with a more recent 'master'? I assume there are
> merge conflicts?
Yup, I will add these issues in my proposal v2.
> Any idea how much work is left post v11?
From the v11 thread
- a lot of design decision fix , like comment alignment and blank lines
- the max remote obj info logic is a bit wrong as Junio pointed out [1]
- one test case for max obj limit
- use of size_t for looping
- the placeholder check ie the even with only objectsize the checking of
formatting string is a bit incorrect [2]
- Implementing an allow list for placeholders
- print empty string for unsupported placeholders, ie those not on the
allow list
- remove usage of split_cmdline since neither url nor oid will have spaces
in them, so a strchr would suffice, I think ?
Above is for just for part 1 ie to get eric jus patch accepted
> As per the guidelines, it says
>
> Any work done on the Project prior to acceptance of the Project
> Proposal will not be considered for Evaluations.
I meant like in the May 1-24 duration, which is after the acceptance
of the project ( april 30 ) but before coding officially begins (may 25)
This is the timeline on gsocs page [3]:
> April 30 - 18:00 UTC
> Accepted GSoC contributor projects announced
> May 1 - 24
> Community Bonding Period | GSoC contributors get to know mentors,
> read documentation, get up to speed to begin working on their projects
> May 25
> Coding officially begins!
I was planning to also ask design questions in this period.
> How will you manage reviews, considering generally they take a long
> time?
I will adjust the timeline to give more time to rebase previously done work.
I was wondering... I cannot start on part 2 ie adding support for more object
fields without first integrating old work ... so about 50% of time will go to
rebasing and 30% to adding new fields ? and 20% for emergency or any mishap.
> I do agree that something like that would be useful indeed, I'm not sure
> of what that design looks like though.
> I do see benefits of this too. But I do wonder if 'git rev-list' is a
> better command for something like this.
I will clarify questions at the beginning of gsoc duration.
> What I missed from the proposal:
> 1. Where did the work from Eric and Calvin stop at, what review comments
> need to be addressed.
> 2. How do you plan to handle reviews and iterations taking time.
Will update the timeline as well as mention the current outstanding tasks,
as far as I have understood them.
Thank you for your feedback.
[1] : xmqqo6yr3wc4.fsf@gitster.g/
[2] : 20250224234720.GC729825@coredump.intra.peff.net/
[3] : https://developers.google.com/open-source/gsoc/timeline
^ permalink raw reply [flat|nested] 14+ messages in thread
* [GSoC Proposal v2] Complete and extend the remote-object-info command for git cat-file
2026-03-05 20:48 [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file SoutrikDas
` (2 preceding siblings ...)
2026-03-16 20:46 ` Karthik Nayak
@ 2026-03-20 13:12 ` SoutrikDas
3 siblings, 0 replies; 14+ messages in thread
From: SoutrikDas @ 2026-03-20 13:12 UTC (permalink / raw)
To: valusoutrik
Cc: ayu.chandekar, chandrapratap3519, christian.couder, git, jltobler,
karthik.188, siddharthasthana31
Hi everyone,
Thank you for the feedback Christian and Karthik.
I have not made a doc version of this yet. I will link it from v3
I understand that in this proposal I have not explained my own plans that
thoroughly, I am working on this in v3.
Changes from v1 :
- Correct spelling mistakes
- Address how much work is remaining after Eric Ju's Patch v11
- Increase Time in Timeline for Reviews
- Add a section for rebasing problems
---
This is the second version of my project proposal for GSoC 2026
I am interested in the project idea : "Complete and extend the
remote-object-info command for git cat-file"
# Complete and extend the remote-object-info command for git cat-file
## Contact
- Name: Soutrik Das
- E-mail: valusoutrik@gmail.com
- Github: https://github.com/SoutrikDas
- LinkedIn: https://www.linkedin.com/in/soutrik-das/
## About Me
My name is Soutrik Das, I am a developer. I did my B.Tech in CS from
IIT Dhanbad. Currently I am pursuing a M.Tech degree in AI from IIT
Bhubaneswar.
I don't really have much experience in contributing to something as
large as git, but I would like to learn as much as possible from this
experience. I have experience in C/C++ from my Btech coursework and
participating in codeforces contests.
## Pre GSoC
I started exploring Git's codebase around February 2026 and sent my first patch
as a docfix, followed by a microproject of modernizing tests
- [PATCH] doc: fix repo_config documentation reference [1]
status: merged to master
Merge Commit: 94336d77bcbf4360b67a9454d8bf2e84b3d88ae7
Merge Date : 13 Feb 2026
Description: Replace the path for the repo_config() documentation
from 'Documentation/technical/api-config.h' to 'config.h'.
- [GSoC PATCH] t7003: modernize path existence checks using test helpers [2]
status: merged to master
Merge Commit: 11294bb0fa540d214d071b32cf74b1ed37b3bbbd
Merge Date : 17 Feb 2026
Description: Replace direct uses of 'test -f' and 'test -d' with
git's helper functions 'test_path_is_file' ,'test_path_is_missing'
and 'test_path_is_dir'
## Eric Ju and Calvin Wan's work
In this section I want to talk about the work already done and what
feedback the community had on the last sent patch , ie v11
This is my understanding of the patch series:
Patch 1/8 : git-compat-util: add strtoul_ul()
Helper function addition
Patch 2/8 : cat-file: add declaration of variable i inside for loop
Small refactoring
Patch 3/8 : t1006: split test utility functions into new "lib-cat-file.sh"
Moving the `echo_without_newline`,`echo_without_newline_nul` and
`strlen` function from `t1006-cat-file.sh` to `lib-cat-file.sh` to
reuse them in future.
When I rebased the patch series against a recent master (March 5)
795c338de725e13bd361214c6b768019fc45a2c1, there is only one other
file ( t1007-hash-object.sh ) that has a duplicate definition.
Patch 4/8 : fetch-pack: refactor packet writing
Generalized write_command_and_capabilities so that it now takes in
a command instead of hardcoding "fetch". It was also moved from
`fetch-pack.c` to `connect.c`
Patch 5/8 : fetch-pack: move fetch initialization
Before this patch, the state machine of do_fetch_pack_v2() used to
assume that starting state is FETCH_CHECK_LOCAL so it would initialize
certain variables like `use_sideband=2` inside the FETCH_CHECK_LOCAL
case. But now for remote-object-info we do not want to go through
the extra steps, we are directly entering the state machine at
FETCH_SEND_REQUEST. We don't need to figure out what to fetch,
the user/machine is explicitly giving it.
Patch 6/8 : serve: advertise object-info feature
Makes the server adertise that it supports the "size" feature of
object-info command.
Patch 7/8 : transport: add client support for object-info
Adds `fetch_object_info` which checks if protocol is v2
and then sends the object info request. After getting the result
its parsing the output.
Also sets `state=FFETCH_SEND_REQUEST` when object-info is used.
Not related to above patch , but on the server side this request is
caught by serve.c and then handled by cap_object_info in protocol-caps.c
Patch 8/8 : cat-file: add remote-object-info to batch-command
Adds the subcommands and relevant tests.
To summarize, this patch series has added the subcommand, and all of
the needed functions to make one object info field work. But a few problems
were left to be addressed. Once those are addressed, adding new object
info fields will be much easier.
## Problems faced during rebasing
I applied the patches onto an old master (2d2a71ce85) and then rebased
to a recent master (795c338de7)
Patch 1/8: Auto / No Merge Conflict
Patch 2/8: Auto / No Merge Conflict
Patch 3/8: add/add conflict
Patch 4/8: Confirming movement of function `write_command_and_capabilities`
Patch 5/8: Auto / No Merge Conflict
Patch 6/8: Auto / No Merge Conflict
Patch 7/8: Makefile merge conflict but when opened in vscode it shows
0 conflict.
Patch 8/8: add/add conflict for object-store.c and modify/delete
conflict for object-store-ll.h
According to 68cd492a3e
> object-store: merge "object-store-ll.h" and "object-store.h"
And according to 8f49151763
> object-store: rename files to "odb.{c,h}"
Therefore I have added the function signature that was supposed to go to
object-store-ll.h to odb.h
## Work remaining to get v11 patch accepted
Almost all of it is focused on patch 8
- Fix multi-line comment formatting - closing */ on own line
- Add blank lines between macro definitions
- Split overly-long MAX_REMOTE_OBJ_INFO_LINE definition across lines
- Change loop variable from size_t i to int i (since argc is int)
- Rearrange if/else to put smaller body first: if (!gtransport->smart_options)
before else
- Fix the logic of maximum line size for the remote-object-info.
- Adding an allow list of object info fields
- Handling what happens if an unsupported object info field is given in
format string.
In this case we send the request as if such a object info field is
not even there, and when printing the result we simply print an empty
string on the client side. No extra payload on the network.
- Add tests.
- Update Documentation
## Project : Complete and extend the remote-object-info command for git cat-file
Currently in the case of a partial clone, the user cannot retrieve all
object data without fetching the object beforehand. To solve this problem
Calvin Wan and Eric Ju had designed a patch series that can solve that,
by utilising protocolv2 servers capabilities.
This was done in the form of "remote-object-info".
But only the %(objectsize) was implemented, and that patch was not merged.
This project has two goals
1: To Rebase and finalize Calvin Wan and Eric Ju's Work by addressing
the feedback on Eric Ju's Patch v11. Work for this part is discussed
above in above section.
2: To discuss with the community and add support for other relevant
object info fields `remote-object-info` like `objecttype`,
`objectsize:disk` and `deltabase`
Project Duration : 13 week approx
## Timeline
### Phase 1 :
May 1-24 : Community Bonding + Start Design discussions on
Logic of allow list implementation
Logic of maximum size of the remote-object-info command
Which object info fields should be supported
Week 1 (May 25 - 31) :
Open Patch Series 1 for Eric Jus patch, after
solving all remaining problems. Use the discussed idea/solution from
above. Both client and server side work would be in the same patch
series. This is just rebasing previous work so I have to address
the changes suggested after v11.
Week 2 (June 1 - 7) : Continue discussion, review feedback and refine.
Week 3 (June 8 - 14) : Review feedback and refine
Week 4 (June 15 - 21) : Review feedback and refine + Update Documentation
and Tests
Week 5 (June 22 - 28) : By now all tasks regarding Merging Eric Ju's
patch should be finished. But since it may take more time for
reviewing I am adding a buffer weeks.
Week 6 (June 29 - July 5) : Polish everything + Midterm report
Week 7 (July 6 - 12) : Midterm evaluation ( July 7-11)
Week 8 (July 13 - 19) : Start Patch Series 2 for adding other object info
fields as per the discussion started in Week 1.
Week 9 (July 20 - 26) : Review feedback and refine.
Week 10 (July 27 - August 2) : Review feedback and refine.
Week 11 (August 3 - 9) : Finalize all tests and Doc changes.
Week 12 (August 10 - 16) : Prepare Final report.
Week 13 (August 17 - 23) : Final Evaluation ( Aug 18-24 )
## Availability
My current semester is ending in the first week of May, so I will be
able to contribute 7-8 hours per day, totalling around 35-40 hrs a week
on the project.
Total weeks = 13 , total hours = 35*13 = 455
It leaves with a lot more room to accommodate any unforeseen circumstances
that may arise during the project.
## RFC
Hi Christian and Karthik !
I still feel like the single object get remote info might be useful
and I think this might be where I can add this functionality :
When someone does `GIT_NO_LAZY_FETCH=0 git cat-file -s <oid>`
And the oid is of a blob that is not on local, then git simply fetches
the blob and reruns git cat-file -s.
But if someone does `GIT_NO_LAZY_FETCH=1 git cat-file -s <oid>`
And the blob is not on local then it exits with the following error
> if (git_env_bool(NO_LAZY_FETCH_ENVIRONMENT, 0)) {
> static int warning_shown;
> if (!warning_shown) {
> warning_shown = 1;
> warning(_("lazy fetching disabled; some objects may not be available"));
> }
> return -1;
> }
Would it be useful behaviour if instead of exiting with an error it sent
a remote-object-info request for that single file ?
Thank you for your time in reviewing my proposal as well as considering
my application. I am excited to learn everything I can from git.
Thanks and Regards,
Soutrik
[1] : pull.2187.git.git.1770293021383.gitgitgadget@gmail.com
[2] : 20260209172445.39536-1-valusoutrik@gmail.com
[3] : 20260225190306.39358-1-valusoutrik@gmail.com
[4] : 20240628190503.67389-1-eric.peijian@gmail.com
[5] : 20220728230210.2952731-1-calvinwan@google.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
@ 2026-03-13 10:17 Pablo
2026-03-14 5:58 ` Chandra Pratap
0 siblings, 1 reply; 14+ messages in thread
From: Pablo @ 2026-03-13 10:17 UTC (permalink / raw)
To: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar,
Siddharth Asthana, Chandra Pratap
## Synopsis
This project finishes Eric Ju's work on `remote-object-info` for `git
cat-file --batch-command` [1], resolves the pending feedback from
Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
`%(objecttype)`.
Expected project size: 350 hours (Medium)
## About Me and Contact
Name: Pablo Sabater Jiménez (he/him)
Age: 19
Education: Currently on my second Computer Science year at University
of Murcia, Spain
Location: Murcia, Spain (CET, UTC+1)
Languages: C (solid), shell(bash) (good)
Tools: git(proficient)
I've checked that I'm eligible for GSoC 2026.
Email: pabloosabaterr@gmail.com
GitHub: https://github.com/pabloosabaterr
## Relevant Projects
- 16 bit CPU emulator. Good example of C programming.
cpu: https://github.com/pabloosabaterr/CPU16
- Compiler. Good example of working on bigger projects.
compiler: https://github.com/pabloosabaterr/Orn
## Pre-GSoC Work
### Introduction
**[GSoC] Introduction Pablo Sabater**
https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com
A mailing list thread where I introduced myself to the git community.
### Microproject
**[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**
https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/
Merged to `next` on 2026-03-12 at 8500bdf172. Replaces `test -f` with
helper `test_path_is_file`, which makes debugging failing tests easier
with better reporting.
As suggested as microproject.
### Other contributions
**[GSoC PATCH v2] test-lib: print escape sequence names**
https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/
Will merge to `next`, in failed expected/actual checks printing, the
escape sequences were shown as their octal code. This patch fixes that
to print the actual escape sequence name, adds tests, and updates the
expected output.
**[GSoC PATCH] t9200: handle missing CVS with skip_all**
https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/
Merged to `next` on 2026-03-12 at 8500bdf172, wraps CVS setup in a
skip_all for clearer failure reporting and moves Git initialization
into its own test_expect_success.
**[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**
https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
While testing Eric's v11 I've found and reported a new bug. On
`remote-object-info` when it's preceded by a local query, `data->type`
isn't being cleared. Causing it to return the wrong type.
I have also studied the documentation provided and Eric Ju's work from
v0 to v11 including all the feedback he got up to March 2025, the
feedback he got from Junio Hamano and Jeff King, taking notes about
what's left to be done and what else I can contribute to the already
proposed project. That's how I've identified everything that I will
address on the Problem, Solution and Timeline sections.
I built Eric Ju's v11 and tested the bugs reported to his patch [5],
I've confirmed the segfault and the `die()`, and found a new one:
- When a local `info` runs before `remote-object-info` sharing the
same format string, `data->type` isn't being cleared. A blob queried
remotely after a local commit, `data->type` for blob becomes 'commit'
with no error. I reported it on the mailing list [6].
I attempted to test rebasing Eric Ju's v11 to master and got conflicts
on 4 out of the 8 commits:
- `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
- `t/t1006-cat-file.sh`
- `d918f720d8` fetch-pack: refactor packet writing.
- `fetch-pack.c`
- `2daf9ed803` transport: add client support for object-info.
- `Makefile`
- `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
- `object-file.c`, `object-store-ll.h` (deleted).
I'm being active on the mailing list and learning the Git flow of work
and from the feedback I've received from the maintainers (Junio) from
my patches.
Following the project guidelines, I haven't done anything on the
project that could step on other candidates' work before being
accepted, and instead I'm focusing on understanding the project and
its needs, and independent patches that will make the Git project more
familiar and understandable to me.
## Availability
My classes end the first week of May. From then until September I
won't have any classes which leaves me free to fully focus on the
project. I can dedicate 8+ hours each day, and for sure 40 hours a
week.
## The Problem
Git's partial clone allows cloning repositories without downloading
all objects (blobs, trees, ...). These objects are fetched on demand
from the remote when needed. However, when a user needs metadata about
these remote objects (size, type, hash, ...), Git has no efficient way
of doing this without downloading all the object content.
The server side support for `object-info` protocol was implemented by
Calvin Wan in 2021. Eric Ju built the client-side `remote-object-info`
for `cat-file --batch-command`. Eric Ju's work remains unmerged after
v11 because of these issues:
- The format validation uses `strstr()` which only checks for
`%(objectsize)`. This causes two different errors:
- Atoms that `expand_atom()` recognizes but the remote doesn't
(`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
accessing `data->type` it only contains garbage, causing segfault. as
Jeff King noted [3].
- Unknown atoms by `expand_atom()`, returns 0, calling
`strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
as Jeff King found [3].
Both cases block the command, including local `info` queries if the
same format string is shared. Unsupported remote placeholders should
return an empty string, matching how `for-each-ref` returns empty for
known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].
- When local and remote queries are mixed, `data->type` is not being
cleared between commands. `remote-object-info` returns the wrong type
data from a previous local query [6].
- Style and code issues marked by Junio Hamano [2] and Jeff King [3]
[5] are still undone.
- comment style.
- `#define` formatting.
- line length.
- misleading error messages.
- missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
- if/else invert at `get_remote_info()`.
- `%(objecttype)` is not yet supported on either client or server side.
## The Solution
There are two main goals:
### Goal 1: Rebase and finish Eric's work
Starting from where Eric Ju left off, I will rebase it on top of the
current `master` branch and address the feedback left to do:
- Fix style in comments, `#define` formatting and line length.
- Fix misleading error message in the overflow check.
- Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
- Invert if/else on `get_remote_info()` to keep the small block first
(the error one) as Junio suggested.
#### Replace `strstr()` format validation with allow_list in `expand_atom()`
`strstr()` isn't enough to fully validate the placeholders, it only
searches for `%(objectsize)` and unsupported placeholders cause
segfaults. The fix is to refactor the validation with an allow_list in
`expand_atom()`. But why `expand_atom()` when Jeff King suggested
`expand_atom()` or `expand_format()` [4] ?
- There are two cases, first, inside `expand_atom()` before returning
(segfault) and second, calls `die()` when `expand_atom()` returns 0.
Placing the `allow_list` at the top of `expand_atom()` prevents both
errors, on remote mode, append nothing to `sb` and return 1, accessing
`data->type` won't cause segfault and prevents `expand_format()` from
reaching `die()`.
As extra safety, initializing `data->type` to `OBJ_BAD` and check
for `NULL` from `type_name()` makes it that even without `allow_list`,
uninitialized data doesn't cause a segfault.
At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
allow_list. Goal 2 will bring `%(objecttype)` support.
### Goal 2: Adding `%(objecttype)`
following what Calvin Wan did in 2021 for `%(objectsize)`, v2 protocol
needs to be extended on the server side to support the new
`%(objecttype)` placeholder:
- extend `object_info_advertise()` at `serve.c`
- add .type to `requested_info` struct at `serve.c`
- support `type` in `cap_object_info()` at `protocol-caps.c`
- look for type at `send_info()` at `protocol-caps.c`
following object-info protocol docs [7] it should look like:
```
attrs = "size" SP "type"
obj-type = "blob" | "tree" | "commit" | "tag"
obj-info = obj-id SP obj-size SP obj-type
info = PKT-LINE(attrs LF)
*PKT-LINE(obj-info LF)
```
`%(objecttype)` needs to be added to the `allow_list`. Client side
needs to learn to ask for `%(objecttype)` from remote, parse what has
been received and fill `expand_data` with the actual type. This makes
it return the object type instead of the empty string returned while
it was unsupported.
Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
Test and document new placeholder support and server side extension.
#### Backward Compatibility
There are four possible scenarios to happen between client and server:
1. The server doesn't know type (new client but old server):
After receiving the server capabilities, a client will only request
what the server advertises. The `allow_list` would handle this,
returning an empty string when the server doesn't support it.
2. The server knows type but the client doesn't (new server but old client):
Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
keys", it will ignore type, and request only the known capabilities.
3. Both know type (new client and new server):
Server advertises type, client requests it and gets the type data.
4. Both know type but protocol middleware doesn't (new client, new
server but old middleware):
If a server advertises type but client doesn't receive type, a
client won't ask for anything unadvertised, if a client asks for type
but the server doesn't receive it, it will only return the known
capabilities.
**performance considerations**
To get an object type, we have to look only at the header, to get the
size `oid_object_info()` at `object-file.c` is being called which
already returns the object type in the same call. Sending the string
with the type will only be, worst case scenario 6 bytes for the
"commit" string.
## Timeline
I've designed this to work with enough time so final work can be
shorter than what's said here
May 1-24: Community Bonding
- Talk and meet with mentor that I'm assigned with, to get feedback
about my proposal, how I will report my progress apart from the code
submitted and possible blogs, and tips and tricks to work better at
Git.
- Confirm with mentor that the `allow_list` approach is still the best option.
- Draft commits structure.
Week 1-2: (May 26 - June 8)
- Rebase Eric Ju's v11 on top of current `master`.
- Work on style fixes: comments, `#define` formatting, line length.
- Fix the wrong error message in the overflow check.
- Add missing check `count > MAX_ALLOWED_OBJ_LIMIT` after `split_cmdline()`.
- Invert if/else in `get_remote_info()`.
- Send first patch.
Week 3-4: (June 9 - June 22)
- Implement `allow_list` in `expand_atom()` using `is_atom()` in remote-mode.
- Initialize `data->type` to `OBJ_BAD` and add null check at `type_name()`.
- Implement empty string return for unsupported placeholders.
- Tests for supported placeholders, unsupported, mix, and the intermix
case `info` + `remote-object-info` with the same format string.
- Work with feedback from the first patch.
Week 5-6: (June 23 - July 6):
- Continue with review feedback.
- Goal 1 should be polished or close to the final form.
- Prepare the midterm report.
Midterm evaluation (July 7 - 11) as specified on GSoC timeline docs
- Goal 1 submitted and keep work with feedback.
Week 7-8: (July 14 - July 27)
- Begin Goal 2.
- Extend server side v2 protocol to serve `%(objecttype)`, following
`%(objectsize)` structure.
- Test server side.
Week 9-10: (July 28 - August 10)
- Add `%(objecttype)` to the `allow_list` from Goal 1.
- Extend client side to ask for `%(objecttype)` from remote on `object-info`.
- Parse server answer and fill `expand_data` with the actual type.
- End to end tests and documentation.
- Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
- Send patch series.
Week 11-12: (August 11 - August 24)
- Work with Goal 2 feedback from the patches.
- Polish everything, all tests pass, good test coverage, no
style/comment mistakes.
- Final documentation review.
- Prepare for final evaluation.
Final evaluation (August 18-24) as specified on GSoC timeline docs
### Additional objectives
If there is enough time, or for future work after the project. I've
some ideas on how this could evolve:
#### More placeholders support
I've checked that Eric's v11 patch only supports `%(objectsize)` on
server side, but on the client side there are other placeholders that
can be added too. with the `allow_list` and having Goal 2 implemented
adding more placeholders becomes trivial.
- `%(objectsize:disk)`: Returns the size on the disk (compressed or as
a delta) instead of returning the uncompressed size that
`%(objectsize)` does. To do this, the server would need to send what's
the actual size on disk data.
- `%(deltabase)`: Returns the delta base object OID. non delta objects
return zero OID as it does on local.
#### Returning missing blobs from a tree ordered
In a partial clone, someone might want to know what blobs are missing
inside a concrete tree and their size before fetching them.
The idea is to build on top of `remote-object-info`:
Given a tree hash, return the missing blobs (inside that tree) ordered by size.
Thanks for reading my proposal and considering my application. I'm
very excited about this opportunity,
Pablo
[1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
"Eric Ju's v11 patch"
[2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
Hamano feedback"
[3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
"Jeff King feedback"
[4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
"options for strstr() by Jeff King"
[5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
"Jeff King follow-up"
[6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
"data->type not being cleared bug"
[7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
"object-info protocol docs"
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
2026-03-13 10:17 [GSoC] Proposal: " Pablo
@ 2026-03-14 5:58 ` Chandra Pratap
2026-03-14 18:31 ` Pablo
0 siblings, 1 reply; 14+ messages in thread
From: Chandra Pratap @ 2026-03-14 5:58 UTC (permalink / raw)
To: Pablo
Cc: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar,
Siddharth Asthana
Hi Pablo,
On Fri, 13 Mar 2026 at 15:47, Pablo <pabloosabaterr@gmail.com> wrote:
>
> ## Synopsis
>
> This project finishes Eric Ju's work on `remote-object-info` for `git
> cat-file --batch-command` [1], resolves the pending feedback from
> Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
> `%(objecttype)`.
>
> Expected project size: 350 hours (Medium)
> ## About Me and Contact
>
> Name: Pablo Sabater Jiménez (he/him)
>
> Age: 19
>
> Education: Currently on my second Computer Science year at University
> of Murcia, Spain
>
> Location: Murcia, Spain (CET, UTC+1)
>
> Languages: C (solid), shell(bash) (good)
>
> Tools: git(proficient)
>
> I've checked that I'm eligible for GSoC 2026.
>
> Email: pabloosabaterr@gmail.com
> GitHub: https://github.com/pabloosabaterr
>
> ## Relevant Projects
>
> - 16 bit CPU emulator. Good example of C programming.
>
> cpu: https://github.com/pabloosabaterr/CPU16
>
> - Compiler. Good example of working on bigger projects.
>
> compiler: https://github.com/pabloosabaterr/Orn
>
Thanks for your interest in contributing to Git this GSoC!
> ## Pre-GSoC Work
>
> ### Introduction
>
> **[GSoC] Introduction Pablo Sabater**
>
> https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com
>
> A mailing list thread where I introduced myself to the git community.
Nit: Could use a newline here.
> ### Microproject
>
> **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**
>
> https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/
>
> Merged to `next` on 2026-03-12 at 8500bdf172. Replaces `test -f` with
> helper `test_path_is_file`, which makes debugging failing tests easier
> with better reporting.
> As suggested as microproject.
>
> ### Other contributions
>
> **[GSoC PATCH v2] test-lib: print escape sequence names**
>
> https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/
>
> Will merge to `next`, in failed expected/actual checks printing, the
> escape sequences were shown as their octal code. This patch fixes that
> to print the actual escape sequence name, adds tests, and updates the
> expected output.
>
> **[GSoC PATCH] t9200: handle missing CVS with skip_all**
>
> https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/
>
> Merged to `next` on 2026-03-12 at 8500bdf172, wraps CVS setup in a
> skip_all for clearer failure reporting and moves Git initialization
> into its own test_expect_success.
>
> **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**
>
> https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
>
> While testing Eric's v11 I've found and reported a new bug. On
> `remote-object-info` when it's preceded by a local query, `data->type`
> isn't being cleared. Causing it to return the wrong type.
>
> I have also studied the documentation provided and Eric Ju's work from
> v0 to v11 including all the feedback he got up to March 2025, the
> feedback he got from Junio Hamano and Jeff King, taking notes about
> what's left to be done and what else I can contribute to the already
> proposed project. That's how I've identified everything that I will
> address on the Problem, Solution and Timeline sections.
>
> I built Eric Ju's v11 and tested the bugs reported to his patch [5],
> I've confirmed the segfault and the `die()`, and found a new one:
> - When a local `info` runs before `remote-object-info` sharing the
> same format string, `data->type` isn't being cleared. A blob queried
> remotely after a local commit, `data->type` for blob becomes 'commit'
> with no error. I reported it on the mailing list [6].
>
> I attempted to test rebasing Eric Ju's v11 to master and got conflicts
> on 4 out of the 8 commits:
> - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
> - `t/t1006-cat-file.sh`
> - `d918f720d8` fetch-pack: refactor packet writing.
> - `fetch-pack.c`
> - `2daf9ed803` transport: add client support for object-info.
> - `Makefile`
> - `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
> - `object-file.c`, `object-store-ll.h` (deleted).
>
> I'm being active on the mailing list and learning the Git flow of work
> and from the feedback I've received from the maintainers (Junio) from
> my patches.
>
> Following the project guidelines, I haven't done anything on the
> project that could step on other candidates' work before being
> accepted, and instead I'm focusing on understanding the project and
> its needs, and independent patches that will make the Git project more
> familiar and understandable to me.
Great work! It would help if you could split the description of your patches
into Status, Description, Comments, etc. It helps a lot when reviewing the
proposal.
>
> ## Availability
>
> My classes end the first week of May. From then until September I
> won't have any classes which leaves me free to fully focus on the
> project. I can dedicate 8+ hours each day, and for sure 40 hours a
> week.
>
> ## The Problem
>
> Git's partial clone allows cloning repositories without downloading
> all objects (blobs, trees, ...). These objects are fetched on demand
> from the remote when needed. However, when a user needs metadata about
> these remote objects (size, type, hash, ...), Git has no efficient way
> of doing this without downloading all the object content.
>
> The server side support for `object-info` protocol was implemented by
> Calvin Wan in 2021. Eric Ju built the client-side `remote-object-info`
> for `cat-file --batch-command`.
This part is likely more relevant in the 'Synopsis' section up top. It provides
important context that helps the reader tune their expectations for the rest
of the proposal.
From my experience, a good rule of thumb when writing a proposal is to
assume the reader doesn't know anything about the project or the problem
it tackles beforehand.
> Eric Ju's work remains unmerged after
> v11 because of these issues:
>
> - The format validation uses `strstr()` which only checks for
> `%(objectsize)`. This causes two different errors:
> - Atoms that `expand_atom()` recognizes but the remote doesn't
> (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
> accessing `data->type` it only contains garbage, causing segfault. as
> Jeff King noted [3].
Grammar nit: should be 'garbage causing segfault, as Jeff King noted[3].'
The sentence could also use some restructuring for better clarity.
It is great that you've referenced the relevant discussion thread here.
> - Unknown atoms by `expand_atom()`, returns 0, calling
> `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
> as Jeff King found [3].
> Both cases block the command, including local `info` queries if the
> same format string is shared. Unsupported remote placeholders should
> return an empty string, matching how `for-each-ref` returns empty for
> known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].
>
> - When local and remote queries are mixed, `data->type` is not being
> cleared between commands. `remote-object-info` returns the wrong type
> data from a previous local query [6].
>
You've mentioned the outstanding issues and their implications for the end user.
Good work.
> - Style and code issues marked by Junio Hamano [2] and Jeff King [3]
> [5] are still undone.
> - comment style.
> - `#define` formatting.
> - line length.
> - misleading error messages.
> - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
> - if/else invert at `get_remote_info()`.
> - `%(objecttype)` is not yet supported on either client or server side.
>
> ## The Solution
>
> There are two main goals:
>
> ### Goal 1: Rebase and finish Eric's work
>
> Starting from where Eric Ju left off, I will rebase it on top of the
> current `master` branch and address the feedback left to do:
> - Fix style in comments, `#define` formatting and line length.
> - Fix misleading error message in the overflow check.
> - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
> - Invert if/else on `get_remote_info()` to keep the small block first
> (the error one) as Junio suggested.
> #### Replace `strstr()` format validation with allow_list in `expand_atom()`
Nit: Could use a newline here.
>
> `strstr()` isn't enough to fully validate the placeholders, it only
> searches for `%(objectsize)` and unsupported placeholders cause
> segfaults. The fix is to refactor the validation with an allow_list in
> `expand_atom()`.
It is great if this is your idea, but if not, it would help to credit the
person who suggested this and link to the relevant discussion, if
applicable.
> But why `expand_atom()` when Jeff King suggested
> `expand_atom()` or `expand_format()` [4] ?
> - There are two cases, first, inside `expand_atom()` before returning
> (segfault) and second, calls `die()` when `expand_atom()` returns 0.
> Placing the `allow_list` at the top of `expand_atom()` prevents both
> errors, on remote mode, append nothing to `sb` and return 1, accessing
> `data->type` won't cause segfault and prevents `expand_format()` from
> reaching `die()`.
> As extra safety, initializing `data->type` to `OBJ_BAD` and check
> for `NULL` from `type_name()` makes it that even without `allow_list`,
> uninitialized data doesn't cause a segfault.
> At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
> allow_list. Goal 2 will bring `%(objecttype)` support.
> ### Goal 2: Adding `%(objecttype)`
Nit: Newline here as well.
>
> following what Calvin Wan did in 2021 for `%(objectsize)`, v2 protocol
Grammar nit: [F]ollowing.
> needs to be extended on the server side to support the new
> `%(objecttype)` placeholder:
> - extend `object_info_advertise()` at `serve.c`
> - add .type to `requested_info` struct at `serve.c`
> - support `type` in `cap_object_info()` at `protocol-caps.c`
> - look for type at `send_info()` at `protocol-caps.c`
>
> following object-info protocol docs [7] it should look like:
Here as well.
> ```
> attrs = "size" SP "type"
> obj-type = "blob" | "tree" | "commit" | "tag"
> obj-info = obj-id SP obj-size SP obj-type
> info = PKT-LINE(attrs LF)
> *PKT-LINE(obj-info LF)
> ```
>
> `%(objecttype)` needs to be added to the `allow_list`. Client side
> needs to learn to ask for `%(objecttype)` from remote, parse what has
> been received and fill `expand_data` with the actual type. This makes
> it return the object type instead of the empty string returned while
> it was unsupported.
>
> Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
> Test and document new placeholder support and server side extension.
>
Makes sense.
> #### Backward Compatibility
>
> There are four possible scenarios to happen between client and server:
> 1. The server doesn't know type (new client but old server):
>
> After receiving the server capabilities, a client will only request
> what the server advertises. The `allow_list` would handle this,
> returning an empty string when the server doesn't support it.
> 2. The server knows type but the client doesn't (new server but old client):
>
> Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
> keys", it will ignore type, and request only the known capabilities.
> 3. Both know type (new client and new server):
>
> Server advertises type, client requests it and gets the type data.
> 4. Both know type but protocol middleware doesn't (new client, new
> server but old middleware):
>
> If a server advertises type but client doesn't receive type, a
> client won't ask for anything unadvertised, if a client asks for type
> but the server doesn't receive it, it will only return the known
> capabilities.
>
This section makes sense as well, could use better formatting though.
> **performance considerations**
>
> To get an object type, we have to look only at the header, to get the
> size `oid_object_info()` at `object-file.c` is being called which
> already returns the object type in the same call. Sending the string
> with the type will only be, worst case scenario 6 bytes for the
> "commit" string.
> ## Timeline
>
Nit: newline.
> I've designed this to work with enough time so final work can be
> shorter than what's said here
>
> May 1-24: Community Bonding
> - Talk and meet with mentor that I'm assigned with, to get feedback
> about my proposal, how I will report my progress apart from the code
> submitted and possible blogs, and tips and tricks to work better at
> Git.
> - Confirm with mentor that the `allow_list` approach is still the best option.
> - Draft commits structure.
It would also be helpful if you continue working on your patches that haven't
been merged yet from your pre-GSoC efforts. The goal of Community
Bonding Period is to interact with the wider community as much as possible,
and what better way to do that other than engaging through patches.
Also, GSoC/Git requires you to write weekly blog posts detailing your work,
what's holding you back, etc. So it's good if you use this time to set up your
blog, if you don't have one already.
>
> Week 1-2: (May 26 - June 8)
> - Rebase Eric Ju's v11 on top of current `master`.
> - Work on style fixes: comments, `#define` formatting, line length.
> - Fix the wrong error message in the overflow check.
> - Add missing check `count > MAX_ALLOWED_OBJ_LIMIT` after `split_cmdline()`.
> - Invert if/else in `get_remote_info()`.
These four points are specifics of how you're going to tackle the
'Style Issues' problem
you mentioned above. I don't think there's any benefit in reiterating them here.
A single 'Fix the style and code issues.' or something similar would be better.
> - Send first patch.
>
> Week 3-4: (June 9 - June 22)
> - Implement `allow_list` in `expand_atom()` using `is_atom()` in remote-mode.
> - Initialize `data->type` to `OBJ_BAD` and add null check at `type_name()`.
> - Implement empty string return for unsupported placeholders.
> - Tests for supported placeholders, unsupported, mix, and the intermix
> case `info` + `remote-object-info` with the same format string.
> - Work with feedback from the first patch.
Again, specifics of the implementation plan don't need reiteration.
>
> Week 5-6: (June 23 - July 6):
> - Continue with review feedback.
> - Goal 1 should be polished or close to the final form.
> - Prepare the midterm report.
>
> Midterm evaluation (July 7 - 11) as specified on GSoC timeline docs
> - Goal 1 submitted and keep work with feedback.
You could probably dedicate this time to start working on Goal 2.
Addressing feedback is something that occurs spontaneously and
doesn't need dedicated slots in your timeline.
> Week 7-8: (July 14 - July 27)
> - Begin Goal 2.
> - Extend server side v2 protocol to serve `%(objecttype)`, following
> `%(objectsize)` structure.
> - Test server side.
>
> Week 9-10: (July 28 - August 10)
> - Add `%(objecttype)` to the `allow_list` from Goal 1.
> - Extend client side to ask for `%(objecttype)` from remote on `object-info`.
> - Parse server answer and fill `expand_data` with the actual type.
> - End to end tests and documentation.
> - Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
> - Send patch series.
>
> Week 11-12: (August 11 - August 24)
> - Work with Goal 2 feedback from the patches.
> - Polish everything, all tests pass, good test coverage, no
> style/comment mistakes.
> - Final documentation review.
> - Prepare for final evaluation.
>
> Final evaluation (August 18-24) as specified on GSoC timeline docs
>
> ### Additional objectives
>
> If there is enough time, or for future work after the project. I've
> some ideas on how this could evolve:
> #### More placeholders support
> I've checked that Eric's v11 patch only supports `%(objectsize)` on
> server side, but on the client side there are other placeholders that
> can be added too. with the `allow_list` and having Goal 2 implemented
> adding more placeholders becomes trivial.
>
> - `%(objectsize:disk)`: Returns the size on the disk (compressed or as
> a delta) instead of returning the uncompressed size that
> `%(objectsize)` does. To do this, the server would need to send what's
> the actual size on disk data.
>
> - `%(deltabase)`: Returns the delta base object OID. non delta objects
> return zero OID as it does on local.
>
> #### Returning missing blobs from a tree ordered
> In a partial clone, someone might want to know what blobs are missing
> inside a concrete tree and their size before fetching them.
> The idea is to build on top of `remote-object-info`:
> Given a tree hash, return the missing blobs (inside that tree) ordered by size.
>
> Thanks for reading my proposal and considering my application. I'm
> very excited about this opportunity,
> Pablo
>
> [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
> "Eric Ju's v11 patch"
>
> [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
> Hamano feedback"
>
> [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
> "Jeff King feedback"
>
> [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
> "options for strstr() by Jeff King"
>
> [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
> "Jeff King follow-up"
>
> [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
> "data->type not being cleared bug"
>
> [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
> "object-info protocol docs"
Overall, great work on the proposal so far! Other than a few stylistic
mishaps, the proposal
looks pretty strong already.
You should upload your proposal on the GSoC website and add the link to it here.
The proposal can be then updated later as many times as you like.
Regards,
Chandra.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
2026-03-14 5:58 ` Chandra Pratap
@ 2026-03-14 18:31 ` Pablo
2026-03-15 9:20 ` Chandra Pratap
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Pablo @ 2026-03-14 18:31 UTC (permalink / raw)
To: Chandra Pratap
Cc: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar,
Siddharth Asthana
Hi Chandra, thanks a lot for the feedback! :)
> You should upload your proposal on the GSoC website and add the link to it here.
> The proposal can be then updated later as many times as you like.
GSoC proposals opens March 16th, for now I'll send my v2 here and as
soon as I can I'll swap to GSoC website and send the link to the
thread.
To avoid having you reread everything again this is what I've done from v1:
Moved context explanation from The Problem to Synopsis and
Availability below About Me and Contact.
Split Pre-GSoC patches into status (for code patches) and
description to improve readability.
Added a code review and proposal thread to the Pre-GSoC section.
Added new lines where noted and fixed capitalization.
Correctly credited Jeff King for the allow_list idea and added new
[8] for Calvin Wan's work.
Community bonding now includes continuing patches and setting up a blog.
Removed most of the duplicated iteration on the Timeline from The
Problem. (feels a bit empty now tho).
I paste here my v2 with the requested changes:
## Synopsis
Git's partial clone allows cloning repositories without downloading
all objects (blobs, trees, ...). These objects are fetched on demand
from the remote when needed. However, when a user needs metadata about
these remote objects (size, type, hash, ...), Git has no efficient way
of doing this without downloading all the object content.
The server side support for `object-info` protocol was implemented by
Calvin Wan in 2021 [8]. Eric Ju built the client-side
`remote-object-info` for `cat-file --batch-command`.
This project finishes Eric Ju's work on `remote-object-info` for `git
cat-file --batch-command` [1], resolves the pending feedback from
Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
`%(objecttype)`.
Expected project size: 350 hours (Medium)
## About Me and Contact
Name: Pablo Sabater Jiménez (he/him)
Age: 19
Education: Currently on my second Computer Science year at University
of Murcia, Spain
Location: Murcia, Spain (CET, UTC+1)
Languages: C (solid), shell(bash) (good)
Tools: git(proficient)
I've checked that I'm eligible for GSoC 2026.
Email: pabloosabaterr@gmail.com
GitHub: https://github.com/pabloosabaterr
## Availability
My classes end the first week of May. From then until September I
won't have any classes which leaves me free to fully focus on the
project. I can dedicate 8+ hours each day, and for sure 40 hours a
week.
## Relevant Projects
- 16 bit CPU emulator. Good example of C programming.
cpu: https://github.com/pabloosabaterr/CPU16
- Compiler. Good example of working on bigger projects.
compiler: https://github.com/pabloosabaterr/Orn
## Pre-GSoC Work
### Introduction
**[GSoC] Introduction Pablo Sabater**
https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com
**Description**: A mailing list thread where I introduced myself to
the git community.
### Microproject
**[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**
https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/
**Status**: Merged to `next` on 2026-03-12 at `8500bdf172`.
**Description**: Replaces `test -f` with helper `test_path_is_file`,
which makes debugging failing tests easier with better reporting.
As suggested as microproject.
### Other contributions
**[GSoC PATCH v2] test-lib: print escape sequence names**
https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/
**Status**: Will merge to `next`.
**Description**: In failed expected/actual checks printing, the escape
sequences were shown as their octal code. This patch fixes that to
print the actual escape sequence name, adds tests, and updates the
expected output.
**[GSoC PATCH] t9200: handle missing CVS with skip_all**
https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/
**Status**: Merged to `next` on 2026-03-12 at `8500bdf172`.
**Description**: wraps CVS setup in a skip_all for clearer failure
reporting and moves Git initialization into its own
test_expect_success.
**Re: [PATCH] gc: add git maintenance list command**
https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/
**Description**: code review for a patch sent.
**[GSoC] Proposal: Complete and extend remote-object-info for git cat-file**
https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/
**Description**: Proposal draft thread.
**[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**
https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
**Description**: While testing Eric's v11 I've found and reported a
new bug. On `remote-object-info` when it's preceded by a local query,
`data->type` isn't being cleared. Causing it to return the wrong type.
I have also studied the documentation provided and Eric Ju's work from
v0 to v11 including all the feedback he got up to March 2025, the
feedback he got from Junio Hamano and Jeff King, taking notes about
what's left to be done and what else I can contribute to the already
proposed project. That's how I've identified everything that I will
address on the Problem, Solution and Timeline sections.
I built Eric Ju's v11 and tested the bugs reported to his patch [5],
I've confirmed the segfault and the `die()`, and found a new one:
- When a local `info` runs before `remote-object-info` sharing the
same format string, `data->type` isn't being cleared. A blob queried
remotely after a local commit, `data->type` for blob becomes 'commit'
with no error. I reported it on the mailing list [6].
I attempted to test rebasing Eric Ju's v11 to master and got conflicts
on 4 out of the 8 commits:
- `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
- `t/t1006-cat-file.sh`
- `d918f720d8` fetch-pack: refactor packet writing.
- `fetch-pack.c`
- `2daf9ed803` transport: add client support for object-info.
- `Makefile`
- `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
- `object-file.c`, `object-store-ll.h` (deleted).
I'm being active on the mailing list and learning the Git flow of work
and from the feedback I've received from the maintainers (Junio) from
my patches.
Following the project guidelines, I haven't done anything on the
project that could step on other candidates' work before being
accepted, and instead I'm focusing on understanding the project and
its needs, and independent patches that will make the Git project more
familiar and understandable to me.
## The Problem
Eric Ju's work remains unmerged after v11 because of these issues:
- The format validation uses `strstr()` which only checks for
`%(objectsize)`. This causes two different errors:
- Atoms that `expand_atom()` recognizes but the remote doesn't
(`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
accessing `data->type` it only contains garbage, causing segfault, as
Jeff King noted [3].
- Unknown atoms by `expand_atom()`, returns 0, calling
`strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
as Jeff King found [3].
Both cases block the command, including local `info` queries if the
same format string is shared. Unsupported remote placeholders should
return an empty string, matching how `for-each-ref` returns empty for
known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].
- When local and remote queries are mixed, `data->type` is not being
cleared between commands. `remote-object-info` returns the wrong type
data from a previous local query [6].
- Style and code issues marked by Junio Hamano [2] and Jeff King [3]
[5] are still undone.
- comment style.
- `#define` formatting.
- line length.
- misleading error messages.
- missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
- if/else invert at `get_remote_info()`.
- `%(objecttype)` is not yet supported on either client or server side.
## The Solution
There are two main goals:
### Goal 1: Rebase and finish Eric's work
Starting from where Eric Ju left off, I will rebase it on top of the
current `master` branch and address the feedback left to do:
- Fix style in comments, `#define` formatting and line length.
- Fix misleading error message in the overflow check.
- Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
- Invert if/else on `get_remote_info()` to keep the small block first
(the error one) as Junio suggested.
#### Replace `strstr()` format validation with allow_list in `expand_atom()`
`strstr()` isn't enough to fully validate the placeholders, it only
searches for `%(objectsize)` and unsupported placeholders cause
segfaults. Jeff King noted [4] that the fix was to refactor the
validation with an allow_list in `expand_atom()` or `expand_format()`.
The best option is to place the validation at `expand_atom()`, but why
`expand_atom()` ?
- There are two cases, first, inside `expand_atom()` before returning
(segfault) and second, calls `die()` when `expand_atom()` returns 0.
Placing the `allow_list` at the top of `expand_atom()` prevents both
errors, on remote mode, append nothing to `sb` and return 1, accessing
`data->type` won't cause segfault and prevents `expand_format()` from
reaching `die()`.
As extra safety, initializing `data->type` to `OBJ_BAD` and check
for `NULL` from `type_name()` makes it that even without `allow_list`,
uninitialized data doesn't cause a segfault.
At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
allow_list. Goal 2 will bring `%(objecttype)` support.
### Goal 2: Adding `%(objecttype)`
Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2
protocol needs to be extended on the server side to support the new
`%(objecttype)` placeholder:
- extend `object_info_advertise()` at `serve.c`
- add .type to `requested_info` struct at `serve.c`
- support `type` in `cap_object_info()` at `protocol-caps.c`
- look for type at `send_info()` at `protocol-caps.c`
Following object-info protocol docs [7] it should look like:
```
attrs = "size" SP "type"
obj-type = "blob" | "tree" | "commit" | "tag"
obj-info = obj-id SP obj-size SP obj-type
info = PKT-LINE(attrs LF)
*PKT-LINE(obj-info LF)
```
`%(objecttype)` needs to be added to the `allow_list`. Client side
needs to learn to ask for `%(objecttype)` from remote, parse what has
been received and fill `expand_data` with the actual type. This makes
it return the object type instead of the empty string returned while
it was unsupported.
Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
Test and document new placeholder support and server side extension.
#### Backward Compatibility
There are four possible scenarios to happen between client and server:
1. **The server doesn't know type (new client but old server)**:
After receiving the server capabilities, a client will only request
what the server advertises. The `allow_list` would handle this,
returning an empty string when the server doesn't support it.
2. **The server knows type but the client doesn't (new server but old client)**:
Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
keys", it will ignore type, and request only the known capabilities.
3. **Both know type (new client and new server)**:
Server advertises type, client requests it and gets the type data.
4. **Both know type but protocol middleware doesn't (new client, new
server but old middleware)**:
If a server advertises type but client doesn't receive type, a
client won't ask for anything unadvertised, if a client asks for type
but the server doesn't receive it, it will only return the known
capabilities.
**performance considerations**
To get an object type, we have to look only at the header, to get the
size `oid_object_info()` at `object-file.c` is being called which
already returns the object type in the same call. Sending the string
with the type will only be, worst case scenario 6 bytes for the
"commit" string.
## Timeline
I've designed this to work with enough time so final work can be
shorter than what's said here
May 1-24: Community Bonding
- Keep working on my ongoing patches and new ones.
- Talk and meet with mentor that I'm assigned with, to get feedback
about my proposal, how I will report my progress apart from the code
submitted and possible blogs, and tips and tricks to work better at
Git.
- Confirm with mentor that the `allow_list` approach is still the best option.
- Draft commits structure.
- Setup a blog to keep track about how GSoC at Git is going.
Week 1-2: (May 26 - June 8)
- Start Goal 1 fixes.
- Fix style and code issues.
Week 3-4: (June 9 - June 22)
- Start with Goal 1 implementations (allow_list approach).
Week 5-6: (June 23 - July 6):
- Goal 1 should be polished or close to the final form.
- Send patch series for Goal 1.
- Start Goal 2.
- Prepare the midterm report.
**Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs
- Goal 1 submitted.
Week 7-8: (July 14 - July 27)
- Start with server side v2 protocol extension (`%(objecttype)`).
Week 9-10: (July 28 - August 10)
- Add `%(objecttype)` to the `allow_list` from Goal 1.
- Client side extension.
- End to end tests and documentation.
- Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
- Send patch series.
Week 11-12: (August 11 - August 24)
- Goal 2 should be close to be done.
- Polish everything, all tests pass, good test coverage, no
style/comment issues.
- Final documentation review.
- Prepare for final evaluation.
**Final evaluation** (August 18-24) as specified on GSoC timeline docs
### Additional objectives
If there is enough time, or for future work after the project. I've
some ideas on how this could evolve:
#### More placeholders support
I've checked that Eric's v11 patch only supports `%(objectsize)` on
server side, but on the client side there are other placeholders that
can be added too. With the `allow_list` and having Goal 2 implemented,
adding more placeholders becomes trivial.
- `%(objectsize:disk)`: Returns the size on the disk (compressed or as
a delta) instead of returning the uncompressed size that
`%(objectsize)` does. To do this, the server would need to send what's
the actual size on disk data.
- `%(deltabase)`: Returns the delta base object OID. non delta objects
return zero OID as it does on local.
#### Returning missing blobs from a tree ordered
In a partial clone, someone might want to know what blobs are missing
inside a concrete tree and their size before fetching them.
The idea is to build on top of `remote-object-info`:
Given a tree hash, return the missing blobs (inside that tree) ordered by size.
Thanks for reading my proposal and considering my application. I'm
very excited about this opportunity,
Pablo
[1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
"Eric Ju's v11 patch"
[2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
Hamano feedback"
[3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
"Jeff King feedback"
[4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
"options for strstr() by Jeff King"
[5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
"Jeff King follow-up"
[6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
"data->type not being cleared bug"
[7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
"object-info protocol docs"
[8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t
"Calvin Wan's patch series"
---
Again, thanks a lot for the feedback.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
2026-03-14 18:31 ` Pablo
@ 2026-03-15 9:20 ` Chandra Pratap
2026-03-16 11:21 ` Christian Couder
2026-03-16 21:38 ` Karthik Nayak
2 siblings, 0 replies; 14+ messages in thread
From: Chandra Pratap @ 2026-03-15 9:20 UTC (permalink / raw)
To: Pablo
Cc: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar,
Siddharth Asthana
On Sun, 15 Mar 2026 at 00:01, Pablo <pabloosabaterr@gmail.com> wrote:
>
> Hi Chandra, thanks a lot for the feedback! :)
>
> > You should upload your proposal on the GSoC website and add the link to it here.
> > The proposal can be then updated later as many times as you like.
>
> GSoC proposals opens March 16th, for now I'll send my v2 here and as
> soon as I can I'll swap to GSoC website and send the link to the
> thread.
I don't think you need to do this, just make sure you include the link when
you send your revised proposals in the future.
> To avoid having you reread everything again this is what I've done from v1:
>
> Moved context explanation from The Problem to Synopsis and
> Availability below About Me and Contact.
> Split Pre-GSoC patches into status (for code patches) and
> description to improve readability.
> Added a code review and proposal thread to the Pre-GSoC section.
> Added new lines where noted and fixed capitalization.
> Correctly credited Jeff King for the allow_list idea and added new
> [8] for Calvin Wan's work.
> Community bonding now includes continuing patches and setting up a blog.
Quickly skimmed over the new proposal and it definitely looks better
now. Great job!
> Removed most of the duplicated iteration on the Timeline from The
> Problem. (feels a bit empty now tho).
This is fine because you've already discussed the relevant details in earlier
sections.
You could think of fleshing it out with new information, but duplicating details
just for the sake of a 'fuller' proposal waters down the impact of the rest of
your work. There isn't a word count requirement after all :)
>
> I paste here my v2 with the requested changes:
>
> ## Synopsis
>
> Git's partial clone allows cloning repositories without downloading
> all objects (blobs, trees, ...). These objects are fetched on demand
> from the remote when needed. However, when a user needs metadata about
> these remote objects (size, type, hash, ...), Git has no efficient way
> of doing this without downloading all the object content.
>
> The server side support for `object-info` protocol was implemented by
> Calvin Wan in 2021 [8]. Eric Ju built the client-side
> `remote-object-info` for `cat-file --batch-command`.
>
> This project finishes Eric Ju's work on `remote-object-info` for `git
> cat-file --batch-command` [1], resolves the pending feedback from
> Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
> `%(objecttype)`.
>
> Expected project size: 350 hours (Medium)
>
> ## About Me and Contact
>
> Name: Pablo Sabater Jiménez (he/him)
>
> Age: 19
>
> Education: Currently on my second Computer Science year at University
> of Murcia, Spain
>
> Location: Murcia, Spain (CET, UTC+1)
>
> Languages: C (solid), shell(bash) (good)
>
> Tools: git(proficient)
>
> I've checked that I'm eligible for GSoC 2026.
>
> Email: pabloosabaterr@gmail.com
> GitHub: https://github.com/pabloosabaterr
>
> ## Availability
>
> My classes end the first week of May. From then until September I
> won't have any classes which leaves me free to fully focus on the
> project. I can dedicate 8+ hours each day, and for sure 40 hours a
> week.
>
> ## Relevant Projects
>
> - 16 bit CPU emulator. Good example of C programming.
>
> cpu: https://github.com/pabloosabaterr/CPU16
>
> - Compiler. Good example of working on bigger projects.
>
> compiler: https://github.com/pabloosabaterr/Orn
>
> ## Pre-GSoC Work
>
> ### Introduction
>
> **[GSoC] Introduction Pablo Sabater**
>
> https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com
>
> **Description**: A mailing list thread where I introduced myself to
> the git community.
>
> ### Microproject
>
> **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**
>
> https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/
>
> **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`.
>
> **Description**: Replaces `test -f` with helper `test_path_is_file`,
> which makes debugging failing tests easier with better reporting.
> As suggested as microproject.
>
> ### Other contributions
>
> **[GSoC PATCH v2] test-lib: print escape sequence names**
>
> https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/
>
> **Status**: Will merge to `next`.
>
> **Description**: In failed expected/actual checks printing, the escape
> sequences were shown as their octal code. This patch fixes that to
> print the actual escape sequence name, adds tests, and updates the
> expected output.
>
> **[GSoC PATCH] t9200: handle missing CVS with skip_all**
>
> https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/
>
> **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`.
>
> **Description**: wraps CVS setup in a skip_all for clearer failure
> reporting and moves Git initialization into its own
> test_expect_success.
>
> **Re: [PATCH] gc: add git maintenance list command**
>
> https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/
>
> **Description**: code review for a patch sent.
>
> **[GSoC] Proposal: Complete and extend remote-object-info for git cat-file**
>
> https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/
>
> **Description**: Proposal draft thread.
>
> **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**
>
> https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
>
> **Description**: While testing Eric's v11 I've found and reported a
> new bug. On `remote-object-info` when it's preceded by a local query,
> `data->type` isn't being cleared. Causing it to return the wrong type.
>
> I have also studied the documentation provided and Eric Ju's work from
> v0 to v11 including all the feedback he got up to March 2025, the
> feedback he got from Junio Hamano and Jeff King, taking notes about
> what's left to be done and what else I can contribute to the already
> proposed project. That's how I've identified everything that I will
> address on the Problem, Solution and Timeline sections.
>
> I built Eric Ju's v11 and tested the bugs reported to his patch [5],
> I've confirmed the segfault and the `die()`, and found a new one:
> - When a local `info` runs before `remote-object-info` sharing the
> same format string, `data->type` isn't being cleared. A blob queried
> remotely after a local commit, `data->type` for blob becomes 'commit'
> with no error. I reported it on the mailing list [6].
>
> I attempted to test rebasing Eric Ju's v11 to master and got conflicts
> on 4 out of the 8 commits:
> - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
> - `t/t1006-cat-file.sh`
> - `d918f720d8` fetch-pack: refactor packet writing.
> - `fetch-pack.c`
> - `2daf9ed803` transport: add client support for object-info.
> - `Makefile`
> - `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
> - `object-file.c`, `object-store-ll.h` (deleted).
>
> I'm being active on the mailing list and learning the Git flow of work
> and from the feedback I've received from the maintainers (Junio) from
> my patches.
>
> Following the project guidelines, I haven't done anything on the
> project that could step on other candidates' work before being
> accepted, and instead I'm focusing on understanding the project and
> its needs, and independent patches that will make the Git project more
> familiar and understandable to me.
>
> ## The Problem
>
> Eric Ju's work remains unmerged after v11 because of these issues:
>
> - The format validation uses `strstr()` which only checks for
> `%(objectsize)`. This causes two different errors:
> - Atoms that `expand_atom()` recognizes but the remote doesn't
> (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
> accessing `data->type` it only contains garbage, causing segfault, as
> Jeff King noted [3].
> - Unknown atoms by `expand_atom()`, returns 0, calling
> `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
> as Jeff King found [3].
> Both cases block the command, including local `info` queries if the
> same format string is shared. Unsupported remote placeholders should
> return an empty string, matching how `for-each-ref` returns empty for
> known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].
>
> - When local and remote queries are mixed, `data->type` is not being
> cleared between commands. `remote-object-info` returns the wrong type
> data from a previous local query [6].
>
> - Style and code issues marked by Junio Hamano [2] and Jeff King [3]
> [5] are still undone.
> - comment style.
> - `#define` formatting.
> - line length.
> - misleading error messages.
> - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
> - if/else invert at `get_remote_info()`.
> - `%(objecttype)` is not yet supported on either client or server side.
>
> ## The Solution
>
> There are two main goals:
>
> ### Goal 1: Rebase and finish Eric's work
>
> Starting from where Eric Ju left off, I will rebase it on top of the
> current `master` branch and address the feedback left to do:
> - Fix style in comments, `#define` formatting and line length.
> - Fix misleading error message in the overflow check.
> - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
> - Invert if/else on `get_remote_info()` to keep the small block first
> (the error one) as Junio suggested.
>
> #### Replace `strstr()` format validation with allow_list in `expand_atom()`
>
> `strstr()` isn't enough to fully validate the placeholders, it only
> searches for `%(objectsize)` and unsupported placeholders cause
> segfaults. Jeff King noted [4] that the fix was to refactor the
> validation with an allow_list in `expand_atom()` or `expand_format()`.
> The best option is to place the validation at `expand_atom()`, but why
> `expand_atom()` ?
> - There are two cases, first, inside `expand_atom()` before returning
> (segfault) and second, calls `die()` when `expand_atom()` returns 0.
> Placing the `allow_list` at the top of `expand_atom()` prevents both
> errors, on remote mode, append nothing to `sb` and return 1, accessing
> `data->type` won't cause segfault and prevents `expand_format()` from
> reaching `die()`.
> As extra safety, initializing `data->type` to `OBJ_BAD` and check
> for `NULL` from `type_name()` makes it that even without `allow_list`,
> uninitialized data doesn't cause a segfault.
> At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
> allow_list. Goal 2 will bring `%(objecttype)` support.
>
> ### Goal 2: Adding `%(objecttype)`
>
> Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2
> protocol needs to be extended on the server side to support the new
> `%(objecttype)` placeholder:
> - extend `object_info_advertise()` at `serve.c`
> - add .type to `requested_info` struct at `serve.c`
> - support `type` in `cap_object_info()` at `protocol-caps.c`
> - look for type at `send_info()` at `protocol-caps.c`
>
> Following object-info protocol docs [7] it should look like:
> ```
> attrs = "size" SP "type"
> obj-type = "blob" | "tree" | "commit" | "tag"
> obj-info = obj-id SP obj-size SP obj-type
> info = PKT-LINE(attrs LF)
> *PKT-LINE(obj-info LF)
> ```
>
> `%(objecttype)` needs to be added to the `allow_list`. Client side
> needs to learn to ask for `%(objecttype)` from remote, parse what has
> been received and fill `expand_data` with the actual type. This makes
> it return the object type instead of the empty string returned while
> it was unsupported.
>
> Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
> Test and document new placeholder support and server side extension.
>
> #### Backward Compatibility
>
> There are four possible scenarios to happen between client and server:
>
> 1. **The server doesn't know type (new client but old server)**:
>
> After receiving the server capabilities, a client will only request
> what the server advertises. The `allow_list` would handle this,
> returning an empty string when the server doesn't support it.
>
> 2. **The server knows type but the client doesn't (new server but old client)**:
>
> Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
> keys", it will ignore type, and request only the known capabilities.
>
> 3. **Both know type (new client and new server)**:
>
> Server advertises type, client requests it and gets the type data.
>
> 4. **Both know type but protocol middleware doesn't (new client, new
> server but old middleware)**:
>
> If a server advertises type but client doesn't receive type, a
> client won't ask for anything unadvertised, if a client asks for type
> but the server doesn't receive it, it will only return the known
> capabilities.
>
> **performance considerations**
>
> To get an object type, we have to look only at the header, to get the
> size `oid_object_info()` at `object-file.c` is being called which
> already returns the object type in the same call. Sending the string
> with the type will only be, worst case scenario 6 bytes for the
> "commit" string.
>
> ## Timeline
>
> I've designed this to work with enough time so final work can be
> shorter than what's said here
>
> May 1-24: Community Bonding
> - Keep working on my ongoing patches and new ones.
> - Talk and meet with mentor that I'm assigned with, to get feedback
> about my proposal, how I will report my progress apart from the code
> submitted and possible blogs, and tips and tricks to work better at
> Git.
> - Confirm with mentor that the `allow_list` approach is still the best option.
> - Draft commits structure.
> - Setup a blog to keep track about how GSoC at Git is going.
>
> Week 1-2: (May 26 - June 8)
> - Start Goal 1 fixes.
> - Fix style and code issues.
>
> Week 3-4: (June 9 - June 22)
> - Start with Goal 1 implementations (allow_list approach).
>
> Week 5-6: (June 23 - July 6):
> - Goal 1 should be polished or close to the final form.
> - Send patch series for Goal 1.
> - Start Goal 2.
> - Prepare the midterm report.
>
> **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs
> - Goal 1 submitted.
>
> Week 7-8: (July 14 - July 27)
> - Start with server side v2 protocol extension (`%(objecttype)`).
>
> Week 9-10: (July 28 - August 10)
> - Add `%(objecttype)` to the `allow_list` from Goal 1.
> - Client side extension.
> - End to end tests and documentation.
> - Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
> - Send patch series.
>
> Week 11-12: (August 11 - August 24)
> - Goal 2 should be close to be done.
> - Polish everything, all tests pass, good test coverage, no
> style/comment issues.
> - Final documentation review.
> - Prepare for final evaluation.
>
> **Final evaluation** (August 18-24) as specified on GSoC timeline docs
>
> ### Additional objectives
>
> If there is enough time, or for future work after the project. I've
> some ideas on how this could evolve:
>
> #### More placeholders support
>
> I've checked that Eric's v11 patch only supports `%(objectsize)` on
> server side, but on the client side there are other placeholders that
> can be added too. With the `allow_list` and having Goal 2 implemented,
> adding more placeholders becomes trivial.
>
> - `%(objectsize:disk)`: Returns the size on the disk (compressed or as
> a delta) instead of returning the uncompressed size that
> `%(objectsize)` does. To do this, the server would need to send what's
> the actual size on disk data.
>
> - `%(deltabase)`: Returns the delta base object OID. non delta objects
> return zero OID as it does on local.
>
> #### Returning missing blobs from a tree ordered
>
> In a partial clone, someone might want to know what blobs are missing
> inside a concrete tree and their size before fetching them.
> The idea is to build on top of `remote-object-info`:
> Given a tree hash, return the missing blobs (inside that tree) ordered by size.
>
> Thanks for reading my proposal and considering my application. I'm
> very excited about this opportunity,
> Pablo
>
> [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
> "Eric Ju's v11 patch"
>
> [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
> Hamano feedback"
>
> [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
> "Jeff King feedback"
>
> [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
> "options for strstr() by Jeff King"
>
> [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
> "Jeff King follow-up"
>
> [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
> "data->type not being cleared bug"
>
> [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
> "object-info protocol docs"
>
> [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t
> "Calvin Wan's patch series"
>
> ---
>
> Again, thanks a lot for the feedback.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
2026-03-14 18:31 ` Pablo
2026-03-15 9:20 ` Chandra Pratap
@ 2026-03-16 11:21 ` Christian Couder
2026-03-16 21:38 ` Karthik Nayak
2 siblings, 0 replies; 14+ messages in thread
From: Christian Couder @ 2026-03-16 11:21 UTC (permalink / raw)
To: Pablo
Cc: Chandra Pratap, git, karthik nayak, jltobler, Ayush Chandekar,
Siddharth Asthana
Hi Pablo,
On Sat, Mar 14, 2026 at 7:31 PM Pablo <pabloosabaterr@gmail.com> wrote:
> #### Backward Compatibility
>
> There are four possible scenarios to happen between client and server:
>
> 1. **The server doesn't know type (new client but old server)**:
>
> After receiving the server capabilities, a client will only request
> what the server advertises. The `allow_list` would handle this,
> returning an empty string when the server doesn't support it.
This is not very clear and maybe answering the following questions
could help clarify:
1) What is returning an empty string. Is it the `allow_list`, the
client, the server or something else?
2) And what is actually reported to the user (en error, a warning, nothing)?
3) Also is it what is implemented in Eric's v11, or what you suggest
implementing?
> 2. **The server knows type but the client doesn't (new server but old client)**:
>
> Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
> keys", it will ignore type, and request only the known capabilities.
Questions 2) and 3) above might be relevant here too.
> 3. **Both know type (new client and new server)**:
>
> Server advertises type, client requests it and gets the type data.
>
> 4. **Both know type but protocol middleware doesn't (new client, new
> server but old middleware)**:
>
> If a server advertises type but client doesn't receive type, a
> client won't ask for anything unadvertised, if a client asks for type
> but the server doesn't receive it, it will only return the known
> capabilities.
Questions 2) and 3) above might be relevant here too.
[...]
> Thanks for reading my proposal and considering my application. I'm
> very excited about this opportunity,
Thanks for your proposal.
Best.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
2026-03-14 18:31 ` Pablo
2026-03-15 9:20 ` Chandra Pratap
2026-03-16 11:21 ` Christian Couder
@ 2026-03-16 21:38 ` Karthik Nayak
2026-03-18 10:45 ` Pablo
2 siblings, 1 reply; 14+ messages in thread
From: Karthik Nayak @ 2026-03-16 21:38 UTC (permalink / raw)
To: Pablo, Chandra Pratap
Cc: git, christian.couder, jltobler, Ayush Chandekar,
Siddharth Asthana
[-- Attachment #1: Type: text/plain, Size: 17746 bytes --]
Pablo <pabloosabaterr@gmail.com> writes:
> Hi Chandra, thanks a lot for the feedback! :)
>
>> You should upload your proposal on the GSoC website and add the link to it here.
>> The proposal can be then updated later as many times as you like.
>
> GSoC proposals opens March 16th, for now I'll send my v2 here and as
> soon as I can I'll swap to GSoC website and send the link to the
> thread.
>
> To avoid having you reread everything again this is what I've done from v1:
>
> Moved context explanation from The Problem to Synopsis and
> Availability below About Me and Contact.
> Split Pre-GSoC patches into status (for code patches) and
> description to improve readability.
> Added a code review and proposal thread to the Pre-GSoC section.
> Added new lines where noted and fixed capitalization.
> Correctly credited Jeff King for the allow_list idea and added new
> [8] for Calvin Wan's work.
> Community bonding now includes continuing patches and setting up a blog.
> Removed most of the duplicated iteration on the Timeline from The
> Problem. (feels a bit empty now tho).
>
Perhaps a diff would be a good addition for next time? :)
> I paste here my v2 with the requested changes:
>
> ## Synopsis
>
> Git's partial clone allows cloning repositories without downloading
> all objects (blobs, trees, ...). These objects are fetched on demand
> from the remote when needed. However, when a user needs metadata about
> these remote objects (size, type, hash, ...), Git has no efficient way
> of doing this without downloading all the object content.
>
> The server side support for `object-info` protocol was implemented by
> Calvin Wan in 2021 [8]. Eric Ju built the client-side
> `remote-object-info` for `cat-file --batch-command`.
>
> This project finishes Eric Ju's work on `remote-object-info` for `git
> cat-file --batch-command` [1], resolves the pending feedback from
> Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
> `%(objecttype)`.
>
Nice to see that you've linked in the relevant resources.
> Expected project size: 350 hours (Medium)
>
> ## About Me and Contact
>
> Name: Pablo Sabater Jiménez (he/him)
>
> Age: 19
>
> Education: Currently on my second Computer Science year at University
> of Murcia, Spain
>
> Location: Murcia, Spain (CET, UTC+1)
>
> Languages: C (solid), shell(bash) (good)
>
> Tools: git(proficient)
>
> I've checked that I'm eligible for GSoC 2026.
>
> Email: pabloosabaterr@gmail.com
> GitHub: https://github.com/pabloosabaterr
>
> ## Availability
>
> My classes end the first week of May. From then until September I
> won't have any classes which leaves me free to fully focus on the
> project. I can dedicate 8+ hours each day, and for sure 40 hours a
> week.
>
> ## Relevant Projects
>
> - 16 bit CPU emulator. Good example of C programming.
>
> cpu: https://github.com/pabloosabaterr/CPU16
>
> - Compiler. Good example of working on bigger projects.
>
> compiler: https://github.com/pabloosabaterr/Orn
>
> ## Pre-GSoC Work
>
> ### Introduction
>
> **[GSoC] Introduction Pablo Sabater**
>
> https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com
>
> **Description**: A mailing list thread where I introduced myself to
> the git community.
>
> ### Microproject
>
> **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**
>
> https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/
>
> **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`.
>
> **Description**: Replaces `test -f` with helper `test_path_is_file`,
> which makes debugging failing tests easier with better reporting.
> As suggested as microproject.
>
> ### Other contributions
>
> **[GSoC PATCH v2] test-lib: print escape sequence names**
>
> https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/
>
> **Status**: Will merge to `next`.
>
> **Description**: In failed expected/actual checks printing, the escape
> sequences were shown as their octal code. This patch fixes that to
> print the actual escape sequence name, adds tests, and updates the
> expected output.
>
> **[GSoC PATCH] t9200: handle missing CVS with skip_all**
>
> https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/
>
> **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`.
>
> **Description**: wraps CVS setup in a skip_all for clearer failure
> reporting and moves Git initialization into its own
> test_expect_success.
>
> **Re: [PATCH] gc: add git maintenance list command**
>
> https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/
>
> **Description**: code review for a patch sent.
>
> **[GSoC] Proposal: Complete and extend remote-object-info for git cat-file**
>
> https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/
>
> **Description**: Proposal draft thread.
>
> **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**
>
> https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
>
> **Description**: While testing Eric's v11 I've found and reported a
> new bug. On `remote-object-info` when it's preceded by a local query,
> `data->type` isn't being cleared. Causing it to return the wrong type.
>
Nice to see that you're proactive and already testing out the branch.
> I have also studied the documentation provided and Eric Ju's work from
> v0 to v11 including all the feedback he got up to March 2025, the
> feedback he got from Junio Hamano and Jeff King, taking notes about
> what's left to be done and what else I can contribute to the already
> proposed project. That's how I've identified everything that I will
> address on the Problem, Solution and Timeline sections.
>
> I built Eric Ju's v11 and tested the bugs reported to his patch [5],
> I've confirmed the segfault and the `die()`, and found a new one:
> - When a local `info` runs before `remote-object-info` sharing the
> same format string, `data->type` isn't being cleared. A blob queried
> remotely after a local commit, `data->type` for blob becomes 'commit'
> with no error. I reported it on the mailing list [6].
>
> I attempted to test rebasing Eric Ju's v11 to master and got conflicts
> on 4 out of the 8 commits:
> - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
> - `t/t1006-cat-file.sh`
> - `d918f720d8` fetch-pack: refactor packet writing.
> - `fetch-pack.c`
> - `2daf9ed803` transport: add client support for object-info.
> - `Makefile`
> - `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
> - `object-file.c`, `object-store-ll.h` (deleted).
It's been a while, so this is expected. I guess the first week[s] would
mostly be getting this series up-to date.
>
> I'm being active on the mailing list and learning the Git flow of work
> and from the feedback I've received from the maintainers (Junio) from
> my patches.
>
> Following the project guidelines, I haven't done anything on the
> project that could step on other candidates' work before being
> accepted, and instead I'm focusing on understanding the project and
> its needs, and independent patches that will make the Git project more
> familiar and understandable to me.
I know this is the silent expectation, but nice to see it listed out.
>
> ## The Problem
>
> Eric Ju's work remains unmerged after v11 because of these issues:
>
> - The format validation uses `strstr()` which only checks for
> `%(objectsize)`. This causes two different errors:
> - Atoms that `expand_atom()` recognizes but the remote doesn't
> (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
> accessing `data->type` it only contains garbage, causing segfault, as
> Jeff King noted [3].
> - Unknown atoms by `expand_atom()`, returns 0, calling
> `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
> as Jeff King found [3].
> Both cases block the command, including local `info` queries if the
> same format string is shared. Unsupported remote placeholders should
> return an empty string, matching how `for-each-ref` returns empty for
> known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].
>
> - When local and remote queries are mixed, `data->type` is not being
> cleared between commands. `remote-object-info` returns the wrong type
> data from a previous local query [6].
>
> - Style and code issues marked by Junio Hamano [2] and Jeff King [3]
> [5] are still undone.
> - comment style.
> - `#define` formatting.
> - line length.
> - misleading error messages.
> - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
> - if/else invert at `get_remote_info()`.
> - `%(objecttype)` is not yet supported on either client or server side.
>
Again, well done on the research. It is always nice to see the
requirements being listed out clearly which makes the objective clearer.
> ## The Solution
>
> There are two main goals:
>
> ### Goal 1: Rebase and finish Eric's work
>
> Starting from where Eric Ju left off, I will rebase it on top of the
> current `master` branch and address the feedback left to do:
> - Fix style in comments, `#define` formatting and line length.
> - Fix misleading error message in the overflow check.
> - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
> - Invert if/else on `get_remote_info()` to keep the small block first
> (the error one) as Junio suggested.
>
> #### Replace `strstr()` format validation with allow_list in `expand_atom()`
>
> `strstr()` isn't enough to fully validate the placeholders, it only
> searches for `%(objectsize)` and unsupported placeholders cause
> segfaults. Jeff King noted [4] that the fix was to refactor the
> validation with an allow_list in `expand_atom()` or `expand_format()`.
> The best option is to place the validation at `expand_atom()`, but why
> `expand_atom()` ?
> - There are two cases, first, inside `expand_atom()` before returning
> (segfault) and second, calls `die()` when `expand_atom()` returns 0.
> Placing the `allow_list` at the top of `expand_atom()` prevents both
> errors, on remote mode, append nothing to `sb` and return 1, accessing
> `data->type` won't cause segfault and prevents `expand_format()` from
> reaching `die()`.
> As extra safety, initializing `data->type` to `OBJ_BAD` and check
> for `NULL` from `type_name()` makes it that even without `allow_list`,
> uninitialized data doesn't cause a segfault.
> At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
> allow_list. Goal 2 will bring `%(objecttype)` support.
>
> ### Goal 2: Adding `%(objecttype)`
>
> Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2
> protocol needs to be extended on the server side to support the new
> `%(objecttype)` placeholder:
> - extend `object_info_advertise()` at `serve.c`
> - add .type to `requested_info` struct at `serve.c`
> - support `type` in `cap_object_info()` at `protocol-caps.c`
> - look for type at `send_info()` at `protocol-caps.c`
>
> Following object-info protocol docs [7] it should look like:
> ```
> attrs = "size" SP "type"
> obj-type = "blob" | "tree" | "commit" | "tag"
> obj-info = obj-id SP obj-size SP obj-type
> info = PKT-LINE(attrs LF)
> *PKT-LINE(obj-info LF)
> ```
>
> `%(objecttype)` needs to be added to the `allow_list`. Client side
> needs to learn to ask for `%(objecttype)` from remote, parse what has
> been received and fill `expand_data` with the actual type. This makes
> it return the object type instead of the empty string returned while
> it was unsupported.
>
> Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
> Test and document new placeholder support and server side extension.
>
> #### Backward Compatibility
>
> There are four possible scenarios to happen between client and server:
>
> 1. **The server doesn't know type (new client but old server)**:
>
> After receiving the server capabilities, a client will only request
> what the server advertises. The `allow_list` would handle this,
> returning an empty string when the server doesn't support it.
>
> 2. **The server knows type but the client doesn't (new server but old client)**:
>
> Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
> keys", it will ignore type, and request only the known capabilities.
>
> 3. **Both know type (new client and new server)**:
>
> Server advertises type, client requests it and gets the type data.
>
> 4. **Both know type but protocol middleware doesn't (new client, new
> server but old middleware)**:
>
> If a server advertises type but client doesn't receive type, a
> client won't ask for anything unadvertised, if a client asks for type
> but the server doesn't receive it, it will only return the known
> capabilities.
>
> **performance considerations**
>
> To get an object type, we have to look only at the header, to get the
> size `oid_object_info()` at `object-file.c` is being called which
> already returns the object type in the same call. Sending the string
> with the type will only be, worst case scenario 6 bytes for the
> "commit" string.
>
> ## Timeline
>
> I've designed this to work with enough time so final work can be
> shorter than what's said here
>
> May 1-24: Community Bonding
> - Keep working on my ongoing patches and new ones.
> - Talk and meet with mentor that I'm assigned with, to get feedback
> about my proposal, how I will report my progress apart from the code
> submitted and possible blogs, and tips and tricks to work better at
> Git.
> - Confirm with mentor that the `allow_list` approach is still the best option.
> - Draft commits structure.
> - Setup a blog to keep track about how GSoC at Git is going.
>
> Week 1-2: (May 26 - June 8)
> - Start Goal 1 fixes.
> - Fix style and code issues.
>
> Week 3-4: (June 9 - June 22)
> - Start with Goal 1 implementations (allow_list approach).
>
> Week 5-6: (June 23 - July 6):
> - Goal 1 should be polished or close to the final form.
> - Send patch series for Goal 1.
> - Start Goal 2.
> - Prepare the midterm report.
>
> **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs
> - Goal 1 submitted.
>
> Week 7-8: (July 14 - July 27)
> - Start with server side v2 protocol extension (`%(objecttype)`).
>
> Week 9-10: (July 28 - August 10)
> - Add `%(objecttype)` to the `allow_list` from Goal 1.
> - Client side extension.
> - End to end tests and documentation.
> - Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
> - Send patch series.
>
> Week 11-12: (August 11 - August 24)
> - Goal 2 should be close to be done.
> - Polish everything, all tests pass, good test coverage, no
> style/comment issues.
> - Final documentation review.
> - Prepare for final evaluation.
>
> **Final evaluation** (August 18-24) as specified on GSoC timeline docs
>
> ### Additional objectives
>
> If there is enough time, or for future work after the project. I've
> some ideas on how this could evolve:
>
> #### More placeholders support
>
> I've checked that Eric's v11 patch only supports `%(objectsize)` on
> server side, but on the client side there are other placeholders that
> can be added too. With the `allow_list` and having Goal 2 implemented,
> adding more placeholders becomes trivial.
>
> - `%(objectsize:disk)`: Returns the size on the disk (compressed or as
> a delta) instead of returning the uncompressed size that
> `%(objectsize)` does. To do this, the server would need to send what's
> the actual size on disk data.
>
> - `%(deltabase)`: Returns the delta base object OID. non delta objects
> return zero OID as it does on local.
>
> #### Returning missing blobs from a tree ordered
>
> In a partial clone, someone might want to know what blobs are missing
> inside a concrete tree and their size before fetching them.
> The idea is to build on top of `remote-object-info`:
> Given a tree hash, return the missing blobs (inside that tree) ordered by size.
>
You might want to look 'git-backfill(1)', I recall there was some
thoughts on extending that command to do something similar. But I don't
remember on the top of my head.
> Thanks for reading my proposal and considering my application. I'm
> very excited about this opportunity,
> Pablo
>
> [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
> "Eric Ju's v11 patch"
>
> [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
> Hamano feedback"
>
> [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
> "Jeff King feedback"
>
> [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
> "options for strstr() by Jeff King"
>
> [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
> "Jeff King follow-up"
>
> [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
> "data->type not being cleared bug"
>
> [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
> "object-info protocol docs"
>
> [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t
> "Calvin Wan's patch series"
>
> ---
>
> Again, thanks a lot for the feedback.
Regards,
Karthik
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
2026-03-16 21:38 ` Karthik Nayak
@ 2026-03-18 10:45 ` Pablo
0 siblings, 0 replies; 14+ messages in thread
From: Pablo @ 2026-03-18 10:45 UTC (permalink / raw)
To: Karthik Nayak
Cc: Chandra Pratap, git, christian.couder, jltobler, Ayush Chandekar,
Siddharth Asthana
Karthik Nayak (<karthik.188@gmail.com>) writes:
> Perhaps a diff would be a good addition for next time? :)
Yes, I'll add a diff from now on.
> It's been a while, so this is expected. I guess the first week[s] would
> mostly be getting this series up-to date.
Yes, it's mentioned in The Solution section, but I'll make it more clear
adding it explicitly to the Timeline that it will be the first thing to do.
> You might want to look 'git-backfill(1)', I recall there was some
> thoughts on extending that command to do something similar. But I don't
> remember on the top of my head.
Thanks, I didn't know about that, from what I've found the 'git-backfill'
extension that Stolee is working on [1], it's similar but (correct me
if i'm wrong)
'git-backfill' fetches the branch/path. This idea would only bring the
metadata asked on a
format string e.g.:"%(objectname) %(objectsize) %(objecttype)" leveraging
on what has been done on Goal 1 and Goal 2. I'll add a clarification on the
proposal about this.
This would get along with 'git-backfill' extension by, querying the metadata
from a branch first and then fetching it with 'git-backfill'
Thanks for the feedback and compliments,
Pablo
[1]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/
"Stolee 'git-backfill' extension"
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-03-20 13:12 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-05 20:48 [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file SoutrikDas
2026-03-15 10:11 ` SoutrikDas
2026-03-16 12:08 ` Christian Couder
2026-03-17 13:06 ` SoutrikDas
2026-03-16 20:46 ` Karthik Nayak
2026-03-17 15:13 ` SoutrikDas
2026-03-20 13:12 ` [GSoC Proposal v2] " SoutrikDas
-- strict thread matches above, loose matches on Subject: below --
2026-03-13 10:17 [GSoC] Proposal: " Pablo
2026-03-14 5:58 ` Chandra Pratap
2026-03-14 18:31 ` Pablo
2026-03-15 9:20 ` Chandra Pratap
2026-03-16 11:21 ` Christian Couder
2026-03-16 21:38 ` Karthik Nayak
2026-03-18 10:45 ` Pablo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox