public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed
* [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file
@ 2026-03-05 20:48 SoutrikDas
  2026-03-15 10:11 ` SoutrikDas
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: SoutrikDas @ 2026-03-05 20:48 UTC (permalink / raw)
  To: git
  Cc: christian.couder, karthik.188, jltobler, ayu.chandekar,
	siddharthasthana31, chandrapratap3519


Hi!

This is my project proposal for GSOC 2026

I am interested in the project idea : "Complete and extend the 
remote-object-info command for git cat-file"


# Complete and extend the remote-object-info command for git cat-file

## Contact

- Name: Soutrik Das
- E-mail: valusoutrik@gmail.com
- Github: https://github.com/SoutrikDas
- LinkedIn: https://www.linkedin.com/in/soutrik-das/

## About Me

My name is Soutrik Das, I am a developer and CS bachelor from Indian 
Institute of Technology, Dhanbad. Currently I am pursuing a master's
degree in AI from Indian Institute of Technology, Bhubaneswar.

I dont really have much experience in contributing to something as 
large as git, but I would love to learn anything and everything I can
gain from this experience. I have experience in C/C++ from my
Btech coursework and participating in codeforces contests.


## Pre GSOC

I started exploring Git's codebase around February 2026 and sent my first patch
as a docfix, followed by a microproject of modernizing tests 

- [PATCH] doc: fix repo_config documentation reference [1]
    status: merged to master 
    Merge Commit: 94336d77bcbf4360b67a9454d8bf2e84b3d88ae7
    Description: Replace the path for the repo_config() documentation 
    from 'Documentation/technical/api-config.h' to 'config.h'.

- [GSOC PATCH] t7003: modernize path existence checks using test helpers [2]
    status: merged to master 
    Merge Commit: 11294bb0fa540d214d071b32cf74b1ed37b3bbbd
    Description: Replace direct uses of 'test -f' and 'test -d' with
    git's helper functions 'test_path_is_file' ,'test_path_is_missing'
     and 'test_path_is_dir'


I have read through most of Eric Ju's [4] work and some of Calvin Wan's [5]
work. I am still finding more things to understand from each thread, but 
I feel I have grasped the basics.

My work in this project would be focused on implementing the changes
suggested at the end of Eric Ju's [Patch v11].

I wouldn't say I understand every bit of discussion from that thread,
but in general my understanding is :

Calvin Wan and Eric Ju has already implemented a client side command
called get_remote_info but its designed for being batched to reduce
multiple network trips to get a single object's data. 

I have added Eric Ju's patch series to an old master commit (2d2a71ce85)
since I could not find a base commit for Eric's patch series. The patch
was properly applied and I also played around and added a very rough
but workin "%(objecttype)" code , ie now it prints like this : 

29658341f39210201ff7f72a4be83937cf2288c5 14 blob


## Project : Complete and extend the remote-object-info command for git cat-file

Currently in the case of a partial clone, the user cannot retrieve all 
object data without fetching the object beforehand. To solve this problem
Calvin Wan and Eric Ju had designed a patch sreies that can solve that,
by utilising protocolv2 servers capabilities.

This was done in the form of "remote-object-info".

But only the %(objectsize) was implemented, and that patch was not merged. 
This project has two goals 

1: To Rebase and finalize Calvin Wan and Eric Ju's Work by addressing
    the feedback on Eric Ju's Patch v11 

2: To add support for objecttype in remote-object-info

3: To discuss other information type like objectsize:disk and deltabase.

Project Duration : 12 week approx

## Timeline 

Mar 6-31 : Refine Proposal

    If possible I would like to submit small patches... but first I will
    have to rebase Eric Ju's Patches ... I am not sure if I can do this
    before GSOC...

    If not, I plan to contribute to git in other areas.

May 1-24 : Community Bonding 
    1-7  : Understand relevant underlying/ helper functions
    8-24 : Ask about any design related problems/decisions

May 25 - Jun 14 : Start a Patch Series to rebase Calvin Wan and Eric Ju's work
    and keep refining

Jun 15 - Aug 15 : Start and keep refining Patch Series to add support for
    object type information

Aug 16 - Aug 24 : Discuss and Implement other object information if possible
    Concurrently I shall make a report for all the work done.

## Availability

My current semester is ending in the first week of April, so I will be
able to contribute 7-8 hours per day, totalling around 35-40 hrs a week
on the project.

Total weeks = 12 , total hours = 35*12 = 420 
It leaves with a lot more room to accomodate any unforeseen circumstances
that may arise during the project.

## RFC 

I have a few ideas but do not know if they are worth pursuing, so I will
leave them here in the first draft 

- Addition of a remote-object-info outside of batchmode :
    Yes it should be optimally used in batch mode .. but if user wants
    only one objects size or type then should they be able to just 
    `git cat-file -r origin <oid>` 
    and get the size and type ? or something similar , I am not sure if
    the way I have depicted it conforms to git's design.

- Addition of commands for common user behaviour :
    I dont know if its going to be a common user behaviour but what about
    `git cat-file -r --all-absent` 
    Or inside "git cat-file --batch-command="<format> remote-object-info 
    --all-absent --type=tree <remote>"
    which would basically fill in remote-object-info with all the blobs
    that are currently absent from the worktree ?
    No need to fill them if its for a common enough use case.

- Sort according to size :
    Maybe a user would want to check whats the largest file they dont
    have yet.

- Get total missing blob size :
    Use case would be when someone wants to know how much exactly there
    is to download, before starting the download.
    
Thank you for your time in revewing my proposal as well as considering
my application. I am excited to learn everything I can from git.

Thanks and Regards,
Soutrik


[1] : pull.2187.git.git.1770293021383.gitgitgadget@gmail.com
[2] : 20260209172445.39536-1-valusoutrik@gmail.com
[3] : 20260225190306.39358-1-valusoutrik@gmail.com
[4] : 20240628190503.67389-1-eric.peijian@gmail.com
[5] : 20220728230210.2952731-1-calvinwan@google.com

^ permalink raw reply	[flat|nested] 14+ messages in thread
* [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
@ 2026-03-13 10:17 Pablo
  2026-03-14  5:58 ` Chandra Pratap
  0 siblings, 1 reply; 14+ messages in thread
From: Pablo @ 2026-03-13 10:17 UTC (permalink / raw)
  To: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar,
	Siddharth Asthana, Chandra Pratap

## Synopsis

This project finishes Eric Ju's work on `remote-object-info` for `git
cat-file --batch-command` [1], resolves the pending feedback from
Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
`%(objecttype)`.

Expected project size: 350 hours (Medium)
## About Me and Contact

Name: Pablo Sabater Jiménez (he/him)

Age: 19

Education: Currently on my second Computer Science year at University
of Murcia, Spain

Location: Murcia, Spain (CET, UTC+1)

Languages: C (solid), shell(bash) (good)

Tools: git(proficient)

I've checked that I'm eligible for GSoC 2026.

Email: pabloosabaterr@gmail.com
GitHub: https://github.com/pabloosabaterr

## Relevant Projects

- 16 bit CPU emulator. Good example of C programming.

  cpu: https://github.com/pabloosabaterr/CPU16

- Compiler. Good example of working on bigger projects.

  compiler: https://github.com/pabloosabaterr/Orn

## Pre-GSoC Work

### Introduction

**[GSoC] Introduction Pablo Sabater**

https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com

A mailing list thread where I introduced myself to the git community.
### Microproject

**[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**

https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/

Merged to `next` on 2026-03-12 at 8500bdf172. Replaces `test -f` with
helper `test_path_is_file`, which makes debugging failing tests easier
with better reporting.
As suggested as microproject.

### Other contributions

**[GSoC PATCH v2] test-lib: print escape sequence names**

https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/

Will merge to `next`, in failed expected/actual checks printing, the
escape sequences were shown as their octal code. This patch fixes that
to print the actual escape sequence name, adds tests, and updates the
expected output.

**[GSoC PATCH] t9200: handle missing CVS with skip_all**

https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/

Merged to `next` on 2026-03-12 at 8500bdf172, wraps CVS setup in a
skip_all for clearer failure reporting and moves Git initialization
into its own test_expect_success.

**[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**

https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/

While testing Eric's v11 I've found and reported a new bug. On
`remote-object-info` when it's preceded by a local query, `data->type`
isn't being cleared. Causing it to return the wrong type.

I have also studied the documentation provided and Eric Ju's work from
v0 to v11 including all the feedback he got up to March 2025, the
feedback he got from Junio Hamano and Jeff King, taking notes about
what's left to be done and what else I can contribute to the already
proposed project. That's how I've identified everything that I will
address on the Problem, Solution and Timeline sections.

I built Eric Ju's v11 and tested the bugs reported to his patch [5],
I've confirmed the segfault and the `die()`, and found a new one:
- When a local `info` runs before `remote-object-info` sharing the
same format string, `data->type` isn't being cleared. A blob queried
remotely after a local commit, `data->type` for blob becomes 'commit'
with no error. I reported it on the mailing list [6].

I attempted to test rebasing Eric Ju's v11 to master and got conflicts
on 4 out of the 8 commits:
- `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
        - `t/t1006-cat-file.sh`
- `d918f720d8` fetch-pack: refactor packet writing.
        - `fetch-pack.c`
- `2daf9ed803` transport: add client support for object-info.
        - `Makefile`
- `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
        - `object-file.c`, `object-store-ll.h` (deleted).

I'm being active on the mailing list and learning the Git flow of work
and from the feedback I've received from the maintainers (Junio) from
my patches.

Following the project guidelines, I haven't done anything on the
project that could step on other candidates' work before being
accepted, and instead I'm focusing on understanding the project and
its needs, and independent patches that will make the Git project more
familiar and understandable to me.

## Availability

My classes end the first week of May. From then until September I
won't have any classes which leaves me free to fully focus on the
project. I can dedicate 8+ hours each day, and for sure 40 hours a
week.

## The Problem

Git's partial clone allows cloning repositories without downloading
all objects (blobs, trees, ...). These objects are fetched on demand
from the remote when needed. However, when a user needs metadata about
these remote objects (size, type, hash, ...), Git has no efficient way
of doing this without downloading all the object content.

 The server side support for `object-info` protocol was implemented by
Calvin Wan in 2021. Eric Ju built the client-side `remote-object-info`
for `cat-file --batch-command`. Eric Ju's work remains unmerged after
v11 because of these issues:

 - The format validation uses `strstr()` which only checks for
`%(objectsize)`. This causes two different errors:
   - Atoms that `expand_atom()` recognizes but the remote doesn't
(`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
accessing `data->type` it only contains garbage, causing segfault. as
Jeff King noted [3].
   - Unknown atoms by `expand_atom()`, returns 0, calling
`strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
as Jeff King found [3].
   Both cases block the command, including local `info` queries if the
same format string is shared. Unsupported remote placeholders should
return an empty string, matching how `for-each-ref` returns empty for
known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].

 - When local and remote queries are mixed, `data->type` is not being
cleared between commands. `remote-object-info` returns the wrong type
data from a previous local query [6].

 - Style and code issues marked by Junio Hamano [2] and Jeff King [3]
[5] are still undone.
   - comment style.
   - `#define` formatting.
   - line length.
   - misleading error messages.
   - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
   - if/else invert at `get_remote_info()`.
 - `%(objecttype)` is not yet supported on either client or server side.

## The Solution

There are two main goals:

### Goal 1: Rebase and finish Eric's work

Starting from where Eric Ju left off, I will rebase it on top of the
current `master` branch and address the feedback left to do:
- Fix style in comments, `#define` formatting and line length.
- Fix misleading error message in the overflow check.
- Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
- Invert if/else on `get_remote_info()` to keep the small block first
(the error one) as Junio suggested.
#### Replace `strstr()` format validation with allow_list in `expand_atom()`

`strstr()` isn't enough to fully validate the placeholders, it only
searches for `%(objectsize)` and unsupported placeholders cause
segfaults. The fix is to refactor the validation with an allow_list in
`expand_atom()`. But why `expand_atom()` when Jeff King suggested
`expand_atom()` or `expand_format()` [4] ?
- There are two cases, first, inside `expand_atom()` before returning
(segfault) and second, calls `die()` when `expand_atom()` returns 0.
  Placing the `allow_list` at the top of `expand_atom()` prevents both
errors, on remote mode, append nothing to `sb` and return 1, accessing
`data->type` won't cause segfault and prevents `expand_format()` from
reaching `die()`.
  As extra safety, initializing `data->type` to `OBJ_BAD` and check
for `NULL` from `type_name()` makes it that even without `allow_list`,
uninitialized data doesn't cause a segfault.
  At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
allow_list. Goal 2 will bring `%(objecttype)` support.
### Goal 2: Adding `%(objecttype)`

following what Calvin Wan did in 2021 for `%(objectsize)`, v2 protocol
needs to be extended on the server side to support the new
`%(objecttype)` placeholder:
- extend `object_info_advertise()` at `serve.c`
- add .type to `requested_info` struct at `serve.c`
- support `type` in `cap_object_info()` at `protocol-caps.c`
- look for type at `send_info()` at `protocol-caps.c`

following object-info protocol docs [7] it should look like:
```
  attrs = "size" SP "type"
  obj-type = "blob" | "tree" | "commit" | "tag"
  obj-info = obj-id SP obj-size SP obj-type
  info = PKT-LINE(attrs LF)
        *PKT-LINE(obj-info LF)
```

`%(objecttype)` needs to be added to the `allow_list`. Client side
needs to learn to ask for `%(objecttype)` from remote, parse what has
been received and fill `expand_data` with the actual type. This makes
it return the object type instead of the empty string returned while
it was unsupported.

Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
Test and document new placeholder support and server side extension.

#### Backward Compatibility

There are four possible scenarios to happen between client and server:
1. The server doesn't know type (new client but old server):

   After receiving the server capabilities, a client will only request
what the server advertises. The `allow_list` would handle this,
returning an empty string when the server doesn't support it.
2. The server knows type but the client doesn't (new server but old client):

   Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
keys", it will ignore type, and request only the known capabilities.
3. Both know type (new client and new server):

   Server advertises type, client requests it and gets the type data.
4. Both know type but protocol middleware doesn't (new client, new
server but old middleware):

   If a server advertises type but client doesn't receive type, a
client won't ask for anything unadvertised, if a client asks for type
but the server doesn't receive it, it will only return the known
capabilities.

**performance considerations**

To get an object type, we have to look only at the header, to get the
size `oid_object_info()` at `object-file.c` is being called which
already returns the object type in the same call. Sending the string
with the type will only be, worst case scenario 6 bytes for the
"commit" string.
## Timeline

I've designed this to work with enough time so final work can be
shorter than what's said here

May 1-24: Community Bonding
- Talk and meet with mentor that I'm assigned with, to get feedback
about my proposal, how I will report my progress apart from the code
submitted and possible blogs, and tips and tricks to work better at
Git.
- Confirm with mentor that the `allow_list` approach is still the best option.
- Draft commits structure.

Week 1-2: (May 26 - June 8)
- Rebase Eric Ju's  v11 on top of current `master`.
- Work on style fixes: comments, `#define` formatting, line length.
- Fix the wrong error message in the overflow check.
- Add missing check `count > MAX_ALLOWED_OBJ_LIMIT` after `split_cmdline()`.
- Invert if/else in `get_remote_info()`.
- Send first patch.

Week 3-4: (June 9 - June 22)
- Implement `allow_list` in `expand_atom()` using `is_atom()` in remote-mode.
- Initialize `data->type` to `OBJ_BAD` and add null check at `type_name()`.
- Implement empty string return for unsupported placeholders.
- Tests for supported placeholders, unsupported, mix, and the intermix
case `info` + `remote-object-info` with the same format string.
- Work with feedback from the first patch.

Week 5-6: (June 23 - July 6):
- Continue with review feedback.
- Goal 1 should be polished or close to the final form.
- Prepare the midterm report.

Midterm evaluation (July 7 - 11) as specified on GSoC timeline docs
- Goal 1 submitted and keep work with feedback.

Week 7-8: (July 14 - July 27)
- Begin Goal 2.
- Extend server side v2 protocol to serve `%(objecttype)`, following
`%(objectsize)` structure.
- Test server side.

Week 9-10: (July 28 - August 10)
- Add `%(objecttype)` to the `allow_list` from Goal 1.
- Extend client side to ask for `%(objecttype)` from remote on `object-info`.
- Parse server answer and fill `expand_data` with the actual type.
- End to end tests and documentation.
- Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
- Send patch series.

Week 11-12: (August 11 - August 24)
- Work with Goal 2 feedback from the patches.
- Polish everything, all tests pass, good test coverage, no
style/comment mistakes.
- Final documentation review.
- Prepare for final evaluation.

Final evaluation (August 18-24) as specified on GSoC timeline docs

### Additional objectives

If there is enough time, or for future work after the project. I've
some ideas on how this could evolve:
#### More placeholders support
I've checked that Eric's v11 patch only supports `%(objectsize)` on
server side, but on the client side there are other placeholders that
can be added too. with the `allow_list` and having Goal 2 implemented
adding more placeholders becomes trivial.

- `%(objectsize:disk)`: Returns the size on the disk (compressed or as
a delta) instead of returning the uncompressed size that
`%(objectsize)` does. To do this, the server would need to send what's
the actual size on disk data.

- `%(deltabase)`: Returns the delta base object OID. non delta objects
return zero OID as it does on local.

#### Returning missing blobs from a tree ordered
In a partial clone, someone might want to know what blobs are missing
inside a concrete tree and their size before fetching them.
The idea is to build on top of `remote-object-info`:
Given a tree hash, return the missing blobs (inside that tree) ordered by size.

Thanks for reading my proposal and considering my application. I'm
very excited about this opportunity,
Pablo

[1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
"Eric Ju's v11 patch"

[2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
Hamano feedback"

[3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
"Jeff King feedback"

[4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
"options for strstr() by Jeff King"

[5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
"Jeff King follow-up"

[6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
"data->type not being cleared bug"

[7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
"object-info protocol docs"

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-03-20 13:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-05 20:48 [GSOC Proposal] Complete and extend the remote-object-info command for git cat-file SoutrikDas
2026-03-15 10:11 ` SoutrikDas
2026-03-16 12:08 ` Christian Couder
2026-03-17 13:06   ` SoutrikDas
2026-03-16 20:46 ` Karthik Nayak
2026-03-17 15:13   ` SoutrikDas
2026-03-20 13:12 ` [GSoC Proposal v2] " SoutrikDas
  -- strict thread matches above, loose matches on Subject: below --
2026-03-13 10:17 [GSoC] Proposal: " Pablo
2026-03-14  5:58 ` Chandra Pratap
2026-03-14 18:31   ` Pablo
2026-03-15  9:20     ` Chandra Pratap
2026-03-16 11:21     ` Christian Couder
2026-03-16 21:38     ` Karthik Nayak
2026-03-18 10:45       ` Pablo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox