* [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool
@ 2025-04-07 19:18 Lucas Seiki Oshiro
2025-04-07 19:39 ` Lucas Seiki Oshiro
2025-04-08 11:37 ` Karthik Nayak
0 siblings, 2 replies; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-07 19:18 UTC (permalink / raw)
To: git; +Cc: ps, karthik.188, shyamthakkar001
Hello again!
I'm sending this v3 which is basically the v2 with some polishments.
You can skip the v2 and review this directly. So, the changes compared
to v1 are:
- Detailing the status of my patches;
- Focusing in `rev-parse` features, listing all that Patrick suggested and
searching for a few more that may be useful;
- Given it is more focused on rev-parse, the parts about surveying users
and projects for features was reduced but not entirely suppressed, as I'll
need to find __where__ and __why__ this new command will be important. But
now this will not be as important I as thought it would be;
- Changed the expected output, which becomes very smaller compared to what I
thought it would be;
- Making the name of the new command explicit;
- Making it clear that the JSON-related functionality will be placed in a
separated file;
- Detailing a little bit more some examples of functions that I expect to use
for retrieving data to populate the JSON that will be outputted;
- Breaking the schedule into 6 steps, bringing part the development of the
JSON serializer to the beginning.
---
# Machine-Readable Repository Information Query Tool
## Contact info
- Name: Lucas Seiki Oshiro
- Timezone: GMT-3 (America/São Paulo)
- IRC: lucasoshiro
- Personal page: https://lucasoshiro.github.io/en/
- GitHub: https://github.com/lucasoshiro
- LinkedIn: https://www.linkedin.com/in/lucasseikioshiro/
## About me
My name is Lucas Oshiro, I'm a developer and CS bachelor from São Paulo,
Brazil. Currently I'm pursuing a master degree in CS at University of São
Paulo. My interest in Git dates from years ago and I even submitted a
patch to its codebase in the past, though I couldn't complete it due to
scheduling conflicts with my capstone project.
Having experience in the academia, industry and FLOSS, I highly value
code quality, code legibility, well-maintained Git histories, unit tests
and documentation.
### Previous experience with Git
Before this year, I haven't been involved directly with Git community,
however, I kept my interest in Git alive by:
- Translating the "Git Internals" chapter of Pro Git to Brazilian
Portuguese: https://github.com/progit/progit2-pt-br/pull/81;
- Writing some blog posts about Git, for example:
- one explaining how Git can be used as a debugging tool:
https://lucasoshiro.github.io/posts-en/2023-02-13-git-debug/;
- other explaining how Git merge submodules:
https://lucasoshiro.github.io/posts-en/2022-03-12-merge-submodule/;
- Writing a compatible subset of Git in Haskell from scratch:
https://github.com/lucasoshiro/oshit;
- Helping organizing a Git Introductory Workshop at my University:
https://flusp.ime.usp.br/events/git-introductory-workshop/;
- Presenting some lectures about Git in a company that I worked some
years ago, covering the Git internals (objects, references, packfile)
and debugging and archaeology related Git tools (blame, bisect,
pickaxe, ls-files, etc).
### Previous experience with C and open-source
I also have experience with C and some C++. During my CS
course, C was one of the primary languages that I used. I also
worked with C/C++, for example, in:
- Writing an AMQP message broker from scratch:
https://github.com/lucasoshiro/amqp_broker;
- Contributing with simple patches to the IIO subsystem of the Linux
kernel: https://lucasoshiro.github.io/floss-en/2020-06-06-kernel_linux/;
- Contributing to the Marlin firmware for 3D printers:
https://lucasoshiro.github.io/floss-en/2020-06-30-Marlin/;
- Writing a module for the ns-3 network simulator, dealing with both C
and C++ codebases (currently under development, I plan to write a
paper and make the code available soon);
During my CS course I also was member of FLUSP
(https://flusp.ime.usp.br), a group in my university focused on FLOSS
contributions and from Hardware Livre USP
(https://hardwarelivreusp.org), another group that was focused on
working with open-source hardware.
As a master's student, I'm one of the Open Science Ambassadors of my
University (https://cienciaaberta.usp.br/sobre-o-projeto/, in
Portuguese), promoting the Open Science principles, which include
open-source software, in the unit where I study.
I also contributed to some other free/open-source software, which I list
here: https://lucasoshiro.github.io/floss-en/
### Activity in the Git community in 2025
Since when I decided to submit a proposal for GSoC, I sent some patches
to the Git codebase and git.github.io:
- My microproject, replacing some `test -f` by `test_path_is_file`:
https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@gmail.com/,
merged to master;
- Adding a paragraph to the merge-strategies documentation describing how
Git merges submodules (based on the blog post that I mentioned
before):
https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@gmail.com/,
merge to master;
- A patchset adding a new `--subject-extra-prefix` flag for `git format-patch`,
allowing the user to quickly prepend tags like [GSoC], [Newbie] or [Outreachy]
to the beginning of the subject:
https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@gmail.com/.
This patchset was rejected in favor of just using `--subject-prefix='GSoC
PATCH'` or similar;
- Given the feedback on the previous rejected patchset, I opened a Pull
Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
by `[GSoC PATCH]`, merged to master;
- Adding a new userdiff driver for INI files, initially target for
gitconfig files:
https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@gmail.com/.
Currently it is still under revision.
Beyond contributions, I also helped people on the mailing list that
needed assistance on Git features and documentation.
## Project Proposal
Based on the information provided in
https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
create a new Git command for querying information from a repository and
returning it as a semi-structured data format as a JSON output.
A first idea on how this command would be named is `git metadata`.
In the scope of this project, the JSON output will only include data
that can currently be retrieved through existing Git commands. The main
idea is to centralize data from `git rev-parse`, which currently is
overloaded with features that doesn't fit its main purpose.
Some of the data that we expect to retrieve and centralize are:
- The hashing algorithm (i.e. `sha1` or `sha256`), which currently can
be retrieved using `git rev-parse --show-object-format`;
- The Git directory of the repository, currently retrieved by running
`git rev-parse --git-dir`;
- The common Git directory, currently retrieved by running
`--git-common-dir`;
- The top level directory of the repository, currently retrieved by using
`git rev-parse --show-toplevel`;
- The reference database format (i.e. currently, `files` or `reftable`),
currently retrieved by running `git rev-parse --show-ref-format`;
- The absolute path of the superproject, currently retrieved by running
`git rev-parse --show-superproject-working-tree`;
- Whether this is a bare repository, currently retrieved by running
`git --is-bare-repository`;
- Whether this is a shallow repository, currently retrieved by running
`git --is-shallow-repository`.
Given that the information that we want to compile are currently
accessible with different sets of flags, the user that wants to read
them needs to have an advanced knowledge on Git. Once having the
repository details consolidated in a single command, the user will be
able to quickly retrieve what it desires without navigating a complex
combination of commands and flags.
A side effect is decreasing the reponsibility of `rev-parse`, like
`git switch` and `git restore` did for `git checkout`.
### Use cases
Some use cases that will be benefited of `git metadata` will be:
- CLI tools that display formatted information about a Git repository,
for example, OneFetch (https://github.com/o2sh/onefetch);
- Text editors, IDEs and plugins that have front-ends for Git, such as
Magit (https://magit.vc) or GitLens
(https://www.gitkraken.com/gitlens);
- FLOSS repository tracking software, for example,
kworkflow (https://github.com/kworkflow),
ctracker (https://github.com/quic/contribution-tracker);
- Any other tool that integrates with Git and currently relies on
`rev-parse` to get information about the repository;
### Planned features
`git metadata` consists of one big feature: produce the JSON with
repository metadata. At first, this should be populated with the data
that currently can only be retrieved through `git rev-parse`, like the
ones listed before.
Other data may be added depending on the demands of the Git community
and the user base.
By now, it's not possible to decide precisely how this command would
work without it being more discussed. But a first draft on how it would
be invoked and the output that it will produce is:
~~~
$ git metadata
{
"object-format": "sha1",
"git-dir": ".git",
"common-dir": ".",
"toplevel": "/home/user/my_repo",
"ref-format": "files",
"superproject-working-tree": "/home/user/my_super_repo",
"bare-repository": true,
"shallow-repository": false
}
~~~
This first draft will be sent to the mailing list in order to get
feedback from the Git community.
### Development plan
Since this is a new command that is not directly related to any specific
existent command, its main code will probably be placed in a new file
`builtin/metadata.c`.
Given that this project will give JSON-related functionality to Git, a
new `json.c` file in the top level directory of the codebase will be
created and it will be available for being used by `git metadata` and
any other command that want to reuse the JSON features introduced here.
The functionality `git metadata` can be divided into two categories:
1. **Data gathering**: retrieving data from different sources, calling
existent functions and reading data structures declared in other
files;
2. **Data serialization**: formatting the gathered data in a JSON
format. This represents two challenges: generating the JSON itself
and designing the schema for how the desired data will be presented.
Since the exported data is already provided by other Git commands, it
won't be difficult to implement this side of the functionality. The main
task for gathering data will be inspect the existing codebase and find
the functions and data structures that will feed our output. For
example, the already mentioned data:
- Hashing algorithm: `git rev-parse` reads from the field
`the_repository->hash_algo->name` (aka `the_hash_algo->name`);
- The Git directory, the common Git directory and top level directory:
`git rev-parse` formats those paths using `print_path`, based on the
`prefix` parameter and the fields `commondir` and from
`the_repository`;
- Reference database format: `git rev-parse` retrieves it from
`ref_storage_format_to_name`;
- Absolute path of the superproject: `git rev-parse` retrieves it from
the function `get_superproject_working_tree`;
- Whether the repository is bare: `git rev-parse` retrieves it from the
function `is_bare_repository`;
- Whether the repository is shallow: `git rev-parse` retrieves it from
the function `is_repository_shallow`.
Designing the schema, however, requires special planning, as the
flexibility of semi-structured data like JSON may lead to early
bad decisions. A solution may emerge by analysing other software that
export JSON as metadata.
### Schedule
1. **Now -- May 5th**: Requirements gathering
- Present to the Git community the proposal, asking what are the
features that it demands;
- Inspect codebases that uses Git (specially `rev-parse`) as data
source;
- Studying more deeply the codebase, specially the `rev-parse` source
code;
2. **May 6th -- June 1st**: Community bonding
- Getting in touch with the mentors;
- Present to the community a first proposal of the JSON schema;
- Receive feedback from the community about the schema;
- Present a first proposal on the command line interface;
- Receive feedback from the community about the command line
interface;
3. **June 2nd -- June 31th**: First coding round
- Decide how the CLI should behave, that is, what should be exported
by default, what should be outputted only by using flags and what
should not be outputted by using disabling flags;
- Write a minimal JSON serializer, focusing on export only the
top-level object with only string values;
- Define a simple data structure that holds only one field of the
data that we want to export;
- Introduce a first version of the command, outputting the first data
structure using the first JSON serializer;
- Write test cases in different scenarios comparing the output of the
new command with the outputs of the existing commands;
- Write a minimal documentation file for the new command, making it
explicitly that the command is experimental and it's still under
development;
4. **July 1st -- July 14th**: Second coding round
- Fill the data structure with all other string fields that should be
exported;
- Add flags for filtering the data output;
- Write tests for each one of the new fields and flags;
- Improve the documentation file, explaining what the command does,
its output format and what the flags that were implemented so far
do;
5. **July 15th -- August 15th**: Third coding round
- Improve the JSON serializer, allowing other data types: array,
number, boolean, null and nested objects;
- Fill the data structure with the remaining non-string fields;
- Write flags, tests and documentation for the types that were
implemented in this round;
6. **August 16th -- August 25th**: Fourth coding round
- Polish the documentation, providing examples on how to use it and
notes on why this command was created;
- Finish remaining work from the previous rounds that still need to
be concluded;
### Availability
2025 is my last year in my master's degree. Currently, I'm not attending
any classes and I am more focused on developing the software of my
research, performing experiments and writing scientific articles and my
thesis. Since my advisor is aware that I'm proposing a GSoC project, it
will be possible to work on Git while working on my master's tasks.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool
2025-04-07 19:18 [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool Lucas Seiki Oshiro
@ 2025-04-07 19:39 ` Lucas Seiki Oshiro
2025-04-08 2:38 ` Lucas Seiki Oshiro
2025-04-08 11:37 ` Karthik Nayak
1 sibling, 1 reply; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-07 19:39 UTC (permalink / raw)
To: git; +Cc: ps, karthik.188, shyamthakkar001
PS:
Just a quick a recent update: the userdiff driver is now merged to
next :-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool
2025-04-07 19:39 ` Lucas Seiki Oshiro
@ 2025-04-08 2:38 ` Lucas Seiki Oshiro
0 siblings, 0 replies; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-08 2:38 UTC (permalink / raw)
To: git; +Cc: ps, karthik.188, shyamthakkar001
PS2:
Given that at I didn't receive more feedback after v2, I just sent my
proposal!
The proposal is still open for changes, so I'll be paying attention on
this thread for eventual late changes.
Thank you!
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool
2025-04-07 19:18 [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool Lucas Seiki Oshiro
2025-04-07 19:39 ` Lucas Seiki Oshiro
@ 2025-04-08 11:37 ` Karthik Nayak
2025-04-08 15:27 ` Lucas Seiki Oshiro
1 sibling, 1 reply; 5+ messages in thread
From: Karthik Nayak @ 2025-04-08 11:37 UTC (permalink / raw)
To: Lucas Seiki Oshiro, git; +Cc: ps, shyamthakkar001
[-- Attachment #1: Type: text/plain, Size: 17046 bytes --]
Lucas Seiki Oshiro <lucasseikioshiro@gmail.com> writes:
> Hello again!
>
> I'm sending this v3 which is basically the v2 with some polishments.
> You can skip the v2 and review this directly. So, the changes compared
> to v1 are:
Thanks Lucas, for your proposal. I would recommend in-lining new
versions in the same thread, it makes it easier to review and also to
note what comments were left in previous versions. That said, I just
went through v1 and skipped v2 as you suggested. Reading on.
>
> - Detailing the status of my patches;
> - Focusing in `rev-parse` features, listing all that Patrick suggested and
> searching for a few more that may be useful;
> - Given it is more focused on rev-parse, the parts about surveying users
> and projects for features was reduced but not entirely suppressed, as I'll
> need to find __where__ and __why__ this new command will be important. But
> now this will not be as important I as thought it would be;
> - Changed the expected output, which becomes very smaller compared to what I
> thought it would be;
> - Making the name of the new command explicit;
> - Making it clear that the JSON-related functionality will be placed in a
> separated file;
> - Detailing a little bit more some examples of functions that I expect to use
> for retrieving data to populate the JSON that will be outputted;
> - Breaking the schedule into 6 steps, bringing part the development of the
> JSON serializer to the beginning.
>
> ---
>
> # Machine-Readable Repository Information Query Tool
>
> ## Contact info
>
> - Name: Lucas Seiki Oshiro
> - Timezone: GMT-3 (America/São Paulo)
> - IRC: lucasoshiro
> - Personal page: https://lucasoshiro.github.io/en/
> - GitHub: https://github.com/lucasoshiro
> - LinkedIn: https://www.linkedin.com/in/lucasseikioshiro/
>
> ## About me
>
> My name is Lucas Oshiro, I'm a developer and CS bachelor from São Paulo,
> Brazil. Currently I'm pursuing a master degree in CS at University of São
> Paulo. My interest in Git dates from years ago and I even submitted a
> patch to its codebase in the past, though I couldn't complete it due to
> scheduling conflicts with my capstone project.
>
> Having experience in the academia, industry and FLOSS, I highly value
> code quality, code legibility, well-maintained Git histories, unit tests
> and documentation.
>
> ### Previous experience with Git
>
> Before this year, I haven't been involved directly with Git community,
> however, I kept my interest in Git alive by:
>
> - Translating the "Git Internals" chapter of Pro Git to Brazilian
> Portuguese: https://github.com/progit/progit2-pt-br/pull/81;
>
> - Writing some blog posts about Git, for example:
> - one explaining how Git can be used as a debugging tool:
> https://lucasoshiro.github.io/posts-en/2023-02-13-git-debug/;
>
This was a good read! Really nice!
> - other explaining how Git merge submodules:
> https://lucasoshiro.github.io/posts-en/2022-03-12-merge-submodule/;
>
> - Writing a compatible subset of Git in Haskell from scratch:
> https://github.com/lucasoshiro/oshit;
>
> - Helping organizing a Git Introductory Workshop at my University:
> https://flusp.ime.usp.br/events/git-introductory-workshop/;
>
> - Presenting some lectures about Git in a company that I worked some
> years ago, covering the Git internals (objects, references, packfile)
> and debugging and archaeology related Git tools (blame, bisect,
> pickaxe, ls-files, etc).
>
> ### Previous experience with C and open-source
>
> I also have experience with C and some C++. During my CS
> course, C was one of the primary languages that I used. I also
> worked with C/C++, for example, in:
>
> - Writing an AMQP message broker from scratch:
> https://github.com/lucasoshiro/amqp_broker;
>
> - Contributing with simple patches to the IIO subsystem of the Linux
> kernel: https://lucasoshiro.github.io/floss-en/2020-06-06-kernel_linux/;
>
> - Contributing to the Marlin firmware for 3D printers:
> https://lucasoshiro.github.io/floss-en/2020-06-30-Marlin/;
>
> - Writing a module for the ns-3 network simulator, dealing with both C
> and C++ codebases (currently under development, I plan to write a
> paper and make the code available soon);
>
> During my CS course I also was member of FLUSP
> (https://flusp.ime.usp.br), a group in my university focused on FLOSS
> contributions and from Hardware Livre USP
> (https://hardwarelivreusp.org), another group that was focused on
> working with open-source hardware.
>
> As a master's student, I'm one of the Open Science Ambassadors of my
> University (https://cienciaaberta.usp.br/sobre-o-projeto/, in
> Portuguese), promoting the Open Science principles, which include
> open-source software, in the unit where I study.
>
> I also contributed to some other free/open-source software, which I list
> here: https://lucasoshiro.github.io/floss-en/
>
> ### Activity in the Git community in 2025
>
> Since when I decided to submit a proposal for GSoC, I sent some patches
> to the Git codebase and git.github.io:
>
> - My microproject, replacing some `test -f` by `test_path_is_file`:
> https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@gmail.com/,
> merged to master;
>
> - Adding a paragraph to the merge-strategies documentation describing how
> Git merges submodules (based on the blog post that I mentioned
> before):
> https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@gmail.com/,
> merge to master;
>
> - A patchset adding a new `--subject-extra-prefix` flag for `git format-patch`,
> allowing the user to quickly prepend tags like [GSoC], [Newbie] or [Outreachy]
> to the beginning of the subject:
> https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@gmail.com/.
> This patchset was rejected in favor of just using `--subject-prefix='GSoC
> PATCH'` or similar;
>
> - Given the feedback on the previous rejected patchset, I opened a Pull
> Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
> by `[GSoC PATCH]`, merged to master;
>
> - Adding a new userdiff driver for INI files, initially target for
> gitconfig files:
> https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@gmail.com/.
> Currently it is still under revision.
>
> Beyond contributions, I also helped people on the mailing list that
> needed assistance on Git features and documentation.
>
> ## Project Proposal
>
> Based on the information provided in
> https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
> create a new Git command for querying information from a repository and
> returning it as a semi-structured data format as a JSON output.
>
> A first idea on how this command would be named is `git metadata`.
One thing to keep in mind is we already have:
- git status
- git describe
Both of them are used to provide summary about the repository in some
sense. How do we differentiate between these two and the new command.
Rhetorical:
- Does 'git metadata' differentiate itself enough to imply what it does?
- Does it convey that it should be used to retrieve repository
information?
- Should we consider 'git repo-info', 'git info', 'git context', 'git
repo-query'?
It would be nice to add some thought here, perhaps justifying why the
chosen name is chosen.
>
> In the scope of this project, the JSON output will only include data
> that can currently be retrieved through existing Git commands. The main
> idea is to centralize data from `git rev-parse`, which currently is
> overloaded with features that doesn't fit its main purpose.
>
We know that 'git rev-parse' outputs in a human readable, would that be
the default here too?
> Some of the data that we expect to retrieve and centralize are:
>
There are a lot more options under the 'Options for Files' section of
the 'git rev-parse' manpage, it would nice to highlight that this is
mostly what we're looking at.
> - The hashing algorithm (i.e. `sha1` or `sha256`), which currently can
> be retrieved using `git rev-parse --show-object-format`;
>
> - The Git directory of the repository, currently retrieved by running
> `git rev-parse --git-dir`;
>
> - The common Git directory, currently retrieved by running
> `--git-common-dir`;
>
> - The top level directory of the repository, currently retrieved by using
> `git rev-parse --show-toplevel`;
>
> - The reference database format (i.e. currently, `files` or `reftable`),
> currently retrieved by running `git rev-parse --show-ref-format`;
>
> - The absolute path of the superproject, currently retrieved by running
> `git rev-parse --show-superproject-working-tree`;
>
> - Whether this is a bare repository, currently retrieved by running
> `git --is-bare-repository`;
>
> - Whether this is a shallow repository, currently retrieved by running
> `git --is-shallow-repository`.
>
> Given that the information that we want to compile are currently
> accessible with different sets of flags, the user that wants to read
> them needs to have an advanced knowledge on Git. Once having the
> repository details consolidated in a single command, the user will be
> able to quickly retrieve what it desires without navigating a complex
> combination of commands and flags.
>
> A side effect is decreasing the reponsibility of `rev-parse`, like
> `git switch` and `git restore` did for `git checkout`.
>
> ### Use cases
>
> Some use cases that will be benefited of `git metadata` will be:
>
> - CLI tools that display formatted information about a Git repository,
> for example, OneFetch (https://github.com/o2sh/onefetch);
>
> - Text editors, IDEs and plugins that have front-ends for Git, such as
> Magit (https://magit.vc) or GitLens
> (https://www.gitkraken.com/gitlens);
>
> - FLOSS repository tracking software, for example,
> kworkflow (https://github.com/kworkflow),
> ctracker (https://github.com/quic/contribution-tracker);
>
> - Any other tool that integrates with Git and currently relies on
> `rev-parse` to get information about the repository;
>
> ### Planned features
>
> `git metadata` consists of one big feature: produce the JSON with
> repository metadata. At first, this should be populated with the data
> that currently can only be retrieved through `git rev-parse`, like the
> ones listed before.
>
> Other data may be added depending on the demands of the Git community
> and the user base.
>
> By now, it's not possible to decide precisely how this command would
> work without it being more discussed. But a first draft on how it would
> be invoked and the output that it will produce is:
>
> ~~~
> $ git metadata
>
> {
> "object-format": "sha1",
> "git-dir": ".git",
> "common-dir": ".",
> "toplevel": "/home/user/my_repo",
> "ref-format": "files",
> "superproject-working-tree": "/home/user/my_super_repo",
> "bare-repository": true,
> "shallow-repository": false
> }
It would be nice to add subsections maybe:
'{"refs": {...}, ..., "objects": {...}}'
> ~~~
>
> This first draft will be sent to the mailing list in order to get
> feedback from the Git community.
>
> ### Development plan
>
> Since this is a new command that is not directly related to any specific
> existent command, its main code will probably be placed in a new file
> `builtin/metadata.c`.
>
> Given that this project will give JSON-related functionality to Git, a
> new `json.c` file in the top level directory of the codebase will be
> created and it will be available for being used by `git metadata` and
> any other command that want to reuse the JSON features introduced here.
>
> The functionality `git metadata` can be divided into two categories:
>
> 1. **Data gathering**: retrieving data from different sources, calling
> existent functions and reading data structures declared in other
> files;
>
> 2. **Data serialization**: formatting the gathered data in a JSON
> format. This represents two challenges: generating the JSON itself
> and designing the schema for how the desired data will be presented.
>
> Since the exported data is already provided by other Git commands, it
> won't be difficult to implement this side of the functionality. The main
> task for gathering data will be inspect the existing codebase and find
> the functions and data structures that will feed our output. For
> example, the already mentioned data:
>
> - Hashing algorithm: `git rev-parse` reads from the field
> `the_repository->hash_algo->name` (aka `the_hash_algo->name`);
>
> - The Git directory, the common Git directory and top level directory:
> `git rev-parse` formats those paths using `print_path`, based on the
> `prefix` parameter and the fields `commondir` and from
> `the_repository`;
>
> - Reference database format: `git rev-parse` retrieves it from
> `ref_storage_format_to_name`;
>
> - Absolute path of the superproject: `git rev-parse` retrieves it from
> the function `get_superproject_working_tree`;
>
> - Whether the repository is bare: `git rev-parse` retrieves it from the
> function `is_bare_repository`;
> - Whether the repository is shallow: `git rev-parse` retrieves it from
> the function `is_repository_shallow`.
>
> Designing the schema, however, requires special planning, as the
> flexibility of semi-structured data like JSON may lead to early
> bad decisions. A solution may emerge by analysing other software that
> export JSON as metadata.
>
> ### Schedule
>
> 1. **Now -- May 5th**: Requirements gathering
> - Present to the Git community the proposal, asking what are the
> features that it demands;
Nice, an RFC before adding the command would go a long way!
> - Inspect codebases that uses Git (specially `rev-parse`) as data
> source;
> - Studying more deeply the codebase, specially the `rev-parse` source
> code;
>
> 2. **May 6th -- June 1st**: Community bonding
> - Getting in touch with the mentors;
> - Present to the community a first proposal of the JSON schema;
> - Receive feedback from the community about the schema;
> - Present a first proposal on the command line interface;
> - Receive feedback from the community about the command line
> interface;
>
> 3. **June 2nd -- June 31th**: First coding round
> - Decide how the CLI should behave, that is, what should be exported
> by default, what should be outputted only by using flags and what
> should not be outputted by using disabling flags;
> - Write a minimal JSON serializer, focusing on export only the
> top-level object with only string values;
> - Define a simple data structure that holds only one field of the
> data that we want to export;
> - Introduce a first version of the command, outputting the first data
> structure using the first JSON serializer;
> - Write test cases in different scenarios comparing the output of the
> new command with the outputs of the existing commands;
> - Write a minimal documentation file for the new command, making it
> explicitly that the command is experimental and it's still under
> development;
>
> 4. **July 1st -- July 14th**: Second coding round
> - Fill the data structure with all other string fields that should be
> exported;
> - Add flags for filtering the data output;
> - Write tests for each one of the new fields and flags;
> - Improve the documentation file, explaining what the command does,
> its output format and what the flags that were implemented so far
> do;
>
> 5. **July 15th -- August 15th**: Third coding round
> - Improve the JSON serializer, allowing other data types: array,
> number, boolean, null and nested objects;
> - Fill the data structure with the remaining non-string fields;
> - Write flags, tests and documentation for the types that were
> implemented in this round;
>
> 6. **August 16th -- August 25th**: Fourth coding round
> - Polish the documentation, providing examples on how to use it and
> notes on why this command was created;
> - Finish remaining work from the previous rounds that still need to
> be concluded;
>
> ### Availability
>
> 2025 is my last year in my master's degree. Currently, I'm not attending
> any classes and I am more focused on developing the software of my
> research, performing experiments and writing scientific articles and my
> thesis. Since my advisor is aware that I'm proposing a GSoC project, it
> will be possible to work on Git while working on my master's tasks.
>
One thing I found missing is the project size and what you think about
it.
Thanks
- Karthik
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool
2025-04-08 11:37 ` Karthik Nayak
@ 2025-04-08 15:27 ` Lucas Seiki Oshiro
0 siblings, 0 replies; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-08 15:27 UTC (permalink / raw)
To: Karthik Nayak; +Cc: git, ps, shyamthakkar001
> Thanks Lucas, for your proposal. I would recommend in-lining new
> versions in the same thread, it makes it easier to review and also to
> note what comments were left in previous versions.
Oops, sorry :-(
> This was a good read! Really nice!
Thanks!
> One thing to keep in mind is we already have:
> - git status
> - git describe
> Both of them are used to provide summary about the repository in some
> sense. How do we differentiate between these two and the new command.
> Rhetorical:
> - Does 'git metadata' differentiate itself enough to imply what it does?
By now, the main idea is to bring some functionality from rev-parse,
although I think that there's room for placing other information,
like remotes, submodules, packfiles and so on.
I don't think that it is related to status or describe, as status is
more related to the working tree and index status and describe is more
related to the history itself. By "metadata" I mean the data about
the repository itself.
> - Does it convey that it should be used to retrieve repository
> information?
> - Should we consider 'git repo-info', 'git info', 'git context', 'git
> repo-query'?
Perhaps `git repo-info` would be a better name, as "metadata" could
be too much generic as it is in some ways a synonym of the name of
the other commands
> We know that 'git rev-parse' outputs in a human readable, would that be
> the default here too?
The original idea (from https://git.github.io/SoC-2025-Ideas/) was
to format it as a JSON output. My only concern here is that if
I used the JSON format in some situations and a plain string format
in other situations it would lead to a lack of consistency on what
it is expected to output.
That is, keeping in mind the Unix philosophy of "do one thing and
do it well", avoiding doing to much and becoming another rev-parse.
> There are a lot more options under the 'Options for Files' section of
> the 'git rev-parse' manpage, it would nice to highlight that this is
> mostly what we're looking at.
Nice idea! I didn't extensively listing what should be displayed in this
JSON because I think there's still room for it being decided, but having
a reference on what can be potential features mades it a lot clear. I'll
included that.
> It would be nice to add subsections maybe:
> '{"refs": {...}, ..., "objects": {...}}'
At first, my idea was to bring more information to this json, e.g. the
hashes of branches, tags and remote branches, the submodule remote and
so on. But after Patrick's review it's more focused on the features of
rev-parse that are not exactly related to "rev parsing" but somehow are
placed there.
> Thanks
> - Karthik
Thanks for your review, Karthik! There's a few hours left, I'll try to
do my best considering what you told me here. Very good insights, I
must say.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-04-08 15:27 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 19:18 [GSoC v3] Project Proposal: Machine-Readable Repository Information Query Tool Lucas Seiki Oshiro
2025-04-07 19:39 ` Lucas Seiki Oshiro
2025-04-08 2:38 ` Lucas Seiki Oshiro
2025-04-08 11:37 ` Karthik Nayak
2025-04-08 15:27 ` Lucas Seiki Oshiro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).