git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
@ 2025-04-02 18:22 Lucas Seiki Oshiro
  2025-04-03 10:14 ` Patrick Steinhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-02 18:22 UTC (permalink / raw)
  To: git; +Cc: ps, karthik.188, shyamthakkar001

Hi!

As you may noticed by the my interactions here, I'm going to send a 
proposal for GSoC 2025!

I'm interested in the project idea currently entitled "Project Proposal:
Machine-Readable Repository Information Query Tool". My main motivations
on why I have chosen this idea is because I think it will be useful for
infrastructure teams and FLOSS researchers.

I'm sending here first version of my proposal. I'll be grateful if you
send me feedback on it! In this proposal I'm presenting myself again,
the possible use cases of this feature, a first idea on how it would
work and a activity schedule.

Thanks!

---


# Machine-Readable Repository Information Query Tool

## Contact info

- Name: Lucas Seiki Oshiro
- Timezone: GMT-3
- IRC:
- GitHub: https://github.com/lucasoshiro
- LinkedIn: https://www.linkedin.com/in/lucasseikioshiro/

## About me

My name is Lucas Oshiro, I'm a developer and CS bachelor from São Paulo,
Brazil. Currently I'm pursuing a master degree in CS at University of São
Paulo. My interest in Git dates from years ago and I even submitted a
patch to its codebase in the past, though I couldn't complete it due to
scheduling conflicts with my capstone project.

Having experience in the academia, industry and FLOSS, I highly value
code quality, code legibility, well-maintained Git histories, unit tests
and documentation.

### Previous experience with Git

Before this year, I haven't been involved directly with Git community,
however, I kept my interest in Git alive by:

- Translating the "Git Internals" chapter of Pro Git to Brazilian
  Portuguese: https://github.com/progit/progit2-pt-br/pull/81;

- Writing some blog posts about Git, for example:
  - one explaining how Git can be used as a debugging tool:
    https://lucasoshiro.github.io/posts-en/2023-02-13-git-debug/;

  - other explaining how Git merge submodules:
  https://lucasoshiro.github.io/posts-en/2022-03-12-merge-submodule/;

- Writing a compatible subset of Git in Haskell from scratch:
 https://github.com/lucasoshiro/oshit;

- Helping organizing a Git Introductory Workshop at my University:
  https://flusp.ime.usp.br/events/git-introductory-workshop/;

- Presenting some lectures about Git in a company that I worked some
  years ago, covering the Git internals (objects, references, packfile)
  and debugging and archaeology related Git tools (blame, bisect,
  pickaxe, ls-files, etc).

### Previous experience with C and open-source

I also have experience with C and some C++. During my CS
course, C was one of the primary languages that I used. I also
worked with C/C++, for example, in:

- Writing an AMQP message broker from scratch: 
  https://github.com/lucasoshiro/amqp_broker;

- Contributing with simple patches to the IIO subsystem of the Linux
  kernel: https://lucasoshiro.github.io/floss-en/2020-06-06-kernel_linux/;

- Contributing to the Marlin firmware for 3D printers:
  https://lucasoshiro.github.io/floss-en/2020-06-30-Marlin/;

- Writing a module for the ns-3 network simulator, dealing with both C
  and C++ codebases (currently under development, I plan to write a
  paper and make the code available soon);

During my CS course I also was member of FLUSP
(https://flusp.ime.usp.br), a group in my university focused on FLOSS
contributions and from Hardware Livre USP
(https://hardwarelivreusp.org), another group that was focused on
working with open-source hardware.

As a master's student, I'm one of the Open Science Ambassadors of my
University (https://cienciaaberta.usp.br/sobre-o-projeto/, in
Portuguese), promoting the Open Science principles, which include
open-source software, in the unit where I study.

I also contributed to some other free/open-source software, which I list
here: https://lucasoshiro.github.io/floss-en/

### Activity in the Git community in 2025

Since when I decided to submit a proposal for GSoC, I sent some patches
to the Git codebase and git.github.io:

- My microproject, replacing some `test -f` by `test_path_is_file`:
  https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@gmail.com/;

- Adding a paragraph to the merge-strategies documentation describing how
  Git merges submodules (based on the blog post that I mentioned
  before):
  https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@gmail.com/;
  
- A patchset adding a new `--subject-extra-prefix` flag for `git
  format-patch`, allowing the user to quickly prepend tags like [GSoC],
  [Newbie] or [Outreachy] to the beginning of the subject. This patchset
  was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or
  similar. It can be seen here:
  https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@gmail.com/;

- Given the feedback on the previous rejected patchset, I opened a Pull
  Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
  by `[GSoC PATCH]`;
  
- Adding a new userdiff driver for INI files, initially target for
  gitconfig files. Currently it is still under revision:
  https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@gmail.com/.

Beyond contributions, I also helped people on the mailing list that
needed assistance on Git documentation.

## Project Proposal

Based on the information provided in
https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
create a new Git command for querying information from a repository and
returning it as a semi-structured data format as a JSON output.

In the scope of this project, the JSON output will only include data
that can currently be retrieved through existing Git commands, for
example:

- `git branch`: information about branches, such as the commit that each
  branch currently references and their upstreams;

- `git tag`: information about the tags, such as the author or commit
  date and the messages they hold (in the case of annotated tags);

- `git remote`: the URL of each remote;

- `git log`: statistics about the commit history, such of the
  distribution of commits over time and by author, the distribution of
  lines changed by each author;

- `git submodule`: information about the submodules, mainly the commits
  that they are referencing and their remote URLs;

- `git rev-parse`: the current branch name, the current commit, the path
  of the repository top level directory, if the repository is a bare
  repository or if the repository is under bisection.

Given that the information that we want to compile are currently
accessible only through different commands with different sets of flags,
the user that wants to read them needs to have an advanced knowledge on
Git. Once having the repository details consolidated in a single
command, the user will be able to quickly retrieve what it desires
without navigating a complex combination of commands and flags.

### Use cases

Some use cases that will be benefited of this feature will be:

- CLI tools that display formatted information about a Git repository,
  for example, OneFetch (https://github.com/o2sh/onefetch);

- Text editors, IDEs and plugins that have front-ends for Git, such as
  Magit (https://magit.vc) or GitLens (https://www.gitkraken.com/gitlens);

- FLOSS repository tracking software, for example,
  kworkflow (https://github.com/kworkflow),
  ctracker (https://github.com/quic/contribution-tracker);

- Academic researchers on FLOSS projects that need statistics on the
  repositories that they are querying;

- Continuous integration workflows that perform checks on the
  repository before allowing a branch to be merged into another or
  before a deploy;

- Code quality tools that will be able to inspect the health of the
  commit history.

### Planned features

Since the features haven't been defined yet, this will need to be
planned after surveying people and projects that potentially will use
that:

- Searching on code hosting tools (e.g. GitHub, GitLab) for open-source
  software that retrieve data from Git and what they do with them;
  
- Contacting people in academia that use Git repositories as data
  sources for their researches and find out what valuable information
  this command can provide them;
  
- Contacting people from the industry, specially in infrastructure teams
  to understand the challenges they face when retrieving data from Git.
  
Given that I have worked in a infrastructure team and that I have
colleagues and professors at the university that currently research
FLOSS software and communities, I have contacts that can provide input
on what should be considered when developing this new command.

By now, it's not possible to decide how exactly this command would work,
but a first draft is this (supposing that `metadata` is the name of the
command and `--submodule` is a flag that enable the submodule metadata):

~~~
$ git metadata --submodule

{
  "symbolic_refs": {
    "HEAD": "main"
  },
  "branches": [
    {
      "name": "main",
      "commit_id": "ac72c22f3c8a9280c81171ccc6cedff3171344cf",
      "remote": "origin/main"
    },
    {
      "name": "feature",
      "commit_id": "1e373e02767337bd6b996da6598eed822a805878",
      "remote": "fork/feature"
    }
  ],
  "tags": [
    {
      "name": "v1.0",
      "message": "First version",
      "author_timestamp": "1743554265",
      "commiter_timestamp": "1743554265"
    }
  ],
  "remotes": [
    {
      "name": "origin",
      "url": "https://example.com/foo"
    },
    {
      "name": "fork",
      "url": "user@example.com/foo"
    }
  ],
  "submodules": [
    {
      "path": "my_dir/my_submodule_dir",
      "url": "https://example.com/bar",
      "commit_id": "94436069f106c0014897b1c93e8fc3e49c8fc156"
    }
  ]
}
~~~

### Development plan

Since this is a new command that is not directly related to any specific
existent command, it will probably be placed in a new file inside the
`builtin` directory.

The functionality of this command can be divided into two categories:

1. **Data gathering**: retrieving data from different sources, calling
   existent functions and reading data structures declared in other
   files;

2. **Data serialization**: formatting the gathered data in a JSON
   format. This represents two challenges: generating the JSON itself
   and designing the schema for how the desired data will be presented.
   
Since the exported data is already provided by other Git commands, it
probably won't be difficult to implement this side of the
functionality. The main task would be inspecting the existing codebase
and find the functions and data structures that will feed our output.

Designing the schema, however, requires special planning, as the
flexibility of semi-structured data like JSON may lead to early
bad decisions. A solution may emerge by analysing other software that
export JSON as metadata.

### Schedule

1. **Now -- May 5th**: Requirements gathering
   - Inspect codebases that uses Git as data sources; 
   - Contacting academic researchers on FLOSS;
   - Contacting industry infrastructure professionals;

2. **May 6th -- June 1st**: Community bonding
   - Getting in touch with the mentors;
   - Present to the community a first proposal of the JSON schema;
   - Receive feedback from the community about the schema;
   - Present a first proposal on the command line interface;
   - Receive feedback from the community about the command line
     interface;

3. **June 2nd -- July 14th**: First coding round
   - Write data structures that correspond to the presented JSON schema;
   - Fill the data structures with data obtained from routines of the
     existing codebase;

4. **July 15th -- August 25th**: Second coding round
   - Implementing the command line interface option handlers;
   - Write the JSON serializer.

### Availability

2025 is my last year in my master's degree. Currently, I'm not attending
any classes and I am more focused on developing the software of my
research, performing experiments and writing scientific articles and my
thesis. Since my advisor is aware that I'm proposing a GSoC project, it
will be possible to work on Git while working on my master's tasks.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
  2025-04-02 18:22 [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool Lucas Seiki Oshiro
@ 2025-04-03 10:14 ` Patrick Steinhardt
  2025-04-03 18:02   ` Lucas Seiki Oshiro
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Steinhardt @ 2025-04-03 10:14 UTC (permalink / raw)
  To: Lucas Seiki Oshiro; +Cc: git, karthik.188, shyamthakkar001

On Wed, Apr 02, 2025 at 03:22:11PM -0300, Lucas Seiki Oshiro wrote:
> ### Activity in the Git community in 2025
> 
> Since when I decided to submit a proposal for GSoC, I sent some patches
> to the Git codebase and git.github.io:
> 
> - My microproject, replacing some `test -f` by `test_path_is_file`:
>   https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@gmail.com/;
> 
> - Adding a paragraph to the merge-strategies documentation describing how
>   Git merges submodules (based on the blog post that I mentioned
>   before):
>   https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@gmail.com/;
>   
> - A patchset adding a new `--subject-extra-prefix` flag for `git
>   format-patch`, allowing the user to quickly prepend tags like [GSoC],
>   [Newbie] or [Outreachy] to the beginning of the subject. This patchset
>   was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or
>   similar. It can be seen here:
>   https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@gmail.com/;
> 
> - Given the feedback on the previous rejected patchset, I opened a Pull
>   Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
>   by `[GSoC PATCH]`;
>   
> - Adding a new userdiff driver for INI files, initially target for
>   gitconfig files. Currently it is still under revision:
>   https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@gmail.com/.
> 
> Beyond contributions, I also helped people on the mailing list that
> needed assistance on Git documentation.

Could you please also amend the status (merged to master, merged to
next, under discussion) for each of these items?

> ## Project Proposal
> 
> Based on the information provided in
> https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
> create a new Git command for querying information from a repository and
> returning it as a semi-structured data format as a JSON output.
> 
> In the scope of this project, the JSON output will only include data
> that can currently be retrieved through existing Git commands, for
> example:
> 
> - `git branch`: information about branches, such as the commit that each
>   branch currently references and their upstreams;
> 
> - `git tag`: information about the tags, such as the author or commit
>   date and the messages they hold (in the case of annotated tags);
> 
> - `git remote`: the URL of each remote;
> 
> - `git log`: statistics about the commit history, such of the
>   distribution of commits over time and by author, the distribution of
>   lines changed by each author;
> 
> - `git submodule`: information about the submodules, mainly the commits
>   that they are referencing and their remote URLs;
> 
> - `git rev-parse`: the current branch name, the current commit, the path
>   of the repository top level directory, if the repository is a bare
>   repository or if the repository is under bisection.
> 
> Given that the information that we want to compile are currently
> accessible only through different commands with different sets of flags,
> the user that wants to read them needs to have an advanced knowledge on
> Git. Once having the repository details consolidated in a single
> command, the user will be able to quickly retrieve what it desires
> without navigating a complex combination of commands and flags.

I already noticed in another proposal, but it seems a bit like the idea
is underspecced. The idea isn't to make _all_ information about the
repository accessible. It's rather that we want to give a better home to
information about the underlying repository itself. To clarify further,
I'm talking about information like:

  - Which object hash does the repository use?
  - What is the ref database format?
  - Where is the Git directory?
  - Where is the common directory?
  - What is the top-level directory?

This kind of information is exposed via git-rev-parse(1) already, see
the section "Options for Files". But git-rev-parse(1) is not really a
good match at all given that its main intent is to parse revisions. Over
time though it developed into a kind of grab-bag of different unrelated
functionality that we didn't really have a nice home for elsewhere.

> ### Development plan
> 
> Since this is a new command that is not directly related to any specific
> existent command, it will probably be placed in a new file inside the
> `builtin` directory.
> 
> The functionality of this command can be divided into two categories:
> 
> 1. **Data gathering**: retrieving data from different sources, calling
>    existent functions and reading data structures declared in other
>    files;
> 
> 2. **Data serialization**: formatting the gathered data in a JSON
>    format. This represents two challenges: generating the JSON itself
>    and designing the schema for how the desired data will be presented.
>    
> Since the exported data is already provided by other Git commands, it
> probably won't be difficult to implement this side of the
> functionality. The main task would be inspecting the existing codebase
> and find the functions and data structures that will feed our output.
> 
> Designing the schema, however, requires special planning, as the
> flexibility of semi-structured data like JSON may lead to early
> bad decisions. A solution may emerge by analysing other software that
> export JSON as metadata.
> 
> ### Schedule
> 
> 1. **Now -- May 5th**: Requirements gathering
>    - Inspect codebases that uses Git as data sources; 
>    - Contacting academic researchers on FLOSS;
>    - Contacting industry infrastructure professionals;
> 
> 2. **May 6th -- June 1st**: Community bonding
>    - Getting in touch with the mentors;
>    - Present to the community a first proposal of the JSON schema;
>    - Receive feedback from the community about the schema;
>    - Present a first proposal on the command line interface;
>    - Receive feedback from the community about the command line
>      interface;
> 
> 3. **June 2nd -- July 14th**: First coding round
>    - Write data structures that correspond to the presented JSON schema;
>    - Fill the data structures with data obtained from routines of the
>      existing codebase;
> 
> 4. **July 15th -- August 25th**: Second coding round
>    - Implementing the command line interface option handlers;
>    - Write the JSON serializer.

I generally recommend students to take on smaller batches of work that
can be submitted individually. The way it is structured now means that
you will end up with a single deliverable at the end of your project.
But structuring the project like that introduces a high risk that you
won't be able to land anything until the end of your project in case
there is a bigger discussion around parts of these patches.

Instead, it would make sense to identify smaller batches of work that
are self-contained enough to be submitted upstream. This ensures that
you get early feedback and that you can iterate on your design as early
as possible in the project.

Patrick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
  2025-04-03 10:14 ` Patrick Steinhardt
@ 2025-04-03 18:02   ` Lucas Seiki Oshiro
  2025-04-04  9:09     ` Patrick Steinhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-03 18:02 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, karthik.188, shyamthakkar001


> Could you please also amend the status (merged to master, merged to
> next, under discussion) for each of these items?

Ok! This may change until the GSoC submission deadline, but by now:

- Microproject: merged to master
- Merge Documentation: merged to master
- Extra prefix flag: closed
- Fix on git.github.io <http://git.github.io/>: merged to master
- Driver for INI files: under discussion

> This kind of information is exposed via git-rev-parse(1) already, see
> the section "Options for Files".

Thanks for your clarification! But still, I was discussing with people
here at my university who deals directly with research on FLOSS
repositories, is it worth to find other uses cases for this new command?

> I generally recommend students to take on smaller batches of work that
> can be submitted individually. The way it is structured now means that
> you will end up with a single deliverable at the end of your project.
> But structuring the project like that introduces a high risk that you
> won't be able to land anything until the end of your project in case
> there is a bigger discussion around parts of these patches.

I see... After sending this first version I was thinking about it and
it would be hard to test the functionalities without having the JSON
serializer minimally working. I'll send another version, proposing
smaller batches of work and writing a minimal serializer at the
beginning of the project, improving with the new features.

> Patrick

Thanks for your time!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
  2025-04-03 18:02   ` Lucas Seiki Oshiro
@ 2025-04-04  9:09     ` Patrick Steinhardt
  2025-04-04 15:01       ` Lucas Seiki Oshiro
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Steinhardt @ 2025-04-04  9:09 UTC (permalink / raw)
  To: Lucas Seiki Oshiro; +Cc: git, karthik.188, shyamthakkar001

On Thu, Apr 03, 2025 at 03:02:51PM -0300, Lucas Seiki Oshiro wrote:
> > This kind of information is exposed via git-rev-parse(1) already, see
> > the section "Options for Files".
> 
> Thanks for your clarification! But still, I was discussing with people
> here at my university who deals directly with research on FLOSS
> repositories, is it worth to find other uses cases for this new command?

It doesn't hurt to think about additional usecases, sure. But one of the
things that I want to caution against is that we now create the next
"grab bag" of unrelated features under a common name. After all, we are
trying to fix exactly this state in git-rev-parse(1). So the intent of
the command should be clearly defined to avoid this.

Patrick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
  2025-04-04  9:09     ` Patrick Steinhardt
@ 2025-04-04 15:01       ` Lucas Seiki Oshiro
  0 siblings, 0 replies; 5+ messages in thread
From: Lucas Seiki Oshiro @ 2025-04-04 15:01 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, karthik.188, shyamthakkar001


> So the intent of the command should be clearly defined to avoid this.

Thanks, Patrick! I'll send a more focused v2 soon.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-04-04 15:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-02 18:22 [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool Lucas Seiki Oshiro
2025-04-03 10:14 ` Patrick Steinhardt
2025-04-03 18:02   ` Lucas Seiki Oshiro
2025-04-04  9:09     ` Patrick Steinhardt
2025-04-04 15:01       ` Lucas Seiki Oshiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).