From: Lucas Seiki Oshiro <lucasseikioshiro@gmail.com>
To: git@vger.kernel.org
Cc: ps@pks.im, karthik.188@gmail.com, shyamthakkar001@gmail.com
Subject: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
Date: Wed, 2 Apr 2025 15:22:11 -0300 [thread overview]
Message-ID: <7EB151DA-0BDB-4D54-BBB8-CEE69F51F13A@gmail.com> (raw)
Hi!
As you may noticed by the my interactions here, I'm going to send a
proposal for GSoC 2025!
I'm interested in the project idea currently entitled "Project Proposal:
Machine-Readable Repository Information Query Tool". My main motivations
on why I have chosen this idea is because I think it will be useful for
infrastructure teams and FLOSS researchers.
I'm sending here first version of my proposal. I'll be grateful if you
send me feedback on it! In this proposal I'm presenting myself again,
the possible use cases of this feature, a first idea on how it would
work and a activity schedule.
Thanks!
---
# Machine-Readable Repository Information Query Tool
## Contact info
- Name: Lucas Seiki Oshiro
- Timezone: GMT-3
- IRC:
- GitHub: https://github.com/lucasoshiro
- LinkedIn: https://www.linkedin.com/in/lucasseikioshiro/
## About me
My name is Lucas Oshiro, I'm a developer and CS bachelor from São Paulo,
Brazil. Currently I'm pursuing a master degree in CS at University of São
Paulo. My interest in Git dates from years ago and I even submitted a
patch to its codebase in the past, though I couldn't complete it due to
scheduling conflicts with my capstone project.
Having experience in the academia, industry and FLOSS, I highly value
code quality, code legibility, well-maintained Git histories, unit tests
and documentation.
### Previous experience with Git
Before this year, I haven't been involved directly with Git community,
however, I kept my interest in Git alive by:
- Translating the "Git Internals" chapter of Pro Git to Brazilian
Portuguese: https://github.com/progit/progit2-pt-br/pull/81;
- Writing some blog posts about Git, for example:
- one explaining how Git can be used as a debugging tool:
https://lucasoshiro.github.io/posts-en/2023-02-13-git-debug/;
- other explaining how Git merge submodules:
https://lucasoshiro.github.io/posts-en/2022-03-12-merge-submodule/;
- Writing a compatible subset of Git in Haskell from scratch:
https://github.com/lucasoshiro/oshit;
- Helping organizing a Git Introductory Workshop at my University:
https://flusp.ime.usp.br/events/git-introductory-workshop/;
- Presenting some lectures about Git in a company that I worked some
years ago, covering the Git internals (objects, references, packfile)
and debugging and archaeology related Git tools (blame, bisect,
pickaxe, ls-files, etc).
### Previous experience with C and open-source
I also have experience with C and some C++. During my CS
course, C was one of the primary languages that I used. I also
worked with C/C++, for example, in:
- Writing an AMQP message broker from scratch:
https://github.com/lucasoshiro/amqp_broker;
- Contributing with simple patches to the IIO subsystem of the Linux
kernel: https://lucasoshiro.github.io/floss-en/2020-06-06-kernel_linux/;
- Contributing to the Marlin firmware for 3D printers:
https://lucasoshiro.github.io/floss-en/2020-06-30-Marlin/;
- Writing a module for the ns-3 network simulator, dealing with both C
and C++ codebases (currently under development, I plan to write a
paper and make the code available soon);
During my CS course I also was member of FLUSP
(https://flusp.ime.usp.br), a group in my university focused on FLOSS
contributions and from Hardware Livre USP
(https://hardwarelivreusp.org), another group that was focused on
working with open-source hardware.
As a master's student, I'm one of the Open Science Ambassadors of my
University (https://cienciaaberta.usp.br/sobre-o-projeto/, in
Portuguese), promoting the Open Science principles, which include
open-source software, in the unit where I study.
I also contributed to some other free/open-source software, which I list
here: https://lucasoshiro.github.io/floss-en/
### Activity in the Git community in 2025
Since when I decided to submit a proposal for GSoC, I sent some patches
to the Git codebase and git.github.io:
- My microproject, replacing some `test -f` by `test_path_is_file`:
https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@gmail.com/;
- Adding a paragraph to the merge-strategies documentation describing how
Git merges submodules (based on the blog post that I mentioned
before):
https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@gmail.com/;
- A patchset adding a new `--subject-extra-prefix` flag for `git
format-patch`, allowing the user to quickly prepend tags like [GSoC],
[Newbie] or [Outreachy] to the beginning of the subject. This patchset
was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or
similar. It can be seen here:
https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@gmail.com/;
- Given the feedback on the previous rejected patchset, I opened a Pull
Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
by `[GSoC PATCH]`;
- Adding a new userdiff driver for INI files, initially target for
gitconfig files. Currently it is still under revision:
https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@gmail.com/.
Beyond contributions, I also helped people on the mailing list that
needed assistance on Git documentation.
## Project Proposal
Based on the information provided in
https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
create a new Git command for querying information from a repository and
returning it as a semi-structured data format as a JSON output.
In the scope of this project, the JSON output will only include data
that can currently be retrieved through existing Git commands, for
example:
- `git branch`: information about branches, such as the commit that each
branch currently references and their upstreams;
- `git tag`: information about the tags, such as the author or commit
date and the messages they hold (in the case of annotated tags);
- `git remote`: the URL of each remote;
- `git log`: statistics about the commit history, such of the
distribution of commits over time and by author, the distribution of
lines changed by each author;
- `git submodule`: information about the submodules, mainly the commits
that they are referencing and their remote URLs;
- `git rev-parse`: the current branch name, the current commit, the path
of the repository top level directory, if the repository is a bare
repository or if the repository is under bisection.
Given that the information that we want to compile are currently
accessible only through different commands with different sets of flags,
the user that wants to read them needs to have an advanced knowledge on
Git. Once having the repository details consolidated in a single
command, the user will be able to quickly retrieve what it desires
without navigating a complex combination of commands and flags.
### Use cases
Some use cases that will be benefited of this feature will be:
- CLI tools that display formatted information about a Git repository,
for example, OneFetch (https://github.com/o2sh/onefetch);
- Text editors, IDEs and plugins that have front-ends for Git, such as
Magit (https://magit.vc) or GitLens (https://www.gitkraken.com/gitlens);
- FLOSS repository tracking software, for example,
kworkflow (https://github.com/kworkflow),
ctracker (https://github.com/quic/contribution-tracker);
- Academic researchers on FLOSS projects that need statistics on the
repositories that they are querying;
- Continuous integration workflows that perform checks on the
repository before allowing a branch to be merged into another or
before a deploy;
- Code quality tools that will be able to inspect the health of the
commit history.
### Planned features
Since the features haven't been defined yet, this will need to be
planned after surveying people and projects that potentially will use
that:
- Searching on code hosting tools (e.g. GitHub, GitLab) for open-source
software that retrieve data from Git and what they do with them;
- Contacting people in academia that use Git repositories as data
sources for their researches and find out what valuable information
this command can provide them;
- Contacting people from the industry, specially in infrastructure teams
to understand the challenges they face when retrieving data from Git.
Given that I have worked in a infrastructure team and that I have
colleagues and professors at the university that currently research
FLOSS software and communities, I have contacts that can provide input
on what should be considered when developing this new command.
By now, it's not possible to decide how exactly this command would work,
but a first draft is this (supposing that `metadata` is the name of the
command and `--submodule` is a flag that enable the submodule metadata):
~~~
$ git metadata --submodule
{
"symbolic_refs": {
"HEAD": "main"
},
"branches": [
{
"name": "main",
"commit_id": "ac72c22f3c8a9280c81171ccc6cedff3171344cf",
"remote": "origin/main"
},
{
"name": "feature",
"commit_id": "1e373e02767337bd6b996da6598eed822a805878",
"remote": "fork/feature"
}
],
"tags": [
{
"name": "v1.0",
"message": "First version",
"author_timestamp": "1743554265",
"commiter_timestamp": "1743554265"
}
],
"remotes": [
{
"name": "origin",
"url": "https://example.com/foo"
},
{
"name": "fork",
"url": "user@example.com/foo"
}
],
"submodules": [
{
"path": "my_dir/my_submodule_dir",
"url": "https://example.com/bar",
"commit_id": "94436069f106c0014897b1c93e8fc3e49c8fc156"
}
]
}
~~~
### Development plan
Since this is a new command that is not directly related to any specific
existent command, it will probably be placed in a new file inside the
`builtin` directory.
The functionality of this command can be divided into two categories:
1. **Data gathering**: retrieving data from different sources, calling
existent functions and reading data structures declared in other
files;
2. **Data serialization**: formatting the gathered data in a JSON
format. This represents two challenges: generating the JSON itself
and designing the schema for how the desired data will be presented.
Since the exported data is already provided by other Git commands, it
probably won't be difficult to implement this side of the
functionality. The main task would be inspecting the existing codebase
and find the functions and data structures that will feed our output.
Designing the schema, however, requires special planning, as the
flexibility of semi-structured data like JSON may lead to early
bad decisions. A solution may emerge by analysing other software that
export JSON as metadata.
### Schedule
1. **Now -- May 5th**: Requirements gathering
- Inspect codebases that uses Git as data sources;
- Contacting academic researchers on FLOSS;
- Contacting industry infrastructure professionals;
2. **May 6th -- June 1st**: Community bonding
- Getting in touch with the mentors;
- Present to the community a first proposal of the JSON schema;
- Receive feedback from the community about the schema;
- Present a first proposal on the command line interface;
- Receive feedback from the community about the command line
interface;
3. **June 2nd -- July 14th**: First coding round
- Write data structures that correspond to the presented JSON schema;
- Fill the data structures with data obtained from routines of the
existing codebase;
4. **July 15th -- August 25th**: Second coding round
- Implementing the command line interface option handlers;
- Write the JSON serializer.
### Availability
2025 is my last year in my master's degree. Currently, I'm not attending
any classes and I am more focused on developing the software of my
research, performing experiments and writing scientific articles and my
thesis. Since my advisor is aware that I'm proposing a GSoC project, it
will be possible to work on Git while working on my master's tasks.
next reply other threads:[~2025-04-02 18:22 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-02 18:22 Lucas Seiki Oshiro [this message]
2025-04-03 10:14 ` [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool Patrick Steinhardt
2025-04-03 18:02 ` Lucas Seiki Oshiro
2025-04-04 9:09 ` Patrick Steinhardt
2025-04-04 15:01 ` Lucas Seiki Oshiro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7EB151DA-0BDB-4D54-BBB8-CEE69F51F13A@gmail.com \
--to=lucasseikioshiro@gmail.com \
--cc=git@vger.kernel.org \
--cc=karthik.188@gmail.com \
--cc=ps@pks.im \
--cc=shyamthakkar001@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).