git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Lucas Seiki Oshiro <lucasseikioshiro@gmail.com>
Cc: git@vger.kernel.org, karthik.188@gmail.com, shyamthakkar001@gmail.com
Subject: Re: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool
Date: Thu, 3 Apr 2025 12:14:02 +0200	[thread overview]
Message-ID: <Z-5famP3CgaSfDc2@pks.im> (raw)
In-Reply-To: <7EB151DA-0BDB-4D54-BBB8-CEE69F51F13A@gmail.com>

On Wed, Apr 02, 2025 at 03:22:11PM -0300, Lucas Seiki Oshiro wrote:
> ### Activity in the Git community in 2025
> 
> Since when I decided to submit a proposal for GSoC, I sent some patches
> to the Git codebase and git.github.io:
> 
> - My microproject, replacing some `test -f` by `test_path_is_file`:
>   https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@gmail.com/;
> 
> - Adding a paragraph to the merge-strategies documentation describing how
>   Git merges submodules (based on the blog post that I mentioned
>   before):
>   https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@gmail.com/;
>   
> - A patchset adding a new `--subject-extra-prefix` flag for `git
>   format-patch`, allowing the user to quickly prepend tags like [GSoC],
>   [Newbie] or [Outreachy] to the beginning of the subject. This patchset
>   was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or
>   similar. It can be seen here:
>   https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@gmail.com/;
> 
> - Given the feedback on the previous rejected patchset, I opened a Pull
>   Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
>   by `[GSoC PATCH]`;
>   
> - Adding a new userdiff driver for INI files, initially target for
>   gitconfig files. Currently it is still under revision:
>   https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@gmail.com/.
> 
> Beyond contributions, I also helped people on the mailing list that
> needed assistance on Git documentation.

Could you please also amend the status (merged to master, merged to
next, under discussion) for each of these items?

> ## Project Proposal
> 
> Based on the information provided in
> https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
> create a new Git command for querying information from a repository and
> returning it as a semi-structured data format as a JSON output.
> 
> In the scope of this project, the JSON output will only include data
> that can currently be retrieved through existing Git commands, for
> example:
> 
> - `git branch`: information about branches, such as the commit that each
>   branch currently references and their upstreams;
> 
> - `git tag`: information about the tags, such as the author or commit
>   date and the messages they hold (in the case of annotated tags);
> 
> - `git remote`: the URL of each remote;
> 
> - `git log`: statistics about the commit history, such of the
>   distribution of commits over time and by author, the distribution of
>   lines changed by each author;
> 
> - `git submodule`: information about the submodules, mainly the commits
>   that they are referencing and their remote URLs;
> 
> - `git rev-parse`: the current branch name, the current commit, the path
>   of the repository top level directory, if the repository is a bare
>   repository or if the repository is under bisection.
> 
> Given that the information that we want to compile are currently
> accessible only through different commands with different sets of flags,
> the user that wants to read them needs to have an advanced knowledge on
> Git. Once having the repository details consolidated in a single
> command, the user will be able to quickly retrieve what it desires
> without navigating a complex combination of commands and flags.

I already noticed in another proposal, but it seems a bit like the idea
is underspecced. The idea isn't to make _all_ information about the
repository accessible. It's rather that we want to give a better home to
information about the underlying repository itself. To clarify further,
I'm talking about information like:

  - Which object hash does the repository use?
  - What is the ref database format?
  - Where is the Git directory?
  - Where is the common directory?
  - What is the top-level directory?

This kind of information is exposed via git-rev-parse(1) already, see
the section "Options for Files". But git-rev-parse(1) is not really a
good match at all given that its main intent is to parse revisions. Over
time though it developed into a kind of grab-bag of different unrelated
functionality that we didn't really have a nice home for elsewhere.

> ### Development plan
> 
> Since this is a new command that is not directly related to any specific
> existent command, it will probably be placed in a new file inside the
> `builtin` directory.
> 
> The functionality of this command can be divided into two categories:
> 
> 1. **Data gathering**: retrieving data from different sources, calling
>    existent functions and reading data structures declared in other
>    files;
> 
> 2. **Data serialization**: formatting the gathered data in a JSON
>    format. This represents two challenges: generating the JSON itself
>    and designing the schema for how the desired data will be presented.
>    
> Since the exported data is already provided by other Git commands, it
> probably won't be difficult to implement this side of the
> functionality. The main task would be inspecting the existing codebase
> and find the functions and data structures that will feed our output.
> 
> Designing the schema, however, requires special planning, as the
> flexibility of semi-structured data like JSON may lead to early
> bad decisions. A solution may emerge by analysing other software that
> export JSON as metadata.
> 
> ### Schedule
> 
> 1. **Now -- May 5th**: Requirements gathering
>    - Inspect codebases that uses Git as data sources; 
>    - Contacting academic researchers on FLOSS;
>    - Contacting industry infrastructure professionals;
> 
> 2. **May 6th -- June 1st**: Community bonding
>    - Getting in touch with the mentors;
>    - Present to the community a first proposal of the JSON schema;
>    - Receive feedback from the community about the schema;
>    - Present a first proposal on the command line interface;
>    - Receive feedback from the community about the command line
>      interface;
> 
> 3. **June 2nd -- July 14th**: First coding round
>    - Write data structures that correspond to the presented JSON schema;
>    - Fill the data structures with data obtained from routines of the
>      existing codebase;
> 
> 4. **July 15th -- August 25th**: Second coding round
>    - Implementing the command line interface option handlers;
>    - Write the JSON serializer.

I generally recommend students to take on smaller batches of work that
can be submitted individually. The way it is structured now means that
you will end up with a single deliverable at the end of your project.
But structuring the project like that introduces a high risk that you
won't be able to land anything until the end of your project in case
there is a bigger discussion around parts of these patches.

Instead, it would make sense to identify smaller batches of work that
are self-contained enough to be submitted upstream. This ensures that
you get early feedback and that you can iterate on your design as early
as possible in the project.

Patrick

  reply	other threads:[~2025-04-03 10:14 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-02 18:22 [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool Lucas Seiki Oshiro
2025-04-03 10:14 ` Patrick Steinhardt [this message]
2025-04-03 18:02   ` Lucas Seiki Oshiro
2025-04-04  9:09     ` Patrick Steinhardt
2025-04-04 15:01       ` Lucas Seiki Oshiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z-5famP3CgaSfDc2@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=karthik.188@gmail.com \
    --cc=lucasseikioshiro@gmail.com \
    --cc=shyamthakkar001@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).