git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Discuss GSoC: Implement consistency checks for refs
@ 2024-03-06 13:20 shejialuo
  2024-03-06 14:45 ` Patrick Steinhardt
  0 siblings, 1 reply; 4+ messages in thread
From: shejialuo @ 2024-03-06 13:20 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt

Hi All,

I am interested in "Implement consistency checks for refs" GSoC idea.
However, implementing a feautre is much harder. So I wanna ask you some
questions to better work on.

As [1] shows, I think the idea is easy to understand. We need to ensure
the consistency of the refs. The current `git-fsck` only checks the
connectivity from ref to the object file. There is a possiblity that ref
itself could be corrupted. And we should avoid it through this project.

I have read some source codes. Based on what I have learned, I know
there are two backends. One is file and another is reftable. I have
no idea about the reftable currently. So at now, I will focus on the
file backend.

I think the principle behind the `git-fsck` is that it will traverse
every object file, read its content and use SHA-1 to hash the content
and compare the value with the stored ref value. So if we want to add
consistency checks for refs. We may need to add a new file to store the
last commit state (not only last commit state, do we need to consider
the stash state). However, from my perspective, it's a bad idea to use a
file to store the refs' states and we cannot use object file to check
whether the ref is corrupted.

So this is my first question, what mechanism should we use to provide
consistency? And to what extend for the consistency. And I think this
mechanism should be general for both text-based and binary-based refs.

And I have a more general qeustion, I think I need understand `fsck.c`
and of couse the reftable format. However, I am confused whether I need
to understand the ref internal. And could you please provide me more
infomration to make this idea more clear.

Thanks,
Jialuo

[1] https://lore.kernel.org/git/ZakIPEytlxHGCB9Y@tanuki/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Discuss GSoC: Implement consistency checks for refs
  2024-03-06 13:20 shejialuo
@ 2024-03-06 14:45 ` Patrick Steinhardt
  0 siblings, 0 replies; 4+ messages in thread
From: Patrick Steinhardt @ 2024-03-06 14:45 UTC (permalink / raw)
  To: shejialuo; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 4189 bytes --]

On Wed, Mar 06, 2024 at 09:20:36PM +0800, shejialuo wrote:
> Hi All,
> 
> I am interested in "Implement consistency checks for refs" GSoC idea.
> However, implementing a feautre is much harder. So I wanna ask you some
> questions to better work on.

Sure!

> As [1] shows, I think the idea is easy to understand. We need to ensure
> the consistency of the refs. The current `git-fsck` only checks the
> connectivity from ref to the object file. There is a possiblity that ref
> itself could be corrupted. And we should avoid it through this project.

I know this is splitting hairs, but git-fsck(1) doesn't give us the
tools to avoid corruption. It only gives us the tools to detect it after
the fact.

> I have read some source codes. Based on what I have learned, I know
> there are two backends. One is file and another is reftable. I have
> no idea about the reftable currently. So at now, I will focus on the
> file backend.

Yeah, the "reftable" backend is new in the Git v2.45 release cycle, so
it's totally expected that most peeople have no idea about it. It's also
part of the motivation for this project though. Because as you noted, it
is a binary format that is thus not as readily parseable by a human as
the old "files, backend. This makes it much more important to provide
the tooling to detect whether things look as expected.

> I think the principle behind the `git-fsck` is that it will traverse
> every object file, read its content and use SHA-1 to hash the content
> and compare the value with the stored ref value. So if we want to add
> consistency checks for refs. We may need to add a new file to store the
> last commit state (not only last commit state, do we need to consider
> the stash state). However, from my perspective, it's a bad idea to use a
> file to store the refs' states and we cannot use object file to check
> whether the ref is corrupted.

I agree a 100% -- tracking ref states in a secondary database is not a
good idea.

> So this is my first question, what mechanism should we use to provide
> consistency? And to what extend for the consistency. And I think this
> mechanism should be general for both text-based and binary-based refs.

The exact extent will need some discussion. What's clear is that it does
not need to be perfect from the beginning, and we are sure to discover
more checks over time that may make sense.

Some ideas from the top of my head:

  - generic
    - Ensure that all ref names are conformant.
    - Ensure that there are no directory/file conflicts for the ref
      names.
  - files
    - Ensure that "packed-refs" is well-formatted.
    - Ensure that refs in "packed-refs" are ordered lexicographically.
    - Check for corrupted loose refs in "refs/".
  - reftable
    - Ensure that there are no garbage files in "reftable/".
    - Ensure that "tables.list" is well-formatted.
    - Ensure that each table is well-formatted.
    - Ensure that refs in each table are ordered correctly.

This list is not exhaustive, there may of course be other checks that
may make sense. Any additional ideas by you or other interested students
are be welcome.

For what it's worth, not all of the checks need to be implemented as
part of GSoC. At a minimum, it should result in the infra to allow for
backend-specific checks and a couple of checks for at least one of the
backends.

> And I have a more general qeustion, I think I need understand `fsck.c`
> and of couse the reftable format. However, I am confused whether I need
> to understand the ref internal. And could you please provide me more
> infomration to make this idea more clear.

You will certainly need to learn about ref internals a bit. There are
some common rules and restrictions that are important in order to figure
out what we want to check in the first place. Understanding the
"reftable" format would be great, but you may also get away with only
implementing generic or "files"-backend specific consistency checks.
This depends on the scope you are aiming for.

Patrick

> Thanks,
> Jialuo
> 
> [1] https://lore.kernel.org/git/ZakIPEytlxHGCB9Y@tanuki/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Discuss GSoC: Implement consistency checks for refs
@ 2024-03-10 10:01 shejialuo
  2024-03-14  3:38 ` Kaartic Sivaraam
  0 siblings, 1 reply; 4+ messages in thread
From: shejialuo @ 2024-03-10 10:01 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Thanks for you help. I'm sorry for the delay in resonding to your email
due to my internship.

> I know this is splitting hairs, but git-fsck(1) doesn't give us the
> tools to avoid corruption. It only gives us the tools to detect it after
> the fact.

I DO misundestood the `git-fsck(1)`.

This time, I have read more source codes about `git-fsck` and refs
internal. So I wanna discuss some implementation of the infrastructure
this time.

I am inspired by `refs-internal.h`, this file declares `ref_storage_be`,
and for every backend, it should implement the interfaces like
`ref_store_init_fn`, `ref_init_db_fn` and etc. And in `refs.h`, it
provides the interfaces to other modules.

Based above idea, I think we could just create files in `refs` directory
and we could implement a file called `ref-check.h`, we design the
interfaces for different backends.

After that, we could compose this structure into `ref_storage_be` and we
could call these interfaces in `fsck.c`. If there are some different
interfaces, we could downcast to a specified type to call the specified
functions. (Actually, I have learned a lot how OOP is implemented in C).

> For what it's worth, not all of the checks need to be implemented as
> part of GSoC. At a minimum, it should result in the infra to allow for
> backend-specific checks and a couple of checks for at least one of the
> backends.

I think using the above idea, we could provide an infrastructure to allow
more checks later.

> You will certainly need to learn about ref internals a bit. There are
> some common rules and restrictions that are important in order to figure
> out what we want to check in the first place. Understanding the
> "reftable" format would be great, but you may also get away with only
> implementing generic or "files"-backend specific consistency checks.
> This depends on the scope you are aiming for.

I think I will at least implement the generic part and files-backend
consistency check. I will then read some specs about the reftable and the
source code of it. If there is sufficient time available, I think I
could implement all of them. However, I am currently interning remotely,
the response may slow.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Discuss GSoC: Implement consistency checks for refs
  2024-03-10 10:01 Discuss GSoC: Implement consistency checks for refs shejialuo
@ 2024-03-14  3:38 ` Kaartic Sivaraam
  0 siblings, 0 replies; 4+ messages in thread
From: Kaartic Sivaraam @ 2024-03-14  3:38 UTC (permalink / raw)
  To: ZeiBfVyTCHUywliI, shejialuo, Patrick Steinhardt; +Cc: git

Hi Jialuo,

Just wanted to chime in to mention one thing.

On 10 March 2024 3:31:35 pm IST, shejialuo <shejialuo@gmail.com> wrote:
>
>I think I will at least implement the generic part and files-backend
>consistency check. I will then read some specs about the reftable and the
>source code of it. If there is sufficient time available, I think I
>could implement all of them. However, I am currently interning remotely,
>the response may slow.
>

Thanks for mentioning this. If your current internship would overlap
with the GSoC period, kindly clarify the same in your proposal. Also,
if it does overlap kindly clarify the amount of time you'll be able to
allocate for the GSoC project in the proposal.

This would be helpful to set the expectations right.

-- 
Sivaraam

Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-03-14  3:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-10 10:01 Discuss GSoC: Implement consistency checks for refs shejialuo
2024-03-14  3:38 ` Kaartic Sivaraam
  -- strict thread matches above, loose matches on Subject: below --
2024-03-06 13:20 shejialuo
2024-03-06 14:45 ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).