git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* reftable: How to represent deleted referees of symrefs in the reflog?
@ 2023-11-21 13:46 Patrick Steinhardt
  2023-12-20 19:03 ` Han-Wen Nienhuys
  0 siblings, 1 reply; 3+ messages in thread
From: Patrick Steinhardt @ 2023-11-21 13:46 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Jonathan Nieder, Terry Parker

[-- Attachment #1: Type: text/plain, Size: 5004 bytes --]

Hi,

I'm currently trying to fully understand how exactly reflogs and reflog
entries are supposed to be deleted in the reftable backend. For ref
records it's easy enough, as explained in the technical documentation
for reftables that is part of our tree:

> Deletion of any reference can be explicitly stored by setting the type
> to 0x0 and omitting the value field of the ref_record. This serves as
> a tombstone, overriding any assertions about the existence of the
> reference from earlier files in the stack.

So you simply create a new ref record with type 0x0 and are done.

For log records it seems to be a bit of a different story though.
Again, quoting the technical documentation:

> The log_type = 0x0 is mostly useful for git stash drop, removing an
> entry from the reflog of refs/stash in a transaction file (below),
> without needing to rewrite larger files. Readers reading a stack of
> reflogs must treat this as a deletion.

To me it seems like deletions in this case only delete a particular log
entry instead of the complete log for a particular reference. And some
older discussion [1] seems to confirm my hunch that a complete reflog is
deleted not with `log_type = 0x0`, but instead by writing the null
object ID as new ID.

So: a single entry is deleted with `log_type = 0x0`, the complete reflog
entry is deleted with the null object ID as new ID. Fair enough, even
though the documentation could be updated to make this easier to
understand. I'll happily do so if my understanding is correct here.

In any case though, this proposed behaviour is not sufficient to cover
all cases that the files-based reflog supports. The following case may
be weird, but we do have tests for it in t1400:

```
 $ git init repo
Initialized empty Git repository in /tmp/repo/.git/
 $ cd repo/
 $ git commit --allow-empty --message initial-commit
[main (root-commit) 9f10b3f] initial-commit
 # Everything looks as expected up to this point.
 $ git reflog show HEAD
9f10b3f (HEAD -> main) HEAD@{0}: commit (initial): initial-commit

 # This behaviour is a bit more on the weird side. We delete the
 # referee, and that causes the files backend to claim that the reflog
 # for HEAD is gone, as well. The reflog still exists though, as
 # demonstrated in the next command.
 $ git update-ref -m delete-main -d refs/heads/main
 $ git reflog show HEAD
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

 # We now re-create the referee, which revives the reflog _including_
 # the old entry.
 $ git update-ref -m recreate-main refs/heads/main 9f10b3f
 $ git reflog show HEAD
9f10b3f (HEAD -> main) HEAD@{0}: recreate-main
9f10b3f (HEAD -> main) HEAD@{2}: commit (initial): initial-commit

 $ cat .git/logs/HEAD
0000000000000000000000000000000000000000 9f10b3f9b20962690fdeff76cd592722fbf57deb Patrick Steinhardt <ps@pks.im> 1700573003 +0100	commit (initial): initial-commit
9f10b3f9b20962690fdeff76cd592722fbf57deb 0000000000000000000000000000000000000000 Patrick Steinhardt <ps@pks.im> 1700573060 +0100	delete-main
0000000000000000000000000000000000000000 9f10b3f9b20962690fdeff76cd592722fbf57deb Patrick Steinhardt <ps@pks.im> 1700573078 +0100	recreate-main
```

It kind of feels like the second step in the files backend where the
reflog is claimed to not exist is buggy -- I'd have expected to still
see the reflog, as the HEAD reference exists quite alright and has never
stopped to exist. And in the third step, I'd have expected to see three
reflog entries, including the deletion that is indeed still present in
the on-disk logfile.

But with the reftable backend the problem becomes worse: we cannot even
represent the fact that the reflog still exists, but that the deletion
of the referee has caused the HEAD to point to the null OID, because the
null OID indicates complete deletion of the reflog. Consequentially, if
we wrote the null OID, we'd only be able to see the last log entry here.

It may totally be that I'm missing something obvious here. But if not,
it leaves us with a couple of options for how to fix it:

    1. Disregard this edge case and accept that the reftable backend
       does not do exactly the same thing as the files backend in very
       specific situations like this.

    2. Change the reftable format so that it can discern these cases,
       e.g. by indicating deletion via a new log type.

    3. Change the files backend to adapt.

None of these alternatives feel particularly great to me. In my opinion
(2) is likely the best option, but would require us to change the format
format that's already in use by Google in the context of multiple
projects. So I'm not quite sure how thrilled you folks at Google and
other users of the reftable library are with this prospect.

Anyway, happy to hear about alternative takes or corrections.

Patrick

[1]: <CAFQ2z_P-cf38yR-VyvfDSgPUO_d38mgsi32UkRSPWMZKJOmjZg@mail.gmail.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: reftable: How to represent deleted referees of symrefs in the reflog?
  2023-11-21 13:46 reftable: How to represent deleted referees of symrefs in the reflog? Patrick Steinhardt
@ 2023-12-20 19:03 ` Han-Wen Nienhuys
  2023-12-28  8:14   ` Patrick Steinhardt
  0 siblings, 1 reply; 3+ messages in thread
From: Han-Wen Nienhuys @ 2023-12-20 19:03 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jonathan Nieder, Terry Parker, Luca Milanesio

On Tue, Nov 21, 2023 at 2:46 PM Patrick Steinhardt <ps@pks.im> wrote:
> To me it seems like deletions in this case only delete a particular log
> entry instead of the complete log for a particular reference. And some
> older discussion [1] seems to confirm my hunch that a complete reflog is
> deleted not with `log_type = 0x0`, but instead by writing the null
> object ID as new ID.

No, writing a null OID (more precisely a transition from $SHA1 =>
$ZERO) means that a branch was at $SHA1, and then was deleted. The
reflog continues to exist, and new entries may be added by reviving
the branch. That would add a $ZERO => $NEWSHA transition, but the
history of the branch prior to its deletion is retained.

>  # This behaviour is a bit more on the weird side. We delete the
>  # referee, and that causes the files backend to claim that the reflog
>  # for HEAD is gone, as well. The reflog still exists though, as
>  # demonstrated in the next command.
>  $ git update-ref -m delete-main -d refs/heads/main
>  $ git reflog show HEAD
> fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.

This looks wrong to me. HEAD has a history, and that history didn't go
away because the current branch happened to be deleted.

> It kind of feels like the second step in the files backend where the
> reflog is claimed to not exist is buggy -- I'd have expected to still

right, I agree.

> see the reflog, as the HEAD reference exists quite alright and has never
> stopped to exist. And in the third step, I'd have expected to see three
> reflog entries, including the deletion that is indeed still present in
> the on-disk logfile.
>
> But with the reftable backend the problem becomes worse: we cannot even
> represent the fact that the reflog still exists, but that the deletion
> of the referee has caused the HEAD to point to the null OID, because the
> null OID indicates complete deletion of the reflog.

This doesn't match my recollection. See
https://github.com/git/git/pull/1215, or more precisely
https://github.com/git/git/blob/3b2c5ddb28fa42a8c944373bea2ca756b1723908/refs/reftable-backend.c#L1582

Removing the entire reflog means removing all the individual entries
using tombstones.

> Consequentially, if
> we wrote the null OID, we'd only be able to see the last log entry here.
>
> It may totally be that I'm missing something obvious here. But if not,
> it leaves us with a couple of options for how to fix it:
>
>     1. Disregard this edge case and accept that the reftable backend
>        does not do exactly the same thing as the files backend in very
>        specific situations like this.

I remember discussing with Jun that it would be acceptable to have
slight semantic differences if unavoidable for the reflogs, and there
should be a record of this in the list.  I think there will always be
some differences: for example, dropping reflogs because of a dir/file
conflict seems like a bug rather than a feature.

>     2. Change the reftable format so that it can discern these cases,
>        e.g. by indicating deletion via a new log type.

This will be a bit messy, because it means that every read of the
reflog has to special case the "deletion marker" to make sure it is
absent, and on every reflog write, you have to create a tombstone for
the deletion marker, to make sure the reflog exists. Also, what if you
have something that looks like:

refs/main@1 : 0000... => 345....
refs/main@0xfffffff:  DELETION

the reflog doesn't exists ("DELETION"), and now it is recreated, ie.
the following is added:

refs/main@0xfffffff:  TOMBSTONE DELETION (make sure reflog exists)
refs/main@2 : 0000... => abc.... (the entry we want to write)

the result is that the reflog would also surface the first entry (@1 ,
000... => 345... ) again. To prevent that from happening, you have to
write tombstones for all entries if you "delete the reflog", but then
do you really need a separate existence marker?

> None of these alternatives feel particularly great to me. In my opinion
> (2) is likely the best option, but would require us to change the format
> format that's already in use by Google in the context of multiple
> projects. So I'm not quite sure how thrilled you folks at Google and
> other users of the reftable library are with this prospect.

Google uses reftable for datacenter-local serving. The reflog is
stored in a database, because it's never involved in serving traffic.
When I implemented reftable-on-filesystem, I found a few bugs in the
reflog code because it was never exercised at Google. Terry/Jonathan
may correct me if my knowledge is outdated.

Luca manages large Gerrit installations. He might know if anyone has
been using reftable in the wild.

HTH.

-- 
Han-Wen Nienhuys - hanwenn@gmail.com - http://www.xs4all.nl/~hanwen

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: reftable: How to represent deleted referees of symrefs in the reflog?
  2023-12-20 19:03 ` Han-Wen Nienhuys
@ 2023-12-28  8:14   ` Patrick Steinhardt
  0 siblings, 0 replies; 3+ messages in thread
From: Patrick Steinhardt @ 2023-12-28  8:14 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git, Jonathan Nieder, Terry Parker, Luca Milanesio

[-- Attachment #1: Type: text/plain, Size: 3802 bytes --]

On Wed, Dec 20, 2023 at 08:03:00PM +0100, Han-Wen Nienhuys wrote:
> On Tue, Nov 21, 2023 at 2:46 PM Patrick Steinhardt <ps@pks.im> wrote:
> > To me it seems like deletions in this case only delete a particular log
> > entry instead of the complete log for a particular reference. And some
> > older discussion [1] seems to confirm my hunch that a complete reflog is
> > deleted not with `log_type = 0x0`, but instead by writing the null
> > object ID as new ID.
> 
> No, writing a null OID (more precisely a transition from $SHA1 =>
> $ZERO) means that a branch was at $SHA1, and then was deleted. The
> reflog continues to exist, and new entries may be added by reviving
> the branch. That would add a $ZERO => $NEWSHA transition, but the
> history of the branch prior to its deletion is retained.
> 
> >  # This behaviour is a bit more on the weird side. We delete the
> >  # referee, and that causes the files backend to claim that the reflog
> >  # for HEAD is gone, as well. The reflog still exists though, as
> >  # demonstrated in the next command.
> >  $ git update-ref -m delete-main -d refs/heads/main
> >  $ git reflog show HEAD
> > fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
> 
> This looks wrong to me. HEAD has a history, and that history didn't go
> away because the current branch happened to be deleted.
> 
> > It kind of feels like the second step in the files backend where the
> > reflog is claimed to not exist is buggy -- I'd have expected to still
> 
> right, I agree.
> 
> > see the reflog, as the HEAD reference exists quite alright and has never
> > stopped to exist. And in the third step, I'd have expected to see three
> > reflog entries, including the deletion that is indeed still present in
> > the on-disk logfile.
> >
> > But with the reftable backend the problem becomes worse: we cannot even
> > represent the fact that the reflog still exists, but that the deletion
> > of the referee has caused the HEAD to point to the null OID, because the
> > null OID indicates complete deletion of the reflog.
> 
> This doesn't match my recollection. See
> https://github.com/git/git/pull/1215, or more precisely
> https://github.com/git/git/blob/3b2c5ddb28fa42a8c944373bea2ca756b1723908/refs/reftable-backend.c#L1582
> 
> Removing the entire reflog means removing all the individual entries
> using tombstones.

Yeah, indeed. I was misinterpreting older discussions here, and now use
individual tombstones which does work as expected. It's a bit
unfortunate that deletion of the reflog thus scales with its size, but
for now I think that's okay. We can still iterate on this if this ever
proves to become a problem.

> > Consequentially, if
> > we wrote the null OID, we'd only be able to see the last log entry here.
> >
> > It may totally be that I'm missing something obvious here. But if not,
> > it leaves us with a couple of options for how to fix it:
> >
> >     1. Disregard this edge case and accept that the reftable backend
> >        does not do exactly the same thing as the files backend in very
> >        specific situations like this.
> 
> I remember discussing with Jun that it would be acceptable to have
> slight semantic differences if unavoidable for the reflogs, and there
> should be a record of this in the list.  I think there will always be
> some differences: for example, dropping reflogs because of a dir/file
> conflict seems like a bug rather than a feature.

Initially I'm aiming for matching semantics, because this will make it a
ton easier to land the initial implementation of the backend. But I do
agree that eventually we may want to iterate and allow for diverging
behaviour between the backends.

Thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-12-28  8:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-21 13:46 reftable: How to represent deleted referees of symrefs in the reflog? Patrick Steinhardt
2023-12-20 19:03 ` Han-Wen Nienhuys
2023-12-28  8:14   ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).