[PATCH] fsck: snapshot default refs before object walk

public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] fsck: snapshot default refs before object walk
@ 2025-12-29 19:12 Elijah Newren via GitGitGadget
  2025-12-30  0:45 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Elijah Newren via GitGitGadget @ 2025-12-29 19:12 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Matthew John Cheetham

From: Matthew John Cheetham <mjcheetham@outlook.com>

Fsck has a race when operating on live repositories; consider the
following simple script that writes new commits as fsck runs:

    #!/bin/bash
    git fsck &
    PID=$!

    while ps -p $PID >/dev/null; do
        sleep 3
        git commit -q --allow-empty -m "Another commit"
    done

Since fsck reads refs at the beginning, walks those for connectivity,
and then reads the refs again at the end to check, this can cause fsck
to get confused and think that the new refs refer to missing commits and
that new reflog entries are invalid.  Running the above script in a
clone of git.git results in the following (output ellipsized to remove
additional errors of the same type):

    $ ./fsck-while-writing.sh
    Checking ref database: 100% (1/1), done.
    Checking object directories: 100% (256/256), done.
    warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
    Checking objects: 100% (835091/835091), done.
    error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    [...]
    error: HEAD: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09
    error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: HEAD: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    error: refs/heads/mybranch invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: refs/heads/mybranch: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    [...]
    error: refs/heads/mybranch: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09
    error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: refs/heads/mybranch: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    Checking connectivity: 833846, done.
    missing commit 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    Verifying commits in commit graph: 100% (242243/242243), done.

This problem doesn't occur when refs are specified on the command line
for us to check, since we use those specified refs for both walking and
checking.  Using the same refs for walking and checking seems to just
make sense, so modify the existing code to do the same when refs aren't
specified.  Snapshot the refs at the beginning, and also ignore all
reflog entries since the time of our snapshot (while this technically
means we could ignore a reflog entry created before the fsck process
if the local clock is weird, since reflogs are local-only there are not
concerns about differences between clocks on different machines).  This
combination of changes modifies the output of running the above script
to:

    $ ./fsck-while-writing.sh
    Checking ref database: 100% (1/1), done.
    Checking object directories: 100% (256/256), done.
    warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
    Checking objects: 100% (835091/835091), done.
    Checking connectivity: 833846, done.
    Verifying commits in commit graph: 100% (242243/242243), done.

While worries about live updates while running fsck is likely of most
interest for forge operators, it will likely also benefit those with
automated jobs (such as git maintenance) or even casual users who want
to do other work in their clone while fsck is running.

Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
Co-authored-by: Elijah Newren <newren@gmail.com>
[en: several changes:
  * adjusted for upstream refactorings to refs callback call signatures
  * handle reflogs as well
  * free recorded snapshot of refs when done
  * default to snapshotting instead of making it a non-default option
  * provide reproducible testcase in commit message and rewrite commit
    message around it
]
Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fsck: snapshot default refs before object walk

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2026%2Fnewren%2Ffsck-snapshot-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2026/newren/fsck-snapshot-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/2026

 builtin/fsck.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 70 insertions(+), 4 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index c489582faa..8d20505f5d 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -51,6 +51,7 @@ static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
 static int check_references = 1;
+static timestamp_t now;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -509,6 +510,9 @@ static int fsck_handle_reflog_ent(const char *refname,
 				  timestamp_t timestamp, int tz UNUSED,
 				  const char *message UNUSED, void *cb_data UNUSED)
 {
+	if (now && timestamp > now)
+		return 0;
+
 	if (verbose)
 		fprintf_ln(stderr, _("Checking reflog %s->%s"),
 			   oid_to_hex(ooid), oid_to_hex(noid));
@@ -567,14 +571,53 @@ static int fsck_head_link(const char *head_ref_name,
 			  const char **head_points_at,
 			  struct object_id *head_oid);
 
-static void get_default_heads(void)
+struct ref_snapshot {
+	size_t nr;
+	size_t name_alloc;
+	size_t oid_alloc;
+	char **refname;
+	struct object_id *oid;
+};
+
+static int snapshot_refs(const struct reference *ref, void *cb_data)
+{
+	struct ref_snapshot *refs = cb_data;
+
+	ALLOC_GROW(refs->refname, refs->nr + 1, refs->name_alloc);
+	ALLOC_GROW(refs->oid, refs->nr + 1, refs->oid_alloc);
+
+	refs->refname[refs->nr] = xstrdup(ref->name);
+	oidcpy(&refs->oid[refs->nr], ref->oid);
+	refs->nr++;
+
+	return 0;
+}
+
+static void free_snapshot_refs(struct ref_snapshot *snapshot)
+{
+	for (size_t i = 0; i < snapshot->nr; i++)
+		free(snapshot->refname[i]);
+	free(snapshot->refname);
+	free(snapshot->oid);
+}
+
+static void get_default_heads(struct ref_snapshot *the_refs)
 {
 	struct worktree **worktrees, **p;
 	const char *head_points_at;
 	struct object_id head_oid;
 
-	refs_for_each_rawref(get_main_ref_store(the_repository),
-			     fsck_handle_ref, NULL);
+	if (the_refs)
+		for (size_t i = 0; i < the_refs->nr; i++) {
+			struct reference ref = {
+				.name = the_refs->refname[i],
+				.oid = &the_refs->oid[i],
+			};
+			fsck_handle_ref(&ref, NULL);
+		}
+	else
+		refs_for_each_rawref(get_main_ref_store(the_repository),
+				     fsck_handle_ref, NULL);
 
 	worktrees = get_worktrees();
 	for (p = worktrees; *p; p++) {
@@ -964,6 +1007,14 @@ int cmd_fsck(int argc,
 {
 	int i;
 	struct odb_source *source;
+	struct ref_snapshot default_refs_snapshot = {
+		.nr = 0,
+		.name_alloc = 0,
+		.oid_alloc = 0,
+		.refname = NULL,
+		.oid = NULL
+	};
+	bool use_snapshot;
 
 	/* fsck knows how to handle missing promisor objects */
 	fetch_if_missing = 0;
@@ -999,6 +1050,19 @@ int cmd_fsck(int argc,
 	if (check_references)
 		fsck_refs(the_repository);
 
+	/*
+	 * Take a snapshot of the refs before walking objects to avoid looking
+	 * at a set of refs that may be changed by the user while we are walking
+	 * objects. We can still walk over new objects that are added during the
+	 * execution of fsck but won't miss any objects that were reachable.
+	 */
+	use_snapshot = !argc;
+	if (use_snapshot) {
+		now = time(NULL);
+		refs_for_each_rawref(get_main_ref_store(the_repository),
+				     snapshot_refs, &default_refs_snapshot);
+	}
+
 	if (connectivity_only) {
 		for_each_loose_object(the_repository->objects,
 				      mark_loose_for_connectivity, NULL, 0);
@@ -1071,7 +1135,7 @@ int cmd_fsck(int argc,
 	 * in this case (ie this implies --cache).
 	 */
 	if (!argc) {
-		get_default_heads();
+		get_default_heads(use_snapshot ? &default_refs_snapshot : NULL);
 		keep_cache_objects = 1;
 	}
 
@@ -1148,5 +1212,7 @@ int cmd_fsck(int argc,
 		}
 	}
 
+	if (use_snapshot)
+		free_snapshot_refs(&default_refs_snapshot);
 	return errors_found;
 }

base-commit: b31ab939fe8e3cbe8be48dddd1c6ac0265991f45
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fsck: snapshot default refs before object walk
  2025-12-29 19:12 [PATCH] fsck: snapshot default refs before object walk Elijah Newren via GitGitGadget
@ 2025-12-30  0:45 ` Junio C Hamano
  2026-01-06 23:19   ` Elijah Newren
  2026-01-02  5:49 ` Jeff King
  2026-01-07  1:29 ` [PATCH v2] " Elijah Newren via GitGitGadget
  2 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2025-12-30  0:45 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren, Matthew John Cheetham

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This problem doesn't occur when refs are specified on the command line
> for us to check, since we use those specified refs for both walking and
> checking.  Using the same refs for walking and checking seems to just
> make sense, so modify the existing code to do the same when refs aren't
> specified.

Excellent analysis and good approach.

> Snapshot the refs at the beginning, and also ignore all
> reflog entries since the time of our snapshot (while this technically
> means we could ignore a reflog entry created before the fsck process
> if the local clock is weird, since reflogs are local-only there are not
> concerns about differences between clocks on different machines).

Repository on a network filesystem being accessed by hosts with
broken clock?

I do not think our reflog API has (1) give me some token to mark
your current state (2) here is the token you gave me earlier, now
iterate and yield entries but ignore entries added after you gave me
that token, so going by the reflog timestamp is probably the best we
could do.  Any approach may get confused when the user tries to be
cute and issues "reflog delete" or "reflog expire" in the middle
anyway, I suspect ;-)

> While worries about live updates while running fsck is likely of most
> interest for forge operators, it will likely also benefit those with
> automated jobs (such as git maintenance) or even casual users who want
> to do other work in their clone while fsck is running.

Great.  Will queue.  Thanks.

> @@ -509,6 +510,9 @@ static int fsck_handle_reflog_ent(const char *refname,
>  				  timestamp_t timestamp, int tz UNUSED,
>  				  const char *message UNUSED, void *cb_data UNUSED)
>  {
> +	if (now && timestamp > now)
> +		return 0;
> +
>  	if (verbose)
>  		fprintf_ln(stderr, _("Checking reflog %s->%s"),
>  			   oid_to_hex(ooid), oid_to_hex(noid));
> @@ -567,14 +571,53 @@ static int fsck_head_link(const char *head_ref_name,
>  			  const char **head_points_at,
>  			  struct object_id *head_oid);
>  
> -static void get_default_heads(void)
> +struct ref_snapshot {
> +	size_t nr;
> +	size_t name_alloc;
> +	size_t oid_alloc;
> +	char **refname;
> +	struct object_id *oid;
> +};

This data structure is somewhat unexpected.  Instead of a struct
that holds two arrays, I would have rather expected an array of
"struct { refname, oid }", with the possiblity to add a "token to
mark the latest reflog entry" to the mix I alluded to earlier when
such an API function materializes.

[Footnote]

We could call refs_for_each_reflog_ent_reverse(), grab the
parameters that each_reflog_ent_fn receives as that "token" for the
latest reflog entry and stop.  That way, we will learn the value of
<old,new,committer,timestamp,tz,msg>, which should be a robust
enough unique key.

After that when iterating over the reflog, we know we should stop
after processing the reflog entry that holds the recorded value.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fsck: snapshot default refs before object walk
  2025-12-30  0:45 ` Junio C Hamano
@ 2026-01-06 23:19   ` Elijah Newren
  2026-01-09 13:11     ` Matthew John Cheetham
  0 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2026-01-06 23:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elijah Newren via GitGitGadget, git, Matthew John Cheetham

On Mon, Dec 29, 2025 at 4:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This problem doesn't occur when refs are specified on the command line
> > for us to check, since we use those specified refs for both walking and
> > checking.  Using the same refs for walking and checking seems to just
> > make sense, so modify the existing code to do the same when refs aren't
> > specified.
>
> Excellent analysis and good approach.
>
> > Snapshot the refs at the beginning, and also ignore all
> > reflog entries since the time of our snapshot (while this technically
> > means we could ignore a reflog entry created before the fsck process
> > if the local clock is weird, since reflogs are local-only there are not
> > concerns about differences between clocks on different machines).
>
> Repository on a network filesystem being accessed by hosts with
> broken clock?

Oh, indeed.

> I do not think our reflog API has (1) give me some token to mark
> your current state (2) here is the token you gave me earlier, now
> iterate and yield entries but ignore entries added after you gave me
> that token, so going by the reflog timestamp is probably the best we
> could do.  Any approach may get confused when the user tries to be
> cute and issues "reflog delete" or "reflog expire" in the middle
> anyway, I suspect ;-)
>
> > While worries about live updates while running fsck is likely of most
> > interest for forge operators, it will likely also benefit those with
> > automated jobs (such as git maintenance) or even casual users who want
> > to do other work in their clone while fsck is running.
>
> Great.  Will queue.  Thanks.
>
> > @@ -509,6 +510,9 @@ static int fsck_handle_reflog_ent(const char *refname,
> >                                 timestamp_t timestamp, int tz UNUSED,
> >                                 const char *message UNUSED, void *cb_data UNUSED)
> >  {
> > +     if (now && timestamp > now)
> > +             return 0;
> > +
> >       if (verbose)
> >               fprintf_ln(stderr, _("Checking reflog %s->%s"),
> >                          oid_to_hex(ooid), oid_to_hex(noid));
> > @@ -567,14 +571,53 @@ static int fsck_head_link(const char *head_ref_name,
> >                         const char **head_points_at,
> >                         struct object_id *head_oid);
> >
> > -static void get_default_heads(void)
> > +struct ref_snapshot {
> > +     size_t nr;
> > +     size_t name_alloc;
> > +     size_t oid_alloc;
> > +     char **refname;
> > +     struct object_id *oid;
> > +};
>
> This data structure is somewhat unexpected.  Instead of a struct
> that holds two arrays, I would have rather expected an array of
> "struct { refname, oid }", with the possiblity to add a "token to
> mark the latest reflog entry" to the mix I alluded to earlier when
> such an API function materializes.

Yeah, that makes sense.  It'll mean that there won't be anything left
of Matthew's original patch that I was trying to upstream (especially
with the further changes Peff highlighted elsewhere in this thread),
but I can just take the authorship and note Matthew's contribution in
a trailer.

> [Footnote]
>
> We could call refs_for_each_reflog_ent_reverse(), grab the
> parameters that each_reflog_ent_fn receives as that "token" for the
> latest reflog entry and stop.  That way, we will learn the value of
> <old,new,committer,timestamp,tz,msg>, which should be a robust
> enough unique key.
>
> After that when iterating over the reflog, we know we should stop
> after processing the reflog entry that holds the recorded value.

Interesting.  The global timestamp for reflogs seems good enough for
me (network filesystems with a broken clock feel niche to me), but I
can leave a TODO in the code for those that want to pursue improving
the reflog handling further.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fsck: snapshot default refs before object walk
  2026-01-06 23:19   ` Elijah Newren
@ 2026-01-09 13:11     ` Matthew John Cheetham
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew John Cheetham @ 2026-01-09 13:11 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano; +Cc: Elijah Newren via GitGitGadget, git

On 2026-01-06 23:19, Elijah Newren wrote:

> Yeah, that makes sense.  It'll mean that there won't be anything left
> of Matthew's original patch that I was trying to upstream (especially
> with the further changes Peff highlighted elsewhere in this thread),
> but I can just take the authorship and note Matthew's contribution in
> a trailer.

Please don't bother including any attributions on this. It was something
I was unfortunately abrubptly tasked with when I was at GitHub, which I 
no longer am.

Thanks,
Matthew


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fsck: snapshot default refs before object walk
  2025-12-29 19:12 [PATCH] fsck: snapshot default refs before object walk Elijah Newren via GitGitGadget
  2025-12-30  0:45 ` Junio C Hamano
@ 2026-01-02  5:49 ` Jeff King
  2026-01-06 23:36   ` Elijah Newren
  2026-01-07  1:29 ` [PATCH v2] " Elijah Newren via GitGitGadget
  2 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2026-01-02  5:49 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren, Matthew John Cheetham

On Mon, Dec 29, 2025 at 07:12:29PM +0000, Elijah Newren via GitGitGadget wrote:

> Fsck has a race when operating on live repositories; consider the
> following simple script that writes new commits as fsck runs:
> 
>     #!/bin/bash
>     git fsck &
>     PID=$!
> 
>     while ps -p $PID >/dev/null; do
>         sleep 3
>         git commit -q --allow-empty -m "Another commit"
>     done
> 
> Since fsck reads refs at the beginning, walks those for connectivity,
> and then reads the refs again at the end to check, this can cause fsck
> to get confused and think that the new refs refer to missing commits and
> that new reflog entries are invalid.

I'm not sure if this is entirely accurate. Does fsck read refs at the
beginning? I think it walks over everything in the object database
without regard to connectivity, marking each with the HAS_OBJ flag. And
then we use the refs to find the reachable objects, making sure any that
are reachable also have HAS_OBJ set (otherwise they are missing).

So the race is really:

  1. We look at all of the packs and loose objects to set HAS_OBJ.

  2. Somebody else simultaneously adds a new object, which is missed by
     our step 1, and updates a ref to point to it.

  3. We look at the refs for reachability, see the new object, but
     think it is missing because we never saw it in step 1.

And your fix is to snapshot the refs as a "step 0", and use that
snapshot in step 3. So any new objects that are introduced after step 1
will never be referenced, since we are using the snapshot values.

Which makes sense as long as we assume objects are only added to the
repository. I think we'd now have the opposite direction race:

  0. We snapshot the refs.

  1. Somebody else deletes a ref, and then does a pruning git-gc which
     deletes the object it pointed to.

  2. We look at all of the objects and mark them as HAS_OBJ. We do not
     include the now-deleted object.

  3. We do a connectivity check with the snapshot, and are dismayed to
     find that the deleted object (which we believe is still referenced)
     has gone away.

I think you could argue that this is a much more preferable race,
though. A busy server will see lots of new objects introduced and refs
updated, and you do not want to have a stop-the-world lock that prevents
pushes. But it is much less common to do a pruning gc, and it is
probably OK to have a mutually exclusive lock between fsck and gc.

> This problem doesn't occur when refs are specified on the command line
> for us to check, since we use those specified refs for both walking and
> checking.  Using the same refs for walking and checking seems to just
> make sense, so modify the existing code to do the same when refs aren't
> specified.

So I don't think this part is quite right either, then. We're not using
the command-line arguments collect the set of objects in the repo. That
still happens by walking over the odb itself. So if I do:

  git fsck --no-dangling HEAD

in git.git, and while it is running, do this in another terminal:

  git commit --allow-empty -m foo

then I get:

  error: 49a90c19d0cd010ed00fbb1e4256cbefaa8b83e2: object missing

even with your patch. So I think that names given on the command-line
could benefit from this type of snapshot, because they suffer from the
same race. You want to lock in the ref resolution (whether from
iterating or from names on the command-line) before you start walking
over the odb.

  Side note: I do not think I have ever run fsck with refs on the
  command-line. It is not like it saves you any time! Most of the
  expense comes from opening up and verifying the objects in the first
  step, not from looking at ref reachability.

And one final note on the overall direction of the patch. We are
assuming that if we look at the refs first and then the odb second, that
we will be getting a "fresh" view of the odb in that second step. But
that isn't necessarily so, as we might have loaded the set of packs
earlier in the process. I don't know if it is possible to trigger that
during fsck or not, but certainly it is relying on a subtle assumption.
It probably is worth calling odb_reprepare() after taking the snapshot
to ensure we are not getting any results cached from before the snapshot
was taken.

> +struct ref_snapshot {
> +	size_t nr;
> +	size_t name_alloc;
> +	size_t oid_alloc;
> +	char **refname;
> +	struct object_id *oid;
> +};

Minor nit, but: why keep two arrays and not a single struct with both?
After all, you even end up sticking them back in a struct at the only
point of use:

> +	if (the_refs)
> +		for (size_t i = 0; i < the_refs->nr; i++) {
> +			struct reference ref = {
> +				.name = the_refs->refname[i],
> +				.oid = &the_refs->oid[i],
> +			};
> +			fsck_handle_ref(&ref, NULL);
> +		}

So this could really just be an array of "struct reference". You can't
just hold onto the "struct reference" passed in to the snapshot_refs()
callback (because it gets reused as we iterate), but you could do a deep
copy.

That did make me wonder a bit about the other fields in "struct
reference" (which your snapshot just throws away). But it looks like
fsck_handle_ref() only cares about the name and oid, so it is OK.

> @@ -999,6 +1050,19 @@ int cmd_fsck(int argc,
>  	if (check_references)
>  		fsck_refs(the_repository);
>  
> +	/*
> +	 * Take a snapshot of the refs before walking objects to avoid looking
> +	 * at a set of refs that may be changed by the user while we are walking
> +	 * objects. We can still walk over new objects that are added during the
> +	 * execution of fsck but won't miss any objects that were reachable.
> +	 */
> +	use_snapshot = !argc;
> +	if (use_snapshot) {
> +		now = time(NULL);
> +		refs_for_each_rawref(get_main_ref_store(the_repository),
> +				     snapshot_refs, &default_refs_snapshot);
> +	}

BTW, one of the reasons I started looking at this is that Coverity
complained about this segment of code. We set use_snapshot if and only
if we don't have any argc arguments. And then later...

>  	if (!argc) {
> -		get_default_heads();
> +		get_default_heads(use_snapshot ? &default_refs_snapshot : NULL);
>  		keep_cache_objects = 1;
>  	}

...we enter this block only if argc is zero. So we know that
use_snapshot will be true here, and the NULL path (and thus the fallback
code in get_default_heads()) will never be used.

That's not wrong exactly, as it's "just" dead code. But it was what led
me to thinking about whether the case of non-zero argc would benefit
from the snapshot, too.

There are a few other related interesting cases, too:

  - We may use the index file for connectivity, as well. It suffers from
    the same race, and would benefit from a snapshot.

  - In get_default_heads() we also look at worktree HEADs. Those have
    the same race (their normal refs we don't consider here, because
    they were already handled by the overall ref iteration).

I know that neither of those is of particular interest to you as a bare
server repo would have neither. And it may be OK not to handle them, if
the complexity doesn't merit it. But it might be worth documenting the
short-coming.

-Peff

PS The other reason I looked at your patch is that I got deja vu from
   all of this. I thought we had discussed ref snapshotting for fsck
   before, but I couldn't find anything on the list. It may have been
   internal GitHub discussions.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fsck: snapshot default refs before object walk
  2026-01-02  5:49 ` Jeff King
@ 2026-01-06 23:36   ` Elijah Newren
  2026-01-14 22:23     ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2026-01-06 23:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren via GitGitGadget, git, Matthew John Cheetham

On Thu, Jan 1, 2026 at 9:49 PM Jeff King <peff@peff.net> wrote:
>
> On Mon, Dec 29, 2025 at 07:12:29PM +0000, Elijah Newren via GitGitGadget wrote:
>
> > Fsck has a race when operating on live repositories; consider the
> > following simple script that writes new commits as fsck runs:
> >
> >     #!/bin/bash
> >     git fsck &
> >     PID=$!
> >
> >     while ps -p $PID >/dev/null; do
> >         sleep 3
> >         git commit -q --allow-empty -m "Another commit"
> >     done
> >
> > Since fsck reads refs at the beginning, walks those for connectivity,
> > and then reads the refs again at the end to check, this can cause fsck
> > to get confused and think that the new refs refer to missing commits and
> > that new reflog entries are invalid.
>
> I'm not sure if this is entirely accurate. Does fsck read refs at the
> beginning? I think it walks over everything in the object database
> without regard to connectivity, marking each with the HAS_OBJ flag. And
> then we use the refs to find the reachable objects, making sure any that
> are reachable also have HAS_OBJ set (otherwise they are missing).

Oh, indeed; I got that wrong.

> So the race is really:
>
>   1. We look at all of the packs and loose objects to set HAS_OBJ.
>
>   2. Somebody else simultaneously adds a new object, which is missed by
>      our step 1, and updates a ref to point to it.
>
>   3. We look at the refs for reachability, see the new object, but
>      think it is missing because we never saw it in step 1.

Yep.

> And your fix is to snapshot the refs as a "step 0", and use that
> snapshot in step 3. So any new objects that are introduced after step 1
> will never be referenced, since we are using the snapshot values.
>
> Which makes sense as long as we assume objects are only added to the
> repository. I think we'd now have the opposite direction race:
>
>   0. We snapshot the refs.
>
>   1. Somebody else deletes a ref, and then does a pruning git-gc which
>      deletes the object it pointed to.
>
>   2. We look at all of the objects and mark them as HAS_OBJ. We do not
>      include the now-deleted object.
>
>   3. We do a connectivity check with the snapshot, and are dismayed to
>      find that the deleted object (which we believe is still referenced)
>      has gone away.
>
> I think you could argue that this is a much more preferable race,
> though. A busy server will see lots of new objects introduced and refs
> updated, and you do not want to have a stop-the-world lock that prevents
> pushes. But it is much less common to do a pruning gc, and it is
> probably OK to have a mutually exclusive lock between fsck and gc.

Indeed.  I'll add some comments to that effect.

> > This problem doesn't occur when refs are specified on the command line
> > for us to check, since we use those specified refs for both walking and
> > checking.  Using the same refs for walking and checking seems to just
> > make sense, so modify the existing code to do the same when refs aren't
> > specified.
>
> So I don't think this part is quite right either, then. We're not using
> the command-line arguments collect the set of objects in the repo. That
> still happens by walking over the odb itself. So if I do:
>
>   git fsck --no-dangling HEAD
>
> in git.git, and while it is running, do this in another terminal:
>
>   git commit --allow-empty -m foo
>
> then I get:
>
>   error: 49a90c19d0cd010ed00fbb1e4256cbefaa8b83e2: object missing
>
> even with your patch. So I think that names given on the command-line
> could benefit from this type of snapshot, because they suffer from the
> same race. You want to lock in the ref resolution (whether from
> iterating or from names on the command-line) before you start walking
> over the odb.

Good point.  I added this snapshotting in v2.

>   Side note: I do not think I have ever run fsck with refs on the
>   command-line. It is not like it saves you any time! Most of the
>   expense comes from opening up and verifying the objects in the first
>   step, not from looking at ref reachability.

Not to mention it produces spurious "dangling" object warnings,
because while the objects might be reachable, they aren't necessarily
reachable from the particular subset you specified on the command
line.  I wonder if no one ever noticed that because it's such a
useless mode; I only noticed it because you pointed out how Matthew
and I overlooked races with command-line arguments.

> And one final note on the overall direction of the patch. We are
> assuming that if we look at the refs first and then the odb second, that
> we will be getting a "fresh" view of the odb in that second step. But
> that isn't necessarily so, as we might have loaded the set of packs
> earlier in the process. I don't know if it is possible to trigger that
> during fsck or not, but certainly it is relying on a subtle assumption.
> It probably is worth calling odb_reprepare() after taking the snapshot
> to ensure we are not getting any results cached from before the snapshot
> was taken.

Will add.

> > +struct ref_snapshot {
> > +     size_t nr;
> > +     size_t name_alloc;
> > +     size_t oid_alloc;
> > +     char **refname;
> > +     struct object_id *oid;
> > +};
>
> Minor nit, but: why keep two arrays and not a single struct with both?
> After all, you even end up sticking them back in a struct at the only
> point of use:
>
> > +     if (the_refs)
> > +             for (size_t i = 0; i < the_refs->nr; i++) {
> > +                     struct reference ref = {
> > +                             .name = the_refs->refname[i],
> > +                             .oid = &the_refs->oid[i],
> > +                     };
> > +                     fsck_handle_ref(&ref, NULL);
> > +             }
>
> So this could really just be an array of "struct reference". You can't
> just hold onto the "struct reference" passed in to the snapshot_refs()
> callback (because it gets reused as we iterate), but you could do a deep
> copy.
>
> That did make me wonder a bit about the other fields in "struct
> reference" (which your snapshot just throws away). But it looks like
> fsck_handle_ref() only cares about the name and oid, so it is OK.

Junio suggested the same thing, although he also suggested we might
want to snapshot some reflog information at the same time, which then
wouldn't make sense to be using a struct reference.  Even though I'm
not implementing per-reflog snapshotting, I left a comment in the code
about it so I think it made sense to just create my own data structure
with just the name and oid.

> > @@ -999,6 +1050,19 @@ int cmd_fsck(int argc,
> >       if (check_references)
> >               fsck_refs(the_repository);
> >
> > +     /*
> > +      * Take a snapshot of the refs before walking objects to avoid looking
> > +      * at a set of refs that may be changed by the user while we are walking
> > +      * objects. We can still walk over new objects that are added during the
> > +      * execution of fsck but won't miss any objects that were reachable.
> > +      */
> > +     use_snapshot = !argc;
> > +     if (use_snapshot) {
> > +             now = time(NULL);
> > +             refs_for_each_rawref(get_main_ref_store(the_repository),
> > +                                  snapshot_refs, &default_refs_snapshot);
> > +     }
>
> BTW, one of the reasons I started looking at this is that Coverity
> complained about this segment of code. We set use_snapshot if and only
> if we don't have any argc arguments. And then later...
>
> >       if (!argc) {
> > -             get_default_heads();
> > +             get_default_heads(use_snapshot ? &default_refs_snapshot : NULL);
> >               keep_cache_objects = 1;
> >       }
>
> ...we enter this block only if argc is zero. So we know that
> use_snapshot will be true here, and the NULL path (and thus the fallback
> code in get_default_heads()) will never be used.
>
> That's not wrong exactly, as it's "just" dead code. But it was what led
> me to thinking about whether the case of non-zero argc would benefit
> from the snapshot, too.

Yeah, I needed to restructure this anyway to handle snapshotting
command line arguments, so I just cleaned it all up.  Thanks for
reading carefully.

> There are a few other related interesting cases, too:
>
>   - We may use the index file for connectivity, as well. It suffers from
>     the same race, and would benefit from a snapshot.

I left a couple TODO comments about this, so that those who are
interested/motivated can extend the snapshotting further.

>   - In get_default_heads() we also look at worktree HEADs. Those have
>     the same race (their normal refs we don't consider here, because
>     they were already handled by the overall ref iteration).

I handled these in my newer version, since handling them is pretty
similar to handling command line arguments.

> I know that neither of those is of particular interest to you as a bare
> server repo would have neither. And it may be OK not to handle them, if
> the complexity doesn't merit it. But it might be worth documenting the
> short-coming.

Yep, absolutely.  Thanks for reading so carefully.

> -Peff
>
> PS The other reason I looked at your patch is that I got deja vu from
>    all of this. I thought we had discussed ref snapshotting for fsck
>    before, but I couldn't find anything on the list. It may have been
>    internal GitHub discussions.

It likely was.  This was based on a patch that's in GitHub's fork of
git, which tripped up Michael Haggerty recently -- in particular, he
thought this fsck race had been "fixed", but was tripped up both by
the fact that it was a non-default option and that it was only in our
fork.  I volunteered to try to fix both issues, and heavily overhauled
the patch in v1, and will have completely rewritten the original by
v2.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fsck: snapshot default refs before object walk
  2026-01-06 23:36   ` Elijah Newren
@ 2026-01-14 22:23     ` Jeff King
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2026-01-14 22:23 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Elijah Newren via GitGitGadget, git, Matthew John Cheetham

On Tue, Jan 06, 2026 at 03:36:57PM -0800, Elijah Newren wrote:

> >   Side note: I do not think I have ever run fsck with refs on the
> >   command-line. It is not like it saves you any time! Most of the
> >   expense comes from opening up and verifying the objects in the first
> >   step, not from looking at ref reachability.
> 
> Not to mention it produces spurious "dangling" object warnings,
> because while the objects might be reachable, they aren't necessarily
> reachable from the particular subset you specified on the command
> line.  I wonder if no one ever noticed that because it's such a
> useless mode; I only noticed it because you pointed out how Matthew
> and I overlooked races with command-line arguments.

Yep. It would be nice to get rid of it, but of course I worry about
missing some obscure use case that somebody relies on. Anyway, way out
of scope for this patch.

> > That did make me wonder a bit about the other fields in "struct
> > reference" (which your snapshot just throws away). But it looks like
> > fsck_handle_ref() only cares about the name and oid, so it is OK.
> 
> Junio suggested the same thing, although he also suggested we might
> want to snapshot some reflog information at the same time, which then
> wouldn't make sense to be using a struct reference.  Even though I'm
> not implementing per-reflog snapshotting, I left a comment in the code
> about it so I think it made sense to just create my own data structure
> with just the name and oid.

I do wonder how expensive it would be to just snapshot all of the
reflogs. It is a shame, because 99% of the time entry N will be
reachable from N+1 (and thus does not need to be stored), but we don't
know that until we traverse.

So we are potentially storing the oid of every reachable object in the
repository in memory as a snapshot. But...isn't that exactly what we'll
do anyway when we traverse, in the form of "struct object"?

Anyway, I can live with your timestamp-based hack for snapshotting the
reflogs.

> > There are a few other related interesting cases, too:
> >
> >   - We may use the index file for connectivity, as well. It suffers from
> >     the same race, and would benefit from a snapshot.
> 
> I left a couple TODO comments about this, so that those who are
> interested/motivated can extend the snapshotting further.
> 
> >   - In get_default_heads() we also look at worktree HEADs. Those have
> >     the same race (their normal refs we don't consider here, because
> >     they were already handled by the overall ref iteration).
> 
> I handled these in my newer version, since handling them is pretty
> similar to handling command line arguments.

Both sound like reasonable directions to me. I'll take a look at your
latest version.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] fsck: snapshot default refs before object walk
  2025-12-29 19:12 [PATCH] fsck: snapshot default refs before object walk Elijah Newren via GitGitGadget
  2025-12-30  0:45 ` Junio C Hamano
  2026-01-02  5:49 ` Jeff King
@ 2026-01-07  1:29 ` Elijah Newren via GitGitGadget
  2026-01-09 17:49   ` [PATCH v3] " Elijah Newren via GitGitGadget
  2 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-01-07  1:29 UTC (permalink / raw)
  To: git
  Cc: Matthew John Cheetham, Jeff King, Elijah Newren, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Fsck has a race when operating on live repositories; consider the
following simple script that writes new commits as fsck runs:

    #!/bin/bash
    git fsck &
    PID=$!

    while ps -p $PID >/dev/null; do
        sleep 3
        git commit -q --allow-empty -m "Another commit"
    done

Since fsck walks objects for connectivity and then reads the refs at the
end to check, this can cause fsck to get confused and think that the new
refs refer to missing commits and that new reflog entries are invalid.
Running the above script in a clone of git.git results in the following
(output ellipsized to remove additional errors of the same type):

    $ ./fsck-while-writing.sh
    Checking ref database: 100% (1/1), done.
    Checking object directories: 100% (256/256), done.
    warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
    Checking objects: 100% (835091/835091), done.
    error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    [...]
    error: HEAD: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09
    error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: HEAD: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    error: refs/heads/mybranch invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: refs/heads/mybranch: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    [...]
    error: refs/heads/mybranch: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09
    error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: refs/heads/mybranch: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    Checking connectivity: 833846, done.
    missing commit 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    Verifying commits in commit graph: 100% (242243/242243), done.

We can minimize the race opportunities by taking a snapshot of refs at
program invocation, doing the connectivity check, and then checking the
snapshotted refs afterward.  This avoids races with regular refs between
fsck and adding objects to the database, though it still leaves a race
between a gc and fsck.  We are less concerned about folks simultaneously
running gc with fsck; though, if it becomes an issue, we could lock fsck
during gc.  We definitely do not want to lock fsck during operations
that may add objects to the object store; that would be problematic for
forges.

Note that refs aren't the only problem, though; reflog entries and index
entries could be problematic as well.  For now we punt on index entries
just leaving a TODO comment, and for reflogs we use a coarse solution of
taking the time at the beginning of the program and ignoring reflog
entries newer than that time.  That may be imperfect if dealing with a
network filesystem, so we leave TODO comment for those that want to
improve that handling as well.

As a high level overview:
  * In addition to fsck_handle_ref(), which now is only a few lines long
    to process a ref, there's also a snapshot_ref() which is called
    early in the program for each ref and takes all the error checking
    logic.
  * The iterating over refs that used to be in get_default_heads() plus
    a loop over the arguments now appears in shapshot_refs().
  * There's a new process_refs() as well that kind of looks like the old
    get_default_heads() though it is streamlined due to the work done by
    snapshot_refs().

This combination of changes modifies the output of running the script
(from the beginning of this commit message) to:

    $ ./fsck-while-writing.sh
    Checking ref database: 100% (1/1), done.
    Checking object directories: 100% (256/256), done.
    warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
    Checking objects: 100% (835091/835091), done.
    Checking connectivity: 833846, done.
    Verifying commits in commit graph: 100% (242243/242243), done.

While worries about live updates while running fsck is likely of most
interest for forge operators, it may also benefit those with
automated jobs (such as git maintenance) or even casual users who want
to do other work in their clone while fsck is running.

Originally-based-on-a-patch-by: Matthew John Cheetham <mjcheetham@outlook.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fsck: snapshot default refs before object walk
    
    Changes in v2, thanks to feedback & help from Peff & Junio:
    
     * Fixed errors in commit message
     * Changed to use a refname, oid struct and have an array of those
     * Snapshot command line arguments and worktree HEADs too
     * Add TODO items for snapshotting index entries, and for possibly
       improved reflog handling
     * Since nothing from Matthew's original patch in GitHub's fork of git
       remains in this patch by v2 (only a little of it remained in v1), I
       changed authorship to myself and gave Matthew an
       Originally-based-on-a-patch-by trailer.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2026%2Fnewren%2Ffsck-snapshot-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2026/newren/fsck-snapshot-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/2026

Range-diff vs v1:

 1:  394b84bb70 ! 1:  7af3e9b92e fsck: snapshot default refs before object walk
     @@
       ## Metadata ##
     -Author: Matthew John Cheetham <mjcheetham@outlook.com>
     +Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
          fsck: snapshot default refs before object walk
     @@ Commit message
                  git commit -q --allow-empty -m "Another commit"
              done
      
     -    Since fsck reads refs at the beginning, walks those for connectivity,
     -    and then reads the refs again at the end to check, this can cause fsck
     -    to get confused and think that the new refs refer to missing commits and
     -    that new reflog entries are invalid.  Running the above script in a
     -    clone of git.git results in the following (output ellipsized to remove
     -    additional errors of the same type):
     +    Since fsck walks objects for connectivity and then reads the refs at the
     +    end to check, this can cause fsck to get confused and think that the new
     +    refs refer to missing commits and that new reflog entries are invalid.
     +    Running the above script in a clone of git.git results in the following
     +    (output ellipsized to remove additional errors of the same type):
      
              $ ./fsck-while-writing.sh
              Checking ref database: 100% (1/1), done.
     @@ Commit message
              missing commit 6724f2dfede88bfa9445a333e06e78536c0c6c0d
              Verifying commits in commit graph: 100% (242243/242243), done.
      
     -    This problem doesn't occur when refs are specified on the command line
     -    for us to check, since we use those specified refs for both walking and
     -    checking.  Using the same refs for walking and checking seems to just
     -    make sense, so modify the existing code to do the same when refs aren't
     -    specified.  Snapshot the refs at the beginning, and also ignore all
     -    reflog entries since the time of our snapshot (while this technically
     -    means we could ignore a reflog entry created before the fsck process
     -    if the local clock is weird, since reflogs are local-only there are not
     -    concerns about differences between clocks on different machines).  This
     -    combination of changes modifies the output of running the above script
     -    to:
     +    We can minimize the race opportunities by taking a snapshot of refs at
     +    program invocation, doing the connectivity check, and then checking the
     +    snapshotted refs afterward.  This avoids races with regular refs between
     +    fsck and adding objects to the database, though it still leaves a race
     +    between a gc and fsck.  We are less concerned about folks simultaneously
     +    running gc with fsck; though, if it becomes an issue, we could lock fsck
     +    during gc.  We definitely do not want to lock fsck during operations
     +    that may add objects to the object store; that would be problematic for
     +    forges.
     +
     +    Note that refs aren't the only problem, though; reflog entries and index
     +    entries could be problematic as well.  For now we punt on index entries
     +    just leaving a TODO comment, and for reflogs we use a coarse solution of
     +    taking the time at the beginning of the program and ignoring reflog
     +    entries newer than that time.  That may be imperfect if dealing with a
     +    network filesystem, so we leave TODO comment for those that want to
     +    improve that handling as well.
     +
     +    As a high level overview:
     +      * In addition to fsck_handle_ref(), which now is only a few lines long
     +        to process a ref, there's also a snapshot_ref() which is called
     +        early in the program for each ref and takes all the error checking
     +        logic.
     +      * The iterating over refs that used to be in get_default_heads() plus
     +        a loop over the arguments now appears in shapshot_refs().
     +      * There's a new process_refs() as well that kind of looks like the old
     +        get_default_heads() though it is streamlined due to the work done by
     +        snapshot_refs().
     +
     +    This combination of changes modifies the output of running the script
     +    (from the beginning of this commit message) to:
      
              $ ./fsck-while-writing.sh
              Checking ref database: 100% (1/1), done.
     @@ Commit message
              Verifying commits in commit graph: 100% (242243/242243), done.
      
          While worries about live updates while running fsck is likely of most
     -    interest for forge operators, it will likely also benefit those with
     +    interest for forge operators, it may also benefit those with
          automated jobs (such as git maintenance) or even casual users who want
          to do other work in their clone while fsck is running.
      
     -    Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
     -    Co-authored-by: Elijah Newren <newren@gmail.com>
     -    [en: several changes:
     -      * adjusted for upstream refactorings to refs callback call signatures
     -      * handle reflogs as well
     -      * free recorded snapshot of refs when done
     -      * default to snapshotting instead of making it a non-default option
     -      * provide reproducible testcase in commit message and rewrite commit
     -        message around it
     -    ]
     +    Originally-based-on-a-patch-by: Matthew John Cheetham <mjcheetham@outlook.com>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
     +    Helped-by: Jeff King <peff@peff.net>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## builtin/fsck.c ##
     @@ builtin/fsck.c: static int fsck_handle_reflog_ent(const char *refname,
       	if (verbose)
       		fprintf_ln(stderr, _("Checking reflog %s->%s"),
       			   oid_to_hex(ooid), oid_to_hex(noid));
     -@@ builtin/fsck.c: static int fsck_head_link(const char *head_ref_name,
     - 			  const char **head_points_at,
     - 			  struct object_id *head_oid);
     +@@ builtin/fsck.c: static int fsck_handle_reflog(const char *logname, void *cb_data)
     + 	return 0;
     + }
       
     --static void get_default_heads(void)
     +-static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
      +struct ref_snapshot {
     -+	size_t nr;
     -+	size_t name_alloc;
     -+	size_t oid_alloc;
     -+	char **refname;
     -+	struct object_id *oid;
     ++	char *refname;
     ++	struct object_id oid;
     ++	/* TODO: Maybe supplement with latest reflog entry info too? */
      +};
      +
     -+static int snapshot_refs(const struct reference *ref, void *cb_data)
     -+{
     -+	struct ref_snapshot *refs = cb_data;
     ++struct snapshot {
     ++	size_t nr;
     ++	size_t alloc;
     ++	struct ref_snapshot *ref;
     ++	/* TODO: Consider also snapshotting the index of each worktree. */
     ++};
      +
     -+	ALLOC_GROW(refs->refname, refs->nr + 1, refs->name_alloc);
     -+	ALLOC_GROW(refs->oid, refs->nr + 1, refs->oid_alloc);
     ++static int snapshot_ref(const struct reference *ref, void *cb_data)
     + {
     ++	struct snapshot *snap = cb_data;
     + 	struct object *obj;
     + 
     + 	obj = parse_object(the_repository, ref->oid);
     +@@ builtin/fsck.c: static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
     + 		errors_found |= ERROR_REFS;
     + 	}
     + 	default_refs++;
      +
     -+	refs->refname[refs->nr] = xstrdup(ref->name);
     -+	oidcpy(&refs->oid[refs->nr], ref->oid);
     -+	refs->nr++;
     ++	ALLOC_GROW(snap->ref, snap->nr + 1, snap->alloc);
     ++	snap->ref[snap->nr].refname = xstrdup(ref->name);
     ++	oidcpy(&snap->ref[snap->nr].oid, ref->oid);
     ++	snap->nr++;
      +
      +	return 0;
      +}
      +
     -+static void free_snapshot_refs(struct ref_snapshot *snapshot)
     ++static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
      +{
     -+	for (size_t i = 0; i < snapshot->nr; i++)
     -+		free(snapshot->refname[i]);
     -+	free(snapshot->refname);
     -+	free(snapshot->oid);
     -+}
     ++	struct object *obj;
      +
     -+static void get_default_heads(struct ref_snapshot *the_refs)
     ++	obj = parse_object(the_repository, ref->oid);
     + 	obj->flags |= USED;
     + 	fsck_put_object_name(&fsck_walk_options,
     + 			     ref->oid, "%s", ref->name);
     +@@ builtin/fsck.c: static int fsck_head_link(const char *head_ref_name,
     + 			  const char **head_points_at,
     + 			  struct object_id *head_oid);
     + 
     +-static void get_default_heads(void)
     ++static void snapshot_refs(struct snapshot *snap, int argc, const char **argv)
       {
       	struct worktree **worktrees, **p;
       	const char *head_points_at;
       	struct object_id head_oid;
       
     --	refs_for_each_rawref(get_main_ref_store(the_repository),
     --			     fsck_handle_ref, NULL);
     -+	if (the_refs)
     -+		for (size_t i = 0; i < the_refs->nr; i++) {
     ++	for (int i = 0; i < argc; i++) {
     ++		const char *arg = argv[i];
     ++		struct object_id oid;
     ++		if (!repo_get_oid(the_repository, arg, &oid)) {
      +			struct reference ref = {
     -+				.name = the_refs->refname[i],
     -+				.oid = &the_refs->oid[i],
     ++				.name = arg,
     ++				.oid = &oid,
      +			};
     -+			fsck_handle_ref(&ref, NULL);
     ++
     ++			snapshot_ref(&ref, snap);
     ++			continue;
      +		}
     -+	else
     -+		refs_for_each_rawref(get_main_ref_store(the_repository),
     -+				     fsck_handle_ref, NULL);
     ++		error(_("invalid parameter: expected sha1, got '%s'"), arg);
     ++		errors_found |= ERROR_OBJECT;
     ++	}
     ++
     ++	if (argc) {
     ++		include_reflogs = 0;
     ++		return;
     ++	}
     ++
     + 	refs_for_each_rawref(get_main_ref_store(the_repository),
     +-			     fsck_handle_ref, NULL);
     ++			     snapshot_ref, snap);
       
       	worktrees = get_worktrees();
       	for (p = worktrees; *p; p++) {
     +@@ builtin/fsck.c: static void get_default_heads(void)
     + 				.oid = &head_oid,
     + 			};
     + 
     +-			fsck_handle_ref(&ref, NULL);
     ++			snapshot_ref(&ref, snap);
     + 		}
     + 		strbuf_release(&refname);
     + 
     +-		if (include_reflogs)
     ++		/*
     ++		 * TODO: Could use refs_for_each_reflog(...) to find
     ++		 * latest entry instead of using a global 'now' for that
     ++		 * purpose.
     ++		 */
     ++	}
     ++	free_worktrees(worktrees);
     ++
     ++	/* Ignore reflogs newer than now */
     ++	now = time(NULL);
     ++}
     ++
     ++
     ++static void free_snapshot_refs(struct snapshot *snap)
     ++{
     ++	for (size_t i = 0; i < snap->nr; i++)
     ++		free(snap->ref[i].refname);
     ++	free(snap->ref);
     ++}
     ++
     ++static void process_refs(struct snapshot *snap)
     ++{
     ++	struct worktree **worktrees, **p;
     ++
     ++	for (size_t i = 0; i < snap->nr; i++) {
     ++		struct reference ref = {
     ++			.name = snap->ref[i].refname,
     ++			.oid = &snap->ref[i].oid,
     ++		};
     ++		fsck_handle_ref(&ref, NULL);
     ++	}
     ++
     ++	if (include_reflogs) {
     ++		worktrees = get_worktrees();
     ++		for (p = worktrees; *p; p++) {
     ++			struct worktree *wt = *p;
     ++
     + 			refs_for_each_reflog(get_worktree_ref_store(wt),
     + 					     fsck_handle_reflog, wt);
     ++		}
     ++		free_worktrees(worktrees);
     + 	}
     +-	free_worktrees(worktrees);
     + 
     + 	/*
     + 	 * Not having any default heads isn't really fatal, but
      @@ builtin/fsck.c: int cmd_fsck(int argc,
     + 	     const char *prefix,
     + 	     struct repository *repo UNUSED)
       {
     - 	int i;
     +-	int i;
       	struct odb_source *source;
     -+	struct ref_snapshot default_refs_snapshot = {
     ++	struct snapshot snap = {
      +		.nr = 0,
     -+		.name_alloc = 0,
     -+		.oid_alloc = 0,
     -+		.refname = NULL,
     -+		.oid = NULL
     ++		.alloc = 0,
     ++		.ref = NULL
      +	};
     -+	bool use_snapshot;
       
       	/* fsck knows how to handle missing promisor objects */
       	fetch_if_missing = 0;
     @@ builtin/fsck.c: int cmd_fsck(int argc,
      +	 * objects. We can still walk over new objects that are added during the
      +	 * execution of fsck but won't miss any objects that were reachable.
      +	 */
     -+	use_snapshot = !argc;
     -+	if (use_snapshot) {
     -+		now = time(NULL);
     -+		refs_for_each_rawref(get_main_ref_store(the_repository),
     -+				     snapshot_refs, &default_refs_snapshot);
     -+	}
     ++	snapshot_refs(&snap, argc, argv);
     ++
     ++	/* Ensure we get a "fresh" view of the odb */
     ++	odb_reprepare(the_repository->objects);
      +
       	if (connectivity_only) {
       		for_each_loose_object(the_repository->objects,
       				      mark_loose_for_connectivity, NULL, 0);
      @@ builtin/fsck.c: int cmd_fsck(int argc,
     - 	 * in this case (ie this implies --cache).
     - 	 */
     - 	if (!argc) {
     + 			errors_found |= ERROR_OBJECT;
     + 	}
     + 
     +-	for (i = 0; i < argc; i++) {
     +-		const char *arg = argv[i];
     +-		struct object_id oid;
     +-		if (!repo_get_oid(the_repository, arg, &oid)) {
     +-			struct object *obj = lookup_object(the_repository,
     +-							   &oid);
     +-
     +-			if (!obj || !(obj->flags & HAS_OBJ)) {
     +-				if (is_promisor_object(the_repository, &oid))
     +-					continue;
     +-				error(_("%s: object missing"), oid_to_hex(&oid));
     +-				errors_found |= ERROR_OBJECT;
     +-				continue;
     +-			}
     +-
     +-			obj->flags |= USED;
     +-			fsck_put_object_name(&fsck_walk_options, &oid,
     +-					     "%s", arg);
     +-			mark_object_reachable(obj);
     +-			continue;
     +-		}
     +-		error(_("invalid parameter: expected sha1, got '%s'"), arg);
     +-		errors_found |= ERROR_OBJECT;
     +-	}
     ++	/* Process the snapshotted refs and the reflogs. */
     ++	process_refs(&snap);
     + 
     +-	/*
     +-	 * If we've not been given any explicit head information, do the
     +-	 * default ones from .git/refs. We also consider the index file
     +-	 * in this case (ie this implies --cache).
     +-	 */
     +-	if (!argc) {
      -		get_default_heads();
     -+		get_default_heads(use_snapshot ? &default_refs_snapshot : NULL);
     ++	/* If not given any explicit objects, process index files too. */
     ++	if (!argc)
       		keep_cache_objects = 1;
     - 	}
     +-	}
     +-
     + 	if (keep_cache_objects) {
     ++		/*
     ++		 * TODO: Consider first walking these indexes in snapshot_refs,
     ++		 * to snapshot where the index entries used to point, and then
     ++		 * check those snapshotted locations here.
     ++		 */
     + 		struct worktree **worktrees, **p;
       
     + 		verify_index_checksum = 1;
      @@ builtin/fsck.c: int cmd_fsck(int argc,
       		}
       	}
       
     -+	if (use_snapshot)
     -+		free_snapshot_refs(&default_refs_snapshot);
     ++	free_snapshot_refs(&snap);
       	return errors_found;
       }


 builtin/fsck.c | 162 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 122 insertions(+), 40 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index c489582faa..eec4626bfa 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -51,6 +51,7 @@ static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
 static int check_references = 1;
+static timestamp_t now;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -509,6 +510,9 @@ static int fsck_handle_reflog_ent(const char *refname,
 				  timestamp_t timestamp, int tz UNUSED,
 				  const char *message UNUSED, void *cb_data UNUSED)
 {
+	if (now && timestamp > now)
+		return 0;
+
 	if (verbose)
 		fprintf_ln(stderr, _("Checking reflog %s->%s"),
 			   oid_to_hex(ooid), oid_to_hex(noid));
@@ -530,8 +534,22 @@ static int fsck_handle_reflog(const char *logname, void *cb_data)
 	return 0;
 }
 
-static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
+struct ref_snapshot {
+	char *refname;
+	struct object_id oid;
+	/* TODO: Maybe supplement with latest reflog entry info too? */
+};
+
+struct snapshot {
+	size_t nr;
+	size_t alloc;
+	struct ref_snapshot *ref;
+	/* TODO: Consider also snapshotting the index of each worktree. */
+};
+
+static int snapshot_ref(const struct reference *ref, void *cb_data)
 {
+	struct snapshot *snap = cb_data;
 	struct object *obj;
 
 	obj = parse_object(the_repository, ref->oid);
@@ -555,6 +573,20 @@ static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
 		errors_found |= ERROR_REFS;
 	}
 	default_refs++;
+
+	ALLOC_GROW(snap->ref, snap->nr + 1, snap->alloc);
+	snap->ref[snap->nr].refname = xstrdup(ref->name);
+	oidcpy(&snap->ref[snap->nr].oid, ref->oid);
+	snap->nr++;
+
+	return 0;
+}
+
+static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
+{
+	struct object *obj;
+
+	obj = parse_object(the_repository, ref->oid);
 	obj->flags |= USED;
 	fsck_put_object_name(&fsck_walk_options,
 			     ref->oid, "%s", ref->name);
@@ -567,14 +599,35 @@ static int fsck_head_link(const char *head_ref_name,
 			  const char **head_points_at,
 			  struct object_id *head_oid);
 
-static void get_default_heads(void)
+static void snapshot_refs(struct snapshot *snap, int argc, const char **argv)
 {
 	struct worktree **worktrees, **p;
 	const char *head_points_at;
 	struct object_id head_oid;
 
+	for (int i = 0; i < argc; i++) {
+		const char *arg = argv[i];
+		struct object_id oid;
+		if (!repo_get_oid(the_repository, arg, &oid)) {
+			struct reference ref = {
+				.name = arg,
+				.oid = &oid,
+			};
+
+			snapshot_ref(&ref, snap);
+			continue;
+		}
+		error(_("invalid parameter: expected sha1, got '%s'"), arg);
+		errors_found |= ERROR_OBJECT;
+	}
+
+	if (argc) {
+		include_reflogs = 0;
+		return;
+	}
+
 	refs_for_each_rawref(get_main_ref_store(the_repository),
-			     fsck_handle_ref, NULL);
+			     snapshot_ref, snap);
 
 	worktrees = get_worktrees();
 	for (p = worktrees; *p; p++) {
@@ -589,15 +642,52 @@ static void get_default_heads(void)
 				.oid = &head_oid,
 			};
 
-			fsck_handle_ref(&ref, NULL);
+			snapshot_ref(&ref, snap);
 		}
 		strbuf_release(&refname);
 
-		if (include_reflogs)
+		/*
+		 * TODO: Could use refs_for_each_reflog(...) to find
+		 * latest entry instead of using a global 'now' for that
+		 * purpose.
+		 */
+	}
+	free_worktrees(worktrees);
+
+	/* Ignore reflogs newer than now */
+	now = time(NULL);
+}
+
+
+static void free_snapshot_refs(struct snapshot *snap)
+{
+	for (size_t i = 0; i < snap->nr; i++)
+		free(snap->ref[i].refname);
+	free(snap->ref);
+}
+
+static void process_refs(struct snapshot *snap)
+{
+	struct worktree **worktrees, **p;
+
+	for (size_t i = 0; i < snap->nr; i++) {
+		struct reference ref = {
+			.name = snap->ref[i].refname,
+			.oid = &snap->ref[i].oid,
+		};
+		fsck_handle_ref(&ref, NULL);
+	}
+
+	if (include_reflogs) {
+		worktrees = get_worktrees();
+		for (p = worktrees; *p; p++) {
+			struct worktree *wt = *p;
+
 			refs_for_each_reflog(get_worktree_ref_store(wt),
 					     fsck_handle_reflog, wt);
+		}
+		free_worktrees(worktrees);
 	}
-	free_worktrees(worktrees);
 
 	/*
 	 * Not having any default heads isn't really fatal, but
@@ -962,8 +1052,12 @@ int cmd_fsck(int argc,
 	     const char *prefix,
 	     struct repository *repo UNUSED)
 {
-	int i;
 	struct odb_source *source;
+	struct snapshot snap = {
+		.nr = 0,
+		.alloc = 0,
+		.ref = NULL
+	};
 
 	/* fsck knows how to handle missing promisor objects */
 	fetch_if_missing = 0;
@@ -999,6 +1093,17 @@ int cmd_fsck(int argc,
 	if (check_references)
 		fsck_refs(the_repository);
 
+	/*
+	 * Take a snapshot of the refs before walking objects to avoid looking
+	 * at a set of refs that may be changed by the user while we are walking
+	 * objects. We can still walk over new objects that are added during the
+	 * execution of fsck but won't miss any objects that were reachable.
+	 */
+	snapshot_refs(&snap, argc, argv);
+
+	/* Ensure we get a "fresh" view of the odb */
+	odb_reprepare(the_repository->objects);
+
 	if (connectivity_only) {
 		for_each_loose_object(the_repository->objects,
 				      mark_loose_for_connectivity, NULL, 0);
@@ -1040,42 +1145,18 @@ int cmd_fsck(int argc,
 			errors_found |= ERROR_OBJECT;
 	}
 
-	for (i = 0; i < argc; i++) {
-		const char *arg = argv[i];
-		struct object_id oid;
-		if (!repo_get_oid(the_repository, arg, &oid)) {
-			struct object *obj = lookup_object(the_repository,
-							   &oid);
-
-			if (!obj || !(obj->flags & HAS_OBJ)) {
-				if (is_promisor_object(the_repository, &oid))
-					continue;
-				error(_("%s: object missing"), oid_to_hex(&oid));
-				errors_found |= ERROR_OBJECT;
-				continue;
-			}
-
-			obj->flags |= USED;
-			fsck_put_object_name(&fsck_walk_options, &oid,
-					     "%s", arg);
-			mark_object_reachable(obj);
-			continue;
-		}
-		error(_("invalid parameter: expected sha1, got '%s'"), arg);
-		errors_found |= ERROR_OBJECT;
-	}
+	/* Process the snapshotted refs and the reflogs. */
+	process_refs(&snap);
 
-	/*
-	 * If we've not been given any explicit head information, do the
-	 * default ones from .git/refs. We also consider the index file
-	 * in this case (ie this implies --cache).
-	 */
-	if (!argc) {
-		get_default_heads();
+	/* If not given any explicit objects, process index files too. */
+	if (!argc)
 		keep_cache_objects = 1;
-	}
-
 	if (keep_cache_objects) {
+		/*
+		 * TODO: Consider first walking these indexes in snapshot_refs,
+		 * to snapshot where the index entries used to point, and then
+		 * check those snapshotted locations here.
+		 */
 		struct worktree **worktrees, **p;
 
 		verify_index_checksum = 1;
@@ -1148,5 +1229,6 @@ int cmd_fsck(int argc,
 		}
 	}
 
+	free_snapshot_refs(&snap);
 	return errors_found;
 }

base-commit: b31ab939fe8e3cbe8be48dddd1c6ac0265991f45
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3] fsck: snapshot default refs before object walk
  2026-01-07  1:29 ` [PATCH v2] " Elijah Newren via GitGitGadget
@ 2026-01-09 17:49   ` Elijah Newren via GitGitGadget
  2026-01-11  4:01     ` Junio C Hamano
  2026-01-14 22:34     ` Jeff King
  0 siblings, 2 replies; 11+ messages in thread
From: Elijah Newren via GitGitGadget @ 2026-01-09 17:49 UTC (permalink / raw)
  To: git
  Cc: Matthew John Cheetham, Jeff King, Elijah Newren, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Fsck has a race when operating on live repositories; consider the
following simple script that writes new commits as fsck runs:

    #!/bin/bash
    git fsck &
    PID=$!

    while ps -p $PID >/dev/null; do
        sleep 3
        git commit -q --allow-empty -m "Another commit"
    done

Since fsck walks objects for connectivity and then reads the refs at the
end to check, this can cause fsck to get confused and think that the new
refs refer to missing commits and that new reflog entries are invalid.
Running the above script in a clone of git.git results in the following
(output ellipsized to remove additional errors of the same type):

    $ ./fsck-while-writing.sh
    Checking ref database: 100% (1/1), done.
    Checking object directories: 100% (256/256), done.
    warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
    Checking objects: 100% (835091/835091), done.
    error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    [...]
    error: HEAD: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09
    error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: HEAD: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    error: refs/heads/mybranch invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: refs/heads/mybranch: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310
    error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68
    [...]
    error: refs/heads/mybranch: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09
    error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a
    error: refs/heads/mybranch: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    Checking connectivity: 833846, done.
    missing commit 6724f2dfede88bfa9445a333e06e78536c0c6c0d
    Verifying commits in commit graph: 100% (242243/242243), done.

We can minimize the race opportunities by taking a snapshot of refs at
program invocation, doing the connectivity check, and then checking the
snapshotted refs afterward.  This avoids races with regular refs between
fsck and adding objects to the database, though it still leaves a race
between a gc and fsck.  We are less concerned about folks simultaneously
running gc with fsck; though, if it becomes an issue, we could lock fsck
during gc.  We definitely do not want to lock fsck during operations
that may add objects to the object store; that would be problematic for
forges.

Note that refs aren't the only problem, though; reflog entries and index
entries could be problematic as well.  For now we punt on index entries
just leaving a TODO comment, and for reflogs we use a coarse solution of
taking the time at the beginning of the program and ignoring reflog
entries newer than that time.  That may be imperfect if dealing with a
network filesystem, so we leave TODO comment for those that want to
improve that handling as well.

As a high level overview:
  * In addition to fsck_handle_ref(), which now is only a few lines long
    to process a ref, there's also a snapshot_ref() which is called
    early in the program for each ref and takes all the error checking
    logic.
  * The iterating over refs that used to be in get_default_heads() plus
    a loop over the arguments now appears in shapshot_refs().
  * There's a new process_refs() as well that kind of looks like the old
    get_default_heads() though it is streamlined due to the work done by
    snapshot_refs().

This combination of changes modifies the output of running the script
(from the beginning of this commit message) to:

    $ ./fsck-while-writing.sh
    Checking ref database: 100% (1/1), done.
    Checking object directories: 100% (256/256), done.
    warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
    Checking objects: 100% (835091/835091), done.
    Checking connectivity: 833846, done.
    Verifying commits in commit graph: 100% (242243/242243), done.

While worries about live updates while running fsck is likely of most
interest for forge operators, it may also benefit those with
automated jobs (such as git maintenance) or even casual users who want
to do other work in their clone while fsck is running.

Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fsck: snapshot default refs before object walk
    
    Changes in v3:
    
     * Removed the attribution for Matthew, as per his request.
    
    Changes in v2, thanks to feedback & help from Peff & Junio:
    
     * Fixed errors in commit message
     * Changed to use a refname, oid struct and have an array of those
     * Snapshot command line arguments and worktree HEADs too
     * Add TODO items for snapshotting index entries, and for possibly
       improved reflog handling
     * Since nothing from Matthew's original patch in GitHub's fork of git
       remains in this patch by v2 (only a little of it remained in v1), I
       changed authorship to myself and gave Matthew an
       Originally-based-on-a-patch-by trailer.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2026%2Fnewren%2Ffsck-snapshot-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2026/newren/fsck-snapshot-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/2026

Range-diff vs v2:

 1:  7af3e9b92e ! 1:  46b3ae9380 fsck: snapshot default refs before object walk
     @@ Commit message
          automated jobs (such as git maintenance) or even casual users who want
          to do other work in their clone while fsck is running.
      
     -    Originally-based-on-a-patch-by: Matthew John Cheetham <mjcheetham@outlook.com>
          Helped-by: Junio C Hamano <gitster@pobox.com>
          Helped-by: Jeff King <peff@peff.net>
          Signed-off-by: Elijah Newren <newren@gmail.com>


 builtin/fsck.c | 162 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 122 insertions(+), 40 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index c489582faa..eec4626bfa 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -51,6 +51,7 @@ static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
 static int check_references = 1;
+static timestamp_t now;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -509,6 +510,9 @@ static int fsck_handle_reflog_ent(const char *refname,
 				  timestamp_t timestamp, int tz UNUSED,
 				  const char *message UNUSED, void *cb_data UNUSED)
 {
+	if (now && timestamp > now)
+		return 0;
+
 	if (verbose)
 		fprintf_ln(stderr, _("Checking reflog %s->%s"),
 			   oid_to_hex(ooid), oid_to_hex(noid));
@@ -530,8 +534,22 @@ static int fsck_handle_reflog(const char *logname, void *cb_data)
 	return 0;
 }
 
-static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
+struct ref_snapshot {
+	char *refname;
+	struct object_id oid;
+	/* TODO: Maybe supplement with latest reflog entry info too? */
+};
+
+struct snapshot {
+	size_t nr;
+	size_t alloc;
+	struct ref_snapshot *ref;
+	/* TODO: Consider also snapshotting the index of each worktree. */
+};
+
+static int snapshot_ref(const struct reference *ref, void *cb_data)
 {
+	struct snapshot *snap = cb_data;
 	struct object *obj;
 
 	obj = parse_object(the_repository, ref->oid);
@@ -555,6 +573,20 @@ static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
 		errors_found |= ERROR_REFS;
 	}
 	default_refs++;
+
+	ALLOC_GROW(snap->ref, snap->nr + 1, snap->alloc);
+	snap->ref[snap->nr].refname = xstrdup(ref->name);
+	oidcpy(&snap->ref[snap->nr].oid, ref->oid);
+	snap->nr++;
+
+	return 0;
+}
+
+static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
+{
+	struct object *obj;
+
+	obj = parse_object(the_repository, ref->oid);
 	obj->flags |= USED;
 	fsck_put_object_name(&fsck_walk_options,
 			     ref->oid, "%s", ref->name);
@@ -567,14 +599,35 @@ static int fsck_head_link(const char *head_ref_name,
 			  const char **head_points_at,
 			  struct object_id *head_oid);
 
-static void get_default_heads(void)
+static void snapshot_refs(struct snapshot *snap, int argc, const char **argv)
 {
 	struct worktree **worktrees, **p;
 	const char *head_points_at;
 	struct object_id head_oid;
 
+	for (int i = 0; i < argc; i++) {
+		const char *arg = argv[i];
+		struct object_id oid;
+		if (!repo_get_oid(the_repository, arg, &oid)) {
+			struct reference ref = {
+				.name = arg,
+				.oid = &oid,
+			};
+
+			snapshot_ref(&ref, snap);
+			continue;
+		}
+		error(_("invalid parameter: expected sha1, got '%s'"), arg);
+		errors_found |= ERROR_OBJECT;
+	}
+
+	if (argc) {
+		include_reflogs = 0;
+		return;
+	}
+
 	refs_for_each_rawref(get_main_ref_store(the_repository),
-			     fsck_handle_ref, NULL);
+			     snapshot_ref, snap);
 
 	worktrees = get_worktrees();
 	for (p = worktrees; *p; p++) {
@@ -589,15 +642,52 @@ static void get_default_heads(void)
 				.oid = &head_oid,
 			};
 
-			fsck_handle_ref(&ref, NULL);
+			snapshot_ref(&ref, snap);
 		}
 		strbuf_release(&refname);
 
-		if (include_reflogs)
+		/*
+		 * TODO: Could use refs_for_each_reflog(...) to find
+		 * latest entry instead of using a global 'now' for that
+		 * purpose.
+		 */
+	}
+	free_worktrees(worktrees);
+
+	/* Ignore reflogs newer than now */
+	now = time(NULL);
+}
+
+
+static void free_snapshot_refs(struct snapshot *snap)
+{
+	for (size_t i = 0; i < snap->nr; i++)
+		free(snap->ref[i].refname);
+	free(snap->ref);
+}
+
+static void process_refs(struct snapshot *snap)
+{
+	struct worktree **worktrees, **p;
+
+	for (size_t i = 0; i < snap->nr; i++) {
+		struct reference ref = {
+			.name = snap->ref[i].refname,
+			.oid = &snap->ref[i].oid,
+		};
+		fsck_handle_ref(&ref, NULL);
+	}
+
+	if (include_reflogs) {
+		worktrees = get_worktrees();
+		for (p = worktrees; *p; p++) {
+			struct worktree *wt = *p;
+
 			refs_for_each_reflog(get_worktree_ref_store(wt),
 					     fsck_handle_reflog, wt);
+		}
+		free_worktrees(worktrees);
 	}
-	free_worktrees(worktrees);
 
 	/*
 	 * Not having any default heads isn't really fatal, but
@@ -962,8 +1052,12 @@ int cmd_fsck(int argc,
 	     const char *prefix,
 	     struct repository *repo UNUSED)
 {
-	int i;
 	struct odb_source *source;
+	struct snapshot snap = {
+		.nr = 0,
+		.alloc = 0,
+		.ref = NULL
+	};
 
 	/* fsck knows how to handle missing promisor objects */
 	fetch_if_missing = 0;
@@ -999,6 +1093,17 @@ int cmd_fsck(int argc,
 	if (check_references)
 		fsck_refs(the_repository);
 
+	/*
+	 * Take a snapshot of the refs before walking objects to avoid looking
+	 * at a set of refs that may be changed by the user while we are walking
+	 * objects. We can still walk over new objects that are added during the
+	 * execution of fsck but won't miss any objects that were reachable.
+	 */
+	snapshot_refs(&snap, argc, argv);
+
+	/* Ensure we get a "fresh" view of the odb */
+	odb_reprepare(the_repository->objects);
+
 	if (connectivity_only) {
 		for_each_loose_object(the_repository->objects,
 				      mark_loose_for_connectivity, NULL, 0);
@@ -1040,42 +1145,18 @@ int cmd_fsck(int argc,
 			errors_found |= ERROR_OBJECT;
 	}
 
-	for (i = 0; i < argc; i++) {
-		const char *arg = argv[i];
-		struct object_id oid;
-		if (!repo_get_oid(the_repository, arg, &oid)) {
-			struct object *obj = lookup_object(the_repository,
-							   &oid);
-
-			if (!obj || !(obj->flags & HAS_OBJ)) {
-				if (is_promisor_object(the_repository, &oid))
-					continue;
-				error(_("%s: object missing"), oid_to_hex(&oid));
-				errors_found |= ERROR_OBJECT;
-				continue;
-			}
-
-			obj->flags |= USED;
-			fsck_put_object_name(&fsck_walk_options, &oid,
-					     "%s", arg);
-			mark_object_reachable(obj);
-			continue;
-		}
-		error(_("invalid parameter: expected sha1, got '%s'"), arg);
-		errors_found |= ERROR_OBJECT;
-	}
+	/* Process the snapshotted refs and the reflogs. */
+	process_refs(&snap);
 
-	/*
-	 * If we've not been given any explicit head information, do the
-	 * default ones from .git/refs. We also consider the index file
-	 * in this case (ie this implies --cache).
-	 */
-	if (!argc) {
-		get_default_heads();
+	/* If not given any explicit objects, process index files too. */
+	if (!argc)
 		keep_cache_objects = 1;
-	}
-
 	if (keep_cache_objects) {
+		/*
+		 * TODO: Consider first walking these indexes in snapshot_refs,
+		 * to snapshot where the index entries used to point, and then
+		 * check those snapshotted locations here.
+		 */
 		struct worktree **worktrees, **p;
 
 		verify_index_checksum = 1;
@@ -1148,5 +1229,6 @@ int cmd_fsck(int argc,
 		}
 	}
 
+	free_snapshot_refs(&snap);
 	return errors_found;
 }

base-commit: b31ab939fe8e3cbe8be48dddd1c6ac0265991f45
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] fsck: snapshot default refs before object walk
  2026-01-09 17:49   ` [PATCH v3] " Elijah Newren via GitGitGadget
@ 2026-01-11  4:01     ` Junio C Hamano
  2026-01-14 22:34     ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2026-01-11  4:01 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Matthew John Cheetham, Jeff King, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

>     fsck: snapshot default refs before object walk
>     
>     Changes in v3:
>     
>      * Removed the attribution for Matthew, as per his request.
>     
>     Changes in v2, thanks to feedback & help from Peff & Junio:
>     
>      * Fixed errors in commit message
>      * Changed to use a refname, oid struct and have an array of those
>      * Snapshot command line arguments and worktree HEADs too
>      * Add TODO items for snapshotting index entries, and for possibly
>        improved reflog handling
>      * Since nothing from Matthew's original patch in GitHub's fork of git
>        remains in this patch by v2 (only a little of it remained in v1), I
>        changed authorship to myself and gave Matthew an
>        Originally-based-on-a-patch-by trailer.

Thanks.  Let's mark it for 'next'.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] fsck: snapshot default refs before object walk
  2026-01-09 17:49   ` [PATCH v3] " Elijah Newren via GitGitGadget
  2026-01-11  4:01     ` Junio C Hamano
@ 2026-01-14 22:34     ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff King @ 2026-01-14 22:34 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Matthew John Cheetham, Elijah Newren

On Fri, Jan 09, 2026 at 05:49:13PM +0000, Elijah Newren via GitGitGadget wrote:

> -static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
> +struct ref_snapshot {
> +	char *refname;
> +	struct object_id oid;
> +	/* TODO: Maybe supplement with latest reflog entry info too? */
> +};

I think this is an OK place to stop for now, but just a few thoughts for
a possible future (that I hope neither of us ever needs to follow up on ;) ).

I wonder if we need to record individual reflog entries at all, or if it
would be sufficient to just keep an oidset of reachable tips. In fact, I
wondered if we even needed these ref_snapshot structs at all, and
couldn't just get away with an oidset. But I guess if we do find that
something reachable is missing, we want some better way of pointing to
the culprit.

It _might_ be possible to keep a more compact representation like an
oidset for the happy path, and then only if we find that something is
unreachable, go back and find the culprit. But maybe that gets too
complicated and/or racy.

I also wondered if we could just instantiate a "struct object" for each
ref tip, and marking it with a reachable flag. That's even _more_
expensive than the object_id you're storing here, but eventually becomes
cheaper, since we're going to instantiate all of those objects as we
traverse. IIRC fsck traditionally relied on the existence of an object
struct as a signal that we saw the object somewhere, but I don't recall
if that is still true (a long time ago, I think I tried to convert that
into an explicit flag to prevent subtle confusion, but I don't remember
how completely I succeeded).

Anyway, all of that is for another time.

> +static int fsck_handle_ref(const struct reference *ref, void *cb_data UNUSED)
> +{
> +	struct object *obj;
> +
> +	obj = parse_object(the_repository, ref->oid);
>  	obj->flags |= USED;
>  	fsck_put_object_name(&fsck_walk_options,
>  			     ref->oid, "%s", ref->name);

I wonder if this parse_object() can ever fail (say, in a corrupted repo)
and we'd segfault here? I wouldn't be surprised if that is already a
possibility before your patch, either. ;)

> [...]

I gave a fairly quick read to the rest of it but didn't see anything
questionable.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-01-14 22:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-29 19:12 [PATCH] fsck: snapshot default refs before object walk Elijah Newren via GitGitGadget
2025-12-30  0:45 ` Junio C Hamano
2026-01-06 23:19   ` Elijah Newren
2026-01-09 13:11     ` Matthew John Cheetham
2026-01-02  5:49 ` Jeff King
2026-01-06 23:36   ` Elijah Newren
2026-01-14 22:23     ` Jeff King
2026-01-07  1:29 ` [PATCH v2] " Elijah Newren via GitGitGadget
2026-01-09 17:49   ` [PATCH v3] " Elijah Newren via GitGitGadget
2026-01-11  4:01     ` Junio C Hamano
2026-01-14 22:34     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox