[PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock

Git development
 help / color / mirror / Atom feed

* [PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock
@ 2024-11-19 10:48 Patrick Steinhardt
  2024-11-19 17:17 ` Justin Tobler
  2024-11-22 15:30 ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Patrick Steinhardt @ 2024-11-19 10:48 UTC (permalink / raw)
  To: git; +Cc: Miguel Rincon Barahona, Kev Kloss

When running scheduled maintenance via `git maintenance start`, we
acquire a lockfile to ensure that no other scheduled maintenance task is
running in the repository concurrently. If so, we do provide an error to
the user hinting that another process seems to be running in this repo.

There are two important cases why such a lockfile may exist:

  - An actual git-maintenance(1) process is still running in this
    repository.

  - An earlier process may have crashed or was interrupted part way
    through and has left a stale lockfile behind.

In c95547a394 (builtin/gc: fix crash when running `git maintenance
start`, 2024-10-10), we have fixed an issue where git-maintenance(1)
would crash with the "start" subcommand, and the underlying bug causes
the second scenario to trigger quite often now.

Most users don't know how to get out of that situation again though.
Ideally, we'd be removing the stale lock for our users automatically.
But in the context of repository maintenance this is rather risky, as it
can easily run for hours or even days. So finding a clear point where we
know that the old process has exited is basically impossible.

We have the same issue in other subsystems, e.g. when locking refs. Our
lockfile interfaces thus provide the `unable_to_lock_message()` function
for exactly this purpose: it provides a nice hint to the user that
explains what is going on and how to get out of that situation again by
manually removing the file.

Adapt git-maintenance(1) to print a similar hint. While we could use the
above function, we can provide a bit more context as we know exactly
what kind of process would create the lockfile.

Reported-by: Miguel Rincon Barahona <mrincon@gitlab.com>
Reported-by: Kev Kloss <kkloss@gitlab.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Hi,

this issue was reported to me internally at GitLab with the suggestion
of providing a hint for how to get out of the situation again.

Patrick
---
 builtin/gc.c           | 11 ++++++++++-
 t/t7900-maintenance.sh |  8 ++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index d52735354c9f87ba4e8acb593dd11aa0482223e1..34848626e47c777112994e5b5c558b65952ac8c2 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -2890,8 +2890,17 @@ static int update_background_schedule(const struct maintenance_start_opts *opts,
 	char *lock_path = xstrfmt("%s/schedule", the_repository->objects->odb->path);

 	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
+		if (errno == EEXIST)
+			error(_("unable to create '%s.lock': %s.\n\n"
+			    "Another scheduled git-maintenance(1) process seems to be running in this\n"
+			    "repository. Please make sure no other maintenance processes are running and\n"
+			    "then try again. If it still fails, a git-maintenance(1) process may have\n"
+			    "crashed in this repository earlier: remove the file manually to continue."),
+			    absolute_path(lock_path), strerror(errno));
+		else
+			error_errno(_("cannot acquire lock for scheduled background maintenance"));
 		free(lock_path);
-		return error(_("another process is scheduling background maintenance"));
+		return -1;
 	}

 	for (i = 1; i < ARRAY_SIZE(scheduler_fn); i++) {
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index c224c8450c85f567bc29258e18b4a59fe6682f0a..6d6ffaaf376bdbadecdb23a460994af1d218dc19 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -1011,4 +1011,12 @@ test_expect_success 'repacking loose objects is quiet' '
 	)
 '

+test_expect_success 'maintenance aborts with existing lock file' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	: >repo/.git/objects/schedule.lock &&
+	test_must_fail git -C repo maintenance start 2>err &&
+	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
+'
+
 test_done

---
base-commit: 090d24e9af6e9f59c3f7bee97c42bb1ae3c7f559
change-id: 20241119-pks-maintenance-hint-with-stale-lock-35e5417d632b

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock
  2024-11-19 10:48 [PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock Patrick Steinhardt
@ 2024-11-19 17:17 ` Justin Tobler
  2024-11-22 15:30 ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Justin Tobler @ 2024-11-19 17:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Miguel Rincon Barahona, Kev Kloss

On 24/11/19 11:48AM, Patrick Steinhardt wrote:
> When running scheduled maintenance via `git maintenance start`, we
> acquire a lockfile to ensure that no other scheduled maintenance task is
> running in the repository concurrently. If so, we do provide an error to
> the user hinting that another process seems to be running in this repo.
> 
> There are two important cases why such a lockfile may exist:
> 
>   - An actual git-maintenance(1) process is still running in this
>     repository.
> 
>   - An earlier process may have crashed or was interrupted part way
>     through and has left a stale lockfile behind.
> 
> In c95547a394 (builtin/gc: fix crash when running `git maintenance
> start`, 2024-10-10), we have fixed an issue where git-maintenance(1)
> would crash with the "start" subcommand, and the underlying bug causes
> the second scenario to trigger quite often now.
> 
> Most users don't know how to get out of that situation again though.
> Ideally, we'd be removing the stale lock for our users automatically.
> But in the context of repository maintenance this is rather risky, as it
> can easily run for hours or even days. So finding a clear point where we
> know that the old process has exited is basically impossible.

Indeed, providing a hint to help affected users here make sense to me.

> 
> We have the same issue in other subsystems, e.g. when locking refs. Our
> lockfile interfaces thus provide the `unable_to_lock_message()` function
> for exactly this purpose: it provides a nice hint to the user that
> explains what is going on and how to get out of that situation again by
> manually removing the file.
> 
> Adapt git-maintenance(1) to print a similar hint. While we could use the
> above function, we can provide a bit more context as we know exactly
> what kind of process would create the lockfile.

The added context provided by having a more specific message seems
appropriate. Looking at the message provided through
`unable_to_lock_message()`, I could see the generic example it provides
being a bit confusing here.

> 
> Reported-by: Miguel Rincon Barahona <mrincon@gitlab.com>
> Reported-by: Kev Kloss <kkloss@gitlab.com>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Hi,
> 
> this issue was reported to me internally at GitLab with the suggestion
> of providing a hint for how to get out of the situation again.
> 
> Patrick
> ---
>  builtin/gc.c           | 11 ++++++++++-
>  t/t7900-maintenance.sh |  8 ++++++++
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/builtin/gc.c b/builtin/gc.c
> index d52735354c9f87ba4e8acb593dd11aa0482223e1..34848626e47c777112994e5b5c558b65952ac8c2 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -2890,8 +2890,17 @@ static int update_background_schedule(const struct maintenance_start_opts *opts,
>  	char *lock_path = xstrfmt("%s/schedule", the_repository->objects->odb->path);
>  
>  	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
> +		if (errno == EEXIST)
> +			error(_("unable to create '%s.lock': %s.\n\n"
> +			    "Another scheduled git-maintenance(1) process seems to be running in this\n"
> +			    "repository. Please make sure no other maintenance processes are running and\n"
> +			    "then try again. If it still fails, a git-maintenance(1) process may have\n"
> +			    "crashed in this repository earlier: remove the file manually to continue."),
> +			    absolute_path(lock_path), strerror(errno));
> +		else
> +			error_errno(_("cannot acquire lock for scheduled background maintenance"));
>  		free(lock_path);
> -		return error(_("another process is scheduling background maintenance"));
> +		return -1;
>  	}
>  
>  	for (i = 1; i < ARRAY_SIZE(scheduler_fn); i++) {
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c224c8450c85f567bc29258e18b4a59fe6682f0a..6d6ffaaf376bdbadecdb23a460994af1d218dc19 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -1011,4 +1011,12 @@ test_expect_success 'repacking loose objects is quiet' '
>  	)
>  '
>  
> +test_expect_success 'maintenance aborts with existing lock file' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	: >repo/.git/objects/schedule.lock &&
> +	test_must_fail git -C repo maintenance start 2>err &&
> +	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
> +'
> +
>  test_done
> 
> ---
> base-commit: 090d24e9af6e9f59c3f7bee97c42bb1ae3c7f559
> change-id: 20241119-pks-maintenance-hint-with-stale-lock-35e5417d632b

The changes in this patch are straightforward and look good to me :)

-Justin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock
  2024-11-19 10:48 [PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock Patrick Steinhardt
  2024-11-19 17:17 ` Justin Tobler
@ 2024-11-22 15:30 ` Jeff King
  2024-11-25  5:33   ` [PATCH] t7900: fix host-dependent behaviour when testing git-maintenance(1) Patrick Steinhardt
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2024-11-22 15:30 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Miguel Rincon Barahona, Kev Kloss

On Tue, Nov 19, 2024 at 11:48:43AM +0100, Patrick Steinhardt wrote:

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c224c8450c85f567bc29258e18b4a59fe6682f0a..6d6ffaaf376bdbadecdb23a460994af1d218dc19 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -1011,4 +1011,12 @@ test_expect_success 'repacking loose objects is quiet' '
>  	)
>  '
>  
> +test_expect_success 'maintenance aborts with existing lock file' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	: >repo/.git/objects/schedule.lock &&
> +	test_must_fail git -C repo maintenance start 2>err &&
> +	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
> +'

This failed for me in all of the macos jobs of GitHub CI. Looks like the
err file contains:

  fatal: launchctl scheduler is not available

So maybe it's a platform image issue?

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] t7900: fix host-dependent behaviour when testing git-maintenance(1)
  2024-11-22 15:30 ` Jeff King
@ 2024-11-25  5:33   ` Patrick Steinhardt
  2024-11-25 11:52     ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Steinhardt @ 2024-11-25  5:33 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Miguel Rincon Barahona, Kev Kloss, Junio C Hamano

We have recently added a new test to t7900 that exercises whether
git-maintenance(1) fails as expected when the "schedule.lock" file
exists. The test depends on whether or not the host has the required
executables present to schedule maintenance tasks in the first place,
like systemd or launchctl -- if not, the test fails with an unrelated
error before even checking for the lock file. This fails for example in
our CI systems, where macOS images do not have launchctl available.

Fix this issue by creating a stub systemctl(1) binary and using the
systemd scheduler.

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

Oh, true, thanks for reporting. This here should fix it.

Patrick

 t/t7900-maintenance.sh | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 6d6ffaaf37..4bbc171958 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -1012,10 +1012,15 @@ test_expect_success 'repacking loose objects is quiet' '
 '
 
 test_expect_success 'maintenance aborts with existing lock file' '
-	test_when_finished "rm -rf repo" &&
+	test_when_finished "rm -rf repo script" &&
+	mkdir script &&
+	write_script script/systemctl <<-\EOF &&
+	true
+	EOF
+
 	git init repo &&
 	: >repo/.git/objects/schedule.lock &&
-	test_must_fail git -C repo maintenance start 2>err &&
+	test_must_fail env PATH="$PWD/script:$PATH" git -C repo maintenance start --scheduler=systemd 2>err &&
 	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
 '
 

base-commit: 2f2daa412d02135a1f1f00da3f424217467f8d16
-- 
2.47.0.274.g962d0b743d.dirty


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] t7900: fix host-dependent behaviour when testing git-maintenance(1)
  2024-11-25  5:33   ` [PATCH] t7900: fix host-dependent behaviour when testing git-maintenance(1) Patrick Steinhardt
@ 2024-11-25 11:52     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2024-11-25 11:52 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Miguel Rincon Barahona, Kev Kloss, Junio C Hamano

On Mon, Nov 25, 2024 at 06:33:41AM +0100, Patrick Steinhardt wrote:

> We have recently added a new test to t7900 that exercises whether
> git-maintenance(1) fails as expected when the "schedule.lock" file
> exists. The test depends on whether or not the host has the required
> executables present to schedule maintenance tasks in the first place,
> like systemd or launchctl -- if not, the test fails with an unrelated
> error before even checking for the lock file. This fails for example in
> our CI systems, where macOS images do not have launchctl available.
> 
> Fix this issue by creating a stub systemctl(1) binary and using the
> systemd scheduler.

Thanks, this explanation makes sense and clears up the CI issues I saw.

>  test_expect_success 'maintenance aborts with existing lock file' '
> -	test_when_finished "rm -rf repo" &&
> +	test_when_finished "rm -rf repo script" &&
> +	mkdir script &&
> +	write_script script/systemctl <<-\EOF &&
> +	true
> +	EOF
> +
>  	git init repo &&
>  	: >repo/.git/objects/schedule.lock &&
> -	test_must_fail git -C repo maintenance start 2>err &&
> +	test_must_fail env PATH="$PWD/script:$PATH" git -C repo maintenance start --scheduler=systemd 2>err &&
>  	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
>  '

As always, I am never sure whether to use $PWD or $(pwd) for the benefit
of Windows. But digging up this message:

  https://lore.kernel.org/git/2b69d098-92ef-77b0-367a-516e9edbe257@kdbg.org/

I think $PWD is right here.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-25 11:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 10:48 [PATCH] builtin/gc: provide hint when maintenance hits a stale schedule lock Patrick Steinhardt
2024-11-19 17:17 ` Justin Tobler
2024-11-22 15:30 ` Jeff King
2024-11-25  5:33   ` [PATCH] t7900: fix host-dependent behaviour when testing git-maintenance(1) Patrick Steinhardt
2024-11-25 11:52     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox