[PATCH] config: retry acquiring config.lock for 100ms

Git development
 help / color / mirror / Atom feed

* [PATCH] config: retry acquiring config.lock for 100ms
@ 2026-04-03 10:01 Joerg Thalheim
  2026-04-03 17:53 ` Junio C Hamano
  2026-04-08 10:34 ` Patrick Steinhardt
  0 siblings, 2 replies; 11+ messages in thread
From: Joerg Thalheim @ 2026-04-03 10:01 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Patrick Steinhardt, Jörg Thalheim

From: Jörg Thalheim <joerg@thalheim.io>

When multiple processes write to a config file concurrently, they
contend on its ".lock" file, which is acquired via open(O_EXCL) with
no retry. The losers fail immediately with "could not lock config
file". Two processes writing unrelated keys (say, "branch.a.remote"
and "branch.b.remote") have no semantic conflict, yet one of them
fails for a purely mechanical reason.

This bites in practice when running `git worktree add -b` concurrently
against the same repository. Each invocation makes several writes to
".git/config" to set up branch tracking, and tooling that creates
worktrees in parallel sees intermittent failures. Worse, `git worktree
add` does not propagate the failed config write to its exit code: the
worktree is created and the command exits 0, but tracking
configuration is silently dropped.

The lock is held only for the duration of rewriting a small file, so
retrying for 100 ms papers over any realistic contention while still
failing fast if a stale lock has been left behind by a crashed
process. This mirrors what we already do for individual reference
locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
2017-08-21)).

Signed-off-by: Jörg Thalheim <joerg@thalheim.io>
---
 config.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/config.c b/config.c
index 156f2a24fa..f7aff8725d 100644
--- a/config.c
+++ b/config.c
@@ -2903,6 +2903,14 @@ char *git_config_prepare_comment_string(const char *comment)
 	return prepared;
 }

+/*
+ * How long to retry acquiring config.lock when another process holds it.
+ * The lock is held only for the duration of rewriting a small file, so
+ * 100 ms covers any realistic contention while still failing fast if
+ * a stale lock has been left behind by a crashed process.
+ */
+#define CONFIG_LOCK_TIMEOUT_MS 100
+
 static void validate_comment_string(const char *comment)
 {
 	size_t leading_blanks;
@@ -2986,7 +2994,8 @@ int repo_config_set_multivar_in_file_gently(struct repository *r,
 	 * The lock serves a purpose in addition to locking: the new
 	 * contents of .git/config will be written into it.
 	 */
-	fd = hold_lock_file_for_update(&lock, config_filename, 0);
+	fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
+					       CONFIG_LOCK_TIMEOUT_MS);
 	if (fd < 0) {
 		error_errno(_("could not lock config file %s"), config_filename);
 		ret = CONFIG_NO_LOCK;
@@ -3331,7 +3340,8 @@ static int repo_config_copy_or_rename_section_in_file(
 	if (!config_filename)
 		config_filename = filename_buf = repo_git_path(r, "config");

-	out_fd = hold_lock_file_for_update(&lock, config_filename, 0);
+	out_fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
+						   CONFIG_LOCK_TIMEOUT_MS);
 	if (out_fd < 0) {
 		ret = error(_("could not lock config file %s"), config_filename);
 		goto out;
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-04-03 10:01 [PATCH] config: retry acquiring config.lock for 100ms Joerg Thalheim
@ 2026-04-03 17:53 ` Junio C Hamano
  2026-04-08 10:34 ` Patrick Steinhardt
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2026-04-03 17:53 UTC (permalink / raw)
  To: Joerg Thalheim; +Cc: git, Patrick Steinhardt

Joerg Thalheim <joerg@thalheim.io> writes:

> From: Jörg Thalheim <joerg@thalheim.io>
>
> When multiple processes write to a config file concurrently, they
> contend on its ".lock" file, which is acquired via open(O_EXCL) with
> no retry. The losers fail immediately with "could not lock config
> file". Two processes writing unrelated keys (say, "branch.a.remote"
> and "branch.b.remote") have no semantic conflict, yet one of them
> fails for a purely mechanical reason.
>
> This bites in practice when running `git worktree add -b` concurrently
> against the same repository. Each invocation makes several writes to
> ".git/config" to set up branch tracking, and tooling that creates
> worktrees in parallel sees intermittent failures. Worse, `git worktree
> add` does not propagate the failed config write to its exit code: the
> worktree is created and the command exits 0, but tracking
> configuration is silently dropped.
>
> The lock is held only for the duration of rewriting a small file, so
> retrying for 100 ms papers over any realistic contention while still
> failing fast if a stale lock has been left behind by a crashed
> process. This mirrors what we already do for individual reference
> locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
> 2017-08-21)).
>
> Signed-off-by: Jörg Thalheim <joerg@thalheim.io>
> ---
>  config.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)

OK.  This is consistent with the default used for a ref update with
files backend.  Various subsystems use their own value randomly
chosen out of thin air, it seems.  credential-store uses 1000ms, gc
uses 150ms to interact with launchctl, refs have their own per
backend, etc.

> diff --git a/config.c b/config.c
> index 156f2a24fa..f7aff8725d 100644
> --- a/config.c
> +++ b/config.c
> @@ -2903,6 +2903,14 @@ char *git_config_prepare_comment_string(const char *comment)
>  	return prepared;
>  }
>  
> +/*
> + * How long to retry acquiring config.lock when another process holds it.
> + * The lock is held only for the duration of rewriting a small file, so
> + * 100 ms covers any realistic contention while still failing fast if
> + * a stale lock has been left behind by a crashed process.
> + */
> +#define CONFIG_LOCK_TIMEOUT_MS 100
> +

Making this configurable would make a cute chicken-and-egg problem?

;-)  No, no need for that---just kidding.

Will queue.  Thanks.

> @@ -2986,7 +2994,8 @@ int repo_config_set_multivar_in_file_gently(struct repository *r,
>  	 * The lock serves a purpose in addition to locking: the new
>  	 * contents of .git/config will be written into it.
>  	 */
> -	fd = hold_lock_file_for_update(&lock, config_filename, 0);
> +	fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
> +					       CONFIG_LOCK_TIMEOUT_MS);
>  	if (fd < 0) {
>  		error_errno(_("could not lock config file %s"), config_filename);
>  		ret = CONFIG_NO_LOCK;
> @@ -3331,7 +3340,8 @@ static int repo_config_copy_or_rename_section_in_file(
>  	if (!config_filename)
>  		config_filename = filename_buf = repo_git_path(r, "config");
>  
> -	out_fd = hold_lock_file_for_update(&lock, config_filename, 0);
> +	out_fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
> +						   CONFIG_LOCK_TIMEOUT_MS);
>  	if (out_fd < 0) {
>  		ret = error(_("could not lock config file %s"), config_filename);
>  		goto out;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-04-03 10:01 [PATCH] config: retry acquiring config.lock for 100ms Joerg Thalheim
  2026-04-03 17:53 ` Junio C Hamano
@ 2026-04-08 10:34 ` Patrick Steinhardt
  2026-05-11  2:32   ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Patrick Steinhardt @ 2026-04-08 10:34 UTC (permalink / raw)
  To: Joerg Thalheim; +Cc: git, Junio C Hamano

On Fri, Apr 03, 2026 at 12:01:35PM +0200, Joerg Thalheim wrote:
> From: Jörg Thalheim <joerg@thalheim.io>
> 
> When multiple processes write to a config file concurrently, they
> contend on its ".lock" file, which is acquired via open(O_EXCL) with
> no retry. The losers fail immediately with "could not lock config
> file". Two processes writing unrelated keys (say, "branch.a.remote"
> and "branch.b.remote") have no semantic conflict, yet one of them
> fails for a purely mechanical reason.
> 
> This bites in practice when running `git worktree add -b` concurrently
> against the same repository. Each invocation makes several writes to
> ".git/config" to set up branch tracking, and tooling that creates
> worktrees in parallel sees intermittent failures. Worse, `git worktree
> add` does not propagate the failed config write to its exit code: the
> worktree is created and the command exits 0, but tracking
> configuration is silently dropped.

This very much sounds like a bug that is worth fixing independently.

> The lock is held only for the duration of rewriting a small file, so
> retrying for 100 ms papers over any realistic contention while still
> failing fast if a stale lock has been left behind by a crashed
> process. This mirrors what we already do for individual reference
> locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
> 2017-08-21)).

Famous last words :) Experience tells me that any timeout value that
isn't excessive will eventually be hit in some production system. Which
raises the question whether we want to make the timeout configurable,
similar to "core.filesRefLockTimeout" and "core.packedRefsTimeout".

> diff --git a/config.c b/config.c
> index 156f2a24fa..f7aff8725d 100644
> --- a/config.c
> +++ b/config.c
> @@ -2903,6 +2903,14 @@ char *git_config_prepare_comment_string(const char *comment)
>  	return prepared;
>  }
>  
> +/*
> + * How long to retry acquiring config.lock when another process holds it.
> + * The lock is held only for the duration of rewriting a small file, so
> + * 100 ms covers any realistic contention while still failing fast if
> + * a stale lock has been left behind by a crashed process.
> + */
> +#define CONFIG_LOCK_TIMEOUT_MS 100
> +
>  static void validate_comment_string(const char *comment)
>  {
>  	size_t leading_blanks;
> @@ -2986,7 +2994,8 @@ int repo_config_set_multivar_in_file_gently(struct repository *r,
>  	 * The lock serves a purpose in addition to locking: the new
>  	 * contents of .git/config will be written into it.
>  	 */
> -	fd = hold_lock_file_for_update(&lock, config_filename, 0);
> +	fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
> +					       CONFIG_LOCK_TIMEOUT_MS);
>  	if (fd < 0) {
>  		error_errno(_("could not lock config file %s"), config_filename);
>  		ret = CONFIG_NO_LOCK;
> @@ -3331,7 +3340,8 @@ static int repo_config_copy_or_rename_section_in_file(
>  	if (!config_filename)
>  		config_filename = filename_buf = repo_git_path(r, "config");
>  
> -	out_fd = hold_lock_file_for_update(&lock, config_filename, 0);
> +	out_fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
> +						   CONFIG_LOCK_TIMEOUT_MS);
>  	if (out_fd < 0) {
>  		ret = error(_("could not lock config file %s"), config_filename);
>  		goto out;

Okay, so we now handle this more gracefully with concurrent writers. But
even if we handle that more gracefully, we still don't really have any
guarantee that the outcome will be reasonable, or even close.

The main difference to the locking timeouts we have for refs is that we
have an easy way to check for conflicts there: we simpliy verify that
the refs we want to update haven't moved from their expected value. We
have no such thing for our configuration though, so two processes that
race with one another will happily overwrite each other's changes.

Honestly though, I'm not really sure what to make with this. We could
of course also add some validation that the configuration we want to set
hasn't been modified meanwhile. But that would now lead to a situation
where we have to update every single caller in our tree to make use of
the new mechanism, which would be a bunch of work.

And adding the timeout doesn't really change the status quo, either. We
already have the case that we'll happily overwrite changes made by
concurrent processes. The only thing that changes is that we make it
more likely for concurrent changes to succeed.

So I'm a bit cautious, but I think overall I'm fine with this?

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-04-08 10:34 ` Patrick Steinhardt
@ 2026-05-11  2:32   ` Junio C Hamano
  2026-05-11  7:33     ` Patrick Steinhardt
  2026-05-11  9:06     ` Jörg Thalheim
  0 siblings, 2 replies; 11+ messages in thread
From: Junio C Hamano @ 2026-05-11  2:32 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Joerg Thalheim, git

Patrick Steinhardt <ps@pks.im> writes:

>> This bites in practice when running `git worktree add -b` concurrently
>> against the same repository. Each invocation makes several writes to
>> ".git/config" to set up branch tracking, and tooling that creates
>> worktrees in parallel sees intermittent failures. Worse, `git worktree
>> add` does not propagate the failed config write to its exit code: the
>> worktree is created and the command exits 0, but tracking
>> configuration is silently dropped.
>
> This very much sounds like a bug that is worth fixing independently.
>
>> The lock is held only for the duration of rewriting a small file, so
>> retrying for 100 ms papers over any realistic contention while still
>> failing fast if a stale lock has been left behind by a crashed
>> process. This mirrors what we already do for individual reference
>> locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
>> 2017-08-21)).
>
> Famous last words :) Experience tells me that any timeout value that
> isn't excessive will eventually be hit in some production system. Which
> raises the question whether we want to make the timeout configurable,
> similar to "core.filesRefLockTimeout" and "core.packedRefsTimeout".
> ...
> Honestly though, I'm not really sure what to make with this. We could
> of course also add some validation that the configuration we want to set
> hasn't been modified meanwhile. But that would now lead to a situation
> where we have to update every single caller in our tree to make use of
> the new mechanism, which would be a bunch of work.
>
> And adding the timeout doesn't really change the status quo, either. We
> already have the case that we'll happily overwrite changes made by
> concurrent processes. The only thing that changes is that we make it
> more likely for concurrent changes to succeed.

We haven't heard any response to these points raised in the message
I am responding to.  Should I still keep the patch in my tree,
hoping that a responses may come some day?  I am tempted to discard
the topic as it has been quite a while since we last looked at it.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-05-11  2:32   ` Junio C Hamano
@ 2026-05-11  7:33     ` Patrick Steinhardt
  2026-05-11  9:06     ` Jörg Thalheim
  1 sibling, 0 replies; 11+ messages in thread
From: Patrick Steinhardt @ 2026-05-11  7:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Joerg Thalheim, git

On Mon, May 11, 2026 at 11:32:33AM +0900, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> >> This bites in practice when running `git worktree add -b` concurrently
> >> against the same repository. Each invocation makes several writes to
> >> ".git/config" to set up branch tracking, and tooling that creates
> >> worktrees in parallel sees intermittent failures. Worse, `git worktree
> >> add` does not propagate the failed config write to its exit code: the
> >> worktree is created and the command exits 0, but tracking
> >> configuration is silently dropped.
> >
> > This very much sounds like a bug that is worth fixing independently.
> >
> >> The lock is held only for the duration of rewriting a small file, so
> >> retrying for 100 ms papers over any realistic contention while still
> >> failing fast if a stale lock has been left behind by a crashed
> >> process. This mirrors what we already do for individual reference
> >> locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
> >> 2017-08-21)).
> >
> > Famous last words :) Experience tells me that any timeout value that
> > isn't excessive will eventually be hit in some production system. Which
> > raises the question whether we want to make the timeout configurable,
> > similar to "core.filesRefLockTimeout" and "core.packedRefsTimeout".
> > ...
> > Honestly though, I'm not really sure what to make with this. We could
> > of course also add some validation that the configuration we want to set
> > hasn't been modified meanwhile. But that would now lead to a situation
> > where we have to update every single caller in our tree to make use of
> > the new mechanism, which would be a bunch of work.
> >
> > And adding the timeout doesn't really change the status quo, either. We
> > already have the case that we'll happily overwrite changes made by
> > concurrent processes. The only thing that changes is that we make it
> > more likely for concurrent changes to succeed.
> 
> We haven't heard any response to these points raised in the message
> I am responding to.  Should I still keep the patch in my tree,
> hoping that a responses may come some day?  I am tempted to discard
> the topic as it has been quite a while since we last looked at it.

Same here, let's discard it for now as it can be easily added back in
once a new version was posted or the feedback has been addressed.

Patrick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-05-11  2:32   ` Junio C Hamano
  2026-05-11  7:33     ` Patrick Steinhardt
@ 2026-05-11  9:06     ` Jörg Thalheim
  2026-05-11 10:01       ` Patrick Steinhardt
  1 sibling, 1 reply; 11+ messages in thread
From: Jörg Thalheim @ 2026-05-11  9:06 UTC (permalink / raw)
  To: Junio C Hamano, Patrick Steinhardt; +Cc: git

I am not really sure what you want me to do here.
I don't see how git can have this value configurable, given it's about reading the configuration itself.
Is the user supposed via command line?

Meanwhile the project I was facing this issue, added the required file lock in its own code,
which has since then worked perfectly to fix my use case: https://github.com/raine/workmux/issues/116



May 11, 2026 at 4:32 AM, "Junio C Hamano" <gitster@pobox.com mailto:gitster@pobox.com?to=%22Junio%20C%20Hamano%22%20%3Cgitster%40pobox.com%3E > wrote:


> 
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > 
> > > 
> > > This bites in practice when running `git worktree add -b` concurrently
> > >  against the same repository. Each invocation makes several writes to
> > >  ".git/config" to set up branch tracking, and tooling that creates
> > >  worktrees in parallel sees intermittent failures. Worse, `git worktree
> > >  add` does not propagate the failed config write to its exit code: the
> > >  worktree is created and the command exits 0, but tracking
> > >  configuration is silently dropped.
> > > 
> >  This very much sounds like a bug that is worth fixing independently.
> > 
> > > 
> > > The lock is held only for the duration of rewriting a small file, so
> > >  retrying for 100 ms papers over any realistic contention while still
> > >  failing fast if a stale lock has been left behind by a crashed
> > >  process. This mirrors what we already do for individual reference
> > >  locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
> > >  2017-08-21)).
> > > 
> >  Famous last words :) Experience tells me that any timeout value that
> >  isn't excessive will eventually be hit in some production system. Which
> >  raises the question whether we want to make the timeout configurable,
> >  similar to "core.filesRefLockTimeout" and "core.packedRefsTimeout".
> >  ...
> >  Honestly though, I'm not really sure what to make with this. We could
> >  of course also add some validation that the configuration we want to set
> >  hasn't been modified meanwhile. But that would now lead to a situation
> >  where we have to update every single caller in our tree to make use of
> >  the new mechanism, which would be a bunch of work.
> > 
> >  And adding the timeout doesn't really change the status quo, either. We
> >  already have the case that we'll happily overwrite changes made by
> >  concurrent processes. The only thing that changes is that we make it
> >  more likely for concurrent changes to succeed.
> > 
> We haven't heard any response to these points raised in the message
> I am responding to. Should I still keep the patch in my tree,
> hoping that a responses may come some day? I am tempted to discard
> the topic as it has been quite a while since we last looked at it.
> 
> Thanks.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-05-11  9:06     ` Jörg Thalheim
@ 2026-05-11 10:01       ` Patrick Steinhardt
  2026-05-17 11:27         ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Steinhardt @ 2026-05-11 10:01 UTC (permalink / raw)
  To: Jörg Thalheim; +Cc: Junio C Hamano, git

On Mon, May 11, 2026 at 09:06:00AM +0000, Jörg Thalheim wrote:
> May 11, 2026 at 4:32 AM, "Junio C Hamano" <gitster@pobox.com mailto:gitster@pobox.com?to=%22Junio%20C%20Hamano%22%20%3Cgitster%40pobox.com%3E > wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > > > This bites in practice when running `git worktree add -b` concurrently
> > > >  against the same repository. Each invocation makes several writes to
> > > >  ".git/config" to set up branch tracking, and tooling that creates
> > > >  worktrees in parallel sees intermittent failures. Worse, `git worktree
> > > >  add` does not propagate the failed config write to its exit code: the
> > > >  worktree is created and the command exits 0, but tracking
> > > >  configuration is silently dropped.
> > > > 
> > >  This very much sounds like a bug that is worth fixing independently.
> > > 
> > > > 
> > > > The lock is held only for the duration of rewriting a small file, so
> > > >  retrying for 100 ms papers over any realistic contention while still
> > > >  failing fast if a stale lock has been left behind by a crashed
> > > >  process. This mirrors what we already do for individual reference
> > > >  locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
> > > >  2017-08-21)).
> > > > 
> > >  Famous last words :) Experience tells me that any timeout value that
> > >  isn't excessive will eventually be hit in some production system. Which
> > >  raises the question whether we want to make the timeout configurable,
> > >  similar to "core.filesRefLockTimeout" and "core.packedRefsTimeout".
> > >  ...
> > >  Honestly though, I'm not really sure what to make with this. We could
> > >  of course also add some validation that the configuration we want to set
> > >  hasn't been modified meanwhile. But that would now lead to a situation
> > >  where we have to update every single caller in our tree to make use of
> > >  the new mechanism, which would be a bunch of work.
> > > 
> > >  And adding the timeout doesn't really change the status quo, either. We
> > >  already have the case that we'll happily overwrite changes made by
> > >  concurrent processes. The only thing that changes is that we make it
> > >  more likely for concurrent changes to succeed.
> > > 
> > We haven't heard any response to these points raised in the message
> > I am responding to. Should I still keep the patch in my tree,
> > hoping that a responses may come some day? I am tempted to discard
> > the topic as it has been quite a while since we last looked at it.
> 
> I am not really sure what you want me to do here.

In general, the idea here is to engage in a discussion that can
ultimately lead to one of two outcomes:

  - The discussion surfaces an area the author hasn't thought about, so
    the patch is adapted accordingly.

  - The discussion shows that the author already did think about the
    issue, but hasn't documented the assumptions. In this case, it
    should be the commit message that gets adapted.

> I don't see how git can have this value configurable, given it's about
> reading the configuration itself. Is the user supposed via command
> line?

This is a fair point indeed. But if it's not possible to change via the
configuration itself, then the next-best thing might be to introduce an
environment variable that allows configuring it.

The other aspect that wasn't discussed in the commit message is how
concurrent writes are handled, both when they are non-conflicting
(updating different keys) and when they are conflicting (updating the
same key). After spending some more time in the code I think it's
ultimately nothing we have to worry about too much, as we only start
reading the configuration after we've locked it.

So in the semantically non-conflicting case there isn't really much of a
race, because things already work as expected. But in the semantically
conflicting case it's a bit different, as the latter writer will
overwrite the result of the former one. In theory it would be possible
to detect such conflicts by:

  - Reading the configuration file.

  - Taking the lock.

  - Rereading the configuration to check for conflicts.

But even that is racy as the first writer might have succeeded before we
read the configuration the first time. So I'm not sure whether we can do
anything about that in the first place, as the race basically exists in
the outer loop controlled by the caller.

So there probably isn't much we can do about that, and unless I missed
something I think your timeout is sensible. But ideally, such nuances
would be discussed as part of the commit message so that reviewers and
future readers are made aware of them.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] config: retry acquiring config.lock for 100ms
  2026-05-11 10:01       ` Patrick Steinhardt
@ 2026-05-17 11:27         ` Johannes Schindelin
  2026-05-17 13:21           ` [PATCH v2] config: retry acquiring config.lock, configurable via core.configLockTimeout Joerg Thalheim
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2026-05-17 11:27 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jörg Thalheim, Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 6855 bytes --]

Hi Patrick & Jörg,

On Mon, 11 May 2026, Patrick Steinhardt wrote:

> On Mon, May 11, 2026 at 09:06:00AM +0000, Jörg Thalheim wrote:
> > May 11, 2026 at 4:32 AM, "Junio C Hamano" <gitster@pobox.com
> > mailto:gitster@pobox.com?to=%22Junio%20C%20Hamano%22%20%3Cgitster%40pobox.com%3E
> > > wrote:
> > > Patrick Steinhardt <ps@pks.im> writes:
> > > > > This bites in practice when running `git worktree add -b` concurrently
> > > > >  against the same repository. Each invocation makes several writes to
> > > > >  ".git/config" to set up branch tracking, and tooling that creates
> > > > >  worktrees in parallel sees intermittent failures. Worse, `git worktree
> > > > >  add` does not propagate the failed config write to its exit code: the
> > > > >  worktree is created and the command exits 0, but tracking
> > > > >  configuration is silently dropped.
> > > > > 
> > > >  This very much sounds like a bug that is worth fixing independently.
> > > > 
> > > > > 
> > > > > The lock is held only for the duration of rewriting a small file, so
> > > > >  retrying for 100 ms papers over any realistic contention while still
> > > > >  failing fast if a stale lock has been left behind by a crashed
> > > > >  process. This mirrors what we already do for individual reference
> > > > >  locks (4ff0f01cb7 (refs: retry acquiring reference locks for 100ms,
> > > > >  2017-08-21)).
> > > > > 
> > > >  Famous last words :) Experience tells me that any timeout value that
> > > >  isn't excessive will eventually be hit in some production system. Which
> > > >  raises the question whether we want to make the timeout configurable,
> > > >  similar to "core.filesRefLockTimeout" and "core.packedRefsTimeout".
> > > >  ...
> > > >  Honestly though, I'm not really sure what to make with this. We could
> > > >  of course also add some validation that the configuration we want to set
> > > >  hasn't been modified meanwhile. But that would now lead to a situation
> > > >  where we have to update every single caller in our tree to make use of
> > > >  the new mechanism, which would be a bunch of work.
> > > > 
> > > >  And adding the timeout doesn't really change the status quo, either. We
> > > >  already have the case that we'll happily overwrite changes made by
> > > >  concurrent processes. The only thing that changes is that we make it
> > > >  more likely for concurrent changes to succeed.
> > > > 
> > > We haven't heard any response to these points raised in the message
> > > I am responding to. Should I still keep the patch in my tree,
> > > hoping that a responses may come some day? I am tempted to discard
> > > the topic as it has been quite a while since we last looked at it.
> > 
> > I am not really sure what you want me to do here.
> 
> In general, the idea here is to engage in a discussion that can
> ultimately lead to one of two outcomes:
> 
>   - The discussion surfaces an area the author hasn't thought about, so
>     the patch is adapted accordingly.
> 
>   - The discussion shows that the author already did think about the
>     issue, but hasn't documented the assumptions. In this case, it
>     should be the commit message that gets adapted.

For what it's worth, I meant to chime in earlier, but obligations kept
preventing me from setting aside the time to do so. Well, better late than
never.

> > I don't see how git can have this value configurable, given it's about
> > reading the configuration itself. Is the user supposed via command
> > line?
> 
> This is a fair point indeed. But if it's not possible to change via the
> configuration itself, then the next-best thing might be to introduce an
> environment variable that allows configuring it.

Well, given that the config is read first before it's written, it is
totally possible to configure a timeout via the config, and I have some
real-world proof that this works as intended (see below).

> The other aspect that wasn't discussed in the commit message is how
> concurrent writes are handled, both when they are non-conflicting
> (updating different keys) and when they are conflicting (updating the
> same key). After spending some more time in the code I think it's
> ultimately nothing we have to worry about too much, as we only start
> reading the configuration after we've locked it.

Correct. I had performed this analysis myself when writing a similar patch
to fix problems in Scalar's Functional Test suite, which wants to register
_many_ Scalar repositories with ~/.gitconfig concurrently. The current
iteration of the patch can be found here:

https://github.com/microsoft/git/commit/a1c2d97cb61bc3697086d1749de848586df2ec54

It does include the config setting, leaving the default as "off" (but I
missed the separate code path to rename sections, which has _independent_
code that also wants to lock the config file, which your patch did not
miss). The subsequent child commit

https://github.com/microsoft/git/commit/5d365c1f332b8d2214ae9c44970d6370ed9caffc

configures it to 150ms in Scalar repositories only. This is notably larger
than the 100ms you suggested, and it is rooted in the fact that NTFS I/O
characteristics are unfortunately in need of a wider margin. In other
words, the optimal value depends on the operating system (and the CPU
load, as Junio had pointed out).

For the record, feel free to adopt whatever you want from my patches for
your next iteration (but also feel free to ignore all of it).

> So in the semantically non-conflicting case there isn't really much of a
> race, because things already work as expected. But in the semantically
> conflicting case it's a bit different, as the latter writer will
> overwrite the result of the former one. In theory it would be possible
> to detect such conflicts by:
> 
>   - Reading the configuration file.
> 
>   - Taking the lock.
> 
>   - Rereading the configuration to check for conflicts.
> 
> But even that is racy as the first writer might have succeeded before we
> read the configuration the first time. So I'm not sure whether we can do
> anything about that in the first place, as the race basically exists in
> the outer loop controlled by the caller.
> 
> So there probably isn't much we can do about that, and unless I missed
> something I think your timeout is sensible. But ideally, such nuances
> would be discussed as part of the commit message so that reviewers and
> future readers are made aware of them.

I agree. Complex cases like this would require a sort of transactional
support to be added to `git config`, and that would in and of itself open
a can of worms I'm not sure we should open unless there is a concrete use
case that bites enough real-world scenarios to require acting upon.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] config: retry acquiring config.lock, configurable via core.configLockTimeout
  2026-05-17 11:27         ` Johannes Schindelin
@ 2026-05-17 13:21           ` Joerg Thalheim
  2026-05-18  0:46             ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Joerg Thalheim @ 2026-05-17 13:21 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Johannes Schindelin,
	Jörg Thalheim

From: Jörg Thalheim <joerg@thalheim.io>

Concurrent config writers race for the ".lock" file, which is taken
with open(O_EXCL) and no retry, so the losers fail right away with
"could not lock config file".

This shows up with parallel "git worktree add -b" against the same
repository: each one writes a couple of branch.* keys and the losers
fail at random. Worse, "git worktree add" doesn't propagate that
failure to its exit code, so the tracking config is silently dropped.
(The swallowed error is a separate bug.)

Retry instead of giving up on the first EEXIST. The lock is only held
while rewriting a small file, so the loser only has to wait out the
other writers. Same approach as 4ff0f01cb7 (refs: retry acquiring
reference locks for 100ms, 2017-08-21).

On the semantics: the on-disk config is read only after the lock is
taken, so writers touching different keys can't lose each other's
change. Writers touching the same key still get last-writer-wins, but
that is already the case today and would need a compare-and-swap config
API to fix. The retry only turns hard failures into successes.

Default to 1000ms, like core.packedRefsTimeout: same shape of problem,
one shared file everyone serializes through. A larger timeout only
costs anything when a stale lock is left behind by a crash, which is
rare; a smaller one fails spuriously on slow filesystems (NTFS has
been seen needing more than 100ms). Make it configurable as
core.configLockTimeout. There is no chicken-and-egg problem: we read
the config before we lock it.

microsoft/git carries a similar patch (core.configWriteLockTimeoutMS,
default off) for Scalar's tests. Defaulting to non-zero here because
the worktree case fails silently.

Helped-by: Patrick Steinhardt <ps@pks.im>
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Jörg Thalheim <joerg@thalheim.io>
---
Thanks for the review and for poking me, this had fallen off my radar.

v1 -> v2:

- added core.configLockTimeout. Johannes is right that there is no
  chicken-and-egg problem (config is read before the lock), so no env
  var needed.
- default bumped to 1000ms; packed-refs is the closer precedent and it
  keeps NTFS out of trouble.
- commit message now covers the read-after-lock / last-writer-wins
  semantics Patrick asked about.
- added tests; existing stale-lock tests in t3200/t5505 now pass
  -c core.configLockTimeout=0 so they still fail fast.

I matched the core.filesRefLockTimeout naming rather than reusing
microsoft/git's core.configWriteLockTimeoutMS, but can switch if the
downstream compat matters more.

 Documentation/config/core.adoc |  8 ++++++++
 config.c                       | 24 ++++++++++++++++++++++--
 t/t1300-config.sh              | 17 +++++++++++++++++
 t/t3200-branch.sh              |  6 ++++--
 t/t5505-remote.sh              |  3 ++-
 5 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/Documentation/config/core.adoc b/Documentation/config/core.adoc
index a0ebf03e2e..340329edc3 100644
--- a/Documentation/config/core.adoc
+++ b/Documentation/config/core.adoc
@@ -589,6 +589,14 @@ core.packedRefsTimeout::
 	all; -1 means to try indefinitely. Default is 1000 (i.e.,
 	retry for 1 second).
 
+core.configLockTimeout::
+	The length of time, in milliseconds, to retry when trying to
+	lock a configuration file for writing. Value 0 means not to
+	retry at all; -1 means to try indefinitely. Default is 1000
+	(i.e., retry for 1 second). This is read from the configuration
+	that is already on disk before the lock is taken, so it can be
+	set persistently like any other option.
+
 core.pager::
 	Text viewer for use by Git commands (e.g., 'less').  The value
 	is meant to be interpreted by the shell.  The order of preference
diff --git a/config.c b/config.c
index 156f2a24fa..ac9563781b 100644
--- a/config.c
+++ b/config.c
@@ -2903,6 +2903,24 @@ char *git_config_prepare_comment_string(const char *comment)
 	return prepared;
 }
 
+/*
+ * How long to retry acquiring config.lock when another process holds
+ * it. Default matches core.packedRefsTimeout; override via
+ * core.configLockTimeout.
+ */
+static long config_lock_timeout_ms(struct repository *r)
+{
+	static int configured;
+	static int timeout_ms = 1000;
+
+	if (!configured) {
+		repo_config_get_int(r, "core.configlocktimeout", &timeout_ms);
+		configured = 1;
+	}
+
+	return timeout_ms;
+}
+
 static void validate_comment_string(const char *comment)
 {
 	size_t leading_blanks;
@@ -2986,7 +3004,8 @@ int repo_config_set_multivar_in_file_gently(struct repository *r,
 	 * The lock serves a purpose in addition to locking: the new
 	 * contents of .git/config will be written into it.
 	 */
-	fd = hold_lock_file_for_update(&lock, config_filename, 0);
+	fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
+					       config_lock_timeout_ms(r));
 	if (fd < 0) {
 		error_errno(_("could not lock config file %s"), config_filename);
 		ret = CONFIG_NO_LOCK;
@@ -3331,7 +3350,8 @@ static int repo_config_copy_or_rename_section_in_file(
 	if (!config_filename)
 		config_filename = filename_buf = repo_git_path(r, "config");
 
-	out_fd = hold_lock_file_for_update(&lock, config_filename, 0);
+	out_fd = hold_lock_file_for_update_timeout(&lock, config_filename, 0,
+						   config_lock_timeout_ms(r));
 	if (out_fd < 0) {
 		ret = error(_("could not lock config file %s"), config_filename);
 		goto out;
diff --git a/t/t1300-config.sh b/t/t1300-config.sh
index 128971ee12..12a1cb8580 100755
--- a/t/t1300-config.sh
+++ b/t/t1300-config.sh
@@ -2939,4 +2939,21 @@ test_expect_success 'writing value with trailing CR not stripped on read' '
 	test_cmp expect actual
 '
 
+test_expect_success 'writing config fails immediately with core.configLockTimeout=0' '
+	test_when_finished "rm -f .git/config.lock" &&
+	>.git/config.lock &&
+	test_must_fail git -c core.configLockTimeout=0 config foo.bar baz 2>err &&
+	test_grep "could not lock config file" err
+'
+
+test_expect_success 'writing config retries until lock is released' '
+	test_when_finished "rm -f .git/config.lock" &&
+	>.git/config.lock &&
+	{
+		( sleep 1 && rm -f .git/config.lock ) &
+	} &&
+	git -c core.configLockTimeout=5000 config retried.key value &&
+	test "$(git config retried.key)" = value
+'
+
 test_done
diff --git a/t/t3200-branch.sh b/t/t3200-branch.sh
index e7829c2c4b..5f8a31c21d 100755
--- a/t/t3200-branch.sh
+++ b/t/t3200-branch.sh
@@ -1037,7 +1037,8 @@ test_expect_success '--set-upstream-to fails on locked config' '
 	test_when_finished "rm -f .git/config.lock" &&
 	>.git/config.lock &&
 	git branch locked &&
-	test_must_fail git branch --set-upstream-to locked 2>err &&
+	test_must_fail git -c core.configLockTimeout=0 \
+		branch --set-upstream-to locked 2>err &&
 	test_grep "could not lock config file .git/config" err
 '
 
@@ -1068,7 +1069,8 @@ test_expect_success '--unset-upstream should fail if config is locked' '
 	test_when_finished "rm -f .git/config.lock" &&
 	git branch --set-upstream-to locked &&
 	>.git/config.lock &&
-	test_must_fail git branch --unset-upstream 2>err &&
+	test_must_fail git -c core.configLockTimeout=0 \
+		branch --unset-upstream 2>err &&
 	test_grep "could not lock config file .git/config" err
 '
 
diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh
index e592c0bcde..aea9222649 100755
--- a/t/t5505-remote.sh
+++ b/t/t5505-remote.sh
@@ -1327,7 +1327,8 @@ test_expect_success 'remote set-url with locked config' '
 	test_when_finished "rm -f .git/config.lock" &&
 	git config --get-all remote.someremote.url >expect &&
 	>.git/config.lock &&
-	test_must_fail git remote set-url someremote baz &&
+	test_must_fail git -c core.configLockTimeout=0 \
+		remote set-url someremote baz &&
 	git config --get-all remote.someremote.url >actual &&
 	cmp expect actual
 '
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] config: retry acquiring config.lock, configurable via core.configLockTimeout
  2026-05-17 13:21           ` [PATCH v2] config: retry acquiring config.lock, configurable via core.configLockTimeout Joerg Thalheim
@ 2026-05-18  0:46             ` Junio C Hamano
  2026-05-18  8:07               ` Patrick Steinhardt
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2026-05-18  0:46 UTC (permalink / raw)
  To: Joerg Thalheim; +Cc: git, Patrick Steinhardt, Johannes Schindelin

Joerg Thalheim <joerg@thalheim.io> writes:

> +/*
> + * How long to retry acquiring config.lock when another process holds
> + * it. Default matches core.packedRefsTimeout; override via
> + * core.configLockTimeout.
> + */
> +static long config_lock_timeout_ms(struct repository *r)
> +{
> +	static int configured;
> +	static int timeout_ms = 1000;
> +
> +	if (!configured) {
> +		repo_config_get_int(r, "core.configlocktimeout", &timeout_ms);
> +		configured = 1;
> +	}
> +
> +	return timeout_ms;
> +}

The above design means whichever repository happens to be passed for
the first time as "r" to this call will fix the return value from
the function for the rest of the system, meaning that the lock timeout
is a per-process property and the repository parameter passed to the
function does not matter all that much.

It may make sense to admit that this is not a per-repository
property (due to the use of local caching), have the function take
no parameter and use the_repository to the config_get call.  That
would make the intention more clear.

Of course the other end of the spectrum is to get rid of the
"configured" caching here, and ask the config system to make a
hashtable look-up every time the function is called.  That will keep
the lock timeout per-repository, which is closer to what the current
function signature suggests.

I dunno.  My gut feeling is that there aren't valid reasons why you
would want to specifically set different timeout values per
repository, so the simplicity of using the_repository (i.e. the
primary repository instance this process deals with) sounds like a
better way to go.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] config: retry acquiring config.lock, configurable via core.configLockTimeout
  2026-05-18  0:46             ` Junio C Hamano
@ 2026-05-18  8:07               ` Patrick Steinhardt
  0 siblings, 0 replies; 11+ messages in thread
From: Patrick Steinhardt @ 2026-05-18  8:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Joerg Thalheim, git, Johannes Schindelin

On Mon, May 18, 2026 at 09:46:05AM +0900, Junio C Hamano wrote:
> Joerg Thalheim <joerg@thalheim.io> writes:
> 
> > +/*
> > + * How long to retry acquiring config.lock when another process holds
> > + * it. Default matches core.packedRefsTimeout; override via
> > + * core.configLockTimeout.
> > + */
> > +static long config_lock_timeout_ms(struct repository *r)
> > +{
> > +	static int configured;
> > +	static int timeout_ms = 1000;
> > +
> > +	if (!configured) {
> > +		repo_config_get_int(r, "core.configlocktimeout", &timeout_ms);
> > +		configured = 1;
> > +	}
> > +
> > +	return timeout_ms;
> > +}
> 
> The above design means whichever repository happens to be passed for
> the first time as "r" to this call will fix the return value from
> the function for the rest of the system, meaning that the lock timeout
> is a per-process property and the repository parameter passed to the
> function does not matter all that much.
> 
> It may make sense to admit that this is not a per-repository
> property (due to the use of local caching), have the function take
> no parameter and use the_repository to the config_get call.  That
> would make the intention more clear.
> 
> Of course the other end of the spectrum is to get rid of the
> "configured" caching here, and ask the config system to make a
> hashtable look-up every time the function is called.  That will keep
> the lock timeout per-repository, which is closer to what the current
> function signature suggests.
> 
> I dunno.  My gut feeling is that there aren't valid reasons why you
> would want to specifically set different timeout values per
> repository, so the simplicity of using the_repository (i.e. the
> primary repository instance this process deals with) sounds like a
> better way to go.

There probably is no reason to have different values per repo. But to
me the question is whether there even are any use cases where we have to
lock the config file so often in quick succession that the caching
mechanism even matters. My gut feeling says no, also because parsing the
value from the configuration is going to be drowned out by actually
writing the lockfile and renaming it into place.

So I'd rather lean towards dropping the cache and keeping the repository
parameter.

Patrick

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-18  8:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 10:01 [PATCH] config: retry acquiring config.lock for 100ms Joerg Thalheim
2026-04-03 17:53 ` Junio C Hamano
2026-04-08 10:34 ` Patrick Steinhardt
2026-05-11  2:32   ` Junio C Hamano
2026-05-11  7:33     ` Patrick Steinhardt
2026-05-11  9:06     ` Jörg Thalheim
2026-05-11 10:01       ` Patrick Steinhardt
2026-05-17 11:27         ` Johannes Schindelin
2026-05-17 13:21           ` [PATCH v2] config: retry acquiring config.lock, configurable via core.configLockTimeout Joerg Thalheim
2026-05-18  0:46             ` Junio C Hamano
2026-05-18  8:07               ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox