[PATCH] Retry acquiring reference locks for 100ms

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Retry acquiring reference locks for 100ms
@ 2017-08-21 11:51 Michael Haggerty
  2017-08-23 21:55 ` Junio C Hamano
  2017-08-24 14:44 ` Jeff King
  0 siblings, 2 replies; 4+ messages in thread
From: Michael Haggerty @ 2017-08-21 11:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

The philosophy of reference locking has been, "if another process is
changing a reference, then whatever I'm trying to do to it will
probably fail anyway because my old-SHA-1 value is probably no longer
current". But this argument falls down if the other process has locked
the reference to do something that doesn't actually change the value
of the reference, such as `pack-refs` or `reflog expire`. There
actually *is* a decent chance that a planned reference update will
still be able to go through after the other process has released the
lock.

So when trying to lock an individual reference (e.g., when creating
"refs/heads/master.lock"), if it is already locked, then retry the
lock acquisition for approximately 100 ms before giving up. This
should eliminate some unnecessary lock conflicts without wasting a lot
of time.

Add a configuration setting, `core.filesRefLockTimeout`, to allow this
setting to be tweaked.

Note: the function `get_files_ref_lock_timeout_ms()` cannot be private
to the files backend because it is also used by `write_pseudoref()`
and `delete_pseudoref()`, which are defined in `refs.c` so that they
can be used by other reference backends.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---

This patch applies to master, but (perhaps surprisingly) doesn't
conflict with the mh/packed-ref-store changes. It can also be obtained
from my Git fork [1] as branch "ref-lock-retry".

Michael

[1] https://github.com/mhagger/git

 Documentation/config.txt |  6 ++++++
 refs.c                   | 24 +++++++++++++++++++++---
 refs/files-backend.c     |  8 ++++++--
 refs/refs-internal.h     |  6 ++++++
 4 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d5c9c4cab6..2c04b9dfb4 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -776,6 +776,12 @@ core.commentChar::
 If set to "auto", `git-commit` would select a character that is not
 the beginning character of any line in existing commit messages.
 
+core.filesRefLockTimeout::
+	The length of time, in milliseconds, to retry when trying to
+	lock an individual reference. Value 0 means not to retry at
+	all; -1 means to try indefinitely. Default is 100 (i.e.,
+	retry for 100ms).
+
 core.packedRefsTimeout::
 	The length of time, in milliseconds, to retry when trying to
 	lock the `packed-refs` file. Value 0 means not to retry at
diff --git a/refs.c b/refs.c
index fe4c59aa8b..29dbb9b610 100644
--- a/refs.c
+++ b/refs.c
@@ -561,6 +561,21 @@ enum ref_type ref_type(const char *refname)
        return REF_TYPE_NORMAL;
 }
 
+long get_files_ref_lock_timeout_ms(void)
+{
+	static int configured = 0;
+
+	/* The default timeout is 100 ms: */
+	static int timeout_ms = 100;
+
+	if (!configured) {
+		git_config_get_int("core.filesreflocktimeout", &timeout_ms);
+		configured = 1;
+	}
+
+	return timeout_ms;
+}
+
 static int write_pseudoref(const char *pseudoref, const unsigned char *sha1,
 			   const unsigned char *old_sha1, struct strbuf *err)
 {
@@ -573,7 +588,9 @@ static int write_pseudoref(const char *pseudoref, const unsigned char *sha1,
 	strbuf_addf(&buf, "%s\n", sha1_to_hex(sha1));
 
 	filename = git_path("%s", pseudoref);
-	fd = hold_lock_file_for_update(&lock, filename, LOCK_DIE_ON_ERROR);
+	fd = hold_lock_file_for_update_timeout(&lock, filename,
+					       LOCK_DIE_ON_ERROR,
+					       get_files_ref_lock_timeout_ms());
 	if (fd < 0) {
 		strbuf_addf(err, "could not open '%s' for writing: %s",
 			    filename, strerror(errno));
@@ -616,8 +633,9 @@ static int delete_pseudoref(const char *pseudoref, const unsigned char *old_sha1
 		int fd;
 		unsigned char actual_old_sha1[20];
 
-		fd = hold_lock_file_for_update(&lock, filename,
-					       LOCK_DIE_ON_ERROR);
+		fd = hold_lock_file_for_update_timeout(
+				&lock, filename, LOCK_DIE_ON_ERROR,
+				get_files_ref_lock_timeout_ms());
 		if (fd < 0)
 			die_errno(_("Could not open '%s' for writing"), filename);
 		if (read_ref(pseudoref, actual_old_sha1))
diff --git a/refs/files-backend.c b/refs/files-backend.c
index 0404f2c233..d611e0f7d7 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -855,7 +855,9 @@ static int lock_raw_ref(struct files_ref_store *refs,
 	if (!lock->lk)
 		lock->lk = xcalloc(1, sizeof(struct lock_file));
 
-	if (hold_lock_file_for_update(lock->lk, ref_file.buf, LOCK_NO_DEREF) < 0) {
+	if (hold_lock_file_for_update_timeout(
+			    lock->lk, ref_file.buf, LOCK_NO_DEREF,
+			    get_files_ref_lock_timeout_ms()) < 0) {
 		if (errno == ENOENT && --attempts_remaining > 0) {
 			/*
 			 * Maybe somebody just deleted one of the
@@ -1181,7 +1183,9 @@ static int create_reflock(const char *path, void *cb)
 {
 	struct lock_file *lk = cb;
 
-	return hold_lock_file_for_update(lk, path, LOCK_NO_DEREF) < 0 ? -1 : 0;
+	return hold_lock_file_for_update_timeout(
+			lk, path, LOCK_NO_DEREF,
+			get_files_ref_lock_timeout_ms()) < 0 ? -1 : 0;
 }
 
 /*
diff --git a/refs/refs-internal.h b/refs/refs-internal.h
index 192f9f85c9..9977fea98b 100644
--- a/refs/refs-internal.h
+++ b/refs/refs-internal.h
@@ -61,6 +61,12 @@
  */
 #define REF_DELETED_LOOSE 0x200
 
+/*
+ * Return the length of time to retry acquiring a loose reference lock
+ * before giving up, in milliseconds:
+ */
+long get_files_ref_lock_timeout_ms(void);
+
 /*
  * Return true iff refname is minimally safe. "Safe" here means that
  * deleting a loose reference by this name will not do any damage, for
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Retry acquiring reference locks for 100ms
  2017-08-21 11:51 [PATCH] Retry acquiring reference locks for 100ms Michael Haggerty
@ 2017-08-23 21:55 ` Junio C Hamano
  2017-08-24  7:01   ` Michael Haggerty
  2017-08-24 14:44 ` Jeff King
  1 sibling, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2017-08-23 21:55 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> The philosophy of reference locking has been, "if another process is
> changing a reference, then whatever I'm trying to do to it will
> probably fail anyway because my old-SHA-1 value is probably no longer
> current". But this argument falls down if the other process has locked
> the reference to do something that doesn't actually change the value
> of the reference, such as `pack-refs` or `reflog expire`. There
> actually *is* a decent chance that a planned reference update will
> still be able to go through after the other process has released the
> lock.

The reason why these 'read-only' operations take locks is because
they want to ensure that other people will not mess with the
anchoring points of the history they base their operation on while
they do their work, right?

> So when trying to lock an individual reference (e.g., when creating
> "refs/heads/master.lock"), if it is already locked, then retry the
> lock acquisition for approximately 100 ms before giving up. This
> should eliminate some unnecessary lock conflicts without wasting a lot
> of time.
>
> Add a configuration setting, `core.filesRefLockTimeout`, to allow this
> setting to be tweaked.

I suspect that this came from real-life needs of a server operator.
What numbers should I be asking to justify this change? ;-) 

"Without this change, 0.4% of pushes used to fail due to losing the
race against periodic GC, but with this, the rate went down to 0.2%,
which is 50% improvement!" or something like that?




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Retry acquiring reference locks for 100ms
  2017-08-23 21:55 ` Junio C Hamano
@ 2017-08-24  7:01   ` Michael Haggerty
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Haggerty @ 2017-08-24  7:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Git Mailing List

On Wed, Aug 23, 2017 at 11:55 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
>
>> The philosophy of reference locking has been, "if another process is
>> changing a reference, then whatever I'm trying to do to it will
>> probably fail anyway because my old-SHA-1 value is probably no longer
>> current". But this argument falls down if the other process has locked
>> the reference to do something that doesn't actually change the value
>> of the reference, such as `pack-refs` or `reflog expire`. There
>> actually *is* a decent chance that a planned reference update will
>> still be able to go through after the other process has released the
>> lock.
>
> The reason why these 'read-only' operations take locks is because
> they want to ensure that other people will not mess with the
> anchoring points of the history they base their operation on while
> they do their work, right?

In the case of `pack-refs`, after it makes packed versions of the
loose references, it needs to lock each loose reference before pruning
it, so that it can verify in a non-racy way that the loose reference
still has the same value as the one it just packed.

In the case of `reflog expire`, it locks the reference because that
implies a lock on the reflog file, which it needs to rewrite. (Reflog
files don't have their own locks.) Otherwise it could inadvertently
overwrite a new reflog entry that is added by another process while it
is rewriting the file.

>> So when trying to lock an individual reference (e.g., when creating
>> "refs/heads/master.lock"), if it is already locked, then retry the
>> lock acquisition for approximately 100 ms before giving up. This
>> should eliminate some unnecessary lock conflicts without wasting a lot
>> of time.
>>
>> Add a configuration setting, `core.filesRefLockTimeout`, to allow this
>> setting to be tweaked.
>
> I suspect that this came from real-life needs of a server operator.
> What numbers should I be asking to justify this change? ;-)
>
> "Without this change, 0.4% of pushes used to fail due to losing the
> race against periodic GC, but with this, the rate went down to 0.2%,
> which is 50% improvement!" or something like that?

We've had a patch like this deployed to our servers for quite some
time, so I don't remember too accurately. But I think we were seeing
maybe 10-50 such errors every day across our whole infrastructure
(we're talking like literally one in a million updates). That number
went basically to zero after retries were added.

It's also not particularly serious when the race happens: the
reference update is rejected, but it is rejected cleanly.

So it's definitely a rare race, and probably only of any interest at
all on a high-traffic Git server. OTOH the cure is pretty simple, so
it seems worth fixing.

Michael

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Retry acquiring reference locks for 100ms
  2017-08-21 11:51 [PATCH] Retry acquiring reference locks for 100ms Michael Haggerty
  2017-08-23 21:55 ` Junio C Hamano
@ 2017-08-24 14:44 ` Jeff King
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff King @ 2017-08-24 14:44 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, git

On Mon, Aug 21, 2017 at 01:51:34PM +0200, Michael Haggerty wrote:

> The philosophy of reference locking has been, "if another process is
> changing a reference, then whatever I'm trying to do to it will
> probably fail anyway because my old-SHA-1 value is probably no longer
> current". But this argument falls down if the other process has locked
> the reference to do something that doesn't actually change the value
> of the reference, such as `pack-refs` or `reflog expire`. There
> actually *is* a decent chance that a planned reference update will
> still be able to go through after the other process has released the
> lock.
> 
> So when trying to lock an individual reference (e.g., when creating
> "refs/heads/master.lock"), if it is already locked, then retry the
> lock acquisition for approximately 100 ms before giving up. This
> should eliminate some unnecessary lock conflicts without wasting a lot
> of time.

It will probably comes as little surprise to others on the list that
I agree with the intent of this patch. This version is cleaned up a
little from what we're running at GitHub right now, so I'll try to
review with fresh eyes.

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index d5c9c4cab6..2c04b9dfb4 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -776,6 +776,12 @@ core.commentChar::
>  If set to "auto", `git-commit` would select a character that is not
>  the beginning character of any line in existing commit messages.
>  
> +core.filesRefLockTimeout::
> +	The length of time, in milliseconds, to retry when trying to
> +	lock an individual reference. Value 0 means not to retry at
> +	all; -1 means to try indefinitely. Default is 100 (i.e.,
> +	retry for 100ms).
> +
>  core.packedRefsTimeout::
>  	The length of time, in milliseconds, to retry when trying to
>  	lock the `packed-refs` file. Value 0 means not to retry at

Do we need a separate config from packedRefsTimeout that is in the
context? I guess so, since the default values are different. And
rightfully so, I think, since writing a packed-refs file is potentially
a much larger operation (being O(n) in the number of refs rather than a
constant 40 bytes).

It probably doesn't matter all that much either way, as I wouldn't
expect people to need to tweak either of these in practice.

At some point when we have another ref backend that needs to take a
global lock, we'd probably have a third timeout. If we were starting
from scratch, I'd suggest that these might be part of refstorage.files.*
instead of core.*. And then eventually we might have
refstorage.reftable.timeout, refstorage.lmdb.timeout, etc.

But since core.packedRefsTimeout has already sailed, I'm not sure it's
worth caring about.

> diff --git a/refs.c b/refs.c
> index fe4c59aa8b..29dbb9b610 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -561,6 +561,21 @@ enum ref_type ref_type(const char *refname)
>         return REF_TYPE_NORMAL;
>  }
>  
> +long get_files_ref_lock_timeout_ms(void)
> +{
> +	static int configured = 0;
> +
> +	/* The default timeout is 100 ms: */
> +	static int timeout_ms = 100;
> +
> +	if (!configured) {
> +		git_config_get_int("core.filesreflocktimeout", &timeout_ms);
> +		configured = 1;
> +	}
> +
> +	return timeout_ms;
> +}

This reads the config value into a static local that gets cached for the
rest of the program run, and there's no way to invalidate the cache. I
think that should be OK, as we would never hit the ref code until we've
actually initialized the repository, at which point we're fairly well
committed.

(I'm a little gun-shy because I used a similar lazy-load pattern for
core.sharedrepository a while back, and those corner cases bit us. But
there it was heavily used by git-init, which wants to "switch" repos
after having loaded some values).

> @@ -573,7 +588,9 @@ static int write_pseudoref(const char *pseudoref, const unsigned char *sha1,
>  	strbuf_addf(&buf, "%s\n", sha1_to_hex(sha1));
>  
>  	filename = git_path("%s", pseudoref);
> -	fd = hold_lock_file_for_update(&lock, filename, LOCK_DIE_ON_ERROR);
> +	fd = hold_lock_file_for_update_timeout(&lock, filename,
> +					       LOCK_DIE_ON_ERROR,
> +					       get_files_ref_lock_timeout_ms());
>  	if (fd < 0) {
>  		strbuf_addf(err, "could not open '%s' for writing: %s",
>  			    filename, strerror(errno));

The rest of the patch looks obviously correct to me. The one thing I
didn't audit for was whether there were any calls that _should_ be
changed that you missed. But I'm pretty sure based on our previous
off-list discussions that you just did that audit.

Obviously new ref-locking calls would want to use the timeout variant
here, too, and we need to remember to do so. But I don't think there
should be many, as most sites should be going through create_reflock()
or lock_raw_ref(), which you touch here.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-08-24 14:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-21 11:51 [PATCH] Retry acquiring reference locks for 100ms Michael Haggerty
2017-08-23 21:55 ` Junio C Hamano
2017-08-24  7:01   ` Michael Haggerty
2017-08-24 14:44 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).