From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Barret Rhoden <brho@google.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Waiman Long <llong@redhat.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
Tejun Heo <tj@kernel.org>, Josh Don <joshdon@google.com>,
Dohyun Kim <dohyunkim@google.com>,
kernel-team@meta.com
Subject: [PATCH bpf-next v1 09/22] rqspinlock: Protect waiters in queue from stalls
Date: Tue, 7 Jan 2025 05:59:51 -0800 [thread overview]
Message-ID: <20250107140004.2732830-10-memxor@gmail.com> (raw)
In-Reply-To: <20250107140004.2732830-1-memxor@gmail.com>
Implement the wait queue cleanup algorithm for rqspinlock. There are
three forms of waiters in the original queued spin lock algorithm. The
first is the waiter which acquires the pending bit and spins on the lock
word without forming a wait queue. The second is the head waiter that is
the first waiter heading the wait queue. The third form is of all the
non-head waiters queued behind the head, waiting to be signalled through
their MCS node to overtake the responsibility of the head.
In this commit, we are concerned with the second and third kind. First,
we augment the waiting loop of the head of the wait queue with a
timeout. When this timeout happens, all waiters part of the wait queue
will abort their lock acquisition attempts. This happens in three steps.
First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node spin
(more on that later). Next, every waiter (head or non-head) attempts to
check whether they are also the tail waiter, in such a case they attempt
to zero out the tail word and allow a new queue to be built up for this
lock. If they succeed, they have no one to signal next in the queue to
stop spinning. Otherwise, they signal the MCS node of the next waiter to
break out of its spin and try resetting the tail word back to 0. This
goes on until the tail waiter is found. In case of races, the new tail
will be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.
Lastly, all of these waiters release the rqnode and return to the
caller. This patch underscores the point that rqspinlock's timeout does
not apply to each waiter individually, and cannot be relied upon as an
upper bound. It is possible for the rqspinlock waiters to return early
from a failed lock acquisition attempt as soon as stalls are detected.
The head waiter cannot directly WRITE_ONCE the tail to zero, as it may
race with a concurrent xchg and a non-head waiter linking its MCS node
to the head's MCS node through 'prev->next' assignment.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 42 +++++++++++++++++++++++++++++---
kernel/locking/rqspinlock.h | 48 +++++++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+), 3 deletions(-)
create mode 100644 kernel/locking/rqspinlock.h
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index dd305573db13..f712fe4b1f38 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -77,6 +77,8 @@ struct rqspinlock_timeout {
u16 spin;
};
+#define RES_TIMEOUT_VAL 2
+
static noinline int check_timeout(struct rqspinlock_timeout *ts)
{
u64 time = ktime_get_mono_fast_ns();
@@ -305,12 +307,18 @@ int __lockfunc resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 v
* head of the waitqueue.
*/
if (old & _Q_TAIL_MASK) {
+ int val;
+
prev = decode_tail(old, qnodes);
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
- arch_mcs_spin_lock_contended(&node->locked);
+ val = arch_mcs_spin_lock_contended(&node->locked);
+ if (val == RES_TIMEOUT_VAL) {
+ ret = -EDEADLK;
+ goto waitq_timeout;
+ }
/*
* While waiting for the MCS lock, the next pointer may have
@@ -334,7 +342,35 @@ int __lockfunc resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 v
* sequentiality; this is because the set_locked() function below
* does not imply a full barrier.
*/
- val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+ RES_RESET_TIMEOUT(ts);
+ val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
+ RES_CHECK_TIMEOUT(ts, ret));
+
+waitq_timeout:
+ if (ret) {
+ /*
+ * If the tail is still pointing to us, then we are the final waiter,
+ * and are responsible for resetting the tail back to 0. Otherwise, if
+ * the cmpxchg operation fails, we signal the next waiter to take exit
+ * and try the same. For a waiter with tail node 'n':
+ *
+ * n,*,* -> 0,*,*
+ *
+ * When performing cmpxchg for the whole word (NR_CPUS > 16k), it is
+ * possible locked/pending bits keep changing and we see failures even
+ * when we remain the head of wait queue. However, eventually, for the
+ * case without corruption, pending bit owner will unset the pending
+ * bit, and new waiters will queue behind us. This will leave the lock
+ * owner in charge, and it will eventually either set locked bit to 0,
+ * or leave it as 1, allowing us to make progress.
+ */
+ if (!try_cmpxchg_tail(lock, tail, 0)) {
+ next = smp_cond_load_relaxed(&node->next, VAL);
+ WRITE_ONCE(next->locked, RES_TIMEOUT_VAL);
+ }
+ lockevent_inc(rqspinlock_lock_timeout);
+ goto release;
+ }
/*
* claim the lock:
@@ -379,6 +415,6 @@ int __lockfunc resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 v
* release the node
*/
__this_cpu_dec(qnodes[0].mcs.count);
- return 0;
+ return ret;
}
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
diff --git a/kernel/locking/rqspinlock.h b/kernel/locking/rqspinlock.h
new file mode 100644
index 000000000000..3cec3a0f2d7e
--- /dev/null
+++ b/kernel/locking/rqspinlock.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Resilient Queued Spin Lock defines
+ *
+ * (C) Copyright 2024 Meta Platforms, Inc. and affiliates.
+ *
+ * Authors: Kumar Kartikeya Dwivedi <memxor@gmail.com>
+ */
+#ifndef __LINUX_RQSPINLOCK_H
+#define __LINUX_RQSPINLOCK_H
+
+#include "qspinlock.h"
+
+/*
+ * try_cmpxchg_tail - Return result of cmpxchg of tail word with a new value
+ * @lock: Pointer to queued spinlock structure
+ * @tail: The tail to compare against
+ * @new_tail: The new queue tail code word
+ * Return: Bool to indicate whether the cmpxchg operation succeeded
+ *
+ * This is used by the head of the wait queue to clean up the queue.
+ * Provides relaxed ordering, since observers only rely on initialized
+ * state of the node which was made visible through the xchg_tail operation,
+ * i.e. through the smp_wmb preceding xchg_tail.
+ *
+ * We avoid using 16-bit cmpxchg, which is not available on all architectures.
+ */
+static __always_inline bool try_cmpxchg_tail(struct qspinlock *lock, u32 tail, u32 new_tail)
+{
+ u32 old, new;
+
+ old = atomic_read(&lock->val);
+ do {
+ /*
+ * Is the tail part we compare to already stale? Fail.
+ */
+ if ((old & _Q_TAIL_MASK) != tail)
+ return false;
+ /*
+ * Encode latest locked/pending state for new tail.
+ */
+ new = (old & _Q_LOCKED_PENDING_MASK) | new_tail;
+ } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+
+ return true;
+}
+
+#endif /* __LINUX_RQSPINLOCK_H */
--
2.43.5
next prev parent reply other threads:[~2025-01-07 14:00 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-07 13:59 [PATCH bpf-next v1 00/22] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 01/22] locking: Move MCS struct definition to public header Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 02/22] locking: Move common qspinlock helpers to a private header Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 03/22] locking: Allow obtaining result of arch_mcs_spin_lock_contended Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 04/22] locking: Copy out qspinlock.c to rqspinlock.c Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 05/22] rqspinlock: Add rqspinlock.h header Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 06/22] rqspinlock: Drop PV and virtualization support Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 07/22] rqspinlock: Add support for timeouts Kumar Kartikeya Dwivedi
2025-01-07 14:50 ` Peter Zijlstra
2025-01-07 17:14 ` Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 08/22] rqspinlock: Protect pending bit owners from stalls Kumar Kartikeya Dwivedi
2025-01-07 14:51 ` Peter Zijlstra
2025-01-07 17:14 ` Kumar Kartikeya Dwivedi
2025-01-07 19:17 ` Peter Zijlstra
2025-01-07 19:22 ` Peter Zijlstra
2025-01-07 19:54 ` Kumar Kartikeya Dwivedi
2025-01-08 2:19 ` Waiman Long
2025-01-08 20:13 ` Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` Kumar Kartikeya Dwivedi [this message]
2025-01-08 3:38 ` [PATCH bpf-next v1 09/22] rqspinlock: Protect waiters in queue " Waiman Long
2025-01-08 20:42 ` Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 10/22] rqspinlock: Protect waiters in trylock fallback " Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 11/22] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
2025-01-08 16:06 ` Waiman Long
2025-01-08 20:19 ` Kumar Kartikeya Dwivedi
2025-01-09 0:32 ` Waiman Long
2025-01-07 13:59 ` [PATCH bpf-next v1 12/22] rqspinlock: Add basic support for CONFIG_PARAVIRT Kumar Kartikeya Dwivedi
2025-01-08 16:27 ` Waiman Long
2025-01-08 20:32 ` Kumar Kartikeya Dwivedi
2025-01-09 0:48 ` Waiman Long
2025-01-09 2:42 ` Alexei Starovoitov
2025-01-09 2:58 ` Waiman Long
2025-01-09 3:37 ` Alexei Starovoitov
2025-01-09 3:46 ` Waiman Long
2025-01-09 3:53 ` Alexei Starovoitov
2025-01-09 3:58 ` Waiman Long
2025-01-07 13:59 ` [PATCH bpf-next v1 13/22] rqspinlock: Add helper to print a splat on timeout or deadlock Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 14/22] rqspinlock: Add macros for rqspinlock usage Kumar Kartikeya Dwivedi
2025-01-08 16:55 ` Waiman Long
2025-01-08 20:41 ` Kumar Kartikeya Dwivedi
2025-01-09 1:11 ` Waiman Long
2025-01-09 3:30 ` Alexei Starovoitov
2025-01-09 4:09 ` Waiman Long
2025-01-07 13:59 ` [PATCH bpf-next v1 15/22] rqspinlock: Add locktorture support Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 16/22] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 17/22] bpf: Convert hashtab.c to rqspinlock Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 18/22] bpf: Convert percpu_freelist.c " Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 19/22] bpf: Convert lpm_trie.c " Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 20/22] bpf: Introduce rqspinlock kfuncs Kumar Kartikeya Dwivedi
2025-01-08 10:23 ` kernel test robot
2025-01-08 10:23 ` kernel test robot
2025-01-08 10:44 ` kernel test robot
2025-01-07 14:00 ` [PATCH bpf-next v1 21/22] bpf: Implement verifier support for rqspinlock Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 22/22] selftests/bpf: Add tests " Kumar Kartikeya Dwivedi
2025-01-07 23:54 ` [PATCH bpf-next v1 00/22] Resilient Queued Spin Lock Linus Torvalds
2025-01-08 9:18 ` Peter Zijlstra
2025-01-08 20:12 ` Kumar Kartikeya Dwivedi
2025-01-08 20:30 ` Linus Torvalds
2025-01-08 21:06 ` Kumar Kartikeya Dwivedi
2025-01-08 21:30 ` Paul E. McKenney
2025-01-09 13:59 ` Waiman Long
2025-01-09 21:13 ` Kumar Kartikeya Dwivedi
2025-01-09 21:18 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250107140004.2732830-10-memxor@gmail.com \
--to=memxor@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brho@google.com \
--cc=daniel@iogearbox.net \
--cc=dohyunkim@google.com \
--cc=eddyz87@gmail.com \
--cc=joshdon@google.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=llong@redhat.com \
--cc=martin.lau@kernel.org \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox