From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>,
Jelle van der Beek <jelle@superluminal.eu>,
kkd@meta.com, kernel-team@meta.com
Subject: [PATCH bpf-next v2 5/6] rqspinlock: Precede non-head waiter queueing with AA check
Date: Fri, 28 Nov 2025 23:28:01 +0000 [thread overview]
Message-ID: <20251128232802.1031906-6-memxor@gmail.com> (raw)
In-Reply-To: <20251128232802.1031906-1-memxor@gmail.com>
While previous commits sufficiently address the deadlocks, there are
still scenarios where queueing of waiters in NMIs can exacerbate the
possibility of timeouts.
Consider the case below:
CPU 0
<NMI>
res_spin_lock(A) -> becomes non-head waiter
</NMI>
lock owner in CS or pending waiter spinning
CPU 1
res_spin_lock(A) -> head waiter spinning on owner/pending bits
In such a scenario, the non-head waiter in NMI on CPU 0 will not poll
for deadlocks or timeout since it will simply queue behind previous
waiter (head on CPU 1), and also not enter the trylock fallback since
no rqspinlock queue waiter is active on CPU 0. In such a scenario, the
transaction initiated by the head waiter on CPU 1 will timeout,
signalling the NMI and ending the cyclic dependency, but it will cost
250 ms of time.
Instead, the NMI on CPU 0 could simply check for the presence of an AA
deadlock and only proceed with queueing on success. Add such a check
right before any form of queueing is initiated.
The reason the AA deadlock check is not used in conjunction with
in_nmi() is that a similar case could occur due to a reentrant path
in the owner's critical section, and unconditionally checking for AA
before entering the queueing path avoids expensive timeouts. Non-NMI
reentrancy only happens at controlled points in the slow path (with
specific tracepoints which do not impede the forward progress of a
waiter loop), or in the owner CS, while NMIs can land anywhere.
While this check is only needed for non-head waiter queueing, checking
whether we are head or not is racy without xchg_tail, and after that
point, we are already queued, hence for simplicity we must invoke the
check unconditionally.
Note that a more contrived case could still be constructed by using two
locks, and interrupting the progress of the respective owners by
non-head waiters of the other lock, in an ABBA fashion, which would
still not be covered by the current set of checks and conditions. It
would still lead to a timeout though, and not a deadlock. An ABBA check
cannot happen optimistically before the queueing, since it can be racy,
and needs to be happen continuously during the waiting period, which
would then require an unlinking step for queued NMI/reentrant waiters.
This is beyond the scope of this patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/rqspinlock.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index e35b06fcf9ee..f7d0c8d4644e 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -437,6 +437,19 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
* queuing.
*/
queue:
+ /*
+ * Do not queue if we're a waiter and someone is attempting this lock on
+ * the same CPU. In case of NMIs, this prevents long timeouts where we
+ * interrupt the pending waiter, and the owner, that will eventually
+ * signal the head of our queue, both of which are logically but not
+ * physically part of the queue, hence outside the scope of the idx > 0
+ * check above for the trylock fallback.
+ */
+ if (check_deadlock_AA(lock)) {
+ ret = -EDEADLK;
+ goto err_release_entry;
+ }
+
lockevent_inc(lock_slowpath);
/* Deadlock detection entry already held after failing fast path. */
node = this_cpu_ptr(&rqnodes[0].mcs);
--
2.51.0
next prev parent reply other threads:[~2025-11-28 23:28 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-28 23:27 [PATCH bpf-next v2 0/6] Limited queueing in NMI for rqspinlock Kumar Kartikeya Dwivedi
2025-11-28 23:27 ` [PATCH bpf-next v2 1/6] rqspinlock: Enclose lock/unlock within lock entry acquisitions Kumar Kartikeya Dwivedi
2025-11-28 23:27 ` [PATCH bpf-next v2 2/6] rqspinlock: Perform AA checks immediately Kumar Kartikeya Dwivedi
2025-11-28 23:27 ` [PATCH bpf-next v2 3/6] rqspinlock: Use trylock fallback when per-CPU rqnode is busy Kumar Kartikeya Dwivedi
2025-11-29 1:12 ` bot+bpf-ci
2025-11-29 18:13 ` Kumar Kartikeya Dwivedi
2025-11-28 23:28 ` [PATCH bpf-next v2 4/6] rqspinlock: Disable spinning for trylock fallback Kumar Kartikeya Dwivedi
2025-11-28 23:28 ` Kumar Kartikeya Dwivedi [this message]
2025-11-28 23:28 ` [PATCH bpf-next v2 6/6] selftests/bpf: Add success stats to rqspinlock stress test Kumar Kartikeya Dwivedi
2025-11-29 17:40 ` [PATCH bpf-next v2 0/6] Limited queueing in NMI for rqspinlock patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251128232802.1031906-6-memxor@gmail.com \
--to=memxor@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=jelle@superluminal.eu \
--cc=kernel-team@meta.com \
--cc=kkd@meta.com \
--cc=martin.lau@kernel.org \
--cc=ritesh@superluminal.eu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox