Re: [PATCH bpf-next v1 11/22] rqspinlock: Add deadlock detection and recovery

BPF List
 help / color / mirror / Atom feed

From: Waiman Long <llong@redhat.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Waiman Long <llong@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Tejun Heo <tj@kernel.org>, Barret Rhoden <brho@google.com>,
	Josh Don <joshdon@google.com>, Dohyun Kim <dohyunkim@google.com>,
	kernel-team@meta.com
Subject: Re: [PATCH bpf-next v1 11/22] rqspinlock: Add deadlock detection and recovery
Date: Wed, 8 Jan 2025 11:06:26 -0500	[thread overview]
Message-ID: <2402fa3e-bd43-47a5-ab8c-bd05877831ff@redhat.com> (raw)
In-Reply-To: <20250107140004.2732830-12-memxor@gmail.com>


On 1/7/25 8:59 AM, Kumar Kartikeya Dwivedi wrote:
> While the timeout logic provides guarantees for the waiter's forward
> progress, the time until a stalling waiter unblocks can still be long.
> The default timeout of 1/2 sec can be excessively long for some use
> cases.  Additionally, custom timeouts may exacerbate recovery time.
>
> Introduce logic to detect common cases of deadlocks and perform quicker
> recovery. This is done by dividing the time from entry into the locking
> slow path until the timeout into intervals of 1 ms. Then, after each
> interval elapses, deadlock detection is performed, while also polling
> the lock word to ensure we can quickly break out of the detection logic
> and proceed with lock acquisition.
>
> A 'held_locks' table is maintained per-CPU where the entry at the bottom
> denotes a lock being waited for or already taken. Entries coming before
> it denote locks that are already held. The current CPU's table can thus
> be looked at to detect AA deadlocks. The tables from other CPUs can be
> looked at to discover ABBA situations. Finally, when a matching entry
> for the lock being taken on the current CPU is found on some other CPU,
> a deadlock situation is detected. This function can take a long time,
> therefore the lock word is constantly polled in each loop iteration to
> ensure we can preempt detection and proceed with lock acquisition, using
> the is_lock_released check.
>
> We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
> deadlock checks immediately to perform faster recovery.
>
> Note: Extending lock word size by 4 bytes to record owner CPU can allow
> faster detection for ABBA. It is typically the owner which participates
> in a ABBA situation. However, to keep compatibility with existing lock
> words in the kernel (struct qspinlock), and given deadlocks are a rare
> event triggered by bugs, we choose to favor compatibility over faster
> detection.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   include/asm-generic/rqspinlock.h |  56 +++++++++-
>   kernel/locking/rqspinlock.c      | 178 ++++++++++++++++++++++++++++---
>   2 files changed, 220 insertions(+), 14 deletions(-)
>
> diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
> index 5c996a82e75f..c7e33ccc57a6 100644
> --- a/include/asm-generic/rqspinlock.h
> +++ b/include/asm-generic/rqspinlock.h
> @@ -11,14 +11,68 @@
>   
>   #include <linux/types.h>
>   #include <vdso/time64.h>
> +#include <linux/percpu.h>
>   
>   struct qspinlock;
>   
> +extern int resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val, u64 timeout);
> +
>   /*
>    * Default timeout for waiting loops is 0.5 seconds
>    */
>   #define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2)
>   
> -extern int resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val, u64 timeout);
> +#define RES_NR_HELD 32
> +
> +struct rqspinlock_held {
> +	int cnt;
> +	void *locks[RES_NR_HELD];
> +};
> +
> +DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
> +
> +static __always_inline void grab_held_lock_entry(void *lock)
> +{
> +	int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt);
> +
> +	if (unlikely(cnt > RES_NR_HELD)) {
> +		/* Still keep the inc so we decrement later. */
> +		return;
> +	}
> +
> +	/*
> +	 * Implied compiler barrier in per-CPU operations; otherwise we can have
> +	 * the compiler reorder inc with write to table, allowing interrupts to
> +	 * overwrite and erase our write to the table (as on interrupt exit it
> +	 * will be reset to NULL).
> +	 */
> +	this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock);
> +}
> +
> +/*
> + * It is possible to run into misdetection scenarios of AA deadlocks on the same
> + * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries
> + * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
> + * logic to preserve right entries in the table would be to walk the array of
> + * held locks and swap and clear out-of-order entries, but that's too
> + * complicated and we don't have a compelling use case for out of order unlocking.
Maybe we can pass in the lock and print a warning if out-of-order unlock 
is being done.
> + *
> + * Therefore, we simply don't support such cases and keep the logic simple here.
> + */
> +static __always_inline void release_held_lock_entry(void)
> +{
> +	struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
> +
> +	if (unlikely(rqh->cnt > RES_NR_HELD))
> +		goto dec;
> +	smp_store_release(&rqh->locks[rqh->cnt - 1], NULL);
> +	/*
> +	 * Overwrite of NULL should appear before our decrement of the count to
> +	 * other CPUs, otherwise we have the issue of a stale non-NULL entry being
> +	 * visible in the array, leading to misdetection during deadlock detection.
> +	 */
> +dec:
> +	this_cpu_dec(rqspinlock_held_locks.cnt);
AFAIU, smp_store_release() only guarantees memory ordering before it, 
not after. That shouldn't be a problem if the decrement is observed 
before clearing the entry as that non-NULL entry won't be checked anyway.
> +}
>   
>   #endif /* __ASM_GENERIC_RQSPINLOCK_H */
> diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
> index b63f92bd43b1..b7c86127d288 100644
> --- a/kernel/locking/rqspinlock.c
> +++ b/kernel/locking/rqspinlock.c
> @@ -30,6 +30,7 @@
>    * Include queued spinlock definitions and statistics code
>    */
>   #include "qspinlock.h"
> +#include "rqspinlock.h"
>   #include "qspinlock_stat.h"
>   
>   /*
> @@ -74,16 +75,141 @@
>   struct rqspinlock_timeout {
>   	u64 timeout_end;
>   	u64 duration;
> +	u64 cur;
>   	u16 spin;
>   };
>   
>   #define RES_TIMEOUT_VAL	2
>   
> -static noinline int check_timeout(struct rqspinlock_timeout *ts)
> +DEFINE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
> +
> +static bool is_lock_released(struct qspinlock *lock, u32 mask, struct rqspinlock_timeout *ts)
> +{
> +	if (!(atomic_read_acquire(&lock->val) & (mask)))
> +		return true;
> +	return false;
> +}
> +
> +static noinline int check_deadlock_AA(struct qspinlock *lock, u32 mask,
> +				      struct rqspinlock_timeout *ts)
> +{
> +	struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
> +	int cnt = min(RES_NR_HELD, rqh->cnt);
> +
> +	/*
> +	 * Return an error if we hold the lock we are attempting to acquire.
> +	 * We'll iterate over max 32 locks; no need to do is_lock_released.
> +	 */
> +	for (int i = 0; i < cnt - 1; i++) {
> +		if (rqh->locks[i] == lock)
> +			return -EDEADLK;
> +	}
> +	return 0;
> +}
> +
> +static noinline int check_deadlock_ABBA(struct qspinlock *lock, u32 mask,
> +					struct rqspinlock_timeout *ts)
> +{

I think you should note that the ABBA check here is not exhaustive. It 
is just the most common case and there are corner cases that will be missed.

Cheers,
Longman

next prev parent reply	other threads:[~2025-01-08 16:06 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-07 13:59 [PATCH bpf-next v1 00/22] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 01/22] locking: Move MCS struct definition to public header Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 02/22] locking: Move common qspinlock helpers to a private header Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 03/22] locking: Allow obtaining result of arch_mcs_spin_lock_contended Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 04/22] locking: Copy out qspinlock.c to rqspinlock.c Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 05/22] rqspinlock: Add rqspinlock.h header Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 06/22] rqspinlock: Drop PV and virtualization support Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 07/22] rqspinlock: Add support for timeouts Kumar Kartikeya Dwivedi
2025-01-07 14:50   ` Peter Zijlstra
2025-01-07 17:14     ` Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 08/22] rqspinlock: Protect pending bit owners from stalls Kumar Kartikeya Dwivedi
2025-01-07 14:51   ` Peter Zijlstra
2025-01-07 17:14     ` Kumar Kartikeya Dwivedi
2025-01-07 19:17       ` Peter Zijlstra
2025-01-07 19:22         ` Peter Zijlstra
2025-01-07 19:54           ` Kumar Kartikeya Dwivedi
2025-01-08  2:19   ` Waiman Long
2025-01-08 20:13     ` Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 09/22] rqspinlock: Protect waiters in queue " Kumar Kartikeya Dwivedi
2025-01-08  3:38   ` Waiman Long
2025-01-08 20:42     ` Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 10/22] rqspinlock: Protect waiters in trylock fallback " Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 11/22] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
2025-01-08 16:06   ` Waiman Long [this message]
2025-01-08 20:19     ` Kumar Kartikeya Dwivedi
2025-01-09  0:32       ` Waiman Long
2025-01-07 13:59 ` [PATCH bpf-next v1 12/22] rqspinlock: Add basic support for CONFIG_PARAVIRT Kumar Kartikeya Dwivedi
2025-01-08 16:27   ` Waiman Long
2025-01-08 20:32     ` Kumar Kartikeya Dwivedi
2025-01-09  0:48       ` Waiman Long
2025-01-09  2:42         ` Alexei Starovoitov
2025-01-09  2:58           ` Waiman Long
2025-01-09  3:37             ` Alexei Starovoitov
2025-01-09  3:46               ` Waiman Long
2025-01-09  3:53                 ` Alexei Starovoitov
2025-01-09  3:58                   ` Waiman Long
2025-01-07 13:59 ` [PATCH bpf-next v1 13/22] rqspinlock: Add helper to print a splat on timeout or deadlock Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 14/22] rqspinlock: Add macros for rqspinlock usage Kumar Kartikeya Dwivedi
2025-01-08 16:55   ` Waiman Long
2025-01-08 20:41     ` Kumar Kartikeya Dwivedi
2025-01-09  1:11       ` Waiman Long
2025-01-09  3:30         ` Alexei Starovoitov
2025-01-09  4:09           ` Waiman Long
2025-01-07 13:59 ` [PATCH bpf-next v1 15/22] rqspinlock: Add locktorture support Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 16/22] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
2025-01-07 13:59 ` [PATCH bpf-next v1 17/22] bpf: Convert hashtab.c to rqspinlock Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 18/22] bpf: Convert percpu_freelist.c " Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 19/22] bpf: Convert lpm_trie.c " Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 20/22] bpf: Introduce rqspinlock kfuncs Kumar Kartikeya Dwivedi
2025-01-08 10:23   ` kernel test robot
2025-01-08 10:23   ` kernel test robot
2025-01-08 10:44   ` kernel test robot
2025-01-07 14:00 ` [PATCH bpf-next v1 21/22] bpf: Implement verifier support for rqspinlock Kumar Kartikeya Dwivedi
2025-01-07 14:00 ` [PATCH bpf-next v1 22/22] selftests/bpf: Add tests " Kumar Kartikeya Dwivedi
2025-01-07 23:54 ` [PATCH bpf-next v1 00/22] Resilient Queued Spin Lock Linus Torvalds
2025-01-08  9:18   ` Peter Zijlstra
2025-01-08 20:12     ` Kumar Kartikeya Dwivedi
2025-01-08 20:30       ` Linus Torvalds
2025-01-08 21:06         ` Kumar Kartikeya Dwivedi
2025-01-08 21:30         ` Paul E. McKenney
2025-01-09 13:59       ` Waiman Long
2025-01-09 21:13         ` Kumar Kartikeya Dwivedi
2025-01-09 21:18           ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2402fa3e-bd43-47a5-ab8c-bd05877831ff@redhat.com \
    --to=llong@redhat.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=daniel@iogearbox.net \
    --cc=dohyunkim@google.com \
    --cc=eddyz87@gmail.com \
    --cc=joshdon@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox