[PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions
@ 2025-11-25 20:32 Kumar Kartikeya Dwivedi
  2025-11-26  2:46 ` Alexei Starovoitov
  0 siblings, 1 reply; 4+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-11-25 20:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Ritesh Oedayrajsingh Varma,
	Jelle van der Beek, kkd, kernel-team

We currently have a tiny window between the fast-path cmpxchg and the
grabbing of the lock entry where an NMI could land, attempt the same
lock that was just acquired, and end up timing out. This is not ideal.
Instead, move the lock entry acquisition from the fast path to before
the cmpxchg, and remove the grabbing of the lock entry in the slow path,
assuming it was already taken by the fast path.

There is a similar case when unlocking the lock. If the NMI lands
between the WRITE_ONCE and smp_store_release, it is possible that we end
up in a situation where the NMI fails to diagnose the AA condition,
leading to a timeout.

The TAS fallback is invoked directly without being preceded by the
typical fast path, therefore we must continue to grab the deadlock
detection entry in that case.

Note the changes to the comments in release_held_lock_entry and
res_spin_unlock. They talk about prevention of the following scenario,
which is introduced by this commit, and was avoided by placing
smp_store_release after WRITE_ONCE (the case before this commit):

grab entry A
lock A
grab entry B
lock B
unlock B
   smp_store_release(B->locked, 0)
							grab entry B
							lock B
							grab entry A
							lock A
							! <detect ABBA>
   WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)

If the store release were placed after the WRITE_ONCE, the other CPU
would not observe B in the table of the CPU unlocking the lock B.

Avoiding this while it was convenient was a prudent choice, but since it
leads to missed diagnosis of AA deadlocks in case of NMIs, it does not
make sense to keep such ordering any further. Moreover, while this
particular schedule is a misdiagnosis, the CPUs are obviously
participating in an ABBA deadlock otherwise, and are only lucky to avoid
an error before due to the aforementioned race.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/asm-generic/rqspinlock.h | 66 ++++++++++++++++++--------------
 kernel/bpf/rqspinlock.c          | 15 +++-----
 2 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 6d4244d643df..2da3f1391914 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -129,8 +129,8 @@ static __always_inline void release_held_lock_entry(void)
 	 * <error> for lock B
 	 * release_held_lock_entry
 	 *
-	 * try_cmpxchg_acquire for lock A
 	 * grab_held_lock_entry
+	 * try_cmpxchg_acquire for lock A
 	 *
 	 * Lack of any ordering means reordering may occur such that dec, inc
 	 * are done before entry is overwritten. This permits a remote lock
@@ -139,13 +139,8 @@ static __always_inline void release_held_lock_entry(void)
 	 * CPU holds a lock it is attempting to acquire, leading to false ABBA
 	 * diagnosis).
 	 *
-	 * In case of unlock, we will always do a release on the lock word after
-	 * releasing the entry, ensuring that other CPUs cannot hold the lock
-	 * (and make conclusions about deadlocks) until the entry has been
-	 * cleared on the local CPU, preventing any anomalies. Reordering is
-	 * still possible there, but a remote CPU cannot observe a lock in our
-	 * table which it is already holding, since visibility entails our
-	 * release store for the said lock has not retired.
+	 * The case of unlock is treated differently due to NMI reentrancy, see
+	 * comments in res_spin_unlock.
 	 *
 	 * In theory we don't have a problem if the dec and WRITE_ONCE above get
 	 * reordered with each other, we either notice an empty NULL entry on
@@ -175,10 +170,16 @@ static __always_inline int res_spin_lock(rqspinlock_t *lock)
 {
 	int val = 0;

-	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) {
-		grab_held_lock_entry(lock);
+	/*
+	 * Grab the deadlock detection entry before doing the cmpxchg, so that
+	 * reentrancy due to NMIs between the succeeding cmpxchg and creation of
+	 * held lock entry can correctly detect an acquisition attempt in the
+	 * interrupted context.
+	 */
+	grab_held_lock_entry(lock);
+
+	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
 		return 0;
-	}
 	return resilient_queued_spin_lock_slowpath(lock, val);
 }

@@ -192,28 +193,35 @@ static __always_inline void res_spin_unlock(rqspinlock_t *lock)
 {
 	struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);

-	if (unlikely(rqh->cnt > RES_NR_HELD))
-		goto unlock;
-	WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
-unlock:
 	/*
-	 * Release barrier, ensures correct ordering. See release_held_lock_entry
-	 * for details.  Perform release store instead of queued_spin_unlock,
-	 * since we use this function for test-and-set fallback as well. When we
-	 * have CONFIG_QUEUED_SPINLOCKS=n, we clear the full 4-byte lockword.
+	 * Release barrier, ensures correct ordering. Perform release store
+	 * instead of queued_spin_unlock, since we use this function for the TAS
+	 * fallback as well. When we have CONFIG_QUEUED_SPINLOCKS=n, we clear
+	 * the full 4-byte lockword.
+	 */
+	smp_store_release(&lock->locked, 0);
+	if (likely(rqh->cnt <= RES_NR_HELD))
+		WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
+	/*
+	 * Unlike release_held_lock_entry, we do the lock word release before
+	 * rewriting the entry back to NULL, and place no ordering between the
+	 * WRITE_ONCE and dec, and possible reordering with grabbing an entry.
+	 *
+	 * This opens up a window where another CPU could acquire this lock, and
+	 * then observe it in our table on the current CPU, leading to possible
+	 * misdiagnosis of ABBA when we get reordered with a
+	 * grab_held_lock_entry's writes (see the case described in
+	 * release_held_lock_entry comments).
 	 *
-	 * Like release_held_lock_entry, we can do the release before the dec.
-	 * We simply care about not seeing the 'lock' in our table from a remote
-	 * CPU once the lock has been released, which doesn't rely on the dec.
+	 * This could be avoided if we did the smp_store_release right before
+	 * the dec, ensuring that the remote CPU could only acquire this lock
+	 * and never observe this lock in our table.
 	 *
-	 * Unlike smp_wmb(), release is not a two way fence, hence it is
-	 * possible for a inc to move up and reorder with our clearing of the
-	 * entry. This isn't a problem however, as for a misdiagnosis of ABBA,
-	 * the remote CPU needs to hold this lock, which won't be released until
-	 * the store below is done, which would ensure the entry is overwritten
-	 * to NULL, etc.
+	 * However, that opens up a window where reentrant NMIs on this same
+	 * CPU could have their AA heuristics fail to fire if they land between
+	 * the WRITE_ONCE and unlock release store, which would result in a
+	 * timeout.
 	 */
-	smp_store_release(&lock->locked, 0);
 	this_cpu_dec(rqspinlock_held_locks.cnt);
 }

diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index 3cc23d79a9fc..878d641719da 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -275,6 +275,10 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
 	int val, ret = 0;

 	RES_INIT_TIMEOUT(ts);
+	/*
+	 * The fast path is not invoked for the TAS fallback, so we must grab
+	 * the deadlock detection entry here.
+	 */
 	grab_held_lock_entry(lock);

 	/*
@@ -397,10 +401,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 		goto queue;
 	}

-	/*
-	 * Grab an entry in the held locks array, to enable deadlock detection.
-	 */
-	grab_held_lock_entry(lock);
+	/* Deadlock detection entry already held after failing fast path. */

 	/*
 	 * We're pending, wait for the owner to go away.
@@ -448,11 +449,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 queue:
 	lockevent_inc(lock_slowpath);
-	/*
-	 * Grab deadlock detection entry for the queue path.
-	 */
-	grab_held_lock_entry(lock);
-
+	/* Deadlock detection entry already held after failing fast path. */
 	node = this_cpu_ptr(&rqnodes[0].mcs);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
--
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions
  2025-11-25 20:32 [PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions Kumar Kartikeya Dwivedi
@ 2025-11-26  2:46 ` Alexei Starovoitov
  2025-11-26  3:51   ` Alexei Starovoitov
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2025-11-26  2:46 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Ritesh Oedayrajsingh Varma,
	Jelle van der Beek, kkd, Kernel Team

On Tue, Nov 25, 2025 at 12:33 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> We currently have a tiny window between the fast-path cmpxchg and the
> grabbing of the lock entry where an NMI could land, attempt the same
> lock that was just acquired, and end up timing out. This is not ideal.
> Instead, move the lock entry acquisition from the fast path to before
> the cmpxchg, and remove the grabbing of the lock entry in the slow path,
> assuming it was already taken by the fast path.
>
> There is a similar case when unlocking the lock. If the NMI lands
> between the WRITE_ONCE and smp_store_release, it is possible that we end
> up in a situation where the NMI fails to diagnose the AA condition,
> leading to a timeout.
>
> The TAS fallback is invoked directly without being preceded by the
> typical fast path, therefore we must continue to grab the deadlock
> detection entry in that case.
>
> Note the changes to the comments in release_held_lock_entry and
> res_spin_unlock. They talk about prevention of the following scenario,
> which is introduced by this commit, and was avoided by placing
> smp_store_release after WRITE_ONCE (the case before this commit):
>
> grab entry A
> lock A
> grab entry B
> lock B
> unlock B
>    smp_store_release(B->locked, 0)
>                                                         grab entry B
>                                                         lock B
>                                                         grab entry A
>                                                         lock A
>                                                         ! <detect ABBA>
>    WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
>
> If the store release were placed after the WRITE_ONCE, the other CPU
> would not observe B in the table of the CPU unlocking the lock B.

I think the above should be reworded. It sounds like an excuse instead
of explaining why it's done this way. I would say it like:

This patch changes order to: smp_store_relase(); WRITE_ONCE()
to avoid missing detection of AA deadlock in case of:
... new diagram...

The reverse order: WRITE_ONCE(); smp_store_release() was there
to prevent misdiagnosing ABBA in the following scenario...
... your diagram...
but CPUs are actually participating in ABBA deadlock, so it wasn't
exactly a misdiagnosis.

> Avoiding this while it was convenient was a prudent choice, but since it
> leads to missed diagnosis of AA deadlocks in case of NMIs, it does not
> make sense to keep such ordering any further. Moreover, while this
> particular schedule is a misdiagnosis, the CPUs are obviously
> participating in an ABBA deadlock otherwise, and are only lucky to avoid
> an error before due to the aforementioned race.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Fixes tag is missing.

> ---
>  include/asm-generic/rqspinlock.h | 66 ++++++++++++++++++--------------
>  kernel/bpf/rqspinlock.c          | 15 +++-----
>  2 files changed, 43 insertions(+), 38 deletions(-)
>
> diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
> index 6d4244d643df..2da3f1391914 100644
> --- a/include/asm-generic/rqspinlock.h
> +++ b/include/asm-generic/rqspinlock.h
> @@ -129,8 +129,8 @@ static __always_inline void release_held_lock_entry(void)
>          * <error> for lock B
>          * release_held_lock_entry
>          *
> -        * try_cmpxchg_acquire for lock A
>          * grab_held_lock_entry
> +        * try_cmpxchg_acquire for lock A
>          *
>          * Lack of any ordering means reordering may occur such that dec, inc
>          * are done before entry is overwritten. This permits a remote lock
> @@ -139,13 +139,8 @@ static __always_inline void release_held_lock_entry(void)
>          * CPU holds a lock it is attempting to acquire, leading to false ABBA
>          * diagnosis).
>          *
> -        * In case of unlock, we will always do a release on the lock word after
> -        * releasing the entry, ensuring that other CPUs cannot hold the lock
> -        * (and make conclusions about deadlocks) until the entry has been
> -        * cleared on the local CPU, preventing any anomalies. Reordering is
> -        * still possible there, but a remote CPU cannot observe a lock in our
> -        * table which it is already holding, since visibility entails our
> -        * release store for the said lock has not retired.
> +        * The case of unlock is treated differently due to NMI reentrancy, see
> +        * comments in res_spin_unlock.
>          *
>          * In theory we don't have a problem if the dec and WRITE_ONCE above get
>          * reordered with each other, we either notice an empty NULL entry on
> @@ -175,10 +170,16 @@ static __always_inline int res_spin_lock(rqspinlock_t *lock)
>  {
>         int val = 0;
>
> -       if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) {
> -               grab_held_lock_entry(lock);
> +       /*
> +        * Grab the deadlock detection entry before doing the cmpxchg, so that
> +        * reentrancy due to NMIs between the succeeding cmpxchg and creation of
> +        * held lock entry can correctly detect an acquisition attempt in the
> +        * interrupted context.

I would add AA diagram here to the comment as well.

> +        */
> +       grab_held_lock_entry(lock);
> +
> +       if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
>                 return 0;
> -       }
>         return resilient_queued_spin_lock_slowpath(lock, val);
>  }
>
> @@ -192,28 +193,35 @@ static __always_inline void res_spin_unlock(rqspinlock_t *lock)
>  {
>         struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
>
> -       if (unlikely(rqh->cnt > RES_NR_HELD))
> -               goto unlock;
> -       WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
> -unlock:
>         /*
> -        * Release barrier, ensures correct ordering. See release_held_lock_entry
> -        * for details.  Perform release store instead of queued_spin_unlock,
> -        * since we use this function for test-and-set fallback as well. When we
> -        * have CONFIG_QUEUED_SPINLOCKS=n, we clear the full 4-byte lockword.
> +        * Release barrier, ensures correct ordering. Perform release store
> +        * instead of queued_spin_unlock, since we use this function for the TAS
> +        * fallback as well. When we have CONFIG_QUEUED_SPINLOCKS=n, we clear
> +        * the full 4-byte lockword.
> +        */
> +       smp_store_release(&lock->locked, 0);
> +       if (likely(rqh->cnt <= RES_NR_HELD))
> +               WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
> +       /*
> +        * Unlike release_held_lock_entry, we do the lock word release before
> +        * rewriting the entry back to NULL, and place no ordering between the
> +        * WRITE_ONCE and dec, and possible reordering with grabbing an entry.

The above is quite misleading.
"Unlike release_held_lock_entry" applies to the 2nd part of the sentence
"possible reordering" and doesn't apply at all to "lock word release".
There is no lock word release in release_held_lock_entry().

> +        *
> +        * This opens up a window where another CPU could acquire this lock, and
> +        * then observe it in our table on the current CPU, leading to possible
> +        * misdiagnosis of ABBA when we get reordered with a
> +        * grab_held_lock_entry's writes (see the case described in
> +        * release_held_lock_entry comments).

Just drop this part. "Misdiagnosis of ABBA" is not quite correct.
As you said CPUs are participating in ABBA, so it's real ABBA.
The commit log description is enough.

>          *
> -        * Like release_held_lock_entry, we can do the release before the dec.
> -        * We simply care about not seeing the 'lock' in our table from a remote
> -        * CPU once the lock has been released, which doesn't rely on the dec.
> +        * This could be avoided if we did the smp_store_release right before
> +        * the dec, ensuring that the remote CPU could only acquire this lock
> +        * and never observe this lock in our table.

Another unnecessary comment. Let's not reflect over old code.
The comment should describe what the code does. For history there is git.

>          *
> -        * Unlike smp_wmb(), release is not a two way fence, hence it is
> -        * possible for a inc to move up and reorder with our clearing of the
> -        * entry. This isn't a problem however, as for a misdiagnosis of ABBA,
> -        * the remote CPU needs to hold this lock, which won't be released until
> -        * the store below is done, which would ensure the entry is overwritten
> -        * to NULL, etc.
> +        * However, that opens up a window where reentrant NMIs on this same
> +        * CPU could have their AA heuristics fail to fire if they land between
> +        * the WRITE_ONCE and unlock release store, which would result in a
> +        * timeout.
>          */
> -       smp_store_release(&lock->locked, 0);
>         this_cpu_dec(rqspinlock_held_locks.cnt);
>  }
>
> diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
> index 3cc23d79a9fc..878d641719da 100644
> --- a/kernel/bpf/rqspinlock.c
> +++ b/kernel/bpf/rqspinlock.c
> @@ -275,6 +275,10 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
>         int val, ret = 0;
>
>         RES_INIT_TIMEOUT(ts);
> +       /*
> +        * The fast path is not invoked for the TAS fallback, so we must grab
> +        * the deadlock detection entry here.
> +        */
>         grab_held_lock_entry(lock);
>
>         /*
> @@ -397,10 +401,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
>                 goto queue;
>         }
>
> -       /*
> -        * Grab an entry in the held locks array, to enable deadlock detection.
> -        */
> -       grab_held_lock_entry(lock);
> +       /* Deadlock detection entry already held after failing fast path. */
>
>         /*
>          * We're pending, wait for the owner to go away.
> @@ -448,11 +449,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
>          */
>  queue:
>         lockevent_inc(lock_slowpath);
> -       /*
> -        * Grab deadlock detection entry for the queue path.
> -        */
> -       grab_held_lock_entry(lock);
> -
> +       /* Deadlock detection entry already held after failing fast path. */

Overall all makes sense to me, but I thought the patch will fix
these messages from stress test:

[   12.636716] INFO: NMI handler (perf_event_nmi_handler) took too
long to run: 12.473 msecs
[   12.785373] INFO: NMI handler (perf_event_nmi_handler) took too
long to run: 261.095 msecs
[   21.455161]    >= 251ms: total 5 (normal 0, nmi 5)

but the stats seem to be the same before and after the patch
when I played with the patch in bpf-next.

I suspect there is more here to discover.
Since we're running out of time, bpf-next is fine for respin.

pw-bot: cr

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions
  2025-11-26  2:46 ` Alexei Starovoitov
@ 2025-11-26  3:51   ` Alexei Starovoitov
  2025-11-27  0:44     ` Ritesh Oedayrajsingh Varma
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2025-11-26  3:51 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Ritesh Oedayrajsingh Varma,
	Jelle van der Beek, kkd, Kernel Team

On Tue, Nov 25, 2025 at 6:46 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Overall all makes sense to me, but I thought the patch will fix
> these messages from stress test:
>
> [   12.636716] INFO: NMI handler (perf_event_nmi_handler) took too
> long to run: 12.473 msecs
> [   12.785373] INFO: NMI handler (perf_event_nmi_handler) took too
> long to run: 261.095 msecs
> [   21.455161]    >= 251ms: total 5 (normal 0, nmi 5)
>
> but the stats seem to be the same before and after the patch
> when I played with the patch in bpf-next.
>
> I suspect there is more here to discover.

I tried:
        if (unlikely(idx >= _Q_MAX_NODES || in_nmi())) {
                lockevent_inc(lock_no_node);
-               RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
+               RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 16);

and see that it hits 4 sec timeout just as well while
stats show that lock acquisition is unfair:
cpu4: total 20 (normal 13, nmi 7)
cpu5: total 319 (normal 160, nmi 159)
cpu6: total 470 (normal 238, nmi 232)
cpu7: total 25 (normal 13, nmi 12)

which, I think, means that queued_spin_trylock() in nmi
has no chance of competing with queued lock logic.
Other cpus from !in_nmi() ctx keep grabbing and releasing
the same lock multiple times while in_nmi waiting is
hopelessly spinning on trylock.

I think we should consider:
        if (unlikely(idx >= _Q_MAX_NODES || in_nmi())) {
                lockevent_inc(lock_no_node);
-               RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
+               RES_RESET_TIMEOUT(ts, NSEC_PER_MSEC);

until proper queueing logic is implemented for in_nmi.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions
  2025-11-26  3:51   ` Alexei Starovoitov
@ 2025-11-27  0:44     ` Ritesh Oedayrajsingh Varma
  0 siblings, 0 replies; 4+ messages in thread
From: Ritesh Oedayrajsingh Varma @ 2025-11-27  0:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Jelle van der Beek, kkd, Kernel Team

On Wed, Nov 26, 2025 at 4:51 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Nov 25, 2025 at 6:46 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Overall all makes sense to me, but I thought the patch will fix
> > these messages from stress test:
> >
> > [   12.636716] INFO: NMI handler (perf_event_nmi_handler) took too
> > long to run: 12.473 msecs
> > [   12.785373] INFO: NMI handler (perf_event_nmi_handler) took too
> > long to run: 261.095 msecs
> > [   21.455161]    >= 251ms: total 5 (normal 0, nmi 5)
> >
> > but the stats seem to be the same before and after the patch
> > when I played with the patch in bpf-next.
> >
> > I suspect there is more here to discover.
>
> I tried:
>         if (unlikely(idx >= _Q_MAX_NODES || in_nmi())) {
>                 lockevent_inc(lock_no_node);
> -               RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
> +               RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 16);
>
> and see that it hits 4 sec timeout just as well while
> stats show that lock acquisition is unfair:
> cpu4: total 20 (normal 13, nmi 7)
> cpu5: total 319 (normal 160, nmi 159)
> cpu6: total 470 (normal 238, nmi 232)
> cpu7: total 25 (normal 13, nmi 12)
>
> which, I think, means that queued_spin_trylock() in nmi
> has no chance of competing with queued lock logic.
> Other cpus from !in_nmi() ctx keep grabbing and releasing
> the same lock multiple times while in_nmi waiting is
> hopelessly spinning on trylock.
>
> I think we should consider:
>         if (unlikely(idx >= _Q_MAX_NODES || in_nmi())) {
>                 lockevent_inc(lock_no_node);
> -               RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
> +               RES_RESET_TIMEOUT(ts, NSEC_PER_MSEC);
>
> until proper queueing logic is implemented for in_nmi.

Thanks both! I've tested the AA fix, and the timeout change with a
locally compiled kernel and posted my results to the original thread.

Let me know if you have any questions or if there are any further
tests you'd like me to run.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-27  0:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25 20:32 [PATCH bpf v1] rqspinlock: Enclose lock/unlock within lock entry acquisitions Kumar Kartikeya Dwivedi
2025-11-26  2:46 ` Alexei Starovoitov
2025-11-26  3:51   ` Alexei Starovoitov
2025-11-27  0:44     ` Ritesh Oedayrajsingh Varma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox