[PATCH -next 0/8] RCU updates from me for next merge window

public inbox for rcu@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -next 0/8] RCU updates from me for next merge window
@ 2026-01-01 16:34 Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

This series contains RCU fixes and improvements intended for the next
merge window. The nocb patches have had one round of review but still
need review tags.

- Updated commit messages for clarity based on review feedback

- Testing:
 All rcutorture scenarios tested successfully for 2 hours on:
  144-core ARM64 NVIDIA Grace (aarch64)
  128-core AMD EPYC (x86_64)

Link to tag:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/tag/?h=rcu-next-v1-20260101

Joel Fernandes (6):
  rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  rcu/nocb: Add warning if no rcuog wake up attempt happened during
    overload
  rcu/nocb: Add warning to detect if overload advancement is ever useful
  rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS
    early
  rcutorture: Prevent concurrent kvm.sh runs on same source tree
  rcutorture: Add --kill-previous option to terminate previous kvm.sh
    runs

Yao Kai (1):
  rcu: Fix rcu_read_unlock() deadloop due to softirq

Zqiang (1):
  srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()

 kernel/rcu/srcutree.c                         |  2 +-
 kernel/rcu/tree.c                             | 16 ++++++
 kernel/rcu/tree.h                             |  4 +-
 kernel/rcu/tree_nocb.h                        | 53 +++++++++----------
 kernel/rcu/tree_plugin.h                      | 15 +++---
 tools/testing/selftests/rcutorture/.gitignore |  1 +
 tools/testing/selftests/rcutorture/bin/kvm.sh | 40 ++++++++++++++
 7 files changed, 95 insertions(+), 36 deletions(-)


base-commit: d26143bb38e2546fe6f8c9860c13a88146ce5dd6
-- 
2.34.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-02 17:28   ` Steven Rostedt
  2026-01-07 23:14   ` Frederic Weisbecker
  2026-01-01 16:34 ` [PATCH -next 2/8] srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() Joel Fernandes
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Yao Kai, Tengda Wu, Joel Fernandes

From: Yao Kai <yaokai34@huawei.com>

Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
__rcu_read_unlock()") removes the recursion-protection code from
__rcu_read_unlock(). Therefore, we could invoke the deadloop in
raise_softirq_irqoff() with ftrace enabled as follows:

WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
Modules linked in: my_irq_work(O)
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 __is_insn_slot_addr+0x54/0x70
 kernel_text_address+0x48/0xc0
 __kernel_text_address+0xd/0x40
 unwind_get_return_address+0x1e/0x40
 arch_stack_walk+0x9c/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 __raise_softirq_irqoff+0x61/0x80
 __flush_smp_call_function_queue+0x115/0x420
 __sysvec_call_function_single+0x17/0xb0
 sysvec_call_function_single+0x8c/0xc0
 </IRQ>

Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
setting a flag before calling irq_work_queue_on(). We fix this issue by
setting the same flag before calling raise_softirq_irqoff() and rename the
flag to defer_qs_pending for more common.

Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
Reported-by: Tengda Wu <wutengda2@huawei.com>
Signed-off-by: Yao Kai <yaokai34@huawei.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.h        |  2 +-
 kernel/rcu/tree_plugin.h | 15 +++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index b8bbe7960cda..2265b9c2906e 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -203,7 +203,7 @@ struct rcu_data {
 					/*  during and after the last grace */
 					/* period it is aware of. */
 	struct irq_work defer_qs_iw;	/* Obtain later scheduler attention. */
-	int defer_qs_iw_pending;	/* Scheduler attention pending? */
+	int defer_qs_pending;		/* irqwork or softirq pending? */
 	struct work_struct strict_work;	/* Schedule readers for strict GPs. */
 
 	/* 2) batch handling */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dbe2d02be824..95ad967adcf3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -487,8 +487,8 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	union rcu_special special;
 
 	rdp = this_cpu_ptr(&rcu_data);
-	if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING)
-		rdp->defer_qs_iw_pending = DEFER_QS_IDLE;
+	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
+		rdp->defer_qs_pending = DEFER_QS_IDLE;
 
 	/*
 	 * If RCU core is waiting for this CPU to exit its critical section,
@@ -645,7 +645,7 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
 	 * 5. Deferred QS reporting does not happen.
 	 */
 	if (rcu_preempt_depth() > 0)
-		WRITE_ONCE(rdp->defer_qs_iw_pending, DEFER_QS_IDLE);
+		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
 }
 
 /*
@@ -747,7 +747,10 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// Using softirq, safe to awaken, and either the
 			// wakeup is free or there is either an expedited
 			// GP in flight or a potential need to deboost.
-			raise_softirq_irqoff(RCU_SOFTIRQ);
+			if (rdp->defer_qs_pending != DEFER_QS_PENDING) {
+				rdp->defer_qs_pending = DEFER_QS_PENDING;
+				raise_softirq_irqoff(RCU_SOFTIRQ);
+			}
 		} else {
 			// Enabling BH or preempt does reschedule, so...
 			// Also if no expediting and no possible deboosting,
@@ -755,11 +758,11 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// tick enabled.
 			set_need_resched_current();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-			    needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
+			    needs_exp && rdp->defer_qs_pending != DEFER_QS_PENDING &&
 			    cpu_online(rdp->cpu)) {
 				// Get scheduler to re-evaluate and call hooks.
 				// If !IRQ_WORK, FQS scan will eventually IPI.
-				rdp->defer_qs_iw_pending = DEFER_QS_PENDING;
+				rdp->defer_qs_pending = DEFER_QS_PENDING;
 				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
 			}
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
@ 2026-01-02 17:28   ` Steven Rostedt
  2026-01-02 17:30     ` Steven Rostedt
  2026-01-07 23:14   ` Frederic Weisbecker
  1 sibling, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2026-01-02 17:28 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E . McKenney, Boqun Feng, rcu, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Yao Kai, Tengda Wu

On Thu,  1 Jan 2026 11:34:10 -0500
Joel Fernandes <joelagnelf@nvidia.com> wrote:

>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180

Stacktrace should have recursion protection too.

Can you try this patch to see if it would have fixed the problem too?

-- Steve

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index ae04054a1be3..e6ca052b2a85 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -34,6 +34,13 @@ enum {
 	TRACE_INTERNAL_SIRQ_BIT,
 	TRACE_INTERNAL_TRANSITION_BIT,
 
+	/* Internal event use recursion bits */
+	TRACE_INTERNAL_EVENT_BIT,
+	TRACE_INTERNAL_EVENT_NMI_BIT,
+	TRACE_INTERNAL_EVENT_IRQ_BIT,
+	TRACE_INTERNAL_EVENT_SIRQ_BIT,
+	TRACE_INTERNAL_EVENT_TRANSITION_BIT,
+
 	TRACE_BRANCH_BIT,
 /*
  * Abuse of the trace_recursion.
@@ -58,6 +65,8 @@ enum {
 
 #define TRACE_LIST_START	TRACE_INTERNAL_BIT
 
+#define TRACE_EVENT_START	TRACE_INTERNAL_EVENT_BIT
+
 #define TRACE_CONTEXT_MASK	((1 << (TRACE_LIST_START + TRACE_CONTEXT_BITS)) - 1)
 
 /*
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2d387d56dcd4..e145d1c7f604 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3013,6 +3013,11 @@ static void __ftrace_trace_stack(struct trace_array *tr,
 	struct ftrace_stack *fstack;
 	struct stack_entry *entry;
 	int stackidx;
+	int bit;
+
+	bit = trace_test_and_set_recursion(_THIS_IP_, _RET_IP_, TRACE_EVENT_START);
+	if (bit < 0)
+		return;
 
 	/*
 	 * Add one, for this function and the call to save_stack_trace()
@@ -3081,6 +3086,7 @@ static void __ftrace_trace_stack(struct trace_array *tr,
 	/* Again, don't let gcc optimize things here */
 	barrier();
 	__this_cpu_dec(ftrace_stack_reserve);
+	trace_clear_recursion(bit);
 }
 
 static inline void ftrace_trace_stack(struct trace_array *tr,

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-02 17:28   ` Steven Rostedt
@ 2026-01-02 17:30     ` Steven Rostedt
  2026-01-02 19:51       ` Paul E. McKenney
  0 siblings, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2026-01-02 17:30 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E . McKenney, Boqun Feng, rcu, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Yao Kai, Tengda Wu

On Fri, 2 Jan 2026 12:28:07 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Stacktrace should have recursion protection too.
> 
> Can you try this patch to see if it would have fixed the problem too?

As I believe the recursion protection should be in the tracing
infrastructure more than in RCU. As RCU is used as an active participant in
the kernel whereas tracing is supposed to be only an observer.

If tracing is the culprit, it should be the one that is fixed.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-02 17:30     ` Steven Rostedt
@ 2026-01-02 19:51       ` Paul E. McKenney
  2026-01-03  0:41         ` Joel Fernandes
  0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2026-01-02 19:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Joel Fernandes, Boqun Feng, rcu, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Yao Kai, Tengda Wu

On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
> On Fri, 2 Jan 2026 12:28:07 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Stacktrace should have recursion protection too.
> > 
> > Can you try this patch to see if it would have fixed the problem too?
> 
> As I believe the recursion protection should be in the tracing
> infrastructure more than in RCU. As RCU is used as an active participant in
> the kernel whereas tracing is supposed to be only an observer.
> 
> If tracing is the culprit, it should be the one that is fixed.

Makes sense to me!  But then it would...  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-02 19:51       ` Paul E. McKenney
@ 2026-01-03  0:41         ` Joel Fernandes
  2026-01-04  3:20           ` Yao Kai
  2026-01-04 10:00           ` Boqun Feng
  0 siblings, 2 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-03  0:41 UTC (permalink / raw)
  To: paulmck, Steven Rostedt
  Cc: Boqun Feng, rcu, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Shuah Khan, linux-kernel, linux-kselftest, Yao Kai,
	Tengda Wu



On 1/2/2026 2:51 PM, Paul E. McKenney wrote:
> On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
>> On Fri, 2 Jan 2026 12:28:07 -0500
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>> Stacktrace should have recursion protection too.
>>>
>>> Can you try this patch to see if it would have fixed the problem too?
>>
>> As I believe the recursion protection should be in the tracing
>> infrastructure more than in RCU. As RCU is used as an active participant in
>> the kernel whereas tracing is supposed to be only an observer.
>>
>> If tracing is the culprit, it should be the one that is fixed.
> 
> Makes sense to me!  But then it would...  ;-)
> 
Could we fix it in both? (RCU and tracing). The patch just adds 3 more net lines
to RCU code. It'd be good to have a guard rail against softirq recursion in RCU
read unlock path, as much as the existing guard rail we already have with
irq_work? After all, both paths attempt to do deferred work when it is safer to
do so.

Yao, if you could test Steve's patch and reply whether it fixes it too?

thanks,

 - Joel




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-03  0:41         ` Joel Fernandes
@ 2026-01-04  3:20           ` Yao Kai
  2026-01-05 17:16             ` Steven Rostedt
  2026-01-04 10:00           ` Boqun Feng
  1 sibling, 1 reply; 38+ messages in thread
From: Yao Kai @ 2026-01-04  3:20 UTC (permalink / raw)
  To: Joel Fernandes, paulmck, Steven Rostedt
  Cc: Boqun Feng, rcu, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Shuah Khan, linux-kernel, linux-kselftest, Tengda Wu,
	liuyongqiang13, yujiacheng3



On 1/3/2026 8:41 AM, Joel Fernandes wrote:
> 
> 
> On 1/2/2026 2:51 PM, Paul E. McKenney wrote:
>> On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
>>> On Fri, 2 Jan 2026 12:28:07 -0500
>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>>> Stacktrace should have recursion protection too.
>>>>
>>>> Can you try this patch to see if it would have fixed the problem too?
>>>
>>> As I believe the recursion protection should be in the tracing
>>> infrastructure more than in RCU. As RCU is used as an active participant in
>>> the kernel whereas tracing is supposed to be only an observer.
>>>
>>> If tracing is the culprit, it should be the one that is fixed.
>>
>> Makes sense to me!  But then it would...  ;-)
>>
> Could we fix it in both? (RCU and tracing). The patch just adds 3 more net lines
> to RCU code. It'd be good to have a guard rail against softirq recursion in RCU
> read unlock path, as much as the existing guard rail we already have with
> irq_work? After all, both paths attempt to do deferred work when it is safer to
> do so.
> 
> Yao, if you could test Steve's patch and reply whether it fixes it too?
> 
> thanks,
> 
>   - Joel
> 
> 
> 
> 

Yes, I tested Steve's patch. It fixes the issue too.

Tested-by: Yao Kai <yaokai34@huawei.com>

  - Yao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-04  3:20           ` Yao Kai
@ 2026-01-05 17:16             ` Steven Rostedt
  2026-01-09 16:38               ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2026-01-05 17:16 UTC (permalink / raw)
  To: Yao Kai
  Cc: Joel Fernandes, paulmck, Boqun Feng, rcu, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Tengda Wu, liuyongqiang13,
	yujiacheng3

On Sun, 4 Jan 2026 11:20:07 +0800
Yao Kai <yaokai34@huawei.com> wrote:

> Yes, I tested Steve's patch. It fixes the issue too.
> 
> Tested-by: Yao Kai <yaokai34@huawei.com>

Thanks for testing. I'll send out a formal patch.

And yes, I agree we should do both.

-- Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-05 17:16             ` Steven Rostedt
@ 2026-01-09 16:38               ` Steven Rostedt
  0 siblings, 0 replies; 38+ messages in thread
From: Steven Rostedt @ 2026-01-09 16:38 UTC (permalink / raw)
  To: Yao Kai
  Cc: Joel Fernandes, paulmck, Boqun Feng, rcu, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Tengda Wu, liuyongqiang13,
	yujiacheng3

On Mon, 5 Jan 2026 12:16:11 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 4 Jan 2026 11:20:07 +0800
> Yao Kai <yaokai34@huawei.com> wrote:
> 
> > Yes, I tested Steve's patch. It fixes the issue too.
> > 
> > Tested-by: Yao Kai <yaokai34@huawei.com>  
> 
> Thanks for testing. I'll send out a formal patch.
> 
> And yes, I agree we should do both.

FYI, the tracing recursion protection made it to mainline:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f1ef0dfcb5b7f4a91a9b0e0ba533efd9f7e2cdb

-- Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-03  0:41         ` Joel Fernandes
  2026-01-04  3:20           ` Yao Kai
@ 2026-01-04 10:00           ` Boqun Feng
  1 sibling, 0 replies; 38+ messages in thread
From: Boqun Feng @ 2026-01-04 10:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: paulmck, Steven Rostedt, rcu, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Yao Kai, Tengda Wu

On Fri, Jan 02, 2026 at 07:41:38PM -0500, Joel Fernandes wrote:
> 
> 
> On 1/2/2026 2:51 PM, Paul E. McKenney wrote:
> > On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
> >> On Fri, 2 Jan 2026 12:28:07 -0500
> >> Steven Rostedt <rostedt@goodmis.org> wrote:
> >>
> >>> Stacktrace should have recursion protection too.
> >>>
> >>> Can you try this patch to see if it would have fixed the problem too?
> >>
> >> As I believe the recursion protection should be in the tracing
> >> infrastructure more than in RCU. As RCU is used as an active participant in
> >> the kernel whereas tracing is supposed to be only an observer.
> >>
> >> If tracing is the culprit, it should be the one that is fixed.
> > 
> > Makes sense to me!  But then it would...  ;-)
> > 
> Could we fix it in both? (RCU and tracing). The patch just adds 3 more net lines
> to RCU code. It'd be good to have a guard rail against softirq recursion in RCU
> read unlock path, as much as the existing guard rail we already have with
> irq_work? After all, both paths attempt to do deferred work when it is safer to
> do so.
> 

Agreed. First it's crucial that RCU itself can prevent indefinitely
entering rcu_read_unlock_special(), because although unlikely, any RCU
reader in raise_softirq_irqoff() would cause a similar infinite loop.
Second, with solely the tracing fix, there still exists a call chain:

	rcu_read_unlock_special():
	  raise_softirq_irqoff():
	    trace_softirq_raise():
	      rcu_read_unlock_special():
	        raise_softirq_irqoff():
		  trace_softirq_raise(); // <- recursion ends here

while with the RCU fix added, the call chain ends at the second
rcu_read_unlock_special():

	rcu_read_unlock_special():
	  raise_softirq_irqoff():
	    trace_softirq_raise():
	      rcu_read_unlock_special(); // <- recursion ends here

which would slightly improve the performance becasue of fewer calls.

I'm going to include this into the RCU PR for 7.0 if no one objects.
Thanks!

Regards,
Boqun

> Yao, if you could test Steve's patch and reply whether it fixes it too?
> 
> thanks,
> 
>  - Joel
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
  2026-01-02 17:28   ` Steven Rostedt
@ 2026-01-07 23:14   ` Frederic Weisbecker
  2026-01-08  1:02     ` Joel Fernandes
  1 sibling, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-07 23:14 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E . McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Yao Kai, Tengda Wu

Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> From: Yao Kai <yaokai34@huawei.com>
> 
> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> __rcu_read_unlock()") removes the recursion-protection code from
> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> raise_softirq_irqoff() with ftrace enabled as follows:
> 
> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> Modules linked in: my_irq_work(O)
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> Tainted: [O]=OOT_MODULE
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>  <IRQ>
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  __is_insn_slot_addr+0x54/0x70
>  kernel_text_address+0x48/0xc0
>  __kernel_text_address+0xd/0x40
>  unwind_get_return_address+0x1e/0x40
>  arch_stack_walk+0x9c/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  __raise_softirq_irqoff+0x61/0x80
>  __flush_smp_call_function_queue+0x115/0x420
>  __sysvec_call_function_single+0x17/0xb0
>  sysvec_call_function_single+0x8c/0xc0
>  </IRQ>
> 
> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> setting a flag before calling irq_work_queue_on(). We fix this issue by
> setting the same flag before calling raise_softirq_irqoff() and rename the
> flag to defer_qs_pending for more common.
> 
> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> Reported-by: Tengda Wu <wutengda2@huawei.com>
> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Looks good but, BTW, what happens if rcu_qs() is called
before rcu_preempt_deferred_qs() had a chance to be called?

current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
(unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
through rcu_core() will be ignored as well.

But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
the next grace period. And if rcu_read_unlock_special() is called again
during the next GP on unfortunate place needing deferred qs, the state machine
will spuriously assume that either rcu_core or the irq_work are pending, when
none are anymore.

The state should be reset by rcu_qs().

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-07 23:14   ` Frederic Weisbecker
@ 2026-01-08  1:02     ` Joel Fernandes
  2026-01-08  1:35       ` Joel Fernandes
  2026-01-08 15:25       ` Frederic Weisbecker
  0 siblings, 2 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-08  1:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Kai Yao, Tengda Wu



> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>> From: Yao Kai <yaokai34@huawei.com>
>> 
>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>> __rcu_read_unlock()") removes the recursion-protection code from
>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>> raise_softirq_irqoff() with ftrace enabled as follows:
>> 
>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>> Modules linked in: my_irq_work(O)
>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>> Tainted: [O]=OOT_MODULE
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>> PKRU: 55555554
>> Call Trace:
>> <IRQ>
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> unwind_next_frame+0x203/0x9b0
>> __unwind_start+0x15d/0x1c0
>> arch_stack_walk+0x62/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> unwind_next_frame+0x203/0x9b0
>> __unwind_start+0x15d/0x1c0
>> arch_stack_walk+0x62/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> unwind_next_frame+0x203/0x9b0
>> __unwind_start+0x15d/0x1c0
>> arch_stack_walk+0x62/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> __is_insn_slot_addr+0x54/0x70
>> kernel_text_address+0x48/0xc0
>> __kernel_text_address+0xd/0x40
>> unwind_get_return_address+0x1e/0x40
>> arch_stack_walk+0x9c/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> __raise_softirq_irqoff+0x61/0x80
>> __flush_smp_call_function_queue+0x115/0x420
>> __sysvec_call_function_single+0x17/0xb0
>> sysvec_call_function_single+0x8c/0xc0
>> </IRQ>
>> 
>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>> setting the same flag before calling raise_softirq_irqoff() and rename the
>> flag to defer_qs_pending for more common.
>> 
>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Looks good but, BTW, what happens if rcu_qs() is called
> before rcu_preempt_deferred_qs() had a chance to be called?

Could you provide an example of when that can happen? 

If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).

>> flag to defer_qs_pending for more common.
>> 
>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Looks good but, BTW, what happens if rcu_qs() is called
> before rcu_preempt_deferred_qs() had a chance to be called?

Could you provide an example of when that can happen? 

As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).

> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
> through rcu_core() will be ignored as well.

I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
> 
> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
> the next grace period. And if rcu_read_unlock_special() is called again
> during the next GP on unfortunate place needing deferred qs, the state machine
> will spuriously assume that either rcu_core or the irq_work are pending, when
> none are anymore.
> 
> The state should be reset by rcu_qs().

In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..

Maybe we should reset it from rcu_report_qs_rdp/rnp?

Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating. 

thanks,

 - Joel


> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
> through rcu_core() will be ignored as well.

I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
> 
> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
> the next grace period. And if rcu_read_unlock_special() is called again
> during the next GP on unfortunate place needing deferred qs, the state machine
> will spuriously assume that either rcu_core or the irq_work are pending, when
> none are anymore.
> 
> The state should be reset by rcu_qs().

In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..

Maybe we should reset it from rcu_report_qs_rdp/rnp?

thanks,

 - Joel


> 
> Thanks.
> 
> --
> Frederic Weisbecker
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-08  1:02     ` Joel Fernandes
@ 2026-01-08  1:35       ` Joel Fernandes
  2026-01-08  3:35         ` Joel Fernandes
  2026-01-08 15:25       ` Frederic Weisbecker
  1 sibling, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-08  1:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Kai Yao, Tengda Wu



> On Jan 7, 2026, at 8:02 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>> 
>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>> From: Yao Kai <yaokai34@huawei.com>
>>> 
>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>> __rcu_read_unlock()") removes the recursion-protection code from
>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>> 
>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>> Modules linked in: my_irq_work(O)
>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>> Tainted: [O]=OOT_MODULE
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>> PKRU: 55555554
>>> Call Trace:
>>> <IRQ>
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> unwind_next_frame+0x203/0x9b0
>>> __unwind_start+0x15d/0x1c0
>>> arch_stack_walk+0x62/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> unwind_next_frame+0x203/0x9b0
>>> __unwind_start+0x15d/0x1c0
>>> arch_stack_walk+0x62/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> unwind_next_frame+0x203/0x9b0
>>> __unwind_start+0x15d/0x1c0
>>> arch_stack_walk+0x62/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> __is_insn_slot_addr+0x54/0x70
>>> kernel_text_address+0x48/0xc0
>>> __kernel_text_address+0xd/0x40
>>> unwind_get_return_address+0x1e/0x40
>>> arch_stack_walk+0x9c/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> __raise_softirq_irqoff+0x61/0x80
>>> __flush_smp_call_function_queue+0x115/0x420
>>> __sysvec_call_function_single+0x17/0xb0
>>> sysvec_call_function_single+0x8c/0xc0
>>> </IRQ>
>>> 
>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>> flag to defer_qs_pending for more common.
>>> 
>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> 
>> Looks good but, BTW, what happens if rcu_qs() is called
>> before rcu_preempt_deferred_qs() had a chance to be called?
> 
> Could you provide an example of when that can happen?
> 
> If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).
> 
>>> flag to defer_qs_pending for more common.
>>> 
>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> 
>> Looks good but, BTW, what happens if rcu_qs() is called
>> before rcu_preempt_deferred_qs() had a chance to be called?
> 
> Could you provide an example of when that can happen?
> 
> As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).
> 
>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>> through rcu_core() will be ignored as well.
> 
> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>> 
>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>> the next grace period. And if rcu_read_unlock_special() is called again
>> during the next GP on unfortunate place needing deferred qs, the state machine
>> will spuriously assume that either rcu_core or the irq_work are pending, when
>> none are anymore.
>> 
>> The state should be reset by rcu_qs().
> 
> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
> 
> Maybe we should reset it from rcu_report_qs_rdp/rnp?
> 
> Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating.
> 
> thanks,
> 
> - Joel
> 
> 
>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>> through rcu_core() will be ignored as well.
> 
> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>> 
>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>> the next grace period. And if rcu_read_unlock_special() is called again
>> during the next GP on unfortunate place needing deferred qs, the state machine
>> will spuriously assume that either rcu_core or the irq_work are pending, when
>> none are anymore.
>> 
>> The state should be reset by rcu_qs().
> 
> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
> 
> Maybe we should reset it from rcu_report_qs_rdp/rnp?
> 
> thanks,


By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion. 

I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place. 

Thanks,

 - Joel










> 
> - Joel
> 
> 
>> 
>> Thanks.
>> 
>> --
>> Frederic Weisbecker
>> SUSE Labs
>> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-08  1:35       ` Joel Fernandes
@ 2026-01-08  3:35         ` Joel Fernandes
  2026-01-08 15:39           ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-08  3:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Kai Yao, Tengda Wu



> On Jan 7, 2026, at 8:35 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>> On Jan 7, 2026, at 8:02 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>> 
>> 
>> 
>>>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>> 
>>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>>> From: Yao Kai <yaokai34@huawei.com>
>>>> 
>>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>>> __rcu_read_unlock()") removes the recursion-protection code from
>>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>>> 
>>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>>> Modules linked in: my_irq_work(O)
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>>> Tainted: [O]=OOT_MODULE
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>> <IRQ>
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> __is_insn_slot_addr+0x54/0x70
>>>> kernel_text_address+0x48/0xc0
>>>> __kernel_text_address+0xd/0x40
>>>> unwind_get_return_address+0x1e/0x40
>>>> arch_stack_walk+0x9c/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> __raise_softirq_irqoff+0x61/0x80
>>>> __flush_smp_call_function_queue+0x115/0x420
>>>> __sysvec_call_function_single+0x17/0xb0
>>>> sysvec_call_function_single+0x8c/0xc0
>>>> </IRQ>
>>>> 
>>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>>> flag to defer_qs_pending for more common.
>>>> 
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> 
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>> 
>> Could you provide an example of when that can happen?
>> 
>> If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).
>> 
>>>> flag to defer_qs_pending for more common.
>>>> 
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> 
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>> 
>> Could you provide an example of when that can happen?
>> 
>> As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).
>> 
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>> 
>> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>>> 
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, when
>>> none are anymore.
>>> 
>>> The state should be reset by rcu_qs().
>> 
>> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
>> 
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>> 
>> Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating.
>> 
>> thanks,
>> 
>> - Joel
>> 
>> 
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>> 
>> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>>> 
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, when
>>> none are anymore.
>>> 
>>> The state should be reset by rcu_qs().
>> 
>> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
>> 
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>> 
>> thanks,
> 
> 
> By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
> 
> I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.

I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint is set and we reported exp qs.

 I am working on a series to cover all cases and will send RFC soon. However this patch we are 
reviewing can go in for this merge window and the rest I am preparing (for further improvement) for the next merge window, if that sounds good.

Thanks!

 - Joel


> 
> Thanks,
> 
> - Joel
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> 
>> - Joel
>> 
>> 
>>> 
>>> Thanks.
>>> 
>>> --
>>> Frederic Weisbecker
>>> SUSE Labs
>>> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-08  3:35         ` Joel Fernandes
@ 2026-01-08 15:39           ` Frederic Weisbecker
  2026-01-08 15:57             ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-08 15:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Kai Yao, Tengda Wu

Le Wed, Jan 07, 2026 at 10:35:44PM -0500, Joel Fernandes a écrit :
> > 
> > By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
> > 
> > I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.
> 
> I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint
> is set and we reported exp qs.

To avoid needlessly reaching the rcu_read_unlock() slowpath whenever the exp QS has
already been reported, yes indeed.

>  I am working on a series to cover all cases and will send RFC soon. However this patch we are 
> reviewing can go in for this merge window and the rest I am preparing (for
> further improvement) for the next merge window, if that sounds good.

Ok.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-08 15:39           ` Frederic Weisbecker
@ 2026-01-08 15:57             ` Mathieu Desnoyers
  0 siblings, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2026-01-08 15:57 UTC (permalink / raw)
  To: Frederic Weisbecker, Joel Fernandes
  Cc: Joel Fernandes, Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt, Lai Jiangshan,
	Zqiang, Shuah Khan, linux-kernel, linux-kselftest, Kai Yao,
	Tengda Wu

On 2026-01-08 10:39, Frederic Weisbecker wrote:
> Le Wed, Jan 07, 2026 at 10:35:44PM -0500, Joel Fernandes a écrit :
>>>
>>> By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
>>>
>>> I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.
>>
>> I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint
>> is set and we reported exp qs.
> 
> To avoid needlessly reaching the rcu_read_unlock() slowpath whenever the exp QS has
> already been reported, yes indeed.

This seems related to:

https://lore.kernel.org/lkml/6c96dbb5-bffc-423f-bb6a-3072abb5f711@efficios.com/

Is it the same issue ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-08  1:02     ` Joel Fernandes
  2026-01-08  1:35       ` Joel Fernandes
@ 2026-01-08 15:25       ` Frederic Weisbecker
  2026-01-09  1:12         ` Joel Fernandes
  1 sibling, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-08 15:25 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest, Kai Yao, Tengda Wu

Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
> 
> 
> > On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > 
> > Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> >> From: Yao Kai <yaokai34@huawei.com>
> >> 
> >> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> >> __rcu_read_unlock()") removes the recursion-protection code from
> >> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> >> raise_softirq_irqoff() with ftrace enabled as follows:
> >> 
> >> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> >> Modules linked in: my_irq_work(O)
> >> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> >> Tainted: [O]=OOT_MODULE
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> >> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> >> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> >> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> >> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> >> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> >> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> >> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> >> PKRU: 55555554
> >> Call Trace:
> >> <IRQ>
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> unwind_next_frame+0x203/0x9b0
> >> __unwind_start+0x15d/0x1c0
> >> arch_stack_walk+0x62/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> unwind_next_frame+0x203/0x9b0
> >> __unwind_start+0x15d/0x1c0
> >> arch_stack_walk+0x62/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> unwind_next_frame+0x203/0x9b0
> >> __unwind_start+0x15d/0x1c0
> >> arch_stack_walk+0x62/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> __is_insn_slot_addr+0x54/0x70
> >> kernel_text_address+0x48/0xc0
> >> __kernel_text_address+0xd/0x40
> >> unwind_get_return_address+0x1e/0x40
> >> arch_stack_walk+0x9c/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> __raise_softirq_irqoff+0x61/0x80
> >> __flush_smp_call_function_queue+0x115/0x420
> >> __sysvec_call_function_single+0x17/0xb0
> >> sysvec_call_function_single+0x8c/0xc0
> >> </IRQ>
> >> 
> >> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> >> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> >> setting a flag before calling irq_work_queue_on(). We fix this issue by
> >> setting the same flag before calling raise_softirq_irqoff() and rename the
> >> flag to defer_qs_pending for more common.
> >> 
> >> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> >> Reported-by: Tengda Wu <wutengda2@huawei.com>
> >> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> >> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Looks good but, BTW, what happens if rcu_qs() is called
> > before rcu_preempt_deferred_qs() had a chance to be called?
> 
> Could you provide an example of when that can happen?

It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().

I'm not aware of other ways to trigger that, except perhaps this:

https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u

Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
called before rcu_qs() in the same IRQ disabled section (or there are other
fields set in ->rcu_read_unlock_special for later clearance). If we do that we
must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().

Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.

Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
which, sadly, would need to be expanded as a u64.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-08 15:25       ` Frederic Weisbecker
@ 2026-01-09  1:12         ` Joel Fernandes
  2026-01-09 14:23           ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-09  1:12 UTC (permalink / raw)
  To: Frederic Weisbecker, Joel Fernandes
  Cc: Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Kai Yao, Tengda Wu

Hi Frederic,

On 1/8/2026 10:25 AM, Frederic Weisbecker wrote:
> Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
>>
>>
>>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>>
>>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>>> From: Yao Kai <yaokai34@huawei.com>
>>>>
>>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>>> __rcu_read_unlock()") removes the recursion-protection code from
>>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>>>
>>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>>> Modules linked in: my_irq_work(O)
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>>> Tainted: [O]=OOT_MODULE
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>> <IRQ>
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> __is_insn_slot_addr+0x54/0x70
>>>> kernel_text_address+0x48/0xc0
>>>> __kernel_text_address+0xd/0x40
>>>> unwind_get_return_address+0x1e/0x40
>>>> arch_stack_walk+0x9c/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> __raise_softirq_irqoff+0x61/0x80
>>>> __flush_smp_call_function_queue+0x115/0x420
>>>> __sysvec_call_function_single+0x17/0xb0
>>>> sysvec_call_function_single+0x8c/0xc0
>>>> </IRQ>
>>>>
>>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>>> flag to defer_qs_pending for more common.
>>>>
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>>
>> Could you provide an example of when that can happen?
> 
> It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
> in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
> to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().

Ah the rcu_softorq_qs() path. Indeed, I see what you're saying now. Not sure
how to trigger it, but yeah good catch. it would delay the reset of the flag.

> I'm not aware of other ways to trigger that, except perhaps this:
> 
> https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u
> 
> Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
> called before rcu_qs() in the same IRQ disabled section (or there are other
> fields set in ->rcu_read_unlock_special for later clearance). If we do that we
> must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().
> 
> Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.

If we did that, can the following not happen? I did believe I tried that and it
did not fix the IRQ work recursion. Supposed you have a timer interrupt and an
IRQ that triggers BPF on exit. Both are pending on the CPU's IRQ controller.

First the non-timer interrupt does this:

irq_exit()
  __irq_exit_rcu()
    /* in_hardirq() returns false after this */
    preempt_count_sub(HARDIRQ_OFFSET)
    tick_irq_exit()
      tick_nohz_irq_exit()
	    tick_nohz_stop_sched_tick()
	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
		   __bpf_trace_tick_stop()
		      bpf_trace_run2()
			    rcu_read_unlock_special()
                              /* will send a IPI to itself */
			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);

<timer interrupt runs>

The timer interrupt runs, and does the clean up that the IRQ work was supposed
to do.

<IPI now runs for the IRQ work>
  ->irq_exit()
   ... recursion since IRQ work issued again.

Maybe it is unlikely to happen, but it feels a bit fragile still.  All it
takes is one call to rcu_qs() after the IRQ work was queued and before it
ran, coupled with an RCU reader that somehow always enters the slow-path.

> Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
> which, sadly, would need to be expanded as a u64.

I was thinking maybe the most robust is something like the following. We
_have_ to go through the node tree to report QS, once we "defer the QS",
there's no other way out of that, that's a path that is a guarantee to go
through in order to end the GP. So just unconditionally clear the flag there
and all such places, something like the following which passes light
rcutorture on all scenarios.

Once we issue an IRQ work or raise a softirq, we don't need to do that again
for the same CPU until the GP ends).

(EDIT: actually rcu_disable_urgency_upon_qs() or its callsites might just be
the place, since it is present in (almost?) all call sites we are to report
up on the node tree).

Thoughts? I need to double check if there any possibilities of requiring IRQ
work for more than one time during the same GP and the same CPU. I don't think
so though.

---8<-----------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b7c818cabe44..81c3af5d1f67 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -729,6 +729,12 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
 	}
 }
 
+static void rcu_defer_qs_clear_pending(struct rcu_data *rdp)
+{
+	if (READ_ONCE(rdp->defer_qs_pending) == DEFER_QS_PENDING)
+		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
+}
+
 /**
  * rcu_is_watching - RCU read-side critical sections permitted on current CPU?
  *
@@ -2483,6 +2490,8 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
 		}
 
 		rcu_disable_urgency_upon_qs(rdp);
+		rcu_defer_qs_clear_pending(rdp);
+
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		/* ^^^ Released rnp->lock */
 	}
@@ -2767,6 +2776,12 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
 			if (ret > 0) {
 				mask |= rdp->grpmask;
 				rcu_disable_urgency_upon_qs(rdp);
+				/*
+				 * Clear any stale defer_qs_pending for idle/offline
+				 * CPUs reporting QS. This can happen if a CPU went
+				 * idle after raising softirq but before it ran.
+				 */
+				rcu_defer_qs_clear_pending(rdp);
 			}
 			if (ret < 0)
 				rsmask |= rdp->grpmask;
@@ -4373,6 +4388,7 @@ void rcutree_report_cpu_starting(unsigned int cpu)
 
 		local_irq_save(flags);
 		rcu_disable_urgency_upon_qs(rdp);
+		rcu_defer_qs_clear_pending(rdp);
 		/* Report QS -after- changing ->qsmaskinitnext! */
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 	} else {
@@ -4432,6 +4448,7 @@ void rcutree_report_cpu_dead(void)
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
 		/* Report quiescent state -before- changing ->qsmaskinitnext! */
 		rcu_disable_urgency_upon_qs(rdp);
+		rcu_defer_qs_clear_pending(rdp);
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 96c49c56fc14..7f2af0e45883 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -272,6 +272,10 @@ static void rcu_report_exp_rdp(struct rcu_data *rdp)
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	WRITE_ONCE(rdp->cpu_no_qs.b.exp, false);
 	ASSERT_EXCLUSIVE_WRITER(rdp->cpu_no_qs.b.exp);
+
+	/* Expedited QS reported. TODO: what happens if we deferred both exp and normal QS (and viceversa for the other callsites)? */
+	rcu_defer_qs_clear_pending(rdp);
+
 	rcu_report_exp_cpu_mult(rnp, flags, rdp->grpmask, true);
 }
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6c86c7b96c63..d706daea021f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -487,8 +487,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	union rcu_special special;
 
 	rdp = this_cpu_ptr(&rcu_data);
-	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
-		rdp->defer_qs_pending = DEFER_QS_IDLE;
 
 	/*
 	 * If RCU core is waiting for this CPU to exit its critical section,

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
  2026-01-09  1:12         ` Joel Fernandes
@ 2026-01-09 14:23           ` Frederic Weisbecker
  0 siblings, 0 replies; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-09 14:23 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E McKenney, Boqun Feng, rcu, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Kai Yao, Tengda Wu

Le Thu, Jan 08, 2026 at 08:12:56PM -0500, Joel Fernandes a écrit :
> Hi Frederic,
> 
> On 1/8/2026 10:25 AM, Frederic Weisbecker wrote:
> > Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
> >>
> >>
> >>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> >>>
> >>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> >>>> From: Yao Kai <yaokai34@huawei.com>
> >>>>
> >>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> >>>> __rcu_read_unlock()") removes the recursion-protection code from
> >>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> >>>> raise_softirq_irqoff() with ftrace enabled as follows:
> >>>>
> >>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> >>>> Modules linked in: my_irq_work(O)
> >>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> >>>> Tainted: [O]=OOT_MODULE
> >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> >>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> >>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> >>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> >>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> >>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> >>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> >>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> >>>> PKRU: 55555554
> >>>> Call Trace:
> >>>> <IRQ>
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> __is_insn_slot_addr+0x54/0x70
> >>>> kernel_text_address+0x48/0xc0
> >>>> __kernel_text_address+0xd/0x40
> >>>> unwind_get_return_address+0x1e/0x40
> >>>> arch_stack_walk+0x9c/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> __raise_softirq_irqoff+0x61/0x80
> >>>> __flush_smp_call_function_queue+0x115/0x420
> >>>> __sysvec_call_function_single+0x17/0xb0
> >>>> sysvec_call_function_single+0x8c/0xc0
> >>>> </IRQ>
> >>>>
> >>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> >>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> >>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
> >>>> setting the same flag before calling raise_softirq_irqoff() and rename the
> >>>> flag to defer_qs_pending for more common.
> >>>>
> >>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> >>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
> >>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> >>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> >>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> >>>
> >>> Looks good but, BTW, what happens if rcu_qs() is called
> >>> before rcu_preempt_deferred_qs() had a chance to be called?
> >>
> >> Could you provide an example of when that can happen?
> > 
> > It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
> > in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
> > to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().
> 
> Ah the rcu_softorq_qs() path. Indeed, I see what you're saying now. Not sure
> how to trigger it, but yeah good catch. it would delay the reset of the flag.
> 
> > I'm not aware of other ways to trigger that, except perhaps this:
> > 
> > https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u
> > 
> > Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
> > called before rcu_qs() in the same IRQ disabled section (or there are other
> > fields set in ->rcu_read_unlock_special for later clearance). If we do that we
> > must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().
> > 
> > Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.
> 
> If we did that, can the following not happen? I did believe I tried that and it
> did not fix the IRQ work recursion. Supposed you have a timer interrupt and an
> IRQ that triggers BPF on exit. Both are pending on the CPU's IRQ controller.
> 
> First the non-timer interrupt does this:
> 
> irq_exit()
>   __irq_exit_rcu()
>     /* in_hardirq() returns false after this */
>     preempt_count_sub(HARDIRQ_OFFSET)
>     tick_irq_exit()
>       tick_nohz_irq_exit()
> 	    tick_nohz_stop_sched_tick()
> 	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
> 		   __bpf_trace_tick_stop()
> 		      bpf_trace_run2()
> 			    rcu_read_unlock_special()
>                               /* will send a IPI to itself */
> 			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> 
> <timer interrupt runs>
> 
> The timer interrupt runs, and does the clean up that the IRQ work was supposed
> to do.
> 
> <IPI now runs for the IRQ work>
>   ->irq_exit()
>    ... recursion since IRQ work issued again.

If defer_qs_pending is only cleared when rcu_qs() or rcu_report_exp_rdp(),
I don't think it can happen, but I could be missing something...

> 
> Maybe it is unlikely to happen, but it feels a bit fragile still.  All it
> takes is one call to rcu_qs() after the IRQ work was queued and before it
> ran, coupled with an RCU reader that somehow always enters the slow-path.

But if rcu_qs() or rcu_report_exp_rdp() has been called there is no more need
to enter rcu_read_unlock_special(), right? Unless the task is still blocked
but I'm not sure it could recurse...

> 
> > Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
> > which, sadly, would need to be expanded as a u64.
> 
> I was thinking maybe the most robust is something like the following. We
> _have_ to go through the node tree to report QS, once we "defer the QS",
> there's no other way out of that, that's a path that is a guarantee to go
> through in order to end the GP. So just unconditionally clear the flag there
> and all such places, something like the following which passes light
> rcutorture on all scenarios.
> 
> Once we issue an IRQ work or raise a softirq, we don't need to do that again
> for the same CPU until the GP ends).
> 
> (EDIT: actually rcu_disable_urgency_upon_qs() or its callsites might just be
> the place, since it is present in (almost?) all call sites we are to report
> up on the node tree).
> 
> Thoughts? I need to double check if there any possibilities of requiring IRQ
> work for more than one time during the same GP and the same CPU. I don't think
> so though.
> 
> ---8<-----------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b7c818cabe44..81c3af5d1f67 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -729,6 +729,12 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
>  	}
>  }
>  
> +static void rcu_defer_qs_clear_pending(struct rcu_data *rdp)
> +{
> +	if (READ_ONCE(rdp->defer_qs_pending) == DEFER_QS_PENDING)
> +		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
> +}
> +
>  /**
>   * rcu_is_watching - RCU read-side critical sections permitted on current CPU?
>   *
> @@ -2483,6 +2490,8 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
>  		}
>  
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
> +
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		/* ^^^ Released rnp->lock */
>  	}
> @@ -2767,6 +2776,12 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
>  			if (ret > 0) {
>  				mask |= rdp->grpmask;
>  				rcu_disable_urgency_upon_qs(rdp);
> +				/*
> +				 * Clear any stale defer_qs_pending for idle/offline
> +				 * CPUs reporting QS. This can happen if a CPU went
> +				 * idle after raising softirq but before it ran.
> +				 */
> +				rcu_defer_qs_clear_pending(rdp);
>  			}
>  			if (ret < 0)
>  				rsmask |= rdp->grpmask;
> @@ -4373,6 +4388,7 @@ void rcutree_report_cpu_starting(unsigned int cpu)
>  
>  		local_irq_save(flags);
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
>  		/* Report QS -after- changing ->qsmaskinitnext! */
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  	} else {
> @@ -4432,6 +4448,7 @@ void rcutree_report_cpu_dead(void)
>  	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
>  		/* Report quiescent state -before- changing ->qsmaskinitnext! */
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	}
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 96c49c56fc14..7f2af0e45883 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -272,6 +272,10 @@ static void rcu_report_exp_rdp(struct rcu_data *rdp)
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	WRITE_ONCE(rdp->cpu_no_qs.b.exp, false);
>  	ASSERT_EXCLUSIVE_WRITER(rdp->cpu_no_qs.b.exp);
> +
> +	/* Expedited QS reported. TODO: what happens if we deferred both exp and normal QS (and viceversa for the other callsites)? */
> +	rcu_defer_qs_clear_pending(rdp);
> +
>  	rcu_report_exp_cpu_mult(rnp, flags, rdp->grpmask, true);
>  }
>  
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 6c86c7b96c63..d706daea021f 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -487,8 +487,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	union rcu_special special;
>  
>  	rdp = this_cpu_ptr(&rcu_data);
> -	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
> -		rdp->defer_qs_pending = DEFER_QS_IDLE;
>  
>  	/*
>  	 * If RCU core is waiting for this CPU to exit its critical section,
>

Looks like a possible direction.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH -next 2/8] srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path Joel Fernandes
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

From: Zqiang <qiang.zhang@linux.dev>

For use the init_srcu_struct*() to initialized srcu structure,
the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to
allocate memory. similarly, if set SRCU_SIZING_INIT, the
srcu_sup's->node can still use GFP_KERNEL flags to allocate
memory, not need to use GFP_ATOMIC flags all the time.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/srcutree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index ea3f128de06f..c469c708fdd6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -262,7 +262,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static)
 	ssp->srcu_sup->srcu_gp_seq_needed_exp = SRCU_GP_SEQ_INITIAL_VAL;
 	ssp->srcu_sup->srcu_last_gp_end = ktime_get_mono_fast_ns();
 	if (READ_ONCE(ssp->srcu_sup->srcu_size_state) == SRCU_SIZE_SMALL && SRCU_SIZING_IS_INIT()) {
-		if (!init_srcu_struct_nodes(ssp, GFP_ATOMIC))
+		if (!init_srcu_struct_nodes(ssp, is_static ? GFP_ATOMIC : GFP_KERNEL))
 			goto err_free_sda;
 		WRITE_ONCE(ssp->srcu_sup->srcu_size_state, SRCU_SIZE_BIG);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 2/8] srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-08 15:57   ` Frederic Weisbecker
  2026-01-01 16:34 ` [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload Joel Fernandes
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to
wake rcuog when the callback count exceeds qhimark and callbacks aren't
done with their GP (newly queued or awaiting GP). However, a lot of
testing proves this wake is always redundant or useless.

In the flooding case, rcuog is always waiting for a GP to finish. So
waking up the rcuog thread is pointless. The timer wakeup adds overhead,
rcuog simply wakes up and goes back to sleep achieving nothing.

This path also adds a full memory barrier, and additional timer expiry
modifications unnecessarily.

The root cause is that WakeOvfIsDeferred fires when
!rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot
accelerate GP completion.

This commit therefore removes this path.

Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB
configurations) - all pass. Also stress tested using a kernel module
that floods call_rcu() to trigger the overload conditions and made the
observations confirming the findings.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.h      |  1 -
 kernel/rcu/tree_nocb.h | 35 +++++++++--------------------------
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2265b9c2906e..653fb4ba5852 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -301,7 +301,6 @@ struct rcu_data {
 #define RCU_NOCB_WAKE_BYPASS	1
 #define RCU_NOCB_WAKE_LAZY	2
 #define RCU_NOCB_WAKE		3
-#define RCU_NOCB_WAKE_FORCE	4
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index e6cd56603cad..daff2756cd90 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -518,10 +518,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 }
 
 /*
- * Awaken the no-CBs grace-period kthread if needed, either due to it
- * legitimately being asleep or due to overload conditions.
- *
- * If warranted, also wake up the kthread servicing this CPUs queues.
+ * Awaken the no-CBs grace-period kthread if needed due to it legitimately
+ * being asleep.
  */
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 				 unsigned long flags)
@@ -533,7 +531,6 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	long lazy_len;
 	long len;
 	struct task_struct *t;
-	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
 
 	// If we are being polled or there is no kthread, just leave.
 	t = READ_ONCE(rdp->nocb_gp_kthread);
@@ -549,22 +546,22 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	lazy_len = READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
+		rcu_nocb_unlock(rdp);
 		// Only lazy CBs in bypass list
 		if (lazy_len && bypass_len == lazy_len) {
-			rcu_nocb_unlock(rdp);
 			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
 					   TPS("WakeLazy"));
 		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
-			rcu_nocb_unlock(rdp);
 			wake_nocb_gp(rdp, false);
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("WakeEmpty"));
 		} else {
-			rcu_nocb_unlock(rdp);
 			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE,
 					   TPS("WakeEmptyIsDeferred"));
 		}
+
+		return;
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
 		/* ... or if many callbacks queued. */
 		rdp->qlen_last_fqs_check = len;
@@ -575,21 +572,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
 		}
-		smp_mb(); /* Enqueue before timer_pending(). */
-		if ((rdp->nocb_cb_sleep ||
-		     !rcu_segcblist_ready_cbs(&rdp->cblist)) &&
-		    !timer_pending(&rdp_gp->nocb_timer)) {
-			rcu_nocb_unlock(rdp);
-			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
-					   TPS("WakeOvfIsDeferred"));
-		} else {
-			rcu_nocb_unlock(rdp);
-			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
-		}
-	} else {
-		rcu_nocb_unlock(rdp);
-		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
 	}
+
+	rcu_nocb_unlock(rdp);
+	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
 }
 
 static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head,
@@ -966,7 +952,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
 					   unsigned long flags)
 	__releases(rdp_gp->nocb_gp_lock)
 {
-	int ndw;
 	int ret;
 
 	if (!rcu_nocb_need_deferred_wakeup(rdp_gp, level)) {
@@ -974,8 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
 		return false;
 	}
 
-	ndw = rdp_gp->nocb_defer_wakeup;
-	ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
+	ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
 	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
 
 	return ret;
@@ -991,7 +975,6 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t)
 	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
 
 	raw_spin_lock_irqsave(&rdp->nocb_gp_lock, flags);
-	smp_mb__after_spinlock(); /* Timer expire before wakeup. */
 	do_nocb_deferred_wakeup_common(rdp, rdp, RCU_NOCB_WAKE_BYPASS, flags);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  2026-01-01 16:34 ` [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path Joel Fernandes
@ 2026-01-08 15:57   ` Frederic Weisbecker
  2026-01-09  1:39     ` Joel Fernandes
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-08 15:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E . McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

Le Thu, Jan 01, 2026 at 11:34:12AM -0500, Joel Fernandes a écrit :
> @@ -974,8 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
>  		return false;
>  	}
>  
> -	ndw = rdp_gp->nocb_defer_wakeup;
> -	ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
> +	ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);

The force parameter can now be removed, right? (same applies to wake_nocb_gp()).

Other than that:

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  2026-01-08 15:57   ` Frederic Weisbecker
@ 2026-01-09  1:39     ` Joel Fernandes
  2026-01-09 10:32       ` Boqun Feng
  0 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-09  1:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Paul E . McKenney, Boqun Feng, rcu,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

On Thu, Jan 08, 2026 at 04:57:26PM +0100, Frederic Weisbecker wrote:
> Le Thu, Jan 01, 2026 at 11:34:12AM -0500, Joel Fernandes a écrit :
> > @@ -974,8 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
> >  		return false;
> >  	}
> >  
> > -	ndw = rdp_gp->nocb_defer_wakeup;
> > -	ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
> > +	ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
> 
> The force parameter can now be removed, right? (same applies to wake_nocb_gp()).
> 
> Other than that:
> 
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Ah true! Thanks, so the following hunk needs to be squashed into the patch
then, with the review tag. Boqun, if you want to do that please do, or I can
send it again for the next merge window.

---8<-----------------------

From: "Joel Fernandes" <joelagnelf@nvidia.com>
Subject: [PATCH] fixup! rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake
 path

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.c      |  2 +-
 kernel/rcu/tree.h      |  2 +-
 kernel/rcu/tree_nocb.h | 14 +++++++-------
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 293bbd9ac3f4..2921ffb19939 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3769,7 +3769,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	}
 	rcu_nocb_unlock(rdp);
 	if (wake_nocb)
-		wake_nocb_gp(rdp, false);
+		wake_nocb_gp(rdp);
 	smp_store_release(&rdp->barrier_seq_snap, gseq);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 653fb4ba5852..7dfc57e9adb1 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -499,7 +499,7 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
+static bool wake_nocb_gp(struct rcu_data *rdp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j, bool lazy);
 static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head,
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index daff2756cd90..c6f1ddecc2d8 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -192,7 +192,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 
 static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
 			   struct rcu_data *rdp,
-			   bool force, unsigned long flags)
+			   unsigned long flags)
 	__releases(rdp_gp->nocb_gp_lock)
 {
 	bool needwake = false;
@@ -225,13 +225,13 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
 /*
  * Kick the GP kthread for this NOCB group.
  */
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
+static bool wake_nocb_gp(struct rcu_data *rdp)
 {
 	unsigned long flags;
 	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
 
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
-	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
+	return __wake_nocb_gp(rdp_gp, rdp, flags);
 }
 
 #ifdef CONFIG_RCU_LAZY
@@ -553,7 +553,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 					   TPS("WakeLazy"));
 		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
-			wake_nocb_gp(rdp, false);
+			wake_nocb_gp(rdp);
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("WakeEmpty"));
 		} else {
@@ -959,7 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
 		return false;
 	}
 
-	ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
+	ret = __wake_nocb_gp(rdp_gp, rdp, flags);
 	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
 
 	return ret;
@@ -1255,7 +1255,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		}
 		rcu_nocb_try_flush_bypass(rdp, jiffies);
 		rcu_nocb_unlock_irqrestore(rdp, flags);
-		wake_nocb_gp(rdp, false);
+		wake_nocb_gp(rdp);
 		sc->nr_to_scan -= _count;
 		count += _count;
 		if (sc->nr_to_scan <= 0)
@@ -1640,7 +1640,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
+static bool wake_nocb_gp(struct rcu_data *rdp)
 {
 	return false;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  2026-01-09  1:39     ` Joel Fernandes
@ 2026-01-09 10:32       ` Boqun Feng
  2026-01-09 11:20         ` Joel Fernandes
  0 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2026-01-09 10:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, Joel Fernandes, Paul E . McKenney, rcu,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

On Thu, Jan 08, 2026 at 08:39:11PM -0500, Joel Fernandes wrote:
> On Thu, Jan 08, 2026 at 04:57:26PM +0100, Frederic Weisbecker wrote:
> > Le Thu, Jan 01, 2026 at 11:34:12AM -0500, Joel Fernandes a écrit :
> > > @@ -974,8 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
> > >  		return false;
> > >  	}
> > >  
> > > -	ndw = rdp_gp->nocb_defer_wakeup;
> > > -	ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
> > > +	ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
> > 
> > The force parameter can now be removed, right? (same applies to wake_nocb_gp()).
> > 
> > Other than that:
> > 
> > Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Ah true! Thanks, so the following hunk needs to be squashed into the patch
> then, with the review tag. Boqun, if you want to do that please do, or I can
> send it again for the next merge window.
> 

We still have time for this merge window, but I see there is still
reviewing going on for other patches, maybe you could resend these 3
patches once we reach agreement, and then we can decide which merge
window. Thoughts?

Regards,
Boqun

> ---8<-----------------------
> 
> From: "Joel Fernandes" <joelagnelf@nvidia.com>
> Subject: [PATCH] fixup! rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake
>  path
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/rcu/tree.c      |  2 +-
>  kernel/rcu/tree.h      |  2 +-
>  kernel/rcu/tree_nocb.h | 14 +++++++-------
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 293bbd9ac3f4..2921ffb19939 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3769,7 +3769,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	}
>  	rcu_nocb_unlock(rdp);
>  	if (wake_nocb)
> -		wake_nocb_gp(rdp, false);
> +		wake_nocb_gp(rdp);
>  	smp_store_release(&rdp->barrier_seq_snap, gseq);
>  }
>  
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 653fb4ba5852..7dfc57e9adb1 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -499,7 +499,7 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>  static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>  static void rcu_init_one_nocb(struct rcu_node *rnp);
> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
> +static bool wake_nocb_gp(struct rcu_data *rdp);
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  				  unsigned long j, bool lazy);
>  static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head,
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index daff2756cd90..c6f1ddecc2d8 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -192,7 +192,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  
>  static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
>  			   struct rcu_data *rdp,
> -			   bool force, unsigned long flags)
> +			   unsigned long flags)
>  	__releases(rdp_gp->nocb_gp_lock)
>  {
>  	bool needwake = false;
> @@ -225,13 +225,13 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
>  /*
>   * Kick the GP kthread for this NOCB group.
>   */
> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> +static bool wake_nocb_gp(struct rcu_data *rdp)
>  {
>  	unsigned long flags;
>  	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
>  
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
> -	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
> +	return __wake_nocb_gp(rdp_gp, rdp, flags);
>  }
>  
>  #ifdef CONFIG_RCU_LAZY
> @@ -553,7 +553,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  					   TPS("WakeLazy"));
>  		} else if (!irqs_disabled_flags(flags)) {
>  			/* ... if queue was empty ... */
> -			wake_nocb_gp(rdp, false);
> +			wake_nocb_gp(rdp);
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("WakeEmpty"));
>  		} else {
> @@ -959,7 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
>  		return false;
>  	}
>  
> -	ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
> +	ret = __wake_nocb_gp(rdp_gp, rdp, flags);
>  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
>  
>  	return ret;
> @@ -1255,7 +1255,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  		}
>  		rcu_nocb_try_flush_bypass(rdp, jiffies);
>  		rcu_nocb_unlock_irqrestore(rdp, flags);
> -		wake_nocb_gp(rdp, false);
> +		wake_nocb_gp(rdp);
>  		sc->nr_to_scan -= _count;
>  		count += _count;
>  		if (sc->nr_to_scan <= 0)
> @@ -1640,7 +1640,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  {
>  }
>  
> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> +static bool wake_nocb_gp(struct rcu_data *rdp)
>  {
>  	return false;
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  2026-01-09 10:32       ` Boqun Feng
@ 2026-01-09 11:20         ` Joel Fernandes
  2026-01-11 12:14           ` Boqun Feng
  0 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-09 11:20 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Frederic Weisbecker, Joel Fernandes, Paul E McKenney, rcu,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest



> On Jan 9, 2026, at 5:32 AM, Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> On Thu, Jan 08, 2026 at 08:39:11PM -0500, Joel Fernandes wrote:
>>> On Thu, Jan 08, 2026 at 04:57:26PM +0100, Frederic Weisbecker wrote:
>>> Le Thu, Jan 01, 2026 at 11:34:12AM -0500, Joel Fernandes a écrit :
>>>> @@ -974,8 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
>>>>        return false;
>>>>    }
>>>> 
>>>> -    ndw = rdp_gp->nocb_defer_wakeup;
>>>> -    ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
>>>> +    ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
>>> 
>>> The force parameter can now be removed, right? (same applies to wake_nocb_gp()).
>>> 
>>> Other than that:
>>> 
>>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>> 
>> Ah true! Thanks, so the following hunk needs to be squashed into the patch
>> then, with the review tag. Boqun, if you want to do that please do, or I can
>> send it again for the next merge window.
>> 
> 
> We still have time for this merge window, but I see there is still
> reviewing going on for other patches, maybe you could resend these 3
> patches once we reach agreement, and then we can decide which merge
> window. Thoughts?

Yes, or let us drop these 3 for this merge window and since I am doing the next
merge window, I will include these after agreement.  I have 3 more patches
as well in this area coming up.

So I will re-send all of them together for nocb.

That will also make it easier for you and Frederic.

If by chance, we conclude review and agreement in time for this window, you could add them too, for now ok to drop nocb ones.

Thanks!

 - Joel

> 
> Regards,
> Boqun
> 
>> ---8<-----------------------
>> 
>> From: "Joel Fernandes" <joelagnelf@nvidia.com>
>> Subject: [PATCH] fixup! rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake
>> path
>> 
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>> kernel/rcu/tree.c      |  2 +-
>> kernel/rcu/tree.h      |  2 +-
>> kernel/rcu/tree_nocb.h | 14 +++++++-------
>> 3 files changed, 9 insertions(+), 9 deletions(-)
>> 
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 293bbd9ac3f4..2921ffb19939 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -3769,7 +3769,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>    }
>>    rcu_nocb_unlock(rdp);
>>    if (wake_nocb)
>> -        wake_nocb_gp(rdp, false);
>> +        wake_nocb_gp(rdp);
>>    smp_store_release(&rdp->barrier_seq_snap, gseq);
>> }
>> 
>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>> index 653fb4ba5852..7dfc57e9adb1 100644
>> --- a/kernel/rcu/tree.h
>> +++ b/kernel/rcu/tree.h
>> @@ -499,7 +499,7 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>> static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>> static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>> static void rcu_init_one_nocb(struct rcu_node *rnp);
>> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>> +static bool wake_nocb_gp(struct rcu_data *rdp);
>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>                  unsigned long j, bool lazy);
>> static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head,
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index daff2756cd90..c6f1ddecc2d8 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -192,7 +192,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>> 
>> static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
>>               struct rcu_data *rdp,
>> -               bool force, unsigned long flags)
>> +               unsigned long flags)
>>    __releases(rdp_gp->nocb_gp_lock)
>> {
>>    bool needwake = false;
>> @@ -225,13 +225,13 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
>> /*
>>  * Kick the GP kthread for this NOCB group.
>>  */
>> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>> +static bool wake_nocb_gp(struct rcu_data *rdp)
>> {
>>    unsigned long flags;
>>    struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
>> 
>>    raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>> -    return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>> +    return __wake_nocb_gp(rdp_gp, rdp, flags);
>> }
>> 
>> #ifdef CONFIG_RCU_LAZY
>> @@ -553,7 +553,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>                       TPS("WakeLazy"));
>>        } else if (!irqs_disabled_flags(flags)) {
>>            /* ... if queue was empty ... */
>> -            wake_nocb_gp(rdp, false);
>> +            wake_nocb_gp(rdp);
>>            trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>                        TPS("WakeEmpty"));
>>        } else {
>> @@ -959,7 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp,
>>        return false;
>>    }
>> 
>> -    ret = __wake_nocb_gp(rdp_gp, rdp, false, flags);
>> +    ret = __wake_nocb_gp(rdp_gp, rdp, flags);
>>    trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
>> 
>>    return ret;
>> @@ -1255,7 +1255,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>>        }
>>        rcu_nocb_try_flush_bypass(rdp, jiffies);
>>        rcu_nocb_unlock_irqrestore(rdp, flags);
>> -        wake_nocb_gp(rdp, false);
>> +        wake_nocb_gp(rdp);
>>        sc->nr_to_scan -= _count;
>>        count += _count;
>>        if (sc->nr_to_scan <= 0)
>> @@ -1640,7 +1640,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>> {
>> }
>> 
>> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>> +static bool wake_nocb_gp(struct rcu_data *rdp)
>> {
>>    return false;
>> }
>> --
>> 2.34.1
>> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
  2026-01-09 11:20         ` Joel Fernandes
@ 2026-01-11 12:14           ` Boqun Feng
  0 siblings, 0 replies; 38+ messages in thread
From: Boqun Feng @ 2026-01-11 12:14 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, Joel Fernandes, Paul E McKenney, rcu,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

On Fri, Jan 09, 2026 at 06:20:20AM -0500, Joel Fernandes wrote:
> 
[...]
> >>> 
> >>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> >> 
> >> Ah true! Thanks, so the following hunk needs to be squashed into the patch
> >> then, with the review tag. Boqun, if you want to do that please do, or I can
> >> send it again for the next merge window.
> >> 
> > 
> > We still have time for this merge window, but I see there is still
> > reviewing going on for other patches, maybe you could resend these 3
> > patches once we reach agreement, and then we can decide which merge
> > window. Thoughts?
> 
> Yes, or let us drop these 3 for this merge window and since I am doing the next
> merge window, I will include these after agreement.  I have 3 more patches
> as well in this area coming up.
> 
> So I will re-send all of them together for nocb.
> 
> That will also make it easier for you and Frederic.
> 
> If by chance, we conclude review and agreement in time for this window, you could add them too, for now ok to drop nocb ones.
> 

Given that we still have some ongoing discussion on other nocb patches,
I dropped the 3 nocb patches for now, I think we can still have them if
we finalize next week. Thanks!

Regards,
Boqun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
                   ` (2 preceding siblings ...)
  2026-01-01 16:34 ` [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-08 17:22   ` Frederic Weisbecker
  2026-01-01 16:34 ` [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful Joel Fernandes
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

To be sure we have no rcog wake ups that were lost, add a warning
to cover the case where the rdp is overloaded with callbacks but
no wake up was attempted.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.c      | 4 ++++
 kernel/rcu/tree.h      | 1 +
 kernel/rcu/tree_nocb.h | 6 +++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 293bbd9ac3f4..78c045a5ef03 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3767,6 +3767,10 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
 	}
+#ifdef CONFIG_RCU_NOCB_CPU
+	if (wake_nocb)
+		rdp->nocb_gp_wake_attempt = true;
+#endif
 	rcu_nocb_unlock(rdp);
 	if (wake_nocb)
 		wake_nocb_gp(rdp, false);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 653fb4ba5852..74bd6a2a2f84 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -257,6 +257,7 @@ struct rcu_data {
 	unsigned long nocb_gp_loops;	/* # passes through wait code. */
 	struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
 	bool nocb_cb_sleep;		/* Is the nocb CB thread asleep? */
+	bool nocb_gp_wake_attempt;	/* Was a rcuog wakeup attempted? */
 	struct task_struct *nocb_cb_kthread;
 	struct list_head nocb_head_rdp; /*
 					 * Head of rcu_data list in wakeup chain,
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index daff2756cd90..7e9d465c8ab1 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -546,6 +546,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	lazy_len = READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
+		rdp->nocb_gp_wake_attempt = true;
 		rcu_nocb_unlock(rdp);
 		// Only lazy CBs in bypass list
 		if (lazy_len && bypass_len == lazy_len) {
@@ -563,7 +564,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 
 		return;
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
-		/* ... or if many callbacks queued. */
+		/* Callback overload condition. */
+		WARN_ON_ONCE(!rdp->nocb_gp_wake_attempt);
 		rdp->qlen_last_fqs_check = len;
 		j = jiffies;
 		if (j != rdp->nocb_gp_adv_time &&
@@ -688,6 +690,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		     bypass_ncbs > 2 * qhimark)) {
 			flush_bypass = true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
+			rdp->nocb_gp_wake_attempt = false;
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			continue; /* No callbacks here, try next. */
 		}
@@ -1254,6 +1257,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			continue;
 		}
 		rcu_nocb_try_flush_bypass(rdp, jiffies);
+		rdp->nocb_gp_wake_attempt = true;
 		rcu_nocb_unlock_irqrestore(rdp, flags);
 		wake_nocb_gp(rdp, false);
 		sc->nr_to_scan -= _count;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload
  2026-01-01 16:34 ` [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload Joel Fernandes
@ 2026-01-08 17:22   ` Frederic Weisbecker
  2026-01-09  3:49     ` Joel Fernandes
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-08 17:22 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E . McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

Le Thu, Jan 01, 2026 at 11:34:13AM -0500, Joel Fernandes a écrit :
> To be sure we have no rcog wake ups that were lost, add a warning
> to cover the case where the rdp is overloaded with callbacks but
> no wake up was attempted.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/rcu/tree.c      | 4 ++++
>  kernel/rcu/tree.h      | 1 +
>  kernel/rcu/tree_nocb.h | 6 +++++-
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 293bbd9ac3f4..78c045a5ef03 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3767,6 +3767,10 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  		debug_rcu_head_unqueue(&rdp->barrier_head);
>  		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>  	}
> +#ifdef CONFIG_RCU_NOCB_CPU
> +	if (wake_nocb)
> +		rdp->nocb_gp_wake_attempt = true;
> +#endif

entrain only queues a callback if the list is non-empty. And if it's
non-empty, rdp->nocb_gp_wake_attempt should be true already.

>  	rcu_nocb_unlock(rdp);
>  	if (wake_nocb)
>  		wake_nocb_gp(rdp, false);
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 653fb4ba5852..74bd6a2a2f84 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -257,6 +257,7 @@ struct rcu_data {
>  	unsigned long nocb_gp_loops;	/* # passes through wait code. */
>  	struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
>  	bool nocb_cb_sleep;		/* Is the nocb CB thread asleep? */
> +	bool nocb_gp_wake_attempt;	/* Was a rcuog wakeup attempted? */

How about nocb_gp_handling ?

>  	struct task_struct *nocb_cb_kthread;
>  	struct list_head nocb_head_rdp; /*
>  					 * Head of rcu_data list in wakeup chain,
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index daff2756cd90..7e9d465c8ab1 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -546,6 +546,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  	lazy_len = READ_ONCE(rdp->lazy_len);
>  	if (was_alldone) {
>  		rdp->qlen_last_fqs_check = len;
> +		rdp->nocb_gp_wake_attempt = true;
>  		rcu_nocb_unlock(rdp);
>  		// Only lazy CBs in bypass list
>  		if (lazy_len && bypass_len == lazy_len) {
> @@ -563,7 +564,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  
>  		return;
>  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
> -		/* ... or if many callbacks queued. */
> +		/* Callback overload condition. */
> +		WARN_ON_ONCE(!rdp->nocb_gp_wake_attempt);
>  		rdp->qlen_last_fqs_check = len;
>  		j = jiffies;
>  		if (j != rdp->nocb_gp_adv_time &&
> @@ -688,6 +690,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  		     bypass_ncbs > 2 * qhimark)) {
>  			flush_bypass = true;
>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
> +			rdp->nocb_gp_wake_attempt = false;

This is when nocb_cb_wait() is done with callbacks but nocb_gp_wait() is done
with them sooner, when the grace period is done for all pending callbacks.

Something like this would perhaps be more accurate:

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index e6cd56603cad..52010cbeaa76 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -746,6 +746,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 			needwait_gp = true;
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("NeedWaitGP"));
+		} else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
+			rdp->nocb_gp_wake_attempt = false;
 		}
 		if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
 			needwake = rdp->nocb_cb_sleep;


>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			continue; /* No callbacks here, try next. */
>  		}
> @@ -1254,6 +1257,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  			continue;
>  		}
>  		rcu_nocb_try_flush_bypass(rdp, jiffies);
> +		rdp->nocb_gp_wake_attempt = true;

Same here, we should expect rdp->nocb_gp_wake_attempt to be already on since
there are lazy callbacks. That's a good opportunity to test the related assertion
though.

Thanks.

>  		rcu_nocb_unlock_irqrestore(rdp, flags);
>  		wake_nocb_gp(rdp, false);
>  		sc->nr_to_scan -= _count;
> -- 
> 2.34.1
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload
  2026-01-08 17:22   ` Frederic Weisbecker
@ 2026-01-09  3:49     ` Joel Fernandes
  2026-01-09 14:36       ` Frederic Weisbecker
  0 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-09  3:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Paul E . McKenney, Boqun Feng, rcu,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

Hi Frederic,

On Thu, Jan 08, 2026 at 06:22:45PM +0100, Frederic Weisbecker wrote:
> Le Thu, Jan 01, 2026 at 11:34:13AM -0500, Joel Fernandes a écrit :
> > To be sure we have no rcog wake ups that were lost, add a warning
> > to cover the case where the rdp is overloaded with callbacks but
> > no wake up was attempted.
> > 
> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > ---
> >  kernel/rcu/tree.c      | 4 ++++
> >  kernel/rcu/tree.h      | 1 +
> >  kernel/rcu/tree_nocb.h | 6 +++++-
> >  3 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 293bbd9ac3f4..78c045a5ef03 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3767,6 +3767,10 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >  		debug_rcu_head_unqueue(&rdp->barrier_head);
> >  		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
> >  	}
> > +#ifdef CONFIG_RCU_NOCB_CPU
> > +	if (wake_nocb)
> > +		rdp->nocb_gp_wake_attempt = true;
> > +#endif
> 
> entrain only queues a callback if the list is non-empty. And if it's
> non-empty, rdp->nocb_gp_wake_attempt should be true already.

This is true, we don't need to track this wake up. I will replace it with a
WARN.

> >  	rcu_nocb_unlock(rdp);
> >  	if (wake_nocb)
> >  		wake_nocb_gp(rdp, false);
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 653fb4ba5852..74bd6a2a2f84 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -257,6 +257,7 @@ struct rcu_data {
> >  	unsigned long nocb_gp_loops;	/* # passes through wait code. */
> >  	struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */
> >  	bool nocb_cb_sleep;		/* Is the nocb CB thread asleep? */
> > +	bool nocb_gp_wake_attempt;	/* Was a rcuog wakeup attempted? */
> 
> How about nocb_gp_handling ?

This is a better name indeed, considering that we also track this for
deferred wakeups of the GP thread.

> >  	struct task_struct *nocb_cb_kthread;
> >  	struct list_head nocb_head_rdp; /*
> >  					 * Head of rcu_data list in wakeup chain,
> > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> > index daff2756cd90..7e9d465c8ab1 100644
> > --- a/kernel/rcu/tree_nocb.h
> > +++ b/kernel/rcu/tree_nocb.h
> > @@ -546,6 +546,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
> >  	lazy_len = READ_ONCE(rdp->lazy_len);
> >  	if (was_alldone) {
> >  		rdp->qlen_last_fqs_check = len;
> > +		rdp->nocb_gp_wake_attempt = true;
> >  		rcu_nocb_unlock(rdp);
> >  		// Only lazy CBs in bypass list
> >  		if (lazy_len && bypass_len == lazy_len) {
> > @@ -563,7 +564,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
> >  
> >  		return;
> >  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
> > -		/* ... or if many callbacks queued. */
> > +		/* Callback overload condition. */
> > +		WARN_ON_ONCE(!rdp->nocb_gp_wake_attempt);
> >  		rdp->qlen_last_fqs_check = len;
> >  		j = jiffies;
> >  		if (j != rdp->nocb_gp_adv_time &&
> > @@ -688,6 +690,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> >  		     bypass_ncbs > 2 * qhimark)) {
> >  			flush_bypass = true;
> >  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
> > +			rdp->nocb_gp_wake_attempt = false;
> 
> This is when nocb_cb_wait() is done with callbacks but nocb_gp_wait() is done
> with them sooner, when the grace period is done for all pending callbacks.
> 
> Something like this would perhaps be more accurate:
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index e6cd56603cad..52010cbeaa76 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -746,6 +746,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  			needwait_gp = true;
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("NeedWaitGP"));
> +		} else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
> +			rdp->nocb_gp_wake_attempt = false;
>  		}

Hmm, I am trying to understand why this suggestion is better than what I
already have. It is one extra line and adds another conditional.

Also shouldn't it be:

  } else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass) &&
             rcu_segcblist_empty(&rdp->cblist)) {
      rdp->nocb_gp_wake_attempt = false;
  }

  ?

My goal was to mark wake_attempt as false when ALL callbacks on the rdp were
drained. IOW, the GP thread is done with the rdp.

>  		if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
>  			needwake = rdp->nocb_cb_sleep;
> 
> 
> >  			rcu_nocb_unlock_irqrestore(rdp, flags);
> >  			continue; /* No callbacks here, try next. */
> >  		}
> > @@ -1254,6 +1257,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >  			continue;
> >  		}
> >  		rcu_nocb_try_flush_bypass(rdp, jiffies);
> > +		rdp->nocb_gp_wake_attempt = true;
> 
> Same here, we should expect rdp->nocb_gp_wake_attempt to be already on since
> there are lazy callbacks. That's a good opportunity to test the related assertion
> though.

Good point! I will turn it into a WARN.

Btw, I have more patches coming to simplify nocb_gp_wait()... it is quite long :)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload
  2026-01-09  3:49     ` Joel Fernandes
@ 2026-01-09 14:36       ` Frederic Weisbecker
  2026-01-09 21:20         ` Joel Fernandes
  0 siblings, 1 reply; 38+ messages in thread
From: Frederic Weisbecker @ 2026-01-09 14:36 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Paul E . McKenney, Boqun Feng, rcu,
	Neeraj Upadhyay, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

Le Thu, Jan 08, 2026 at 10:49:30PM -0500, Joel Fernandes a écrit :
> > > @@ -688,6 +690,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> > >  		     bypass_ncbs > 2 * qhimark)) {
> > >  			flush_bypass = true;
> > >  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
> > > +			rdp->nocb_gp_wake_attempt = false;
> > 
> > This is when nocb_cb_wait() is done with callbacks but nocb_gp_wait() is done
> > with them sooner, when the grace period is done for all pending callbacks.
> > 
> > Something like this would perhaps be more accurate:
> > 
> > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> > index e6cd56603cad..52010cbeaa76 100644
> > --- a/kernel/rcu/tree_nocb.h
> > +++ b/kernel/rcu/tree_nocb.h
> > @@ -746,6 +746,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> >  			needwait_gp = true;
> >  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> >  					    TPS("NeedWaitGP"));
> > +		} else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
> > +			rdp->nocb_gp_wake_attempt = false;
> >  		}
> 
> Hmm, I am trying to understand why this suggestion is better than what I
> already have. It is one extra line and adds another conditional.
> 
> Also shouldn't it be:
> 
>   } else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass) &&
>              rcu_segcblist_empty(&rdp->cblist)) {
>       rdp->nocb_gp_wake_attempt = false;
>   }
> 
>   ?

This else already means that rcu_segcblist_nextgp() returned false because there
is no pending callbacks. rcu_segcblist_empty() is different because it also test
done callbacks.

> 
> My goal was to mark wake_attempt as false when ALL callbacks on the rdp were
> drained. IOW, the GP thread is done with the rdp.

So nocb_gp_wait (the rcuog kthreads) handle the pending callbacks, advancing
them throughout grace periods until they are moved to the done state.

But once they are moved as done, the callbacks are the responsibility of
nocb_cb_wait() (the rcuo kthreads) and nocb_gp_wait() stops paying attention
to that rdp if there are no more pending callbacks.

So with your initial patch, you still have rdp->nocb_gp_wake_attempt == true
even when there are only done callbacks. But without an appropriate wake-up
after subsequent enqueue, nocb_gp_wait() may ignore new callbacks, event though
rdp->nocb_gp_wake_attempt == true suggests otherwise.

> Btw, I have more patches coming to simplify nocb_gp_wait()... it is quite long
> :)

Cool :)

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload
  2026-01-09 14:36       ` Frederic Weisbecker
@ 2026-01-09 21:20         ` Joel Fernandes
  0 siblings, 0 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-09 21:20 UTC (permalink / raw)
  To: Frederic Weisbecker, Joel Fernandes
  Cc: Paul E . McKenney, Boqun Feng, rcu, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest



On 1/9/2026 9:36 AM, Frederic Weisbecker wrote:
> Le Thu, Jan 08, 2026 at 10:49:30PM -0500, Joel Fernandes a écrit :
>>>> @@ -688,6 +690,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>>>  		     bypass_ncbs > 2 * qhimark)) {
>>>>  			flush_bypass = true;
>>>>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>>>> +			rdp->nocb_gp_wake_attempt = false;
>>>
>>> This is when nocb_cb_wait() is done with callbacks but nocb_gp_wait() is done
>>> with them sooner, when the grace period is done for all pending callbacks.
>>>
>>> Something like this would perhaps be more accurate:
>>>
>>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>>> index e6cd56603cad..52010cbeaa76 100644
>>> --- a/kernel/rcu/tree_nocb.h
>>> +++ b/kernel/rcu/tree_nocb.h
>>> @@ -746,6 +746,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>>  			needwait_gp = true;
>>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>  					    TPS("NeedWaitGP"));
>>> +		} else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
>>> +			rdp->nocb_gp_wake_attempt = false;
>>>  		}
>>
>> Hmm, I am trying to understand why this suggestion is better than what I
>> already have. It is one extra line and adds another conditional.
>>
>> Also shouldn't it be:
>>
>>   } else if (!rcu_cblist_n_cbs(&rdp->nocb_bypass) &&
>>              rcu_segcblist_empty(&rdp->cblist)) {
>>       rdp->nocb_gp_wake_attempt = false;
>>   }
>>
>>   ?
> 
> This else already means that rcu_segcblist_nextgp() returned false because there
> is no pending callbacks. rcu_segcblist_empty() is different because it also test
> done callbacks.
> 
>>
>> My goal was to mark wake_attempt as false when ALL callbacks on the rdp were
>> drained. IOW, the GP thread is done with the rdp.
> 
> So nocb_gp_wait (the rcuog kthreads) handle the pending callbacks, advancing
> them throughout grace periods until they are moved to the done state.
> 
> But once they are moved as done, the callbacks are the responsibility of
> nocb_cb_wait() (the rcuo kthreads) and nocb_gp_wait() stops paying attention
> to that rdp if there are no more pending callbacks.
> 
> So with your initial patch, you still have rdp->nocb_gp_wake_attempt == true
> even when there are only done callbacks. But without an appropriate wake-up
> after subsequent enqueue, nocb_gp_wait() may ignore new callbacks, event though
> rdp->nocb_gp_wake_attempt == true suggests otherwise.

Ah, got it! I was clubbing the acting of waking up rcuog to that of both the
rcuog and rcuop/s threads. Your suggestion, instead, is more accurate so I will
do it that way instead. Thanks for the explanations!

 - Joel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
                   ` (3 preceding siblings ...)
  2026-01-01 16:34 ` [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-14  1:09   ` Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 6/8] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early Joel Fernandes
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

During callback overload, the NOCB code attempts an opportunistic
advancement via rcu_advance_cbs_nowake().

Analysis via tracing with 300,000 callbacks flooded shows this
optimization is likely dead code:
- 30 overload conditions triggered
- 0 advancements actually occurred
- 100% of time no advancement due to current GP not done.

I also ran TREE05 and TREE08 for 2 hours and cannot trigger it.

When callbacks overflow (exceed qhimark), they are waiting for a grace
period that hasn't completed yet. The optimization requires the GP to be
complete to advance callbacks, but the overload condition itself is
caused by callbacks piling up faster than GPs can complete. This creates
a logical contradiction where the advancement cannot happen.

In *theory* this might be possible, the GP completed just in the nick of
time as we hit the overload, but this is just so rare that it can be
considered impossible when we cannot even hit it with synthetic callback
flooding even, it is a waste of cycles to even try to advance, let alone
be useful and is a maintenance burden complexity we don't need.

I suggest deletion. However, add a WARN_ON_ONCE for a merge window or 2
and delete it after out of extreme caution.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree_nocb.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 7e9d465c8ab1..d3e6a0e77210 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -571,8 +571,20 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 		if (j != rdp->nocb_gp_adv_time &&
 		    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
 		    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
+			long done_before = rcu_segcblist_get_seglen(&rdp->cblist, RCU_DONE_TAIL);
+
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
+
+			/*
+			 * The advance_cbs call above is not useful. Under an
+			 * overload condition, nocb_gp_wait() is always waiting
+			 * for GP completion, due to this nothing can be moved
+			 * from WAIT to DONE, in the list. WARN if an
+			 * advancement happened (next step is deletion of advance).
+			 */
+			WARN_ON_ONCE(rcu_segcblist_get_seglen(&rdp->cblist,
+				     RCU_DONE_TAIL) > done_before);
 		}
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful
  2026-01-01 16:34 ` [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful Joel Fernandes
@ 2026-01-14  1:09   ` Joel Fernandes
  0 siblings, 0 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-14  1:09 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest

Since I am resubmitting the nocb patches in this series (3 of them from this
series) for the next merge window, I thought I'll replace this particular patch
with just a deletion of the rcu_advance_cbs_nowake() call itself instead of
bloating the code path with warnings and comments.

linux-next and many days of testing on my side are also looking good.

Thoughts?  Once I get any opinions, I'll change this patch to do the deletion.
Also I am adding one other (trivial) patch to this series:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=nocb-7.0&id=84669d678b9cb28ff8774a3b6457186a4a187c75

Running overnight tests on all 4 patches now...

thanks,

 - Joel

On 1/1/2026 11:34 AM, Joel Fernandes wrote:
> During callback overload, the NOCB code attempts an opportunistic
> advancement via rcu_advance_cbs_nowake().
> 
> Analysis via tracing with 300,000 callbacks flooded shows this
> optimization is likely dead code:
> - 30 overload conditions triggered
> - 0 advancements actually occurred
> - 100% of time no advancement due to current GP not done.
> 
> I also ran TREE05 and TREE08 for 2 hours and cannot trigger it.
> 
> When callbacks overflow (exceed qhimark), they are waiting for a grace
> period that hasn't completed yet. The optimization requires the GP to be
> complete to advance callbacks, but the overload condition itself is
> caused by callbacks piling up faster than GPs can complete. This creates
> a logical contradiction where the advancement cannot happen.
> 
> In *theory* this might be possible, the GP completed just in the nick of
> time as we hit the overload, but this is just so rare that it can be
> considered impossible when we cannot even hit it with synthetic callback
> flooding even, it is a waste of cycles to even try to advance, let alone
> be useful and is a maintenance burden complexity we don't need.
> 
> I suggest deletion. However, add a WARN_ON_ONCE for a merge window or 2
> and delete it after out of extreme caution.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/rcu/tree_nocb.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 7e9d465c8ab1..d3e6a0e77210 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -571,8 +571,20 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  		if (j != rdp->nocb_gp_adv_time &&
>  		    rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
>  		    rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
> +			long done_before = rcu_segcblist_get_seglen(&rdp->cblist, RCU_DONE_TAIL);
> +
>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
>  			rdp->nocb_gp_adv_time = j;
> +
> +			/*
> +			 * The advance_cbs call above is not useful. Under an
> +			 * overload condition, nocb_gp_wait() is always waiting
> +			 * for GP completion, due to this nothing can be moved
> +			 * from WAIT to DONE, in the list. WARN if an
> +			 * advancement happened (next step is deletion of advance).
> +			 */
> +			WARN_ON_ONCE(rcu_segcblist_get_seglen(&rdp->cblist,
> +				     RCU_DONE_TAIL) > done_before);
>  		}
>  	}
>  


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH -next 6/8] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
                   ` (4 preceding siblings ...)
  2026-01-01 16:34 ` [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 7/8] rcutorture: Prevent concurrent kvm.sh runs on same source tree Joel Fernandes
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
State) design where the first FQS saves dyntick-idle snapshots and
the second FQS compares them. This results in long and unnecessary latency
for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
1000HZ) whenever one FQS wait sufficed.

Some investigations showed that the GP kthread's CPU is the holdout CPU
a lot of times after the first FQS as - it cannot be detected as "idle"
because it's actively running the FQS scan in the GP kthread.

Therefore, at the end of rcu_gp_init(), immediately report a quiescent
state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
GP kthread cannot be in an RCU read-side critical section while running
GP initialization, so this is safe and results in significant latency
improvements.

The following tests were performed:

(1) synchronize_rcu() benchmarking

    100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs
    jiffies settings):

    Baseline (without fix):
    | Run | Mean      | Min      | Max       |
    |-----|-----------|----------|-----------|
    | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
    | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
    | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
    | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
    | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
    | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
    | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
    | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
    | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
    | 10  | 10.077 ms | 9.984 ms | 17.703 ms |

    With fix:
    | Run | Mean     | Min      | Max       |
    |-----|----------|----------|-----------|
    | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
    | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
    | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
    | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
    | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
    | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
    | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
    | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
    | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
    | 10  | 6.036 ms | 5.993 ms |  9.579 ms |

    Summary:
    - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
    - Max latency:  25.72 ms -> 10.25 ms (60% improvement)

(2) Bridge setup/teardown latency (Uladzislau Rezki)

    x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete:

                                   real time
    1 - default:                   24.221s
    2 - this patch:                20.754s  (14% faster)
    3 - this patch + wake_from_gp: 15.895s  (34% faster)
    4 - wake_from_gp only:         18.947s  (22% faster)

    Per-synchronize_rcu() latency (in usec):
                  1         2         3       4
    median: 37249.5   31540.5   15765   22480
    min:    7881      7918      9803    7857
    max:    63651     55639     31861   32040

    This patch combined with rcu_normal_wake_from_gp reduces bridge
    setup/teardown time from 24 seconds to 16 seconds.

(3) CPU overhead verification (Uladzislau Rezki)

    System CPU time across 5 runs showed no measurable increase:
      default:     1.698s - 1.937s
      this patch:  1.667s - 1.930s
    Conclusion: variations are within noise, no CPU overhead regression.

(4) rcutorture

    Tested TREE and SRCU configurations - no regressions.

Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 78c045a5ef03..b7c818cabe44 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
 			      unsigned long gps, unsigned long flags);
 static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
+static void rcu_report_qs_rdp(struct rcu_data *rdp);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
@@ -1983,6 +1984,17 @@ static noinline_for_stack bool rcu_gp_init(void)
 	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
 		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
 
+	/*
+	 * Immediately report QS for the GP kthread's CPU. The GP kthread
+	 * cannot be in an RCU read-side critical section while running
+	 * the FQS scan. This eliminates the need for a second FQS wait
+	 * when all CPUs are idle.
+	 */
+	preempt_disable();
+	rcu_qs();
+	rcu_report_qs_rdp(this_cpu_ptr(&rcu_data));
+	preempt_enable();
+
 	return true;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH -next 7/8] rcutorture: Prevent concurrent kvm.sh runs on same source tree
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
                   ` (5 preceding siblings ...)
  2026-01-01 16:34 ` [PATCH -next 6/8] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-01 16:34 ` [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs Joel Fernandes
  2026-01-04 10:55 ` [PATCH -next 0/8] RCU updates from me for next merge window Boqun Feng
  8 siblings, 0 replies; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

Add flock-based locking to kvm.sh to prevent multiple instances from
running concurrently on the same source tree. This prevents build
failures caused by one instance's "make clean" deleting generated files
while another instance is building causing build failures.

The lock file is placed in the rcutorture directory and added to
.gitignore.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 tools/testing/selftests/rcutorture/.gitignore |  1 +
 tools/testing/selftests/rcutorture/bin/kvm.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/tools/testing/selftests/rcutorture/.gitignore b/tools/testing/selftests/rcutorture/.gitignore
index f6cbce77460b..b8fd42547a6e 100644
--- a/tools/testing/selftests/rcutorture/.gitignore
+++ b/tools/testing/selftests/rcutorture/.gitignore
@@ -3,3 +3,4 @@ initrd
 b[0-9]*
 res
 *.swp
+.kvm.sh.lock
diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh
index fff15821c44c..d1fbd092e22a 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -275,6 +275,23 @@ do
 	shift
 done
 
+# Prevent concurrent kvm.sh runs on the same source tree.  The flock
+# is automatically released when the script exits, even if killed.
+TORTURE_LOCK="$RCUTORTURE/.kvm.sh.lock"
+if test -z "$dryrun"
+then
+	# Create a file descriptor and flock it, so that when kvm.sh (and its
+	# children) exit, the flock is released by the kernel automatically.
+	exec 9>"$TORTURE_LOCK"
+	if ! flock -n 9
+	then
+		echo "ERROR: Another kvm.sh instance is already running on this tree."
+		echo "       Lock file: $TORTURE_LOCK"
+		echo "       To run kvm.sh, kill all existing kvm.sh runs first."
+		exit 1
+	fi
+fi
+
 if test -n "$dryrun" || test -z "$TORTURE_INITRD" || tools/testing/selftests/rcutorture/bin/mkinitrd.sh
 then
 	:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
                   ` (6 preceding siblings ...)
  2026-01-01 16:34 ` [PATCH -next 7/8] rcutorture: Prevent concurrent kvm.sh runs on same source tree Joel Fernandes
@ 2026-01-01 16:34 ` Joel Fernandes
  2026-01-01 22:48   ` Paul E. McKenney
  2026-01-04 10:55 ` [PATCH -next 0/8] RCU updates from me for next merge window Boqun Feng
  8 siblings, 1 reply; 38+ messages in thread
From: Joel Fernandes @ 2026-01-01 16:34 UTC (permalink / raw)
  To: Paul E . McKenney, Boqun Feng, rcu
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Shuah Khan, linux-kernel, linux-kselftest,
	Joel Fernandes

When kvm.sh is killed, its child processes (make, gcc, qemu, etc.) may
continue running. This prevents new kvm.sh instances from starting even
though the parent is gone.

Add a --kill-previous option that uses fuser(1) to terminate all
processes holding the flock file before attempting to acquire it. This
provides a clean way to recover from stale/zombie kvm.sh runs which
sometimes may have lots of qemu and compiler processes still disturbing.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 tools/testing/selftests/rcutorture/bin/kvm.sh | 25 ++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh
index d1fbd092e22a..65b04b832733 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -80,6 +80,7 @@ usage () {
 	echo "       --kasan"
 	echo "       --kconfig Kconfig-options"
 	echo "       --kcsan"
+	echo "       --kill-previous"
 	echo "       --kmake-arg kernel-make-arguments"
 	echo "       --mac nn:nn:nn:nn:nn:nn"
 	echo "       --memory megabytes|nnnG"
@@ -206,6 +207,9 @@ do
 	--kcsan)
 		TORTURE_KCONFIG_KCSAN_ARG="$debuginfo CONFIG_KCSAN=y CONFIG_KCSAN_STRICT=y CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000 CONFIG_KCSAN_VERBOSE=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y"; export TORTURE_KCONFIG_KCSAN_ARG
 		;;
+	--kill-previous)
+		TORTURE_KILL_PREVIOUS=1
+		;;
 	--kmake-arg|--kmake-args)
 		checkarg --kmake-arg "(kernel make arguments)" $# "$2" '.*' '^error$'
 		TORTURE_KMAKE_ARG="`echo "$TORTURE_KMAKE_ARG $2" | sed -e 's/^ *//' -e 's/ *$//'`"
@@ -278,6 +282,25 @@ done
 # Prevent concurrent kvm.sh runs on the same source tree.  The flock
 # is automatically released when the script exits, even if killed.
 TORTURE_LOCK="$RCUTORTURE/.kvm.sh.lock"
+
+# Terminate any processes holding the lock file, if requested.
+if test -n "$TORTURE_KILL_PREVIOUS"
+then
+	if test -e "$TORTURE_LOCK"
+	then
+		echo "Killing processes holding $TORTURE_LOCK..."
+		if fuser -k "$TORTURE_LOCK" >/dev/null 2>&1
+		then
+			sleep 2
+			echo "Previous kvm.sh processes killed."
+		else
+			echo "No processes were holding the lock."
+		fi
+	else
+		echo "No lock file exists, nothing to kill."
+	fi
+fi
+
 if test -z "$dryrun"
 then
 	# Create a file descriptor and flock it, so that when kvm.sh (and its
@@ -287,7 +310,7 @@ then
 	then
 		echo "ERROR: Another kvm.sh instance is already running on this tree."
 		echo "       Lock file: $TORTURE_LOCK"
-		echo "       To run kvm.sh, kill all existing kvm.sh runs first."
+		echo "       To run kvm.sh, kill all existing kvm.sh runs first (--kill-previous)."
 		exit 1
 	fi
 fi
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
  2026-01-01 16:34 ` [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs Joel Fernandes
@ 2026-01-01 22:48   ` Paul E. McKenney
  0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2026-01-01 22:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Boqun Feng, rcu, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

On Thu, Jan 01, 2026 at 11:34:17AM -0500, Joel Fernandes wrote:
> When kvm.sh is killed, its child processes (make, gcc, qemu, etc.) may
> continue running. This prevents new kvm.sh instances from starting even
> though the parent is gone.
> 
> Add a --kill-previous option that uses fuser(1) to terminate all
> processes holding the flock file before attempting to acquire it. This
> provides a clean way to recover from stale/zombie kvm.sh runs which
> sometimes may have lots of qemu and compiler processes still disturbing.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

For #7 and #8:

Tested-by: Paul E. McKenney <paulmck@kernel.org>

One way to kill the current/previous run without starting a new one is
to use the --dryrun argument.

							Thanx, Paul

> ---
>  tools/testing/selftests/rcutorture/bin/kvm.sh | 25 ++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh
> index d1fbd092e22a..65b04b832733 100755
> --- a/tools/testing/selftests/rcutorture/bin/kvm.sh
> +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
> @@ -80,6 +80,7 @@ usage () {
>  	echo "       --kasan"
>  	echo "       --kconfig Kconfig-options"
>  	echo "       --kcsan"
> +	echo "       --kill-previous"
>  	echo "       --kmake-arg kernel-make-arguments"
>  	echo "       --mac nn:nn:nn:nn:nn:nn"
>  	echo "       --memory megabytes|nnnG"
> @@ -206,6 +207,9 @@ do
>  	--kcsan)
>  		TORTURE_KCONFIG_KCSAN_ARG="$debuginfo CONFIG_KCSAN=y CONFIG_KCSAN_STRICT=y CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000 CONFIG_KCSAN_VERBOSE=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y"; export TORTURE_KCONFIG_KCSAN_ARG
>  		;;
> +	--kill-previous)
> +		TORTURE_KILL_PREVIOUS=1
> +		;;
>  	--kmake-arg|--kmake-args)
>  		checkarg --kmake-arg "(kernel make arguments)" $# "$2" '.*' '^error$'
>  		TORTURE_KMAKE_ARG="`echo "$TORTURE_KMAKE_ARG $2" | sed -e 's/^ *//' -e 's/ *$//'`"
> @@ -278,6 +282,25 @@ done
>  # Prevent concurrent kvm.sh runs on the same source tree.  The flock
>  # is automatically released when the script exits, even if killed.
>  TORTURE_LOCK="$RCUTORTURE/.kvm.sh.lock"
> +
> +# Terminate any processes holding the lock file, if requested.
> +if test -n "$TORTURE_KILL_PREVIOUS"
> +then
> +	if test -e "$TORTURE_LOCK"
> +	then
> +		echo "Killing processes holding $TORTURE_LOCK..."
> +		if fuser -k "$TORTURE_LOCK" >/dev/null 2>&1
> +		then
> +			sleep 2
> +			echo "Previous kvm.sh processes killed."
> +		else
> +			echo "No processes were holding the lock."
> +		fi
> +	else
> +		echo "No lock file exists, nothing to kill."
> +	fi
> +fi
> +
>  if test -z "$dryrun"
>  then
>  	# Create a file descriptor and flock it, so that when kvm.sh (and its
> @@ -287,7 +310,7 @@ then
>  	then
>  		echo "ERROR: Another kvm.sh instance is already running on this tree."
>  		echo "       Lock file: $TORTURE_LOCK"
> -		echo "       To run kvm.sh, kill all existing kvm.sh runs first."
> +		echo "       To run kvm.sh, kill all existing kvm.sh runs first (--kill-previous)."
>  		exit 1
>  	fi
>  fi
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -next 0/8] RCU updates from me for next merge window
  2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
                   ` (7 preceding siblings ...)
  2026-01-01 16:34 ` [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs Joel Fernandes
@ 2026-01-04 10:55 ` Boqun Feng
  8 siblings, 0 replies; 38+ messages in thread
From: Boqun Feng @ 2026-01-04 10:55 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E . McKenney, rcu, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Shuah Khan,
	linux-kernel, linux-kselftest

On Thu, Jan 01, 2026 at 11:34:09AM -0500, Joel Fernandes wrote:
> This series contains RCU fixes and improvements intended for the next
> merge window. The nocb patches have had one round of review but still
> need review tags.
> 

Queued for more tests and reviews, thanks!

Regards,
Boqun

> - Updated commit messages for clarity based on review feedback
> 
> - Testing:
>  All rcutorture scenarios tested successfully for 2 hours on:
>   144-core ARM64 NVIDIA Grace (aarch64)
>   128-core AMD EPYC (x86_64)
> 
> Link to tag:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/tag/?h=rcu-next-v1-20260101
> 
[...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2026-01-14  1:09 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
2026-01-02 17:28   ` Steven Rostedt
2026-01-02 17:30     ` Steven Rostedt
2026-01-02 19:51       ` Paul E. McKenney
2026-01-03  0:41         ` Joel Fernandes
2026-01-04  3:20           ` Yao Kai
2026-01-05 17:16             ` Steven Rostedt
2026-01-09 16:38               ` Steven Rostedt
2026-01-04 10:00           ` Boqun Feng
2026-01-07 23:14   ` Frederic Weisbecker
2026-01-08  1:02     ` Joel Fernandes
2026-01-08  1:35       ` Joel Fernandes
2026-01-08  3:35         ` Joel Fernandes
2026-01-08 15:39           ` Frederic Weisbecker
2026-01-08 15:57             ` Mathieu Desnoyers
2026-01-08 15:25       ` Frederic Weisbecker
2026-01-09  1:12         ` Joel Fernandes
2026-01-09 14:23           ` Frederic Weisbecker
2026-01-01 16:34 ` [PATCH -next 2/8] srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path Joel Fernandes
2026-01-08 15:57   ` Frederic Weisbecker
2026-01-09  1:39     ` Joel Fernandes
2026-01-09 10:32       ` Boqun Feng
2026-01-09 11:20         ` Joel Fernandes
2026-01-11 12:14           ` Boqun Feng
2026-01-01 16:34 ` [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload Joel Fernandes
2026-01-08 17:22   ` Frederic Weisbecker
2026-01-09  3:49     ` Joel Fernandes
2026-01-09 14:36       ` Frederic Weisbecker
2026-01-09 21:20         ` Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful Joel Fernandes
2026-01-14  1:09   ` Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 6/8] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 7/8] rcutorture: Prevent concurrent kvm.sh runs on same source tree Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs Joel Fernandes
2026-01-01 22:48   ` Paul E. McKenney
2026-01-04 10:55 ` [PATCH -next 0/8] RCU updates from me for next merge window Boqun Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox