[PATCH -rcu -next 1/7] smp: Document preemption and stop

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion
@ 2025-07-08 14:22 Joel Fernandes
  2025-07-08 14:22 ` [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work Joel Fernandes
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Andrea Righi, Paul E . McKenney,
	Frederic Weisbecker, Peter Zijlstra, rcu

Recently while revising RCU's cpu online checks, there was some discussion
around how IPIs synchronize with hotplug.

Add comments explaining how preemption disable creates mutual exclusion with
CPU hotplug's stop_machine mechanism. The key insight is that stop_machine()
atomically updates CPU masks and flushes IPIs with interrupts disabled, and
cannot proceed while any CPU (including the IPI sender) has preemption
disabled.

[ Apply peterz feedback. ]

Cc: Andrea Righi <arighi@nvidia.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: rcu@vger.kernel.org
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
Just reduced LOC of that first paragraph by indent, and some minor nits from
Peter. Removed the comments in stop machine and kept few comments in smp.c.

 kernel/smp.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 974f3a3962e8..23d51a8e582d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -86,13 +86,15 @@ int smpcfd_dead_cpu(unsigned int cpu)
 int smpcfd_dying_cpu(unsigned int cpu)
 {
 	/*
-	 * The IPIs for the smp-call-function callbacks queued by other
-	 * CPUs might arrive late, either due to hardware latencies or
-	 * because this CPU disabled interrupts (inside stop-machine)
-	 * before the IPIs were sent. So flush out any pending callbacks
-	 * explicitly (without waiting for the IPIs to arrive), to
-	 * ensure that the outgoing CPU doesn't go offline with work
-	 * still pending.
+	 * The IPIs for the smp-call-function callbacks queued by other CPUs
+	 * might arrive late, either due to hardware latencies or because this
+	 * CPU disabled interrupts (inside stop-machine) before the IPIs were
+	 * sent. So flush out any pending callbacks explicitly (without waiting
+	 * for the IPIs to arrive), to ensure that the outgoing CPU doesn't go
+	 * offline with work still pending.
+	 *
+	 * This runs with interrupts disabled inside the stopper task invoked by
+	 * stop_machine(), ensuring mutually exclusive CPU offlining and IPI flush.
 	 */
 	__flush_smp_call_function_queue(false);
 	irq_work_run();
@@ -418,6 +420,10 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
  */
 static int generic_exec_single(int cpu, call_single_data_t *csd)
 {
+	/*
+	 * Preemption already disabled here so stopper cannot run on this CPU,
+	 * ensuring mutually exclusive CPU offlining and last IPI flush.
+	 */
 	if (cpu == smp_processor_id()) {
 		smp_call_func_t func = csd->func;
 		void *info = csd->info;
@@ -638,8 +644,10 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 	int err;
 
 	/*
-	 * prevent preemption and reschedule on another processor,
-	 * as well as CPU removal
+	 * Prevent preemption and reschedule on another CPU, as well as CPU
+	 * removal. This prevents stopper from running on this CPU, thus
+	 * providing mutual exclusion of the below cpu_online() check and
+	 * IPI sending ensuring IPI are not missed by CPU going offline.
 	 */
 	this_cpu = get_cpu();
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work
  2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
@ 2025-07-08 14:22 ` Joel Fernandes
  2025-07-09  0:00   ` Paul E. McKenney
  2025-07-08 14:22 ` [PATCH -rcu -next 3/7] rcu: Refactor expedited handling check in rcu_read_unlock_special() Joel Fernandes
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang
  Cc: Xiongfeng Wang, Qi Xi, rcu, bpf

During rcu_read_unlock_special(), if this happens during irq_exit(), we
can lockup if an IPI is issued. This is because the IPI itself triggers
the irq_exit() path causing a recursive lock up.

This is precisely what Xiongfeng found when invoking a BPF program on
the trace_tick_stop() tracepoint As shown in the trace below. Fix by
managing the irq_work state correctly.

irq_exit()
  __irq_exit_rcu()
    /* in_hardirq() returns false after this */
    preempt_count_sub(HARDIRQ_OFFSET)
    tick_irq_exit()
      tick_nohz_irq_exit()
	    tick_nohz_stop_sched_tick()
	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
		   __bpf_trace_tick_stop()
		      bpf_trace_run2()
			    rcu_read_unlock_special()
                              /* will send a IPI to itself */
			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);

A simple reproducer can also be obtained by doing the following in
tick_irq_exit(). It will hang on boot without the patch:

  static inline void tick_irq_exit(void)
  {
 +	rcu_read_lock();
 +	WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
 +	rcu_read_unlock();
 +

Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Closes: https://lore.kernel.org/all/9acd5f9f-6732-7701-6880-4b51190aa070@huawei.com/
Tested-by: Qi Xi <xiqi2@huawei.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.h        | 11 ++++++++++-
 kernel/rcu/tree_plugin.h | 23 +++++++++++++++++++----
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3830c19cf2f6..f8f612269e6e 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -174,6 +174,15 @@ struct rcu_snap_record {
 	unsigned long   jiffies;	/* Track jiffies value */
 };
 
+/*
+ * The IRQ work (deferred_qs_iw) is used by RCU to get scheduler's attention.
+ * It can be in one of the following states:
+ * - DEFER_QS_IDLE: An IRQ work was never scheduled.
+ * - DEFER_QS_PENDING: An IRQ work was scheduler but never run.
+ */
+#define DEFER_QS_IDLE		0
+#define DEFER_QS_PENDING	1
+
 /* Per-CPU data for read-copy update. */
 struct rcu_data {
 	/* 1) quiescent-state and grace-period handling : */
@@ -192,7 +201,7 @@ struct rcu_data {
 					/*  during and after the last grace */
 					/* period it is aware of. */
 	struct irq_work defer_qs_iw;	/* Obtain later scheduler attention. */
-	bool defer_qs_iw_pending;	/* Scheduler attention pending? */
+	int defer_qs_iw_pending;	/* Scheduler attention pending? */
 	struct work_struct strict_work;	/* Schedule readers for strict GPs. */
 
 	/* 2) batch handling */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dd1c156c1759..fa7b0d854833 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -486,13 +486,16 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	struct rcu_node *rnp;
 	union rcu_special special;
 
+	rdp = this_cpu_ptr(&rcu_data);
+	if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING)
+		rdp->defer_qs_iw_pending = DEFER_QS_IDLE;
+
 	/*
 	 * If RCU core is waiting for this CPU to exit its critical section,
 	 * report the fact that it has exited.  Because irqs are disabled,
 	 * t->rcu_read_unlock_special cannot change.
 	 */
 	special = t->rcu_read_unlock_special;
-	rdp = this_cpu_ptr(&rcu_data);
 	if (!special.s && !rdp->cpu_no_qs.b.exp) {
 		local_irq_restore(flags);
 		return;
@@ -628,7 +631,18 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
 
 	rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
 	local_irq_save(flags);
-	rdp->defer_qs_iw_pending = false;
+
+	/*
+	 * Requeue the IRQ work on next unlock in following situation:
+	 * 1. rcu_read_unlock() queues IRQ work (state -> DEFER_QS_PENDING)
+	 * 2. CPU enters new rcu_read_lock()
+	 * 3. IRQ work runs but cannot report QS due to rcu_preempt_depth() > 0
+	 * 4. rcu_read_unlock() does not re-queue work (state still PENDING)
+	 * 5. Deferred QS reporting does not happen.
+	 */
+	if (rcu_preempt_depth() > 0)
+		WRITE_ONCE(rdp->defer_qs_iw_pending, DEFER_QS_IDLE);
+
 	local_irq_restore(flags);
 }
 
@@ -675,7 +689,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
+			    expboost && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
+			    cpu_online(rdp->cpu)) {
 				// Get scheduler to re-evaluate and call hooks.
 				// If !IRQ_WORK, FQS scan will eventually IPI.
 				if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
@@ -685,7 +700,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 				else
 					init_irq_work(&rdp->defer_qs_iw,
 						      rcu_preempt_deferred_qs_handler);
-				rdp->defer_qs_iw_pending = true;
+				rdp->defer_qs_iw_pending = DEFER_QS_PENDING;
 				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
 			}
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work
  2025-07-08 14:22 ` [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work Joel Fernandes
@ 2025-07-09  0:00   ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-09  0:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Xiongfeng Wang, Qi Xi, rcu, bpf

On Tue, Jul 08, 2025 at 10:22:19AM -0400, Joel Fernandes wrote:
> During rcu_read_unlock_special(), if this happens during irq_exit(), we
> can lockup if an IPI is issued. This is because the IPI itself triggers
> the irq_exit() path causing a recursive lock up.
> 
> This is precisely what Xiongfeng found when invoking a BPF program on
> the trace_tick_stop() tracepoint As shown in the trace below. Fix by
> managing the irq_work state correctly.
> 
> irq_exit()
>   __irq_exit_rcu()
>     /* in_hardirq() returns false after this */
>     preempt_count_sub(HARDIRQ_OFFSET)
>     tick_irq_exit()
>       tick_nohz_irq_exit()
> 	    tick_nohz_stop_sched_tick()
> 	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
> 		   __bpf_trace_tick_stop()
> 		      bpf_trace_run2()
> 			    rcu_read_unlock_special()
>                               /* will send a IPI to itself */
> 			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> 
> A simple reproducer can also be obtained by doing the following in
> tick_irq_exit(). It will hang on boot without the patch:
> 
>   static inline void tick_irq_exit(void)
>   {
>  +	rcu_read_lock();
>  +	WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
>  +	rcu_read_unlock();
>  +
> 
> Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
> Closes: https://lore.kernel.org/all/9acd5f9f-6732-7701-6880-4b51190aa070@huawei.com/
> Tested-by: Qi Xi <xiqi2@huawei.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/tree.h        | 11 ++++++++++-
>  kernel/rcu/tree_plugin.h | 23 +++++++++++++++++++----
>  2 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 3830c19cf2f6..f8f612269e6e 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -174,6 +174,15 @@ struct rcu_snap_record {
>  	unsigned long   jiffies;	/* Track jiffies value */
>  };
>  
> +/*
> + * The IRQ work (deferred_qs_iw) is used by RCU to get scheduler's attention.
> + * It can be in one of the following states:
> + * - DEFER_QS_IDLE: An IRQ work was never scheduled.
> + * - DEFER_QS_PENDING: An IRQ work was scheduler but never run.
> + */
> +#define DEFER_QS_IDLE		0
> +#define DEFER_QS_PENDING	1
> +
>  /* Per-CPU data for read-copy update. */
>  struct rcu_data {
>  	/* 1) quiescent-state and grace-period handling : */
> @@ -192,7 +201,7 @@ struct rcu_data {
>  					/*  during and after the last grace */
>  					/* period it is aware of. */
>  	struct irq_work defer_qs_iw;	/* Obtain later scheduler attention. */
> -	bool defer_qs_iw_pending;	/* Scheduler attention pending? */
> +	int defer_qs_iw_pending;	/* Scheduler attention pending? */
>  	struct work_struct strict_work;	/* Schedule readers for strict GPs. */
>  
>  	/* 2) batch handling */
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index dd1c156c1759..fa7b0d854833 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -486,13 +486,16 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	struct rcu_node *rnp;
>  	union rcu_special special;
>  
> +	rdp = this_cpu_ptr(&rcu_data);
> +	if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING)
> +		rdp->defer_qs_iw_pending = DEFER_QS_IDLE;
> +
>  	/*
>  	 * If RCU core is waiting for this CPU to exit its critical section,
>  	 * report the fact that it has exited.  Because irqs are disabled,
>  	 * t->rcu_read_unlock_special cannot change.
>  	 */
>  	special = t->rcu_read_unlock_special;
> -	rdp = this_cpu_ptr(&rcu_data);
>  	if (!special.s && !rdp->cpu_no_qs.b.exp) {
>  		local_irq_restore(flags);
>  		return;
> @@ -628,7 +631,18 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>  
>  	rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
>  	local_irq_save(flags);
> -	rdp->defer_qs_iw_pending = false;
> +
> +	/*
> +	 * Requeue the IRQ work on next unlock in following situation:
> +	 * 1. rcu_read_unlock() queues IRQ work (state -> DEFER_QS_PENDING)
> +	 * 2. CPU enters new rcu_read_lock()
> +	 * 3. IRQ work runs but cannot report QS due to rcu_preempt_depth() > 0
> +	 * 4. rcu_read_unlock() does not re-queue work (state still PENDING)
> +	 * 5. Deferred QS reporting does not happen.
> +	 */
> +	if (rcu_preempt_depth() > 0)
> +		WRITE_ONCE(rdp->defer_qs_iw_pending, DEFER_QS_IDLE);
> +
>  	local_irq_restore(flags);
>  }
>  
> @@ -675,7 +689,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> -			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> +			    expboost && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
> +			    cpu_online(rdp->cpu)) {
>  				// Get scheduler to re-evaluate and call hooks.
>  				// If !IRQ_WORK, FQS scan will eventually IPI.
>  				if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
> @@ -685,7 +700,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  				else
>  					init_irq_work(&rdp->defer_qs_iw,
>  						      rcu_preempt_deferred_qs_handler);
> -				rdp->defer_qs_iw_pending = true;
> +				rdp->defer_qs_iw_pending = DEFER_QS_PENDING;
>  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
>  			}
>  		}
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH -rcu -next 3/7] rcu: Refactor expedited handling check in rcu_read_unlock_special()
  2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
  2025-07-08 14:22 ` [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work Joel Fernandes
@ 2025-07-08 14:22 ` Joel Fernandes
  2025-07-09  0:06   ` Paul E. McKenney
  2025-07-08 14:22 ` [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock Joel Fernandes
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Sebastian Andrzej Siewior, Clark Williams
  Cc: rcu, linux-rt-devel

Extract the complex expedited handling condition in rcu_read_unlock_special()
into a separate function rcu_unlock_needs_exp_handling() with detailed
comments explaining each condition.

This improves code readability. No functional change intended.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree_plugin.h | 83 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 74 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index fa7b0d854833..e20c17163c13 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -646,6 +646,75 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
 	local_irq_restore(flags);
 }
 
+/*
+ * Check if expedited grace period processing during unlock is needed.
+ *
+ * This function determines whether expedited handling is required based on:
+ * 1. Task blocking an expedited grace period (based on a heuristic, could be
+ *    false-positive, see below.)
+ * 2. CPU participating in an expedited grace period
+ * 3. Strict grace period mode requiring expedited handling
+ * 4. RCU priority deboosting needs when interrupts were disabled
+ *
+ * @t: The task being checked
+ * @rdp: The per-CPU RCU data
+ * @rnp: The RCU node for this CPU
+ * @irqs_were_disabled: Whether interrupts were disabled before rcu_read_unlock()
+ *
+ * Returns true if expedited processing of the rcu_read_unlock() is needed.
+ */
+static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
+				      struct rcu_data *rdp,
+				      struct rcu_node *rnp,
+				      bool irqs_were_disabled)
+{
+	/*
+	 * Check if this task is blocking an expedited grace period. If the
+	 * task was preempted within an RCU read-side critical section and is
+	 * on the expedited grace period blockers list (exp_tasks), we need
+	 * expedited handling to unblock the expedited GP. This is not an exact
+	 * check because 't' might not be on the exp_tasks list at all - its
+	 * just a fast heuristic that can be false-positive sometimes.
+	 */
+	if (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks))
+		return true;
+
+	/*
+	 * Check if this CPU is participating in an expedited grace period.
+	 * The expmask bitmap tracks which CPUs need to check in for the
+	 * current expedited GP. If our CPU's bit is set, we need expedited
+	 * handling to help complete the expedited GP.
+	 */
+	if (rdp->grpmask & READ_ONCE(rnp->expmask))
+		return true;
+
+	/*
+	 * In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, all grace periods
+	 * are treated as short for testing purposes even if that means
+	 * disturbing the system more. Check if either:
+	 * - This CPU has not yet reported a quiescent state, or
+	 * - This task was preempted within an RCU critical section
+	 * In either case, require expedited handling for strict GP mode.
+	 */
+	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
+	    ((rdp->grpmask & READ_ONCE(rnp->qsmask)) || t->rcu_blocked_node))
+		return true;
+
+	/*
+	 * RCU priority boosting case: If a task is subject to RCU priority
+	 * boosting and exits an RCU read-side critical section with interrupts
+	 * disabled, we need expedited handling to ensure timely deboosting.
+	 * Without this, a low-priority task could incorrectly run at high
+	 * real-time priority for an extended period degrading real-time
+	 * responsiveness. This applies to all CONFIG_RCU_BOOST=y kernels,
+	 * not just to PREEMPT_RT.
+	 */
+	if (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node)
+		return true;
+
+	return false;
+}
+
 /*
  * Handle special cases during rcu_read_unlock(), such as needing to
  * notify RCU core processing or task having blocked during the RCU
@@ -665,18 +734,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	local_irq_save(flags);
 	irqs_were_disabled = irqs_disabled_flags(flags);
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
-		bool expboost; // Expedited GP in flight or possible boosting.
+		bool needs_exp; // Expedited handling needed.
 		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 		struct rcu_node *rnp = rdp->mynode;
 
-		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
-			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
-			   (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
-			   ((rdp->grpmask & READ_ONCE(rnp->qsmask)) || t->rcu_blocked_node)) ||
-			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
-			    t->rcu_blocked_node);
+		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp, irqs_were_disabled);
+	
 		// Need to defer quiescent state until everything is enabled.
-		if (use_softirq && (in_hardirq() || (expboost && !irqs_were_disabled))) {
+		if (use_softirq && (in_hardirq() || (needs_exp && !irqs_were_disabled))) {
 			// Using softirq, safe to awaken, and either the
 			// wakeup is free or there is either an expedited
 			// GP in flight or a potential need to deboost.
@@ -689,7 +754,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-			    expboost && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
+			    needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
 			    cpu_online(rdp->cpu)) {
 				// Get scheduler to re-evaluate and call hooks.
 				// If !IRQ_WORK, FQS scan will eventually IPI.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 3/7] rcu: Refactor expedited handling check in rcu_read_unlock_special()
  2025-07-08 14:22 ` [PATCH -rcu -next 3/7] rcu: Refactor expedited handling check in rcu_read_unlock_special() Joel Fernandes
@ 2025-07-09  0:06   ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-09  0:06 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Sebastian Andrzej Siewior, Clark Williams,
	rcu, linux-rt-devel

On Tue, Jul 08, 2025 at 10:22:20AM -0400, Joel Fernandes wrote:
> Extract the complex expedited handling condition in rcu_read_unlock_special()
> into a separate function rcu_unlock_needs_exp_handling() with detailed
> comments explaining each condition.
> 
> This improves code readability. No functional change intended.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/tree_plugin.h | 83 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 74 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index fa7b0d854833..e20c17163c13 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -646,6 +646,75 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>  	local_irq_restore(flags);
>  }
>  
> +/*
> + * Check if expedited grace period processing during unlock is needed.
> + *
> + * This function determines whether expedited handling is required based on:
> + * 1. Task blocking an expedited grace period (based on a heuristic, could be
> + *    false-positive, see below.)
> + * 2. CPU participating in an expedited grace period
> + * 3. Strict grace period mode requiring expedited handling
> + * 4. RCU priority deboosting needs when interrupts were disabled
> + *
> + * @t: The task being checked
> + * @rdp: The per-CPU RCU data
> + * @rnp: The RCU node for this CPU
> + * @irqs_were_disabled: Whether interrupts were disabled before rcu_read_unlock()
> + *
> + * Returns true if expedited processing of the rcu_read_unlock() is needed.
> + */
> +static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
> +				      struct rcu_data *rdp,
> +				      struct rcu_node *rnp,
> +				      bool irqs_were_disabled)
> +{
> +	/*
> +	 * Check if this task is blocking an expedited grace period. If the
> +	 * task was preempted within an RCU read-side critical section and is
> +	 * on the expedited grace period blockers list (exp_tasks), we need
> +	 * expedited handling to unblock the expedited GP. This is not an exact
> +	 * check because 't' might not be on the exp_tasks list at all - its
> +	 * just a fast heuristic that can be false-positive sometimes.
> +	 */
> +	if (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks))
> +		return true;
> +
> +	/*
> +	 * Check if this CPU is participating in an expedited grace period.
> +	 * The expmask bitmap tracks which CPUs need to check in for the
> +	 * current expedited GP. If our CPU's bit is set, we need expedited
> +	 * handling to help complete the expedited GP.
> +	 */
> +	if (rdp->grpmask & READ_ONCE(rnp->expmask))
> +		return true;
> +
> +	/*
> +	 * In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, all grace periods
> +	 * are treated as short for testing purposes even if that means
> +	 * disturbing the system more. Check if either:
> +	 * - This CPU has not yet reported a quiescent state, or
> +	 * - This task was preempted within an RCU critical section
> +	 * In either case, require expedited handling for strict GP mode.
> +	 */
> +	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
> +	    ((rdp->grpmask & READ_ONCE(rnp->qsmask)) || t->rcu_blocked_node))
> +		return true;
> +
> +	/*
> +	 * RCU priority boosting case: If a task is subject to RCU priority
> +	 * boosting and exits an RCU read-side critical section with interrupts
> +	 * disabled, we need expedited handling to ensure timely deboosting.
> +	 * Without this, a low-priority task could incorrectly run at high
> +	 * real-time priority for an extended period degrading real-time
> +	 * responsiveness. This applies to all CONFIG_RCU_BOOST=y kernels,
> +	 * not just to PREEMPT_RT.
> +	 */
> +	if (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node)
> +		return true;
> +
> +	return false;
> +}
> +
>  /*
>   * Handle special cases during rcu_read_unlock(), such as needing to
>   * notify RCU core processing or task having blocked during the RCU
> @@ -665,18 +734,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	local_irq_save(flags);
>  	irqs_were_disabled = irqs_disabled_flags(flags);
>  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> -		bool expboost; // Expedited GP in flight or possible boosting.
> +		bool needs_exp; // Expedited handling needed.
>  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  		struct rcu_node *rnp = rdp->mynode;
>  
> -		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> -			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> -			   (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
> -			   ((rdp->grpmask & READ_ONCE(rnp->qsmask)) || t->rcu_blocked_node)) ||
> -			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> -			    t->rcu_blocked_node);
> +		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp, irqs_were_disabled);
> +	
>  		// Need to defer quiescent state until everything is enabled.
> -		if (use_softirq && (in_hardirq() || (expboost && !irqs_were_disabled))) {
> +		if (use_softirq && (in_hardirq() || (needs_exp && !irqs_were_disabled))) {
>  			// Using softirq, safe to awaken, and either the
>  			// wakeup is free or there is either an expedited
>  			// GP in flight or a potential need to deboost.
> @@ -689,7 +754,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> -			    expboost && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
> +			    needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
>  			    cpu_online(rdp->cpu)) {
>  				// Get scheduler to re-evaluate and call hooks.
>  				// If !IRQ_WORK, FQS scan will eventually IPI.
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
  2025-07-08 14:22 ` [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work Joel Fernandes
  2025-07-08 14:22 ` [PATCH -rcu -next 3/7] rcu: Refactor expedited handling check in rcu_read_unlock_special() Joel Fernandes
@ 2025-07-08 14:22 ` Joel Fernandes
  2025-07-11  0:00   ` Paul E. McKenney
  2025-07-08 14:22 ` [PATCH -rcu -next 5/7] rcu: Document GP init vs hotplug-scan ordering requirements Joel Fernandes
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang
  Cc: rcu

The check for irqs_were_disabled is redundant in
rcu_unlock_needs_exp_handling() as the caller already checks for this.
This includes the boost case as well. Just remove the redundant check.

This is a first win for the refactor of the needs_exp (formerly
expboost) condition into a new rcu_unlock_needs_exp_handling() function,
as the conditions became more easier to read.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree_plugin.h | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e20c17163c13..7d03d81e45f6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -659,14 +659,12 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
  * @t: The task being checked
  * @rdp: The per-CPU RCU data
  * @rnp: The RCU node for this CPU
- * @irqs_were_disabled: Whether interrupts were disabled before rcu_read_unlock()
  *
  * Returns true if expedited processing of the rcu_read_unlock() is needed.
  */
 static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
 				      struct rcu_data *rdp,
-				      struct rcu_node *rnp,
-				      bool irqs_were_disabled)
+				      struct rcu_node *rnp)
 {
 	/*
 	 * Check if this task is blocking an expedited grace period. If the
@@ -702,14 +700,14 @@ static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
 
 	/*
 	 * RCU priority boosting case: If a task is subject to RCU priority
-	 * boosting and exits an RCU read-side critical section with interrupts
-	 * disabled, we need expedited handling to ensure timely deboosting.
-	 * Without this, a low-priority task could incorrectly run at high
-	 * real-time priority for an extended period degrading real-time
+	 * boosting and exits an RCU read-side critical section, we need
+	 * expedited handling to ensure timely deboosting. Without this,
+	 * a low-priority task could incorrectly run at high real-time
+	 * priority for an extended period degrading real-time
 	 * responsiveness. This applies to all CONFIG_RCU_BOOST=y kernels,
 	 * not just to PREEMPT_RT.
 	 */
-	if (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node)
+	if (IS_ENABLED(CONFIG_RCU_BOOST) && t->rcu_blocked_node)
 		return true;
 
 	return false;
@@ -738,7 +736,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 		struct rcu_node *rnp = rdp->mynode;
 
-		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp, irqs_were_disabled);
+		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp);
 	
 		// Need to defer quiescent state until everything is enabled.
 		if (use_softirq && (in_hardirq() || (needs_exp && !irqs_were_disabled))) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-08 14:22 ` [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock Joel Fernandes
@ 2025-07-11  0:00   ` Paul E. McKenney
  2025-07-11 16:30     ` Joel Fernandes
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-11  0:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, rcu

On Tue, Jul 08, 2025 at 10:22:21AM -0400, Joel Fernandes wrote:
> The check for irqs_were_disabled is redundant in
> rcu_unlock_needs_exp_handling() as the caller already checks for this.
> This includes the boost case as well. Just remove the redundant check.

But in the very last "if" statement in the context of the last hunk of
this patch, isn't it instead checking for !irqs_were_disabled?

Or is there something that I am missing that makes this work out OK?

							Thanx, Paul

> This is a first win for the refactor of the needs_exp (formerly
> expboost) condition into a new rcu_unlock_needs_exp_handling() function,
> as the conditions became more easier to read.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/rcu/tree_plugin.h | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index e20c17163c13..7d03d81e45f6 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -659,14 +659,12 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>   * @t: The task being checked
>   * @rdp: The per-CPU RCU data
>   * @rnp: The RCU node for this CPU
> - * @irqs_were_disabled: Whether interrupts were disabled before rcu_read_unlock()
>   *
>   * Returns true if expedited processing of the rcu_read_unlock() is needed.
>   */
>  static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
>  				      struct rcu_data *rdp,
> -				      struct rcu_node *rnp,
> -				      bool irqs_were_disabled)
> +				      struct rcu_node *rnp)
>  {
>  	/*
>  	 * Check if this task is blocking an expedited grace period. If the
> @@ -702,14 +700,14 @@ static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
>  
>  	/*
>  	 * RCU priority boosting case: If a task is subject to RCU priority
> -	 * boosting and exits an RCU read-side critical section with interrupts
> -	 * disabled, we need expedited handling to ensure timely deboosting.
> -	 * Without this, a low-priority task could incorrectly run at high
> -	 * real-time priority for an extended period degrading real-time
> +	 * boosting and exits an RCU read-side critical section, we need
> +	 * expedited handling to ensure timely deboosting. Without this,
> +	 * a low-priority task could incorrectly run at high real-time
> +	 * priority for an extended period degrading real-time
>  	 * responsiveness. This applies to all CONFIG_RCU_BOOST=y kernels,
>  	 * not just to PREEMPT_RT.
>  	 */
> -	if (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node)
> +	if (IS_ENABLED(CONFIG_RCU_BOOST) && t->rcu_blocked_node)
>  		return true;
>  
>  	return false;
> @@ -738,7 +736,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  		struct rcu_node *rnp = rdp->mynode;
>  
> -		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp, irqs_were_disabled);
> +		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp);
>  	
>  		// Need to defer quiescent state until everything is enabled.
>  		if (use_softirq && (in_hardirq() || (needs_exp && !irqs_were_disabled))) {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-11  0:00   ` Paul E. McKenney
@ 2025-07-11 16:30     ` Joel Fernandes
  2025-07-11 16:39       ` Joel Fernandes
  2025-07-11 17:18       ` Paul E. McKenney
  0 siblings, 2 replies; 17+ messages in thread
From: Joel Fernandes @ 2025-07-11 16:30 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, rcu



On 7/10/2025 8:00 PM, Paul E. McKenney wrote:
> On Tue, Jul 08, 2025 at 10:22:21AM -0400, Joel Fernandes wrote:
>> The check for irqs_were_disabled is redundant in
>> rcu_unlock_needs_exp_handling() as the caller already checks for this.
>> This includes the boost case as well. Just remove the redundant check.
> 
> But in the very last "if" statement in the context of the last hunk of
> this patch, isn't it instead checking for !irqs_were_disabled?
> 
> Or is there something that I am missing that makes this work out OK?

You are right, after going over all the cases I introduced a behavioral change.

Say, irqs_were_disabled was false. And say we are RCU-boosted. needs_exp might
return true now, but previously it was returning false. Further say, we are not
in hardirq.

New code will raise softirq, old code would go to the ELSE and just set:
                set_tsk_need_resched(current);
set_preempt_need_resched();

But preempt_bh_were_disabled that's why we're here.

So we need to only set resched and not raise softirq, because the preempt enable
would reschedule.

Sorry I missed this, but unless this behavior is correct or needs further work,
lets drop this patch. Thanks and kudos on the catch!

 - Joel

> 
> 							Thanx, Paul
> 
>> This is a first win for the refactor of the needs_exp (formerly
>> expboost) condition into a new rcu_unlock_needs_exp_handling() function,
>> as the conditions became more easier to read.
>>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  kernel/rcu/tree_plugin.h | 16 +++++++---------
>>  1 file changed, 7 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index e20c17163c13..7d03d81e45f6 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -659,14 +659,12 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>   * @t: The task being checked
>>   * @rdp: The per-CPU RCU data
>>   * @rnp: The RCU node for this CPU
>> - * @irqs_were_disabled: Whether interrupts were disabled before rcu_read_unlock()
>>   *
>>   * Returns true if expedited processing of the rcu_read_unlock() is needed.
>>   */
>>  static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
>>  				      struct rcu_data *rdp,
>> -				      struct rcu_node *rnp,
>> -				      bool irqs_were_disabled)
>> +				      struct rcu_node *rnp)
>>  {
>>  	/*
>>  	 * Check if this task is blocking an expedited grace period. If the
>> @@ -702,14 +700,14 @@ static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
>>  
>>  	/*
>>  	 * RCU priority boosting case: If a task is subject to RCU priority
>> -	 * boosting and exits an RCU read-side critical section with interrupts
>> -	 * disabled, we need expedited handling to ensure timely deboosting.
>> -	 * Without this, a low-priority task could incorrectly run at high
>> -	 * real-time priority for an extended period degrading real-time
>> +	 * boosting and exits an RCU read-side critical section, we need
>> +	 * expedited handling to ensure timely deboosting. Without this,
>> +	 * a low-priority task could incorrectly run at high real-time
>> +	 * priority for an extended period degrading real-time
>>  	 * responsiveness. This applies to all CONFIG_RCU_BOOST=y kernels,
>>  	 * not just to PREEMPT_RT.
>>  	 */
>> -	if (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node)
>> +	if (IS_ENABLED(CONFIG_RCU_BOOST) && t->rcu_blocked_node)
>>  		return true;
>>  
>>  	return false;
>> @@ -738,7 +736,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>>  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>  		struct rcu_node *rnp = rdp->mynode;
>>  
>> -		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp, irqs_were_disabled);
>> +		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp);
>>  	
>>  		// Need to defer quiescent state until everything is enabled.
>>  		if (use_softirq && (in_hardirq() || (needs_exp && !irqs_were_disabled))) {
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-11 16:30     ` Joel Fernandes
@ 2025-07-11 16:39       ` Joel Fernandes
  2025-07-11 17:18       ` Paul E. McKenney
  1 sibling, 0 replies; 17+ messages in thread
From: Joel Fernandes @ 2025-07-11 16:39 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, rcu



On 7/11/2025 12:30 PM, Joel Fernandes wrote:
> 
> 
> On 7/10/2025 8:00 PM, Paul E. McKenney wrote:
>> On Tue, Jul 08, 2025 at 10:22:21AM -0400, Joel Fernandes wrote:
>>> The check for irqs_were_disabled is redundant in
>>> rcu_unlock_needs_exp_handling() as the caller already checks for this.
>>> This includes the boost case as well. Just remove the redundant check.
>>
>> But in the very last "if" statement in the context of the last hunk of
>> this patch, isn't it instead checking for !irqs_were_disabled?
>>
>> Or is there something that I am missing that makes this work out OK?
> 
> You are right, after going over all the cases I introduced a behavioral change.
> 
> Say, irqs_were_disabled was false. And say we are RCU-boosted. needs_exp might
> return true now, but previously it was returning false. Further say, we are not
> in hardirq.
> 
> New code will raise softirq, old code would go to the ELSE and just set:
>                 set_tsk_need_resched(current);
> set_preempt_need_resched();
> 
> But preempt_bh_were_disabled that's why we're here.
> 
> So we need to only set resched and not raise softirq, because the preempt enable
> would reschedule.
> 
> Sorry I missed this, but unless this behavior is correct or needs further work,
> lets drop this patch. Thanks and kudos on the catch!
Btw, Neeraj, the related patch 3/7 should still be good for applying (which has
Paul's Review tag).

Thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-11 16:30     ` Joel Fernandes
  2025-07-11 16:39       ` Joel Fernandes
@ 2025-07-11 17:18       ` Paul E. McKenney
  2025-07-11 18:19         ` Joel Fernandes
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-11 17:18 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, rcu

On Fri, Jul 11, 2025 at 12:30:08PM -0400, Joel Fernandes wrote:
> 
> 
> On 7/10/2025 8:00 PM, Paul E. McKenney wrote:
> > On Tue, Jul 08, 2025 at 10:22:21AM -0400, Joel Fernandes wrote:
> >> The check for irqs_were_disabled is redundant in
> >> rcu_unlock_needs_exp_handling() as the caller already checks for this.
> >> This includes the boost case as well. Just remove the redundant check.
> > 
> > But in the very last "if" statement in the context of the last hunk of
> > this patch, isn't it instead checking for !irqs_were_disabled?
> > 
> > Or is there something that I am missing that makes this work out OK?
> 
> You are right, after going over all the cases I introduced a behavioral change.
> 
> Say, irqs_were_disabled was false. And say we are RCU-boosted. needs_exp might
> return true now, but previously it was returning false. Further say, we are not
> in hardirq.
> 
> New code will raise softirq, old code would go to the ELSE and just set:
>                 set_tsk_need_resched(current);
> set_preempt_need_resched();
> 
> But preempt_bh_were_disabled that's why we're here.
> 
> So we need to only set resched and not raise softirq, because the preempt enable
> would reschedule.
> 
> Sorry I missed this, but unless this behavior is correct or needs further work,
> lets drop this patch. Thanks and kudos on the catch!

Not a problem!

You can use cbmc to check these sorts of transformations, as described
here: https://paulmck.livejournal.com/38997.html

							Thanx, Paul

>  - Joel
> 
> > 
> > 							Thanx, Paul
> > 
> >> This is a first win for the refactor of the needs_exp (formerly
> >> expboost) condition into a new rcu_unlock_needs_exp_handling() function,
> >> as the conditions became more easier to read.
> >>
> >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> >> ---
> >>  kernel/rcu/tree_plugin.h | 16 +++++++---------
> >>  1 file changed, 7 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >> index e20c17163c13..7d03d81e45f6 100644
> >> --- a/kernel/rcu/tree_plugin.h
> >> +++ b/kernel/rcu/tree_plugin.h
> >> @@ -659,14 +659,12 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> >>   * @t: The task being checked
> >>   * @rdp: The per-CPU RCU data
> >>   * @rnp: The RCU node for this CPU
> >> - * @irqs_were_disabled: Whether interrupts were disabled before rcu_read_unlock()
> >>   *
> >>   * Returns true if expedited processing of the rcu_read_unlock() is needed.
> >>   */
> >>  static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
> >>  				      struct rcu_data *rdp,
> >> -				      struct rcu_node *rnp,
> >> -				      bool irqs_were_disabled)
> >> +				      struct rcu_node *rnp)
> >>  {
> >>  	/*
> >>  	 * Check if this task is blocking an expedited grace period. If the
> >> @@ -702,14 +700,14 @@ static bool rcu_unlock_needs_exp_handling(struct task_struct *t,
> >>  
> >>  	/*
> >>  	 * RCU priority boosting case: If a task is subject to RCU priority
> >> -	 * boosting and exits an RCU read-side critical section with interrupts
> >> -	 * disabled, we need expedited handling to ensure timely deboosting.
> >> -	 * Without this, a low-priority task could incorrectly run at high
> >> -	 * real-time priority for an extended period degrading real-time
> >> +	 * boosting and exits an RCU read-side critical section, we need
> >> +	 * expedited handling to ensure timely deboosting. Without this,
> >> +	 * a low-priority task could incorrectly run at high real-time
> >> +	 * priority for an extended period degrading real-time
> >>  	 * responsiveness. This applies to all CONFIG_RCU_BOOST=y kernels,
> >>  	 * not just to PREEMPT_RT.
> >>  	 */
> >> -	if (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node)
> >> +	if (IS_ENABLED(CONFIG_RCU_BOOST) && t->rcu_blocked_node)
> >>  		return true;
> >>  
> >>  	return false;
> >> @@ -738,7 +736,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> >>  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> >>  		struct rcu_node *rnp = rdp->mynode;
> >>  
> >> -		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp, irqs_were_disabled);
> >> +		needs_exp = rcu_unlock_needs_exp_handling(t, rdp, rnp);
> >>  	
> >>  		// Need to defer quiescent state until everything is enabled.
> >>  		if (use_softirq && (in_hardirq() || (needs_exp && !irqs_were_disabled))) {
> >> -- 
> >> 2.34.1
> >>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-11 17:18       ` Paul E. McKenney
@ 2025-07-11 18:19         ` Joel Fernandes
  2025-07-11 18:41           ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2025-07-11 18:19 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, rcu



On 7/11/2025 1:18 PM, Paul E. McKenney wrote:
> On Fri, Jul 11, 2025 at 12:30:08PM -0400, Joel Fernandes wrote:
>>
>>
>> On 7/10/2025 8:00 PM, Paul E. McKenney wrote:
>>> On Tue, Jul 08, 2025 at 10:22:21AM -0400, Joel Fernandes wrote:
>>>> The check for irqs_were_disabled is redundant in
>>>> rcu_unlock_needs_exp_handling() as the caller already checks for this.
>>>> This includes the boost case as well. Just remove the redundant check.
>>>
>>> But in the very last "if" statement in the context of the last hunk of
>>> this patch, isn't it instead checking for !irqs_were_disabled?
>>>
>>> Or is there something that I am missing that makes this work out OK?
>>
>> You are right, after going over all the cases I introduced a behavioral change.
>>
>> Say, irqs_were_disabled was false. And say we are RCU-boosted. needs_exp might
>> return true now, but previously it was returning false. Further say, we are not
>> in hardirq.
>>
>> New code will raise softirq, old code would go to the ELSE and just set:
>>                 set_tsk_need_resched(current);
>> set_preempt_need_resched();
>>
>> But preempt_bh_were_disabled that's why we're here.
>>
>> So we need to only set resched and not raise softirq, because the preempt enable
>> would reschedule.
>>
>> Sorry I missed this, but unless this behavior is correct or needs further work,
>> lets drop this patch. Thanks and kudos on the catch!
> 
> Not a problem!
> 
> You can use cbmc to check these sorts of transformations, as described
> here: https://paulmck.livejournal.com/38997.html
Much appreciated! Does the tool automatically create stubs for dependent
structures or is that upto the CBMC user? I see in your example in the blog you
did create an rcu_node :). I wonder if for large number of dependencies on the
code base, how it does in terms of overhead for the user.

But perhaps for simpler examples (such as perhaps the above), manually mocking
data structures and dependent code is reasonable?

We'd have to stub irq_work and raise_softirq and functions too and also the
CONFIG option values.

Thanks!

 - Joel





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock
  2025-07-11 18:19         ` Joel Fernandes
@ 2025-07-11 18:41           ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-11 18:41 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, rcu

On Fri, Jul 11, 2025 at 02:19:47PM -0400, Joel Fernandes wrote:
> 
> 
> On 7/11/2025 1:18 PM, Paul E. McKenney wrote:
> > On Fri, Jul 11, 2025 at 12:30:08PM -0400, Joel Fernandes wrote:
> >>
> >>
> >> On 7/10/2025 8:00 PM, Paul E. McKenney wrote:
> >>> On Tue, Jul 08, 2025 at 10:22:21AM -0400, Joel Fernandes wrote:
> >>>> The check for irqs_were_disabled is redundant in
> >>>> rcu_unlock_needs_exp_handling() as the caller already checks for this.
> >>>> This includes the boost case as well. Just remove the redundant check.
> >>>
> >>> But in the very last "if" statement in the context of the last hunk of
> >>> this patch, isn't it instead checking for !irqs_were_disabled?
> >>>
> >>> Or is there something that I am missing that makes this work out OK?
> >>
> >> You are right, after going over all the cases I introduced a behavioral change.
> >>
> >> Say, irqs_were_disabled was false. And say we are RCU-boosted. needs_exp might
> >> return true now, but previously it was returning false. Further say, we are not
> >> in hardirq.
> >>
> >> New code will raise softirq, old code would go to the ELSE and just set:
> >>                 set_tsk_need_resched(current);
> >> set_preempt_need_resched();
> >>
> >> But preempt_bh_were_disabled that's why we're here.
> >>
> >> So we need to only set resched and not raise softirq, because the preempt enable
> >> would reschedule.
> >>
> >> Sorry I missed this, but unless this behavior is correct or needs further work,
> >> lets drop this patch. Thanks and kudos on the catch!
> > 
> > Not a problem!
> > 
> > You can use cbmc to check these sorts of transformations, as described
> > here: https://paulmck.livejournal.com/38997.html
> 
> Much appreciated! Does the tool automatically create stubs for dependent
> structures or is that upto the CBMC user? I see in your example in the blog you
> did create an rcu_node :). I wonder if for large number of dependencies on the
> code base, how it does in terms of overhead for the user.

You do need to create the stubs.  But you also have the option of
simplifying the code, for example, by removing structure fields that
you are not using and providing static instances of the needed structures.

> But perhaps for simpler examples (such as perhaps the above), manually mocking
> data structures and dependent code is reasonable?

That is what I do.  ;-)

> We'd have to stub irq_work and raise_softirq and functions too and also the
> CONFIG option values.

Agreed.  But in this case, we only need to know whether or not they
were called, so a very simple stub suffices.

You can also speed things up by replacing the Kconfig options with
variables, thus allowing full exploration in one run.  And dispensing
with the dazzling implementation of IS_ENABLED().  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH -rcu -next 5/7] rcu: Document GP init vs hotplug-scan ordering requirements
  2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
                   ` (2 preceding siblings ...)
  2025-07-08 14:22 ` [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock Joel Fernandes
@ 2025-07-08 14:22 ` Joel Fernandes
  2025-07-11  0:01   ` Paul E. McKenney
  2025-07-08 14:22 ` [PATCH -rcu -next 6/7] rcu: Document separation of rcu_state and rnp's gp_seq Joel Fernandes
  2025-07-08 14:22 ` [PATCH -rcu -next 7/7] rcu: Document concurrent quiescent state reporting for offline CPUs Joel Fernandes
  5 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Jonathan Corbet
  Cc: rcu, linux-doc

Add detailed comments explaining the critical ordering constraints
during RCU grace period initialization, based on discussions with
Frederic.

Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 .../RCU/Design/Requirements/Requirements.rst  | 41 +++++++++++++++++++
 kernel/rcu/tree.c                             |  8 ++++
 2 files changed, 49 insertions(+)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index 6125e7068d2c..771a1ce4c84d 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -1970,6 +1970,47 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
 between RCU's hotplug notifier hooks, the grace period initialization
 code, and the FQS loop, all of which refer to or modify this bookkeeping.
 
+Note that grace period initialization (rcu_gp_init()) must carefully sequence
+CPU hotplug scanning with grace period state changes. For example, the
+following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
+after the CPU hotplug scanning.
+
+.. code-block:: none
+
+   CPU0 (rcu_gp_init)                   CPU1                          CPU2
+   ---------------------                ----                          ----
+   // Hotplug scan first (WRONG ORDER)
+   rcu_for_each_leaf_node(rnp) {
+       rnp->qsmaskinit = rnp->qsmaskinitnext;
+   }
+                                        rcutree_report_cpu_starting()
+                                            rnp->qsmaskinitnext |= mask;
+                                        rcu_read_lock()
+                                        r0 = *X;
+                                                                      r1 = *X;
+                                                                      X = NULL;
+                                                                      cookie = get_state_synchronize_rcu();
+                                                                      // cookie = 8 (future GP)
+   rcu_seq_start(&rcu_state.gp_seq);
+   // gp_seq = 5
+
+   // CPU1 now invisible to this GP!
+   rcu_for_each_node_breadth_first() {
+       rnp->qsmask = rnp->qsmaskinit;
+       // CPU1 not included!
+   }
+
+   // GP completes without CPU1
+   rcu_seq_end(&rcu_state.gp_seq);
+   // gp_seq = 8
+                                                                      poll_state_synchronize_rcu(cookie);
+                                                                      // Returns true!
+                                                                      kfree(r1);
+                                        r2 = *r0; // USE-AFTER-FREE!
+
+By incrementing gp_seq first, CPU1's RCU read-side critical section
+is guaranteed to not be missed by CPU2.
+
 Scheduler and RCU
 ~~~~~~~~~~~~~~~~~
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 327848f436e7..32fdb66e9c9f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1835,6 +1835,14 @@ static noinline_for_stack bool rcu_gp_init(void)
 	start_new_poll = rcu_sr_normal_gp_init();
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	old_gp_seq = rcu_state.gp_seq;
+	/*
+	 * Critical ordering: rcu_seq_start() must happen BEFORE the CPU hotplug
+	 * scan below. Otherwise we risk a race where a newly onlining CPU could
+	 * be missed by the current grace period, potentially leading to
+	 * use-after-free errors. For a detailed explanation of this race, see
+	 * Documentation/RCU/Design/Requirements/Requirements.rst in the
+	 * "Hotplug CPU" section.
+	 */
 	rcu_seq_start(&rcu_state.gp_seq);
 	/* Ensure that rcu_seq_done_exact() guardband doesn't give false positives. */
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 5/7] rcu: Document GP init vs hotplug-scan ordering requirements
  2025-07-08 14:22 ` [PATCH -rcu -next 5/7] rcu: Document GP init vs hotplug-scan ordering requirements Joel Fernandes
@ 2025-07-11  0:01   ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-11  0:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Jonathan Corbet, rcu, linux-doc

On Tue, Jul 08, 2025 at 10:22:22AM -0400, Joel Fernandes wrote:
> Add detailed comments explaining the critical ordering constraints
> during RCU grace period initialization, based on discussions with
> Frederic.
> 
> Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  .../RCU/Design/Requirements/Requirements.rst  | 41 +++++++++++++++++++
>  kernel/rcu/tree.c                             |  8 ++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> index 6125e7068d2c..771a1ce4c84d 100644
> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> @@ -1970,6 +1970,47 @@ corresponding CPU's leaf node lock is held. This avoids race conditions
>  between RCU's hotplug notifier hooks, the grace period initialization
>  code, and the FQS loop, all of which refer to or modify this bookkeeping.
>  
> +Note that grace period initialization (rcu_gp_init()) must carefully sequence
> +CPU hotplug scanning with grace period state changes. For example, the
> +following race could occur in rcu_gp_init() if rcu_seq_start() were to happen
> +after the CPU hotplug scanning.
> +
> +.. code-block:: none
> +
> +   CPU0 (rcu_gp_init)                   CPU1                          CPU2
> +   ---------------------                ----                          ----
> +   // Hotplug scan first (WRONG ORDER)
> +   rcu_for_each_leaf_node(rnp) {
> +       rnp->qsmaskinit = rnp->qsmaskinitnext;
> +   }
> +                                        rcutree_report_cpu_starting()
> +                                            rnp->qsmaskinitnext |= mask;
> +                                        rcu_read_lock()
> +                                        r0 = *X;
> +                                                                      r1 = *X;
> +                                                                      X = NULL;
> +                                                                      cookie = get_state_synchronize_rcu();
> +                                                                      // cookie = 8 (future GP)
> +   rcu_seq_start(&rcu_state.gp_seq);
> +   // gp_seq = 5
> +
> +   // CPU1 now invisible to this GP!
> +   rcu_for_each_node_breadth_first() {
> +       rnp->qsmask = rnp->qsmaskinit;
> +       // CPU1 not included!
> +   }
> +
> +   // GP completes without CPU1
> +   rcu_seq_end(&rcu_state.gp_seq);
> +   // gp_seq = 8
> +                                                                      poll_state_synchronize_rcu(cookie);
> +                                                                      // Returns true!
> +                                                                      kfree(r1);
> +                                        r2 = *r0; // USE-AFTER-FREE!
> +
> +By incrementing gp_seq first, CPU1's RCU read-side critical section
> +is guaranteed to not be missed by CPU2.
> +
>  Scheduler and RCU
>  ~~~~~~~~~~~~~~~~~
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 327848f436e7..32fdb66e9c9f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1835,6 +1835,14 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	start_new_poll = rcu_sr_normal_gp_init();
>  	/* Record GP times before starting GP, hence rcu_seq_start(). */
>  	old_gp_seq = rcu_state.gp_seq;
> +	/*
> +	 * Critical ordering: rcu_seq_start() must happen BEFORE the CPU hotplug
> +	 * scan below. Otherwise we risk a race where a newly onlining CPU could
> +	 * be missed by the current grace period, potentially leading to
> +	 * use-after-free errors. For a detailed explanation of this race, see
> +	 * Documentation/RCU/Design/Requirements/Requirements.rst in the
> +	 * "Hotplug CPU" section.
> +	 */
>  	rcu_seq_start(&rcu_state.gp_seq);
>  	/* Ensure that rcu_seq_done_exact() guardband doesn't give false positives. */
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH -rcu -next 6/7] rcu: Document separation of rcu_state and rnp's gp_seq
  2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
                   ` (3 preceding siblings ...)
  2025-07-08 14:22 ` [PATCH -rcu -next 5/7] rcu: Document GP init vs hotplug-scan ordering requirements Joel Fernandes
@ 2025-07-08 14:22 ` Joel Fernandes
  2025-07-11  0:02   ` Paul E. McKenney
  2025-07-08 14:22 ` [PATCH -rcu -next 7/7] rcu: Document concurrent quiescent state reporting for offline CPUs Joel Fernandes
  5 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Jonathan Corbet
  Cc: rcu, linux-doc

The details of this are subtle and was discussed recently. Add a
quick-quiz about this and refer to it from the code, for more clarity.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 .../Data-Structures/Data-Structures.rst       | 32 +++++++++++++++++++
 kernel/rcu/tree.c                             |  4 +++
 2 files changed, 36 insertions(+)

diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
index 04e16775c752..930535f076b4 100644
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
+++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
@@ -286,6 +286,38 @@ in order to detect the beginnings and ends of grace periods in a
 distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
 (down the tree from the root to the leaves) to ``rcu_data``.
 
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Given that the root rcu_node structure has a gp_seq field,            |
+| why does RCU maintain a separate gp_seq in the rcu_state structure?   |
+| Why not just use the root rcu_node's gp_seq as the official record    |
+| and update it directly when starting a new grace period?              |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| On single-node RCU trees (where the root node is also a leaf),        |
+| updating the root node's gp_seq immediately would create unnecessary  |
+| lock contention. Here's why:                                          |
+|                                                                       |
+| If we did rcu_seq_start() directly on the root node's gp_seq:         |
+| 1. All CPUs would immediately see their node's gp_seq from their rdp's|
+|    gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
+| 2. Which calls note_gp_changes() and try to acquire the node lock.    |
+| 3. But rnp->qsmask isn't initialized yet (happens later in            |
+|    rcu_gp_init())                                                     |
+| 4. So each CPU would acquire the lock, find it can't determine if it  |
+|    needs to report quiescent state (no qsmask), update rdp->gp_seq,   |
+|    and release the lock.                                              |
+| 5. Result: Lots of lock acquisitions with no grace period progress    |
+|                                                                       |
+| By having a separate rcu_state.gp_seq, we can increment the official  |
+| grace period counter without immediately affecting what CPUs see in   |
+| their nodes. The hierarchical propagation in rcu_gp_init() then       |
+| updates the root node's gp_seq and qsmask together under the same lock|
+| acquisition, avoiding this useless contention.                        |
++-----------------------------------------------------------------------+
+
 Miscellaneous
 '''''''''''''
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 32fdb66e9c9f..c31b85e62310 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1842,6 +1842,10 @@ static noinline_for_stack bool rcu_gp_init(void)
 	 * use-after-free errors. For a detailed explanation of this race, see
 	 * Documentation/RCU/Design/Requirements/Requirements.rst in the
 	 * "Hotplug CPU" section.
+	 *
+	 * Also note that the root rnp's gp_seq is kept separate from, and lags,
+	 * the rcu_state's gp_seq, for a reason. See the Quick-Quiz on
+	 * Single-node systems for more details (in Data-Structures.rst).
 	 */
 	rcu_seq_start(&rcu_state.gp_seq);
 	/* Ensure that rcu_seq_done_exact() guardband doesn't give false positives. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -rcu -next 6/7] rcu: Document separation of rcu_state and rnp's gp_seq
  2025-07-08 14:22 ` [PATCH -rcu -next 6/7] rcu: Document separation of rcu_state and rnp's gp_seq Joel Fernandes
@ 2025-07-11  0:02   ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2025-07-11  0:02 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Boqun Feng, Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Jonathan Corbet, rcu, linux-doc

On Tue, Jul 08, 2025 at 10:22:23AM -0400, Joel Fernandes wrote:
> The details of this are subtle and was discussed recently. Add a
> quick-quiz about this and refer to it from the code, for more clarity.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  .../Data-Structures/Data-Structures.rst       | 32 +++++++++++++++++++
>  kernel/rcu/tree.c                             |  4 +++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
> index 04e16775c752..930535f076b4 100644
> --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
> +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
> @@ -286,6 +286,38 @@ in order to detect the beginnings and ends of grace periods in a
>  distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
>  (down the tree from the root to the leaves) to ``rcu_data``.
>  
> ++-----------------------------------------------------------------------+
> +| **Quick Quiz**:                                                       |
> ++-----------------------------------------------------------------------+
> +| Given that the root rcu_node structure has a gp_seq field,            |
> +| why does RCU maintain a separate gp_seq in the rcu_state structure?   |
> +| Why not just use the root rcu_node's gp_seq as the official record    |
> +| and update it directly when starting a new grace period?              |
> ++-----------------------------------------------------------------------+
> +| **Answer**:                                                           |
> ++-----------------------------------------------------------------------+
> +| On single-node RCU trees (where the root node is also a leaf),        |
> +| updating the root node's gp_seq immediately would create unnecessary  |
> +| lock contention. Here's why:                                          |
> +|                                                                       |
> +| If we did rcu_seq_start() directly on the root node's gp_seq:         |
> +| 1. All CPUs would immediately see their node's gp_seq from their rdp's|
> +|    gp_seq, in rcu_pending(). They would all then invoke the RCU-core. |
> +| 2. Which calls note_gp_changes() and try to acquire the node lock.    |
> +| 3. But rnp->qsmask isn't initialized yet (happens later in            |
> +|    rcu_gp_init())                                                     |
> +| 4. So each CPU would acquire the lock, find it can't determine if it  |
> +|    needs to report quiescent state (no qsmask), update rdp->gp_seq,   |
> +|    and release the lock.                                              |
> +| 5. Result: Lots of lock acquisitions with no grace period progress    |
> +|                                                                       |
> +| By having a separate rcu_state.gp_seq, we can increment the official  |
> +| grace period counter without immediately affecting what CPUs see in   |
> +| their nodes. The hierarchical propagation in rcu_gp_init() then       |
> +| updates the root node's gp_seq and qsmask together under the same lock|
> +| acquisition, avoiding this useless contention.                        |
> ++-----------------------------------------------------------------------+
> +
>  Miscellaneous
>  '''''''''''''
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 32fdb66e9c9f..c31b85e62310 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1842,6 +1842,10 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	 * use-after-free errors. For a detailed explanation of this race, see
>  	 * Documentation/RCU/Design/Requirements/Requirements.rst in the
>  	 * "Hotplug CPU" section.
> +	 *
> +	 * Also note that the root rnp's gp_seq is kept separate from, and lags,
> +	 * the rcu_state's gp_seq, for a reason. See the Quick-Quiz on
> +	 * Single-node systems for more details (in Data-Structures.rst).
>  	 */
>  	rcu_seq_start(&rcu_state.gp_seq);
>  	/* Ensure that rcu_seq_done_exact() guardband doesn't give false positives. */
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH -rcu -next 7/7] rcu: Document concurrent quiescent state reporting for offline CPUs
  2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
                   ` (4 preceding siblings ...)
  2025-07-08 14:22 ` [PATCH -rcu -next 6/7] rcu: Document separation of rcu_state and rnp's gp_seq Joel Fernandes
@ 2025-07-08 14:22 ` Joel Fernandes
  5 siblings, 0 replies; 17+ messages in thread
From: Joel Fernandes @ 2025-07-08 14:22 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Jonathan Corbet
  Cc: rcu, linux-doc

The synchronization of CPU offlining with GP initialization is confusing
to put it mildly (rightfully so as the issue it deals with is complex).
Recent discussions brought up a question -- what prevents the
rcu_implicit_dyntick_qs() from warning about QS reports for offline
CPUs (missing QS reports for offline CPUs causing indefinite hangs).

QS reporting for now-offline CPUs should only happen from:
- gp_init()
- rcutree_cpu_report_dead()

Add some documentation on this and refer to it from comments in the code
explaining how QS reporting is not missed when these functions are
concurrently running.

I referred heavily to this post [1] about the need for the ofl_lock.
[1] https://lore.kernel.org/all/20180924164443.GF4222@linux.ibm.com/

[ Applied paulmck feedback on moving documentation to Requirements.rst ]

Link: https://lore.kernel.org/all/01b4d228-9416-43f8-a62e-124b92e8741a@paulmck-laptop/
Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 .../RCU/Design/Requirements/Requirements.rst  | 87 +++++++++++++++++++
 kernel/rcu/tree.c                             | 19 +++-
 2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index 771a1ce4c84d..841326d9358d 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -2011,6 +2011,93 @@ after the CPU hotplug scanning.
 By incrementing gp_seq first, CPU1's RCU read-side critical section
 is guaranteed to not be missed by CPU2.
 
+**Concurrent Quiescent State Reporting for Offline CPUs**
+
+RCU must ensure that CPUs going offline report quiescent states to avoid
+blocking grace periods. This requires careful synchronization to handle
+race conditions
+
+**Race condition causing Offline CPU to hang GP**
+
+A race between CPU offlining and new GP initialization (gp_init) may occur
+because `rcu_report_qs_rnp()` in `rcutree_report_cpu_dead()` must temporarily
+release the `rcu_node` lock to wake the RCU grace-period kthread:
+
+.. code-block:: none
+
+   CPU1 (going offline)                 CPU0 (GP kthread)
+   --------------------                 -----------------
+   rcutree_report_cpu_dead()
+     rcu_report_qs_rnp()
+       // Must release rnp->lock to wake GP kthread
+       raw_spin_unlock_irqrestore_rcu_node()
+                                        // Wakes up and starts new GP
+                                        rcu_gp_init()
+                                          // First loop:
+                                          copies qsmaskinitnext->qsmaskinit
+                                          // CPU1 still in qsmaskinitnext!
+                                          
+                                          // Second loop:
+                                          rnp->qsmask = rnp->qsmaskinit
+                                          mask = rnp->qsmask & ~rnp->qsmaskinitnext
+                                          // mask is 0! CPU1 still in both masks
+       // Reacquire lock (but too late)
+     rnp->qsmaskinitnext &= ~mask       // Finally clears bit
+
+Without `ofl_lock`, the new grace period includes the offline CPU and waits
+forever for its quiescent state causing a GP hang.
+
+**A solution with ofl_lock**
+
+The `ofl_lock` (offline lock) prevents `rcu_gp_init()` from running during
+the vulnerable window when `rcu_report_qs_rnp()` has released `rnp->lock`:
+
+.. code-block:: none
+
+   CPU0 (rcu_gp_init)                   CPU1 (rcutree_report_cpu_dead)
+   ------------------                   ------------------------------
+   rcu_for_each_leaf_node(rnp) {
+       arch_spin_lock(&ofl_lock) -----> arch_spin_lock(&ofl_lock) [BLOCKED]
+       
+       // Safe: CPU1 can't interfere
+       rnp->qsmaskinit = rnp->qsmaskinitnext
+       
+       arch_spin_unlock(&ofl_lock) ---> // Now CPU1 can proceed
+   }                                    // But snapshot already taken
+
+**Another race causing GP hangs in rcu_gpu_init(): Reporting QS for Now-offline CPUs**
+
+After the first loop takes an atomic snapshot of online CPUs, as shown above,
+the second loop in `rcu_gp_init()` detects CPUs that went offline between
+releasing `ofl_lock` and acquiring the per-node `rnp->lock`. This detection is
+crucial because:
+
+1. The CPU might have gone offline after the snapshot but before the second loop
+2. The offline CPU cannot report its own QS if it's already dead
+3. Without this detection, the grace period would wait forever for CPUs that
+   are now offline.
+
+The second loop performs this detection safely:
+
+.. code-block:: none
+
+   rcu_for_each_node_breadth_first(rnp) {
+       raw_spin_lock_irqsave_rcu_node(rnp, flags);
+       rnp->qsmask = rnp->qsmaskinit;  // Apply the snapshot
+       
+       // Detect CPUs offline after snapshot
+       mask = rnp->qsmask & ~rnp->qsmaskinitnext;
+       
+       if (mask && rcu_is_leaf_node(rnp))
+           rcu_report_qs_rnp(mask, ...)  // Report QS for offline CPUs
+   }
+
+This approach ensures atomicity: quiescent state reporting for offline CPUs
+happens either in `rcu_gp_init()` (second loop) or in `rcutree_report_cpu_dead()`,
+never both and never neither. The `rnp->lock` held throughout the sequence
+prevents races - `rcutree_report_cpu_dead()` also acquires this lock when
+clearing `qsmaskinitnext`, ensuring mutual exclusion.
+
 Scheduler and RCU
 ~~~~~~~~~~~~~~~~~
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c31b85e62310..f669f9b45e80 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1883,6 +1883,10 @@ static noinline_for_stack bool rcu_gp_init(void)
 	/* Exclude CPU hotplug operations. */
 	rcu_for_each_leaf_node(rnp) {
 		local_irq_disable();
+		/*
+		 * Serialize with CPU offline. See Requirements.rst > Hotplug CPU >
+		 * Concurrent Quiescent State Reporting for Offline CPUs.
+		 */
 		arch_spin_lock(&rcu_state.ofl_lock);
 		raw_spin_lock_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
@@ -1957,7 +1961,12 @@ static noinline_for_stack bool rcu_gp_init(void)
 		trace_rcu_grace_period_init(rcu_state.name, rnp->gp_seq,
 					    rnp->level, rnp->grplo,
 					    rnp->grphi, rnp->qsmask);
-		/* Quiescent states for tasks on any now-offline CPUs. */
+		/*
+		 * Quiescent states for tasks on any now-offline CPUs. Since we
+		 * released the ofl and rnp lock before this loop, CPUs might
+		 * have gone offline and we have to report QS on their behalf.
+		 * See Requirements.rst > Hotplug CPU > Concurrent QS Reporting.
+		 */
 		mask = rnp->qsmask & ~rnp->qsmaskinitnext;
 		rnp->rcu_gp_init_mask = mask;
 		if ((mask || rnp->wait_blkd_tasks) && rcu_is_leaf_node(rnp))
@@ -4388,6 +4397,13 @@ void rcutree_report_cpu_dead(void)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
+
+	/*
+	 * Hold the ofl_lock and rnp lock to avoid races between CPU going
+	 * offline and doing a QS report (as below), versus rcu_gp_init().
+	 * See Requirements.rst > Hotplug CPU > Concurrent QS Reporting section
+	 * for more details.
+	 */
 	arch_spin_lock(&rcu_state.ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq);
@@ -4398,6 +4414,7 @@ void rcutree_report_cpu_dead(void)
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
+	/* Clear from ->qsmaskinitnext to mark offline. */
 	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	arch_spin_unlock(&rcu_state.ofl_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-11 18:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 14:22 [PATCH -rcu -next 1/7] smp: Document preemption and stop_machine() mutual exclusion Joel Fernandes
2025-07-08 14:22 ` [PATCH -rcu -next 2/7] rcu: Fix rcu_read_unlock() deadloop due to IRQ work Joel Fernandes
2025-07-09  0:00   ` Paul E. McKenney
2025-07-08 14:22 ` [PATCH -rcu -next 3/7] rcu: Refactor expedited handling check in rcu_read_unlock_special() Joel Fernandes
2025-07-09  0:06   ` Paul E. McKenney
2025-07-08 14:22 ` [PATCH -rcu -next 4/7] rcu: Remove redundant check for irq state during unlock Joel Fernandes
2025-07-11  0:00   ` Paul E. McKenney
2025-07-11 16:30     ` Joel Fernandes
2025-07-11 16:39       ` Joel Fernandes
2025-07-11 17:18       ` Paul E. McKenney
2025-07-11 18:19         ` Joel Fernandes
2025-07-11 18:41           ` Paul E. McKenney
2025-07-08 14:22 ` [PATCH -rcu -next 5/7] rcu: Document GP init vs hotplug-scan ordering requirements Joel Fernandes
2025-07-11  0:01   ` Paul E. McKenney
2025-07-08 14:22 ` [PATCH -rcu -next 6/7] rcu: Document separation of rcu_state and rnp's gp_seq Joel Fernandes
2025-07-11  0:02   ` Paul E. McKenney
2025-07-08 14:22 ` [PATCH -rcu -next 7/7] rcu: Document concurrent quiescent state reporting for offline CPUs Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).