[PATCH rcu 0/5] Expedited RCU changes for v6.17

rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rcu 0/5] Expedited RCU changes for v6.17
@ 2025-07-09 10:39 neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 1/5] rcu/exp: Protect against early QS report neeraj.upadhyay
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: neeraj.upadhyay @ 2025-07-09 10:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, paulmck, joelagnelf, frederic, boqun.feng, urezki,
	rostedt, mathieu.desnoyers, jiangshanlai, qiang.zhang1211,
	neeraj.iitr10, neeraj.upadhyay, Neeraj Upadhyay (AMD)

From: "Neeraj Upadhyay (AMD)" <neeraj.upadhyay@kernel.org>

Hello,

This patch series contains following updates to the expedited RCU
code (rebased on v6.16-rc3):

Frederic Weisbecker (5):
  rcu/exp: Protect against early QS report
  rcu/exp: Remove confusing needless full barrier on task unblock
  rcu/exp: Remove needless CPU up quiescent state report
  rcu/exp: Warn on QS requested on dying CPU
  rcu/exp: Warn on CPU lagging for too long within hotplug IPI's
    blindspot

 kernel/rcu/tree.c        |  8 +++--
 kernel/rcu/tree_exp.h    | 69 +++++++++++-----------------------------
 kernel/rcu/tree_plugin.h |  1 -
 3 files changed, 25 insertions(+), 53 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH rcu 1/5] rcu/exp: Protect against early QS report
  2025-07-09 10:39 [PATCH rcu 0/5] Expedited RCU changes for v6.17 neeraj.upadhyay
@ 2025-07-09 10:39 ` neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 2/5] rcu/exp: Remove confusing needless full barrier on task unblock neeraj.upadhyay
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: neeraj.upadhyay @ 2025-07-09 10:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, paulmck, joelagnelf, frederic, boqun.feng, urezki,
	rostedt, mathieu.desnoyers, jiangshanlai, qiang.zhang1211,
	neeraj.iitr10, neeraj.upadhyay, Neeraj Upadhyay (AMD)

From: Frederic Weisbecker <frederic@kernel.org>

When a grace period is started, the ->expmask of each node is set up
from sync_exp_reset_tree(). Then later on each leaf node also initialize
its ->exp_tasks pointer.

This means that the initialization of the quiescent state of a node and
the initialization of its blocking tasks happen with an unlocked node
gap in-between.

It happens to be fine because nothing is expected to report an exp
quiescent state within this gap, since no IPI have been issued yet and
every rdp's ->cpu_no_qs.b.exp should be false.

However if it were to happen by accident, the quiescent state could be
reported and propagated while ignoring tasks that blocked _before_ the
start of the grace period.

Prevent such trouble to happen in the future and initialize both the
quiescent states mask to report and the blocked tasks head from the same
node locked block.

If a task blocks within an RCU read side critical section before
sync_exp_reset_tree() is called and is then unblocked between
sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
won't be reported because no RCU exp IPI had been issued to request it
through the setting of srdp->cpu_no_qs.b.exp.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
---
 kernel/rcu/tree_exp.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index c36c7d5575ca..2fa7aa9155bd 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		WARN_ON_ONCE(rnp->expmask);
 		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
+		/*
+		 * Need to wait for any blocked tasks as well.	Note that
+		 * additional blocking tasks will also block the expedited GP
+		 * until such time as the ->expmask bits are cleared.
+		 */
+		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
+			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	}
 }
@@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
 	}
 	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;

-	/*
-	 * Need to wait for any blocked tasks as well.	Note that
-	 * additional blocking tasks will also block the expedited GP
-	 * until such time as the ->expmask bits are cleared.
-	 */
-	if (rcu_preempt_has_tasks(rnp))
-		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

 	/* IPI the remaining CPUs for expedited quiescent state. */
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH rcu 2/5] rcu/exp: Remove confusing needless full barrier on task unblock
  2025-07-09 10:39 [PATCH rcu 0/5] Expedited RCU changes for v6.17 neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 1/5] rcu/exp: Protect against early QS report neeraj.upadhyay
@ 2025-07-09 10:39 ` neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 3/5] rcu/exp: Remove needless CPU up quiescent state report neeraj.upadhyay
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: neeraj.upadhyay @ 2025-07-09 10:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, paulmck, joelagnelf, frederic, boqun.feng, urezki,
	rostedt, mathieu.desnoyers, jiangshanlai, qiang.zhang1211,
	neeraj.iitr10, neeraj.upadhyay, Neeraj Upadhyay (AMD)

From: Frederic Weisbecker <frederic@kernel.org>

A full memory barrier in the RCU-PREEMPT task unblock path advertizes
to order the context switch (or rather the accesses prior to
rcu_read_unlock()) with the expedited grace period fastpath.

However the grace period can not complete without the rnp calling into
rcu_report_exp_rnp() with the node locked. This reports the quiescent
state in a fully ordered fashion against updater's accesses thanks to:

1) The READ-SIDE smp_mb__after_unlock_lock() barrier across nodes
   locking while propagating QS up to the root.

2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
   the root rnp to wait/check for the GP completion.

3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
   before the grace period completes.

This makes the explicit barrier in this place superfluous. Therefore
remove it as it is confusing.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
---
 kernel/rcu/tree_plugin.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0b0f56f6abc8..0532a13cb75e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 		WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
 			     (!empty_norm || rnp->qsmask));
 		empty_exp = sync_rcu_exp_done(rnp);
-		smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
 		np = rcu_next_node_entry(t, rnp);
 		list_del_init(&t->rcu_node_entry);
 		t->rcu_blocked_node = NULL;
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH rcu 3/5] rcu/exp: Remove needless CPU up quiescent state report
  2025-07-09 10:39 [PATCH rcu 0/5] Expedited RCU changes for v6.17 neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 1/5] rcu/exp: Protect against early QS report neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 2/5] rcu/exp: Remove confusing needless full barrier on task unblock neeraj.upadhyay
@ 2025-07-09 10:39 ` neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 4/5] rcu/exp: Warn on QS requested on dying CPU neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot neeraj.upadhyay
  4 siblings, 0 replies; 6+ messages in thread
From: neeraj.upadhyay @ 2025-07-09 10:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, paulmck, joelagnelf, frederic, boqun.feng, urezki,
	rostedt, mathieu.desnoyers, jiangshanlai, qiang.zhang1211,
	neeraj.iitr10, neeraj.upadhyay, Neeraj Upadhyay (AMD)

From: Frederic Weisbecker <frederic@kernel.org>

A CPU coming online checks for an ongoing grace period and reports
a quiescent state accordingly if needed. This special treatment that
shortcuts the expedited IPI finds its origin as an optimization purpose
on the following commit:

	338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()

The point is to avoid an IPI while waiting for a CPU to become online
or failing to become offline.

However this is pointless and even error prone for several reasons:

* If the CPU has been seen offline in the first round scanning offline
  and idle CPUs, no IPI is even tried and the quiescent state is
  reported on behalf of the CPU.

* This means that if the IPI fails, the CPU just became offline. So
  it's unlikely to become online right away, unless the cpu hotplug
  operation failed and rolled back, which is a rare event that can
  wait a jiffy for a new IPI to be issued.

* But then the "optimization" applying on failing CPU hotplug down only
  applies to !PREEMPT_RCU.

* This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
  set. As a result it can race with remote QS reports on the same rdp.
  Fortunately it happens to be OK but an accident is waiting to happen.

For all those reasons, remove this optimization that doesn't look worthy
to keep around.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
---
 kernel/rcu/tree.c     |  2 --
 kernel/rcu/tree_exp.h | 45 ++-----------------------------------------
 2 files changed, 2 insertions(+), 45 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 14d4499c6fc3..0bda23fec690 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -160,7 +160,6 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
 			      unsigned long gps, unsigned long flags);
 static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
-static void sync_sched_exp_online_cleanup(int cpu);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
@@ -4268,7 +4267,6 @@ int rcutree_online_cpu(unsigned int cpu)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
 		return 0; /* Too early in boot for scheduler work. */
-	sync_sched_exp_online_cleanup(cpu);
 
 	// Stop-machine done, so allow nohz_full to disable tick.
 	tick_dep_clear(TICK_DEP_BIT_RCU);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 2fa7aa9155bd..6058a734090c 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -751,12 +751,8 @@ static void rcu_exp_handler(void *unused)
 	struct task_struct *t = current;
 
 	/*
-	 * First, is there no need for a quiescent state from this CPU,
-	 * or is this CPU already looking for a quiescent state for the
-	 * current grace period?  If either is the case, just leave.
-	 * However, this should not happen due to the preemptible
-	 * sync_sched_exp_online_cleanup() implementation being a no-op,
-	 * so warn if this does happen.
+	 * WARN if the CPU is unexpectedly already looking for a
+	 * QS or has already reported one.
 	 */
 	ASSERT_EXCLUSIVE_WRITER_SCOPED(rdp->cpu_no_qs.b.exp);
 	if (WARN_ON_ONCE(!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
@@ -803,11 +799,6 @@ static void rcu_exp_handler(void *unused)
 	WARN_ON_ONCE(1);
 }
 
-/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
-static void sync_sched_exp_online_cleanup(int cpu)
-{
-}
-
 /*
  * Scan the current list of tasks blocked within RCU read-side critical
  * sections, printing out the tid of each that is blocking the current
@@ -885,38 +876,6 @@ static void rcu_exp_handler(void *unused)
 	rcu_exp_need_qs();
 }
 
-/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
-static void sync_sched_exp_online_cleanup(int cpu)
-{
-	unsigned long flags;
-	int my_cpu;
-	struct rcu_data *rdp;
-	int ret;
-	struct rcu_node *rnp;
-
-	rdp = per_cpu_ptr(&rcu_data, cpu);
-	rnp = rdp->mynode;
-	my_cpu = get_cpu();
-	/* Quiescent state either not needed or already requested, leave. */
-	if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
-	    READ_ONCE(rdp->cpu_no_qs.b.exp)) {
-		put_cpu();
-		return;
-	}
-	/* Quiescent state needed on current CPU, so set it up locally. */
-	if (my_cpu == cpu) {
-		local_irq_save(flags);
-		rcu_exp_need_qs();
-		local_irq_restore(flags);
-		put_cpu();
-		return;
-	}
-	/* Quiescent state needed on some other CPU, send IPI. */
-	ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
-	put_cpu();
-	WARN_ON_ONCE(ret);
-}
-
 /*
  * Because preemptible RCU does not exist, we never have to check for
  * tasks blocked within RCU read-side critical sections that are
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH rcu 4/5] rcu/exp: Warn on QS requested on dying CPU
  2025-07-09 10:39 [PATCH rcu 0/5] Expedited RCU changes for v6.17 neeraj.upadhyay
                   ` (2 preceding siblings ...)
  2025-07-09 10:39 ` [PATCH rcu 3/5] rcu/exp: Remove needless CPU up quiescent state report neeraj.upadhyay
@ 2025-07-09 10:39 ` neeraj.upadhyay
  2025-07-09 10:39 ` [PATCH rcu 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot neeraj.upadhyay
  4 siblings, 0 replies; 6+ messages in thread
From: neeraj.upadhyay @ 2025-07-09 10:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, paulmck, joelagnelf, frederic, boqun.feng, urezki,
	rostedt, mathieu.desnoyers, jiangshanlai, qiang.zhang1211,
	neeraj.iitr10, neeraj.upadhyay, Neeraj Upadhyay (AMD)

From: Frederic Weisbecker <frederic@kernel.org>

It is not possible to send an IPI to a dying CPU that has passed the
CPUHP_TEARDOWN_CPU stage. Remaining unhandled IPIs are handled later at
CPUHP_AP_SMPCFD_DYING stage by stop machine. This is the last
opportunity for RCU exp handler to request an expedited quiescent state.
And the upcoming final context switch between stop machine and idle must
have reported the requested context switch.

Therefore, it should not be possible to observe a pending requested
expedited quiescent state when RCU finally stops watching the outgoing
CPU. Once IPIs aren't possible anymore, the QS for the target CPU will
be reported on its behalf by the RCU exp kworker.

Provide an assertion to verify those expectations.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
---
 kernel/rcu/tree.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0bda23fec690..00c182b3f978 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4356,6 +4356,12 @@ void rcutree_report_cpu_dead(void)
 	 * may introduce a new READ-side while it is actually off the QS masks.
 	 */
 	lockdep_assert_irqs_disabled();
+	/*
+	 * CPUHP_AP_SMPCFD_DYING was the last call for rcu_exp_handler() execution.
+	 * The requested QS must have been reported on the last context switch
+	 * from stop machine to idle.
+	 */
+	WARN_ON_ONCE(rdp->cpu_no_qs.b.exp);
 	// Do any dangling deferred wakeups.
 	do_nocb_deferred_wakeup(rdp);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH rcu 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot
  2025-07-09 10:39 [PATCH rcu 0/5] Expedited RCU changes for v6.17 neeraj.upadhyay
                   ` (3 preceding siblings ...)
  2025-07-09 10:39 ` [PATCH rcu 4/5] rcu/exp: Warn on QS requested on dying CPU neeraj.upadhyay
@ 2025-07-09 10:39 ` neeraj.upadhyay
  4 siblings, 0 replies; 6+ messages in thread
From: neeraj.upadhyay @ 2025-07-09 10:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, paulmck, joelagnelf, frederic, boqun.feng, urezki,
	rostedt, mathieu.desnoyers, jiangshanlai, qiang.zhang1211,
	neeraj.iitr10, neeraj.upadhyay, Neeraj Upadhyay (AMD)

From: Frederic Weisbecker <frederic@kernel.org>

A CPU within hotplug operations can make the RCU exp kworker lagging if:

* The dying CPU is running after CPUHP_TEARDOWN_CPU but before
  rcutree_report_cpu_dead(). It is too late to send an IPI but RCU is
  still watching the CPU. Therefore the exp kworker can only wait for
  the target to reach rcutree_report_cpu_dead().

* The booting CPU is running after rcutree_report_cpu_starting() but
  before set_cpu_online(). RCU is watching the CPU but it is too early
  to be able to send an IPI. Therefore the exp kworker can only wait
  until it observes the CPU as officially online.

Such a lag is expected to be very short. However #VMEXIT and other
hazards can stay on the way. Report long delays, 50 jiffies is
considered a high threshold already.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
[neeraj: Change max retries to 50 jiffies]
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
---
 kernel/rcu/tree_exp.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6058a734090c..076ad61e42f4 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -406,8 +406,18 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
 	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
 		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 		unsigned long mask = rdp->grpmask;
+		int nr_retries = 0;
 
 retry_ipi:
+		/*
+		 * In case of retrying, CPU either is lagging:
+		 *
+		 * - between CPUHP_TEARDOWN_CPU and rcutree_report_cpu_dead()
+		 * or:
+		 * - between rcutree_report_cpu_starting() and set_cpu_online()
+		 */
+		WARN_ON_ONCE(nr_retries++ > 50);
+
 		if (rcu_watching_snap_stopped_since(rdp, rdp->exp_watching_snap)) {
 			mask_ofl_test |= mask;
 			continue;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-07-09 10:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 10:39 [PATCH rcu 0/5] Expedited RCU changes for v6.17 neeraj.upadhyay
2025-07-09 10:39 ` [PATCH rcu 1/5] rcu/exp: Protect against early QS report neeraj.upadhyay
2025-07-09 10:39 ` [PATCH rcu 2/5] rcu/exp: Remove confusing needless full barrier on task unblock neeraj.upadhyay
2025-07-09 10:39 ` [PATCH rcu 3/5] rcu/exp: Remove needless CPU up quiescent state report neeraj.upadhyay
2025-07-09 10:39 ` [PATCH rcu 4/5] rcu/exp: Warn on QS requested on dying CPU neeraj.upadhyay
2025-07-09 10:39 ` [PATCH rcu 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot neeraj.upadhyay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).