rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock
  2025-03-14 14:36 [PATCH 0/5 v2] rcu/exp updates Frederic Weisbecker
@ 2025-03-14 14:36 ` Frederic Weisbecker
  2025-03-18 17:18   ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-03-14 14:36 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

A full memory barrier in the RCU-PREEMPT task unblock path advertizes
to order the context switch (or rather the accesses prior to
rcu_read_unlock()) with the expedited grace period fastpath.

However the grace period can not complete without the rnp calling into
rcu_report_exp_rnp() with the node locked. This reports the quiescent
state in a fully ordered fashion against updater's accesses thanks to:

1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
   locking while propagating QS up to the root.

2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
   the root rnp to wait/check for the GP completion.

3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
   before the grace period completes.

This makes the explicit barrier in this place superflous. Therefore
remove it as it is confusing.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_plugin.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3c0bbbbb686f..d51cc7a5dfc7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 		WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
 			     (!empty_norm || rnp->qsmask));
 		empty_exp = sync_rcu_exp_done(rnp);
-		smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
 		np = rcu_next_node_entry(t, rnp);
 		list_del_init(&t->rcu_node_entry);
 		t->rcu_blocked_node = NULL;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock
  2025-03-14 14:36 ` [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
@ 2025-03-18 17:18   ` Paul E. McKenney
  2025-03-19  9:01     ` Frederic Weisbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2025-03-18 17:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Uladzislau Rezki, Zqiang, rcu

On Fri, Mar 14, 2025 at 03:36:39PM +0100, Frederic Weisbecker wrote:
> A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> to order the context switch (or rather the accesses prior to
> rcu_read_unlock()) with the expedited grace period fastpath.
> 
> However the grace period can not complete without the rnp calling into
> rcu_report_exp_rnp() with the node locked. This reports the quiescent
> state in a fully ordered fashion against updater's accesses thanks to:
> 
> 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
>    locking while propagating QS up to the root.
> 
> 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
>    the root rnp to wait/check for the GP completion.
> 
> 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
>    before the grace period completes.
> 
> This makes the explicit barrier in this place superflous. Therefore
> remove it as it is confusing.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Still cannot see a problem with this, but still a bit nervous.

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/tree_plugin.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3c0bbbbb686f..d51cc7a5dfc7 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  		WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
>  			     (!empty_norm || rnp->qsmask));
>  		empty_exp = sync_rcu_exp_done(rnp);
> -		smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
>  		np = rcu_next_node_entry(t, rnp);
>  		list_del_init(&t->rcu_node_entry);
>  		t->rcu_blocked_node = NULL;
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock
  2025-03-18 17:18   ` Paul E. McKenney
@ 2025-03-19  9:01     ` Frederic Weisbecker
  2025-03-19 14:03       ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-03-19  9:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Uladzislau Rezki, Zqiang, rcu

Le Tue, Mar 18, 2025 at 10:18:12AM -0700, Paul E. McKenney a écrit :
> On Fri, Mar 14, 2025 at 03:36:39PM +0100, Frederic Weisbecker wrote:
> > A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> > to order the context switch (or rather the accesses prior to
> > rcu_read_unlock()) with the expedited grace period fastpath.
> > 
> > However the grace period can not complete without the rnp calling into
> > rcu_report_exp_rnp() with the node locked. This reports the quiescent
> > state in a fully ordered fashion against updater's accesses thanks to:
> > 
> > 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
> >    locking while propagating QS up to the root.
> > 
> > 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
> >    the root rnp to wait/check for the GP completion.
> > 
> > 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
> >    before the grace period completes.
> > 
> > This makes the explicit barrier in this place superflous. Therefore
> > remove it as it is confusing.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Still cannot see a problem with this, but still a bit nervous.

Where is the challenge in life if we manage to fall alseep within a minute
at bedtime?

> 
> Acked-by: Paul E. McKenney <paulmck@kernel.org>

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock
  2025-03-19  9:01     ` Frederic Weisbecker
@ 2025-03-19 14:03       ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2025-03-19 14:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Uladzislau Rezki, Zqiang, rcu

On Wed, Mar 19, 2025 at 10:01:36AM +0100, Frederic Weisbecker wrote:
> Le Tue, Mar 18, 2025 at 10:18:12AM -0700, Paul E. McKenney a écrit :
> > On Fri, Mar 14, 2025 at 03:36:39PM +0100, Frederic Weisbecker wrote:
> > > A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> > > to order the context switch (or rather the accesses prior to
> > > rcu_read_unlock()) with the expedited grace period fastpath.
> > > 
> > > However the grace period can not complete without the rnp calling into
> > > rcu_report_exp_rnp() with the node locked. This reports the quiescent
> > > state in a fully ordered fashion against updater's accesses thanks to:
> > > 
> > > 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
> > >    locking while propagating QS up to the root.
> > > 
> > > 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
> > >    the root rnp to wait/check for the GP completion.
> > > 
> > > 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
> > >    before the grace period completes.
> > > 
> > > This makes the explicit barrier in this place superflous. Therefore
> > > remove it as it is confusing.
> > > 
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > 
> > Still cannot see a problem with this, but still a bit nervous.
> 
> Where is the challenge in life if we manage to fall alseep within a minute
> at bedtime?

;-) ;-) ;-)

Suppose that there was an issue with this that we are somehow not spotting.
How would you go about debugging it?

							Thanx, Paul

> > Acked-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/5 v3] rcu/exp updates
@ 2025-04-29 13:42 Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 1/5] rcu/exp: Protect against early QS report Frederic Weisbecker
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-29 13:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

Hi,

No real updates in this set. Just collected a few tags.

Thanks.

Frederic Weisbecker (5):
  rcu/exp: Protect against early QS report
  rcu/exp: Remove confusing needless full barrier on task unblock
  rcu/exp: Remove needless CPU up quiescent state report
  rcu/exp: Warn on QS requested on dying CPU
  rcu/exp: Warn on CPU lagging for too long within hotplug IPI's
    blindspot

 kernel/rcu/tree.c        |  8 +++--
 kernel/rcu/tree_exp.h    | 69 +++++++++++-----------------------------
 kernel/rcu/tree_plugin.h |  1 -
 3 files changed, 25 insertions(+), 53 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] rcu/exp: Protect against early QS report
  2025-04-29 13:42 [PATCH 0/5 v3] rcu/exp updates Frederic Weisbecker
@ 2025-04-29 13:43 ` Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-29 13:43 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

When a grace period is started, the ->expmask of each node is set up
from sync_exp_reset_tree(). Then later on each leaf node also initialize
its ->exp_tasks pointer.

This means that the initialization of the quiescent state of a node and
the initialization of its blocking tasks happen with an unlocked node
gap in-between.

It happens to be fine because nothing is expected to report an exp
quiescent state within this gap, since no IPI have been issued yet and
every rdp's ->cpu_no_qs.b.exp should be false.

However if it were to happen by accident, the quiescent state could be
reported and propagated while ignoring tasks that blocked _before_ the
start of the grace period.

Prevent such trouble to happen in the future and initialize both the
quiescent states mask to report and the blocked tasks head from the same
node locked block.

If a task blocks within an RCU read side critical section before
sync_exp_reset_tree() is called and is then unblocked between
sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS
won't be reported because no RCU exp IPI had been issued to request it
through the setting of srdp->cpu_no_qs.b.exp.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_exp.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index c36c7d5575ca..2fa7aa9155bd 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		WARN_ON_ONCE(rnp->expmask);
 		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
+		/*
+		 * Need to wait for any blocked tasks as well.	Note that
+		 * additional blocking tasks will also block the expedited GP
+		 * until such time as the ->expmask bits are cleared.
+		 */
+		if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
+			WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	}
 }
@@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
 	}
 	mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
 
-	/*
-	 * Need to wait for any blocked tasks as well.	Note that
-	 * additional blocking tasks will also block the expedited GP
-	 * until such time as the ->expmask bits are cleared.
-	 */
-	if (rcu_preempt_has_tasks(rnp))
-		WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 
 	/* IPI the remaining CPUs for expedited quiescent state. */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock
  2025-04-29 13:42 [PATCH 0/5 v3] rcu/exp updates Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 1/5] rcu/exp: Protect against early QS report Frederic Weisbecker
@ 2025-04-29 13:43 ` Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 3/5] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-29 13:43 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

A full memory barrier in the RCU-PREEMPT task unblock path advertizes
to order the context switch (or rather the accesses prior to
rcu_read_unlock()) with the expedited grace period fastpath.

However the grace period can not complete without the rnp calling into
rcu_report_exp_rnp() with the node locked. This reports the quiescent
state in a fully ordered fashion against updater's accesses thanks to:

1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
   locking while propagating QS up to the root.

2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
   the root rnp to wait/check for the GP completion.

3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
   before the grace period completes.

This makes the explicit barrier in this place superflous. Therefore
remove it as it is confusing.

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_plugin.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3c0bbbbb686f..d51cc7a5dfc7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 		WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
 			     (!empty_norm || rnp->qsmask));
 		empty_exp = sync_rcu_exp_done(rnp);
-		smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
 		np = rcu_next_node_entry(t, rnp);
 		list_del_init(&t->rcu_node_entry);
 		t->rcu_blocked_node = NULL;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] rcu/exp: Remove needless CPU up quiescent state report
  2025-04-29 13:42 [PATCH 0/5 v3] rcu/exp updates Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 1/5] rcu/exp: Protect against early QS report Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
@ 2025-04-29 13:43 ` Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 4/5] rcu/exp: Warn on QS requested on dying CPU Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot Frederic Weisbecker
  4 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-29 13:43 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

A CPU coming online checks for an ongoing grace period and reports
a quiescent state accordingly if needed. This special treatment that
shortcuts the expedited IPI finds its origin as an optimization purpose
on the following commit:

	338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()

The point is to avoid an IPI while waiting for a CPU to become online
or failing to become offline.

However this is pointless and even error prone for several reasons:

* If the CPU has been seen offline in the first round scanning offline
  and idle CPUs, no IPI is even tried and the quiescent state is
  reported on behalf of the CPU.

* This means that if the IPI fails, the CPU just became offline. So
  it's unlikely to become online right away, unless the cpu hotplug
  operation failed and rolled back, which is a rare event that can
  wait a jiffy for a new IPI to be issued.

* But then the "optimization" applying on failing CPU hotplug down only
  applies to !PREEMPT_RCU.

* This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
  set. As a result it can race with remote QS reports on the same rdp.
  Fortunately it happens to be OK but an accident is waiting to happen.

For all those reasons, remove this optimization that doesn't look worthy
to keep around.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c     |  2 --
 kernel/rcu/tree_exp.h | 45 ++-----------------------------------------
 2 files changed, 2 insertions(+), 45 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 90d0214c05c7..5be292667385 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -160,7 +160,6 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
 			      unsigned long gps, unsigned long flags);
 static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
-static void sync_sched_exp_online_cleanup(int cpu);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
@@ -4264,7 +4263,6 @@ int rcutree_online_cpu(unsigned int cpu)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
 		return 0; /* Too early in boot for scheduler work. */
-	sync_sched_exp_online_cleanup(cpu);
 
 	// Stop-machine done, so allow nohz_full to disable tick.
 	tick_dep_clear(TICK_DEP_BIT_RCU);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 2fa7aa9155bd..6058a734090c 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -751,12 +751,8 @@ static void rcu_exp_handler(void *unused)
 	struct task_struct *t = current;
 
 	/*
-	 * First, is there no need for a quiescent state from this CPU,
-	 * or is this CPU already looking for a quiescent state for the
-	 * current grace period?  If either is the case, just leave.
-	 * However, this should not happen due to the preemptible
-	 * sync_sched_exp_online_cleanup() implementation being a no-op,
-	 * so warn if this does happen.
+	 * WARN if the CPU is unexpectedly already looking for a
+	 * QS or has already reported one.
 	 */
 	ASSERT_EXCLUSIVE_WRITER_SCOPED(rdp->cpu_no_qs.b.exp);
 	if (WARN_ON_ONCE(!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
@@ -803,11 +799,6 @@ static void rcu_exp_handler(void *unused)
 	WARN_ON_ONCE(1);
 }
 
-/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
-static void sync_sched_exp_online_cleanup(int cpu)
-{
-}
-
 /*
  * Scan the current list of tasks blocked within RCU read-side critical
  * sections, printing out the tid of each that is blocking the current
@@ -885,38 +876,6 @@ static void rcu_exp_handler(void *unused)
 	rcu_exp_need_qs();
 }
 
-/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
-static void sync_sched_exp_online_cleanup(int cpu)
-{
-	unsigned long flags;
-	int my_cpu;
-	struct rcu_data *rdp;
-	int ret;
-	struct rcu_node *rnp;
-
-	rdp = per_cpu_ptr(&rcu_data, cpu);
-	rnp = rdp->mynode;
-	my_cpu = get_cpu();
-	/* Quiescent state either not needed or already requested, leave. */
-	if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
-	    READ_ONCE(rdp->cpu_no_qs.b.exp)) {
-		put_cpu();
-		return;
-	}
-	/* Quiescent state needed on current CPU, so set it up locally. */
-	if (my_cpu == cpu) {
-		local_irq_save(flags);
-		rcu_exp_need_qs();
-		local_irq_restore(flags);
-		put_cpu();
-		return;
-	}
-	/* Quiescent state needed on some other CPU, send IPI. */
-	ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
-	put_cpu();
-	WARN_ON_ONCE(ret);
-}
-
 /*
  * Because preemptible RCU does not exist, we never have to check for
  * tasks blocked within RCU read-side critical sections that are
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] rcu/exp: Warn on QS requested on dying CPU
  2025-04-29 13:42 [PATCH 0/5 v3] rcu/exp updates Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2025-04-29 13:43 ` [PATCH 3/5] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
@ 2025-04-29 13:43 ` Frederic Weisbecker
  2025-04-29 13:43 ` [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot Frederic Weisbecker
  4 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-29 13:43 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

It is not possible to send an IPI to a dying CPU that has passed the
CPUHP_TEARDOWN_CPU stage. Remaining unhandled IPIs are handled later at
CPUHP_AP_SMPCFD_DYING stage by stop machine. This is the last
opportunity for RCU exp handler to request an expedited quiescent state.
And the upcoming final context switch between stop machine and idle must
have reported the requested context switch.

Therefore, it should not be possible to observe a pending requested
expedited quiescent state when RCU finally stops watching the outgoing
CPU. Once IPIs aren't possible anymore, the QS for the target CPU will
be reported on its behalf by the RCU exp kworker.

Provide an assertion to verify those expectations.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5be292667385..e14bdbb658a9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4352,6 +4352,12 @@ void rcutree_report_cpu_dead(void)
 	 * may introduce a new READ-side while it is actually off the QS masks.
 	 */
 	lockdep_assert_irqs_disabled();
+	/*
+	 * CPUHP_AP_SMPCFD_DYING was the last call for rcu_exp_handler() execution.
+	 * The requested QS must have been reported on the last context switch
+	 * from stop machine to idle.
+	 */
+	WARN_ON_ONCE(rdp->cpu_no_qs.b.exp);
 	// Do any dangling deferred wakeups.
 	do_nocb_deferred_wakeup(rdp);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot
  2025-04-29 13:42 [PATCH 0/5 v3] rcu/exp updates Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2025-04-29 13:43 ` [PATCH 4/5] rcu/exp: Warn on QS requested on dying CPU Frederic Weisbecker
@ 2025-04-29 13:43 ` Frederic Weisbecker
  2025-04-30  2:20   ` Joel Fernandes
  4 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-29 13:43 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
	Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu

A CPU within hotplug operations can make the RCU exp kworker lagging if:

* The dying CPU is running after CPUHP_TEARDOWN_CPU but before
  rcutree_report_cpu_dead(). It is too late to send an IPI but RCU is
  still watching the CPU. Therefore the exp kworker can only wait for
  the target to reach rcutree_report_cpu_dead().

* The booting CPU is running after rcutree_report_cpu_starting() but
  before set_cpu_online(). RCU is watching the CPU but it is too early
  to be able to send an IPI. Therefore the exp kworker can only wait
  until it observes the CPU as officially online.

Such a lag is expected to be very short. However #VMEXIT and other
hazards can stay on the way. Report long delays, 10 jiffies is
considered a high threshold already.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_exp.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6058a734090c..87a44423927d 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -406,8 +406,18 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
 	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
 		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 		unsigned long mask = rdp->grpmask;
+		int nr_retries = 0;
 
 retry_ipi:
+		/*
+		 * In case of retrying, CPU either is lagging:
+		 *
+		 * - between CPUHP_TEARDOWN_CPU and rcutree_report_cpu_dead()
+		 * or:
+		 * - between rcutree_report_cpu_starting() and set_cpu_online()
+		 */
+		WARN_ON_ONCE(nr_retries++ > 10);
+
 		if (rcu_watching_snap_stopped_since(rdp, rdp->exp_watching_snap)) {
 			mask_ofl_test |= mask;
 			continue;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot
  2025-04-29 13:43 ` [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot Frederic Weisbecker
@ 2025-04-30  2:20   ` Joel Fernandes
  2025-05-01 21:49     ` Frederic Weisbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Joel Fernandes @ 2025-04-30  2:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Neeraj Upadhyay, Paul E McKenney,
	Uladzislau Rezki, Zqiang, rcu, Frederic Weisbecker



> On Apr 29, 2025, at 9:44 AM, Frederic Weisbecker <frederic@kernel.org> wrote:

Hi Frederic,
These all look good to me. Do you wish for
these to go into the upcoming merge window or
can I push them to a for-Neeraj branch as he
is doing the merge window after the next?

Thanks,

 - Joel


> 
> A CPU within hotplug operations can make the RCU exp kworker lagging if:
> 
> * The dying CPU is running after CPUHP_TEARDOWN_CPU but before
>  rcutree_report_cpu_dead(). It is too late to send an IPI but RCU is
>  still watching the CPU. Therefore the exp kworker can only wait for
>  the target to reach rcutree_report_cpu_dead().
> 
> * The booting CPU is running after rcutree_report_cpu_starting() but
>  before set_cpu_online(). RCU is watching the CPU but it is too early
>  to be able to send an IPI. Therefore the exp kworker can only wait
>  until it observes the CPU as officially online.
> 
> Such a lag is expected to be very short. However #VMEXIT and other
> hazards can stay on the way. Report long delays, 10 jiffies is
> considered a high threshold already.
> 
> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/rcu/tree_exp.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6058a734090c..87a44423927d 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -406,8 +406,18 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
>    for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
>        struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>        unsigned long mask = rdp->grpmask;
> +        int nr_retries = 0;
> 
> retry_ipi:
> +        /*
> +         * In case of retrying, CPU either is lagging:
> +         *
> +         * - between CPUHP_TEARDOWN_CPU and rcutree_report_cpu_dead()
> +         * or:
> +         * - between rcutree_report_cpu_starting() and set_cpu_online()
> +         */
> +        WARN_ON_ONCE(nr_retries++ > 10);
> +
>        if (rcu_watching_snap_stopped_since(rdp, rdp->exp_watching_snap)) {
>            mask_ofl_test |= mask;
>            continue;
> --
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot
  2025-04-30  2:20   ` Joel Fernandes
@ 2025-05-01 21:49     ` Frederic Weisbecker
  2025-05-02 15:08       ` Joel Fernandes
  0 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-05-01 21:49 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: LKML, Boqun Feng, Neeraj Upadhyay, Paul E McKenney,
	Uladzislau Rezki, Zqiang, rcu

Le Wed, Apr 30, 2025 at 02:20:31AM +0000, Joel Fernandes a écrit :
> 
> 
> > On Apr 29, 2025, at 9:44 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> Hi Frederic,
> These all look good to me. Do you wish for
> these to go into the upcoming merge window or
> can I push them to a for-Neeraj branch as he
> is doing the merge window after the next?

Your call :-)

If that can help, there is nothing urgent in there.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot
  2025-05-01 21:49     ` Frederic Weisbecker
@ 2025-05-02 15:08       ` Joel Fernandes
  0 siblings, 0 replies; 13+ messages in thread
From: Joel Fernandes @ 2025-05-02 15:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Neeraj Upadhyay, Paul E McKenney,
	Uladzislau Rezki, Zqiang, rcu



On 5/1/2025 5:49 PM, Frederic Weisbecker wrote:
> Le Wed, Apr 30, 2025 at 02:20:31AM +0000, Joel Fernandes a écrit :
>>
>>
>>> On Apr 29, 2025, at 9:44 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>
>> Hi Frederic,
>> These all look good to me. Do you wish for
>> these to go into the upcoming merge window or
>> can I push them to a for-Neeraj branch as he
>> is doing the merge window after the next?
> 
> Your call :-)
> 
> If that can help, there is nothing urgent in there.
> 
> Thanks.
> 

Since its not urgent, I will put these in a new branch for Neeraj :-). Worst
case we get even more eyes on it ;-)

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-02 15:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 13:42 [PATCH 0/5 v3] rcu/exp updates Frederic Weisbecker
2025-04-29 13:43 ` [PATCH 1/5] rcu/exp: Protect against early QS report Frederic Weisbecker
2025-04-29 13:43 ` [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
2025-04-29 13:43 ` [PATCH 3/5] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
2025-04-29 13:43 ` [PATCH 4/5] rcu/exp: Warn on QS requested on dying CPU Frederic Weisbecker
2025-04-29 13:43 ` [PATCH 5/5] rcu/exp: Warn on CPU lagging for too long within hotplug IPI's blindspot Frederic Weisbecker
2025-04-30  2:20   ` Joel Fernandes
2025-05-01 21:49     ` Frederic Weisbecker
2025-05-02 15:08       ` Joel Fernandes
  -- strict thread matches above, loose matches on Subject: below --
2025-03-14 14:36 [PATCH 0/5 v2] rcu/exp updates Frederic Weisbecker
2025-03-14 14:36 ` [PATCH 2/5] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
2025-03-18 17:18   ` Paul E. McKenney
2025-03-19  9:01     ` Frederic Weisbecker
2025-03-19 14:03       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).