* [PATCH 0/3] rcu/exp updates
@ 2025-02-13 23:25 Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 1/3] rcu/exp: Protect against early QS report Frederic Weisbecker
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-13 23:25 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu
Here are a few updates for expedited RCU. Some were inspired by debates
with Paul while he was investigating the issue that got eventually
fixed by "rcu: Fix get_state_synchronize_rcu_full() GP-start detection".
Frederic Weisbecker (3):
rcu/exp: Protect against early QS report
rcu/exp: Remove confusing needless full barrier on task unblock
rcu/exp: Remove needless CPU up quiescent state report
kernel/rcu/tree.c | 2 --
kernel/rcu/tree_exp.h | 59 ++++++----------------------------------
kernel/rcu/tree_plugin.h | 1 -
3 files changed, 9 insertions(+), 53 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/3] rcu/exp: Protect against early QS report
2025-02-13 23:25 [PATCH 0/3] rcu/exp updates Frederic Weisbecker
@ 2025-02-13 23:25 ` Frederic Weisbecker
2025-02-14 9:10 ` Paul E. McKenney
2025-02-13 23:25 ` [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
2 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-13 23:25 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu
When a grace period is started, the ->expmask of each node is set up
from sync_exp_reset_tree(). Then later on each leaf node also initialize
its ->exp_tasks pointer.
This means that the initialization of the quiescent state of a node and
the initialization of its blocking tasks happen with an unlocked node
gap in-between.
It happens to be fine because nothing is expected to report an exp
quiescent state within this gap, since no IPI have been issued yet and
every rdp's ->cpu_no_qs.b.exp should be false.
However if it were to happen by accident, the quiescent state could be
reported and propagated while ignoring tasks that blocked _before_ the
start of the grace period.
Prevent such trouble to happen in the future and initialize both the
quiescent states mask to report and the blocked tasks head from the same
node locked block.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/rcu/tree_exp.h | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 8d4895c854c5..caff16e441d1 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
raw_spin_lock_irqsave_rcu_node(rnp, flags);
WARN_ON_ONCE(rnp->expmask);
WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
+ /*
+ * Need to wait for any blocked tasks as well. Note that
+ * additional blocking tasks will also block the expedited GP
+ * until such time as the ->expmask bits are cleared.
+ */
+ if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
+ WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}
}
@@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
}
mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
- /*
- * Need to wait for any blocked tasks as well. Note that
- * additional blocking tasks will also block the expedited GP
- * until such time as the ->expmask bits are cleared.
- */
- if (rcu_preempt_has_tasks(rnp))
- WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
/* IPI the remaining CPUs for expedited quiescent state. */
--
2.46.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-13 23:25 [PATCH 0/3] rcu/exp updates Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 1/3] rcu/exp: Protect against early QS report Frederic Weisbecker
@ 2025-02-13 23:25 ` Frederic Weisbecker
2025-02-25 21:59 ` Joel Fernandes
2025-02-13 23:25 ` [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
2 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-13 23:25 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu
A full memory barrier in the RCU-PREEMPT task unblock path advertizes
to order the context switch (or rather the accesses prior to
rcu_read_unlock()) with the expedited grace period fastpath.
However the grace period can not complete without the rnp calling into
rcu_report_exp_rnp() with the node locked. This reports the quiescent
state in a fully ordered fashion against updater's accesses thanks to:
1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
locking while propagating QS up to the root.
2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
the root rnp to wait/check for the GP completion.
3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
before the grace period completes.
This makes the explicit barrier in this place superflous. Therefore
remove it as it is confusing.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/rcu/tree_plugin.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3c0bbbbb686f..d51cc7a5dfc7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
(!empty_norm || rnp->qsmask));
empty_exp = sync_rcu_exp_done(rnp);
- smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
np = rcu_next_node_entry(t, rnp);
list_del_init(&t->rcu_node_entry);
t->rcu_blocked_node = NULL;
--
2.46.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-13 23:25 [PATCH 0/3] rcu/exp updates Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 1/3] rcu/exp: Protect against early QS report Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
@ 2025-02-13 23:25 ` Frederic Weisbecker
2025-02-14 9:01 ` Paul E. McKenney
2025-03-03 20:10 ` Paul E. McKenney
2 siblings, 2 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-13 23:25 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu
A CPU coming online checks for an ongoing grace period and reports
a quiescent state accordingly if needed. This special treatment that
shortcuts the expedited IPI finds its origin as an optimization purpose
on the following commit:
338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
The point is to avoid an IPI while waiting for a CPU to become online
or failing to become offline.
However this is pointless and even error prone for several reasons:
* If the CPU has been seen offline in the first round scanning offline
and idle CPUs, no IPI is even tried and the quiescent state is
reported on behalf of the CPU.
* This means that if the IPI fails, the CPU just became offline. So
it's unlikely to become online right away, unless the cpu hotplug
operation failed and rolled back, which is a rare event that can
wait a jiffy for a new IPI to be issued.
* But then the "optimization" applying on failing CPU hotplug down only
applies to !PREEMPT_RCU.
* This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
set. As a result it can race with remote QS reports on the same rdp.
Fortunately it happens to be OK but an accident is waiting to happen.
For all those reasons, remove this optimization that doesn't look worthy
to keep around.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/rcu/tree.c | 2 --
kernel/rcu/tree_exp.h | 45 ++-----------------------------------------
2 files changed, 2 insertions(+), 45 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8625f616c65a..86935fe00397 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -151,7 +151,6 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
unsigned long gps, unsigned long flags);
static void invoke_rcu_core(void);
static void rcu_report_exp_rdp(struct rcu_data *rdp);
-static void sync_sched_exp_online_cleanup(int cpu);
static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
@@ -4259,7 +4258,6 @@ int rcutree_online_cpu(unsigned int cpu)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
return 0; /* Too early in boot for scheduler work. */
- sync_sched_exp_online_cleanup(cpu);
// Stop-machine done, so allow nohz_full to disable tick.
tick_dep_clear(TICK_DEP_BIT_RCU);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index caff16e441d1..a3f962eb7057 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -751,12 +751,8 @@ static void rcu_exp_handler(void *unused)
struct task_struct *t = current;
/*
- * First, is there no need for a quiescent state from this CPU,
- * or is this CPU already looking for a quiescent state for the
- * current grace period? If either is the case, just leave.
- * However, this should not happen due to the preemptible
- * sync_sched_exp_online_cleanup() implementation being a no-op,
- * so warn if this does happen.
+ * WARN if the CPU is unexpectedly already looking for a
+ * QS or has already reported one.
*/
ASSERT_EXCLUSIVE_WRITER_SCOPED(rdp->cpu_no_qs.b.exp);
if (WARN_ON_ONCE(!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
@@ -803,11 +799,6 @@ static void rcu_exp_handler(void *unused)
WARN_ON_ONCE(1);
}
-/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
-static void sync_sched_exp_online_cleanup(int cpu)
-{
-}
-
/*
* Scan the current list of tasks blocked within RCU read-side critical
* sections, printing out the tid of each that is blocking the current
@@ -885,38 +876,6 @@ static void rcu_exp_handler(void *unused)
rcu_exp_need_qs();
}
-/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
-static void sync_sched_exp_online_cleanup(int cpu)
-{
- unsigned long flags;
- int my_cpu;
- struct rcu_data *rdp;
- int ret;
- struct rcu_node *rnp;
-
- rdp = per_cpu_ptr(&rcu_data, cpu);
- rnp = rdp->mynode;
- my_cpu = get_cpu();
- /* Quiescent state either not needed or already requested, leave. */
- if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
- READ_ONCE(rdp->cpu_no_qs.b.exp)) {
- put_cpu();
- return;
- }
- /* Quiescent state needed on current CPU, so set it up locally. */
- if (my_cpu == cpu) {
- local_irq_save(flags);
- rcu_exp_need_qs();
- local_irq_restore(flags);
- put_cpu();
- return;
- }
- /* Quiescent state needed on some other CPU, send IPI. */
- ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
- put_cpu();
- WARN_ON_ONCE(ret);
-}
-
/*
* Because preemptible RCU does not exist, we never have to check for
* tasks blocked within RCU read-side critical sections that are
--
2.46.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-13 23:25 ` [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
@ 2025-02-14 9:01 ` Paul E. McKenney
2025-02-14 12:10 ` Frederic Weisbecker
2025-03-03 20:10 ` Paul E. McKenney
1 sibling, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-14 9:01 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> A CPU coming online checks for an ongoing grace period and reports
> a quiescent state accordingly if needed. This special treatment that
> shortcuts the expedited IPI finds its origin as an optimization purpose
> on the following commit:
>
> 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
>
> The point is to avoid an IPI while waiting for a CPU to become online
> or failing to become offline.
>
> However this is pointless and even error prone for several reasons:
>
> * If the CPU has been seen offline in the first round scanning offline
> and idle CPUs, no IPI is even tried and the quiescent state is
> reported on behalf of the CPU.
>
> * This means that if the IPI fails, the CPU just became offline. So
> it's unlikely to become online right away, unless the cpu hotplug
> operation failed and rolled back, which is a rare event that can
> wait a jiffy for a new IPI to be issued.
>
> * But then the "optimization" applying on failing CPU hotplug down only
> applies to !PREEMPT_RCU.
>
> * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
> set. As a result it can race with remote QS reports on the same rdp.
> Fortunately it happens to be OK but an accident is waiting to happen.
>
> For all those reasons, remove this optimization that doesn't look worthy
> to keep around.
Thank you for digging into this!
When I ran tests that removed the call to sync_sched_exp_online_cleanup()
a few months ago, I got grace-period hangs [1]. Has something changed
to make this safe?
Thanx, Paul
[1] https://docs.google.com/document/d/1-JQ4QYF1qid0TWSLa76O1kusdhER2wgm0dYdwFRUzU8/edit?usp=sharing
> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/rcu/tree.c | 2 --
> kernel/rcu/tree_exp.h | 45 ++-----------------------------------------
> 2 files changed, 2 insertions(+), 45 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8625f616c65a..86935fe00397 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -151,7 +151,6 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
> unsigned long gps, unsigned long flags);
> static void invoke_rcu_core(void);
> static void rcu_report_exp_rdp(struct rcu_data *rdp);
> -static void sync_sched_exp_online_cleanup(int cpu);
> static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
> static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
> static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
> @@ -4259,7 +4258,6 @@ int rcutree_online_cpu(unsigned int cpu)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> return 0; /* Too early in boot for scheduler work. */
> - sync_sched_exp_online_cleanup(cpu);
>
> // Stop-machine done, so allow nohz_full to disable tick.
> tick_dep_clear(TICK_DEP_BIT_RCU);
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index caff16e441d1..a3f962eb7057 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -751,12 +751,8 @@ static void rcu_exp_handler(void *unused)
> struct task_struct *t = current;
>
> /*
> - * First, is there no need for a quiescent state from this CPU,
> - * or is this CPU already looking for a quiescent state for the
> - * current grace period? If either is the case, just leave.
> - * However, this should not happen due to the preemptible
> - * sync_sched_exp_online_cleanup() implementation being a no-op,
> - * so warn if this does happen.
> + * WARN if the CPU is unexpectedly already looking for a
> + * QS or has already reported one.
> */
> ASSERT_EXCLUSIVE_WRITER_SCOPED(rdp->cpu_no_qs.b.exp);
> if (WARN_ON_ONCE(!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
> @@ -803,11 +799,6 @@ static void rcu_exp_handler(void *unused)
> WARN_ON_ONCE(1);
> }
>
> -/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
> -static void sync_sched_exp_online_cleanup(int cpu)
> -{
> -}
> -
> /*
> * Scan the current list of tasks blocked within RCU read-side critical
> * sections, printing out the tid of each that is blocking the current
> @@ -885,38 +876,6 @@ static void rcu_exp_handler(void *unused)
> rcu_exp_need_qs();
> }
>
> -/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
> -static void sync_sched_exp_online_cleanup(int cpu)
> -{
> - unsigned long flags;
> - int my_cpu;
> - struct rcu_data *rdp;
> - int ret;
> - struct rcu_node *rnp;
> -
> - rdp = per_cpu_ptr(&rcu_data, cpu);
> - rnp = rdp->mynode;
> - my_cpu = get_cpu();
> - /* Quiescent state either not needed or already requested, leave. */
> - if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
> - READ_ONCE(rdp->cpu_no_qs.b.exp)) {
> - put_cpu();
> - return;
> - }
> - /* Quiescent state needed on current CPU, so set it up locally. */
> - if (my_cpu == cpu) {
> - local_irq_save(flags);
> - rcu_exp_need_qs();
> - local_irq_restore(flags);
> - put_cpu();
> - return;
> - }
> - /* Quiescent state needed on some other CPU, send IPI. */
> - ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
> - put_cpu();
> - WARN_ON_ONCE(ret);
> -}
> -
> /*
> * Because preemptible RCU does not exist, we never have to check for
> * tasks blocked within RCU read-side critical sections that are
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] rcu/exp: Protect against early QS report
2025-02-13 23:25 ` [PATCH 1/3] rcu/exp: Protect against early QS report Frederic Weisbecker
@ 2025-02-14 9:10 ` Paul E. McKenney
2025-03-13 16:40 ` Frederic Weisbecker
0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-14 9:10 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Fri, Feb 14, 2025 at 12:25:57AM +0100, Frederic Weisbecker wrote:
> When a grace period is started, the ->expmask of each node is set up
> from sync_exp_reset_tree(). Then later on each leaf node also initialize
> its ->exp_tasks pointer.
>
> This means that the initialization of the quiescent state of a node and
> the initialization of its blocking tasks happen with an unlocked node
> gap in-between.
>
> It happens to be fine because nothing is expected to report an exp
> quiescent state within this gap, since no IPI have been issued yet and
> every rdp's ->cpu_no_qs.b.exp should be false.
>
> However if it were to happen by accident, the quiescent state could be
> reported and propagated while ignoring tasks that blocked _before_ the
> start of the grace period.
>
> Prevent such trouble to happen in the future and initialize both the
> quiescent states mask to report and the blocked tasks head from the same
> node locked block.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Thank you for looking into this!
One question: What happens if a CPU has tasks pending during the
call to sync_exp_reset_tree(), but all of these tasks complete
their RCU read-side critical sections before execution reaches
__sync_rcu_exp_select_node_cpus()?
(My guess is that all is well, but even if so, it would be good to record
why in the commit log.)
Thanx, Paul
> ---
> kernel/rcu/tree_exp.h | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 8d4895c854c5..caff16e441d1 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> WARN_ON_ONCE(rnp->expmask);
> WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
> + /*
> + * Need to wait for any blocked tasks as well. Note that
> + * additional blocking tasks will also block the expedited GP
> + * until such time as the ->expmask bits are cleared.
> + */
> + if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
> + WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
> }
> @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> }
> mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
>
> - /*
> - * Need to wait for any blocked tasks as well. Note that
> - * additional blocking tasks will also block the expedited GP
> - * until such time as the ->expmask bits are cleared.
> - */
> - if (rcu_preempt_has_tasks(rnp))
> - WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>
> /* IPI the remaining CPUs for expedited quiescent state. */
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-14 9:01 ` Paul E. McKenney
@ 2025-02-14 12:10 ` Frederic Weisbecker
2025-02-15 10:38 ` Paul E. McKenney
0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-14 12:10 UTC (permalink / raw)
To: Paul E. McKenney
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Fri, Feb 14, 2025 at 01:01:56AM -0800, Paul E. McKenney a écrit :
> On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > A CPU coming online checks for an ongoing grace period and reports
> > a quiescent state accordingly if needed. This special treatment that
> > shortcuts the expedited IPI finds its origin as an optimization purpose
> > on the following commit:
> >
> > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> >
> > The point is to avoid an IPI while waiting for a CPU to become online
> > or failing to become offline.
> >
> > However this is pointless and even error prone for several reasons:
> >
> > * If the CPU has been seen offline in the first round scanning offline
> > and idle CPUs, no IPI is even tried and the quiescent state is
> > reported on behalf of the CPU.
> >
> > * This means that if the IPI fails, the CPU just became offline. So
> > it's unlikely to become online right away, unless the cpu hotplug
> > operation failed and rolled back, which is a rare event that can
> > wait a jiffy for a new IPI to be issued.
> >
> > * But then the "optimization" applying on failing CPU hotplug down only
> > applies to !PREEMPT_RCU.
> >
> > * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
> > set. As a result it can race with remote QS reports on the same rdp.
> > Fortunately it happens to be OK but an accident is waiting to happen.
> >
> > For all those reasons, remove this optimization that doesn't look worthy
> > to keep around.
>
> Thank you for digging into this!
>
> When I ran tests that removed the call to sync_sched_exp_online_cleanup()
> a few months ago, I got grace-period hangs [1]. Has something changed
> to make this safe?
Hmm, but was it before or after "rcu: Fix get_state_synchronize_rcu_full()
GP-start detection" ?
And if after do we know why?
Thanks.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-14 12:10 ` Frederic Weisbecker
@ 2025-02-15 10:38 ` Paul E. McKenney
2025-02-15 22:23 ` Frederic Weisbecker
0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-15 10:38 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Fri, Feb 14, 2025 at 01:10:52PM +0100, Frederic Weisbecker wrote:
> Le Fri, Feb 14, 2025 at 01:01:56AM -0800, Paul E. McKenney a écrit :
> > On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > > A CPU coming online checks for an ongoing grace period and reports
> > > a quiescent state accordingly if needed. This special treatment that
> > > shortcuts the expedited IPI finds its origin as an optimization purpose
> > > on the following commit:
> > >
> > > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> > >
> > > The point is to avoid an IPI while waiting for a CPU to become online
> > > or failing to become offline.
> > >
> > > However this is pointless and even error prone for several reasons:
> > >
> > > * If the CPU has been seen offline in the first round scanning offline
> > > and idle CPUs, no IPI is even tried and the quiescent state is
> > > reported on behalf of the CPU.
> > >
> > > * This means that if the IPI fails, the CPU just became offline. So
> > > it's unlikely to become online right away, unless the cpu hotplug
> > > operation failed and rolled back, which is a rare event that can
> > > wait a jiffy for a new IPI to be issued.
But the expedited grace period might be preempted for an arbitrarily
long period, especially if a hypervisor is in play. And we do drop
that lock midway through...
> > > * But then the "optimization" applying on failing CPU hotplug down only
> > > applies to !PREEMPT_RCU.
Yes, definitely only non-preemptible RCU.
> > > * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
> > > set. As a result it can race with remote QS reports on the same rdp.
> > > Fortunately it happens to be OK but an accident is waiting to happen.
To your point, I did in fact incorrectly decide that this was a bug. ;-)
> > > For all those reasons, remove this optimization that doesn't look worthy
> > > to keep around.
> >
> > Thank you for digging into this!
> >
> > When I ran tests that removed the call to sync_sched_exp_online_cleanup()
> > a few months ago, I got grace-period hangs [1]. Has something changed
> > to make this safe?
>
> Hmm, but was it before or after "rcu: Fix get_state_synchronize_rcu_full()
> GP-start detection" ?
Before. There was also some buggy debug code in play. Also, to get the
failure, it was necessary to make TREE03 disable preemption, as stock
TREE03 has an empty sync_sched_exp_online_cleanup() function.
I am rerunning the test with a WARN_ON_ONCE() after the early exit from
the sync_sched_exp_online_cleanup(). Of course, lack of a failure does
not necessairly indicate
> And if after do we know why?
Here are some (possibly bogus) possibilities that came to mind:
1. There is some coming-online race that deprives the incoming
CPU of an IPI, but nevertheless marks that CPU as blocking the
current grace period.
2. Some strange scenario involves the CPU going offline for just a
little bit, so that the IPI gets wasted on the outgoing due to
neither of the "if" conditions in rcu_exp_handler() being true.
The outgoing CPU just says "I need a QS", then leaves and
comes back. (The expedited grace period doesn't retry because
it believes that it already sent that IPI.)
3. Your ideas here! ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-15 10:38 ` Paul E. McKenney
@ 2025-02-15 22:23 ` Frederic Weisbecker
2025-02-19 14:58 ` Paul E. McKenney
0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-15 22:23 UTC (permalink / raw)
To: Paul E. McKenney
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Sat, Feb 15, 2025 at 02:38:04AM -0800, Paul E. McKenney a écrit :
> On Fri, Feb 14, 2025 at 01:10:52PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Feb 14, 2025 at 01:01:56AM -0800, Paul E. McKenney a écrit :
> > > On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > > > A CPU coming online checks for an ongoing grace period and reports
> > > > a quiescent state accordingly if needed. This special treatment that
> > > > shortcuts the expedited IPI finds its origin as an optimization purpose
> > > > on the following commit:
> > > >
> > > > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> > > >
> > > > The point is to avoid an IPI while waiting for a CPU to become online
> > > > or failing to become offline.
> > > >
> > > > However this is pointless and even error prone for several reasons:
> > > >
> > > > * If the CPU has been seen offline in the first round scanning offline
> > > > and idle CPUs, no IPI is even tried and the quiescent state is
> > > > reported on behalf of the CPU.
> > > >
> > > > * This means that if the IPI fails, the CPU just became offline. So
> > > > it's unlikely to become online right away, unless the cpu hotplug
> > > > operation failed and rolled back, which is a rare event that can
> > > > wait a jiffy for a new IPI to be issued.
>
> But the expedited grace period might be preempted for an arbitrarily
> long period, especially if a hypervisor is in play. And we do drop
> that lock midway through...
Well, then that delays the expedited grace period as a whole anyway...
> > > > For all those reasons, remove this optimization that doesn't look worthy
> > > > to keep around.
> > >
> > > Thank you for digging into this!
> > >
> > > When I ran tests that removed the call to sync_sched_exp_online_cleanup()
> > > a few months ago, I got grace-period hangs [1]. Has something changed
> > > to make this safe?
> >
> > Hmm, but was it before or after "rcu: Fix get_state_synchronize_rcu_full()
> > GP-start detection" ?
>
> Before. There was also some buggy debug code in play. Also, to get the
> failure, it was necessary to make TREE03 disable preemption, as stock
> TREE03 has an empty sync_sched_exp_online_cleanup() function.
>
> I am rerunning the test with a WARN_ON_ONCE() after the early exit from
> the sync_sched_exp_online_cleanup(). Of course, lack of a failure does
> not necessairly indicate
Cool, thanks!
>
> > And if after do we know why?
>
> Here are some (possibly bogus) possibilities that came to mind:
>
> 1. There is some coming-online race that deprives the incoming
> CPU of an IPI, but nevertheless marks that CPU as blocking the
> current grace period.
Arguably there is a tiny window between rcutree_report_cpu_starting()
and set_cpu_online() that could make ->qsmaskinitnext visible before
cpu_online() and therefore delay the IPI a bit. But I don't expect
more than a jiffy to fill up the gap. And if that's relevant, note that
only !PREEMPT_RCU is then "fixed" by sync_sched_exp_online_cleanup() here.
>
> 2. Some strange scenario involves the CPU going offline for just a
> little bit, so that the IPI gets wasted on the outgoing due to
> neither of the "if" conditions in rcu_exp_handler() being true.
> The outgoing CPU just says "I need a QS", then leaves and
> comes back. (The expedited grace period doesn't retry because
> it believes that it already sent that IPI.)
I don't think this is possible. Once the CPU enters CPUHP_TEARDOWN_CPU with
stop_machine, no more IPIs can be issued. The remaining ones are executed
at CPUHP_AP_SMPCFD_DYING, still in stop_machine. So this is the last call
for rcu_exp_handler() execution. And this last call has to be followed
by rcu_note_context_switch() between stop_machine and the final schedule to
idle. And that rcu_note_context_switch() must report the rdp exp context
switch.
One easy way to assert that is:
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 86935fe00397..40d6090a33f5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4347,6 +4347,12 @@ void rcutree_report_cpu_dead(void)
* may introduce a new READ-side while it is actually off the QS masks.
*/
lockdep_assert_irqs_disabled();
+ /*
+ * CPUHP_AP_SMPCFD_DYING was the last call for rcu_exp_handler() execution.
+ * The requested QS must have been reported on the last context switch
+ * from stop machine to idle.
+ */
+ WARN_ON_ONCE(rdp->cpu_no_qs.b.exp);
// Do any dangling deferred wakeups.
do_nocb_deferred_wakeup(rdp);
>
> 3. Your ideas here! ;-)
:-)
>
> Thanx, Paul
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-15 22:23 ` Frederic Weisbecker
@ 2025-02-19 14:58 ` Paul E. McKenney
2025-02-19 15:55 ` Paul E. McKenney
2025-02-21 15:52 ` Frederic Weisbecker
0 siblings, 2 replies; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-19 14:58 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Sat, Feb 15, 2025 at 11:23:45PM +0100, Frederic Weisbecker wrote:
> Le Sat, Feb 15, 2025 at 02:38:04AM -0800, Paul E. McKenney a écrit :
> > On Fri, Feb 14, 2025 at 01:10:52PM +0100, Frederic Weisbecker wrote:
> > > Le Fri, Feb 14, 2025 at 01:01:56AM -0800, Paul E. McKenney a écrit :
> > > > On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > > > > A CPU coming online checks for an ongoing grace period and reports
> > > > > a quiescent state accordingly if needed. This special treatment that
> > > > > shortcuts the expedited IPI finds its origin as an optimization purpose
> > > > > on the following commit:
> > > > >
> > > > > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> > > > >
> > > > > The point is to avoid an IPI while waiting for a CPU to become online
> > > > > or failing to become offline.
> > > > >
> > > > > However this is pointless and even error prone for several reasons:
> > > > >
> > > > > * If the CPU has been seen offline in the first round scanning offline
> > > > > and idle CPUs, no IPI is even tried and the quiescent state is
> > > > > reported on behalf of the CPU.
> > > > >
> > > > > * This means that if the IPI fails, the CPU just became offline. So
> > > > > it's unlikely to become online right away, unless the cpu hotplug
> > > > > operation failed and rolled back, which is a rare event that can
> > > > > wait a jiffy for a new IPI to be issued.
> >
> > But the expedited grace period might be preempted for an arbitrarily
> > long period, especially if a hypervisor is in play. And we do drop
> > that lock midway through...
>
> Well, then that delays the expedited grace period as a whole anyway...
Fair enough. Part of this is the paranoia that has served me so well.
But which can also cause the occasional problem. On the other hand,
we really do occasionally lose things during CPU hotplug operations...
> > > > > For all those reasons, remove this optimization that doesn't look worthy
> > > > > to keep around.
> > > >
> > > > Thank you for digging into this!
> > > >
> > > > When I ran tests that removed the call to sync_sched_exp_online_cleanup()
> > > > a few months ago, I got grace-period hangs [1]. Has something changed
> > > > to make this safe?
> > >
> > > Hmm, but was it before or after "rcu: Fix get_state_synchronize_rcu_full()
> > > GP-start detection" ?
> >
> > Before. There was also some buggy debug code in play. Also, to get the
> > failure, it was necessary to make TREE03 disable preemption, as stock
> > TREE03 has an empty sync_sched_exp_online_cleanup() function.
> >
> > I am rerunning the test with a WARN_ON_ONCE() after the early exit from
> > the sync_sched_exp_online_cleanup(). Of course, lack of a failure does
> > not necessairly indicate
>
> Cool, thanks!
No failures. But might it be wise to put this WARN_ON_ONCE() in,
let things go for a year or two, and complete the removal if it never
triggers? Or is the lack of forward progress warning enough?
> > > And if after do we know why?
> >
> > Here are some (possibly bogus) possibilities that came to mind:
> >
> > 1. There is some coming-online race that deprives the incoming
> > CPU of an IPI, but nevertheless marks that CPU as blocking the
> > current grace period.
>
> Arguably there is a tiny window between rcutree_report_cpu_starting()
> and set_cpu_online() that could make ->qsmaskinitnext visible before
> cpu_online() and therefore delay the IPI a bit. But I don't expect
> more than a jiffy to fill up the gap. And if that's relevant, note that
> only !PREEMPT_RCU is then "fixed" by sync_sched_exp_online_cleanup() here.
Agreed. And I vaguely recall that there was some difference due to
preemptible RCU's ability to clean up at the next rcu_read_unlock(),
though more recently, possibly deferred.
> > 2. Some strange scenario involves the CPU going offline for just a
> > little bit, so that the IPI gets wasted on the outgoing due to
> > neither of the "if" conditions in rcu_exp_handler() being true.
> > The outgoing CPU just says "I need a QS", then leaves and
> > comes back. (The expedited grace period doesn't retry because
> > it believes that it already sent that IPI.)
>
> I don't think this is possible. Once the CPU enters CPUHP_TEARDOWN_CPU with
> stop_machine, no more IPIs can be issued. The remaining ones are executed
> at CPUHP_AP_SMPCFD_DYING, still in stop_machine. So this is the last call
> for rcu_exp_handler() execution. And this last call has to be followed
> by rcu_note_context_switch() between stop_machine and the final schedule to
> idle. And that rcu_note_context_switch() must report the rdp exp context
> switch.
Makes sense to me.
> One easy way to assert that is:
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 86935fe00397..40d6090a33f5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4347,6 +4347,12 @@ void rcutree_report_cpu_dead(void)
> * may introduce a new READ-side while it is actually off the QS masks.
> */
> lockdep_assert_irqs_disabled();
> + /*
> + * CPUHP_AP_SMPCFD_DYING was the last call for rcu_exp_handler() execution.
> + * The requested QS must have been reported on the last context switch
> + * from stop machine to idle.
> + */
> + WARN_ON_ONCE(rdp->cpu_no_qs.b.exp);
> // Do any dangling deferred wakeups.
> do_nocb_deferred_wakeup(rdp);
I fired off a 30-minute run of 100*TREE03 with this change also.
> > 3. Your ideas here! ;-)
>
> :-)
Thanx, Paul
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-19 14:58 ` Paul E. McKenney
@ 2025-02-19 15:55 ` Paul E. McKenney
2025-02-21 15:31 ` Frederic Weisbecker
2025-02-21 15:52 ` Frederic Weisbecker
1 sibling, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-19 15:55 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Wed, Feb 19, 2025 at 06:58:36AM -0800, Paul E. McKenney wrote:
> On Sat, Feb 15, 2025 at 11:23:45PM +0100, Frederic Weisbecker wrote:
> > Le Sat, Feb 15, 2025 at 02:38:04AM -0800, Paul E. McKenney a écrit :
> > > On Fri, Feb 14, 2025 at 01:10:52PM +0100, Frederic Weisbecker wrote:
> > > > Le Fri, Feb 14, 2025 at 01:01:56AM -0800, Paul E. McKenney a écrit :
> > > > > On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > > > > > A CPU coming online checks for an ongoing grace period and reports
> > > > > > a quiescent state accordingly if needed. This special treatment that
> > > > > > shortcuts the expedited IPI finds its origin as an optimization purpose
> > > > > > on the following commit:
> > > > > >
> > > > > > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> > > > > >
> > > > > > The point is to avoid an IPI while waiting for a CPU to become online
> > > > > > or failing to become offline.
> > > > > >
> > > > > > However this is pointless and even error prone for several reasons:
> > > > > >
> > > > > > * If the CPU has been seen offline in the first round scanning offline
> > > > > > and idle CPUs, no IPI is even tried and the quiescent state is
> > > > > > reported on behalf of the CPU.
> > > > > >
> > > > > > * This means that if the IPI fails, the CPU just became offline. So
> > > > > > it's unlikely to become online right away, unless the cpu hotplug
> > > > > > operation failed and rolled back, which is a rare event that can
> > > > > > wait a jiffy for a new IPI to be issued.
> > >
> > > But the expedited grace period might be preempted for an arbitrarily
> > > long period, especially if a hypervisor is in play. And we do drop
> > > that lock midway through...
> >
> > Well, then that delays the expedited grace period as a whole anyway...
>
> Fair enough. Part of this is the paranoia that has served me so well.
> But which can also cause the occasional problem. On the other hand,
> we really do occasionally lose things during CPU hotplug operations...
>
> > > > > > For all those reasons, remove this optimization that doesn't look worthy
> > > > > > to keep around.
> > > > >
> > > > > Thank you for digging into this!
> > > > >
> > > > > When I ran tests that removed the call to sync_sched_exp_online_cleanup()
> > > > > a few months ago, I got grace-period hangs [1]. Has something changed
> > > > > to make this safe?
> > > >
> > > > Hmm, but was it before or after "rcu: Fix get_state_synchronize_rcu_full()
> > > > GP-start detection" ?
> > >
> > > Before. There was also some buggy debug code in play. Also, to get the
> > > failure, it was necessary to make TREE03 disable preemption, as stock
> > > TREE03 has an empty sync_sched_exp_online_cleanup() function.
> > >
> > > I am rerunning the test with a WARN_ON_ONCE() after the early exit from
> > > the sync_sched_exp_online_cleanup(). Of course, lack of a failure does
> > > not necessairly indicate
> >
> > Cool, thanks!
>
> No failures. But might it be wise to put this WARN_ON_ONCE() in,
> let things go for a year or two, and complete the removal if it never
> triggers? Or is the lack of forward progress warning enough?
>
> > > > And if after do we know why?
> > >
> > > Here are some (possibly bogus) possibilities that came to mind:
> > >
> > > 1. There is some coming-online race that deprives the incoming
> > > CPU of an IPI, but nevertheless marks that CPU as blocking the
> > > current grace period.
> >
> > Arguably there is a tiny window between rcutree_report_cpu_starting()
> > and set_cpu_online() that could make ->qsmaskinitnext visible before
> > cpu_online() and therefore delay the IPI a bit. But I don't expect
> > more than a jiffy to fill up the gap. And if that's relevant, note that
> > only !PREEMPT_RCU is then "fixed" by sync_sched_exp_online_cleanup() here.
>
> Agreed. And I vaguely recall that there was some difference due to
> preemptible RCU's ability to clean up at the next rcu_read_unlock(),
> though more recently, possibly deferred.
>
> > > 2. Some strange scenario involves the CPU going offline for just a
> > > little bit, so that the IPI gets wasted on the outgoing due to
> > > neither of the "if" conditions in rcu_exp_handler() being true.
> > > The outgoing CPU just says "I need a QS", then leaves and
> > > comes back. (The expedited grace period doesn't retry because
> > > it believes that it already sent that IPI.)
> >
> > I don't think this is possible. Once the CPU enters CPUHP_TEARDOWN_CPU with
> > stop_machine, no more IPIs can be issued. The remaining ones are executed
> > at CPUHP_AP_SMPCFD_DYING, still in stop_machine. So this is the last call
> > for rcu_exp_handler() execution. And this last call has to be followed
> > by rcu_note_context_switch() between stop_machine and the final schedule to
> > idle. And that rcu_note_context_switch() must report the rdp exp context
> > switch.
>
> Makes sense to me.
>
> > One easy way to assert that is:
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 86935fe00397..40d6090a33f5 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4347,6 +4347,12 @@ void rcutree_report_cpu_dead(void)
> > * may introduce a new READ-side while it is actually off the QS masks.
> > */
> > lockdep_assert_irqs_disabled();
> > + /*
> > + * CPUHP_AP_SMPCFD_DYING was the last call for rcu_exp_handler() execution.
> > + * The requested QS must have been reported on the last context switch
> > + * from stop machine to idle.
> > + */
> > + WARN_ON_ONCE(rdp->cpu_no_qs.b.exp);
> > // Do any dangling deferred wakeups.
> > do_nocb_deferred_wakeup(rdp);
>
> I fired off a 30-minute run of 100*TREE03 with this change also.
And no failures!
Thanx, Paul
> > > 3. Your ideas here! ;-)
> >
> > :-)
>
> Thanx, Paul
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-19 15:55 ` Paul E. McKenney
@ 2025-02-21 15:31 ` Frederic Weisbecker
0 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-21 15:31 UTC (permalink / raw)
To: Paul E. McKenney
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Wed, Feb 19, 2025 at 07:55:05AM -0800, Paul E. McKenney a écrit :
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 86935fe00397..40d6090a33f5 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -4347,6 +4347,12 @@ void rcutree_report_cpu_dead(void)
> > > * may introduce a new READ-side while it is actually off the QS masks.
> > > */
> > > lockdep_assert_irqs_disabled();
> > > + /*
> > > + * CPUHP_AP_SMPCFD_DYING was the last call for rcu_exp_handler() execution.
> > > + * The requested QS must have been reported on the last context switch
> > > + * from stop machine to idle.
> > > + */
> > > + WARN_ON_ONCE(rdp->cpu_no_qs.b.exp);
> > > // Do any dangling deferred wakeups.
> > > do_nocb_deferred_wakeup(rdp);
> >
> > I fired off a 30-minute run of 100*TREE03 with this change also.
>
> And no failures!
Ok I'm adding that one on my queue.
>
> Thanx, Paul
>
> > > > 3. Your ideas here! ;-)
> > >
> > > :-)
> >
> > Thanx, Paul
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-19 14:58 ` Paul E. McKenney
2025-02-19 15:55 ` Paul E. McKenney
@ 2025-02-21 15:52 ` Frederic Weisbecker
2025-02-26 0:00 ` Paul E. McKenney
1 sibling, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-21 15:52 UTC (permalink / raw)
To: Paul E. McKenney
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Wed, Feb 19, 2025 at 06:58:36AM -0800, Paul E. McKenney a écrit :
> On Sat, Feb 15, 2025 at 11:23:45PM +0100, Frederic Weisbecker wrote:
> > > Before. There was also some buggy debug code in play. Also, to get the
> > > failure, it was necessary to make TREE03 disable preemption, as stock
> > > TREE03 has an empty sync_sched_exp_online_cleanup() function.
> > >
> > > I am rerunning the test with a WARN_ON_ONCE() after the early exit from
> > > the sync_sched_exp_online_cleanup(). Of course, lack of a failure does
> > > not necessairly indicate
> >
> > Cool, thanks!
>
> No failures. But might it be wise to put this WARN_ON_ONCE() in,
> let things go for a year or two, and complete the removal if it never
> triggers? Or is the lack of forward progress warning enough?
Hmm, what prevents a WARN_ON_ONCE() after the early exit of
sync_sched_exp_online_cleanup() to hit?
All it takes is for sync_sched_exp_online_cleanup() to execute between
sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus() manage
to send an IPI.
But we can warn about the lack of forward progress after a few iterations
of the retry_ipi label in __sync_rcu_exp_select_node_cpus().
>
> > > > And if after do we know why?
> > >
> > > Here are some (possibly bogus) possibilities that came to mind:
> > >
> > > 1. There is some coming-online race that deprives the incoming
> > > CPU of an IPI, but nevertheless marks that CPU as blocking the
> > > current grace period.
> >
> > Arguably there is a tiny window between rcutree_report_cpu_starting()
> > and set_cpu_online() that could make ->qsmaskinitnext visible before
> > cpu_online() and therefore delay the IPI a bit. But I don't expect
> > more than a jiffy to fill up the gap. And if that's relevant, note that
> > only !PREEMPT_RCU is then "fixed" by sync_sched_exp_online_cleanup() here.
>
> Agreed. And I vaguely recall that there was some difference due to
> preemptible RCU's ability to clean up at the next rcu_read_unlock(),
> though more recently, possibly deferred.
Perhaps at the time but today at least I can't find any.
Thanks.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-13 23:25 ` [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
@ 2025-02-25 21:59 ` Joel Fernandes
2025-02-26 0:08 ` Paul E. McKenney
2025-02-26 12:52 ` Frederic Weisbecker
0 siblings, 2 replies; 25+ messages in thread
From: Joel Fernandes @ 2025-02-25 21:59 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu
On Fri, Feb 14, 2025 at 12:25:58AM +0100, Frederic Weisbecker wrote:
> A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> to order the context switch (or rather the accesses prior to
> rcu_read_unlock()) with the expedited grace period fastpath.
>
> However the grace period can not complete without the rnp calling into
> rcu_report_exp_rnp() with the node locked. This reports the quiescent
> state in a fully ordered fashion against updater's accesses thanks to:
>
> 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
> locking while propagating QS up to the root.
>
> 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
> the root rnp to wait/check for the GP completion.
>
> 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
> before the grace period completes.
>
> This makes the explicit barrier in this place superflous. Therefore
> remove it as it is confusing.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/rcu/tree_plugin.h | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3c0bbbbb686f..d51cc7a5dfc7 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
> (!empty_norm || rnp->qsmask));
> empty_exp = sync_rcu_exp_done(rnp);
> - smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
I was wondering though, this is a slow path and the smp_mb() has been there
since 2009 or so. Not sure if it is super valuable to remove it at this
point. But we/I should definitely understand it.
I was wondering if you could also point to the fastpath that this is racing
with, it is not immediately clear (to me) what this smp_mb() is pairing with :(
thanks,
- Joel
> np = rcu_next_node_entry(t, rnp);
> list_del_init(&t->rcu_node_entry);
> t->rcu_blocked_node = NULL;
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-21 15:52 ` Frederic Weisbecker
@ 2025-02-26 0:00 ` Paul E. McKenney
0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-26 0:00 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Fri, Feb 21, 2025 at 04:52:39PM +0100, Frederic Weisbecker wrote:
> Le Wed, Feb 19, 2025 at 06:58:36AM -0800, Paul E. McKenney a écrit :
> > On Sat, Feb 15, 2025 at 11:23:45PM +0100, Frederic Weisbecker wrote:
> > > > Before. There was also some buggy debug code in play. Also, to get the
> > > > failure, it was necessary to make TREE03 disable preemption, as stock
> > > > TREE03 has an empty sync_sched_exp_online_cleanup() function.
> > > >
> > > > I am rerunning the test with a WARN_ON_ONCE() after the early exit from
> > > > the sync_sched_exp_online_cleanup(). Of course, lack of a failure does
> > > > not necessairly indicate
> > >
> > > Cool, thanks!
> >
> > No failures. But might it be wise to put this WARN_ON_ONCE() in,
> > let things go for a year or two, and complete the removal if it never
> > triggers? Or is the lack of forward progress warning enough?
>
> Hmm, what prevents a WARN_ON_ONCE() after the early exit of
> sync_sched_exp_online_cleanup() to hit?
>
> All it takes is for sync_sched_exp_online_cleanup() to execute between
> sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus() manage
> to send an IPI.
You are right, that would do it!
> But we can warn about the lack of forward progress after a few iterations
> of the retry_ipi label in __sync_rcu_exp_select_node_cpus().
Agreed, that would make more sense.
> > > > > And if after do we know why?
> > > >
> > > > Here are some (possibly bogus) possibilities that came to mind:
> > > >
> > > > 1. There is some coming-online race that deprives the incoming
> > > > CPU of an IPI, but nevertheless marks that CPU as blocking the
> > > > current grace period.
> > >
> > > Arguably there is a tiny window between rcutree_report_cpu_starting()
> > > and set_cpu_online() that could make ->qsmaskinitnext visible before
> > > cpu_online() and therefore delay the IPI a bit. But I don't expect
> > > more than a jiffy to fill up the gap. And if that's relevant, note that
> > > only !PREEMPT_RCU is then "fixed" by sync_sched_exp_online_cleanup() here.
> >
> > Agreed. And I vaguely recall that there was some difference due to
> > preemptible RCU's ability to clean up at the next rcu_read_unlock(),
> > though more recently, possibly deferred.
>
> Perhaps at the time but today at least I can't find any.
And maybe not even back then. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-25 21:59 ` Joel Fernandes
@ 2025-02-26 0:08 ` Paul E. McKenney
2025-02-26 12:52 ` Frederic Weisbecker
1 sibling, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-26 0:08 UTC (permalink / raw)
To: Joel Fernandes
Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu
On Tue, Feb 25, 2025 at 04:59:08PM -0500, Joel Fernandes wrote:
> On Fri, Feb 14, 2025 at 12:25:58AM +0100, Frederic Weisbecker wrote:
> > A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> > to order the context switch (or rather the accesses prior to
> > rcu_read_unlock()) with the expedited grace period fastpath.
> >
> > However the grace period can not complete without the rnp calling into
> > rcu_report_exp_rnp() with the node locked. This reports the quiescent
> > state in a fully ordered fashion against updater's accesses thanks to:
> >
> > 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
> > locking while propagating QS up to the root.
> >
> > 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
> > the root rnp to wait/check for the GP completion.
> >
> > 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
> > before the grace period completes.
> >
> > This makes the explicit barrier in this place superflous. Therefore
> > remove it as it is confusing.
> >
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> > kernel/rcu/tree_plugin.h | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 3c0bbbbb686f..d51cc7a5dfc7 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
> > (!empty_norm || rnp->qsmask));
> > empty_exp = sync_rcu_exp_done(rnp);
> > - smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
>
> I was wondering though, this is a slow path and the smp_mb() has been there
> since 2009 or so. Not sure if it is super valuable to remove it at this
> point. But we/I should definitely understand it.
>
> I was wondering if you could also point to the fastpath that this is racing
> with, it is not immediately clear (to me) what this smp_mb() is pairing with :(
My guess was one of the lock acquisitions or dyntick checks in
__sync_rcu_exp_select_node_cpus(), but I am not seeing anything there.
In this context, "fastpath" would be one of the early exits, for example,
the "continue" statements in the second for_each_leaf_node_cpu_mask()
loop.
But again, I am not seeing anything that appears to need that smp_mb().
As you say, that smp_mb() is not on a fastpath, so we need to check
carefully before removing it.
Thanx, Paul
> thanks,
>
> - Joel
>
>
>
>
>
> > np = rcu_next_node_entry(t, rnp);
> > list_del_init(&t->rcu_node_entry);
> > t->rcu_blocked_node = NULL;
> > --
> > 2.46.0
> >
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-25 21:59 ` Joel Fernandes
2025-02-26 0:08 ` Paul E. McKenney
@ 2025-02-26 12:52 ` Frederic Weisbecker
2025-02-26 15:04 ` Paul E. McKenney
1 sibling, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-26 12:52 UTC (permalink / raw)
To: Joel Fernandes
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Paul E . McKenney, Uladzislau Rezki, Zqiang, rcu
Le Tue, Feb 25, 2025 at 04:59:08PM -0500, Joel Fernandes a écrit :
> On Fri, Feb 14, 2025 at 12:25:58AM +0100, Frederic Weisbecker wrote:
> > A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> > to order the context switch (or rather the accesses prior to
> > rcu_read_unlock()) with the expedited grace period fastpath.
> >
> > However the grace period can not complete without the rnp calling into
> > rcu_report_exp_rnp() with the node locked. This reports the quiescent
> > state in a fully ordered fashion against updater's accesses thanks to:
> >
> > 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
> > locking while propagating QS up to the root.
> >
> > 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
> > the root rnp to wait/check for the GP completion.
> >
> > 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
> > before the grace period completes.
> >
> > This makes the explicit barrier in this place superflous. Therefore
> > remove it as it is confusing.
> >
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> > kernel/rcu/tree_plugin.h | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 3c0bbbbb686f..d51cc7a5dfc7 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
> > (!empty_norm || rnp->qsmask));
> > empty_exp = sync_rcu_exp_done(rnp);
> > - smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
>
> I was wondering though, this is a slow path and the smp_mb() has been there
> since 2009 or so. Not sure if it is super valuable to remove it at this
> point. But we/I should definitely understand it.
The point is indeed not to improve performance because this is a slowpath
(although...). The main goal is to maintain a clear picture of the ordering
without needless barriers that leave a taste of doubt to reviewers.
> I was wondering if you could also point to the fastpath that this is racing
> with, it is not immediately clear (to me) what this smp_mb() is pairing with
> :(
It is supposed to pair with the barrier in sync_exp_work_done() but then again
this is already enforced by the smp_mb__after_unlock_lock() chained through
rnp locking.
Thanks.
>
> thanks,
>
> - Joel
>
>
>
>
>
> > np = rcu_next_node_entry(t, rnp);
> > list_del_init(&t->rcu_node_entry);
> > t->rcu_blocked_node = NULL;
> > --
> > 2.46.0
> >
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-26 12:52 ` Frederic Weisbecker
@ 2025-02-26 15:04 ` Paul E. McKenney
2025-02-26 15:26 ` Joel Fernandes
0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2025-02-26 15:04 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Joel Fernandes, LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Wed, Feb 26, 2025 at 01:52:09PM +0100, Frederic Weisbecker wrote:
> Le Tue, Feb 25, 2025 at 04:59:08PM -0500, Joel Fernandes a écrit :
> > On Fri, Feb 14, 2025 at 12:25:58AM +0100, Frederic Weisbecker wrote:
> > > A full memory barrier in the RCU-PREEMPT task unblock path advertizes
> > > to order the context switch (or rather the accesses prior to
> > > rcu_read_unlock()) with the expedited grace period fastpath.
> > >
> > > However the grace period can not complete without the rnp calling into
> > > rcu_report_exp_rnp() with the node locked. This reports the quiescent
> > > state in a fully ordered fashion against updater's accesses thanks to:
> > >
> > > 1) The READ-SIDE smp_mb__after_unlock_lock() barrier accross nodes
> > > locking while propagating QS up to the root.
> > >
> > > 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the
> > > the root rnp to wait/check for the GP completion.
> > >
> > > 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end()
> > > before the grace period completes.
> > >
> > > This makes the explicit barrier in this place superflous. Therefore
> > > remove it as it is confusing.
> > >
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > > ---
> > > kernel/rcu/tree_plugin.h | 1 -
> > > 1 file changed, 1 deletion(-)
> > >
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 3c0bbbbb686f..d51cc7a5dfc7 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -534,7 +534,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
> > > (!empty_norm || rnp->qsmask));
> > > empty_exp = sync_rcu_exp_done(rnp);
> > > - smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
> >
> > I was wondering though, this is a slow path and the smp_mb() has been there
> > since 2009 or so. Not sure if it is super valuable to remove it at this
> > point. But we/I should definitely understand it.
>
> The point is indeed not to improve performance because this is a slowpath
> (although...). The main goal is to maintain a clear picture of the ordering
> without needless barriers that leave a taste of doubt to reviewers.
Agreed!
> > I was wondering if you could also point to the fastpath that this is racing
> > with, it is not immediately clear (to me) what this smp_mb() is pairing with
> > :(
>
> It is supposed to pair with the barrier in sync_exp_work_done() but then again
> this is already enforced by the smp_mb__after_unlock_lock() chained through
> rnp locking.
You could interpret that "Order GP completion with preceding accesses"
to include preceding readers, which to your point sounds plausible.
And in that case, again as you say, the raw_spin_lock_irqsave_rcu_node()
in rcu_report_exp_rnp() provides the needed ordering.
I think. ;-)
Thanx, Paul
> Thanks.
>
> >
> > thanks,
> >
> > - Joel
> >
> >
> >
> >
> >
> > > np = rcu_next_node_entry(t, rnp);
> > > list_del_init(&t->rcu_node_entry);
> > > t->rcu_blocked_node = NULL;
> > > --
> > > 2.46.0
> > >
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-26 15:04 ` Paul E. McKenney
@ 2025-02-26 15:26 ` Joel Fernandes
2025-02-26 15:34 ` Frederic Weisbecker
0 siblings, 1 reply; 25+ messages in thread
From: Joel Fernandes @ 2025-02-26 15:26 UTC (permalink / raw)
To: paulmck, Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On 2/26/2025 10:04 AM, Paul E. McKenney wrote:
>>> I was wondering if you could also point to the fastpath that this is racing
>>> with, it is not immediately clear (to me) what this smp_mb() is pairing with
>>> 🙁
>> It is supposed to pair with the barrier in sync_exp_work_done() but then again
>> this is already enforced by the smp_mb__after_unlock_lock() chained through
>> rnp locking.
> You could interpret that "Order GP completion with preceding accesses"
> to include preceding readers, which to your point sounds plausible.
> And in that case, again as you say, the raw_spin_lock_irqsave_rcu_node()
> in rcu_report_exp_rnp() provides the needed ordering.
>
> I think. 😉
This is for the case where readers are blocked. For the case where readers were
not blocked, and we exited the RSCS, we just expect the regular QS reporting
paths to call into rcu_report_exp_rnp() and similarly provide the full memory
barrier (smp_mb) on the now-reader-unlocked-CPU right?
Just wanted to check my understanding was correct :)
Also if I may paraphrase the ordering requirement here, we do not want RCU
readers to observe any modifications happening to data after the GP has ended
(i.e. synchronize_rcu_expedited() has returned). Similarly, we do not want
updates in the pre-existing readers to not be visible to accesses after the GP
has ended. Right?
thanks,
- Joel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock
2025-02-26 15:26 ` Joel Fernandes
@ 2025-02-26 15:34 ` Frederic Weisbecker
0 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2025-02-26 15:34 UTC (permalink / raw)
To: Joel Fernandes
Cc: paulmck, LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Wed, Feb 26, 2025 at 10:26:34AM -0500, Joel Fernandes a écrit :
>
>
> On 2/26/2025 10:04 AM, Paul E. McKenney wrote:
> >>> I was wondering if you could also point to the fastpath that this is racing
> >>> with, it is not immediately clear (to me) what this smp_mb() is pairing with
> >>> 🙁
> >> It is supposed to pair with the barrier in sync_exp_work_done() but then again
> >> this is already enforced by the smp_mb__after_unlock_lock() chained through
> >> rnp locking.
> > You could interpret that "Order GP completion with preceding accesses"
> > to include preceding readers, which to your point sounds plausible.
> > And in that case, again as you say, the raw_spin_lock_irqsave_rcu_node()
> > in rcu_report_exp_rnp() provides the needed ordering.
> >
> > I think. 😉
>
> This is for the case where readers are blocked. For the case where readers were
> not blocked, and we exited the RSCS, we just expect the regular QS reporting
> paths to call into rcu_report_exp_rnp() and similarly provide the full memory
> barrier (smp_mb) on the now-reader-unlocked-CPU right?
Right, again through rnp locking and smp_mb__after_unlock_lock().
>
> Just wanted to check my understanding was correct :)
>
> Also if I may paraphrase the ordering requirement here, we do not want RCU
> readers to observe any modifications happening to data after the GP has ended
> (i.e. synchronize_rcu_expedited() has returned). Similarly, we do not want
> updates in the pre-existing readers to not be visible to accesses after the GP
> has ended. Right?
Exactly!
Thanks.
>
> thanks,
>
> - Joel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-02-13 23:25 ` [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
2025-02-14 9:01 ` Paul E. McKenney
@ 2025-03-03 20:10 ` Paul E. McKenney
2025-03-14 14:39 ` Frederic Weisbecker
1 sibling, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2025-03-03 20:10 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> A CPU coming online checks for an ongoing grace period and reports
> a quiescent state accordingly if needed. This special treatment that
> shortcuts the expedited IPI finds its origin as an optimization purpose
> on the following commit:
>
> 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
>
> The point is to avoid an IPI while waiting for a CPU to become online
> or failing to become offline.
>
> However this is pointless and even error prone for several reasons:
>
> * If the CPU has been seen offline in the first round scanning offline
> and idle CPUs, no IPI is even tried and the quiescent state is
> reported on behalf of the CPU.
>
> * This means that if the IPI fails, the CPU just became offline. So
> it's unlikely to become online right away, unless the cpu hotplug
> operation failed and rolled back, which is a rare event that can
> wait a jiffy for a new IPI to be issued.
>
> * But then the "optimization" applying on failing CPU hotplug down only
> applies to !PREEMPT_RCU.
>
> * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
> set. As a result it can race with remote QS reports on the same rdp.
> Fortunately it happens to be OK but an accident is waiting to happen.
>
> For all those reasons, remove this optimization that doesn't look worthy
> to keep around.
>
> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Based on discussions:
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
If it was still just me doing RCU, I would skip this one, just out of an
abundance of caution. But you break it, you buy it! ;-)
> ---
> kernel/rcu/tree.c | 2 --
> kernel/rcu/tree_exp.h | 45 ++-----------------------------------------
> 2 files changed, 2 insertions(+), 45 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8625f616c65a..86935fe00397 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -151,7 +151,6 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
> unsigned long gps, unsigned long flags);
> static void invoke_rcu_core(void);
> static void rcu_report_exp_rdp(struct rcu_data *rdp);
> -static void sync_sched_exp_online_cleanup(int cpu);
> static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
> static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
> static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
> @@ -4259,7 +4258,6 @@ int rcutree_online_cpu(unsigned int cpu)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> return 0; /* Too early in boot for scheduler work. */
> - sync_sched_exp_online_cleanup(cpu);
>
> // Stop-machine done, so allow nohz_full to disable tick.
> tick_dep_clear(TICK_DEP_BIT_RCU);
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index caff16e441d1..a3f962eb7057 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -751,12 +751,8 @@ static void rcu_exp_handler(void *unused)
> struct task_struct *t = current;
>
> /*
> - * First, is there no need for a quiescent state from this CPU,
> - * or is this CPU already looking for a quiescent state for the
> - * current grace period? If either is the case, just leave.
> - * However, this should not happen due to the preemptible
> - * sync_sched_exp_online_cleanup() implementation being a no-op,
> - * so warn if this does happen.
> + * WARN if the CPU is unexpectedly already looking for a
> + * QS or has already reported one.
> */
> ASSERT_EXCLUSIVE_WRITER_SCOPED(rdp->cpu_no_qs.b.exp);
> if (WARN_ON_ONCE(!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
> @@ -803,11 +799,6 @@ static void rcu_exp_handler(void *unused)
> WARN_ON_ONCE(1);
> }
>
> -/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
> -static void sync_sched_exp_online_cleanup(int cpu)
> -{
> -}
> -
> /*
> * Scan the current list of tasks blocked within RCU read-side critical
> * sections, printing out the tid of each that is blocking the current
> @@ -885,38 +876,6 @@ static void rcu_exp_handler(void *unused)
> rcu_exp_need_qs();
> }
>
> -/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
> -static void sync_sched_exp_online_cleanup(int cpu)
> -{
> - unsigned long flags;
> - int my_cpu;
> - struct rcu_data *rdp;
> - int ret;
> - struct rcu_node *rnp;
> -
> - rdp = per_cpu_ptr(&rcu_data, cpu);
> - rnp = rdp->mynode;
> - my_cpu = get_cpu();
> - /* Quiescent state either not needed or already requested, leave. */
> - if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
> - READ_ONCE(rdp->cpu_no_qs.b.exp)) {
> - put_cpu();
> - return;
> - }
> - /* Quiescent state needed on current CPU, so set it up locally. */
> - if (my_cpu == cpu) {
> - local_irq_save(flags);
> - rcu_exp_need_qs();
> - local_irq_restore(flags);
> - put_cpu();
> - return;
> - }
> - /* Quiescent state needed on some other CPU, send IPI. */
> - ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
> - put_cpu();
> - WARN_ON_ONCE(ret);
> -}
> -
> /*
> * Because preemptible RCU does not exist, we never have to check for
> * tasks blocked within RCU read-side critical sections that are
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] rcu/exp: Protect against early QS report
2025-02-14 9:10 ` Paul E. McKenney
@ 2025-03-13 16:40 ` Frederic Weisbecker
2025-03-13 17:04 ` Paul E. McKenney
0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-03-13 16:40 UTC (permalink / raw)
To: Paul E. McKenney
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Fri, Feb 14, 2025 at 01:10:43AM -0800, Paul E. McKenney a écrit :
> On Fri, Feb 14, 2025 at 12:25:57AM +0100, Frederic Weisbecker wrote:
> > When a grace period is started, the ->expmask of each node is set up
> > from sync_exp_reset_tree(). Then later on each leaf node also initialize
> > its ->exp_tasks pointer.
> >
> > This means that the initialization of the quiescent state of a node and
> > the initialization of its blocking tasks happen with an unlocked node
> > gap in-between.
> >
> > It happens to be fine because nothing is expected to report an exp
> > quiescent state within this gap, since no IPI have been issued yet and
> > every rdp's ->cpu_no_qs.b.exp should be false.
> >
> > However if it were to happen by accident, the quiescent state could be
> > reported and propagated while ignoring tasks that blocked _before_ the
> > start of the grace period.
> >
> > Prevent such trouble to happen in the future and initialize both the
> > quiescent states mask to report and the blocked tasks head from the same
> > node locked block.
> >
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>
> Thank you for looking into this!
>
> One question: What happens if a CPU has tasks pending during the
> call to sync_exp_reset_tree(), but all of these tasks complete
> their RCU read-side critical sections before execution reaches
> __sync_rcu_exp_select_node_cpus()?
>
> (My guess is that all is well, but even if so, it would be good to record
> why in the commit log.)
All is (expected to be) well because the QS won't be reported yet:
rdp->cpu_no_qs.b.exp is still false, therefore rnp->expmask will
still have the RDPs masks set.
!PREEMPT_RCU is different because sync_sched_exp_online_cleanup()
can report the QS earlier. But that patch is a PREEMPT_RCU concern
only.
I'll drop a note on the changelog.
Thanks.
>
> Thanx, Paul
>
> > ---
> > kernel/rcu/tree_exp.h | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 8d4895c854c5..caff16e441d1 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
> > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > WARN_ON_ONCE(rnp->expmask);
> > WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
> > + /*
> > + * Need to wait for any blocked tasks as well. Note that
> > + * additional blocking tasks will also block the expedited GP
> > + * until such time as the ->expmask bits are cleared.
> > + */
> > + if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
> > + WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > }
> > }
> > @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > }
> > mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
> >
> > - /*
> > - * Need to wait for any blocked tasks as well. Note that
> > - * additional blocking tasks will also block the expedited GP
> > - * until such time as the ->expmask bits are cleared.
> > - */
> > - if (rcu_preempt_has_tasks(rnp))
> > - WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >
> > /* IPI the remaining CPUs for expedited quiescent state. */
> > --
> > 2.46.0
> >
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] rcu/exp: Protect against early QS report
2025-03-13 16:40 ` Frederic Weisbecker
@ 2025-03-13 17:04 ` Paul E. McKenney
0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2025-03-13 17:04 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Thu, Mar 13, 2025 at 05:40:19PM +0100, Frederic Weisbecker wrote:
> Le Fri, Feb 14, 2025 at 01:10:43AM -0800, Paul E. McKenney a écrit :
> > On Fri, Feb 14, 2025 at 12:25:57AM +0100, Frederic Weisbecker wrote:
> > > When a grace period is started, the ->expmask of each node is set up
> > > from sync_exp_reset_tree(). Then later on each leaf node also initialize
> > > its ->exp_tasks pointer.
> > >
> > > This means that the initialization of the quiescent state of a node and
> > > the initialization of its blocking tasks happen with an unlocked node
> > > gap in-between.
> > >
> > > It happens to be fine because nothing is expected to report an exp
> > > quiescent state within this gap, since no IPI have been issued yet and
> > > every rdp's ->cpu_no_qs.b.exp should be false.
> > >
> > > However if it were to happen by accident, the quiescent state could be
> > > reported and propagated while ignoring tasks that blocked _before_ the
> > > start of the grace period.
> > >
> > > Prevent such trouble to happen in the future and initialize both the
> > > quiescent states mask to report and the blocked tasks head from the same
> > > node locked block.
> > >
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> >
> > Thank you for looking into this!
> >
> > One question: What happens if a CPU has tasks pending during the
> > call to sync_exp_reset_tree(), but all of these tasks complete
> > their RCU read-side critical sections before execution reaches
> > __sync_rcu_exp_select_node_cpus()?
> >
> > (My guess is that all is well, but even if so, it would be good to record
> > why in the commit log.)
>
> All is (expected to be) well because the QS won't be reported yet:
> rdp->cpu_no_qs.b.exp is still false, therefore rnp->expmask will
> still have the RDPs masks set.
>
> !PREEMPT_RCU is different because sync_sched_exp_online_cleanup()
> can report the QS earlier. But that patch is a PREEMPT_RCU concern
> only.
>
> I'll drop a note on the changelog.
Thank you!
Thanx, Paul
> Thanks.
>
>
> >
> > Thanx, Paul
> >
> > > ---
> > > kernel/rcu/tree_exp.h | 14 +++++++-------
> > > 1 file changed, 7 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index 8d4895c854c5..caff16e441d1 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -141,6 +141,13 @@ static void __maybe_unused sync_exp_reset_tree(void)
> > > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > WARN_ON_ONCE(rnp->expmask);
> > > WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
> > > + /*
> > > + * Need to wait for any blocked tasks as well. Note that
> > > + * additional blocking tasks will also block the expedited GP
> > > + * until such time as the ->expmask bits are cleared.
> > > + */
> > > + if (rcu_is_leaf_node(rnp) && rcu_preempt_has_tasks(rnp))
> > > + WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > }
> > > }
> > > @@ -393,13 +400,6 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > }
> > > mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
> > >
> > > - /*
> > > - * Need to wait for any blocked tasks as well. Note that
> > > - * additional blocking tasks will also block the expedited GP
> > > - * until such time as the ->expmask bits are cleared.
> > > - */
> > > - if (rcu_preempt_has_tasks(rnp))
> > > - WRITE_ONCE(rnp->exp_tasks, rnp->blkd_tasks.next);
> > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > >
> > > /* IPI the remaining CPUs for expedited quiescent state. */
> > > --
> > > 2.46.0
> > >
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-03-03 20:10 ` Paul E. McKenney
@ 2025-03-14 14:39 ` Frederic Weisbecker
2025-03-18 17:07 ` Paul E. McKenney
0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2025-03-14 14:39 UTC (permalink / raw)
To: Paul E. McKenney
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
Le Mon, Mar 03, 2025 at 12:10:50PM -0800, Paul E. McKenney a écrit :
> On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > A CPU coming online checks for an ongoing grace period and reports
> > a quiescent state accordingly if needed. This special treatment that
> > shortcuts the expedited IPI finds its origin as an optimization purpose
> > on the following commit:
> >
> > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> >
> > The point is to avoid an IPI while waiting for a CPU to become online
> > or failing to become offline.
> >
> > However this is pointless and even error prone for several reasons:
> >
> > * If the CPU has been seen offline in the first round scanning offline
> > and idle CPUs, no IPI is even tried and the quiescent state is
> > reported on behalf of the CPU.
> >
> > * This means that if the IPI fails, the CPU just became offline. So
> > it's unlikely to become online right away, unless the cpu hotplug
> > operation failed and rolled back, which is a rare event that can
> > wait a jiffy for a new IPI to be issued.
> >
> > * But then the "optimization" applying on failing CPU hotplug down only
> > applies to !PREEMPT_RCU.
> >
> > * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
> > set. As a result it can race with remote QS reports on the same rdp.
> > Fortunately it happens to be OK but an accident is waiting to happen.
> >
> > For all those reasons, remove this optimization that doesn't look worthy
> > to keep around.
> >
> > Reported-by: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>
> Based on discussions:
>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
>
> If it was still just me doing RCU, I would skip this one, just out of an
> abundance of caution. But you break it, you buy it! ;-)
I'm trying to be optimistic... ;-))
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report
2025-03-14 14:39 ` Frederic Weisbecker
@ 2025-03-18 17:07 ` Paul E. McKenney
0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2025-03-18 17:07 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Boqun Feng, Joel Fernandes, Neeraj Upadhyay,
Uladzislau Rezki, Zqiang, rcu
On Fri, Mar 14, 2025 at 03:39:05PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 03, 2025 at 12:10:50PM -0800, Paul E. McKenney a écrit :
> > On Fri, Feb 14, 2025 at 12:25:59AM +0100, Frederic Weisbecker wrote:
> > > A CPU coming online checks for an ongoing grace period and reports
> > > a quiescent state accordingly if needed. This special treatment that
> > > shortcuts the expedited IPI finds its origin as an optimization purpose
> > > on the following commit:
> > >
> > > 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited()
> > >
> > > The point is to avoid an IPI while waiting for a CPU to become online
> > > or failing to become offline.
> > >
> > > However this is pointless and even error prone for several reasons:
> > >
> > > * If the CPU has been seen offline in the first round scanning offline
> > > and idle CPUs, no IPI is even tried and the quiescent state is
> > > reported on behalf of the CPU.
> > >
> > > * This means that if the IPI fails, the CPU just became offline. So
> > > it's unlikely to become online right away, unless the cpu hotplug
> > > operation failed and rolled back, which is a rare event that can
> > > wait a jiffy for a new IPI to be issued.
> > >
> > > * But then the "optimization" applying on failing CPU hotplug down only
> > > applies to !PREEMPT_RCU.
> > >
> > > * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not
> > > set. As a result it can race with remote QS reports on the same rdp.
> > > Fortunately it happens to be OK but an accident is waiting to happen.
> > >
> > > For all those reasons, remove this optimization that doesn't look worthy
> > > to keep around.
> > >
> > > Reported-by: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> >
> > Based on discussions:
> >
> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> >
> > If it was still just me doing RCU, I would skip this one, just out of an
> > abundance of caution. But you break it, you buy it! ;-)
>
> I'm trying to be optimistic... ;-))
So am I. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-03-18 17:07 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 23:25 [PATCH 0/3] rcu/exp updates Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 1/3] rcu/exp: Protect against early QS report Frederic Weisbecker
2025-02-14 9:10 ` Paul E. McKenney
2025-03-13 16:40 ` Frederic Weisbecker
2025-03-13 17:04 ` Paul E. McKenney
2025-02-13 23:25 ` [PATCH 2/3] rcu/exp: Remove confusing needless full barrier on task unblock Frederic Weisbecker
2025-02-25 21:59 ` Joel Fernandes
2025-02-26 0:08 ` Paul E. McKenney
2025-02-26 12:52 ` Frederic Weisbecker
2025-02-26 15:04 ` Paul E. McKenney
2025-02-26 15:26 ` Joel Fernandes
2025-02-26 15:34 ` Frederic Weisbecker
2025-02-13 23:25 ` [PATCH 3/3] rcu/exp: Remove needless CPU up quiescent state report Frederic Weisbecker
2025-02-14 9:01 ` Paul E. McKenney
2025-02-14 12:10 ` Frederic Weisbecker
2025-02-15 10:38 ` Paul E. McKenney
2025-02-15 22:23 ` Frederic Weisbecker
2025-02-19 14:58 ` Paul E. McKenney
2025-02-19 15:55 ` Paul E. McKenney
2025-02-21 15:31 ` Frederic Weisbecker
2025-02-21 15:52 ` Frederic Weisbecker
2025-02-26 0:00 ` Paul E. McKenney
2025-03-03 20:10 ` Paul E. McKenney
2025-03-14 14:39 ` Frederic Weisbecker
2025-03-18 17:07 ` Paul E. McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox