[PATCH 0/4] rcu: Fix PF_IDLE related issues v2

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] rcu: Fix PF_IDLE related issues v2
@ 2023-10-24 21:46 Frederic Weisbecker
  2023-10-24 21:46 ` [PATCH 1/4] rcu: Introduce rcu_cpu_online() Frederic Weisbecker
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-24 21:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, Zqiang, rcu,
	Liam R . Howlett, Peter Zijlstra

The modification of PF_IDLE semantics lately to fix a bug in rcutiny
eventually introduced new bugs in RCU-tasks. In this v2, this series
propose to fix these issues without reverting:

	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

Frederic Weisbecker (4):
  rcu: Introduce rcu_cpu_online()
  rcu/tasks: Handle new PF_IDLE semantics
  rcu/tasks-trace: Handle new PF_IDLE semantics
  sched: Exclude CPU boot code from PF_IDLE area

 include/linux/sched.h |  2 +-
 kernel/cpu.c          |  4 ++++
 kernel/rcu/rcu.h      |  2 ++
 kernel/rcu/tasks.h    | 33 ++++++++++++++++++++++++++++++---
 kernel/rcu/tree.c     |  7 +++++++
 kernel/sched/idle.c   |  1 -
 6 files changed, 44 insertions(+), 5 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] rcu: Introduce rcu_cpu_online()
  2023-10-24 21:46 [PATCH 0/4] rcu: Fix PF_IDLE related issues v2 Frederic Weisbecker
@ 2023-10-24 21:46 ` Frederic Weisbecker
  2023-10-25  2:29   ` Z qiang
  2023-10-24 21:46 ` [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Frederic Weisbecker
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-24 21:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, Zqiang, rcu,
	Liam R . Howlett, Peter Zijlstra

Export the RCU point of view as to when a CPU is considered offline
(ie: when does RCU consider that a CPU is sufficiently down in the
hotplug process to not feature any possible read side).

This will be used by RCU-tasks whose vision of an offline CPU should
reasonably match the one of RCU core.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/rcu.h  | 2 ++
 kernel/rcu/tree.c | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 0d866eaa4cc8..b531c33e9545 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -500,6 +500,7 @@ static inline void rcu_expedite_gp(void) { }
 static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_async_hurry(void) { }
 static inline void rcu_async_relax(void) { }
+static inline bool rcu_cpu_online(int cpu) { return true; }
 #else /* #ifdef CONFIG_TINY_RCU */
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
@@ -509,6 +510,7 @@ void rcu_unexpedite_gp(void);
 void rcu_async_hurry(void);
 void rcu_async_relax(void);
 void rcupdate_announce_bootup_oddness(void);
+bool rcu_cpu_online(int cpu);
 #ifdef CONFIG_TASKS_RCU_GENERIC
 void show_rcu_tasks_gp_kthreads(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 700524726079..fd21c1506092 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4202,6 +4202,13 @@ static bool rcu_rdp_cpu_online(struct rcu_data *rdp)
 	return !!(rdp->grpmask & rcu_rnp_online_cpus(rdp->mynode));
 }
 
+bool rcu_cpu_online(int cpu)
+{
+	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+
+	return rcu_rdp_cpu_online(rdp);
+}
+
 #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU)
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-24 21:46 [PATCH 0/4] rcu: Fix PF_IDLE related issues v2 Frederic Weisbecker
  2023-10-24 21:46 ` [PATCH 1/4] rcu: Introduce rcu_cpu_online() Frederic Weisbecker
@ 2023-10-24 21:46 ` Frederic Weisbecker
  2023-10-25  8:40   ` Peter Zijlstra
  2023-10-24 21:46 ` [PATCH 3/4] rcu/tasks-trace: " Frederic Weisbecker
  2023-10-24 21:46 ` [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area Frederic Weisbecker
  3 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-24 21:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, Zqiang, rcu,
	Liam R . Howlett, Peter Zijlstra

The commit:

	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.

This has however introduced new potential RCU-tasks stalls when either:

1) Grace period is started before init/0 had a chance to set PF_IDLE,
   keeping it stuck in the holdout list until idle ever schedules.

2) Grace period is started when some possible CPUs have never been
   online, keeping their idle tasks stuck in the holdout list until the
   CPU ever boots up.

3) Similar to 1) but with secondary CPUs: Grace period is started
   concurrently with secondary CPU booting, putting its idle task in
   the holdout list because PF_IDLE isn't yet observed on it. It stays
   then stuck in the holdout list until that CPU ever schedules. The
   effect is mitigated here by the hotplug AP thread that must run to
   bring the CPU up.

Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tasks.h | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index bf5f178fe723..acf81efe5eff 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -895,10 +895,37 @@ static void rcu_tasks_pregp_step(struct list_head *hop)
 	synchronize_rcu();
 }

+/* Check for quiescent states since the pregp's synchronize_rcu() */
+static bool rcu_tasks_is_holdout(struct task_struct *t)
+{
+	int cpu;
+
+	/* Has the task been seen voluntarily sleeping? */
+	if (!READ_ONCE(t->on_rq))
+		return false;
+
+	cpu = task_cpu(t);
+
+	/*
+	 * Idle tasks within the idle loop or offline CPUs are RCU-tasks
+	 * quiescent states. But CPU boot code performed by the idle task
+	 * isn't a quiescent state.
+	 */
+	if (t == idle_task(cpu)) {
+		if (is_idle_task(t))
+			return false;
+
+		if (!rcu_cpu_online(cpu))
+			return false;
+	}
+
+	return true;
+}
+
 /* Per-task initial processing. */
 static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
 {
-	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+	if (t != current && rcu_tasks_is_holdout(t)) {
 		get_task_struct(t);
 		t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
 		WRITE_ONCE(t->rcu_tasks_holdout, true);
@@ -947,7 +974,7 @@ static void check_holdout_task(struct task_struct *t,

 	if (!READ_ONCE(t->rcu_tasks_holdout) ||
 	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
+	    !rcu_tasks_is_holdout(t) ||
 	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
 	     !is_idle_task(t) && READ_ONCE(t->rcu_tasks_idle_cpu) >= 0)) {
 		WRITE_ONCE(t->rcu_tasks_holdout, false);
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] rcu/tasks-trace: Handle new PF_IDLE semantics
  2023-10-24 21:46 [PATCH 0/4] rcu: Fix PF_IDLE related issues v2 Frederic Weisbecker
  2023-10-24 21:46 ` [PATCH 1/4] rcu: Introduce rcu_cpu_online() Frederic Weisbecker
  2023-10-24 21:46 ` [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Frederic Weisbecker
@ 2023-10-24 21:46 ` Frederic Weisbecker
  2023-10-24 21:46 ` [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area Frederic Weisbecker
  3 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-24 21:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, Zqiang, rcu,
	Liam R . Howlett, Peter Zijlstra, Naresh Kamboju

The commit:

	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

has changed the semantics of what is to be considered an idle task in
such a way that the idle task of an offline CPU may not carry the
PF_IDLE flag anymore.

However RCU-tasks-trace tests the opposite assertion, still assuming
that idle tasks carry the PF_IDLE flag during their whole lifecycle.

Remove this assumption to avoid spurious warnings but keep the initial
test verifying that the idle task is the current task on any offline
CPU.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tasks.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index acf81efe5eff..4dd70f2af4af 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1552,7 +1552,7 @@ static int trc_inspect_reader(struct task_struct *t, void *bhp_in)
 	} else {
 		// The task is not running, so C-language access is safe.
 		nesting = t->trc_reader_nesting;
-		WARN_ON_ONCE(ofl && task_curr(t) && !is_idle_task(t));
+		WARN_ON_ONCE(ofl && task_curr(t) && (t != idle_task(task_cpu(t))));
 		if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) && ofl)
 			n_heavy_reader_ofl_updates++;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area
  2023-10-24 21:46 [PATCH 0/4] rcu: Fix PF_IDLE related issues v2 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2023-10-24 21:46 ` [PATCH 3/4] rcu/tasks-trace: " Frederic Weisbecker
@ 2023-10-24 21:46 ` Frederic Weisbecker
  2023-10-25  8:48   ` Peter Zijlstra
  3 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-24 21:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, Zqiang, rcu,
	Liam R . Howlett, Peter Zijlstra

The commit:

	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

has changed the semantics of what is to be considered an idle task in
such a way that only the actual idle loop is accounted as PF_IDLE. The
intent is to exclude the CPU boot code from that coverage.

However this doesn't clear the flag when the CPU goes down. Therefore
when the CPU goes up again, its boot code is part of the PF_IDLE zone.

Make sure this flag behave consistently and clear the flag when a CPU
exits from the idle loop. If anything, RCU-tasks relies on it to exclude
CPU boot code from its quiescent states.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched.h | 2 +-
 kernel/cpu.c          | 4 ++++
 kernel/sched/idle.c   | 1 -
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8885be2c143e..ad18962b921d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1945,7 +1945,7 @@ extern struct task_struct *idle_task(int cpu);
  */
 static __always_inline bool is_idle_task(const struct task_struct *p)
 {
-	return !!(p->flags & PF_IDLE);
+	return !!(READ_ONCE(p->flags) & PF_IDLE);
 }
 
 extern struct task_struct *curr_task(int cpu);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 3b9d5c7eb4a2..3a1991010f4e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
 {
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 
+	WRITE_ONCE(current->flags, current->flags & ~PF_IDLE);
 	BUG_ON(st->state != CPUHP_AP_OFFLINE);
+
 	rcutree_report_cpu_dead();
 	st->state = CPUHP_AP_IDLE_DEAD;
 	/*
@@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
 {
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 
+	WRITE_ONCE(current->flags, current->flags | PF_IDLE);
+
 	/* Happens for the boot cpu */
 	if (state != CPUHP_AP_ONLINE_IDLE)
 		return;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5007b25c5bc6..342f58a329f5 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,6 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
 
 void cpu_startup_entry(enum cpuhp_state state)
 {
-	current->flags |= PF_IDLE;
 	arch_cpu_idle_prepare();
 	cpuhp_online_idle(state);
 	while (1)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] rcu: Introduce rcu_cpu_online()
  2023-10-24 21:46 ` [PATCH 1/4] rcu: Introduce rcu_cpu_online() Frederic Weisbecker
@ 2023-10-25  2:29   ` Z qiang
  2023-10-25 10:32     ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Z qiang @ 2023-10-25  2:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Liam R . Howlett,
	Peter Zijlstra

>
> Export the RCU point of view as to when a CPU is considered offline
> (ie: when does RCU consider that a CPU is sufficiently down in the
> hotplug process to not feature any possible read side).
>
> This will be used by RCU-tasks whose vision of an offline CPU should
> reasonably match the one of RCU core.
>
> Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/rcu.h  | 2 ++
>  kernel/rcu/tree.c | 7 +++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 0d866eaa4cc8..b531c33e9545 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -500,6 +500,7 @@ static inline void rcu_expedite_gp(void) { }
>  static inline void rcu_unexpedite_gp(void) { }
>  static inline void rcu_async_hurry(void) { }
>  static inline void rcu_async_relax(void) { }
> +static inline bool rcu_cpu_online(int cpu) { return true; }
>  #else /* #ifdef CONFIG_TINY_RCU */
>  bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>  bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
> @@ -509,6 +510,7 @@ void rcu_unexpedite_gp(void);
>  void rcu_async_hurry(void);
>  void rcu_async_relax(void);
>  void rcupdate_announce_bootup_oddness(void);
> +bool rcu_cpu_online(int cpu);
>  #ifdef CONFIG_TASKS_RCU_GENERIC
>  void show_rcu_tasks_gp_kthreads(void);
>  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 700524726079..fd21c1506092 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4202,6 +4202,13 @@ static bool rcu_rdp_cpu_online(struct rcu_data *rdp)
>         return !!(rdp->grpmask & rcu_rnp_online_cpus(rdp->mynode));
>  }
>
> +bool rcu_cpu_online(int cpu)
> +{
> +       struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>
>

Should 'struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu)' be used ?

Thanks
Zqiang


>
> +
> +       return rcu_rdp_cpu_online(rdp);
> +}
> +
>  #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU)
>
>  /*
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-24 21:46 ` [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Frederic Weisbecker
@ 2023-10-25  8:40   ` Peter Zijlstra
  2023-10-25 10:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-25  8:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, Zqiang, rcu, Liam R . Howlett

On Tue, Oct 24, 2023 at 11:46:23PM +0200, Frederic Weisbecker wrote:

> +/* Check for quiescent states since the pregp's synchronize_rcu() */
> +static bool rcu_tasks_is_holdout(struct task_struct *t)
> +{
> +	int cpu;
> +
> +	/* Has the task been seen voluntarily sleeping? */
> +	if (!READ_ONCE(t->on_rq))
> +		return false;
> +
> +	cpu = task_cpu(t);
> +
> +	/*
> +	 * Idle tasks within the idle loop or offline CPUs are RCU-tasks
> +	 * quiescent states. But CPU boot code performed by the idle task
> +	 * isn't a quiescent state.
> +	 */
> +	if (t == idle_task(cpu)) {
> +		if (is_idle_task(t))
> +			return false;
> +
> +		if (!rcu_cpu_online(cpu))
> +			return false;
> +	}

Hmm, why is this guarded by t == idle_task() ?

Notably, there is the idle-injection thing that uses FIFO tasks to run
'idle', see play_idle_precise(). This will (temporarily) get PF_IDLE on
tasks that are not idle_task().

> +
> +	return true;
> +}
> +
>  /* Per-task initial processing. */
>  static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
>  {
> -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> +	if (t != current && rcu_tasks_is_holdout(t)) {



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area
  2023-10-24 21:46 ` [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area Frederic Weisbecker
@ 2023-10-25  8:48   ` Peter Zijlstra
  2023-10-25 11:25     ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-25  8:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, Zqiang, rcu, Liam R . Howlett

On Tue, Oct 24, 2023 at 11:46:25PM +0200, Frederic Weisbecker wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8885be2c143e..ad18962b921d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1945,7 +1945,7 @@ extern struct task_struct *idle_task(int cpu);
>   */
>  static __always_inline bool is_idle_task(const struct task_struct *p)
>  {
> -	return !!(p->flags & PF_IDLE);
> +	return !!(READ_ONCE(p->flags) & PF_IDLE);
>  }
>  
>  extern struct task_struct *curr_task(int cpu);
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 3b9d5c7eb4a2..3a1991010f4e 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  
> +	WRITE_ONCE(current->flags, current->flags & ~PF_IDLE);
>  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> +
>  	rcutree_report_cpu_dead();
>  	st->state = CPUHP_AP_IDLE_DEAD;
>  	/*
> @@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  
> +	WRITE_ONCE(current->flags, current->flags | PF_IDLE);
> +
>  	/* Happens for the boot cpu */
>  	if (state != CPUHP_AP_ONLINE_IDLE)
>  		return;

Without changing *ALL* ->flags stores to WRITE_ONCE() I don't see the
point of this. Also, since we only care about a single bit, how does
store tearing affect things?

Not to mention if we're really paranoid, what are the SMP ordering
considerations :-)

[ also, PF_ is used for Protocol Family, Page Flag and Process Flag,
  grepping is a pain in the arse :-( ]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-25  8:40   ` Peter Zijlstra
@ 2023-10-25 10:31     ` Frederic Weisbecker
  2023-10-26 12:15       ` Z qiang
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-25 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, Zqiang, rcu, Liam R . Howlett

Le Wed, Oct 25, 2023 at 10:40:08AM +0200, Peter Zijlstra a écrit :
> On Tue, Oct 24, 2023 at 11:46:23PM +0200, Frederic Weisbecker wrote:
> 
> > +/* Check for quiescent states since the pregp's synchronize_rcu() */
> > +static bool rcu_tasks_is_holdout(struct task_struct *t)
> > +{
> > +	int cpu;
> > +
> > +	/* Has the task been seen voluntarily sleeping? */
> > +	if (!READ_ONCE(t->on_rq))
> > +		return false;
> > +
> > +	cpu = task_cpu(t);
> > +
> > +	/*
> > +	 * Idle tasks within the idle loop or offline CPUs are RCU-tasks
> > +	 * quiescent states. But CPU boot code performed by the idle task
> > +	 * isn't a quiescent state.
> > +	 */
> > +	if (t == idle_task(cpu)) {
> > +		if (is_idle_task(t))
> > +			return false;
> > +
> > +		if (!rcu_cpu_online(cpu))
> > +			return false;
> > +	}
> 
> Hmm, why is this guarded by t == idle_task() ?
> 
> Notably, there is the idle-injection thing that uses FIFO tasks to run
> 'idle', see play_idle_precise(). This will (temporarily) get PF_IDLE on
> tasks that are not idle_task().

Ah good point. So indeed the is_idle_task() test doesn't musn't be
guarded by t == idle_task(cpu). But rcu_cpu_online() has to, otherwise
if it's not an idle task, there is a risk that the task gets migrated out
by the time we observe the old CPU offline.

Thanks.

> 
> > +
> > +	return true;
> > +}
> > +
> >  /* Per-task initial processing. */
> >  static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
> >  {
> > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> > +	if (t != current && rcu_tasks_is_holdout(t)) {
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] rcu: Introduce rcu_cpu_online()
  2023-10-25  2:29   ` Z qiang
@ 2023-10-25 10:32     ` Frederic Weisbecker
  0 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-25 10:32 UTC (permalink / raw)
  To: Z qiang
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Liam R . Howlett,
	Peter Zijlstra

Le Wed, Oct 25, 2023 at 10:29:46AM +0800, Z qiang a écrit :
> >
> > Export the RCU point of view as to when a CPU is considered offline
> > (ie: when does RCU consider that a CPU is sufficiently down in the
> > hotplug process to not feature any possible read side).
> >
> > This will be used by RCU-tasks whose vision of an offline CPU should
> > reasonably match the one of RCU core.
> >
> > Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >  kernel/rcu/rcu.h  | 2 ++
> >  kernel/rcu/tree.c | 7 +++++++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 0d866eaa4cc8..b531c33e9545 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -500,6 +500,7 @@ static inline void rcu_expedite_gp(void) { }
> >  static inline void rcu_unexpedite_gp(void) { }
> >  static inline void rcu_async_hurry(void) { }
> >  static inline void rcu_async_relax(void) { }
> > +static inline bool rcu_cpu_online(int cpu) { return true; }
> >  #else /* #ifdef CONFIG_TINY_RCU */
> >  bool rcu_gp_is_normal(void);     /* Internal RCU use. */
> >  bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
> > @@ -509,6 +510,7 @@ void rcu_unexpedite_gp(void);
> >  void rcu_async_hurry(void);
> >  void rcu_async_relax(void);
> >  void rcupdate_announce_bootup_oddness(void);
> > +bool rcu_cpu_online(int cpu);
> >  #ifdef CONFIG_TASKS_RCU_GENERIC
> >  void show_rcu_tasks_gp_kthreads(void);
> >  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 700524726079..fd21c1506092 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4202,6 +4202,13 @@ static bool rcu_rdp_cpu_online(struct rcu_data *rdp)
> >         return !!(rdp->grpmask & rcu_rnp_online_cpus(rdp->mynode));
> >  }
> >
> > +bool rcu_cpu_online(int cpu)
> > +{
> > +       struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> >
> >
> 
> Should 'struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu)' be used ?

Oh! Good catch!

> 
> Thanks
> Zqiang
> 
> 
> >
> > +
> > +       return rcu_rdp_cpu_online(rdp);
> > +}
> > +
> >  #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU)
> >
> >  /*
> > --
> > 2.41.0
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area
  2023-10-25  8:48   ` Peter Zijlstra
@ 2023-10-25 11:25     ` Frederic Weisbecker
  0 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-25 11:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, Zqiang, rcu, Liam R . Howlett

Le Wed, Oct 25, 2023 at 10:48:33AM +0200, Peter Zijlstra a écrit :
> On Tue, Oct 24, 2023 at 11:46:25PM +0200, Frederic Weisbecker wrote:
> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 8885be2c143e..ad18962b921d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1945,7 +1945,7 @@ extern struct task_struct *idle_task(int cpu);
> >   */
> >  static __always_inline bool is_idle_task(const struct task_struct *p)
> >  {
> > -	return !!(p->flags & PF_IDLE);
> > +	return !!(READ_ONCE(p->flags) & PF_IDLE);
> >  }
> >  
> >  extern struct task_struct *curr_task(int cpu);
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 3b9d5c7eb4a2..3a1991010f4e 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
> >  {
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >  
> > +	WRITE_ONCE(current->flags, current->flags & ~PF_IDLE);
> >  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> > +
> >  	rcutree_report_cpu_dead();
> >  	st->state = CPUHP_AP_IDLE_DEAD;
> >  	/*
> > @@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
> >  {
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >  
> > +	WRITE_ONCE(current->flags, current->flags | PF_IDLE);
> > +
> >  	/* Happens for the boot cpu */
> >  	if (state != CPUHP_AP_ONLINE_IDLE)
> >  		return;
> 
> Without changing *ALL* ->flags stores to WRITE_ONCE() I don't see the
> point of this. Also, since we only care about a single bit, how does
> store tearing affect things?
> 
> Not to mention if we're really paranoid, what are the SMP ordering
> considerations :-)
> 
> [ also, PF_ is used for Protocol Family, Page Flag and Process Flag,
>   grepping is a pain in the arse :-( ]

Indeed. Also cpuhp_online_idle() is called with preemption disabled
and cpuhp_report_idle_dead() with interrupts disabled. As for idle
injection in play_idle_precise(), the flag is set and cleared with
preemption disabled.

This means that all writes are in an RCU read side critical section
that RCU-tasks pre-gp's synchronize_rcu() waits for. So I don't think
we need those WRITE_ONCE/READ_ONCE.

Paul are you ok with that?

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-25 10:31     ` Frederic Weisbecker
@ 2023-10-26 12:15       ` Z qiang
  2023-10-26 14:34         ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Z qiang @ 2023-10-26 12:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, rcu,
	Liam R . Howlett

>
> Le Wed, Oct 25, 2023 at 10:40:08AM +0200, Peter Zijlstra a écrit :
> > On Tue, Oct 24, 2023 at 11:46:23PM +0200, Frederic Weisbecker wrote:
> >
> > > +/* Check for quiescent states since the pregp's synchronize_rcu() */
> > > +static bool rcu_tasks_is_holdout(struct task_struct *t)
> > > +{
> > > +   int cpu;
> > > +
> > > +   /* Has the task been seen voluntarily sleeping? */
> > > +   if (!READ_ONCE(t->on_rq))
> > > +           return false;
> > > +
> > > +   cpu = task_cpu(t);
> > > +
> > > +   /*
> > > +    * Idle tasks within the idle loop or offline CPUs are RCU-tasks
> > > +    * quiescent states. But CPU boot code performed by the idle task
> > > +    * isn't a quiescent state.
> > > +    */
> > > +   if (t == idle_task(cpu)) {
> > > +           if (is_idle_task(t))
> > > +                   return false;
> > > +
> > > +           if (!rcu_cpu_online(cpu))
> > > +                   return false;
> > > +   }
> >
> > Hmm, why is this guarded by t == idle_task() ?
> >
> > Notably, there is the idle-injection thing that uses FIFO tasks to run
> > 'idle', see play_idle_precise(). This will (temporarily) get PF_IDLE on
> > tasks that are not idle_task().
>
> Ah good point. So indeed the is_idle_task() test doesn't musn't be
> guarded by t == idle_task(cpu). But rcu_cpu_online() has to, otherwise
> if it's not an idle task, there is a risk that the task gets migrated out
> by the time we observe the old CPU offline.
>

If a fifo-tasks use play_idle_precise() to run idle and invoke
do_idle(), may cause
rcu-tasks to falsely report a rcu-tasks QS , when rcu_is_cpu_rrupt_from_idle()
return true in rcu_sched_clock_irq(), so should we also add a check for
"current == idle_task(task_cpu(current))" in the rcu_is_cpu_rrupt_from_idle() ?

Thanks
Zqiang


> Thanks.
>
> >
> > > +
> > > +   return true;
> > > +}
> > > +
> > >  /* Per-task initial processing. */
> > >  static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
> > >  {
> > > -   if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> > > +   if (t != current && rcu_tasks_is_holdout(t)) {
> >
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-26 12:15       ` Z qiang
@ 2023-10-26 14:34         ` Frederic Weisbecker
  0 siblings, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2023-10-26 14:34 UTC (permalink / raw)
  To: Z qiang
  Cc: Peter Zijlstra, LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Lai Jiangshan, Mathieu Desnoyers, Neeraj Upadhyay,
	Paul E . McKenney, Steven Rostedt, Uladzislau Rezki, rcu,
	Liam R . Howlett

On Thu, Oct 26, 2023 at 08:15:33PM +0800, Z qiang wrote:
> >
> > Le Wed, Oct 25, 2023 at 10:40:08AM +0200, Peter Zijlstra a écrit :
> > > On Tue, Oct 24, 2023 at 11:46:23PM +0200, Frederic Weisbecker wrote:
> > >
> > > > +/* Check for quiescent states since the pregp's synchronize_rcu() */
> > > > +static bool rcu_tasks_is_holdout(struct task_struct *t)
> > > > +{
> > > > +   int cpu;
> > > > +
> > > > +   /* Has the task been seen voluntarily sleeping? */
> > > > +   if (!READ_ONCE(t->on_rq))
> > > > +           return false;
> > > > +
> > > > +   cpu = task_cpu(t);
> > > > +
> > > > +   /*
> > > > +    * Idle tasks within the idle loop or offline CPUs are RCU-tasks
> > > > +    * quiescent states. But CPU boot code performed by the idle task
> > > > +    * isn't a quiescent state.
> > > > +    */
> > > > +   if (t == idle_task(cpu)) {
> > > > +           if (is_idle_task(t))
> > > > +                   return false;
> > > > +
> > > > +           if (!rcu_cpu_online(cpu))
> > > > +                   return false;
> > > > +   }
> > >
> > > Hmm, why is this guarded by t == idle_task() ?
> > >
> > > Notably, there is the idle-injection thing that uses FIFO tasks to run
> > > 'idle', see play_idle_precise(). This will (temporarily) get PF_IDLE on
> > > tasks that are not idle_task().
> >
> > Ah good point. So indeed the is_idle_task() test doesn't musn't be
> > guarded by t == idle_task(cpu). But rcu_cpu_online() has to, otherwise
> > if it's not an idle task, there is a risk that the task gets migrated out
> > by the time we observe the old CPU offline.
> >
> 
> If a fifo-tasks use play_idle_precise() to run idle and invoke
> do_idle(), may cause
> rcu-tasks to falsely report a rcu-tasks QS

Well, there can be a debate here: should we consider an idle injector as a real
task that we must wait for a voluntary schedule or should we treat it just like
an idle task?

Having that whole idle task quiescent state in RCU-tasks is quite a strange
semantic anyway. And in the long run, the purpose is to unify RCU-tasks and
RCU-tasks-RUDE with relying on ct_dynticks for idle quiescent states.

> , when rcu_is_cpu_rrupt_from_idle()
> return true in rcu_sched_clock_irq(), so should we also add a check for
> "current == idle_task(task_cpu(current))" in the rcu_is_cpu_rrupt_from_idle()
> ?

That looks fine OTOH. Whether idle injection or real idle,
rcu_is_cpu_rrupt_from_idle() is always a quiescent state in real RCU. Because
we know we have no RCU reader between ct_idle_enter() and ct_idle_exit().

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
       [not found] ` <20231027144050.110601-3-frederic@kernel.org>
@ 2023-10-27 19:20   ` Peter Zijlstra
  2023-10-27 21:23     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-27 19:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett

On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:

> +	/* Has the task been seen voluntarily sleeping? */
> +	if (!READ_ONCE(t->on_rq))
> +		return false;

> -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {

AFAICT this ->on_rq usage is outside of scheduler locks and that
READ_ONCE isn't going to help much.

Obviously a pre-existing issue, and I suppose all it cares about is
seeing a 0 or not, irrespective of the races, but urgh..

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-27 19:20   ` [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Peter Zijlstra
@ 2023-10-27 21:23     ` Paul E. McKenney
  2023-10-27 22:46       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-27 21:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett

On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:
> 
> > +	/* Has the task been seen voluntarily sleeping? */
> > +	if (!READ_ONCE(t->on_rq))
> > +		return false;
> 
> > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> 
> AFAICT this ->on_rq usage is outside of scheduler locks and that
> READ_ONCE isn't going to help much.
> 
> Obviously a pre-existing issue, and I suppose all it cares about is
> seeing a 0 or not, irrespective of the races, but urgh..

The trick is that RCU Tasks only needs to spot a task voluntarily blocked
once at any point in the grace period.  The beginning and end of the
grace-period process have full barriers, so if this code sees t->on_rq
equal to zero, we know that the task was voluntarily blocked at some
point during the grace period, as required.

In theory, we could acquire a scheduler lock, but in practice this would
cause CPU-latency problems at a certain set of large datacenters, and
for once, not the datacenters operated by my employer.

In theory, we could make separate lists of tasks that we need to wait on,
thus avoiding the need to scan the full task list, but in practice this
would require a synchronized linked-list operation on every voluntary
context switch, both in and out.

In theory, the task list could sharded, so that it could be scanned
incrementally, but in practice, this is a bit non-trivial.  Though this
particular use case doesn't care about new tasks, so it could live with
something simpler than would be required for certain types of signal
delivery.

In theory, we could place rcu_segcblist-like mid pointers into the
task list, so that scans could restart from any mid pointer.  Care is
required because the mid pointers would likely need to be recycled as
new tasks are added.  Plus care is needed because it has been a good
long time since I have looked at the code managing the tasks list,
and I am probably woefully out of date on how it all works.

So, is there a better way?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-27 21:23     ` Paul E. McKenney
@ 2023-10-27 22:46       ` Peter Zijlstra
  2023-10-27 23:41         ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-27 22:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett

On Fri, Oct 27, 2023 at 02:23:56PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:
> > 
> > > +	/* Has the task been seen voluntarily sleeping? */
> > > +	if (!READ_ONCE(t->on_rq))
> > > +		return false;
> > 
> > > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> > 
> > AFAICT this ->on_rq usage is outside of scheduler locks and that
> > READ_ONCE isn't going to help much.
> > 
> > Obviously a pre-existing issue, and I suppose all it cares about is
> > seeing a 0 or not, irrespective of the races, but urgh..
> 
> The trick is that RCU Tasks only needs to spot a task voluntarily blocked
> once at any point in the grace period.  The beginning and end of the
> grace-period process have full barriers, so if this code sees t->on_rq
> equal to zero, we know that the task was voluntarily blocked at some
> point during the grace period, as required.
> 
> In theory, we could acquire a scheduler lock, but in practice this would
> cause CPU-latency problems at a certain set of large datacenters, and
> for once, not the datacenters operated by my employer.
> 
> In theory, we could make separate lists of tasks that we need to wait on,
> thus avoiding the need to scan the full task list, but in practice this
> would require a synchronized linked-list operation on every voluntary
> context switch, both in and out.
> 
> In theory, the task list could sharded, so that it could be scanned
> incrementally, but in practice, this is a bit non-trivial.  Though this
> particular use case doesn't care about new tasks, so it could live with
> something simpler than would be required for certain types of signal
> delivery.
> 
> In theory, we could place rcu_segcblist-like mid pointers into the
> task list, so that scans could restart from any mid pointer.  Care is
> required because the mid pointers would likely need to be recycled as
> new tasks are added.  Plus care is needed because it has been a good
> long time since I have looked at the code managing the tasks list,
> and I am probably woefully out of date on how it all works.
> 
> So, is there a better way?

Nah, this is more or less what I feared. I just worry people will come
around and put WRITE_ONCE() on the other end. I don't think that'll buy
us much. Nor do I think the current READ_ONCE()s actually matter.

But perhaps put a comment there, that we don't care for the races and
only need to observe a 0 once or something.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-27 22:46       ` Peter Zijlstra
@ 2023-10-27 23:41         ` Paul E. McKenney
  2023-10-30  8:21           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-27 23:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett

On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 02:23:56PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:
> > > 
> > > > +	/* Has the task been seen voluntarily sleeping? */
> > > > +	if (!READ_ONCE(t->on_rq))
> > > > +		return false;
> > > 
> > > > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> > > 
> > > AFAICT this ->on_rq usage is outside of scheduler locks and that
> > > READ_ONCE isn't going to help much.
> > > 
> > > Obviously a pre-existing issue, and I suppose all it cares about is
> > > seeing a 0 or not, irrespective of the races, but urgh..
> > 
> > The trick is that RCU Tasks only needs to spot a task voluntarily blocked
> > once at any point in the grace period.  The beginning and end of the
> > grace-period process have full barriers, so if this code sees t->on_rq
> > equal to zero, we know that the task was voluntarily blocked at some
> > point during the grace period, as required.
> > 
> > In theory, we could acquire a scheduler lock, but in practice this would
> > cause CPU-latency problems at a certain set of large datacenters, and
> > for once, not the datacenters operated by my employer.
> > 
> > In theory, we could make separate lists of tasks that we need to wait on,
> > thus avoiding the need to scan the full task list, but in practice this
> > would require a synchronized linked-list operation on every voluntary
> > context switch, both in and out.
> > 
> > In theory, the task list could sharded, so that it could be scanned
> > incrementally, but in practice, this is a bit non-trivial.  Though this
> > particular use case doesn't care about new tasks, so it could live with
> > something simpler than would be required for certain types of signal
> > delivery.
> > 
> > In theory, we could place rcu_segcblist-like mid pointers into the
> > task list, so that scans could restart from any mid pointer.  Care is
> > required because the mid pointers would likely need to be recycled as
> > new tasks are added.  Plus care is needed because it has been a good
> > long time since I have looked at the code managing the tasks list,
> > and I am probably woefully out of date on how it all works.
> > 
> > So, is there a better way?
> 
> Nah, this is more or less what I feared. I just worry people will come
> around and put WRITE_ONCE() on the other end. I don't think that'll buy
> us much. Nor do I think the current READ_ONCE()s actually matter.

My friend, you trust compilers more than I ever will.  ;-)

> But perhaps put a comment there, that we don't care for the races and
> only need to observe a 0 once or something.

There are these two passagers in the big lock comment preceding the
RCU Tasks code:

// rcu_tasks_pregp_step():
//      Invokes synchronize_rcu() in order to wait for all in-flight
//      t->on_rq and t->nvcsw transitions to complete.  This works because
//      all such transitions are carried out with interrupts disabled.

and:

// rcu_tasks_postgp():
//      Invokes synchronize_rcu() in order to ensure that all prior
//      t->on_rq and t->nvcsw transitions are seen by all CPUs and tasks
//      to have happened before the end of this RCU Tasks grace period.
//      Again, this works because all such transitions are carried out
//      with interrupts disabled.

The rcu_tasks_pregp_step() function contains this comment:

	/*
	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
	 * to complete.  Invoking synchronize_rcu() suffices because all
	 * these transitions occur with interrupts disabled.  Without this
	 * synchronize_rcu(), a read-side critical section that started
	 * before the grace period might be incorrectly seen as having
	 * started after the grace period.
	 *
	 * This synchronize_rcu() also dispenses with the need for a
	 * memory barrier on the first store to t->rcu_tasks_holdout,
	 * as it forces the store to happen after the beginning of the
	 * grace period.
	 */

And the rcu_tasks_postgp() function contains this comment:

	/*
	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
	 * memory barriers prior to them in the schedule() path, memory
	 * reordering on other CPUs could cause their RCU-tasks read-side
	 * critical sections to extend past the end of the grace period.
	 * However, because these ->nvcsw updates are carried out with
	 * interrupts disabled, we can use synchronize_rcu() to force the
	 * needed ordering on all such CPUs.
	 *
	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
	 * accesses to be within the grace period, avoiding the need for
	 * memory barriers for ->rcu_tasks_holdout accesses.
	 *
	 * In addition, this synchronize_rcu() waits for exiting tasks
	 * to complete their final preempt_disable() region of execution,
	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
	 * enforcing the whole region before tasklist removal until
	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
	 * read side critical section.
	 */

Does that suffice, or should we add more?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-27 23:41         ` Paul E. McKenney
@ 2023-10-30  8:21           ` Peter Zijlstra
  2023-10-30 20:11             ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-30  8:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett

On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:

> > Nah, this is more or less what I feared. I just worry people will come
> > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > us much. Nor do I think the current READ_ONCE()s actually matter.
> 
> My friend, you trust compilers more than I ever will.  ;-)

Well, we only use the values {0,1,2}, that's contained in the first
byte. Are we saying compiler will not only byte-split but also
bit-split the loads?

But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
getting you anything, and if you really worried about it, shouldn't you
have proposed a patch making it all WRITE_ONCE() back when you did this
tasks-rcu stuff?

> > But perhaps put a comment there, that we don't care for the races and
> > only need to observe a 0 once or something.
> 
> There are these two passagers in the big lock comment preceding the
> RCU Tasks code:

> // rcu_tasks_pregp_step():
> //      Invokes synchronize_rcu() in order to wait for all in-flight
> //      t->on_rq and t->nvcsw transitions to complete.  This works because
> //      all such transitions are carried out with interrupts disabled.

> Does that suffice, or should we add more?

Probably sufficient. If one were to have used the search option :-)

Anyway, this brings me to nvcsw, exact same problem there, except
possibly worse, because now we actually do care about the full word.

No WRITE_ONCE() write side, so the READ_ONCE() don't help against
store-tearing (however unlikely that actually is in this case).

Also, I'm not entirely sure I see why you need on_rq and nvcsw. Would
not nvcsw increasing be enough to know it passed through a quiescent
state? Are you trying to say that if nvcsw hasn't advanced but on_rq is
still 0, nothing has changed and you can proceed?

Or rather, looking at the code it seems use the inverse, if on_rq, nvcsw
must change.

Makes sense I suppose, no point waiting for nvcsw to change if the task
never did anything.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-30  8:21           ` Peter Zijlstra
@ 2023-10-30 20:11             ` Paul E. McKenney
  2023-10-31  9:52               ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-30 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett

On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> 
> > > Nah, this is more or less what I feared. I just worry people will come
> > > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > > us much. Nor do I think the current READ_ONCE()s actually matter.
> > 
> > My friend, you trust compilers more than I ever will.  ;-)
> 
> Well, we only use the values {0,1,2}, that's contained in the first
> byte. Are we saying compiler will not only byte-split but also
> bit-split the loads?
> 
> But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
> getting you anything, and if you really worried about it, shouldn't you
> have proposed a patch making it all WRITE_ONCE() back when you did this
> tasks-rcu stuff?

There are not all that many of them.  If such a WRITE_ONCE() patch would
be welcome, I would be happy to put it together.

> > > But perhaps put a comment there, that we don't care for the races and
> > > only need to observe a 0 once or something.
> > 
> > There are these two passagers in the big lock comment preceding the
> > RCU Tasks code:
> 
> > // rcu_tasks_pregp_step():
> > //      Invokes synchronize_rcu() in order to wait for all in-flight
> > //      t->on_rq and t->nvcsw transitions to complete.  This works because
> > //      all such transitions are carried out with interrupts disabled.
> 
> > Does that suffice, or should we add more?
> 
> Probably sufficient. If one were to have used the search option :-)
> 
> Anyway, this brings me to nvcsw, exact same problem there, except
> possibly worse, because now we actually do care about the full word.
> 
> No WRITE_ONCE() write side, so the READ_ONCE() don't help against
> store-tearing (however unlikely that actually is in this case).

Again, if such a WRITE_ONCE() patch would be welcome, I would be happy
to put it together.

> Also, I'm not entirely sure I see why you need on_rq and nvcsw. Would
> not nvcsw increasing be enough to know it passed through a quiescent
> state? Are you trying to say that if nvcsw hasn't advanced but on_rq is
> still 0, nothing has changed and you can proceed?
> 
> Or rather, looking at the code it seems use the inverse, if on_rq, nvcsw
> must change.
> 
> Makes sense I suppose, no point waiting for nvcsw to change if the task
> never did anything.

Exactly, the on_rq check is needed to avoid excessively long grace
periods for tasks that are blocked for long periods of time.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-30 20:11             ` Paul E. McKenney
@ 2023-10-31  9:52               ` Peter Zijlstra
  2023-10-31 14:16                 ` Michael Matz
  2023-10-31 14:24                 ` Paul E. McKenney
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-31  9:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett, matz, ubizjak

On Mon, Oct 30, 2023 at 01:11:41PM -0700, Paul E. McKenney wrote:
> On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> > 
> > > > Nah, this is more or less what I feared. I just worry people will come
> > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > > > us much. Nor do I think the current READ_ONCE()s actually matter.
> > > 
> > > My friend, you trust compilers more than I ever will.  ;-)
> > 
> > Well, we only use the values {0,1,2}, that's contained in the first
> > byte. Are we saying compiler will not only byte-split but also
> > bit-split the loads?
> > 
> > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
> > getting you anything, and if you really worried about it, shouldn't you
> > have proposed a patch making it all WRITE_ONCE() back when you did this
> > tasks-rcu stuff?
> 
> There are not all that many of them.  If such a WRITE_ONCE() patch would
> be welcome, I would be happy to put it together.
> 
> > > > But perhaps put a comment there, that we don't care for the races and
> > > > only need to observe a 0 once or something.
> > > 
> > > There are these two passagers in the big lock comment preceding the
> > > RCU Tasks code:
> > 
> > > // rcu_tasks_pregp_step():
> > > //      Invokes synchronize_rcu() in order to wait for all in-flight
> > > //      t->on_rq and t->nvcsw transitions to complete.  This works because
> > > //      all such transitions are carried out with interrupts disabled.
> > 
> > > Does that suffice, or should we add more?
> > 
> > Probably sufficient. If one were to have used the search option :-)
> > 
> > Anyway, this brings me to nvcsw, exact same problem there, except
> > possibly worse, because now we actually do care about the full word.
> > 
> > No WRITE_ONCE() write side, so the READ_ONCE() don't help against
> > store-tearing (however unlikely that actually is in this case).
> 
> Again, if such a WRITE_ONCE() patch would be welcome, I would be happy
> to put it together.

Welcome is not the right word. What bugs me most is that this was never
raised when this code was written :/

Mostly my problem is that GCC generates such utter shite when you
mention volatile. See, the below patch changes the perfectly fine and
non-broken:

0148  1d8:      49 83 06 01             addq   $0x1,(%r14)

into:

0148  1d8:	49 8b 06             	mov    (%r14),%rax
014b  1db:	48 83 c0 01          	add    $0x1,%rax
014f  1df:	49 89 06             	mov    %rax,(%r14)

For absolutely no reason :-(

At least clang doesn't do this, it stays:

0403  413:      49 ff 45 00             incq   0x0(%r13)

irrespective of the volatile.

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 802551e0009b..d616211b9151 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6575,8 +6575,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  */
 static void __sched notrace __schedule(unsigned int sched_mode)
 {
 	struct task_struct *prev, *next;
-	unsigned long *switch_count;
+	volatile unsigned long *switch_count;
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31  9:52               ` Peter Zijlstra
@ 2023-10-31 14:16                 ` Michael Matz
  2023-10-31 14:43                   ` Paul E. McKenney
  2023-10-31 15:16                   ` Peter Zijlstra
  2023-10-31 14:24                 ` Paul E. McKenney
  1 sibling, 2 replies; 30+ messages in thread
From: Michael Matz @ 2023-10-31 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

Hello,

On Tue, 31 Oct 2023, Peter Zijlstra wrote:

(I can't say anything about the WRITE_ONCE/rcu code, just about the below 
codegen part)

> Welcome is not the right word. What bugs me most is that this was never
> raised when this code was written :/
> 
> Mostly my problem is that GCC generates such utter shite when you
> mention volatile. See, the below patch changes the perfectly fine and
> non-broken:
> 
> 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)

What is non-broken here that is ...

> into:
> 
> 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> 014f  1df:	49 89 06             	mov    %rax,(%r14)

... broken here?  (Sure code size and additional register use, but I don't 
think you mean this with broken).

> For absolutely no reason :-(

The reason is simple (and should be obvious): to adhere to the abstract 
machine regarding volatile.  When x is volatile then x++ consists of a 
read and a write, in this order.  The easiest way to ensure this is to 
actually generate a read and a write instruction.  Anything else is an 
optimization, and for each such optimization you need to actively find an 
argument why this optimization is correct to start with (and then if it's 
an optimization at all).  In this case the argument needs to somehow 
involve arguing that an rmw instruction on x86 is in fact completely 
equivalent to the separate instructions, from read cycle to write cycle 
over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
instruction is spec'ed to be equivalent.

You most probably can make that argument in this specific case, I'll give 
you that.  But why bother to start with, in a piece of software that is 
already fairly complex (the compiler)?  It's much easier to just not do 
much anything with volatile accesses at all and be guaranteed correct.
Even more so as the software author, when using volatile, most likely is 
much more interested in correct code (even from a abstract machine 
perspective) than micro optimizations.

> At least clang doesn't do this, it stays:
> 
> 0403  413:      49 ff 45 00             incq   0x0(%r13)
> 
> irrespective of the volatile.

And, are you 100% sure that this is correct?  Even for x86 CPU 
pipeline implementations that you aren't intimately knowing about? ;-)

But all that seems to be a side-track anyway, what's your real worry with  
the code sequence generated by GCC?

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31  9:52               ` Peter Zijlstra
  2023-10-31 14:16                 ` Michael Matz
@ 2023-10-31 14:24                 ` Paul E. McKenney
  2023-10-31 15:20                   ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-31 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett, matz, ubizjak

On Tue, Oct 31, 2023 at 10:52:02AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 30, 2023 at 01:11:41PM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote:
> > > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> > > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> > > 
> > > > > Nah, this is more or less what I feared. I just worry people will come
> > > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > > > > us much. Nor do I think the current READ_ONCE()s actually matter.
> > > > 
> > > > My friend, you trust compilers more than I ever will.  ;-)
> > > 
> > > Well, we only use the values {0,1,2}, that's contained in the first
> > > byte. Are we saying compiler will not only byte-split but also
> > > bit-split the loads?
> > > 
> > > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
> > > getting you anything, and if you really worried about it, shouldn't you
> > > have proposed a patch making it all WRITE_ONCE() back when you did this
> > > tasks-rcu stuff?
> > 
> > There are not all that many of them.  If such a WRITE_ONCE() patch would
> > be welcome, I would be happy to put it together.
> > 
> > > > > But perhaps put a comment there, that we don't care for the races and
> > > > > only need to observe a 0 once or something.
> > > > 
> > > > There are these two passagers in the big lock comment preceding the
> > > > RCU Tasks code:
> > > 
> > > > // rcu_tasks_pregp_step():
> > > > //      Invokes synchronize_rcu() in order to wait for all in-flight
> > > > //      t->on_rq and t->nvcsw transitions to complete.  This works because
> > > > //      all such transitions are carried out with interrupts disabled.
> > > 
> > > > Does that suffice, or should we add more?
> > > 
> > > Probably sufficient. If one were to have used the search option :-)
> > > 
> > > Anyway, this brings me to nvcsw, exact same problem there, except
> > > possibly worse, because now we actually do care about the full word.
> > > 
> > > No WRITE_ONCE() write side, so the READ_ONCE() don't help against
> > > store-tearing (however unlikely that actually is in this case).
> > 
> > Again, if such a WRITE_ONCE() patch would be welcome, I would be happy
> > to put it together.
> 
> Welcome is not the right word. What bugs me most is that this was never
> raised when this code was written :/

Me, I consider those READ_ONCE() calls to be documentation as well as
defense against overly enthusiastic optimizers.  "This access is racy."

> Mostly my problem is that GCC generates such utter shite when you
> mention volatile. See, the below patch changes the perfectly fine and
> non-broken:
> 
> 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> into:
> 
> 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> For absolutely no reason :-(
> 
> At least clang doesn't do this, it stays:
> 
> 0403  413:      49 ff 45 00             incq   0x0(%r13)
> 
> irrespective of the volatile.

Sounds like a bug in GCC, perhaps depending on the microarchitecture
in question.  And it was in fact reported in the past, but closed as
not-a-bug.  Perhaps clang's fix for this will help GCC along.

And yes, I do see that ++*switch_count in __schedule().

So, at least until GCC catches up to clang's code generation, I take it
that you don't want WRITE_ONCE() for that ->nvcsw increment.  Thoughts on
->on_rq?

							Thanx, Paul

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 802551e0009b..d616211b9151 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6575,8 +6575,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   */
>  static void __sched notrace __schedule(unsigned int sched_mode)
>  {
>  	struct task_struct *prev, *next;
> -	unsigned long *switch_count;
> +	volatile unsigned long *switch_count;
>  	unsigned long prev_state;
>  	struct rq_flags rf;
>  	struct rq *rq;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 14:16                 ` Michael Matz
@ 2023-10-31 14:43                   ` Paul E. McKenney
  2023-10-31 15:16                   ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-31 14:43 UTC (permalink / raw)
  To: Michael Matz
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote:
> Hello,
> 
> On Tue, 31 Oct 2023, Peter Zijlstra wrote:
> 
> (I can't say anything about the WRITE_ONCE/rcu code, just about the below 
> codegen part)
> 
> > Welcome is not the right word. What bugs me most is that this was never
> > raised when this code was written :/
> > 
> > Mostly my problem is that GCC generates such utter shite when you
> > mention volatile. See, the below patch changes the perfectly fine and
> > non-broken:
> > 
> > 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> What is non-broken here that is ...
> 
> > into:
> > 
> > 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> > 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> > 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> ... broken here?  (Sure code size and additional register use, but I don't 
> think you mean this with broken).
> 
> > For absolutely no reason :-(
> 
> The reason is simple (and should be obvious): to adhere to the abstract 
> machine regarding volatile.  When x is volatile then x++ consists of a 
> read and a write, in this order.  The easiest way to ensure this is to 
> actually generate a read and a write instruction.  Anything else is an 
> optimization, and for each such optimization you need to actively find an 
> argument why this optimization is correct to start with (and then if it's 
> an optimization at all).  In this case the argument needs to somehow 
> involve arguing that an rmw instruction on x86 is in fact completely 
> equivalent to the separate instructions, from read cycle to write cycle 
> over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> instruction is spec'ed to be equivalent.
> 
> You most probably can make that argument in this specific case, I'll give 
> you that.  But why bother to start with, in a piece of software that is 
> already fairly complex (the compiler)?  It's much easier to just not do 
> much anything with volatile accesses at all and be guaranteed correct.
> Even more so as the software author, when using volatile, most likely is 
> much more interested in correct code (even from a abstract machine 
> perspective) than micro optimizations.

I suspect that Peter cares because the code in question is on a fast
path within the scheduler.

> > At least clang doesn't do this, it stays:
> > 
> > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > 
> > irrespective of the volatile.
> 
> And, are you 100% sure that this is correct?  Even for x86 CPU 
> pipeline implementations that you aren't intimately knowing about? ;-)

Me, I am not even sure that the load-increment-store is always faithfully
executed by the hardware.  ;-)

> But all that seems to be a side-track anyway, what's your real worry with  
> the code sequence generated by GCC?

Again, my guess is that Peter cares deeply about the speed of the fast
path on which this increment occurs.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 14:16                 ` Michael Matz
  2023-10-31 14:43                   ` Paul E. McKenney
@ 2023-10-31 15:16                   ` Peter Zijlstra
  2023-10-31 15:55                     ` Michael Matz
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-31 15:16 UTC (permalink / raw)
  To: Michael Matz
  Cc: Paul E. McKenney, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote:

> > Mostly my problem is that GCC generates such utter shite when you
> > mention volatile. See, the below patch changes the perfectly fine and
> > non-broken:
> > 
> > 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> What is non-broken here that is ...
> 
> > into:
> > 
> > 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> > 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> > 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> ... broken here?  (Sure code size and additional register use, but I don't 
> think you mean this with broken).

The point was that the code was perfectly fine without adding the
volatile, and adding volatile makes it worse.

> > For absolutely no reason :-(
> 
> The reason is simple (and should be obvious): to adhere to the abstract 
> machine regarding volatile.  When x is volatile then x++ consists of a 
> read and a write, in this order.  The easiest way to ensure this is to 
> actually generate a read and a write instruction.  Anything else is an 
> optimization, and for each such optimization you need to actively find an 
> argument why this optimization is correct to start with (and then if it's 
> an optimization at all).  In this case the argument needs to somehow 
> involve arguing that an rmw instruction on x86 is in fact completely 
> equivalent to the separate instructions, from read cycle to write cycle 
> over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> instruction is spec'ed to be equivalent.
> 
> You most probably can make that argument in this specific case, I'll give 
> you that.  But why bother to start with, in a piece of software that is 
> already fairly complex (the compiler)?  It's much easier to just not do 
> much anything with volatile accesses at all and be guaranteed correct.
> Even more so as the software author, when using volatile, most likely is 
> much more interested in correct code (even from a abstract machine 
> perspective) than micro optimizations.

There's a pile of situations where a RmW instruction is actively
different vs a load-store split, esp for volatile variables that are
explicitly expected to change asynchronously.

The original RmW instruction is IRQ-safe, while the load-store version
is not. If an interrupt lands in between the load and store and also
modifies the variable then the store after interrupt-return will
over-write said modification.

These are not equivalent.

In this case that's not relevant because the increment happens to happen
with IRQs disabled. But the point is that these forms are very much not
equivalent.

> > At least clang doesn't do this, it stays:
> > 
> > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > 
> > irrespective of the volatile.
> 
> And, are you 100% sure that this is correct?  Even for x86 CPU 
> pipeline implementations that you aren't intimately knowing about? ;-)

It so happens that the x86 architecture does guarantee RmW ops are
IRQ-safe or locally atomic. SMP/concurrent loads will observe either
pre or post but no intermediate state as well.

> But all that seems to be a side-track anyway, what's your real worry with  
> the code sequence generated by GCC?

In this case it's sub-optimal code, both larger and possibly slower for
having two memops.

The reason to have volatile is because that's what Linux uses to
dis-allow store-tearing, something that doesn't happen in this case. A
suitably insane but conforming compiler could compile a non-volatile
memory increment into something insane like:

	load byte-0, r1
	increment r1
	store r1, byte-0
	jno done
	load byte-1, r1
	increment ri
	store r1, byte 1
	jno done
	...
done:

We want to explicitly dis-allow this.

I know C has recently (2011) grown this _Atomic thing, but that has
other problems.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 14:24                 ` Paul E. McKenney
@ 2023-10-31 15:20                   ` Peter Zijlstra
  2023-10-31 18:12                     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-31 15:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett, matz, ubizjak

On Tue, Oct 31, 2023 at 07:24:13AM -0700, Paul E. McKenney wrote:

> So, at least until GCC catches up to clang's code generation, I take it
> that you don't want WRITE_ONCE() for that ->nvcsw increment.  Thoughts on
> ->on_rq?

I've not done the patch yet, but I suspect those would be fine, those
are straight up stores, hard to get wrong (famous last words).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 15:16                   ` Peter Zijlstra
@ 2023-10-31 15:55                     ` Michael Matz
  2023-10-31 16:23                       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Michael Matz @ 2023-10-31 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

Hello,

On Tue, 31 Oct 2023, Peter Zijlstra wrote:

> > > For absolutely no reason :-(
> > 
> > The reason is simple (and should be obvious): to adhere to the abstract 
> > machine regarding volatile.  When x is volatile then x++ consists of a 
> > read and a write, in this order.  The easiest way to ensure this is to 
> > actually generate a read and a write instruction.  Anything else is an 
> > optimization, and for each such optimization you need to actively find an 
> > argument why this optimization is correct to start with (and then if it's 
> > an optimization at all).  In this case the argument needs to somehow 
> > involve arguing that an rmw instruction on x86 is in fact completely 
> > equivalent to the separate instructions, from read cycle to write cycle 
> > over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> > instruction is spec'ed to be equivalent.
> > 
> > You most probably can make that argument in this specific case, I'll give 
> > you that.  But why bother to start with, in a piece of software that is 
> > already fairly complex (the compiler)?  It's much easier to just not do 
> > much anything with volatile accesses at all and be guaranteed correct.
> > Even more so as the software author, when using volatile, most likely is 
> > much more interested in correct code (even from a abstract machine 
> > perspective) than micro optimizations.
> 
> There's a pile of situations where a RmW instruction is actively
> different vs a load-store split, esp for volatile variables that are
> explicitly expected to change asynchronously.
> 
> The original RmW instruction is IRQ-safe, while the load-store version
> is not. If an interrupt lands in between the load and store and also
> modifies the variable then the store after interrupt-return will
> over-write said modification.
> 
> These are not equivalent.

Okay, then there you have it.  Namely that LLVM has a bug (but see next 
paragraph).  For volatile x, x++ _must_ expand to a separate read and 
write, because the abstract machine of C says so.  If a RmW isn't 
equivalent to that, then it can't be used in this situation.  If you 
_have_ to use a RmW for other reasons like interrupt safety, then a 
volatile variable is not the way to force this, as C simply doesn't have 
that concept and hence can't talk about it.  (Of course it can't, as not 
all architectures could implement such, if it were required).

(If an RmW merely gives you more guarantees than a split load-store then 
of course LLVM doesn't have a bug, but you said not-equivalent, so I'm 
assuming the worst, that RmW also has fewer (other) guarantees)

> > > At least clang doesn't do this, it stays:
> > > 
> > > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > > 
> > > irrespective of the volatile.
> > 
> > And, are you 100% sure that this is correct?  Even for x86 CPU 
> > pipeline implementations that you aren't intimately knowing about? ;-)
> 
> It so happens that the x86 architecture does guarantee RmW ops are
> IRQ-safe or locally atomic. SMP/concurrent loads will observe either
> pre or post but no intermediate state as well.

So, are RMW ops a strict superset (vis the guarantees they give) of split 
load-store?  If so we can at least say that using RMW is a valid 
optimization :)  Still, an optmization only.

> > But all that seems to be a side-track anyway, what's your real worry with  
> > the code sequence generated by GCC?
> 
> In this case it's sub-optimal code, both larger and possibly slower for
> having two memops.
> 
> The reason to have volatile is because that's what Linux uses to
> dis-allow store-tearing, something that doesn't happen in this case. A
> suitably insane but conforming compiler could compile a non-volatile
> memory increment into something insane like:
> 
> 	load byte-0, r1
> 	increment r1
> 	store r1, byte-0
> 	jno done
> 	load byte-1, r1
> 	increment ri
> 	store r1, byte 1
> 	jno done
> 	...
> done:
> 
> We want to explicitly dis-allow this.

Yeah, I see.  Within C you don't have much choice than volatile for this 
:-/  Funny thing: on some architectures this is actually what is generated 
sometimes, even if it has multi-byte loads/stores.  This came up 
recently on the gcc list and the byte-per-byte sequence was faster ;-) 
(it was rather: load-by-bytes, form whole value via shifts, increment, 
store-by-bytes)
Insane codegen for insane micro-architectures!

> I know C has recently (2011) grown this _Atomic thing, but that has 
> other problems.

Yeah.

So, hmm, I don't quite know what to say, you're between a rock and a hard 
place, I guess.  You have to use volatile for its effects but then are 
unhappy about its effects :)

If you can confirm the above about validity of the optimization, then at 
least there'd by a point for adding a peephole in GCC for this, even if 
current codegen isn't a bug, but I still wouldn't hold my breath.  
volatile is so ... ewww, it's best left alone.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 15:55                     ` Michael Matz
@ 2023-10-31 16:23                       ` Peter Zijlstra
  2023-10-31 16:49                         ` Michael Matz
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-10-31 16:23 UTC (permalink / raw)
  To: Michael Matz
  Cc: Paul E. McKenney, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

On Tue, Oct 31, 2023 at 03:55:20PM +0000, Michael Matz wrote:

> > There's a pile of situations where a RmW instruction is actively
> > different vs a load-store split, esp for volatile variables that are
> > explicitly expected to change asynchronously.
> > 
> > The original RmW instruction is IRQ-safe, while the load-store version
> > is not. If an interrupt lands in between the load and store and also
> > modifies the variable then the store after interrupt-return will
> > over-write said modification.
> > 
> > These are not equivalent.
> 
> Okay, then there you have it.  Namely that LLVM has a bug (but see next 
> paragraph).  For volatile x, x++ _must_ expand to a separate read and 
> write, because the abstract machine of C says so.  If a RmW isn't 
> equivalent to that, then it can't be used in this situation.  If you 
> _have_ to use a RmW for other reasons like interrupt safety, then a 
> volatile variable is not the way to force this, as C simply doesn't have 
> that concept and hence can't talk about it.  (Of course it can't, as not 
> all architectures could implement such, if it were required).

Yeah, RISC archs typically lack the RmW ops. I can understand C not
mandating their use. However, on architectures that do have them, using
them makes a ton of sense.

For us living in the real world, this C abstract machine is mostly a
pain in the arse :-)

> (If an RmW merely gives you more guarantees than a split load-store then 
> of course LLVM doesn't have a bug, but you said not-equivalent, so I'm 
> assuming the worst, that RmW also has fewer (other) guarantees)

RmW is strict superset of load-store, and as such not equivalent :-)

Specifically, using volatile degrades the guarantees -- which is counter
intuitive.

> So, are RMW ops a strict superset (vis the guarantees they give) of split 
> load-store?  If so we can at least say that using RMW is a valid 
> optimization :)  Still, an optmization only.

This.

> > > But all that seems to be a side-track anyway, what's your real worry with  
> > > the code sequence generated by GCC?
> > 
> > In this case it's sub-optimal code, both larger and possibly slower for
> > having two memops.
> > 
> > The reason to have volatile is because that's what Linux uses to
> > dis-allow store-tearing, something that doesn't happen in this case. A
> > suitably insane but conforming compiler could compile a non-volatile
> > memory increment into something insane like:
> > 
> > 	load byte-0, r1
> > 	increment r1
> > 	store r1, byte-0
> > 	jno done
> > 	load byte-1, r1
> > 	increment ri
> > 	store r1, byte 1
> > 	jno done
> > 	...
> > done:
> > 
> > We want to explicitly dis-allow this.
> 
> Yeah, I see.  Within C you don't have much choice than volatile for this 
> :-/  Funny thing: on some architectures this is actually what is generated 
> sometimes, even if it has multi-byte loads/stores.  This came up 
> recently on the gcc list and the byte-per-byte sequence was faster ;-) 
> (it was rather: load-by-bytes, form whole value via shifts, increment, 
> store-by-bytes)
> Insane codegen for insane micro-architectures!

*groan*

> > I know C has recently (2011) grown this _Atomic thing, but that has 
> > other problems.
> 
> Yeah.
> 
> So, hmm, I don't quite know what to say, you're between a rock and a hard 
> place, I guess.  You have to use volatile for its effects but then are 
> unhappy about its effects :)

Notably, Linux uses a *ton* of volatile and there has historically been
a lot of grumbling about the GCC stance of 'stupid' codegen the moment
it sees volatile.

It really would help us (the Linux community) if GCC were to be less
offended by the whole volatile thing and would try to generate better
code.

Paul has been on the C/C++ committee meetings and keeps telling me them
folks hate volatile with a passion up to the point of proposing to
remove it from the language or somesuch. But the reality is that Linux
very heavily relies on it and _Atomic simply cannot replace it.

> If you can confirm the above about validity of the optimization, then at 
> least there'd by a point for adding a peephole in GCC for this, even if 
> current codegen isn't a bug, but I still wouldn't hold my breath.  
> volatile is so ... ewww, it's best left alone.

Confirmed, and please, your SMP computer only works becuase of volatile,
it *is* important.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 16:23                       ` Peter Zijlstra
@ 2023-10-31 16:49                         ` Michael Matz
  2023-10-31 17:10                           ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Michael Matz @ 2023-10-31 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

Hey,

On Tue, 31 Oct 2023, Peter Zijlstra wrote:

> > equivalent to that, then it can't be used in this situation.  If you 
> > _have_ to use a RmW for other reasons like interrupt safety, then a 
> > volatile variable is not the way to force this, as C simply doesn't have 
> > that concept and hence can't talk about it.  (Of course it can't, as not 
> > all architectures could implement such, if it were required).
> 
> Yeah, RISC archs typically lack the RmW ops. I can understand C not
> mandating their use. However, on architectures that do have them, using
> them makes a ton of sense.
> 
> For us living in the real world, this C abstract machine is mostly a
> pain in the arse :-)

Believe me, without it you would live in a world where only languages like 
ECMA script or Rust existed, without any reliable spec at all ("it's 
obvious, the language should behave like this-and-that compiler from 2010 
implements it! Or was it 2012?").  Even if it sometimes would make life 
easier without (also for compilers!), at least you _have_ an arse to feel 
pain in! :-)  Ahem.

> > So, hmm, I don't quite know what to say, you're between a rock and a hard 
> > place, I guess.  You have to use volatile for its effects but then are 
> > unhappy about its effects :)
> 
> Notably, Linux uses a *ton* of volatile and there has historically been
> a lot of grumbling about the GCC stance of 'stupid' codegen the moment
> it sees volatile.
> 
> It really would help us (the Linux community) if GCC were to be less
> offended by the whole volatile thing and would try to generate better
> code.
> 
> Paul has been on the C/C++ committee meetings and keeps telling me them
> folks hate volatile with a passion up to the point of proposing to
> remove it from the language or somesuch. But the reality is that Linux
> very heavily relies on it and _Atomic simply cannot replace it.

Oh yeah, I agree.  People trying to get rid of volatile are misguided.  
Of course one can try to capture all the individual aspects of it, and 
make individual language constructs for them (_Atomic is one).  It makes 
arguing about and precisely specifying the aspects much easier.  But it 
also makes the feature-interoperability matrix explode and the language 
more complicated in areas that were already arcane to start with.

But it's precisely _because_ of the large-scale feature set of volatile 
and the compilers wish to cater for the real world, that it's mostly left 
alone, as is, as written by the author.  Sure, one can wish for better 
codegen related to volatile.  But it's a slippery slope: "here I have 
volatile, because I don't want optimizations to happen." - "but here I 
want a little optimization to happen" - "but not these" - ... It's ... not 
so easy :)

In this specific case I think we have now sufficiently argued that 
"load-modify-store --> rmw" on x86 even for volatile accesses is a correct 
transformation (and one that has sufficiently local effects that our heads 
don't explode while thinking about all consequences).  You'd have to do 
that for each and every transformation where volatile stuff is involved, 
just so to not throw out the baby with the water.

> > If you can confirm the above about validity of the optimization, then at 
> > least there'd by a point for adding a peephole in GCC for this, even if 
> > current codegen isn't a bug, but I still wouldn't hold my breath.  
> > volatile is so ... ewww, it's best left alone.
> 
> Confirmed, and please, your SMP computer only works becuase of volatile,
> it *is* important.

Agreed.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 16:49                         ` Michael Matz
@ 2023-10-31 17:10                           ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-31 17:10 UTC (permalink / raw)
  To: Michael Matz
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Boqun Feng,
	Joel Fernandes, Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett,
	ubizjak

On Tue, Oct 31, 2023 at 04:49:56PM +0000, Michael Matz wrote:
> Hey,
> 
> On Tue, 31 Oct 2023, Peter Zijlstra wrote:
> 
> > > equivalent to that, then it can't be used in this situation.  If you 
> > > _have_ to use a RmW for other reasons like interrupt safety, then a 
> > > volatile variable is not the way to force this, as C simply doesn't have 
> > > that concept and hence can't talk about it.  (Of course it can't, as not 
> > > all architectures could implement such, if it were required).
> > 
> > Yeah, RISC archs typically lack the RmW ops. I can understand C not
> > mandating their use. However, on architectures that do have them, using
> > them makes a ton of sense.
> > 
> > For us living in the real world, this C abstract machine is mostly a
> > pain in the arse :-)
> 
> Believe me, without it you would live in a world where only languages like 
> ECMA script or Rust existed, without any reliable spec at all ("it's 
> obvious, the language should behave like this-and-that compiler from 2010 
> implements it! Or was it 2012?").  Even if it sometimes would make life 
> easier without (also for compilers!), at least you _have_ an arse to feel 
> pain in! :-)  Ahem.

You mean like Rust volatiles considering conflicting accesses to be
data races?  That certainly leads me to wonder how a Rust-language device
driver is supposed to interoperate with Rust-language device firmware.

They currently propose atomics and things like the barrier() asm to make
that work, and their definition of atomic might just allow it.

> > > So, hmm, I don't quite know what to say, you're between a rock and a hard 
> > > place, I guess.  You have to use volatile for its effects but then are 
> > > unhappy about its effects :)
> > 
> > Notably, Linux uses a *ton* of volatile and there has historically been
> > a lot of grumbling about the GCC stance of 'stupid' codegen the moment
> > it sees volatile.
> > 
> > It really would help us (the Linux community) if GCC were to be less
> > offended by the whole volatile thing and would try to generate better
> > code.
> > 
> > Paul has been on the C/C++ committee meetings and keeps telling me them
> > folks hate volatile with a passion up to the point of proposing to
> > remove it from the language or somesuch. But the reality is that Linux
> > very heavily relies on it and _Atomic simply cannot replace it.
> 
> Oh yeah, I agree.  People trying to get rid of volatile are misguided.  
> Of course one can try to capture all the individual aspects of it, and 
> make individual language constructs for them (_Atomic is one).  It makes 
> arguing about and precisely specifying the aspects much easier.  But it 
> also makes the feature-interoperability matrix explode and the language 
> more complicated in areas that were already arcane to start with.

Agreed, and I have personally witnessed some primal-scream therapy
undertaken in response to attempts to better define volatile.

> But it's precisely _because_ of the large-scale feature set of volatile 
> and the compilers wish to cater for the real world, that it's mostly left 
> alone, as is, as written by the author.  Sure, one can wish for better 
> codegen related to volatile.  But it's a slippery slope: "here I have 
> volatile, because I don't want optimizations to happen." - "but here I 
> want a little optimization to happen" - "but not these" - ... It's ... not 
> so easy :)

And to your point, there really have been optimization bugs that have
broken volatile.  So I do very much appreciate your careful attention
to this matter.

> In this specific case I think we have now sufficiently argued that 
> "load-modify-store --> rmw" on x86 even for volatile accesses is a correct 
> transformation (and one that has sufficiently local effects that our heads 
> don't explode while thinking about all consequences).  You'd have to do 
> that for each and every transformation where volatile stuff is involved, 
> just so to not throw out the baby with the water.

Understood!

> > > If you can confirm the above about validity of the optimization, then at 
> > > least there'd by a point for adding a peephole in GCC for this, even if 
> > > current codegen isn't a bug, but I still wouldn't hold my breath.  
> > > volatile is so ... ewww, it's best left alone.
> > 
> > Confirmed, and please, your SMP computer only works becuase of volatile,
> > it *is* important.
> 
> Agreed.

Good to hear!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics
  2023-10-31 15:20                   ` Peter Zijlstra
@ 2023-10-31 18:12                     ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2023-10-31 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Liam R . Howlett, matz, ubizjak

On Tue, Oct 31, 2023 at 04:20:33PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 31, 2023 at 07:24:13AM -0700, Paul E. McKenney wrote:
> 
> > So, at least until GCC catches up to clang's code generation, I take it
> > that you don't want WRITE_ONCE() for that ->nvcsw increment.  Thoughts on
> > ->on_rq?
> 
> I've not done the patch yet, but I suspect those would be fine, those
> are straight up stores, hard to get wrong (famous last words).

Assuming that the reads are already either marked with READ_ONCE() or
are under appropriate locks, my immediate thought would be something
like the all-too-lightly tested patch shown below.

The ASSERT_EXCLUSIVE_WRITER() causes KCSAN to complain if there is a
concurrent store of any kind to the location.

Of course, please feel free to ignore.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81885748871d..aeace19ad7f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2124,12 +2124,14 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_task(rq, p, flags);
 
-	p->on_rq = TASK_ON_RQ_QUEUED;
+	WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
+	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 }
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
+	WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
+	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 
 	dequeue_task(rq, p, flags);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-10-31 18:15 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-24 21:46 [PATCH 0/4] rcu: Fix PF_IDLE related issues v2 Frederic Weisbecker
2023-10-24 21:46 ` [PATCH 1/4] rcu: Introduce rcu_cpu_online() Frederic Weisbecker
2023-10-25  2:29   ` Z qiang
2023-10-25 10:32     ` Frederic Weisbecker
2023-10-24 21:46 ` [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Frederic Weisbecker
2023-10-25  8:40   ` Peter Zijlstra
2023-10-25 10:31     ` Frederic Weisbecker
2023-10-26 12:15       ` Z qiang
2023-10-26 14:34         ` Frederic Weisbecker
2023-10-24 21:46 ` [PATCH 3/4] rcu/tasks-trace: " Frederic Weisbecker
2023-10-24 21:46 ` [PATCH 4/4] sched: Exclude CPU boot code from PF_IDLE area Frederic Weisbecker
2023-10-25  8:48   ` Peter Zijlstra
2023-10-25 11:25     ` Frederic Weisbecker
     [not found] <20231027144050.110601-1-frederic@kernel.org>
     [not found] ` <20231027144050.110601-3-frederic@kernel.org>
2023-10-27 19:20   ` [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Peter Zijlstra
2023-10-27 21:23     ` Paul E. McKenney
2023-10-27 22:46       ` Peter Zijlstra
2023-10-27 23:41         ` Paul E. McKenney
2023-10-30  8:21           ` Peter Zijlstra
2023-10-30 20:11             ` Paul E. McKenney
2023-10-31  9:52               ` Peter Zijlstra
2023-10-31 14:16                 ` Michael Matz
2023-10-31 14:43                   ` Paul E. McKenney
2023-10-31 15:16                   ` Peter Zijlstra
2023-10-31 15:55                     ` Michael Matz
2023-10-31 16:23                       ` Peter Zijlstra
2023-10-31 16:49                         ` Michael Matz
2023-10-31 17:10                           ` Paul E. McKenney
2023-10-31 14:24                 ` Paul E. McKenney
2023-10-31 15:20                   ` Peter Zijlstra
2023-10-31 18:12                     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox