[PATCH v3 0/2] cpuhp: Improve SMT switch time via lock batching and RCU expedition

public inbox for rcu@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] cpuhp: Improve SMT switch time via lock batching and RCU expedition
@ 2026-02-18  8:39 Vishal Chourasia
  2026-02-18  8:39 ` [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition Vishal Chourasia
  2026-02-18  8:39 ` [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations Vishal Chourasia
  0 siblings, 2 replies; 11+ messages in thread
From: Vishal Chourasia @ 2026-02-18  8:39 UTC (permalink / raw)
  To: peterz, aboorvad
  Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx,
	urezki, samir, vishalc

On large systems with high core counts, toggling SMT modes via sysfs
(/sys/devices/system/cpu/smt/control) incurs significant latency. For
instance, on ~2000 CPUs, switching SMT modes can take close to an hour
as the system hotplugs each hardware thread individually. This series
reduces this time to minutes.

Analysis of the hotplug path [1] identifies synchronize_rcu() as the
primary bottleneck. During a bulk SMT switch, the kernel repeatedly
enters RCU grace periods for each CPU being brought online or offline.

This series optimizes the SMT transition in two ways:

1. Lock Batching [1]: Instead of repeatedly acquiring and releasing the CPU
hotplug write lock for every individual CPU, we now hold cpus_write_lock
across the entire SMT toggle operation.

2. Expedite RCU grace period [2]: It utilizes rcu_expedite_gp() to force
expedite grace periods specifically for the duration of the SMT switch.
The trade-off is justified here to prevent the administrative task of
SMT switching from stalling for an unacceptable duration on large
systems. 

Changes since v1
Link: https://lore.kernel.org/all/20260112094332.66006-2-vishalc@linux.ibm.com/
Expedite system-wide synchronize_rcu only when SMT switch operation are
triggered via /sys/devices/system/cpu/smt/control interface.

Changes since v2
Link: https://lore.kernel.org/all/20260216121927.489062-2-vishalc@linux.ibm.com/
Move the declaration of rcu_[un]expedite_gp() to
include/linux/rcupdate.h. Thanks Shrikanth for sharing the fix and
kernel test robot for finding the issue. [3]

[1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
[2] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
[3] https://lore.kernel.org/all/202602170049.WQD7Wcuj-lkp@intel.com/

Vishal Chourasia (2):
  cpuhp: Optimize SMT switch operation by batching lock acquisition
  cpuhp: Expedite RCU grace periods during SMT operations

 include/linux/rcupdate.h |  8 +++++
 kernel/cpu.c             | 76 +++++++++++++++++++++++++++++-----------
 kernel/rcu/rcu.h         |  4 ---
 3 files changed, 64 insertions(+), 24 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition
  2026-02-18  8:39 [PATCH v3 0/2] cpuhp: Improve SMT switch time via lock batching and RCU expedition Vishal Chourasia
@ 2026-02-18  8:39 ` Vishal Chourasia
  2026-03-25 19:09   ` Thomas Gleixner
  2026-02-18  8:39 ` [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations Vishal Chourasia
  1 sibling, 1 reply; 11+ messages in thread
From: Vishal Chourasia @ 2026-02-18  8:39 UTC (permalink / raw)
  To: peterz, aboorvad
  Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx,
	urezki, samir, vishalc

From: Joel Fernandes <joelagnelf@nvidia.com>

Bulk CPU hotplug operations, such as an SMT switch operation, requires
hotplugging multiple CPUs. The current implementation takes
cpus_write_lock() for each individual CPU, causing multiple slow grace
period requests.

Introduce cpu_up_locked() and cpu_down_locked() that assume the caller
already holds cpus_write_lock(). The cpuhp_smt_enable() and
cpuhp_smt_disable() functions are updated to hold the lock once around
the entire loop, rather than for each individual CPU.

Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
---
 kernel/cpu.c | 72 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 01968a5c4a16..62e209eda78c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1400,8 +1400,8 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
 	return ret;
 }
 
-/* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
+/* Requires cpu_add_remove_lock and cpus_write_lock to be held */
+static int __ref cpu_down_locked(unsigned int cpu, int tasks_frozen,
 			   enum cpuhp_state target)
 {
 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
@@ -1413,7 +1413,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 	if (!cpu_present(cpu))
 		return -EINVAL;
 
-	cpus_write_lock();
+	lockdep_assert_cpus_held();
 
 	/*
 	 * Keep at least one housekeeping cpu onlined to avoid generating
@@ -1421,8 +1421,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 	 */
 	if (cpumask_any_and(cpu_online_mask,
 			    housekeeping_cpumask(HK_TYPE_DOMAIN)) >= nr_cpu_ids) {
-		ret = -EBUSY;
-		goto out;
+		return -EBUSY;
 	}
 
 	cpuhp_tasks_frozen = tasks_frozen;
@@ -1440,14 +1439,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 		 * return the error code..
 		 */
 		if (ret)
-			goto out;
+			return ret;
 
 		/*
 		 * We might have stopped still in the range of the AP hotplug
 		 * thread. Nothing to do anymore.
 		 */
 		if (st->state > CPUHP_TEARDOWN_CPU)
-			goto out;
+			return 0;
 
 		st->target = target;
 	}
@@ -1464,8 +1463,16 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 			WARN(1, "DEAD callback error for CPU%d", cpu);
 		}
 	}
+	return ret;
+}
 
-out:
+static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
+			   enum cpuhp_state target)
+{
+
+	int ret;
+	cpus_write_lock();
+	ret = cpu_down_locked(cpu, tasks_frozen, target);
 	cpus_write_unlock();
 	arch_smt_update();
 	return ret;
@@ -1613,18 +1620,18 @@ void cpuhp_online_idle(enum cpuhp_state state)
 	complete_ap_thread(st, true);
 }
 
-/* Requires cpu_add_remove_lock to be held */
-static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
+/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */
+static int cpu_up_locked(unsigned int cpu, int tasks_frozen,
+			 enum cpuhp_state target)
 {
 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
 	struct task_struct *idle;
 	int ret = 0;
 
-	cpus_write_lock();
+	lockdep_assert_cpus_held();
 
 	if (!cpu_present(cpu)) {
-		ret = -EINVAL;
-		goto out;
+		return -EINVAL;
 	}
 
 	/*
@@ -1632,14 +1639,13 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
 	 * caller. Nothing to do.
 	 */
 	if (st->state >= target)
-		goto out;
+		return 0;
 
 	if (st->state == CPUHP_OFFLINE) {
 		/* Let it fail before we try to bring the cpu up */
 		idle = idle_thread_get(cpu);
 		if (IS_ERR(idle)) {
-			ret = PTR_ERR(idle);
-			goto out;
+			return PTR_ERR(idle);
 		}
 
 		/*
@@ -1663,7 +1669,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
 		 * return the error code..
 		 */
 		if (ret)
-			goto out;
+			return ret;
 	}
 
 	/*
@@ -1673,7 +1679,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
 	 */
 	target = min((int)target, CPUHP_BRINGUP_CPU);
 	ret = cpuhp_up_callbacks(cpu, st, target);
-out:
+	return ret;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
+{
+	int ret;
+
+	cpus_write_lock();
+	ret = cpu_up_locked(cpu, tasks_frozen, target);
 	cpus_write_unlock();
 	arch_smt_update();
 	return ret;
@@ -2659,6 +2674,16 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 	int cpu, ret = 0;
 
 	cpu_maps_update_begin();
+	if (cpu_hotplug_offline_disabled) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+	if (cpu_hotplug_disabled) {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* Hold cpus_write_lock() for entire batch operation. */
+	cpus_write_lock();
 	for_each_online_cpu(cpu) {
 		if (topology_is_primary_thread(cpu))
 			continue;
@@ -2668,7 +2693,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 		 */
 		if (ctrlval == CPU_SMT_ENABLED && cpu_smt_thread_allowed(cpu))
 			continue;
-		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
+		ret = cpu_down_locked(cpu, 0, CPUHP_OFFLINE);
 		if (ret)
 			break;
 		/*
@@ -2688,6 +2713,9 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 	}
 	if (!ret)
 		cpu_smt_control = ctrlval;
+	cpus_write_unlock();
+	arch_smt_update();
+out:
 	cpu_maps_update_done();
 	return ret;
 }
@@ -2705,6 +2733,8 @@ int cpuhp_smt_enable(void)
 	int cpu, ret = 0;
 
 	cpu_maps_update_begin();
+	/* Hold cpus_write_lock() for entire batch operation. */
+	cpus_write_lock();
 	cpu_smt_control = CPU_SMT_ENABLED;
 	for_each_present_cpu(cpu) {
 		/* Skip online CPUs and CPUs on offline nodes */
@@ -2712,12 +2742,14 @@ int cpuhp_smt_enable(void)
 			continue;
 		if (!cpu_smt_thread_allowed(cpu) || !topology_is_core_online(cpu))
 			continue;
-		ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
+		ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE);
 		if (ret)
 			break;
 		/* See comment in cpuhp_smt_disable() */
 		cpuhp_online_cpu_device(cpu);
 	}
+	cpus_write_unlock();
+	arch_smt_update();
 	cpu_maps_update_done();
 	return ret;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-02-18  8:39 [PATCH v3 0/2] cpuhp: Improve SMT switch time via lock batching and RCU expedition Vishal Chourasia
  2026-02-18  8:39 ` [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition Vishal Chourasia
@ 2026-02-18  8:39 ` Vishal Chourasia
  2026-02-27  1:13   ` Joel Fernandes
  2026-03-25 19:10   ` Thomas Gleixner
  1 sibling, 2 replies; 11+ messages in thread
From: Vishal Chourasia @ 2026-02-18  8:39 UTC (permalink / raw)
  To: peterz, aboorvad
  Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx,
	urezki, samir, vishalc

Expedite synchronize_rcu during the SMT mode switch operation when
initiated via /sys/devices/system/cpu/smt/control interface

SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
others are user driven operations and therefore should complete as soon
as possible. Switching SMT states involves iterating over a list of CPUs
and performing hotplug operations. It was found these transitions took
significantly large amount of time to complete particularly on
high-core-count systems.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
---
 include/linux/rcupdate.h | 8 ++++++++
 kernel/cpu.c             | 4 ++++
 kernel/rcu/rcu.h         | 4 ----
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7729fef249e1..61b80c29d53b 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1190,6 +1190,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
 extern int rcu_expedited;
 extern int rcu_normal;
 
+#ifdef CONFIG_TINY_RCU
+static inline void rcu_expedite_gp(void) { }
+static inline void rcu_unexpedite_gp(void) { }
+#else
+void rcu_expedite_gp(void);
+void rcu_unexpedite_gp(void);
+#endif
+
 DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
 DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 62e209eda78c..1377a68d6f47 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2682,6 +2682,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 		ret = -EBUSY;
 		goto out;
 	}
+	rcu_expedite_gp();
 	/* Hold cpus_write_lock() for entire batch operation. */
 	cpus_write_lock();
 	for_each_online_cpu(cpu) {
@@ -2714,6 +2715,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 	if (!ret)
 		cpu_smt_control = ctrlval;
 	cpus_write_unlock();
+	rcu_unexpedite_gp();
 	arch_smt_update();
 out:
 	cpu_maps_update_done();
@@ -2733,6 +2735,7 @@ int cpuhp_smt_enable(void)
 	int cpu, ret = 0;
 
 	cpu_maps_update_begin();
+	rcu_expedite_gp();
 	/* Hold cpus_write_lock() for entire batch operation. */
 	cpus_write_lock();
 	cpu_smt_control = CPU_SMT_ENABLED;
@@ -2749,6 +2752,7 @@ int cpuhp_smt_enable(void)
 		cpuhp_online_cpu_device(cpu);
 	}
 	cpus_write_unlock();
+	rcu_unexpedite_gp();
 	arch_smt_update();
 	cpu_maps_update_done();
 	return ret;
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index dc5d614b372c..41a0d262e964 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -512,8 +512,6 @@ do {									\
 static inline bool rcu_gp_is_normal(void) { return true; }
 static inline bool rcu_gp_is_expedited(void) { return false; }
 static inline bool rcu_async_should_hurry(void) { return false; }
-static inline void rcu_expedite_gp(void) { }
-static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_async_hurry(void) { }
 static inline void rcu_async_relax(void) { }
 static inline bool rcu_cpu_online(int cpu) { return true; }
@@ -521,8 +519,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; }
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
 bool rcu_async_should_hurry(void);  /* Internal RCU use. */
-void rcu_expedite_gp(void);
-void rcu_unexpedite_gp(void);
 void rcu_async_hurry(void);
 void rcu_async_relax(void);
 void rcupdate_announce_bootup_oddness(void);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-02-18  8:39 ` [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations Vishal Chourasia
@ 2026-02-27  1:13   ` Joel Fernandes
  2026-03-02 11:47     ` Samir M
  2026-03-25 19:10   ` Thomas Gleixner
  1 sibling, 1 reply; 11+ messages in thread
From: Joel Fernandes @ 2026-02-27  1:13 UTC (permalink / raw)
  To: Vishal Chourasia
  Cc: peterz, aboorvad, boqun.feng, frederic, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx,
	urezki, samir

On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> Expedite synchronize_rcu during the SMT mode switch operation when
> initiated via /sys/devices/system/cpu/smt/control interface
> 
> SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
> others are user driven operations and therefore should complete as soon
> as possible. Switching SMT states involves iterating over a list of CPUs
> and performing hotplug operations. It was found these transitions took
> significantly large amount of time to complete particularly on
> high-core-count systems.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
>  include/linux/rcupdate.h | 8 ++++++++
>  kernel/cpu.c             | 4 ++++
>  kernel/rcu/rcu.h         | 4 ----
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 7729fef249e1..61b80c29d53b 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1190,6 +1190,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
>  extern int rcu_expedited;
>  extern int rcu_normal;
>  
> +#ifdef CONFIG_TINY_RCU
> +static inline void rcu_expedite_gp(void) { }
> +static inline void rcu_unexpedite_gp(void) { }
> +#else
> +void rcu_expedite_gp(void);
> +void rcu_unexpedite_gp(void);
> +#endif
> +
>  DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
>  DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
>  
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 62e209eda78c..1377a68d6f47 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2682,6 +2682,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  		ret = -EBUSY;
>  		goto out;
>  	}
> +	rcu_expedite_gp();

After the locking related changes in patch 1, is expediting still required? I
am just a bit concerned that we are papering over the real issue of over
usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
the patches that reducing the number of lock acquire/release was supposed to
help.)

Could you provide more justification of why expediting these sections is
required if the locking concerns were addressed? It would be great if you can
provide performance numbers with only the first patch and without the second
patch. That way we can quantify this patch.

thanks,

--
Joel Fernandes


>  	/* Hold cpus_write_lock() for entire batch operation. */
>  	cpus_write_lock();
>  	for_each_online_cpu(cpu) {
> @@ -2714,6 +2715,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  	if (!ret)
>  		cpu_smt_control = ctrlval;
>  	cpus_write_unlock();
> +	rcu_unexpedite_gp();
>  	arch_smt_update();
>  out:
>  	cpu_maps_update_done();
> @@ -2733,6 +2735,7 @@ int cpuhp_smt_enable(void)
>  	int cpu, ret = 0;
>  
>  	cpu_maps_update_begin();
> +	rcu_expedite_gp();
>  	/* Hold cpus_write_lock() for entire batch operation. */
>  	cpus_write_lock();
>  	cpu_smt_control = CPU_SMT_ENABLED;
> @@ -2749,6 +2752,7 @@ int cpuhp_smt_enable(void)
>  		cpuhp_online_cpu_device(cpu);
>  	}
>  	cpus_write_unlock();
> +	rcu_unexpedite_gp();
>  	arch_smt_update();
>  	cpu_maps_update_done();
>  	return ret;
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index dc5d614b372c..41a0d262e964 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -512,8 +512,6 @@ do {									\
>  static inline bool rcu_gp_is_normal(void) { return true; }
>  static inline bool rcu_gp_is_expedited(void) { return false; }
>  static inline bool rcu_async_should_hurry(void) { return false; }
> -static inline void rcu_expedite_gp(void) { }
> -static inline void rcu_unexpedite_gp(void) { }
>  static inline void rcu_async_hurry(void) { }
>  static inline void rcu_async_relax(void) { }
>  static inline bool rcu_cpu_online(int cpu) { return true; }
> @@ -521,8 +519,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; }
>  bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>  bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
>  bool rcu_async_should_hurry(void);  /* Internal RCU use. */
> -void rcu_expedite_gp(void);
> -void rcu_unexpedite_gp(void);
>  void rcu_async_hurry(void);
>  void rcu_async_relax(void);
>  void rcupdate_announce_bootup_oddness(void);
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-02-27  1:13   ` Joel Fernandes
@ 2026-03-02 11:47     ` Samir M
  2026-03-06  5:44       ` Vishal Chourasia
  0 siblings, 1 reply; 11+ messages in thread
From: Samir M @ 2026-03-02 11:47 UTC (permalink / raw)
  To: Joel Fernandes, Vishal Chourasia
  Cc: peterz, aboorvad, boqun.feng, frederic, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx,
	urezki


On 27/02/26 6:43 am, Joel Fernandes wrote:
> On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
>> Expedite synchronize_rcu during the SMT mode switch operation when
>> initiated via /sys/devices/system/cpu/smt/control interface
>>
>> SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
>> others are user driven operations and therefore should complete as soon
>> as possible. Switching SMT states involves iterating over a list of CPUs
>> and performing hotplug operations. It was found these transitions took
>> significantly large amount of time to complete particularly on
>> high-core-count systems.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
>> ---
>>   include/linux/rcupdate.h | 8 ++++++++
>>   kernel/cpu.c             | 4 ++++
>>   kernel/rcu/rcu.h         | 4 ----
>>   3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 7729fef249e1..61b80c29d53b 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -1190,6 +1190,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
>>   extern int rcu_expedited;
>>   extern int rcu_normal;
>>   
>> +#ifdef CONFIG_TINY_RCU
>> +static inline void rcu_expedite_gp(void) { }
>> +static inline void rcu_unexpedite_gp(void) { }
>> +#else
>> +void rcu_expedite_gp(void);
>> +void rcu_unexpedite_gp(void);
>> +#endif
>> +
>>   DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
>>   DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
>>   
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 62e209eda78c..1377a68d6f47 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -2682,6 +2682,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>>   		ret = -EBUSY;
>>   		goto out;
>>   	}
>> +	rcu_expedite_gp();
> After the locking related changes in patch 1, is expediting still required? I
> am just a bit concerned that we are papering over the real issue of over
> usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> the patches that reducing the number of lock acquire/release was supposed to
> help.)
>
> Could you provide more justification of why expediting these sections is
> required if the locking concerns were addressed? It would be great if you can
> provide performance numbers with only the first patch and without the second
> patch. That way we can quantify this patch.
>
> thanks,
>
> --
> Joel Fernandes
>
Hi Vishal/Joel,


Configuration:
     •    Kernel version: 7.0.0-rc1
     •    Number of CPUs: 1536

I have verified the below two patches together and observed improvements,
Patch 1: 
https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/

Patch 2: 
https://lore.kernel.org/all/20260218083915.660252-6-vishalc@linux.ibm.com/
SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
------------------------------------------------------------------------|
SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |

When I tested the below patch independently, I did not observe any 
improvements for either smt=on or smt=off. However, in the smt=off 
scenario, I encountered hung task splats (with call traces), where some 
threads were blocked on cpus_read_lock. Please also refer to the 
attached call trace below.
Patch 1: 
https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/

SMT Mode    | Without Patch(Base) | just patch 1 applied   | % 
Improvement  |
----------------------------------------------------------------------------|
SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %    
    |
SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %    
    |


Call traces:
12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
[ 1477.612384] [  T8746] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1 
  tgid:1   ppid:0   task_flags:0x400100 flags:0x00040000
[ 1477.612397] [  T8746] Call Trace:
[ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000 
(unreliable)
[ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c] 
__switch_to+0x1dc/0x290
[ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac] 
__schedule+0x40c/0x1a70
[ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58] 
schedule+0x48/0x1a0
[ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8] 
percpu_rwsem_wait+0x198/0x200
[ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930] 
__percpu_down_read+0xb0/0x210
[ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400] 
cpus_read_lock+0xc0/0xd0
[ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398] 
cgroup_procs_write_start+0x328/0x410
[ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620] 
__cgroup_procs_write+0x70/0x2c0
[ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8] 
cgroup_procs_write+0x28/0x50
[ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624] 
cgroup_file_write+0xb4/0x240
[ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8] 
kernfs_fop_write_iter+0x1a8/0x2a0
[ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c] 
vfs_write+0x27c/0x540
[ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350] 
ksys_write+0x80/0x150
[ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898] 
system_call_exception+0x148/0x320
[ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0] 
system_call_common+0x160/0x2c4
[ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
[ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR: 
0000000000000000
[ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G    
   E    (7.0.0-rc1-150700.51-default-dirty)
[ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> 
CR: 28002288 XER: 00000000



Regards,
Samir
>>   	/* Hold cpus_write_lock() for entire batch operation. */
>>   	cpus_write_lock();
>>   	for_each_online_cpu(cpu) {
>> @@ -2714,6 +2715,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>>   	if (!ret)
>>   		cpu_smt_control = ctrlval;
>>   	cpus_write_unlock();
>> +	rcu_unexpedite_gp();
>>   	arch_smt_update();
>>   out:
>>   	cpu_maps_update_done();
>> @@ -2733,6 +2735,7 @@ int cpuhp_smt_enable(void)
>>   	int cpu, ret = 0;
>>   
>>   	cpu_maps_update_begin();
>> +	rcu_expedite_gp();
>>   	/* Hold cpus_write_lock() for entire batch operation. */
>>   	cpus_write_lock();
>>   	cpu_smt_control = CPU_SMT_ENABLED;
>> @@ -2749,6 +2752,7 @@ int cpuhp_smt_enable(void)
>>   		cpuhp_online_cpu_device(cpu);
>>   	}
>>   	cpus_write_unlock();
>> +	rcu_unexpedite_gp();
>>   	arch_smt_update();
>>   	cpu_maps_update_done();
>>   	return ret;
>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>> index dc5d614b372c..41a0d262e964 100644
>> --- a/kernel/rcu/rcu.h
>> +++ b/kernel/rcu/rcu.h
>> @@ -512,8 +512,6 @@ do {									\
>>   static inline bool rcu_gp_is_normal(void) { return true; }
>>   static inline bool rcu_gp_is_expedited(void) { return false; }
>>   static inline bool rcu_async_should_hurry(void) { return false; }
>> -static inline void rcu_expedite_gp(void) { }
>> -static inline void rcu_unexpedite_gp(void) { }
>>   static inline void rcu_async_hurry(void) { }
>>   static inline void rcu_async_relax(void) { }
>>   static inline bool rcu_cpu_online(int cpu) { return true; }
>> @@ -521,8 +519,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; }
>>   bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>>   bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
>>   bool rcu_async_should_hurry(void);  /* Internal RCU use. */
>> -void rcu_expedite_gp(void);
>> -void rcu_unexpedite_gp(void);
>>   void rcu_async_hurry(void);
>>   void rcu_async_relax(void);
>>   void rcupdate_announce_bootup_oddness(void);
>> -- 
>> 2.53.0
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-03-02 11:47     ` Samir M
@ 2026-03-06  5:44       ` Vishal Chourasia
  2026-03-06 15:12         ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Vishal Chourasia @ 2026-03-06  5:44 UTC (permalink / raw)
  To: Samir M
  Cc: Joel Fernandes, peterz, aboorvad, boqun.feng, frederic, josh,
	linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar,
	sshegde, tglx, urezki

On Mon, Mar 02, 2026 at 05:17:16PM +0530, Samir M wrote:
> 
> On 27/02/26 6:43 am, Joel Fernandes wrote:
> > On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> > > Expedite synchronize_rcu during the SMT mode switch operation when
> > > initiated via /sys/devices/system/cpu/smt/control interface
> > >
> > After the locking related changes in patch 1, is expediting still required? I
Yes.
> > am just a bit concerned that we are papering over the real issue of over
> > usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> > the patches that reducing the number of lock acquire/release was supposed to
> > help.)
At present, I am not sure about the underlying issue. So far what I have
found is when synchronize_rcu() is invoked, it marks the start of a new
grace period number, say A. Thread invoking synchronize_rcu() blocks
until all CPUs have reported QS for GP "A". There is a rcu grace period
kthread that runs periodically looping over a CPU list to figure out all
CPUs have reported QS. In the trace, I find some CPUs reporting QS for
sequence number way back in the past for ex. A - N where N is > 10.

> > 
> > Could you provide more justification of why expediting these sections is
> > required if the locking concerns were addressed? It would be great if you can
> > provide performance numbers with only the first patch and without the second
> > patch. That way we can quantify this patch.
> > 
> > 
> SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
> ------------------------------------------------------------------------|
> SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
> SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |
> 
> When I tested the below patch independently, I did not observe any
> improvements for either smt=on or smt=off. However, in the smt=off scenario,
> I encountered hung task splats (with call traces), where some threads were
> blocked on cpus_read_lock. Please also refer to the attached call trace
> below.
> Patch 1:
> https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/
> 
> SMT Mode    | Without Patch(Base) | just patch 1 applied   | % Improvement 
> |
> ----------------------------------------------------------------------------|
> SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %     
>  |
> SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %     
>  |
> 
> 
> Call traces:
> 12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
> [ 1477.612384] [  T8746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1  tgid:1 
>  ppid:0   task_flags:0x400100 flags:0x00040000
> [ 1477.612397] [  T8746] Call Trace:
> [ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000
> (unreliable)
> [ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c]
> __switch_to+0x1dc/0x290
> [ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac]
> __schedule+0x40c/0x1a70
> [ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58]
> schedule+0x48/0x1a0
> [ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8]
> percpu_rwsem_wait+0x198/0x200
> [ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930]
> __percpu_down_read+0xb0/0x210
> [ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400]
> cpus_read_lock+0xc0/0xd0
> [ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398]
> cgroup_procs_write_start+0x328/0x410
> [ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620]
> __cgroup_procs_write+0x70/0x2c0
> [ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8]
> cgroup_procs_write+0x28/0x50
> [ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624]
> cgroup_file_write+0xb4/0x240
> [ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8]
> kernfs_fop_write_iter+0x1a8/0x2a0
> [ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c]
> vfs_write+0x27c/0x540
> [ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350]
> ksys_write+0x80/0x150
> [ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898]
> system_call_exception+0x148/0x320
> [ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0]
> system_call_common+0x160/0x2c4
> [ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
> [ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR:
> 0000000000000000
> [ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G     
> E    (7.0.0-rc1-150700.51-default-dirty)
> [ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR:
> 28002288 XER: 00000000
> 
> 

Default timeout is set to 8 mins.

$ grep . /proc/sys/kernel/hung_task_timeout_secs
/proc/sys/kernel/hung_task_timeout_secs:480

Now that cpus_write_lock is taken once, and SMT mode switch can take
tens of minutes to complete and relinquish the lock, threads waiting on 
cpus_read_lock will be blocked for this entire duration.

Although there were no splats observed for "both patch applied" case
the issue still remains.

regards,
vishal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-03-06  5:44       ` Vishal Chourasia
@ 2026-03-06 15:12         ` Paul E. McKenney
  2026-03-20 18:49           ` Vishal Chourasia
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2026-03-06 15:12 UTC (permalink / raw)
  To: Vishal Chourasia
  Cc: Samir M, Joel Fernandes, peterz, aboorvad, boqun.feng, frederic,
	josh, linux-kernel, neeraj.upadhyay, rcu, rostedt, srikar,
	sshegde, tglx, urezki

On Fri, Mar 06, 2026 at 11:14:13AM +0530, Vishal Chourasia wrote:
> On Mon, Mar 02, 2026 at 05:17:16PM +0530, Samir M wrote:
> > 
> > On 27/02/26 6:43 am, Joel Fernandes wrote:
> > > On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> > > > Expedite synchronize_rcu during the SMT mode switch operation when
> > > > initiated via /sys/devices/system/cpu/smt/control interface
> > > >
> > > After the locking related changes in patch 1, is expediting still required? I
> Yes.
> > > am just a bit concerned that we are papering over the real issue of over
> > > usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> > > the patches that reducing the number of lock acquire/release was supposed to
> > > help.)
> At present, I am not sure about the underlying issue. So far what I have
> found is when synchronize_rcu() is invoked, it marks the start of a new
> grace period number, say A. Thread invoking synchronize_rcu() blocks
> until all CPUs have reported QS for GP "A". There is a rcu grace period
> kthread that runs periodically looping over a CPU list to figure out all
> CPUs have reported QS. In the trace, I find some CPUs reporting QS for
> sequence number way back in the past for ex. A - N where N is > 10.

This can happen when a CPU goes idle for multiple grace periods, then
wakes up in the middle of a later grace period.  This is (or at least is
supposed to be) harmless because a quiescent state was reported on that
CPU's behalf when RCU noticed that it was idle.  The report is quashed
when RCU notices that the quiescent state being reported is for a grace
period that has already completed.  Grace-period counter wrap is handled
by the infamous ->gpwrap field in the rcu_data structure.

I have seen N having four digits, with deep embedded devices being most
likely to have extremely large values of N.

							Thanx, Paul

> > > Could you provide more justification of why expediting these sections is
> > > required if the locking concerns were addressed? It would be great if you can
> > > provide performance numbers with only the first patch and without the second
> > > patch. That way we can quantify this patch.
> > > 
> > > 
> > SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
> > ------------------------------------------------------------------------|
> > SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
> > SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |
> > 
> > When I tested the below patch independently, I did not observe any
> > improvements for either smt=on or smt=off. However, in the smt=off scenario,
> > I encountered hung task splats (with call traces), where some threads were
> > blocked on cpus_read_lock. Please also refer to the attached call trace
> > below.
> > Patch 1:
> > https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/
> > 
> > SMT Mode    | Without Patch(Base) | just patch 1 applied   | % Improvement 
> > |
> > ----------------------------------------------------------------------------|
> > SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %     
> >  |
> > SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %     
> >  |
> > 
> > 
> > Call traces:
> > 12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
> > [ 1477.612384] [  T8746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1  tgid:1 
> >  ppid:0   task_flags:0x400100 flags:0x00040000
> > [ 1477.612397] [  T8746] Call Trace:
> > [ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000
> > (unreliable)
> > [ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c]
> > __switch_to+0x1dc/0x290
> > [ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac]
> > __schedule+0x40c/0x1a70
> > [ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58]
> > schedule+0x48/0x1a0
> > [ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8]
> > percpu_rwsem_wait+0x198/0x200
> > [ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930]
> > __percpu_down_read+0xb0/0x210
> > [ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400]
> > cpus_read_lock+0xc0/0xd0
> > [ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398]
> > cgroup_procs_write_start+0x328/0x410
> > [ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620]
> > __cgroup_procs_write+0x70/0x2c0
> > [ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8]
> > cgroup_procs_write+0x28/0x50
> > [ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624]
> > cgroup_file_write+0xb4/0x240
> > [ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8]
> > kernfs_fop_write_iter+0x1a8/0x2a0
> > [ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c]
> > vfs_write+0x27c/0x540
> > [ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350]
> > ksys_write+0x80/0x150
> > [ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898]
> > system_call_exception+0x148/0x320
> > [ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0]
> > system_call_common+0x160/0x2c4
> > [ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
> > [ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR:
> > 0000000000000000
> > [ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G     
> > E    (7.0.0-rc1-150700.51-default-dirty)
> > [ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR:
> > 28002288 XER: 00000000
> > 
> > 
> 
> Default timeout is set to 8 mins.
> 
> $ grep . /proc/sys/kernel/hung_task_timeout_secs
> /proc/sys/kernel/hung_task_timeout_secs:480
> 
> Now that cpus_write_lock is taken once, and SMT mode switch can take
> tens of minutes to complete and relinquish the lock, threads waiting on 
> cpus_read_lock will be blocked for this entire duration.
> 
> Although there were no splats observed for "both patch applied" case
> the issue still remains.
> 
> regards,
> vishal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-03-06 15:12         ` Paul E. McKenney
@ 2026-03-20 18:49           ` Vishal Chourasia
  0 siblings, 0 replies; 11+ messages in thread
From: Vishal Chourasia @ 2026-03-20 18:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Samir M, Joel Fernandes, peterz, aboorvad, boqun.feng, frederic,
	josh, linux-kernel, neeraj.upadhyay, rcu, rostedt, srikar,
	sshegde, tglx, urezki

Hi Paul, Thank you for your response. 
Sorry, I could not revert back quicker.

As I wanted to understand what's happening behind the scenes after the
cpuhp kthread blocks upon execution of synchronize_rcu(). So I did a
little more digging.

On a 320 CPU system, (SMT8 to SMT4) operation takes >1 minute to complete.
160 CPUs are offlined one by one. 

In total 321 synchronize_rcu() calls are invoked taking ~125ms to finish
(ftrace option sleep-time set).

3298110.851011 |   316)  cpuhp/3-1614  |               |  synchronize_rcu() {
3298111.010125 |   316)  cpuhp/3-1614  | @ 159112.9 us |  }
--
3298111.020432 |     0) kworker-29406  |               |  synchronize_rcu() {
3298111.190132 |     0) kworker-29406  | @ 169699.4 us |  }
--
3298111.191327 |   317)  cpuhp/3-1619  |               |  synchronize_rcu() {
3298111.350129 |   317)  cpuhp/3-1619  | @ 158801.9 us |  }
--
3298111.360263 |     0) kworker-29406  |               |  synchronize_rcu() {
3298111.530137 |     0) kworker-29406  | @ 169874.5 us |  }
--
3298111.531098 |   318)  cpuhp/3-1624  |               |  synchronize_rcu() {
3298111.650128 |   318)  cpuhp/3-1624  | @ 119029.8 us |  }

Breakdown of the time spent during a single synchronize_rcu() during the
invocation of sched_cpu_deactivate callback (CPU 4 was offlined)

Summary:
--> cpuhp_enter (sched_cpu_deactivate)
CB registration → AccWaitCB        ~10ms     Waiting for softirq tick on CPU 4
GP 220685125: FQS scan 1           ~10ms     Tick delay + scan (all clear except CPU 260, 
                                                        rcu_gp_kthread is running on CPU 260)
GP 220685125: wait for CPU 260     ~30msi    FQS sleep interval, CPU 260 not yet reported
GP 220685125: FQS scan 2 + end     ~0.02ms   CPU 260 clears
GP 220685129: FQS scan 1           ~30ms     Tick delay + full scan (same: CPU 260 holdout)
GP 220685129: wait for CPU 260     ~30ms     Same pattern
GP 220685129: FQS scan 2 + end     ~0.02ms   CPU 260 clears
CB invocation + wakeup             ~10ms     Softirq tick invokes wakeme_after_rcu
destroy_sched_domains_rcu queueing ~8ms322   call_rcu() callbacks
<-- cpuhp_exit (sched_cpu_deactivate)

I have collected some rcu static tracepoint data, which I am currently
going through.

On Fri, Mar 06, 2026 at 07:12:04AM -0800, Paul E. McKenney wrote:
> On Fri, Mar 06, 2026 at 11:14:13AM +0530, Vishal Chourasia wrote:
> > On Mon, Mar 02, 2026 at 05:17:16PM +0530, Samir M wrote:
> > > 
> > > On 27/02/26 6:43 am, Joel Fernandes wrote:
> > > > On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> > > > > Expedite synchronize_rcu during the SMT mode switch operation when
> > > > > initiated via /sys/devices/system/cpu/smt/control interface
> > > > >
> > > > After the locking related changes in patch 1, is expediting still required? I
> > Yes.
> > > > am just a bit concerned that we are papering over the real issue of over
> > > > usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> > > > the patches that reducing the number of lock acquire/release was supposed to
> > > > help.)
> > At present, I am not sure about the underlying issue. So far what I have
> > found is when synchronize_rcu() is invoked, it marks the start of a new
> > grace period number, say A. Thread invoking synchronize_rcu() blocks
> > until all CPUs have reported QS for GP "A". There is a rcu grace period
> > kthread that runs periodically looping over a CPU list to figure out all
> > CPUs have reported QS. In the trace, I find some CPUs reporting QS for
> > sequence number way back in the past for ex. A - N where N is > 10.
> 
> This can happen when a CPU goes idle for multiple grace periods, then
> wakes up in the middle of a later grace period.  This is (or at least is
> supposed to be) harmless because a quiescent state was reported on that
> CPU's behalf when RCU noticed that it was idle.  The report is quashed
If it is harmless, can we consider just expediting the smt mode switch
operation via smt/control file [1].

Thanks, vishalc

[1] https://lore.kernel.org/all/20260218083915.660252-6-vishalc@linux.ibm.com/

> when RCU notices that the quiescent state being reported is for a grace
> period that has already completed.  Grace-period counter wrap is handled
> by the infamous ->gpwrap field in the rcu_data structure.

> 
> I have seen N having four digits, with deep embedded devices being most
> likely to have extremely large values of N.
> 
> 							Thanx, Paul
> 
> > > > Could you provide more justification of why expediting these sections is
> > > > required if the locking concerns were addressed? It would be great if you can
> > > > provide performance numbers with only the first patch and without the second
> > > > patch. That way we can quantify this patch.
> > > > 
> > > > 
> > > SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
> > > ------------------------------------------------------------------------|
> > > SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
> > > SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |
> > > 
> > > When I tested the below patch independently, I did not observe any
> > > improvements for either smt=on or smt=off. However, in the smt=off scenario,
> > > I encountered hung task splats (with call traces), where some threads were
> > > blocked on cpus_read_lock. Please also refer to the attached call trace
> > > below.
> > > Patch 1:
> > > https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/
> > > 
> > > SMT Mode    | Without Patch(Base) | just patch 1 applied   | % Improvement 
> > > |
> > > ----------------------------------------------------------------------------|
> > > SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %     
> > >  |
> > > SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %     
> > >  |
> > > 
> > > 
> > > Call traces:
> > > 12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
> > > [ 1477.612384] [  T8746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1  tgid:1 
> > >  ppid:0   task_flags:0x400100 flags:0x00040000
> > > [ 1477.612397] [  T8746] Call Trace:
> > > [ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000
> > > (unreliable)
> > > [ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c]
> > > __switch_to+0x1dc/0x290
> > > [ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac]
> > > __schedule+0x40c/0x1a70
> > > [ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58]
> > > schedule+0x48/0x1a0
> > > [ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8]
> > > percpu_rwsem_wait+0x198/0x200
> > > [ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930]
> > > __percpu_down_read+0xb0/0x210
> > > [ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400]
> > > cpus_read_lock+0xc0/0xd0
> > > [ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398]
> > > cgroup_procs_write_start+0x328/0x410
> > > [ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620]
> > > __cgroup_procs_write+0x70/0x2c0
> > > [ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8]
> > > cgroup_procs_write+0x28/0x50
> > > [ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624]
> > > cgroup_file_write+0xb4/0x240
> > > [ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8]
> > > kernfs_fop_write_iter+0x1a8/0x2a0
> > > [ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c]
> > > vfs_write+0x27c/0x540
> > > [ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350]
> > > ksys_write+0x80/0x150
> > > [ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898]
> > > system_call_exception+0x148/0x320
> > > [ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0]
> > > system_call_common+0x160/0x2c4
> > > [ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
> > > [ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR:
> > > 0000000000000000
> > > [ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G     
> > > E    (7.0.0-rc1-150700.51-default-dirty)
> > > [ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR:
> > > 28002288 XER: 00000000
> > > 
> > > 
> > 
> > Default timeout is set to 8 mins.
> > 
> > $ grep . /proc/sys/kernel/hung_task_timeout_secs
> > /proc/sys/kernel/hung_task_timeout_secs:480
> > 
> > Now that cpus_write_lock is taken once, and SMT mode switch can take
> > tens of minutes to complete and relinquish the lock, threads waiting on 
> > cpus_read_lock will be blocked for this entire duration.
> > 
> > Although there were no splats observed for "both patch applied" case
> > the issue still remains.
> > 
> > regards,
> > vishal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition
  2026-02-18  8:39 ` [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition Vishal Chourasia
@ 2026-03-25 19:09   ` Thomas Gleixner
  2026-03-26 10:06     ` Vishal Chourasia
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2026-03-25 19:09 UTC (permalink / raw)
  To: Vishal Chourasia, peterz, aboorvad
  Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, urezki,
	samir, vishalc

On Wed, Feb 18 2026 at 14:09, Vishal Chourasia wrote:
> From: Joel Fernandes <joelagnelf@nvidia.com>
>
> Bulk CPU hotplug operations, such as an SMT switch operation, requires
> hotplugging multiple CPUs. The current implementation takes
> cpus_write_lock() for each individual CPU, causing multiple slow grace
> period requests.
>
> Introduce cpu_up_locked() and cpu_down_locked() that assume the caller
> already holds cpus_write_lock(). The cpuhp_smt_enable() and
> cpuhp_smt_disable() functions are updated to hold the lock once around
> the entire loop, rather than for each individual CPU.
>
> Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>

You dropped Joel's Signed-off-by ....

> -/* Requires cpu_add_remove_lock to be held */
> -static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
> +/* Requires cpu_add_remove_lock and cpus_write_lock to be held */
> +static int __ref cpu_down_locked(unsigned int cpu, int tasks_frozen,
>  			   enum cpuhp_state target)

No line break required. You have 100 chars. If you still need one:

  https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

>  	 */
>  	if (cpumask_any_and(cpu_online_mask,
>  			    housekeeping_cpumask(HK_TYPE_DOMAIN)) >= nr_cpu_ids) {
> -		ret = -EBUSY;
> -		goto out;
> +		return -EBUSY;
>  	}

Please remove the brackets. They are not longer required. All over the place.

> +static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
> +			   enum cpuhp_state target)
> +{
> +
> +	int ret;
> +	cpus_write_lock();

Coding style...

> +	ret = cpu_down_locked(cpu, tasks_frozen, target);
>  	cpus_write_unlock();
>  	arch_smt_update();
>  	return ret;
> @@ -2659,6 +2674,16 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  	int cpu, ret = 0;
>  
>  	cpu_maps_update_begin();
> +	if (cpu_hotplug_offline_disabled) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (cpu_hotplug_disabled) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	/* Hold cpus_write_lock() for entire batch operation. */
> +	cpus_write_lock();

 .... for the entire ...

And please visiually separate things. Newlines exist for a reason.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
  2026-02-18  8:39 ` [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations Vishal Chourasia
  2026-02-27  1:13   ` Joel Fernandes
@ 2026-03-25 19:10   ` Thomas Gleixner
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2026-03-25 19:10 UTC (permalink / raw)
  To: Vishal Chourasia, peterz, aboorvad
  Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel,
	neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, urezki,
	samir, vishalc

On Wed, Feb 18 2026 at 14:09, Vishal Chourasia wrote:
> Expedite synchronize_rcu during the SMT mode switch operation when
> initiated via /sys/devices/system/cpu/smt/control interface
>
> SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
> others are user driven operations and therefore should complete as soon
> as possible. Switching SMT states involves iterating over a list of CPUs
> and performing hotplug operations. It was found these transitions took
> significantly large amount of time to complete particularly on
> high-core-count systems.

This changelog is neither explaining the underlying problem, nor
explaining why expedite solves it and does not contain numbers which
justify the change.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition
  2026-03-25 19:09   ` Thomas Gleixner
@ 2026-03-26 10:06     ` Vishal Chourasia
  0 siblings, 0 replies; 11+ messages in thread
From: Vishal Chourasia @ 2026-03-26 10:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: peterz, aboorvad, boqun.feng, frederic, joelagnelf, josh,
	linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar,
	sshegde, urezki, samir

Hi Thomas, Thank you for the review.

Numbers from 400 CPUs that I had while back,
baseline: Linux 6.19.0-rc4-00310-g755bc1335e3b

On PPC64 system with 400 CPUs:
SMT8 to SMT1:
  baseline:        real 1m14.792s
  baseline+patch:  real 0m03.205s  # ~23x improvement

SMT1 to SMT8:
  baseline:        real 2m27.695s
  baseline+patch:  real 0m02.510s  # ~58x improvement

Note: We observe huge improvements for max config system which
originally took approx to 1 hour to switch SMT states, with GPs
expedited is taking 5 to 6 minutes.

Analysis: why expediting GPs improves time to complete
By expediting the grace period, we force an immediate IPI-driven
quiescent state detection across all CPUs rather than lazily waiting,
which dramatically reduces the time the calling thread remains blocked
in synchronize_rcu()

Why holding the cpus_write_lock() for the duration of SMT switch will
not work? [1] This causes hung-task timeout splats [2] because there are
threads blocked on cpus_read_lock(). Expediting grace periods shrinks
the window but doesn't eliminate it. I plan to drop this patch and the
next version will only carry the expedited RCU grace period change.

I will incorporate all your other suggestions in the next version.

[1] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/all/aapprY-prH0l_WeK@linux.ibm.com/

On Wed, Mar 25, 2026 at 08:09:17PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 18 2026 at 14:09, Vishal Chourasia wrote:
> > From: Joel Fernandes <joelagnelf@nvidia.com>
> >
> > Bulk CPU hotplug operations, such as an SMT switch operation, requires
> > hotplugging multiple CPUs. The current implementation takes
> > cpus_write_lock() for each individual CPU, causing multiple slow grace
> > period requests.
> >
> > Introduce cpu_up_locked() and cpu_down_locked() that assume the caller
> > already holds cpus_write_lock(). The cpuhp_smt_enable() and
> > cpuhp_smt_disable() functions are updated to hold the lock once around
> > the entire loop, rather than for each individual CPU.
> >
> > Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> 
> You dropped Joel's Signed-off-by ....
Sorry for messing up the changelog w.r.t to signed-off-by tag.
Will take care in future.
> 
> > -/* Requires cpu_add_remove_lock to be held */
> > -static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
> > +/* Requires cpu_add_remove_lock and cpus_write_lock to be held */
> > +static int __ref cpu_down_locked(unsigned int cpu, int tasks_frozen,
> >  			   enum cpuhp_state target)
> 
> No line break required. You have 100 chars. If you still need one:
> 
>   https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
Ack.
> 
> >  	 */
> >  	if (cpumask_any_and(cpu_online_mask,
> >  			    housekeeping_cpumask(HK_TYPE_DOMAIN)) >= nr_cpu_ids) {
> > -		ret = -EBUSY;
> > -		goto out;
> > +		return -EBUSY;
> >  	}
> 
> Please remove the brackets. They are not longer required. All over the place.
Ack.
> 
> > +static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
> > +			   enum cpuhp_state target)
> > +{
> > +
> > +	int ret;
> > +	cpus_write_lock();
> 
> Coding style...
Ack.
> 
> > +	ret = cpu_down_locked(cpu, tasks_frozen, target);
> >  	cpus_write_unlock();
> >  	arch_smt_update();
> >  	return ret;
> > @@ -2659,6 +2674,16 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> >  	int cpu, ret = 0;
> >  
> >  	cpu_maps_update_begin();
> > +	if (cpu_hotplug_offline_disabled) {
> > +		ret = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (cpu_hotplug_disabled) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +	/* Hold cpus_write_lock() for entire batch operation. */
> > +	cpus_write_lock();
> 
>  .... for the entire ...
> 
> And please visiually separate things. Newlines exist for a reason.
Sure. 
> 
> Thanks,
> 
>         tglx

Thanks and Regards!

Vishalc

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-26 10:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-18  8:39 [PATCH v3 0/2] cpuhp: Improve SMT switch time via lock batching and RCU expedition Vishal Chourasia
2026-02-18  8:39 ` [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition Vishal Chourasia
2026-03-25 19:09   ` Thomas Gleixner
2026-03-26 10:06     ` Vishal Chourasia
2026-02-18  8:39 ` [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations Vishal Chourasia
2026-02-27  1:13   ` Joel Fernandes
2026-03-02 11:47     ` Samir M
2026-03-06  5:44       ` Vishal Chourasia
2026-03-06 15:12         ` Paul E. McKenney
2026-03-20 18:49           ` Vishal Chourasia
2026-03-25 19:10   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox