* [PATCH v4 0/1] cpuhp: Expedite RCU when toggling system-wide SMT mode
@ 2026-05-07 5:39 Vishal Chourasia
2026-05-07 5:39 ` [PATCH v4 1/1] " Vishal Chourasia
0 siblings, 1 reply; 3+ messages in thread
From: Vishal Chourasia @ 2026-05-07 5:39 UTC (permalink / raw)
To: peterz, aboorvad
Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel,
neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx,
urezki, samir, vishalc
Hello All,
SMT mode switch operation on a large CPU count system takes close to an
hour to complete. Initial debugging root caused the delay to the CPU
hotplug subsystem being blocked on numerous synchronize_rcu() calls.
Simply enabling system-wide RCU expediting reduced the switch time to
5-6 minutes. Since then, different approaches have been explored, of
which some had their own side effects and others didn't work as
expected.
Approaches explored:
1. Expedited individual CPU hotplug operations by wrapping
_cpu_up()/_cpu_down() with rcu_expedite_gp()/rcu_unexpedite_gp() [0].
Peter suggested expediting only when SMT switch is triggered via the
sysfs control interface, not for individual hotplug operations [1].
2. Replacing synchronize_rcu() calls in the CPU hotplug codepath with
their expedited variants. This is not viable because one
synchronize_rcu() is invoked inside cpus_write_lock(), which is shared
with other kernel subsystems [5].
3. Hoisting cpus_write_lock() to be taken once for the entire SMT switch
operation instead of per-CPU [3][4]. On large systems where the SMT
switch can still take 5-6 minutes, holding the lock for that duration
causes hung task splats and starves other subsystems depending on the
read lock.
4. Peter has also suggested using rcu_sync_{enter|exit}() which as is
doesn't help as is, but can be paired the approach 2 from above.
Current approach: expedite RCU grace periods around the SMT switch
operation in the sysfs control interface path, per Peter's suggestion
[1], with Aboorva's analysis confirming synchronize_rcu() as the
bottleneck [2].
[0] https://lore.kernel.org/all/20260218083915.660252-2-vishalc@linux.ibm.com
[1] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
[3] https://lore.kernel.org/all/20260119114333.GI1890602@noisy.programming.kicks-ass.net/
[4] https://lore.kernel.org/all/ba470918-0ad9-4548-9161-826948462f73@linux.ibm.com/
[5] https://lore.kernel.org/all/804E7B47-F515-4592-B12E-84AD251EB07D@nvidia.com/
[6] https://lore.kernel.org/all/e2cca734-9191-4073-ba9d-936014498645@linux.ibm.com/
Vishal Chourasia (1):
cpuhp: Expedite RCU when toggling system-wide SMT mode
include/linux/rcupdate.h | 8 ++++++++
kernel/cpu.c | 4 ++++
kernel/rcu/rcu.h | 4 ----
3 files changed, 12 insertions(+), 4 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 3+ messages in thread* [PATCH v4 1/1] cpuhp: Expedite RCU when toggling system-wide SMT mode 2026-05-07 5:39 [PATCH v4 0/1] cpuhp: Expedite RCU when toggling system-wide SMT mode Vishal Chourasia @ 2026-05-07 5:39 ` Vishal Chourasia 2026-05-07 19:07 ` Samir M 0 siblings, 1 reply; 3+ messages in thread From: Vishal Chourasia @ 2026-05-07 5:39 UTC (permalink / raw) To: peterz, aboorvad Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki, samir, vishalc On large idle systems, changing system-wide SMT level using the sysfs control interface still takes approximately 40-55 minutes. Changing SMT levels is a user triggered operation generally done when systems are fairly idle. This causes users to be blocked from doing other subsequent work. Analyzing the collected profile data during SMT level switch, showed that CPU hotplug machinery was blocked on synchronize_rcu() calls. Expedite RCU grace periods for the entire duration of triggered SMT switch operation via the sysfs control interface. Individual CPU hotplug operations via the online/offline interface are not affected. On a PPC64 system with 400 CPUs: SMT8 to SMT1: before: 1m14s after: 3.2s (~23x faster) SMT1 to SMT8: before: 2m27s after: 2.5s (~58x faster) On a large config system with 1920 CPUs, completion time improves from ~1 hour to 5-6 minutes. Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> --- include/linux/rcupdate.h | 8 ++++++++ kernel/cpu.c | 4 ++++ kernel/rcu/rcu.h | 4 ---- 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index bfa765132de8..b6bccf131f1c 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -1178,6 +1178,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) extern int rcu_expedited; extern int rcu_normal; +#ifdef CONFIG_TINY_RCU +static inline void rcu_expedite_gp(void) { } +static inline void rcu_unexpedite_gp(void) { } +#else +void rcu_expedite_gp(void); +void rcu_unexpedite_gp(void); +#endif + DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock()) DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU)) diff --git a/kernel/cpu.c b/kernel/cpu.c index bc4f7a9ba64e..6351da9dffdc 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2658,6 +2658,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) int cpu, ret = 0; cpu_maps_update_begin(); + rcu_expedite_gp(); for_each_online_cpu(cpu) { if (topology_is_primary_thread(cpu)) continue; @@ -2687,6 +2688,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) } if (!ret) cpu_smt_control = ctrlval; + rcu_unexpedite_gp(); cpu_maps_update_done(); return ret; } @@ -2704,6 +2706,7 @@ int cpuhp_smt_enable(void) int cpu, ret = 0; cpu_maps_update_begin(); + rcu_expedite_gp(); cpu_smt_control = CPU_SMT_ENABLED; for_each_present_cpu(cpu) { /* Skip online CPUs and CPUs on offline nodes */ @@ -2717,6 +2720,7 @@ int cpuhp_smt_enable(void) /* See comment in cpuhp_smt_disable() */ cpuhp_online_cpu_device(cpu); } + rcu_unexpedite_gp(); cpu_maps_update_done(); return ret; } diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index fa6d30ce73d1..d6a63ee60bc0 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -521,8 +521,6 @@ do { \ static inline bool rcu_gp_is_normal(void) { return true; } static inline bool rcu_gp_is_expedited(void) { return false; } static inline bool rcu_async_should_hurry(void) { return false; } -static inline void rcu_expedite_gp(void) { } -static inline void rcu_unexpedite_gp(void) { } static inline void rcu_async_hurry(void) { } static inline void rcu_async_relax(void) { } static inline bool rcu_cpu_online(int cpu) { return true; } @@ -530,8 +528,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; } bool rcu_gp_is_normal(void); /* Internal RCU use. */ bool rcu_gp_is_expedited(void); /* Internal RCU use. */ bool rcu_async_should_hurry(void); /* Internal RCU use. */ -void rcu_expedite_gp(void); -void rcu_unexpedite_gp(void); void rcu_async_hurry(void); void rcu_async_relax(void); void rcupdate_announce_bootup_oddness(void); -- 2.54.0 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v4 1/1] cpuhp: Expedite RCU when toggling system-wide SMT mode 2026-05-07 5:39 ` [PATCH v4 1/1] " Vishal Chourasia @ 2026-05-07 19:07 ` Samir M 0 siblings, 0 replies; 3+ messages in thread From: Samir M @ 2026-05-07 19:07 UTC (permalink / raw) To: Vishal Chourasia, peterz, aboorvad Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki On 07/05/26 11:09 am, Vishal Chourasia wrote: > On large idle systems, changing system-wide SMT level using the sysfs > control interface still takes approximately 40-55 minutes. Changing SMT > levels is a user triggered operation generally done when systems are > fairly idle. This causes users to be blocked from doing other subsequent > work. > > Analyzing the collected profile data during SMT level switch, showed > that CPU hotplug machinery was blocked on synchronize_rcu() calls. > > Expedite RCU grace periods for the entire duration of triggered SMT > switch operation via the sysfs control interface. Individual CPU hotplug > operations via the online/offline interface are not affected. > > On a PPC64 system with 400 CPUs: > > SMT8 to SMT1: > before: 1m14s > after: 3.2s (~23x faster) > > SMT1 to SMT8: > before: 2m27s > after: 2.5s (~58x faster) > > On a large config system with 1920 CPUs, completion time improves from > ~1 hour to 5-6 minutes. Hi Vishal, I verified the patch on PPC64 system using the configuration described below. Configuration: • Kernel version: 7.1.0-rc2+ • Number of CPUs: 960 Using this setup, I evaluated the patch with both SMT enabled and SMT disabled. patch shows a significant improvement in the both SMT=on/off case. SMT Mode | Without Patch | With Patch | % Improvement | ------------------------------------------------------------------- SMT=off | 3m 56.174s | 0m 34.284s | +85.48% | SMT=on | 3m 47.322s | 0m 35.583s | +84.35% | Regards, Samir > Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> > --- > include/linux/rcupdate.h | 8 ++++++++ > kernel/cpu.c | 4 ++++ > kernel/rcu/rcu.h | 4 ---- > 3 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index bfa765132de8..b6bccf131f1c 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -1178,6 +1178,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) > extern int rcu_expedited; > extern int rcu_normal; > > +#ifdef CONFIG_TINY_RCU > +static inline void rcu_expedite_gp(void) { } > +static inline void rcu_unexpedite_gp(void) { } > +#else > +void rcu_expedite_gp(void); > +void rcu_unexpedite_gp(void); > +#endif > + > DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock()) > DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU)) > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index bc4f7a9ba64e..6351da9dffdc 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -2658,6 +2658,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + rcu_expedite_gp(); > for_each_online_cpu(cpu) { > if (topology_is_primary_thread(cpu)) > continue; > @@ -2687,6 +2688,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > } > if (!ret) > cpu_smt_control = ctrlval; > + rcu_unexpedite_gp(); > cpu_maps_update_done(); > return ret; > } > @@ -2704,6 +2706,7 @@ int cpuhp_smt_enable(void) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + rcu_expedite_gp(); > cpu_smt_control = CPU_SMT_ENABLED; > for_each_present_cpu(cpu) { > /* Skip online CPUs and CPUs on offline nodes */ > @@ -2717,6 +2720,7 @@ int cpuhp_smt_enable(void) > /* See comment in cpuhp_smt_disable() */ > cpuhp_online_cpu_device(cpu); > } > + rcu_unexpedite_gp(); > cpu_maps_update_done(); > return ret; > } > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index fa6d30ce73d1..d6a63ee60bc0 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -521,8 +521,6 @@ do { \ > static inline bool rcu_gp_is_normal(void) { return true; } > static inline bool rcu_gp_is_expedited(void) { return false; } > static inline bool rcu_async_should_hurry(void) { return false; } > -static inline void rcu_expedite_gp(void) { } > -static inline void rcu_unexpedite_gp(void) { } > static inline void rcu_async_hurry(void) { } > static inline void rcu_async_relax(void) { } > static inline bool rcu_cpu_online(int cpu) { return true; } > @@ -530,8 +528,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; } > bool rcu_gp_is_normal(void); /* Internal RCU use. */ > bool rcu_gp_is_expedited(void); /* Internal RCU use. */ > bool rcu_async_should_hurry(void); /* Internal RCU use. */ > -void rcu_expedite_gp(void); > -void rcu_unexpedite_gp(void); > void rcu_async_hurry(void); > void rcu_async_relax(void); > void rcupdate_announce_bootup_oddness(void); ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-07 19:07 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-07 5:39 [PATCH v4 0/1] cpuhp: Expedite RCU when toggling system-wide SMT mode Vishal Chourasia 2026-05-07 5:39 ` [PATCH v4 1/1] " Vishal Chourasia 2026-05-07 19:07 ` Samir M
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox