* [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations
@ 2026-01-12 9:43 Vishal Chourasia
2026-01-12 10:08 ` Uladzislau Rezki
` (4 more replies)
0 siblings, 5 replies; 54+ messages in thread
From: Vishal Chourasia @ 2026-01-12 9:43 UTC (permalink / raw)
To: rcu, linux-kernel
Cc: paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng,
urezki, rostedt, tglx, peterz, sshegde, srikar, Vishal Chourasia
Bulk CPU hotplug operations—such as switching SMT modes across all
cores—require hotplugging multiple CPUs in rapid succession. On large
systems, this process takes significant time, increasing as the number
of CPUs grows, leading to substantial delays on high-core-count
machines. Analysis [1] reveals that the majority of this time is spent
waiting for synchronize_rcu().
Expedite synchronize_rcu() during the hotplug path to accelerate the
operation. Since CPU hotplug is a user-initiated administrative task,
it should complete as quickly as possible.
Performance data on a PPC64 system with 400 CPUs:
+ ppc64_cpu --smt=1 (SMT8 to SMT1)
Before: real 1m14.792s
After: real 0m03.205s # ~23x improvement
+ ppc64_cpu --smt=8 (SMT1 to SMT8)
Before: real 2m27.695s
After: real 0m02.510s # ~58x improvement
Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
[1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
---
include/linux/rcupdate.h | 3 +++
kernel/cpu.c | 2 ++
2 files changed, 5 insertions(+)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index c5b30054cd01..03c06cfb2b6d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
extern int rcu_expedited;
extern int rcu_normal;
+extern void rcu_expedite_gp(void);
+extern void rcu_unexpedite_gp(void);
+
DEFINE_LOCK_GUARD_0(rcu,
do {
rcu_read_lock();
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8df2d773fe3b..6b0d491d73f4 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
void cpus_write_lock(void)
{
+ rcu_expedite_gp();
percpu_down_write(&cpu_hotplug_lock);
}
void cpus_write_unlock(void)
{
percpu_up_write(&cpu_hotplug_lock);
+ rcu_unexpedite_gp();
}
void lockdep_assert_cpus_held(void)
--
2.52.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 9:43 [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Vishal Chourasia @ 2026-01-12 10:08 ` Uladzislau Rezki 2026-01-12 10:43 ` Vishal Chourasia 2026-01-12 12:02 ` Shrikanth Hegde 2026-01-12 12:21 ` Shrikanth Hegde ` (3 subsequent siblings) 4 siblings, 2 replies; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-12 10:08 UTC (permalink / raw) To: Vishal Chourasia Cc: rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, peterz, sshegde, srikar On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp to speedup regular synchronize_rcu() call. But i am not saying that it would beat your "expedited switch" improvement. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 10:08 ` Uladzislau Rezki @ 2026-01-12 10:43 ` Vishal Chourasia 2026-01-12 11:07 ` Uladzislau Rezki 2026-01-12 12:02 ` Shrikanth Hegde 1 sibling, 1 reply; 54+ messages in thread From: Vishal Chourasia @ 2026-01-12 10:43 UTC (permalink / raw) To: Uladzislau Rezki Cc: rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, rostedt, tglx, peterz, sshegde, srikar, aboorvad Hi Uladzislau, On 12/01/26 15:38, Uladzislau Rezki wrote: > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >> Performance data on a PPC64 system with 400 CPUs: >> >> + ppc64_cpu --smt=1 (SMT8 to SMT1) >> Before: real 1m14.792s >> After: real 0m03.205s # ~23x improvement >> >> + ppc64_cpu --smt=8 (SMT1 to SMT8) >> Before: real 2m27.695s >> After: real 0m02.510s # ~58x improvement >> >> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >> >> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >> > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > your "expedited switch" improvement. # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp After setting, # time ppc64_cpu --smt=1; real 1m10.726s # Run 1 real 1m12.530s # Run 2 # time ppc64_cpu --smt=8 real 0m36.661s # Run 1 real 0m41.401s # Run 2 Regards, vishalc ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 10:43 ` Vishal Chourasia @ 2026-01-12 11:07 ` Uladzislau Rezki 0 siblings, 0 replies; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-12 11:07 UTC (permalink / raw) To: Vishal Chourasia Cc: Uladzislau Rezki, rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, rostedt, tglx, peterz, sshegde, srikar, aboorvad Hello, Vishalc! > Hi Uladzislau, > > On 12/01/26 15:38, Uladzislau Rezki wrote: > > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > > Performance data on a PPC64 system with 400 CPUs: > > > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > > Before: real 1m14.792s > > > After: real 0m03.205s # ~23x improvement > > > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > > Before: real 2m27.695s > > > After: real 0m02.510s # ~58x improvement > > > > > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > > > > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > > > > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > > your "expedited switch" improvement. > > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > After setting, > > # time ppc64_cpu --smt=1; > real 1m10.726s # Run 1 > real 1m12.530s # Run 2 > > # time ppc64_cpu --smt=8 > real 0m36.661s # Run 1 > real 0m41.401s # Run 2 > Thanks. "ppc64_cpu --smt=1" is the same, i assume it is offlining. "ppc64_cpu --smt=8", whereas, onlining, sees the differences(~5x). But your real "0m02.510s" is hard to beat event by activating the "rcu_normal_wake_from_gp" option. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 10:08 ` Uladzislau Rezki 2026-01-12 10:43 ` Vishal Chourasia @ 2026-01-12 12:02 ` Shrikanth Hegde 2026-01-12 12:57 ` Uladzislau Rezki 1 sibling, 1 reply; 54+ messages in thread From: Shrikanth Hegde @ 2026-01-12 12:02 UTC (permalink / raw) To: Uladzislau Rezki, Vishal Chourasia Cc: rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, rostedt, tglx, peterz, srikar Hi Vishal. Thanks for the patch. On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >> Bulk CPU hotplug operations—such as switching SMT modes across all >> cores—require hotplugging multiple CPUs in rapid succession. On large >> systems, this process takes significant time, increasing as the number >> of CPUs grows, leading to substantial delays on high-core-count >> machines. Analysis [1] reveals that the majority of this time is spent >> waiting for synchronize_rcu(). >> >> Expedite synchronize_rcu() during the hotplug path to accelerate the >> operation. Since CPU hotplug is a user-initiated administrative task, >> it should complete as quickly as possible. >> >> Performance data on a PPC64 system with 400 CPUs: >> >> + ppc64_cpu --smt=1 (SMT8 to SMT1) >> Before: real 1m14.792s >> After: real 0m03.205s # ~23x improvement >> >> + ppc64_cpu --smt=8 (SMT1 to SMT8) >> Before: real 2m27.695s >> After: real 0m02.510s # ~58x improvement >> >> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >> >> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >> > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > your "expedited switch" improvement. > Hi Uladzislau. Had a discussion on this at LPC, having in kernel solution is likely better than having it in userspace. - Having it in kernel would make it work across all archs. Why should any user wait when one initiates the hotplug. - userspace tools are spread across such as chcpu, ppc64_cpu etc. though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". We will have to repeat the same in each tool. - There is already /sys/kernel/rcu_expedited which is better if at all we need to fallback to userspace. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 12:02 ` Shrikanth Hegde @ 2026-01-12 12:57 ` Uladzislau Rezki 2026-01-12 16:09 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-12 12:57 UTC (permalink / raw) To: Shrikanth Hegde Cc: Uladzislau Rezki, Vishal Chourasia, rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, rostedt, tglx, peterz, srikar Hello, Shrikanth! > > On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > > Bulk CPU hotplug operations—such as switching SMT modes across all > > > cores—require hotplugging multiple CPUs in rapid succession. On large > > > systems, this process takes significant time, increasing as the number > > > of CPUs grows, leading to substantial delays on high-core-count > > > machines. Analysis [1] reveals that the majority of this time is spent > > > waiting for synchronize_rcu(). > > > > > > Expedite synchronize_rcu() during the hotplug path to accelerate the > > > operation. Since CPU hotplug is a user-initiated administrative task, > > > it should complete as quickly as possible. > > > > > > Performance data on a PPC64 system with 400 CPUs: > > > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > > Before: real 1m14.792s > > > After: real 0m03.205s # ~23x improvement > > > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > > Before: real 2m27.695s > > > After: real 0m02.510s # ~58x improvement > > > > > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > > > > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > > > > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > > your "expedited switch" improvement. > > > > Hi Uladzislau. > > Had a discussion on this at LPC, having in kernel solution is likely > better than having it in userspace. > > - Having it in kernel would make it work across all archs. Why should > any user wait when one initiates the hotplug. > > - userspace tools are spread across such as chcpu, ppc64_cpu etc. > though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > We will have to repeat the same in each tool. > > - There is already /sys/kernel/rcu_expedited which is better if at all > we need to fallback to userspace. > Sounds good to me. I agree it is better to bypass parameters. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 12:57 ` Uladzislau Rezki @ 2026-01-12 16:09 ` Joel Fernandes 2026-01-12 16:48 ` Paul E. McKenney 2026-01-12 17:09 ` Uladzislau Rezki 0 siblings, 2 replies; 54+ messages in thread From: Joel Fernandes @ 2026-01-12 16:09 UTC (permalink / raw) To: Uladzislau Rezki Cc: Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com, Uladzislau Rezki > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > Hello, Shrikanth! > >> >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>> systems, this process takes significant time, increasing as the number >>>> of CPUs grows, leading to substantial delays on high-core-count >>>> machines. Analysis [1] reveals that the majority of this time is spent >>>> waiting for synchronize_rcu(). >>>> >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>> it should complete as quickly as possible. >>>> >>>> Performance data on a PPC64 system with 400 CPUs: >>>> >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>> Before: real 1m14.792s >>>> After: real 0m03.205s # ~23x improvement >>>> >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>> Before: real 2m27.695s >>>> After: real 0m02.510s # ~58x improvement >>>> >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>> >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>> >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>> your "expedited switch" improvement. >>> >> >> Hi Uladzislau. >> >> Had a discussion on this at LPC, having in kernel solution is likely >> better than having it in userspace. >> >> - Having it in kernel would make it work across all archs. Why should >> any user wait when one initiates the hotplug. >> >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >> We will have to repeat the same in each tool. >> >> - There is already /sys/kernel/rcu_expedited which is better if at all >> we need to fallback to userspace. >> > Sounds good to me. I agree it is better to bypass parameters. Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. thanks, - Joel > > -- > Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 16:09 ` Joel Fernandes @ 2026-01-12 16:48 ` Paul E. McKenney 2026-01-12 17:05 ` Uladzislau Rezki 2026-01-12 22:24 ` Joel Fernandes 2026-01-12 17:09 ` Uladzislau Rezki 1 sibling, 2 replies; 54+ messages in thread From: Paul E. McKenney @ 2026-01-12 16:48 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > Hello, Shrikanth! > > > >> > >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>> systems, this process takes significant time, increasing as the number > >>>> of CPUs grows, leading to substantial delays on high-core-count > >>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>> waiting for synchronize_rcu(). > >>>> > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>> it should complete as quickly as possible. > >>>> > >>>> Performance data on a PPC64 system with 400 CPUs: > >>>> > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>> Before: real 1m14.792s > >>>> After: real 0m03.205s # ~23x improvement > >>>> > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>> Before: real 2m27.695s > >>>> After: real 0m02.510s # ~58x improvement > >>>> > >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>> > >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>> > >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>> your "expedited switch" improvement. > >>> > >> > >> Hi Uladzislau. > >> > >> Had a discussion on this at LPC, having in kernel solution is likely > >> better than having it in userspace. > >> > >> - Having it in kernel would make it work across all archs. Why should > >> any user wait when one initiates the hotplug. > >> > >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >> We will have to repeat the same in each tool. > >> > >> - There is already /sys/kernel/rcu_expedited which is better if at all > >> we need to fallback to userspace. > >> > > Sounds good to me. I agree it is better to bypass parameters. > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. This would require increasing the scalability of this optimization, right? Or am I thinking of the wrong optimization? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 16:48 ` Paul E. McKenney @ 2026-01-12 17:05 ` Uladzislau Rezki 2026-01-12 18:27 ` Vishal Chourasia 2026-01-12 22:24 ` Joel Fernandes 1 sibling, 1 reply; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-12 17:05 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote: > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > Hello, Shrikanth! > > > > > >> > > >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > > >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > > >>>> systems, this process takes significant time, increasing as the number > > >>>> of CPUs grows, leading to substantial delays on high-core-count > > >>>> machines. Analysis [1] reveals that the majority of this time is spent > > >>>> waiting for synchronize_rcu(). > > >>>> > > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > > >>>> it should complete as quickly as possible. > > >>>> > > >>>> Performance data on a PPC64 system with 400 CPUs: > > >>>> > > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > > >>>> Before: real 1m14.792s > > >>>> After: real 0m03.205s # ~23x improvement > > >>>> > > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > > >>>> Before: real 2m27.695s > > >>>> After: real 0m02.510s # ~58x improvement > > >>>> > > >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > >>>> > > >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > >>>> > > >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > > >>> your "expedited switch" improvement. > > >>> > > >> > > >> Hi Uladzislau. > > >> > > >> Had a discussion on this at LPC, having in kernel solution is likely > > >> better than having it in userspace. > > >> > > >> - Having it in kernel would make it work across all archs. Why should > > >> any user wait when one initiates the hotplug. > > >> > > >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > > >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > > >> We will have to repeat the same in each tool. > > >> > > >> - There is already /sys/kernel/rcu_expedited which is better if at all > > >> we need to fallback to userspace. > > >> > > > Sounds good to me. I agree it is better to bypass parameters. > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > This would require increasing the scalability of this optimization, > right? Or am I thinking of the wrong optimization? ;-) > I tested this before. I noticed that after 64K of simultaneous synchronize_rcu() calls the scalability is required. Everything less was faster with a new approach. I can retest. Should i? :) -- Uladzsislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 17:05 ` Uladzislau Rezki @ 2026-01-12 18:27 ` Vishal Chourasia 2026-01-13 0:03 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Vishal Chourasia @ 2026-01-12 18:27 UTC (permalink / raw) To: Uladzislau Rezki Cc: Paul E. McKenney, Joel Fernandes, Shrikanth Hegde, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com Hello Joel, Paul, Uladzislau, On Mon, Jan 12, 2026 at 06:05:30PM +0100, Uladzislau Rezki wrote: > On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote: > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > > > > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > >> > > > > Sounds good to me. I agree it is better to bypass parameters. > > > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > > > This would require increasing the scalability of this optimization, > > right? Or am I thinking of the wrong optimization? ;-) > > > I tested this before. I noticed that after 64K of simultaneous > synchronize_rcu() calls the scalability is required. Everything > less was faster with a new approach. It is worth noting that bulk CPU hotplug represents a different stress pattern than the "simultaneous call" scenario mentioned above. In a large-scale hotplug event (like a SMT mode switch), we aren't necessarily seeing thousands of simultaneous synchronize_rcu() calls. Instead, because CPU hotplug operations are serialized, we see a "conveyor belt" of sequential calls. One synchronize_rcu() blocks, the hotplug state machine waits, it unblocks, and then the next call is triggered shortly after. The bottleneck here isn't RCU scalability under concurrent load, but rather the accumulated latency of hundreds of sequential Grace Periods. For example, on pSeries, onlining 350 out of 400 CPUs triggers exactly 350 calls at three different points in the hotplug state machine. Even though they happen one at a time, the sheer volume makes the total operation time prohibitive. Following callstack was collected during SMT mode switch where 350 out of 400 CPUs were onlined, @[ synchronize_rcu+12 cpuidle_pause_and_lock+120 pseries_cpuidle_cpu_online+88 cpuhp_invoke_callback+500 cpuhp_thread_fun+316 smpboot_thread_fn+512 kthread+308 start_kernel_thread+20 ]: 350 @[ synchronize_rcu+12 rcu_sync_enter+260 percpu_down_write+76 _cpu_up+140 cpu_up+440 cpu_subsys_online+128 device_online+176 online_store+220 dev_attr_store+52 sysfs_kf_write+120 kernfs_fop_write_iter+456 vfs_write+952 ksys_write+132 system_call_exception+292 system_call_vectored_common+348 ]: 350 @[ synchronize_rcu+12 rcu_sync_enter+260 percpu_down_write+76 try_online_node+64 cpu_up+120 cpu_subsys_online+128 device_online+176 online_store+220 dev_attr_store+52 sysfs_kf_write+120 kernfs_fop_write_iter+456 vfs_write+952 ksys_write+132 system_call_exception+292 system_call_vectored_common+348 ]: 350 Following callstack was collected during SMT mode switch where 350 out of 400 CPUs where offlined, @[ synchronize_rcu+12 rcu_sync_enter+260 percpu_down_write+76 _cpu_down+188 __cpu_down_maps_locked+44 work_for_cpu_fn+56 process_one_work+508 worker_thread+840 kthread+308 start_kernel_thread+20 ]: 1 @[ synchronize_rcu+12 sched_cpu_deactivate+244 cpuhp_invoke_callback+500 cpuhp_thread_fun+316 smpboot_thread_fn+512 kthread+308 start_kernel_thread+20 ]: 350 @[ synchronize_rcu+12 cpuidle_pause_and_lock+120 pseries_cpuidle_cpu_dead+88 cpuhp_invoke_callback+500 __cpuhp_invoke_callback_range+200 _cpu_down+412 __cpu_down_maps_locked+44 work_for_cpu_fn+56 process_one_work+508 worker_thread+840 kthread+308 start_kernel_thread+20 ]: 350 - vishalc ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 18:27 ` Vishal Chourasia @ 2026-01-13 0:03 ` Paul E. McKenney 0 siblings, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2026-01-13 0:03 UTC (permalink / raw) To: Vishal Chourasia Cc: Uladzislau Rezki, Joel Fernandes, Shrikanth Hegde, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 11:57:41PM +0530, Vishal Chourasia wrote: > Hello Joel, Paul, Uladzislau, > > On Mon, Jan 12, 2026 at 06:05:30PM +0100, Uladzislau Rezki wrote: > > On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote: > > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > > > > > > > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > >> > > > > > Sounds good to me. I agree it is better to bypass parameters. > > > > > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > > > > > This would require increasing the scalability of this optimization, > > > right? Or am I thinking of the wrong optimization? ;-) > > > > > I tested this before. I noticed that after 64K of simultaneous > > synchronize_rcu() calls the scalability is required. Everything > > less was faster with a new approach. > > It is worth noting that bulk CPU hotplug represents a different stress > pattern than the "simultaneous call" scenario mentioned above. > > In a large-scale hotplug event (like a SMT mode switch), we aren't > necessarily seeing thousands of simultaneous synchronize_rcu() calls. > Instead, because CPU hotplug operations are serialized, we see a > "conveyor belt" of sequential calls. One synchronize_rcu() blocks, the > hotplug state machine waits, it unblocks, and then the next call is > triggered shortly after. > > The bottleneck here isn't RCU scalability under concurrent load, but > rather the accumulated latency of hundreds of sequential Grace Periods. > > For example, on pSeries, onlining 350 out of 400 CPUs triggers exactly > 350 calls at three different points in the hotplug state machine. Even > though they happen one at a time, the sheer volume makes the total > operation time prohibitive. > > Following callstack was collected during SMT mode switch where 350 out > of 400 CPUs were onlined, > > @[ > synchronize_rcu+12 > cpuidle_pause_and_lock+120 > pseries_cpuidle_cpu_online+88 > cpuhp_invoke_callback+500 > cpuhp_thread_fun+316 > smpboot_thread_fn+512 > kthread+308 > start_kernel_thread+20 > ]: 350 > @[ > synchronize_rcu+12 > rcu_sync_enter+260 > percpu_down_write+76 > _cpu_up+140 > cpu_up+440 > cpu_subsys_online+128 > device_online+176 > online_store+220 > dev_attr_store+52 > sysfs_kf_write+120 > kernfs_fop_write_iter+456 > vfs_write+952 > ksys_write+132 > system_call_exception+292 > system_call_vectored_common+348 > ]: 350 > @[ > synchronize_rcu+12 > rcu_sync_enter+260 > percpu_down_write+76 > try_online_node+64 > cpu_up+120 > cpu_subsys_online+128 > device_online+176 > online_store+220 > dev_attr_store+52 > sysfs_kf_write+120 > kernfs_fop_write_iter+456 > vfs_write+952 > ksys_write+132 > system_call_exception+292 > system_call_vectored_common+348 > ]: 350 > > Following callstack was collected during SMT mode switch where 350 out > of 400 CPUs where offlined, > > @[ > synchronize_rcu+12 > rcu_sync_enter+260 > percpu_down_write+76 > _cpu_down+188 > __cpu_down_maps_locked+44 > work_for_cpu_fn+56 > process_one_work+508 > worker_thread+840 > kthread+308 > start_kernel_thread+20 > ]: 1 > @[ > synchronize_rcu+12 > sched_cpu_deactivate+244 > cpuhp_invoke_callback+500 > cpuhp_thread_fun+316 > smpboot_thread_fn+512 > kthread+308 > start_kernel_thread+20 > ]: 350 > @[ > synchronize_rcu+12 > cpuidle_pause_and_lock+120 > pseries_cpuidle_cpu_dead+88 > cpuhp_invoke_callback+500 > __cpuhp_invoke_callback_range+200 > _cpu_down+412 > __cpu_down_maps_locked+44 > work_for_cpu_fn+56 > process_one_work+508 > worker_thread+840 > kthread+308 > start_kernel_thread+20 > ]: 350 I still suggest that you test on a big system. There are other sources of synchronize_rcu() calls than just CPU hotplug. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 16:48 ` Paul E. McKenney 2026-01-12 17:05 ` Uladzislau Rezki @ 2026-01-12 22:24 ` Joel Fernandes 2026-01-13 0:01 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-12 22:24 UTC (permalink / raw) To: paulmck Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On 1/12/2026 11:48 AM, Paul E. McKenney wrote: > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >> >> >>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> Hello, Shrikanth! >>> >>>> >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>> systems, this process takes significant time, increasing as the number >>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>> waiting for synchronize_rcu(). >>>>>> >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>> it should complete as quickly as possible. >>>>>> >>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>> >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>> Before: real 1m14.792s >>>>>> After: real 0m03.205s # ~23x improvement >>>>>> >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>> Before: real 2m27.695s >>>>>> After: real 0m02.510s # ~58x improvement >>>>>> >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>> >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>> >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>> your "expedited switch" improvement. >>>>> >>>> >>>> Hi Uladzislau. >>>> >>>> Had a discussion on this at LPC, having in kernel solution is likely >>>> better than having it in userspace. >>>> >>>> - Having it in kernel would make it work across all archs. Why should >>>> any user wait when one initiates the hotplug. >>>> >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>> We will have to repeat the same in each tool. >>>> >>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>> we need to fallback to userspace. >>>> >>> Sounds good to me. I agree it is better to bypass parameters. >> >> Another way to make it in-kernel would be to make the RCU normal wake >> from GP optimization enabled for > 16 CPUs by default.>> >> I was considering this, but I did not bring it up because I did not >> know that there are large systems that might benefit from it until now.> > This would require increasing the scalability of this optimization, > right? Or am I thinking of the wrong optimization? ;-) > Yes I think you are considering the correct one, the concern you have is regarding large number of wake ups initiated from the GP thread, correct? I was suggesting on the thread, a more dynamic approach where using synchronize_rcu_normal() until it gets overloaded with requests. One approach might be to measure the length of the rcu_state.srs_next to detect an overload condition, similar to qhimark? Or perhaps qhimark itself can be used. And under lightly loaded conditions, default to synchronize_rcu_normal() without checking for the 16 CPU count. Thoughts? thanks, - Joel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 22:24 ` Joel Fernandes @ 2026-01-13 0:01 ` Paul E. McKenney 2026-01-13 2:46 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2026-01-13 0:01 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote: > > > On 1/12/2026 11:48 AM, Paul E. McKenney wrote: > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >> > >> > >>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> Hello, Shrikanth! > >>> > >>>> > >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>> systems, this process takes significant time, increasing as the number > >>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>> waiting for synchronize_rcu(). > >>>>>> > >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>> it should complete as quickly as possible. > >>>>>> > >>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>> > >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>> Before: real 1m14.792s > >>>>>> After: real 0m03.205s # ~23x improvement > >>>>>> > >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>> Before: real 2m27.695s > >>>>>> After: real 0m02.510s # ~58x improvement > >>>>>> > >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>> > >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>> > >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>> your "expedited switch" improvement. > >>>>> > >>>> > >>>> Hi Uladzislau. > >>>> > >>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>> better than having it in userspace. > >>>> > >>>> - Having it in kernel would make it work across all archs. Why should > >>>> any user wait when one initiates the hotplug. > >>>> > >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>> We will have to repeat the same in each tool. > >>>> > >>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>> we need to fallback to userspace. > >>>> > >>> Sounds good to me. I agree it is better to bypass parameters. > >> > >> Another way to make it in-kernel would be to make the RCU normal wake > >> from GP optimization enabled for > 16 CPUs by default.>> > >> I was considering this, but I did not bring it up because I did not > >> know that there are large systems that might benefit from it until now.> > > This would require increasing the scalability of this optimization, > > right? Or am I thinking of the wrong optimization? ;-) > > > Yes I think you are considering the correct one, the concern you have is > regarding large number of wake ups initiated from the GP thread, correct? > > I was suggesting on the thread, a more dynamic approach where using > synchronize_rcu_normal() until it gets overloaded with requests. One approach > might be to measure the length of the rcu_state.srs_next to detect an overload > condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > lightly loaded conditions, default to synchronize_rcu_normal() without checking > for the 16 CPU count. > > Thoughts? Or maintain multiple lists. Systems with 1000+ CPUs can be a bit unforgiving of pretty much any form of contention. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 0:01 ` Paul E. McKenney @ 2026-01-13 2:46 ` Joel Fernandes 2026-01-13 4:53 ` Shrikanth Hegde 2026-01-14 3:59 ` Paul E. McKenney 0 siblings, 2 replies; 54+ messages in thread From: Joel Fernandes @ 2026-01-13 2:46 UTC (permalink / raw) To: paulmck@kernel.org Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com > On Jan 12, 2026, at 7:01 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote: >> >> >>> On 1/12/2026 11:48 AM, Paul E. McKenney wrote: >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >>>> >>>> >>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>> >>>>> Hello, Shrikanth! >>>>> >>>>>> >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>>>> systems, this process takes significant time, increasing as the number >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>>>> waiting for synchronize_rcu(). >>>>>>>> >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>>>> it should complete as quickly as possible. >>>>>>>> >>>>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>>>> >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>>>> Before: real 1m14.792s >>>>>>>> After: real 0m03.205s # ~23x improvement >>>>>>>> >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>>>> Before: real 2m27.695s >>>>>>>> After: real 0m02.510s # ~58x improvement >>>>>>>> >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>>>> >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>>>> your "expedited switch" improvement. >>>>>>> >>>>>> >>>>>> Hi Uladzislau. >>>>>> >>>>>> Had a discussion on this at LPC, having in kernel solution is likely >>>>>> better than having it in userspace. >>>>>> >>>>>> - Having it in kernel would make it work across all archs. Why should >>>>>> any user wait when one initiates the hotplug. >>>>>> >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>>>> We will have to repeat the same in each tool. >>>>>> >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>>>> we need to fallback to userspace. >>>>>> >>>>> Sounds good to me. I agree it is better to bypass parameters. >>>> >>>> Another way to make it in-kernel would be to make the RCU normal wake >>>> from GP optimization enabled for > 16 CPUs by default.>> >>>> I was considering this, but I did not bring it up because I did not >>>> know that there are large systems that might benefit from it until now.> >>> This would require increasing the scalability of this optimization, >>> right? Or am I thinking of the wrong optimization? ;-) >>> >> Yes I think you are considering the correct one, the concern you have is >> regarding large number of wake ups initiated from the GP thread, correct? >> >> I was suggesting on the thread, a more dynamic approach where using >> synchronize_rcu_normal() until it gets overloaded with requests. One approach >> might be to measure the length of the rcu_state.srs_next to detect an overload >> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >> lightly loaded conditions, default to synchronize_rcu_normal() without checking >> for the 16 CPU count. >> >> Thoughts? > > Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > unforgiving of pretty much any form of contention. Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. That would address the conveyor belt pattern Vishal expressed. thanks, - Joel > > Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 2:46 ` Joel Fernandes @ 2026-01-13 4:53 ` Shrikanth Hegde 2026-01-13 8:57 ` Joel Fernandes 2026-01-14 3:59 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Shrikanth Hegde @ 2026-01-13 4:53 UTC (permalink / raw) To: Joel Fernandes, Uladzislau Rezki, paulmck@kernel.org Cc: Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com Hi. On 1/13/26 8:16 AM, Joel Fernandes wrote: > > >>>>> Another way to make it in-kernel would be to make the RCU normal wake >>>>> from GP optimization enabled for > 16 CPUs by default.>> >>>>> I was considering this, but I did not bring it up because I did not >>>>> know that there are large systems that might benefit from it until now.> >>>> This would require increasing the scalability of this optimization, >>>> right? Or am I thinking of the wrong optimization? ;-) >>>> >>> Yes I think you are considering the correct one, the concern you have is >>> regarding large number of wake ups initiated from the GP thread, correct? >>> >>> I was suggesting on the thread, a more dynamic approach where using >>> synchronize_rcu_normal() until it gets overloaded with requests. One approach >>> might be to measure the length of the rcu_state.srs_next to detect an overload >>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >>> lightly loaded conditions, default to synchronize_rcu_normal() without checking >>> for the 16 CPU count. >>> >>> Thoughts? >> >> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit >> unforgiving of pretty much any form of contention. > > Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > > That would address the conveyor belt pattern Vishal expressed. > > thanks, > > - Joel > Wouldn't that make most of the sync_rcu calls on large system with synchronize_rcu_normal off? Whats the cost of doing this? (Me not knowing much about rcu internals) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 4:53 ` Shrikanth Hegde @ 2026-01-13 8:57 ` Joel Fernandes 2026-01-14 4:00 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-13 8:57 UTC (permalink / raw) To: Shrikanth Hegde Cc: Uladzislau Rezki, paulmck@kernel.org, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com > On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: > > Hi. > > On 1/13/26 8:16 AM, Joel Fernandes wrote: > > >>>>>> Another way to make it in-kernel would be to make the RCU normal wake >>>>>> from GP optimization enabled for > 16 CPUs by default.>> >>>>>> I was considering this, but I did not bring it up because I did not >>>>>> know that there are large systems that might benefit from it until now.> >>>>> This would require increasing the scalability of this optimization, >>>>> right? Or am I thinking of the wrong optimization? ;-) >>>>> >>>> Yes I think you are considering the correct one, the concern you have is >>>> regarding large number of wake ups initiated from the GP thread, correct? >>>> >>>> I was suggesting on the thread, a more dynamic approach where using >>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach >>>> might be to measure the length of the rcu_state.srs_next to detect an overload >>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking >>>> for the 16 CPU count. >>>> >>>> Thoughts? >>> >>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit >>> unforgiving of pretty much any form of contention. >> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. >> That would address the conveyor belt pattern Vishal expressed. >> thanks, >> - Joel > > Wouldn't that make most of the sync_rcu calls on large system > with synchronize_rcu_normal off? It would and that is expected. > > Whats the cost of doing this? There is no cost, that is the point right. The scalability issue Paul is referring to is the large number of wake ups. You wont have that if the number of synchronous callers is small. - Joel > > (Me not knowing much about rcu internals) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 8:57 ` Joel Fernandes @ 2026-01-14 4:00 ` Paul E. McKenney 2026-01-14 8:54 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2026-01-14 4:00 UTC (permalink / raw) To: Joel Fernandes Cc: Shrikanth Hegde, Uladzislau Rezki, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Tue, Jan 13, 2026 at 08:57:20AM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: > > > > Hi. > > > > On 1/13/26 8:16 AM, Joel Fernandes wrote: > > > > > >>>>>> Another way to make it in-kernel would be to make the RCU normal wake > >>>>>> from GP optimization enabled for > 16 CPUs by default.>> > >>>>>> I was considering this, but I did not bring it up because I did not > >>>>>> know that there are large systems that might benefit from it until now.> > >>>>> This would require increasing the scalability of this optimization, > >>>>> right? Or am I thinking of the wrong optimization? ;-) > >>>>> > >>>> Yes I think you are considering the correct one, the concern you have is > >>>> regarding large number of wake ups initiated from the GP thread, correct? > >>>> > >>>> I was suggesting on the thread, a more dynamic approach where using > >>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach > >>>> might be to measure the length of the rcu_state.srs_next to detect an overload > >>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > >>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking > >>>> for the 16 CPU count. > >>>> > >>>> Thoughts? > >>> > >>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > >>> unforgiving of pretty much any form of contention. > >> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > >> That would address the conveyor belt pattern Vishal expressed. > >> thanks, > >> - Joel > > > > Wouldn't that make most of the sync_rcu calls on large system > > with synchronize_rcu_normal off? > > It would and that is expected. > > > > > Whats the cost of doing this? > > There is no cost, that is the point right. The scalability issue Paul is referring to is the > large number of wake ups. You wont have that if the number of synchronous callers is small. Also the contention involved in the list management, if there is still only the one list. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-14 4:00 ` Paul E. McKenney @ 2026-01-14 8:54 ` Joel Fernandes 2026-01-16 19:02 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-14 8:54 UTC (permalink / raw) To: paulmck Cc: Shrikanth Hegde, Uladzislau Rezki, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On 1/13/2026 11:00 PM, Paul E. McKenney wrote: > On Tue, Jan 13, 2026 at 08:57:20AM +0000, Joel Fernandes wrote: >> >> >>> On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: >>> >>> Hi. >>> >>> On 1/13/26 8:16 AM, Joel Fernandes wrote: >>> >>> >>>>>>>> Another way to make it in-kernel would be to make the RCU normal wake >>>>>>>> from GP optimization enabled for > 16 CPUs by default.>> >>>>>>>> I was considering this, but I did not bring it up because I did not >>>>>>>> know that there are large systems that might benefit from it until now.> >>>>>>> This would require increasing the scalability of this optimization, >>>>>>> right? Or am I thinking of the wrong optimization? ;-) >>>>>>> >>>>>> Yes I think you are considering the correct one, the concern you have is >>>>>> regarding large number of wake ups initiated from the GP thread, correct? >>>>>> >>>>>> I was suggesting on the thread, a more dynamic approach where using >>>>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach >>>>>> might be to measure the length of the rcu_state.srs_next to detect an overload >>>>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >>>>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking >>>>>> for the 16 CPU count. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit >>>>> unforgiving of pretty much any form of contention. >>>> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. >>>> That would address the conveyor belt pattern Vishal expressed. >>>> thanks, >>>> - Joel >>> >>> Wouldn't that make most of the sync_rcu calls on large system >>> with synchronize_rcu_normal off? >> >> It would and that is expected. >> >>> >>> Whats the cost of doing this? >> >> There is no cost, that is the point right. The scalability issue Paul is referring to is the >> large number of wake ups. You wont have that if the number of synchronous callers is small. > > Also the contention involved in the list management, if there is still > only the one list. > Even if the number of synchronize_rcu() in flight is a small number? like < 10. To clarify, I meant keeping the threshold that small in favor of the list contention issue you're raising. Thanks! - Joel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-14 8:54 ` Joel Fernandes @ 2026-01-16 19:02 ` Paul E. McKenney 0 siblings, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2026-01-16 19:02 UTC (permalink / raw) To: Joel Fernandes Cc: Shrikanth Hegde, Uladzislau Rezki, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Wed, Jan 14, 2026 at 03:54:44AM -0500, Joel Fernandes wrote: > > > On 1/13/2026 11:00 PM, Paul E. McKenney wrote: > > On Tue, Jan 13, 2026 at 08:57:20AM +0000, Joel Fernandes wrote: > >> > >> > >>> On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: > >>> > >>> Hi. > >>> > >>> On 1/13/26 8:16 AM, Joel Fernandes wrote: > >>> > >>> > >>>>>>>> Another way to make it in-kernel would be to make the RCU normal wake > >>>>>>>> from GP optimization enabled for > 16 CPUs by default.>> > >>>>>>>> I was considering this, but I did not bring it up because I did not > >>>>>>>> know that there are large systems that might benefit from it until now.> > >>>>>>> This would require increasing the scalability of this optimization, > >>>>>>> right? Or am I thinking of the wrong optimization? ;-) > >>>>>>> > >>>>>> Yes I think you are considering the correct one, the concern you have is > >>>>>> regarding large number of wake ups initiated from the GP thread, correct? > >>>>>> > >>>>>> I was suggesting on the thread, a more dynamic approach where using > >>>>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach > >>>>>> might be to measure the length of the rcu_state.srs_next to detect an overload > >>>>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > >>>>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking > >>>>>> for the 16 CPU count. > >>>>>> > >>>>>> Thoughts? > >>>>> > >>>>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > >>>>> unforgiving of pretty much any form of contention. > >>>> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > >>>> That would address the conveyor belt pattern Vishal expressed. > >>>> thanks, > >>>> - Joel > >>> > >>> Wouldn't that make most of the sync_rcu calls on large system > >>> with synchronize_rcu_normal off? > >> > >> It would and that is expected. > >> > >>> > >>> Whats the cost of doing this? > >> > >> There is no cost, that is the point right. The scalability issue Paul is referring to is the > >> large number of wake ups. You wont have that if the number of synchronous callers is small. > > > > Also the contention involved in the list management, if there is still > > only the one list. > > > Even if the number of synchronize_rcu() in flight is a small number? like < 10. > To clarify, I meant keeping the threshold that small in favor of the list > contention issue you're raising. Maybe? One remaining concern is the reads of the counter. Another remaining concern is the possibility of large numbers of CPUs reading the counter, seeing that it is small, then all piling on the list. But it looks like you might get some testing on a large system, so nothing quite like finding out. If it breaks, you guys get to fix it on whatever schedule is indicated at that time. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 2:46 ` Joel Fernandes 2026-01-13 4:53 ` Shrikanth Hegde @ 2026-01-14 3:59 ` Paul E. McKenney 1 sibling, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2026-01-14 3:59 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Tue, Jan 13, 2026 at 02:46:56AM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 7:01 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote: > >> > >> > >>> On 1/12/2026 11:48 AM, Paul E. McKenney wrote: > >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >>>> > >>>> > >>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>> > >>>>> Hello, Shrikanth! > >>>>> > >>>>>> > >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>>>> systems, this process takes significant time, increasing as the number > >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>>>> waiting for synchronize_rcu(). > >>>>>>>> > >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>>>> it should complete as quickly as possible. > >>>>>>>> > >>>>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>>>> Before: real 1m14.792s > >>>>>>>> After: real 0m03.205s # ~23x improvement > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>>>> Before: real 2m27.695s > >>>>>>>> After: real 0m02.510s # ~58x improvement > >>>>>>>> > >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>>>> > >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>>>> > >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>>>> your "expedited switch" improvement. > >>>>>>> > >>>>>> > >>>>>> Hi Uladzislau. > >>>>>> > >>>>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>>>> better than having it in userspace. > >>>>>> > >>>>>> - Having it in kernel would make it work across all archs. Why should > >>>>>> any user wait when one initiates the hotplug. > >>>>>> > >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>>>> We will have to repeat the same in each tool. > >>>>>> > >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>>>> we need to fallback to userspace. > >>>>>> > >>>>> Sounds good to me. I agree it is better to bypass parameters. > >>>> > >>>> Another way to make it in-kernel would be to make the RCU normal wake > >>>> from GP optimization enabled for > 16 CPUs by default.>> > >>>> I was considering this, but I did not bring it up because I did not > >>>> know that there are large systems that might benefit from it until now.> > >>> This would require increasing the scalability of this optimization, > >>> right? Or am I thinking of the wrong optimization? ;-) > >>> > >> Yes I think you are considering the correct one, the concern you have is > >> regarding large number of wake ups initiated from the GP thread, correct? > >> > >> I was suggesting on the thread, a more dynamic approach where using > >> synchronize_rcu_normal() until it gets overloaded with requests. One approach > >> might be to measure the length of the rcu_state.srs_next to detect an overload > >> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > >> lightly loaded conditions, default to synchronize_rcu_normal() without checking > >> for the 16 CPU count. > >> > >> Thoughts? > > > > Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > > unforgiving of pretty much any form of contention. > > Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > > That would address the conveyor belt pattern Vishal expressed. On a system with more than 1,000 CPUs, any single list needs to be handled extremely carefully to avoid contention of one sort or another. At that many CPUs, the default rule is instead "never have just one of anything". Unless that "just one" is protected by some contention-avoidance scheme, for example, like the rcu_node tree protects the root rcu_node structure and the rcu_state structure from contention. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 16:09 ` Joel Fernandes 2026-01-12 16:48 ` Paul E. McKenney @ 2026-01-12 17:09 ` Uladzislau Rezki 2026-01-12 17:36 ` Joel Fernandes 1 sibling, 1 reply; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-12 17:09 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > Hello, Shrikanth! > > > >> > >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>> systems, this process takes significant time, increasing as the number > >>>> of CPUs grows, leading to substantial delays on high-core-count > >>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>> waiting for synchronize_rcu(). > >>>> > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>> it should complete as quickly as possible. > >>>> > >>>> Performance data on a PPC64 system with 400 CPUs: > >>>> > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>> Before: real 1m14.792s > >>>> After: real 0m03.205s # ~23x improvement > >>>> > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>> Before: real 2m27.695s > >>>> After: real 0m02.510s # ~58x improvement > >>>> > >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>> > >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>> > >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>> your "expedited switch" improvement. > >>> > >> > >> Hi Uladzislau. > >> > >> Had a discussion on this at LPC, having in kernel solution is likely > >> better than having it in userspace. > >> > >> - Having it in kernel would make it work across all archs. Why should > >> any user wait when one initiates the hotplug. > >> > >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >> We will have to repeat the same in each tool. > >> > >> - There is already /sys/kernel/rcu_expedited which is better if at all > >> we need to fallback to userspace. > >> > > Sounds good to me. I agree it is better to bypass parameters. > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > IMO, we can increase that threshold. 512/1024 is not a problem at all. But as Paul mentioned, we should consider scalability enhancement. From the other hand it is also probably worth to get into the state when we really see them :) -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 17:09 ` Uladzislau Rezki @ 2026-01-12 17:36 ` Joel Fernandes 2026-01-13 12:18 ` Uladzislau Rezki 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-12 17:36 UTC (permalink / raw) To: Uladzislau Rezki Cc: Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com, Uladzislau Rezki > On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >> >> >>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> Hello, Shrikanth! >>> >>>> >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>> systems, this process takes significant time, increasing as the number >>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>> waiting for synchronize_rcu(). >>>>>> >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>> it should complete as quickly as possible. >>>>>> >>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>> >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>> Before: real 1m14.792s >>>>>> After: real 0m03.205s # ~23x improvement >>>>>> >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>> Before: real 2m27.695s >>>>>> After: real 0m02.510s # ~58x improvement >>>>>> >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>> >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>> >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>> your "expedited switch" improvement. >>>>> >>>> >>>> Hi Uladzislau. >>>> >>>> Had a discussion on this at LPC, having in kernel solution is likely >>>> better than having it in userspace. >>>> >>>> - Having it in kernel would make it work across all archs. Why should >>>> any user wait when one initiates the hotplug. >>>> >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>> We will have to repeat the same in each tool. >>>> >>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>> we need to fallback to userspace. >>>> >>> Sounds good to me. I agree it is better to bypass parameters. >> >> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >> >> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >> > IMO, we can increase that threshold. 512/1024 is not a problem at all. > But as Paul mentioned, we should consider scalability enhancement. From > the other hand it is also probably worth to get into the state when we > really see them :) Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) Thanks. > > -- > Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 17:36 ` Joel Fernandes @ 2026-01-13 12:18 ` Uladzislau Rezki 2026-01-13 12:44 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-13 12:18 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >> > >> > >>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> Hello, Shrikanth! > >>> > >>>> > >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>> systems, this process takes significant time, increasing as the number > >>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>> waiting for synchronize_rcu(). > >>>>>> > >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>> it should complete as quickly as possible. > >>>>>> > >>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>> > >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>> Before: real 1m14.792s > >>>>>> After: real 0m03.205s # ~23x improvement > >>>>>> > >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>> Before: real 2m27.695s > >>>>>> After: real 0m02.510s # ~58x improvement > >>>>>> > >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>> > >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>> > >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>> your "expedited switch" improvement. > >>>>> > >>>> > >>>> Hi Uladzislau. > >>>> > >>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>> better than having it in userspace. > >>>> > >>>> - Having it in kernel would make it work across all archs. Why should > >>>> any user wait when one initiates the hotplug. > >>>> > >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>> We will have to repeat the same in each tool. > >>>> > >>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>> we need to fallback to userspace. > >>>> > >>> Sounds good to me. I agree it is better to bypass parameters. > >> > >> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > >> > >> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > >> > > IMO, we can increase that threshold. 512/1024 is not a problem at all. > > But as Paul mentioned, we should consider scalability enhancement. From > > the other hand it is also probably worth to get into the state when we > > really see them :) > > Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > Honestly i do not see use cases when we are not up to speed to process all callbacks in time keeping in mind that it is blocking context call. How many of them should be in flight(blocked contexts) to make it starve... :) According to my last evaluation it was ~64K. Note i do not say that it should not be scaled. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 12:18 ` Uladzislau Rezki @ 2026-01-13 12:44 ` Joel Fernandes 2026-01-13 14:17 ` Uladzislau Rezki 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-13 12:44 UTC (permalink / raw) To: Uladzislau Rezki Cc: Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com, Uladzislau Rezki > On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: >> >> >>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >>>> >>>> >>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>> >>>>> Hello, Shrikanth! >>>>> >>>>>> >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>>>> systems, this process takes significant time, increasing as the number >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>>>> waiting for synchronize_rcu(). >>>>>>>> >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>>>> it should complete as quickly as possible. >>>>>>>> >>>>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>>>> >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>>>> Before: real 1m14.792s >>>>>>>> After: real 0m03.205s # ~23x improvement >>>>>>>> >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>>>> Before: real 2m27.695s >>>>>>>> After: real 0m02.510s # ~58x improvement >>>>>>>> >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>>>> >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>>>> your "expedited switch" improvement. >>>>>>> >>>>>> >>>>>> Hi Uladzislau. >>>>>> >>>>>> Had a discussion on this at LPC, having in kernel solution is likely >>>>>> better than having it in userspace. >>>>>> >>>>>> - Having it in kernel would make it work across all archs. Why should >>>>>> any user wait when one initiates the hotplug. >>>>>> >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>>>> We will have to repeat the same in each tool. >>>>>> >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>>>> we need to fallback to userspace. >>>>>> >>>>> Sounds good to me. I agree it is better to bypass parameters. >>>> >>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >>>> >>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >>>> >>> IMO, we can increase that threshold. 512/1024 is not a problem at all. >>> But as Paul mentioned, we should consider scalability enhancement. From >>> the other hand it is also probably worth to get into the state when we >>> really see them :) >> >> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) >> > Honestly i do not see use cases when we are not up to speed to process > all callbacks in time keeping in mind that it is blocking context call. > > How many of them should be in flight(blocked contexts) to make it starve... :) > According to my last evaluation it was ~64K. > > Note i do not say that it should not be scaled. But you did not test that on large system with 1000s of CPUs right? So the options I see are: either default to always using the optimization, not just for less than 17 CPUs (what you are saying above). Or, do what I said above (safer for system with 1000s of CPUs and less risky). Let me know if I missed something. Thanks. > > -- > Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 12:44 ` Joel Fernandes @ 2026-01-13 14:17 ` Uladzislau Rezki 2026-01-13 14:32 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-13 14:17 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Tue, Jan 13, 2026 at 12:44:10PM +0000, Joel Fernandes wrote: > > > > On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: > >> > >> > >>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >>>> > >>>> > >>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>> > >>>>> Hello, Shrikanth! > >>>>> > >>>>>> > >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>>>> systems, this process takes significant time, increasing as the number > >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>>>> waiting for synchronize_rcu(). > >>>>>>>> > >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>>>> it should complete as quickly as possible. > >>>>>>>> > >>>>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>>>> Before: real 1m14.792s > >>>>>>>> After: real 0m03.205s # ~23x improvement > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>>>> Before: real 2m27.695s > >>>>>>>> After: real 0m02.510s # ~58x improvement > >>>>>>>> > >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>>>> > >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>>>> > >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>>>> your "expedited switch" improvement. > >>>>>>> > >>>>>> > >>>>>> Hi Uladzislau. > >>>>>> > >>>>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>>>> better than having it in userspace. > >>>>>> > >>>>>> - Having it in kernel would make it work across all archs. Why should > >>>>>> any user wait when one initiates the hotplug. > >>>>>> > >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>>>> We will have to repeat the same in each tool. > >>>>>> > >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>>>> we need to fallback to userspace. > >>>>>> > >>>>> Sounds good to me. I agree it is better to bypass parameters. > >>>> > >>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > >>>> > >>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > >>>> > >>> IMO, we can increase that threshold. 512/1024 is not a problem at all. > >>> But as Paul mentioned, we should consider scalability enhancement. From > >>> the other hand it is also probably worth to get into the state when we > >>> really see them :) > >> > >> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > >> > > Honestly i do not see use cases when we are not up to speed to process > > all callbacks in time keeping in mind that it is blocking context call. > > > > How many of them should be in flight(blocked contexts) to make it starve... :) > > According to my last evaluation it was ~64K. > > > > Note i do not say that it should not be scaled. > > But you did not test that on large system with 1000s of CPUs right? > No, no. I do not have access to such systems. > > So the options I see are: either default to always using the optimization, > not just for less than 17 CPUs (what you are saying above). Or, do what I said > above (safer for system with 1000s of CPUs and less risky). > You mean introduce threshold and count how many nodes are in queue? To me it sounds not optimal and looks like a temporary solution. Long term wise, it is better to split it, i mean to scale. Do you know who can test it on ~1000 CPUs system? So we have some figures. What i have is 256 CPUs system i can test on. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 14:17 ` Uladzislau Rezki @ 2026-01-13 14:32 ` Joel Fernandes 2026-01-13 14:53 ` Shrikanth Hegde 2026-01-13 17:58 ` Uladzislau Rezki 0 siblings, 2 replies; 54+ messages in thread From: Joel Fernandes @ 2026-01-13 14:32 UTC (permalink / raw) To: Uladzislau Rezki Cc: Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On 1/13/2026 9:17 AM, Uladzislau Rezki wrote: > On Tue, Jan 13, 2026 at 12:44:10PM +0000, Joel Fernandes wrote: >> >> >>> On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: >>>> >>>> >>>>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>> >>>>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >>>>>> >>>>>> >>>>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>>>> >>>>>>> Hello, Shrikanth! >>>>>>> >>>>>>>> >>>>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>>>>>> systems, this process takes significant time, increasing as the number >>>>>>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>>>>>> waiting for synchronize_rcu(). >>>>>>>>>> >>>>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>>>>>> it should complete as quickly as possible. >>>>>>>>>> >>>>>>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>>>>>> >>>>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>>>>>> Before: real 1m14.792s >>>>>>>>>> After: real 0m03.205s # ~23x improvement >>>>>>>>>> >>>>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>>>>>> Before: real 2m27.695s >>>>>>>>>> After: real 0m02.510s # ~58x improvement >>>>>>>>>> >>>>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>>>>>> >>>>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>>>>>> >>>>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>>>>>> your "expedited switch" improvement. >>>>>>>>> >>>>>>>> >>>>>>>> Hi Uladzislau. >>>>>>>> >>>>>>>> Had a discussion on this at LPC, having in kernel solution is likely >>>>>>>> better than having it in userspace. >>>>>>>> >>>>>>>> - Having it in kernel would make it work across all archs. Why should >>>>>>>> any user wait when one initiates the hotplug. >>>>>>>> >>>>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>>>>>> We will have to repeat the same in each tool. >>>>>>>> >>>>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>>>>>> we need to fallback to userspace. >>>>>>>> >>>>>>> Sounds good to me. I agree it is better to bypass parameters. >>>>>> >>>>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >>>>>> >>>>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >>>>>> >>>>> IMO, we can increase that threshold. 512/1024 is not a problem at all. >>>>> But as Paul mentioned, we should consider scalability enhancement. From >>>>> the other hand it is also probably worth to get into the state when we >>>>> really see them :) >>>> >>>> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) >>>> >>> Honestly i do not see use cases when we are not up to speed to process >>> all callbacks in time keeping in mind that it is blocking context call. >>> >>> How many of them should be in flight(blocked contexts) to make it starve... :) >>> According to my last evaluation it was ~64K. >>> >>> Note i do not say that it should not be scaled. >> >> But you did not test that on large system with 1000s of CPUs right? >> > No, no. I do not have access to such systems. > >> >> So the options I see are: either default to always using the optimization, >> not just for less than 17 CPUs (what you are saying above). Or, do what I said >> above (safer for system with 1000s of CPUs and less risky). >> > You mean introduce threshold and count how many nodes are in queue? Yes. > To me it sounds not optimal and looks like a temporary solution. Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > > Long term wise, it is better to split it, i mean to scale. But the scalable solution is already there: the !synchronize_rcu_normal path, right? And splitting the list won't help this use case anyway. > > Do you know who can test it on ~1000 CPUs system? So we have some figures. I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the folks on this thread have such systems as they mentioned 1900+ CPU systems. They should be happy to test. > > What i have is 256 CPUs system i can test on. Same boat. ;-) thanks, - Joel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 14:32 ` Joel Fernandes @ 2026-01-13 14:53 ` Shrikanth Hegde 2026-01-13 18:17 ` Uladzislau Rezki 2026-01-13 17:58 ` Uladzislau Rezki 1 sibling, 1 reply; 54+ messages in thread From: Shrikanth Hegde @ 2026-01-13 14:53 UTC (permalink / raw) To: Joel Fernandes, Uladzislau Rezki Cc: Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com Hi. On 1/13/26 8:02 PM, Joel Fernandes wrote: > > >>>>>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >>>>>>> >>>>>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >>>>>>> >>>>>> IMO, we can increase that threshold. 512/1024 is not a problem at all. >>>>>> But as Paul mentioned, we should consider scalability enhancement. From >>>>>> the other hand it is also probably worth to get into the state when we >>>>>> really see them :) >>>>> >>>>> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) >>>>> >>>> Honestly i do not see use cases when we are not up to speed to process >>>> all callbacks in time keeping in mind that it is blocking context call. >>>> >>>> How many of them should be in flight(blocked contexts) to make it starve... :) >>>> According to my last evaluation it was ~64K. >>>> >>>> Note i do not say that it should not be scaled. >>> >>> But you did not test that on large system with 1000s of CPUs right? >>> >> No, no. I do not have access to such systems. >> >>> >>> So the options I see are: either default to always using the optimization, >>> not just for less than 17 CPUs (what you are saying above). Or, do what I said >>> above (safer for system with 1000s of CPUs and less risky). >>> >> You mean introduce threshold and count how many nodes are in queue? > > Yes. > >> To me it sounds not optimal and looks like a temporary solution. > > Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > >> >> Long term wise, it is better to split it, i mean to scale. > > But the scalable solution is already there: the !synchronize_rcu_normal path, > right? And splitting the list won't help this use case anyway. > >> >> Do you know who can test it on ~1000 CPUs system? So we have some figures. > > I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the > folks on this thread have such systems as they mentioned 1900+ CPU systems. They > should be happy to test. > Do you have a patch to try out? We can test it on these systems. Note: Might take a while to test it, as those systems are bit tricky to get. >> >> What i have is 256 CPUs system i can test on. > Same boat. ;-) > > thanks, > > - Joel > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 14:53 ` Shrikanth Hegde @ 2026-01-13 18:17 ` Uladzislau Rezki 0 siblings, 0 replies; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-13 18:17 UTC (permalink / raw) To: Shrikanth Hegde Cc: Joel Fernandes, Uladzislau Rezki, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Tue, Jan 13, 2026 at 08:23:29PM +0530, Shrikanth Hegde wrote: > Hi. > > On 1/13/26 8:02 PM, Joel Fernandes wrote: > > > > > > > > > > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > > > > > > > > > > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > > > > > > > > > > > > > > IMO, we can increase that threshold. 512/1024 is not a problem at all. > > > > > > > But as Paul mentioned, we should consider scalability enhancement. From > > > > > > > the other hand it is also probably worth to get into the state when we > > > > > > > really see them :) > > > > > > > > > > > > Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > > > > > > > > > > > Honestly i do not see use cases when we are not up to speed to process > > > > > all callbacks in time keeping in mind that it is blocking context call. > > > > > > > > > > How many of them should be in flight(blocked contexts) to make it starve... :) > > > > > According to my last evaluation it was ~64K. > > > > > > > > > > Note i do not say that it should not be scaled. > > > > > > > > But you did not test that on large system with 1000s of CPUs right? > > > > > > > No, no. I do not have access to such systems. > > > > > > > > > > > So the options I see are: either default to always using the optimization, > > > > not just for less than 17 CPUs (what you are saying above). Or, do what I said > > > > above (safer for system with 1000s of CPUs and less risky). > > > > > > > You mean introduce threshold and count how many nodes are in queue? > > > > Yes. > > > > > To me it sounds not optimal and looks like a temporary solution. > > > > Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > > > > > > > > Long term wise, it is better to split it, i mean to scale. > > > > But the scalable solution is already there: the !synchronize_rcu_normal path, > > right? And splitting the list won't help this use case anyway. > > > > > > > > Do you know who can test it on ~1000 CPUs system? So we have some figures. > > > > I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the > > folks on this thread have such systems as they mentioned 1900+ CPU systems. They > > should be happy to test. > > > > Do you have a patch to try out? We can test it on these systems. > > > Note: Might take a while to test it, as those systems are bit tricky to > get. > Let me prepare something. I will come back. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-13 14:32 ` Joel Fernandes 2026-01-13 14:53 ` Shrikanth Hegde @ 2026-01-13 17:58 ` Uladzislau Rezki 1 sibling, 0 replies; 54+ messages in thread From: Uladzislau Rezki @ 2026-01-13 17:58 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Shrikanth Hegde, Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, srikar@linux.ibm.com On Tue, Jan 13, 2026 at 09:32:13AM -0500, Joel Fernandes wrote: > > > On 1/13/2026 9:17 AM, Uladzislau Rezki wrote: > > On Tue, Jan 13, 2026 at 12:44:10PM +0000, Joel Fernandes wrote: > >> > >> > >>> On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: > >>>> > >>>> > >>>>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>> > >>>>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >>>>>> > >>>>>> > >>>>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>>>> > >>>>>>> Hello, Shrikanth! > >>>>>>> > >>>>>>>> > >>>>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>>>>>> systems, this process takes significant time, increasing as the number > >>>>>>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>>>>>> waiting for synchronize_rcu(). > >>>>>>>>>> > >>>>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>>>>>> it should complete as quickly as possible. > >>>>>>>>>> > >>>>>>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>>>>>> > >>>>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>>>>>> Before: real 1m14.792s > >>>>>>>>>> After: real 0m03.205s # ~23x improvement > >>>>>>>>>> > >>>>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>>>>>> Before: real 2m27.695s > >>>>>>>>>> After: real 0m02.510s # ~58x improvement > >>>>>>>>>> > >>>>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>>>>>> > >>>>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>>>>>> > >>>>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>>>>>> your "expedited switch" improvement. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Uladzislau. > >>>>>>>> > >>>>>>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>>>>>> better than having it in userspace. > >>>>>>>> > >>>>>>>> - Having it in kernel would make it work across all archs. Why should > >>>>>>>> any user wait when one initiates the hotplug. > >>>>>>>> > >>>>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>>>>>> We will have to repeat the same in each tool. > >>>>>>>> > >>>>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>>>>>> we need to fallback to userspace. > >>>>>>>> > >>>>>>> Sounds good to me. I agree it is better to bypass parameters. > >>>>>> > >>>>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > >>>>>> > >>>>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > >>>>>> > >>>>> IMO, we can increase that threshold. 512/1024 is not a problem at all. > >>>>> But as Paul mentioned, we should consider scalability enhancement. From > >>>>> the other hand it is also probably worth to get into the state when we > >>>>> really see them :) > >>>> > >>>> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > >>>> > >>> Honestly i do not see use cases when we are not up to speed to process > >>> all callbacks in time keeping in mind that it is blocking context call. > >>> > >>> How many of them should be in flight(blocked contexts) to make it starve... :) > >>> According to my last evaluation it was ~64K. > >>> > >>> Note i do not say that it should not be scaled. > >> > >> But you did not test that on large system with 1000s of CPUs right? > >> > > No, no. I do not have access to such systems. > > > >> > >> So the options I see are: either default to always using the optimization, > >> not just for less than 17 CPUs (what you are saying above). Or, do what I said > >> above (safer for system with 1000s of CPUs and less risky). > >> > > You mean introduce threshold and count how many nodes are in queue? > > Yes. > > > To me it sounds not optimal and looks like a temporary solution. > > Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > It was trial testing :) Agree we should do something with it. > > > > Long term wise, it is better to split it, i mean to scale. > > But the scalable solution is already there: the !synchronize_rcu_normal path, > right? And splitting the list won't help this use case anyway. > Fair point. > > > > Do you know who can test it on ~1000 CPUs system? So we have some figures. > > I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the > folks on this thread have such systems as they mentioned 1900+ CPU systems. They > should be happy to test. > > > > > What i have is 256 CPUs system i can test on. > Same boat. ;-) > :) -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 9:43 [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Vishal Chourasia 2026-01-12 10:08 ` Uladzislau Rezki @ 2026-01-12 12:21 ` Shrikanth Hegde 2026-01-12 12:46 ` Vishal Chourasia 2026-01-12 14:03 ` Joel Fernandes ` (2 subsequent siblings) 4 siblings, 1 reply; 54+ messages in thread From: Shrikanth Hegde @ 2026-01-12 12:21 UTC (permalink / raw) To: Vishal Chourasia, rcu, linux-kernel Cc: paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, peterz, srikar On 1/12/26 3:13 PM, Vishal Chourasia wrote: > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > Hi Vishal, I tried on tip/master at 315f416d3e26. It fails to apply. is rcu tree updated? > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> > --- > include/linux/rcupdate.h | 3 +++ > kernel/cpu.c | 2 ++ > 2 files changed, 5 insertions(+) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index c5b30054cd01..03c06cfb2b6d 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) > extern int rcu_expedited; > extern int rcu_normal; > > +extern void rcu_expedite_gp(void); > +extern void rcu_unexpedite_gp(void); > + Why extern is needed? All it needs is declarations no? > DEFINE_LOCK_GUARD_0(rcu, > do { > rcu_read_lock(); > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..6b0d491d73f4 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock); > > void cpus_write_lock(void) > { > + rcu_expedite_gp(); > percpu_down_write(&cpu_hotplug_lock); > } > > void cpus_write_unlock(void) > { > percpu_up_write(&cpu_hotplug_lock); > + rcu_unexpedite_gp(); > } > > void lockdep_assert_cpus_held(void) Have you tested kexec path or suspend/resume path? Seems like the counter can nest, but would be good to verify. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 12:21 ` Shrikanth Hegde @ 2026-01-12 12:46 ` Vishal Chourasia 0 siblings, 0 replies; 54+ messages in thread From: Vishal Chourasia @ 2026-01-12 12:46 UTC (permalink / raw) To: Shrikanth Hegde, rcu, linux-kernel Cc: paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, peterz, srikar On 12/01/26 17:51, Shrikanth Hegde wrote: > > > On 1/12/26 3:13 PM, Vishal Chourasia wrote: >> Bulk CPU hotplug operations—such as switching SMT modes across all >> cores—require hotplugging multiple CPUs in rapid succession. On large >> systems, this process takes significant time, increasing as the number >> of CPUs grows, leading to substantial delays on high-core-count >> machines. Analysis [1] reveals that the majority of this time is spent >> waiting for synchronize_rcu(). >> >> Expedite synchronize_rcu() during the hotplug path to accelerate the >> operation. Since CPU hotplug is a user-initiated administrative task, >> it should complete as quickly as possible. >> >> Performance data on a PPC64 system with 400 CPUs: >> >> + ppc64_cpu --smt=1 (SMT8 to SMT1) >> Before: real 1m14.792s >> After: real 0m03.205s # ~23x improvement >> >> + ppc64_cpu --smt=8 (SMT1 to SMT8) >> Before: real 2m27.695s >> After: real 0m02.510s # ~58x improvement >> >> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >> > > Hi Vishal, > > I tried on tip/master at 315f416d3e26. > It fails to apply. is rcu tree updated? I'm currently working off the GitHub mirror (|github.com/torvalds/linux)| > > >> [1] >> https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >> >> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> >> --- >> include/linux/rcupdate.h | 3 +++ >> kernel/cpu.c | 2 ++ >> 2 files changed, 5 insertions(+) >> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h >> index c5b30054cd01..03c06cfb2b6d 100644 >> --- a/include/linux/rcupdate.h >> +++ b/include/linux/rcupdate.h >> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, >> rcu_callback_t f) >> extern int rcu_expedited; >> extern int rcu_normal; >> +extern void rcu_expedite_gp(void); >> +extern void rcu_unexpedite_gp(void); >> + > > Why extern is needed? All it needs is declarations no? Already declared in kernel/rcu/rcu.h kernel/cpu.c already includes linux/rcupdate.h, therefore added an extern. > > >> DEFINE_LOCK_GUARD_0(rcu, >> do { >> rcu_read_lock(); >> diff --git a/kernel/cpu.c b/kernel/cpu.c >> index 8df2d773fe3b..6b0d491d73f4 100644 >> --- a/kernel/cpu.c >> +++ b/kernel/cpu.c >> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock); >> void cpus_write_lock(void) >> { >> + rcu_expedite_gp(); >> percpu_down_write(&cpu_hotplug_lock); >> } >> void cpus_write_unlock(void) >> { >> percpu_up_write(&cpu_hotplug_lock); >> + rcu_unexpedite_gp(); >> } >> void lockdep_assert_cpus_held(void) > > Have you tested kexec path or suspend/resume path? I did test kexec patch by booting into another kernel via kexec path. But, I didn't test suspend/resume. > Seems like the counter can nest, but would be good to verify. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 9:43 [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Vishal Chourasia 2026-01-12 10:08 ` Uladzislau Rezki 2026-01-12 12:21 ` Shrikanth Hegde @ 2026-01-12 14:03 ` Joel Fernandes 2026-01-12 14:20 ` Joel Fernandes 2026-01-12 14:24 ` Peter Zijlstra 2026-01-18 11:38 ` [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Samir M 4 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-12 14:03 UTC (permalink / raw) To: Vishal Chourasia Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, urezki@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, sshegde@linux.ibm.com, srikar@linux.ibm.com, Vishal Chourasia > On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: > > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. When does the user initiate this in your system? Hotplug should not be happening that often to begin with, it is a slow path that depends on the disruptive stop-machine mechanism. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. thanks, - Joel > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> > --- > include/linux/rcupdate.h | 3 +++ > kernel/cpu.c | 2 ++ > 2 files changed, 5 insertions(+) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index c5b30054cd01..03c06cfb2b6d 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) > extern int rcu_expedited; > extern int rcu_normal; > > +extern void rcu_expedite_gp(void); > +extern void rcu_unexpedite_gp(void); > + > DEFINE_LOCK_GUARD_0(rcu, > do { > rcu_read_lock(); > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..6b0d491d73f4 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock); > > void cpus_write_lock(void) > { > + rcu_expedite_gp(); > percpu_down_write(&cpu_hotplug_lock); > } > > void cpus_write_unlock(void) > { > percpu_up_write(&cpu_hotplug_lock); > + rcu_unexpedite_gp(); > } > > void lockdep_assert_cpus_held(void) > -- > 2.52.0 > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 14:03 ` Joel Fernandes @ 2026-01-12 14:20 ` Joel Fernandes 2026-01-12 14:23 ` Peter Zijlstra 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-12 14:20 UTC (permalink / raw) To: Vishal Chourasia Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, urezki@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, peterz@infradead.org, sshegde@linux.ibm.com, srikar@linux.ibm.com, Vishal Chourasia > On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote: > > > >> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: >> >> Bulk CPU hotplug operations—such as switching SMT modes across all >> cores—require hotplugging multiple CPUs in rapid succession. On large >> systems, this process takes significant time, increasing as the number >> of CPUs grows, leading to substantial delays on high-core-count >> machines. Analysis [1] reveals that the majority of this time is spent >> waiting for synchronize_rcu(). >> >> Expedite synchronize_rcu() during the hotplug path to accelerate the >> operation. Since CPU hotplug is a user-initiated administrative task, >> it should complete as quickly as possible. > > When does the user initiate this in your system? > > Hotplug should not be happening that often to begin with, it is a slow path that > depends on the disruptive stop-machine mechanism. > >> >> Performance data on a PPC64 system with 400 CPUs: >> >> + ppc64_cpu --smt=1 (SMT8 to SMT1) >> Before: real 1m14.792s >> After: real 0m03.205s # ~23x improvement >> >> + ppc64_cpu --smt=8 (SMT1 to SMT8) >> Before: real 2m27.695s >> After: real 0m02.510s # ~58x improvement > > This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? > > Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. Also, why not just use the expedite api at the callsite that is slow than blanket expediting everything between hotplug lock and unlock. That is more specific fix than this fix which applies more broadly to all operations. It appears the report you provided does provide the culprit callsite. - Joel > > thanks, > > - Joel > >> >> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >> >> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >> >> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> >> --- >> include/linux/rcupdate.h | 3 +++ >> kernel/cpu.c | 2 ++ >> 2 files changed, 5 insertions(+) >> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h >> index c5b30054cd01..03c06cfb2b6d 100644 >> --- a/include/linux/rcupdate.h >> +++ b/include/linux/rcupdate.h >> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) >> extern int rcu_expedited; >> extern int rcu_normal; >> >> +extern void rcu_expedite_gp(void); >> +extern void rcu_unexpedite_gp(void); >> + >> DEFINE_LOCK_GUARD_0(rcu, >> do { >> rcu_read_lock(); >> diff --git a/kernel/cpu.c b/kernel/cpu.c >> index 8df2d773fe3b..6b0d491d73f4 100644 >> --- a/kernel/cpu.c >> +++ b/kernel/cpu.c >> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock); >> >> void cpus_write_lock(void) >> { >> + rcu_expedite_gp(); >> percpu_down_write(&cpu_hotplug_lock); >> } >> >> void cpus_write_unlock(void) >> { >> percpu_up_write(&cpu_hotplug_lock); >> + rcu_unexpedite_gp(); >> } >> >> void lockdep_assert_cpus_held(void) >> -- >> 2.52.0 >> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 14:20 ` Joel Fernandes @ 2026-01-12 14:23 ` Peter Zijlstra 2026-01-12 14:37 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Peter Zijlstra @ 2026-01-12 14:23 UTC (permalink / raw) To: Joel Fernandes Cc: Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, urezki@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, sshegde@linux.ibm.com, srikar@linux.ibm.com On Mon, Jan 12, 2026 at 02:20:44PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote: > > > > > > > >> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: > >> > >> Bulk CPU hotplug operations—such as switching SMT modes across all > >> cores—require hotplugging multiple CPUs in rapid succession. On large > >> systems, this process takes significant time, increasing as the number > >> of CPUs grows, leading to substantial delays on high-core-count > >> machines. Analysis [1] reveals that the majority of this time is spent > >> waiting for synchronize_rcu(). > >> > >> Expedite synchronize_rcu() during the hotplug path to accelerate the > >> operation. Since CPU hotplug is a user-initiated administrative task, > >> it should complete as quickly as possible. > > > > When does the user initiate this in your system? > > > > Hotplug should not be happening that often to begin with, it is a slow path that > > depends on the disruptive stop-machine mechanism. > > > >> > >> Performance data on a PPC64 system with 400 CPUs: > >> > >> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >> Before: real 1m14.792s > >> After: real 0m03.205s # ~23x improvement > >> > >> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >> Before: real 2m27.695s > >> After: real 0m02.510s # ~58x improvement > > > > This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? > > > > Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. > > Also, why not just use the expedite api at the callsite that is slow > than blanket expediting everything between hotplug lock and unlock. > That is more specific fix than this fix which applies more broadly to > all operations. It appears the report you provided does provide the > culprit callsite. Because hotplug is not a fast path; there is no expectation of performance here. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 14:23 ` Peter Zijlstra @ 2026-01-12 14:37 ` Joel Fernandes 2026-01-12 17:52 ` Vishal Chourasia 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-12 14:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Vishal Chourasia, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, urezki@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, sshegde@linux.ibm.com, srikar@linux.ibm.com > On Jan 12, 2026, at 9:24 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 12, 2026 at 02:20:44PM +0000, Joel Fernandes wrote: >> >> >>>> On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote: >>> >>> >>> >>>> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: >>>> >>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>> systems, this process takes significant time, increasing as the number >>>> of CPUs grows, leading to substantial delays on high-core-count >>>> machines. Analysis [1] reveals that the majority of this time is spent >>>> waiting for synchronize_rcu(). >>>> >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>> it should complete as quickly as possible. >>> >>> When does the user initiate this in your system? >>> >>> Hotplug should not be happening that often to begin with, it is a slow path that >>> depends on the disruptive stop-machine mechanism. >>> >>>> >>>> Performance data on a PPC64 system with 400 CPUs: >>>> >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>> Before: real 1m14.792s >>>> After: real 0m03.205s # ~23x improvement >>>> >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>> Before: real 2m27.695s >>>> After: real 0m02.510s # ~58x improvement >>> >>> This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? >>> >>> Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. >> >> Also, why not just use the expedite api at the callsite that is slow >> than blanket expediting everything between hotplug lock and unlock. >> That is more specific fix than this fix which applies more broadly to >> all operations. It appears the report you provided does provide the >> culprit callsite. > > Because hotplug is not a fast path; there is no expectation of > performance here. Agreed, I was just wondering if it was incredibly slow or something. Looking forward to more justification from Vishal on usecase, - Joel > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 14:37 ` Joel Fernandes @ 2026-01-12 17:52 ` Vishal Chourasia 0 siblings, 0 replies; 54+ messages in thread From: Vishal Chourasia @ 2026-01-12 17:52 UTC (permalink / raw) To: Joel Fernandes Cc: Peter Zijlstra, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, paulmck@kernel.org, frederic@kernel.org, neeraj.upadhyay@kernel.org, josh@joshtriplett.org, boqun.feng@gmail.com, urezki@gmail.com, rostedt@goodmis.org, tglx@linutronix.de, sshegde@linux.ibm.com, srikar@linux.ibm.com Hello Joel, Peter On Mon, Jan 12, 2026 at 02:37:14PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 9:24 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Jan 12, 2026 at 02:20:44PM +0000, Joel Fernandes wrote: > >> > >> > >>>> On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote: > >>> > >>> > >>> > >>>> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: > >>>> > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>> systems, this process takes significant time, increasing as the number > >>>> of CPUs grows, leading to substantial delays on high-core-count > >>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>> waiting for synchronize_rcu(). > >>>> > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>> it should complete as quickly as possible. > >>> > >>> When does the user initiate this in your system? Workloads exhibit varying sensitivity to SMT levels. Users dynamically adjust SMT modes to optimize performance. > >>> > >>> Hotplug should not be happening that often to begin with, it is a slow path that > >>> depends on the disruptive stop-machine mechanism. Yes, it doesn't happen too often, but when it does, on machines with (>= 1920 CPUs) it takes more than 20 mins to finish. > >>> > >>>> > >>>> Performance data on a PPC64 system with 400 CPUs: > >>>> > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>> Before: real 1m14.792s > >>>> After: real 0m03.205s # ~23x improvement > >>>> > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>> Before: real 2m27.695s > >>>> After: real 0m02.510s # ~58x improvement > >>> > >>> This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? The ppc64_cpu utility generates a list of target CPUs based on the requested SMT state and writes to their corresponding sysfs online entries. Sorry, I didn't get your second question about the performance change with cycle count. > >>> > >>> Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. Sure, I will get back with the numbers. > >> > >> Also, why not just use the expedite api at the callsite that is slow > >> than blanket expediting everything between hotplug lock and unlock. > >> That is more specific fix than this fix which applies more broadly to > >> all operations. It appears the report you provided does provide the > >> culprit callsite. I initially attempted to replace synchronize_rcu() with synchronize_rcu_expedited() at specific callsites. However, the primary bottlenecks are within percpu_down_write(), called via _cpu_up() and try_online_node(). Please refer to the callstack shared below. Since percpu_down_write() is used throughout the kernel, modifying it directly would force expedited grace periods on unrelated subsystems. @[ synchronize_rcu+12 rcu_sync_enter+260 percpu_down_write+76 _cpu_up+140 cpu_up+440 cpu_subsys_online+128 device_online+176 online_store+220 dev_attr_store+52 sysfs_kf_write+120 kernfs_fop_write_iter+456 vfs_write+952 ksys_write+132 system_call_exception+292 system_call_vectored_common+348 ]: 350 @[ synchronize_rcu+12 rcu_sync_enter+260 percpu_down_write+76 try_online_node+64 cpu_up+120 cpu_subsys_online+128 device_online+176 online_store+220 dev_attr_store+52 sysfs_kf_write+120 kernfs_fop_write_iter+456 vfs_write+952 ksys_write+132 system_call_exception+292 system_call_vectored_common+348 ]: 350 > > > > Because hotplug is not a fast path; there is no expectation of > > performance here. True. > > Agreed, I was just wondering if it was incredibly slow or something. Looking forward to more justification from Vishal on usecase, > > - Joel > > > > - vishalc ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 9:43 [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Vishal Chourasia ` (2 preceding siblings ...) 2026-01-12 14:03 ` Joel Fernandes @ 2026-01-12 14:24 ` Peter Zijlstra 2026-01-12 18:00 ` Vishal Chourasia 2026-01-18 11:38 ` [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Samir M 4 siblings, 1 reply; 54+ messages in thread From: Peter Zijlstra @ 2026-01-12 14:24 UTC (permalink / raw) To: Vishal Chourasia Cc: rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, sshegde, srikar On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement > But who cares? Its not like you'd *ever* do this, right? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 14:24 ` Peter Zijlstra @ 2026-01-12 18:00 ` Vishal Chourasia 2026-01-13 9:01 ` Peter Zijlstra 0 siblings, 1 reply; 54+ messages in thread From: Vishal Chourasia @ 2026-01-12 18:00 UTC (permalink / raw) To: Peter Zijlstra Cc: rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, sshegde, srikar Hello Peter, On Mon, Jan 12, 2026 at 03:24:40PM +0100, Peter Zijlstra wrote: > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > Bulk CPU hotplug operations—such as switching SMT modes across all > > cores—require hotplugging multiple CPUs in rapid succession. On large > > systems, this process takes significant time, increasing as the number > > of CPUs grows, leading to substantial delays on high-core-count > > machines. Analysis [1] reveals that the majority of this time is spent > > waiting for synchronize_rcu(). > > > > Expedite synchronize_rcu() during the hotplug path to accelerate the > > operation. Since CPU hotplug is a user-initiated administrative task, > > it should complete as quickly as possible. > > > > Performance data on a PPC64 system with 400 CPUs: > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > Before: real 1m14.792s > > After: real 0m03.205s # ~23x improvement > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > Before: real 2m27.695s > > After: real 0m02.510s # ~58x improvement > > > > But who cares? Its not like you'd *ever* do this, right? Users dynamically adjust SMT modes to optimize performance of the workload being run. And, yes it doesn't happen too often, but when it does, on machines with (>= 1920 CPUs) it takes more than 20 mins to finish. - vishal ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 18:00 ` Vishal Chourasia @ 2026-01-13 9:01 ` Peter Zijlstra 2026-01-19 10:47 ` [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch Vishal Chourasia 2026-01-19 10:54 ` [RESEND] " Vishal Chourasia 0 siblings, 2 replies; 54+ messages in thread From: Peter Zijlstra @ 2026-01-13 9:01 UTC (permalink / raw) To: Vishal Chourasia Cc: rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, sshegde, srikar On Mon, Jan 12, 2026 at 11:30:52PM +0530, Vishal Chourasia wrote: > Hello Peter, > > > > On Mon, Jan 12, 2026 at 03:24:40PM +0100, Peter Zijlstra wrote: > > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > > Bulk CPU hotplug operations—such as switching SMT modes across all > > > cores—require hotplugging multiple CPUs in rapid succession. On large > > > systems, this process takes significant time, increasing as the number > > > of CPUs grows, leading to substantial delays on high-core-count > > > machines. Analysis [1] reveals that the majority of this time is spent > > > waiting for synchronize_rcu(). > > > > > > Expedite synchronize_rcu() during the hotplug path to accelerate the > > > operation. Since CPU hotplug is a user-initiated administrative task, > > > it should complete as quickly as possible. > > > > > > Performance data on a PPC64 system with 400 CPUs: > > > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > > Before: real 1m14.792s > > > After: real 0m03.205s # ~23x improvement > > > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > > Before: real 2m27.695s > > > After: real 0m02.510s # ~58x improvement > > > > > > > But who cares? Its not like you'd *ever* do this, right? > Users dynamically adjust SMT modes to optimize performance of the > workload being run. And, yes it doesn't happen too often, but when it > does, on machines with (>= 1920 CPUs) it takes more than 20 mins to > finish. Users cannot change this, it is root only. Having to change SMT mode per workload seems quite insane; but whatever. If you do have to put RCU hooks anywhere; I'd much rather see them in cpuhp_smt_{en,dis}able(), such that they only affect the batch hotplug case, rather than everything using cpus_write_lock(). Also note that there is a case to be made to optimize this batch hotplug case; for one it makes no sense to take cpus_write_lock() over and over and over again; if you can pull that out, just like it already lifted cpu_maps_update_begin(), this would help. And Joel has a point, in that it might make sense for RCU to behave 'better' under these conditions. ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-13 9:01 ` Peter Zijlstra @ 2026-01-19 10:47 ` Vishal Chourasia 2026-01-19 11:43 ` Peter Zijlstra 2026-01-19 10:54 ` [RESEND] " Vishal Chourasia 1 sibling, 1 reply; 54+ messages in thread From: Vishal Chourasia @ 2026-01-19 10:47 UTC (permalink / raw) To: peterz Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki, samir, vishalc Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to accelerate the operation. Bulk CPU hotplug operations—such as switching SMT modes across all cores—require hotplugging multiple CPUs in rapid succession. On large systems, this process takes significant time, increasing as the number of CPUs to hotplug during SMT switch grows, leading to substantial delays on high-core-count machines. Analysis [1] reveals that the majority of this time is spent waiting for synchronize_rcu(). SMT switch is a user-initiated administrative task, it should complete as quickly as possible. Performance data on a PPC64 system with 2048 CPUs: + ppc64_cpu --smt=1 (SMT8 to SMT1) Before: real 30m53.194s After: real 6m4.678s # ~5x improvement + ppc64_cpu --smt=8 (SMT1 to SMT8) Before: real 49m5.920s After: real 36m47.798s # ~1.3x improvement [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Tested-by: Samir M <samir@linux.ibm.com> --- include/linux/rcupdate.h | 3 +++ kernel/cpu.c | 4 ++++ 2 files changed, 7 insertions(+) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index c5b30054cd01..03c06cfb2b6d 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) extern int rcu_expedited; extern int rcu_normal; +extern void rcu_expedite_gp(void); +extern void rcu_unexpedite_gp(void); + DEFINE_LOCK_GUARD_0(rcu, do { rcu_read_lock(); diff --git a/kernel/cpu.c b/kernel/cpu.c index 8df2d773fe3b..a264d7170842 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) int cpu, ret = 0; cpu_maps_update_begin(); + rcu_expedite_gp(); for_each_online_cpu(cpu) { if (topology_is_primary_thread(cpu)) continue; @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) } if (!ret) cpu_smt_control = ctrlval; + rcu_unexpedite_gp(); cpu_maps_update_done(); return ret; } @@ -2716,6 +2718,7 @@ int cpuhp_smt_enable(void) cpu_maps_update_begin(); cpu_smt_control = CPU_SMT_ENABLED; + rcu_expedite_gp(); for_each_present_cpu(cpu) { /* Skip online CPUs and CPUs on offline nodes */ if (cpu_online(cpu) || !node_online(cpu_to_node(cpu))) @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) /* See comment in cpuhp_smt_disable() */ cpuhp_online_cpu_device(cpu); } + rcu_unexpedite_gp(); cpu_maps_update_done(); return ret; } -- 2.52.0 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-19 10:47 ` [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch Vishal Chourasia @ 2026-01-19 11:43 ` Peter Zijlstra 2026-01-19 13:45 ` Shrikanth Hegde 2026-01-27 17:48 ` Samir M 0 siblings, 2 replies; 54+ messages in thread From: Peter Zijlstra @ 2026-01-19 11:43 UTC (permalink / raw) To: Vishal Chourasia Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki, samir On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote: > Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to > accelerate the operation. > > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs to hotplug during SMT switch grows, leading to substantial > delays on high-core-count machines. Analysis [1] reveals that the > majority of this time is spent waiting for synchronize_rcu(). > You seem to have left out all the useful bits from your changelog again :/ Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not something we can't live with either. Also, memory got jogged and I think something like the below will remove 2/3 of your rcu woes as well. diff --git a/kernel/cpu.c b/kernel/cpu.c index 8df2d773fe3b..1365c19444b2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) int cpu, ret = 0; cpu_maps_update_begin(); + rcu_sync_enter(&cpu_hotplug_lock.rss); for_each_online_cpu(cpu) { if (topology_is_primary_thread(cpu)) continue; @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) } if (!ret) cpu_smt_control = ctrlval; + rcu_sync_exit(&cpu_hotplug_lock.rss); cpu_maps_update_done(); return ret; } @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void) int cpu, ret = 0; cpu_maps_update_begin(); + rcu_sync_enter(&cpu_hotplug_lock.rss); cpu_smt_control = CPU_SMT_ENABLED; for_each_present_cpu(cpu) { /* Skip online CPUs and CPUs on offline nodes */ @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) /* See comment in cpuhp_smt_disable() */ cpuhp_online_cpu_device(cpu); } + rcu_sync_exit(&cpu_hotplug_lock.rss); cpu_maps_update_done(); return ret; } ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-19 11:43 ` Peter Zijlstra @ 2026-01-19 13:45 ` Shrikanth Hegde 2026-01-19 14:11 ` Peter Zijlstra 2026-01-27 17:48 ` Samir M 1 sibling, 1 reply; 54+ messages in thread From: Shrikanth Hegde @ 2026-01-19 13:45 UTC (permalink / raw) To: Peter Zijlstra, Vishal Chourasia Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, tglx, urezki, samir Hi Peter. On 1/19/26 5:13 PM, Peter Zijlstra wrote: > On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote: >> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to >> accelerate the operation. >> >> Bulk CPU hotplug operations—such as switching SMT modes across all >> cores—require hotplugging multiple CPUs in rapid succession. On large >> systems, this process takes significant time, increasing as the number >> of CPUs to hotplug during SMT switch grows, leading to substantial >> delays on high-core-count machines. Analysis [1] reveals that the >> majority of this time is spent waiting for synchronize_rcu(). >> > > You seem to have left out all the useful bits from your changelog again > :/ > > Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not > something we can't live with either. > > Also, memory got jogged and I think something like the below will remove > 2/3 of your rcu woes as well. > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..1365c19444b2 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + rcu_sync_enter(&cpu_hotplug_lock.rss); > for_each_online_cpu(cpu) { > if (topology_is_primary_thread(cpu)) > continue; > @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > } > if (!ret) > cpu_smt_control = ctrlval; > + rcu_sync_exit(&cpu_hotplug_lock.rss); > cpu_maps_update_done(); > return ret; > } > @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + rcu_sync_enter(&cpu_hotplug_lock.rss); > cpu_smt_control = CPU_SMT_ENABLED; > for_each_present_cpu(cpu) { > /* Skip online CPUs and CPUs on offline nodes */ > @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) > /* See comment in cpuhp_smt_disable() */ > cpuhp_online_cpu_device(cpu); > } > + rcu_sync_exit(&cpu_hotplug_lock.rss); > cpu_maps_update_done(); > return ret; > } Currently, cpuhp_smt_[enable/disable] calls _cpu_up/_cpu_down which does the same in cpus_write_lock/unlock. though it is per cpu enable/disable one after another. How hoisting this up will help? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-19 13:45 ` Shrikanth Hegde @ 2026-01-19 14:11 ` Peter Zijlstra 2026-01-19 14:45 ` Joel Fernandes 0 siblings, 1 reply; 54+ messages in thread From: Peter Zijlstra @ 2026-01-19 14:11 UTC (permalink / raw) To: Shrikanth Hegde Cc: Vishal Chourasia, boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, tglx, urezki, samir On Mon, Jan 19, 2026 at 07:15:09PM +0530, Shrikanth Hegde wrote: > Hi Peter. > > On 1/19/26 5:13 PM, Peter Zijlstra wrote: > > On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote: > > > Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to > > > accelerate the operation. > > > > > > Bulk CPU hotplug operations—such as switching SMT modes across all > > > cores—require hotplugging multiple CPUs in rapid succession. On large > > > systems, this process takes significant time, increasing as the number > > > of CPUs to hotplug during SMT switch grows, leading to substantial > > > delays on high-core-count machines. Analysis [1] reveals that the > > > majority of this time is spent waiting for synchronize_rcu(). > > > > > > > You seem to have left out all the useful bits from your changelog again > > :/ > > > > Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not > > something we can't live with either. > > > > Also, memory got jogged and I think something like the below will remove > > 2/3 of your rcu woes as well. > > > > diff --git a/kernel/cpu.c b/kernel/cpu.c > > index 8df2d773fe3b..1365c19444b2 100644 > > --- a/kernel/cpu.c > > +++ b/kernel/cpu.c > > @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > > int cpu, ret = 0; > > cpu_maps_update_begin(); > > + rcu_sync_enter(&cpu_hotplug_lock.rss); > > for_each_online_cpu(cpu) { > > if (topology_is_primary_thread(cpu)) > > continue; > > @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > > } > > if (!ret) > > cpu_smt_control = ctrlval; > > + rcu_sync_exit(&cpu_hotplug_lock.rss); > > cpu_maps_update_done(); > > return ret; > > } > > @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void) > > int cpu, ret = 0; > > cpu_maps_update_begin(); > > + rcu_sync_enter(&cpu_hotplug_lock.rss); > > cpu_smt_control = CPU_SMT_ENABLED; > > for_each_present_cpu(cpu) { > > /* Skip online CPUs and CPUs on offline nodes */ > > @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) > > /* See comment in cpuhp_smt_disable() */ > > cpuhp_online_cpu_device(cpu); > > } > > + rcu_sync_exit(&cpu_hotplug_lock.rss); > > cpu_maps_update_done(); > > return ret; > > } > > > Currently, cpuhp_smt_[enable/disable] calls _cpu_up/_cpu_down > which does the same in cpus_write_lock/unlock. though it is per > cpu enable/disable one after another. > > How hoisting this up will help? By holding an extra rcu_sync reference, the percpu rwsem is kept into the the slow path, avoiding the rcu-sync on down_write(), which was very prevalent per this: https://lkml.kernel.org/r/aWU9HRcs4ghazIRg@linux.ibm.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-19 14:11 ` Peter Zijlstra @ 2026-01-19 14:45 ` Joel Fernandes 2026-01-19 14:59 ` Peter Zijlstra 0 siblings, 1 reply; 54+ messages in thread From: Joel Fernandes @ 2026-01-19 14:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Shrikanth Hegde, Vishal Chourasia, boqun.feng, frederic, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, tglx, urezki, samir On Sun, Jan 19, 2026 at 03:11:18PM +0100, Peter Zijlstra wrote: > By holding an extra rcu_sync reference, the percpu rwsem is kept into > the the slow path, avoiding the rcu-sync on down_write(), which was very > prevalent per this: > > https://lkml.kernel.org/r/aWU9HRcs4ghazIRg@linux.ibm.com Makes sense, though I wonder if we should have a separate percpu-rwsem API for this rather than directly accessing the lock-internal rcu_sync state? Other future percpu_rw_semaphore users may benefit as well. - Joel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-19 14:45 ` Joel Fernandes @ 2026-01-19 14:59 ` Peter Zijlstra 0 siblings, 0 replies; 54+ messages in thread From: Peter Zijlstra @ 2026-01-19 14:59 UTC (permalink / raw) To: Joel Fernandes Cc: Shrikanth Hegde, Vishal Chourasia, boqun.feng, frederic, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, tglx, urezki, samir On Mon, Jan 19, 2026 at 09:45:48AM -0500, Joel Fernandes wrote: > On Sun, Jan 19, 2026 at 03:11:18PM +0100, Peter Zijlstra wrote: > > By holding an extra rcu_sync reference, the percpu rwsem is kept into > > the the slow path, avoiding the rcu-sync on down_write(), which was very > > prevalent per this: > > > > https://lkml.kernel.org/r/aWU9HRcs4ghazIRg@linux.ibm.com > > Makes sense, though I wonder if we should have a separate percpu-rwsem API > for this rather than directly accessing the lock-internal rcu_sync state? > Other future percpu_rw_semaphore users may benefit as well. Yeah, perhaps. There is one other user, the above makes two. kernel/cgroup/cgroup.c: rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-19 11:43 ` Peter Zijlstra 2026-01-19 13:45 ` Shrikanth Hegde @ 2026-01-27 17:48 ` Samir M 2026-01-29 7:05 ` Samir M 2026-02-03 6:31 ` Samir M 1 sibling, 2 replies; 54+ messages in thread From: Samir M @ 2026-01-27 17:48 UTC (permalink / raw) To: Peter Zijlstra, Vishal Chourasia Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki On 19/01/26 5:13 pm, Peter Zijlstra wrote: > On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote: >> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to >> accelerate the operation. >> >> Bulk CPU hotplug operations—such as switching SMT modes across all >> cores—require hotplugging multiple CPUs in rapid succession. On large >> systems, this process takes significant time, increasing as the number >> of CPUs to hotplug during SMT switch grows, leading to substantial >> delays on high-core-count machines. Analysis [1] reveals that the >> majority of this time is spent waiting for synchronize_rcu(). >> > You seem to have left out all the useful bits from your changelog again > :/ > > Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not > something we can't live with either. > > Also, memory got jogged and I think something like the below will remove > 2/3 of your rcu woes as well. > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..1365c19444b2 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + rcu_sync_enter(&cpu_hotplug_lock.rss); > for_each_online_cpu(cpu) { > if (topology_is_primary_thread(cpu)) > continue; > @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) > } > if (!ret) > cpu_smt_control = ctrlval; > + rcu_sync_exit(&cpu_hotplug_lock.rss); > cpu_maps_update_done(); > return ret; > } > @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + rcu_sync_enter(&cpu_hotplug_lock.rss); > cpu_smt_control = CPU_SMT_ENABLED; > for_each_present_cpu(cpu) { > /* Skip online CPUs and CPUs on offline nodes */ > @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) > /* See comment in cpuhp_smt_disable() */ > cpuhp_online_cpu_device(cpu); > } > + rcu_sync_exit(&cpu_hotplug_lock.rss); > cpu_maps_update_done(); > return ret; > } Hi, I verified this patch using the configuration described below. Configuration: • Kernel version: 6.19.0-rc6 • Number of CPUs: 1536 Earlier verification of an older version of this patch was performed on a system with *2048 CPUs*. Due to system unavailability, the current verification was carried out on a *different system.* Using this setup, I evaluated the patch with both SMT enabled and SMT disabled. patch shows a significant improvement in the SMT=off case and a measurable improvement in the SMT=on case. The results indicate that when SMT is enabled, the system time is noticeably higher. In contrast, with SMT disabled, no significant increase in system time is observed. SMT=ON -> sys 50m42.805s SMT=OFF -> sys 0m0.064s SMT Mode | Without Patch | With Patch | % Improvement | ------------------------------------------------------------------ SMT=off | 20m 32.210s | 5m 30.898s | +73.15% | SMT=on | 62m 46.549s | 55m 45.671s | +11.18% | Please add below tag: Tested-by: Samir M <samir@linux.ibm.com> Regards, Samir ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-27 17:48 ` Samir M @ 2026-01-29 7:05 ` Samir M 2026-02-03 6:31 ` Samir M 1 sibling, 0 replies; 54+ messages in thread From: Samir M @ 2026-01-29 7:05 UTC (permalink / raw) To: Peter Zijlstra, Vishal Chourasia Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki On 27/01/26 11:18 pm, Samir M wrote: > > On 19/01/26 5:13 pm, Peter Zijlstra wrote: >> On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote: >>> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] >>> path to >>> accelerate the operation. >>> >>> Bulk CPU hotplug operations—such as switching SMT modes across all >>> cores—require hotplugging multiple CPUs in rapid succession. On large >>> systems, this process takes significant time, increasing as the number >>> of CPUs to hotplug during SMT switch grows, leading to substantial >>> delays on high-core-count machines. Analysis [1] reveals that the >>> majority of this time is spent waiting for synchronize_rcu(). >>> >> You seem to have left out all the useful bits from your changelog again >> :/ >> >> Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not >> something we can't live with either. >> >> Also, memory got jogged and I think something like the below will remove >> 2/3 of your rcu woes as well. >> >> diff --git a/kernel/cpu.c b/kernel/cpu.c >> index 8df2d773fe3b..1365c19444b2 100644 >> --- a/kernel/cpu.c >> +++ b/kernel/cpu.c >> @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control >> ctrlval) >> int cpu, ret = 0; >> cpu_maps_update_begin(); >> + rcu_sync_enter(&cpu_hotplug_lock.rss); >> for_each_online_cpu(cpu) { >> if (topology_is_primary_thread(cpu)) >> continue; >> @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control >> ctrlval) >> } >> if (!ret) >> cpu_smt_control = ctrlval; >> + rcu_sync_exit(&cpu_hotplug_lock.rss); >> cpu_maps_update_done(); >> return ret; >> } >> @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void) >> int cpu, ret = 0; >> cpu_maps_update_begin(); >> + rcu_sync_enter(&cpu_hotplug_lock.rss); >> cpu_smt_control = CPU_SMT_ENABLED; >> for_each_present_cpu(cpu) { >> /* Skip online CPUs and CPUs on offline nodes */ >> @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) >> /* See comment in cpuhp_smt_disable() */ >> cpuhp_online_cpu_device(cpu); >> } >> + rcu_sync_exit(&cpu_hotplug_lock.rss); >> cpu_maps_update_done(); >> return ret; >> } > > > Hi, > > I verified this patch using the configuration described below. > Configuration: > • Kernel version: 6.19.0-rc6 > • Number of CPUs: 1536 > > Earlier verification of an older version of this patch was performed > on a system with *2048 CPUs*. Due to system unavailability, the > current verification was carried out on a *different system.* > > > Using this setup, I evaluated the patch with both SMT enabled and SMT > disabled. patch shows a significant improvement in the SMT=off case > and a measurable improvement in the SMT=on case. > The results indicate that when SMT is enabled, the system time is > noticeably higher. In contrast, with SMT disabled, no significant > increase in system time is observed. > > SMT=ON -> sys 50m42.805s > SMT=OFF -> sys 0m0.064s > > > SMT Mode | Without Patch | With Patch | % Improvement | > ------------------------------------------------------------------ > SMT=off | 20m 32.210s | 5m 30.898s | +73.15% | > SMT=on | 62m 46.549s | 55m 45.671s | +11.18% | > > > Please add below tag: Tested-by: Samir M <samir@linux.ibm.com> > > Regards, > Samir > Hi All, For reference, I am updating the results for the Vishal and Rezki patches. Configuration: Kernel version: 6.19.0-rc6 Number of CPUs: 1536 Earlier verification of an older revision of this patch was performed on a system with 2048 CPUs*.* Due to system unavailability, the current verification was carried out on a different**system. For reference, I am updating the results corresponding to the patches verified using the above configuration. Patch: https://lore.kernel.org/all/20260112094332.66006-2-vishalc@linux.ibm.com/ SMT Mode | Without Patch | With Patch | % Improvement | ------------------------------------------------------------------ SMT=off | 20m 32.210s | 5m 31.807s | +73.09% | SMT=on | 62m 46.549s | 55m 48.801s | +11.08% | SMT=ON -> sys 50m44.105s SMT=OFF -> sys 0m0.035s patch: https://lore.kernel.org/all/20260114183415.286489-1-urezki@gmail.com/ SMT Mode | Without Patch | With Patch | % Improvement | ------------------------------------------------------------------ SMT=off | 20m 32.210s | 18m 2.012s | +12.19% | SMT=on | 62m 46.549s | 62m 35.076s | +0.30% | SMT=ON -> sys 50m43.806s SMT=OFF -> sys 0m0.109s Regards, Samir ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-27 17:48 ` Samir M 2026-01-29 7:05 ` Samir M @ 2026-02-03 6:31 ` Samir M 1 sibling, 0 replies; 54+ messages in thread From: Samir M @ 2026-02-03 6:31 UTC (permalink / raw) To: Peter Zijlstra, Vishal Chourasia Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki On 27/01/26 11:18 pm, Samir M wrote: > > On 19/01/26 5:13 pm, Peter Zijlstra wrote: >> On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote: >>> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] >>> path to >>> accelerate the operation. >>> >>> Bulk CPU hotplug operations—such as switching SMT modes across all >>> cores—require hotplugging multiple CPUs in rapid succession. On large >>> systems, this process takes significant time, increasing as the number >>> of CPUs to hotplug during SMT switch grows, leading to substantial >>> delays on high-core-count machines. Analysis [1] reveals that the >>> majority of this time is spent waiting for synchronize_rcu(). >>> >> You seem to have left out all the useful bits from your changelog again >> :/ >> >> Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not >> something we can't live with either. >> >> Also, memory got jogged and I think something like the below will remove >> 2/3 of your rcu woes as well. >> >> diff --git a/kernel/cpu.c b/kernel/cpu.c >> index 8df2d773fe3b..1365c19444b2 100644 >> --- a/kernel/cpu.c >> +++ b/kernel/cpu.c >> @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control >> ctrlval) >> int cpu, ret = 0; >> cpu_maps_update_begin(); >> + rcu_sync_enter(&cpu_hotplug_lock.rss); >> for_each_online_cpu(cpu) { >> if (topology_is_primary_thread(cpu)) >> continue; >> @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control >> ctrlval) >> } >> if (!ret) >> cpu_smt_control = ctrlval; >> + rcu_sync_exit(&cpu_hotplug_lock.rss); >> cpu_maps_update_done(); >> return ret; >> } >> @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void) >> int cpu, ret = 0; >> cpu_maps_update_begin(); >> + rcu_sync_enter(&cpu_hotplug_lock.rss); >> cpu_smt_control = CPU_SMT_ENABLED; >> for_each_present_cpu(cpu) { >> /* Skip online CPUs and CPUs on offline nodes */ >> @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) >> /* See comment in cpuhp_smt_disable() */ >> cpuhp_online_cpu_device(cpu); >> } >> + rcu_sync_exit(&cpu_hotplug_lock.rss); >> cpu_maps_update_done(); >> return ret; >> } > > > Hi, > > I verified this patch using the configuration described below. > Configuration: > • Kernel version: 6.19.0-rc6 > • Number of CPUs: 1536 > > Earlier verification of an older version of this patch was performed > on a system with *2048 CPUs*. Due to system unavailability, the > current verification was carried out on a *different system.* > > > Using this setup, I evaluated the patch with both SMT enabled and SMT > disabled. patch shows a significant improvement in the SMT=off case > and a measurable improvement in the SMT=on case. > The results indicate that when SMT is enabled, the system time is > noticeably higher. In contrast, with SMT disabled, no significant > increase in system time is observed. > > SMT=ON -> sys 50m42.805s > SMT=OFF -> sys 0m0.064s > > > SMT Mode | Without Patch | With Patch | % Improvement | > ------------------------------------------------------------------ > SMT=off | 20m 32.210s | 5m 30.898s | +73.15% | > SMT=on | 62m 46.549s | 55m 45.671s | +11.18% | > > > Please add below tag: Tested-by: Samir M <samir@linux.ibm.com> > > Regards, > Samir > Hi All, Apologies for the confusion in the previous report. In the previous report, I used the b4 am command to apply the patch. As a result, all patches in the mail thread were applied, rather than only the intended one. Consequently, the results that were posted earlier included changes from multiple patches, and therefore cannot be considered valid for evaluating this patch in isolation. We have since tested Peter’s patch separately, applied in isolation. Based on this testing, we did not observe any improvement in the SMT timing details. Configuration: • Kernel version: 6.19.0-rc6 • Number of CPUs: 1536 SMT Mode | Without Patch | With Patch | % Improvement | ------------------------------------------------------------------ SMT=off | 20m 32.210s | 20m22.441s | +0.79% | SMT=on | 62m 46.549s | 63m0.532s | -0.37% | Regards, Samir ^ permalink raw reply [flat|nested] 54+ messages in thread
* [RESEND] [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch 2026-01-13 9:01 ` Peter Zijlstra 2026-01-19 10:47 ` [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch Vishal Chourasia @ 2026-01-19 10:54 ` Vishal Chourasia 1 sibling, 0 replies; 54+ messages in thread From: Vishal Chourasia @ 2026-01-19 10:54 UTC (permalink / raw) To: peterz Cc: boqun.feng, frederic, joelagnelf, josh, linux-kernel, neeraj.upadhyay, paulmck, rcu, rostedt, srikar, sshegde, tglx, urezki, samir, vishalc [Edits] Added Suggested-by tag Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to accelerate the operation. Bulk CPU hotplug operations—such as switching SMT modes across all cores—require hotplugging multiple CPUs in rapid succession. On large systems, this process takes significant time, increasing as the number of CPUs to hotplug during SMT switch grows, leading to substantial delays on high-core-count machines. Analysis [1] reveals that the majority of this time is spent waiting for synchronize_rcu(). SMT switch is a user-initiated administrative task, it should complete as quickly as possible. Performance data on a PPC64 system with 2048 CPUs: + ppc64_cpu --smt=1 (SMT8 to SMT1) Before: real 30m53.194s After: real 6m4.678s # ~5x improvement + ppc64_cpu --smt=8 (SMT1 to SMT8) Before: real 49m5.920s After: real 36m47.798s # ~1.3x improvement [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com [2] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Tested-by: Samir M <samir@linux.ibm.com> --- include/linux/rcupdate.h | 3 +++ kernel/cpu.c | 4 ++++ 2 files changed, 7 insertions(+) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index c5b30054cd01..03c06cfb2b6d 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) extern int rcu_expedited; extern int rcu_normal; +extern void rcu_expedite_gp(void); +extern void rcu_unexpedite_gp(void); + DEFINE_LOCK_GUARD_0(rcu, do { rcu_read_lock(); diff --git a/kernel/cpu.c b/kernel/cpu.c index 8df2d773fe3b..a264d7170842 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) int cpu, ret = 0; cpu_maps_update_begin(); + rcu_expedite_gp(); for_each_online_cpu(cpu) { if (topology_is_primary_thread(cpu)) continue; @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) } if (!ret) cpu_smt_control = ctrlval; + rcu_unexpedite_gp(); cpu_maps_update_done(); return ret; } @@ -2716,6 +2718,7 @@ int cpuhp_smt_enable(void) cpu_maps_update_begin(); cpu_smt_control = CPU_SMT_ENABLED; + rcu_expedite_gp(); for_each_present_cpu(cpu) { /* Skip online CPUs and CPUs on offline nodes */ if (cpu_online(cpu) || !node_online(cpu_to_node(cpu))) @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void) /* See comment in cpuhp_smt_disable() */ cpuhp_online_cpu_device(cpu); } + rcu_unexpedite_gp(); cpu_maps_update_done(); return ret; } -- 2.52.0 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-12 9:43 [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Vishal Chourasia ` (3 preceding siblings ...) 2026-01-12 14:24 ` Peter Zijlstra @ 2026-01-18 11:38 ` Samir M 2026-01-19 5:18 ` Joel Fernandes 4 siblings, 1 reply; 54+ messages in thread From: Samir M @ 2026-01-18 11:38 UTC (permalink / raw) To: Vishal Chourasia, rcu, linux-kernel Cc: paulmck, frederic, neeraj.upadhyay, joelagnelf, josh, boqun.feng, urezki, rostedt, tglx, peterz, sshegde, srikar On 12/01/26 3:13 pm, Vishal Chourasia wrote: > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> > --- > include/linux/rcupdate.h | 3 +++ > kernel/cpu.c | 2 ++ > 2 files changed, 5 insertions(+) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index c5b30054cd01..03c06cfb2b6d 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) > extern int rcu_expedited; > extern int rcu_normal; > > +extern void rcu_expedite_gp(void); > +extern void rcu_unexpedite_gp(void); > + > DEFINE_LOCK_GUARD_0(rcu, > do { > rcu_read_lock(); > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..6b0d491d73f4 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock); > > void cpus_write_lock(void) > { > + rcu_expedite_gp(); > percpu_down_write(&cpu_hotplug_lock); > } > > void cpus_write_unlock(void) > { > percpu_up_write(&cpu_hotplug_lock); > + rcu_unexpedite_gp(); > } > > void lockdep_assert_cpus_held(void) Hi Vishal, I verified this patch using the configuration described below. Configuration: • Kernel version: 6.19.0-rc5 • Number of CPUs: 2048 Using this setup, I evaluated the patch with both SMT enabled and SMT disabled. patch shows a significant improvement in the SMT=off case and a measurable improvement in the SMT=on case. The results indicate that when SMT is enabled, the system time is noticeably higher. In contrast, with SMT disabled, no significant increase in system time is observed. SMT=ON -> sys 31m18.849s SMT=OFF -> sys 0m0.087s SMT Mode | Without Patch | With Patch | % Improvement | ------------------------------------------------------------------ SMT=off | 30m 53.194s | 6m 4.250s | +80.40% | SMT=on | 49m 5.920s | 36m 50.386s | +25.01% | Please add below tag: Tested-by: Samir M <samir@linux.ibm.com> Regards, Samir ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-18 11:38 ` [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Samir M @ 2026-01-19 5:18 ` Joel Fernandes 2026-01-19 13:53 ` Shrikanth Hegde 2026-02-02 8:46 ` Vishal Chourasia 0 siblings, 2 replies; 54+ messages in thread From: Joel Fernandes @ 2026-01-19 5:18 UTC (permalink / raw) To: Samir M, Vishal Chourasia, rcu, linux-kernel Cc: paulmck, frederic, neeraj.upadhyay, josh, boqun.feng, urezki, rostedt, tglx, peterz, sshegde, srikar On Sun, Jan 18, 2026 at 05:08:44PM +0530, Samir M wrote: > On 12/01/26 3:13 pm, Vishal Chourasia wrote: > > Bulk CPU hotplug operations--such as switching SMT modes across all > > cores--require hotplugging multiple CPUs in rapid succession. On large > > systems, this process takes significant time, increasing as the number > > of CPUs grows, leading to substantial delays on high-core-count > > machines. Analysis [1] reveals that the majority of this time is spent > > waiting for synchronize_rcu(). > > > > Expedite synchronize_rcu() during the hotplug path to accelerate the > > operation. Since CPU hotplug is a user-initiated administrative task, > > it should complete as quickly as possible. > > Hi Vishal, > > I verified this patch using the configuration described below. > Configuration: > * Kernel version: 6.19.0-rc5 > * Number of CPUs: 2048 > > SMT Mode | Without Patch | With Patch | % Improvement | > ------------------------------------------------------------------ > SMT=off | 30m 53.194s | 6m 4.250s | +80.40% | > SMT=on | 49m 5.920s | 36m 50.386s | +25.01% | Hi Vishal, Samir, Thanks for the testing on your large CPU count system. Considering the SMT=on performance is still terrible, before we expedite RCU, could we try the approach Peter suggested (avoiding repeated lock/unlock)? I wrote a patch below. git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git tag: cpuhp-bulk-optimize-rfc-v1 I tested it lightly on rcutorture hotplug test and it passes. Please share any performance results, thanks. Also I'd like to use expediting of RCU as a last resort TBH, we should optimize the outer operations that require RCU in the first place such as Peter's suggestion since that will improve the overall efficiency of the code. And if/when expediting RCU, Peter's other suggestion to not do it in cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes sense to me. ---8<----------------------- From: Joel Fernandes <joelagnelf@nvidia.com> Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock acquiring Bulk CPU hotplug operations such as enabling SMT across all cores require hotplugging multiple CPUs. The current implementation takes cpus_write_lock() for each individual CPU causing multiple slow grace period requests. Therefore introduce cpu_up_locked() that assumes the caller already holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to hold the lock once around the entire loop rather than for each CPU. Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> --- kernel/cpu.c | 40 +++++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 15 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index 8df2d773fe3b..4ce7deb236d7 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state) complete_ap_thread(st, true); } -/* Requires cpu_add_remove_lock to be held */ -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */ +static int cpu_up_locked(unsigned int cpu, int tasks_frozen, + enum cpuhp_state target) { struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); struct task_struct *idle; int ret = 0; - cpus_write_lock(); + lockdep_assert_cpus_held(); - if (!cpu_present(cpu)) { - ret = -EINVAL; - goto out; - } + if (!cpu_present(cpu)) + return -EINVAL; /* * The caller of cpu_up() might have raced with another * caller. Nothing to do. */ if (st->state >= target) - goto out; + return 0; if (st->state == CPUHP_OFFLINE) { /* Let it fail before we try to bring the cpu up */ idle = idle_thread_get(cpu); - if (IS_ERR(idle)) { - ret = PTR_ERR(idle); - goto out; - } + if (IS_ERR(idle)) + return PTR_ERR(idle); /* * Reset stale stack state from the last time this CPU was online. @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) * return the error code.. */ if (ret) - goto out; + return ret; } /* @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) */ target = min((int)target, CPUHP_BRINGUP_CPU); ret = cpuhp_up_callbacks(cpu, st, target); -out: + return ret; +} + +/* Requires cpu_add_remove_lock to be held */ +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) +{ + int ret; + + cpus_write_lock(); + ret = cpu_up_locked(cpu, tasks_frozen, target); cpus_write_unlock(); arch_smt_update(); return ret; @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void) int cpu, ret = 0; cpu_maps_update_begin(); + /* Hold cpus_write_lock() for entire batch operation. */ + cpus_write_lock(); cpu_smt_control = CPU_SMT_ENABLED; for_each_present_cpu(cpu) { /* Skip online CPUs and CPUs on offline nodes */ @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void) continue; if (!cpu_smt_thread_allowed(cpu) || !topology_is_core_online(cpu)) continue; - ret = _cpu_up(cpu, 0, CPUHP_ONLINE); + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE); if (ret) break; /* See comment in cpuhp_smt_disable() */ cpuhp_online_cpu_device(cpu); } + cpus_write_unlock(); + arch_smt_update(); cpu_maps_update_done(); return ret; } -- 2.34.1 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-19 5:18 ` Joel Fernandes @ 2026-01-19 13:53 ` Shrikanth Hegde 2026-01-19 21:10 ` joelagnelf 2026-02-02 8:46 ` Vishal Chourasia 1 sibling, 1 reply; 54+ messages in thread From: Shrikanth Hegde @ 2026-01-19 13:53 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, frederic, neeraj.upadhyay, josh, boqun.feng, urezki, rostedt, tglx, peterz, srikar, Samir M, Vishal Chourasia, rcu, linux-kernel On 1/19/26 10:48 AM, Joel Fernandes wrote: > On Sun, Jan 18, 2026 at 05:08:44PM +0530, Samir M wrote: >> On 12/01/26 3:13 pm, Vishal Chourasia wrote: >> > Bulk CPU hotplug operations--such as switching SMT modes across all >> > cores--require hotplugging multiple CPUs in rapid succession. On large >> > systems, this process takes significant time, increasing as the number >> > of CPUs grows, leading to substantial delays on high-core-count >> > machines. Analysis [1] reveals that the majority of this time is spent >> > waiting for synchronize_rcu(). >> > >> > Expedite synchronize_rcu() during the hotplug path to accelerate the >> > operation. Since CPU hotplug is a user-initiated administrative task, >> > it should complete as quickly as possible. >> >> Hi Vishal, >> >> I verified this patch using the configuration described below. >> Configuration: >> * Kernel version: 6.19.0-rc5 >> * Number of CPUs: 2048 >> >> SMT Mode | Without Patch | With Patch | % Improvement | >> ------------------------------------------------------------------ >> SMT=off | 30m 53.194s | 6m 4.250s | +80.40% | >> SMT=on | 49m 5.920s | 36m 50.386s | +25.01% | > > Hi Vishal, Samir, > > Thanks for the testing on your large CPU count system. > > Considering the SMT=on performance is still terrible, before we expedite > RCU, could we try the approach Peter suggested (avoiding repeated > lock/unlock)? I wrote a patch below. > > git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git > tag: cpuhp-bulk-optimize-rfc-v1 > > I tested it lightly on rcutorture hotplug test and it passes. Please share > any performance results, thanks. > > Also I'd like to use expediting of RCU as a last resort TBH, we should > optimize the outer operations that require RCU in the first place such as > Peter's suggestion since that will improve the overall efficiency of the > code. And if/when expediting RCU, Peter's other suggestion to not do it in > cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes > sense > to me. > > ---8<----------------------- > > From: Joel Fernandes <joelagnelf@nvidia.com> > Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock > acquiring > > Bulk CPU hotplug operations such as enabling SMT across all cores > require hotplugging multiple CPUs. The current implementation takes > cpus_write_lock() for each individual CPU causing multiple slow grace > period requests. > > Therefore introduce cpu_up_locked() that assumes the caller already > holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to > hold the lock once around the entire loop rather than for each CPU. > > Link: https://lore.kernel.org/ > all/20260113090153.GS830755@noisy.programming.kicks-ass.net/ > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > --- > kernel/cpu.c | 40 +++++++++++++++++++++++++--------------- > 1 file changed, 25 insertions(+), 15 deletions(-) > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..4ce7deb236d7 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state) > complete_ap_thread(st, true); > } > > -/* Requires cpu_add_remove_lock to be held */ > -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state > target) > +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */ > +static int cpu_up_locked(unsigned int cpu, int tasks_frozen, > + enum cpuhp_state target) > { > struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > struct task_struct *idle; > int ret = 0; > > - cpus_write_lock(); > + lockdep_assert_cpus_held(); > > - if (!cpu_present(cpu)) { > - ret = -EINVAL; > - goto out; > - } > + if (!cpu_present(cpu)) > + return -EINVAL; > > /* > * The caller of cpu_up() might have raced with another > * caller. Nothing to do. > */ > if (st->state >= target) > - goto out; > + return 0; > > if (st->state == CPUHP_OFFLINE) { > /* Let it fail before we try to bring the cpu up */ > idle = idle_thread_get(cpu); > - if (IS_ERR(idle)) { > - ret = PTR_ERR(idle); > - goto out; > - } > + if (IS_ERR(idle)) > + return PTR_ERR(idle); > > /* > * Reset stale stack state from the last time this CPU was > online. > @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int > tasks_frozen, enum cpuhp_state target) > * return the error code.. > */ > if (ret) > - goto out; > + return ret; > } > > /* > @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int > tasks_frozen, enum cpuhp_state target) > */ > target = min((int)target, CPUHP_BRINGUP_CPU); > ret = cpuhp_up_callbacks(cpu, st, target); > -out: > + return ret; > +} > + > +/* Requires cpu_add_remove_lock to be held */ > +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state > target) > +{ > + int ret; > + > + cpus_write_lock(); > + ret = cpu_up_locked(cpu, tasks_frozen, target); > cpus_write_unlock(); > arch_smt_update(); > return ret; > @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void) > int cpu, ret = 0; > > cpu_maps_update_begin(); > + /* Hold cpus_write_lock() for entire batch operation. */ > + cpus_write_lock(); > cpu_smt_control = CPU_SMT_ENABLED; > for_each_present_cpu(cpu) { > /* Skip online CPUs and CPUs on offline nodes */ > @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void) > continue; > if (!cpu_smt_thread_allowed(cpu) || ! > topology_is_core_online(cpu)) > continue; > - ret = _cpu_up(cpu, 0, CPUHP_ONLINE); > + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE); > if (ret) > break; > /* See comment in cpuhp_smt_disable() */ > cpuhp_online_cpu_device(cpu); > } > + cpus_write_unlock(); > + arch_smt_update(); > cpu_maps_update_done(); > return ret; > } What about cpuhp_smt_disable? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-19 13:53 ` Shrikanth Hegde @ 2026-01-19 21:10 ` joelagnelf 0 siblings, 0 replies; 54+ messages in thread From: joelagnelf @ 2026-01-19 21:10 UTC (permalink / raw) To: Shrikanth Hegde Cc: paulmck, frederic, neeraj.upadhyay, josh, boqun.feng, urezki, rostedt, tglx, peterz, srikar, Samir M, Vishal Chourasia, rcu, linux-kernel > On Jan 19, 2026, at 8:54â¯AM, Shrikanth Hegde wrote: > >  > >> On 1/19/26 10:48 AM, Joel Fernandes wrote: >>> On Sun, Jan 18, 2026 at 05:08:44PM +0530, Samir M wrote: >>>> On 12/01/26 3:13 pm, Vishal Chourasia wrote: >>> > Bulk CPU hotplug operations--such as switching SMT modes across all >>> > cores--require hotplugging multiple CPUs in rapid succession. On large >>> > systems, this process takes significant time, increasing as the number >>> > of CPUs grows, leading to substantial delays on high-core-count >>> > machines. Analysis [1] reveals that the majority of this time is spent >>> > waiting for synchronize_rcu(). >>> > >>> > Expedite synchronize_rcu() during the hotplug path to accelerate the >>> > operation. Since CPU hotplug is a user-initiated administrative task, >>> > it should complete as quickly as possible. >>> >>> Hi Vishal, >>> >>> I verified this patch using the configuration described below. >>> Configuration: >>> * Kernel version: 6.19.0-rc5 >>> * Number of CPUs: 2048 >>> >>> SMT Mode | Without Patch | With Patch | % Improvement | >>> ------------------------------------------------------------------ >>> SMT=off | 30m 53.194s | 6m 4.250s | +80.40% | >>> SMT=on | 49m 5.920s | 36m 50.386s | +25.01% | >> Hi Vishal, Samir, >> Thanks for the testing on your large CPU count system. >> Considering the SMT=on performance is still terrible, before we expedite RCU, could we try the approach Peter suggested (avoiding repeated >> lock/unlock)? I wrote a patch below. >> git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git >> tag: cpuhp-bulk-optimize-rfc-v1 >> I tested it lightly on rcutorture hotplug test and it passes. Please share >> any performance results, thanks. >> Also I'd like to use expediting of RCU as a last resort TBH, we should >> optimize the outer operations that require RCU in the first place such as >> Peter's suggestion since that will improve the overall efficiency of the >> code. And if/when expediting RCU, Peter's other suggestion to not do it in >> cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes sense >> to me. >> ---8<----------------------- >> From: Joel Fernandes >> Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock acquiring >> Bulk CPU hotplug operations such as enabling SMT across all cores >> require hotplugging multiple CPUs. The current implementation takes >> cpus_write_lock() for each individual CPU causing multiple slow grace >> period requests. >> Therefore introduce cpu_up_locked() that assumes the caller already >> holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to >> hold the lock once around the entire loop rather than for each CPU. >> Link: https://lore.kernel.org/ all/20260113090153.GS830755@noisy.programming.kicks-ass.net/ >> Suggested-by: Peter Zijlstra >> Signed-off-by: Joel Fernandes >> --- >> kernel/cpu.c | 40 +++++++++++++++++++++++++--------------- >> 1 file changed, 25 insertions(+), 15 deletions(-) >> diff --git a/kernel/cpu.c b/kernel/cpu.c >> index 8df2d773fe3b..4ce7deb236d7 100644 >> --- a/kernel/cpu.c >> +++ b/kernel/cpu.c >> @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state) >> complete_ap_thread(st, true); >> } >> -/* Requires cpu_add_remove_lock to be held */ >> -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) >> +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */ >> +static int cpu_up_locked(unsigned int cpu, int tasks_frozen, >> + enum cpuhp_state target) >> { >> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >> struct task_struct *idle; >> int ret = 0; >> - cpus_write_lock(); >> + lockdep_assert_cpus_held(); >> - if (!cpu_present(cpu)) { >> - ret = -EINVAL; >> - goto out; >> - } >> + if (!cpu_present(cpu)) >> + return -EINVAL; >> /* >> * The caller of cpu_up() might have raced with another >> * caller. Nothing to do. >> */ >> if (st->state >= target) >> - goto out; >> + return 0; >> if (st->state == CPUHP_OFFLINE) { >> /* Let it fail before we try to bring the cpu up */ >> idle = idle_thread_get(cpu); >> - if (IS_ERR(idle)) { >> - ret = PTR_ERR(idle); >> - goto out; >> - } >> + if (IS_ERR(idle)) >> + return PTR_ERR(idle); >> /* >> * Reset stale stack state from the last time this CPU was online. >> @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) >> * return the error code.. >> */ >> if (ret) >> - goto out; >> + return ret; >> } >> /* >> @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) >> */ >> target = min((int)target, CPUHP_BRINGUP_CPU); >> ret = cpuhp_up_callbacks(cpu, st, target); >> -out: >> + return ret; >> +} >> + >> +/* Requires cpu_add_remove_lock to be held */ >> +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) >> +{ >> + int ret; >> + >> + cpus_write_lock(); >> + ret = cpu_up_locked(cpu, tasks_frozen, target); >> cpus_write_unlock(); >> arch_smt_update(); >> return ret; >> @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void) >> int cpu, ret = 0; >> cpu_maps_update_begin(); >> + /* Hold cpus_write_lock() for entire batch operation. */ >> + cpus_write_lock(); >> cpu_smt_control = CPU_SMT_ENABLED; >> for_each_present_cpu(cpu) { >> /* Skip online CPUs and CPUs on offline nodes */ >> @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void) >> continue; >> if (!cpu_smt_thread_allowed(cpu) || ! topology_is_core_online(cpu)) >> continue; >> - ret = _cpu_up(cpu, 0, CPUHP_ONLINE); >> + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE); >> if (ret) >> break; >> /* See comment in cpuhp_smt_disable() */ >> cpuhp_online_cpu_device(cpu); >> } >> + cpus_write_unlock(); >> + arch_smt_update(); >> cpu_maps_update_done(); >> return ret; >> } > > What about cpuhp_smt_disable? Doing this is not that easy on disable path, AFAICS. Considering that the enable path in the performance tests was much worse, I wanted to contain it to that. This does not have to be the only fix though, but one of the cures to get there. Thanks. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations 2026-01-19 5:18 ` Joel Fernandes 2026-01-19 13:53 ` Shrikanth Hegde @ 2026-02-02 8:46 ` Vishal Chourasia 1 sibling, 0 replies; 54+ messages in thread From: Vishal Chourasia @ 2026-02-02 8:46 UTC (permalink / raw) To: Joel Fernandes Cc: Samir M, rcu, linux-kernel, paulmck, frederic, neeraj.upadhyay, josh, boqun.feng, urezki, rostedt, tglx, peterz, sshegde, srikar On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > Performance data on a PPC64 system with 400 CPUs: > >+ ppc64_cpu --smt=1 (SMT8 to SMT1) >Before: real 1m14.792s >After: real 0m03.205s # ~23x improvement > >+ ppc64_cpu --smt=8 (SMT1 to SMT8) >Before: real 2m27.695s >After: real 0m02.510s # ~58x improvement > >Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b On Mon, Jan 19, 2026 at 12:18:35AM -0500, Joel Fernandes wrote: > Hi Vishal, Samir, > > Thanks for the testing on your large CPU count system. > > Considering the SMT=on performance is still terrible, before we expedite > RCU, could we try the approach Peter suggested (avoiding repeated > lock/unlock)? I wrote a patch below. > > git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git > tag: cpuhp-bulk-optimize-rfc-v1 > > I tested it lightly on rcutorture hotplug test and it passes. Please share > any performance results, thanks. > > Also I'd like to use expediting of RCU as a last resort TBH, we should > optimize the outer operations that require RCU in the first place such as > Peter's suggestion since that will improve the overall efficiency of the > code. And if/when expediting RCU, Peter's other suggestion to not do it in > cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes sense > to me. > > ---8<----------------------- > > From: Joel Fernandes <joelagnelf@nvidia.com> > Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock acquiring > > Bulk CPU hotplug operations such as enabling SMT across all cores > require hotplugging multiple CPUs. The current implementation takes > cpus_write_lock() for each individual CPU causing multiple slow grace > period requests. > > Therefore introduce cpu_up_locked() that assumes the caller already > holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to > hold the lock once around the entire loop rather than for each CPU. > > Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/ > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > --- > kernel/cpu.c | 40 +++++++++++++++++++++++++--------------- > 1 file changed, 25 insertions(+), 15 deletions(-) > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 8df2d773fe3b..4ce7deb236d7 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state) > complete_ap_thread(st, true); > } > -/* Requires cpu_add_remove_lock to be held */ > -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) > +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */ > +static int cpu_up_locked(unsigned int cpu, int tasks_frozen, > + enum cpuhp_state target) > { > struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > struct task_struct *idle; > int ret = 0; > - cpus_write_lock(); > + lockdep_assert_cpus_held(); > - if (!cpu_present(cpu)) { > - ret = -EINVAL; > - goto out; > - } > + if (!cpu_present(cpu)) > + return -EINVAL; > /* > * The caller of cpu_up() might have raced with another > * caller. Nothing to do. > */ > if (st->state >= target) > - goto out; > + return 0; > if (st->state == CPUHP_OFFLINE) { > /* Let it fail before we try to bring the cpu up */ > idle = idle_thread_get(cpu); > - if (IS_ERR(idle)) { > - ret = PTR_ERR(idle); > - goto out; > - } > + if (IS_ERR(idle)) > + return PTR_ERR(idle); > /* > * Reset stale stack state from the last time this CPU was online. > @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) > * return the error code.. > */ > if (ret) > - goto out; > + return ret; > } > /* > @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) > */ > target = min((int)target, CPUHP_BRINGUP_CPU); > ret = cpuhp_up_callbacks(cpu, st, target); > -out: > + return ret; > +} > + > +/* Requires cpu_add_remove_lock to be held */ > +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) > +{ > + int ret; > + > + cpus_write_lock(); > + ret = cpu_up_locked(cpu, tasks_frozen, target); > cpus_write_unlock(); > arch_smt_update(); > return ret; > @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void) > int cpu, ret = 0; > cpu_maps_update_begin(); > + /* Hold cpus_write_lock() for entire batch operation. */ > + cpus_write_lock(); > cpu_smt_control = CPU_SMT_ENABLED; > for_each_present_cpu(cpu) { > /* Skip online CPUs and CPUs on offline nodes */ > @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void) > continue; > if (!cpu_smt_thread_allowed(cpu) || !topology_is_core_online(cpu)) > continue; > - ret = _cpu_up(cpu, 0, CPUHP_ONLINE); > + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE); > if (ret) > break; > /* See comment in cpuhp_smt_disable() */ > cpuhp_online_cpu_device(cpu); > } > + cpus_write_unlock(); > + arch_smt_update(); > cpu_maps_update_done(); > return ret; > } > -- > 2.34.1 > Hi Joel, I tested above patch on 400 CPU machine that I had originally posted the numbers for. # time echo 1 > /sys/devices/system/cpu/smt/control real 1m27.133s # Base real 1m25.859s # With patch # time echo 8 > /sys/devices/system/cpu/smt/control real 1m0.682s # Base real 1m3.423s # With patch ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2026-02-03 6:32 UTC | newest] Thread overview: 54+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-12 9:43 [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Vishal Chourasia 2026-01-12 10:08 ` Uladzislau Rezki 2026-01-12 10:43 ` Vishal Chourasia 2026-01-12 11:07 ` Uladzislau Rezki 2026-01-12 12:02 ` Shrikanth Hegde 2026-01-12 12:57 ` Uladzislau Rezki 2026-01-12 16:09 ` Joel Fernandes 2026-01-12 16:48 ` Paul E. McKenney 2026-01-12 17:05 ` Uladzislau Rezki 2026-01-12 18:27 ` Vishal Chourasia 2026-01-13 0:03 ` Paul E. McKenney 2026-01-12 22:24 ` Joel Fernandes 2026-01-13 0:01 ` Paul E. McKenney 2026-01-13 2:46 ` Joel Fernandes 2026-01-13 4:53 ` Shrikanth Hegde 2026-01-13 8:57 ` Joel Fernandes 2026-01-14 4:00 ` Paul E. McKenney 2026-01-14 8:54 ` Joel Fernandes 2026-01-16 19:02 ` Paul E. McKenney 2026-01-14 3:59 ` Paul E. McKenney 2026-01-12 17:09 ` Uladzislau Rezki 2026-01-12 17:36 ` Joel Fernandes 2026-01-13 12:18 ` Uladzislau Rezki 2026-01-13 12:44 ` Joel Fernandes 2026-01-13 14:17 ` Uladzislau Rezki 2026-01-13 14:32 ` Joel Fernandes 2026-01-13 14:53 ` Shrikanth Hegde 2026-01-13 18:17 ` Uladzislau Rezki 2026-01-13 17:58 ` Uladzislau Rezki 2026-01-12 12:21 ` Shrikanth Hegde 2026-01-12 12:46 ` Vishal Chourasia 2026-01-12 14:03 ` Joel Fernandes 2026-01-12 14:20 ` Joel Fernandes 2026-01-12 14:23 ` Peter Zijlstra 2026-01-12 14:37 ` Joel Fernandes 2026-01-12 17:52 ` Vishal Chourasia 2026-01-12 14:24 ` Peter Zijlstra 2026-01-12 18:00 ` Vishal Chourasia 2026-01-13 9:01 ` Peter Zijlstra 2026-01-19 10:47 ` [PATCH] cpuhp: Expedite synchronize_rcu during SMT switch Vishal Chourasia 2026-01-19 11:43 ` Peter Zijlstra 2026-01-19 13:45 ` Shrikanth Hegde 2026-01-19 14:11 ` Peter Zijlstra 2026-01-19 14:45 ` Joel Fernandes 2026-01-19 14:59 ` Peter Zijlstra 2026-01-27 17:48 ` Samir M 2026-01-29 7:05 ` Samir M 2026-02-03 6:31 ` Samir M 2026-01-19 10:54 ` [RESEND] " Vishal Chourasia 2026-01-18 11:38 ` [PATCH] cpuhp: Expedite synchronize_rcu during CPU hotplug operations Samir M 2026-01-19 5:18 ` Joel Fernandes 2026-01-19 13:53 ` Shrikanth Hegde 2026-01-19 21:10 ` joelagnelf 2026-02-02 8:46 ` Vishal Chourasia
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox