From: Vishal Chourasia <vishalc@linux.ibm.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Samir M <samir@linux.ibm.com>,
Joel Fernandes <joelagnelf@nvidia.com>,
peterz@infradead.org, aboorvad@linux.ibm.com,
boqun.feng@gmail.com, frederic@kernel.org, josh@joshtriplett.org,
linux-kernel@vger.kernel.org, neeraj.upadhyay@kernel.org,
rcu@vger.kernel.org, rostedt@goodmis.org, srikar@linux.ibm.com,
sshegde@linux.ibm.com, tglx@linutronix.de, urezki@gmail.com
Subject: Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations
Date: Sat, 21 Mar 2026 00:19:35 +0530 [thread overview]
Message-ID: <ab2WvwjWNnJceaWS@linux.ibm.com> (raw)
In-Reply-To: <bde1a8b9-7f56-45fb-830c-038fa7b85f0d@paulmck-laptop>
Hi Paul, Thank you for your response.
Sorry, I could not revert back quicker.
As I wanted to understand what's happening behind the scenes after the
cpuhp kthread blocks upon execution of synchronize_rcu(). So I did a
little more digging.
On a 320 CPU system, (SMT8 to SMT4) operation takes >1 minute to complete.
160 CPUs are offlined one by one.
In total 321 synchronize_rcu() calls are invoked taking ~125ms to finish
(ftrace option sleep-time set).
3298110.851011 | 316) cpuhp/3-1614 | | synchronize_rcu() {
3298111.010125 | 316) cpuhp/3-1614 | @ 159112.9 us | }
--
3298111.020432 | 0) kworker-29406 | | synchronize_rcu() {
3298111.190132 | 0) kworker-29406 | @ 169699.4 us | }
--
3298111.191327 | 317) cpuhp/3-1619 | | synchronize_rcu() {
3298111.350129 | 317) cpuhp/3-1619 | @ 158801.9 us | }
--
3298111.360263 | 0) kworker-29406 | | synchronize_rcu() {
3298111.530137 | 0) kworker-29406 | @ 169874.5 us | }
--
3298111.531098 | 318) cpuhp/3-1624 | | synchronize_rcu() {
3298111.650128 | 318) cpuhp/3-1624 | @ 119029.8 us | }
Breakdown of the time spent during a single synchronize_rcu() during the
invocation of sched_cpu_deactivate callback (CPU 4 was offlined)
Summary:
--> cpuhp_enter (sched_cpu_deactivate)
CB registration → AccWaitCB ~10ms Waiting for softirq tick on CPU 4
GP 220685125: FQS scan 1 ~10ms Tick delay + scan (all clear except CPU 260,
rcu_gp_kthread is running on CPU 260)
GP 220685125: wait for CPU 260 ~30msi FQS sleep interval, CPU 260 not yet reported
GP 220685125: FQS scan 2 + end ~0.02ms CPU 260 clears
GP 220685129: FQS scan 1 ~30ms Tick delay + full scan (same: CPU 260 holdout)
GP 220685129: wait for CPU 260 ~30ms Same pattern
GP 220685129: FQS scan 2 + end ~0.02ms CPU 260 clears
CB invocation + wakeup ~10ms Softirq tick invokes wakeme_after_rcu
destroy_sched_domains_rcu queueing ~8ms322 call_rcu() callbacks
<-- cpuhp_exit (sched_cpu_deactivate)
I have collected some rcu static tracepoint data, which I am currently
going through.
On Fri, Mar 06, 2026 at 07:12:04AM -0800, Paul E. McKenney wrote:
> On Fri, Mar 06, 2026 at 11:14:13AM +0530, Vishal Chourasia wrote:
> > On Mon, Mar 02, 2026 at 05:17:16PM +0530, Samir M wrote:
> > >
> > > On 27/02/26 6:43 am, Joel Fernandes wrote:
> > > > On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> > > > > Expedite synchronize_rcu during the SMT mode switch operation when
> > > > > initiated via /sys/devices/system/cpu/smt/control interface
> > > > >
> > > > After the locking related changes in patch 1, is expediting still required? I
> > Yes.
> > > > am just a bit concerned that we are papering over the real issue of over
> > > > usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> > > > the patches that reducing the number of lock acquire/release was supposed to
> > > > help.)
> > At present, I am not sure about the underlying issue. So far what I have
> > found is when synchronize_rcu() is invoked, it marks the start of a new
> > grace period number, say A. Thread invoking synchronize_rcu() blocks
> > until all CPUs have reported QS for GP "A". There is a rcu grace period
> > kthread that runs periodically looping over a CPU list to figure out all
> > CPUs have reported QS. In the trace, I find some CPUs reporting QS for
> > sequence number way back in the past for ex. A - N where N is > 10.
>
> This can happen when a CPU goes idle for multiple grace periods, then
> wakes up in the middle of a later grace period. This is (or at least is
> supposed to be) harmless because a quiescent state was reported on that
> CPU's behalf when RCU noticed that it was idle. The report is quashed
If it is harmless, can we consider just expediting the smt mode switch
operation via smt/control file [1].
Thanks, vishalc
[1] https://lore.kernel.org/all/20260218083915.660252-6-vishalc@linux.ibm.com/
> when RCU notices that the quiescent state being reported is for a grace
> period that has already completed. Grace-period counter wrap is handled
> by the infamous ->gpwrap field in the rcu_data structure.
>
> I have seen N having four digits, with deep embedded devices being most
> likely to have extremely large values of N.
>
> Thanx, Paul
>
> > > > Could you provide more justification of why expediting these sections is
> > > > required if the locking concerns were addressed? It would be great if you can
> > > > provide performance numbers with only the first patch and without the second
> > > > patch. That way we can quantify this patch.
> > > >
> > > >
> > > SMT Mode | Without Patch(Base) | both patch applied | % Improvement |
> > > ------------------------------------------------------------------------|
> > > SMT=off | 16m 13.956s | 6m 18.435s | +61.14 % |
> > > SMT=on | 12m 0.982s | 5m 59.576s | +50.10 % |
> > >
> > > When I tested the below patch independently, I did not observe any
> > > improvements for either smt=on or smt=off. However, in the smt=off scenario,
> > > I encountered hung task splats (with call traces), where some threads were
> > > blocked on cpus_read_lock. Please also refer to the attached call trace
> > > below.
> > > Patch 1:
> > > https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/
> > >
> > > SMT Mode | Without Patch(Base) | just patch 1 applied | % Improvement
> > > |
> > > ----------------------------------------------------------------------------|
> > > SMT=off | 16m 13.956s | 16m 9.793s | +0.43 %
> > > |
> > > SMT=on | 12m 0.982s | 12m 19.494s | -2.57 %
> > > |
> > >
> > >
> > > Call traces:
> > > 12377] [ T8746] Tainted: G E 7.0.0-rc1-150700.51-default-dirty #1
> > > [ 1477.612384] [ T8746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [ 1477.612389] [ T8746] task:systemd state:D stack:0 pid:1 tgid:1
> > > ppid:0 task_flags:0x400100 flags:0x00040000
> > > [ 1477.612397] [ T8746] Call Trace:
> > > [ 1477.612399] [ T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000
> > > (unreliable)
> > > [ 1477.612416] [ T8746] [c00000000cc0f6a0] [c00000000001fe5c]
> > > __switch_to+0x1dc/0x290
> > > [ 1477.612425] [ T8746] [c00000000cc0f6f0] [c0000000012598ac]
> > > __schedule+0x40c/0x1a70
> > > [ 1477.612433] [ T8746] [c00000000cc0f840] [c00000000125af58]
> > > schedule+0x48/0x1a0
> > > [ 1477.612439] [ T8746] [c00000000cc0f870] [c0000000002e27b8]
> > > percpu_rwsem_wait+0x198/0x200
> > > [ 1477.612445] [ T8746] [c00000000cc0f8f0] [c000000001262930]
> > > __percpu_down_read+0xb0/0x210
> > > [ 1477.612449] [ T8746] [c00000000cc0f930] [c00000000022f400]
> > > cpus_read_lock+0xc0/0xd0
> > > [ 1477.612456] [ T8746] [c00000000cc0f950] [c0000000003a6398]
> > > cgroup_procs_write_start+0x328/0x410
> > > [ 1477.612462] [ T8746] [c00000000cc0fa00] [c0000000003a9620]
> > > __cgroup_procs_write+0x70/0x2c0
> > > [ 1477.612468] [ T8746] [c00000000cc0fac0] [c0000000003a98e8]
> > > cgroup_procs_write+0x28/0x50
> > > [ 1477.612473] [ T8746] [c00000000cc0faf0] [c0000000003a1624]
> > > cgroup_file_write+0xb4/0x240
> > > [ 1477.612478] [ T8746] [c00000000cc0fb50] [c000000000853ba8]
> > > kernfs_fop_write_iter+0x1a8/0x2a0
> > > [ 1477.612485] [ T8746] [c00000000cc0fba0] [c000000000733d5c]
> > > vfs_write+0x27c/0x540
> > > [ 1477.612491] [ T8746] [c00000000cc0fc50] [c000000000734350]
> > > ksys_write+0x80/0x150
> > > [ 1477.612495] [ T8746] [c00000000cc0fca0] [c000000000032898]
> > > system_call_exception+0x148/0x320
> > > [ 1477.612500] [ T8746] [c00000000cc0fe50] [c00000000000d6a0]
> > > system_call_common+0x160/0x2c4
> > > [ 1477.612506] [ T8746] ---- interrupt: c00 at 0x7fffa8f73df4
> > > [ 1477.612509] [ T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR:
> > > 0000000000000000
> > > [ 1477.612512] [ T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G
> > > E (7.0.0-rc1-150700.51-default-dirty)
> > > [ 1477.612515] [ T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR:
> > > 28002288 XER: 00000000
> > >
> > >
> >
> > Default timeout is set to 8 mins.
> >
> > $ grep . /proc/sys/kernel/hung_task_timeout_secs
> > /proc/sys/kernel/hung_task_timeout_secs:480
> >
> > Now that cpus_write_lock is taken once, and SMT mode switch can take
> > tens of minutes to complete and relinquish the lock, threads waiting on
> > cpus_read_lock will be blocked for this entire duration.
> >
> > Although there were no splats observed for "both patch applied" case
> > the issue still remains.
> >
> > regards,
> > vishal
next prev parent reply other threads:[~2026-03-20 18:49 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-18 8:39 [PATCH v3 0/2] cpuhp: Improve SMT switch time via lock batching and RCU expedition Vishal Chourasia
2026-02-18 8:39 ` [PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition Vishal Chourasia
2026-03-25 19:09 ` Thomas Gleixner
2026-03-26 10:06 ` Vishal Chourasia
2026-02-18 8:39 ` [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations Vishal Chourasia
2026-02-27 1:13 ` Joel Fernandes
2026-03-02 11:47 ` Samir M
2026-03-06 5:44 ` Vishal Chourasia
2026-03-06 15:12 ` Paul E. McKenney
2026-03-20 18:49 ` Vishal Chourasia [this message]
2026-03-25 19:10 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ab2WvwjWNnJceaWS@linux.ibm.com \
--to=vishalc@linux.ibm.com \
--cc=aboorvad@linux.ibm.com \
--cc=boqun.feng@gmail.com \
--cc=frederic@kernel.org \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=linux-kernel@vger.kernel.org \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=rcu@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=samir@linux.ibm.com \
--cc=srikar@linux.ibm.com \
--cc=sshegde@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=urezki@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.