* [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
@ 2026-05-06 23:57 Tejun Heo
2026-05-07 14:14 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2026-05-06 23:57 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt
Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Kyle McMartin, linux-kernel, Tejun Heo, stable
push_rt_task() picks the highest pushable RT task next_task. If it
outranks rq->donor, the existing path calls resched_curr() and
returns 0, trusting local schedule() to pick next_task soon.
The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely
on that. When this CPU has a steady supply of softirq work (e.g.,
incoming packets), the next push IPI arrives before schedule() can
run. Other CPUs keep seeing this CPU as overloaded and keep sending
IPIs, this CPU keeps taking the same bail, and the loop repeats
until soft lockup.
Seen in production on hosts with sustained NET_RX softirq load:
the loop ran millions of iterations before tripping the soft-lockup
watchdog.
Skip the prio bail when called via the IPI relay (pull=true) so
push_rt_task() migrates next_task to another CPU. Verified with a
synthetic reproducer.
Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Cc: Kyle McMartin <jkkm@meta.com>
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Tejun Heo <tj@kernel.org>
---
This looks minimal to me, but happy for suggestions. Thanks.
kernel/sched/rt.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1968,8 +1968,14 @@ retry:
* It's possible that the next_task slipped in of
* higher priority than current. If that's the case
* just reschedule current.
+ *
+ * This doesn't work for the IPI relay caller (pull). When this CPU
+ * has a steady supply of softirq work (e.g., incoming packets), the
+ * next push IPI arrives before schedule() can run. Other CPUs keep
+ * seeing it as overloaded and keep sending IPIs, this CPU keeps
+ * taking the same bail, and the loop repeats until soft lockup.
*/
- if (unlikely(next_task->prio < rq->donor->prio)) {
+ if (unlikely(next_task->prio < rq->donor->prio) && !pull) {
resched_curr(rq);
return 0;
}
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-06 23:57 [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop Tejun Heo @ 2026-05-07 14:14 ` Peter Zijlstra 2026-05-11 19:33 ` Tejun Heo 2026-05-12 15:37 ` Steven Rostedt 0 siblings, 2 replies; 17+ messages in thread From: Peter Zijlstra @ 2026-05-07 14:14 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable On Wed, May 06, 2026 at 01:57:16PM -1000, Tejun Heo wrote: > push_rt_task() picks the highest pushable RT task next_task. If it > outranks rq->donor, the existing path calls resched_curr() and > returns 0, trusting local schedule() to pick next_task soon. > > The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely > on that. When this CPU has a steady supply of softirq work (e.g., > incoming packets), the next push IPI arrives before schedule() can > run. Other CPUs keep seeing this CPU as overloaded and keep sending > IPIs, this CPU keeps taking the same bail, and the loop repeats > until soft lockup. > > Seen in production on hosts with sustained NET_RX softirq load: > the loop ran millions of iterations before tripping the soft-lockup > watchdog. > > Skip the prio bail when called via the IPI relay (pull=true) so > push_rt_task() migrates next_task to another CPU. Verified with a > synthetic reproducer. > > Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") > Cc: Kyle McMartin <jkkm@meta.com> > Cc: stable@vger.kernel.org # v5.10+ > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > This looks minimal to me, but happy for suggestions. Thanks. > > kernel/sched/rt.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -1968,8 +1968,14 @@ retry: > * It's possible that the next_task slipped in of > * higher priority than current. If that's the case > * just reschedule current. > + * > + * This doesn't work for the IPI relay caller (pull). When this CPU > + * has a steady supply of softirq work (e.g., incoming packets), the > + * next push IPI arrives before schedule() can run. Other CPUs keep > + * seeing it as overloaded and keep sending IPIs, this CPU keeps > + * taking the same bail, and the loop repeats until soft lockup. > */ > - if (unlikely(next_task->prio < rq->donor->prio)) { > + if (unlikely(next_task->prio < rq->donor->prio) && !pull) { > resched_curr(rq); > return 0; > } IIRC Steve has a test for this stuff. If this breaks things, an alternative is keeping a counter/limit on attempts or something. --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1339,6 +1339,8 @@ struct rq { unsigned int nr_pinned; unsigned int push_busy; struct cpu_stop_work push_work; + unsigned int rt_switches; + unsigned int rt_push_resched; #ifdef CONFIG_SCHED_CORE /* per rq */ --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2941,6 +2941,13 @@ static int push_dl_task(struct rq *rq) if (dl_task(rq->donor) && dl_time_before(next_task->dl.deadline, rq->donor->dl.deadline) && rq->curr->nr_cpus_allowed > 1) { + if (rq->rt_switches != rq->nr_switches) { + rq->rt_switches = rq->nr_switches; + rq->rt_push_resched = 0; + } + if (test_tsk_need_resched(rq->curr) && ++rq->rt_push_resched > 16) + return 1; + resched_curr(rq); return 0; } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-07 14:14 ` Peter Zijlstra @ 2026-05-11 19:33 ` Tejun Heo 2026-05-12 15:37 ` Steven Rostedt 1 sibling, 0 replies; 17+ messages in thread From: Tejun Heo @ 2026-05-11 19:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable On Thu, May 07, 2026 at 04:14:37PM +0200, Peter Zijlstra wrote: > IIRC Steve has a test for this stuff. If this breaks things, an > alternative is keeping a counter/limit on attempts or something. Ping. For some reason, we're seeing this reliably now. Whichever way is fine but it'd be nice to roll out something that's landing upstream. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-07 14:14 ` Peter Zijlstra 2026-05-11 19:33 ` Tejun Heo @ 2026-05-12 15:37 ` Steven Rostedt 2026-05-12 18:07 ` Tejun Heo 2026-05-12 20:10 ` Valentin Schneider 1 sibling, 2 replies; 17+ messages in thread From: Steven Rostedt @ 2026-05-12 15:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur [ Adding some RT folks ] Also, Valentin, can you look at this, because I believe the issue was introduced by your change (see below). On Thu, 7 May 2026 16:14:37 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, May 06, 2026 at 01:57:16PM -1000, Tejun Heo wrote: > > push_rt_task() picks the highest pushable RT task next_task. If it > > outranks rq->donor, the existing path calls resched_curr() and > > returns 0, trusting local schedule() to pick next_task soon. > > > > The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely > > on that. When this CPU has a steady supply of softirq work (e.g., > > incoming packets), the next push IPI arrives before schedule() can > > run. Other CPUs keep seeing this CPU as overloaded and keep sending > > IPIs, this CPU keeps taking the same bail, and the loop repeats > > until soft lockup. > > > > Seen in production on hosts with sustained NET_RX softirq load: > > the loop ran millions of iterations before tripping the soft-lockup > > watchdog. > > > > Skip the prio bail when called via the IPI relay (pull=true) so > > push_rt_task() migrates next_task to another CPU. Verified with a > > synthetic reproducer. > > > > Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") Wrong Fixes tag. That commit doesn't even have the code that you are changing. I think the correct commit is: Fixes: 49bef33e4b87b ("sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race") Which adds that if statement that exits out of the code early. > > Cc: Kyle McMartin <jkkm@meta.com> > > Cc: stable@vger.kernel.org # v5.10+ > > Signed-off-by: Tejun Heo <tj@kernel.org> > > --- > > This looks minimal to me, but happy for suggestions. Thanks. > > > > kernel/sched/rt.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -1968,8 +1968,14 @@ retry: > > * It's possible that the next_task slipped in of > > * higher priority than current. If that's the case > > * just reschedule current. > > + * > > + * This doesn't work for the IPI relay caller (pull). When this CPU > > + * has a steady supply of softirq work (e.g., incoming packets), the > > + * next push IPI arrives before schedule() can run. Other CPUs keep > > + * seeing it as overloaded and keep sending IPIs, this CPU keeps > > + * taking the same bail, and the loop repeats until soft lockup. > > */ > > - if (unlikely(next_task->prio < rq->donor->prio)) { > > + if (unlikely(next_task->prio < rq->donor->prio) && !pull) { > > resched_curr(rq); > > return 0; > > } > > IIRC Steve has a test for this stuff. If this breaks things, an > alternative is keeping a counter/limit on attempts or something. IIRC, the test we had was simply cyclictest that we ran with the following parameters. From commit b6366f048e0ca ("sched/rt: Use IPI to trigger RT task push migration instead of pulling"), it states it runs: cyclictest --numa -p95 -m -d0 -i100 The above runs a thread on each CPU at priority 95 and will sleep for 100us. Each thread should wake up at the same time. You can read the commit message for more details but the tl;dr; is that without the IPI push request, if one of the CPUs ran another RT task besides cyclictest, then all the others would then ask to pull from it when the other CPUs cyclictest would sleep. Having over 100 CPUs send an IPI to pull a task when only the first one would get it, caused a large latency. Especially since it took the rq lock over and over again. But, the code being fixed wasn't due to that commit, but due to the commit that added the short cut of the logic. That commit fixes a race with the normal call to push_rt_task() and I think the pull logic issue was a side effect. I agree with Tejun's change, it actually puts the logic for the IPI pull back to what it was before commit 49bef33e4b87b. The bug was added by the shortcut case to push_rt_task() that was only meant for the !pull scenario. Adding !pull to the if conditional seems like the correct change. Valentin, can you confirm please. Please update the Fixes tag to point to the appropriate commit as well as update the change log. With that: Reviewed-by: Steven Rostedt <rostedt@goodmis.org> -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-12 15:37 ` Steven Rostedt @ 2026-05-12 18:07 ` Tejun Heo 2026-05-12 21:28 ` Steven Rostedt 2026-05-12 20:10 ` Valentin Schneider 1 sibling, 1 reply; 17+ messages in thread From: Tejun Heo @ 2026-05-12 18:07 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur Hello, Looking at 49bef33e4b87 ("sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race"), the prio bail looks like it was already there and only got moved up to retry:. For non-migration-disabled next_task the bail fires at the same effective point both before and after, and rto_push_irq_work_func() + rto_next_cpu() were already in their current shape, so the loop seems reachable before the move too - b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") looks like the actual origin. Am I reading it wrong? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-12 18:07 ` Tejun Heo @ 2026-05-12 21:28 ` Steven Rostedt 2026-05-13 19:39 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Steven Rostedt @ 2026-05-12 21:28 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On Tue, 12 May 2026 08:07:58 -1000 Tejun Heo <tj@kernel.org> wrote: > Hello, > > Looking at 49bef33e4b87 ("sched/rt: Plug rt_mutex_setprio() vs > push_rt_task() race"), the prio bail looks like it was already there > and only got moved up to retry:. For non-migration-disabled next_task > the bail fires at the same effective point both before and after, and > rto_push_irq_work_func() + rto_next_cpu() were already in their > current shape, so the loop seems reachable before the move too - > b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration > instead of pulling") looks like the actual origin. > > Am I reading it wrong? > No, I missed the movement of that code. Which means I need to understand the problem better. I'm still wondering about the trigger of this. That shortcut means the current process is of lower priority than the waiting tasks and a simple schedule should happen. From your tests, can you see why a lower process was running on the CPU instead of a higher priority process? Also, the IPIs only happen when another CPU is about to schedule something of lower priority where it tries to pull a task to it. From your description, you are seeing a storm of IPIs from all these CPUs before the first CPU could return from hard interrupt and schedule? I'm thinking there may be something else wrong here. Note, the RT_PUSH_IPI logic only has a single iteration happening. If it is happening and another CPU wants to do a "push", it simply ups the counter to try again. It doesn't send another IPI. Do you have a trace that shows what is happening? # trace-cmd start -e sched_switch -e sched_waking -e irq -e workqueue # echo 1 > /proc/sys/kernel/traceoff_on_warning # trace-cmd extract may be enough. May need to add some trace_printk()s into the IPI logic code too. -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-12 21:28 ` Steven Rostedt @ 2026-05-13 19:39 ` Tejun Heo 2026-05-14 0:24 ` Steven Rostedt 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2026-05-13 19:39 UTC (permalink / raw) To: Steven Rostedt Cc: Tejun Heo, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur Hello, Capturing this on the actual production hosts is awkward. It requires a fleet with a particular management operation in flight, and while the aggregate occurrence rate is reliable, which specific machine hits it isn't predictable, so I haven't been able to catch one with tracing on. Production context: hosts serve live network traffic and storage IO. CPU util before lockup is moderate (~30-40% steady state), but the moment-to-moment softirq work is bursty - traffic patterns, IO completions, plus the PSI poll triggers that source the migratable psimon. Once softirq processing falls behind on a CPU, work piles up. With the prio-bail-without-clearing-overload path, the IPI storm forms on top and amplifies the slowdown. So, here's a capture from a synthetic reproducer that I think models the dynamic and reaches the same end state. Test box, 192 CPUs, kernel without the fix: - Per-target hrtimer (HRTIMER_MODE_REL_PINNED_HARD) fires every 750us. Each fire schedules one tasklet round-robin from a pool of 20k distinct tasklets. Each tasklet body is a 500us cpu_relax loop, standing in for "process one item of softirq work". - Storm driver: 190 SCHED_FIFO-50 nanosleep loops on non-target CPUs drive tell_cpu_to_push from balance_rt. Two synthetic psimon-shaped kthreads (FIFO 1) bound to the targets to pin them into rto_mask. Baseline (no storm helpers): ~85% softirq util, no lockup, runs indefinitely. The reproducer's baseline is higher than production - my guess is we need to scrape up against capacity to grow a backlog with the fixed-shape workload here, while production gets the same effect from bursty arrivals during brief slowdowns. With the storm: walker IPI overhead stretches each tasklet body from 500us to ~1.1ms. Service rate drops below arrival, backlog grows ~430/s. After ~46s, one tasklet_action_common snapshot has ~20k tasklets which it processes serially in BH-disabled softirq context. That's ~22s uninterruptible, watchdog fires. Six soft-lockups in a 120s run: [61125.38] BUG: soft lockup - CPU#95 stuck for 22s! [kworker/95:0] [61145.38] BUG: soft lockup - CPU#47 stuck for 45s! [migration/47] [61173.38] BUG: soft lockup - CPU#47 stuck for 71s! [migration/47] [61197.38] BUG: soft lockup - CPU#95 stuck for 22s! [migration/95] [61209.38] BUG: soft lockup - CPU#47 stuck for 21s! [kworker/47:1] [61225.38] BUG: soft lockup - CPU#95 stuck for 48s! [migration/95] Stack at fire: rt_storm_wedge_fn+0x22/0xe0 tasklet_action_common+0x100/0x2b0 handle_softirqs+0xbe/0x280 __irq_exit_rcu+0x47/0x100 sysvec_apic_timer_interrupt+0x3a/0x80 <- watchdog hrtimer asm_sysvec_apic_timer_interrupt+0x16/0x20 RIP: 0033:0x... <- user task (rt_storm_hog) Trace captured with your event list plus IPI: -e sched_switch -e sched_waking -e irq -e workqueue -e ipi -e irq_vectors:call_function_single_entry/exit -e irq_vectors:irq_work_entry/exit -e irq_vectors:reschedule_entry/exit -e irq_vectors:local_timer_entry/exit Sliced to a 17s window around the first RCU stall + first soft-lockup, filtered to CPUs 47 and 95, gzipped text (~11MB): https://drive.google.com/file/d/11AN6dyvOWiZLVNEEuVtQieRyAxJYbCbt/view?usp=sharing Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-13 19:39 ` Tejun Heo @ 2026-05-14 0:24 ` Steven Rostedt 2026-05-14 0:53 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Steven Rostedt @ 2026-05-14 0:24 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On Wed, 13 May 2026 09:39:14 -1000 Tejun Heo <tj@kernel.org> wrote: > So, here's a capture from a synthetic reproducer that I think > models the dynamic and reaches the same end state. Synthetic capture is fine. > > Test box, 192 CPUs, kernel without the fix: > > - Per-target hrtimer (HRTIMER_MODE_REL_PINNED_HARD) fires every > 750us. Each fire schedules one tasklet round-robin from a pool > of 20k distinct tasklets. Each tasklet body is a 500us cpu_relax > loop, standing in for "process one item of softirq work". So you are running a softirq for 500us every 750us? This basically prevents any task from running on these CPUs while the softirq is executing. > > - Storm driver: 190 SCHED_FIFO-50 nanosleep loops on non-target > CPUs drive tell_cpu_to_push from balance_rt. Two synthetic > psimon-shaped kthreads (FIFO 1) bound to the targets to pin > them into rto_mask. What exactly are these synthetic kthreads doing. Have code to share? > > Baseline (no storm helpers): ~85% softirq util, no lockup, runs > indefinitely. The reproducer's baseline is higher than production - > my guess is we need to scrape up against capacity to grow a backlog > with the fixed-shape workload here, while production gets the same > effect from bursty arrivals during brief slowdowns. > > With the storm: walker IPI overhead stretches each tasklet body > from 500us to ~1.1ms. Service rate drops below arrival, backlog > grows ~430/s. After ~46s, one tasklet_action_common snapshot has > ~20k tasklets which it processes serially in BH-disabled softirq > context. That's ~22s uninterruptible, watchdog fires. The IPI walker should only go to the CPUs with overloaded RT tasks. Are you making all the CPUS have overloaded RT tasks? > > Six soft-lockups in a 120s run: > > [61125.38] BUG: soft lockup - CPU#95 stuck for 22s! [kworker/95:0] > [61145.38] BUG: soft lockup - CPU#47 stuck for 45s! [migration/47] > [61173.38] BUG: soft lockup - CPU#47 stuck for 71s! [migration/47] > [61197.38] BUG: soft lockup - CPU#95 stuck for 22s! [migration/95] > [61209.38] BUG: soft lockup - CPU#47 stuck for 21s! [kworker/47:1] > [61225.38] BUG: soft lockup - CPU#95 stuck for 48s! [migration/95] > > Stack at fire: > > rt_storm_wedge_fn+0x22/0xe0 > tasklet_action_common+0x100/0x2b0 > handle_softirqs+0xbe/0x280 > __irq_exit_rcu+0x47/0x100 > sysvec_apic_timer_interrupt+0x3a/0x80 <- watchdog hrtimer > asm_sysvec_apic_timer_interrupt+0x16/0x20 > RIP: 0033:0x... <- user task (rt_storm_hog) > > Trace captured with your event list plus IPI: > > -e sched_switch -e sched_waking -e irq -e workqueue -e ipi > -e irq_vectors:call_function_single_entry/exit > -e irq_vectors:irq_work_entry/exit > -e irq_vectors:reschedule_entry/exit > -e irq_vectors:local_timer_entry/exit > > Sliced to a 17s window around the first RCU stall + first > soft-lockup, filtered to CPUs 47 and 95, gzipped text (~11MB): > > https://drive.google.com/file/d/11AN6dyvOWiZLVNEEuVtQieRyAxJYbCbt/view?usp=sharing > So this is showing that the IPI logic is just extending the softirq work load to something greater than the period of execution and causing a live lock of softirqs. This still doesn't explain to me why the current process is of a lower priority than a waiting RT task. I'm really starting to think you are fixing a symptom and not the cause. -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 0:24 ` Steven Rostedt @ 2026-05-14 0:53 ` Tejun Heo 2026-05-14 1:31 ` Steven Rostedt 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2026-05-14 0:53 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur Hello, On Wed, May 13, 2026 at 08:24:32PM -0400, Steven Rostedt wrote: > > - Per-target hrtimer (HRTIMER_MODE_REL_PINNED_HARD) fires every > > 750us. Each fire schedules one tasklet round-robin from a pool > > of 20k distinct tasklets. Each tasklet body is a 500us cpu_relax > > loop, standing in for "process one item of softirq work". > > So you are running a softirq for 500us every 750us? > > This basically prevents any task from running on these CPUs while the > softirq is executing. Hmmm? The utilization is high at around 70%. It can still run something and wouldn't lock up. The prod repro case isn't this high. More like 30-40%. It's just difficult to make syntheric repro reliable with that. > > - Storm driver: 190 SCHED_FIFO-50 nanosleep loops on non-target > > CPUs drive tell_cpu_to_push from balance_rt. Two synthetic > > psimon-shaped kthreads (FIFO 1) bound to the targets to pin > > them into rto_mask. > > What exactly are these synthetic kthreads doing. Have code to share? It's just looping set number of times. Here's the slop: https://gist.github.com/htejun/ba43a0a7bc6f6503602ada850f45ce4d > The IPI walker should only go to the CPUs with overloaded RT tasks. Are you > making all the CPUS have overloaded RT tasks? Only 2 cpus are overloaded. I don't know why it used FIFO threads on CPUs that aren't overloaded. It's just using that to pulse CPUs to trigger need_pull_rt_task(). > So this is showing that the IPI logic is just extending the softirq work > load to something greater than the period of execution and causing a live > lock of softirqs. > > This still doesn't explain to me why the current process is of a lower > priority than a waiting RT task. 1. The CPU was running a fair task. 2. IRQ triggers which creates softirq work. 3. Either IRQ, softirq or another CPU wakes up multiple RT tasks to the CPU. 4. The CPU enters softirq. 5. Other CPUs keep sending pull IPIs, slowing softirq processing. 6. Before softirq processing finishes, another IRQ happens which creates more softirq work. Go back to 4. > I'm really starting to think you are fixing a symptom and not the cause. It seems relatively straightforward to me. The CPU was relatively loaded with irq/softirq. While in irq context, RT tasks wake up to it and then the CPU gets hammered by pull IPIs to the point where it's constantly chasing new softirq work and thus can't leave irq context in a reasonable amount of time. What am I missing? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 0:53 ` Tejun Heo @ 2026-05-14 1:31 ` Steven Rostedt 2026-05-14 1:42 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Steven Rostedt @ 2026-05-14 1:31 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On Wed, 13 May 2026 14:53:21 -1000 Tejun Heo <tj@kernel.org> wrote: > > This still doesn't explain to me why the current process is of a lower > > priority than a waiting RT task. > > 1. The CPU was running a fair task. > > 2. IRQ triggers which creates softirq work. > > 3. Either IRQ, softirq or another CPU wakes up multiple RT tasks to the CPU. > > 4. The CPU enters softirq. OK, this is what I was missing. The fact that the CPU was running a softirq at the time that was running for a very long time that prevents the schedule from happening. > > 5. Other CPUs keep sending pull IPIs, slowing softirq processing. > > 6. Before softirq processing finishes, another IRQ happens which creates > more softirq work. Go back to 4. > > > I'm really starting to think you are fixing a symptom and not the cause. > > It seems relatively straightforward to me. The CPU was relatively loaded > with irq/softirq. While in irq context, RT tasks wake up to it and then the > CPU gets hammered by pull IPIs to the point where it's constantly chasing > new softirq work and thus can't leave irq context in a reasonable amount of > time. What am I missing? So if the current task running is SCHED_OTHER we still need to handle the case where the next task is pinned, as it will cause a warning again if it tries to move the fair task, especially since that doesn't fix the overloading. I think this requires a bit more complex fix. Perhaps if the current task is fair and the next task is pinned, it needs to look for the task after that one to move. -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 1:31 ` Steven Rostedt @ 2026-05-14 1:42 ` Tejun Heo 2026-05-14 2:01 ` Steven Rostedt 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2026-05-14 1:42 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur Hello, On Wed, May 13, 2026 at 09:31:08PM -0400, Steven Rostedt wrote: > OK, this is what I was missing. The fact that the CPU was running a > softirq at the time that was running for a very long time that prevents > the schedule from happening. Right, although, in prod case, I don't think each softirq invocation is that long. It's maybe a few msecs, if that. However, there's a constant stream of them and if you slow down the CPU enough with IPIs, the CPU can't ever clear pending softirq although it only runs a short time each time it enters softirq. > So if the current task running is SCHED_OTHER we still need to handle > the case where the next task is pinned, as it will cause a warning > again if it tries to move the fair task, especially since that doesn't > fix the overloading. > > I think this requires a bit more complex fix. Perhaps if the current > task is fair and the next task is pinned, it needs to look for the task > after that one to move. I see. You know the code and history a lot better than I do. Wanna take over? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 1:42 ` Tejun Heo @ 2026-05-14 2:01 ` Steven Rostedt 2026-05-14 4:48 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Steven Rostedt @ 2026-05-14 2:01 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On Wed, 13 May 2026 15:42:14 -1000 Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Wed, May 13, 2026 at 09:31:08PM -0400, Steven Rostedt wrote: > > OK, this is what I was missing. The fact that the CPU was running a > > softirq at the time that was running for a very long time that prevents > > the schedule from happening. > > Right, although, in prod case, I don't think each softirq invocation is that > long. It's maybe a few msecs, if that. However, there's a constant stream of > them and if you slow down the CPU enough with IPIs, the CPU can't ever clear > pending softirq although it only runs a short time each time it enters > softirq. > > > So if the current task running is SCHED_OTHER we still need to handle > > the case where the next task is pinned, as it will cause a warning > > again if it tries to move the fair task, especially since that doesn't > > fix the overloading. > > > > I think this requires a bit more complex fix. Perhaps if the current > > task is fair and the next task is pinned, it needs to look for the task > > after that one to move. > > I see. You know the code and history a lot better than I do. Wanna take > over? I could try, but there are still some things that I don't understand. One is that to send more IPIs due to the RT pull request, there needs to be RT tasks constantly sleeping. Is that happening in this use case? Are the softirqs waking up RT tasks that run for a short time and go back to sleep, causing the pull IPI to trigger again? -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 2:01 ` Steven Rostedt @ 2026-05-14 4:48 ` Tejun Heo 2026-05-14 14:03 ` Steven Rostedt 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2026-05-14 4:48 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur Hello, On Wed, May 13, 2026 at 10:01:36PM -0400, Steven Rostedt wrote: > I could try, but there are still some things that I don't understand. > One is that to send more IPIs due to the RT pull request, there needs > to be RT tasks constantly sleeping. Is that happening in this use case? > Are the softirqs waking up RT tasks that run for a short time and go > back to sleep, causing the pull IPI to trigger again? Ah, yes, that makes sense. That's why the repro is using FIFO threads too. In prod, there's mpi3mr threaded irq handlers that are FIFO. These are storage machines so they're also constantly active. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 4:48 ` Tejun Heo @ 2026-05-14 14:03 ` Steven Rostedt 2026-05-14 21:15 ` Tejun Heo 0 siblings, 1 reply; 17+ messages in thread From: Steven Rostedt @ 2026-05-14 14:03 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On Wed, 13 May 2026 18:48:31 -1000 Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Wed, May 13, 2026 at 10:01:36PM -0400, Steven Rostedt wrote: > > I could try, but there are still some things that I don't understand. > > One is that to send more IPIs due to the RT pull request, there needs > > to be RT tasks constantly sleeping. Is that happening in this use case? > > Are the softirqs waking up RT tasks that run for a short time and go > > back to sleep, causing the pull IPI to trigger again? > > Ah, yes, that makes sense. That's why the repro is using FIFO threads too. > In prod, there's mpi3mr threaded irq handlers that are FIFO. These are > storage machines so they're also constantly active. I was thinking about this more and does disabling the RT_PUSH_IPI cause any problems for you? # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched/features The reason I ask is that I'm not sure he RT_PUSH_IPI even makes sense to have enabled when CONFIG_IRQ_FORCED_THREADING is not enabled. The reason the RT_PUSH_IPI was created in the first place was due to a kind of "thundering herd" of taking the rq lock of the CPU that has an overloaded set of RT tasks on it. When RT_PUSH_IPI is disabled, instead of sending an IPI to the CPU to do a push, the CPU that is scheduling a lower priority task takes the overloaded CPU's rq lock and will try to pull tasks from it. The issue that RT_PUSH_IPI solved was that if you had a 100 CPUs all scheduling a lower priority task at the same time, they would all try to take the lock of the overloaded CPU. Only the first one would succeed in pulling a task. The other 99 would finally get that lock and see that it has no tasks to pull from. I found that this could cause 500us of latency or more. That 500us mattered a lot for PREEMPT_RT, but doesn't really matter if you have softirqs running uninterruptable and for 500us themselves. I'm thinking that we could just have the following instead: diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9f63b15d309d..0a4f4a212cd6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -829,8 +829,14 @@ static inline int rt_bandwidth_enabled(void) return sysctl_sched_rt_runtime >= 0; } -/* RT IPI pull logic requires IRQ_WORK */ -#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) +/* + * RT IPI pull logic requires IRQ_WORK and doesn't make sense for uniprocessors. + * If CONFIG_IRQ_FORCED_THREADING isn't set, then softirqs do not run as threads + * and can cause latency larger than what RT_PUSH_IPI can save, killing the + * effect of it. + */ +#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) && \ + defined(CONFIG_IRQ_FORCED_THREADING) # define HAVE_RT_PUSH_IPI #endif -- Steve ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 14:03 ` Steven Rostedt @ 2026-05-14 21:15 ` Tejun Heo 2026-05-14 23:43 ` Steven Rostedt 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2026-05-14 21:15 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur Hello, Steven. On Thu, May 14, 2026 at 10:03:00AM -0400, Steven Rostedt wrote: > I was thinking about this more and does disabling the RT_PUSH_IPI cause any > problems for you? > > # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched/features Not at all. This is actually the mitigation that we deployed across the affected machines. ... > -/* RT IPI pull logic requires IRQ_WORK */ > -#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) > +/* > + * RT IPI pull logic requires IRQ_WORK and doesn't make sense for uniprocessors. > + * If CONFIG_IRQ_FORCED_THREADING isn't set, then softirqs do not run as threads > + * and can cause latency larger than what RT_PUSH_IPI can save, killing the > + * effect of it. > + */ > +#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) && \ > + defined(CONFIG_IRQ_FORCED_THREADING) > # define HAVE_RT_PUSH_IPI > #endif Maybe it should trigger on force_irqthreads so that it's active only when irq threads are actully enabled. Whichever way it's done tho, wouldn't this still leave machines in that config susceptible to IPI storms? It took a combination of factors to trigger - mpi3mr's threaded irq, psimon activated by systemd, and sustained network load - but those factors are not that exotic. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-14 21:15 ` Tejun Heo @ 2026-05-14 23:43 ` Steven Rostedt 0 siblings, 0 replies; 17+ messages in thread From: Steven Rostedt @ 2026-05-14 23:43 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On Thu, 14 May 2026 11:15:31 -1000 Tejun Heo <tj@kernel.org> wrote: > Hello, Steven. > > On Thu, May 14, 2026 at 10:03:00AM -0400, Steven Rostedt wrote: > > I was thinking about this more and does disabling the RT_PUSH_IPI cause any > > problems for you? > > > > # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched/features > > Not at all. This is actually the mitigation that we deployed across the > affected machines. > > ... > > -/* RT IPI pull logic requires IRQ_WORK */ > > -#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) > > +/* > > + * RT IPI pull logic requires IRQ_WORK and doesn't make sense for uniprocessors. > > + * If CONFIG_IRQ_FORCED_THREADING isn't set, then softirqs do not run as threads > > + * and can cause latency larger than what RT_PUSH_IPI can save, killing the > > + * effect of it. > > + */ > > +#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) && \ > > + defined(CONFIG_IRQ_FORCED_THREADING) > > # define HAVE_RT_PUSH_IPI > > #endif > > Maybe it should trigger on force_irqthreads so that it's active only when > irq threads are actully enabled. Well, PREEMPT_RT doesn't need force_irqthreads for this to be enabled. But I could keep this configured like the above, but have the feature to be disabled on boot up if !PREEMPT_RT and force_irqthreads is not set. > > Whichever way it's done tho, wouldn't this still leave machines in that > config susceptible to IPI storms? It took a combination of factors to > trigger - mpi3mr's threaded irq, psimon activated by systemd, and sustained > network load - but those factors are not that exotic. With softirqs as threads it is highly unlikely to be a problem. The reason you saw this was because the break out to schedule happened in a softirq that prevented scheduling from occurring right away. With irqs as threads, so are softirqs, and they wouldn't be able to cause the delay in scheduling that you were experiencing. I'll write up a patch tomorrow or next week. Thanks! -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop 2026-05-12 15:37 ` Steven Rostedt 2026-05-12 18:07 ` Tejun Heo @ 2026-05-12 20:10 ` Valentin Schneider 1 sibling, 0 replies; 17+ messages in thread From: Valentin Schneider @ 2026-05-12 20:10 UTC (permalink / raw) To: Steven Rostedt, Peter Zijlstra Cc: Tejun Heo, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, K Prateek Nayak, Kyle McMartin, linux-kernel, stable, Linux RT Development, Clark Williams, Sebastian Andrzej Siewior, John Kacur On 12/05/26 11:37, Steven Rostedt wrote: > [ Adding some RT folks ] > > Also, Valentin, can you look at this, because I believe the issue was > introduced by your change (see below). > Woops! > IIRC, the test we had was simply cyclictest that we ran with the following > parameters. From commit b6366f048e0ca ("sched/rt: Use IPI to trigger RT > task push migration instead of pulling"), it states it runs: > > cyclictest --numa -p95 -m -d0 -i100 > > The above runs a thread on each CPU at priority 95 and will sleep for > 100us. Each thread should wake up at the same time. You can read the commit > message for more details but the tl;dr; is that without the IPI push > request, if one of the CPUs ran another RT task besides cyclictest, then > all the others would then ask to pull from it when the other CPUs > cyclictest would sleep. Having over 100 CPUs send an IPI to pull a task > when only the first one would get it, caused a large latency. Especially > since it took the rq lock over and over again. > > But, the code being fixed wasn't due to that commit, but due to the commit > that added the short cut of the logic. That commit fixes a race with the > normal call to push_rt_task() and I think the pull logic issue was a side > effect. > > I agree with Tejun's change, it actually puts the logic for the IPI pull > back to what it was before commit 49bef33e4b87b. The bug was added by the > shortcut case to push_rt_task() that was only meant for the !pull scenario. > Adding !pull to the if conditional seems like the correct change. > > Valentin, can you confirm please. > So looking back at the original report for my patch: https://lore.kernel.org/all/Yb3vXx3DcqVOi+EA@donbot/ the splat happened through rto_push_irq_work_func(), i.e. with pull=true (that naming always causes me to shuffle through my notes; AFAICT that's because it's when push_rt_task() is invoked due to a pull_rt_task() call but urgh). So IIUC I'm afraid the suggested fix would cause the original issue to resurface, but that still leaves us with the reported softlock issue. I don't have any inspiration so far, I'll sleep on it. > Please update the Fixes tag to point to the appropriate commit as well as > update the change log. With that: > > Reviewed-by: Steven Rostedt <rostedt@goodmis.org> > > -- Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-05-14 23:43 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-06 23:57 [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop Tejun Heo 2026-05-07 14:14 ` Peter Zijlstra 2026-05-11 19:33 ` Tejun Heo 2026-05-12 15:37 ` Steven Rostedt 2026-05-12 18:07 ` Tejun Heo 2026-05-12 21:28 ` Steven Rostedt 2026-05-13 19:39 ` Tejun Heo 2026-05-14 0:24 ` Steven Rostedt 2026-05-14 0:53 ` Tejun Heo 2026-05-14 1:31 ` Steven Rostedt 2026-05-14 1:42 ` Tejun Heo 2026-05-14 2:01 ` Steven Rostedt 2026-05-14 4:48 ` Tejun Heo 2026-05-14 14:03 ` Steven Rostedt 2026-05-14 21:15 ` Tejun Heo 2026-05-14 23:43 ` Steven Rostedt 2026-05-12 20:10 ` Valentin Schneider
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox