[PATCH sched/core] sched/rt: Fix RT_PUSH

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
@ 2026-05-06 23:57 Tejun Heo
  2026-05-07 14:14 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2026-05-06 23:57 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, Tejun Heo, stable

push_rt_task() picks the highest pushable RT task next_task. If it
outranks rq->donor, the existing path calls resched_curr() and
returns 0, trusting local schedule() to pick next_task soon.

The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely
on that. When this CPU has a steady supply of softirq work (e.g.,
incoming packets), the next push IPI arrives before schedule() can
run. Other CPUs keep seeing this CPU as overloaded and keep sending
IPIs, this CPU keeps taking the same bail, and the loop repeats
until soft lockup.

Seen in production on hosts with sustained NET_RX softirq load:
the loop ran millions of iterations before tripping the soft-lockup
watchdog.

Skip the prio bail when called via the IPI relay (pull=true) so
push_rt_task() migrates next_task to another CPU. Verified with a
synthetic reproducer.

Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Cc: Kyle McMartin <jkkm@meta.com>
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Tejun Heo <tj@kernel.org>
---
This looks minimal to me, but happy for suggestions. Thanks.

 kernel/sched/rt.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1968,8 +1968,14 @@ retry:
 	 * It's possible that the next_task slipped in of
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
+	 *
+	 * This doesn't work for the IPI relay caller (pull). When this CPU
+	 * has a steady supply of softirq work (e.g., incoming packets), the
+	 * next push IPI arrives before schedule() can run. Other CPUs keep
+	 * seeing it as overloaded and keep sending IPIs, this CPU keeps
+	 * taking the same bail, and the loop repeats until soft lockup.
 	 */
-	if (unlikely(next_task->prio < rq->donor->prio)) {
+	if (unlikely(next_task->prio < rq->donor->prio) && !pull) {
 		resched_curr(rq);
 		return 0;
 	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-06 23:57 [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop Tejun Heo
@ 2026-05-07 14:14 ` Peter Zijlstra
  2026-05-11 19:33   ` Tejun Heo
  2026-05-12 15:37   ` Steven Rostedt
  0 siblings, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2026-05-07 14:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable

On Wed, May 06, 2026 at 01:57:16PM -1000, Tejun Heo wrote:
> push_rt_task() picks the highest pushable RT task next_task. If it
> outranks rq->donor, the existing path calls resched_curr() and
> returns 0, trusting local schedule() to pick next_task soon.
> 
> The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely
> on that. When this CPU has a steady supply of softirq work (e.g.,
> incoming packets), the next push IPI arrives before schedule() can
> run. Other CPUs keep seeing this CPU as overloaded and keep sending
> IPIs, this CPU keeps taking the same bail, and the loop repeats
> until soft lockup.
> 
> Seen in production on hosts with sustained NET_RX softirq load:
> the loop ran millions of iterations before tripping the soft-lockup
> watchdog.
> 
> Skip the prio bail when called via the IPI relay (pull=true) so
> push_rt_task() migrates next_task to another CPU. Verified with a
> synthetic reproducer.
> 
> Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
> Cc: Kyle McMartin <jkkm@meta.com>
> Cc: stable@vger.kernel.org # v5.10+
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> This looks minimal to me, but happy for suggestions. Thanks.
> 
>  kernel/sched/rt.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1968,8 +1968,14 @@ retry:
>  	 * It's possible that the next_task slipped in of
>  	 * higher priority than current. If that's the case
>  	 * just reschedule current.
> +	 *
> +	 * This doesn't work for the IPI relay caller (pull). When this CPU
> +	 * has a steady supply of softirq work (e.g., incoming packets), the
> +	 * next push IPI arrives before schedule() can run. Other CPUs keep
> +	 * seeing it as overloaded and keep sending IPIs, this CPU keeps
> +	 * taking the same bail, and the loop repeats until soft lockup.
>  	 */
> -	if (unlikely(next_task->prio < rq->donor->prio)) {
> +	if (unlikely(next_task->prio < rq->donor->prio) && !pull) {
>  		resched_curr(rq);
>  		return 0;
>  	}

IIRC Steve has a test for this stuff. If this breaks things, an
alternative is keeping a counter/limit on attempts or something.


--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1339,6 +1339,8 @@ struct rq {
 	unsigned int		nr_pinned;
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
+	unsigned int		rt_switches;
+	unsigned int		rt_push_resched;
 
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2941,6 +2941,13 @@ static int push_dl_task(struct rq *rq)
 	if (dl_task(rq->donor) &&
 	    dl_time_before(next_task->dl.deadline, rq->donor->dl.deadline) &&
 	    rq->curr->nr_cpus_allowed > 1) {
+		if (rq->rt_switches != rq->nr_switches) {
+			rq->rt_switches = rq->nr_switches;
+			rq->rt_push_resched = 0;
+		}
+		if (test_tsk_need_resched(rq->curr) && ++rq->rt_push_resched > 16)
+			return 1;
+
 		resched_curr(rq);
 		return 0;
 	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-07 14:14 ` Peter Zijlstra
@ 2026-05-11 19:33   ` Tejun Heo
  2026-05-12 15:37   ` Steven Rostedt
  1 sibling, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2026-05-11 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable

On Thu, May 07, 2026 at 04:14:37PM +0200, Peter Zijlstra wrote:
> IIRC Steve has a test for this stuff. If this breaks things, an
> alternative is keeping a counter/limit on attempts or something.

Ping. For some reason, we're seeing this reliably now. Whichever way is fine
but it'd be nice to roll out something that's landing upstream.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-07 14:14 ` Peter Zijlstra
  2026-05-11 19:33   ` Tejun Heo
@ 2026-05-12 15:37   ` Steven Rostedt
  2026-05-12 18:07     ` Tejun Heo
  2026-05-12 20:10     ` Valentin Schneider
  1 sibling, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2026-05-12 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable,
	Linux RT Development, Clark Williams, Sebastian Andrzej Siewior,
	John Kacur

[ Adding some RT folks ]

Also, Valentin, can you look at this, because I believe the issue was
introduced by your change (see below).

On Thu, 7 May 2026 16:14:37 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, May 06, 2026 at 01:57:16PM -1000, Tejun Heo wrote:
> > push_rt_task() picks the highest pushable RT task next_task. If it
> > outranks rq->donor, the existing path calls resched_curr() and
> > returns 0, trusting local schedule() to pick next_task soon.
> > 
> > The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely
> > on that. When this CPU has a steady supply of softirq work (e.g.,
> > incoming packets), the next push IPI arrives before schedule() can
> > run. Other CPUs keep seeing this CPU as overloaded and keep sending
> > IPIs, this CPU keeps taking the same bail, and the loop repeats
> > until soft lockup.
> > 
> > Seen in production on hosts with sustained NET_RX softirq load:
> > the loop ran millions of iterations before tripping the soft-lockup
> > watchdog.
> > 
> > Skip the prio bail when called via the IPI relay (pull=true) so
> > push_rt_task() migrates next_task to another CPU. Verified with a
> > synthetic reproducer.
> > 
> > Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

Wrong Fixes tag. That commit doesn't even have the code that you are
changing. I think the correct commit is:

 Fixes: 49bef33e4b87b ("sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race")

Which adds that if statement that exits out of the code early.

> > Cc: Kyle McMartin <jkkm@meta.com>
> > Cc: stable@vger.kernel.org # v5.10+
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> > This looks minimal to me, but happy for suggestions. Thanks.
> > 
> >  kernel/sched/rt.c |    8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1968,8 +1968,14 @@ retry:
> >  	 * It's possible that the next_task slipped in of
> >  	 * higher priority than current. If that's the case
> >  	 * just reschedule current.
> > +	 *
> > +	 * This doesn't work for the IPI relay caller (pull). When this CPU
> > +	 * has a steady supply of softirq work (e.g., incoming packets), the
> > +	 * next push IPI arrives before schedule() can run. Other CPUs keep
> > +	 * seeing it as overloaded and keep sending IPIs, this CPU keeps
> > +	 * taking the same bail, and the loop repeats until soft lockup.
> >  	 */
> > -	if (unlikely(next_task->prio < rq->donor->prio)) {
> > +	if (unlikely(next_task->prio < rq->donor->prio) && !pull) {
> >  		resched_curr(rq);
> >  		return 0;
> >  	}  
> 
> IIRC Steve has a test for this stuff. If this breaks things, an
> alternative is keeping a counter/limit on attempts or something.

IIRC, the test we had was simply cyclictest that we ran with the following
parameters. From commit b6366f048e0ca ("sched/rt: Use IPI to trigger RT
task push migration instead of pulling"), it states it runs:

   cyclictest --numa -p95 -m -d0 -i100

The above runs a thread on each CPU at priority 95 and will sleep for
100us. Each thread should wake up at the same time. You can read the commit
message for more details but the tl;dr; is that without the IPI push
request, if one of the CPUs ran another RT task besides cyclictest, then
all the others would then ask to pull from it when the other CPUs
cyclictest would sleep. Having over 100 CPUs send an IPI to pull a task
when only the first one would get it, caused a large latency. Especially
since it took the rq lock over and over again.

But, the code being fixed wasn't due to that commit, but due to the commit
that added the short cut of the logic. That commit fixes a race with the
normal call to push_rt_task() and I think the pull logic issue was a side
effect.

I agree with Tejun's change, it actually puts the logic for the IPI pull
back to what it was before commit 49bef33e4b87b. The bug was added by the
shortcut case to push_rt_task() that was only meant for the !pull scenario.
Adding !pull to the if conditional seems like the correct change.

Valentin, can you confirm please.

Please update the Fixes tag to point to the appropriate commit as well as
update the change log. With that:

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-12 15:37   ` Steven Rostedt
@ 2026-05-12 18:07     ` Tejun Heo
  2026-05-12 21:28       ` Steven Rostedt
  2026-05-12 20:10     ` Valentin Schneider
  1 sibling, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2026-05-12 18:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable,
	Linux RT Development, Clark Williams, Sebastian Andrzej Siewior,
	John Kacur

Hello,

Looking at 49bef33e4b87 ("sched/rt: Plug rt_mutex_setprio() vs
push_rt_task() race"), the prio bail looks like it was already there
and only got moved up to retry:. For non-migration-disabled next_task
the bail fires at the same effective point both before and after, and
rto_push_irq_work_func() + rto_next_cpu() were already in their
current shape, so the loop seems reachable before the move too -
b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration
instead of pulling") looks like the actual origin.

Am I reading it wrong?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-12 15:37   ` Steven Rostedt
  2026-05-12 18:07     ` Tejun Heo
@ 2026-05-12 20:10     ` Valentin Schneider
  1 sibling, 0 replies; 7+ messages in thread
From: Valentin Schneider @ 2026-05-12 20:10 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, K Prateek Nayak,
	Kyle McMartin, linux-kernel, stable, Linux RT Development,
	Clark Williams, Sebastian Andrzej Siewior, John Kacur

On 12/05/26 11:37, Steven Rostedt wrote:
> [ Adding some RT folks ]
>
> Also, Valentin, can you look at this, because I believe the issue was
> introduced by your change (see below).
>

Woops!

> IIRC, the test we had was simply cyclictest that we ran with the following
> parameters. From commit b6366f048e0ca ("sched/rt: Use IPI to trigger RT
> task push migration instead of pulling"), it states it runs:
>
>    cyclictest --numa -p95 -m -d0 -i100
>
> The above runs a thread on each CPU at priority 95 and will sleep for
> 100us. Each thread should wake up at the same time. You can read the commit
> message for more details but the tl;dr; is that without the IPI push
> request, if one of the CPUs ran another RT task besides cyclictest, then
> all the others would then ask to pull from it when the other CPUs
> cyclictest would sleep. Having over 100 CPUs send an IPI to pull a task
> when only the first one would get it, caused a large latency. Especially
> since it took the rq lock over and over again.
>
> But, the code being fixed wasn't due to that commit, but due to the commit
> that added the short cut of the logic. That commit fixes a race with the
> normal call to push_rt_task() and I think the pull logic issue was a side
> effect.
>
> I agree with Tejun's change, it actually puts the logic for the IPI pull
> back to what it was before commit 49bef33e4b87b. The bug was added by the
> shortcut case to push_rt_task() that was only meant for the !pull scenario.
> Adding !pull to the if conditional seems like the correct change.
>
> Valentin, can you confirm please.
>

So looking back at the original report for my patch:

  https://lore.kernel.org/all/Yb3vXx3DcqVOi+EA@donbot/

the splat happened through rto_push_irq_work_func(), i.e. with pull=true
(that naming always causes me to shuffle through my notes; AFAICT that's
because it's when push_rt_task() is invoked due to a pull_rt_task() call
but urgh).

So IIUC I'm afraid the suggested fix would cause the original issue to
resurface, but that still leaves us with the reported softlock issue. I
don't have any inspiration so far, I'll sleep on it.

> Please update the Fixes tag to point to the appropriate commit as well as
> update the change log. With that:
>
> Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
>
> -- Steve


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-12 18:07     ` Tejun Heo
@ 2026-05-12 21:28       ` Steven Rostedt
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2026-05-12 21:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable,
	Linux RT Development, Clark Williams, Sebastian Andrzej Siewior,
	John Kacur

On Tue, 12 May 2026 08:07:58 -1000
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> Looking at 49bef33e4b87 ("sched/rt: Plug rt_mutex_setprio() vs
> push_rt_task() race"), the prio bail looks like it was already there
> and only got moved up to retry:. For non-migration-disabled next_task
> the bail fires at the same effective point both before and after, and
> rto_push_irq_work_func() + rto_next_cpu() were already in their
> current shape, so the loop seems reachable before the move too -
> b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration
> instead of pulling") looks like the actual origin.
> 
> Am I reading it wrong?
> 

No, I missed the movement of that code. Which means I need to understand
the problem better.

I'm still wondering about the trigger of this. That shortcut means the
current process is of lower priority than the waiting tasks and a simple
schedule should happen. From your tests, can you see why a lower process
was running on the CPU instead of a higher priority process?

Also, the IPIs only happen when another CPU is about to schedule something
of lower priority where it tries to pull a task to it.

From your description, you are seeing a storm of IPIs from all these CPUs
before the first CPU could return from hard interrupt and schedule?

I'm thinking there may be something else wrong here.

Note, the RT_PUSH_IPI logic only has a single iteration happening. If it is
happening and another CPU wants to do a "push", it simply ups the counter
to try again. It doesn't send another IPI.

Do you have a trace that shows what is happening?

 # trace-cmd start -e sched_switch -e sched_waking -e irq -e workqueue
 # echo 1 > /proc/sys/kernel/traceoff_on_warning
 # trace-cmd extract

may be enough.

May need to add some trace_printk()s into the IPI logic code too.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-12 21:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 23:57 [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop Tejun Heo
2026-05-07 14:14 ` Peter Zijlstra
2026-05-11 19:33   ` Tejun Heo
2026-05-12 15:37   ` Steven Rostedt
2026-05-12 18:07     ` Tejun Heo
2026-05-12 21:28       ` Steven Rostedt
2026-05-12 20:10     ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox