The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
@ 2026-05-06 23:57 Tejun Heo
  2026-05-07 14:14 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Tejun Heo @ 2026-05-06 23:57 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt
  Cc: Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, Tejun Heo, stable

push_rt_task() picks the highest pushable RT task next_task. If it
outranks rq->donor, the existing path calls resched_curr() and
returns 0, trusting local schedule() to pick next_task soon.

The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely
on that. When this CPU has a steady supply of softirq work (e.g.,
incoming packets), the next push IPI arrives before schedule() can
run. Other CPUs keep seeing this CPU as overloaded and keep sending
IPIs, this CPU keeps taking the same bail, and the loop repeats
until soft lockup.

Seen in production on hosts with sustained NET_RX softirq load:
the loop ran millions of iterations before tripping the soft-lockup
watchdog.

Skip the prio bail when called via the IPI relay (pull=true) so
push_rt_task() migrates next_task to another CPU. Verified with a
synthetic reproducer.

Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Cc: Kyle McMartin <jkkm@meta.com>
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Tejun Heo <tj@kernel.org>
---
This looks minimal to me, but happy for suggestions. Thanks.

 kernel/sched/rt.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1968,8 +1968,14 @@ retry:
 	 * It's possible that the next_task slipped in of
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
+	 *
+	 * This doesn't work for the IPI relay caller (pull). When this CPU
+	 * has a steady supply of softirq work (e.g., incoming packets), the
+	 * next push IPI arrives before schedule() can run. Other CPUs keep
+	 * seeing it as overloaded and keep sending IPIs, this CPU keeps
+	 * taking the same bail, and the loop repeats until soft lockup.
 	 */
-	if (unlikely(next_task->prio < rq->donor->prio)) {
+	if (unlikely(next_task->prio < rq->donor->prio) && !pull) {
 		resched_curr(rq);
 		return 0;
 	}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-06 23:57 [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop Tejun Heo
@ 2026-05-07 14:14 ` Peter Zijlstra
  2026-05-11 19:33   ` Tejun Heo
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2026-05-07 14:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable

On Wed, May 06, 2026 at 01:57:16PM -1000, Tejun Heo wrote:
> push_rt_task() picks the highest pushable RT task next_task. If it
> outranks rq->donor, the existing path calls resched_curr() and
> returns 0, trusting local schedule() to pick next_task soon.
> 
> The RT_PUSH_IPI relay caller (rto_push_irq_work_func()) cannot rely
> on that. When this CPU has a steady supply of softirq work (e.g.,
> incoming packets), the next push IPI arrives before schedule() can
> run. Other CPUs keep seeing this CPU as overloaded and keep sending
> IPIs, this CPU keeps taking the same bail, and the loop repeats
> until soft lockup.
> 
> Seen in production on hosts with sustained NET_RX softirq load:
> the loop ran millions of iterations before tripping the soft-lockup
> watchdog.
> 
> Skip the prio bail when called via the IPI relay (pull=true) so
> push_rt_task() migrates next_task to another CPU. Verified with a
> synthetic reproducer.
> 
> Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
> Cc: Kyle McMartin <jkkm@meta.com>
> Cc: stable@vger.kernel.org # v5.10+
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> This looks minimal to me, but happy for suggestions. Thanks.
> 
>  kernel/sched/rt.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1968,8 +1968,14 @@ retry:
>  	 * It's possible that the next_task slipped in of
>  	 * higher priority than current. If that's the case
>  	 * just reschedule current.
> +	 *
> +	 * This doesn't work for the IPI relay caller (pull). When this CPU
> +	 * has a steady supply of softirq work (e.g., incoming packets), the
> +	 * next push IPI arrives before schedule() can run. Other CPUs keep
> +	 * seeing it as overloaded and keep sending IPIs, this CPU keeps
> +	 * taking the same bail, and the loop repeats until soft lockup.
>  	 */
> -	if (unlikely(next_task->prio < rq->donor->prio)) {
> +	if (unlikely(next_task->prio < rq->donor->prio) && !pull) {
>  		resched_curr(rq);
>  		return 0;
>  	}

IIRC Steve has a test for this stuff. If this breaks things, an
alternative is keeping a counter/limit on attempts or something.


--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1339,6 +1339,8 @@ struct rq {
 	unsigned int		nr_pinned;
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
+	unsigned int		rt_switches;
+	unsigned int		rt_push_resched;
 
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2941,6 +2941,13 @@ static int push_dl_task(struct rq *rq)
 	if (dl_task(rq->donor) &&
 	    dl_time_before(next_task->dl.deadline, rq->donor->dl.deadline) &&
 	    rq->curr->nr_cpus_allowed > 1) {
+		if (rq->rt_switches != rq->nr_switches) {
+			rq->rt_switches = rq->nr_switches;
+			rq->rt_push_resched = 0;
+		}
+		if (test_tsk_need_resched(rq->curr) && ++rq->rt_push_resched > 16)
+			return 1;
+
 		resched_curr(rq);
 		return 0;
 	}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop
  2026-05-07 14:14 ` Peter Zijlstra
@ 2026-05-11 19:33   ` Tejun Heo
  0 siblings, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-05-11 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Kyle McMartin, linux-kernel, stable

On Thu, May 07, 2026 at 04:14:37PM +0200, Peter Zijlstra wrote:
> IIRC Steve has a test for this stuff. If this breaks things, an
> alternative is keeping a counter/limit on attempts or something.

Ping. For some reason, we're seeing this reliably now. Whichever way is fine
but it'd be nice to roll out something that's landing upstream.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-11 19:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 23:57 [PATCH sched/core] sched/rt: Fix RT_PUSH_IPI soft lockup loop Tejun Heo
2026-05-07 14:14 ` Peter Zijlstra
2026-05-11 19:33   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox