stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
@ 2025-09-25 13:33 Matt Fleming
  2025-09-26  0:05 ` John Stultz
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2025-09-25 13:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, kernel-team, Matt Fleming,
	Oleg Nesterov, John Stultz, Chris Arges, stable

From: Matt Fleming <mfleming@cloudflare.com>

This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.

If we dequeue a task (task B) that was sched delayed then that task is
definitely no longer on the rq and not tracked in the rbtree.
Unfortunately, task_on_rq_queued(B) will still return true because
dequeue_task() doesn't update p->on_rq.

This inconsistency can lead to tasks (task A) spinning indefinitely in
wait_task_inactive(), e.g. when delivering a fatal signal to a thread
group, because it thinks the task B is still queued (it's not) and waits
forever for it to unschedule.

          Task A                                    Task B

  arch_do_signal_or_restart()
    get_signal()
      do_coredump()
        coredump_wait()
	  zap_threads()                     arch_do_signal_or_restart()
          wait_task_inactive() <-- SPIN       get_signal()
	                                        do_group_exit()
						  do_exit()
						    coredump_task_exit()
						      schedule() <--- never comes back

Not only will task A spin forever in wait_task_inactive(), but task B
will also trigger RCU stalls:

  INFO: rcu_tasks detected stalls on tasks:
  00000000a973a4d8: .. nvcsw: 2/2 holdout: 1 idle_cpu: -1/79
  task:ffmpeg          state:I stack:0     pid:665601 tgid:665155 ppid:668691 task_flags:0x400448 flags:0x00004006
  Call Trace:
   <TASK>
   __schedule+0x4fb/0xbf0
   ? srso_return_thunk+0x5/0x5f
   schedule+0x27/0xf0
   do_exit+0xdd/0xaa0
   ? __pfx_futex_wake_mark+0x10/0x10
   do_group_exit+0x30/0x80
   get_signal+0x81e/0x860
   ? srso_return_thunk+0x5/0x5f
   ? futex_wake+0x177/0x1a0
   arch_do_signal_or_restart+0x2e/0x1f0
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? __x64_sys_futex+0x10c/0x1d0
   syscall_exit_to_user_mode+0xa5/0x130
   do_syscall_64+0x57/0x110
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f22d05b0f16
  RSP: 002b:00007f2265761cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
  RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f22d05b0f16
  RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005629e320d97c
  RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
  R10: 0000000000000000 R11: 0000000000000246 R12: 00005629e320d928
  R13: 0000000000000000 R14: 0000000000000001 R15: 00005629e320d97c
   </TASK>

Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Chris Arges <carges@cloudflare.com>
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
 kernel/sched/core.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ccba6fc3c3fe..2dfc3977920d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,12 +2293,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &rf);
-		/*
-		 * If task is sched_delayed, force dequeue it, to avoid always
-		 * hitting the tick timeout in the queued case
-		 */
-		if (p->se.sched_delayed)
-			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		trace_sched_wait_task(p);
 		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-25 13:33 [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks" Matt Fleming
@ 2025-09-26  0:05 ` John Stultz
  2025-09-26  0:06   ` John Stultz
  2025-09-26  2:43   ` K Prateek Nayak
  0 siblings, 2 replies; 14+ messages in thread
From: John Stultz @ 2025-09-26  0:05 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, kernel-team, Matt Fleming,
	Oleg Nesterov, Chris Arges, stable

On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> From: Matt Fleming <mfleming@cloudflare.com>
>
> This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
>
> If we dequeue a task (task B) that was sched delayed then that task is
> definitely no longer on the rq and not tracked in the rbtree.
> Unfortunately, task_on_rq_queued(B) will still return true because
> dequeue_task() doesn't update p->on_rq.

Hey!
  Sorry again my patch has been causing you trouble. Thanks for your
persistence in chasing this down!

It's confusing as this patch uses the similar logic as logic
pick_next_entity() uses when a sched_delayed task is picked to run, as
well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
fret that similar

And my impression was that dequeue_task() on a sched_delayed task
would update p->on_rq via calling __block_task() at the end of
dequeue_entities().

However, there are two spots where we might exit dequeue_entities()
early when cfs_rq_throttled(rq), so maybe that's what's catching us
here?

Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
little odd, as the desired dequeue didn't really complete, but
dequeue_task_fair() will still return true indicating success - not
that too many places are checking the dequeue_task return. Is that
right?

thanks
-john

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-26  0:05 ` John Stultz
@ 2025-09-26  0:06   ` John Stultz
  2025-09-26  2:43   ` K Prateek Nayak
  1 sibling, 0 replies; 14+ messages in thread
From: John Stultz @ 2025-09-26  0:06 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, kernel-team, Matt Fleming,
	Oleg Nesterov, Chris Arges, stable

On Thu, Sep 25, 2025 at 5:05 PM John Stultz <jstultz@google.com> wrote:
> It's confusing as this patch uses the similar logic as logic
> pick_next_entity() uses when a sched_delayed task is picked to run, as
> well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
> fret that similar

Didn't finish my sentence there.  "...similar problems as the one you
are seeing could still occur."


thanks
-john

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-26  0:05 ` John Stultz
  2025-09-26  0:06   ` John Stultz
@ 2025-09-26  2:43   ` K Prateek Nayak
  2025-09-26 15:34     ` Matt Fleming
  2025-09-29 10:38     ` [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks" Peter Zijlstra
  1 sibling, 2 replies; 14+ messages in thread
From: K Prateek Nayak @ 2025-09-26  2:43 UTC (permalink / raw)
  To: John Stultz, Matt Fleming
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, kernel-team, Matt Fleming,
	Oleg Nesterov, Chris Arges, stable

Hello John, Matt,

On 9/26/2025 5:35 AM, John Stultz wrote:
> On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@readmodwrite.com> wrote:
>>
>> From: Matt Fleming <mfleming@cloudflare.com>
>>
>> This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
>>
>> If we dequeue a task (task B) that was sched delayed then that task is
>> definitely no longer on the rq and not tracked in the rbtree.
>> Unfortunately, task_on_rq_queued(B) will still return true because
>> dequeue_task() doesn't update p->on_rq.
> 
> Hey!
>   Sorry again my patch has been causing you trouble. Thanks for your
> persistence in chasing this down!
> 
> It's confusing as this patch uses the similar logic as logic
> pick_next_entity() uses when a sched_delayed task is picked to run, as
> well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
> fret that similar
> 
> And my impression was that dequeue_task() on a sched_delayed task
> would update p->on_rq via calling __block_task() at the end of
> dequeue_entities().
> 
> However, there are two spots where we might exit dequeue_entities()
> early when cfs_rq_throttled(rq), so maybe that's what's catching us
> here?

That could very likely be it.

> 
> Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> little odd, as the desired dequeue didn't really complete, but
> dequeue_task_fair() will still return true indicating success - not
> that too many places are checking the dequeue_task return. Is that
> right?

I think for most part until now it was harmless as we couldn't pick on
a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
would later do a:

    queued = task_on_rq_queued(p);
    ...
    if (queued)
        enqueue_task(p)

which would either lead to spuriously running a blocked task and it
would block back again, or a wakeup would properly wakeup the queued
task via ttwu_runnable() but wait_task_inactive() is interesting as
it expects the dequeue will result in a block which never happens with
throttled hierarchies. I'm impressed double dequeue doesn't result in
any major splats!

Matt, if possible can you try the patch attached below to check if the
bailout for throttled hierarchy is indeed the root cause. Thanks in
advance.

P.S. the per-task throttle in tip:sched/core would get rid of all this
but it would be good to have a fix via tip:sched/urgent to get it
backported to v6.12 LTS and the newer stable kernels.

---
Thanks and Regards,
Prateek

(Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f..f0a4d9d7424d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int h_nr_runnable = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
+	int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	ret = 1;
+out:
 	if (p && task_delayed) {
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
@@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return ret;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-26  2:43   ` K Prateek Nayak
@ 2025-09-26 15:34     ` Matt Fleming
  2025-09-29  3:53       ` K Prateek Nayak
                         ` (2 more replies)
  2025-09-29 10:38     ` [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks" Peter Zijlstra
  1 sibling, 3 replies; 14+ messages in thread
From: Matt Fleming @ 2025-09-26 15:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, kernel-team,
	Matt Fleming, Oleg Nesterov, Chris Arges, stable

On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John, Matt,
>
> On 9/26/2025 5:35 AM, John Stultz wrote:
> >
> > However, there are two spots where we might exit dequeue_entities()
> > early when cfs_rq_throttled(rq), so maybe that's what's catching us
> > here?
>
> That could very likely be it.

That tracks -- we're heavy users of cgroups and this particular issue
only appeared on our kubernetes nodes.

> Matt, if possible can you try the patch attached below to check if the
> bailout for throttled hierarchy is indeed the root cause. Thanks in
> advance.

I've been running our reproducer with this patch for the last few
hours without any issues, so the fix looks good to me.

Tested-by: Matt Fleming <mfleming@cloudflare.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-26 15:34     ` Matt Fleming
@ 2025-09-29  3:53       ` K Prateek Nayak
  2025-10-14 10:15         ` Matt Fleming
  2025-10-15  6:03       ` [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue K Prateek Nayak
  2025-10-23  4:03       ` [PATCH 6.17] " K Prateek Nayak
  2 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2025-09-29  3:53 UTC (permalink / raw)
  To: Matt Fleming
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, kernel-team,
	Matt Fleming, Oleg Nesterov, Chris Arges, stable

Hello Matt,

On 9/26/2025 9:04 PM, Matt Fleming wrote:
> On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello John, Matt,
>>
>> On 9/26/2025 5:35 AM, John Stultz wrote:
>>>
>>> However, there are two spots where we might exit dequeue_entities()
>>> early when cfs_rq_throttled(rq), so maybe that's what's catching us
>>> here?
>>
>> That could very likely be it.
> 
> That tracks -- we're heavy users of cgroups and this particular issue
> only appeared on our kubernetes nodes.
> 
>> Matt, if possible can you try the patch attached below to check if the
>> bailout for throttled hierarchy is indeed the root cause. Thanks in
>> advance.
> 
> I've been running our reproducer with this patch for the last few
> hours without any issues, so the fix looks good to me.
> 
> Tested-by: Matt Fleming <mfleming@cloudflare.com>

Thank you for testing the diff. I see Ingo has already posted the
scheduler pull for v6.18 which indirectly solves this by removing those
early returns.

Once that lands, I'll attach a formal commit log and send out a patch
targeting the stable kernels >= 6.12.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-26  2:43   ` K Prateek Nayak
  2025-09-26 15:34     ` Matt Fleming
@ 2025-09-29 10:38     ` Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2025-09-29 10:38 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, Matt Fleming, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, kernel-team,
	Matt Fleming, Oleg Nesterov, Chris Arges, stable

On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:

> > Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> > little odd, as the desired dequeue didn't really complete, but
> > dequeue_task_fair() will still return true indicating success - not
> > that too many places are checking the dequeue_task return. Is that
> > right?

Bah, i'm forever confused on the throttle cases there :/

> I think for most part until now it was harmless as we couldn't pick on
> a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
> would later do a:
> 
>     queued = task_on_rq_queued(p);
>     ...
>     if (queued)
>         enqueue_task(p)
> 
> which would either lead to spuriously running a blocked task and it
> would block back again, or a wakeup would properly wakeup the queued
> task via ttwu_runnable() but wait_task_inactive() is interesting as
> it expects the dequeue will result in a block which never happens with
> throttled hierarchies. I'm impressed double dequeue doesn't result in
> any major splats!
> 
> Matt, if possible can you try the patch attached below to check if the
> bailout for throttled hierarchy is indeed the root cause. Thanks in
> advance.
> 
> P.S. the per-task throttle in tip:sched/core would get rid of all this
> but it would be good to have a fix via tip:sched/urgent to get it
> backported to v6.12 LTS and the newer stable kernels.

Yes, good riddance to that code :-)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8ce56a8d507f..f0a4d9d7424d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	int h_nr_runnable = 0;
>  	struct cfs_rq *cfs_rq;
>  	u64 slice = 0;
> +	int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */

Right, we don't appear to really need that.

>  
>  	if (entity_is_task(se)) {
>  		p = task_of(se);
> @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> +			goto out;
>  
>  		/* Don't dequeue parent if it has other entities besides us */
>  		if (cfs_rq->load.weight) {
> @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> +			goto out;
>  	}
>  
>  	sub_nr_running(rq, h_nr_queued);
> @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>  		rq->next_balance = jiffies;
>  
> +	ret = 1;
> +out:
>  	if (p && task_delayed) {
>  		WARN_ON_ONCE(!task_sleep);
>  		WARN_ON_ONCE(p->on_rq != 1);
> @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		__block_task(rq, p);
>  	}
>  
> -	return 1;
> +	return ret;
>  }

So the difference is that we also do __block_task() when we get
throttled somewhere in the hierarchy. IIRC when I was looking at this, I
thought it wouldn't matter since it won't get picked anyway, on account
of the cfs_rq being blocked/detached, so who cares.

But yeah, this makes sense.

Patch logistics are going to be a pain -- .17 is closed and merge window
is open, which means Linus will have per-task throttle and /urgent don't
work no more.

At this point best we can do is a patch to stable with a note that
upstream is no longer affected due to rework or something.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"
  2025-09-29  3:53       ` K Prateek Nayak
@ 2025-10-14 10:15         ` Matt Fleming
  0 siblings, 0 replies; 14+ messages in thread
From: Matt Fleming @ 2025-10-14 10:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, kernel-team,
	Matt Fleming, Oleg Nesterov, Chris Arges, stable

On Mon, Sep 29, 2025 at 09:23:06AM +0530, K Prateek Nayak wrote:
> 
> Thank you for testing the diff. I see Ingo has already posted the
> scheduler pull for v6.18 which indirectly solves this by removing those
> early returns.
> 
> Once that lands, I'll attach a formal commit log and send out a patch
> targeting the stable kernels >= 6.12.

Hey Prateek, I see the merge window is closed. Do you have plans to
submit a patch for stable now?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue
  2025-09-26 15:34     ` Matt Fleming
  2025-09-29  3:53       ` K Prateek Nayak
@ 2025-10-15  6:03       ` K Prateek Nayak
  2025-10-15  6:14         ` Greg Kroah-Hartman
  2025-10-23  4:03       ` [PATCH 6.17] " K Prateek Nayak
  2 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2025-10-15  6:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin, stable, Matt Fleming,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, kernel-team, Matt Fleming, Oleg Nesterov,
	John Stultz, Chris Arges, K Prateek Nayak

Dequeuing a fair task on a throttled hierarchy returns early on
encountering a throttled cfs_rq since the throttle path has already
dequeued the hierarchy above and has adjusted the h_nr_* accounting till
the root cfs_rq.

dequeue_entities() crucially misses calling __block_task() for delayed
tasks being dequeued on the throttled hierarchies, but this was mostly
harmless until commit b7ca5743a260 ("sched/core: Tweak
wait_task_inactive() to force dequeue sched_delayed tasks") since all
existing cases would re-enqueue the task if task_on_rq_queued() returned
true and the task would eventually be blocked at pick after the
hierarchy was unthrottled.

wait_task_inactive() is special as it expects the delayed task on
throttled hierarchy to reach the blocked state on dequeue but since
__block_task() is never called, task_on_rq_queued() continues to return
true. Furthermore, since the task is now off the hierarchy, the pick
never reaches it to fully block the task even after unthrottle leading
to wait_task_inactive() looping endlessly.

Remedy this by calling __block_task() if a delayed task is being
dequeued on a throttled hierarchy.

This fix is only required for stabled kernels implementing delay dequeue
(>= v6.12) before v6.18 since upstream commit e1fad12dcb66 ("sched/fair:
Switch to task based throttle model") indirectly fixes this by removing
the early return conditions in dequeue_entities() as part of the per-task
throttle feature.

Cc: stable@vger.kernel.org
Reported-by: Matt Fleming <matt@readmodwrite.com>
Closes: https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/
Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks")
Tested-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Greg, Sasha,

This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels
too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me
know if you would like me to send a seperate patch for each.

As mentioned above, the upstream fixes this as a part of larger feature
and we would only like these bits backported. If there are any future
conflicts in this area during backporting, I would be more than happy to
help out resolve them.
---
 kernel/sched/fair.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af61769b1d50..b3d9826e25b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7187,6 +7187,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int h_nr_delayed = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
+	int ret = 0;
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -7218,7 +7219,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -7261,7 +7262,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -7273,6 +7274,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	ret = 1;
+out:
 	if (p && task_delayed) {
 		SCHED_WARN_ON(!task_sleep);
 		SCHED_WARN_ON(p->on_rq != 1);
@@ -7288,7 +7291,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return ret;
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue
  2025-10-15  6:03       ` [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue K Prateek Nayak
@ 2025-10-15  6:14         ` Greg Kroah-Hartman
  2025-10-15  6:27           ` K Prateek Nayak
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-15  6:14 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sasha Levin, stable, Matt Fleming, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	kernel-team, Matt Fleming, Oleg Nesterov, John Stultz,
	Chris Arges

On Wed, Oct 15, 2025 at 06:03:59AM +0000, K Prateek Nayak wrote:
> Dequeuing a fair task on a throttled hierarchy returns early on
> encountering a throttled cfs_rq since the throttle path has already
> dequeued the hierarchy above and has adjusted the h_nr_* accounting till
> the root cfs_rq.
> 
> dequeue_entities() crucially misses calling __block_task() for delayed
> tasks being dequeued on the throttled hierarchies, but this was mostly
> harmless until commit b7ca5743a260 ("sched/core: Tweak
> wait_task_inactive() to force dequeue sched_delayed tasks") since all
> existing cases would re-enqueue the task if task_on_rq_queued() returned
> true and the task would eventually be blocked at pick after the
> hierarchy was unthrottled.
> 
> wait_task_inactive() is special as it expects the delayed task on
> throttled hierarchy to reach the blocked state on dequeue but since
> __block_task() is never called, task_on_rq_queued() continues to return
> true. Furthermore, since the task is now off the hierarchy, the pick
> never reaches it to fully block the task even after unthrottle leading
> to wait_task_inactive() looping endlessly.
> 
> Remedy this by calling __block_task() if a delayed task is being
> dequeued on a throttled hierarchy.
> 
> This fix is only required for stabled kernels implementing delay dequeue
> (>= v6.12) before v6.18 since upstream commit e1fad12dcb66 ("sched/fair:
> Switch to task based throttle model") indirectly fixes this by removing
> the early return conditions in dequeue_entities() as part of the per-task
> throttle feature.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Matt Fleming <matt@readmodwrite.com>
> Closes: https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/
> Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks")
> Tested-by: Matt Fleming <mfleming@cloudflare.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Greg, Sasha,
> 
> This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels
> too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me
> know if you would like me to send a seperate patch for each.
> 
> As mentioned above, the upstream fixes this as a part of larger feature
> and we would only like these bits backported. If there are any future
> conflicts in this area during backporting, I would be more than happy to
> help out resolve them.

Why not just backport all of the mainline changes instead?  As I say a
lot, whenever we do these "one off" changes, it's almost always wrong
and causes problems over the years going forward as other changes around
the same area can not be backported either.

So please, try to just backport the original commits.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue
  2025-10-15  6:14         ` Greg Kroah-Hartman
@ 2025-10-15  6:27           ` K Prateek Nayak
  2025-10-15  7:27             ` Greg Kroah-Hartman
  0 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2025-10-15  6:27 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, stable, Matt Fleming, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	kernel-team, Matt Fleming, Oleg Nesterov, John Stultz,
	Chris Arges

Hello Greg,

On 10/15/2025 11:44 AM, Greg Kroah-Hartman wrote:
>> Greg, Sasha,
>>
>> This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels
>> too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me
>> know if you would like me to send a seperate patch for each.
>>
>> As mentioned above, the upstream fixes this as a part of larger feature
>> and we would only like these bits backported. If there are any future
>> conflicts in this area during backporting, I would be more than happy to
>> help out resolve them.
> 
> Why not just backport all of the mainline changes instead?  As I say a
> lot, whenever we do these "one off" changes, it's almost always wrong
> and causes problems over the years going forward as other changes around
> the same area can not be backported either.
> 
> So please, try to just backport the original commits.

Peter was in favor of backporting just the necessary bits in
https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks-ass.net/

Backporting the whole of per-task throttle feature is lot more heavy
handed with the core changes adding:

 include/linux/sched.h |   5 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 451 ++++++++++++++++++++++++------------------
 kernel/sched/pelt.h   |   4 +-
 kernel/sched/sched.h  |   7 +-
 5 files changed, 274 insertions(+), 196 deletions(-)

And a few more fixes that will add to the above before v6.18. I'll defer
to Peter to decide the best course of action.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue
  2025-10-15  6:27           ` K Prateek Nayak
@ 2025-10-15  7:27             ` Greg Kroah-Hartman
  2025-10-15  8:22               ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-15  7:27 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sasha Levin, stable, Matt Fleming, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	kernel-team, Matt Fleming, Oleg Nesterov, John Stultz,
	Chris Arges

On Wed, Oct 15, 2025 at 11:57:19AM +0530, K Prateek Nayak wrote:
> Hello Greg,
> 
> On 10/15/2025 11:44 AM, Greg Kroah-Hartman wrote:
> >> Greg, Sasha,
> >>
> >> This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels
> >> too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me
> >> know if you would like me to send a seperate patch for each.
> >>
> >> As mentioned above, the upstream fixes this as a part of larger feature
> >> and we would only like these bits backported. If there are any future
> >> conflicts in this area during backporting, I would be more than happy to
> >> help out resolve them.
> > 
> > Why not just backport all of the mainline changes instead?  As I say a
> > lot, whenever we do these "one off" changes, it's almost always wrong
> > and causes problems over the years going forward as other changes around
> > the same area can not be backported either.
> > 
> > So please, try to just backport the original commits.
> 
> Peter was in favor of backporting just the necessary bits in
> https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks-ass.net/
> 
> Backporting the whole of per-task throttle feature is lot more heavy
> handed with the core changes adding:
> 
>  include/linux/sched.h |   5 +
>  kernel/sched/core.c   |   3 +
>  kernel/sched/fair.c   | 451 ++++++++++++++++++++++++------------------
>  kernel/sched/pelt.h   |   4 +-
>  kernel/sched/sched.h  |   7 +-
>  5 files changed, 274 insertions(+), 196 deletions(-)

That's very tiny overall in the scheme of what we take for the stable
trees.

> And a few more fixes that will add to the above before v6.18. I'll defer
> to Peter to decide the best course of action.

We'll defer to the maintainers of the subsystem as to what they want
here.  If they say take this smaller patch, we'll be glad to do so.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue
  2025-10-15  7:27             ` Greg Kroah-Hartman
@ 2025-10-15  8:22               ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2025-10-15  8:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: K Prateek Nayak, Sasha Levin, stable, Matt Fleming, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	kernel-team, Matt Fleming, Oleg Nesterov, John Stultz,
	Chris Arges

On Wed, Oct 15, 2025 at 09:27:53AM +0200, Greg Kroah-Hartman wrote:
> On Wed, Oct 15, 2025 at 11:57:19AM +0530, K Prateek Nayak wrote:
> > Hello Greg,
> > 
> > On 10/15/2025 11:44 AM, Greg Kroah-Hartman wrote:
> > >> Greg, Sasha,
> > >>
> > >> This fix cleanly applies on top of v6.16.y and v6.17.y stable kernels
> > >> too when cherry-picked from v6.12.y branch (or with 'git am -3'). Let me
> > >> know if you would like me to send a seperate patch for each.
> > >>
> > >> As mentioned above, the upstream fixes this as a part of larger feature
> > >> and we would only like these bits backported. If there are any future
> > >> conflicts in this area during backporting, I would be more than happy to
> > >> help out resolve them.
> > > 
> > > Why not just backport all of the mainline changes instead?  As I say a
> > > lot, whenever we do these "one off" changes, it's almost always wrong
> > > and causes problems over the years going forward as other changes around
> > > the same area can not be backported either.
> > > 
> > > So please, try to just backport the original commits.
> > 
> > Peter was in favor of backporting just the necessary bits in
> > https://lore.kernel.org/all/20250929103836.GK3419281@noisy.programming.kicks-ass.net/
> > 
> > Backporting the whole of per-task throttle feature is lot more heavy
> > handed with the core changes adding:
> > 
> >  include/linux/sched.h |   5 +
> >  kernel/sched/core.c   |   3 +
> >  kernel/sched/fair.c   | 451 ++++++++++++++++++++++++------------------
> >  kernel/sched/pelt.h   |   4 +-
> >  kernel/sched/sched.h  |   7 +-
> >  5 files changed, 274 insertions(+), 196 deletions(-)
> 
> That's very tiny overall in the scheme of what we take for the stable
> trees.
> 
> > And a few more fixes that will add to the above before v6.18. I'll defer
> > to Peter to decide the best course of action.
> 
> We'll defer to the maintainers of the subsystem as to what they want
> here.  If they say take this smaller patch, we'll be glad to do so.

So if the timing of all this would've been slightly different, I'd have
taken the smaller patch and routed it through sched/urgent in time for
the 6.17 release. And we'd not have had this discussion, but alas.

Also, while we're fairly confident in the task throttling rework, it is
somewhat invasive and hasn't seen widespread testing -- it is
conceivable there are performance issues.

So at this point I would really rather backport the smaller patch that
fixes the immediate bug.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 6.17] sched/fair: Block delayed tasks on throttled hierarchy during dequeue
  2025-09-26 15:34     ` Matt Fleming
  2025-09-29  3:53       ` K Prateek Nayak
  2025-10-15  6:03       ` [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue K Prateek Nayak
@ 2025-10-23  4:03       ` K Prateek Nayak
  2 siblings, 0 replies; 14+ messages in thread
From: K Prateek Nayak @ 2025-10-23  4:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin, stable, Matt Fleming,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, kernel-team, Matt Fleming, Oleg Nesterov,
	John Stultz, Chris Arges, Luis Claudio R. Goncalves,
	K Prateek Nayak

Dequeuing a fair task on a throttled hierarchy returns early on
encountering a throttled cfs_rq since the throttle path has already
dequeued the hierarchy above and has adjusted the h_nr_* accounting till
the root cfs_rq.

dequeue_entities() crucially misses calling __block_task() for delayed
tasks being dequeued on the throttled hierarchies, but this was mostly
harmless until commit b7ca5743a260 ("sched/core: Tweak
wait_task_inactive() to force dequeue sched_delayed tasks") since all
existing cases would re-enqueue the task if task_on_rq_queued() returned
true and the task would eventually be blocked at pick after the
hierarchy was unthrottled.

wait_task_inactive() is special as it expects the delayed task on
throttled hierarchy to reach the blocked state on dequeue but since
__block_task() is never called, task_on_rq_queued() continues to return
true. Furthermore, since the task is now off the hierarchy, the pick
never reaches it to fully block the task even after unthrottle leading
to wait_task_inactive() looping endlessly.

Remedy this by calling __block_task() if a delayed task is being
dequeued on a throttled hierarchy.

This fix is only required for stabled kernels implementing delay dequeue
(>= v6.12) before v6.18 since upstream commit e1fad12dcb66 ("sched/fair:
Switch to task based throttle model") indirectly fixes this by removing
the early return conditions in dequeue_entities() as part of the per-task
throttle feature.

Cc: stable@vger.kernel.org
Reported-by: Matt Fleming <matt@readmodwrite.com>
Closes: https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/
Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks")
Tested-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Hello Greg, Sasha,

Please consider the same fix for the v6.17 stable kernel too since there
is a report of a similar issue on v6.17.1 based RT kernel at
https://lore.kernel.org/lkml/aPN7XBJbGhdWJDb2@uudg.org/ and Luis
confirmed that this fix solves the issue for him in
https://lore.kernel.org/lkml/aPgm6KvDx5Os2oJS@uudg.org/
---
 kernel/sched/fair.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f..f0a4d9d7424d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int h_nr_runnable = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
+	int ret = 0;
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	ret = 1;
+out:
 	if (p && task_delayed) {
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
@@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return ret;
 }
 
 /*

base-commit: 6c7871823908a4330e145d635371582f76ce1407
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-23  4:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25 13:33 [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks" Matt Fleming
2025-09-26  0:05 ` John Stultz
2025-09-26  0:06   ` John Stultz
2025-09-26  2:43   ` K Prateek Nayak
2025-09-26 15:34     ` Matt Fleming
2025-09-29  3:53       ` K Prateek Nayak
2025-10-14 10:15         ` Matt Fleming
2025-10-15  6:03       ` [PATCH v6.12] sched/fair: Block delayed tasks on throttled hierarchy during dequeue K Prateek Nayak
2025-10-15  6:14         ` Greg Kroah-Hartman
2025-10-15  6:27           ` K Prateek Nayak
2025-10-15  7:27             ` Greg Kroah-Hartman
2025-10-15  8:22               ` Peter Zijlstra
2025-10-23  4:03       ` [PATCH 6.17] " K Prateek Nayak
2025-09-29 10:38     ` [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks" Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).