public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <jstultz@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	"Zimuzo Ezeozue" <zezeozue@google.com>,
	Mel Gorman <mgorman@suse.de>, Will Deacon <will@kernel.org>,
	Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	"Suleiman Souhlal" <suleiman@google.com>,
	kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
	<kernel-team@android.com>
Subject: Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
Date: Fri, 3 Apr 2026 15:55:22 +0530	[thread overview]
Message-ID: <1d2d4596-93d6-4d87-babc-084b8d6c2d98@amd.com> (raw)
In-Reply-To: <20260403095225.GY3738010@noisy.programming.kicks-ass.net>

Hello Peter,

On 4/3/2026 3:22 PM, Peter Zijlstra wrote:
>>>> +       if (p->se.sched_delayed)
>>>> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>>>
>>> I can't precisely remember the details now, but I believe we need to
>>> handle enqueueing sched_delayed tasks before handling blocked_on
>>> tasks.
>>
>> So proxy_deactivate() can still delay the task leading to
>> task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
>> we can dequeue it fully in proxy_deactivate() or we need to teach
>> block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().
>>
>> I think the former is cleaner but we don't decay lag for fair task :-(
>>
>> We can't simply re-enqueue it either since proxy migration might have
>> put it on a CPU outside its affinity mask so we need to take a full
>> dequeue + wakeup in ttwu_runnable().
> 
> Right, sanest option is to have ttwu_runnable() deal with this.

Ack! For now I've used John's original move of doing re-enqueue
before doing a dequeue if we find a delayed + blocked_on task.

> 
>>>> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>>>> -                              struct task_struct *p)
>>>> -       __must_hold(__rq_lockp(rq))
>>>> -{
>>>> -}
>>>> -
>>
>> Went a little heavy of the delete there did you? :-)
> 
> Well, I thought that was the whole idea, have ttwu() handle this :-)
> 
>>>>  /*
>>>>   * Find runnable lock owner to proxy for mutex blocked donor
>>>>   *
>>>> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>>>>                                 clear_task_blocked_on(p, PROXY_WAKING);
>>>>                                 return p;
>>>>                         }
>>>> -                       goto force_return;
>>>> +                       goto deactivate;
>>>>                 }
>>
>> This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
>> invariant since we'll definitely get a wakeup here.
> 
> Right, so TASK_RUNNING must imply !->blocked_on.
> 
>>>>
>>>>                 /*
>>>> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>>>>                                 __clear_task_blocked_on(p, NULL);
>>>>                                 return p;
>>>>                         }
>>>> -                       goto force_return;
>>>> +                       goto deactivate;
>>
>> This too makes sense considering !owner implies some task will be woken
>> up but ... if we take this task off and another task steals the mutex,
>> this task will no longer be able to proxy it since it is completely
>> blocked now.
>>
>> Probably not desired. We should at least let it run and see if it can
>> get the mutex and evaluate the "p->blocked_on" again since !owner is
>> a limbo state.
> 
> I need to go re-read the mutex side of things, but doesn't that do
> hand-off way more agressively?

Ack but we have optimistic spinning enabled for performance reasons so
there is still a chance that the task may not get the mutex but now that
I think about it, it will definitely receive a wakeup so it it should be
able to re-establish the chain when it gets on CPU again.

> 
> Anyway, one thing that is completely missing is a fast path for when the
> task is still inside its valid mask. I suspect adding that back will
> cure some of these issues.
> 
>> So I added the following in top of Peter's diff on top of
>> queue:sched/core and it hasn't crashed and burnt yet when running a
>> handful instances of sched-messaging with a mix of fair and SCHED_RR
>> priority:
>>
>>   (Includes John's findings from the parallel thread)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5b2b2451720a..e845e3a8ae65 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>>  	dequeue_task(rq, p, flags);
>>  }
>>  
>> -static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
>> +static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
>>  {
>>  	int flags = DEQUEUE_NOCLOCK;
>>  
>> @@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>>  	}
>>  }
>>  
>> +static void zap_balance_callbacks(struct rq *rq);
>> +
>> +static inline void proxy_reset_donor(struct rq *rq)
>> +{
>> +#ifdef CONFIG_SCHED_PROXY_EXEC
>> +	WARN_ON_ONCE(rq->curr == rq->donor);
>> +
>> +	put_prev_set_next_task(rq, rq->donor, rq->curr);
>> +	rq_set_donor(rq, rq->curr);
>> +	zap_balance_callbacks(rq);
>> +	resched_curr(rq);
>> +#endif
>> +}
> 
> This one hurts my bain :-)
> 
>>  /*
>>   * Consider @p being inside a wait loop:
>>   *
>> @@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  		return 0;
>>  
>>  	update_rq_clock(rq);
>> +	if (p->se.sched_delayed)
>> +		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> 
> Right, this works but seems wasteful, might be better to add
> DEQUEUE_DELAYED in the blocked_on case.
> 
>>  	if (sched_proxy_exec() && p->blocked_on) {
> 
> So I had doubts about this lockless test of ->blocked_on, I still cannot
> convince myself it is correct.

Let me give a try: A task's "blocked_on" starts off as a valid mutex and
can be transitioned optionally to PROXY_WAKING (!= NULL) before being
cleared.

If blocked_on is cleared directly, PROXY_WAKING transition never
happens even if someone does set_task_blocked_on_waking() since we bail
out early if !p->blocked_on.

All "p->blocked_on" transition happen with "blocked_on_lock" held.

So that begs the question, when is "blocked_on" actually cleared?

1) If the task is task_on_rq_queued(), we either clear it in schedule()
   (find_proxy_task() to be precise) or in ttwu_runnable() - both with
   rq_lock held.

2) *NEW* If the task is off rq and is waking up, it means there is a
   ttwu_state_match() and without proxy, the task would have woken up
   and executed on the CPU.

   Since the task is completely off rq, schedule() cannot clear the
   p->blocked_on. Only other remote transition possible is to
   PROXY_WAKING (!= NULL).

   So *inspecting* the p->blocked_on relation without the
   blocked_on_lock held should be fine to know if the task has a
   blocked_on relation.

Only the task itself can set "p->blocked_on" to a valid mutex when
running on the CPU so it is out of question we can suddenly get a
transition to a new mutex when we are in schedule() or in middle of
waking the task.

> 
>>  		guard(raw_spinlock)(&p->blocked_lock);
>>  		struct mutex *lock = p->blocked_on;
>> @@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  			 * TASK_WAKING is a special state and results in
>>  			 * DEQUEUE_SPECIAL such that the task will actually be
>>  			 * forced from the runqueue.
>> +			 *
>> +			 * XXX: All of this is now equivalent of
>> +			 * proxy_needs_return() from John's series :-)
>>  			 */
>> -			block_task(rq, p, TASK_WAKING);
>>  			p->blocked_on = NULL;
>> +			if (task_current(rq, p))
>> +				goto out;
> 
> Right, fair enough :-) This could also be done when rq->cpu is inside
> p->cpus_ptr mask, because in that case we don't strictly need a
> migration. Thinking about that was on the todo list.

Ack. Once concern there is that task was out of load balancer's
purview until it is "p->blocked_on" and this could be a good
spot for balance during wakeup.

> 
>> +			if (task_current_donor(rq, p))
>> +				proxy_reset_donor(rq);
> 
> Fun fun fun :-)
> 
>> +			block_task(rq, p, TASK_WAKING);
>>  			return 0;
>>  		}
>>  	}
>> -
>> -	if (p->se.sched_delayed)
>> -		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>> +out:
>>  	if (!task_on_cpu(rq, p)) {
>>  		/*
>>  		 * When on_rq && !on_cpu the task is preempted, see if
>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  		 */
>>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>>  
>> +		/*
>> +		 * We never clear the blocked_on relation on proxy_deactivate.
>> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
>> +		 * when waking up. Since this is a fully blocked, off CPU task
>> +		 * waking up, it should be safe to clear the blocked_on relation.
>> +		 */
>> +		if (task_is_blocked(p))
>> +			clear_task_blocked_on(p, NULL);
>> +
> 
> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
> skip ttwu_runnable(). We still need to clear ->blocked_on.
> 
> I am once again not sure on the lockless nature of accessing
> ->blocked_on.

I hope I have convinced you from the short analysis above :-)

> 
>>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>>  		if (task_cpu(p) != cpu) {
>>  			if (p->in_iowait) {
>> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>>  		switch_count = &prev->nvcsw;
>>  	}
>>  
>> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
>> +	if (!prev_state && task_is_blocked(prev))
>> +		clear_task_blocked_on(prev, NULL);
>> +
> 
> This one confuses me, ttwu() should never results in ->blocked_on being
> set.

This is from the signal_pending_state() in try_to_block_task() putting
prev to TASK_RUNNING while still having p->blocked_on.

It is expected that task executes and re-evaluates if it needs to block
on the mutex again or simply return -EINTR from
mutex_lock_interruptible().

> 
>>  pick_again:
>>  	assert_balance_callbacks_empty(rq);
>>  	next = pick_next_task(rq, rq->donor, &rf);

-- 
Thanks and Regards,
Prateek


  reply	other threads:[~2026-04-03 10:25 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
2026-03-26 21:45   ` Steven Rostedt
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
2026-03-26 22:52   ` Steven Rostedt
2026-03-27  4:47     ` K Prateek Nayak
2026-03-27 12:47       ` Peter Zijlstra
2026-04-02 14:43   ` Peter Zijlstra
2026-04-02 15:08     ` Peter Zijlstra
2026-04-02 17:43       ` John Stultz
2026-04-02 17:34     ` John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
2026-03-27 11:48   ` Peter Zijlstra
2026-03-27 13:33     ` K Prateek Nayak
2026-03-27 15:20       ` Peter Zijlstra
2026-03-27 15:41         ` Peter Zijlstra
2026-03-27 16:00       ` Peter Zijlstra
2026-03-27 16:57         ` K Prateek Nayak
2026-04-02 15:50           ` Peter Zijlstra
2026-04-02 18:31             ` John Stultz
2026-04-02 21:04               ` John Stultz
2026-04-03  6:09               ` K Prateek Nayak
2026-04-03  9:52                 ` Peter Zijlstra
2026-04-03 10:25                   ` K Prateek Nayak [this message]
2026-04-03 11:28                     ` Peter Zijlstra
2026-04-03 13:43                       ` K Prateek Nayak
2026-04-03 14:38                         ` Peter Zijlstra
2026-04-03 15:39                           ` K Prateek Nayak
2026-04-03 21:08                             ` Peter Zijlstra
2026-04-04  0:26                             ` John Stultz
2026-04-04  5:49                               ` K Prateek Nayak
2026-04-04  6:07                                 ` John Stultz
2026-04-06  2:40                                   ` K Prateek Nayak
2026-04-03 12:54                     ` Peter Zijlstra
2026-04-03  9:18               ` Peter Zijlstra
2026-03-27 19:15     ` John Stultz
2026-03-27 19:10   ` John Stultz
2026-03-28  4:53     ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d2d4596-93d6-4d87-babc-084b8d6c2d98@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=Metin.Kaya@arm.com \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=hupu.gm@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=kuyo.chang@mediatek.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    --cc=xuewen.yan94@gmail.com \
    --cc=zezeozue@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox