From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A1FD39769B
	for <linux-kernel@vger.kernel.org>; Fri,  3 Apr 2026 09:52:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775209966; cv=none; b=uvIpqAXobkM7zE/yikFYPa06QiLFBdJ6kaP87h3HZ2Pg0HgP6tisP1xCj8/31aram8LjuCCZkbAkfhrOARyQN0hlaP46HzpLspZ4zWJR7SgABxJv4SPQs8SS8Ofwu3gMytMv1P+BBVEhQFR36cucBccJuQQPqzjDOKWICXPJJuA=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775209966; c=relaxed/simple;
	bh=+CoXeVAZBq58mqk6fUg0TeeN5a2tAA6hTk7Wj/RuPs4=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=oQkh8dDGUe99hZLRBQ9tszifuGelCG+yiYoqrlAaFjITcUS7BIciyRjiubgGYhlS+sT9dFf6ZqQ8quZtkY7K6OtaDvOkZAfflf4rsz4H/nVfudl4MsLWX3Z4Qw3nL/4Xa+rCACEQyKjLgch1hTjKzlWpgpkoGaQUrsCFXVo8JpE=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=qY1HiOgq; arc=none smtp.client-ip=90.155.92.199
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="qY1HiOgq"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=V9n5y2Kb4B+iM7QrSSdBQy5ry8dLOG+UOLM1O73nOmE=; b=qY1HiOgqiRAudlBgh6MlVPTF2h
	bIengrJtgZfcVDlYrs+jcCWYwj7GLly7hLh55vhFg/HwSyoleP+bxPZkgDjVr2loljA9RoH9wZsFB
	+AybRar79NU2hUvdtFsNbGpbx3/5HN9BguVFoDUfJoRlpb0Ps6R0qkUQWm/3la9lBAP3vkDzyt/9g
	cW3bjlSPXWc0zry43VJhPYkwd5kE8ZQvW7D601kG4fhfIobX/iwJ46IfzZoxK4lQt0qr0QFAFKn94
	vI5cN3VXrlaxC2bPBuHVMo/kdS3nr74IJkHj7wVtVYpHr6asKAGF/yP7WHVdeNH1gwXIwlBmOMsmu
	oEqVtR6Q==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w8bCW-00000003w4s-0s7H;
	Fri, 03 Apr 2026 09:52:28 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id D5FC83032D1; Fri, 03 Apr 2026 11:52:25 +0200 (CEST)
Date: Fri, 3 Apr 2026 11:52:25 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: John Stultz <jstultz@google.com>, LKML <linux-kernel@vger.kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Zimuzo Ezeozue <zezeozue@google.com>, Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>,
	kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Subject: Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
Message-ID: <20260403095225.GY3738010@noisy.programming.kicks-ass.net>
References: <20260324191337.1841376-1-jstultz@google.com>
 <36e96f87-a682-436e-aefc-13e2e5810019@amd.com>
 <20260327114844.GQ2872@noisy.programming.kicks-ass.net>
 <33e60181-1809-44e1-bc4c-8ac7f79d49d6@amd.com>
 <20260327160017.GK3738010@noisy.programming.kicks-ass.net>
 <1515d405-62fc-4952-842f-b69e2bf192c0@amd.com>
 <20260402155055.GV3738010@noisy.programming.kicks-ass.net>
 <CANDhNCoAp8MeSbyO_Q3PanZDZGLS04U0i==Uv_9Ev4eX9VvUVA@mail.gmail.com>
 <e6dfce6a-ee6f-441b-9f21-e2036d0a3e5d@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e6dfce6a-ee6f-441b-9f21-e2036d0a3e5d@amd.com>

On Fri, Apr 03, 2026 at 11:39:25AM +0530, K Prateek Nayak wrote:

> >> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
> >>   */
> >>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >>  {
> >> -       struct rq_flags rf;
> >> -       struct rq *rq;
> >> -       int ret = 0;
> >> +       ACQUIRE(__task_rq_lock, guard)(p);
> >> +       struct rq *rq = guard.rq;
> >>
> >> -       rq = __task_rq_lock(p, &rf);
> >> -       if (task_on_rq_queued(p)) {
> >> -               update_rq_clock(rq);
> >> -               if (p->se.sched_delayed)
> >> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> >> -               if (!task_on_cpu(rq, p)) {
> >> +       if (!task_on_rq_queued(p))
> >> +               return 0;
> >> +
> >> +       if (sched_proxy_exec() && p->blocked_on) {
> >> +               guard(raw_spinlock)(&p->blocked_lock);
> >> +               struct mutex *lock = p->blocked_on;
> >> +               if (lock) {
> >>                         /*
> >> -                        * When on_rq && !on_cpu the task is preempted, see if
> >> -                        * it should preempt the task that is current now.
> >> +                        * TASK_WAKING is a special state and results in
> >> +                        * DEQUEUE_SPECIAL such that the task will actually be
> >> +                        * forced from the runqueue.
> >>                          */
> >> -                       wakeup_preempt(rq, p, wake_flags);
> >> +                       block_task(rq, p, TASK_WAKING);
> 
> This needs to reset the rq->donor if the task getting woken up is the
> current donor.

*groan*, that is a fun case. I'll ponder that.

> >> +                       p->blocked_on = NULL;
> >> +                       return 0;
> >>                 }
> >> -               ttwu_do_wakeup(p);
> >> -               ret = 1;
> >>         }
> >> -       __task_rq_unlock(rq, p, &rf);
> >>
> >> -       return ret;
> >> +       update_rq_clock(rq);
> 
> nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that
> clock update before the block.

D'0h :-)

> >> +       if (p->se.sched_delayed)
> >> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > 
> > I can't precisely remember the details now, but I believe we need to
> > handle enqueueing sched_delayed tasks before handling blocked_on
> > tasks.
> 
> So proxy_deactivate() can still delay the task leading to
> task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
> we can dequeue it fully in proxy_deactivate() or we need to teach
> block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().
> 
> I think the former is cleaner but we don't decay lag for fair task :-(
> 
> We can't simply re-enqueue it either since proxy migration might have
> put it on a CPU outside its affinity mask so we need to take a full
> dequeue + wakeup in ttwu_runnable().

Right, sanest option is to have ttwu_runnable() deal with this.

> >> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> >> -                              struct task_struct *p)
> >> -       __must_hold(__rq_lockp(rq))
> >> -{
> >> -}
> >> -
> 
> Went a little heavy of the delete there did you? :-)

Well, I thought that was the whole idea, have ttwu() handle this :-)

> >>  /*
> >>   * Find runnable lock owner to proxy for mutex blocked donor
> >>   *
> >> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
> >>                                 clear_task_blocked_on(p, PROXY_WAKING);
> >>                                 return p;
> >>                         }
> >> -                       goto force_return;
> >> +                       goto deactivate;
> >>                 }
> 
> This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
> invariant since we'll definitely get a wakeup here.

Right, so TASK_RUNNING must imply !->blocked_on.

> >>
> >>                 /*
> >> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
> >>                                 __clear_task_blocked_on(p, NULL);
> >>                                 return p;
> >>                         }
> >> -                       goto force_return;
> >> +                       goto deactivate;
> 
> This too makes sense considering !owner implies some task will be woken
> up but ... if we take this task off and another task steals the mutex,
> this task will no longer be able to proxy it since it is completely
> blocked now.
> 
> Probably not desired. We should at least let it run and see if it can
> get the mutex and evaluate the "p->blocked_on" again since !owner is
> a limbo state.

I need to go re-read the mutex side of things, but doesn't that do
hand-off way more agressively?

Anyway, one thing that is completely missing is a fast path for when the
task is still inside its valid mask. I suspect adding that back will
cure some of these issues.

> So I added the following in top of Peter's diff on top of
> queue:sched/core and it hasn't crashed and burnt yet when running a
> handful instances of sched-messaging with a mix of fair and SCHED_RR
> priority:
> 
>   (Includes John's findings from the parallel thread)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b2b2451720a..e845e3a8ae65 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>  	dequeue_task(rq, p, flags);
>  }
>  
> -static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
> +static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
>  {
>  	int flags = DEQUEUE_NOCLOCK;
>  
> @@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>  	}
>  }
>  
> +static void zap_balance_callbacks(struct rq *rq);
> +
> +static inline void proxy_reset_donor(struct rq *rq)
> +{
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	WARN_ON_ONCE(rq->curr == rq->donor);
> +
> +	put_prev_set_next_task(rq, rq->donor, rq->curr);
> +	rq_set_donor(rq, rq->curr);
> +	zap_balance_callbacks(rq);
> +	resched_curr(rq);
> +#endif
> +}

This one hurts my bain :-)

>  /*
>   * Consider @p being inside a wait loop:
>   *
> @@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		return 0;
>  
>  	update_rq_clock(rq);
> +	if (p->se.sched_delayed)
> +		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);

Right, this works but seems wasteful, might be better to add
DEQUEUE_DELAYED in the blocked_on case.

>  	if (sched_proxy_exec() && p->blocked_on) {

So I had doubts about this lockless test of ->blocked_on, I still cannot
convince myself it is correct.

>  		guard(raw_spinlock)(&p->blocked_lock);
>  		struct mutex *lock = p->blocked_on;
> @@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  			 * TASK_WAKING is a special state and results in
>  			 * DEQUEUE_SPECIAL such that the task will actually be
>  			 * forced from the runqueue.
> +			 *
> +			 * XXX: All of this is now equivalent of
> +			 * proxy_needs_return() from John's series :-)
>  			 */
> -			block_task(rq, p, TASK_WAKING);
>  			p->blocked_on = NULL;
> +			if (task_current(rq, p))
> +				goto out;

Right, fair enough :-) This could also be done when rq->cpu is inside
p->cpus_ptr mask, because in that case we don't strictly need a
migration. Thinking about that was on the todo list.

> +			if (task_current_donor(rq, p))
> +				proxy_reset_donor(rq);

Fun fun fun :-)

> +			block_task(rq, p, TASK_WAKING);
>  			return 0;
>  		}
>  	}
> -
> -	if (p->se.sched_delayed)
> -		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> +out:
>  	if (!task_on_cpu(rq, p)) {
>  		/*
>  		 * When on_rq && !on_cpu the task is preempted, see if
> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		 */
>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>  
> +		/*
> +		 * We never clear the blocked_on relation on proxy_deactivate.
> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
> +		 * when waking up. Since this is a fully blocked, off CPU task
> +		 * waking up, it should be safe to clear the blocked_on relation.
> +		 */
> +		if (task_is_blocked(p))
> +			clear_task_blocked_on(p, NULL);
> +

Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
skip ttwu_runnable(). We still need to clear ->blocked_on.

I am once again not sure on the lockless nature of accessing
->blocked_on.

>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>  		if (task_cpu(p) != cpu) {
>  			if (p->in_iowait) {
> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>  		switch_count = &prev->nvcsw;
>  	}
>  
> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
> +	if (!prev_state && task_is_blocked(prev))
> +		clear_task_blocked_on(prev, NULL);
> +

This one confuses me, ttwu() should never results in ->blocked_on being
set.

>  pick_again:
>  	assert_balance_callbacks_empty(rq);
>  	next = pick_next_task(rq, rq->donor, &rf);