From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E62822E8B98
	for <linux-kernel@vger.kernel.org>; Thu,  2 Apr 2026 15:51:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775145074; cv=none; b=TfL2f4onPlp/B5n95fKnMd+cAzmWjQb+4hXkaruMWr2vtDDnHuxn8FdgC+WbiPJFAYoSLD2/Nt0xFCsux+swIpLpQ/dT2+arsXNqdSWbVOSo9cQWeekfmGi8L38pvlAg9rtisrXPRbCnoCF/F5AnmN1UNaFlicyXY8WMpquMjkY=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775145074; c=relaxed/simple;
	bh=qA+WzUmpwTP+lCL9Df91SEzyOp1CJktpi8lL15z7Hfo=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=V9k0mMrQe2hhe5bnb9d1GNJP09Ich1GX54gvt95i6jBrUjjyMlCFNWmj8+gdQdT5OfjWaJiRh6v10gWK7lJHIde7KfylkkczDHF0GzkmxQhZ1s96aETsDNhuGpKuHwO1KwMLmrNpMMm0WUOmV7/9i+MzuUnkY4uHFdEPzBGIKwc=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=M4cqSIXn; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="M4cqSIXn"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=Dayng0C8ZtM5+8qsgttmlExIHsE3s1c+rcM2syA4iVw=; b=M4cqSIXnS/We5u5mtSaWXnsw3k
	0DZxGUMkDSYBEuAzfBCxfEqJQfE9+9wRAmUXhXqZH9d60Xm4+AuJOpXoaTJOi6kZfXMW0umr8q2mF
	xi3V9LGVtbAX3hWEMlGl6HFkg1/+HzY6j84Nbt8ZT+N1ZiwOi/NkUKF30jsoGtm4mRsWI5yjkuZRu
	rE2I15+WuSmM0D9vma9tVY+gR7gHCFYQ470IrJB/375M3LUyhOcSBbApupIB362R0uy4CABclKJ8b
	gVxE04cHluj9CtE0ru7F2lxAEf5C7ffYaUR7XPQqXaSJJOrlmb9jz4Mgixxp20JG/ofojjQMyrarh
	hBrW2lUg==;
Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w8KJt-0000000CYTJ-219o;
	Thu, 02 Apr 2026 15:50:57 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id 00AD730301D; Thu, 02 Apr 2026 17:50:55 +0200 (CEST)
Date: Thu, 2 Apr 2026 17:50:55 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: John Stultz <jstultz@google.com>, LKML <linux-kernel@vger.kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Zimuzo Ezeozue <zezeozue@google.com>, Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>,
	kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Subject: Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
Message-ID: <20260402155055.GV3738010@noisy.programming.kicks-ass.net>
References: <20260324191337.1841376-1-jstultz@google.com>
 <36e96f87-a682-436e-aefc-13e2e5810019@amd.com>
 <20260327114844.GQ2872@noisy.programming.kicks-ass.net>
 <33e60181-1809-44e1-bc4c-8ac7f79d49d6@amd.com>
 <20260327160017.GK3738010@noisy.programming.kicks-ass.net>
 <1515d405-62fc-4952-842f-b69e2bf192c0@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1515d405-62fc-4952-842f-b69e2bf192c0@amd.com>

On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote:

> So taking a step back, this is what we have today (at least the
> common scenario):
> 
>     CPU0 (donor - A)                         CPU1 (owner - B)
>     ================                         ================
> 
> mutex_lock()
>   __set_current_state(TASK_INTERRUPTIBLE)
>   __set_task_blocked_on(M)
>     schedule()
>       /* Retained for proxy */
>       proxy_migrate_task()
>         ==================================> /* Migrates to CPU1 */
>   ...
>   send_sig(B)
>     signal_wake_up_state()
>       wake_up_state()
>         try_to_wake_up()
>           ttwu_runnable()
>             ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */
> 
>                                             /*
>                                              * After this point ttwu_state_match()
>                                              * will fail for A so a mutex_unlock()
>                                              * will have to go through __schedule()
>                                              * for return migration.
>                                              */
> 
>                                             __schedule()
>                                               find_proxy_task()
> 
>                                                 /* Scenario 1 - B sleeps */
>                                                 __clear_task_blocked_on()
>                                                 proxy_deactivate(A)
>                                                   /* A->__state == TASK_RUNNING */
>                                                   /* fallthrough */
> 
>                                                 /* Scenario 2 - return migration after unlock() */
>                                                 __clear_task_blocked_on()
>                                                 /*
>                                                  * At this point proxy stops.
>                                                  * Much later after signal.
>                                                  */
>                                                 proxy_force_return()
>     schedule() <==================================
>       signal_pending_state()
> 
>   clear_task_blocked_on()
>   __set_current_state(TASK_RUNNING)
> 
> ... /* return with -EINR */
> 
> 
> Basically, a blocked donor has to wait for a mutex_unlock() before it
> can go process the signal and bail out on the mutex_lock_interruptible()
> which seems counter productive - but it is still okay from correctness
> perspective.
> 
> > 
> > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> > task, and then it goes into the normal path and will do the migration.
> > I've done things like that before.
> > 
> > Does that fix all the return-migration cases?
> 
> Yes it does! If we handle the return via ttwu_runnable(), which is what
> proxy_needs_return() in the next chunk of changes aims to do and we can
> build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
> state outside of __schedule() which works well with ttwu_state_match().
> 
> > 
> >> 2. Why does proxy_needs_return() (this comes later in John's tree but I
> >>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
> >>    of the ttwu_state_mach() machinery?
> >>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> > 
> > Since it comes later, I've not seen it and not given it thought ;-)
> > 
> > (I mean, I've probably seen it at some point, but being the gold-fish
> > that I am, I have no recollection, so I might as well not have seen it).
> > 
> > A brief look now makes me confused. The comment fails to describe how
> > that situation could ever come to pass.
> 
> That is a signal delivery happening before unlock which will force
> TASK_RUNNING but since we are waiting on an unlock, the wakeup from
> unlock will see TASK_RUNNING + PROXY_WAKING.
> 
> We then later force it on the ttwu path to do return via
> ttwu_runnable().

So, I've not gone through all the cases yet, and it is *COMPLETELY*
untested, but something like the below perhaps?

---
 include/linux/sched.h |    2 
 kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
 2 files changed, 58 insertions(+), 117 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -161,7 +161,7 @@ struct user_event_mm;
  */
 #define is_special_task_state(state)					\
 	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |	\
-		    TASK_DEAD | TASK_FROZEN))
+		    TASK_DEAD | TASK_WAKING | TASK_FROZEN))
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,8 +2160,29 @@ void deactivate_task(struct rq *rq, stru
 	dequeue_task(rq, p, flags);
 }
 
-static void block_task(struct rq *rq, struct task_struct *p, int flags)
+static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
+	int flags = DEQUEUE_NOCLOCK;
+
+	p->sched_contributes_to_load =
+		(task_state & TASK_UNINTERRUPTIBLE) &&
+		!(task_state & TASK_NOLOAD) &&
+		!(task_state & TASK_FROZEN);
+
+	if (unlikely(is_special_task_state(task_state)))
+		flags |= DEQUEUE_SPECIAL;
+
+	/*
+	 * __schedule()			ttwu()
+	 *   prev_state = prev->state;    if (p->on_rq && ...)
+	 *   if (prev_state)		    goto out;
+	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+	 *				  p->state = TASK_WAKING
+	 *
+	 * Where __schedule() and ttwu() have matching control dependencies.
+	 *
+	 * After this, schedule() must not care about p->state any more.
+	 */
 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
 		__block_task(rq, p);
 }
@@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	ACQUIRE(__task_rq_lock, guard)(p);
+	struct rq *rq = guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	if (sched_proxy_exec() && p->blocked_on) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		struct mutex *lock = p->blocked_on;
+		if (lock) {
 			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
+			 * TASK_WAKING is a special state and results in
+			 * DEQUEUE_SPECIAL such that the task will actually be
+			 * forced from the runqueue.
 			 */
-			wakeup_preempt(rq, p, wake_flags);
+			block_task(rq, p, TASK_WAKING);
+			p->blocked_on = NULL;
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
 	}
-	__task_rq_unlock(rq, p, &rf);
 
-	return ret;
+	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 void sched_ttwu_pending(void *arg)
@@ -6519,7 +6551,6 @@ static bool try_to_block_task(struct rq
 			      unsigned long *task_state_p, bool should_block)
 {
 	unsigned long task_state = *task_state_p;
-	int flags = DEQUEUE_NOCLOCK;
 
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
@@ -6539,26 +6570,7 @@ static bool try_to_block_task(struct rq
 	if (!should_block)
 		return false;
 
-	p->sched_contributes_to_load =
-		(task_state & TASK_UNINTERRUPTIBLE) &&
-		!(task_state & TASK_NOLOAD) &&
-		!(task_state & TASK_FROZEN);
-
-	if (unlikely(is_special_task_state(task_state)))
-		flags |= DEQUEUE_SPECIAL;
-
-	/*
-	 * __schedule()			ttwu()
-	 *   prev_state = prev->state;    if (p->on_rq && ...)
-	 *   if (prev_state)		    goto out;
-	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-	 *				  p->state = TASK_WAKING
-	 *
-	 * Where __schedule() and ttwu() have matching control dependencies.
-	 *
-	 * After this, schedule() must not care about p->state any more.
-	 */
-	block_task(rq, p, flags);
+	block_task(rq, p, task_state);
 	return true;
 }
 
@@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
 	return rq->idle;
 }
 
-static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
-	/* Don't deactivate if the state has been changed to TASK_RUNNING */
-	if (state == TASK_RUNNING)
-		return false;
+	WARN_ON_ONCE(state == TASK_RUNNING);
+
 	/*
 	 * Because we got donor from pick_next_task(), it is *crucial*
 	 * that we call proxy_resched_idle() before we deactivate it.
@@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
-	return try_to_block_task(rq, donor, &state, true);
+	block_task(rq, donor, state);
 }
 
 static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
@@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
 	proxy_reacquire_rq_lock(rq, rf);
 }
 
-static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
-			       struct task_struct *p)
-	__must_hold(__rq_lockp(rq))
-{
-	struct rq *task_rq, *target_rq = NULL;
-	int cpu, wake_flag = WF_TTWU;
-
-	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
-
-	if (p == rq->donor)
-		proxy_resched_idle(rq);
-
-	proxy_release_rq_lock(rq, rf);
-	/*
-	 * We drop the rq lock, and re-grab task_rq_lock to get
-	 * the pi_lock (needed for select_task_rq) as well.
-	 */
-	scoped_guard (task_rq_lock, p) {
-		task_rq = scope.rq;
-
-		/*
-		 * Since we let go of the rq lock, the task may have been
-		 * woken or migrated to another rq before we  got the
-		 * task_rq_lock. So re-check we're on the same RQ. If
-		 * not, the task has already been migrated and that CPU
-		 * will handle any futher migrations.
-		 */
-		if (task_rq != rq)
-			break;
-
-		/*
-		 * Similarly, if we've been dequeued, someone else will
-		 * wake us
-		 */
-		if (!task_on_rq_queued(p))
-			break;
-
-		/*
-		 * Since we should only be calling here from __schedule()
-		 * -> find_proxy_task(), no one else should have
-		 * assigned current out from under us. But check and warn
-		 * if we see this, then bail.
-		 */
-		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
-			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-				  __func__, cpu_of(task_rq),
-				  p->comm, p->pid, p->on_cpu);
-			break;
-		}
-
-		update_rq_clock(task_rq);
-		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-		set_task_cpu(p, cpu);
-		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
-	}
-
-	if (target_rq)
-		attach_one_task(target_rq, p);
-
-	proxy_reacquire_rq_lock(rq, rf);
-}
-
 /*
  * Find runnable lock owner to proxy for mutex blocked donor
  *
@@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
 				clear_task_blocked_on(p, PROXY_WAKING);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		/*
@@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
 				__clear_task_blocked_on(p, NULL);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
@@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
 	return owner;
 
 deactivate:
-	if (proxy_deactivate(rq, donor))
-		return NULL;
-	/* If deactivate fails, force return */
-	p = donor;
-force_return:
-	proxy_force_return(rq, rf, p);
+	proxy_deactivate(rq, donor);
 	return NULL;
 migrate_task:
 	proxy_migrate_task(rq, rf, p, owner_cpu);