[PATCH v28 0/8] Optimized Donor Migration for Proxy Execution

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution
@ 2026-04-22 23:06 John Stultz
  2026-04-22 23:06 ` [PATCH v28 1/8] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hey All,

This is just the next step for Proxy Execution, allowing
proxy-return-migration to occur from the try_to_wake_up()
callpath. This however introduces some complexity as it means
the rq->donor can change from a try_to_wake_up() call, and thus
there are further changes needed to make this safe.

As always, I’m trying to submit this larger work in smallish
digestible pieces, so in this portion of the series, I’m only
submitting for review and consideration these patches which
minimize the use of passing “prev” (ie: the rq->donor) down
through the class schedulers since this cached value can become
invalid across runqueue lock drops. From there, there are some
fixups the deadline class scheduler to avoid issues when the
donor changes, then the actual patch to do return migration
from ttwu() followed by some optimization patches so we
migrate the entire task chain in proxy_migrate_task().  

While there’s not a whole lot new in this revision, I still want
to give a big thanks to K Prateek for his feedback and
suggestions on the new changes.

New in this iteration:
* Dropped the patch adding a rq->donor->sched_class->yield_task
  check, as with changes from K Prateek we shouldn’t temporarily
  run with idle as donor anymore.

* Folded in change suggested by K Prateek to introduce
  proxy_reset_donor() to reset the donor to current task when
  the donor is woken up.

* Drop an unnecessary PROXY_WAKING assignment which was noted
  by K Prateek

I’d love to get further feedback on any place where these
patches are confusing, or could use additional clarifications.

Additionally I’d appreciate any testing or comments that folks
have with the full Proxy Execution series! You can find the full
Proxy Exec series here:
  https://github.com/johnstultz-work/linux-dev/tree/proxy-exec-v28-7.1-rc0
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v28-7.1-rc0

New changes to the full patch series in this revision include:
* I found in the sleeping owner handling patch, we end up
  calling put_task() from __proxy_remove_from_sleeping_owner(),
  which might free the owner while we are still holing the
  owners->blocked_lock. So be sure to get_task()/put_task()
  around the owner usage to ensure we don't prematurely free the
  task.

* Similarly I also re-orderd the put_task/unlock lines in
  activate_blocked_waiters(), to avoid the same problem

Issues still to address with the full series:
* Continue stress testing with the recent ttwu rework to ensure
  the whole stack is working as intended (so far so good).

* Testing to double check I sorted the rwsem collisions w/
  7.1-rc properly

* Continue working to get sched_ext to be ok with Proxy
  Execution enabled.

* Reevaluate performance regression K Prateek Nayak found with
  the full series.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good, using proxy for Binder PI is something else
  we’re exploring.

* Eventually: Work to replace rt_mutexes and get things happy
  with PREEMPT_RT

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!).
Thanks also to Joel Fernandes, Dietmar Eggemann, Metin Kaya,
K Prateek Nayak and Suleiman Souhlal for their substantial
review, suggestion, and patch contributions.

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are surely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (7):
  sched: Rework pick_next_task() and prev_balance() to avoid stale prev
    references
  sched: deadline: Add some helper variables to cleanup deadline logic
  sched: deadline: Add dl_rq->curr pointer to address issues with Proxy
    Exec
  sched: Rework block_task so it can be directly called
  sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING
    case
  sched: Break out core of attach_tasks() helper into sched.h
  sched: Migrate whole chain in proxy_migrate_task()

Peter Zijlstra (1):
  sched: Add blocked_donor link to task for smarter mutex handoffs

 include/linux/sched.h    |   3 +-
 init/init_task.c         |   1 +
 kernel/fork.c            |   1 +
 kernel/locking/mutex.c   |  43 +++++-
 kernel/sched/core.c      | 307 +++++++++++++++++++++------------------
 kernel/sched/deadline.c  |  44 ++++--
 kernel/sched/fair.c      |  25 +---
 kernel/sched/idle.c      |   2 +-
 kernel/sched/rt.c        |   8 +-
 kernel/sched/sched.h     |  30 +++-
 kernel/sched/stop_task.c |   2 +-
 11 files changed, 280 insertions(+), 186 deletions(-)

-- 
2.54.0.rc2.533.g4f5dca5207-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v28 1/8] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 2/8] sched: deadline: Add some helper variables to cleanup deadline logic John Stultz
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Historically, the prev value from __schedule() was the rq->curr.
This prev value is passed down through numerous functions, and
used in the class scheduler implementations. The fact that
prev was on_cpu until the end of __schedule(), meant it was
stable across the rq lock drops that the class->pick_next_task()
and ->balance() implementations often do.

However, with proxy-exec, the prev passed to functions called
by __schedule() is rq->donor, which may not be the same as
rq->curr and may not be on_cpu, this makes the prev value
potentially unstable across rq lock drops.

A recently found issue with proxy-exec, is when we begin doing
return migration from try_to_wake_up(), its possible we may be
waking up the rq->donor.  When we do this, we proxy_resched_idle()
to put_prev_set_next() setting the rq->donor to rq->idle, allowing
the rq->donor to be return migrated and allowed to run.

This however runs into trouble, as on another cpu we might be in
the middle of calling __schedule(). Conceptually the rq lock is
held for the majority of the time, but in calling pick_next_task()
its possible the class->pick_next_task() handler or the
->balance() call may briefly drop the rq lock. This opens a
window for try_to_wake_up() to wake and return migrate the
rq->donor before the class logic reacquires the rq lock.

Unfortunately pick_next_task() and prev_balance() pass in a prev
argument, to which we pass rq->donor. However this prev value can
now become stale and incorrect across a rq lock drop.

So, to correct this, rework the pick_next_task() and
prev_balance() calls so that they do not take a "prev" argument.

Also rework the class ->pick_next_task() and ->balance()
implementations to drop the prev argument, and in the cases
where it was used, and have the class functions reference
rq->donor directly, and not save the value across rq lock drops
so that we don't end up with a stale references.

Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c      | 37 ++++++++++++++++++-------------------
 kernel/sched/deadline.c  |  8 +++++++-
 kernel/sched/fair.c      |  9 +++++++--
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  8 +++++++-
 kernel/sched/sched.h     | 10 ++++------
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..3ac6dd4d3c587 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5971,10 +5971,9 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
 	schedstat_inc(this_rq()->sched_count);
 }
 
-static void prev_balance(struct rq *rq, struct task_struct *prev,
-			 struct rq_flags *rf)
+static void prev_balance(struct rq *rq, struct rq_flags *rf)
 {
-	const struct sched_class *start_class = prev->sched_class;
+	const struct sched_class *start_class = rq->donor->sched_class;
 	const struct sched_class *class;
 
 	/*
@@ -5986,7 +5985,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
 	 * a runnable task of @class priority or higher.
 	 */
 	for_active_class_range(class, start_class, &idle_sched_class) {
-		if (class->balance && class->balance(rq, prev, rf))
+		if (class->balance && class->balance(rq, rf))
 			break;
 	}
 }
@@ -5995,7 +5994,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
 	const struct sched_class *class;
@@ -6012,28 +6011,28 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * higher scheduling class, because otherwise those lose the
 	 * opportunity to pull in more work from other CPUs.
 	 */
-	if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
+	if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) &&
 		   rq->nr_running == rq->cfs.h_nr_queued)) {
 
-		p = pick_next_task_fair(rq, prev, rf);
+		p = pick_next_task_fair(rq, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto restart;
 
 		/* Assume the next prioritized class is idle_sched_class */
 		if (!p) {
 			p = pick_task_idle(rq, rf);
-			put_prev_set_next_task(rq, prev, p);
+			put_prev_set_next_task(rq, rq->donor, p);
 		}
 
 		return p;
 	}
 
 restart:
-	prev_balance(rq, prev, rf);
+	prev_balance(rq, rf);
 
 	for_each_active_class(class) {
 		if (class->pick_next_task) {
-			p = class->pick_next_task(rq, prev, rf);
+			p = class->pick_next_task(rq, rf);
 			if (unlikely(p == RETRY_TASK))
 				goto restart;
 			if (p)
@@ -6043,7 +6042,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			if (unlikely(p == RETRY_TASK))
 				goto restart;
 			if (p) {
-				put_prev_set_next_task(rq, prev, p);
+				put_prev_set_next_task(rq, rq->donor, p);
 				return p;
 			}
 		}
@@ -6096,7 +6095,7 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
 static void queue_core_balance(struct rq *rq);
 
 static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
 	struct task_struct *next, *p, *max;
@@ -6109,7 +6108,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	bool need_sync;
 
 	if (!sched_core_enabled(rq))
-		return __pick_next_task(rq, prev, rf);
+		return __pick_next_task(rq, rf);
 
 	cpu = cpu_of(rq);
 
@@ -6122,7 +6121,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		 */
 		rq->core_pick = NULL;
 		rq->core_dl_server = NULL;
-		return __pick_next_task(rq, prev, rf);
+		return __pick_next_task(rq, rf);
 	}
 
 	/*
@@ -6146,7 +6145,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		goto out_set_next;
 	}
 
-	prev_balance(rq, prev, rf);
+	prev_balance(rq, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
 	need_sync = !!rq->core->core_cookie;
@@ -6328,7 +6327,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	}
 
 out_set_next:
-	put_prev_set_next_task(rq, prev, next);
+	put_prev_set_next_task(rq, rq->donor, next);
 	if (rq->core->core_forceidle_count && next == rq->idle)
 		queue_core_balance(rq);
 
@@ -6551,10 +6550,10 @@ static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
 static inline void sched_core_cpu_dying(unsigned int cpu) {}
 
 static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
-	return __pick_next_task(rq, prev, rf);
+	return __pick_next_task(rq, rf);
 }
 
 #endif /* !CONFIG_SCHED_CORE */
@@ -7101,7 +7100,7 @@ static void __sched notrace __schedule(int sched_mode)
 
 pick_again:
 	assert_balance_callbacks_empty(rq);
-	next = pick_next_task(rq, rq->donor, &rf);
+	next = pick_next_task(rq, &rf);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
 		struct task_struct *prev_donor = rq->donor;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index edca7849b165d..f07a888314450 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2506,8 +2506,14 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	resched_curr(rq);
 }
 
-static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_dl(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use prev across lock drops
+	 */
+	struct task_struct *p = rq->donor;
+
 	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353ad..b843f9a876d6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9195,14 +9195,19 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
 
 struct task_struct *
-pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task_fair(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
 	struct sched_entity *se;
-	struct task_struct *p;
+	struct task_struct *p, *prev;
 	int new_tasks;
 
 again:
+	/*
+	 * Re-read rq->donor at the top as it may have
+	 * changed across a rq lock drop
+	 */
+	prev = rq->donor;
 	p = pick_task_fair(rq, rf);
 	if (!p)
 		goto idle;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a83be0c834ddb..ff39120d723a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -462,7 +462,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int flags)
 }
 
 static int
-balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_idle(struct rq *rq, struct rq_flags *rf)
 {
 	return WARN_ON_ONCE(1);
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4ee8faf01441a..3c5f37c858b60 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1596,8 +1596,14 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 	resched_curr(rq);
 }
 
-static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_rt(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use p across lock drops
+	 */
+	struct task_struct *p = rq->donor;
+
 	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..2b3a97735efeb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2561,7 +2561,7 @@ struct sched_class {
 	/*
 	 * schedule/pick_next_task/prev_balance: rq->lock
 	 */
-	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+	int (*balance)(struct rq *rq, struct rq_flags *rf);
 
 	/*
 	 * schedule/pick_next_task: rq->lock
@@ -2572,12 +2572,11 @@ struct sched_class {
 	 *
 	 *   next = pick_task();
 	 *   if (next) {
-	 *       put_prev_task(prev);
+	 *       put_prev_task(rq->donor);
 	 *       set_next_task_first(next);
 	 *   }
 	 */
-	struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
-					      struct rq_flags *rf);
+	struct task_struct *(*pick_next_task)(struct rq *rq, struct rq_flags *rf);
 
 	/*
 	 * sched_change:
@@ -2801,8 +2800,7 @@ static inline bool sched_fair_runnable(struct rq *rq)
 	return rq->cfs.nr_queued > 0;
 }
 
-extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
-					       struct rq_flags *rf);
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct rq_flags *rf);
 extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
 
 #define SCA_CHECK		0x01
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index f95798baddebb..c909ca0d8c87c 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int flags)
 }
 
 static int
-balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_stop(struct rq *rq, struct rq_flags *rf)
 {
 	return sched_stop_runnable(rq);
 }
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 2/8] sched: deadline: Add some helper variables to cleanup deadline logic
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
  2026-04-22 23:06 ` [PATCH v28 1/8] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 3/8] sched: deadline: Add dl_rq->curr pointer to address issues with Proxy Exec John Stultz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

As part of an improvement to handling pushable deadline tasks,
Peter suggested this cleanup[1], to use helper values for
dl_entity and dl_rq in the enqueue_task_dl() and
put_prev_task_dl() functions. There should be no functional
change from this patch.

To make sure this cleanup change doesn't obscure later logic
changes, I've split it into its own patch.

[1]: https://lore.kernel.org/lkml/20260304095123.GP606826@noisy.programming.kicks-ass.net/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/deadline.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f07a888314450..4ff3e164d9880 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2292,7 +2292,10 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (is_dl_boosted(&p->dl)) {
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct dl_rq *dl_rq = &rq->dl;
+
+	if (is_dl_boosted(dl_se)) {
 		/*
 		 * Because of delays in the detection of the overrun of a
 		 * thread's runtime, it might be the case that a thread
@@ -2305,14 +2308,14 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		 *
 		 * In this case, the boost overrides the throttle.
 		 */
-		if (p->dl.dl_throttled) {
+		if (dl_se->dl_throttled) {
 			/*
 			 * The replenish timer needs to be canceled. No
 			 * problem if it fires concurrently: boosted threads
 			 * are ignored in dl_task_timer().
 			 */
-			cancel_replenish_timer(&p->dl);
-			p->dl.dl_throttled = 0;
+			cancel_replenish_timer(dl_se);
+			dl_se->dl_throttled = 0;
 		}
 	} else if (!dl_prio(p->normal_prio)) {
 		/*
@@ -2324,7 +2327,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		 * being boosted again with no means to replenish the runtime and clear
 		 * the throttle.
 		 */
-		p->dl.dl_throttled = 0;
+		dl_se->dl_throttled = 0;
 		if (!(flags & ENQUEUE_REPLENISH))
 			printk_deferred_once("sched: DL de-boosted task PID %d: REPLENISH flag missing\n",
 					     task_pid_nr(p));
@@ -2333,14 +2336,14 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	}
 
 	check_schedstat_required();
-	update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);
+	update_stats_wait_start_dl(dl_rq, dl_se);
 
 	if (p->on_rq == TASK_ON_RQ_MIGRATING)
 		flags |= ENQUEUE_MIGRATING;
 
-	enqueue_dl_entity(&p->dl, flags);
+	enqueue_dl_entity(dl_se, flags);
 
-	if (dl_server(&p->dl))
+	if (dl_server(dl_se))
 		return;
 
 	if (task_is_blocked(p))
@@ -2643,7 +2646,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_s
 	struct sched_dl_entity *dl_se = &p->dl;
 	struct dl_rq *dl_rq = &rq->dl;
 
-	if (on_dl_rq(&p->dl))
+	if (on_dl_rq(dl_se))
 		update_stats_wait_start_dl(dl_rq, dl_se);
 
 	update_curr_dl(rq);
@@ -2653,7 +2656,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_s
 	if (task_is_blocked(p))
 		return;
 
-	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
+	if (on_dl_rq(dl_se) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 3/8] sched: deadline: Add dl_rq->curr pointer to address issues with Proxy Exec
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
  2026-04-22 23:06 ` [PATCH v28 1/8] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
  2026-04-22 23:06 ` [PATCH v28 2/8] sched: deadline: Add some helper variables to cleanup deadline logic John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 4/8] sched: Rework block_task so it can be directly called John Stultz
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

The DL scheduler keeps the current task in the rbtree, since
the deadline value isn't usually chagned while the task is
runnable.

This results in set_next_task() and put_prev_task() being
simpler, but unfortunately this causes complexity elsewhere.

Specifically when update_curr_dl() updates the deadline, it has
to dequeue and then enqueue the task.

From put_prev_task_dl(), we first call update_curr_dl(), and
then call enqueue_pushable_dl_task().

However, with Proxy Exec this goes awry. Since when a mutex is
released, we might wake the waiting rq->donor. This will cause
put_prev_task() to be called on the donor to take it off the
cpu for return migration.

At that point, from put_prev_task_dl() the update_curr_dl()
logic will dequeue & enqueue the task, and the enqueue function
will call enqueue_pushable_dl_task() (since the task_current()
check won't prevent it). Then back up the callstack in
put_prev_task_dl() we'll end up calling
enqueue_pushable_dl_task() again, tripping the
!RB_EMPTY_NODE(&p->pushable_dl_tasks) warning.

So to avoid this, use Peter's suggested[1] approach, and add a
dl_rq->curr pointer that is set/cleared from set_next_task()/
put_prev_task(), which effectively tracks the rq->donor. We can
then use this to avoid adding the active donor to the pushable
list from enqueue_task_dl().

[1]: https://lore.kernel.org/lkml/20260304095123.GP606826@noisy.programming.kicks-ass.net/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/deadline.c | 13 +++++++++++++
 kernel/sched/sched.h    |  1 +
 2 files changed, 14 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ff3e164d9880..df56406b12766 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2349,6 +2349,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (task_is_blocked(p))
 		return;
 
+	if (dl_rq->curr == dl_se)
+		return;
+
 	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
@@ -2571,6 +2574,10 @@ static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 }
 #endif /* !CONFIG_SCHED_HRTICK */
 
+/*
+ * DL keeps current in tree, because ->deadline is not typically changed while
+ * a task is runnable.
+ */
 static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -2583,6 +2590,9 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
 
+	WARN_ON_ONCE(dl_rq->curr);
+	dl_rq->curr = dl_se;
+
 	if (!first)
 		return;
 
@@ -2653,6 +2663,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_s
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 
+	WARN_ON_ONCE(dl_rq->curr != dl_se);
+	dl_rq->curr = NULL;
+
 	if (task_is_blocked(p))
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2b3a97735efeb..d0f3e164a61d2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -889,6 +889,7 @@ struct dl_rq {
 
 	bool			overloaded;
 
+	struct sched_dl_entity	*curr;
 	/*
 	 * Tasks on this rq that can be pushed away. They are kept in
 	 * an rb-tree, ordered by tasks' deadlines, with caching
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 4/8] sched: Rework block_task so it can be directly called
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
                   ` (2 preceding siblings ...)
  2026-04-22 23:06 ` [PATCH v28 3/8] sched: deadline: Add dl_rq->curr pointer to address issues with Proxy Exec John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 5/8] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Pull most of the logic out of try_to_block_task() and put it
into block_task() directly, so that we can call block_task() and
not have to worry about the failing cases in try_to_block_task()

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 45 +++++++++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ac6dd4d3c587..942af3b34ffe0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2223,8 +2223,29 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_task(rq, p, flags);
 }
 
-static void block_task(struct rq *rq, struct task_struct *p, int flags)
+static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
+	int flags = DEQUEUE_NOCLOCK;
+
+	p->sched_contributes_to_load =
+		(task_state & TASK_UNINTERRUPTIBLE) &&
+		!(task_state & TASK_NOLOAD) &&
+		!(task_state & TASK_FROZEN);
+
+	if (unlikely(is_special_task_state(task_state)))
+		flags |= DEQUEUE_SPECIAL;
+
+	/*
+	 * __schedule()			ttwu()
+	 *   prev_state = prev->state;    if (p->on_rq && ...)
+	 *   if (prev_state)		    goto out;
+	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+	 *				  p->state = TASK_WAKING
+	 *
+	 * Where __schedule() and ttwu() have matching control dependencies.
+	 *
+	 * After this, schedule() must not care about p->state any more.
+	 */
 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
 		__block_task(rq, p);
 }
@@ -6581,7 +6602,6 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 			      unsigned long *task_state_p, bool should_block)
 {
 	unsigned long task_state = *task_state_p;
-	int flags = DEQUEUE_NOCLOCK;
 
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
@@ -6601,26 +6621,7 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (!should_block)
 		return false;
 
-	p->sched_contributes_to_load =
-		(task_state & TASK_UNINTERRUPTIBLE) &&
-		!(task_state & TASK_NOLOAD) &&
-		!(task_state & TASK_FROZEN);
-
-	if (unlikely(is_special_task_state(task_state)))
-		flags |= DEQUEUE_SPECIAL;
-
-	/*
-	 * __schedule()			ttwu()
-	 *   prev_state = prev->state;    if (p->on_rq && ...)
-	 *   if (prev_state)		    goto out;
-	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-	 *				  p->state = TASK_WAKING
-	 *
-	 * Where __schedule() and ttwu() have matching control dependencies.
-	 *
-	 * After this, schedule() must not care about p->state any more.
-	 */
-	block_task(rq, p, flags);
+	block_task(rq, p, task_state);
 	return true;
 }
 
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 5/8] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
                   ` (3 preceding siblings ...)
  2026-04-22 23:06 ` [PATCH v28 4/8] sched: Rework block_task so it can be directly called John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 6/8] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

This patch adds logic so try_to_wake_up() will notice if we are
waking a task where blocked_on == PROXY_WAKING, and if necessary
dequeue the task so the wakeup will naturally return-migrate the
donor task back to a cpu it can run on.

This helps performance as we do the dequeue and wakeup under the
locks normally taken in the try_to_wake_up() and avoids having
to do proxy_force_return() from __schedule(), which has to
re-take similar locks and then force a pick again loop.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
XXX - Make sure to switch to use ACQUIRE(__task_rq_lock, guard) in
7.0-rc+ branches
---
v24:
* Reworked proxy_needs_return() so its less nested as suggested
  by K Prateek
* Switch to using block_task with DEQUEUE_SPECIAL as suggested
  by K Prateek
* Fix edge case to reset wake_cpu if select_task_rq() chooses
  the current rq and we skip set_task_cpu()
v26:
* Handle both blocked and PROXY_WAKING tasks in
  proxy_needs_return(), as suggested by K Prateek
* Try to handle signal edge case in ttwu that K Prateek pointed
  out
v27:
* Integrate simplifications to proxy_needs_return() suggested
  by K Prateek
* Rework ttwu_runnable() to align with
  ACQUIRE(__task_rq_lock, guard)(p) usage as suggested by Peter
* Major rework suggested by Peter to get rid of
  proxy_force_return() completely, using proxy_deactivate() and
  allow ttwu to handle all the return migration. Lots of helpful
  improvements suggested by K Prateek included as well here.
v28:
* Folded in change suggested by K Prateek to introduce
  proxy_reset_donor() to reset the donor to current task when
  the donor is woken up.
* Drop an unnecessary PROXY_WAKING assignment which was noted
  by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h |   2 +-
 kernel/sched/core.c   | 196 +++++++++++++++++++++---------------------
 2 files changed, 98 insertions(+), 100 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb51..5b68a1c9eedcf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -161,7 +161,7 @@ struct user_event_mm;
  */
 #define is_special_task_state(state)					\
 	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |	\
-		    TASK_DEAD | TASK_FROZEN))
+		    TASK_DEAD | TASK_WAKING | TASK_FROZEN))
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 942af3b34ffe0..17797e1f76f25 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3722,6 +3722,54 @@ void update_rq_avg_idle(struct rq *rq)
 	rq->idle_stamp = 0;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static void zap_balance_callbacks(struct rq *rq);
+
+static inline void proxy_reset_donor(struct rq *rq)
+{
+	WARN_ON_ONCE(rq->donor == rq->curr);
+
+	put_prev_set_next_task(rq, rq->donor, rq->curr);
+	rq_set_donor(rq, rq->curr);
+	zap_balance_callbacks(rq);
+	resched_curr(rq);
+}
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	if (!task_is_blocked(p))
+		return false;
+
+	guard(raw_spinlock)(&p->blocked_lock);
+
+	/* Task is waking up; clear any blocked_on relationship */
+	__clear_task_blocked_on(p, NULL);
+
+	/* If already current, don't need to return migrate */
+	if (task_current(rq, p))
+		return false;
+
+	/* If we're return migrating the rq->donor, switch it out for idle */
+	if (task_current_donor(rq, p))
+		proxy_reset_donor(rq);
+
+	block_task(rq, p, TASK_WAKING);
+	return true;
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3786,28 +3834,26 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	ACQUIRE(__task_rq_lock, guard)(p);
+	struct rq *rq = guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
-		}
-		ttwu_do_wakeup(p);
-		ret = 1;
-	}
-	__task_rq_unlock(rq, p, &rf);
+	if (!task_on_rq_queued(p))
+		return 0;
 
-	return ret;
+	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (proxy_needs_return(rq, p))
+		return 0;
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 void sched_ttwu_pending(void *arg)
@@ -4194,6 +4240,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If p is current, we know we can run here, so clear blocked_on */
+		clear_task_blocked_on(p, NULL);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4210,6 +4258,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
+
 		if (!ttwu_state_match(p, state, &success))
 			break;
 
@@ -4274,6 +4323,14 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
 
+		/*
+		 * We never clear the blocked_on relation on proxy_deactivate.
+		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
+		 * when waking up. Since this is a fully blocked, off CPU task
+		 * waking up, it should be safe to clear the blocked_on relation.
+		 */
+		if (task_is_blocked(p))
+			clear_task_blocked_on(p, NULL);
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
 		 * this task as prev, considering queueing p on the remote CPUs wake_list
@@ -4318,6 +4375,16 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 			wake_flags |= WF_MIGRATED;
 			psi_ttwu_dequeue(p);
 			set_task_cpu(p, cpu);
+		} else if (cpu != p->wake_cpu) {
+			/*
+			 * If we were proxy-migrated to cpu, then
+			 * select_task_rq() picks cpu instead of wake_cpu
+			 * to return to, we won't call set_task_cpu(),
+			 * leaving a stale wake_cpu pointing to where we
+			 * proxy-migrated from. So just fixup wake_cpu here
+			 * if its not correct
+			 */
+			p->wake_cpu = cpu;
 		}
 
 		ttwu_queue(p, cpu, wake_flags);
@@ -6606,7 +6673,7 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
-		set_task_blocked_on_waking(p, NULL);
+		clear_task_blocked_on(p, NULL);
 
 		return false;
 	}
@@ -6649,13 +6716,11 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
-	/* Don't deactivate if the state has been changed to TASK_RUNNING */
-	if (state == TASK_RUNNING)
-		return false;
+	WARN_ON_ONCE(state == TASK_RUNNING);
 	/*
 	 * Because we got donor from pick_next_task(), it is *crucial*
 	 * that we call proxy_resched_idle() before we deactivate it.
@@ -6666,7 +6731,7 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
-	return try_to_block_task(rq, donor, &state, true);
+	block_task(rq, donor, state);
 }
 
 static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
@@ -6740,71 +6805,6 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	proxy_reacquire_rq_lock(rq, rf);
 }
 
-static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
-			       struct task_struct *p)
-	__must_hold(__rq_lockp(rq))
-{
-	struct rq *task_rq, *target_rq = NULL;
-	int cpu, wake_flag = WF_TTWU;
-
-	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
-
-	if (p == rq->donor)
-		proxy_resched_idle(rq);
-
-	proxy_release_rq_lock(rq, rf);
-	/*
-	 * We drop the rq lock, and re-grab task_rq_lock to get
-	 * the pi_lock (needed for select_task_rq) as well.
-	 */
-	scoped_guard (task_rq_lock, p) {
-		task_rq = scope.rq;
-
-		/*
-		 * Since we let go of the rq lock, the task may have been
-		 * woken or migrated to another rq before we  got the
-		 * task_rq_lock. So re-check we're on the same RQ. If
-		 * not, the task has already been migrated and that CPU
-		 * will handle any futher migrations.
-		 */
-		if (task_rq != rq)
-			break;
-
-		/*
-		 * Similarly, if we've been dequeued, someone else will
-		 * wake us
-		 */
-		if (!task_on_rq_queued(p))
-			break;
-
-		/*
-		 * Since we should only be calling here from __schedule()
-		 * -> find_proxy_task(), no one else should have
-		 * assigned current out from under us. But check and warn
-		 * if we see this, then bail.
-		 */
-		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
-			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-				  __func__, cpu_of(task_rq),
-				  p->comm, p->pid, p->on_cpu);
-			break;
-		}
-
-		update_rq_clock(task_rq);
-		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-		set_task_cpu(p, cpu);
-		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
-	}
-
-	if (target_rq)
-		attach_one_task(target_rq, p);
-
-	proxy_reacquire_rq_lock(rq, rf);
-}
-
 /*
  * Find runnable lock owner to proxy for mutex blocked donor
  *
@@ -6840,7 +6840,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 				clear_task_blocked_on(p, PROXY_WAKING);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		/*
@@ -6875,7 +6875,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 				__clear_task_blocked_on(p, NULL);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
@@ -6954,12 +6954,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	return owner;
 
 deactivate:
-	if (proxy_deactivate(rq, donor))
-		return NULL;
-	/* If deactivate fails, force return */
-	p = donor;
-force_return:
-	proxy_force_return(rq, rf, p);
+	proxy_deactivate(rq, p);
 	return NULL;
 migrate_task:
 	proxy_migrate_task(rq, rf, p, owner_cpu);
@@ -7106,6 +7101,9 @@ static void __sched notrace __schedule(int sched_mode)
 	if (sched_proxy_exec()) {
 		struct task_struct *prev_donor = rq->donor;
 
+		if (!prev_state && prev->blocked_on)
+			clear_task_blocked_on(prev, NULL);
+
 		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 6/8] sched: Add blocked_donor link to task for smarter mutex handoffs
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
                   ` (4 preceding siblings ...)
  2026-04-22 23:06 ` [PATCH v28 5/8] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 7/8] sched: Break out core of attach_tasks() helper into sched.h John Stultz
  2026-04-22 23:06 ` [PATCH v28 8/8] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Connor O'Brien, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
  is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
  by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
v8:
* Use helper for donor blocked_on_state transition
v9:
* Re-add mutex lock handoff in the unlock path, but only when we
  have a blocked donor
* Slight reword of commit message suggested by Metin
v18:
* Add task_init initialization for blocked_donor, suggested by
  Suleiman
v23:
* Reworks for PROXY_WAKING approach suggested by PeterZ
v25:
* Simplified some logic now we don't have proxy_tag_curr()
v28:
* Remove sched_proxy_exec() conditionalized blocked_lock usage
  as it was hitting llvm's "'blocked_lock' is not held on every
  path through here [-Wthread-safety-analysis]" errors.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

Fixup for llvm 'blocked_lock' is not held on every path through here [-Wthread-safety-analysis]

Clang is fussy so lets not do conditional locking
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 43 +++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    | 14 +++++++++++++-
 5 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5b68a1c9eedcf..65791d7359553 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1239,6 +1239,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index b5f48ebdc2b6e..41c19670c8f6b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -200,6 +200,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+	.blocked_donor = NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index f1ad69c6dc2d4..c1d65b86729d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2216,6 +2216,7 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 
 	p->blocked_on = NULL; /* not blocked yet */
+	p->blocked_donor = NULL; /* nobody is boosting p yet */
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 6e6699d9fcba6..333f33d33ef06 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -976,7 +976,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 	__releases(lock)
 {
-	struct task_struct *next = NULL;
+	struct task_struct *donor, *next = NULL;
 	struct mutex_waiter *waiter;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
@@ -997,6 +997,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		MUTEX_WARN_ON(__owner_task(owner) != current);
 		MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 
+		if (sched_proxy_exec() && current->blocked_donor) {
+			/* force handoff if we have a blocked_donor */
+			owner = MUTEX_FLAG_HANDOFF;
+			break;
+		}
+
 		if (owner & MUTEX_FLAG_HANDOFF)
 			break;
 
@@ -1009,19 +1015,50 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	}
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	raw_spin_lock(&current->blocked_lock);
 	debug_mutex_unlock(lock);
+
+	if (sched_proxy_exec()) {
+		/*
+		 * If we have a task boosting current, and that task was boosting
+		 * current through this lock, hand the lock to that task, as that
+		 * is the highest waiter, as selected by the scheduling function.
+		 */
+		donor = current->blocked_donor;
+		if (donor) {
+			struct mutex *next_lock;
+
+			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+			next_lock = __get_task_blocked_on(donor);
+			if (next_lock == lock) {
+				next = donor;
+				__set_task_blocked_on_waking(donor, next_lock);
+				wake_q_add(&wake_q, donor);
+				current->blocked_donor = NULL;
+			}
+			raw_spin_unlock(&donor->blocked_lock);
+		}
+	}
+
+	/*
+	 * Failing that, pick first on the wait list.
+	 */
 	waiter = lock->first_waiter;
-	if (waiter) {
+	if (!next && waiter) {
 		next = waiter->task;
 
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		debug_mutex_wake_waiter(lock, waiter);
-		set_task_blocked_on_waking(next, lock);
+		__set_task_blocked_on_waking(next, lock);
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
+
 	}
 
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17797e1f76f25..edd418dde1e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6809,7 +6809,17 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
  * Find runnable lock owner to proxy for mutex blocked donor
  *
  * Follow the blocked-on relation:
- *   task->blocked_on -> mutex->owner -> task...
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
  *
  * Lock order:
  *
@@ -6949,6 +6959,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
+		owner->blocked_donor = p;
 	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
@@ -7105,6 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
 			clear_task_blocked_on(prev, NULL);
 
 		rq_set_donor(rq, next);
+		next->blocked_donor = NULL;
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 7/8] sched: Break out core of attach_tasks() helper into sched.h
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
                   ` (5 preceding siblings ...)
  2026-04-22 23:06 ` [PATCH v28 6/8] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
@ 2026-04-22 23:06 ` John Stultz
  2026-04-22 23:06 ` [PATCH v28 8/8] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Pull the core of attach_tasks() out into sched.h so
it can be used more generically.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/fair.c  | 16 +---------------
 kernel/sched/sched.h | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b843f9a876d6d..af6bd2fdd460f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10011,21 +10011,7 @@ static int detach_tasks(struct lb_env *env)
  */
 static void attach_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->tasks;
-	struct task_struct *p;
-	struct rq_flags rf;
-
-	rq_lock(env->dst_rq, &rf);
-	update_rq_clock(env->dst_rq);
-
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
-		list_del_init(&p->se.group_node);
-
-		attach_task(env->dst_rq, p);
-	}
-
-	rq_unlock(env->dst_rq, &rf);
+	__attach_tasks(env->dst_rq, &env->tasks);
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d0f3e164a61d2..e1927e53e85cb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3045,6 +3045,25 @@ static inline void attach_one_task(struct rq *rq, struct task_struct *p)
 	attach_task(rq, p);
 }
 
+/*
+ * __attach_tasks() - attaches a list of tasks (using se.group_node) to
+ * the new rq
+ */
+static inline void __attach_tasks(struct rq *rq, struct list_head *tasks)
+{
+	guard(rq_lock)(rq);
+	update_rq_clock(rq);
+
+	while (!list_empty(tasks)) {
+		struct task_struct *p;
+
+		p = list_first_entry(tasks, struct task_struct, se.group_node);
+		list_del_init(&p->se.group_node);
+
+		attach_task(rq, p);
+	}
+}
+
 #ifdef CONFIG_PREEMPT_RT
 # define SCHED_NR_MIGRATE_BREAK 8
 #else
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v28 8/8] sched: Migrate whole chain in proxy_migrate_task()
  2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
                   ` (6 preceding siblings ...)
  2026-04-22 23:06 ` [PATCH v28 7/8] sched: Break out core of attach_tasks() helper into sched.h John Stultz
@ 2026-04-22 23:06 ` John Stultz
  7 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-04-22 23:06 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Instead of migrating one task each time through find_proxy_task(),
we can walk up the blocked_donor ptrs and migrate the entire
current chain in one go.

This was broken out of earlier patches and held back while the
series was being stabilized, but I wanted to re-introduce it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v12:
* Earlier this was re-using blocked_node, but I hit
  a race with activating blocked entities, and to
  avoid it introduced a new migration_node listhead
v18:
* Add init_task initialization of migration_node as suggested
  by Suleiman
v22:
* Move migration_node under CONFIG_SCHED_PROXY_EXEC as suggested
  by K Prateek
v25:
* Use se.group_node instead of adding migration_node, as
  suggsested by K Prateek
* Integrated attach_tasks() cleanups suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index edd418dde1e47..886984a02adad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6779,9 +6779,9 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	__must_hold(__rq_lockp(rq))
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
+	LIST_HEAD(migrate_list);
 
 	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
 	/*
 	 * Since we are migrating a blocked donor, it could be rq->donor,
 	 * and we want to make sure there aren't any references from this
@@ -6794,13 +6794,20 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	 * before we release the lock.
 	 */
 	proxy_resched_idle(rq);
-
-	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
-	proxy_set_task_cpu(p, target_cpu);
-
+	for (; p; p = p->blocked_donor) {
+		WARN_ON(p == rq->curr);
+		deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+		proxy_set_task_cpu(p, target_cpu);
+		/*
+		 * We can re-use se.group_node to migrate the thing,
+		 * because @p is deactivated (won't be balanced) and
+		 * we hold the rq_lock.
+		 */
+		list_add(&p->se.group_node, &migrate_list);
+	}
 	proxy_release_rq_lock(rq, rf);
 
-	attach_one_task(target_rq, p);
+	__attach_tasks(target_rq, &migrate_list);
 
 	proxy_reacquire_rq_lock(rq, rf);
 }
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-22 23:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 23:06 [PATCH v28 0/8] Optimized Donor Migration for Proxy Execution John Stultz
2026-04-22 23:06 ` [PATCH v28 1/8] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
2026-04-22 23:06 ` [PATCH v28 2/8] sched: deadline: Add some helper variables to cleanup deadline logic John Stultz
2026-04-22 23:06 ` [PATCH v28 3/8] sched: deadline: Add dl_rq->curr pointer to address issues with Proxy Exec John Stultz
2026-04-22 23:06 ` [PATCH v28 4/8] sched: Rework block_task so it can be directly called John Stultz
2026-04-22 23:06 ` [PATCH v28 5/8] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
2026-04-22 23:06 ` [PATCH v28 6/8] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2026-04-22 23:06 ` [PATCH v28 7/8] sched: Break out core of attach_tasks() helper into sched.h John Stultz
2026-04-22 23:06 ` [PATCH v28 8/8] sched: Migrate whole chain in proxy_migrate_task() John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox