[PATCH v24 00/11] Donor Migration for Proxy Execution (v24)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v24 00/11] Donor Migration for Proxy Execution (v24)
@ 2025-11-24 22:30 John Stultz
  2025-11-24 22:30 ` [PATCH v24 01/11] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hey All,

Yet another iteration on the next chunk of the Proxy Exec
series: Donor Migration

This is just the next step for Proxy Execution, to allow us to
migrate blocked donors across runqueues to boost remote lock
owners.

In this portion of the series, I’m only submitting for review
and consideration the logic that allows us to do donor 
(blocked waiter) migration, which requires some additional
changes to locking and extra state tracking to ensure we don’t
accidentally run a migrated donor on a cpu it isn’t affined to,
as well as some extra handling to deal with balance callback
state that needs to be reset when we decide to pick a different
task after doing donor migration.

In the last iteration, K Prateek provided some really great
review feedback, so I’ve tried to integrate all of his suggested
cleanups and improvements. Many thanks again to K Prateek!

Additionally, in my continued efforts to make Proxy Execution
and sched_ext play well together, I realized a bug I saw with
sched_ext was actually a larger issue around the sched class
implementations assumptions that the “prev” argument passed in
from __schedule() is stable across rq lock drops. Without Proxy
Exec, “prev” is always “current” and is on the cpu, so this
assumption held, but with Proxy Exec,  “prev” is “rq->donor”,
and if the rq lock is dropped, the rq->donor may be woken up on
another cpu and return migrated away, with rq->donor being set
to idle.  So I’ve gone through the class schedulers for both
pick_next_task() and prev_balance() and removed the prev
argument. Reworking the functions to sample rq->donor,
particularly after a rq lock drop.

New in this iteration:
* Reworking pick_next_task() and prev_balance() to not pass prev
  argument which might go stale across rq lock drops 
* Change to avoid null ptr traversal task calls yield when
  rq->donor is idle. 
* _Lots_ of cleanups and improvements suggested by K Prateek.
* Fix for edge case where select_task_rq() chooses the current
  cpu and we don’t call set_task_cpu(), which caused wake_cpu to
  go stale 

I’d love to get further feedback on any place where these
patches are confusing, or could use additional clarifications.

In the full series, there’s a number of fixes for issues found
enabling and testing with sched_ext, along with another revision
of Suleiman’s rwsem support. I’d appreciate any testing or
comments that folks have with the fully set:

You can find the full Proxy Exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v24-6.18-rc6
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v24-6.18-rc6

Issues still to address with the full series:
* Continue working to get sched_ext to be ok with Proxy
  Execution enabled.
* I’ve reproduced the performance regression K Prateek Nayak
  found with the full series. I’m hoping to work to understand
   and narrow the issue down soon.
* The chain migration functionality needs further iterations
  and better validation to ensure it truly maintains the RT/DL
  load balancing invariants (despite this being broken in
  vanilla upstream with RT_PUSH_IPI currently)
Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good, using proxy for Binder PI is something else
  we’re exploring.
* Eventually: Work to replace rt_mutexes and get things happy
  with PREEMPT_RT

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the larger series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!).
Thanks also to Joel Fernandes, Dietmar Eggemann, Metin Kaya,
K Prateek Nayak and Suleiman Souhlal for their substantial
review, suggestion, and patch contributions.

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are surely
mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>   
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (10):
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix modifying donor->blocked on without proper locking
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy
    return-migration
  sched: Add assert_balance_callbacks_empty helper
  sched: Add logic to zap balance callbacks if we pick again
  sched: Handle blocked-waiter migration (and return migration)
  sched: Rework pick_next_task() and prev_balance() to avoid stale prev
    references
  sched: Avoid donor->sched_class->yield_task() null traversal
  sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING
    case
  sched: Migrate whole chain in proxy_migrate_task()

Peter Zijlstra (1):
  sched: Add blocked_donor link to task for smarter mutex handoffs

 include/linux/sched.h        |  95 +++++---
 init/init_task.c             |   5 +
 kernel/fork.c                |   5 +
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  82 +++++--
 kernel/locking/mutex.h       |   6 +
 kernel/locking/ww_mutex.h    |  16 +-
 kernel/sched/core.c          | 418 +++++++++++++++++++++++++++++++----
 kernel/sched/deadline.c      |   8 +-
 kernel/sched/ext.c           |   8 +-
 kernel/sched/fair.c          |  15 +-
 kernel/sched/idle.c          |   2 +-
 kernel/sched/rt.c            |   8 +-
 kernel/sched/sched.h         |  17 +-
 kernel/sched/stop_task.c     |   2 +-
 kernel/sched/syscalls.c      |   3 +-
 16 files changed, 582 insertions(+), 112 deletions(-)

-- 
2.52.0.487.g5c8c507ade-goog

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v24 01/11] locking: Add task::blocked_lock to serialize blocked_on state
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-11-24 22:30 ` [PATCH v24 02/11] sched: Fix modifying donor->blocked on without proper locking John Stultz
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak
v21:
* After recently thinking more on ww_mutex code, I
  reworked the blocked_lock usage in mutex lock to
  avoid having to take nested locks in the ww_mutex
  paths, as I was concerned the lock ordering
  constraints weren't as strong as I had previously
  thought.
v22:
* Added some extra spaces to avoid dense code blocks
  suggested by K Prateek
v23:
* Move get_task_blocked_on() to kernel/locking/mutex.h
  as requested by PeterZ

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 48 +++++++++++++-----------------------
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 40 +++++++++++++++++++-----------
 kernel/locking/mutex.h       |  6 +++++
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 8 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c8..16a2951f78b1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,6 +1241,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2149,57 +2150,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
-
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206fa..60477d74546e0 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -143,6 +143,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a95..0697084be202f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2038,6 +2038,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 949103fd8e9b5..1d8cff71f65e1 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index de7d6702cd96c..c44fc63d4476e 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -640,6 +640,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -653,8 +654,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -677,12 +679,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -693,25 +697,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -740,11 +752,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -955,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 2e8080a9bee37..5cfd663e2c011 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct mutex *lock)
 	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
 }
 
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785afee6c0..e4a81790ea7dd 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6960c1bfc741a..b3f9bc20b7e1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6607,6 +6607,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6630,8 +6631,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 02/11] sched: Fix modifying donor->blocked on without proper locking
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
  2025-11-24 22:30 ` [PATCH v24 01/11] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-11-24 22:30 ` [PATCH v24 03/11] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.

This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v23:
* Split out from earlier patch.
v24:
* Minor re-ordering local variables to keep with style
  as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b3f9bc20b7e1f..1b6fd173daadd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6591,7 +6591,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		clear_task_blocked_on(donor, NULL);
 	}
 	return NULL;
 }
@@ -6615,6 +6615,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
+	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 	struct task_struct *owner = NULL;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
@@ -6652,12 +6653,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6715,6 +6718,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 */
 	}
 
+	/* Handle actions we need to do outside of the guard() scope */
+	switch (action) {
+	case DEACTIVATE_DONOR:
+		return proxy_deactivate(rq, donor);
+	case FOUND:
+		/* fallthrough */;
+	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 03/11] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
  2025-11-24 22:30 ` [PATCH v24 01/11] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
  2025-11-24 22:30 ` [PATCH v24 02/11] sched: Fix modifying donor->blocked on without proper locking John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-11-24 22:30 ` [PATCH v24 04/11] sched: Add assert_balance_callbacks_empty helper John Stultz
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:

                  ,-+-* Mutex-1 <-.
        Task-A ---' |             | ,-- Task-B
                    `-> Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can
   back-off and re-try the transaction.

Now, the current code [without this patch] does:
        __clear_task_blocked_on();
        wake_q_add();

And surely clearing ->blocked_on is sufficient to break the
cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().

Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."

Thus we need more than just a binary concept of the task being
blocked on a mutex or not.

So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.

This will then be used in a later patch to handle proxy
return-migration.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
  around blocked_on_state changes that we were cheating in
  previous versions.
* Minor cleanups, some comment improvements
v22:
* Re-order blocked_on_state helpers to try to make it clearer
  the set_task_blocked_on() and clear_task_blocked_on() are
  the main enter/exit states and the blocked_on_state helpers
  help manage the transition states within. Per feedback from
  K Prateek Nayak.
* Rework blocked_on_state to be defined within
  CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak.
* Reworked empty stub functions to just take one line as
  suggestd by K Prateek
* Avoid using gotos out of a guard() scope, as highlighted by
  K Prateek, and instead rework logic to break and switch()
  on an action value.
v23:
* Big rework to using PROXY_WAKING instead of blocked_on_state
  as suggested by Peter.
* Reworked commit message to include Peter's nice diagrams and
  example for why this extra state is necessary.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 51 +++++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex.c    |  2 +-
 kernel/locking/ww_mutex.h | 16 ++++++------
 kernel/sched/core.c       | 17 +++++++++++++
 4 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16a2951f78b1f..0d6c4c31e3624 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2148,10 +2148,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 })
 
 #ifndef CONFIG_PREEMPT_RT
+
+/*
+ * With proxy exec, if a task has been proxy-migrated, it may be a donor
+ * on a cpu that it can't actually run on. Thus we need a special state
+ * to denote that the task is being woken, but that it needs to be
+ * evaluated for return-migration before it is run. So if the task is
+ * blocked_on PROXY_WAKING, return migrate it before running it.
+ */
+#define PROXY_WAKING ((struct mutex *)(-1L))
+
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on;
+	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2179,7 +2189,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2188,6 +2198,35 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+
+	if (!sched_proxy_exec()) {
+		__clear_task_blocked_on(p, m);
+		return;
+	}
+
+	/* Don't set PROXY_WAKING if blocked_on was already cleared */
+	if (!p->blocked_on)
+		return;
+	/*
+	 * There may be cases where we set PROXY_WAKING on tasks that were
+	 * already set to waking, but make sure we are not changing
+	 * the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	p->blocked_on = PROXY_WAKING;
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_task_blocked_on_waking(p, m);
+}
+
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2196,6 +2235,14 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index c44fc63d4476e..3cb9001d15119 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -967,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		set_task_blocked_on_waking(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..5cd9dfa4b31e6 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on to PROXY_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_task_blocked_on_waking(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on to PROXY_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_task_blocked_on_waking(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b6fd173daadd..b8a8495b82525 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4293,6 +4293,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, clear the task->blocked_on
+	 * regardless if it was set to a mutex or PROXY_WAKING so the
+	 * task can run. We will need to be more careful later when
+	 * properly handling proxy migration
+	 */
+	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6627,6 +6634,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		/* Something changed in the chain, so pick again */
 		if (!mutex)
 			return NULL;
+
+		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
+		if (mutex == PROXY_WAKING)
+			return proxy_resched_idle(rq);
+
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6647,6 +6659,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/*
+			 * If there is no owner, clear blocked_on
+			 * and return p so it can run and try to
+			 * acquire the lock
+			 */
 			__clear_task_blocked_on(p, mutex);
 			return p;
 		}
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 04/11] sched: Add assert_balance_callbacks_empty helper
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (2 preceding siblings ...)
  2025-11-24 22:30 ` [PATCH v24 03/11] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-11-24 22:30 ` [PATCH v24 05/11] sched: Add logic to zap balance callbacks if we pick again John Stultz
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, K Prateek Nayak, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.

So pull the warning out into a helper function, and make sure we
check it when we pick again.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v24:
* Use IS_ENABLED() as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c  | 1 +
 kernel/sched/sched.h | 9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8a8495b82525..cfe71c2764558 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6896,6 +6896,7 @@ static void __sched notrace __schedule(int sched_mode)
 	}
 
 pick_again:
+	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3409d72..a0de4f00edd61 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1780,6 +1780,13 @@ static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
+static inline void assert_balance_callbacks_empty(struct rq *rq)
+{
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) &&
+		     rq->balance_callback &&
+		     rq->balance_callback != &balance_push_callback);
+}
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -1796,7 +1803,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
 	rf->clock_update_flags = 0;
-	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+	assert_balance_callbacks_empty(rq);
 }
 
 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 05/11] sched: Add logic to zap balance callbacks if we pick again
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (3 preceding siblings ...)
  2025-11-24 22:30 ` [PATCH v24 04/11] sched: Add assert_balance_callbacks_empty helper John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-11-24 22:30 ` [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration) John Stultz
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs
v22:
* Spelling fix suggested by K Prateek
* Collapsed the stub implementation to one line as suggested
  by K Prateek
* Zap callbacks when we resched idle, as suggested by K Prateek
v24:
* Don't conditionalize function on CONFIG_SCHED_PROXY_EXEC as
  the callers will be optimized out if that is unset, and the
  dead function will be removed, as suggsted by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cfe71c2764558..8a3f9f63916dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4970,6 +4970,34 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6901,10 +6929,15 @@ static void __sched notrace __schedule(int sched_mode)
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
-		if (!next)
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
 			goto pick_again;
-		if (next == rq->idle)
+		}
+		if (next == rq->idle) {
+			zap_balance_callbacks(rq);
 			goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration)
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (4 preceding siblings ...)
  2025-11-24 22:30 ` [PATCH v24 05/11] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-12-30  5:33   ` K Prateek Nayak
  2025-11-24 22:30 ` [PATCH v24 07/11] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks
v22:
* Add comments around zap_balance_callbacks in proxy_migration logic
* Rework logic to avoid gotos out of guard() scopes, and instead
  use break and switch() on action value, as suggested by K Prateek
* K Prateek suggested simplifications around putting donor and
  setting idle as next task in the migration paths, which I further
  simplified to using proxy_resched_idle()
* Comment improvements
* Dropped curr != donor check in pick_next_task_fair() suggested by
  K Prateek
v23:
* Rework to use the PROXY_WAKING approach suggested by Peter
* Drop unnecessarily setting wake_cpu when affinity changes
  as noticed by Peter
* Split out the ttwu() logic changes into a later separate patch
  as suggested by Peter
v24:
* Numerous fixes for rq clock handling, pointed out by K Prateek
* Slight tweak to where put_task() is called suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 232 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 202 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8a3f9f63916dd..4c5493b0ad210 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3682,6 +3682,23 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
 	trace_sched_wakeup(p);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -4293,13 +4310,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, clear the task->blocked_on
-	 * regardless if it was set to a mutex or PROXY_WAKING so the
-	 * task can run. We will need to be more careful later when
-	 * properly handling proxy migration
-	 */
-	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6598,7 +6608,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6618,17 +6628,146 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		clear_task_blocked_on(donor, NULL);
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * So call proxy_resched_idle() to let go of the references before
+	 * we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	proxy_set_task_cpu(p, target_cpu);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	raw_spin_rq_lock(target_rq);
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+	raw_spin_rq_unlock(target_rq);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	struct rq *this_rq, *target_rq;
+	struct rq_flags this_rf;
+	int cpu, wake_flag = 0;
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+
+	get_task_struct(p);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	/*
+	 * We drop the rq lock, and re-grab task_rq_lock to get
+	 * the pi_lock (needed for select_task_rq) as well.
+	 */
+	this_rq = task_rq_lock(p, &this_rf);
+
+	/*
+	 * Since we let go of the rq lock, the task may have been
+	 * woken or migrated to another rq before we  got the
+	 * task_rq_lock. So re-check we're on the same RQ. If
+	 * not, the task has already been migrated and that CPU
+	 * will handle any futher migrations.
+	 */
+	if (this_rq != rq)
+		goto err_out;
+
+	/* Similarly, if we've been dequeued, someone else will wake us */
+	if (!task_on_rq_queued(p))
+		goto err_out;
+
+	/*
+	 * Since we should only be calling here from __schedule()
+	 * -> find_proxy_task(), no one else should have
+	 * assigned current out from under us. But check and warn
+	 * if we see this, then bail.
+	 */
+	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
+		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+			  __func__, cpu_of(this_rq),
+			  p->comm, p->pid, p->on_cpu);
+		goto err_out;
 	}
-	return NULL;
+
+	update_rq_clock(this_rq);
+	proxy_resched_idle(this_rq);
+	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
+	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+	set_task_cpu(p, cpu);
+	target_rq = cpu_rq(cpu);
+	clear_task_blocked_on(p, NULL);
+	task_rq_unlock(this_rq, p, &this_rf);
+
+	/* Drop this_rq and grab target_rq for activation */
+	raw_spin_rq_lock(target_rq);
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+	put_task_struct(p);
+	raw_spin_rq_unlock(target_rq);
+
+	/* Finally, re-grab the origianl rq lock and return to pick-again */
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+	return;
+
+err_out:
+	task_rq_unlock(this_rq, p, &this_rf);
+	put_task_struct(p);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
 }
 
 /*
@@ -6650,11 +6789,13 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6663,9 +6804,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		if (!mutex)
 			return NULL;
 
-		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
-		if (mutex == PROXY_WAKING)
-			return proxy_resched_idle(rq);
+		/* if its PROXY_WAKING, do return migration or run if current */
+		if (mutex == PROXY_WAKING) {
+			if (task_current(rq, p)) {
+				clear_task_blocked_on(p, PROXY_WAKING);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
+		}
 
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
@@ -6685,26 +6832,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, clear blocked_on
-			 * and return p so it can run and try to
-			 * acquire the lock
+			 * If there is no owner, either clear blocked_on
+			 * and return p (if it is current and safe to
+			 * just run on this rq), or return-migrate the task.
 			 */
-			__clear_task_blocked_on(p, mutex);
-			return p;
+			if (task_current(rq, p)) {
+				__clear_task_blocked_on(p, NULL);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			action = DEACTIVATE_DONOR;
 			break;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			action = MIGRATE;
 			break;
 		}
 
@@ -6766,7 +6928,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
+		if (proxy_deactivate(rq, donor))
+			return NULL;
+		/* If deactivate fails, force return */
+		p = donor;
+		fallthrough;
+	case NEEDS_RETURN:
+		proxy_force_return(rq, rf, p);
+		return NULL;
+	case MIGRATE:
+		proxy_migrate_task(rq, rf, p, owner_cpu);
+		return NULL;
 	case FOUND:
 		/* fallthrough */;
 	}
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 07/11] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (5 preceding siblings ...)
  2025-11-24 22:30 ` [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-11-24 22:30 ` John Stultz
  2025-11-24 22:31 ` [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal John Stultz
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Historically, the prev value from __schedule() was the rq->curr.
This prev value is passed down through numerous functions, and
used in the class scheduler implementations. The fact that
prev was on_cpu until the end of __schedule(), meant it was
stable across the rq lock drops that the class->pick_next_task()
and ->balance() implementations often do.

However, with proxy-exec, the prev passed to functions called
by __schedule() is rq->donor, which may not be the same as
rq->curr and may not be on_cpu, this makes the prev value
potentially unstable across rq lock drops.

A recently found issue with proxy-exec, is when we begin doing
return migration from try_to_wake_up(), its possible we may be
waking up the rq->donor.  When we do this, we proxy_resched_idle()
to put_prev_set_next() setting the rq->donor to rq->idle, allowing
the rq->donor to be return migrated and allowed to run.

This however runs into trouble, as on another cpu we might be in
the middle of calling __schedule(). Conceptually the rq lock is
held for the majority of the time, but in calling pick_next_task()
its possible the class->pick_next_task() handler or the
->balance() call may briefly drop the rq lock. This opens a
window for try_to_wake_up() to wake and return migrate the
rq->donor before the class logic reacquires the rq lock.

Unfortunately pick_next_task() and prev_balance() pass in a prev
argument, to which we pass rq->donor. However this prev value can
now become stale and incorrect across a rq lock drop.

So, to correct this, rework the pick_next_task() and
prev_balance() calls so that they do not take a "prev" argument.

Also rework the class ->pick_next_task() and ->balance()
implementations to drop the prev argument, and in the cases
where it was used, and have the class functions reference
rq->donor directly, and not save the value across rq lock drops
so that we don't end up with a stale references.

Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c      | 37 ++++++++++++++++++-------------------
 kernel/sched/deadline.c  |  8 +++++++-
 kernel/sched/ext.c       |  8 ++++++--
 kernel/sched/fair.c      | 15 ++++++++++-----
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  8 +++++++-
 kernel/sched/sched.h     |  8 ++++----
 kernel/sched/stop_task.c |  2 +-
 8 files changed, 54 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4c5493b0ad210..fcf64c4db437e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5955,10 +5955,9 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
 	schedstat_inc(this_rq()->sched_count);
 }
 
-static void prev_balance(struct rq *rq, struct task_struct *prev,
-			 struct rq_flags *rf)
+static void prev_balance(struct rq *rq, struct rq_flags *rf)
 {
-	const struct sched_class *start_class = prev->sched_class;
+	const struct sched_class *start_class = rq->donor->sched_class;
 	const struct sched_class *class;
 
 #ifdef CONFIG_SCHED_CLASS_EXT
@@ -5983,7 +5982,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
 	 * a runnable task of @class priority or higher.
 	 */
 	for_active_class_range(class, start_class, &idle_sched_class) {
-		if (class->balance && class->balance(rq, prev, rf))
+		if (class->balance && class->balance(rq, rf))
 			break;
 	}
 }
@@ -5992,7 +5991,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -6008,34 +6007,34 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * higher scheduling class, because otherwise those lose the
 	 * opportunity to pull in more work from other CPUs.
 	 */
-	if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
+	if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) &&
 		   rq->nr_running == rq->cfs.h_nr_queued)) {
 
-		p = pick_next_task_fair(rq, prev, rf);
+		p = pick_next_task_fair(rq, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto restart;
 
 		/* Assume the next prioritized class is idle_sched_class */
 		if (!p) {
 			p = pick_task_idle(rq);
-			put_prev_set_next_task(rq, prev, p);
+			put_prev_set_next_task(rq, rq->donor, p);
 		}
 
 		return p;
 	}
 
 restart:
-	prev_balance(rq, prev, rf);
+	prev_balance(rq, rf);
 
 	for_each_active_class(class) {
 		if (class->pick_next_task) {
-			p = class->pick_next_task(rq, prev);
+			p = class->pick_next_task(rq);
 			if (p)
 				return p;
 		} else {
 			p = class->pick_task(rq);
 			if (p) {
-				put_prev_set_next_task(rq, prev, p);
+				put_prev_set_next_task(rq, rq->donor, p);
 				return p;
 			}
 		}
@@ -6084,7 +6083,7 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
 static void queue_core_balance(struct rq *rq);
 
 static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
 	struct task_struct *next, *p, *max = NULL;
 	const struct cpumask *smt_mask;
@@ -6096,7 +6095,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	bool need_sync;
 
 	if (!sched_core_enabled(rq))
-		return __pick_next_task(rq, prev, rf);
+		return __pick_next_task(rq, rf);
 
 	cpu = cpu_of(rq);
 
@@ -6109,7 +6108,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		 */
 		rq->core_pick = NULL;
 		rq->core_dl_server = NULL;
-		return __pick_next_task(rq, prev, rf);
+		return __pick_next_task(rq, rf);
 	}
 
 	/*
@@ -6133,7 +6132,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		goto out_set_next;
 	}
 
-	prev_balance(rq, prev, rf);
+	prev_balance(rq, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
 	need_sync = !!rq->core->core_cookie;
@@ -6306,7 +6305,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	}
 
 out_set_next:
-	put_prev_set_next_task(rq, prev, next);
+	put_prev_set_next_task(rq, rq->donor, next);
 	if (rq->core->core_forceidle_count && next == rq->idle)
 		queue_core_balance(rq);
 
@@ -6528,9 +6527,9 @@ static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
 static inline void sched_core_cpu_dying(unsigned int cpu) {}
 
 static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
-	return __pick_next_task(rq, prev, rf);
+	return __pick_next_task(rq, rf);
 }
 
 #endif /* !CONFIG_SCHED_CORE */
@@ -7097,7 +7096,7 @@ static void __sched notrace __schedule(int sched_mode)
 
 pick_again:
 	assert_balance_callbacks_empty(rq);
-	next = pick_next_task(rq, rq->donor, &rf);
+	next = pick_next_task(rq, &rf);
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c4402542ef44f..d86fc3dd0d806 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2268,8 +2268,14 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	resched_curr(rq);
 }
 
-static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_dl(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use prev across lock drops
+	 */
+	struct task_struct *p = rq->donor;
+
 	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7e0fcfdc06a2d..5c6cb0a3be738 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2153,9 +2153,13 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	return true;
 }
 
-static int balance_scx(struct rq *rq, struct task_struct *prev,
-		       struct rq_flags *rf)
+static int balance_scx(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use prev across lock drops
+	 */
+	struct task_struct *prev = rq->donor;
 	int ret;
 
 	rq_unpin_lock(rq, rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 328ea325a1d1c..7d2e92a55b164 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8713,7 +8713,7 @@ static void set_cpus_allowed_fair(struct task_struct *p, struct affinity_context
 }
 
 static int
-balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_fair(struct rq *rq, struct rq_flags *rf)
 {
 	if (sched_fair_runnable(rq))
 		return 1;
@@ -8866,13 +8866,18 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
 
 struct task_struct *
-pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task_fair(struct rq *rq, struct rq_flags *rf)
 {
 	struct sched_entity *se;
-	struct task_struct *p;
+	struct task_struct *p, *prev;
 	int new_tasks;
 
 again:
+	/*
+	 * Re-read rq->donor at the top as it may have
+	 * changed across a rq lock drop
+	 */
+	prev = rq->donor;
 	p = pick_task_fair(rq);
 	if (!p)
 		goto idle;
@@ -8952,9 +8957,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	return NULL;
 }
 
-static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev)
+static struct task_struct *__pick_next_task_fair(struct rq *rq)
 {
-	return pick_next_task_fair(rq, prev, NULL);
+	return pick_next_task_fair(rq, NULL);
 }
 
 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c39b089d4f09b..a7c718c1733ba 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -439,7 +439,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int flags)
 }
 
 static int
-balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_idle(struct rq *rq, struct rq_flags *rf)
 {
 	return WARN_ON_ONCE(1);
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fb07dcfc60a24..17cfac1da38b6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1591,8 +1591,14 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 	resched_curr(rq);
 }
 
-static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_rt(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use p across lock drops
+	 */
+	struct task_struct *p = rq->donor;
+
 	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a0de4f00edd61..424c40bd46e2f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2415,18 +2415,18 @@ struct sched_class {
 
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 
-	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+	int (*balance)(struct rq *rq, struct rq_flags *rf);
 	struct task_struct *(*pick_task)(struct rq *rq);
 	/*
 	 * Optional! When implemented pick_next_task() should be equivalent to:
 	 *
 	 *   next = pick_task();
 	 *   if (next) {
-	 *       put_prev_task(prev);
+	 *       put_prev_task(rq->donor);
 	 *       set_next_task_first(next);
 	 *   }
 	 */
-	struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
+	struct task_struct *(*pick_next_task)(struct rq *rq);
 
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
@@ -2586,7 +2586,7 @@ static inline bool sched_fair_runnable(struct rq *rq)
 	return rq->cfs.nr_queued > 0;
 }
 
-extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct rq_flags *rf);
 extern struct task_struct *pick_task_idle(struct rq *rq);
 
 #define SCA_CHECK		0x01
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 2d4e279f05ee9..73aeb0743aa2e 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int flags)
 }
 
 static int
-balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_stop(struct rq *rq, struct rq_flags *rf)
 {
 	return sched_stop_runnable(rq);
 }
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (6 preceding siblings ...)
  2025-11-24 22:30 ` [PATCH v24 07/11] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
@ 2025-11-24 22:31 ` John Stultz
  2025-12-30  6:01   ` K Prateek Nayak
  2025-11-24 22:31 ` [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:31 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

With proxy-exec once we do return migration from ttwu(), if a
task is proxying for a waiting donor, and the donor is woken up,
we switch the rq->donor to point to idle briefly until we can
re-enter __schedule().

However, if a task that was acting as a proxy calls into
yield() right after the donor is switched to idle, it may
trip a null  pointer traversal, because the idle task doesn't
have a yield_task() pointer.

So add a conditional to ensure we don't try to call the
yield_task() pointer in that case.

This was only recently found because prior to commit
127b90315ca07 ("sched/proxy: Yield the donor task")
do_sched_yield() incorrectly called
current->sched_class_yield_task() instead of using
rq->donor.

Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/syscalls.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index bf360a6fbb800..4b2b81437b03b 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1351,7 +1351,8 @@ static void do_sched_yield(void)
 	rq = this_rq_lock_irq(&rf);
 
 	schedstat_inc(rq->yld_count);
-	rq->donor->sched_class->yield_task(rq);
+	if (rq->donor->sched_class->yield_task)
+		rq->donor->sched_class->yield_task(rq);
 
 	preempt_disable();
 	rq_unlock_irq(rq, &rf);
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (7 preceding siblings ...)
  2025-11-24 22:31 ` [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal John Stultz
@ 2025-11-24 22:31 ` John Stultz
  2026-01-01  9:53   ` K Prateek Nayak
  2025-11-24 22:31 ` [PATCH v24 10/11] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
  2025-11-24 22:31 ` [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  10 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:31 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

This patch adds logic so try_to_wake_up() will notice if we are
waking a task where blocked_on == PROXY_WAKING, and if necessary
dequeue the task so the wakeup will naturally return-migrate the
donor task back to a cpu it can run on.

This helps performance as we do the dequeue and wakeup under the
locks normally taken in the try_to_wake_up() and avoids having
to do proxy_force_return() from __schedule(), which has to
re-take similar locks and then force a pick again loop.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v24:
* Reworked proxy_needs_return() so its less nested as suggested
  by K Prateek
* Switch to using block_task with DEQUEUE_SPECIAL as suggested
  by K Prateek
* Fix edge case to reset wake_cpu if select_task_rq() chooses
  the current rq and we skip set_task_cpu()

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 84 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcf64c4db437e..e4a49c694dcd9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3697,6 +3697,64 @@ static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 	__set_task_cpu(p, cpu);
 	p->wake_cpu = wake_cpu;
 }
+
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (READ_ONCE(p->__state) == TASK_RUNNING &&
+		READ_ONCE(p->blocked_on) == PROXY_WAKING);
+}
+
+static inline struct task_struct *proxy_resched_idle(struct rq *rq);
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+
+	guard(raw_spinlock)(&p->blocked_lock);
+
+	/* If task isn't PROXY_WAKING, we don't need to do return migration */
+	if (p->blocked_on != PROXY_WAKING)
+		return false;
+
+	__clear_task_blocked_on(p, PROXY_WAKING);
+
+	/* If already current, don't need to return migrate */
+	if (task_current(rq, p))
+		return false;
+
+	/* If wake_cpu is targeting this cpu, don't bother return migrating */
+	if (p->wake_cpu == cpu_of(rq)) {
+		resched_curr(rq);
+		return false;
+	}
+
+	/* If we're return migrating the rq->donor, switch it out for idle */
+	if (task_current_donor(rq, p))
+		proxy_resched_idle(rq);
+
+	/* (ab)Use DEQUEUE_SPECIAL to ensure task is always blocked here. */
+	block_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SPECIAL);
+	return true;
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	return false;
+}
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
 #endif /* CONFIG_SCHED_PROXY_EXEC */
 
 static void
@@ -3784,6 +3842,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		if (proxy_needs_return(rq, p))
+			goto out;
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -3794,6 +3854,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
+out:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;
@@ -4181,6 +4242,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If p is current, we know we can run here, so clear blocked_on */
+		clear_task_blocked_on(p, NULL);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success))
-			break;
+		if (!ttwu_state_match(p, state, &success)) {
+			/*
+			 * If we're already TASK_RUNNING, and PROXY_WAKING
+			 * continue on to ttwu_runnable check to force
+			 * proxy_needs_return evaluation
+			 */
+			if (!proxy_task_runnable_but_waking(p))
+				break;
+		}
 
 		trace_sched_waking(p);
 
@@ -4305,6 +4375,16 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 			wake_flags |= WF_MIGRATED;
 			psi_ttwu_dequeue(p);
 			set_task_cpu(p, cpu);
+		} else if (cpu != p->wake_cpu) {
+			/*
+			 * If we were proxy-migrated to cpu, then
+			 * select_task_rq() picks cpu instead of wake_cpu
+			 * to return to, we won't call set_task_cpu(),
+			 * leaving a stale wake_cpu pointing to where we
+			 * proxy-migrated from. So just fixup wake_cpu here
+			 * if its not correct
+			 */
+			p->wake_cpu = cpu;
 		}
 
 		ttwu_queue(p, cpu, wake_flags);
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 10/11] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (8 preceding siblings ...)
  2025-11-24 22:31 ` [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
@ 2025-11-24 22:31 ` John Stultz
  2025-11-24 22:31 ` [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  10 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:31 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Connor O'Brien, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
  is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
  by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
v8:
* Use helper for donor blocked_on_state transition
v9:
* Re-add mutex lock handoff in the unlock path, but only when we
  have a blocked donor
* Slight reword of commit message suggested by Metin
v18:
* Add task_init initialization for blocked_donor, suggested by
  Suleiman
v23:
* Reworks for PROXY_WAKING approach suggested by PeterZ

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 44 +++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    | 18 +++++++++++++++--
 5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0d6c4c31e3624..178ed37850470 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,6 +1241,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 60477d74546e0..34853a511b4d8 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -177,6 +177,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+	.blocked_donor = NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 0697084be202f..0a9a17e25b85d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2136,6 +2136,7 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 
 	p->blocked_on = NULL; /* not blocked yet */
+	p->blocked_donor = NULL; /* nobody is boosting p yet */
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 3cb9001d15119..08f438a54f56f 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -926,7 +926,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
-	struct task_struct *next = NULL;
+	struct task_struct *donor, *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
 	unsigned long flags;
@@ -945,6 +945,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		MUTEX_WARN_ON(__owner_task(owner) != current);
 		MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 
+		if (sched_proxy_exec() && current->blocked_donor) {
+			/* force handoff if we have a blocked_donor */
+			owner = MUTEX_FLAG_HANDOFF;
+			break;
+		}
+
 		if (owner & MUTEX_FLAG_HANDOFF)
 			break;
 
@@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+	if (sched_proxy_exec()) {
+		raw_spin_lock(&current->blocked_lock);
+		/*
+		 * If we have a task boosting current, and that task was boosting
+		 * current through this lock, hand the lock to that task, as that
+		 * is the highest waiter, as selected by the scheduling function.
+		 */
+		donor = current->blocked_donor;
+		if (donor) {
+			struct mutex *next_lock;
+
+			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+			next_lock = __get_task_blocked_on(donor);
+			if (next_lock == lock) {
+				next = donor;
+				__set_task_blocked_on_waking(donor, next_lock);
+				wake_q_add(&wake_q, donor);
+				current->blocked_donor = NULL;
+			}
+			raw_spin_unlock(&donor->blocked_lock);
+		}
+	}
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 			list_first_entry(&lock->wait_list,
@@ -966,14 +999,19 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		debug_mutex_wake_waiter(lock, waiter);
-		set_task_blocked_on_waking(next, lock);
+		__set_task_blocked_on_waking(next, lock);
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
+
 	}
 
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	if (sched_proxy_exec())
+		raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4a49c694dcd9..7f42ec01192dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6853,7 +6853,17 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
  * Find runnable lock owner to proxy for mutex blocked donor
  *
  * Follow the blocked-on relation:
- *   task->blocked_on -> mutex->owner -> task...
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
  *
  * Lock order:
  *
@@ -7002,6 +7012,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
+		owner->blocked_donor = p;
 	}
 
 	/* Handle actions we need to do outside of the guard() scope */
@@ -7102,6 +7113,7 @@ static void __sched notrace __schedule(int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool prev_not_proxied;
 	int cpu;
 
 	/* Trace preemptions consistently with task switches */
@@ -7174,10 +7186,12 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	prev_not_proxied = !prev->blocked_donor;
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, &rf);
 	rq_set_donor(rq, next);
+	next->blocked_donor = NULL;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next) {
@@ -7243,7 +7257,7 @@ static void __sched notrace __schedule(int sched_mode)
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
 		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
+		if (prev_not_proxied && next->blocked_donor)
 			proxy_tag_curr(rq, next);
 
 		rq_unpin_lock(rq, &rf);
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task()
  2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
                   ` (9 preceding siblings ...)
  2025-11-24 22:31 ` [PATCH v24 10/11] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
@ 2025-11-24 22:31 ` John Stultz
  2025-12-30  6:46   ` K Prateek Nayak
  10 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2025-11-24 22:31 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Instead of migrating one task each time through find_proxy_task(),
we can walk up the blocked_donor ptrs and migrate the entire
current chain in one go.

This was broken out of earlier patches and held back while the
series was being stabilized, but I wanted to re-introduce it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v12:
* Earlier this was re-using blocked_node, but I hit
  a race with activating blocked entities, and to
  avoid it introduced a new migration_node listhead
v18:
* Add init_task initialization of migration_node as suggested
  by Suleiman
v22:
* Move migration_node under CONFIG_SCHED_PROXY_EXEC as suggested
  by K Prateek
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h |  3 +++
 init/init_task.c      |  3 +++
 kernel/fork.c         |  3 +++
 kernel/sched/core.c   | 24 +++++++++++++++++-------
 4 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178ed37850470..775cc06f756d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1243,6 +1243,9 @@ struct task_struct {
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	struct list_head		migration_node;
+#endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
diff --git a/init/init_task.c b/init/init_task.c
index 34853a511b4d8..78fb7cb83fa5d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -178,6 +178,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 						 &init_task.alloc_lock),
 #endif
 	.blocked_donor = NULL,
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	.migration_node = LIST_HEAD_INIT(init_task.migration_node),
+#endif
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 0a9a17e25b85d..a7561480e879e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2137,6 +2137,9 @@ __latent_entropy struct task_struct *copy_process(
 
 	p->blocked_on = NULL; /* not blocked yet */
 	p->blocked_donor = NULL; /* nobody is boosting p yet */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	INIT_LIST_HEAD(&p->migration_node);
+#endif
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f42ec01192dc..0c50d154050a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6723,6 +6723,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
+	LIST_HEAD(migrate_list);
 
 	lockdep_assert_rq_held(rq);
 
@@ -6739,11 +6740,16 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	 */
 	proxy_resched_idle(rq);
 
-	WARN_ON(p == rq->curr);
-
-	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
-	proxy_set_task_cpu(p, target_cpu);
-
+	for (; p; p = p->blocked_donor) {
+		WARN_ON(p == rq->curr);
+		deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+		proxy_set_task_cpu(p, target_cpu);
+		/*
+		 * We can abuse blocked_node to migrate the thing,
+		 * because @p was still on the rq.
+		 */
+		list_add(&p->migration_node, &migrate_list);
+	}
 	/*
 	 * We have to zap callbacks before unlocking the rq
 	 * as another CPU may jump in and call sched_balance_rq
@@ -6755,8 +6761,12 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	raw_spin_rq_unlock(rq);
 
 	raw_spin_rq_lock(target_rq);
-	activate_task(target_rq, p, 0);
-	wakeup_preempt(target_rq, p, 0);
+	while (!list_empty(&migrate_list)) {
+		p = list_first_entry(&migrate_list, struct task_struct, migration_node);
+		list_del_init(&p->migration_node);
+		activate_task(target_rq, p, 0);
+		wakeup_preempt(target_rq, p, 0);
+	}
 	raw_spin_rq_unlock(target_rq);
 
 	raw_spin_rq_lock(rq);
-- 
2.52.0.487.g5c8c507ade-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration)
  2025-11-24 22:30 ` [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-12-30  5:33   ` K Prateek Nayak
  2026-03-07  6:48     ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2025-12-30  5:33 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/25/2025 4:00 AM, John Stultz wrote:
> -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	raw_spin_rq_lock(target_rq);
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +	raw_spin_rq_unlock(target_rq);

I'll leave my suggestion for this on Patch 11 but for now ...

> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);

I'm tempted to suggest extracting this pattern into a guard. We seem
to always do the zap + unpin + unlock somewhere in the middle and
lock + repin + update clock at the end of a function so how about:

  (Build and boot tested with a test-ww_mutex run; Open-coded pattern
   inspired from the one for perf_ctx_lock in kernel/events/core.c)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4c5493b0ad21..52dfa63327fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6600,6 +6600,44 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_PROXY_EXEC
+typedef struct {
+	struct rq 	*rq;
+	struct rq_flags *rf;
+} class_proxy_drop_reacquire_rq_lock_t;
+
+static void __proxy_drop_rq_lock(struct rq *rq, struct rq_flags *rf)
+{
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+}
+
+static void __proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags *rf)
+{
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
+static inline void
+class_proxy_drop_reacquire_rq_lock_destructor(class_proxy_drop_reacquire_rq_lock_t *_T)
+{
+	__proxy_reacquire_rq_lock(_T->rq, _T->rf);
+}
+
+static inline class_proxy_drop_reacquire_rq_lock_t
+class_proxy_drop_reacquire_rq_lock_constructor(struct rq *rq, struct rq_flags *rf)
+{
+	__proxy_drop_rq_lock(rq, rf);
+	return (class_proxy_drop_reacquire_rq_lock_t){ rq, rf };
+}
+
 static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 {
 	put_prev_set_next_task(rq, rq->donor, rq->idle);
@@ -6666,23 +6704,15 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	proxy_set_task_cpu(p, target_cpu);
 
 	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
+	 * Zap balance callbacks and drop the rq_lock
+	 * until the end of the scope.
 	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	guard(proxy_drop_reacquire_rq_lock)(rq, rf);
 
 	raw_spin_rq_lock(target_rq);
 	activate_task(target_rq, p, 0);
 	wakeup_preempt(target_rq, p, 0);
 	raw_spin_rq_unlock(target_rq);
-
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
 }
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
---

If the guard obfuscates the flow too much, I'm happy with the current
approach as well but having the guard + a comment on top of its usage is
lot more soothing on the eye IMHO :-)

It also simplifies proxy_force_return() if my next set of comments aren't
too horrible. I'll leave it to you and Peter to decide which is best.

> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = 0;

Should we pretend this is a wakeup with WF_TTWU? There are some
affine wakeup considerations that select_task_rq_fair() adds with
WF_TTWU. Without it, we'll end up selecting the wake_cpu's LLC to
search for the wakeup target.

> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	get_task_struct(p);

Since we are IRQs disabled (equivalent of RCU read-side), I don't think
we need to grab an extra reference here since the task_struct cannot
disappear from under us and we take the necessary locks to confirm the
task's rq associations so we shouldn't race with anything either.

> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);

So I'm failing to see why we need to drop the rq_lock, re-grab it via
task_rq_lock() when we are eventually going to do a deactivate_task().

Once we deactivate, wakeup path will stall at ttwu_runnable() since it
cannot grab the __task_rq_lock() with task_on_rq_migrating() so we
should be able to safely move the task around.

Maybe my naive eyes are missing something (and perhaps there are some
details that arrive with sleeping owner stuff) but the following
boots fine and survives a sched-messaging and test-ww_mutex run
without any splat on my machine at this point:

  (Includes bits from comments above to give those a spin.)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52dfa63327fc..2a4e73c807e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6718,86 +6718,31 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
 {
-	struct rq *this_rq, *target_rq;
-	struct rq_flags this_rf;
-	int cpu, wake_flag = 0;
+	int target_cpu, wake_flag = WF_TTWU;
+	struct rq *target_rq;
 
 	lockdep_assert_rq_held(rq);
 	WARN_ON(p == rq->curr);
 
-	get_task_struct(p);
-
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
-
-	/*
-	 * We drop the rq lock, and re-grab task_rq_lock to get
-	 * the pi_lock (needed for select_task_rq) as well.
-	 */
-	this_rq = task_rq_lock(p, &this_rf);
+	proxy_resched_idle(rq);
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 
-	/*
-	 * Since we let go of the rq lock, the task may have been
-	 * woken or migrated to another rq before we  got the
-	 * task_rq_lock. So re-check we're on the same RQ. If
-	 * not, the task has already been migrated and that CPU
-	 * will handle any futher migrations.
-	 */
-	if (this_rq != rq)
-		goto err_out;
+	target_cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+	target_rq = cpu_rq(target_cpu);
+	set_task_cpu(p, target_cpu);
 
-	/* Similarly, if we've been dequeued, someone else will wake us */
-	if (!task_on_rq_queued(p))
-		goto err_out;
+	clear_task_blocked_on(p, NULL);
 
 	/*
-	 * Since we should only be calling here from __schedule()
-	 * -> find_proxy_task(), no one else should have
-	 * assigned current out from under us. But check and warn
-	 * if we see this, then bail.
+	 * Zap balance callbacks and drop the rq_lock
+	 * until the end of the scope.
 	 */
-	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
-		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-			  __func__, cpu_of(this_rq),
-			  p->comm, p->pid, p->on_cpu);
-		goto err_out;
-	}
-
-	update_rq_clock(this_rq);
-	proxy_resched_idle(this_rq);
-	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
-	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-	set_task_cpu(p, cpu);
-	target_rq = cpu_rq(cpu);
-	clear_task_blocked_on(p, NULL);
-	task_rq_unlock(this_rq, p, &this_rf);
+	guard(proxy_drop_reacquire_rq_lock)(rq, rf);
 
-	/* Drop this_rq and grab target_rq for activation */
 	raw_spin_rq_lock(target_rq);
 	activate_task(target_rq, p, 0);
 	wakeup_preempt(target_rq, p, 0);
-	put_task_struct(p);
 	raw_spin_rq_unlock(target_rq);
-
-	/* Finally, re-grab the origianl rq lock and return to pick-again */
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
-	return;
-
-err_out:
-	task_rq_unlock(this_rq, p, &this_rf);
-	put_task_struct(p);
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
 }
 
 /*
---

Thoughts?

Also, we may want to just move the attach_one_task() helper to sched.h
and use it in both proxy_migrate_task() and proxy_force_return().

> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	update_rq_clock(this_rq);
> +	proxy_resched_idle(this_rq);
> +	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	/* Drop this_rq and grab target_rq for activation */
> +	raw_spin_rq_lock(target_rq);
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +	put_task_struct(p);
> +	raw_spin_rq_unlock(target_rq);
> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +	return;
> +
> +err_out:
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	put_task_struct(p);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
>  }
>  
>  /*
> @@ -6650,11 +6789,13 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
>  static struct task_struct *
>  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  {
> -	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
> +	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
>  	struct task_struct *owner = NULL;
> +	bool curr_in_chain = false;
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
>  	struct mutex *mutex;
> +	int owner_cpu;
>  
>  	/* Follow blocked_on chain. */
>  	for (p = donor; task_is_blocked(p); p = owner) {
> @@ -6663,9 +6804,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		if (!mutex)
>  			return NULL;
>  
> -		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
> -		if (mutex == PROXY_WAKING)
> -			return proxy_resched_idle(rq);
> +		/* if its PROXY_WAKING, do return migration or run if current */
> +		if (mutex == PROXY_WAKING) {
> +			if (task_current(rq, p)) {
> +				clear_task_blocked_on(p, PROXY_WAKING);
> +				return p;
> +			}
> +			action = NEEDS_RETURN;
> +			break;
> +		}
>  
>  		/*
>  		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> @@ -6685,26 +6832,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			return NULL;
>  		}
>  
> +		if (task_current(rq, p))
> +			curr_in_chain = true;
> +
>  		owner = __mutex_owner(mutex);
>  		if (!owner) {
>  			/*
> -			 * If there is no owner, clear blocked_on
> -			 * and return p so it can run and try to
> -			 * acquire the lock
> +			 * If there is no owner, either clear blocked_on
> +			 * and return p (if it is current and safe to
> +			 * just run on this rq), or return-migrate the task.
>  			 */
> -			__clear_task_blocked_on(p, mutex);
> -			return p;
> +			if (task_current(rq, p)) {
> +				__clear_task_blocked_on(p, NULL);

If my suggestion with proxy_force_return() isn't downright horrible, we
can unconditionally clear the p->blocked_on relation on both the
NEEDS_RETURN paths since we won't drop the rq_lock before deactivate
(jury still out on that).

> +				return p;
> +			}
> +			action = NEEDS_RETURN;
> +			break;
>  		}
>  
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal
  2025-11-24 22:31 ` [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal John Stultz
@ 2025-12-30  6:01   ` K Prateek Nayak
  2025-12-30  9:52     ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2025-12-30  6:01 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/25/2025 4:01 AM, John Stultz wrote:
> With proxy-exec once we do return migration from ttwu(), if a
> task is proxying for a waiting donor, and the donor is woken up,
> we switch the rq->donor to point to idle briefly until we can
> re-enter __schedule().
> 
> However, if a task that was acting as a proxy calls into
> yield() right after the donor is switched to idle, it may
> trip a null  pointer traversal, because the idle task doesn't
> have a yield_task() pointer.

I thought that was a transient state that should not be observed by the
running task.

Since NEED_RESCHED is retained, we'll just go though another pass of
__schedule() loop and the task should not observe rq->donor as idle in
do_sched_yield() as only the current task can yield on the local CPU.

Do we have a splat that suggests this happens?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task()
  2025-11-24 22:31 ` [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task() John Stultz
@ 2025-12-30  6:46   ` K Prateek Nayak
  2026-03-07  7:07     ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2025-12-30  6:46 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/25/2025 4:01 AM, John Stultz wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178ed37850470..775cc06f756d0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1243,6 +1243,9 @@ struct task_struct {
>  	struct mutex			*blocked_on;	/* lock we're blocked on */
>  	struct task_struct		*blocked_donor;	/* task that is boosting this task */
>  	raw_spinlock_t			blocked_lock;
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	struct list_head		migration_node;

We can just reuse the "p->se.group_node" here instead of adding the
"migration_node" - it is only used to move tasks during load balancing,
and once we deactivate a task, it can no longer be picked for load
balancing so we should be safe (serialized by the rq_lock).

> +#endif
>  
>  #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
>  	/*
> diff --git a/init/init_task.c b/init/init_task.c
> index 34853a511b4d8..78fb7cb83fa5d 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -178,6 +178,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>  						 &init_task.alloc_lock),
>  #endif
>  	.blocked_donor = NULL,
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	.migration_node = LIST_HEAD_INIT(init_task.migration_node),
> +#endif
>  #ifdef CONFIG_RT_MUTEXES
>  	.pi_waiters	= RB_ROOT_CACHED,
>  	.pi_top_task	= NULL,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0a9a17e25b85d..a7561480e879e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2137,6 +2137,9 @@ __latent_entropy struct task_struct *copy_process(
>  
>  	p->blocked_on = NULL; /* not blocked yet */
>  	p->blocked_donor = NULL; /* nobody is boosting p yet */
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	INIT_LIST_HEAD(&p->migration_node);
> +#endif
>  
>  #ifdef CONFIG_BCACHE
>  	p->sequential_io	= 0;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f42ec01192dc..0c50d154050a3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6723,6 +6723,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
>  			       struct task_struct *p, int target_cpu)
>  {
>  	struct rq *target_rq = cpu_rq(target_cpu);
> +	LIST_HEAD(migrate_list);
>  
>  	lockdep_assert_rq_held(rq);
>  
> @@ -6739,11 +6740,16 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
>  	 */
>  	proxy_resched_idle(rq);
>  
> -	WARN_ON(p == rq->curr);
> -
> -	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> -	proxy_set_task_cpu(p, target_cpu);
> -
> +	for (; p; p = p->blocked_donor) {
> +		WARN_ON(p == rq->curr);
> +		deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +		proxy_set_task_cpu(p, target_cpu);
> +		/*
> +		 * We can abuse blocked_node to migrate the thing,
> +		 * because @p was still on the rq.
> +		 */
> +		list_add(&p->migration_node, &migrate_list);
> +	}
>  	/*
>  	 * We have to zap callbacks before unlocking the rq
>  	 * as another CPU may jump in and call sched_balance_rq
> @@ -6755,8 +6761,12 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
>  	raw_spin_rq_unlock(rq);
>  
>  	raw_spin_rq_lock(target_rq);
> -	activate_task(target_rq, p, 0);
> -	wakeup_preempt(target_rq, p, 0);
> +	while (!list_empty(&migrate_list)) {
> +		p = list_first_entry(&migrate_list, struct task_struct, migration_node);
> +		list_del_init(&p->migration_node);
> +		activate_task(target_rq, p, 0);
> +		wakeup_preempt(target_rq, p, 0);
> +	}
>  	raw_spin_rq_unlock(target_rq);

Here, we can extract the core logic of attach_tasks() into a helper and
reuse is as:

  (Included stuff from the suggestion on Patch 6 + the above suggestion;
   build + boot tested with test-ww_mutex and sched-messaging)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a6ac4d13112..700fd08b2392 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6786,7 +6786,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 		 * We can abuse blocked_node to migrate the thing,
 		 * because @p was still on the rq.
 		 */
-		list_add(&p->migration_node, &migrate_list);
+		list_add(&p->se.group_node, &migrate_list);
 	}
 	/*
 	 * Zap balance callbacks and drop the rq_lock
@@ -6794,14 +6794,8 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	 */
 	guard(proxy_drop_reacquire_rq_lock)(rq, rf);
 
-	raw_spin_rq_lock(target_rq);
-	while (!list_empty(&migrate_list)) {
-		p = list_first_entry(&migrate_list, struct task_struct, migration_node);
-		list_del_init(&p->migration_node);
-		activate_task(target_rq, p, 0);
-		wakeup_preempt(target_rq, p, 0);
-	}
-	raw_spin_rq_unlock(target_rq);
+	/* Move migrate_list to target_rq. */
+	__attach_tasks(target_rq, &migrate_list);
 }
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d2e92a55b16..f24ca09ded8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9700,27 +9700,28 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
 	rq_unlock(rq, &rf);
 }
 
-/*
- * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
- * new rq.
- */
-static void attach_tasks(struct lb_env *env)
+void __attach_tasks(struct rq *rq, struct list_head *tasks)
 {
-	struct list_head *tasks = &env->tasks;
-	struct task_struct *p;
-	struct rq_flags rf;
-
-	rq_lock(env->dst_rq, &rf);
-	update_rq_clock(env->dst_rq);
+	guard(rq_lock)(rq);
+	update_rq_clock(rq);
 
 	while (!list_empty(tasks)) {
+		struct task_struct *p;
+
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 		list_del_init(&p->se.group_node);
 
-		attach_task(env->dst_rq, p);
+		attach_task(rq, p);
 	}
+}
 
-	rq_unlock(env->dst_rq, &rf);
+/*
+ * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
+ * new rq.
+ */
+static void attach_tasks(struct lb_env *env)
+{
+	__attach_tasks(env->dst_rq, &env->tasks);
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 424c40bd46e2..43b9ee90c67f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,6 +1368,8 @@ static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
 }
 #endif
 
+void __attach_tasks(struct rq *rq, struct list_head *tasks);
+
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
---

Thoughts?

>  
>  	raw_spin_rq_lock(rq);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal
  2025-12-30  6:01   ` K Prateek Nayak
@ 2025-12-30  9:52     ` K Prateek Nayak
  2026-01-01  7:04       ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2025-12-30  9:52 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 12/30/2025 11:31 AM, K Prateek Nayak wrote:
> On 11/25/2025 4:01 AM, John Stultz wrote:
>> With proxy-exec once we do return migration from ttwu(), if a
>> task is proxying for a waiting donor, and the donor is woken up,
>> we switch the rq->donor to point to idle briefly until we can
>> re-enter __schedule().
>>
>> However, if a task that was acting as a proxy calls into
>> yield() right after the donor is switched to idle, it may
>> trip a null  pointer traversal, because the idle task doesn't
>> have a yield_task() pointer.
> 
> I thought that was a transient state that should not be observed by the
> running task.
> 
> Since NEED_RESCHED is retained, we'll just go though another pass of
> __schedule() loop and the task should not observe rq->donor as idle in
> do_sched_yield() as only the current task can yield on the local CPU.
> 
> Do we have a splat that suggests this happens?

So I think I found my answer (at least one of them):

  find_proxy_task()
    /* case DEACTIVATE_DONOR */
    proxy_deactivate(rq, donor)
      proxy_resched_idle(rq); /* Switched donor to rq->idle. */
      try_to_block_task(rq, donor, &state, true)
        if (signal_pending_state(task_state, donor))
          WRITE_ONCE(p->__state, TASK_RUNNING)
          return false; /* Blocking fails. */

    /* If deactivate fails, force return */
    p = donor;
    return p

  next = p; /* Donor is rq->idle. */


This should be illegal and I think we should either force a "pick_again"
if proxy_deactivate() fails (can it get stuck on an infinite loop?) or
we should fix the donor relation before running "p".

We can also push the proxy_resched_idle() into try_to_block_task() and
only do it once we are past all the early returns and if
task_current_donor(rq, p).

... and as I write this I realize we can have this via
proxy_needs_return() so I guess we need this patch after all and
proxy_needs_return() should do resched_curr() for that case too so we
can re-evaluate the donor context on the CPU where we are stealing away
the donor from.

Quick question: Can we avoid that proxy_resched_idle() in
proxy_needs_return() by retaining PROXY_WAKING and letting
proxy_force_return() handle donor's migration too?

  (Seems to survive the same set of challenges I've been putting my
   suggestions through)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 700fd08b2392..e01a61f1955e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3720,6 +3720,12 @@ static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 	if (!sched_proxy_exec())
 		return false;
 
+	/* For current, and donor, let proxy_force_return() handle return. */
+	if (task_current(rq, p) || task_current_donor(rq, p)) {
+		resched_curr(rq);
+		return false;
+	}
+
 	guard(raw_spinlock)(&p->blocked_lock);
 
 	/* If task isn't PROXY_WAKING, we don't need to do return migration */
@@ -3728,20 +3734,12 @@ static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 
 	__clear_task_blocked_on(p, PROXY_WAKING);
 
-	/* If already current, don't need to return migrate */
-	if (task_current(rq, p))
-		return false;
-
 	/* If wake_cpu is targeting this cpu, don't bother return migrating */
 	if (p->wake_cpu == cpu_of(rq)) {
 		resched_curr(rq);
 		return false;
 	}
 
-	/* If we're return migrating the rq->donor, switch it out for idle */
-	if (task_current_donor(rq, p))
-		proxy_resched_idle(rq);
-
 	/* (ab)Use DEQUEUE_SPECIAL to ensure task is always blocked here. */
 	block_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SPECIAL);
 	return true;
---

I think you'll have a note somewhere that says why this is an absolutely
terrible idea :-)

-- 
Thanks and Regards,
Prateek

  “Program testing can be used to show the presence of bugs,
   but never to show their absence!”  ― Edsger W. Dijkstra


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal
  2025-12-30  9:52     ` K Prateek Nayak
@ 2026-01-01  7:04       ` K Prateek Nayak
  0 siblings, 0 replies; 23+ messages in thread
From: K Prateek Nayak @ 2026-01-01  7:04 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 12/30/2025 3:22 PM, K Prateek Nayak wrote:
> So I think I found my answer (at least one of them):
> 
>   find_proxy_task()
>     /* case DEACTIVATE_DONOR */
>     proxy_deactivate(rq, donor)
>       proxy_resched_idle(rq); /* Switched donor to rq->idle. */
>       try_to_block_task(rq, donor, &state, true)
>         if (signal_pending_state(task_state, donor))
>           WRITE_ONCE(p->__state, TASK_RUNNING)
>           return false; /* Blocking fails. */
> 
>     /* If deactivate fails, force return */
>     p = donor;

So I was looking at my tree with some modifications on top and that
above scenario cannot happen since "force return" just falls through to
to NEEDS_RETURN and that'll migrate the task away and return NULL
forcing a re-pick and donor will be back to normal.

>     return p
> 
>   next = p; /* Donor is rq->idle. */
> 
> 
> This should be illegal and I think we should either force a "pick_again"
> if proxy_deactivate() fails (can it get stuck on an infinite loop?) or
> we should fix the donor relation before running "p".
> 
> We can also push the proxy_resched_idle() into try_to_block_task() and
> only do it once we are past all the early returns and if
> task_current_donor(rq, p).
> 
> ... and as I write this I realize we can have this via
> proxy_needs_return() so I guess we need this patch after all and
> proxy_needs_return() should do resched_curr() for that case too so we
> can re-evaluate the donor context on the CPU where we are stealing away
> the donor from.

But this can definitely happen. FWIW I think we can just do:

static inline void proxy_reset_donor(struct rq *rq)
{
	put_prev_set_next_task(rq, rq->donor, rq->curr);
	rq_set_donor(rq, rq->curr);
	resched_curr(rq);
}

... instead of proxy_resched_idle() and we just account the balance time
between stealing the donor and the resched to the "rq->curr" instead of
accounting it to the idle task to stay fair.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-11-24 22:31 ` [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
@ 2026-01-01  9:53   ` K Prateek Nayak
  2026-03-11  6:45     ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2026-01-01  9:53 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/25/2025 4:01 AM, John Stultz wrote:
> @@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 */
>  	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>  		smp_mb__after_spinlock();
> -		if (!ttwu_state_match(p, state, &success))
> -			break;
> +		if (!ttwu_state_match(p, state, &success)) {
> +			/*
> +			 * If we're already TASK_RUNNING, and PROXY_WAKING
> +			 * continue on to ttwu_runnable check to force
> +			 * proxy_needs_return evaluation
> +			 */
> +			if (!proxy_task_runnable_but_waking(p))
> +				break;
> +		}

I don't like the fact that we have to go against the ttwu_state_match()
machinery here or the fact that proxy_deactivate() has to check for
"TASK_RUNNING" as a short-cut to detect wakeups. It makes all these bits
a bit more painful to maintain IMO.

Here is what I've learnt so far:

o Task can go into an interruptible sleep where a signal can cause the
  task to give up its attempt to acquire the lock ("err" path in
  __mutex_lock_common()) - This means we'll have to re-evaluate the
  "p->blocked_on" by running it.

o If we weren't running with the PROXY_EXEC, a task that was blocked
  (including delayed) on a mutex will definitely receive a wakeup. With
  proxy, the only difference is that that wakeup path will now see
  "task_on_rq_queued()" and will have to go though ttwu_runnable() under
  rq_lock.

o Everything else is handled in __schedule() under the rq_lock again
  and all paths that drop the rq_lock will go through a re-pick.

o Once we deactivate a donor (via proxy_deactivate()), the "blocked_on"
  relation has no real use since we'll always be PROXY_WAKING when we
  come back (or we have received a signal and that blocked_on relation
  now needs a re-evaluation anyways) 

I'm sure a bunch of these hold up for RWSEM too.

With that in mind, we can actually generalize ttwu_runnable() and
proxy_needs_return() to clear "p->blocked_on", even for !PROXY_WAKING
since a wakeup implies we have to re-evaluate the "p->blocked_on".

We already have clear_task_blocked_on() within the rq_lock critical
section (except for when task is running) and we ensure blocked donors
cannot go TASK_RUNNING with task_is_blocked() != NULL.


This is what I had in mind based on top of commit d424d28ea93
("sched: Migrate whole chain in proxy_migrate_task()") on
"proxy-exec-v24-6.18-rc6" branch:

  (lightly tested with sched-messaging; Haven't seen any splats
   or hung-task yet but from my experience, it doesn't hold any
   water when my suggestion meets the real-world :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c50d154050a..cb567c219f04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3698,15 +3698,14 @@ static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 	p->wake_cpu = wake_cpu;
 }
 
-static bool proxy_task_runnable_but_waking(struct task_struct *p)
+static inline void proxy_reset_donor(struct rq *rq)
 {
-	if (!sched_proxy_exec())
-		return false;
-	return (READ_ONCE(p->__state) == TASK_RUNNING &&
-		READ_ONCE(p->blocked_on) == PROXY_WAKING);
-}
+	WARN_ON_ONCE(rq->donor == rq->curr);
 
-static inline struct task_struct *proxy_resched_idle(struct rq *rq);
+	put_prev_set_next_task(rq, rq->donor, rq->curr);
+	rq_set_donor(rq, rq->curr);
+	resched_curr(rq);
+}
 
 /*
  * Checks to see if task p has been proxy-migrated to another rq
@@ -3720,37 +3719,55 @@ static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 	if (!sched_proxy_exec())
 		return false;
 
-	guard(raw_spinlock)(&p->blocked_lock);
-
-	/* If task isn't PROXY_WAKING, we don't need to do return migration */
-	if (p->blocked_on != PROXY_WAKING)
+	/*
+	 * A preempted task isn't blocked on a mutex, let ttwu_runnable()
+	 * handle the rest.
+	 *
+	 * Since we are under the rq_lock and "p->blocked_on" can only
+	 * be transitioned to NULL under the rq_lock for a preempted task, it
+	 * is safe to inspect the blocked_on relation outside the blocked_lock.
+	 */
+	if (!task_current(rq, p) && !p->blocked_on)
 		return false;
 
-	__clear_task_blocked_on(p, PROXY_WAKING);
+	guard(raw_spinlock)(&p->blocked_lock);
 
-	/* If already current, don't need to return migrate */
-	if (task_current(rq, p))
+	/* Task isn't blocked, let ttwu_runnable() handle the rest. */
+	if (!p->blocked_on)
 		return false;
+	/*
+	 * Blocked task is waking up - either to grab a lock, or to
+	 * handle a signal. Clear the blocked_on relation to run the
+	 * task. The task will re-establish the blocked_on relation if
+	 * it blocks on the lock again.
+	 *
+	 * A concurrent wakeup from the lock owner will not set
+	 * PROXY_WAKING since we are already running. ttwu_state_match()
+	 * under "p->pi_lock" will ensure there is no race.
+	 */
+	__clear_task_blocked_on(p, NULL);
 
-	/* If wake_cpu is targeting this cpu, don't bother return migrating */
-	if (p->wake_cpu == cpu_of(rq)) {
-		resched_curr(rq);
+	/* If already current, let ttwu_runnable() handle the rest. */
+	if (task_current(rq, p))
 		return false;
-	}
 
-	/* If we're return migrating the rq->donor, switch it out for idle */
+	/*
+	 * If we're waking up the the rq->donor, switch the context
+	 * back to rq->curr to account the rest of the runtime to
+	 * rq->curr and issue a resched for __schedule() to
+	 * re-evalaute the context as soon as possible.
+	 *
+	 * Since we are under the rq_lock, it is safe to swap out the
+	 * donor.
+	 */
 	if (task_current_donor(rq, p))
-		proxy_resched_idle(rq);
+		proxy_reset_donor(rq);
 
 	/* (ab)Use DEQUEUE_SPECIAL to ensure task is always blocked here. */
 	block_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SPECIAL);
 	return true;
 }
 #else /* !CONFIG_SCHED_PROXY_EXEC */
-static bool proxy_task_runnable_but_waking(struct task_struct *p)
-{
-	return false;
-}
 static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 {
 	return false;
@@ -4260,15 +4277,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success)) {
-			/*
-			 * If we're already TASK_RUNNING, and PROXY_WAKING
-			 * continue on to ttwu_runnable check to force
-			 * proxy_needs_return evaluation
-			 */
-			if (!proxy_task_runnable_but_waking(p))
-				break;
-		}
+		if (!ttwu_state_match(p, state, &success))
+			break;
 
 		trace_sched_waking(p);
 
@@ -6634,12 +6644,15 @@ pick_next_task(struct rq *rq, struct rq_flags *rf)
  * blocked_on).
  */
 static bool try_to_block_task(struct rq *rq, struct task_struct *p,
-			      unsigned long *task_state_p, bool should_block)
+			      unsigned long *task_state_p)
 {
 	unsigned long task_state = *task_state_p;
 	int flags = DEQUEUE_NOCLOCK;
 
 	if (signal_pending_state(task_state, p)) {
+		/* See the next comment for why this is safe. */
+		if (task_is_blocked(p))
+			clear_task_blocked_on(p, NULL);
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
 		return false;
@@ -6651,8 +6664,23 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	 * should_block is false, its likely due to the task being
 	 * blocked on a mutex, and we want to keep it on the runqueue
 	 * to be selectable for proxy-execution.
+	 *
+	 * Checks against wakeup is safe since we are here with on_rq
+	 * and rq_lock held forcing the wakeup on ttwu_runnable()
+	 * path and "p->blocked_on" cannot be cleared outside of the
+	 * rq_lock when task is preempted.
+	 *
+	 * XXX: This can be further optimized to block the task on
+	 * PROXY_WAKING but that will require grabbing the
+	 * "blocked_lock" to inspect "blocked_on" and clearing it iff
+	 * PROXY_WAKING.
+	 *
+	 * If we block with PROXY_WAKING, proxy_needs_return() in
+	 * ttwu_runnable() will be skipped and we'll hit TASK_RUNNING
+	 * without clearing "blocked_on" which will make
+	 * proxy_deactivate() in proxy_wake_up_donor() unhappy :-(
 	 */
-	if (!should_block)
+	if (task_is_blocked(p))
 		return false;
 
 	p->sched_contributes_to_load =
@@ -6690,10 +6718,23 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
+	bool ret;
 
-	/* Don't deactivate if the state has been changed to TASK_RUNNING */
-	if (state == TASK_RUNNING)
+	/*
+	 * We should **NEVER** be TASK_RUNNING! TASK_RUNNING implies a
+	 * blocked doner got a wakeup without clearing the "blocked_on"
+	 * relation and has take a path other than ttwu_runnable().
+	 *
+	 * Warn to catch such cases since this can affect
+	 * proxy_wake_up_donor() where the wake_up_process() might not
+	 * be able to serialize wakeups using ttwu state matching.
+	 *
+	 * Fundamentally the task is runnable so we can go ahead and
+	 * indicate the blocking failed and just run the task.
+	 */
+	if (WARN_ON_ONCE(state == TASK_RUNNING))
 		return false;
+
 	/*
 	 * Because we got donor from pick_next_task(), it is *crucial*
 	 * that we call proxy_resched_idle() before we deactivate it.
@@ -6704,7 +6745,20 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
-	return try_to_block_task(rq, donor, &state, true);
+	ret = try_to_block_task(rq, donor, &state);
+
+	/*
+	 * If we fail to block, it must be because of a pending signal
+	 * since parallel wakeup needs to grab the rq_lock before
+	 * activating us at this point.
+	 *
+	 * The task_is_blocked() path doesn't modify the state so check
+	 * against the state and warn if the blocked_on relation caused
+	 * the task to fail blocking.
+	 */
+	if (WARN_ON_ONCE(!ret && (state != TASK_RUNNING)))
+		WRITE_ONCE(donor->__state, TASK_RUNNING);
+	return ret;
 }
 
 /*
@@ -6774,89 +6828,33 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	update_rq_clock(rq);
 }
 
-static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+static bool proxy_wake_up_donor(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
 {
-	struct rq *this_rq, *target_rq;
-	struct rq_flags this_rf;
-	int cpu, wake_flag = 0;
-
+	WARN_ON_ONCE(task_current(rq, p));
 	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
-
-	get_task_struct(p);
 
 	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
+	 * Task received a signal or was woken up!
+	 * It should be safe to run it now.
 	 */
+	if (!proxy_deactivate(rq ,p))
+		return false;
+
 	zap_balance_callbacks(rq);
 	rq_unpin_lock(rq, rf);
 	raw_spin_rq_unlock(rq);
 
 	/*
-	 * We drop the rq lock, and re-grab task_rq_lock to get
-	 * the pi_lock (needed for select_task_rq) as well.
-	 */
-	this_rq = task_rq_lock(p, &this_rf);
-
-	/*
-	 * Since we let go of the rq lock, the task may have been
-	 * woken or migrated to another rq before we  got the
-	 * task_rq_lock. So re-check we're on the same RQ. If
-	 * not, the task has already been migrated and that CPU
-	 * will handle any futher migrations.
+	 * Everything beyond this point is serialized
+	 * by the ttwu state machine.
 	 */
-	if (this_rq != rq)
-		goto err_out;
-
-	/* Similarly, if we've been dequeued, someone else will wake us */
-	if (!task_on_rq_queued(p))
-		goto err_out;
+	wake_up_process(p);
 
-	/*
-	 * Since we should only be calling here from __schedule()
-	 * -> find_proxy_task(), no one else should have
-	 * assigned current out from under us. But check and warn
-	 * if we see this, then bail.
-	 */
-	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
-		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-			  __func__, cpu_of(this_rq),
-			  p->comm, p->pid, p->on_cpu);
-		goto err_out;
-	}
-
-	update_rq_clock(this_rq);
-	proxy_resched_idle(this_rq);
-	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
-	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-	set_task_cpu(p, cpu);
-	target_rq = cpu_rq(cpu);
-	clear_task_blocked_on(p, NULL);
-	task_rq_unlock(this_rq, p, &this_rf);
-
-	/* Drop this_rq and grab target_rq for activation */
-	raw_spin_rq_lock(target_rq);
-	activate_task(target_rq, p, 0);
-	wakeup_preempt(target_rq, p, 0);
-	put_task_struct(p);
-	raw_spin_rq_unlock(target_rq);
-
-	/* Finally, re-grab the origianl rq lock and return to pick-again */
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
-	return;
-
-err_out:
-	task_rq_unlock(this_rq, p, &this_rf);
-	put_task_struct(p);
 	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
 	update_rq_clock(rq);
+	return true;
 }
 
 /*
@@ -6888,7 +6886,7 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_WAKEUP } action = FOUND;
 	struct task_struct *owner = NULL;
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
@@ -6902,14 +6900,29 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		/* Something changed in the chain, so pick again */
 		if (!mutex)
 			return NULL;
-
-		/* if its PROXY_WAKING, do return migration or run if current */
+		/*
+		 * If we see PROXY_WAKING, "p" is expecting a wakeup.
+		 * Follow proxy_needs_return() and do a
+		 * proxy_deactivate() + wake_up_task() if "p" is not the
+		 * current. For current, just clear "blocked_on" and
+		 * run the task.
+		 *
+		 * For the task waking us up, ttwu_state_match() will
+		 * either see TASK_RUNNING or we'll be serialized under
+		 * "p->pi_lock" (and optionally at ttwu_runnable() if
+		 * we drop the rq_lock with "p" still queued on us) to
+		 * handle the wakeup.
+		 *
+		 * Since we are under the rq_lock, ttwu() cannot clear
+		 * PROXY_WAKING before we do.
+		 */
 		if (mutex == PROXY_WAKING) {
+			clear_task_blocked_on(p, PROXY_WAKING);
 			if (task_current(rq, p)) {
-				clear_task_blocked_on(p, PROXY_WAKING);
+				WRITE_ONCE(p->__state, TASK_RUNNING);
 				return p;
 			}
-			action = NEEDS_RETURN;
+			action = NEEDS_WAKEUP;
 			break;
 		}
 
@@ -6937,15 +6950,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, either clear blocked_on
-			 * and return p (if it is current and safe to
-			 * just run on this rq), or return-migrate the task.
+			 * If there is no owner, clear "p->blocked_on"
+			 * and run if "p" is the current.
+			 *
+			 * Else, this is a handoff and we'll soon be
+			 * woken up by the previous owner but don't wait
+			 * for that and try to do it ourselves via
+			 * proxy_wake_up_donor().
 			 */
+			__clear_task_blocked_on(p, NULL);
 			if (task_current(rq, p)) {
-				__clear_task_blocked_on(p, NULL);
+				WRITE_ONCE(p->__state, TASK_RUNNING);
 				return p;
 			}
-			action = NEEDS_RETURN;
+			action = NEEDS_WAKEUP;
 			break;
 		}
 
@@ -7028,14 +7046,43 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
+		WARN_ON_ONCE(!task_is_blocked(donor));
+		/*
+		 * Since we have to get a wakeup and be queued back
+		 * beyond this point, clear the blocked_on relation.
+		 *
+		 * XXX: This is probabaly controversial since the
+		 * blocked donors now have their "blocked_on" relation
+		 * cleared and this field is also used by
+		 * CONFIG_DEBUG_MUTEXES.
+		 *
+		 * But only debug_mutex_{add,remove}_waiter() cares
+		 * about this field when the task is running and those
+		 * relations are re-established correctly as soon as the
+		 * task resumes execution and before we hit those
+		 * checks.
+		 */
+		clear_task_blocked_on(donor, NULL);
+
+		/* If donor deactivated successfully, retry pick. */
 		if (proxy_deactivate(rq, donor))
 			return NULL;
-		/* If deactivate fails, force return */
-		p = donor;
-		fallthrough;
-	case NEEDS_RETURN:
-		proxy_force_return(rq, rf, p);
-		return NULL;
+
+		/* Donor received a signal or was woken! Just run it. */
+		put_prev_set_next_task(rq, rq->donor, donor);
+		rq_set_donor(rq, donor);
+		return donor;
+	case NEEDS_WAKEUP:
+		/* If p was deactivated and woken successfully, retry pick. */
+		if (proxy_wake_up_donor(rq, rf, p))
+			return NULL;
+		/*
+		 * p received a signal or was woken! Restore the donor
+		 * context and run the task as proxy.
+		 */
+		put_prev_set_next_task(rq, rq->donor, donor);
+		rq_set_donor(rq, donor);
+		return p;
 	case MIGRATE:
 		proxy_migrate_task(rq, rf, p, owner_cpu);
 		return NULL;
@@ -7191,9 +7238,19 @@ static void __sched notrace __schedule(int sched_mode)
 		 * for slection with proxy-exec (without proxy-exec
 		 * task_is_blocked() will always be false).
 		 */
-		try_to_block_task(rq, prev, &prev_state,
-				  !task_is_blocked(prev));
+		try_to_block_task(rq, prev, &prev_state);
 		switch_count = &prev->nvcsw;
+	} else if (task_is_blocked(prev)) {
+		/*
+		 * We are not blocking anymore! Clear "p->blocked_on"
+		 * since something has forced us runnable.
+		 *
+		 * It is safe to inspect "prev->blocked_on" here without
+		 * taking "p->blocked_lock" since blocked_on can only be
+		 * set to PROXY_WAKING (!= NULL) when task is preempted
+		 * without taking the rq_lock.
+		 */
+		clear_task_blocked_on(prev, NULL);
 	}
 
 	prev_not_proxied = !prev->blocked_donor;
---

I guess ignorance is bliss and all this might be a terrible idea but
I like how proxy_wake_up_donor() (aka proxy_needs_return()) ends up
being much simpler now and the whole:

  blocked_donor (!TASK_RUNNING && task_is_blocked())
        |
        v
      wakeup    (clear_task_blocked_on() + ttwu())
        |
        v
     runnable   (TASK_RUNNING && !task_is_blocked())

is easier to reason about IMO and should also help sched-ext since we
now have defined points to notify where the proxy starts and ends for
them to hook onto.

I think I lost the "wake_cpu" check in proxy_needs_return() somewhere
down the line but doing a block + wakeup irrespective seems to be the
right thing since in absence of proxy, !task_current() would have
blocked and received a wakeup anyways.

Sorry for dumping a bunch of radical suggestions during the holidays.
Hope we can flush out a bunch more of proxy execution (if not all of
it) this year :-)

Happy New Year!
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration)
  2025-12-30  5:33   ` K Prateek Nayak
@ 2026-03-07  6:48     ` John Stultz
  2026-03-10  4:17       ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2026-03-07  6:48 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Dec 29, 2025 at 9:34 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/25/2025 4:00 AM, John Stultz wrote:
> > +
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
> > +     update_rq_clock(rq);
>
> I'm tempted to suggest extracting this pattern into a guard. We seem
> to always do the zap + unpin + unlock somewhere in the middle and
> lock + repin + update clock at the end of a function so how about:
>
[snipped guard logic]
>
> If the guard obfuscates the flow too much, I'm happy with the current
> approach as well but having the guard + a comment on top of its usage is
> lot more soothing on the eye IMHO :-)
>
> It also simplifies proxy_force_return() if my next set of comments aren't
> too horrible. I'll leave it to you and Peter to decide which is best.

So first of all, apologies for being so slow to get to this! I very
much apprecaite your feedback here, but the first two months of this
year have been busy and I've not had the mental space to get my head
back into the proxy-exec details. I definitely did not intend to
neglect this for so long.

So on your suggestion, I'm torn. It does make the code simpler, but
I've found some guard() usage to be less readable for me, as it can
hide some subtleties.
But I also felt this way when I first ran into scoped locking and I've
started to warm to that.
So maybe if it's ok, I'll leave this for now, but let's consider it
for a future refactor?

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = 0;
>
> Should we pretend this is a wakeup with WF_TTWU? There are some
> affine wakeup considerations that select_task_rq_fair() adds with
> WF_TTWU. Without it, we'll end up selecting the wake_cpu's LLC to
> search for the wakeup target.

Ack

> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     get_task_struct(p);
>
> Since we are IRQs disabled (equivalent of RCU read-side), I don't think
> we need to grab an extra reference here since the task_struct cannot
> disappear from under us and we take the necessary locks to confirm the
> task's rq associations so we shouldn't race with anything either.

Ack.

> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     /*
> > +      * We drop the rq lock, and re-grab task_rq_lock to get
> > +      * the pi_lock (needed for select_task_rq) as well.
> > +      */
> > +     this_rq = task_rq_lock(p, &this_rf);
>
> So I'm failing to see why we need to drop the rq_lock, re-grab it via
> task_rq_lock() when we are eventually going to do a deactivate_task().
>
> Once we deactivate, wakeup path will stall at ttwu_runnable() since it
> cannot grab the __task_rq_lock() with task_on_rq_migrating() so we
> should be able to safely move the task around.
>
> Maybe my naive eyes are missing something (and perhaps there are some
> details that arrive with sleeping owner stuff) but the following
> boots fine and survives a sched-messaging and test-ww_mutex run
> without any splat on my machine at this point:

Hum. The reason we need the pi_lock is for select_task_rq(), and since
we hold the rq lock (which is under the pi_lock in the locking order),
we need to drop it to take the pi_lock and rq_lock via task_rq_lock().
I know it's not pretty, but I'm not sure how your testing wouldn't
have blown up on the lockdep_assert_held() in select_task_rq().

Trying a simplified version (without the guard) locally leaves my
console filled with splats:
  WARNING: CPU: 20 PID: 0 at kernel/sched/core.c:3746 select_task_rq+0xdc/0x120

But let me know if there's something I'm being daft about.

>
> Also, we may want to just move the attach_one_task() helper to sched.h
> and use it in both proxy_migrate_task() and proxy_force_return().
>

Ah, that looks nice!

Again, thank you so much for the review feedback! I *really*
appreciate the suggestions and am eager to hear more of your thoughts!

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task()
  2025-12-30  6:46   ` K Prateek Nayak
@ 2026-03-07  7:07     ` John Stultz
  0 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2026-03-07  7:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Dec 29, 2025 at 10:46 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/25/2025 4:01 AM, John Stultz wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 178ed37850470..775cc06f756d0 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1243,6 +1243,9 @@ struct task_struct {
> >       struct mutex                    *blocked_on;    /* lock we're blocked on */
> >       struct task_struct              *blocked_donor; /* task that is boosting this task */
> >       raw_spinlock_t                  blocked_lock;
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +     struct list_head                migration_node;
>
> We can just reuse the "p->se.group_node" here instead of adding the
> "migration_node" - it is only used to move tasks during load balancing,
> and once we deactivate a task, it can no longer be picked for load
> balancing so we should be safe (serialized by the rq_lock).
>

Ah, yes. This along with the rest of your suggestion here is a nice
cleanup! Thank you! I'll get them integrated into the series!
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration)
  2026-03-07  6:48     ` John Stultz
@ 2026-03-10  4:17       ` K Prateek Nayak
  0 siblings, 0 replies; 23+ messages in thread
From: K Prateek Nayak @ 2026-03-10  4:17 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/7/2026 12:18 PM, John Stultz wrote:
> On Mon, Dec 29, 2025 at 9:34 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 11/25/2025 4:00 AM, John Stultz wrote:
>>> +
>>> +     raw_spin_rq_lock(rq);
>>> +     rq_repin_lock(rq, rf);
>>> +     update_rq_clock(rq);
>>
>> I'm tempted to suggest extracting this pattern into a guard. We seem
>> to always do the zap + unpin + unlock somewhere in the middle and
>> lock + repin + update clock at the end of a function so how about:
>>
> [snipped guard logic]
>>
>> If the guard obfuscates the flow too much, I'm happy with the current
>> approach as well but having the guard + a comment on top of its usage is
>> lot more soothing on the eye IMHO :-)
>>
>> It also simplifies proxy_force_return() if my next set of comments aren't
>> too horrible. I'll leave it to you and Peter to decide which is best.
> 
> So first of all, apologies for being so slow to get to this! I very
> much apprecaite your feedback here, but the first two months of this
> year have been busy and I've not had the mental space to get my head
> back into the proxy-exec details. I definitely did not intend to
> neglect this for so long.
> 
> So on your suggestion, I'm torn. It does make the code simpler, but
> I've found some guard() usage to be less readable for me, as it can
> hide some subtleties.
> But I also felt this way when I first ran into scoped locking and I've
> started to warm to that.
> So maybe if it's ok, I'll leave this for now, but let's consider it
> for a future refactor?

Ack! Now that I look at it again, that guard does obscure the flow
a lot for a few lines saved in the diffstat (if at all!) so I too
think this can stay the same.

>>> +     /*
>>> +      * We drop the rq lock, and re-grab task_rq_lock to get
>>> +      * the pi_lock (needed for select_task_rq) as well.
>>> +      */
>>> +     this_rq = task_rq_lock(p, &this_rf);
>>
>> So I'm failing to see why we need to drop the rq_lock, re-grab it via
>> task_rq_lock() when we are eventually going to do a deactivate_task().
>>
>> Once we deactivate, wakeup path will stall at ttwu_runnable() since it
>> cannot grab the __task_rq_lock() with task_on_rq_migrating() so we
>> should be able to safely move the task around.
>>
>> Maybe my naive eyes are missing something (and perhaps there are some
>> details that arrive with sleeping owner stuff) but the following
>> boots fine and survives a sched-messaging and test-ww_mutex run
>> without any splat on my machine at this point:
> 
> Hum. The reason we need the pi_lock is for select_task_rq(), and since
> we hold the rq lock (which is under the pi_lock in the locking order),
> we need to drop it to take the pi_lock and rq_lock via task_rq_lock().
> I know it's not pretty, but I'm not sure how your testing wouldn't
> have blown up on the lockdep_assert_held() in select_task_rq().
> 
> Trying a simplified version (without the guard) locally leaves my
> console filled with splats:
>   WARNING: CPU: 20 PID: 0 at kernel/sched/core.c:3746 select_task_rq+0xdc/0x120
> 
> But let me know if there's something I'm being daft about.

Only one being daft was me since I forgot the whole pi_lock dependency
for select_task_rq()! I'll start running with lockdep in my testing to
spot these terrible suggestions early. Sorry for the noise and thanks
a ton for testing that!

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2026-01-01  9:53   ` K Prateek Nayak
@ 2026-03-11  6:45     ` John Stultz
  2026-03-11  9:25       ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2026-03-11  6:45 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Jan 1, 2026 at 1:53 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/25/2025 4:01 AM, John Stultz wrote:
> > @@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >        */
> >       scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
> >               smp_mb__after_spinlock();
> > -             if (!ttwu_state_match(p, state, &success))
> > -                     break;
> > +             if (!ttwu_state_match(p, state, &success)) {
> > +                     /*
> > +                      * If we're already TASK_RUNNING, and PROXY_WAKING
> > +                      * continue on to ttwu_runnable check to force
> > +                      * proxy_needs_return evaluation
> > +                      */
> > +                     if (!proxy_task_runnable_but_waking(p))
> > +                             break;
> > +             }
>
> I don't like the fact that we have to go against the ttwu_state_match()
> machinery here or the fact that proxy_deactivate() has to check for
> "TASK_RUNNING" as a short-cut to detect wakeups. It makes all these bits
> a bit more painful to maintain IMO.
>
> Here is what I've learnt so far:
>
> o Task can go into an interruptible sleep where a signal can cause the
>   task to give up its attempt to acquire the lock ("err" path in
>   __mutex_lock_common()) - This means we'll have to re-evaluate the
>   "p->blocked_on" by running it.
>
> o If we weren't running with the PROXY_EXEC, a task that was blocked
>   (including delayed) on a mutex will definitely receive a wakeup. With
>   proxy, the only difference is that that wakeup path will now see
>   "task_on_rq_queued()" and will have to go though ttwu_runnable() under
>   rq_lock.
>
> o Everything else is handled in __schedule() under the rq_lock again
>   and all paths that drop the rq_lock will go through a re-pick.
>
> o Once we deactivate a donor (via proxy_deactivate()), the "blocked_on"
>   relation has no real use since we'll always be PROXY_WAKING when we
>   come back (or we have received a signal and that blocked_on relation
>   now needs a re-evaluation anyways)
>
> I'm sure a bunch of these hold up for RWSEM too.
>
> With that in mind, we can actually generalize ttwu_runnable() and
> proxy_needs_return() to clear "p->blocked_on", even for !PROXY_WAKING
> since a wakeup implies we have to re-evaluate the "p->blocked_on".
>
> We already have clear_task_blocked_on() within the rq_lock critical
> section (except for when task is running) and we ensure blocked donors
> cannot go TASK_RUNNING with task_is_blocked() != NULL.
>
>
> This is what I had in mind based on top of commit d424d28ea93
> ("sched: Migrate whole chain in proxy_migrate_task()") on
> "proxy-exec-v24-6.18-rc6" branch:
>
>   (lightly tested with sched-messaging; Haven't seen any splats
>    or hung-task yet but from my experience, it doesn't hold any
>    water when my suggestion meets the real-world :-)
...
>
> I guess ignorance is bliss and all this might be a terrible idea but
> I like how proxy_wake_up_donor() (aka proxy_needs_return()) ends up
> being much simpler now and the whole:
>
>   blocked_donor (!TASK_RUNNING && task_is_blocked())
>         |
>         v
>       wakeup    (clear_task_blocked_on() + ttwu())
>         |
>         v
>      runnable   (TASK_RUNNING && !task_is_blocked())
>
> is easier to reason about IMO and should also help sched-ext since we
> now have defined points to notify where the proxy starts and ends for
> them to hook onto.
>
> I think I lost the "wake_cpu" check in proxy_needs_return() somewhere
> down the line but doing a block + wakeup irrespective seems to be the
> right thing since in absence of proxy, !task_current() would have
> blocked and received a wakeup anyways.
>
> Sorry for dumping a bunch of radical suggestions during the holidays.
> Hope we can flush out a bunch more of proxy execution (if not all of
> it) this year :-)

Apologies, I've been dragging my feet a bit on this email, as its a
fair amount of change to get my head around.  :)  As always, I
*really* appreciate your thoughtful feedback.

Running with your patch, I do get quite a few splats from the
assert_balance_callbacks_empty() check (which then later causes a
deadlock from lockdep on the console lock), but that seems to be due
to proxy_needs_return()'s calling proxy_reset_donor() which does a
put_prev_set_next() which might enqueue a balance callback that we
then don't run when we drop the rq lock in ttwu_runnable().  Zapping
the callback seems to resolve this.

From there, I've been trying to separate and understand the different
threads of change and their dependencies so that I can apply them to
the right point in the series, but its subtle and the changes do seem
very inter-dependent.

The main semantic rework of task state and blocked_on is interesting.
I definitely like the idea of simplifying the ttwu paths. But the
change to also feels a little messy to me, so I'm not 100% on adopting
it yet, but maybe I need to just get more familiar with it or combine
the logic in some form.

The suggestion on donor wakeup to just set rq->donor to rq->curr
sounds really nice. I've been thinking of doing similar but was
hesitant as the donor change has already been a source of races and
I've been wary to rock the boat, but I'd like to pick this up. However
the put_prev_set_next_task() on something other than idle has numerous
subtle side effects that are giving me grief (seem to be hitting a
case where right as rq->curr is dequeued and instead set as
sched_delayed, we wake the donor from another cpu, tripping warnings
trying to set rq->curr as next when its sched_delayed).

And I definitely like the reworked proxy_force_return() suggestion,
which is much closer to the wake_up_process() approach I used back in
v22:
  https://lore.kernel.org/lkml/20250926032931.27663-5-jstultz@google.com/
But unfortunately my attempts to use a similar method ran into lots of
trouble w/ I was reworking the logic for v23, which resulted in the
messier logic currently in that function.

So I've had difficulty breaking up and isolating the minimal changeset
to support these simplifications, as  they seem to also depend on the
semantic rework around task state and blocked_on that you've made.

Finally, I appreciate your thorough comments, though I'm wary of
adding more spots where we're accessing values without locks. I know
these tricks can be critical for performance, but the scheduler code
is already so terribly subtle with its many exceptions to locking
rules that I fret this makes it even harder to think about (though I'm
maybe overly sensitive here as I've felt particularly dim in my
efforts to rework your patch :).

So I'm a little stuck as I really want to get the next iteration (v25)
of the patch series out to the list, but I don't want to be ignoring
your suggestions here.  Since the v25 series has added a number of
earlier fixups, the donor-migration chunk is getting quite long (16
patches currently), so I may just send the early portions (up to
"Handle blocked-waiter migration (and return migration)" - before we
get to this try_to_wake_up() change), and maybe I can hopefully sort
out how to better integrate this feedback for v26 (or whenever the
first chunk is queued).

That sound ok?

Again, very much appreciate your great insights and feedback here!
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2026-03-11  6:45     ` John Stultz
@ 2026-03-11  9:25       ` K Prateek Nayak
  0 siblings, 0 replies; 23+ messages in thread
From: K Prateek Nayak @ 2026-03-11  9:25 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/11/2026 12:15 PM, John Stultz wrote:
> On Thu, Jan 1, 2026 at 1:53 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 11/25/2025 4:01 AM, John Stultz wrote:
>>> @@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>        */
>>>       scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>>>               smp_mb__after_spinlock();
>>> -             if (!ttwu_state_match(p, state, &success))
>>> -                     break;
>>> +             if (!ttwu_state_match(p, state, &success)) {
>>> +                     /*
>>> +                      * If we're already TASK_RUNNING, and PROXY_WAKING
>>> +                      * continue on to ttwu_runnable check to force
>>> +                      * proxy_needs_return evaluation
>>> +                      */
>>> +                     if (!proxy_task_runnable_but_waking(p))
>>> +                             break;
>>> +             }
>>
>> I don't like the fact that we have to go against the ttwu_state_match()
>> machinery here or the fact that proxy_deactivate() has to check for
>> "TASK_RUNNING" as a short-cut to detect wakeups. It makes all these bits
>> a bit more painful to maintain IMO.
>>
>> Here is what I've learnt so far:
>>
>> o Task can go into an interruptible sleep where a signal can cause the
>>   task to give up its attempt to acquire the lock ("err" path in
>>   __mutex_lock_common()) - This means we'll have to re-evaluate the
>>   "p->blocked_on" by running it.
>>
>> o If we weren't running with the PROXY_EXEC, a task that was blocked
>>   (including delayed) on a mutex will definitely receive a wakeup. With
>>   proxy, the only difference is that that wakeup path will now see
>>   "task_on_rq_queued()" and will have to go though ttwu_runnable() under
>>   rq_lock.
>>
>> o Everything else is handled in __schedule() under the rq_lock again
>>   and all paths that drop the rq_lock will go through a re-pick.
>>
>> o Once we deactivate a donor (via proxy_deactivate()), the "blocked_on"
>>   relation has no real use since we'll always be PROXY_WAKING when we
>>   come back (or we have received a signal and that blocked_on relation
>>   now needs a re-evaluation anyways)
>>
>> I'm sure a bunch of these hold up for RWSEM too.
>>
>> With that in mind, we can actually generalize ttwu_runnable() and
>> proxy_needs_return() to clear "p->blocked_on", even for !PROXY_WAKING
>> since a wakeup implies we have to re-evaluate the "p->blocked_on".
>>
>> We already have clear_task_blocked_on() within the rq_lock critical
>> section (except for when task is running) and we ensure blocked donors
>> cannot go TASK_RUNNING with task_is_blocked() != NULL.
>>
>>
>> This is what I had in mind based on top of commit d424d28ea93
>> ("sched: Migrate whole chain in proxy_migrate_task()") on
>> "proxy-exec-v24-6.18-rc6" branch:
>>
>>   (lightly tested with sched-messaging; Haven't seen any splats
>>    or hung-task yet but from my experience, it doesn't hold any
>>    water when my suggestion meets the real-world :-)
> ...
>>
>> I guess ignorance is bliss and all this might be a terrible idea but
>> I like how proxy_wake_up_donor() (aka proxy_needs_return()) ends up
>> being much simpler now and the whole:
>>
>>   blocked_donor (!TASK_RUNNING && task_is_blocked())
>>         |
>>         v
>>       wakeup    (clear_task_blocked_on() + ttwu())
>>         |
>>         v
>>      runnable   (TASK_RUNNING && !task_is_blocked())
>>
>> is easier to reason about IMO and should also help sched-ext since we
>> now have defined points to notify where the proxy starts and ends for
>> them to hook onto.
>>
>> I think I lost the "wake_cpu" check in proxy_needs_return() somewhere
>> down the line but doing a block + wakeup irrespective seems to be the
>> right thing since in absence of proxy, !task_current() would have
>> blocked and received a wakeup anyways.
>>
>> Sorry for dumping a bunch of radical suggestions during the holidays.
>> Hope we can flush out a bunch more of proxy execution (if not all of
>> it) this year :-)
> 
> Apologies, I've been dragging my feet a bit on this email, as its a
> fair amount of change to get my head around.  :)

Sorry about that! The timing too was not ideal given the holiday
season! Thank you again for taking a look at it.

> As always, I *really* appreciate your thoughtful feedback.

Likewise :-)

> 
> Running with your patch, I do get quite a few splats from the
> assert_balance_callbacks_empty() check (which then later causes a
> deadlock from lockdep on the console lock), but that seems to be due
> to proxy_needs_return()'s calling proxy_reset_donor() which does a
> put_prev_set_next() which might enqueue a balance callback that we
> then don't run when we drop the rq lock in ttwu_runnable().  Zapping
> the callback seems to resolve this.

I clearly didn't test it enough! But zap seems to be the right thing
to do since we are on our way to pick anyways and we'll restore those
callbacks.

> 
> From there, I've been trying to separate and understand the different
> threads of change and their dependencies so that I can apply them to
> the right point in the series, but its subtle and the changes do seem
> very inter-dependent.

Yes, sorry about that! I should have probably given a tree with the
breakdown.

> 
> The main semantic rework of task state and blocked_on is interesting.
> I definitely like the idea of simplifying the ttwu paths. But the
> change to also feels a little messy to me, so I'm not 100% on adopting
> it yet, but maybe I need to just get more familiar with it or combine
> the logic in some form.

So my main gripe was the position of proxy_task_runnable_but_waking()
and the TASK_RUNNING check in the proxy_deactivate() path - If we fail
ttwu_state_match(), we don't need a wakeup but with proxy we get the
additional TASK_RUNNING + PROXY_WAKING handling which seems out of
place.

TASK_RUNNING + PROXY_WAKING means task is queued on the CPU, and hasn't
blocked yet - running / preempted:

o If it is preempted, find_proxy_task() will realize PROXY_WAKING and
  evaluate a NEEDS_RETURN.

o If it is running, it will simply not block but will have PROXY_WAKING
  which find_proxy_task() should again handle as it'll either return
  curr on seeing PROXY_WAKING or we'll go to the above case of being
  preempted with PROXY_WAKING.

Now that I think of it, why exactly did we need it? My memory is a
little hazy on this matter.

> 
> The suggestion on donor wakeup to just set rq->donor to rq->curr
> sounds really nice. I've been thinking of doing similar but was
> hesitant as the donor change has already been a source of races and
> I've been wary to rock the boat, but I'd like to pick this up.

Based on my reading, it is all serialized by the rq_lock so the worst
case scenario is the RCU readers seeing the stale rq->donor for a bit
but that happens today anyways so it must be handled.

> However
> the put_prev_set_next_task() on something other than idle has numerous
> subtle side effects that are giving me grief (seem to be hitting a
> case where right as rq->curr is dequeued and instead set as
> sched_delayed, we wake the donor from another cpu, tripping warnings
> trying to set rq->curr as next when its sched_delayed).

Is it the WARN_ON() in __set_next_task_fair()? If so, it can happen
like this:

  CPU0: schedule()                                  CPU1: handoff
  ================                                  =============
  rq_lock(CPU0)
    try_to_block_task(rq->curr) /* Delayed */
    donor = pick_next_task()
    set_rq_donor(donor) /* Donor has blocked_on */  set_task_blocked_on_waking(donor)
    find_proxy_task()                               wake_up_task(donor)
       if (mutex == PROXY_WAKING)                     try_to_wake_up()
          action = NEEDS_RETURN;                        pi_lock(donor)
          break                                         ...
       NEEDS_RETURN:                                    ttwu_runnable()
       proxy_force_return(donor)                          rq_lock(rq(CPU0) ...
         raw_spin_rq_unlock(rq(CPU0))                     ... /* Gets the rq_lock */
         pi_lock(donor) ...                               proxy_needs_return()
                                                            proxy_reset_donor(rq(CPU0)->curr)
                                                              set_next_task_fair(curr)
                                                                !!! Woops! rq->curr is delayed! !!!


We set the donor before the proxy pick but the curr is only
modified later once we finalize the task so it can hit this (or I'm
missing something as usual).

I think we are missing a proxy_resched_idle() in the find_proxy_task()
loop when we plan to return migrate the donor before dropping the rq
lock. I see a proxy_resched_idle() in proxy_force_return() but this
is post task_rq_lock() which is too late.

> 
> And I definitely like the reworked proxy_force_return() suggestion,
> which is much closer to the wake_up_process() approach I used back in
> v22:
>   https://lore.kernel.org/lkml/20250926032931.27663-5-jstultz@google.com/
> But unfortunately my attempts to use a similar method ran into lots of
> trouble w/ I was reworking the logic for v23, which resulted in the
> messier logic currently in that function.
> 
> So I've had difficulty breaking up and isolating the minimal changeset
> to support these simplifications, as  they seem to also depend on the
> semantic rework around task state and blocked_on that you've made.
> 
> Finally, I appreciate your thorough comments, though I'm wary of
> adding more spots where we're accessing values without locks. I know
> these tricks can be critical for performance, but the scheduler code
> is already so terribly subtle with its many exceptions to locking
> rules that I fret this makes it even harder to think about (though I'm
> maybe overly sensitive here as I've felt particularly dim in my
> efforts to rework your patch :).

Perhaps we can work these step by step later given the series in its
current state has got a good bit of testing. I saw the WIP branch only
a few days back and I do think some of this gets cleaned away later so
I'm not against keeping it like this for now.

> 
> So I'm a little stuck as I really want to get the next iteration (v25)
> of the patch series out to the list, but I don't want to be ignoring
> your suggestions here.  Since the v25 series has added a number of
> earlier fixups, the donor-migration chunk is getting quite long (16
> patches currently), so I may just send the early portions (up to
> "Handle blocked-waiter migration (and return migration)" - before we
> get to this try_to_wake_up() change), and maybe I can hopefully sort
> out how to better integrate this feedback for v26 (or whenever the
> first chunk is queued).
> 
> That sound ok?

I'm okay with that :-) Again thanks a ton for looking into this.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-03-11  9:25 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24 22:30 [PATCH v24 00/11] Donor Migration for Proxy Execution (v24) John Stultz
2025-11-24 22:30 ` [PATCH v24 01/11] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2025-11-24 22:30 ` [PATCH v24 02/11] sched: Fix modifying donor->blocked on without proper locking John Stultz
2025-11-24 22:30 ` [PATCH v24 03/11] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2025-11-24 22:30 ` [PATCH v24 04/11] sched: Add assert_balance_callbacks_empty helper John Stultz
2025-11-24 22:30 ` [PATCH v24 05/11] sched: Add logic to zap balance callbacks if we pick again John Stultz
2025-11-24 22:30 ` [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration) John Stultz
2025-12-30  5:33   ` K Prateek Nayak
2026-03-07  6:48     ` John Stultz
2026-03-10  4:17       ` K Prateek Nayak
2025-11-24 22:30 ` [PATCH v24 07/11] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references John Stultz
2025-11-24 22:31 ` [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal John Stultz
2025-12-30  6:01   ` K Prateek Nayak
2025-12-30  9:52     ` K Prateek Nayak
2026-01-01  7:04       ` K Prateek Nayak
2025-11-24 22:31 ` [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
2026-01-01  9:53   ` K Prateek Nayak
2026-03-11  6:45     ` John Stultz
2026-03-11  9:25       ` K Prateek Nayak
2025-11-24 22:31 ` [PATCH v24 10/11] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2025-11-24 22:31 ` [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task() John Stultz
2025-12-30  6:46   ` K Prateek Nayak
2026-03-07  7:07     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox