[PATCH v23 0/9] Donor Migration for Proxy Execution (v23)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v23 0/9] Donor Migration for Proxy Execution (v23)
@ 2025-10-30  0:18 John Stultz
  2025-10-30  0:18 ` [PATCH v23 1/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hey All,

Just another iteration on the next chunk of the proxy-exec
series: Donor Migration

This is just the next step for Proxy Execution, to allow us to
migrate blocked donors across runqueues to boost remote lock
owners.

As always, I’m trying to submit this larger work in smallish
digestible pieces, so in this portion of the series, I’m only
submitting for review and consideration the logic that allows us
to do donor(blocked waiter) migration, which requires some
additional changes to locking and extra state tracking to ensure
we don’t accidentally run a migrated donor on a cpu it isn’t
affined to, as well as some extra handling to deal with balance
callback state that needs to be reset when we decide to pick a
different task after doing donor migration.

Peter provided some very helpful review and feedback on the last
iteration, and I’ve tried to address his concerns and suggestions
here. However, the rework ended up being fairly significant (and
I think I stepped on just about every rake possible in the
process :P), so stabilizing the changes took much longer than I
had hoped for. I suspect Peter will have further suggestions
for changes.

New in this iteration:
* Folded the blocked_on_state into the blocked_on ptr, by
  introducing a special PROXY_WAKING value. This slight
  “compression” of state tracking had some subtle implications,
  especially in the find_proxy_task() chain walking, but I think
  I’ve got it worked out.

* Split the donor migration patch into a few smaller patches,
  handling manual return migration from find_proxy_task() first,
  and then adding the return migration logic in try_to_wakeup()
  in a separate patch, as requested by Peter. This also
  uncovered quite a few subtleties and required majorly
  reworking the proxy_force_return() logic.

* Pulled out balance_callback WARN_ON into its own helper
  function so we can check it at the top of the pick_again loop,
  as Peter suggested

I’d love to get further feedback on any place where these
patches are confusing, or could use additional clarifications.

In the full series, I don’t have much new as this rework took
up much of my time. But I’d still appreciate any testing or
comments that folks have:

Also you can find the full proxy-exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v23-6.18-rc3
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v23-6.18-rc3

Issues still to address with the full series:
* Continue working to get sched_ext to be ok with
  proxy-execution enabled.

* I’ve reproduced the performance regression K Prateek Nayak
  found with the full series. I’m hoping to work to
  understand and narrow the issue down soon.

* Polish Suleiman’s rwsem patches some, as the PROXY_WAKING
  rework added some atomicity complications to traversing and
  locking the blocked_on structure and my initial fixups aren’t
  super elegant.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good, using proxy for Binder PI is something else
  we’re exploring.
* Eventually: Work to replace rt_mutexes and get things happy
  with PREEMPT_RT

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the larger series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!).

Thanks also to Joel Fernandes, Dietmar Eggemann, Metin Kaya,
K Prateek Nayak and Suleiman Souhlal for their substantial
review, suggestion, and patch contributions.

So again, many thanks to those above, as all the credit for
this series really is due to them - while the mistakes are
surely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>   
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (8):
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix modifying donor->blocked on without proper locking
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy
    return-migration
  sched: Add assert_balance_callbacks_empty helper
  sched: Add logic to zap balance callbacks if we pick again
  sched: Handle blocked-waiter migration (and return migration)
  sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING
    case
  sched: Migrate whole chain in proxy_migrate_task()

Peter Zijlstra (1):
  sched: Add blocked_donor link to task for smarter mutex handoffs

 include/linux/sched.h        |  95 ++++++---
 init/init_task.c             |   5 +
 kernel/fork.c                |   5 +
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  82 ++++++--
 kernel/locking/mutex.h       |   6 +
 kernel/locking/ww_mutex.h    |  16 +-
 kernel/sched/core.c          | 372 +++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h         |  11 +-
 9 files changed, 519 insertions(+), 77 deletions(-)

-- 
2.51.1.930.gacf6e81ea2-goog

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v23 1/9] locking: Add task::blocked_lock to serialize blocked_on state
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-30  0:18 ` [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak
v21:
* After recently thinking more on ww_mutex code, I
  reworked the blocked_lock usage in mutex lock to
  avoid having to take nested locks in the ww_mutex
  paths, as I was concerned the lock ordering
  constraints weren't as strong as I had previously
  thought.
v22:
* Added some extra spaces to avoid dense code blocks
  suggested by K Prateek
v23:
* Move get_task_blocked_on() to kernel/locking/mutex.h
  as requested by PeterZ

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 48 +++++++++++++-----------------------
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 40 +++++++++++++++++++-----------
 kernel/locking/mutex.h       |  6 +++++
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 8 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbb7340c5866f..16122c2a2a224 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,6 +1241,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2149,57 +2150,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
-
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206fa..60477d74546e0 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -143,6 +143,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a95..0697084be202f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2038,6 +2038,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 949103fd8e9b5..1d8cff71f65e1 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index de7d6702cd96c..c44fc63d4476e 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -640,6 +640,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -653,8 +654,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -677,12 +679,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -693,25 +697,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -740,11 +752,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -955,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 2e8080a9bee37..5cfd663e2c011 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct mutex *lock)
 	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
 }
 
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785afee6c0..e4a81790ea7dd 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb4f6d91d4455..517b26c515bc5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6607,6 +6607,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6630,8 +6631,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
  2025-10-30  0:18 ` [PATCH v23 1/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-30  4:51   ` K Prateek Nayak
  2025-10-30  0:18 ` [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.

This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.

Signed-off-by: John Stultz <jstultz@google.com>
---
v23:
* Split out from earlier patch.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 517b26c515bc5..0533a14ce5935 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6591,7 +6591,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		clear_task_blocked_on(donor, NULL);
 	}
 	return NULL;
 }
@@ -6619,6 +6619,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6652,12 +6653,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6715,6 +6718,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 */
 	}
 
+	/* Handle actions we need to do outside of the guard() scope */
+	switch (action) {
+	case DEACTIVATE_DONOR:
+		return proxy_deactivate(rq, donor);
+	case FOUND:
+		/* fallthrough */;
+	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
  2025-10-30  0:18 ` [PATCH v23 1/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
  2025-10-30  0:18 ` [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-30  7:32   ` K Prateek Nayak
  2025-10-30  0:18 ` [PATCH v23 4/9] sched: Add assert_balance_callbacks_empty helper John Stultz
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:

                  ,-+-* Mutex-1 <-.
        Task-A ---' |             | ,-- Task-B
                    `-> Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can
   back-off and re-try the transaction.

Now, the current code [without this patch] does:
        __clear_task_blocked_on();
        wake_q_add();

And surely clearing ->blocked_on is sufficient to break the
cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().

Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."

Thus we need more than just a binary concept of the task being
blocked on a mutex or not.

So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.

This will then be used in a later patch to handle proxy
return-migration.

Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
  around blocked_on_state changes that we were cheating in
  previous versions.
* Minor cleanups, some comment improvements
v22:
* Re-order blocked_on_state helpers to try to make it clearer
  the set_task_blocked_on() and clear_task_blocked_on() are
  the main enter/exit states and the blocked_on_state helpers
  help manage the transition states within. Per feedback from
  K Prateek Nayak.
* Rework blocked_on_state to be defined within
  CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak.
* Reworked empty stub functions to just take one line as
  suggestd by K Prateek
* Avoid using gotos out of a guard() scope, as highlighted by
  K Prateek, and instead rework logic to break and switch()
  on an action value.
v23:
* Big rework to using PROXY_WAKING instead of blocked_on_state
  as suggested by Peter.
* Reworked commit message to include Peter's nice diagrams and
  example for why this extra state is necessary.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 51 +++++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex.c    |  2 +-
 kernel/locking/ww_mutex.h | 16 ++++++------
 kernel/sched/core.c       | 17 +++++++++++++
 4 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16122c2a2a224..863c54685684c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2148,10 +2148,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 })
 
 #ifndef CONFIG_PREEMPT_RT
+
+/*
+ * With proxy exec, if a task has been proxy-migrated, it may be a donor
+ * on a cpu that it can't actually run on. Thus we need a special state
+ * to denote that the task is being woken, but that it needs to be
+ * evaluated for return-migration before it is run. So if the task is
+ * blocked_on PROXY_WAKING, return migrate it before running it.
+ */
+#define PROXY_WAKING ((struct mutex *)(-1L))
+
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on;
+	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2179,7 +2189,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2188,6 +2198,35 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+
+	if (!sched_proxy_exec()) {
+		__clear_task_blocked_on(p, m);
+		return;
+	}
+
+	/* Don't set PROXY_WAKING if blocked_on was already cleared */
+	if (!p->blocked_on)
+		return;
+	/*
+	 * There may be cases where we set PROXY_WAKING on tasks that were
+	 * already set to waking, but make sure we are not changing
+	 * the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	p->blocked_on = PROXY_WAKING;
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_task_blocked_on_waking(p, m);
+}
+
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2196,6 +2235,14 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index c44fc63d4476e..3cb9001d15119 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -967,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		set_task_blocked_on_waking(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..5cd9dfa4b31e6 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on to PROXY_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_task_blocked_on_waking(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on to PROXY_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_task_blocked_on_waking(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a14ce5935..da6dd2fc8e705 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4293,6 +4293,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, clear the task->blocked_on
+	 * regardless if it was set to a mutex or PROXY_WAKING so the
+	 * task can run. We will need to be more careful later when
+	 * properly handling proxy migration
+	 */
+	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6627,6 +6634,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		/* Something changed in the chain, so pick again */
 		if (!mutex)
 			return NULL;
+
+		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
+		if (mutex == PROXY_WAKING)
+			return proxy_resched_idle(rq);
+
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6647,6 +6659,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/*
+			 * If there is no owner, clear blocked_on
+			 * and return p so it can run and try to
+			 * acquire the lock
+			 */
 			__clear_task_blocked_on(p, mutex);
 			return p;
 		}
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 4/9] sched: Add assert_balance_callbacks_empty helper
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
                   ` (2 preceding siblings ...)
  2025-10-30  0:18 ` [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-30  7:38   ` K Prateek Nayak
  2025-10-30  0:18 ` [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.

So pull the warning out into a helper function, and make sure we
check it when we pick again.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c  |  1 +
 kernel/sched/sched.h | 11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da6dd2fc8e705..680ff147d270d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6896,6 +6896,7 @@ static void __sched notrace __schedule(int sched_mode)
 	}
 
 pick_again:
+	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 361f9101cef97..de77b3313ab18 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1779,6 +1779,15 @@ static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
+#ifdef CONFIG_PROVE_LOCKING
+static inline void assert_balance_callbacks_empty(struct rq *rq)
+{
+	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+}
+#else
+static inline void assert_balance_callbacks_empty(struct rq *rq) {}
+#endif
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -1795,7 +1804,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
 	rf->clock_update_flags = 0;
-	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+	assert_balance_callbacks_empty(rq);
 }
 
 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
                   ` (3 preceding siblings ...)
  2025-10-30  0:18 ` [PATCH v23 4/9] sched: Add assert_balance_callbacks_empty helper John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-30  8:08   ` K Prateek Nayak
  2025-10-30  0:18 ` [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs
v22:
* Spelling fix suggested by K Prateek
* Collapsed the stub implementation to one line as suggested
  by K Prateek
* Zap callbacks when we resched idle, as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 680ff147d270d..ab6e14259bdf2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4970,6 +4970,38 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+#else
+static inline void zap_balance_callbacks(struct rq *rq) {}
+#endif
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6901,10 +6933,15 @@ static void __sched notrace __schedule(int sched_mode)
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
-		if (!next)
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
 			goto pick_again;
-		if (next == rq->idle)
+		}
+		if (next == rq->idle) {
+			zap_balance_callbacks(rq);
 			goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
                   ` (4 preceding siblings ...)
  2025-10-30  0:18 ` [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-30  9:32   ` K Prateek Nayak
  2025-11-07 15:19   ` Juri Lelli
  2025-10-30  0:18 ` [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks
v22:
* Add comments around zap_balance_callbacks in proxy_migration logic
* Rework logic to avoid gotos out of guard() scopes, and instead
  use break and switch() on action value, as suggested by K Prateek
* K Prateek suggested simplifications around putting donor and
  setting idle as next task in the migration paths, which I further
  simplified to using proxy_resched_idle()
* Comment improvements
* Dropped curr != donor check in pick_next_task_fair() suggested by
  K Prateek
v23:
* Rework to use the PROXY_WAKING approach suggested by Peter
* Drop unnecessarily setting wake_cpu when affinity changes
  as noticed by Peter
* Split out the ttwu() logic changes into a later separate patch
  as suggested by Peter

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 230 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 200 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab6e14259bdf2..3cf5e75abf21e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3682,6 +3682,23 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
 	trace_sched_wakeup(p);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -4293,13 +4310,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, clear the task->blocked_on
-	 * regardless if it was set to a mutex or PROXY_WAKING so the
-	 * task can run. We will need to be more careful later when
-	 * properly handling proxy migration
-	 */
-	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6602,7 +6612,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6622,17 +6632,144 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		clear_task_blocked_on(donor, NULL);
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * So call proxy_resched_idle() to let go of the references before
+	 * we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, 0);
+	proxy_set_task_cpu(p, target_cpu);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(target_rq);
+
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+
+	raw_spin_rq_unlock(target_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	struct rq *this_rq, *target_rq;
+	struct rq_flags this_rf;
+	int cpu, wake_flag = 0;
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+
+	get_task_struct(p);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	/*
+	 * We drop the rq lock, and re-grab task_rq_lock to get
+	 * the pi_lock (needed for select_task_rq) as well.
+	 */
+	this_rq = task_rq_lock(p, &this_rf);
+	update_rq_clock(this_rq);
+
+	/*
+	 * Since we let go of the rq lock, the task may have been
+	 * woken or migrated to another rq before we  got the
+	 * task_rq_lock. So re-check we're on the same RQ. If
+	 * not, the task has already been migrated and that CPU
+	 * will handle any futher migrations.
+	 */
+	if (this_rq != rq)
+		goto err_out;
+
+	/* Similarly, if we've been dequeued, someone else will wake us */
+	if (!task_on_rq_queued(p))
+		goto err_out;
+
+	/*
+	 * Since we should only be calling here from __schedule()
+	 * -> find_proxy_task(), no one else should have
+	 * assigned current out from under us. But check and warn
+	 * if we see this, then bail.
+	 */
+	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
+		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+			  __func__, cpu_of(this_rq),
+			  p->comm, p->pid, p->on_cpu);
+		goto err_out;
 	}
-	return NULL;
+
+	proxy_resched_idle(this_rq);
+	deactivate_task(this_rq, p, 0);
+	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+	set_task_cpu(p, cpu);
+	target_rq = cpu_rq(cpu);
+	clear_task_blocked_on(p, NULL);
+	task_rq_unlock(this_rq, p, &this_rf);
+
+	/* Drop this_rq and grab target_rq for activation */
+	raw_spin_rq_lock(target_rq);
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+	put_task_struct(p);
+	raw_spin_rq_unlock(target_rq);
+
+	/* Finally, re-grab the origianl rq lock and return to pick-again */
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	return;
+
+err_out:
+	put_task_struct(p);
+	task_rq_unlock(this_rq, p, &this_rf);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	return;
 }
 
 /*
@@ -6655,10 +6792,12 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
+	int owner_cpu;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6667,9 +6806,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		if (!mutex)
 			return NULL;
 
-		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
-		if (mutex == PROXY_WAKING)
-			return proxy_resched_idle(rq);
+		/* if its PROXY_WAKING, do return migration or run if current */
+		if (mutex == PROXY_WAKING) {
+			if (task_current(rq, p)) {
+				clear_task_blocked_on(p, PROXY_WAKING);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
+		}
 
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
@@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, clear blocked_on
-			 * and return p so it can run and try to
-			 * acquire the lock
+			 * If there is no owner, either clear blocked_on
+			 * and return p (if it is current and safe to
+			 * just run on this rq), or return-migrate the task.
 			 */
-			__clear_task_blocked_on(p, mutex);
-			return p;
+			if (task_current(rq, p)) {
+				__clear_task_blocked_on(p, NULL);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			action = DEACTIVATE_DONOR;
 			break;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			action = MIGRATE;
 			break;
 		}
 
@@ -6770,7 +6930,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
+		if (proxy_deactivate(rq, donor))
+			return NULL;
+		/* If deactivate fails, force return */
+		p = donor;
+		fallthrough;
+	case NEEDS_RETURN:
+		proxy_force_return(rq, rf, p);
+		return NULL;
+	case MIGRATE:
+		proxy_migrate_task(rq, rf, p, owner_cpu);
+		return NULL;
 	case FOUND:
 		/* fallthrough */;
 	}
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
                   ` (5 preceding siblings ...)
  2025-10-30  0:18 ` [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-31  4:27   ` K Prateek Nayak
  2025-10-30  0:18 ` [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
  2025-10-30  0:18 ` [PATCH v23 9/9] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  8 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

This patch adds logic so try_to_wake_up() will notice if we are
waking a task where blocked_on == PROXY_WAKING, and if necessary
dequeue the task so the wakeup will naturally return-migrate the
donor task back to a cpu it can run on.

This helps performance as we do the dequeue and wakeup under the
locks normally taken in the try_to_wake_up() and avoids having
to do proxy_force_return() from __schedule(), which has to
re-take similar locks and then force a pick again loop.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 74 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3cf5e75abf21e..4546ceb8eae56 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3697,6 +3697,56 @@ static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 	__set_task_cpu(p, cpu);
 	p->wake_cpu = wake_cpu;
 }
+
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (READ_ONCE(p->__state) == TASK_RUNNING &&
+		READ_ONCE(p->blocked_on) == PROXY_WAKING);
+}
+
+static inline struct task_struct *proxy_resched_idle(struct rq *rq);
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	bool ret = false;
+
+	if (!sched_proxy_exec())
+		return false;
+
+	raw_spin_lock(&p->blocked_lock);
+	if (p->blocked_on == PROXY_WAKING) {
+		if (!task_current(rq, p) && p->wake_cpu != cpu_of(rq)) {
+			if (task_current_donor(rq, p))
+				proxy_resched_idle(rq);
+
+			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+			ret = true;
+		}
+		__clear_task_blocked_on(p, PROXY_WAKING);
+		resched_curr(rq);
+	}
+	raw_spin_unlock(&p->blocked_lock);
+	return ret;
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	return false;
+}
+
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
 #endif /* CONFIG_SCHED_PROXY_EXEC */
 
 static void
@@ -3784,6 +3834,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		if (proxy_needs_return(rq, p))
+			goto out;
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -3794,6 +3846,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
+out:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;
@@ -3924,6 +3977,14 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 		return false;
 #endif
 
+	/*
+	 * If we're PROXY_WAKING, we have deactivated on this cpu, so we should
+	 * activate it here as well, to avoid IPI'ing a cpu that is stuck in
+	 * task_rq_lock() spinning on p->on_rq, deadlocking that cpu.
+	 */
+	if (task_on_rq_migrating(p))
+		return false;
+
 	/*
 	 * Do not complicate things with the async wake_list while the CPU is
 	 * in hotplug state.
@@ -4181,6 +4242,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If p is current, we know we can run here, so clear blocked_on */
+		clear_task_blocked_on(p, NULL);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success))
-			break;
+		if (!ttwu_state_match(p, state, &success)) {
+			/*
+			 * If we're already TASK_RUNNING, and PROXY_WAKING
+			 * continue on to ttwu_runnable check to force
+			 * proxy_needs_return evaluation
+			 */
+			if (!proxy_task_runnable_but_waking(p))
+				break;
+		}
 
 		trace_sched_waking(p);
 
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
                   ` (6 preceding siblings ...)
  2025-10-30  0:18 ` [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
@ 2025-10-30  0:18 ` John Stultz
  2025-10-31  5:01   ` K Prateek Nayak
  2025-10-30  0:18 ` [PATCH v23 9/9] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  8 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Connor O'Brien, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
  is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
  by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
v8:
* Use helper for donor blocked_on_state transition
v9:
* Re-add mutex lock handoff in the unlock path, but only when we
  have a blocked donor
* Slight reword of commit message suggested by Metin
v18:
* Add task_init initialization for blocked_donor, suggested by
  Suleiman
v23:
* Reworks for PROXY_WAKING approach suggested by PeterZ

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 44 +++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    | 18 +++++++++++++++--
 5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 863c54685684c..bac1b956027e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,6 +1241,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 60477d74546e0..34853a511b4d8 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -177,6 +177,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+	.blocked_donor = NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 0697084be202f..0a9a17e25b85d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2136,6 +2136,7 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 
 	p->blocked_on = NULL; /* not blocked yet */
+	p->blocked_donor = NULL; /* nobody is boosting p yet */
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 3cb9001d15119..08f438a54f56f 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -926,7 +926,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
-	struct task_struct *next = NULL;
+	struct task_struct *donor, *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
 	unsigned long flags;
@@ -945,6 +945,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		MUTEX_WARN_ON(__owner_task(owner) != current);
 		MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 
+		if (sched_proxy_exec() && current->blocked_donor) {
+			/* force handoff if we have a blocked_donor */
+			owner = MUTEX_FLAG_HANDOFF;
+			break;
+		}
+
 		if (owner & MUTEX_FLAG_HANDOFF)
 			break;
 
@@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+	if (sched_proxy_exec()) {
+		raw_spin_lock(&current->blocked_lock);
+		/*
+		 * If we have a task boosting current, and that task was boosting
+		 * current through this lock, hand the lock to that task, as that
+		 * is the highest waiter, as selected by the scheduling function.
+		 */
+		donor = current->blocked_donor;
+		if (donor) {
+			struct mutex *next_lock;
+
+			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+			next_lock = __get_task_blocked_on(donor);
+			if (next_lock == lock) {
+				next = donor;
+				__set_task_blocked_on_waking(donor, next_lock);
+				wake_q_add(&wake_q, donor);
+				current->blocked_donor = NULL;
+			}
+			raw_spin_unlock(&donor->blocked_lock);
+		}
+	}
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 			list_first_entry(&lock->wait_list,
@@ -966,14 +999,19 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		debug_mutex_wake_waiter(lock, waiter);
-		set_task_blocked_on_waking(next, lock);
+		__set_task_blocked_on_waking(next, lock);
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
+
 	}
 
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	if (sched_proxy_exec())
+		raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4546ceb8eae56..eabde9706981a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6846,7 +6846,17 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
  * Find runnable lock owner to proxy for mutex blocked donor
  *
  * Follow the blocked-on relation:
- *   task->blocked_on -> mutex->owner -> task...
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
  *
  * Lock order:
  *
@@ -6995,6 +7005,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
+		owner->blocked_donor = p;
 	}
 
 	/* Handle actions we need to do outside of the guard() scope */
@@ -7095,6 +7106,7 @@ static void __sched notrace __schedule(int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool prev_not_proxied;
 	int cpu;
 
 	/* Trace preemptions consistently with task switches */
@@ -7167,10 +7179,12 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	prev_not_proxied = !prev->blocked_donor;
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	next->blocked_donor = NULL;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next) {
@@ -7236,7 +7250,7 @@ static void __sched notrace __schedule(int sched_mode)
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
 		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
+		if (prev_not_proxied && next->blocked_donor)
 			proxy_tag_curr(rq, next);
 
 		rq_unpin_lock(rq, &rf);
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v23 9/9] sched: Migrate whole chain in proxy_migrate_task()
  2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
                   ` (7 preceding siblings ...)
  2025-10-30  0:18 ` [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
@ 2025-10-30  0:18 ` John Stultz
  8 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-10-30  0:18 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Instead of migrating one task each time through find_proxy_task(),
we can walk up the blocked_donor ptrs and migrate the entire
current chain in one go.

This was broken out of earlier patches and held back while the
series was being stabilized, but I wanted to re-introduce it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v12:
* Earlier this was re-using blocked_node, but I hit
  a race with activating blocked entities, and to
  avoid it introduced a new migration_node listhead
v18:
* Add init_task initialization of migration_node as suggested
  by Suleiman
v22:
* Move migration_node under CONFIG_SCHED_PROXY_EXEC as suggested
  by K Prateek
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h |  3 +++
 init/init_task.c      |  3 +++
 kernel/fork.c         |  3 +++
 kernel/sched/core.c   | 25 +++++++++++++++++--------
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bac1b956027e2..cd2453c2085c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1243,6 +1243,9 @@ struct task_struct {
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	struct list_head		migration_node;
+#endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
diff --git a/init/init_task.c b/init/init_task.c
index 34853a511b4d8..78fb7cb83fa5d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -178,6 +178,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 						 &init_task.alloc_lock),
 #endif
 	.blocked_donor = NULL,
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	.migration_node = LIST_HEAD_INIT(init_task.migration_node),
+#endif
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 0a9a17e25b85d..a7561480e879e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2137,6 +2137,9 @@ __latent_entropy struct task_struct *copy_process(
 
 	p->blocked_on = NULL; /* not blocked yet */
 	p->blocked_donor = NULL; /* nobody is boosting p yet */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	INIT_LIST_HEAD(&p->migration_node);
+#endif
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eabde9706981a..c202ae19b4ac8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6718,6 +6718,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
+	LIST_HEAD(migrate_list);
 
 	lockdep_assert_rq_held(rq);
 
@@ -6734,11 +6735,16 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	 */
 	proxy_resched_idle(rq);
 
-	WARN_ON(p == rq->curr);
-
-	deactivate_task(rq, p, 0);
-	proxy_set_task_cpu(p, target_cpu);
-
+	for (; p; p = p->blocked_donor) {
+		WARN_ON(p == rq->curr);
+		deactivate_task(rq, p, 0);
+		proxy_set_task_cpu(p, target_cpu);
+		/*
+		 * We can abuse blocked_node to migrate the thing,
+		 * because @p was still on the rq.
+		 */
+		list_add(&p->migration_node, &migrate_list);
+	}
 	/*
 	 * We have to zap callbacks before unlocking the rq
 	 * as another CPU may jump in and call sched_balance_rq
@@ -6749,10 +6755,13 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	rq_unpin_lock(rq, rf);
 	raw_spin_rq_unlock(rq);
 	raw_spin_rq_lock(target_rq);
+	while (!list_empty(&migrate_list)) {
+		p = list_first_entry(&migrate_list, struct task_struct, migration_node);
+		list_del_init(&p->migration_node);
 
-	activate_task(target_rq, p, 0);
-	wakeup_preempt(target_rq, p, 0);
-
+		activate_task(target_rq, p, 0);
+		wakeup_preempt(target_rq, p, 0);
+	}
 	raw_spin_rq_unlock(target_rq);
 	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
-- 
2.51.1.930.gacf6e81ea2-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking
  2025-10-30  0:18 ` [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
@ 2025-10-30  4:51   ` K Prateek Nayak
  2025-10-30 23:42     ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-30  4:51 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 517b26c515bc5..0533a14ce5935 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6591,7 +6591,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
>  		 * as unblocked, as we aren't doing proxy-migrations
>  		 * yet (more logic will be needed then).
>  		 */
> -		donor->blocked_on = NULL;
> +		clear_task_blocked_on(donor, NULL);

nit. You can probably switch this to use __clear_task_blocked_on() in
the previous patch and then to the clear_task_blocked_on() variant here.
It makes it more clear that proxy_deactivate() is now out of the
"donor->blocked_lock" critical section.

Either way, no strong feelings.

>  	}
>  	return NULL;
>  }
> @@ -6619,6 +6619,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
>  	struct mutex *mutex;
> +	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;

nit. If you move that declaration to the top, you can preserve the nice
reverse xmas arrangement ;)

Apart from those couple of nits, feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2025-10-30  0:18 ` [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
@ 2025-10-30  7:32   ` K Prateek Nayak
  2025-10-30 23:53     ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-30  7:32 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> As we add functionality to proxy execution, we may migrate a
> donor task to a runqueue where it can't run due to cpu affinity.
> Thus, we must be careful to ensure we return-migrate the task
> back to a cpu in its cpumask when it becomes unblocked.
> 
> Peter helpfully provided the following example with pictures:
> "Suppose we have a ww_mutex cycle:
> 
>                   ,-+-* Mutex-1 <-.
>         Task-A ---' |             | ,-- Task-B
>                     `-> Mutex-2 *-+-'
> 
> Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
> where Task-B holds Mutex-2 and tries to acquire Mutex-1.
> 
> Then the blocked_on->owner chain will go in circles.
> 
>         Task-A  -> Mutex-2
>           ^          |
>           |          v
>         Mutex-1 <- Task-B
> 
> We need two things:
> 
>  - find_proxy_task() to stop iterating the circle;
> 
>  - the woken task to 'unblock' and run, such that it can
>    back-off and re-try the transaction.
> 
> Now, the current code [without this patch] does:
>         __clear_task_blocked_on();
>         wake_q_add();
> 
> And surely clearing ->blocked_on is sufficient to break the
> cycle.
> 
> Suppose it is Task-B that is made to back-off, then we have:
> 
>   Task-A -> Mutex-2 -> Task-B (no further blocked_on)
> 
> and it would attempt to run Task-B. Or worse, it could directly
> pick Task-B and run it, without ever getting into
> find_proxy_task().
> 
> Now, here is a problem because Task-B might not be runnable on
> the CPU it is currently on; and because !task_is_blocked() we
> don't get into the proxy paths, so nobody is going to fix this
> up.
> 
> Ideally we would have dequeued Task-B alongside of clearing
> ->blocked_on, but alas, [the lock ordering prevents us from
> getting the task_rq_lock() and] spoils things."
> 
> Thus we need more than just a binary concept of the task being
> blocked on a mutex or not.
> 
> So allow setting blocked_on to PROXY_WAKING as a special value
> which specifies the task is no longer blocked, but needs to
> be evaluated for return migration *before* it can be run.

Now I can truly appreciate the need for the tri-state with
that updated commit log. Thank you for the detailed explanation.
Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> 
> This will then be used in a later patch to handle proxy
> return-migration.
> 
> Signed-off-by: John Stultz <jstultz@google.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 4/9] sched: Add assert_balance_callbacks_empty helper
  2025-10-30  0:18 ` [PATCH v23 4/9] sched: Add assert_balance_callbacks_empty helper John Stultz
@ 2025-10-30  7:38   ` K Prateek Nayak
  0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-30  7:38 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> +#ifdef CONFIG_PROVE_LOCKING
> +static inline void assert_balance_callbacks_empty(struct rq *rq)
> +{
> +	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);

Can we instead use "IS_ENABLED(CONFIG_PROVE_LOCKING) && ..." and avoid
the empty stub?

I see similar pattern used inside WARN_ON_ONCE() by RCU and
context-tracking bits so I'm assuming compiler is smart enough to
optimize it out when the config is disabled :)

Apart from that nit, feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> +}
> +#else
> +static inline void assert_balance_callbacks_empty(struct rq *rq) {}
> +#endif
> +

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again
  2025-10-30  0:18 ` [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-10-30  8:08   ` K Prateek Nayak
  2025-10-31  3:15     ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-30  8:08 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +/*
> + * Only called from __schedule context
> + *
> + * There are some cases where we are going to re-do the action
> + * that added the balance callbacks. We may not be in a state
> + * where we can run them, so just zap them so they can be
> + * properly re-added on the next time around. This is similar
> + * handling to running the callbacks, except we just don't call
> + * them.
> + */
> +static void zap_balance_callbacks(struct rq *rq)
> +{
> +	struct balance_callback *next, *head;
> +	bool found = false;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	head = rq->balance_callback;
> +	while (head) {
> +		if (head == &balance_push_callback)
> +			found = true;
> +		next = head->next;
> +		head->next = NULL;
> +		head = next;
> +	}
> +	rq->balance_callback = found ? &balance_push_callback : NULL;
> +}

There is nothing proxy-exec specific in this function. Do we need to
keep it behind CONFIG_SCHED_PROXY_EXEC?

I believe compiler will optimize out the dead code and having
zap_balance_callbacks() unconditionally shouldn't have make any
difference to the size of generated binary for
!CONFIG_SCHED_PROXY_EXEC case.

Apart from that nit. feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> +#else
> +static inline void zap_balance_callbacks(struct rq *rq) {}
> +#endif
> +
>  static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
>  {
>  	void (*func)(struct rq *rq);
> @@ -6901,10 +6933,15 @@ static void __sched notrace __schedule(int sched_mode)
>  	rq_set_donor(rq, next);
>  	if (unlikely(task_is_blocked(next))) {
>  		next = find_proxy_task(rq, next, &rf);
> -		if (!next)
> +		if (!next) {
> +			/* zap the balance_callbacks before picking again */
> +			zap_balance_callbacks(rq);
>  			goto pick_again;
> -		if (next == rq->idle)
> +		}
> +		if (next == rq->idle) {
> +			zap_balance_callbacks(rq);

Also I would have preferred to have that zap_balance_callbacks() in
proxy_resched_idle() but this is okay too.

-- 
Thanks and Regards,
Prateek

>  			goto keep_resched;
> +		}
>  	}
>  picked:
>  	clear_tsk_need_resched(prev);



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-10-30  0:18 ` [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-10-30  9:32   ` K Prateek Nayak
  2025-11-07 23:18     ` John Stultz
  2025-11-07 15:19   ` Juri Lelli
  1 sibling, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-30  9:32 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, 0);

DEQUEUE_NOCLOCK since we arrive here with the clock updated from
schedule().

> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +	raw_spin_rq_lock(target_rq);
> +
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +
> +	raw_spin_rq_unlock(target_rq);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);

We should perhaps update the clock once we've reacquired the rq_lock
given we are going into schedule() again for another pick.

> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = 0;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	get_task_struct(p);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +	update_rq_clock(this_rq);

I think we can delay the clock update until proxy_resched_idle().

> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	proxy_resched_idle(this_rq);
> +	deactivate_task(this_rq, p, 0);

This should add DEQUEUE_NOCLOCK since we've already updated the rq clock
before the call.

> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	/* Drop this_rq and grab target_rq for activation */
> +	raw_spin_rq_lock(target_rq);
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +	put_task_struct(p);
> +	raw_spin_rq_unlock(target_rq);
> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	return;
> +
> +err_out:
> +	put_task_struct(p);
> +	task_rq_unlock(this_rq, p, &this_rf);

I believe as long a we have the task_rq_lock(), the task cannot
disappear under us but are there any concern on doing a
put_task_struct() and then using the same task reference for
task_rq_unlock()?

> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);

Probably a clock update once we reacquire the rq_lock since we
go into schedule() again to retry pick?

> +	return;
>  }
>  
>  /*
> @@ -6655,10 +6792,12 @@ static struct task_struct *
>  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  {
>  	struct task_struct *owner = NULL;
> +	bool curr_in_chain = false;
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
>  	struct mutex *mutex;
> -	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
> +	int owner_cpu;
> +	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
>  
>  	/* Follow blocked_on chain. */
>  	for (p = donor; task_is_blocked(p); p = owner) {
> @@ -6667,9 +6806,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		if (!mutex)
>  			return NULL;
>  
> -		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
> -		if (mutex == PROXY_WAKING)
> -			return proxy_resched_idle(rq);
> +		/* if its PROXY_WAKING, do return migration or run if current */
> +		if (mutex == PROXY_WAKING) {
> +			if (task_current(rq, p)) {
> +				clear_task_blocked_on(p, PROXY_WAKING);
> +				return p;
> +			}
> +			action = NEEDS_RETURN;
> +			break;
> +		}
>  
>  		/*
>  		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> @@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			return NULL;
>  		}
>  
> +		if (task_current(rq, p))
> +			curr_in_chain = true;
> +
>  		owner = __mutex_owner(mutex);
>  		if (!owner) {
>  			/*
> -			 * If there is no owner, clear blocked_on
> -			 * and return p so it can run and try to
> -			 * acquire the lock
> +			 * If there is no owner, either clear blocked_on
> +			 * and return p (if it is current and safe to
> +			 * just run on this rq), or return-migrate the task.
>  			 */
> -			__clear_task_blocked_on(p, mutex);
> -			return p;
> +			if (task_current(rq, p)) {
> +				__clear_task_blocked_on(p, NULL);
> +				return p;
> +			}
> +			action = NEEDS_RETURN;
> +			break;
>  		}
>  
>  		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {

Should we handle task_on_rq_migrating() in the similar way?
Wait for the owner to finish migrating and look at the
task_cpu(owner) once it is reliable?

>  			/* XXX Don't handle blocked owners/delayed dequeue yet */
> +			if (curr_in_chain)
> +				return proxy_resched_idle(rq);
>  			action = DEACTIVATE_DONOR;
>  			break;
>  		}
>  
> -		if (task_cpu(owner) != this_cpu) {
> -			/* XXX Don't handle migrations yet */
> -			action = DEACTIVATE_DONOR;
> +		owner_cpu = task_cpu(owner);
> +		if (owner_cpu != this_cpu) {
> +			/*
> +			 * @owner can disappear, simply migrate to @owner_cpu
> +			 * and leave that CPU to sort things out.
> +			 */
> +			if (curr_in_chain)
> +				return proxy_resched_idle(rq);
> +			action = MIGRATE;
>  			break;
>  		}
>  
> @@ -6770,7 +6930,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	/* Handle actions we need to do outside of the guard() scope */
>  	switch (action) {
>  	case DEACTIVATE_DONOR:
> -		return proxy_deactivate(rq, donor);
> +		if (proxy_deactivate(rq, donor))
> +			return NULL;
> +		/* If deactivate fails, force return */
> +		p = donor;
> +		fallthrough;
> +	case NEEDS_RETURN:
> +		proxy_force_return(rq, rf, p);
> +		return NULL;
> +	case MIGRATE:
> +		proxy_migrate_task(rq, rf, p, owner_cpu);
> +		return NULL;
>  	case FOUND:
>  		/* fallthrough */;
>  	}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking
  2025-10-30  4:51   ` K Prateek Nayak
@ 2025-10-30 23:42     ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-10-30 23:42 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Oct 29, 2025 at 9:51 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 10/30/2025 5:48 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 517b26c515bc5..0533a14ce5935 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6591,7 +6591,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
> >                * as unblocked, as we aren't doing proxy-migrations
> >                * yet (more logic will be needed then).
> >                */
> > -             donor->blocked_on = NULL;
> > +             clear_task_blocked_on(donor, NULL);
>
> nit. You can probably switch this to use __clear_task_blocked_on() in
> the previous patch and then to the clear_task_blocked_on() variant here.
> It makes it more clear that proxy_deactivate() is now out of the
> "donor->blocked_lock" critical section.

The gotcha there is the __clear_task_blocked_on() will assert if we
don't hold the right lock. So this patch is sort of fixing it up to
allow for proper locking without cheating here.


> Either way, no strong feelings.
>
> >       }
> >       return NULL;
> >  }
> > @@ -6619,6 +6619,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >       int this_cpu = cpu_of(rq);
> >       struct task_struct *p;
> >       struct mutex *mutex;
> > +     enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
>
> nit. If you move that declaration to the top, you can preserve the nice
> reverse xmas arrangement ;)

Yeah, I meant to do that, but just overlooked it. Thanks for pointing it out.


> Apart from those couple of nits, feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

As always, greatly appreciate your time for the review and feedback here!
Thank you!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2025-10-30  7:32   ` K Prateek Nayak
@ 2025-10-30 23:53     ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-10-30 23:53 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 30, 2025 at 12:33 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Now I can truly appreciate the need for the tri-state with
> that updated commit log. Thank you for the detailed explanation.

I'm glad, but the credit goes to Peter for his helpful explanation and
drawings I quoted from the last review cycle.

> Feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thank you!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again
  2025-10-30  8:08   ` K Prateek Nayak
@ 2025-10-31  3:15     ` John Stultz
  2025-10-31  3:50       ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-10-31  3:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 30, 2025 at 1:08 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 10/30/2025 5:48 AM, John Stultz wrote:
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +/*
> > + * Only called from __schedule context
> > + *
> > + * There are some cases where we are going to re-do the action
> > + * that added the balance callbacks. We may not be in a state
> > + * where we can run them, so just zap them so they can be
> > + * properly re-added on the next time around. This is similar
> > + * handling to running the callbacks, except we just don't call
> > + * them.
> > + */
> > +static void zap_balance_callbacks(struct rq *rq)
> > +{
> > +     struct balance_callback *next, *head;
> > +     bool found = false;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     head = rq->balance_callback;
> > +     while (head) {
> > +             if (head == &balance_push_callback)
> > +                     found = true;
> > +             next = head->next;
> > +             head->next = NULL;
> > +             head = next;
> > +     }
> > +     rq->balance_callback = found ? &balance_push_callback : NULL;
> > +}
>
> There is nothing proxy-exec specific in this function. Do we need to
> keep it behind CONFIG_SCHED_PROXY_EXEC?
>
> I believe compiler will optimize out the dead code and having
> zap_balance_callbacks() unconditionally shouldn't have make any
> difference to the size of generated binary for
> !CONFIG_SCHED_PROXY_EXEC case.

Good point. I'll drop that.


> Apart from that nit. feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thank you!

> > @@ -6901,10 +6933,15 @@ static void __sched notrace __schedule(int sched_mode)
> >       rq_set_donor(rq, next);
> >       if (unlikely(task_is_blocked(next))) {
> >               next = find_proxy_task(rq, next, &rf);
> > -             if (!next)
> > +             if (!next) {
> > +                     /* zap the balance_callbacks before picking again */
> > +                     zap_balance_callbacks(rq);
> >                       goto pick_again;
> > -             if (next == rq->idle)
> > +             }
> > +             if (next == rq->idle) {
> > +                     zap_balance_callbacks(rq);
>
> Also I would have preferred to have that zap_balance_callbacks() in
> proxy_resched_idle() but this is okay too.

So my initial hesitation here is just we call proxy_resched_idle() in
other situations where we might return NULL from find_proxy_task() as
well. So this avoids calling zap_balance_callbacks() twice.

But thinking some more, later in the full series we often call
proxy_resched_idle() in those paths where we are briefly dropping the
rq lock and we often call zap_balance_callbacks as well there. I'll
take a closer look at the full patch series and see if that doesn't
make sense to consolidate then.  Not 100% sure it will work out, but
worth looking into.

Thanks for the suggestion!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again
  2025-10-31  3:15     ` John Stultz
@ 2025-10-31  3:50       ` K Prateek Nayak
  0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-31  3:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/31/2025 8:45 AM, John Stultz wrote:
>>> -             if (next == rq->idle)
>>> +             }
>>> +             if (next == rq->idle) {
>>> +                     zap_balance_callbacks(rq);
>>
>> Also I would have preferred to have that zap_balance_callbacks() in
>> proxy_resched_idle() but this is okay too.
> 
> So my initial hesitation here is just we call proxy_resched_idle() in
> other situations where we might return NULL from find_proxy_task() as
> well. So this avoids calling zap_balance_callbacks() twice.
> 
> But thinking some more, later in the full series we often call
> proxy_resched_idle() in those paths where we are briefly dropping the
> rq lock and we often call zap_balance_callbacks as well there. I'll
> take a closer look at the full patch series and see if that doesn't
> make sense to consolidate then.  Not 100% sure it will work out, but
> worth looking into.

I don't have any strong feelings btw. What you have currently works
well. thank you for the additional background. I should go back and
take a look at the full tree again to get a full picture.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-10-30  0:18 ` [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
@ 2025-10-31  4:27   ` K Prateek Nayak
  2025-11-20  1:05     ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-31  4:27 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> +/*
> + * Checks to see if task p has been proxy-migrated to another rq
> + * and needs to be returned. If so, we deactivate the task here
> + * so that it can be properly woken up on the p->wake_cpu
> + * (or whichever cpu select_task_rq() picks at the bottom of
> + * try_to_wake_up()
> + */
> +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> +{
> +	bool ret = false;
> +
> +	if (!sched_proxy_exec())
> +		return false;
> +
> +	raw_spin_lock(&p->blocked_lock);
> +	if (p->blocked_on == PROXY_WAKING) {
> +		if (!task_current(rq, p) && p->wake_cpu != cpu_of(rq)) {
> +			if (task_current_donor(rq, p))
> +				proxy_resched_idle(rq);
> +
> +			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +			ret = true;
> +		}
> +		__clear_task_blocked_on(p, PROXY_WAKING);
> +		resched_curr(rq);

Do we need to resched_curr() if we've simply dequeued a waiting
owner who is neither the current, nor the donor? I would have
preferred if this function was similar to ttwu_queue_cond() with
each step explaining the the intent - something like:

static inline bool
proxy_needs_return(struct rq *rq, struct task_struct *p, int wake_flags)
{
	if (!sched_proxy_exec())
		return false;

	guard(raw_spinlock)(&p->blocked_lock);

	/* Task has been woken up / is running. */
	if (p->blocked_on != PROXY_WAKING)
		return false;

	__clear_task_blocked_on(p, PROXY_WAKING);

	/* Task is running, allow schedule() to reevaluate. */
	if (task_current(rq, p)) {
		resched_curr(rq);
		return false;
	}

	/*
	 * Task belongs to the same CPU.
	 * Check if it should be run now.
	 */
	if (p->wake_cpu == cpu_of(rq)) {
		wakeup_preempt(rq, p, wake_flags);
		return false;
	}

	/*
	 * Task was migrated from a different CPU for proxy.
	 * Block the task, and do full wakeup. If the task is
	 * the donor, ensure we call put_prev_task() before
	 * proceeding
	 */
	if (task_current_donor(rq, p))
		proxy_resched_idle(rq);

	/*
	 * (ab)Use DEQUEUE_SPECIAL to ensure task is always
	 * blocked here.
	 */
	block_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SPECIAL);
	return true;
}

I would wait for Peter to comment before changing all this up in case
I'm terribly wrong and am missing some subtleties. Also reason why we
may want to switch to block_task() is explained below.

> +	}
> +	raw_spin_unlock(&p->blocked_lock);
> +	return ret;
> +}
> +#else /* !CONFIG_SCHED_PROXY_EXEC */
> +static bool proxy_task_runnable_but_waking(struct task_struct *p)
> +{
> +	return false;
> +}
> +
> +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_SCHED_PROXY_EXEC */
>  
>  static void
> @@ -3784,6 +3834,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		update_rq_clock(rq);
>  		if (p->se.sched_delayed)
>  			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> +		if (proxy_needs_return(rq, p))
> +			goto out;
>  		if (!task_on_cpu(rq, p)) {
>  			/*
>  			 * When on_rq && !on_cpu the task is preempted, see if
> @@ -3794,6 +3846,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		ttwu_do_wakeup(p);
>  		ret = 1;
>  	}
> +out:
>  	__task_rq_unlock(rq, &rf);
>  
>  	return ret;
> @@ -3924,6 +3977,14 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
>  		return false;
>  #endif
>  
> +	/*
> +	 * If we're PROXY_WAKING, we have deactivated on this cpu, so we should
> +	 * activate it here as well, to avoid IPI'ing a cpu that is stuck in
> +	 * task_rq_lock() spinning on p->on_rq, deadlocking that cpu.
> +	 */

Sounds like block_task() would be better than deactivate_task() above
in that case. Anything that is waiting on the task's state change takes
the pi_lock afaik and the wakeup is always done with pi_lock held so
blocking the task shouldn't cause any problems based on my reading.

> +	if (task_on_rq_migrating(p))
> +		return false;
> +
>  	/*
>  	 * Do not complicate things with the async wake_list while the CPU is
>  	 * in hotplug state.
> @@ -4181,6 +4242,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		 *    it disabling IRQs (this allows not taking ->pi_lock).
>  		 */
>  		WARN_ON_ONCE(p->se.sched_delayed);
> +		/* If p is current, we know we can run here, so clear blocked_on */
> +		clear_task_blocked_on(p, NULL);
>  		if (!ttwu_state_match(p, state, &success))
>  			goto out;
>  
> @@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 */
>  	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>  		smp_mb__after_spinlock();
> -		if (!ttwu_state_match(p, state, &success))
> -			break;
> +		if (!ttwu_state_match(p, state, &success)) {
> +			/*
> +			 * If we're already TASK_RUNNING, and PROXY_WAKING
> +			 * continue on to ttwu_runnable check to force
> +			 * proxy_needs_return evaluation
> +			 */
> +			if (!proxy_task_runnable_but_waking(p))
> +				break;
> +		}
>  
>  		trace_sched_waking(p);
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-10-30  0:18 ` [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
@ 2025-10-31  5:01   ` K Prateek Nayak
  2025-11-11  7:50     ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-10-31  5:01 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz, LKML
  Cc: Juri Lelli, Valentin Schneider, Connor O'Brien,
	Joel Fernandes, Qais Yousef, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter, John,

On 10/30/2025 5:48 AM, John Stultz wrote:
> @@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>  
>  	raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  	debug_mutex_unlock(lock);
> -	if (!list_empty(&lock->wait_list)) {
> +
> +	if (sched_proxy_exec()) {
> +		raw_spin_lock(&current->blocked_lock);
> +		/*
> +		 * If we have a task boosting current, and that task was boosting
> +		 * current through this lock, hand the lock to that task, as that
> +		 * is the highest waiter, as selected by the scheduling function.
> +		 */
> +		donor = current->blocked_donor;
> +		if (donor) {

Any concerns on new waiters always appearing as donors and in-turn
starving the long time waiters on the list?

> +			struct mutex *next_lock;
> +
> +			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
> +			next_lock = __get_task_blocked_on(donor);
> +			if (next_lock == lock) {
> +				next = donor;
> +				__set_task_blocked_on_waking(donor, next_lock);
> +				wake_q_add(&wake_q, donor);
> +				current->blocked_donor = NULL;
> +			}
> +			raw_spin_unlock(&donor->blocked_lock);
> +		}
> +	}
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-10-30  0:18 ` [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2025-10-30  9:32   ` K Prateek Nayak
@ 2025-11-07 15:19   ` Juri Lelli
  2025-11-07 17:24     ` John Stultz
  1 sibling, 1 reply; 36+ messages in thread
From: Juri Lelli @ 2025-11-07 15:19 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hi,

On 30/10/25 00:18, John Stultz wrote:
> Add logic to handle migrating a blocked waiter to a remote
> cpu where the lock owner is runnable.
> 
> Additionally, as the blocked task may not be able to run
> on the remote cpu, add logic to handle return migration once
> the waiting task is given the mutex.
> 
> Because tasks may get migrated to where they cannot run, also
> modify the scheduling classes to avoid sched class migrations on
> mutex blocked tasks, leaving find_proxy_task() and related logic
> to do the migrations and return migrations.
> 
> This was split out from the larger proxy patch, and
> significantly reworked.
> 
> Credits for the original patch go to:
>   Peter Zijlstra (Intel) <peterz@infradead.org>
>   Juri Lelli <juri.lelli@redhat.com>
>   Valentin Schneider <valentin.schneider@arm.com>
>   Connor O'Brien <connoro@google.com>
> 
> Signed-off-by: John Stultz <jstultz@google.com>
> ---

...

> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned int wake_cpu;
> +
> +	/*
> +	 * Since we are enqueuing a blocked task on a cpu it may
> +	 * not be able to run on, preserve wake_cpu when we
> +	 * __set_task_cpu so we can return the task to where it
> +	 * was previously runnable.
> +	 */
> +	wake_cpu = p->wake_cpu;
> +	__set_task_cpu(p, cpu);
> +	p->wake_cpu = wake_cpu;
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> +

...

> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, 0);
> +	proxy_set_task_cpu(p, target_cpu);

We use proxy_set_task_cpu() here. BTW, can you comment/expand on why an
ad-hoc set_task_cpu() is needed for proxy?

> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +	raw_spin_rq_lock(target_rq);
> +
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +
> +	raw_spin_rq_unlock(target_rq);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = 0;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	get_task_struct(p);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +	update_rq_clock(this_rq);
> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	proxy_resched_idle(this_rq);
> +	deactivate_task(this_rq, p, 0);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);

But, then use the 'standard' set_task_cpu() for the return migration. Is
that intended?

> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	/* Drop this_rq and grab target_rq for activation */
> +	raw_spin_rq_lock(target_rq);
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +	put_task_struct(p);
> +	raw_spin_rq_unlock(target_rq);
> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	return;
> +
> +err_out:
> +	put_task_struct(p);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	return;

Thanks,
Juri


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-07 15:19   ` Juri Lelli
@ 2025-11-07 17:24     ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-11-07 17:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Nov 7, 2025 at 7:19 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> On 30/10/25 00:18, John Stultz wrote:
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned int wake_cpu;
> > +
> > +     /*
> > +      * Since we are enqueuing a blocked task on a cpu it may
> > +      * not be able to run on, preserve wake_cpu when we
> > +      * __set_task_cpu so we can return the task to where it
> > +      * was previously runnable.
> > +      */
> > +     wake_cpu = p->wake_cpu;
> > +     __set_task_cpu(p, cpu);
> > +     p->wake_cpu = wake_cpu;
> > +}
> > +#endif /* CONFIG_SCHED_PROXY_EXEC */
> > +
>
> ...
>
> > +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p, int target_cpu)
> >  {
> > -     if (!__proxy_deactivate(rq, donor)) {
> > -             /*
> > -              * XXX: For now, if deactivation failed, set donor
> > -              * as unblocked, as we aren't doing proxy-migrations
> > -              * yet (more logic will be needed then).
> > -              */
> > -             clear_task_blocked_on(donor, NULL);
> > +     struct rq *target_rq = cpu_rq(target_cpu);
> > +
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     /*
> > +      * Since we're going to drop @rq, we have to put(@rq->donor) first,
> > +      * otherwise we have a reference that no longer belongs to us.
> > +      *
> > +      * Additionally, as we put_prev_task(prev) earlier, its possible that
> > +      * prev will migrate away as soon as we drop the rq lock, however we
> > +      * still have it marked as rq->curr, as we've not yet switched tasks.
> > +      *
> > +      * So call proxy_resched_idle() to let go of the references before
> > +      * we release the lock.
> > +      */
> > +     proxy_resched_idle(rq);
> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, 0);
> > +     proxy_set_task_cpu(p, target_cpu);
>
> We use proxy_set_task_cpu() here. BTW, can you comment/expand on why an
> ad-hoc set_task_cpu() is needed for proxy?

Since with proxy, we keep blocked waiters on the runqueue,
proxy-migrations may move those lock waiters to cpu runqueues where
they can't run. Ideally when the mutex is released, we want the waiter
to wake up on the same cpu it would have woken on if it has been
blocked.  So for proxy-migrations, we want to preserve the wake_cpu
value when we change the task_cpu.

For instance, if we were proxy-migrated as a donor to a cpu outside of
our affinity set, we don't want the wake_cpu hint to suggest that we
should be woken on a cpu we can't run on.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = 0;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     get_task_struct(p);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     /*
> > +      * We drop the rq lock, and re-grab task_rq_lock to get
> > +      * the pi_lock (needed for select_task_rq) as well.
> > +      */
> > +     this_rq = task_rq_lock(p, &this_rf);
> > +     update_rq_clock(this_rq);
> > +
> > +     /*
> > +      * Since we let go of the rq lock, the task may have been
> > +      * woken or migrated to another rq before we  got the
> > +      * task_rq_lock. So re-check we're on the same RQ. If
> > +      * not, the task has already been migrated and that CPU
> > +      * will handle any futher migrations.
> > +      */
> > +     if (this_rq != rq)
> > +             goto err_out;
> > +
> > +     /* Similarly, if we've been dequeued, someone else will wake us */
> > +     if (!task_on_rq_queued(p))
> > +             goto err_out;
> > +
> > +     /*
> > +      * Since we should only be calling here from __schedule()
> > +      * -> find_proxy_task(), no one else should have
> > +      * assigned current out from under us. But check and warn
> > +      * if we see this, then bail.
> > +      */
> > +     if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> > +             WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> > +                       __func__, cpu_of(this_rq),
> > +                       p->comm, p->pid, p->on_cpu);
> > +             goto err_out;
> >       }
> > -     return NULL;
> > +
> > +     proxy_resched_idle(this_rq);
> > +     deactivate_task(this_rq, p, 0);
> > +     cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> > +     set_task_cpu(p, cpu);
>
> But, then use the 'standard' set_task_cpu() for the return migration. Is
> that intended?

Yep. On return migration, we want the task to be returned to the
runqueue it would have woken on (which wake_cpu provides a hint, but
select_task_rq can always choose somewhere else more appropriate).
Once the cpu has been selected, it's then fine if the wake_cpu is then
set to cpu the task now being assigned to.

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-10-30  9:32   ` K Prateek Nayak
@ 2025-11-07 23:18     ` John Stultz
  2025-11-10  4:47       ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-11-07 23:18 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 30, 2025 at 2:33 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 10/30/2025 5:48 AM, John Stultz wrote:
> > -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> > +/*
> > + * If the blocked-on relationship crosses CPUs, migrate @p to the
> > + * owner's CPU.
> > + *
> > + * This is because we must respect the CPU affinity of execution
> > + * contexts (owner) but we can ignore affinity for scheduling
> > + * contexts (@p). So we have to move scheduling contexts towards
> > + * potential execution contexts.
> > + *
> > + * Note: The owner can disappear, but simply migrate to @target_cpu
> > + * and leave that CPU to sort things out.
> > + */
> > +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p, int target_cpu)
> >  {
> > -     if (!__proxy_deactivate(rq, donor)) {
> > -             /*
> > -              * XXX: For now, if deactivation failed, set donor
> > -              * as unblocked, as we aren't doing proxy-migrations
> > -              * yet (more logic will be needed then).
> > -              */
> > -             clear_task_blocked_on(donor, NULL);
> > +     struct rq *target_rq = cpu_rq(target_cpu);
> > +
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     /*
> > +      * Since we're going to drop @rq, we have to put(@rq->donor) first,
> > +      * otherwise we have a reference that no longer belongs to us.
> > +      *
> > +      * Additionally, as we put_prev_task(prev) earlier, its possible that
> > +      * prev will migrate away as soon as we drop the rq lock, however we
> > +      * still have it marked as rq->curr, as we've not yet switched tasks.
> > +      *
> > +      * So call proxy_resched_idle() to let go of the references before
> > +      * we release the lock.
> > +      */
> > +     proxy_resched_idle(rq);
> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, 0);
>
> DEQUEUE_NOCLOCK since we arrive here with the clock updated from
> schedule().

Ah, good point.

> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +     raw_spin_rq_lock(target_rq);
> > +
> > +     activate_task(target_rq, p, 0);
> > +     wakeup_preempt(target_rq, p, 0);
> > +
> > +     raw_spin_rq_unlock(target_rq);
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
>
> We should perhaps update the clock once we've reacquired the rq_lock
> given we are going into schedule() again for another pick.

No objection to adding this, though I wonder why I'm not hitting the
usual warnings. I'll have to think through that a bit.


> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = 0;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     get_task_struct(p);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     /*
> > +      * We drop the rq lock, and re-grab task_rq_lock to get
> > +      * the pi_lock (needed for select_task_rq) as well.
> > +      */
> > +     this_rq = task_rq_lock(p, &this_rf);
> > +     update_rq_clock(this_rq);
>
> I think we can delay the clock update until proxy_resched_idle().

You're thinking just to avoid it if we trip into the error paths?  Sounds good.


> > +
> > +     /*
> > +      * Since we let go of the rq lock, the task may have been
> > +      * woken or migrated to another rq before we  got the
> > +      * task_rq_lock. So re-check we're on the same RQ. If
> > +      * not, the task has already been migrated and that CPU
> > +      * will handle any futher migrations.
> > +      */
> > +     if (this_rq != rq)
> > +             goto err_out;
> > +
> > +     /* Similarly, if we've been dequeued, someone else will wake us */
> > +     if (!task_on_rq_queued(p))
> > +             goto err_out;
> > +
> > +     /*
> > +      * Since we should only be calling here from __schedule()
> > +      * -> find_proxy_task(), no one else should have
> > +      * assigned current out from under us. But check and warn
> > +      * if we see this, then bail.
> > +      */
> > +     if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> > +             WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> > +                       __func__, cpu_of(this_rq),
> > +                       p->comm, p->pid, p->on_cpu);
> > +             goto err_out;
> >       }
> > -     return NULL;
> > +
> > +     proxy_resched_idle(this_rq);
> > +     deactivate_task(this_rq, p, 0);
>
> This should add DEQUEUE_NOCLOCK since we've already updated the rq clock
> before the call.

Ack.


> > +     cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> > +     set_task_cpu(p, cpu);
> > +     target_rq = cpu_rq(cpu);
> > +     clear_task_blocked_on(p, NULL);
> > +     task_rq_unlock(this_rq, p, &this_rf);
> > +
> > +     /* Drop this_rq and grab target_rq for activation */
> > +     raw_spin_rq_lock(target_rq);
> > +     activate_task(target_rq, p, 0);
> > +     wakeup_preempt(target_rq, p, 0);
> > +     put_task_struct(p);
> > +     raw_spin_rq_unlock(target_rq);
> > +
> > +     /* Finally, re-grab the origianl rq lock and return to pick-again */
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
> > +     return;
> > +
> > +err_out:
> > +     put_task_struct(p);
> > +     task_rq_unlock(this_rq, p, &this_rf);
>
> I believe as long a we have the task_rq_lock(), the task cannot
> disappear under us but are there any concern on doing a
> put_task_struct() and then using the same task reference for
> task_rq_unlock()?

Yeah. Seems proper to do it the way you're suggesting.


> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
>
> Probably a clock update once we reacquire the rq_lock since we
> go into schedule() again to retry pick?
>

Ack.


> > @@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >                       return NULL;
> >               }
> >
> > +             if (task_current(rq, p))
> > +                     curr_in_chain = true;
> > +
> >               owner = __mutex_owner(mutex);
> >               if (!owner) {
> >                       /*
> > -                      * If there is no owner, clear blocked_on
> > -                      * and return p so it can run and try to
> > -                      * acquire the lock
> > +                      * If there is no owner, either clear blocked_on
> > +                      * and return p (if it is current and safe to
> > +                      * just run on this rq), or return-migrate the task.
> >                        */
> > -                     __clear_task_blocked_on(p, mutex);
> > -                     return p;
> > +                     if (task_current(rq, p)) {
> > +                             __clear_task_blocked_on(p, NULL);
> > +                             return p;
> > +                     }
> > +                     action = NEEDS_RETURN;
> > +                     break;
> >               }
> >
> >               if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>
> Should we handle task_on_rq_migrating() in the similar way?
> Wait for the owner to finish migrating and look at the
> task_cpu(owner) once it is reliable?

Hrm. I'm not quite sure I understand your suggestion here. Could you
expand a bit here? Are you thinking we should deactivate the donor
when the owner is migrating? What would then return the donor to the
runqueue? Just rescheduling idle so that we drop the rq lock
momentarily should be sufficient to make sure the owner can finish
migration.

As always, I really appreciate your review and feedback!
Thanks so much!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-07 23:18     ` John Stultz
@ 2025-11-10  4:47       ` K Prateek Nayak
  2025-11-20  1:53         ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-10  4:47 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/8/2025 4:48 AM, John Stultz wrote:
>>> @@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>>>                       return NULL;
>>>               }
>>>
>>> +             if (task_current(rq, p))
>>> +                     curr_in_chain = true;
>>> +
>>>               owner = __mutex_owner(mutex);
>>>               if (!owner) {
>>>                       /*
>>> -                      * If there is no owner, clear blocked_on
>>> -                      * and return p so it can run and try to
>>> -                      * acquire the lock
>>> +                      * If there is no owner, either clear blocked_on
>>> +                      * and return p (if it is current and safe to
>>> +                      * just run on this rq), or return-migrate the task.
>>>                        */
>>> -                     __clear_task_blocked_on(p, mutex);
>>> -                     return p;
>>> +                     if (task_current(rq, p)) {
>>> +                             __clear_task_blocked_on(p, NULL);
>>> +                             return p;
>>> +                     }
>>> +                     action = NEEDS_RETURN;
>>> +                     break;
>>>               }
>>>
>>>               if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>>
>> Should we handle task_on_rq_migrating() in the similar way?
>> Wait for the owner to finish migrating and look at the
>> task_cpu(owner) once it is reliable?
> 
> Hrm. I'm not quite sure I understand your suggestion here. Could you
> expand a bit here? Are you thinking we should deactivate the donor
> when the owner is migrating? What would then return the donor to the
> runqueue? Just rescheduling idle so that we drop the rq lock
> momentarily should be sufficient to make sure the owner can finish
> migration.

In find_proxy_task() we have:

  if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
    /* Returns rq->idle or NULL */
  }

  /*
   * Owner can be task_on_rq_migrating() at this point
   * since it is in turn blocked on a lock owner on a
   * different CPU.
   */

  owner_cpu = task_cpu(owner); /* Prev CPU */
  if (owner_cpu != this_cpu) {
    ...
    action = MIGRATE;
    break;
  }


So in the end we can migrate to the previous CPU of the owner
and the previous CPU has to do a chain migration again. I'm
probably overthinking about a very unlikely scenario here :)

Unfortunately, I don't really have a great way to detect it
unless we have another member in the task_struct that follows
task_cpu() for most part and is set to the "owner_cpu" as
soon as we know we are going for the "MIGRATE" action when we
are still under the "wait_lock"/"blocked_on_lock".

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-10-31  5:01   ` K Prateek Nayak
@ 2025-11-11  7:50     ` John Stultz
  2025-11-11  8:35       ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-11-11  7:50 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, LKML, Juri Lelli, Valentin Schneider,
	Connor O'Brien, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 30, 2025 at 10:03 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 10/30/2025 5:48 AM, John Stultz wrote:
> > @@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> >
> >       raw_spin_lock_irqsave(&lock->wait_lock, flags);
> >       debug_mutex_unlock(lock);
> > -     if (!list_empty(&lock->wait_list)) {
> > +
> > +     if (sched_proxy_exec()) {
> > +             raw_spin_lock(&current->blocked_lock);
> > +             /*
> > +              * If we have a task boosting current, and that task was boosting
> > +              * current through this lock, hand the lock to that task, as that
> > +              * is the highest waiter, as selected by the scheduling function.
> > +              */
> > +             donor = current->blocked_donor;
> > +             if (donor) {
>
> Any concerns on new waiters always appearing as donors and in-turn
> starving the long time waiters on the list?
>
Hey!
  So I'm not sure I'm quite following the concern. The scheduler picks
"the most important task in that moment" to run and we just want to
try to to pass the lock off to the donor that was boosting the lock
owner.

Are you concerned that new waiters would somehow always be further
left on the rq and would be selected as donors before other waiters on
the rq? I can sort of see the argument that "new waiters" would be
running until they try to get the lock, so they may have timeslice
still available to make them attractive to the scheduler as a donor,
so maybe they would be re-selected as a donor right away. But I'm
assuming at some point the fairness is going to cycle any waiting
donors up to the front of the rq to act as a donor and then have the
lock handed off.

That said, I don't see how this would be very much different from
new/running optimistic spinners having a better chance at grabbing a
lock then waiting tasks that aren't running.

But let me know more about what you're thinking of, as I'd like to
better understand it and see if I could contrive a test to produce
this sort of concerning behavior.

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-11-11  7:50     ` John Stultz
@ 2025-11-11  8:35       ` K Prateek Nayak
  0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-11  8:35 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, Juri Lelli, Valentin Schneider,
	Connor O'Brien, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/11/2025 1:20 PM, John Stultz wrote:
> On Thu, Oct 30, 2025 at 10:03 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 10/30/2025 5:48 AM, John Stultz wrote:
>>> @@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>>>
>>>       raw_spin_lock_irqsave(&lock->wait_lock, flags);
>>>       debug_mutex_unlock(lock);
>>> -     if (!list_empty(&lock->wait_list)) {
>>> +
>>> +     if (sched_proxy_exec()) {
>>> +             raw_spin_lock(&current->blocked_lock);
>>> +             /*
>>> +              * If we have a task boosting current, and that task was boosting
>>> +              * current through this lock, hand the lock to that task, as that
>>> +              * is the highest waiter, as selected by the scheduling function.
>>> +              */
>>> +             donor = current->blocked_donor;
>>> +             if (donor) {
>>
>> Any concerns on new waiters always appearing as donors and in-turn
>> starving the long time waiters on the list?
>>
> Hey!
>   So I'm not sure I'm quite following the concern. The scheduler picks
> "the most important task in that moment" to run and we just want to
> try to to pass the lock off to the donor that was boosting the lock
> owner.
> 
> Are you concerned that new waiters would somehow always be further
> left on the rq and would be selected as donors before other waiters on
> the rq?

Something along those lines. Say the first waiter on the wait list
has been around for a while but a new waiter keeps appearing, starts
proxy, gets the handoff, runs instead of the first waiter, and the
cycle repeats.

    owner: A (unlock)
    donor: C
    wait_list: B->C->D

    /* handoff to blocked_donor C */

    owner: C
    donor: C
    wait_list: B->D (->A) /* A appears again */


    /* C runs out of slice; B, D and then A take turn as donor */

    owner: C (unlock)
    donor: A
    wait_list: B->D->A

    /* Hand off to blocked donor A */

    owner: A
    donor: A
    wait_list: B->D (->C) /* C appears again */

    /*
     * A runs out of slice; B, D and C take turn as donor.
     * Whole thing repeats from top as C is selected for
     * handoff since it is the blocked donor.
     * B and D keep waiting for a long time for their turn.
     */

> I can sort of see the argument that "new waiters" would be
> running until they try to get the lock, so they may have timeslice
> still available to make them attractive to the scheduler as a donor,
> so maybe they would be re-selected as a donor right away. But I'm
> assuming at some point the fairness is going to cycle any waiting
> donors up to the front of the rq to act as a donor and then have the
> lock handed off.
> 
> That said, I don't see how this would be very much different from
> new/running optimistic spinners having a better chance at grabbing a
> lock then waiting tasks that aren't running.

Now that I take a closer look at optimistic spinning bits, the
"got the lock, yay!" path in __mutex_lock_common() doesn't seem
any different in behavior so I guess it is alright to hand off
to the proxy donor.

> 
> But let me know more about what you're thinking of, as I'd like to
> better understand it and see if I could contrive a test to produce
> this sort of concerning behavior.
> 
> thanks
> -john

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-10-31  4:27   ` K Prateek Nayak
@ 2025-11-20  1:05     ` John Stultz
  2025-11-20  3:15       ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-11-20  1:05 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 30, 2025 at 9:28 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 10/30/2025 5:48 AM, John Stultz wrote:
> > +/*
> > + * Checks to see if task p has been proxy-migrated to another rq
> > + * and needs to be returned. If so, we deactivate the task here
> > + * so that it can be properly woken up on the p->wake_cpu
> > + * (or whichever cpu select_task_rq() picks at the bottom of
> > + * try_to_wake_up()
> > + */
> > +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> > +{
> > +     bool ret = false;
> > +
> > +     if (!sched_proxy_exec())
> > +             return false;
> > +
> > +     raw_spin_lock(&p->blocked_lock);
> > +     if (p->blocked_on == PROXY_WAKING) {
> > +             if (!task_current(rq, p) && p->wake_cpu != cpu_of(rq)) {
> > +                     if (task_current_donor(rq, p))
> > +                             proxy_resched_idle(rq);
> > +
> > +                     deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> > +                     ret = true;
> > +             }
> > +             __clear_task_blocked_on(p, PROXY_WAKING);
> > +             resched_curr(rq);
>
> Do we need to resched_curr() if we've simply dequeued a waiting
> owner who is neither the current, nor the donor? I would have

Hrm. Yeah, I guess really we only need to resched_curr() when we are
skipping the deactivation because already we're waking on the wake_cpu
rq.

I'll fix that up.

> preferred if this function was similar to ttwu_queue_cond() with
> each step explaining the the intent - something like:
>
> static inline bool
> proxy_needs_return(struct rq *rq, struct task_struct *p, int wake_flags)
> {
>         if (!sched_proxy_exec())
>                 return false;
>
>         guard(raw_spinlock)(&p->blocked_lock);
>
>         /* Task has been woken up / is running. */
>         if (p->blocked_on != PROXY_WAKING)
>                 return false;
>
>         __clear_task_blocked_on(p, PROXY_WAKING);

Ah, this is nicer! I'll rework the function in this style! Thanks for
that suggestion!

> > +     /*
> > +      * If we're PROXY_WAKING, we have deactivated on this cpu, so we should
> > +      * activate it here as well, to avoid IPI'ing a cpu that is stuck in
> > +      * task_rq_lock() spinning on p->on_rq, deadlocking that cpu.
> > +      */
>
> Sounds like block_task() would be better than deactivate_task() above
> in that case. Anything that is waiting on the task's state change takes
> the pi_lock afaik and the wakeup is always done with pi_lock held so
> blocking the task shouldn't cause any problems based on my reading.

So earlier I did try using block_task() but it always seemed to run
into crashes, which I assumed was because other cpus were picking the
task up as it wasn't on_rq (any references to a task after
block_task() in other situations often runs into this trouble).

But your point about the pi_lock being held is a good one, so I will
tinker and think a bit more on this.

Thanks as always for the review!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-10  4:47       ` K Prateek Nayak
@ 2025-11-20  1:53         ` John Stultz
  2025-11-20  2:00           ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-11-20  1:53 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Sun, Nov 9, 2025 at 8:48 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John,
>
> On 11/8/2025 4:48 AM, John Stultz wrote:
> >>> @@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >>>                       return NULL;
> >>>               }
> >>>
> >>> +             if (task_current(rq, p))
> >>> +                     curr_in_chain = true;
> >>> +
> >>>               owner = __mutex_owner(mutex);
> >>>               if (!owner) {
> >>>                       /*
> >>> -                      * If there is no owner, clear blocked_on
> >>> -                      * and return p so it can run and try to
> >>> -                      * acquire the lock
> >>> +                      * If there is no owner, either clear blocked_on
> >>> +                      * and return p (if it is current and safe to
> >>> +                      * just run on this rq), or return-migrate the task.
> >>>                        */
> >>> -                     __clear_task_blocked_on(p, mutex);
> >>> -                     return p;
> >>> +                     if (task_current(rq, p)) {
> >>> +                             __clear_task_blocked_on(p, NULL);
> >>> +                             return p;
> >>> +                     }
> >>> +                     action = NEEDS_RETURN;
> >>> +                     break;
> >>>               }
> >>>
> >>>               if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> >>
> >> Should we handle task_on_rq_migrating() in the similar way?
> >> Wait for the owner to finish migrating and look at the
> >> task_cpu(owner) once it is reliable?
> >
> > Hrm. I'm not quite sure I understand your suggestion here. Could you
> > expand a bit here? Are you thinking we should deactivate the donor
> > when the owner is migrating? What would then return the donor to the
> > runqueue? Just rescheduling idle so that we drop the rq lock
> > momentarily should be sufficient to make sure the owner can finish
> > migration.
>
> In find_proxy_task() we have:
>
>   if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>     /* Returns rq->idle or NULL */
>   }
>
>   /*
>    * Owner can be task_on_rq_migrating() at this point
>    * since it is in turn blocked on a lock owner on a
>    * different CPU.
>    */
>
>   owner_cpu = task_cpu(owner); /* Prev CPU */
>   if (owner_cpu != this_cpu) {
>     ...
>     action = MIGRATE;
>     break;
>   }
>
>
> So in the end we can migrate to the previous CPU of the owner
> and the previous CPU has to do a chain migration again. I'm
> probably overthinking about a very unlikely scenario here :)

Ok, so you're suggesting maybe putting the
    if (task_on_rq_migrating(owner))
case ahead of the
    if (owner_cpu != this_cpu)
check?

Let me give that a whirl and see how it does.

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-20  1:53         ` John Stultz
@ 2025-11-20  2:00           ` John Stultz
  2025-11-20  2:55             ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-11-20  2:00 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Nov 19, 2025 at 5:53 PM John Stultz <jstultz@google.com> wrote:
> On Sun, Nov 9, 2025 at 8:48 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >
> > Hello John,
> >
> > On 11/8/2025 4:48 AM, John Stultz wrote:
> > >>> @@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > >>>                       return NULL;
> > >>>               }
> > >>>
> > >>> +             if (task_current(rq, p))
> > >>> +                     curr_in_chain = true;
> > >>> +
> > >>>               owner = __mutex_owner(mutex);
> > >>>               if (!owner) {
> > >>>                       /*
> > >>> -                      * If there is no owner, clear blocked_on
> > >>> -                      * and return p so it can run and try to
> > >>> -                      * acquire the lock
> > >>> +                      * If there is no owner, either clear blocked_on
> > >>> +                      * and return p (if it is current and safe to
> > >>> +                      * just run on this rq), or return-migrate the task.
> > >>>                        */
> > >>> -                     __clear_task_blocked_on(p, mutex);
> > >>> -                     return p;
> > >>> +                     if (task_current(rq, p)) {
> > >>> +                             __clear_task_blocked_on(p, NULL);
> > >>> +                             return p;
> > >>> +                     }
> > >>> +                     action = NEEDS_RETURN;
> > >>> +                     break;
> > >>>               }
> > >>>
> > >>>               if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> > >>
> > >> Should we handle task_on_rq_migrating() in the similar way?
> > >> Wait for the owner to finish migrating and look at the
> > >> task_cpu(owner) once it is reliable?
> > >
> > > Hrm. I'm not quite sure I understand your suggestion here. Could you
> > > expand a bit here? Are you thinking we should deactivate the donor
> > > when the owner is migrating? What would then return the donor to the
> > > runqueue? Just rescheduling idle so that we drop the rq lock
> > > momentarily should be sufficient to make sure the owner can finish
> > > migration.
> >
> > In find_proxy_task() we have:
> >
> >   if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> >     /* Returns rq->idle or NULL */
> >   }
> >
> >   /*
> >    * Owner can be task_on_rq_migrating() at this point
> >    * since it is in turn blocked on a lock owner on a
> >    * different CPU.
> >    */
> >
> >   owner_cpu = task_cpu(owner); /* Prev CPU */
> >   if (owner_cpu != this_cpu) {
> >     ...
> >     action = MIGRATE;
> >     break;
> >   }
> >
> >
> > So in the end we can migrate to the previous CPU of the owner
> > and the previous CPU has to do a chain migration again. I'm
> > probably overthinking about a very unlikely scenario here :)
>
> Ok, so you're suggesting maybe putting the
>     if (task_on_rq_migrating(owner))
> case ahead of the
>     if (owner_cpu != this_cpu)
> check?
>
> Let me give that a whirl and see how it does.

That said, thinking another second on it, I also realize once we
decide to proxy_migrate, there is always the chance the owner gets
migrated somewhere else. So we can check task_on_rq_migrating() but
then right after we check that it might be migrated, and we can't
really prevent this.  And in that case, doing the proxy-migration to
the wrong place will be ok, as that cpu will then bounce the tasks to
the owner's new cpu.

Hopefully this would be rare though. :)

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-20  2:00           ` John Stultz
@ 2025-11-20  2:55             ` K Prateek Nayak
  2025-11-20  6:33               ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-20  2:55 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/20/2025 7:30 AM, John Stultz wrote:
>> Ok, so you're suggesting maybe putting the
>>     if (task_on_rq_migrating(owner))
>> case ahead of the
>>     if (owner_cpu != this_cpu)
>> check?
>>
>> Let me give that a whirl and see how it does.
> 
> That said, thinking another second on it, I also realize once we
> decide to proxy_migrate, there is always the chance the owner gets
> migrated somewhere else. So we can check task_on_rq_migrating() but
> then right after we check that it might be migrated, and we can't
> really prevent this.  And in that case, doing the proxy-migration to
> the wrong place will be ok, as that cpu will then bounce the tasks to
> the owner's new cpu.
> 
> Hopefully this would be rare though. :)

Ack! I was just thinking of some extreme scenarios. We can probably
think about it if and when we run into a problem with it :)

That said, once we decide to move the first donors to owner's CPU
should we task some care to retain the owner on the same CPU as much
as possible - take it out of the purview of load balancing and only
move it if the owner is no long runnable on that CPU as a result of
affinity changes?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-11-20  1:05     ` John Stultz
@ 2025-11-20  3:15       ` K Prateek Nayak
  2025-11-20  7:34         ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-20  3:15 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/20/2025 6:35 AM, John Stultz wrote:
>> Sounds like block_task() would be better than deactivate_task() above
>> in that case. Anything that is waiting on the task's state change takes
>> the pi_lock afaik and the wakeup is always done with pi_lock held so
>> blocking the task shouldn't cause any problems based on my reading.
> 
> So earlier I did try using block_task() but it always seemed to run
> into crashes, which I assumed was because other cpus were picking the
> task up as it wasn't on_rq (any references to a task after
> block_task() in other situations often runs into this trouble).
> 
> But your point about the pi_lock being held is a good one, so I will
> tinker and think a bit more on this.

So if you hadn't used DEQUEUE_SPECIAL previously with block_task(),
there is a case where:

> @@ -3784,6 +3834,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		update_rq_clock(rq);
>  		if (p->se.sched_delayed)
>  			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> +		if (proxy_needs_return(rq, p))
> +			goto out;

Task turns delayed here but the delayed condition is handled
before proxy_needs_return(). Perhaps you can try reordering
them?

Since we avoid calling block_task() on blocked donors, I
don't think they can be delayed until we actually call
block_task().

I might be missing other subtleties but this is one case
I could think of.

>  		if (!task_on_cpu(rq, p)) {
>  			/*
>  			 * When on_rq && !on_cpu the task is preempted, see if

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-20  2:55             ` K Prateek Nayak
@ 2025-11-20  6:33               ` John Stultz
  2025-11-20  7:16                 ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-11-20  6:33 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Nov 19, 2025 at 6:55 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/20/2025 7:30 AM, John Stultz wrote:
> >> Ok, so you're suggesting maybe putting the
> >>     if (task_on_rq_migrating(owner))
> >> case ahead of the
> >>     if (owner_cpu != this_cpu)
> >> check?
> >>
> >> Let me give that a whirl and see how it does.
> >
> > That said, thinking another second on it, I also realize once we
> > decide to proxy_migrate, there is always the chance the owner gets
> > migrated somewhere else. So we can check task_on_rq_migrating() but
> > then right after we check that it might be migrated, and we can't
> > really prevent this.  And in that case, doing the proxy-migration to
> > the wrong place will be ok, as that cpu will then bounce the tasks to
> > the owner's new cpu.
> >
> > Hopefully this would be rare though. :)
>
> Ack! I was just thinking of some extreme scenarios. We can probably
> think about it if and when we run into a problem with it :)
>
> That said, once we decide to move the first donors to owner's CPU
> should we task some care to retain the owner on the same CPU as much
> as possible - take it out of the purview of load balancing and only
> move it if the owner is no long runnable on that CPU as a result of
> affinity changes?

Eh, I'm hesitant to muck with the balancing effects on the lock
owners.  If it's better for them to move around, then the donor chain
should follow along (which will happen naturally).

Now, the donors (or really any blocked tasks on the rq) we avoid
adding to any movable lists, since we only want proxy-migrations to
move those tasks (so they stick to the owner).

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-20  6:33               ` John Stultz
@ 2025-11-20  7:16                 ` K Prateek Nayak
  2025-11-20  7:27                   ` John Stultz
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-11-20  7:16 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 11/20/2025 12:03 PM, John Stultz wrote:
> On Wed, Nov 19, 2025 at 6:55 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 11/20/2025 7:30 AM, John Stultz wrote:
>>>> Ok, so you're suggesting maybe putting the
>>>>     if (task_on_rq_migrating(owner))
>>>> case ahead of the
>>>>     if (owner_cpu != this_cpu)
>>>> check?
>>>>
>>>> Let me give that a whirl and see how it does.
>>>
>>> That said, thinking another second on it, I also realize once we
>>> decide to proxy_migrate, there is always the chance the owner gets
>>> migrated somewhere else. So we can check task_on_rq_migrating() but
>>> then right after we check that it might be migrated, and we can't
>>> really prevent this.  And in that case, doing the proxy-migration to
>>> the wrong place will be ok, as that cpu will then bounce the tasks to
>>> the owner's new cpu.
>>>
>>> Hopefully this would be rare though. :)
>>
>> Ack! I was just thinking of some extreme scenarios. We can probably
>> think about it if and when we run into a problem with it :)
>>
>> That said, once we decide to move the first donors to owner's CPU
>> should we task some care to retain the owner on the same CPU as much
>> as possible - take it out of the purview of load balancing and only
>> move it if the owner is no long runnable on that CPU as a result of
>> affinity changes?
> 
> Eh, I'm hesitant to muck with the balancing effects on the lock
> owners.  If it's better for them to move around, then the donor chain
> should follow along (which will happen naturally).

So assume the case where you have the owner and a bunch of blocked
donor on the same rq. This rq appears the busiest to the load balancer.

Load balancer go thorough the task list and find that almost
everything is blocked on the owner. Then it arrives at the owner in a
preempted state (queued; not running) and thinks this is a good enough
task to move to reduce imbalance.

Now, this triggers a whole chain migration at pick for all the blocked
donors to the new CPU. Seems wasteful (although again this is a very
unlikely scenario to not be on_cpu with so many donors on the CPU)

> 
> Now, the donors (or really any blocked tasks on the rq) we avoid
> adding to any movable lists, since we only want proxy-migrations to
> move those tasks (so they stick to the owner).
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration)
  2025-11-20  7:16                 ` K Prateek Nayak
@ 2025-11-20  7:27                   ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-11-20  7:27 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Nov 19, 2025 at 11:16 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/20/2025 12:03 PM, John Stultz wrote:
> > On Wed, Nov 19, 2025 at 6:55 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >> On 11/20/2025 7:30 AM, John Stultz wrote:
> >>>> Ok, so you're suggesting maybe putting the
> >>>>     if (task_on_rq_migrating(owner))
> >>>> case ahead of the
> >>>>     if (owner_cpu != this_cpu)
> >>>> check?
> >>>>
> >>>> Let me give that a whirl and see how it does.
> >>>
> >>> That said, thinking another second on it, I also realize once we
> >>> decide to proxy_migrate, there is always the chance the owner gets
> >>> migrated somewhere else. So we can check task_on_rq_migrating() but
> >>> then right after we check that it might be migrated, and we can't
> >>> really prevent this.  And in that case, doing the proxy-migration to
> >>> the wrong place will be ok, as that cpu will then bounce the tasks to
> >>> the owner's new cpu.
> >>>
> >>> Hopefully this would be rare though. :)
> >>
> >> Ack! I was just thinking of some extreme scenarios. We can probably
> >> think about it if and when we run into a problem with it :)
> >>
> >> That said, once we decide to move the first donors to owner's CPU
> >> should we task some care to retain the owner on the same CPU as much
> >> as possible - take it out of the purview of load balancing and only
> >> move it if the owner is no long runnable on that CPU as a result of
> >> affinity changes?
> >
> > Eh, I'm hesitant to muck with the balancing effects on the lock
> > owners.  If it's better for them to move around, then the donor chain
> > should follow along (which will happen naturally).
>
> So assume the case where you have the owner and a bunch of blocked
> donor on the same rq. This rq appears the busiest to the load balancer.
>
> Load balancer go thorough the task list and find that almost
> everything is blocked on the owner. Then it arrives at the owner in a
> preempted state (queued; not running) and thinks this is a good enough
> task to move to reduce imbalance.
>
> Now, this triggers a whole chain migration at pick for all the blocked
> donors to the new CPU. Seems wasteful (although again this is a very
> unlikely scenario to not be on_cpu with so many donors on the CPU)

Yeah. This is a case we will probably need some tuning for. I'd lean
more towards trying not to consider the blocked_on tasks for
balancing, instead of trying to lock the owner down.

As always, I appreciate the thoughts and feedback!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case
  2025-11-20  3:15       ` K Prateek Nayak
@ 2025-11-20  7:34         ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-11-20  7:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Nov 19, 2025 at 7:16 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/20/2025 6:35 AM, John Stultz wrote:
> >> Sounds like block_task() would be better than deactivate_task() above
> >> in that case. Anything that is waiting on the task's state change takes
> >> the pi_lock afaik and the wakeup is always done with pi_lock held so
> >> blocking the task shouldn't cause any problems based on my reading.
> >
> > So earlier I did try using block_task() but it always seemed to run
> > into crashes, which I assumed was because other cpus were picking the
> > task up as it wasn't on_rq (any references to a task after
> > block_task() in other situations often runs into this trouble).
> >
> > But your point about the pi_lock being held is a good one, so I will
> > tinker and think a bit more on this.
>
> So if you hadn't used DEQUEUE_SPECIAL previously with block_task(),

Yeah, you're right, DEQUEUE_SPECIAL definitely solves the crashes I
was seeing without it.
I'll switch over to that.

thanks so much!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-11-20  7:34 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30  0:18 [PATCH v23 0/9] Donor Migration for Proxy Execution (v23) John Stultz
2025-10-30  0:18 ` [PATCH v23 1/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2025-10-30  0:18 ` [PATCH v23 2/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
2025-10-30  4:51   ` K Prateek Nayak
2025-10-30 23:42     ` John Stultz
2025-10-30  0:18 ` [PATCH v23 3/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2025-10-30  7:32   ` K Prateek Nayak
2025-10-30 23:53     ` John Stultz
2025-10-30  0:18 ` [PATCH v23 4/9] sched: Add assert_balance_callbacks_empty helper John Stultz
2025-10-30  7:38   ` K Prateek Nayak
2025-10-30  0:18 ` [PATCH v23 5/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
2025-10-30  8:08   ` K Prateek Nayak
2025-10-31  3:15     ` John Stultz
2025-10-31  3:50       ` K Prateek Nayak
2025-10-30  0:18 ` [PATCH v23 6/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
2025-10-30  9:32   ` K Prateek Nayak
2025-11-07 23:18     ` John Stultz
2025-11-10  4:47       ` K Prateek Nayak
2025-11-20  1:53         ` John Stultz
2025-11-20  2:00           ` John Stultz
2025-11-20  2:55             ` K Prateek Nayak
2025-11-20  6:33               ` John Stultz
2025-11-20  7:16                 ` K Prateek Nayak
2025-11-20  7:27                   ` John Stultz
2025-11-07 15:19   ` Juri Lelli
2025-11-07 17:24     ` John Stultz
2025-10-30  0:18 ` [PATCH v23 7/9] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
2025-10-31  4:27   ` K Prateek Nayak
2025-11-20  1:05     ` John Stultz
2025-11-20  3:15       ` K Prateek Nayak
2025-11-20  7:34         ` John Stultz
2025-10-30  0:18 ` [PATCH v23 8/9] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2025-10-31  5:01   ` K Prateek Nayak
2025-11-11  7:50     ` John Stultz
2025-11-11  8:35       ` K Prateek Nayak
2025-10-30  0:18 ` [PATCH v23 9/9] sched: Migrate whole chain in proxy_migrate_task() John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox