[PATCH v22 0/6] Donor Migration for Proxy Execution (v22)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v22 0/6] Donor Migration for Proxy Execution (v22)
@ 2025-09-26  3:29 John Stultz
  2025-09-26  3:29 ` [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hey All,

I wanted to continue pushing for feedback on the next chunk of
the series: Donor Migration

This is just the next step for Proxy Execution, to allow us to
migrate blocked donors across runqueues to boost remote lock
owners.

As always, I’m trying to submit this larger work in smallish
digestible pieces, so in this portion of the series, I’m only
submitting for review and consideration the logic that allows us
to do donor(blocked waiter) migration, which requires some
additional changes to locking and extra state tracking to ensure
we don’t accidentally run a migrated donor on a cpu it isn’t
affined to, as well as some extra handling to deal with balance
callback state that needs to be reset when we decide to pick a
different task after doing donor migration.

My last version got a lot of great feedback from K Prateek Nayak,
which while not significantly changing behavior, did have me
reworking and reorganizing quite a bit of code in this series:

* Reworking find_proxy_task() to avoid mixing gotos with guard()
  usage. Instead break and switch() on a set action enum.
* Zap callbacks when we resched idle
* Remove unjustified curr != donor check in pick_next_task_fair()
* Simplifications around put_prev_set_next() in the migration
  logic
* Reorder functions for readability
* Move a few task_struct elements under #ifdef
  CONFIG_SCHED_PROXY_EXEC
* Switch to one-line stubs and other white space and spelling
  cleanups.

I’d love to get further feedback on any place where these patches
are confusing, or could use additional clarifications.

Also Suleiman Souhlal and I have been working on some
enhancements to the full Proxy Execution series:
* Suleiman has implemented a first pass at enabling Proxy Exec
  on rw_sems! Rw_sems have been another common source of PI
  inversion problems, so I’m excited to be able to have the
  Proxy Exec approach be able to help solve those issues as
  well. More work and validation are required, but it’s very
  exciting!
* I’ve been working to allow Proxy Exec to work with sched_ext.
  Currently I’ve worked out the crashers I was initially
  seeing. However, I find my stress tests tend to eventually
  cause problems, though this seems unfortunately the case
  without proxy-exec as well, and seems to be due to the missing
  dl_server for sched_ext. I need to try to test with Andrea
  Righi’s series here:
     https://lore.kernel.org/lkml/20250903095008.162049-1-arighi@nvidia.com/ 
  I still have further work to better understand if Proxy
  switching the selected task breaks bpf scheduler assumptions
  and what might be done about it.

Also you can find the full proxy-exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v22-6.17-rc6
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v22-6.17-rc6

Issues still to address with the full series:
* Continue working to get sched_ext to be ok with
  proxy-execution enabled.
* K Prateek Nayak re-did some performance testing with both this
  set and the full series, and while the set I’m submitting here
  looked ok, the full series did see regressions. I’m working to
  reproduce this so I can narrow the issue down.
* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good too.
* Eventually: Work to replace rt_mutexes and get things happy
  with PREEMPT_RT

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the larger series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[2] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/20250805001026.2247040-1-jstultz@google.com/
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>   
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (5):
  locking: Add task::blocked_lock to serialize blocked_on state
  sched/locking: Add blocked_on_state to provide necessary tri-state for
    proxy return-migration
  sched: Add logic to zap balance callbacks if we pick again
  sched: Handle blocked-waiter migration (and return migration)
  sched: Migrate whole chain in proxy_migrate_task()

Peter Zijlstra (1):
  sched: Add blocked_donor link to task for smarter mutex handoffs

 include/linux/sched.h        | 130 ++++++++++----
 init/init_task.c             |   6 +
 kernel/fork.c                |   7 +-
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  86 +++++++--
 kernel/locking/ww_mutex.h    |  20 +--
 kernel/sched/core.c          | 339 ++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h         |   6 +-
 8 files changed, 507 insertions(+), 91 deletions(-)

-- 
2.51.0.536.g15c5d4f767-goog

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state
  2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
@ 2025-09-26  3:29 ` John Stultz
  2025-10-08 10:27   ` Peter Zijlstra
  2025-09-26  3:29 ` [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak
v21:
* After recently thinking more on ww_mutex code, I
  reworked the blocked_lock usage in mutex lock to
  avoid having to take nested locks in the ww_mutex
  paths, as I was concerned the lock ordering
  constraints weren't as strong as I had previously
  thought.
v22:
* Added some extra spaces to avoid dense code blocks
  suggested by K Prateek
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 52 +++++++++++++++---------------------
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 40 +++++++++++++++++----------
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 7 files changed, 57 insertions(+), 49 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e4ce0a76831e5..cb4e81d9d9b67 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1233,6 +1233,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2141,57 +2142,48 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
+}
 
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd906..7e29d86153d9f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index c4ada32598bd5..796cfceb2bbda 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2030,6 +2030,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 949103fd8e9b5..1d8cff71f65e1 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index de7d6702cd96c..c44fc63d4476e 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -640,6 +640,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -653,8 +654,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -677,12 +679,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -693,25 +697,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -740,11 +752,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -955,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785afee6c0..e4a81790ea7dd 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 631e25ce15c66..007459d42ae4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6639,6 +6639,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6662,8 +6663,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
  2025-09-26  3:29 ` [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2025-09-26  3:29 ` John Stultz
  2025-10-08 11:26   ` Peter Zijlstra
  2025-09-26  3:29 ` [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Thus we need more then just a binary concept of the task being
blocked on a mutex or not.

So add a blocked_on_state value to the task, that allows the
task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
and back to BO_RUNNING. This provides a guard state in
BO_WAKING so we can know the task is no longer blocked
but we don't want to run it until we have potentially
done return migration, back to a usable cpu.

Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
  around blocked_on_state changes that we were cheating in
  previous versions.
* Minor cleanups, some comment improvements
v22:
* Re-order blocked_on_state helpers to try to make it clearer
  the set_task_blocked_on() and clear_task_blocked_on() are
  the main enter/exit states and the blocked_on_state helpers
  help manage the transition states within. Per feedback from
  K Prateek Nayak.
* Rework blocked_on_state to be defined within
  CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak.
* Reworked empty stub functions to just take one line as
  suggestd by K Prateek
* Avoid using gotos out of a guard() scope, as highlighted by
  K Prateek, and instead rework logic to break and switch()
  on an action value.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 92 +++++++++++++++++++++++++++++++++------
 init/init_task.c          |  3 ++
 kernel/fork.c             |  3 ++
 kernel/locking/mutex.c    | 15 ++++---
 kernel/locking/ww_mutex.h | 20 ++++-----
 kernel/sched/core.c       | 45 +++++++++++++++++--
 kernel/sched/sched.h      |  6 ++-
 7 files changed, 146 insertions(+), 38 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cb4e81d9d9b67..8245940783c77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -813,6 +813,12 @@ struct kmap_ctrl {
 #endif
 };
 
+enum blocked_on_state {
+	BO_RUNNABLE,
+	BO_BLOCKED,
+	BO_WAKING,
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1234,6 +1240,9 @@ struct task_struct {
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	raw_spinlock_t			blocked_lock;
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	enum blocked_on_state		blocked_on_state;
+#endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2139,7 +2148,6 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
-#ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
@@ -2152,6 +2160,13 @@ static inline struct mutex *get_task_blocked_on(struct task_struct *p)
 	return __get_task_blocked_on(p);
 }
 
+static inline void __force_blocked_on_blocked(struct task_struct *p);
+static inline void __force_blocked_on_runnable(struct task_struct *p);
+
+/*
+ * These helpers set and clear the task blocked_on pointer, as well
+ * as setting the initial blocked_on_state, or clearing it
+ */
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
 	WARN_ON_ONCE(!m);
@@ -2161,24 +2176,23 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
-	 * with a different mutex. Note, setting it to the same
-	 * lock repeatedly is ok.
+	 * with a different mutex.
 	 */
-	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(p->blocked_on);
 	p->blocked_on = m;
+	__force_blocked_on_blocked(p);
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
+	/* The task should only be clearing itself */
+	WARN_ON_ONCE(p != current);
 	/* Currently we serialize blocked_on under the task::blocked_lock */
 	lockdep_assert_held_once(&p->blocked_lock);
-	/*
-	 * There may be cases where we re-clear already cleared
-	 * blocked_on relationships, but make sure we are not
-	 * clearing the relationship with a different lock.
-	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	/* Make sure we are clearing the relationship with the right lock */
+	WARN_ON_ONCE(m && p->blocked_on != m);
 	p->blocked_on = NULL;
+	__force_blocked_on_runnable(p);
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2186,15 +2200,65 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
-#else
-static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
+
+/*
+ * The following helpers manage the blocked_on_state transitions while
+ * the blocked_on pointer is set.
+ */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void __force_blocked_on_blocked(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	p->blocked_on_state = BO_BLOCKED;
+}
+
+static inline void __set_blocked_on_waking(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	if (p->blocked_on_state == BO_BLOCKED)
+		p->blocked_on_state = BO_WAKING;
+}
+
+static inline void set_blocked_on_waking(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_blocked_on_waking(p);
+}
+
+static inline void __force_blocked_on_runnable(struct task_struct *p)
 {
+	lockdep_assert_held(&p->blocked_lock);
+	p->blocked_on_state = BO_RUNNABLE;
 }
 
-static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
+static inline void force_blocked_on_runnable(struct task_struct *p)
 {
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__force_blocked_on_runnable(p);
+}
+
+static inline void __set_blocked_on_runnable(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	if (p->blocked_on_state == BO_WAKING)
+		p->blocked_on_state = BO_RUNNABLE;
+}
+
+static inline void set_blocked_on_runnable(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return;
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_blocked_on_runnable(p);
 }
-#endif /* !CONFIG_PREEMPT_RT */
+#else  /* CONFIG_SCHED_PROXY_EXEC */
+static inline void __force_blocked_on_blocked(struct task_struct *p) {}
+static inline void __set_blocked_on_waking(struct task_struct *p) {}
+static inline void set_blocked_on_waking(struct task_struct *p) {}
+static inline void __force_blocked_on_runnable(struct task_struct *p) {}
+static inline void __set_blocked_on_runnable(struct task_struct *p) {}
+static inline void set_blocked_on_runnable(struct task_struct *p) {}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
 
 static __always_inline bool need_resched(void)
 {
diff --git a/init/init_task.c b/init/init_task.c
index 7e29d86153d9f..63b66b4aa585a 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -174,6 +174,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	.blocked_on_state = BO_RUNNABLE,
+#endif
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 796cfceb2bbda..d8eb66e5be918 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2130,6 +2130,9 @@ __latent_entropy struct task_struct *copy_process(
 #endif
 
 	p->blocked_on = NULL; /* not blocked yet */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	p->blocked_on_state = BO_RUNNABLE;
+#endif
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index c44fc63d4476e..d8cf2e9a22a65 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -682,11 +682,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 		raw_spin_lock(&current->blocked_lock);
 		/*
-		 * As we likely have been woken up by task
-		 * that has cleared our blocked_on state, re-set
-		 * it to the lock we are trying to acquire.
+		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
 		 */
-		__set_task_blocked_on(current, lock);
+		__force_blocked_on_blocked(current);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -705,7 +703,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			__clear_task_blocked_on(current, lock);
+			__force_blocked_on_runnable(current);
 			raw_spin_unlock(&current->blocked_lock);
 			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
@@ -714,7 +712,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 			raw_spin_lock_irqsave(&lock->wait_lock, flags);
 			raw_spin_lock(&current->blocked_lock);
-			__set_task_blocked_on(current, lock);
+			__force_blocked_on_blocked(current);
 
 			if (opt_acquired)
 				break;
@@ -966,8 +964,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
+		raw_spin_lock(&next->blocked_lock);
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		WARN_ON_ONCE(__get_task_blocked_on(next) != lock);
+		__set_blocked_on_waking(next);
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..f34363615eb34 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on_state to BO_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_blocked_on_waking(waiter->task);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
-			 *
-			 * NOTE: We pass NULL here instead of lock, because we
-			 * are waking the mutex owner, who may be currently
-			 * blocked on a different mutex.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on_state to BO_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_blocked_on_waking(owner);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 007459d42ae4a..abecd2411e29e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4328,6 +4328,12 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, set us as BO_RUNNABLE
+	 * We will need to be more careful later when handling
+	 * proxy migration
+	 */
+	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6623,7 +6629,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		force_blocked_on_runnable(donor);
 	}
 	return NULL;
 }
@@ -6651,6 +6657,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6676,20 +6683,43 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		/*
+		 * If a ww_mutex hits the die/wound case, it marks the task as
+		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
+		 * cycle can be broken and we avoid a deadlock.
+		 *
+		 * However, if at that moment, we are here on the cpu which the
+		 * die/wounded task is enqueued, we might loop on the cycle as
+		 * BO_WAKING still causes task_is_blocked() to return true
+		 * (since we want return migration to occur before we run the
+		 * task).
+		 *
+		 * Unfortunately since we hold the rq lock, it will block
+		 * try_to_wake_up from completing and doing the return
+		 * migration.
+		 *
+		 * So when we hit a !BO_BLOCKED task briefly schedule idle
+		 * so we release the rq and let the wakeup complete.
+		 */
+		if (p->blocked_on_state != BO_BLOCKED)
+			return proxy_resched_idle(rq);
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
-			__clear_task_blocked_on(p, mutex);
+			__force_blocked_on_runnable(p);
 			return p;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6747,6 +6777,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 */
 	}
 
+	/* Handle actions we need to do outside of the guard() scope */
+	switch (action) {
+	case DEACTIVATE_DONOR:
+		return proxy_deactivate(rq, donor);
+	case FOUND:
+		/* fallthrough */;
+	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cf2109b67f9a3..03deb68ee5f86 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2284,13 +2284,17 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
 	return rq->donor == p;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
 static inline bool task_is_blocked(struct task_struct *p)
 {
 	if (!sched_proxy_exec())
 		return false;
 
-	return !!p->blocked_on;
+	return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
 }
+#else
+static inline bool task_is_blocked(struct task_struct *p) { return false; }
+#endif
 
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again
  2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
  2025-09-26  3:29 ` [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
  2025-09-26  3:29 ` [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
@ 2025-09-26  3:29 ` John Stultz
  2025-10-08 11:37   ` Peter Zijlstra
  2025-09-26  3:29 ` [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs
v22:
* Spelling fix suggested by K Prateek
* Collapsed the stub implementation to one line as suggested
  by K Prateek
* Zap callbacks when we resched idle, as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index abecd2411e29e..7bba05c07a79d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5001,6 +5001,38 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+#else
+static inline void zap_balance_callbacks(struct rq *rq) {}
+#endif
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6942,10 +6974,15 @@ static void __sched notrace __schedule(int sched_mode)
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
-		if (!next)
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
 			goto pick_again;
-		if (next == rq->idle)
+		}
+		if (next == rq->idle) {
+			zap_balance_callbacks(rq);
 			goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
                   ` (2 preceding siblings ...)
  2025-09-26  3:29 ` [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-09-26  3:29 ` John Stultz
  2025-10-08 13:32   ` Peter Zijlstra
  2025-09-26  3:29 ` [PATCH v22 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
  2025-09-26  3:29 ` [PATCH v22 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  5 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks
v22:
* Add comments around zap_balance_callbacks in proxy_migration logic
* Rework logic to avoid gotos out of guard() scopes, and instead
  use break and switch() on action value, as suggested by K Prateek
* K Prateek suggested simplifications around putting donor and
  setting idle as next task in the migration paths, which I further
  simplified to using proxy_resched_idle()
* Comment improvements
* Dropped curr != donor check in pick_next_task_fair() suggested by
  K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 256 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 228 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bba05c07a79d..d063d2c9bd5aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, ctx);
 
+	/*
+	 * It might be that the p->wake_cpu is no longer
+	 * allowed, so set it to the dest_cpu so return
+	 * migration doesn't send it to an invalid cpu
+	 */
+	if (!is_cpu_allowed(p, p->wake_cpu))
+		p->wake_cpu = dest_cpu;
+
 	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
 out:
@@ -3717,6 +3725,72 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
 	trace_sched_wakeup(p);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (READ_ONCE(p->__state) == TASK_RUNNING &&
+		READ_ONCE(p->blocked_on_state) == BO_WAKING);
+}
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	bool ret = false;
+
+	if (!sched_proxy_exec())
+		return false;
+
+	raw_spin_lock(&p->blocked_lock);
+	if (__get_task_blocked_on(p) && p->blocked_on_state == BO_WAKING) {
+		if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
+			if (task_current_donor(rq, p)) {
+				put_prev_task(rq, p);
+				rq_set_donor(rq, rq->idle);
+			}
+			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+			ret = true;
+		}
+		__set_blocked_on_runnable(p);
+		resched_curr(rq);
+	}
+	raw_spin_unlock(&p->blocked_lock);
+	return ret;
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	return false;
+}
+
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3802,6 +3876,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		if (proxy_needs_return(rq, p))
+			goto out;
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -3812,6 +3888,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
+out:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;
@@ -4199,6 +4276,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If current is waking up, we know we can run here, so set BO_RUNNBLE */
+		set_blocked_on_runnable(p);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4215,8 +4294,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success))
-			break;
+		if (!ttwu_state_match(p, state, &success)) {
+			/*
+			 * If we're already TASK_RUNNING, and BO_WAKING
+			 * continue on to ttwu_runnable check to force
+			 * proxy_needs_return evaluation
+			 */
+			if (!proxy_task_runnable_but_waking(p))
+				break;
+		}
 
 		trace_sched_waking(p);
 
@@ -4278,6 +4364,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 * enqueue, such as ttwu_queue_wakelist().
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
+		set_blocked_on_runnable(p);
 
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
@@ -4328,12 +4415,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, set us as BO_RUNNABLE
-	 * We will need to be more careful later when handling
-	 * proxy migration
-	 */
-	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6633,7 +6714,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6653,17 +6734,97 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		force_blocked_on_runnable(donor);
-	}
-	return NULL;
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * So call proxy_resched_idle() to let go of the references before
+	 * we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, 0);
+	proxy_set_task_cpu(p, target_cpu);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(target_rq);
+
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+
+	raw_spin_rq_unlock(target_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	proxy_resched_idle(rq);
+
+	WARN_ON(p == rq->curr);
+
+	set_blocked_on_waking(p);
+	get_task_struct(p);
+	block_task(rq, p, 0);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	wake_up_process(p);
+	put_task_struct(p);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
+{
+	if (p == rq->curr || p->wake_cpu == cpu_of(rq))
+		return true;
+	return false;
 }
 
 /*
@@ -6686,10 +6847,12 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
+	int owner_cpu;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6715,6 +6878,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		/* Double check blocked_on_state now we're holding the lock */
+		if (p->blocked_on_state == BO_RUNNABLE)
+			return p;
+
 		/*
 		 * If a ww_mutex hits the die/wound case, it marks the task as
 		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
@@ -6730,27 +6897,50 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * try_to_wake_up from completing and doing the return
 		 * migration.
 		 *
-		 * So when we hit a !BO_BLOCKED task briefly schedule idle
-		 * so we release the rq and let the wakeup complete.
+		 * So when we hit a BO_WAKING task try to wake it up ourselves.
 		 */
-		if (p->blocked_on_state != BO_BLOCKED)
-			return proxy_resched_idle(rq);
+		if (p->blocked_on_state == BO_WAKING) {
+			if (task_current(rq, p)) {
+				/* If its current just set it runnable */
+				__force_blocked_on_runnable(p);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
+		}
+
+		if (task_current(rq, p))
+			curr_in_chain = true;
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/* If the owner is null, we may have some work to do */
+			if (!proxy_can_run_here(rq, p)) {
+				action = NEEDS_RETURN;
+				break;
+			}
+
 			__force_blocked_on_runnable(p);
 			return p;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			action = DEACTIVATE_DONOR;
 			break;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			action = MIGRATE;
 			break;
 		}
 
@@ -6812,7 +7002,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
+		if (proxy_deactivate(rq, donor))
+			return NULL;
+		/* If deactivate fails, force return */
+		p = donor;
+		fallthrough;
+	case NEEDS_RETURN:
+		proxy_force_return(rq, rf, p);
+		return NULL;
+	case MIGRATE:
+		proxy_migrate_task(rq, rf, p, owner_cpu);
+		return NULL;
 	case FOUND:
 		/* fallthrough */;
 	}
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v22 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
                   ` (3 preceding siblings ...)
  2025-09-26  3:29 ` [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-09-26  3:29 ` John Stultz
  2025-09-26  3:29 ` [PATCH v22 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz
  5 siblings, 0 replies; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Connor O'Brien, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
  is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
  by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
v8:
* Use helper for donor blocked_on_state transition
v9:
* Re-add mutex lock handoff in the unlock path, but only when we
  have a blocked donor
* Slight reword of commit message suggested by Metin
v18:
* Add task_init initialization for blocked_donor, suggested by
  Suleiman

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  2 +-
 kernel/locking/mutex.c | 41 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    | 18 ++++++++++++++++--
 5 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8245940783c77..5ca495d5d0a2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1239,6 +1239,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
 #ifdef CONFIG_SCHED_PROXY_EXEC
 	enum blocked_on_state		blocked_on_state;
diff --git a/init/init_task.c b/init/init_task.c
index 63b66b4aa585a..4fb95ab1810a3 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -177,6 +177,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 #ifdef CONFIG_SCHED_PROXY_EXEC
 	.blocked_on_state = BO_RUNNABLE,
 #endif
+	.blocked_donor = NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index d8eb66e5be918..651ebe85e1521 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2130,10 +2130,10 @@ __latent_entropy struct task_struct *copy_process(
 #endif
 
 	p->blocked_on = NULL; /* not blocked yet */
+	p->blocked_donor = NULL; /* nobody is boosting p yet */
 #ifdef CONFIG_SCHED_PROXY_EXEC
 	p->blocked_on_state = BO_RUNNABLE;
 #endif
-
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d8cf2e9a22a65..fca2ee0756b1f 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -924,7 +924,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
-	struct task_struct *next = NULL;
+	struct task_struct *donor, *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
 	unsigned long flags;
@@ -943,6 +943,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		MUTEX_WARN_ON(__owner_task(owner) != current);
 		MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 
+		if (sched_proxy_exec() && current->blocked_donor) {
+			/* force handoff if we have a blocked_donor */
+			owner = MUTEX_FLAG_HANDOFF;
+			break;
+		}
+
 		if (owner & MUTEX_FLAG_HANDOFF)
 			break;
 
@@ -956,7 +962,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+	if (sched_proxy_exec()) {
+		raw_spin_lock(&current->blocked_lock);
+		/*
+		 * If we have a task boosting current, and that task was boosting
+		 * current through this lock, hand the lock to that task, as that
+		 * is the highest waiter, as selected by the scheduling function.
+		 */
+		donor = current->blocked_donor;
+		if (donor) {
+			struct mutex *next_lock;
+
+			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+			next_lock = __get_task_blocked_on(donor);
+			if (next_lock == lock) {
+				next = donor;
+				__set_blocked_on_waking(donor);
+				wake_q_add(&wake_q, donor);
+				current->blocked_donor = NULL;
+			}
+			raw_spin_unlock(&donor->blocked_lock);
+		}
+	}
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 			list_first_entry(&lock->wait_list,
@@ -964,7 +997,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
-		raw_spin_lock(&next->blocked_lock);
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		debug_mutex_wake_waiter(lock, waiter);
 		WARN_ON_ONCE(__get_task_blocked_on(next) != lock);
 		__set_blocked_on_waking(next);
@@ -975,6 +1008,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	if (sched_proxy_exec())
+		raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d063d2c9bd5aa..bccaa4bf41b7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6831,7 +6831,17 @@ static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
  * Find runnable lock owner to proxy for mutex blocked donor
  *
  * Follow the blocked-on relation:
- *   task->blocked_on -> mutex->owner -> task...
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
  *
  * Lock order:
  *
@@ -6997,6 +7007,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
+		owner->blocked_donor = p;
 	}
 
 	/* Handle actions we need to do outside of the guard() scope */
@@ -7097,6 +7108,7 @@ static void __sched notrace __schedule(int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool prev_not_proxied;
 	int cpu;
 
 	/* Trace preemptions consistently with task switches */
@@ -7169,9 +7181,11 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	prev_not_proxied = !prev->blocked_donor;
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	next->blocked_donor = NULL;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next) {
@@ -7237,7 +7251,7 @@ static void __sched notrace __schedule(int sched_mode)
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
 		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
+		if (prev_not_proxied && next->blocked_donor)
 			proxy_tag_curr(rq, next);
 
 		rq_unpin_lock(rq, &rf);
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v22 6/6] sched: Migrate whole chain in proxy_migrate_task()
  2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
                   ` (4 preceding siblings ...)
  2025-09-26  3:29 ` [PATCH v22 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
@ 2025-09-26  3:29 ` John Stultz
  5 siblings, 0 replies; 17+ messages in thread
From: John Stultz @ 2025-09-26  3:29 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Instead of migrating one task each time through find_proxy_task(),
we can walk up the blocked_donor ptrs and migrate the entire
current chain in one go.

This was broken out of earlier patches and held back while the
series was being stabilized, but I wanted to re-introduce it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v12:
* Earlier this was re-using blocked_node, but I hit
  a race with activating blocked entities, and to
  avoid it introduced a new migration_node listhead
v18:
* Add init_task initialization of migration_node as suggested
  by Suleiman
v22:
* Move migration_node under CONFIG_SCHED_PROXY_EXEC as suggested
  by K Prateek
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h |  1 +
 init/init_task.c      |  3 ++-
 kernel/fork.c         |  1 +
 kernel/sched/core.c   | 25 +++++++++++++++++--------
 4 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5ca495d5d0a2d..4a3c836d0bab3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1243,6 +1243,7 @@ struct task_struct {
 	raw_spinlock_t			blocked_lock;
 #ifdef CONFIG_SCHED_PROXY_EXEC
 	enum blocked_on_state		blocked_on_state;
+	struct list_head		migration_node;
 #endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 4fb95ab1810a3..26dc30e2827cd 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -174,10 +174,11 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+	.blocked_donor = NULL,
 #ifdef CONFIG_SCHED_PROXY_EXEC
 	.blocked_on_state = BO_RUNNABLE,
+	.migration_node = LIST_HEAD_INIT(init_task.migration_node),
 #endif
-	.blocked_donor = NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 651ebe85e1521..f195aff7470ce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2133,6 +2133,7 @@ __latent_entropy struct task_struct *copy_process(
 	p->blocked_donor = NULL; /* nobody is boosting p yet */
 #ifdef CONFIG_SCHED_PROXY_EXEC
 	p->blocked_on_state = BO_RUNNABLE;
+	INIT_LIST_HEAD(&p->migration_node);
 #endif
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bccaa4bf41b7d..9dfc4d705e295 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6750,6 +6750,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
+	LIST_HEAD(migrate_list);
 
 	lockdep_assert_rq_held(rq);
 
@@ -6766,11 +6767,16 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	 */
 	proxy_resched_idle(rq);
 
-	WARN_ON(p == rq->curr);
-
-	deactivate_task(rq, p, 0);
-	proxy_set_task_cpu(p, target_cpu);
-
+	for (; p; p = p->blocked_donor) {
+		WARN_ON(p == rq->curr);
+		deactivate_task(rq, p, 0);
+		proxy_set_task_cpu(p, target_cpu);
+		/*
+		 * We can abuse blocked_node to migrate the thing,
+		 * because @p was still on the rq.
+		 */
+		list_add(&p->migration_node, &migrate_list);
+	}
 	/*
 	 * We have to zap callbacks before unlocking the rq
 	 * as another CPU may jump in and call sched_balance_rq
@@ -6781,10 +6787,13 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	rq_unpin_lock(rq, rf);
 	raw_spin_rq_unlock(rq);
 	raw_spin_rq_lock(target_rq);
+	while (!list_empty(&migrate_list)) {
+		p = list_first_entry(&migrate_list, struct task_struct, migration_node);
+		list_del_init(&p->migration_node);
 
-	activate_task(target_rq, p, 0);
-	wakeup_preempt(target_rq, p, 0);
-
+		activate_task(target_rq, p, 0);
+		wakeup_preempt(target_rq, p, 0);
+	}
 	raw_spin_rq_unlock(target_rq);
 	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state
  2025-09-26  3:29 ` [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2025-10-08 10:27   ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2025-10-08 10:27 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, K Prateek Nayak, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Sep 26, 2025 at 03:29:09AM +0000, John Stultz wrote:
> So far, we have been able to utilize the mutex::wait_lock
> for serializing the blocked_on state, but when we move to
> proxying across runqueues, we will need to add more state
> and a way to serialize changes to this state in contexts
> where we don't hold the mutex::wait_lock.
> 
> So introduce the task::blocked_lock, which nests under the
> mutex::wait_lock in the locking order, and rework the locking
> to use it.
> 
> Signed-off-by: John Stultz <jstultz@google.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  include/linux/sched.h        | 52 +++++++++++++++---------------------
>  init/init_task.c             |  1 +
>  kernel/fork.c                |  1 +
>  kernel/locking/mutex-debug.c |  4 +--
>  kernel/locking/mutex.c       | 40 +++++++++++++++++----------
>  kernel/locking/ww_mutex.h    |  4 +--
>  kernel/sched/core.c          |  4 ++-
>  7 files changed, 57 insertions(+), 49 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e4ce0a76831e5..cb4e81d9d9b67 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h

> +static inline struct mutex *get_task_blocked_on(struct task_struct *p)
> +{
> +	guard(raw_spinlock_irqsave)(&p->blocked_lock);
> +	return __get_task_blocked_on(p);
>  }

This isn't a safe function in general; nothing guarantees the value
returned is stable. Perhaps move it inside kernel/locking/mutex.h, its
only users (below) are mutex debug code after all.

> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index 949103fd8e9b5..1d8cff71f65e1 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  	lockdep_assert_held(&lock->wait_lock);
>  
>  	/* Current thread can't be already blocked (since it's executing!) */
> -	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
> +	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
>  }
>  
>  void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  			 struct task_struct *task)
>  {
> -	struct mutex *blocked_on = __get_task_blocked_on(task);
> +	struct mutex *blocked_on = get_task_blocked_on(task);
>  
>  	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
>  	DEBUG_LOCKS_WARN_ON(waiter->task != task);
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index de7d6702cd96c..c44fc63d4476e 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c

> @@ -740,11 +752,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	return 0;
>  
>  err:
> -	__clear_task_blocked_on(current, lock);
> +	clear_task_blocked_on(current, lock);
>  	__set_current_state(TASK_RUNNING);
>  	__mutex_remove_waiter(lock, &waiter);
>  err_early_kill:
> -	WARN_ON(__get_task_blocked_on(current));
> +	WARN_ON(get_task_blocked_on(current));
>  	trace_contention_end(lock, ret);
>  	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
>  	debug_mutex_free_waiter(&waiter);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-26  3:29 ` [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
@ 2025-10-08 11:26   ` Peter Zijlstra
  2025-10-09  0:07     ` John Stultz
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2025-10-08 11:26 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Sep 26, 2025 at 03:29:10AM +0000, John Stultz wrote:
> As we add functionality to proxy execution, we may migrate a
> donor task to a runqueue where it can't run due to cpu affinity.
> Thus, we must be careful to ensure we return-migrate the task
> back to a cpu in its cpumask when it becomes unblocked.
> 
> Thus we need more then just a binary concept of the task being
> blocked on a mutex or not.
> 
> So add a blocked_on_state value to the task, that allows the
> task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
> and back to BO_RUNNING. This provides a guard state in
> BO_WAKING so we can know the task is no longer blocked
> but we don't want to run it until we have potentially
> done return migration, back to a usable cpu.
> 
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
>  include/linux/sched.h     | 92 +++++++++++++++++++++++++++++++++------
>  init/init_task.c          |  3 ++
>  kernel/fork.c             |  3 ++
>  kernel/locking/mutex.c    | 15 ++++---
>  kernel/locking/ww_mutex.h | 20 ++++-----
>  kernel/sched/core.c       | 45 +++++++++++++++++--
>  kernel/sched/sched.h      |  6 ++-
>  7 files changed, 146 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cb4e81d9d9b67..8245940783c77 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -813,6 +813,12 @@ struct kmap_ctrl {
>  #endif
>  };
>  
> +enum blocked_on_state {
> +	BO_RUNNABLE,
> +	BO_BLOCKED,
> +	BO_WAKING,
> +};

I am still struggling with all this.

  RUNNABLE is !p->blocked_on
  BLOCKED is !!p->blocked_on
  WAKING is !!p->blocked_on but you need magical beans

I'm not sure I follow the argument above, and there is a distinct lack
of comments with this enum explaining the states (although there are
some comments scattered across the patch itself).

Last time we talked about this:

  https://lkml.kernel.org/r/20241216165419.GE35539@noisy.programming.kicks-ass.net

I was equally confused; and suggested not having the WAKING state by
simply dequeueing the offending task and letting ttwu() sort it out --
since we know a wakeup will be coming our way.

I'm thinking that suggesting didn't work out somehow, but I'm still not
sure I understand why.

There is this comment:


+               /*
+                * If a ww_mutex hits the die/wound case, it marks the task as
+                * BO_WAKING and calls try_to_wake_up(), so that the mutex
+                * cycle can be broken and we avoid a deadlock.
+                *
+                * However, if at that moment, we are here on the cpu which the
+                * die/wounded task is enqueued, we might loop on the cycle as
+                * BO_WAKING still causes task_is_blocked() to return true
+                * (since we want return migration to occur before we run the
+                * task).
+                *
+                * Unfortunately since we hold the rq lock, it will block
+                * try_to_wake_up from completing and doing the return
+                * migration.
+                *
+                * So when we hit a !BO_BLOCKED task briefly schedule idle
+                * so we release the rq and let the wakeup complete.
+                */
+               if (p->blocked_on_state != BO_BLOCKED)
+                       return proxy_resched_idle(rq);


Which I presume tries to clarify things, but that only had me scratching
my head again. Why would you need task_is_blocked() to affect return
migration?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again
  2025-09-26  3:29 ` [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-10-08 11:37   ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2025-10-08 11:37 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Sep 26, 2025 at 03:29:11AM +0000, John Stultz wrote:

> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +/*
> + * Only called from __schedule context
> + *
> + * There are some cases where we are going to re-do the action
> + * that added the balance callbacks. We may not be in a state
> + * where we can run them, so just zap them so they can be
> + * properly re-added on the next time around. This is similar
> + * handling to running the callbacks, except we just don't call
> + * them.
> + */
> +static void zap_balance_callbacks(struct rq *rq)
> +{
> +	struct balance_callback *next, *head;
> +	bool found = false;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	head = rq->balance_callback;
> +	while (head) {
> +		if (head == &balance_push_callback)
> +			found = true;
> +		next = head->next;
> +		head->next = NULL;
> +		head = next;
> +	}
> +	rq->balance_callback = found ? &balance_push_callback : NULL;
> +}
> +#else
> +static inline void zap_balance_callbacks(struct rq *rq) {}
> +#endif
> +
>  static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
>  {
>  	void (*func)(struct rq *rq);
> @@ -6942,10 +6974,15 @@ static void __sched notrace __schedule(int sched_mode)
>  	rq_set_donor(rq, next);
>  	if (unlikely(task_is_blocked(next))) {
>  		next = find_proxy_task(rq, next, &rf);
> -		if (!next)
> +		if (!next) {
> +			/* zap the balance_callbacks before picking again */
> +			zap_balance_callbacks(rq);
>  			goto pick_again;
> -		if (next == rq->idle)
> +		}
> +		if (next == rq->idle) {
> +			zap_balance_callbacks(rq);
>  			goto keep_resched;
> +		}
>  	}
>  picked:
>  	clear_tsk_need_resched(prev);

I would feel a wee bit better if you'd add something like:

  pick_again:
+	assert_balance_callbacks_empty();
	next = pick_next_task(...);

And have that verify the balance list is indeed empty (save for push).
Perhaps make that depend on PROVE_LOCKING or so; since someone went and
deleted SCHED_DEBUG *sigh*.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-09-26  3:29 ` [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-10-08 13:32   ` Peter Zijlstra
  2025-10-16  0:15     ` John Stultz
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2025-10-08 13:32 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Sep 26, 2025 at 03:29:12AM +0000, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7bba05c07a79d..d063d2c9bd5aa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
>  
>  	__do_set_cpus_allowed(p, ctx);
>  
> +	/*
> +	 * It might be that the p->wake_cpu is no longer
> +	 * allowed, so set it to the dest_cpu so return
> +	 * migration doesn't send it to an invalid cpu
> +	 */

This comment isn't quite right; ->wake_cpu is a mere preference, it does
not have correctness concerns. That is, it is okay for ->wake_cpu to not
be inside cpus_allowed.

> +	if (!is_cpu_allowed(p, p->wake_cpu))
> +		p->wake_cpu = dest_cpu;
> +
>  	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
>  
>  out:
> @@ -3717,6 +3725,72 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
>  	trace_sched_wakeup(p);
>  }
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned int wake_cpu;
> +
> +	/*
> +	 * Since we are enqueuing a blocked task on a cpu it may
> +	 * not be able to run on, preserve wake_cpu when we
> +	 * __set_task_cpu so we can return the task to where it
> +	 * was previously runnable.
> +	 */
> +	wake_cpu = p->wake_cpu;
> +	__set_task_cpu(p, cpu);
> +	p->wake_cpu = wake_cpu;

Humm, perhaps add an argument to __set_task_cpu() to not set ->wake_cpu
instead?

> +}
> +
> +static bool proxy_task_runnable_but_waking(struct task_struct *p)
> +{
> +	if (!sched_proxy_exec())
> +		return false;
> +	return (READ_ONCE(p->__state) == TASK_RUNNING &&
> +		READ_ONCE(p->blocked_on_state) == BO_WAKING);
> +}
> +
> +/*
> + * Checks to see if task p has been proxy-migrated to another rq
> + * and needs to be returned. If so, we deactivate the task here
> + * so that it can be properly woken up on the p->wake_cpu
> + * (or whichever cpu select_task_rq() picks at the bottom of
> + * try_to_wake_up()
> + */
> +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> +{
> +	bool ret = false;
> +
> +	if (!sched_proxy_exec())
> +		return false;
> +
> +	raw_spin_lock(&p->blocked_lock);
> +	if (__get_task_blocked_on(p) && p->blocked_on_state == BO_WAKING) {
> +		if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
> +			if (task_current_donor(rq, p)) {
> +				put_prev_task(rq, p);
> +				rq_set_donor(rq, rq->idle);
> +			}
> +			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +			ret = true;
> +		}
> +		__set_blocked_on_runnable(p);
> +		resched_curr(rq);
> +	}
> +	raw_spin_unlock(&p->blocked_lock);
> +	return ret;
> +}
> +#else /* !CONFIG_SCHED_PROXY_EXEC */
> +static bool proxy_task_runnable_but_waking(struct task_struct *p)
> +{
> +	return false;
> +}
> +
> +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> +
>  static void
>  ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>  		 struct rq_flags *rf)
> @@ -3802,6 +3876,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		update_rq_clock(rq);
>  		if (p->se.sched_delayed)
>  			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> +		if (proxy_needs_return(rq, p))
> +			goto out;
>  		if (!task_on_cpu(rq, p)) {
>  			/*
>  			 * When on_rq && !on_cpu the task is preempted, see if
> @@ -3812,6 +3888,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		ttwu_do_wakeup(p);
>  		ret = 1;
>  	}
> +out:
>  	__task_rq_unlock(rq, &rf);
>  
>  	return ret;
> @@ -4199,6 +4276,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		 *    it disabling IRQs (this allows not taking ->pi_lock).
>  		 */
>  		WARN_ON_ONCE(p->se.sched_delayed);
> +		/* If current is waking up, we know we can run here, so set BO_RUNNBLE */
> +		set_blocked_on_runnable(p);
>  		if (!ttwu_state_match(p, state, &success))
>  			goto out;
>  
> @@ -4215,8 +4294,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 */
>  	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>  		smp_mb__after_spinlock();
> -		if (!ttwu_state_match(p, state, &success))
> -			break;
> +		if (!ttwu_state_match(p, state, &success)) {
> +			/*
> +			 * If we're already TASK_RUNNING, and BO_WAKING
> +			 * continue on to ttwu_runnable check to force
> +			 * proxy_needs_return evaluation
> +			 */
> +			if (!proxy_task_runnable_but_waking(p))
> +				break;
> +		}
>  
>  		trace_sched_waking(p);

Oh gawd :-( why !?!? 

So AFAICT this makes ttwu() do dequeue when it finds WAKING, and then it
falls through to do the normal wakeup. So why can't we do dequeue to
begin with -- instead of setting WAKING in the first place?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-10-08 11:26   ` Peter Zijlstra
@ 2025-10-09  0:07     ` John Stultz
  2025-10-09 11:43       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: John Stultz @ 2025-10-09  0:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Oct 8, 2025 at 4:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 26, 2025 at 03:29:10AM +0000, John Stultz wrote:
> > As we add functionality to proxy execution, we may migrate a
> > donor task to a runqueue where it can't run due to cpu affinity.
> > Thus, we must be careful to ensure we return-migrate the task
> > back to a cpu in its cpumask when it becomes unblocked.
> >
> > Thus we need more then just a binary concept of the task being
> > blocked on a mutex or not.
> >
> > So add a blocked_on_state value to the task, that allows the
> > task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
> > and back to BO_RUNNING. This provides a guard state in
> > BO_WAKING so we can know the task is no longer blocked
> > but we don't want to run it until we have potentially
> > done return migration, back to a usable cpu.
> >
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
> >  include/linux/sched.h     | 92 +++++++++++++++++++++++++++++++++------
> >  init/init_task.c          |  3 ++
> >  kernel/fork.c             |  3 ++
> >  kernel/locking/mutex.c    | 15 ++++---
> >  kernel/locking/ww_mutex.h | 20 ++++-----
> >  kernel/sched/core.c       | 45 +++++++++++++++++--
> >  kernel/sched/sched.h      |  6 ++-
> >  7 files changed, 146 insertions(+), 38 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index cb4e81d9d9b67..8245940783c77 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -813,6 +813,12 @@ struct kmap_ctrl {
> >  #endif
> >  };
> >
> > +enum blocked_on_state {
> > +     BO_RUNNABLE,
> > +     BO_BLOCKED,
> > +     BO_WAKING,
> > +};
>
> I am still struggling with all this.

My apologies. I really appreciate you taking the time to look it over!

>   RUNNABLE is !p->blocked_on
>   BLOCKED is !!p->blocked_on
>   WAKING is !!p->blocked_on but you need magical beans
>
> I'm not sure I follow the argument above, and there is a distinct lack
> of comments with this enum explaining the states (although there are
> some comments scattered across the patch itself).

That's fair. I'll try to improve the comments there.

So the blocked_on_state values don't quite match to blocked_on as you
listed above, but it did evolve out of the fact that just having the
blocked_on pointer didn't give us enough state and had lots of subtle
bugs, so having more state helped stabilize this. I do agree it has
some duplicative aspects with the task->__state, so I'd love to
flatten it down, but so far I've not found a good way.

So p->blocked_on can be separated off, as is totally managed by the
__mutex_lock_common() path. It's set to the mutex we're trying to
take, and cleared when we get it.

Where as the p->blocked_on_state tells us:
BO_RUNNABLE: If the task was picked from the runqueue it can be run on that cpu.
BO_BLOCKED: The task can be picked, but cannot be executed, it can
only be a donor task. It may migrate to the runqueue of cpus that it
is not allowed to run on.
BO_WAKING: An intermediate "gate" state. This task was BO_BLOCKED, and
we'd like it to be BO_RUNNABLE, but we have to address that it might
be on a runqueue it can't run on. So this prevents tasks from being
run until they have been evaluated for return migration.  Ideally ttwu
will handle the return migration, but there are cases where we will do
it manually in find_proxy_task() if we come across a task in the chain
with this state.

So, just to clarify your summary, the a task can be
p->blocked_on_state=BO_RUNNABLE and p->blocked_on can be set, since we
need to run the task in order for it to complete __mutex_lock_common()
to clear its own blocked_on pointer.  BO_BLOCKED does imply
!!p->blocked_on and BO_WAKING implies !!p->blocked_on but also that we
need to evaluate return migration before we run it.

> Last time we talked about this:
>
>   https://lkml.kernel.org/r/20241216165419.GE35539@noisy.programming.kicks-ass.net
>
> I was equally confused; and suggested not having the WAKING state by
> simply dequeueing the offending task and letting ttwu() sort it out --
> since we know a wakeup will be coming our way.

So yeah, and I really appreciated that suggestion. I used that dequeue
and wake approach in the "manual" return-migration path
(proxy_force_return()), and it did simplify things, but I haven't been
able to apply it everywhere.

> I'm thinking that suggesting didn't work out somehow, but I'm still not
> sure I understand why.

So the main issue is about where we end up setting the task to
BO_WAKING (via set_blocked_on_waking()). This is done in
__mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
And in those cases, we are already holding the mutex->wait_lock, and
sometimes a task's blocked_lock, without the rq lock.  So we can't
just grab the rq lock out of order, and we probably shouldn't drop and
try to reacquire the blocked_lock and wait_lock there.

Though, one approach that I just thought of would be to have a special
wake_up_q call, which would handle dequeuing the blocked_on tasks on
the wake_q before doing the wakeup?  I can give that a try.

Though I'm not sure if that will still enable us to drop the
blocked_on_state tri-state. Since I worry we may be able to get
spurious wakeups on blocked_on tasks outside the mutex_unlock_slowpath
or ww_mutex_die/wound paths. Then we risk running a proxy-migrated
task on a cpu outside its affinity set. As without proxy-migration,
spurious wakeups are ok as the task will just loop back into schedule,
but with proxy migration, we have to be sure we return migrate first.

> There is this comment:
>
>
> +               /*
> +                * If a ww_mutex hits the die/wound case, it marks the task as
> +                * BO_WAKING and calls try_to_wake_up(), so that the mutex
> +                * cycle can be broken and we avoid a deadlock.
> +                *
> +                * However, if at that moment, we are here on the cpu which the
> +                * die/wounded task is enqueued, we might loop on the cycle as
> +                * BO_WAKING still causes task_is_blocked() to return true
> +                * (since we want return migration to occur before we run the
> +                * task).
> +                *
> +                * Unfortunately since we hold the rq lock, it will block
> +                * try_to_wake_up from completing and doing the return
> +                * migration.
> +                *
> +                * So when we hit a !BO_BLOCKED task briefly schedule idle
> +                * so we release the rq and let the wakeup complete.
> +                */
> +               if (p->blocked_on_state != BO_BLOCKED)
> +                       return proxy_resched_idle(rq);
>
>
> Which I presume tries to clarify things, but that only had me scratching
> my head again. Why would you need task_is_blocked() to affect return
> migration?

So task_is_blocked() returns true when p->blocked_on is set and
p->blocked_on_state != BO_RUNNABLE.  So BO_WAKING tasks are still
prevented from being selected to run, until they have had a chance to
be return-migrated (because as a donor they may be on a runqueue where
they can't actually run on).

The problem this comment tries to describe is that due to ww_mutexes,
there may be a loop in the blocked_on chain. So the cpu running
find_proxy_task() might spin following this loop. The ww_mutex logic
will fix the loop via ww_mutex_die/wound, which sets BO_WAKING, and
wakes the task up to release the lock.   However, the try_to_wake_up()
can get stuck waiting for the rqlock that the cpu looping in
find_proxy_task() is holding. So for the case where it's not
BO_BLOCKED, we resched_idle for a moment to drop the lock and let
try_to_wake_up() complete.

Though I worry I've just repeated the comment here, so let me know if
this wasn't helpful in clarifying things.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-10-09  0:07     ` John Stultz
@ 2025-10-09 11:43       ` Peter Zijlstra
  2025-10-09 11:45         ` Peter Zijlstra
  2025-10-14  2:43         ` John Stultz
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2025-10-09 11:43 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Oct 08, 2025 at 05:07:26PM -0700, John Stultz wrote:

> > I'm thinking that suggesting didn't work out somehow, but I'm still not
> > sure I understand why.
> 
> So the main issue is about where we end up setting the task to
> BO_WAKING (via set_blocked_on_waking()). This is done in
> __mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
> And in those cases, we are already holding the mutex->wait_lock, and
> sometimes a task's blocked_lock, without the rq lock.  So we can't
> just grab the rq lock out of order, and we probably shouldn't drop and
> try to reacquire the blocked_lock and wait_lock there.

Oh bugger. In my head the scheduler locks still nest inside wait_lock,
but we've flipped that such that schedule() / find_proxy_task() can take
it inside rq->lock.

Yes that does complicate things.

So suppose we have this ww_mutex cycle thing:

		  ,-+-*	Mutex-1 <-.
	Task-A ---' |		  | ,--	Task-B
		    `->	Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can back-off and
   re-try the transaction.

Now, the current code does:

	__clear_task_blocked_on();
	wake_q_add();

And surely clearing ->blocked_on is sufficient to break the cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly pick
Task-B and run it, without ever getting into find_proxy_task().

Now, here is a problem because Task-B might not be runnable on the CPU
it is currently on; and because !task_is_blocked() we don't get into the
proxy paths, so nobody is going to fix this up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, lock inversion spoils things.

> Though, one approach that I just thought of would be to have a special
> wake_up_q call, which would handle dequeuing the blocked_on tasks on
> the wake_q before doing the wakeup?  I can give that a try.

I think this is racy worse than you considered. CPU1 could be inside
schedule() trying to pick Task-B while CPU2 does that wound/die thing.
No spurious wakeup required.

Anyway, since the actual value of ->blocked_on doesn't matter in this
case (we really want it to be NULL, but can't because we need someone to
go back migrate the thing), why not simply use something like:

#define PROXY_STOP ((struct mutex *)(-1L))

	__set_task_blocked_on(task, PROXY_STOP);

Then, have find_proxy_task() fix it up?

Random thoughts:

 - we should probably have something like:

	next = pick_next_task();
	rq_set_donor(next)
	if (unlikely(task_is_blocked()) {
		...
	}
+	WARN_ON_ONCE(next->__state);

   at all times the task we end up picking should be in RUNNABLE state.

 - similarly, we should have ttwu() check ->blocked_on is NULL ||
   PROXY_STOP, waking a task that still has a blocked_on relation can't
   be right -- ooh, dang race conditions :/ perhaps DEBUG_MUTEX and
   serialize on wait_lock.

 - I'm confliced on having TTWU fix up PROXY_STOP; strictly not required
   I think, but might improve performance -- if so, include numbers in
   patch that adds it -- which should be a separate patch from the one
   that adds PROXY_STOP.

 - since find_proxy_task() can do a lock-break, it should probably
   re-try the pick if, at the end, a higher runqueue is modified than
   the task we ended up with.

   Also see this thread:

      https://lkml.kernel.org/r/20251006105453.522934521@infradead.org

   eg. something like:

	rq->queue_mask = 0;
	// code with rq-lock-break
   	if (rq_modified_above(rq, next->sched_class))
		return NULL;

I'm still confused on BO_RUNNABLE -- you set that around
optimistic-spin, probably because you want to retain the ->blocked_on
relation, but also you have to run that thing to make progress. There
are a few other sites that use it, but those are more confusing still.

Please try and clarify this.

Anyway, if that is indeed it, you could do this by (ab)using the LSB of
the ->blocked_on pointer I suppose (you could make PROXY_STOP -2).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-10-09 11:43       ` Peter Zijlstra
@ 2025-10-09 11:45         ` Peter Zijlstra
  2025-10-14  2:43         ` John Stultz
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2025-10-09 11:45 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 09, 2025 at 01:43:02PM +0200, Peter Zijlstra wrote:
>  - we should probably have something like:
> 
> 	next = pick_next_task();
> 	rq_set_donor(next)
> 	if (unlikely(task_is_blocked()) {
> 		...
> 	}
> +	WARN_ON_ONCE(next->__state);
> 
>    at all times the task we end up picking should be in RUNNABLE state.

Pfff.. PREEMPT won't like that. Ignore this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-10-09 11:43       ` Peter Zijlstra
  2025-10-09 11:45         ` Peter Zijlstra
@ 2025-10-14  2:43         ` John Stultz
  2025-10-16 22:23           ` John Stultz
  1 sibling, 1 reply; 17+ messages in thread
From: John Stultz @ 2025-10-14  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Oct 9, 2025 at 4:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 08, 2025 at 05:07:26PM -0700, John Stultz wrote:
>
> > > I'm thinking that suggesting didn't work out somehow, but I'm still not
> > > sure I understand why.
> >
> > So the main issue is about where we end up setting the task to
> > BO_WAKING (via set_blocked_on_waking()). This is done in
> > __mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
> > And in those cases, we are already holding the mutex->wait_lock, and
> > sometimes a task's blocked_lock, without the rq lock.  So we can't
> > just grab the rq lock out of order, and we probably shouldn't drop and
> > try to reacquire the blocked_lock and wait_lock there.
>
> Oh bugger. In my head the scheduler locks still nest inside wait_lock,
> but we've flipped that such that schedule() / find_proxy_task() can take
> it inside rq->lock.
>
> Yes that does complicate things.
>
> So suppose we have this ww_mutex cycle thing:
>
>                   ,-+-* Mutex-1 <-.
>         Task-A ---' |             | ,-- Task-B
>                     `-> Mutex-2 *-+-'
>
> Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
> where Task-B holds Mutex-2 and tries to acquire Mutex-1.
>
> Then the blocked_on->owner chain will go in circles.
>
>         Task-A  -> Mutex-2
>           ^          |
>           |          v
>         Mutex-1 <- Task-B
>
> We need two things:
>
>  - find_proxy_task() to stop iterating the circle;
>
>  - the woken task to 'unblock' and run, such that it can back-off and
>    re-try the transaction.
>
>
> Now, the current code does:
>
>         __clear_task_blocked_on();
>         wake_q_add();
>
> And surely clearing ->blocked_on is sufficient to break the cycle.
>
> Suppose it is Task-B that is made to back-off, then we have:
>
>   Task-A -> Mutex-2 -> Task-B (no further blocked_on)
>
> and it would attempt to run Task-B. Or worse, it could directly pick
> Task-B and run it, without ever getting into find_proxy_task().
>
> Now, here is a problem because Task-B might not be runnable on the CPU
> it is currently on; and because !task_is_blocked() we don't get into the
> proxy paths, so nobody is going to fix this up.
>
> Ideally we would have dequeued Task-B alongside of clearing
> ->blocked_on, but alas, lock inversion spoils things.

Right. Thus my adding of the blocked_on_state to try to gate the task
from running until we evaluate it for return migration.

> > Though, one approach that I just thought of would be to have a special
> > wake_up_q call, which would handle dequeuing the blocked_on tasks on
> > the wake_q before doing the wakeup?  I can give that a try.
>
> I think this is racy worse than you considered. CPU1 could be inside
> schedule() trying to pick Task-B while CPU2 does that wound/die thing.
> No spurious wakeup required.

Yeah. I took a bit of a try at it, but couldn't manage to rework
things without preserving the BO_WAKING guard.

Then trying to do the dequeue in the wake_up_q() really isn't that far
away from just doing it in ttwu() a little deeper in the call stack,
as we still have to take  task_rq_lock() to call block_task().

> Anyway, since the actual value of ->blocked_on doesn't matter in this
> case (we really want it to be NULL, but can't because we need someone to
> go back migrate the thing), why not simply use something like:
>
> #define PROXY_STOP ((struct mutex *)(-1L))
>
>         __set_task_blocked_on(task, PROXY_STOP);
>
> Then, have find_proxy_task() fix it up?

Ok, so this sounds like it sort of matches the BO_WAKING state I
currently have (replacing the BO_WAKING state with PROXY_STOP). Not
much of a logic change, but would indeed save a bit of space.
I'll take a stab at it.

> Random thoughts:
>
>  - we should probably have something like:
>
>         next = pick_next_task();
>         rq_set_donor(next)
>         if (unlikely(task_is_blocked()) {
>                 ...
>         }
> +       WARN_ON_ONCE(next->__state);
>
>    at all times the task we end up picking should be in RUNNABLE state.
>
>  - similarly, we should have ttwu() check ->blocked_on is NULL ||
>    PROXY_STOP, waking a task that still has a blocked_on relation can't
>    be right -- ooh, dang race conditions :/ perhaps DEBUG_MUTEX and
>    serialize on wait_lock.
>
>  - I'm confliced on having TTWU fix up PROXY_STOP; strictly not required
>    I think, but might improve performance -- if so, include numbers in
>    patch that adds it -- which should be a separate patch from the one
>    that adds PROXY_STOP.

Ok, I'll work to split that logic out. The nice thing in ttwu is we
already end up taking the rq lock in ttwu_runnable() when we do the
dequeue so yeah I expect it would help performance.

>  - since find_proxy_task() can do a lock-break, it should probably
>    re-try the pick if, at the end, a higher runqueue is modified than
>    the task we ended up with.

So, I think find_proxy_task() will always pick-again if it releases
the rqlock.  So I'm not sure I'm quite following this bit. Could you
clarify?

>    Also see this thread:
>
>       https://lkml.kernel.org/r/20251006105453.522934521@infradead.org
>
>    eg. something like:
>
>         rq->queue_mask = 0;
>         // code with rq-lock-break
>         if (rq_modified_above(rq, next->sched_class))
>                 return NULL;
>
>
> I'm still confused on BO_RUNNABLE -- you set that around
> optimistic-spin, probably because you want to retain the ->blocked_on
> relation, but also you have to run that thing to make progress. There
> are a few other sites that use it, but those are more confusing still.

Mostly I liked managing the blocked_on_state separately from the
blocked_on pointer as I found it simplified my thinking in the mutex
lock side, for cases where we wake up and then loop again. But let me
take a pass at reworking it a bit like you suggest to see how it goes.

> Please try and clarify this.

Will try to add more comments to explain.

> Anyway, if that is indeed it, you could do this by (ab)using the LSB of
> the ->blocked_on pointer I suppose (you could make PROXY_STOP -2).

One complication for using the LSB of the pointer, Suleiman was
thinking about stealing those for extending the blocked_on pointer for
use with other lock types (rw_sem in his case).
Currently he's got it in a structure with an enum:
  https://github.com/johnstultz-work/linux-dev/commit/e61b487d240782302199f6dc1d99851c3449b547

But we talked a little about potentially squishing that together, but
it sort of depends on how many primitive types we end up using
proxy-exec with.

As always, thanks again for all the feedback, I really appreciate it!
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-10-08 13:32   ` Peter Zijlstra
@ 2025-10-16  0:15     ` John Stultz
  0 siblings, 0 replies; 17+ messages in thread
From: John Stultz @ 2025-10-16  0:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Oct 8, 2025 at 6:32 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Sep 26, 2025 at 03:29:12AM +0000, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7bba05c07a79d..d063d2c9bd5aa 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
> >
> >       __do_set_cpus_allowed(p, ctx);
> >
> > +     /*
> > +      * It might be that the p->wake_cpu is no longer
> > +      * allowed, so set it to the dest_cpu so return
> > +      * migration doesn't send it to an invalid cpu
> > +      */
>
> This comment isn't quite right; ->wake_cpu is a mere preference, it does
> not have correctness concerns. That is, it is okay for ->wake_cpu to not
> be inside cpus_allowed.

Oh! This is actually left over from earlier in the revisions where
return migration would migrate specifically to the wake_cpu, I thought
it was still important, but looking again, since the return migration
now does a block_task/wake_up_process(), I think we can drop this
whole chunk.  Thanks for catching that!

> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned int wake_cpu;
> > +
> > +     /*
> > +      * Since we are enqueuing a blocked task on a cpu it may
> > +      * not be able to run on, preserve wake_cpu when we
> > +      * __set_task_cpu so we can return the task to where it
> > +      * was previously runnable.
> > +      */
> > +     wake_cpu = p->wake_cpu;
> > +     __set_task_cpu(p, cpu);
> > +     p->wake_cpu = wake_cpu;
>
> Humm, perhaps add an argument to __set_task_cpu() to not set ->wake_cpu
> instead?

Hrm. I can rework it that way, but I do always feel that bool
arguments to functions makes the code less readable (since it's not
always clear from the usage what true/false means).  Making it an enum
 or defined flags argument help a bit with that, but it still seems it
would have much more impact to the source than this small helper
that's only called from the proxy related logic in two places at the
end of the day.

> > @@ -4215,8 +4294,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >        */
> >       scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
> >               smp_mb__after_spinlock();
> > -             if (!ttwu_state_match(p, state, &success))
> > -                     break;
> > +             if (!ttwu_state_match(p, state, &success)) {
> > +                     /*
> > +                      * If we're already TASK_RUNNING, and BO_WAKING
> > +                      * continue on to ttwu_runnable check to force
> > +                      * proxy_needs_return evaluation
> > +                      */
> > +                     if (!proxy_task_runnable_but_waking(p))
> > +                             break;
> > +             }
> >
> >               trace_sched_waking(p);
>
> Oh gawd :-( why !?!?
>
> So AFAICT this makes ttwu() do dequeue when it finds WAKING, and then it
> falls through to do the normal wakeup. So why can't we do dequeue to
> begin with -- instead of setting WAKING in the first place?

So hopefully the earlier discussion cleared that one up, but let me
know if you still object to this or have questions.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-10-14  2:43         ` John Stultz
@ 2025-10-16 22:23           ` John Stultz
  0 siblings, 0 replies; 17+ messages in thread
From: John Stultz @ 2025-10-16 22:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Oct 13, 2025 at 7:43 PM John Stultz <jstultz@google.com> wrote:
> On Thu, Oct 9, 2025 at 4:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >  - I'm confliced on having TTWU fix up PROXY_STOP; strictly not required
> >    I think, but might improve performance -- if so, include numbers in
> >    patch that adds it -- which should be a separate patch from the one
> >    that adds PROXY_STOP.
>
> Ok, I'll work to split that logic out. The nice thing in ttwu is we
> already end up taking the rq lock in ttwu_runnable() when we do the
> dequeue so yeah I expect it would help performance.

So, I thought this wouldn't be hard, but it ends up there's some
subtlety to trying to separate the ttwu changes.

First, I am using PROXY_WAKING instead of PROXY_STOP since it seemed
more clear and aligned to my previous mental model with BO_WAKING.

One of the issues is when we go through the:
  mutex_unlock_slowpath()/ww_mutex_die()/ww_mutex_wound()
  ->  tsk->blocked_on = PROXY_WAKING
      wake_q_add(tsk)
      ...
      wake_up_q()
      ->  wake_up_process()

The wake_up_process() call through try_to_wake_up() will hit the
ttwu_runnable() case and set the task state RUNNING.

Then on the cpu where that task is enqueued:
  __schedule()
  -> find_proxy_task()
     -> if (p->blocked_on == PROXY_WAKING)
           proxy_force_return(rq, p);

In v22, proxy_force_return() logic would block_task(p),
clear_task_blocked_on(p) and then call wake_up_process(p).
https://github.com/johnstultz-work/linux-dev/blob/proxy-exec-v22-6.17-rc6/kernel/sched/core.c#L7117

However, since the task state has already been set to TASK_RUNNING,
the second wakeup ends up short-circuiting at ttwu_state_match(), and
the now blocked task would end up left dequeued forever.

So, I've reworked the proxy_force_return() to be sort of an open coded
try_to_wakeup() to call select_task_rq() to pick the return cpu and
then basically deactivate/activate the task to migrate it over.  It
was nice to reuse block_task() and wake_up_process() previously, but
that wake/block/wake behavior tripping into the dequeued forever issue
worries me that it could be tripped in rare cases previously with my
series (despite having check after ttwu_state_mach() for this case).
So either I'll keep this approach or maybe we should add some extra
checking in ttwu_state_mach() for on_rq before bailing?  Let me know
if you have thoughts there.

Hopefully will have the patches cleaned up and out again soon.

thanks
-john

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-10-16 22:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-26  3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
2025-09-26  3:29 ` [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2025-10-08 10:27   ` Peter Zijlstra
2025-09-26  3:29 ` [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
2025-10-08 11:26   ` Peter Zijlstra
2025-10-09  0:07     ` John Stultz
2025-10-09 11:43       ` Peter Zijlstra
2025-10-09 11:45         ` Peter Zijlstra
2025-10-14  2:43         ` John Stultz
2025-10-16 22:23           ` John Stultz
2025-09-26  3:29 ` [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
2025-10-08 11:37   ` Peter Zijlstra
2025-09-26  3:29 ` [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
2025-10-08 13:32   ` Peter Zijlstra
2025-10-16  0:15     ` John Stultz
2025-09-26  3:29 ` [PATCH v22 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2025-09-26  3:29 ` [PATCH v22 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.