[RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)
@ 2025-09-04  0:21 John Stultz
  2025-09-04  0:21 ` [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hey All,

I didn't get any feedback on the last iteration, so I wanted to
resend this next chunk of the series: Donor Migration 

The main change from v20 before it, is that I previously had
logic where the ww_mutex paths took the blocked_lock of the task
it was waking (either the lock waiter->task or owner), but in a
context from __mutex_lock_common() where we already held the
current->block_lock. This required using the spin_lock_nested()
annotation to keep lockdep happy, and I was leaning on the logic
that there is an implied order between running current and the
existing not-running lock waiters, which should avoid loops. In
the wound case, there is also an order used if the owners
context is younger, which sounded likely to avoid loops.

However, after thinking more about the wound case where we are
wounding a lock owner, since that owner is not waiting and could
be trying to acquire a mutex current owns, I couldn’t quite
convince myself we couldn’t get into a ABBA style deadlock with
the nested blocked_lock accesses (though, I’ve not been able to
contrive it to happen, but that doesn’t prove anything).

So the main difference in v21 is reworking of how we hold the
blocked_lock in the mutex_lock_common() code, reducing it so we
don’t call into ww_mutex paths while holding it. The
lock->waiter_lock still serializes things at top level, but the
blocked_lock isn’t held completely in parallel under that, and
is focused on its purpose of protecting the blocked_on,
blocked_on_state and similar proxy-related values in the task
struct.

I also did some cleanups to be more consistent in how the
blocked_on_state is handled. I had a few spots previously where
I was cheating and just set the value instead of going through
the helpers. And sure enough, in fixing those I realized there
were a few spots where I wasn’t always holding the right
blocked_lock, so some minor rework helped clean that up.

I’m trying to submit this larger work in smallish digestible
pieces, so in this portion of the series, I’m only submitting
for review and consideration the logic that allows us to do
donor(blocked waiter) migration, allowing us to proxy-execute
lock owners that might be on other cpu runqueues. This requires
some additional changes to locking and extra state tracking to
ensure we don’t accidentally run a migrated donor on a cpu it
isn’t affined to, as well as some extra handling to deal with
balance callback state that needs to be reset when we decide to
pick a different task after doing donor migration.

I’d love to get some feedback on any place where these patches
are confusing, or could use additional clarifications.

Also you can find the full proxy-exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v21-6.17-rc4/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v21-6.17-rc4

Issues still to address with the full series:
* Need to sort out what is needed for sched_ext to be ok with
  proxy-execution enabled. This is my next priority.

* K Prateek Nayak did some testing about a bit over a year ago
  with an earlier version of the full series and saw ~3-5%
  regressions in some cases. Need to re-evaluate this with the
  proxy-migration avoidance optimization Suleiman suggested
  having now been implemented.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to other locking primitives: Suleiman is looking at
  rw_semaphores, as that is another common source of priority
  inversion. Figuring out pi-futexes would be good too.
* Eventually: Work to replace rt_mutexes and get things happy
  with PREEMPT_RT

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>   
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (5):
  locking: Add task::blocked_lock to serialize blocked_on state
  sched/locking: Add blocked_on_state to provide necessary tri-state for
    proxy return-migration
  sched: Add logic to zap balance callbacks if we pick again
  sched: Handle blocked-waiter migration (and return migration)
  sched: Migrate whole chain in proxy_migrate_task()

Peter Zijlstra (1):
  sched: Add blocked_donor link to task for smarter mutex handoffs

 include/linux/sched.h        | 120 ++++++++-----
 init/init_task.c             |   4 +
 kernel/fork.c                |   4 +
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  83 +++++++--
 kernel/locking/ww_mutex.h    |  20 +--
 kernel/sched/core.c          | 329 +++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c          |   3 +-
 kernel/sched/sched.h         |   2 +-
 9 files changed, 473 insertions(+), 96 deletions(-)

-- 
2.51.0.338.gd7d06c2dae-goog

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
@ 2025-09-04  0:21 ` John Stultz
  2025-09-12  5:50   ` K Prateek Nayak
  2025-09-04  0:21 ` [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak
v21:
* After recently thinking more on ww_mutex code, I
  reworked the blocked_lock usage in mutex lock to
  avoid having to take nested locks in the ww_mutex
  paths, as I was concerned the lock ordering
  constraints weren't as strong as I had previously
  thought.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 52 +++++++++++++++---------------------
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 37 +++++++++++++++----------
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 7 files changed, 54 insertions(+), 49 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8188b8333503..3ec0ef0d91603 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1235,6 +1235,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2143,57 +2144,48 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
+}
 
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd906..7e29d86153d9f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499dc..db6d08946ec11 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2030,6 +2030,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 949103fd8e9b5..1d8cff71f65e1 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index de7d6702cd96c..fac40c456098e 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -640,6 +640,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -653,8 +654,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -677,12 +679,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -693,25 +697,30 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -740,11 +749,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -955,7 +964,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785afee6c0..e4a81790ea7dd 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4c..0180853dd48c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6639,6 +6639,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6662,8 +6663,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
  2025-09-04  0:21 ` [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2025-09-04  0:21 ` John Stultz
  2025-09-15  8:05   ` K Prateek Nayak
  2025-09-15  9:03   ` K Prateek Nayak
  2025-09-04  0:21 ` [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Thus we need more then just a binary concept of the task being
blocked on a mutex or not.

So add a blocked_on_state value to the task, that allows the
task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
and back to BO_RUNNING. This provides a guard state in
BO_WAKING so we can know the task is no longer blocked
but we don't want to run it until we have potentially
done return migration, back to a usable cpu.

Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
  around blocked_on_state changes that we were cheating in
  previous versions.
* Minor cleanups, some comment improvements

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 80 +++++++++++++++++++++++++++++----------
 init/init_task.c          |  1 +
 kernel/fork.c             |  1 +
 kernel/locking/mutex.c    | 15 ++++----
 kernel/locking/ww_mutex.h | 20 ++++------
 kernel/sched/core.c       | 44 +++++++++++++++++++--
 kernel/sched/sched.h      |  2 +-
 7 files changed, 120 insertions(+), 43 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ec0ef0d91603..5801de1a44a79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,12 @@ struct kmap_ctrl {
 #endif
 };
 
+enum blocked_on_state {
+	BO_RUNNABLE,
+	BO_BLOCKED,
+	BO_WAKING,
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1234,6 +1240,7 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
+	enum blocked_on_state		blocked_on_state;
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	raw_spinlock_t			blocked_lock;
 
@@ -2141,7 +2148,52 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
-#ifndef CONFIG_PREEMPT_RT
+static inline void __force_blocked_on_runnable(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	p->blocked_on_state = BO_RUNNABLE;
+}
+
+static inline void force_blocked_on_runnable(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__force_blocked_on_runnable(p);
+}
+
+static inline void __force_blocked_on_blocked(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	p->blocked_on_state = BO_BLOCKED;
+}
+
+static inline void __set_blocked_on_runnable(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	if (p->blocked_on_state == BO_WAKING)
+		p->blocked_on_state = BO_RUNNABLE;
+}
+
+static inline void set_blocked_on_runnable(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return;
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_blocked_on_runnable(p);
+}
+
+static inline void __set_blocked_on_waking(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+	if (p->blocked_on_state == BO_BLOCKED)
+		p->blocked_on_state = BO_WAKING;
+}
+
+static inline void set_blocked_on_waking(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_blocked_on_waking(p);
+}
+
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
@@ -2163,24 +2215,23 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
-	 * with a different mutex. Note, setting it to the same
-	 * lock repeatedly is ok.
+	 * with a different mutex.
 	 */
-	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(p->blocked_on);
 	p->blocked_on = m;
+	p->blocked_on_state = BO_BLOCKED;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
+	/* The task should only be clearing itself */
+	WARN_ON_ONCE(p != current);
 	/* Currently we serialize blocked_on under the task::blocked_lock */
 	lockdep_assert_held_once(&p->blocked_lock);
-	/*
-	 * There may be cases where we re-clear already cleared
-	 * blocked_on relationships, but make sure we are not
-	 * clearing the relationship with a different lock.
-	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	/* Make sure we are clearing the relationship with the right lock */
+	WARN_ON_ONCE(m && p->blocked_on != m);
 	p->blocked_on = NULL;
+	p->blocked_on_state = BO_RUNNABLE;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2188,15 +2239,6 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
-#else
-static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
-{
-}
-
-static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
-{
-}
-#endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
 {
diff --git a/init/init_task.c b/init/init_task.c
index 7e29d86153d9f..6d72ec23410a6 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -174,6 +174,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+	.blocked_on_state = BO_RUNNABLE,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index db6d08946ec11..4bd0731995e86 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2129,6 +2129,7 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
+	p->blocked_on_state = BO_RUNNABLE;
 	p->blocked_on = NULL; /* not blocked yet */
 
 #ifdef CONFIG_BCACHE
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index fac40c456098e..42e4d2e6e4ad4 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -682,11 +682,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 		raw_spin_lock(&current->blocked_lock);
 		/*
-		 * As we likely have been woken up by task
-		 * that has cleared our blocked_on state, re-set
-		 * it to the lock we are trying to acquire.
+		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
 		 */
-		__set_task_blocked_on(current, lock);
+		__force_blocked_on_blocked(current);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -705,14 +703,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			__clear_task_blocked_on(current, lock);
+			__force_blocked_on_runnable(current);
 			raw_spin_unlock(&current->blocked_lock);
 			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
 			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
 			raw_spin_lock_irqsave(&lock->wait_lock, flags);
 			raw_spin_lock(&current->blocked_lock);
-			__set_task_blocked_on(current, lock);
+			__force_blocked_on_blocked(current);
 			if (opt_acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
@@ -963,8 +961,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
+		raw_spin_lock(&next->blocked_lock);
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		WARN_ON_ONCE(__get_task_blocked_on(next) != lock);
+		__set_blocked_on_waking(next);
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..f34363615eb34 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on_state to BO_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_blocked_on_waking(waiter->task);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
-			 *
-			 * NOTE: We pass NULL here instead of lock, because we
-			 * are waking the mutex owner, who may be currently
-			 * blocked on a different mutex.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on_state to BO_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_blocked_on_waking(owner);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0180853dd48c5..e0007660161fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4328,6 +4328,12 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, set us as BO_RUNNABLE
+	 * We will need to be more careful later when handling
+	 * proxy migration
+	 */
+	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6623,7 +6629,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		force_blocked_on_runnable(donor);
 	}
 	return NULL;
 }
@@ -6676,20 +6682,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		/*
+		 * If a ww_mutex hits the die/wound case, it marks the task as
+		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
+		 * cycle can be broken and we avoid a deadlock.
+		 *
+		 * However, if at that moment, we are here on the cpu which the
+		 * die/wounded task is enqueued, we might loop on the cycle as
+		 * BO_WAKING still causes task_is_blocked() to return true
+		 * (since we want return migration to occur before we run the
+		 * task).
+		 *
+		 * Unfortunately since we hold the rq lock, it will block
+		 * try_to_wake_up from completing and doing the return
+		 * migration.
+		 *
+		 * So when we hit a !BO_BLOCKED task briefly schedule idle
+		 * so we release the rq and let the wakeup complete.
+		 */
+		if (p->blocked_on_state != BO_BLOCKED)
+			return proxy_resched_idle(rq);
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
-			__clear_task_blocked_on(p, mutex);
+			__force_blocked_on_runnable(p);
 			return p;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			goto deactivate_donor;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			goto deactivate_donor;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6749,6 +6776,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
+
+	/*
+	 * NOTE: This logic is down here, because we need to call
+	 * the functions with the mutex wait_lock and task
+	 * blocked_lock released, so we have to get out of the
+	 * guard() scope.
+	 */
+deactivate_donor:
+	return proxy_deactivate(rq, donor);
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f75..845454ec81a22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2264,7 +2264,7 @@ static inline bool task_is_blocked(struct task_struct *p)
 	if (!sched_proxy_exec())
 		return false;
 
-	return !!p->blocked_on;
+	return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
 }
 
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
  2025-09-04  0:21 ` [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
  2025-09-04  0:21 ` [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
@ 2025-09-04  0:21 ` John Stultz
  2025-09-15  8:31   ` K Prateek Nayak
  2025-09-04  0:21 ` [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the blance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e0007660161fa..01bf5ef8d9fcc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5001,6 +5001,40 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+#if defined(CONFIG_SCHED_PROXY_EXEC)
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+#else
+static inline void zap_balance_callbacks(struct rq *rq)
+{
+}
+#endif
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6941,8 +6975,11 @@ static void __sched notrace __schedule(int sched_mode)
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
-		if (!next)
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
 			goto pick_again;
+		}
 		if (next == rq->idle)
 			goto keep_resched;
 	}
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
                   ` (2 preceding siblings ...)
  2025-09-04  0:21 ` [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-09-04  0:21 ` John Stultz
  2025-09-15 10:06   ` K Prateek Nayak
  2025-09-04  0:21 ` [RESEND][PATCH v21 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 247 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/fair.c |   3 +-
 2 files changed, 222 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01bf5ef8d9fcc..0f824446c6046 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, ctx);
 
+	/*
+	 * It might be that the p->wake_cpu is no longer
+	 * allowed, so set it to the dest_cpu so return
+	 * migration doesn't send it to an invalid cpu
+	 */
+	if (!is_cpu_allowed(p, p->wake_cpu))
+		p->wake_cpu = dest_cpu;
+
 	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
 out:
@@ -3717,6 +3725,67 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
 	trace_sched_wakeup(p);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (READ_ONCE(p->__state) == TASK_RUNNING &&
+		READ_ONCE(p->blocked_on_state) == BO_WAKING);
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	return false;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	bool ret = false;
+
+	if (!sched_proxy_exec())
+		return false;
+
+	raw_spin_lock(&p->blocked_lock);
+	if (__get_task_blocked_on(p) && p->blocked_on_state == BO_WAKING) {
+		if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
+			if (task_current_donor(rq, p)) {
+				put_prev_task(rq, p);
+				rq_set_donor(rq, rq->idle);
+			}
+			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+			ret = true;
+		}
+		__set_blocked_on_runnable(p);
+		resched_curr(rq);
+	}
+	raw_spin_unlock(&p->blocked_lock);
+	return ret;
+}
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3802,6 +3871,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		if (proxy_needs_return(rq, p))
+			goto out;
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -3812,6 +3883,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
+out:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;
@@ -4199,6 +4271,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If current is waking up, we know we can run here, so set BO_RUNNBLE */
+		set_blocked_on_runnable(p);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4215,8 +4289,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success))
-			break;
+		if (!ttwu_state_match(p, state, &success)) {
+			/*
+			 * If we're already TASK_RUNNING, and BO_WAKING
+			 * continue on to ttwu_runnable check to force
+			 * proxy_needs_return evaluation
+			 */
+			if (!proxy_task_runnable_but_waking(p))
+				break;
+		}
 
 		trace_sched_waking(p);
 
@@ -4278,6 +4359,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 * enqueue, such as ttwu_queue_wakelist().
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
+		set_blocked_on_runnable(p);
 
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
@@ -4328,12 +4410,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, set us as BO_RUNNABLE
-	 * We will need to be more careful later when handling
-	 * proxy migration
-	 */
-	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6635,7 +6711,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6655,17 +6731,98 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		force_blocked_on_runnable(donor);
-	}
-	return NULL;
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * After the migration, we are going to pick_again in the __schedule
+	 * logic, so backtrack a bit before we release the lock:
+	 * Put rq->donor, and set rq->curr as rq->donor and set_next_task,
+	 * so that we're close to the situation we had entering __schedule
+	 * the first time.
+	 *
+	 * Then when we re-aquire the lock, we will re-put rq->curr then
+	 * rq_set_donor(rq->idle) and set_next_task(rq->idle), before
+	 * picking again.
+	 */
+	/* XXX - Added to address problems with changed dl_server semantics - double check */
+	__put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, 0);
+	proxy_set_task_cpu(p, target_cpu);
+
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(target_rq);
+
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+
+	raw_spin_rq_unlock(target_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+
+	WARN_ON(p == rq->curr);
+
+	set_blocked_on_waking(p);
+	get_task_struct(p);
+	block_task(rq, p, 0);
+
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	wake_up_process(p);
+	put_task_struct(p);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
+{
+	if (p == rq->curr || p->wake_cpu == cpu_of(rq))
+		return true;
+	return false;
 }
 
 /*
@@ -6688,9 +6845,11 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6716,6 +6875,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		/* Double check blocked_on_state now we're holding the lock */
+		if (p->blocked_on_state == BO_RUNNABLE)
+			return p;
+
 		/*
 		 * If a ww_mutex hits the die/wound case, it marks the task as
 		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
@@ -6731,26 +6894,46 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * try_to_wake_up from completing and doing the return
 		 * migration.
 		 *
-		 * So when we hit a !BO_BLOCKED task briefly schedule idle
-		 * so we release the rq and let the wakeup complete.
+		 * So when we hit a BO_WAKING task try to wake it up ourselves.
 		 */
-		if (p->blocked_on_state != BO_BLOCKED)
-			return proxy_resched_idle(rq);
+		if (p->blocked_on_state == BO_WAKING) {
+			if (task_current(rq, p)) {
+				/* If its current just set it runnable */
+				__force_blocked_on_runnable(p);
+				return p;
+			}
+			goto needs_return;
+		}
+
+		if (task_current(rq, p))
+			curr_in_chain = true;
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/* If the owner is null, we may have some work to do */
+			if (!proxy_can_run_here(rq, p))
+				goto needs_return;
+
 			__force_blocked_on_runnable(p);
 			return p;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			goto deactivate_donor;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			goto deactivate_donor;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			goto migrate;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6817,8 +7000,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	 * blocked_lock released, so we have to get out of the
 	 * guard() scope.
 	 */
+migrate:
+	proxy_migrate_task(rq, rf, p, owner_cpu);
+	return NULL;
+needs_return:
+	proxy_force_return(rq, rf, p);
+	return NULL;
 deactivate_donor:
-	return proxy_deactivate(rq, donor);
+	if (!proxy_deactivate(rq, donor)) {
+		p = donor;
+		goto needs_return;
+	}
+	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c2..cc531eb939831 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	se = &p->se;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (prev->sched_class != &fair_sched_class ||
+	    rq->curr != rq->donor)
 		goto simple;
 
 	__put_prev_set_next_dl_server(rq, prev, p);
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RESEND][PATCH v21 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
                   ` (3 preceding siblings ...)
  2025-09-04  0:21 ` [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-09-04  0:21 ` John Stultz
  2025-09-04  0:21 ` [RESEND][PATCH v21 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Connor O'Brien, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
  is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
  by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
v8:
* Use helper for donor blocked_on_state transition
v9:
* Re-add mutex lock handoff in the unlock path, but only when we
  have a blocked donor
* Slight reword of commit message suggested by Metin
v18:
* Add task_init initialization for blocked_donor, suggested by
  Suleiman

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 41 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    | 18 ++++++++++++++++--
 5 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5801de1a44a79..ab12eb738c440 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1242,6 +1242,7 @@ struct task_struct {
 
 	enum blocked_on_state		blocked_on_state;
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 6d72ec23410a6..627bbd8953e88 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -175,6 +175,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 						 &init_task.alloc_lock),
 #endif
 	.blocked_on_state = BO_RUNNABLE,
+	.blocked_donor = NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 4bd0731995e86..86fe43ee35952 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2131,6 +2131,7 @@ __latent_entropy struct task_struct *copy_process(
 
 	p->blocked_on_state = BO_RUNNABLE;
 	p->blocked_on = NULL; /* not blocked yet */
+	p->blocked_donor = NULL; /* nobody is boosting p yet */
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 42e4d2e6e4ad4..76cba3580fce7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -921,7 +921,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
-	struct task_struct *next = NULL;
+	struct task_struct *donor, *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
 	unsigned long flags;
@@ -940,6 +940,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		MUTEX_WARN_ON(__owner_task(owner) != current);
 		MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 
+		if (sched_proxy_exec() && current->blocked_donor) {
+			/* force handoff if we have a blocked_donor */
+			owner = MUTEX_FLAG_HANDOFF;
+			break;
+		}
+
 		if (owner & MUTEX_FLAG_HANDOFF)
 			break;
 
@@ -953,7 +959,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+	if (sched_proxy_exec()) {
+		raw_spin_lock(&current->blocked_lock);
+		/*
+		 * If we have a task boosting current, and that task was boosting
+		 * current through this lock, hand the lock to that task, as that
+		 * is the highest waiter, as selected by the scheduling function.
+		 */
+		donor = current->blocked_donor;
+		if (donor) {
+			struct mutex *next_lock;
+
+			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+			next_lock = __get_task_blocked_on(donor);
+			if (next_lock == lock) {
+				next = donor;
+				__set_blocked_on_waking(donor);
+				wake_q_add(&wake_q, donor);
+				current->blocked_donor = NULL;
+			}
+			raw_spin_unlock(&donor->blocked_lock);
+		}
+	}
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 			list_first_entry(&lock->wait_list,
@@ -961,7 +994,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
-		raw_spin_lock(&next->blocked_lock);
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		debug_mutex_wake_waiter(lock, waiter);
 		WARN_ON_ONCE(__get_task_blocked_on(next) != lock);
 		__set_blocked_on_waking(next);
@@ -972,6 +1005,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	if (sched_proxy_exec())
+		raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f824446c6046..cac03f68cbcce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6829,7 +6829,17 @@ static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
  * Find runnable lock owner to proxy for mutex blocked donor
  *
  * Follow the blocked-on relation:
- *   task->blocked_on -> mutex->owner -> task...
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
  *
  * Lock order:
  *
@@ -6989,6 +6999,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
+		owner->blocked_donor = p;
 	}
 
 	WARN_ON_ONCE(owner && !owner->on_rq);
@@ -7091,6 +7102,7 @@ static void __sched notrace __schedule(int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool prev_not_proxied;
 	int cpu;
 
 	/* Trace preemptions consistently with task switches */
@@ -7163,9 +7175,11 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	prev_not_proxied = !prev->blocked_donor;
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	next->blocked_donor = NULL;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next) {
@@ -7229,7 +7243,7 @@ static void __sched notrace __schedule(int sched_mode)
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
 		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
+		if (prev_not_proxied && next->blocked_donor)
 			proxy_tag_curr(rq, next);
 
 		rq_unpin_lock(rq, &rf);
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RESEND][PATCH v21 6/6] sched: Migrate whole chain in proxy_migrate_task()
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
                   ` (4 preceding siblings ...)
  2025-09-04  0:21 ` [RESEND][PATCH v21 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
@ 2025-09-04  0:21 ` John Stultz
  2025-09-11 13:59 ` [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) Juri Lelli
  2025-09-16  3:19 ` K Prateek Nayak
  7 siblings, 0 replies; 26+ messages in thread
From: John Stultz @ 2025-09-04  0:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Instead of migrating one task each time through find_proxy_task(),
we can walk up the blocked_donor ptrs and migrate the entire
current chain in one go.

This was broken out of earlier patches and held back while the
series was being stabilized, but I wanted to re-introduce it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v12:
* Earlier this was re-using blocked_node, but I hit
  a race with activating blocked entities, and to
  avoid it introduced a new migration_node listhead
v18:
* Add init_task initialization of migration_node as suggested
  by Suleiman

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h |  1 +
 init/init_task.c      |  1 +
 kernel/fork.c         |  1 +
 kernel/sched/core.c   | 25 +++++++++++++++++--------
 4 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ab12eb738c440..176ec117f4041 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1243,6 +1243,7 @@ struct task_struct {
 	enum blocked_on_state		blocked_on_state;
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	struct task_struct		*blocked_donor;	/* task that is boosting this task */
+	struct list_head		migration_node;
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 627bbd8953e88..65e0f90285966 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -176,6 +176,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 #endif
 	.blocked_on_state = BO_RUNNABLE,
 	.blocked_donor = NULL,
+	.migration_node = LIST_HEAD_INIT(init_task.migration_node),
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	= RB_ROOT_CACHED,
 	.pi_top_task	= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 86fe43ee35952..01fd08c463871 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2132,6 +2132,7 @@ __latent_entropy struct task_struct *copy_process(
 	p->blocked_on_state = BO_RUNNABLE;
 	p->blocked_on = NULL; /* not blocked yet */
 	p->blocked_donor = NULL; /* nobody is boosting p yet */
+	INIT_LIST_HEAD(&p->migration_node);
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cac03f68cbcce..26f7a11a39e0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6747,6 +6747,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
+	LIST_HEAD(migrate_list);
 
 	lockdep_assert_rq_held(rq);
 
@@ -6774,19 +6775,27 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	rq_set_donor(rq, rq->idle);
 	set_next_task(rq, rq->idle);
 
-	WARN_ON(p == rq->curr);
-
-	deactivate_task(rq, p, 0);
-	proxy_set_task_cpu(p, target_cpu);
-
+	for (; p; p = p->blocked_donor) {
+		WARN_ON(p == rq->curr);
+		deactivate_task(rq, p, 0);
+		proxy_set_task_cpu(p, target_cpu);
+		/*
+		 * We can abuse blocked_node to migrate the thing,
+		 * because @p was still on the rq.
+		 */
+		list_add(&p->migration_node, &migrate_list);
+	}
 	zap_balance_callbacks(rq);
 	rq_unpin_lock(rq, rf);
 	raw_spin_rq_unlock(rq);
 	raw_spin_rq_lock(target_rq);
+	while (!list_empty(&migrate_list)) {
+		p = list_first_entry(&migrate_list, struct task_struct, migration_node);
+		list_del_init(&p->migration_node);
 
-	activate_task(target_rq, p, 0);
-	wakeup_preempt(target_rq, p, 0);
-
+		activate_task(target_rq, p, 0);
+		wakeup_preempt(target_rq, p, 0);
+	}
 	raw_spin_rq_unlock(target_rq);
 	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
                   ` (5 preceding siblings ...)
  2025-09-04  0:21 ` [RESEND][PATCH v21 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz
@ 2025-09-11 13:59 ` Juri Lelli
  2025-09-11 23:21   ` John Stultz
  2025-09-16  3:19 ` K Prateek Nayak
  7 siblings, 1 reply; 26+ messages in thread
From: Juri Lelli @ 2025-09-11 13:59 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team,
	Luca Abeni, Tommaso Cucinotta, yurand2000

Hi John,

On 04/09/25 00:21, John Stultz wrote:
> Hey All,
> 
> I didn't get any feedback on the last iteration, so I wanted to
> resend this next chunk of the series: Donor Migration 

Not ignoring you, but I had to spend some time putting together some
testing infra I am not trying to use to see how DEADLINE behaves with
the series, as it's somewhat difficult for me to think in abstract about
all this. :)

> Also you can find the full proxy-exec series here:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v21-6.17-rc4/
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v21-6.17-rc4
> 

> I’d really appreciate any feedback or review thoughts on the
> full series as well.

I current have the following on top of your complete series

https://github.com/jlelli/linux/commits/experimental/eval-mbwi/
https://github.com/jlelli/linux experimental/eval-mbwi

of which

https://github.com/jlelli/linux/commit/9d4bbb1aca624e76e5b34938d848dc9a418c6146

introduces the testing (M-BWI is Multiprocessor Bandwidth Inheritance)
infra and the rest some additional tracepoints (based on Gabriele's
patch) to get more DEADLINE info out of tests (in conjuction with
sched_tp [1]).

Nothing bit to report just yet, mainly spent time getting this working.

One thing I noticed thouh (and probably forgot from previous
discussions) is that spin_on_owner might be 'confusing' from an
RT/DEADLINE perspective as it deviates from what one expects from the
ideal theoretical world (as tasks don't immediately block and
potentially donate). Not sure what to do about it. Maybe special case it
for RT/DEADLINE, but just started playing with it.

Anyway, I will keep playing with all this. Just wanted to give
you/others a quick update. Also adding Luca, Tommaso and Yuri to the
thread so that they are aware of the testing framework. :)

Thanks!
Juri

1 - https://github.com/jlelli/sched_tp/tree/deadline-tp
    will open an MR against Qais' mainline repo as soon as TPs are
    hopefully merged upstream as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)
  2025-09-11 13:59 ` [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) Juri Lelli
@ 2025-09-11 23:21   ` John Stultz
  2025-09-12  8:14     ` Juri Lelli
  0 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-11 23:21 UTC (permalink / raw)
  To: Juri Lelli
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team,
	Luca Abeni, Tommaso Cucinotta, yurand2000

On Thu, Sep 11, 2025 at 6:59 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> On 04/09/25 00:21, John Stultz wrote:
> > I’d really appreciate any feedback or review thoughts on the
> > full series as well.
>
> I current have the following on top of your complete series
>
> https://github.com/jlelli/linux/commits/experimental/eval-mbwi/
> https://github.com/jlelli/linux experimental/eval-mbwi
>
> of which
>
> https://github.com/jlelli/linux/commit/9d4bbb1aca624e76e5b34938d848dc9a418c6146
>
> introduces the testing (M-BWI is Multiprocessor Bandwidth Inheritance)
> infra and the rest some additional tracepoints (based on Gabriele's
> patch) to get more DEADLINE info out of tests (in conjuction with
> sched_tp [1]).
>
> Nothing bit to report just yet, mainly spent time getting this working.

Very cool to see! I'll have to pull those and take a look at it!

And I'm of course very interested to hear if you find anything with
the proxy set that I need to revise.

> One thing I noticed thouh (and probably forgot from previous
> discussions) is that spin_on_owner might be 'confusing' from an
> RT/DEADLINE perspective as it deviates from what one expects from the
> ideal theoretical world (as tasks don't immediately block and
> potentially donate). Not sure what to do about it. Maybe special case it
> for RT/DEADLINE, but just started playing with it.

Can you refresh me a bit on why blocking to donate is preferred? If
the lock owner is running, there's not much that blocking to donate
would help with. Does this concern not apply to the current mutex
logic without proxy?  With proxy-exec, I'm trying to preserve the
existing mutex behavior of spin_on_owner, with the main tweak is just
the lock handoff to the current donor when we are proxying, otherwise
the intent is it should be the same.

Now, I do recognize that rt_mutexes and mutexes do have different lock
handoff requirements for RT tasks (needs to strictly go to the highest
priority waiter, and we can't let a lower priority task steal it),
which is why I've not yet enabled proxy-exec on rt_mutexes.

> Anyway, I will keep playing with all this. Just wanted to give
> you/others a quick update. Also adding Luca, Tommaso and Yuri to the
> thread so that they are aware of the testing framework. :)

Thanks so much for sharing!
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state
  2025-09-04  0:21 ` [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2025-09-12  5:50   ` K Prateek Nayak
  2025-09-18 19:58     ` John Stultz
  0 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-12  5:50 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
> @@ -693,25 +697,30 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			break;
>  
>  		if (first) {
> -			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> +			bool opt_acquired;
> +
>  			/*
>  			 * mutex_optimistic_spin() can call schedule(), so
> -			 * clear blocked on so we don't become unselectable
> +			 * we need to release these locks before calling it,
> +			 * and clear blocked on so we don't become unselectable
>  			 * to run.
>  			 */
> -			clear_task_blocked_on(current, lock);
> -			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> +			__clear_task_blocked_on(current, lock);
> +			raw_spin_unlock(&current->blocked_lock);
> +			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> +			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> +			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
> +			raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +			raw_spin_lock(&current->blocked_lock);
> +			__set_task_blocked_on(current, lock);
> +			if (opt_acquired)
>  				break;

nit. Is there any reason for setting the blocked state before the break
other than for symmetry? The blocked_lock is held and we'll soon clear
it after the break anyways so does it have any other subtle purpose?

Also super silly but that above block is too dense. Can we add some
spaces in between, groping the relevant ops, to make it easier on the
eyes :)

Apart from those nit picks, I don't see anything out of the place.
Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'm still lagging behind on testing (Sorry about that!) but I should
have the results for this by Monday.

-- 
Thanks and Regards,
Prateek

> -			set_task_blocked_on(current, lock);
>  			trace_contention_begin(lock, LCB_F_MUTEX);
>  		}
> -
> -		raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  	}
> -	raw_spin_lock_irqsave(&lock->wait_lock, flags);
> -acquired:
>  	__clear_task_blocked_on(current, lock);
>  	__set_current_state(TASK_RUNNING);
> +	raw_spin_unlock(&current->blocked_lock);
>  
>  	if (ww_ctx) {
>  		/*


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)
  2025-09-11 23:21   ` John Stultz
@ 2025-09-12  8:14     ` Juri Lelli
  0 siblings, 0 replies; 26+ messages in thread
From: Juri Lelli @ 2025-09-12  8:14 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team,
	Luca Abeni, Tommaso Cucinotta, yurand2000

On 11/09/25 16:21, John Stultz wrote:
> On Thu, Sep 11, 2025 at 6:59 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> > On 04/09/25 00:21, John Stultz wrote:
> > > I’d really appreciate any feedback or review thoughts on the
> > > full series as well.
> >
> > I current have the following on top of your complete series
> >
> > https://github.com/jlelli/linux/commits/experimental/eval-mbwi/
> > https://github.com/jlelli/linux experimental/eval-mbwi
> >
> > of which
> >
> > https://github.com/jlelli/linux/commit/9d4bbb1aca624e76e5b34938d848dc9a418c6146
> >
> > introduces the testing (M-BWI is Multiprocessor Bandwidth Inheritance)
> > infra and the rest some additional tracepoints (based on Gabriele's
> > patch) to get more DEADLINE info out of tests (in conjuction with
> > sched_tp [1]).
> >
> > Nothing bit to report just yet, mainly spent time getting this working.
> 
> Very cool to see! I'll have to pull those and take a look at it!
> 
> And I'm of course very interested to hear if you find anything with
> the proxy set that I need to revise.
> 
> > One thing I noticed thouh (and probably forgot from previous
> > discussions) is that spin_on_owner might be 'confusing' from an
> > RT/DEADLINE perspective as it deviates from what one expects from the
> > ideal theoretical world (as tasks don't immediately block and
> > potentially donate). Not sure what to do about it. Maybe special case it
> > for RT/DEADLINE, but just started playing with it.
> 
> Can you refresh me a bit on why blocking to donate is preferred? If
> the lock owner is running, there's not much that blocking to donate
> would help with. Does this concern not apply to the current mutex
> logic without proxy?  With proxy-exec, I'm trying to preserve the
> existing mutex behavior of spin_on_owner, with the main tweak is just
> the lock handoff to the current donor when we are proxying, otherwise
> the intent is it should be the same.

Yeah, I think we want to preserve that behavior for non-RT mutexes for
throughput, but for RT I fear we might risk priority inversion if tasks
spin (for a bit) before blocking. My understanding is that with PI
enabled futexes (apart from some initial tries to get the lock with
atomic ops) we then call into __rt_mutex_start_proxy_lock() which
enqueues the blocked tasks onto the pi chain (so that PI rules are
respected etc.). Guess maybe we could end-up reintroducing this behavior
when we eventually kill rt-mutexes, so don't worry to much about it yet
I think, just something to keep in mind. :)

> Now, I do recognize that rt_mutexes and mutexes do have different lock
> handoff requirements for RT tasks (needs to strictly go to the highest
> priority waiter, and we can't let a lower priority task steal it),
> which is why I've not yet enabled proxy-exec on rt_mutexes.

Right. I will probably hack something in to test the DEADLINE scenarios,
but again don't worry about it.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-04  0:21 ` [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
@ 2025-09-15  8:05   ` K Prateek Nayak
  2025-09-18 22:57     ` John Stultz
  2025-09-15  9:03   ` K Prateek Nayak
  1 sibling, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-15  8:05 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
>  static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
>  {
> +	/* The task should only be clearing itself */
> +	WARN_ON_ONCE(p != current);
>  	/* Currently we serialize blocked_on under the task::blocked_lock */
>  	lockdep_assert_held_once(&p->blocked_lock);
> -	/*
> -	 * There may be cases where we re-clear already cleared
> -	 * blocked_on relationships, but make sure we are not
> -	 * clearing the relationship with a different lock.
> -	 */
> -	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
> +	/* Make sure we are clearing the relationship with the right lock */
> +	WARN_ON_ONCE(m && p->blocked_on != m);
>  	p->blocked_on = NULL;
> +	p->blocked_on_state = BO_RUNNABLE;
>  }
>  

Maybe it is just me, but I got confused a couple of time only to
realize __clear_task_blocked_on() clears the "blocked_on" and sets
"blocked_on_state" to BO_RUNNABLE.

Can we decouple the two and only set "p->blocked_on" in
*_task_blocked_on_* and "p->blocked_on_state" in
*{set,clear,force}_blocked_on* functions so it becomes easier to
follow in the mutex path as:

    __set_task_blocked_on(p, mutex); // blocked on mutex
    __force_blocked_on_blocked(p); // blocked from running on CPU

   ...

    __clear_task_blocked_on(p, mutex); // p is no longer blocked on a mutex
    __set_blocked_on_runnable(p); // p is now runnable

>  static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)

Of the three {set,clear}_task_blcoked_on() usage:

    $ grep -r "\(set\|clear\)_task_blocked_on" include/
    kernel/locking/mutex.c: __set_task_blocked_on(current, lock);
    kernel/locking/mutex.c: __clear_task_blocked_on(current, lock);
    kernel/locking/mutex.c: clear_task_blocked_on(current, lock);

two can be converted directly and perhaps clear_task_blocked_on() can be
renamed as clear_task_blocked_on_set_runnable()?

This is just me rambling on so feel free to ignore. I probably have to
train my mind enough to see __clear_task_blocked_on() not only clears
"blocked_on" but also sets task to runnable :)

[..snip..]

> @@ -6749,6 +6776,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  
>  	WARN_ON_ONCE(owner && !owner->on_rq);
>  	return owner;
> +
> +	/*
> +	 * NOTE: This logic is down here, because we need to call
> +	 * the functions with the mutex wait_lock and task
> +	 * blocked_lock released, so we have to get out of the
> +	 * guard() scope.
> +	 */

I didn't know that was possible! Neat. Since cleanup.h has a note
reading:

    ... the expectation is that usage of "goto" and cleanup helpers is
    never mixed in the same function.

are there any concerns w.r.t. compiler versions etc. or am I just being
paranoid?

> +deactivate_donor:
> +	return proxy_deactivate(rq, donor);
>  }
>  #else /* SCHED_PROXY_EXEC */
>  static struct task_struct *
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again
  2025-09-04  0:21 ` [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2025-09-15  8:31   ` K Prateek Nayak
  2025-09-19 18:34     ` John Stultz
  0 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-15  8:31 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
> With proxy-exec, a task is selected to run via pick_next_task(),
> and then if it is a mutex blocked task, we call find_proxy_task()
> to find a runnable owner. If the runnable owner is on another
> cpu, we will need to migrate the selected donor task away, after
> which we will pick_again can call pick_next_task() to choose
> something else.
> 
> However, in the first call to pick_next_task(), we may have
> had a balance_callback setup by the class scheduler. After we
> pick again, its possible pick_next_task_fair() will be called
> which calls sched_balance_newidle() and sched_balance_rq().
> 
> This will throw a warning:
> [    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
> ...
> [    8.796467] Call Trace:
> [    8.796467]  <TASK>
> [    8.796467]  ? __warn.cold+0xb2/0x14e
> [    8.796467]  ? sched_balance_rq+0xe92/0x1250
> [    8.796467]  ? report_bug+0x107/0x1a0
> [    8.796467]  ? handle_bug+0x54/0x90
> [    8.796467]  ? exc_invalid_op+0x17/0x70
> [    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
> [    8.796467]  ? sched_balance_rq+0xe92/0x1250
> [    8.796467]  sched_balance_newidle+0x295/0x820
> [    8.796467]  pick_next_task_fair+0x51/0x3f0
> [    8.796467]  __schedule+0x23a/0x14b0
> [    8.796467]  ? lock_release+0x16d/0x2e0
> [    8.796467]  schedule+0x3d/0x150
> [    8.796467]  worker_thread+0xb5/0x350
> [    8.796467]  ? __pfx_worker_thread+0x10/0x10
> [    8.796467]  kthread+0xee/0x120
> [    8.796467]  ? __pfx_kthread+0x10/0x10
> [    8.796467]  ret_from_fork+0x31/0x50
> [    8.796467]  ? __pfx_kthread+0x10/0x10
> [    8.796467]  ret_from_fork_asm+0x1a/0x30
> [    8.796467]  </TASK>
> 
> This is because if a RT task was originally picked, it will
> setup the rq->balance_callback with push_rt_tasks() via
> set_next_task_rt().
> 
> Once the task is migrated away and we pick again, we haven't
> processed any balance callbacks, so rq->balance_callback is not
> in the same state as it was the first time pick_next_task was
> called.
> 
> To handle this, add a zap_balance_callbacks() helper function
> which cleans up the blance callbacks without running them. This

s/blance/balance/

> should be ok, as we are effectively undoing the state set in
> the first call to pick_next_task(), and when we pick again,
> the new callback can be configured for the donor task actually
> selected.
> 
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> v20:
> * Tweaked to avoid build issues with different configs
> 
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: kernel-team@android.com
> ---
>  kernel/sched/core.c | 39 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e0007660161fa..01bf5ef8d9fcc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5001,6 +5001,40 @@ static inline void finish_task(struct task_struct *prev)
>  	smp_store_release(&prev->on_cpu, 0);
>  }
>  
> +#if defined(CONFIG_SCHED_PROXY_EXEC)

nit. This can be an "#ifdef CONFIG_SCHED_PROXY_EXEC" now.

> +/*
> + * Only called from __schedule context
> + *
> + * There are some cases where we are going to re-do the action
> + * that added the balance callbacks. We may not be in a state
> + * where we can run them, so just zap them so they can be
> + * properly re-added on the next time around. This is similar
> + * handling to running the callbacks, except we just don't call
> + * them.
> + */
> +static void zap_balance_callbacks(struct rq *rq)
> +{
> +	struct balance_callback *next, *head;
> +	bool found = false;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	head = rq->balance_callback;
> +	while (head) {
> +		if (head == &balance_push_callback)
> +			found = true;
> +		next = head->next;
> +		head->next = NULL;
> +		head = next;
> +	}
> +	rq->balance_callback = found ? &balance_push_callback : NULL;
> +}
> +#else
> +static inline void zap_balance_callbacks(struct rq *rq)
> +{
> +}

nit.

This can perhaps be reduced to a single line in light of Thomas' recent
work to condense the stubs elsewhere:
https://lore.kernel.org/lkml/20250908212925.389031537@linutronix.de/

> +#endif
> +
>  static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
>  {
>  	void (*func)(struct rq *rq);
> @@ -6941,8 +6975,11 @@ static void __sched notrace __schedule(int sched_mode)
>  	rq_set_donor(rq, next);
>  	if (unlikely(task_is_blocked(next))) {
>  		next = find_proxy_task(rq, next, &rf);
> -		if (!next)
> +		if (!next) {
> +			/* zap the balance_callbacks before picking again */
> +			zap_balance_callbacks(rq);
>  			goto pick_again;
> +		}
>  		if (next == rq->idle)
>  			goto keep_resched;

Should we zap the callbacks if we are planning to go through schedule()
again via rq->idle since it essentially voids the last pick too?

>  	}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-04  0:21 ` [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
  2025-09-15  8:05   ` K Prateek Nayak
@ 2025-09-15  9:03   ` K Prateek Nayak
  2025-09-18 23:07     ` John Stultz
  1 sibling, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-15  9:03 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -815,6 +815,12 @@ struct kmap_ctrl {
>  #endif
>  };
>  
> +enum blocked_on_state {
> +	BO_RUNNABLE,
> +	BO_BLOCKED,
> +	BO_WAKING,
> +};
> +
>  struct task_struct {
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/*
> @@ -1234,6 +1240,7 @@ struct task_struct {
>  	struct rt_mutex_waiter		*pi_blocked_on;
>  #endif
>  
> +	enum blocked_on_state		blocked_on_state;

Is there any use of the "blocked_on_state" outside of CONFIG_PROXY_EXEC?
If not, should we start thinking about putting the proxy exec specific
members behind CONFIG_PROXY_EXEC to avoid bloating the task_struct?

-- 
Thanks and Regards,
Prateek

>  	struct mutex			*blocked_on;	/* lock we're blocked on */
>  	raw_spinlock_t			blocked_lock;
>  

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-09-04  0:21 ` [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2025-09-15 10:06   ` K Prateek Nayak
  2025-09-20  2:38     ` John Stultz
  0 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-15 10:06 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
> @@ -6655,17 +6731,98 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  	return try_to_block_task(rq, donor, &state, true);
>  }
>  
> -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		force_blocked_on_runnable(donor);
> -	}
> -	return NULL;
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * After the migration, we are going to pick_again in the __schedule
> +	 * logic, so backtrack a bit before we release the lock:
> +	 * Put rq->donor, and set rq->curr as rq->donor and set_next_task,
> +	 * so that we're close to the situation we had entering __schedule
> +	 * the first time.
> +	 *
> +	 * Then when we re-aquire the lock, we will re-put rq->curr then
> +	 * rq_set_donor(rq->idle) and set_next_task(rq->idle), before
> +	 * picking again.
> +	 */
> +	/* XXX - Added to address problems with changed dl_server semantics - double check */
> +	__put_prev_set_next_dl_server(rq, rq->donor, rq->curr);

Given we are tagging the rq->dl_server to the donor's context, should we
do:

    __put_prev_set_next_dl_server(rq, rq->donor, rq->idle);

... since we are setting rq->idle as next task and the donor?

On a side note, this can perhaps be simplified as:

    put_prev_set_next_task(rq, rq->donor, rq->idle);
    rq_set_donor(rq, rq->idle);

put_prev_set_next_task() will take are of the dl_server handoff too.

> +	put_prev_task(rq, rq->donor);
> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, 0);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	zap_balance_callbacks(rq);

Is this zap necessary? Given we return NULL from find_proxy_task() for
migrate case, __schedule() would zap the callback soon. I don't see
any WARN_ON() for the balance callbacks in the unpin + repin path so
this might not be necessary or am I mistaken?

> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +	raw_spin_rq_lock(target_rq);
> +
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +
> +	raw_spin_rq_unlock(target_rq);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(rq);
> +
> +	put_prev_task(rq, rq->donor);
> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);

Similar set of comments as above.

> +
> +	WARN_ON(p == rq->curr);
> +
> +	set_blocked_on_waking(p);
> +	get_task_struct(p);
> +	block_task(rq, p, 0);
> +
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	wake_up_process(p);
> +	put_task_struct(p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +}
> +
> +static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
> +{
> +	if (p == rq->curr || p->wake_cpu == cpu_of(rq))
> +		return true;
> +	return false;
>  }
>  
>  /*
> @@ -6688,9 +6845,11 @@ static struct task_struct *
>  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  {
>  	struct task_struct *owner = NULL;
> +	bool curr_in_chain = false;
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
>  	struct mutex *mutex;
> +	int owner_cpu;
>  
>  	/* Follow blocked_on chain. */
>  	for (p = donor; task_is_blocked(p); p = owner) {
> @@ -6716,6 +6875,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			return NULL;
>  		}
>  
> +		/* Double check blocked_on_state now we're holding the lock */
> +		if (p->blocked_on_state == BO_RUNNABLE)
> +			return p;
> +
>  		/*
>  		 * If a ww_mutex hits the die/wound case, it marks the task as
>  		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
> @@ -6731,26 +6894,46 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		 * try_to_wake_up from completing and doing the return
>  		 * migration.
>  		 *
> -		 * So when we hit a !BO_BLOCKED task briefly schedule idle
> -		 * so we release the rq and let the wakeup complete.
> +		 * So when we hit a BO_WAKING task try to wake it up ourselves.
>  		 */
> -		if (p->blocked_on_state != BO_BLOCKED)
> -			return proxy_resched_idle(rq);
> +		if (p->blocked_on_state == BO_WAKING) {
> +			if (task_current(rq, p)) {
> +				/* If its current just set it runnable */
> +				__force_blocked_on_runnable(p);
> +				return p;
> +			}
> +			goto needs_return;
> +		}
> +
> +		if (task_current(rq, p))
> +			curr_in_chain = true;
>  
>  		owner = __mutex_owner(mutex);
>  		if (!owner) {
> +			/* If the owner is null, we may have some work to do */
> +			if (!proxy_can_run_here(rq, p))
> +				goto needs_return;
> +
>  			__force_blocked_on_runnable(p);
>  			return p;
>  		}
>  
>  		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>  			/* XXX Don't handle blocked owners/delayed dequeue yet */
> +			if (curr_in_chain)
> +				return proxy_resched_idle(rq);
>  			goto deactivate_donor;
>  		}
>  
> -		if (task_cpu(owner) != this_cpu) {
> -			/* XXX Don't handle migrations yet */
> -			goto deactivate_donor;
> +		owner_cpu = task_cpu(owner);
> +		if (owner_cpu != this_cpu) {
> +			/*
> +			 * @owner can disappear, simply migrate to @owner_cpu
> +			 * and leave that CPU to sort things out.
> +			 */
> +			if (curr_in_chain)
> +				return proxy_resched_idle(rq);
> +			goto migrate;
>  		}
>  
>  		if (task_on_rq_migrating(owner)) {
> @@ -6817,8 +7000,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	 * blocked_lock released, so we have to get out of the
>  	 * guard() scope.
>  	 */
> +migrate:
> +	proxy_migrate_task(rq, rf, p, owner_cpu);
> +	return NULL;
> +needs_return:
> +	proxy_force_return(rq, rf, p);
> +	return NULL;
>  deactivate_donor:
> -	return proxy_deactivate(rq, donor);
> +	if (!proxy_deactivate(rq, donor)) {
> +		p = donor;
> +		goto needs_return;
> +	}
> +	return NULL;
>  }
>  #else /* SCHED_PROXY_EXEC */
>  static struct task_struct *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c2..cc531eb939831 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	se = &p->se;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -	if (prev->sched_class != &fair_sched_class)
> +	if (prev->sched_class != &fair_sched_class ||
> +	    rq->curr != rq->donor)

Why is this special handling required?

>  		goto simple;
>  
>  	__put_prev_set_next_dl_server(rq, prev, p);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)
  2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
                   ` (6 preceding siblings ...)
  2025-09-11 13:59 ` [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) Juri Lelli
@ 2025-09-16  3:19 ` K Prateek Nayak
  2025-09-16  4:57   ` John Stultz
  7 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-16  3:19 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
> Also you can find the full proxy-exec series here:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v21-6.17-rc4/
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v21-6.17-rc4

tl;dr

This series seems fine from performance standpoint but the above branch
may have some performance issues but take them with a grain of salt
since this is not all apples to apples comparison.

For this series things are alright - my harness for longer running
benchmarks gave up for some reason so I'll rerun those tests again and
report back later but either tip has some improvements for
netperf / tbench or "proxy-exec-v21-6.17-rc4" may have some issues
around it. I'll take a deeper look later in the week.


o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode


o Kernels

- tip		tip:sched/core at commit 5b726e9bf954 ("sched/fair: Get
		rid of throttled_lb_pair()")
		(CONFIG_SCHED_PROXY_EXEC disabled)

- proxy-v21	tip + this series as is
		(CONFIG_SCHED_PROXY_EXEC=y)

- proxy-full	proxy-exec-v21-6.17-rc4 as is
		(CONFIG_SCHED_PROXY_EXEC=y)


o Benchmark results

    ==================================================================
    Test          : hackbench
    Units         : Normalized time in seconds
    Interpretation: Lower is better
    Statistic     : AMean
    ==================================================================
    Case:           tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
     1-groups     1.00 [ -0.00](10.57)     0.94 [  6.24]( 7.88)     0.91 [  9.46](10.11)
     2-groups     1.00 [ -0.00]( 3.33)     1.02 [ -1.75]( 3.16)     1.04 [ -4.17]( 2.51)
     4-groups     1.00 [ -0.00]( 2.41)     1.01 [ -0.87]( 2.29)     1.03 [ -3.03]( 1.27)
     8-groups     1.00 [ -0.00]( 2.67)     1.02 [ -1.66]( 2.10)     1.01 [ -0.55]( 1.45)
    16-groups     1.00 [ -0.00]( 1.83)     1.01 [ -0.82]( 2.30)     1.00 [ -0.25]( 1.72)


    ==================================================================
    Test          : tbench
    Units         : Normalized throughput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:    tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
        1     1.00 [  0.00]( 0.81)     1.00 [ -0.13]( 0.16)     0.92 [ -8.06]( 0.39)
        2     1.00 [  0.00]( 0.32)     0.99 [ -0.84]( 0.66)     0.91 [ -8.85]( 0.54)
        4     1.00 [  0.00]( 0.32)     0.98 [ -2.37]( 1.40)     0.92 [ -8.28]( 0.28)
        8     1.00 [  0.00]( 0.69)     0.98 [ -2.47]( 0.53)     0.90 [ -9.58]( 0.36)
       16     1.00 [  0.00]( 1.24)     0.96 [ -3.94]( 1.51)     0.90 [ -9.83]( 0.69)
       32     1.00 [  0.00]( 0.60)     0.99 [ -1.47]( 3.38)     0.89 [-11.43]( 5.60)
       64     1.00 [  0.00]( 1.22)     0.99 [ -1.33]( 0.88)     0.91 [ -8.52]( 2.67)
      128     1.00 [  0.00]( 0.34)     0.99 [ -1.48]( 0.99)     0.92 [ -7.51]( 0.13)
      256     1.00 [  0.00]( 1.32)     0.98 [ -1.75]( 0.96)     0.97 [ -3.35]( 1.22)
      512     1.00 [  0.00]( 0.25)     0.99 [ -1.29]( 0.41)     0.97 [ -2.90]( 0.17)
     1024     1.00 [  0.00]( 0.24)     0.99 [ -0.59]( 0.14)     0.98 [ -2.36]( 0.33)


    ==================================================================
    Test          : stream-10
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:       tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
     Copy     1.00 [  0.00](10.90)     1.07 [  6.53]( 8.21)     1.07 [  7.26]( 7.22)
    Scale     1.00 [  0.00]( 9.62)     1.04 [  4.00]( 6.99)     1.05 [  4.71]( 5.85)
      Add     1.00 [  0.00](10.17)     1.05 [  5.07]( 6.14)     1.06 [  6.03]( 6.56)
    Triad     1.00 [  0.00]( 8.48)     1.04 [  4.34]( 5.09)     1.04 [  4.07]( 4.40)


    ==================================================================
    Test          : stream-100
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:       tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
     Copy     1.00 [  0.00]( 1.38)     1.01 [  0.99]( 1.21)     1.02 [  1.68]( 1.50)
    Scale     1.00 [  0.00]( 6.19)     1.02 [  1.94]( 4.34)     1.03 [  3.00]( 1.19)
      Add     1.00 [  0.00]( 4.42)     1.01 [  0.94]( 4.17)     1.02 [  1.58]( 1.54)
    Triad     1.00 [  0.00]( 1.30)     1.01 [  0.61]( 1.37)     1.00 [  0.18]( 2.65)


    ==================================================================
    Test          : netperf
    Units         : Normalized Througput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:         tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
     1-clients     1.00 [  0.00]( 0.41)     0.99 [ -1.03]( 0.34)     0.90 [ -9.96]( 0.46)
     2-clients     1.00 [  0.00]( 0.31)     0.99 [ -1.17]( 0.72)     0.90 [ -9.77]( 0.78)
     4-clients     1.00 [  0.00]( 0.57)     0.99 [ -0.68]( 0.32)     0.90 [-10.21]( 0.89)
     8-clients     1.00 [  0.00]( 0.46)     0.99 [ -0.69]( 0.32)     0.90 [-10.20]( 0.70)
    16-clients     1.00 [  0.00]( 0.57)     0.99 [ -1.39]( 1.28)     0.90 [-10.37]( 1.34)
    32-clients     1.00 [  0.00]( 1.03)     0.97 [ -2.53]( 1.92)     0.90 [-10.00]( 1.23)
    64-clients     1.00 [  0.00]( 1.23)     0.97 [ -3.15]( 2.94)     0.90 [ -9.94]( 1.52)
    128-clients    1.00 [  0.00]( 1.14)     0.99 [ -1.07]( 0.95)     0.90 [ -9.91]( 0.90)
    256-clients    1.00 [  0.00]( 3.73)     0.98 [ -1.80]( 3.66)     0.97 [ -3.41]( 4.47)
    512-clients    1.00 [  0.00](54.79)     0.97 [ -3.03](48.98)     0.95 [ -4.63](51.77)


    ==================================================================
    Test          : schbench
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
      1     1.00 [ -0.00](30.14)     1.11 [-11.11](35.78)     1.31 [-30.56](42.87)
      2     1.00 [ -0.00]( 7.87)     0.93 [  7.14]( 8.45)     0.95 [  4.76]( 7.50)
      4     1.00 [ -0.00]( 7.87)     1.07 [ -7.14]( 7.36)     1.14 [-14.29](12.73)
      8     1.00 [ -0.00]( 4.59)     1.08 [ -8.16]( 5.09)     1.12 [-12.24]( 7.44)
     16     1.00 [ -0.00]( 5.33)     1.05 [ -5.08]( 0.93)     1.05 [ -5.08]( 2.75)
     32     1.00 [ -0.00]( 1.04)     1.00 [ -0.00]( 3.12)     1.07 [ -7.29]( 4.49)
     64     1.00 [ -0.00]( 1.04)     0.96 [  3.50]( 3.78)     1.01 [ -1.00]( 2.24)
    128     1.00 [ -0.00]( 5.11)     1.06 [ -6.11]( 7.56)     1.09 [ -8.60]( 6.26)
    256     1.00 [ -0.00](19.39)     1.29 [-28.73](14.92)     1.15 [-14.71](14.83)
    512     1.00 [ -0.00]( 0.15)     0.98 [  2.02]( 1.85)     0.99 [  1.01]( 1.66)


    ==================================================================
    Test          : new-schbench-requests-per-second
    Units         : Normalized Requests per second
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
      1     1.00 [  0.00]( 0.26)     1.00 [  0.29]( 0.15)     1.00 [ -0.29]( 0.30)
      2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
      4     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
      8     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)     1.00 [  0.29]( 0.15)
     16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
     32     1.00 [  0.00]( 1.86)     1.00 [ -0.31]( 0.28)     1.00 [ -0.31]( 2.12)
     64     1.00 [  0.00](13.62)     0.99 [ -0.77]( 4.78)     0.81 [-18.52](11.11)
    128     1.00 [  0.00]( 0.00)     1.00 [  0.38]( 0.00)     1.00 [  0.38]( 0.00)
    256     1.00 [  0.00]( 1.49)     1.02 [  1.82]( 1.63)     1.00 [  0.00]( 1.19)
    512     1.00 [  0.00]( 0.75)     1.01 [  0.71]( 1.65)     1.01 [  1.19]( 1.53)


    ==================================================================
    Test          : new-schbench-wakeup-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
      1     1.00 [ -0.00]( 6.74)     1.00 [ -0.00]( 6.74)     1.12 [-12.50](19.26)
      2     1.00 [ -0.00](11.18)     1.00 [ -0.00](17.21)     1.50 [-50.00]( 7.45)
      4     1.00 [ -0.00]( 9.94)     1.00 [ -0.00](19.26)     1.56 [-55.56](15.78)
      8     1.00 [ -0.00](10.68)     1.00 [ -0.00](10.68)     1.44 [-44.44](28.77)
     16     1.00 [ -0.00]( 9.68)     1.00 [ -0.00]( 9.68)     1.20 [-20.00]( 8.15)
     32     1.00 [ -0.00](14.08)     1.00 [ -0.00]( 5.34)     1.20 [-20.00](14.70)
     64     1.00 [ -0.00]( 3.52)     1.13 [-13.33]( 5.26)     1.27 [-26.67]( 2.77)
    128     1.00 [ -0.00]( 1.79)     1.07 [ -6.56]( 2.70)     1.07 [ -6.97]( 7.71)
    256     1.00 [ -0.00]( 9.89)     1.04 [ -4.50]( 3.81)     1.02 [ -2.00]( 7.78)
    512     1.00 [ -0.00]( 0.00)     1.01 [ -0.77]( 0.34)     1.00 [ -0.00]( 0.20)


    ==================================================================
    Test          : new-schbench-request-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)      proxy-v21[pct imp](CV)   proxy-full[pct imp](CV)
      1     1.00 [ -0.00]( 1.33)     0.96 [  3.89]( 1.46)     1.02 [ -1.82]( 3.02)
      2     1.00 [ -0.00]( 0.14)     1.01 [ -1.09]( 0.24)     1.02 [ -2.44]( 2.73)
      4     1.00 [ -0.00]( 1.24)     1.00 [ -0.26]( 1.69)     0.97 [  2.65]( 0.14)
      8     1.00 [ -0.00]( 0.54)     1.00 [ -0.00]( 1.02)     0.99 [  1.31]( 2.16)
     16     1.00 [ -0.00]( 0.36)     1.00 [ -0.00]( 1.70)     0.98 [  1.59]( 1.00)
     32     1.00 [ -0.00]( 5.51)     0.99 [  0.73]( 2.09)     1.01 [ -1.45]( 7.52)
     64     1.00 [ -0.00]( 5.38)     1.09 [ -9.27]( 0.88)     1.09 [ -8.56]( 0.11)
    128     1.00 [ -0.00]( 0.32)     1.00 [ -0.36]( 0.32)     1.03 [ -2.54]( 1.15)
    256     1.00 [ -0.00](10.51)     1.14 [-14.23](11.19)     1.00 [  0.24](11.42)
    512     1.00 [ -0.00]( 2.00)     1.03 [ -3.27]( 0.94)     1.02 [ -2.41]( 1.96)


-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21)
  2025-09-16  3:19 ` K Prateek Nayak
@ 2025-09-16  4:57   ` John Stultz
  0 siblings, 0 replies; 26+ messages in thread
From: John Stultz @ 2025-09-16  4:57 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Sep 15, 2025 at 8:19 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John,
>
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > Also you can find the full proxy-exec series here:
> >   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v21-6.17-rc4/
> >   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v21-6.17-rc4
>
> tl;dr
>
> This series seems fine from performance standpoint but the above branch
> may have some performance issues but take them with a grain of salt
> since this is not all apples to apples comparison.
>

Thank you so much for running these test! I really appreciate it!
It does look like I need to  spend some more work on the full series,
as those regressions with the full proxy patch set does seem
problematic. Thank  you for raising this!

(Thank you also for your feedback on the patches in this series, I've
not gotten a chance to address and reply back individually yet, but I
hope to do so soon!)


> For this series things are alright - my harness for longer running
> benchmarks gave up for some reason so I'll rerun those tests again and
> report back later but either tip has some improvements for
> netperf / tbench or "proxy-exec-v21-6.17-rc4" may have some issues
> around it. I'll take a deeper look later in the week.
>

Let me know if you find anything further!

Thank you so much again for all your efforts here, both in testing and
review! As well as the testing you do for the whole community!
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state
  2025-09-12  5:50   ` K Prateek Nayak
@ 2025-09-18 19:58     ` John Stultz
  0 siblings, 0 replies; 26+ messages in thread
From: John Stultz @ 2025-09-18 19:58 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Sep 11, 2025 at 10:50 PM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > @@ -693,25 +697,30 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       break;
> >
> >               if (first) {
> > -                     trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> > +                     bool opt_acquired;
> > +
> >                       /*
> >                        * mutex_optimistic_spin() can call schedule(), so
> > -                      * clear blocked on so we don't become unselectable
> > +                      * we need to release these locks before calling it,
> > +                      * and clear blocked on so we don't become unselectable
> >                        * to run.
> >                        */
> > -                     clear_task_blocked_on(current, lock);
> > -                     if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> > +                     __clear_task_blocked_on(current, lock);
> > +                     raw_spin_unlock(&current->blocked_lock);
> > +                     raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> > +                     trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> > +                     opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
> > +                     raw_spin_lock_irqsave(&lock->wait_lock, flags);
> > +                     raw_spin_lock(&current->blocked_lock);
> > +                     __set_task_blocked_on(current, lock);
> > +                     if (opt_acquired)
> >                               break;
>
> nit. Is there any reason for setting the blocked state before the break
> other than for symmetry? The blocked_lock is held and we'll soon clear
> it after the break anyways so does it have any other subtle purpose?

So, basically yes, it's for symmetry, and trying to avoid subtlety
where I've found it's easy to confuse myself.

At this point in the patch series, as we don't prevent double-clears,
I guess it could be moved below the break, without much impact.
However, after the "Add blocked_on_state" patch, we do enforce that we
we don't clear cleared values, and we manage the blocked_on_state
separately here, so at that point we'd have to put it back or loosen
the constraint checking. So I think leaving it here seems the most
sensible.

> Also super silly but that above block is too dense. Can we add some
> spaces in between, groping the relevant ops, to make it easier on the
> eyes :)

Sure! Done!

> Apart from those nit picks, I don't see anything out of the place.
> Feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>

Thanks so much again for your review and feedback!
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-15  8:05   ` K Prateek Nayak
@ 2025-09-18 22:57     ` John Stultz
  2025-09-19  3:27       ` K Prateek Nayak
  0 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-18 22:57 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Sep 15, 2025 at 1:06 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John,
>
> On 9/4/2025 5:51 AM, John Stultz wrote:
> >  static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
> >  {
> > +     /* The task should only be clearing itself */
> > +     WARN_ON_ONCE(p != current);
> >       /* Currently we serialize blocked_on under the task::blocked_lock */
> >       lockdep_assert_held_once(&p->blocked_lock);
> > -     /*
> > -      * There may be cases where we re-clear already cleared
> > -      * blocked_on relationships, but make sure we are not
> > -      * clearing the relationship with a different lock.
> > -      */
> > -     WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
> > +     /* Make sure we are clearing the relationship with the right lock */
> > +     WARN_ON_ONCE(m && p->blocked_on != m);
> >       p->blocked_on = NULL;
> > +     p->blocked_on_state = BO_RUNNABLE;
> >  }
> >
>
> Maybe it is just me, but I got confused a couple of time only to
> realize __clear_task_blocked_on() clears the "blocked_on" and sets
> "blocked_on_state" to BO_RUNNABLE.
>
> Can we decouple the two and only set "p->blocked_on" in
> *_task_blocked_on_* and "p->blocked_on_state" in
> *{set,clear,force}_blocked_on* functions so it becomes easier to
> follow in the mutex path as:
>
>     __set_task_blocked_on(p, mutex); // blocked on mutex
>     __force_blocked_on_blocked(p); // blocked from running on CPU
>
>    ...
>
>     __clear_task_blocked_on(p, mutex); // p is no longer blocked on a mutex
>     __set_blocked_on_runnable(p); // p is now runnable

Hrm. So I guess I was thinking of
set_task_blocked_on()/clear_task_blocked_on() as the main enter/exit
states, with set_blocked_on_waking(), set_blocked_on_runnable() as
transition states within.

But, the various force_blocked_on_<state>() cases do add more
complexity, so maybe handling it completely separately would be more
intuitive? I dunno.


> >  static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
>
> Of the three {set,clear}_task_blcoked_on() usage:
>
>     $ grep -r "\(set\|clear\)_task_blocked_on" include/
>     kernel/locking/mutex.c: __set_task_blocked_on(current, lock);
>     kernel/locking/mutex.c: __clear_task_blocked_on(current, lock);
>     kernel/locking/mutex.c: clear_task_blocked_on(current, lock);
>
> two can be converted directly and perhaps clear_task_blocked_on() can be
> renamed as clear_task_blocked_on_set_runnable()?
>
> This is just me rambling on so feel free to ignore. I probably have to
> train my mind enough to see __clear_task_blocked_on() not only clears
> "blocked_on" but also sets task to runnable :)

Yeah, the case where we don't already hold the lock and want to do
both gets more complex in that case.

Hrm. Maybe just the way the functions are organized in the header make
it seem like we're managing two separate bits of state, where really
they are ordered.
I'll try to re-arrange that introducing the
set_task_blocked_on/clear_task_blocked_on first, then the transition
set_blocked_on_<state>() helpers after?
Maybe that and  some comments will make that clearer?

> > @@ -6749,6 +6776,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >
> >       WARN_ON_ONCE(owner && !owner->on_rq);
> >       return owner;
> > +
> > +     /*
> > +      * NOTE: This logic is down here, because we need to call
> > +      * the functions with the mutex wait_lock and task
> > +      * blocked_lock released, so we have to get out of the
> > +      * guard() scope.
> > +      */
>
> I didn't know that was possible! Neat. Since cleanup.h has a note
> reading:
>
>     ... the expectation is that usage of "goto" and cleanup helpers is
>     never mixed in the same function.
>
> are there any concerns w.r.t. compiler versions etc. or am I just being
> paranoid?

Hrrrrmmmm.  I hadn't seen that detail. :/   I guess I was just lucky
it worked with my toolchain.

Oof. That may require reworking all this logic back to explicit
lock/unlock calls, away from the guard() usage.

Let me think on if there's a better way.

Thanks so much again for pointing this out!
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-15  9:03   ` K Prateek Nayak
@ 2025-09-18 23:07     ` John Stultz
  2025-09-19  3:38       ` K Prateek Nayak
  0 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-18 23:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Sep 15, 2025 at 2:05 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -815,6 +815,12 @@ struct kmap_ctrl {
> >  #endif
> >  };
> >
> > +enum blocked_on_state {
> > +     BO_RUNNABLE,
> > +     BO_BLOCKED,
> > +     BO_WAKING,
> > +};
> > +
> >  struct task_struct {
> >  #ifdef CONFIG_THREAD_INFO_IN_TASK
> >       /*
> > @@ -1234,6 +1240,7 @@ struct task_struct {
> >       struct rt_mutex_waiter          *pi_blocked_on;
> >  #endif
> >
> > +     enum blocked_on_state           blocked_on_state;
>
> Is there any use of the "blocked_on_state" outside of CONFIG_PROXY_EXEC?
> If not, should we start thinking about putting the proxy exec specific
> members behind CONFIG_PROXY_EXEC to avoid bloating the task_struct?

So yeah, your suggestion is a decent one, though it gets a little
messy in a few spots. I'm working on integrating this and propagating
it through the full series, and hopefully I can clean it up further.
There are a few spots where this and other proxy related values do get
checked, so wrapping those up so they can be ifdef'ed out will require
some extra logic.

Thanks for the review and suggestion!
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-18 22:57     ` John Stultz
@ 2025-09-19  3:27       ` K Prateek Nayak
  2025-09-23 23:34         ` John Stultz
  0 siblings, 1 reply; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-19  3:27 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/19/2025 4:27 AM, John Stultz wrote:
>> Of the three {set,clear}_task_blcoked_on() usage:
>>
>>     $ grep -r "\(set\|clear\)_task_blocked_on" include/
>>     kernel/locking/mutex.c: __set_task_blocked_on(current, lock);
>>     kernel/locking/mutex.c: __clear_task_blocked_on(current, lock);
>>     kernel/locking/mutex.c: clear_task_blocked_on(current, lock);
>>
>> two can be converted directly and perhaps clear_task_blocked_on() can be
>> renamed as clear_task_blocked_on_set_runnable()?
>>
>> This is just me rambling on so feel free to ignore. I probably have to
>> train my mind enough to see __clear_task_blocked_on() not only clears
>> "blocked_on" but also sets task to runnable :)
> 
> Yeah, the case where we don't already hold the lock and want to do
> both gets more complex in that case.
> 
> Hrm. Maybe just the way the functions are organized in the header make
> it seem like we're managing two separate bits of state, where really
> they are ordered.
> I'll try to re-arrange that introducing the
> set_task_blocked_on/clear_task_blocked_on first, then the transition
> set_blocked_on_<state>() helpers after?
> Maybe that and  some comments will make that clearer?

Again that was me rambling. Even a small comment above
clear_task_blocked_on() would also be sufficient if the whole rework
turns out to be more extensive. 

> 
>>> @@ -6749,6 +6776,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>>>
>>>       WARN_ON_ONCE(owner && !owner->on_rq);
>>>       return owner;
>>> +
>>> +     /*
>>> +      * NOTE: This logic is down here, because we need to call
>>> +      * the functions with the mutex wait_lock and task
>>> +      * blocked_lock released, so we have to get out of the
>>> +      * guard() scope.
>>> +      */
>>
>> I didn't know that was possible! Neat. Since cleanup.h has a note
>> reading:
>>
>>     ... the expectation is that usage of "goto" and cleanup helpers is
>>     never mixed in the same function.
>>
>> are there any concerns w.r.t. compiler versions etc. or am I just being
>> paranoid?
> 
> Hrrrrmmmm.  I hadn't seen that detail. :/   I guess I was just lucky
> it worked with my toolchain.

I have been too. Maybe it is okay to use a goto if folks know what
they are doing ¯\_(ツ)_/¯

Another idea is to have:

    bool deactivate_donor = false;

    for (p = donor; task_is_blocked(p); p = owner) {
        guard(raw_spinlock)(...);
        ...
        if (<condition> {
            deactivate_donor = true;
            break;
        }
        ...
    }
    if (deactivate_donor)
        return proxy_deactivate(rq, donor);

Can that work?

> 
> Oof. That may require reworking all this logic back to explicit
> lock/unlock calls, away from the guard() usage.
> 
> Let me think on if there's a better way.
> 
> Thanks so much again for pointing this out!
> -john

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-18 23:07     ` John Stultz
@ 2025-09-19  3:38       ` K Prateek Nayak
  0 siblings, 0 replies; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-19  3:38 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/19/2025 4:37 AM, John Stultz wrote:
> On Mon, Sep 15, 2025 at 2:05 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 9/4/2025 5:51 AM, John Stultz wrote:
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -815,6 +815,12 @@ struct kmap_ctrl {
>>>  #endif
>>>  };
>>>
>>> +enum blocked_on_state {
>>> +     BO_RUNNABLE,
>>> +     BO_BLOCKED,
>>> +     BO_WAKING,
>>> +};
>>> +
>>>  struct task_struct {
>>>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>>>       /*
>>> @@ -1234,6 +1240,7 @@ struct task_struct {
>>>       struct rt_mutex_waiter          *pi_blocked_on;
>>>  #endif
>>>
>>> +     enum blocked_on_state           blocked_on_state;
>>
>> Is there any use of the "blocked_on_state" outside of CONFIG_PROXY_EXEC?
>> If not, should we start thinking about putting the proxy exec specific
>> members behind CONFIG_PROXY_EXEC to avoid bloating the task_struct?
> 
> So yeah, your suggestion is a decent one, though it gets a little
> messy in a few spots. I'm working on integrating this and propagating
> it through the full series, and hopefully I can clean it up further.
> There are a few spots where this and other proxy related values do get
> checked, so wrapping those up so they can be ifdef'ed out will require
> some extra logic.

Looking at the proxy-exec-6.17-rc4, most of the direct references to
"p->blocked_on_state" is already under CONFIG_PROXY_EXEC. Only the
functions that modify it needs a stub and a helper for the usage in
task_is_blocked()

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again
  2025-09-15  8:31   ` K Prateek Nayak
@ 2025-09-19 18:34     ` John Stultz
  0 siblings, 0 replies; 26+ messages in thread
From: John Stultz @ 2025-09-19 18:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Sep 15, 2025 at 1:32 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e0007660161fa..01bf5ef8d9fcc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5001,6 +5001,40 @@ static inline void finish_task(struct task_struct *prev)
> >       smp_store_release(&prev->on_cpu, 0);
> >  }
> >
> > +#if defined(CONFIG_SCHED_PROXY_EXEC)
>
> nit. This can be an "#ifdef CONFIG_SCHED_PROXY_EXEC" now.

Ah. Yes, this is leftover from it previously checking for PROXY_EXEC
&& CONFIG_SMP.  I'll be sure to clean that up.

> > +#else
> > +static inline void zap_balance_callbacks(struct rq *rq)
> > +{
> > +}
>
> nit.
>
> This can perhaps be reduced to a single line in light of Thomas' recent
> work to condense the stubs elsewhere:
> https://lore.kernel.org/lkml/20250908212925.389031537@linutronix.de/

Ah, if folks are ok with that, I'd prefer it as well!  Thanks for the
suggestion! I'll try to work that throughout the series.

> > +#endif
> > +
> >  static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
> >  {
> >       void (*func)(struct rq *rq);
> > @@ -6941,8 +6975,11 @@ static void __sched notrace __schedule(int sched_mode)
> >       rq_set_donor(rq, next);
> >       if (unlikely(task_is_blocked(next))) {
> >               next = find_proxy_task(rq, next, &rf);
> > -             if (!next)
> > +             if (!next) {
> > +                     /* zap the balance_callbacks before picking again */
> > +                     zap_balance_callbacks(rq);
> >                       goto pick_again;
> > +             }
> >               if (next == rq->idle)
> >                       goto keep_resched;
>
> Should we zap the callbacks if we are planning to go through schedule()
> again via rq->idle since it essentially voids the last pick too?

Hrm. So I don't think it's strictly necessary, because we will run the
set callback as part of finish_task_switch() when we switch briefly to
idle. So we don't end up with stale callbacks in the next
pick_next_task().

But I guess zapping them could help just avoid running it spuriously.
I'll give that a shot and see how it affects things.

Thanks again for all the suggestions!
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-09-15 10:06   ` K Prateek Nayak
@ 2025-09-20  2:38     ` John Stultz
  2025-09-22  4:14       ` K Prateek Nayak
  0 siblings, 1 reply; 26+ messages in thread
From: John Stultz @ 2025-09-20  2:38 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Sep 15, 2025 at 3:07 AM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > +     /* XXX - Added to address problems with changed dl_server semantics - double check */
> > +     __put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
>
> Given we are tagging the rq->dl_server to the donor's context, should we
> do:
>
>     __put_prev_set_next_dl_server(rq, rq->donor, rq->idle);
>
> ... since we are setting rq->idle as next task and the donor?
>
> On a side note, this can perhaps be simplified as:
>
>     put_prev_set_next_task(rq, rq->donor, rq->idle);
>     rq_set_donor(rq, rq->idle);
>
> put_prev_set_next_task() will take are of the dl_server handoff too.

Nice! That is simpler. I think I can also simplify those two lines to
proxy_resched_idle() as well.
And looking, the big comment above that section needs an update as
well as its out of date.

> > +     put_prev_task(rq, rq->donor);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_next_task(rq, rq->idle);
> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, 0);
> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     zap_balance_callbacks(rq);
>
> Is this zap necessary? Given we return NULL from find_proxy_task() for
> migrate case, __schedule() would zap the callback soon. I don't see
> any WARN_ON() for the balance callbacks in the unpin + repin path so
> this might not be necessary or am I mistaken?

So unfortunately the issue is that when we're letting go of the
runqueue other CPUs can jump in via sched_balance_domains()  ->
sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
trip the
WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
&balance_push_callback); case if we leave callbacks.

So we need to zap the callbacks before we drop the rq lock.  I'll add
a comment to make that clear.

And yeah, that means we do call zap_balance_callbacks() again after we
re-take the lock and return null.

I guess we could remove the zap_balance_callbacks() call in
__schedule() and instead call it in every case where find_proxy_task()
returns NULL? Or we call it twice in the migration paths, as it should
just have one entry the second time.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     put_prev_task(rq, rq->donor);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_next_task(rq, rq->idle);
>
> Similar set of comments as above.

Ack.

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c2..cc531eb939831 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >       se = &p->se;
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > -     if (prev->sched_class != &fair_sched_class)
> > +     if (prev->sched_class != &fair_sched_class ||
> > +         rq->curr != rq->donor)
>
> Why is this special handling required?

Hrmm.. that's a good question, as I haven't thought about that in
awhile, and I'm unfortunately forgetful. My instinct is that its a
concern that by checking the prev sched class is fair, is assuming the
prev task was selected by the fair scheduler's pick_next_task last
time around. But since we may have picked a blocked RT task as donor,
if we are proxying, we can't assume prev was previously chosen by
pick_next_task_fair(). So we skip the optimization just to be careful.

But this may be overly cautious and looking it over I don't
immediately see it as necessary. So I'll drop it and do some thorough
testing without it. If I do find an issue I'll re-add it with a much
better comment so I don't forget again.  I do really appreciate you
checking me here!

thanks
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)
  2025-09-20  2:38     ` John Stultz
@ 2025-09-22  4:14       ` K Prateek Nayak
  0 siblings, 0 replies; 26+ messages in thread
From: K Prateek Nayak @ 2025-09-22  4:14 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 9/20/2025 8:08 AM, John Stultz wrote:
>>> +     put_prev_task(rq, rq->donor);
>>> +     rq_set_donor(rq, rq->idle);
>>> +     set_next_task(rq, rq->idle);
>>> +
>>> +     WARN_ON(p == rq->curr);
>>> +
>>> +     deactivate_task(rq, p, 0);
>>> +     proxy_set_task_cpu(p, target_cpu);
>>> +
>>> +     zap_balance_callbacks(rq);
>>
>> Is this zap necessary? Given we return NULL from find_proxy_task() for
>> migrate case, __schedule() would zap the callback soon. I don't see
>> any WARN_ON() for the balance callbacks in the unpin + repin path so
>> this might not be necessary or am I mistaken?
> 
> So unfortunately the issue is that when we're letting go of the
> runqueue other CPUs can jump in via sched_balance_domains()  ->
> sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
> trip the
> WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
> &balance_push_callback); case if we leave callbacks.

Ah! That makes sense. I overlooked the fact that we have *other CPUs*
that can take the rq lock once dropped.

> 
> So we need to zap the callbacks before we drop the rq lock.  I'll add
> a comment to make that clear.

Thank you!

> 
> And yeah, that means we do call zap_balance_callbacks() again after we
> re-take the lock and return null.
> 
> I guess we could remove the zap_balance_callbacks() call in
> __schedule() and instead call it in every case where find_proxy_task()
> returns NULL? Or we call it twice in the migration paths, as it should
> just have one entry the second time.

I think a comment would be good enough with the current flow. I'll let
you decide which is the best option based on your understanding of the
complete flow (I'm still slowly wrapping my head around all this :)

[..snip..]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index b173a059315c2..cc531eb939831 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>>       se = &p->se;
>>>
>>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>> -     if (prev->sched_class != &fair_sched_class)
>>> +     if (prev->sched_class != &fair_sched_class ||
>>> +         rq->curr != rq->donor)
>>
>> Why is this special handling required?
> 
> Hrmm.. that's a good question, as I haven't thought about that in
> awhile, and I'm unfortunately forgetful. My instinct is that its a
> concern that by checking the prev sched class is fair, is assuming the
> prev task was selected by the fair scheduler's pick_next_task last
> time around. But since we may have picked a blocked RT task as donor,
> if we are proxying, we can't assume prev was previously chosen by
> pick_next_task_fair(). So we skip the optimization just to be careful.

My interpretation was - since pick_next_task_fair() selected the
scheduling context, it shouldn't matter what the execution context was.

> 
> But this may be overly cautious and looking it over I don't
> immediately see it as necessary. So I'll drop it and do some thorough
> testing without it. If I do find an issue I'll re-add it with a much
> better comment so I don't forget again.

Thank you! I too will do some testing and let you know if I see
something go awry.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
  2025-09-19  3:27       ` K Prateek Nayak
@ 2025-09-23 23:34         ` John Stultz
  0 siblings, 0 replies; 26+ messages in thread
From: John Stultz @ 2025-09-23 23:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Sep 18, 2025 at 8:28 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 9/19/2025 4:27 AM, John Stultz wrote:
> >> I didn't know that was possible! Neat. Since cleanup.h has a note
> >> reading:
> >>
> >>     ... the expectation is that usage of "goto" and cleanup helpers is
> >>     never mixed in the same function.
> >>
> >> are there any concerns w.r.t. compiler versions etc. or am I just being
> >> paranoid?
> >
> > Hrrrrmmmm.  I hadn't seen that detail. :/   I guess I was just lucky
> > it worked with my toolchain.
>
> I have been too. Maybe it is okay to use a goto if folks know what
> they are doing ¯\_(ツ)_/¯
>
> Another idea is to have:
>
>     bool deactivate_donor = false;
>
>     for (p = donor; task_is_blocked(p); p = owner) {
>         guard(raw_spinlock)(...);
>         ...
>         if (<condition> {
>             deactivate_donor = true;
>             break;
>         }
>         ...
>     }
>     if (deactivate_donor)
>         return proxy_deactivate(rq, donor);
>
> Can that work?

Yeah, I've reworked the logic to switch() on an action enum, which
will let us do something similar without gotos.

thanks
-john

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-09-23 23:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04  0:21 [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) John Stultz
2025-09-04  0:21 ` [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2025-09-12  5:50   ` K Prateek Nayak
2025-09-18 19:58     ` John Stultz
2025-09-04  0:21 ` [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration John Stultz
2025-09-15  8:05   ` K Prateek Nayak
2025-09-18 22:57     ` John Stultz
2025-09-19  3:27       ` K Prateek Nayak
2025-09-23 23:34         ` John Stultz
2025-09-15  9:03   ` K Prateek Nayak
2025-09-18 23:07     ` John Stultz
2025-09-19  3:38       ` K Prateek Nayak
2025-09-04  0:21 ` [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
2025-09-15  8:31   ` K Prateek Nayak
2025-09-19 18:34     ` John Stultz
2025-09-04  0:21 ` [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
2025-09-15 10:06   ` K Prateek Nayak
2025-09-20  2:38     ` John Stultz
2025-09-22  4:14       ` K Prateek Nayak
2025-09-04  0:21 ` [RESEND][PATCH v21 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2025-09-04  0:21 ` [RESEND][PATCH v21 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz
2025-09-11 13:59 ` [RESEND][PATCH v21 0/6] Donor Migration for Proxy Execution (v21) Juri Lelli
2025-09-11 23:21   ` John Stultz
2025-09-12  8:14     ` Juri Lelli
2025-09-16  3:19 ` K Prateek Nayak
2025-09-16  4:57   ` John Stultz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.