From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
Joel Fernandes <joelagnelf@nvidia.com>,
Qais Yousef <qyousef@layalina.io>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Valentin Schneider <vschneid@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Zimuzo Ezeozue <zezeozue@google.com>,
Mel Gorman <mgorman@suse.de>, Will Deacon <will@kernel.org>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
Metin Kaya <Metin.Kaya@arm.com>,
Xuewen Yan <xuewen.yan94@gmail.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Thomas Gleixner <tglx@linutronix.de>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Suleiman Souhlal <suleiman@google.com>,
kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
kernel-team@android.com
Subject: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
Date: Fri, 26 Sep 2025 03:29:10 +0000 [thread overview]
Message-ID: <20250926032931.27663-3-jstultz@google.com> (raw)
In-Reply-To: <20250926032931.27663-1-jstultz@google.com>
As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.
Thus we need more then just a binary concept of the task being
blocked on a mutex or not.
So add a blocked_on_state value to the task, that allows the
task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
and back to BO_RUNNING. This provides a guard state in
BO_WAKING so we can know the task is no longer blocked
but we don't want to run it until we have potentially
done return migration, back to a usable cpu.
Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
series, as the tri-state isn't necessary until we deal
with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
BO_WAKING by another cpu (usually via ww_mutex die code).
Make sure we release the rq lock so the wakeup can
complete.
* Rework to use guard() in find_proxy_task() as suggested
by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
around blocked_on_state changes that we were cheating in
previous versions.
* Minor cleanups, some comment improvements
v22:
* Re-order blocked_on_state helpers to try to make it clearer
the set_task_blocked_on() and clear_task_blocked_on() are
the main enter/exit states and the blocked_on_state helpers
help manage the transition states within. Per feedback from
K Prateek Nayak.
* Rework blocked_on_state to be defined within
CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak.
* Reworked empty stub functions to just take one line as
suggestd by K Prateek
* Avoid using gotos out of a guard() scope, as highlighted by
K Prateek, and instead rework logic to break and switch()
on an action value.
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
include/linux/sched.h | 92 +++++++++++++++++++++++++++++++++------
init/init_task.c | 3 ++
kernel/fork.c | 3 ++
kernel/locking/mutex.c | 15 ++++---
kernel/locking/ww_mutex.h | 20 ++++-----
kernel/sched/core.c | 45 +++++++++++++++++--
kernel/sched/sched.h | 6 ++-
7 files changed, 146 insertions(+), 38 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cb4e81d9d9b67..8245940783c77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -813,6 +813,12 @@ struct kmap_ctrl {
#endif
};
+enum blocked_on_state {
+ BO_RUNNABLE,
+ BO_BLOCKED,
+ BO_WAKING,
+};
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1234,6 +1240,9 @@ struct task_struct {
struct mutex *blocked_on; /* lock we're blocked on */
raw_spinlock_t blocked_lock;
+#ifdef CONFIG_SCHED_PROXY_EXEC
+ enum blocked_on_state blocked_on_state;
+#endif
#ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
/*
@@ -2139,7 +2148,6 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
__cond_resched_rwlock_write(lock); \
})
-#ifndef CONFIG_PREEMPT_RT
static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
{
lockdep_assert_held_once(&p->blocked_lock);
@@ -2152,6 +2160,13 @@ static inline struct mutex *get_task_blocked_on(struct task_struct *p)
return __get_task_blocked_on(p);
}
+static inline void __force_blocked_on_blocked(struct task_struct *p);
+static inline void __force_blocked_on_runnable(struct task_struct *p);
+
+/*
+ * These helpers set and clear the task blocked_on pointer, as well
+ * as setting the initial blocked_on_state, or clearing it
+ */
static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
{
WARN_ON_ONCE(!m);
@@ -2161,24 +2176,23 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
lockdep_assert_held_once(&p->blocked_lock);
/*
* Check ensure we don't overwrite existing mutex value
- * with a different mutex. Note, setting it to the same
- * lock repeatedly is ok.
+ * with a different mutex.
*/
- WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+ WARN_ON_ONCE(p->blocked_on);
p->blocked_on = m;
+ __force_blocked_on_blocked(p);
}
static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
{
+ /* The task should only be clearing itself */
+ WARN_ON_ONCE(p != current);
/* Currently we serialize blocked_on under the task::blocked_lock */
lockdep_assert_held_once(&p->blocked_lock);
- /*
- * There may be cases where we re-clear already cleared
- * blocked_on relationships, but make sure we are not
- * clearing the relationship with a different lock.
- */
- WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+ /* Make sure we are clearing the relationship with the right lock */
+ WARN_ON_ONCE(m && p->blocked_on != m);
p->blocked_on = NULL;
+ __force_blocked_on_runnable(p);
}
static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2186,15 +2200,65 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
guard(raw_spinlock_irqsave)(&p->blocked_lock);
__clear_task_blocked_on(p, m);
}
-#else
-static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
+
+/*
+ * The following helpers manage the blocked_on_state transitions while
+ * the blocked_on pointer is set.
+ */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void __force_blocked_on_blocked(struct task_struct *p)
+{
+ lockdep_assert_held(&p->blocked_lock);
+ p->blocked_on_state = BO_BLOCKED;
+}
+
+static inline void __set_blocked_on_waking(struct task_struct *p)
+{
+ lockdep_assert_held(&p->blocked_lock);
+ if (p->blocked_on_state == BO_BLOCKED)
+ p->blocked_on_state = BO_WAKING;
+}
+
+static inline void set_blocked_on_waking(struct task_struct *p)
+{
+ guard(raw_spinlock_irqsave)(&p->blocked_lock);
+ __set_blocked_on_waking(p);
+}
+
+static inline void __force_blocked_on_runnable(struct task_struct *p)
{
+ lockdep_assert_held(&p->blocked_lock);
+ p->blocked_on_state = BO_RUNNABLE;
}
-static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
+static inline void force_blocked_on_runnable(struct task_struct *p)
{
+ guard(raw_spinlock_irqsave)(&p->blocked_lock);
+ __force_blocked_on_runnable(p);
+}
+
+static inline void __set_blocked_on_runnable(struct task_struct *p)
+{
+ lockdep_assert_held(&p->blocked_lock);
+ if (p->blocked_on_state == BO_WAKING)
+ p->blocked_on_state = BO_RUNNABLE;
+}
+
+static inline void set_blocked_on_runnable(struct task_struct *p)
+{
+ if (!sched_proxy_exec())
+ return;
+ guard(raw_spinlock_irqsave)(&p->blocked_lock);
+ __set_blocked_on_runnable(p);
}
-#endif /* !CONFIG_PREEMPT_RT */
+#else /* CONFIG_SCHED_PROXY_EXEC */
+static inline void __force_blocked_on_blocked(struct task_struct *p) {}
+static inline void __set_blocked_on_waking(struct task_struct *p) {}
+static inline void set_blocked_on_waking(struct task_struct *p) {}
+static inline void __force_blocked_on_runnable(struct task_struct *p) {}
+static inline void __set_blocked_on_runnable(struct task_struct *p) {}
+static inline void set_blocked_on_runnable(struct task_struct *p) {}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
static __always_inline bool need_resched(void)
{
diff --git a/init/init_task.c b/init/init_task.c
index 7e29d86153d9f..63b66b4aa585a 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -174,6 +174,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
&init_task.alloc_lock),
#endif
+#ifdef CONFIG_SCHED_PROXY_EXEC
+ .blocked_on_state = BO_RUNNABLE,
+#endif
#ifdef CONFIG_RT_MUTEXES
.pi_waiters = RB_ROOT_CACHED,
.pi_top_task = NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 796cfceb2bbda..d8eb66e5be918 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2130,6 +2130,9 @@ __latent_entropy struct task_struct *copy_process(
#endif
p->blocked_on = NULL; /* not blocked yet */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+ p->blocked_on_state = BO_RUNNABLE;
+#endif
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index c44fc63d4476e..d8cf2e9a22a65 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -682,11 +682,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_lock_irqsave(&lock->wait_lock, flags);
raw_spin_lock(¤t->blocked_lock);
/*
- * As we likely have been woken up by task
- * that has cleared our blocked_on state, re-set
- * it to the lock we are trying to acquire.
+ * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
*/
- __set_task_blocked_on(current, lock);
+ __force_blocked_on_blocked(current);
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -705,7 +703,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* and clear blocked on so we don't become unselectable
* to run.
*/
- __clear_task_blocked_on(current, lock);
+ __force_blocked_on_runnable(current);
raw_spin_unlock(¤t->blocked_lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
@@ -714,7 +712,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
raw_spin_lock_irqsave(&lock->wait_lock, flags);
raw_spin_lock(¤t->blocked_lock);
- __set_task_blocked_on(current, lock);
+ __force_blocked_on_blocked(current);
if (opt_acquired)
break;
@@ -966,8 +964,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
next = waiter->task;
+ raw_spin_lock(&next->blocked_lock);
debug_mutex_wake_waiter(lock, waiter);
- clear_task_blocked_on(next, lock);
+ WARN_ON_ONCE(__get_task_blocked_on(next) != lock);
+ __set_blocked_on_waking(next);
+ raw_spin_unlock(&next->blocked_lock);
wake_q_add(&wake_q, next);
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..f34363615eb34 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
debug_mutex_wake_waiter(lock, waiter);
#endif
/*
- * When waking up the task to die, be sure to clear the
- * blocked_on pointer. Otherwise we can see circular
- * blocked_on relationships that can't resolve.
+ * When waking up the task to die, be sure to set the
+ * blocked_on_state to BO_WAKING. Otherwise we can see
+ * circular blocked_on relationships that can't resolve.
*/
- clear_task_blocked_on(waiter->task, lock);
+ set_blocked_on_waking(waiter->task);
wake_q_add(wake_q, waiter->task);
}
@@ -339,15 +339,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
*/
if (owner != current) {
/*
- * When waking up the task to wound, be sure to clear the
- * blocked_on pointer. Otherwise we can see circular
- * blocked_on relationships that can't resolve.
- *
- * NOTE: We pass NULL here instead of lock, because we
- * are waking the mutex owner, who may be currently
- * blocked on a different mutex.
+ * When waking up the task to wound, be sure to set the
+ * blocked_on_state to BO_WAKING. Otherwise we can see
+ * circular blocked_on relationships that can't resolve.
*/
- clear_task_blocked_on(owner, NULL);
+ set_blocked_on_waking(owner);
wake_q_add(wake_q, owner);
}
return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 007459d42ae4a..abecd2411e29e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4328,6 +4328,12 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
ttwu_queue(p, cpu, wake_flags);
}
out:
+ /*
+ * For now, if we've been woken up, set us as BO_RUNNABLE
+ * We will need to be more careful later when handling
+ * proxy migration
+ */
+ set_blocked_on_runnable(p);
if (success)
ttwu_stat(p, task_cpu(p), wake_flags);
@@ -6623,7 +6629,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
* as unblocked, as we aren't doing proxy-migrations
* yet (more logic will be needed then).
*/
- donor->blocked_on = NULL;
+ force_blocked_on_runnable(donor);
}
return NULL;
}
@@ -6651,6 +6657,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
int this_cpu = cpu_of(rq);
struct task_struct *p;
struct mutex *mutex;
+ enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
/* Follow blocked_on chain. */
for (p = donor; task_is_blocked(p); p = owner) {
@@ -6676,20 +6683,43 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
return NULL;
}
+ /*
+ * If a ww_mutex hits the die/wound case, it marks the task as
+ * BO_WAKING and calls try_to_wake_up(), so that the mutex
+ * cycle can be broken and we avoid a deadlock.
+ *
+ * However, if at that moment, we are here on the cpu which the
+ * die/wounded task is enqueued, we might loop on the cycle as
+ * BO_WAKING still causes task_is_blocked() to return true
+ * (since we want return migration to occur before we run the
+ * task).
+ *
+ * Unfortunately since we hold the rq lock, it will block
+ * try_to_wake_up from completing and doing the return
+ * migration.
+ *
+ * So when we hit a !BO_BLOCKED task briefly schedule idle
+ * so we release the rq and let the wakeup complete.
+ */
+ if (p->blocked_on_state != BO_BLOCKED)
+ return proxy_resched_idle(rq);
+
owner = __mutex_owner(mutex);
if (!owner) {
- __clear_task_blocked_on(p, mutex);
+ __force_blocked_on_runnable(p);
return p;
}
if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
/* XXX Don't handle blocked owners/delayed dequeue yet */
- return proxy_deactivate(rq, donor);
+ action = DEACTIVATE_DONOR;
+ break;
}
if (task_cpu(owner) != this_cpu) {
/* XXX Don't handle migrations yet */
- return proxy_deactivate(rq, donor);
+ action = DEACTIVATE_DONOR;
+ break;
}
if (task_on_rq_migrating(owner)) {
@@ -6747,6 +6777,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
*/
}
+ /* Handle actions we need to do outside of the guard() scope */
+ switch (action) {
+ case DEACTIVATE_DONOR:
+ return proxy_deactivate(rq, donor);
+ case FOUND:
+ /* fallthrough */;
+ }
WARN_ON_ONCE(owner && !owner->on_rq);
return owner;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cf2109b67f9a3..03deb68ee5f86 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2284,13 +2284,17 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
return rq->donor == p;
}
+#ifdef CONFIG_SCHED_PROXY_EXEC
static inline bool task_is_blocked(struct task_struct *p)
{
if (!sched_proxy_exec())
return false;
- return !!p->blocked_on;
+ return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
}
+#else
+static inline bool task_is_blocked(struct task_struct *p) { return false; }
+#endif
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
--
2.51.0.536.g15c5d4f767-goog
next prev parent reply other threads:[~2025-09-26 3:29 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-26 3:29 [PATCH v22 0/6] Donor Migration for Proxy Execution (v22) John Stultz
2025-09-26 3:29 ` [PATCH v22 1/6] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2025-10-08 10:27 ` Peter Zijlstra
2025-09-26 3:29 ` John Stultz [this message]
2025-10-08 11:26 ` [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration Peter Zijlstra
2025-10-09 0:07 ` John Stultz
2025-10-09 11:43 ` Peter Zijlstra
2025-10-09 11:45 ` Peter Zijlstra
2025-10-14 2:43 ` John Stultz
2025-10-16 22:23 ` John Stultz
2025-09-26 3:29 ` [PATCH v22 3/6] sched: Add logic to zap balance callbacks if we pick again John Stultz
2025-10-08 11:37 ` Peter Zijlstra
2025-09-26 3:29 ` [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration) John Stultz
2025-10-08 13:32 ` Peter Zijlstra
2025-10-16 0:15 ` John Stultz
2025-09-26 3:29 ` [PATCH v22 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2025-09-26 3:29 ` [PATCH v22 6/6] sched: Migrate whole chain in proxy_migrate_task() John Stultz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250926032931.27663-3-jstultz@google.com \
--to=jstultz@google.com \
--cc=Metin.Kaya@arm.com \
--cc=boqun.feng@gmail.com \
--cc=bsegall@google.com \
--cc=daniel.lezcano@linaro.org \
--cc=dietmar.eggemann@arm.com \
--cc=hupu.gm@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@android.com \
--cc=kprateek.nayak@amd.com \
--cc=kuyo.chang@mediatek.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=suleiman@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=will@kernel.org \
--cc=xuewen.yan94@gmail.com \
--cc=zezeozue@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.