From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
Joel Fernandes <joelagnelf@nvidia.com>,
Qais Yousef <qyousef@layalina.io>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Valentin Schneider <vschneid@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Zimuzo Ezeozue <zezeozue@google.com>,
Mel Gorman <mgorman@suse.de>, Will Deacon <will@kernel.org>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
Metin Kaya <Metin.Kaya@arm.com>,
Xuewen Yan <xuewen.yan94@gmail.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Thomas Gleixner <tglx@linutronix.de>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Suleiman Souhlal <suleiman@google.com>,
kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
kernel-team@android.com
Subject: [PATCH v27 01/10] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references
Date: Sat, 4 Apr 2026 05:36:18 +0000 [thread overview]
Message-ID: <20260404053632.1729280-2-jstultz@google.com> (raw)
In-Reply-To: <20260404053632.1729280-1-jstultz@google.com>
Historically, the prev value from __schedule() was the rq->curr.
This prev value is passed down through numerous functions, and
used in the class scheduler implementations. The fact that
prev was on_cpu until the end of __schedule(), meant it was
stable across the rq lock drops that the class->pick_next_task()
and ->balance() implementations often do.
However, with proxy-exec, the prev passed to functions called
by __schedule() is rq->donor, which may not be the same as
rq->curr and may not be on_cpu, this makes the prev value
potentially unstable across rq lock drops.
A recently found issue with proxy-exec, is when we begin doing
return migration from try_to_wake_up(), its possible we may be
waking up the rq->donor. When we do this, we proxy_resched_idle()
to put_prev_set_next() setting the rq->donor to rq->idle, allowing
the rq->donor to be return migrated and allowed to run.
This however runs into trouble, as on another cpu we might be in
the middle of calling __schedule(). Conceptually the rq lock is
held for the majority of the time, but in calling pick_next_task()
its possible the class->pick_next_task() handler or the
->balance() call may briefly drop the rq lock. This opens a
window for try_to_wake_up() to wake and return migrate the
rq->donor before the class logic reacquires the rq lock.
Unfortunately pick_next_task() and prev_balance() pass in a prev
argument, to which we pass rq->donor. However this prev value can
now become stale and incorrect across a rq lock drop.
So, to correct this, rework the pick_next_task() and
prev_balance() calls so that they do not take a "prev" argument.
Also rework the class ->pick_next_task() and ->balance()
implementations to drop the prev argument, and in the cases
where it was used, and have the class functions reference
rq->donor directly, and not save the value across rq lock drops
so that we don't end up with a stale references.
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
kernel/sched/core.c | 37 ++++++++++++++++++-------------------
kernel/sched/deadline.c | 8 +++++++-
kernel/sched/fair.c | 9 +++++++--
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 8 +++++++-
kernel/sched/sched.h | 10 ++++------
kernel/sched/stop_task.c | 2 +-
7 files changed, 45 insertions(+), 31 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c15c9865299e7..9c8a769a6d109 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5907,10 +5907,9 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
schedstat_inc(this_rq()->sched_count);
}
-static void prev_balance(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf)
+static void prev_balance(struct rq *rq, struct rq_flags *rf)
{
- const struct sched_class *start_class = prev->sched_class;
+ const struct sched_class *start_class = rq->donor->sched_class;
const struct sched_class *class;
/*
@@ -5922,7 +5921,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
* a runnable task of @class priority or higher.
*/
for_active_class_range(class, start_class, &idle_sched_class) {
- if (class->balance && class->balance(rq, prev, rf))
+ if (class->balance && class->balance(rq, rf))
break;
}
}
@@ -5931,7 +5930,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
* Pick up the highest-prio task:
*/
static inline struct task_struct *
-__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct rq_flags *rf)
__must_hold(__rq_lockp(rq))
{
const struct sched_class *class;
@@ -5948,28 +5947,28 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* higher scheduling class, because otherwise those lose the
* opportunity to pull in more work from other CPUs.
*/
- if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
+ if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) &&
rq->nr_running == rq->cfs.h_nr_queued)) {
- p = pick_next_task_fair(rq, prev, rf);
+ p = pick_next_task_fair(rq, rf);
if (unlikely(p == RETRY_TASK))
goto restart;
/* Assume the next prioritized class is idle_sched_class */
if (!p) {
p = pick_task_idle(rq, rf);
- put_prev_set_next_task(rq, prev, p);
+ put_prev_set_next_task(rq, rq->donor, p);
}
return p;
}
restart:
- prev_balance(rq, prev, rf);
+ prev_balance(rq, rf);
for_each_active_class(class) {
if (class->pick_next_task) {
- p = class->pick_next_task(rq, prev, rf);
+ p = class->pick_next_task(rq, rf);
if (unlikely(p == RETRY_TASK))
goto restart;
if (p)
@@ -5979,7 +5978,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (unlikely(p == RETRY_TASK))
goto restart;
if (p) {
- put_prev_set_next_task(rq, prev, p);
+ put_prev_set_next_task(rq, rq->donor, p);
return p;
}
}
@@ -6032,7 +6031,7 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
static void queue_core_balance(struct rq *rq);
static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
__must_hold(__rq_lockp(rq))
{
struct task_struct *next, *p, *max;
@@ -6045,7 +6044,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
bool need_sync;
if (!sched_core_enabled(rq))
- return __pick_next_task(rq, prev, rf);
+ return __pick_next_task(rq, rf);
cpu = cpu_of(rq);
@@ -6058,7 +6057,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
rq->core_pick = NULL;
rq->core_dl_server = NULL;
- return __pick_next_task(rq, prev, rf);
+ return __pick_next_task(rq, rf);
}
/*
@@ -6082,7 +6081,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
goto out_set_next;
}
- prev_balance(rq, prev, rf);
+ prev_balance(rq, rf);
smt_mask = cpu_smt_mask(cpu);
need_sync = !!rq->core->core_cookie;
@@ -6264,7 +6263,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
out_set_next:
- put_prev_set_next_task(rq, prev, next);
+ put_prev_set_next_task(rq, rq->donor, next);
if (rq->core->core_forceidle_count && next == rq->idle)
queue_core_balance(rq);
@@ -6487,10 +6486,10 @@ static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
static inline void sched_core_cpu_dying(unsigned int cpu) {}
static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
__must_hold(__rq_lockp(rq))
{
- return __pick_next_task(rq, prev, rf);
+ return __pick_next_task(rq, rf);
}
#endif /* !CONFIG_SCHED_CORE */
@@ -7038,7 +7037,7 @@ static void __sched notrace __schedule(int sched_mode)
pick_again:
assert_balance_callbacks_empty(rq);
- next = pick_next_task(rq, rq->donor, &rf);
+ next = pick_next_task(rq, &rf);
rq->next_class = next->sched_class;
if (sched_proxy_exec()) {
struct task_struct *prev_donor = rq->donor;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 27359a1e995f9..7352506208287 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2509,8 +2509,14 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
resched_curr(rq);
}
-static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_dl(struct rq *rq, struct rq_flags *rf)
{
+ /*
+ * Note, rq->donor may change during rq lock drops,
+ * so don't re-use prev across lock drops
+ */
+ struct task_struct *p = rq->donor;
+
if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
/*
* This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 597ce5b718d26..4a6669c517dae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9153,14 +9153,19 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
struct task_struct *
-pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_next_task_fair(struct rq *rq, struct rq_flags *rf)
__must_hold(__rq_lockp(rq))
{
struct sched_entity *se;
- struct task_struct *p;
+ struct task_struct *p, *prev;
int new_tasks;
again:
+ /*
+ * Re-read rq->donor at the top as it may have
+ * changed across a rq lock drop
+ */
+ prev = rq->donor;
p = pick_task_fair(rq, rf);
if (!p)
goto idle;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a83be0c834ddb..ff39120d723a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -462,7 +462,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int flags)
}
static int
-balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_idle(struct rq *rq, struct rq_flags *rf)
{
return WARN_ON_ONCE(1);
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4e5f1957b91b1..3fd03a836731e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1596,8 +1596,14 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
resched_curr(rq);
}
-static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_rt(struct rq *rq, struct rq_flags *rf)
{
+ /*
+ * Note, rq->donor may change during rq lock drops,
+ * so don't re-use p across lock drops
+ */
+ struct task_struct *p = rq->donor;
+
if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
/*
* This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9594355a36811..8ee82b03a8a10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2550,7 +2550,7 @@ struct sched_class {
/*
* schedule/pick_next_task/prev_balance: rq->lock
*/
- int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+ int (*balance)(struct rq *rq, struct rq_flags *rf);
/*
* schedule/pick_next_task: rq->lock
@@ -2561,12 +2561,11 @@ struct sched_class {
*
* next = pick_task();
* if (next) {
- * put_prev_task(prev);
+ * put_prev_task(rq->donor);
* set_next_task_first(next);
* }
*/
- struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf);
+ struct task_struct *(*pick_next_task)(struct rq *rq, struct rq_flags *rf);
/*
* sched_change:
@@ -2790,8 +2789,7 @@ static inline bool sched_fair_runnable(struct rq *rq)
return rq->cfs.nr_queued > 0;
}
-extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
- struct rq_flags *rf);
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct rq_flags *rf);
extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
#define SCA_CHECK 0x01
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index f95798baddebb..c909ca0d8c87c 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int flags)
}
static int
-balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_stop(struct rq *rq, struct rq_flags *rf)
{
return sched_stop_runnable(rq);
}
--
2.53.0.1213.gd9a14994de-goog
next prev parent reply other threads:[~2026-04-04 5:36 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-04 5:36 [PATCH v27 00/10] Optimized Donor Migration for Proxy Execution John Stultz
2026-04-04 5:36 ` John Stultz [this message]
2026-04-04 5:36 ` [PATCH v27 02/10] sched: Avoid donor->sched_class->yield_task() null traversal John Stultz
2026-04-04 5:57 ` K Prateek Nayak
2026-04-04 6:09 ` John Stultz
2026-04-04 5:36 ` [PATCH v27 03/10] sched: deadline: Add some helper variables to cleanup deadline logic John Stultz
2026-04-04 5:36 ` [PATCH v27 04/10] sched: deadline: Add dl_rq->curr pointer to address issues with Proxy Exec John Stultz
2026-04-04 5:36 ` [PATCH v27 05/10] sched: Rework block_task so it can be directly called John Stultz
2026-04-04 5:36 ` [PATCH v27 06/10] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case John Stultz
2026-04-04 5:36 ` [PATCH v27 07/10] sched/core: Reset the donor to current task when donor is woken John Stultz
2026-04-04 5:36 ` [PATCH v27 08/10] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
2026-04-04 5:36 ` [PATCH v27 09/10] sched: Break out core of attach_tasks() helper into sched.h John Stultz
2026-04-04 5:36 ` [PATCH v27 10/10] sched: Migrate whole chain in proxy_migrate_task() John Stultz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260404053632.1729280-2-jstultz@google.com \
--to=jstultz@google.com \
--cc=Metin.Kaya@arm.com \
--cc=boqun.feng@gmail.com \
--cc=bsegall@google.com \
--cc=daniel.lezcano@linaro.org \
--cc=dietmar.eggemann@arm.com \
--cc=hupu.gm@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@android.com \
--cc=kprateek.nayak@amd.com \
--cc=kuyo.chang@mediatek.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=suleiman@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=will@kernel.org \
--cc=xuewen.yan94@gmail.com \
--cc=zezeozue@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.