From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3B63E34A79E for ; Wed, 22 Apr 2026 23:07:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776899230; cv=none; b=cBLRiJLd1rAU+lHMHv1njrH6E2EFRrUcCXMpJ44SRjKk1chGycV1zZ6Yj+BOZ9R3x8/DiZtNAPQSsHQ6eE1q1NIyU4g5/aR/N9bSnfLdQnCMYv4Ph/7SYVcz3xSWo8LQmfFr/bsBtBOh2zHxeurR6fFMiMdUVecDQlt0CI+yVnM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776899230; c=relaxed/simple; bh=fGa6TOAYeHeInmFA+Ft5ufCLZZ9luP3dGsCVoOTPo+Y=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=QWs1KNmqWzyK8xJaCo3io1kuqA/7mkaCBRCTgoD/A6oKry8m0QRwRjWYQ+Gq0YU43Jh/TkhRsl54mDjHX7OFL0DmefjshsPwD8iGBVWdJQEwd5y7jLZ7E1hSXWf2E4c6J0QoxTheU0LpC48U86Yb2kdDzf2+9Om1iMAgNDslq+g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=UhDo+jFh; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="UhDo+jFh" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-358e425c261so7185838a91.3 for ; Wed, 22 Apr 2026 16:07:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1776899227; x=1777504027; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=TJ4qjatYydvK9gZxdeDSyJL/1YC/vMfoiWEL4QJRCcU=; b=UhDo+jFh/2hVFP8Fm+zkYM1BaUsXsqMdkfNz97wtzUjKvAb66yQOaOoX6AUCCMW38z p/KFZs10hQ0PWMQx5T15tEKLEeynIo3JaH0ekW56ZvQBIta1WjNPt7uNF4Rt62slmrxB X5d5lq2iMOSBmRR5Zhu8TxZatbSeB9SUEwEEB5KwAhSvZf+BXl8qQacScK4Je5DdqNsA jim18mqm8gLiATfz3B0zWzYZ7Iu7UkEHDRFmSwmOJXXFt91RPN0sXqN47q4yl2Nx3IUp jI5hrgRLV4H+BYIpgrmRmMcgwxR8xD1IGBDQuI0teJfl+4gkoQCYmV2IPbGGVz8QxI0H F8Gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776899227; x=1777504027; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=TJ4qjatYydvK9gZxdeDSyJL/1YC/vMfoiWEL4QJRCcU=; b=Q7CjXT0pkCWrxkvJkeKPKjuEa3nN6imwnXAaq6uf5YVVfxQ8i/J4cPnUgXbv05C0Fo 6hLC8p93bWCH0wxvVYwMUKIYK7vsVRqcrMKQLkRGPVYeS+rgqde57Jq0jhLPmslNPsoS MbCOQdda6ZPtGvP/fJzzqJ84Uw6s+erurgi+RZJXW7sj6HGFcbr8lUjMKZ76u2nU8bmL ubDq04qsB7hTcedMYofAZ2BFqFgqBziJysRfwMISEvEUJtAtyeLys9cMXnus62oLIUco RoFTLX4U1ox+lSPqzYZyzDV+97I0otOTjzKXReZ9BnSDIGIUJjnscIsPaRVfmPnNHY8V SLvA== X-Gm-Message-State: AOJu0Ywz/zm/YlZRmTtMHGH3nv+e+DC8O1thstk39zxKRkmhZD12t0SZ smJ5YMvuc5susGH5gStdRkOpiuqtKsecMtzbS+mNnvdh6lT+Oamq3Ak1jTL2h8yTfrfGCCGaDK+ xaaiV10MbQUCpuTpseHXSHbow1CyKvAY3VB503iPiO4bu+QfvLCdtijDfgdEDRZYSCu13buoaBn 1fag6ccNAHQFPnaH9YOde7DrGRhM3DumFwYJRfTEMkZGEQeC4q X-Received: from pjjd7.prod.google.com ([2002:a17:90a:6287:b0:35c:dbb:e44a]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3811:b0:35f:c729:de9f with SMTP id 98e67ed59e1d1-361404c2053mr25011836a91.27.1776899227147; Wed, 22 Apr 2026 16:07:07 -0700 (PDT) Date: Wed, 22 Apr 2026 23:06:42 +0000 In-Reply-To: <20260422230659.903191-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260422230659.903191-1-jstultz@google.com> X-Mailer: git-send-email 2.54.0.rc2.533.g4f5dca5207-goog Message-ID: <20260422230659.903191-2-jstultz@google.com> Subject: [PATCH v28 1/8] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Historically, the prev value from __schedule() was the rq->curr. This prev value is passed down through numerous functions, and used in the class scheduler implementations. The fact that prev was on_cpu until the end of __schedule(), meant it was stable across the rq lock drops that the class->pick_next_task() and ->balance() implementations often do. However, with proxy-exec, the prev passed to functions called by __schedule() is rq->donor, which may not be the same as rq->curr and may not be on_cpu, this makes the prev value potentially unstable across rq lock drops. A recently found issue with proxy-exec, is when we begin doing return migration from try_to_wake_up(), its possible we may be waking up the rq->donor. When we do this, we proxy_resched_idle() to put_prev_set_next() setting the rq->donor to rq->idle, allowing the rq->donor to be return migrated and allowed to run. This however runs into trouble, as on another cpu we might be in the middle of calling __schedule(). Conceptually the rq lock is held for the majority of the time, but in calling pick_next_task() its possible the class->pick_next_task() handler or the ->balance() call may briefly drop the rq lock. This opens a window for try_to_wake_up() to wake and return migrate the rq->donor before the class logic reacquires the rq lock. Unfortunately pick_next_task() and prev_balance() pass in a prev argument, to which we pass rq->donor. However this prev value can now become stale and incorrect across a rq lock drop. So, to correct this, rework the pick_next_task() and prev_balance() calls so that they do not take a "prev" argument. Also rework the class ->pick_next_task() and ->balance() implementations to drop the prev argument, and in the cases where it was used, and have the class functions reference rq->donor directly, and not save the value across rq lock drops so that we don't end up with a stale references. Signed-off-by: John Stultz --- Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 37 ++++++++++++++++++------------------- kernel/sched/deadline.c | 8 +++++++- kernel/sched/fair.c | 9 +++++++-- kernel/sched/idle.c | 2 +- kernel/sched/rt.c | 8 +++++++- kernel/sched/sched.h | 10 ++++------ kernel/sched/stop_task.c | 2 +- 7 files changed, 45 insertions(+), 31 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25ae..3ac6dd4d3c587 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5971,10 +5971,9 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) schedstat_inc(this_rq()->sched_count); } -static void prev_balance(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static void prev_balance(struct rq *rq, struct rq_flags *rf) { - const struct sched_class *start_class = prev->sched_class; + const struct sched_class *start_class = rq->donor->sched_class; const struct sched_class *class; /* @@ -5986,7 +5985,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev, * a runnable task of @class priority or higher. */ for_active_class_range(class, start_class, &idle_sched_class) { - if (class->balance && class->balance(rq, prev, rf)) + if (class->balance && class->balance(rq, rf)) break; } } @@ -5995,7 +5994,7 @@ static void prev_balance(struct rq *rq, struct task_struct *prev, * Pick up the highest-prio task: */ static inline struct task_struct * -__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +__pick_next_task(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { const struct sched_class *class; @@ -6012,28 +6011,28 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * higher scheduling class, because otherwise those lose the * opportunity to pull in more work from other CPUs. */ - if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && + if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) && rq->nr_running == rq->cfs.h_nr_queued)) { - p = pick_next_task_fair(rq, prev, rf); + p = pick_next_task_fair(rq, rf); if (unlikely(p == RETRY_TASK)) goto restart; /* Assume the next prioritized class is idle_sched_class */ if (!p) { p = pick_task_idle(rq, rf); - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); } return p; } restart: - prev_balance(rq, prev, rf); + prev_balance(rq, rf); for_each_active_class(class) { if (class->pick_next_task) { - p = class->pick_next_task(rq, prev, rf); + p = class->pick_next_task(rq, rf); if (unlikely(p == RETRY_TASK)) goto restart; if (p) @@ -6043,7 +6042,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) if (unlikely(p == RETRY_TASK)) goto restart; if (p) { - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); return p; } } @@ -6096,7 +6095,7 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f static void queue_core_balance(struct rq *rq); static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +pick_next_task(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { struct task_struct *next, *p, *max; @@ -6109,7 +6108,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) bool need_sync; if (!sched_core_enabled(rq)) - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); cpu = cpu_of(rq); @@ -6122,7 +6121,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) */ rq->core_pick = NULL; rq->core_dl_server = NULL; - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } /* @@ -6146,7 +6145,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) goto out_set_next; } - prev_balance(rq, prev, rf); + prev_balance(rq, rf); smt_mask = cpu_smt_mask(cpu); need_sync = !!rq->core->core_cookie; @@ -6328,7 +6327,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } out_set_next: - put_prev_set_next_task(rq, prev, next); + put_prev_set_next_task(rq, rq->donor, next); if (rq->core->core_forceidle_count && next == rq->idle) queue_core_balance(rq); @@ -6551,10 +6550,10 @@ static inline void sched_core_cpu_deactivate(unsigned int cpu) {} static inline void sched_core_cpu_dying(unsigned int cpu) {} static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +pick_next_task(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } #endif /* !CONFIG_SCHED_CORE */ @@ -7101,7 +7100,7 @@ static void __sched notrace __schedule(int sched_mode) pick_again: assert_balance_callbacks_empty(rq); - next = pick_next_task(rq, rq->donor, &rf); + next = pick_next_task(rq, &rf); rq->next_class = next->sched_class; if (sched_proxy_exec()) { struct task_struct *prev_donor = rq->donor; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index edca7849b165d..f07a888314450 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2506,8 +2506,14 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p) resched_curr(rq); } -static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf) +static int balance_dl(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use prev across lock drops + */ + struct task_struct *p = rq->donor; + if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 69361c63353ad..b843f9a876d6d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9195,14 +9195,19 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first); struct task_struct * -pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +pick_next_task_fair(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { struct sched_entity *se; - struct task_struct *p; + struct task_struct *p, *prev; int new_tasks; again: + /* + * Re-read rq->donor at the top as it may have + * changed across a rq lock drop + */ + prev = rq->donor; p = pick_task_fair(rq, rf); if (!p) goto idle; diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index a83be0c834ddb..ff39120d723a9 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -462,7 +462,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int flags) } static int -balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_idle(struct rq *rq, struct rq_flags *rf) { return WARN_ON_ONCE(1); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 4ee8faf01441a..3c5f37c858b60 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1596,8 +1596,14 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) resched_curr(rq); } -static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf) +static int balance_rt(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use p across lock drops + */ + struct task_struct *p = rq->donor; + if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9f63b15d309d1..2b3a97735efeb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2561,7 +2561,7 @@ struct sched_class { /* * schedule/pick_next_task/prev_balance: rq->lock */ - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); + int (*balance)(struct rq *rq, struct rq_flags *rf); /* * schedule/pick_next_task: rq->lock @@ -2572,12 +2572,11 @@ struct sched_class { * * next = pick_task(); * if (next) { - * put_prev_task(prev); + * put_prev_task(rq->donor); * set_next_task_first(next); * } */ - struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf); + struct task_struct *(*pick_next_task)(struct rq *rq, struct rq_flags *rf); /* * sched_change: @@ -2801,8 +2800,7 @@ static inline bool sched_fair_runnable(struct rq *rq) return rq->cfs.nr_queued > 0; } -extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf); +extern struct task_struct *pick_next_task_fair(struct rq *rq, struct rq_flags *rf); extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf); #define SCA_CHECK 0x01 diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index f95798baddebb..c909ca0d8c87c 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int flags) } static int -balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_stop(struct rq *rq, struct rq_flags *rf) { return sched_stop_runnable(rq); } -- 2.54.0.rc2.533.g4f5dca5207-goog