From: Peter Zijlstra <peterz@infradead.org>
To: John Stultz <jstultz@google.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
Vineeth Pillai <vineethrp@google.com>,
Sonam Sanju <sonam.sanju@intel.com>,
Sean Christopherson <seanjc@google.com>,
Kunwu Chan <kunwu.chan@linux.dev>, Tejun Heo <tj@kernel.org>,
Joel Fernandes <joelagnelf@nvidia.com>,
Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Valentin Schneider <vschneid@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
Metin Kaya <Metin.Kaya@arm.com>,
Xuewen Yan <xuewen.yan94@gmail.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Thomas Gleixner <tglx@linutronix.de>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Suleiman Souhlal <suleiman@google.com>,
kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
kernel-team@android.com
Subject: Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
Date: Tue, 28 Apr 2026 13:18:33 +0200 [thread overview]
Message-ID: <20260428111833.GL3102924@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20260428094353.GB1026330@noisy.programming.kicks-ass.net>
On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:
>
> > kernel/sched/core.c | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index da20fb6ea25ae..5f684caefd8b2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
> > try_to_block_task(rq, prev, &prev_state,
> > !task_is_blocked(prev));
> > switch_count = &prev->nvcsw;
> > + } else if (preempt && prev->blocked_on) {
> > + /*
> > + * If we are SM_PREEMPT, we may have interrupted
> > + * after blocked_on was set, before schedule()
> > + * was run, preventing workques from running. So
>
> workqueues
>
> > + * clear blocked_on and mark task RUNNING so it
> > + * can be reselected to run and complete its
> > + * logic
> > + */
> > + WRITE_ONCE(prev->__state, TASK_RUNNING);
> > + clear_task_blocked_on(prev, NULL);
> > }
> >
> > pick_again:
>
> *groan*, this feels wrong. Preemption should never touch state. Let me
> try and wake up and make sense of this.
So all non-special block states *SHOULD* be in a loop and handle
spurious wakeups -- I fixed a pile of offenders some many years ago, but
there really isn't anything in the kernel that validates this.
[ I suppose someone could try and do a cocci test for this? ]
Any wait for non-special states that is not a loop is fundamentally
broken, since many of the lock wake-up paths are explicitly racy in that
they can cause spurious wakeups (which is the safe side of the race,
since insufficient wakeups is bad etc.).
OTOH special states, are special, esp. because they cannot handle
spurious wakeups.
Eg, consider something like:
set_current_state(TASK_FROZEN)
<PREEMPT>
current->__state = TASK_RUNNING
</PREEMPT/
schedule();
is all sorts of broken. Now, obiously special states must never have
blocked_on set, so this can be fudged about. But still, touching __state
from schedule seems wrong.
Anyway, the historical distinction between a blocked task and a
preempted task is that the blocked task is not on the runqueue, while
the preempted task is kept on the runqueue.
Obviously PE wrecks this, and hence the problem. And yeah, amazing we
never hit this before.
Something like so perhaps?
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..0bd5da8360f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
struct alloc_tag *alloc_tag;
#endif
- int on_cpu;
+ u8 on_cpu;
+ u8 on_rq;
+ u8 is_blocked;
+ u8 __pad;
+
struct __call_single_node wake_entry;
unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
*/
int recent_used_cpu;
int wake_cpu;
- int on_rq;
int prio;
int static_prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..06817ae0cbd9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state);
* [ The astute reader will observe that it is possible for two tasks on one
* CPU to have ->on_cpu = 1 at the same time. ]
*
+* p->is_blocked <- { 0, 1 }:
+*
+* is set by block_task() and cleared by ttwu_do_activate() and indicates
+* this task is blocked, as opposed to runnable. Used to distinguish between
+* preempted and blocked tasks for proxy exec, which keeps everything on the
+* runqueue.
+ *
* task_cpu(p): is changed by set_task_cpu(), the rules are:
*
* - Don't call set_task_cpu() on a blocked task:
@@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
static void block_task(struct rq *rq, struct task_struct *p, int flags)
{
+ p->is_blocked = 1;
if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
__block_task(rq, p);
}
@@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
atomic_dec(&task_rq(p)->nr_iowait);
}
+ p->is_blocked = 0;
activate_task(rq, p, en_flags);
wakeup_preempt(rq, p, wake_flags);
@@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
struct task_struct *prev_donor = rq->donor;
rq_set_donor(rq, next);
- if (unlikely(next->blocked_on)) {
+ if (unlikely(next->is_blocked && next->blocked_on)) {
next = find_proxy_task(rq, next, &rf);
if (!next) {
zap_balance_callbacks(rq);
next prev parent reply other threads:[~2026-04-28 11:18 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 18:38 [PATCH 0/2] Proxy Execution fixes for v7.1-rc John Stultz
2026-04-27 18:38 ` [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed John Stultz
2026-04-28 8:06 ` K Prateek Nayak
2026-04-28 9:43 ` Peter Zijlstra
2026-04-28 11:18 ` Peter Zijlstra [this message]
2026-04-28 13:15 ` K Prateek Nayak
2026-04-28 14:12 ` K Prateek Nayak
2026-04-28 16:50 ` Peter Zijlstra
2026-04-29 2:27 ` John Stultz
2026-04-27 18:38 ` [PATCH 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING John Stultz
2026-04-28 8:16 ` K Prateek Nayak
2026-04-28 19:50 ` John Stultz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260428111833.GL3102924@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=Metin.Kaya@arm.com \
--cc=boqun.feng@gmail.com \
--cc=daniel.lezcano@linaro.org \
--cc=dietmar.eggemann@arm.com \
--cc=hupu.gm@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=jstultz@google.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@android.com \
--cc=kprateek.nayak@amd.com \
--cc=kunwu.chan@linux.dev \
--cc=kuyo.chang@mediatek.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=sonam.sanju@intel.com \
--cc=suleiman@google.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vineethrp@google.com \
--cc=vschneid@redhat.com \
--cc=will@kernel.org \
--cc=xuewen.yan94@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox