From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 73C1F3F7AA5 for ; Tue, 28 Apr 2026 11:18:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777375133; cv=none; b=XOFTcGj73R5d0lbtLnTq3On0VObNcfHGXOB3TbbMCVh4ynSwNna0w9Kz1bbhmsjNOnHPAljUDajvehLVoXNGS//TZse66EZ7D7FUnU1kjo+9GdAvbzaG3z3Z5YJLO8VnFaoOT/3CZse0b2ZwsJi4I5DdEc6kY6MhUGcGiKdqk9o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777375133; c=relaxed/simple; bh=xUBt2XLNX0hL2V0Elr++a4b3m8IK88T1vnfkvNRTrFw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bJktjbZuE3+GvdUBuPLXNwoNI6otvACtLUSKl5FY0W2eJjgaeK4M99aDC9qi5ja12YPA/bp97QVFOzTHxG8938+kD4ryYFEJcO1n0JqNAxZRQqxSFXrxhIenJK9w3YpuLNdPzYvcewxR5lhsq3ZqO++y7ZUyP1fZtdQN5uhs8YE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=gXnafC1X; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="gXnafC1X" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=ypwHjKq/oDfkZDQYlBxw+6wUWW1+bXBvdGwkQdNmGUk=; b=gXnafC1Xm1Ls8gXgNjWnxf4p4q E4wnM0KLUyS99JVPQB02vkHLVES4R6teFFqIc9/49RuR5t1/k7NVYvM/HkXeC/a6XXrENjKqY66X5 YP+cJ8YiKclkdXTPEJ8gTzKDsq8usvzveSk0e1Lb3Gw5/aSC/c0WAw+h7KS4V+AItnR1LVj2CT7bm WcjbxlZhhozuvXkfhIu7qxbw+CnC5R0YOBng/PcaSyvAAh1l7L4ZFEPnZpK42R6hRCm4T6VX6nqjQ oEGslEZtTX4vqdjrx9MvJ2bgficEZhOmN/9hNEADO1Ktd7C8SxLq1j9idJxWsqMnWsmv/Tr0Qdi4w Ekflm7xQ==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1wHgSY-00000003hPX-0sx3; Tue, 28 Apr 2026 11:18:34 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 3909F30330B; Tue, 28 Apr 2026 13:18:33 +0200 (CEST) Date: Tue, 28 Apr 2026 13:18:33 +0200 From: Peter Zijlstra To: John Stultz Cc: LKML , Vineeth Pillai , Sonam Sanju , Sean Christopherson , Kunwu Chan , Tejun Heo , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Subject: Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed Message-ID: <20260428111833.GL3102924@noisy.programming.kicks-ass.net> References: <20260427183848.698551-1-jstultz@google.com> <20260427183848.698551-2-jstultz@google.com> <20260428094353.GB1026330@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260428094353.GB1026330@noisy.programming.kicks-ass.net> On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote: > On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote: > > > kernel/sched/core.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index da20fb6ea25ae..5f684caefd8b2 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) > > try_to_block_task(rq, prev, &prev_state, > > !task_is_blocked(prev)); > > switch_count = &prev->nvcsw; > > + } else if (preempt && prev->blocked_on) { > > + /* > > + * If we are SM_PREEMPT, we may have interrupted > > + * after blocked_on was set, before schedule() > > + * was run, preventing workques from running. So > > workqueues > > > + * clear blocked_on and mark task RUNNING so it > > + * can be reselected to run and complete its > > + * logic > > + */ > > + WRITE_ONCE(prev->__state, TASK_RUNNING); > > + clear_task_blocked_on(prev, NULL); > > } > > > > pick_again: > > *groan*, this feels wrong. Preemption should never touch state. Let me > try and wake up and make sense of this. So all non-special block states *SHOULD* be in a loop and handle spurious wakeups -- I fixed a pile of offenders some many years ago, but there really isn't anything in the kernel that validates this. [ I suppose someone could try and do a cocci test for this? ] Any wait for non-special states that is not a loop is fundamentally broken, since many of the lock wake-up paths are explicitly racy in that they can cause spurious wakeups (which is the safe side of the race, since insufficient wakeups is bad etc.). OTOH special states, are special, esp. because they cannot handle spurious wakeups. Eg, consider something like: set_current_state(TASK_FROZEN) current->__state = TASK_RUNNING on_cpu = 1 at the same time. ] * +* p->is_blocked <- { 0, 1 }: +* +* is set by block_task() and cleared by ttwu_do_activate() and indicates +* this task is blocked, as opposed to runnable. Used to distinguish between +* preempted and blocked tasks for proxy exec, which keeps everything on the +* runqueue. + * * task_cpu(p): is changed by set_task_cpu(), the rules are: * * - Don't call set_task_cpu() on a blocked task: @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) static void block_task(struct rq *rq, struct task_struct *p, int flags) { + p->is_blocked = 1; if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) __block_task(rq, p); } @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, atomic_dec(&task_rq(p)->nr_iowait); } + p->is_blocked = 0; activate_task(rq, p, en_flags); wakeup_preempt(rq, p, wake_flags); @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode) struct task_struct *prev_donor = rq->donor; rq_set_donor(rq, next); - if (unlikely(next->blocked_on)) { + if (unlikely(next->is_blocked && next->blocked_on)) { next = find_proxy_task(rq, next, &rf); if (!next) { zap_balance_callbacks(rq);