From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from CH1PR05CU001.outbound.protection.outlook.com (mail-northcentralusazon11010016.outbound.protection.outlook.com [52.101.193.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BE4963803C0 for ; Tue, 28 Apr 2026 13:15:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.193.16 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777382156; cv=fail; b=mZxDRvHMwfI7iJsio7fk/M0sd4Pb6b1g1R8gI/rnF1PoZeR+dn0HgrQGCn9ID6Z/+lqbz0F/4j941c4hHIaki0dYnXvTZEm/M0gA4Ok0X1ZiGtZM8lAnuHjUZ4b+mmAEVkyPX4QArPE6PW282PTwpNdQPjAlL9zLUWutsT6Qw3g= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777382156; c=relaxed/simple; bh=0AvzEVMWEgfhHzozNq40UxeEr9uiX0G+++jDXeNruHY=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=WzoZ5ILoIETGRcolfoeZUzN0SdgQ/kgVITZU+ESWjLvQgxLvfFRAegRIvAVpCfBzyjuJkZWx/SyOrMsCPxKiJTcwyfyTSUOG43UpuiF7E50kXQFCKw9/3MVNNq/5ewAMdtpqrYtwhuy+4XQ1X5jxwPFPucwlvoiX/yXE1mMHVSI= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=f9kfZujc; arc=fail smtp.client-ip=52.101.193.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="f9kfZujc" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=zPgSjoN+P3HtVDhK4NFFcfwHUi5gSwssBEJ75ekeT7cNUBguIwjwZSle4jrhCS7tkLaZ8Sy/od9XwEyt4DyLKEBO54hu/nBn3sSxazTBIokvyLXqBK4ECMJNkqH5sauwuJqi0a7Rt7GwSnbRmOfkEmN1O1X8vZpXWccsSIeD8SE/OvshlIP26PDpA4eUoQ2EGcISXiAPDnPfaiBRANzlYlJj5t5YxSALWXJgLkbe3jGHnQ4mXju0VftIG6x2StH6c0Fi4IOtc1yMGVep5hBmB82K53i37q9rp3vqSs0KqD1+ubuIlTz8IgM9IknkriY8OrkLte5gQ+dO7UCmksLInA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5bPHDq/FPq19ELjPRuOi/F7o2OwarbeGFS1NKw17y3E=; b=PeO8VN7VkqBZ6lvjg5JaUogYYAmtGFKyvMTeTmWxL1T+6VRbkV3F9lWb1gfeq0Dbi7EvIQFpE7mqoyF0LU9c9tYarM6M70AUgleZsI2RNI0/wf5UaTCQdm2axRkFZS4NNIIe8WSJYPRTT0O352TqUruuwoOsXb54V/p/zwhZXnLGUQq05ZmSrY45ecGbUT7COeto0jVwFFl+IsZby96EgnU6xLS6BNENjzUcTSV9QDhmN7pkLGYKCPbSH/NnEaslWFo7PFpiHrm8X9WNsmcyJ5VWKmKg1NrPLcUVlk2AyQiFq2R5JMOqTBssAJqQ5JW3yT5M7BrK00PGlnh9hzPhZA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=infradead.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5bPHDq/FPq19ELjPRuOi/F7o2OwarbeGFS1NKw17y3E=; b=f9kfZujcSEoROwkeTma5KD/MYk0E8+BokTEcy3bb++q5YK2BcqRgxsKSASKioJYg2NyMvZBW/WA+uJBFQGAegZR6wVtZH87yUWI046+VR0BxQaGelsqheI/3FTFMYDfcHTpVnhdIVft2lGuSVk0ZmZNnbT50hGzsIylnLUZH9KQ= Received: from DM5PR07CA0109.namprd07.prod.outlook.com (2603:10b6:4:ae::38) by CH3PR12MB7666.namprd12.prod.outlook.com (2603:10b6:610:152::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.17; Tue, 28 Apr 2026 13:15:49 +0000 Received: from DS1PEPF0001709A.namprd05.prod.outlook.com (2603:10b6:4:ae:cafe::7a) by DM5PR07CA0109.outlook.office365.com (2603:10b6:4:ae::38) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9846.28 via Frontend Transport; Tue, 28 Apr 2026 13:15:49 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb08.amd.com; pr=C Received: from satlexmb08.amd.com (165.204.84.17) by DS1PEPF0001709A.mail.protection.outlook.com (10.167.18.104) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9846.18 via Frontend Transport; Tue, 28 Apr 2026 13:15:49 +0000 Received: from Satlexmb09.amd.com (10.181.42.218) by satlexmb08.amd.com (10.181.42.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Tue, 28 Apr 2026 08:15:48 -0500 Received: from satlexmb08.amd.com (10.181.42.217) by satlexmb09.amd.com (10.181.42.218) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Tue, 28 Apr 2026 06:15:48 -0700 Received: from [172.31.184.125] (10.180.168.240) by satlexmb08.amd.com (10.181.42.217) with Microsoft SMTP Server id 15.2.2562.17 via Frontend Transport; Tue, 28 Apr 2026 08:15:40 -0500 Message-ID: <9cf9b433-cba5-4a8e-8dbf-6410239cffb6@amd.com> Date: Tue, 28 Apr 2026 18:45:39 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed To: Peter Zijlstra , John Stultz CC: LKML , Vineeth Pillai , Sonam Sanju , "Sean Christopherson" , Kunwu Chan , "Tejun Heo" , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , "Suleiman Souhlal" , kuyo chang , hupu , References: <20260427183848.698551-1-jstultz@google.com> <20260427183848.698551-2-jstultz@google.com> <20260428094353.GB1026330@noisy.programming.kicks-ass.net> <20260428111833.GL3102924@noisy.programming.kicks-ass.net> Content-Language: en-US From: K Prateek Nayak In-Reply-To: <20260428111833.GL3102924@noisy.programming.kicks-ass.net> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS1PEPF0001709A:EE_|CH3PR12MB7666:EE_ X-MS-Office365-Filtering-Correlation-Id: 53ca7d51-731e-4c15-75f8-08dea528441e X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|36860700016|7416014|376014|1800799024|22082099003|18002099003|18096099003|56012099003; X-Microsoft-Antispam-Message-Info: bJkyL4QvFJ09NLP2mBJt2kwWSrbh1hs3FImWCyOM1phXHdZ0iW0XY/MxGx2xD08PF8u48HeZaSHWQj45G0kIGZ6rvXBBAIt26GukzcUS9UPy+ER3FW+snui8LZ4dYqgcvck/ERTYYy9ayHyZVszD5rxfUOHfrpkEWJp7JMK/t5JhSjUantBuWTaTrkFlr9ncVk/4Fq1AqQ1UqztxxLLJF0xM22qEICqiQAc1E/1sk5o/fg4tG4KVrnN7Nmwg9iCd4IwM6rPOBr5qzDiBMRQvYyKQw19427AbkKFT987ExfDqFnEFuLe2CCT1JyJWalWM8QmGKxGnbolj0WZZ6i89FJ3Oxpv4FaiwRxjpAxTFdgB80HazZu1omG0kHTkHSsUQQuRO7ABOESCTlio9kP81Ohb3TVy5i5EcmgtqQFxG7bf0P5vTnJfMnJg9SdC34AJ/G3G+c/6T/qHw0oQP3AYPXEKHio+MPtfn+KxLeNh473mKoMDj5uRM0ritZjxCZA6VNniB3gIkZKaC/Eh97rnkfrnXcb5FnneJlyUzn1rOup6YxT0Mf7qNF3M3TOY+WQGoq84ctQiSvbiJs2yGBYow2oeB3fqmsEnJUwZipw53lucHyXH1Uz2GECKgOvx24HhsnCV3nTrUTr8KkX385NVwR3FhKDQP2sP7ckUbq9X8rhjxwa/IFYw8AWv9ltai9yUSVBoRLIY2s2anU0sIEEq7VEpykGmRwOOeCYm3EObXMGkj61pbFTypqxxYmku9CozIZn6fjyX8by3/a+7KjIm5Rw== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb08.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(36860700016)(7416014)(376014)(1800799024)(22082099003)(18002099003)(18096099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: ljKEEZQ4tpzQ9gnd4Tx2f86uYeEc/GsFy5EH/wiCg6gIdxfFbcreTkuQXu1eUxOkA8+YhsVMialpGnbXW/uL3KrPOLNR8U61XRoXuPZJLBFyYNnpGiMZJC6K1B+xaHoFD/7nFYqB22zbCFRCO1Yt0R/FVkD4soNSH2kWoC4nT0nnEQrOKrcYn70rgZU+YlqamAq9y2V9hyvQ7ObsIGkyAZ4i1LN0WzhotLep3sIMgLWvn1ZYOgShArd/qr993vI3wqpQ1+X66MdYTues4uyQ9c0VGbLnROzexiCQwxkH7prnA69deBziL50BPs6i+QOPfAn30vfIWjnVJp4WzBua8GO3oFUaZMhHNTlaugrktQuXYl+qIJtv6fqtidlnIlVTtjh5XRyU5L5X8a2eu7Gf2Et7FS+ilyMhGRtd6WWRhkEtZ6ZuheV/tJUiZU9tdQ0U X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Apr 2026 13:15:49.2513 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 53ca7d51-731e-4c15-75f8-08dea528441e X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb08.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DS1PEPF0001709A.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB7666 Hello Peter, On 4/28/2026 4:48 PM, Peter Zijlstra wrote: > On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote: >> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote: >> >>> kernel/sched/core.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index da20fb6ea25ae..5f684caefd8b2 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) >>> try_to_block_task(rq, prev, &prev_state, >>> !task_is_blocked(prev)); >>> switch_count = &prev->nvcsw; >>> + } else if (preempt && prev->blocked_on) { >>> + /* >>> + * If we are SM_PREEMPT, we may have interrupted >>> + * after blocked_on was set, before schedule() >>> + * was run, preventing workques from running. So >> >> workqueues >> >>> + * clear blocked_on and mark task RUNNING so it >>> + * can be reselected to run and complete its >>> + * logic >>> + */ >>> + WRITE_ONCE(prev->__state, TASK_RUNNING); >>> + clear_task_blocked_on(prev, NULL); >>> } >>> >>> pick_again: >> >> *groan*, this feels wrong. Preemption should never touch state. Let me >> try and wake up and make sense of this. > > So all non-special block states *SHOULD* be in a loop and handle > spurious wakeups -- I fixed a pile of offenders some many years ago, but > there really isn't anything in the kernel that validates this. > > [ I suppose someone could try and do a cocci test for this? ] > > Any wait for non-special states that is not a loop is fundamentally > broken, since many of the lock wake-up paths are explicitly racy in that > they can cause spurious wakeups (which is the safe side of the race, > since insufficient wakeups is bad etc.). > > OTOH special states, are special, esp. because they cannot handle > spurious wakeups. > > Eg, consider something like: > > set_current_state(TASK_FROZEN) > > current->__state = TASK_RUNNING > schedule(); > > is all sorts of broken. Now, obiously special states must never have > blocked_on set, so this can be fudged about. But still, touching __state > from schedule seems wrong. > > Anyway, the historical distinction between a blocked task and a > preempted task is that the blocked task is not on the runqueue, while > the preempted task is kept on the runqueue. > > Obviously PE wrecks this, and hence the problem. And yeah, amazing we > never hit this before. > > Something like so perhaps? > > --- > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 368c7b4d7cb5..0bd5da8360f3 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -846,7 +846,11 @@ struct task_struct { > struct alloc_tag *alloc_tag; > #endif > > - int on_cpu; > + u8 on_cpu; > + u8 on_rq; > + u8 is_blocked; > + u8 __pad; > + > struct __call_single_node wake_entry; > unsigned int wakee_flips; > unsigned long wakee_flip_decay_ts; > @@ -861,7 +865,6 @@ struct task_struct { > */ > int recent_used_cpu; > int wake_cpu; > - int on_rq; > > int prio; > int static_prio; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index da20fb6ea25a..06817ae0cbd9 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state); > * [ The astute reader will observe that it is possible for two tasks on one > * CPU to have ->on_cpu = 1 at the same time. ] > * > +* p->is_blocked <- { 0, 1 }: > +* > +* is set by block_task() and cleared by ttwu_do_activate() and indicates > +* this task is blocked, as opposed to runnable. Used to distinguish between > +* preempted and blocked tasks for proxy exec, which keeps everything on the > +* runqueue. > + * > * task_cpu(p): is changed by set_task_cpu(), the rules are: > * > * - Don't call set_task_cpu() on a blocked task: > @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) > > static void block_task(struct rq *rq, struct task_struct *p, int flags) > { > + p->is_blocked = 1; We never reach here with PROXY_EXEC. Instead we bail out in the caller try_to_block_task() ... > if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) > __block_task(rq, p); > } > @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > atomic_dec(&task_rq(p)->nr_iowait); > } > > + p->is_blocked = 0; > activate_task(rq, p, en_flags); > wakeup_preempt(rq, p, wake_flags); > > @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode) > struct task_struct *prev_donor = rq->donor; > > rq_set_donor(rq, next); > - if (unlikely(next->blocked_on)) { > + if (unlikely(next->is_blocked && next->blocked_on)) { ... so ->is_blocked here is always false for proxy tasks retained on the runqueue. I was trying something like below but I'm somewhere missing a clear_task_blocked_on() for PROXY_WAKING before going back into mutex_lock_common(): diff --git a/include/linux/sched.h b/include/linux/sched.h index 8ec3b6d7d718b..6ea74aecc5fbd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -586,6 +586,7 @@ struct sched_entity { unsigned char sched_delayed; unsigned char rel_deadline; unsigned char custom_slice; + unsigned char sched_proxy; /* hole */ u64 exec_start; @@ -2222,6 +2223,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex * * clearing the relationship with a different lock. */ WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING); + WRITE_ONCE(p->se.sched_proxy, 0); p->blocked_on = NULL; } @@ -2250,6 +2252,8 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu * the relationship with a different lock. */ WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING); + /* Force the task down proxy_force_return() path. */ + WRITE_ONCE(p->se.sched_proxy, 1); p->blocked_on = PROXY_WAKING; } diff --git a/init/init_task.c b/init/init_task.c index b5f48ebdc2b6e..8e8fc680fcd21 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -118,6 +118,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { }, .se = { .group_node = LIST_HEAD_INIT(init_task.se.group_node), + .sched_proxy = 0, }, .rt = { .run_list = LIST_HEAD_INIT(init_task.rt.run_list), diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 49cd5d2171613..8142fba59ad94 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4395,6 +4395,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; + p->se.sched_proxy = 0; INIT_LIST_HEAD(&p->se.group_node); /* A delayed task cannot be in clone(). */ @@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p, * blocked on a mutex, and we want to keep it on the runqueue * to be selectable for proxy-execution. */ - if (!should_block) + if (!should_block) { + guard(raw_spinlock)(&p->blocked_lock); + /* Stable against race */ + if (task_is_blocked(p)) + WRITE_ONCE(p->se.sched_proxy, 1); return false; + } p->sched_contributes_to_load = (task_state & TASK_UNINTERRUPTIBLE) && @@ -6765,11 +6771,15 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) bool curr_in_chain = false; int this_cpu = cpu_of(rq); struct task_struct *p; - struct mutex *mutex; int owner_cpu; /* Follow blocked_on chain. */ - for (p = donor; (mutex = p->blocked_on); p = owner) { + for (p = donor; READ_ONCE(p->se.sched_proxy); p = owner) { + struct mutex *mutex = p->blocked_on; + + if (!mutex) + return NULL; + /* if its PROXY_WAKING, do return migration or run if current */ if (mutex == PROXY_WAKING) { if (task_current(rq, p)) { @@ -6787,7 +6797,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) guard(raw_spinlock)(&p->blocked_lock); /* Check again that p is blocked with blocked_lock held */ - if (mutex != __get_task_blocked_on(p)) { + if (!p->se.sched_proxy || mutex != __get_task_blocked_on(p)) { /* * Something changed in the blocked_on chain and * we don't know if only at this level. So, let's @@ -7044,7 +7054,7 @@ static void __sched notrace __schedule(int sched_mode) struct task_struct *prev_donor = rq->donor; rq_set_donor(rq, next); - if (unlikely(next->blocked_on)) { + if (unlikely(READ_ONCE(next->se.sched_proxy))) { next = find_proxy_task(rq, next, &rf); if (!next) { zap_balance_callbacks(rq); --- > next = find_proxy_task(rq, next, &rf); > if (!next) { > zap_balance_callbacks(rq); -- Thanks and Regards, Prateek