From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9D5F2392C4A for ; Thu, 30 Apr 2026 21:51:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585874; cv=none; b=eD37A2zi+gbaKizU1AaSTI+9ugAN2bFBWXt9N6W6bAp4HBkGdX/8629v1XyOfgmYCsULQUvQGpXe2fXOCbssXHnoxydtYYuvPc8iMzHw6ngNZGSegdWxMMLxAL7on6SZufKYDtjWWwdQEFkymued236AlJqm1dQMzDol/RH9lvk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585874; c=relaxed/simple; bh=7EH82Wz5HZ6tkvwWNUlGQOxswubMnF5K1SipbsiFu8k=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=fQJw3TlvWL5D99kdkLC3ZhUN9jxWLQcrOxsx+KaiVyXijySK30G7pwTKu9I8t/4jkwbsG9CoiOHfN3hcDaFGCnK/XnX2JeOKdAW9MazMTCcK7t9BkmCPpIkdzeGa2twKWc54GTZX6bn2ygHBQPKPzb+L2b1HRif6BEeF4ds4tcc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=BW7tveWY; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BW7tveWY" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c7948640854so766440a12.1 for ; Thu, 30 Apr 2026 14:51:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777585872; x=1778190672; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=5Iu/xCqbCo/zTSrkdtrDLa1P5+N6MsN4LTbTPimT3vk=; b=BW7tveWYQtwE5ZA44ziJubsYjMsIDyY3Ven/MELPkgYckOhSuv0/Dk3sbDtcnOZJGT wpQhZgZ8/yzOxD75UqbI/DFhnP0RGVXsqzdKLzoqFFfwxkS1ewvgeh06lhw3/3o5JdzM bICCZDOOyNBscjYB9VJe5VdyIYJ32GAGwpu2oU481NFbEDpfSOgGg3odNO9+agZ3eiG8 wQr+nhDqqaQGrFAM9rh3J9IFibX2GWRcdEuohfjZQDNiHf314oSPS2EzGscU5B8+pJnp uv5en9S7caZRTQaCt5yGb3Te0QJXlS1PfAL3aNRBjdOSJnkd3gfbtlTNvpUIaoYt+eVn FGng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777585872; x=1778190672; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=5Iu/xCqbCo/zTSrkdtrDLa1P5+N6MsN4LTbTPimT3vk=; b=Vf5FRjvR/GLEN039NC6hytktUMbN76wB5M06sfTwj8NS1J1D6+MGFkM0xmZE/BcCkE Wf16M+PrKAK+AfD7VDYP60Ugb04jhNVlP3snootBbv/xUYwpQYtSlY8pBNImD0bgMkl1 xv2YRloWWWIj/7LC0ruBpgNSjAIKkAgkKnRZmI25yMslom5SkX0wJubB1yBsldffT6U4 2e3PWzEEU97oxGOF1I20+/Gmj+sOIpOuaJVNBoryYfUKWWLgNFMh51z1WQJCe8cLnQee JLmW9PvLhtl+M5L9pO5XCK8K/QcTHiFIszcfFSUyVFbDtwtIpHQI2v31CHoUaQnV7VUw 1qQA== X-Gm-Message-State: AOJu0YwGfuADAgskiPdrRujIInMQ+Zq6kT7bUP9y7FJi1dF9TcJmrlnJ mndkwwVzWyoYdAB9FAdMJDrEWhO0b+fSs4md8o6WiH8//RRDcCsE5qiRWXoCt0OxRXNpQV42WSe TaG8KnAGNflbUvR+MIT+EQHtoKeKFKQdPoOqDGfAiRYUiKsXqLpwESYH+4a6d5ukWE5JeTzWxNm 2aqEuIcDZGCb0m5+6cWiuMJ8/dr6ccFOTtXArdDRfSmQIARe+w X-Received: from pgu15.prod.google.com ([2002:a63:144f:0:b0:c79:6553:cfe0]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:9188:b0:398:a934:9be6 with SMTP id adf61e73a8af0-3a3cf85a036mr5480828637.43.1777585871134; Thu, 30 Apr 2026 14:51:11 -0700 (PDT) Date: Thu, 30 Apr 2026 21:50:46 +0000 In-Reply-To: <20260430215103.2978955-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260430215103.2978955-1-jstultz@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260430215103.2978955-2-jstultz@google.com> Subject: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed From: John Stultz To: LKML Cc: John Stultz , Vineeth Pillai , Sonam Sanju , Sean Christopherson , Kunwu Chan , Tejun Heo , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Vineeth reported seeing a KVM related deadlock connected to work queue lockups using the android17-6.18 tree, which has Proxy Execution enabled (using the full patch stack), but I've subsequently reproduced it on v7.1-rc1. On further debugging he found: - kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu pwq(work queue pool) - one of kvm-irqfd-cleanup worker(say A) takes a mutex and then calls synchronize_srcu_expedited() - one other kvm-irqfd-cleanup worker worker(Say B) tries to acquire the lock and then gets blocked - On the way to blocking, this cpu gets an IPI and on return from IPI, it calls __schedule() and did not get to complete workqueue accounting(worker->sleeping = 0 and decrementing pool->nr_running). This is done in sched_submit_work() -> wq_worker_sleeping() called from schedule() and we got preempted before that. - proxy execution doesn't immediately take it off run queue as p->blocked_on is set during __mutex_lock - Next time when B is picked for running, it notices A(mutex holder) is not on a runqueue and then blocks B. find_proxy_task() -> proxy_deactivate() -> block_task() - And things are then stuck. A is waiting for the workqueue to be run, but B can't run the workqueue as it is blocked on A. The trouble is that with Proxy Execution, in __mutex_lock_common() we set the task state to TASK_UNINTERRUPTIBLE, and set blocked_on before calling into schedule(), where sched_submit_work() will be called. But if an IPI comes in before we call schedule() the interrupt will call __schedule(SM_PREEMPT) directly. This causes the scheduler to see the current task as blocked_on, and deactivate it (because the owner is off the runqueue). Since its deactivated, it wont' be run, and it won't get to call sched_submit_work(). And then we see workqueue stalls. Without proxy-execution, things work, as the SM_PREEMPT case will prevent the task from being dequeued, and it can be reselected again and run, which will allow it to finish calling into schedule() and calling sched_submit_work() before actually blocking. Peter didn't like my earlier attempt to solve this by clearing the blocked_on state and marking the task __state RUNNABLE, as we shouldn't modify __state from schedule(). So this approach is slightly different. We use the low bit of the blocked_on pointer as a latch bit flag. When the task sets the blocked_on pointer, we don't consider it for use with proxy execution until the latch is set. We then only set the latch bit in __schedule() when we are not in an SM_PREEMPT case and are considering blocking the task. This makes the blocked_on state machine a little more complex: NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL With additional transitions: // only done on current ptr:latched -> NULL // only done on current or when trying to set waking ptr:unlatched -> NULL And where NULL and ptr:unlatched are functionally equivalent except for the ability to transition to ptr:latched. Credit for this idea is due to Vineeth and Sulieman who had proposed something very similar when the issue was first reported. As well as to Peter for suggesting it and K Prateek who helped iterate and shared an initial working version. Many thanks to Vineeth for figuring this very obscure race out and for implementing a test tool to make it easily reproducible! Reported-by: Vineeth Pillai Signed-off-by: John Stultz --- v2: * Switch to using extra flag bit to ensure we don't proxy early on SM_PREEMPT cases, as suggested by Peter (and Vineeth and Suleiman) and helped developed with K Prateek Cc: Vineeth Pillai Cc: Sonam Sanju Cc: Sean Christopherson Cc: Kunwu Chan Cc: Tejun Heo Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 64 +++++++++++++++++++++++++++++++++++-------- kernel/sched/core.c | 15 ++++++---- 2 files changed, 63 insertions(+), 16 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 368c7b4d7cb51..8b9e971d98f67 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock); #ifndef CONFIG_PREEMPT_RT /* - * With proxy exec, if a task has been proxy-migrated, it may be a donor - * on a cpu that it can't actually run on. Thus we need a special state - * to denote that the task is being woken, but that it needs to be - * evaluated for return-migration before it is run. So if the task is - * blocked_on PROXY_WAKING, return migrate it before running it. + * The proxy exec blocked_on pointer value uses the low bit as a latch + * value which clarifies if the blocked_on value is used for proxying or + * not. + * + * The state machine looks something like + * NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL + * + * With some additional transitions: + * ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_waking()) + * ptr:latched -> NULL (done only on current) + * + * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will occur + * 2) ptr:latched is the state when proxying will occur + * 3) PROXY_WAKING is used when the task is being woken to ensure we + * return-migrate proxy-migrated tasks before running them (note it has + * the latch bit set). */ -#define PROXY_WAKING ((struct mutex *)(-1L)) +#define PROXY_BLOCKED_LATCH (1UL) +#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~PROXY_BLOCKED_LATCH)) +#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit set */ + +static inline struct mutex *task_is_blocked_on(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATCH); +} + +static inline void __set_task_blocked_on_latched(struct task_struct *p) +{ + lockdep_assert_held_once(&p->blocked_lock); + WARN_ON_ONCE(!p->blocked_on); + p->blocked_on = (struct mutex *)((unsigned long)p->blocked_on | PROXY_BLOCKED_LATCH); +} + +static inline struct mutex *__get_task_latched_blocked_on(struct task_struct *p) +{ + if (!task_is_blocked_on(p)) + return NULL; + if (p->blocked_on == PROXY_WAKING) + return PROXY_WAKING; + return PROXY_BLOCKED_ON_MASK(p->blocked_on); +} static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { lockdep_assert_held_once(&p->blocked_lock); - return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on; + if (p->blocked_on == PROXY_WAKING) + return NULL; + return PROXY_BLOCKED_ON_MASK(p->blocked_on); } static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m) @@ -2215,6 +2253,8 @@ static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m) static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m) { + struct mutex *bo = p->blocked_on; + /* Currently we serialize blocked_on under the task::blocked_lock */ lockdep_assert_held_once(&p->blocked_lock); /* @@ -2222,7 +2262,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex * * blocked_on relationships, but make sure we are not * clearing the relationship with a different lock. */ - WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING); + WARN_ON_ONCE(m && bo && __get_task_blocked_on(p) != m && bo != PROXY_WAKING); p->blocked_on = NULL; } @@ -2242,15 +2282,17 @@ static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mu return; } - /* Don't set PROXY_WAKING if blocked_on was already cleared */ - if (!p->blocked_on) + /* Don't set PROXY_WAKING if we are not really blocked_on */ + if (!task_is_blocked_on(p)) { + p->blocked_on = NULL; /* clear if unlatched */ return; + } /* * There may be cases where we set PROXY_WAKING on tasks that were * already set to waking, but make sure we are not changing * the relationship with a different lock. */ - WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING); + WARN_ON_ONCE(m && __get_task_blocked_on(p) != m && p->blocked_on != PROXY_WAKING); p->blocked_on = PROXY_WAKING; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25ae..2f912bf698446 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p, * blocked on a mutex, and we want to keep it on the runqueue * to be selectable for proxy-execution. */ - if (!should_block) - return false; + if (!should_block) { + guard(raw_spinlock)(&p->blocked_lock); + if (p->blocked_on) { + __set_task_blocked_on_latched(p); + return false; + } + } p->sched_contributes_to_load = (task_state & TASK_UNINTERRUPTIBLE) && @@ -6833,7 +6838,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) int owner_cpu; /* Follow blocked_on chain. */ - for (p = donor; (mutex = p->blocked_on); p = owner) { + for (p = donor; (mutex = __get_task_latched_blocked_on(p)); p = owner) { /* if its PROXY_WAKING, do return migration or run if current */ if (mutex == PROXY_WAKING) { if (task_current(rq, p)) { @@ -6851,7 +6856,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf) guard(raw_spinlock)(&p->blocked_lock); /* Check again that p is blocked with blocked_lock held */ - if (mutex != __get_task_blocked_on(p)) { + if (mutex != __get_task_latched_blocked_on(p)) { /* * Something changed in the blocked_on chain and * we don't know if only at this level. So, let's @@ -7107,7 +7112,7 @@ static void __sched notrace __schedule(int sched_mode) struct task_struct *prev_donor = rq->donor; rq_set_donor(rq, next); - if (unlikely(next->blocked_on)) { + if (unlikely(task_is_blocked_on(next))) { next = find_proxy_task(rq, next, &rf); if (!next) { zap_balance_callbacks(rq); -- 2.54.0.545.g6539524ca2-goog