From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B9423ECBEE for ; Mon, 27 Apr 2026 18:38:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777315135; cv=none; b=u/BunOILdmumDJlli4bZjahel0eHeA6MLX3H0CAr2xC9eoCIBmBlm2USsT+coroqCIMwT3RVg2GfrKpTEwWSsYliwpBiZ+M+d+hmpM1LPTcsFS7AT5M64j+ggiDBGmF4azuC5OfINZwnEMpASG0YeQelAbC8PKYBsa0qABffpck= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777315135; c=relaxed/simple; bh=zmBCw8fckUNF2iWohGiF7Yva/7VA6jxNUbU5DHGp5bY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=oz544uUZCZ2RUTISguHybNh65rc9oyNhjRZHhpzhtq94NhBij+/jISjIF5zZcp0V3fWncY7Y0zuPSMTIGae/yKCmyOEUIETjTSbn10Y7apa4CBxRxVCwcl0FxFTXWe6DI0jwSbI7RDkOGWriMTp1FohoDxmOnz/PzjYwJ55p3J8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CFFNMtNf; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CFFNMtNf" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-82f803658d5so12680937b3a.1 for ; Mon, 27 Apr 2026 11:38:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777315133; x=1777919933; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HN/1YRYPaRxP2aKoTAAxqdPHqemktADUER/AsMMbkeI=; b=CFFNMtNfjdTifUuoF8hdEBFYMl4SiCPAxtmAFjGhm2SbA4s1EdFkww9F5zZRs8b3IG vOebLSHXrOSylEqDSknhc7my1Cg2LWsSzefimFHlXjzo1bQhfiBNFAyYX1KeBlfpsgh6 pwi02NJ6OgTAafxqnmgCNCjhwgp7ws3+h5vgMZ841uCleIwL67Q+hEOcnk8HkshHVTxs tdeREqTmTceboyDRh6ZMbc3Rl6CLw3YKg2cJ6QV0MHWhm76qG+QeIHfNnYIlK/xH7VQZ O5GA6lCnUsAjGua4IBVtgySvK4aXNEcjgPUTHWndjOA7MColuN4eqwhQJIL3kiFxQBIj dQVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777315133; x=1777919933; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HN/1YRYPaRxP2aKoTAAxqdPHqemktADUER/AsMMbkeI=; b=NGMHgStcmBnauDufsNDG5X1WWGEWdvc51U1gB/765md+6EpH+srTLnqgYrIccqJROb jrKoEEZBR5YJxxuQ7e0C/JIzpEY/n/O7bU/mrSI994qaMhsV5WO0GFFnRkP4sBR+Jx5o mzs0cO9MkclxT0sE4qhyFViwcYeAo0DCdmJK03jJnxVLTgiKE+fA3lM27LUvmKQ/8L9i rsHztDk/XvRM38myNEUwtOznNJvphfm/37q2CsHBWhZGYUuN5Kq3p7KodEa0V4nDP4Qw Iv/l6Sy/DkoiTCKr1VuSnAOeDFjEW2AT4JUnPT0oSjbxz/eNBt0S8SqVIssAqaNJBHuj EhkQ== X-Gm-Message-State: AOJu0Yzc/484dw+AWK8sqXxSfM5mYUOaDTEdLV7zqOxcsBYQx42FS4Jl fMfocvNK7n3SmAvCSq7TpelB4FfiiEy9BbpVCE48+gSJr0ci18dUK7kAOCOcRnVyJUcwTfgmWHQ J9uaG5vBnMdzMM7ZdIIV9IMgVOiTtsKna+COG71jv7gbYyb1azJ4B4wkt4fCN1K9AKgnYtAi11F Qy/HIIONAzW8LgOhJRu5hxlSEtIgRWNy4DCM9Ps4V0AXmehx72 X-Received: from pfbhm23.prod.google.com ([2002:a05:6a00:6717:b0:82f:a056:177a]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:139a:b0:82f:aae5:c7a6 with SMTP id d2e1a72fcca58-834dc2ab466mr150535b3a.43.1777315132856; Mon, 27 Apr 2026 11:38:52 -0700 (PDT) Date: Mon, 27 Apr 2026 18:38:40 +0000 In-Reply-To: <20260427183848.698551-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260427183848.698551-1-jstultz@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260427183848.698551-2-jstultz@google.com> Subject: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed From: John Stultz To: LKML Cc: John Stultz , Vineeth Pillai , Sonam Sanju , Sean Christopherson , Kunwu Chan , Tejun Heo , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Vineeth reported seeing a KVM related deadlock connected to work queue lockups using the android17-6.18 tree, which has Proxy Execution enabled (using the full patch stack), but I've subsequently reproduced it on v7.1-rc1. On further debugging he found: - kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu pwq(work queue pool) - one of kvm-irqfd-cleanup worker(say A) takes a mutex and then calls synchronize_srcu_expedited() - one other kvm-irqfd-cleanup worker worker(Say B) tries to acquire the lock and then gets blocked - On the way to blocking, this cpu gets an IPI and on return from IPI, it calls __schedule() and did not get to complete workqueue accounting(worker->sleeping = 0 and decrementing pool->nr_running). This is done in sched_submit_work() -> wq_worker_sleeping() called from schedule() and we got preempted before that. - proxy execution doesn't immediately take it off run queue as p->blocked_on is set during __mutex_lock - Next time when B is picked for running, it notices A(mutex holder) is not on a runqueue and then blocks B. find_proxy_task() -> proxy_deactivate() -> block_task() - And things are then stuck. A is waiting for the workqueue to be run, but B can't run the workqueue as it is blocked on A. The trouble is that with Proxy Execution, in __mutex_lock_common() we set the task state to TASK_UNINTERRUPTIBLE, and set blocked_on before calling into schedule(), where sched_submit_work() will be called. But if an IPI comes in before we call schedule() the interrupt will call __schedule(SM_PREEMPT) directly. This causes the scheduler to see the current task as blocked_on, and deactivate it (because the owner is off the runqueue). Since its deactivated, it wont' be run, and it won't get to call sched_submit_work(). Without proxy-execution, the SM_PREEMPT case will prevent the task from being dequeued, and it can be reselected again and run, which will allow it to finish calling into schedule() and calling sched_submit_work() before actually blocking. So we need to make sure on the SM_PREEMPT case, if current is marked as blocked_on, we should clear the blocked_on state and mark the task RUNNABLE so the task can be selected to complete its call to schedule() -> sched_submit_work(). Now because we cleared BLOCKED_ON and set the task RUNNABLE, the task will be able to be selected and run again and loop back in __mutex_lock_common() where it can re-set the blocked_on state and call back into schedule() in order to properly be chosen as a donor. Many thanks to Vineeth for figuring this very obscure race out and for implementing a test tool to make it easily reproducible! Reported-by: Vineeth Pillai Tested-by: Vineeth Pillai Signed-off-by: John Stultz --- Cc: Vineeth Pillai Cc: Sonam Sanju Cc: Sean Christopherson Cc: Kunwu Chan Cc: Tejun Heo Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25ae..5f684caefd8b2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode) try_to_block_task(rq, prev, &prev_state, !task_is_blocked(prev)); switch_count = &prev->nvcsw; + } else if (preempt && prev->blocked_on) { + /* + * If we are SM_PREEMPT, we may have interrupted + * after blocked_on was set, before schedule() + * was run, preventing workques from running. So + * clear blocked_on and mark task RUNNING so it + * can be reselected to run and complete its + * logic + */ + WRITE_ONCE(prev->__state, TASK_RUNNING); + clear_task_blocked_on(prev, NULL); } pick_again: -- 2.54.0.545.g6539524ca2-goog