From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A78A535AC18 for ; Wed, 4 Mar 2026 15:40:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772638861; cv=none; b=XkiBO2IyGz6m3cnWGpszZlamCHFoZgIH7lA9ustbCSttWmwhvWBs+UNlt+EDh1FCMzCj/Vx5w6hB7yI6uU4juS9sAe4Soj0h8VcSjvsjz2w+SlYHFXfXDp+cjybea9inf9/6PuahYxloqoETb2g34H5TXixgwOn2mFB0oXAv1I4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772638861; c=relaxed/simple; bh=ekommXKtVeeM4B/mRj+716pNc5rl7h/s6LJvS7SUp0E=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=uEyXheqVAZlO3rHFmrgJ8sq3KOmyioCJigR/9vdbKEWGzmOOml2sQqVczvbiUgxcKcPi4AkNXPcVtmcWGxeqYjjrmTSKpgQxAjVjHFpycwLeweyZYtpjtMx3zC4wtcbVRP/rZzmBE4tKJbkCOcpd8sDt5ku+Uij+PJ5f4Gc80Lk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org; spf=none smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=M5GYBFDn; arc=none smtp.client-ip=82.195.75.108 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=debian.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="M5GYBFDn" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=knioNqrrH7LAUB0u91Y57xuGlEnZONVV+tdnl638Lhg=; b=M5GYBFDnqGv+qKbwLiyQJonE8Q p8mxHQcaAflSillHUQfLL8SvPdebZ+D5/6WNvEXo91mO7l6MuXsvACEybab9Rg1uOpFs/L0+pqdpP ZhoiQrgbwGGz+glek9brZ2BHMVYSbs9shqCXEiaiUuurpCpCtnZVFY8MWxaXXMhmqCgKB+ODSk0YC yMO/gWiEHbhJH+r2FZUxwzTDy5bnp8b5avitDm1e63TImdoI7RmgOr0vPGQaI1CTdSthNqCVaPpPD 9RT1vvPr18tXdlK5LMlzjV1Qx8sglK5OpfubGz8cvrTDQ1CO6EKOw4eiiq6utEsAL4K1RNWDp1r/7 0cSVXRiQ==; Received: from authenticated user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.94.2) (envelope-from ) id 1vxoLG-00G3ZW-IM; Wed, 04 Mar 2026 15:40:54 +0000 Date: Wed, 4 Mar 2026 07:40:49 -0800 From: Breno Leitao To: Tejun Heo Cc: Lai Jiangshan , Andrew Morton , linux-kernel@vger.kernel.org, Omar Sandoval , kernel-team@meta.com Subject: Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers Message-ID: References: <20260211-wqstall_start-at-v1-0-bd9499a18c19@debian.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Debian-User: leitao Hello Tejun, On Wed, Feb 11, 2026 at 08:56:11AM -1000, Tejun Heo wrote: > On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote: > > The workqueue watchdog detects pools that haven't made forward progress > > by checking whether pending work items on the worklist have been waiting > > too long. However, this approach has a blind spot: if a pool has only > > one work item and that item has already been dequeued and is executing on > > a worker, the worklist is empty and the watchdog skips the pool entirely. > > This means a single hogged worker with no other pending work is invisible > > to the stall detector. > > > > I was able to come up with the following example that shows this blind > > spot: > > > > static void stall_work_fn(struct work_struct *work) > > { > > for (;;) { > > mdelay(1000); > > cond_resched(); > > } > > } > > Workqueue doesn't require users to limit execution time. As long as there is > enough supply of concurrency to avoid stalling of pending work items, work > items can run as long as they want, including indefinitely. Workqueue stall > is there to indicate that there is insufficient supply of concurrency. Thank you for the clarification. Let me share more context about the actual problem I am observing so we can think through it together. On some production hosts, I am seeing a workqueue stall where no backtraces are printed: BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s! Showing busy workqueues and worker pools: workqueue events: flags=0x100 pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5 in-flight: 178:stall_work1_fn [wq_stall] pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work workqueue mm_percpu_wq: flags=0x108 pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update pool 18: cpus=4 node=0 flags=0x0 nice=0 hung=132s workers=2 idle: 45 Showing backtraces of running workers in stalled CPU-bound worker pools: We initially suspected a TOCTOU issue, and Omar put together a patch to address that, but it did not identify anything. After digging deeper, I believe I have found the root cause along with a reproducer[1]: 1) kfence executes toggle_allocation_gate() as a delayed workqueue item (kfence_timer) on the system WQ. 2) toggle_allocation_gate() enables a static key, which IPIs every CPU to patch code: static_branch_enable(&kfence_allocation_key); 3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a kfence allocation to occur: wait_event_idle(allocation_wait, atomic_read(&kfence_allocation_gate) > 0 || ...); This can last indefinitely if no allocation goes through the kfence path. The worker remains in the pool's busy_hash (in-flight) but is no longer task_is_running(). 4) The workqueue watchdog detects the stall and calls show_cpu_pool_hog(), which only prints backtraces for workers that are actively running on CPU: static void show_cpu_pool_hog(struct worker_pool *pool) { ... if (task_is_running(worker->task)) sched_show_task(worker->task); } 5) Nothing is printed because the offending worker is in TASK_IDLE state. The output shows "Showing backtraces of running workers in stalled CPU-bound worker pools:" followed by nothing, effectively hiding the actual culprit. The fix I am considering is to remove the task_is_running() filter in show_cpu_pool_hog() so that all in-flight workers in stalled pools have their backtraces printed, regardless of whether they are running or sleeping. This would make sleeping culprits like toggle_allocation_gate() visible in the watchdog output. When I test without the task_runinng, then I see the culprit. Fix I am testing: diff --git a/kernel/workqueue.c b/kernel/workqueue.c index aeaec79bc09c4..3f5ee08f99313 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -7593,19 +7593,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool) raw_spin_lock_irqsave(&pool->lock, irq_flags); hash_for_each(pool->busy_hash, bkt, worker, hentry) { - if (task_is_running(worker->task)) { - /* - * Defer printing to avoid deadlocks in console - * drivers that queue work while holding locks - * also taken in their write paths. - */ - printk_deferred_enter(); + /* + * Defer printing to avoid deadlocks in console + * drivers that queue work while holding locks + * also taken in their write paths. + */ + printk_deferred_enter(); - pr_info("pool %d:\n", pool->id); - sched_show_task(worker->task); + pr_info("pool %d:\n", pool->id); + sched_show_task(worker->task); - printk_deferred_exit(); - } + printk_deferred_exit(); } raw_spin_unlock_irqrestore(&pool->lock, irq_flags); @@ -7616,7 +7614,7 @@ static void show_cpu_pools_hogs(void) struct worker_pool *pool; int pi; - pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n"); + pr_info("Showing backtraces of in-flight workers in stalled CPU-bound worker pools:\n"); rcu_read_lock(); Then I see: BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 34s! Showing busy workqueues and worker pools: workqueue events: flags=0x100 pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=3 refcnt=4 in-flight: 161:stall_work1_fn [wq_stall] pending: stall_work2_fn [wq_stall], psi_avgs_work workqueue mm_percpu_wq: flags=0x108 pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update pool 26: cpus=6 node=0 flags=0x0 nice=0 hung=34s workers=3 idle: 210 57 Showing backtraces of in-flight workers in stalled CPU-bound worker pools: pool 26: task:kworker/6:1 state:I stack:0 pid:161 tgid:161 ppid:2 task_flags:0x4208040 flags:0x00080000 Call Trace: __schedule+0x1521/0x5360 ? console_trylock+0x40/0x40 ? preempt_count_add+0x92/0x1a0 ? do_raw_spin_lock+0x12c/0x2f0 ? is_mmconf_reserved+0x390/0x390 ? schedule+0x91/0x350 ? schedule+0x91/0x350 schedule+0x165/0x350 stall_work1_fn+0x17f/0x250 [wq_stall] Link: https://github.com/leitao/debug/blob/main/workqueue_stall/wq_stall.c [1]