From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C24643C2D for ; Fri, 13 Mar 2026 12:25:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773404714; cv=none; b=AeqWF2aVYQsmdoFgWiX4lLbtRRRYCHsL4pqD3N9m+st98b0KAKkrTkjvyJUpp/xHpp3r57ca3j2ikfbf82wSe7ariVpNvfUp69Nwi8UTOg0fRgBChJmrsona7Qf5RItNX34ubw6H2/1gUfuzlm713UbG0W+T8RgjxM7qoUJucIU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773404714; c=relaxed/simple; bh=cF+3E6YTEnu8s4LOQWNBNhAjO5zlIf+zJgKAPXggI7k=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=o1K1dPCxDmbSs0It+0GtOCkC96RRUhd+6mZvBx3xN4H4UFYECUjXHGso9BS7QzRFps7UVLSBXidi/crj8VPsSiic1emUTwGd83Eh1cwfIWKVVDcvnusXbwat2evQyNCClfAGHRyglvNv84VGyJb6E+JHbIrVcDEX9VtzlNDXiwo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org; spf=none smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=rf0+vD6P; arc=none smtp.client-ip=82.195.75.108 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=debian.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="rf0+vD6P" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=j5zV13C2nMIYXY2eUvBsm7OBDR9dZRaaTQErHJd47r0=; b=rf0+vD6Ps9wu0CxTAdrCMNWnTY eLZ284RO6CFc98tb6JwXJEJ64mzDRq3KOvu2kxIrl6GNdFQWv9lcX7M9QfvGE2WwqeuibPZE0p3Vb eXMcGX0O+67txBw8zqR4AEdiVA0hFgz/1IM2zE9IFILIaU1SrOgDswR4cRxlOmXYaOnOtQfIJ2z68 0GrZvxFaXxGupFGoFKYRJrR/foD4t6sWc5eKmqjegkLbrJLLBiAipUV35xWGTTqvgILJ9usQEm2fT 6uwIlZLCSCVDZGSpXNqw3zPalWVSvaqbf5F9sme1OCyox06/uqX7LsRwsIqCAXQ9KNQnt3g/WmWjG WAjRGjJA==; Received: from authenticated user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.94.2) (envelope-from ) id 1w11Zb-005JoZ-LA; Fri, 13 Mar 2026 12:25:00 +0000 Date: Fri, 13 Mar 2026 05:24:54 -0700 From: Breno Leitao To: Petr Mladek Cc: Tejun Heo , Lai Jiangshan , Andrew Morton , linux-kernel@vger.kernel.org, Omar Sandoval , Song Liu , Danielle Costantino , kasan-dev@googlegroups.com, kernel-team@meta.com Subject: Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers Message-ID: References: <20260305-wqstall_start-at-v2-0-b60863ee0899@debian.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Debian-User: leitao Hello Petr, On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote: > On Thu 2026-03-05 08:15:36, Breno Leitao wrote: > > There is a blind spot exists in the work queue stall detecetor (aka > > show_cpu_pool_hog()). It only prints workers whose task_is_running() is > > true, so a busy worker that is sleeping (e.g. wait_event_idle()) > > produces an empty backtrace section even though it is the cause of the > > stall. > > > > Additionally, when the watchdog does report stalled pools, the output > > doesn't show how long each in-flight work item has been running, making > > it harder to identify which specific worker is stuck. > > > > Example of the sample code: > > > > BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s! > > Showing busy workqueues and worker pools: > > workqueue events: flags=0x100 > > pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5 > > in-flight: 178:stall_work1_fn [wq_stall] > > pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work > > ... > > Showing backtraces of running workers in stalled > > CPU-bound worker pools: > > > > > > I see it happening on real machines, causing some stalls that doesn't > > have any backtrace. This is one of the code path: > > > > 1) kfence executes toggle_allocation_gate() as a delayed workqueue > > item (kfence_timer) on the system WQ. > > > > 2) toggle_allocation_gate() enables a static key, which IPIs every > > CPU to patch code: > > static_branch_enable(&kfence_allocation_key); > > > > 3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a > > kfence allocation to occur: > > wait_event_idle(allocation_wait, > > atomic_read(&kfence_allocation_gate) > 0 || ...); > > > > This can last indefinitely if no allocation goes through the > > kfence path (or IPIing all the CPUs take longer, which is common on > > platforms that do not have NMI). > > > > The worker remains in the pool's busy_hash > > (in-flight) but is no longer task_is_running(). > > > > 4) The workqueue watchdog detects the stall and calls > > show_cpu_pool_hog(), which only prints backtraces for workers > > that are actively running on CPU: > > > > static void show_cpu_pool_hog(struct worker_pool *pool) { > > ... > > if (task_is_running(worker->task)) > > sched_show_task(worker->task); > > } > > > > 5) Nothing is printed because the offending worker is in TASK_IDLE > > state. The output shows "Showing backtraces of running workers in > > stalled CPU-bound worker pools:" followed by nothing, effectively > > hiding the actual culprit. > > I am trying to better understand the situation. There was a reason > why only the worker in the running state was shown. > > Normally, a sleeping worker should not cause a stall. The scheduler calls > wq_worker_sleeping() which should wake up another idle worker. There is > always at least one idle worker in the poll. It should start processing > the next pending work. Or it should fork another worker when it was > the last idle one. Right, but let's look at this case: BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s! work func=blk_mq_timeout_work data=0xffff0000ad7e3a05 Showing busy workqueues and worker pools: workqueue events_unbound: flags=0x2 pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2 in-flight: 4083734:btrfs_extent_map_shrinker_worker workqueue mm_percpu_wq: flags=0x8 pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484 Showing backtraces of running workers in stalled CPU-bound worker pools: # Nothing in here It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for 679s ? > I wonder what blocked the idle worker from waking or forking > a new worker. Was it caused by the IPIs? Not sure, keep in mind that these hosts (arm64) do not have NMI, so, IPIs are just regular interrupts that could take a long time to be handled. The toggle_allocation_gate() was good example, given it was sending IPIs very frequently and I took it as an example for the cover letter, but, this problem also show up with diferent places. (more examples later) > Did printing the sleeping workers helped to analyze the problem? That is my hope. I don't have a reproducer other than the one in this patchset. I am currently rolling this patchset to production, and I can report once I get more information. > I wonder if we could do better in this case. For example, warn > that the scheduler failed to wake up another idle worker when > no worker is in the running state. And maybe, print backtrace > of the currently running process on the given CPU because it > likely blocks waking/scheduling the idle worker. I am happy to improve this, given this has been a hard issue. let me give more instances of the "empty" stalls I am seeing. All with empty backtraces: # Instance 1 BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4 pending: 3*psi_avgs_work pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2 in-flight: 842:key_garbage_collector workqueue mm_percpu_wq: flags=0x8 pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627 Showing backtraces of running workers in stalled CPU-bound worker pools: # Instance 2 BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: psi_avgs_work pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5 pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func workqueue events_freezable: flags=0x4 pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: pci_pme_list_scan workqueue slub_flushwq: flags=0x8 pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3 pending: flush_cpu_slab BAR(7520) workqueue mm_percpu_wq: flags=0x8 pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146 pending: cmd_work_handler Showing backtraces of running workers in stalled CPU-bound worker pools: # Instance 3 BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s! Showing busy workqueues and worker pools: workqueue mm_percpu_wq: flags=0x8 pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update Showing backtraces of running workers in stalled CPU-bound worker pools: # Instance 4 BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3 pending: psi_avgs_work, fuse_check_timeout workqueue events_freezable: flags=0x4 pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: pci_pme_list_scan workqueue mm_percpu_wq: flags=0x8 pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2 pending: vmstat_update Showing backtraces of running workers in stalled CPU-bound worker pools: Thanks for your help, --breno