From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EEA2937267E for ; Mon, 22 Jun 2026 11:14:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782126855; cv=none; b=r2zaA73rv7DiMqRJtH5TVT0WN8kbRG03VbSpl/FqHQsHbBgoW7mjhpjmLgtPCmIe2HImwCV8rlGr7ak62QsPaHz9P3wMTIc6eeurwM6V3KuaHWp7xGfJsw0Ed+zHEO2wAiWZh77MvlAuK3EGrztnJ20s6v8mV+Ln9g6vF+PqhrM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782126855; c=relaxed/simple; bh=wvNjCorDU0A6FjMb5Dl8Ekms2zBjzcLdQc27yj1SqHI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rZcQ3e5PQbNXX853hwnNNoC/SIDrx71+AX/5HGGQV0VqDjBWm1bB7i5GFqz1i4LsVLGBKuE8boSCQ/ALmJkaHcF+kGb/JpOm49Zu1JNHrj5wmSu3ZVuUNkZrPizo2l/Ta2eINMZjxlQgloaw/IJtZAoL3tVa8qzdNtrNhMzifiI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=Y1BkxOO1; arc=none smtp.client-ip=82.195.75.108 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=debian.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="Y1BkxOO1" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=hKDkufEkOFEftjeBilQ4ONIm22QZf0r0VPC/G3P6sgc=; b=Y1BkxOO1bbEGQ9JROUTjYY+1SA 0ArsEbJGGYsbLVKD1mld7I+Va2nBDWt6gIlzuMXVazksaWDAOaf0NU/utErsj6vB+YUNEdvzKzcoh oLAQzLVDVNHJkfGCo/sutSadrdJ2JGfNUYxDJcALI1LOLuqrmXUmPBm/4Q50c8LXyYcBpJ19X0AI/ CT6KphEwbvbCN1UUyjcC9vFqORiv2vxOg8T1P4eZzeq0jlW9uTDWutz7JhhTUJYidn4J5ddY0GJEL f10gOfo095c1ep2SbAkQaIsUjBTTPyB1mAnQiy6heWI9Xg+/UTZXKwu3GvOOSEORHMIqp3Q5jF+tc NI65cVBg==; Received: from authenticated-user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.96) (envelope-from ) id 1wbcbR-000ptG-0y; Mon, 22 Jun 2026 11:14:09 +0000 Date: Mon, 22 Jun 2026 04:14:04 -0700 From: Breno Leitao To: Petr Mladek Cc: Tejun Heo , Lai Jiangshan , Song Liu , linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH RFC 2/3] workqueue: trigger a single-CPU backtrace for stalled pools Message-ID: References: <20260616-wq_dump_petr-v1-0-b57473ca6d18@debian.org> <20260616-wq_dump_petr-v1-2-b57473ca6d18@debian.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Debian-User: leitao Hello Petr, On Fri, Jun 19, 2026 at 03:42:56PM +0200, Petr Mladek wrote: > It makes some sense. wq_watchdog_timer_fn() checks either > 'per_cpu(wq_watchdog_touched_cpu)' or the global 'wq_watchdog_touched' > depending whether pool->cpu is set or not. And it seems to be wrong > for disassociated pools. > > But this seems to be an existing problem which should be fixed > separately. Good observation. For disassociated pools (where a CPU has been offlined), pool->cpu remains set, only the workers' CPU affinity changes. When a CPU goes offline, the pool becomes disassociated but pool->cpu still points to the now-offline CPU. Later in wq_watchdog_timer_fn(), when checking the stalled pool: if (pool->cpu >= 0) touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu)); This reads wq_watchdog_touched_cpu for the offline CPU, which is still being updated by wq_watchdog_reset_touched() via for_each_possible_cpu() (which updates CPU, including offlined CPUs). Regardless of whether the CPU is online or offline, wq_watchdog_reset_touched() will mark it as touched. The real problem is that pool->cpu now names an offline CPU: - the per-cpu "touched" heartbeat we consult is the wrong one. The pool's work now runs on online CPUs (it behaves like an unbound pool), so the global wq_watchdog_touched is the correct grace signal - the same pool->cpu >= 0 test marks the pool cpu_stall and aims the new single-CPU backtrace at the offline CPU. So, I suppose we have a few options: 1) Set pool->cpu to -1 at dissociation time. But, that would lose the cpu that would be necessary to rebind later. We would need to backup pool->cpu if we decide to unset it. int workqueue_online_cpu(unsigned int cpu) { ... if (pool->cpu == cpu) 2) Treat the pool as cpuless if they are disassociated. static int pool_watchdog_cpu(struct worker_pool *pool) { if (pool->cpu < 0 || (pool->flags & POOL_DISASSOCIATED)) return -1; return pool->cpu; } and replace pool->cpu read by pool_watchdog_cpu() everywhere in the stall code path. I lean towards 2). Either way this is unrelated to this patchset, so my suggestion is: 1) I respin this RFC with your Reviewed-by + a cpu_online() check before triggering the backtrace: if (!found_running && cpu_online(cpu)) trigger_single_cpu_backtrace(cpu); 2) we continue the disassociated-pool discussion separately, so it does not block this series. Thanks, --breno