From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D5793D3D15
	for <linux-kernel@vger.kernel.org>; Mon,  8 Jun 2026 17:12:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780938772; cv=none; b=aGEf6o1DiOuJrW9aDPIFJBmQe8yLx3cw+7vOihTxtWLFMDdVN2NA1wRdpQRfSl+9LxL6wxGnrDjIEGt+DE9/EwLgB0ItVrbhrVxcWbbKGR3R6oaG07EniV6F0CvheQ0dQkqlNN+hYyNe59R+iBJwAz2+UoSf8vWglMcN1jYo/vI=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780938772; c=relaxed/simple;
	bh=eVrcHVoGgBm1Hqw5SDBk8EV4p1E/1P2JwP5mSUga+Io=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=ieK9Ufi3mYWwmCQJ976qzY5X60AA/NBhvDV2Ul2ptlKgr9JEx95k4pcKb8ifYAO8N0T/mNmpjZzCDSbWr5PMcJj4gM7+Lg6RBSyqeZNRxmJAwAkcGuTSE9UF39lAfs0N571e+wrtdX8s1QwICfHvniBcOotNgeVucq+AaK5MUwY=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=J5Bbm8F6; arc=none smtp.client-ip=82.195.75.108
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=debian.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="J5Bbm8F6"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org;
	s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=Zpm+F8TSEOQuAcA1aY2PT7AMQ+RGdFmT2ytdsEOZp0U=; b=J5Bbm8F62uW9yWtQnDnaIBTxj3
	WBj7PXySvS5ZE3JcATxKpOzjZDrzp7QwGwnNW1yNk+NDtcGDHUwwsmrpC9aJZ+X88xNqqdc78RY9v
	FhBEVorI2qoxaMhJ0c5gIOQlmfbfJQjlO+D7gheKMVZKSIySz2Ws/c1H6CC7ByeR86VVZkuc0v2Ub
	gCqhJYm52B1nnD5DfhxmgPVQI9bDcXgk7n59HefBUl96ehAPdBDJRj+58tsXLdaKxOQ/Ojyzw4Wtr
	AAfaEBT6tfrbczpGHuVM92RJgrEjfAa17y/LhIAti2g75W+xVpijgeNY4RR8j4iEq3ZMErpffx/kK
	df+XpitQ==;
Received: from authenticated-user
	by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256)
	(Exim 4.96)
	(envelope-from <leitao@debian.org>)
	id 1wWdWk-007mGo-2G;
	Mon, 08 Jun 2026 17:12:43 +0000
Date: Mon, 8 Jun 2026 10:12:38 -0700
From: Breno Leitao <leitao@debian.org>
To: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>, linux-kernel@vger.kernel.org, 
	marco.crivellari@suse.com, frederic@kernel.org, bigeasy@linutronix.de, 
	Hillf Danton <hdanton@sina.com>, kernel-team@meta.com
Subject: Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
Message-ID: <aib0R4r-JSmPK0eo@gmail.com>
References: <20260603-fastwake-v2-0-2977512fe7fa@debian.org>
 <20260603-fastwake-v2-1-2977512fe7fa@debian.org>
 <aiE8RZCHvpqh_V2H@slm.duckdns.org>
 <aiLSOAcx_v1rrMdD@gmail.com>
 <aiMHoGja1Ve7bYC0@slm.duckdns.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <aiMHoGja1Ve7bYC0@slm.duckdns.org>
X-Debian-User: leitao

On Fri, Jun 05, 2026 at 07:30:08AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jun 05, 2026 at 07:40:43AM -0700, Breno Leitao wrote:
> ...
> > > > task state, so a kicked-but-not-yet-scheduled worker is still a valid
> > > > cull victim -- the cull can reap it before it consumes the just-enqueued
> > > > work, stranding the item.  The window is narrow today but later patches
> > > > in this series defer the wakeup outside pool->lock, widening it.
> > > 
> > > Have you actually reproduced this?
> > 
> > No -- not without artificially changing the timeout in the kernel. So far this
> > is a theoretical race window rather than something I've hit in
> > practice; it was flagged as a critical issue by sashiko:
> > 
> >     https://sashiko.dev/#/patchset/20260526-fastwake-v1-0-e69ad86923e6%40debian.org
> > 
> > The only way I could get it to actually strand an item was by shrinking
> > IDLE_WORKER_TIMEOUT 150,000x (300s -> 2ms), so that a worker counts as
> > "timed out" almost immediately.
> 
> I see. This is a bug then. It shouldn't happen even with that.
> 
> > Would it make sense to refresh last_active when a worker is kicked? The
> > cull walks tail->head and breaks at the first non-expired worker, so a
> > freshly-stamped kicked worker would simply be skipped while genuinely
> > old workers behind it are still reaped.
> 
> I don't think timestamping is where the problem is. The intention of the
> code is that the idle thread minimum count + how workers transition their
> states can't lead to a situation where a work item is pending without an
> idle worker to execute it regardless of timing.
> 
> Can you instrument code with the lowered threshold and record the sequence
> of events. If we record the sequence of work item and worker state
> transitions, it should tell us what's broken. We shouldn't need to protect
> all kicked workers to fix this.

Sure, let me show what I am using first, so, we are comfortable with the
"reproducer" and get on the same page.

This is what I am using. Basically colling a task that is waking up.

	diff --git a/kernel/workqueue.c b/kernel/workqueue.c
	index 78f25afb4a9d6..20e6f8b1fb6b6 100644
	--- a/kernel/workqueue.c
	+++ b/kernel/workqueue.c
	@@ -110,7 +110,7 @@ enum wq_internal_consts {
		BUSY_WORKER_HASH_ORDER  = 6,            /* 64 pointers */

		MAX_IDLE_WORKERS_RATIO  = 4,            /* 1/4 of busy can be idle */
	-       IDLE_WORKER_TIMEOUT     = 300 * HZ,     /* keep idle ones for 5 mins */
	+       IDLE_WORKER_TIMEOUT     = HZ / 500,     /* REPRO: 2ms (was 300 * HZ) */

		MAYDAY_INITIAL_TIMEOUT  = HZ / 100 >= 2 ? HZ / 100 : 2,
							/* call for help after 10ms
	@@ -2954,6 +2954,25 @@ static void set_worker_dying(struct worker *worker, struct list_head *list)

		/* get an extra task struct reference for later kthread_stop_put() */
		get_task_struct(worker->task);
	+
	+       {
	+               static atomic_t total = ATOMIC_INIT(0);
	+               static atomic_t kicked = ATOMIC_INIT(0);
	+               unsigned int st = READ_ONCE(worker->task->__state);
	+               int t = atomic_inc_return(&total);
	+
	+               /* TASK_IDLE = TASK_UNINTERRUPTIBLE|TASK_NOLOAD = 0x402.  Anything
	+                * else means the task was already woken before the cull caught it.
	+                */
	+               if (st != TASK_IDLE) {
	+                       int k = atomic_inc_return(&kicked);
	+
	+                       pr_warn("REPRO: KICKED-then-CULLED #%d/%d (pool=%d state=0x%x worklist=%s nr_idle=%d nr_workers=%d)\n",
	+                               k, t, pool->id, st,
	+                               list_empty(&pool->worklist) ? "empty" : "NON-EMPTY",
	+                               pool->nr_idle, pool->nr_workers);
	+               }
	+       }
	}

And as soon as the machine boots (even before init) I start seeing:

	[    8.340748] REPRO: KICKED-then-CULLED #1/13 (pool=294 state=0x0 worklist=empty nr_idle=2 nr_workers=2)
	[  753.586340] REPRO: KICKED-then-CULLED #2/5946 (pool=289 state=0x0 worklist=NON-EMPTY nr_idle=4 nr_workers=6)
	[  766.349179] REPRO: KICKED-then-CULLED #3/6024 (pool=288 state=0x0 worklist=NON-EMPTY nr_idle=3 nr_workers=6)


Do you think this is a good test?

Thanks
--breno