From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9A7C3F39E7 for ; Wed, 27 May 2026 14:51:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779893496; cv=none; b=OSw9KR3icllDUwxf43Yd9TnssIifWUM6USCRSZXuvNM4OfGXfDMtMHpWOTnul7kZFsMOQwh2Vs5wuLwInQuNUWho45laIwu+Qu6ENnmNZ0oLHEywAQq2CxcmXciaPCmwPjDshDPnjGMjOW5raAZF658+onwU3htTUjk0EcuBzVw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779893496; c=relaxed/simple; bh=h4q7kRJB6k3cnv4vX0qvO9INT5uSN1nNDy883/D2r2I=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=GyDKmyOg5I+Hl/d6UwymrjwQMibPCUj/cMylF6OuXOOIJLywKctl54N5djz4ZvzCplKSc/MPsGGFYfmLVXpeCu2le1v3SPiPPVDqivvgRmZopLH24CMWpYNtQyCh/pCVoFU9i691xQa+uSjXm7pkleLFU0kPHBBaQ8yFGlip+4I= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=q7Imdshv; arc=none smtp.client-ip=82.195.75.108 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=debian.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="q7Imdshv" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=jGvJ/r//4ph/4LmPL27eW/nmkcErb0XjM5euifOGNHM=; b=q7Imdshv5cU4ibifFhik6vuy9q RAYbGTIOqjliqlC8xnR3Td7hZ8/MB3fd+SwNRojEy6OUvzF2rqnDuKG3vH0baN2ziv/y2qCKz0sUh T3gTEse9FkXpFNJqkRF5apjx+VmW4HscvtMVYB0UzSb+CKucGyQEXdnaNH0LGXdrpip3XJ54/YyBE CK8ExAi0x7rUlkevgd4kDAYs6iv2xkD583zag339o5bqNrQUslrauzyGlsqpRe+eNXA/WPTc8dyV4 ntPMI0fXBp7VLNgS5Hpi2tIf2FsMUzi/2I5+8uWbh9DytKvilQpDpvGsnptR1CpDBAgRl/z1NVitQ DTiO1pVw==; Received: from authenticated-user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.96) (envelope-from ) id 1wSFbO-003FAz-1p; Wed, 27 May 2026 14:51:22 +0000 Date: Wed, 27 May 2026 07:51:17 -0700 From: Breno Leitao To: Hillf Danton Cc: Tejun Heo , Lai Jiangshan , linux-kernel@vger.kernel.org, marco.crivellari@suse.com, frederic@kernel.org, bigeasy@linutronix.de Subject: Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Message-ID: References: <20260526-fastwake-v1-0-e69ad86923e6@debian.org> <20260526212346.1100-1-hdanton@sina.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Debian-User: leitao On Wed, May 27, 2026 at 10:48:12AM +0100, Breno Leitao wrote: > I am still unclear how to fix it, but I am thinking about it. I would say the best approach is to take the worker from the idle list before waking it up. worker_leave_idle(worker); wake_up_process(p); Not later when the worker is scheduled (worker_thread()). commit f0a524a156b7563b9b37242aa42c822f7a8256e2 Author: Breno Leitao Date: Wed May 27 07:39:37 2026 -0700 workqueue: claim picked idle worker under pool->lock in kick_pool() kick_pool() picks an idle worker via first_idle_worker(), optionally adjusts the wake_cpu and repatriates it, then wakes it via wake_up_process() -- but it does not remove the picked worker from pool->idle_list and does not clear WORKER_IDLE. Those transitions happen later in worker_thread:woke_up: when the woken kthread reacquires pool->lock and calls worker_leave_idle(). idle_cull_fn() walks pool->idle_list from the tail and calls set_worker_dying() on entries whose last_active + IDLE_WORKER_TIMEOUT has elapsed. set_worker_dying() only checks WORKER_IDLE; it does not inspect task->__state. A worker that has been kicked but has not yet rerun worker_leave_idle() therefore remains a valid cull victim. If cull races in between the kick and the wakee reacquiring pool->lock, the worker gets WORKER_DIE set and is reaped by kthread_stop_put(). On wakeup it sees WORKER_DIE in worker_thread() and returns 0 without consuming pool->worklist, stranding the just-enqueued work item until the next pool event kicks somebody else. CPU A (kicker) CPU B (cull) ---------------- ---------------- raw_spin_lock(&pool->lock) pick worker W (idle_list head) wake_up_process(W->task) raw_spin_unlock(&pool->lock) raw_spin_lock(&pool->lock) too_many_workers() == true W is list_last_entry(idle_list) set_worker_dying(W, cull_list) raw_spin_unlock(&pool->lock) reap_dying_workers() kthread_stop_put(W->task) (W finally schedules in) worker_thread(): raw_spin_lock(&pool->lock) WORKER_DIE set -> return 0 Today the window is bounded by scheduler latency between the kicker dropping pool->lock and the wakee reacquiring it, plus the 5 minute IDLE_WORKER_TIMEOUT precondition. Deferring wake_up_process() outside pool->lock on the hot enqueue / process_one_work() paths widens it considerably (IRQs re-enabled, fully preemptible). Close the window by claiming the picked worker under pool->lock: call worker_leave_idle() in kick_pool() before wake_up_process(). That removes the worker from pool->idle_list and clears WORKER_IDLE while we still hold the lock, so set_worker_dying()'s WORKER_IDLE precondition is no longer satisfied and idle_cull_fn()'s tail walk cannot see the worker. Gate the existing worker_leave_idle() call in worker_thread:woke_up: behind WORKER_IDLE, so first-time wakeups from create_worker() (which do not go through kick_pool()) still transition correctly. Signed-off-by: Breno Leitao diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8df671066dd1..321df19f9fd8 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1310,6 +1310,18 @@ static bool kick_pool(struct worker_pool *pool) } } #endif + /* + * Claim @worker under pool->lock so it cannot race idle_cull_fn(): + * once @worker is off pool->idle_list and no longer WORKER_IDLE, + * set_worker_dying() will skip it and the cull walk cannot reach + * it. Otherwise, the woken worker remains WORKER_IDLE on + * pool->idle_list until it actually schedules in and runs + * worker_leave_idle() in worker_thread:woke_up:, leaving a window + * where a concurrent idle_cull_fn() can flag it WORKER_DIE and + * kthread_stop_put() it before it consumes pool->worklist, + * stranding the just-enqueued work. + */ + worker_leave_idle(worker); wake_up_process(p); return true; } @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker) return 0; } - worker_leave_idle(worker); + /* + * Kicked workers have already been removed from pool->idle_list + * by kick_pool(); only first-time wakeups (via create_worker()) + * still arrive with WORKER_IDLE set. + */ + if (worker->flags & WORKER_IDLE) + worker_leave_idle(worker); recheck: /* no more worker necessary? */ if (!need_more_worker(pool))