From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9A7C3F39E7
	for <linux-kernel@vger.kernel.org>; Wed, 27 May 2026 14:51:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779893496; cv=none; b=OSw9KR3icllDUwxf43Yd9TnssIifWUM6USCRSZXuvNM4OfGXfDMtMHpWOTnul7kZFsMOQwh2Vs5wuLwInQuNUWho45laIwu+Qu6ENnmNZ0oLHEywAQq2CxcmXciaPCmwPjDshDPnjGMjOW5raAZF658+onwU3htTUjk0EcuBzVw=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779893496; c=relaxed/simple;
	bh=h4q7kRJB6k3cnv4vX0qvO9INT5uSN1nNDy883/D2r2I=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=GyDKmyOg5I+Hl/d6UwymrjwQMibPCUj/cMylF6OuXOOIJLywKctl54N5djz4ZvzCplKSc/MPsGGFYfmLVXpeCu2le1v3SPiPPVDqivvgRmZopLH24CMWpYNtQyCh/pCVoFU9i691xQa+uSjXm7pkleLFU0kPHBBaQ8yFGlip+4I=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=q7Imdshv; arc=none smtp.client-ip=82.195.75.108
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=debian.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="q7Imdshv"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org;
	s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=jGvJ/r//4ph/4LmPL27eW/nmkcErb0XjM5euifOGNHM=; b=q7Imdshv5cU4ibifFhik6vuy9q
	RAYbGTIOqjliqlC8xnR3Td7hZ8/MB3fd+SwNRojEy6OUvzF2rqnDuKG3vH0baN2ziv/y2qCKz0sUh
	T3gTEse9FkXpFNJqkRF5apjx+VmW4HscvtMVYB0UzSb+CKucGyQEXdnaNH0LGXdrpip3XJ54/YyBE
	CK8ExAi0x7rUlkevgd4kDAYs6iv2xkD583zag339o5bqNrQUslrauzyGlsqpRe+eNXA/WPTc8dyV4
	ntPMI0fXBp7VLNgS5Hpi2tIf2FsMUzi/2I5+8uWbh9DytKvilQpDpvGsnptR1CpDBAgRl/z1NVitQ
	DTiO1pVw==;
Received: from authenticated-user
	by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256)
	(Exim 4.96)
	(envelope-from <leitao@debian.org>)
	id 1wSFbO-003FAz-1p;
	Wed, 27 May 2026 14:51:22 +0000
Date: Wed, 27 May 2026 07:51:17 -0700
From: Breno Leitao <leitao@debian.org>
To: Hillf Danton <hdanton@sina.com>
Cc: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>, 
	linux-kernel@vger.kernel.org, marco.crivellari@suse.com, frederic@kernel.org, 
	bigeasy@linutronix.de
Subject: Re: [PATCH 2/2] workqueue: defer wake_up_process() outside
 pool->lock on hot paths
Message-ID: <ahcEcR6ApWr7E4ID@gmail.com>
References: <20260526-fastwake-v1-0-e69ad86923e6@debian.org>
 <20260526212346.1100-1-hdanton@sina.com>
 <aha6ENW-K-4F4AHi@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <aha6ENW-K-4F4AHi@gmail.com>
X-Debian-User: leitao

On Wed, May 27, 2026 at 10:48:12AM +0100, Breno Leitao wrote:
> I am still unclear how to fix it, but I am thinking about it.

I would say the best approach is to take the worker from the idle list
before waking it up.

	worker_leave_idle(worker);
        wake_up_process(p);

Not later when the worker is scheduled (worker_thread()).


commit f0a524a156b7563b9b37242aa42c822f7a8256e2
Author: Breno Leitao <leitao@debian.org>
Date:   Wed May 27 07:39:37 2026 -0700

    workqueue: claim picked idle worker under pool->lock in kick_pool()
    
    kick_pool() picks an idle worker via first_idle_worker(), optionally
    adjusts the wake_cpu and repatriates it, then wakes it via
    wake_up_process() -- but it does not remove the picked worker from
    pool->idle_list and does not clear WORKER_IDLE.  Those transitions
    happen later in worker_thread:woke_up: when the woken kthread
    reacquires pool->lock and calls worker_leave_idle().
    
    idle_cull_fn() walks pool->idle_list from the tail and calls
    set_worker_dying() on entries whose last_active + IDLE_WORKER_TIMEOUT
    has elapsed.  set_worker_dying() only checks WORKER_IDLE; it does
    not inspect task->__state.  A worker that has been kicked but has
    not yet rerun worker_leave_idle() therefore remains a valid cull
    victim.  If cull races in between the kick and the wakee reacquiring
    pool->lock, the worker gets WORKER_DIE set and is reaped by
    kthread_stop_put().  On wakeup it sees WORKER_DIE in worker_thread()
    and returns 0 without consuming pool->worklist, stranding the
    just-enqueued work item until the next pool event kicks somebody
    else.
    
        CPU A (kicker)                      CPU B (cull)
        ----------------                    ----------------
        raw_spin_lock(&pool->lock)
        pick worker W (idle_list head)
        wake_up_process(W->task)
        raw_spin_unlock(&pool->lock)
                                            raw_spin_lock(&pool->lock)
                                            too_many_workers() == true
                                            W is list_last_entry(idle_list)
                                            set_worker_dying(W, cull_list)
                                            raw_spin_unlock(&pool->lock)
                                            reap_dying_workers()
                                            kthread_stop_put(W->task)
        (W finally schedules in)
        worker_thread():
          raw_spin_lock(&pool->lock)
          WORKER_DIE set -> return 0
    
    Today the window is bounded by scheduler latency between the kicker
    dropping pool->lock and the wakee reacquiring it, plus the 5 minute
    IDLE_WORKER_TIMEOUT precondition.  Deferring wake_up_process()
    outside pool->lock on the hot enqueue / process_one_work() paths
    widens it considerably (IRQs re-enabled, fully preemptible).
    
    Close the window by claiming the picked worker under pool->lock:
    call worker_leave_idle() in kick_pool() before wake_up_process().
    That removes the worker from pool->idle_list and clears WORKER_IDLE
    while we still hold the lock, so set_worker_dying()'s WORKER_IDLE
    precondition is no longer satisfied and idle_cull_fn()'s tail walk
    cannot see the worker.  Gate the existing worker_leave_idle() call
    in worker_thread:woke_up: behind WORKER_IDLE, so first-time wakeups
    from create_worker() (which do not go through kick_pool()) still
    transition correctly.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..321df19f9fd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1310,6 +1310,18 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/*
+	 * Claim @worker under pool->lock so it cannot race idle_cull_fn():
+	 * once @worker is off pool->idle_list and no longer WORKER_IDLE,
+	 * set_worker_dying() will skip it and the cull walk cannot reach
+	 * it.  Otherwise, the woken worker remains WORKER_IDLE on
+	 * pool->idle_list until it actually schedules in and runs
+	 * worker_leave_idle() in worker_thread:woke_up:, leaving a window
+	 * where a concurrent idle_cull_fn() can flag it WORKER_DIE and
+	 * kthread_stop_put() it before it consumes pool->worklist,
+	 * stranding the just-enqueued work.
+	 */
+	worker_leave_idle(worker);
 	wake_up_process(p);
 	return true;
 }
@@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
 		return 0;
 	}
 
-	worker_leave_idle(worker);
+	/*
+	 * Kicked workers have already been removed from pool->idle_list
+	 * by kick_pool(); only first-time wakeups (via create_worker())
+	 * still arrive with WORKER_IDLE set.
+	 */
+	if (worker->flags & WORKER_IDLE)
+		worker_leave_idle(worker);
 recheck:
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))