* [PATCH 0/2] workqueue: Shrink the lock time
@ 2026-05-26 18:08 Breno Leitao
2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao
2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
0 siblings, 2 replies; 11+ messages in thread
From: Breno Leitao @ 2026-05-26 18:08 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Breno Leitao,
kernel-team
The goal of this patchset is to decrease the time spent on the
workqueue pool->lock.
Currently, the worker process is woken up inside the lock, which can
be expensive on some architectures. On arm64, the wake-up issues an
IPI when the target CPU is in idle (wfi on arm64), which requires an
interrupt to wake it. And also, the rq lock, which will be nested in
this case.
This series decreases the time spent under pool->lock by waking up
the process outside of the lock. This improves the microbenchark we have
in lib/ by up to 10%.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
workqueue: split kick_pool() into kick_pool_pick() + wake_up_process()
workqueue: defer wake_up_process() outside pool->lock on hot paths
kernel/workqueue.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 49 insertions(+), 8 deletions(-)
---
base-commit: c1ecb239fa3456529a32255359fc78b69eb9d847
change-id: 20260526-fastwake-02982fd66312
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() 2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao @ 2026-05-26 18:08 ` Breno Leitao 2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao 1 sibling, 0 replies; 11+ messages in thread From: Breno Leitao @ 2026-05-26 18:08 UTC (permalink / raw) To: Tejun Heo, Lai Jiangshan Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Breno Leitao, kernel-team Factor the worker-selection part of kick_pool() out into a new helper kick_pool_pick() that picks the worker to wake (and applies the wake_cpu / repatriation hint that requires pool->lock) but does not call wake_up_process() on it. It returns the task_struct of the worker to wake, or NULL when no wakeup is needed. kick_pool() becomes a thin wrapper that calls kick_pool_pick() and then wake_up_process() on the returned task, so all existing callers keep their current behavior. This is a pure refactor with no functional change. Signed-off-by: Breno Leitao <leitao@debian.org> --- kernel/workqueue.c | 36 ++++++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8df671066dd1..b788d7c44ac0 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1258,13 +1258,21 @@ static void kick_bh_pool(struct worker_pool *pool) } /** - * kick_pool - wake up an idle worker if necessary + * kick_pool_pick - select a worker to wake up but defer the wakeup * @pool: pool to kick * - * @pool may have pending work items. Wake up worker if necessary. Returns - * whether a worker was woken up. + * Same selection logic as kick_pool() but returns the task that should be + * woken instead of calling wake_up_process() on it. The caller is + * responsible for calling wake_up_process() on the returned task (typically + * after dropping @pool->lock to shorten the locked region). + * + * Returns the task_struct to wake, or NULL if no wakeup is needed (e.g. the + * pool was empty, no idle worker was available, or it was a BH pool which + * was already kicked synchronously). + * + * Must be called with @pool->lock held. */ -static bool kick_pool(struct worker_pool *pool) +static struct task_struct *kick_pool_pick(struct worker_pool *pool) { struct worker *worker = first_idle_worker(pool); struct task_struct *p; @@ -1272,11 +1280,11 @@ static bool kick_pool(struct worker_pool *pool) lockdep_assert_held(&pool->lock); if (!need_more_worker(pool) || !worker) - return false; + return NULL; if (pool->flags & POOL_BH) { kick_bh_pool(pool); - return true; + return NULL; } p = worker->task; @@ -1310,6 +1318,22 @@ static bool kick_pool(struct worker_pool *pool) } } #endif + return p; +} + +/** + * kick_pool - wake up an idle worker if necessary + * @pool: pool to kick + * + * @pool may have pending work items. Wake up worker if necessary. Returns + * whether a worker was woken up. + */ +static bool kick_pool(struct worker_pool *pool) +{ + struct task_struct *p = kick_pool_pick(pool); + + if (!p) + return false; wake_up_process(p); return true; } -- 2.51.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao 2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao @ 2026-05-26 18:08 ` Breno Leitao 2026-05-26 21:23 ` Hillf Danton 2026-05-27 15:22 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 11+ messages in thread From: Breno Leitao @ 2026-05-26 18:08 UTC (permalink / raw) To: Tejun Heo, Lai Jiangshan Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Breno Leitao, kernel-team Both __queue_work() (enqueue) and process_one_work() (per-work chain kick on unbound/CPU_INTENSIVE pools) call kick_pool() while holding pool->lock. kick_pool() ends in wake_up_process(), which takes the target task's rq->lock. Holding pool->lock across that runqueue lock acquisition lengthens the locked region on the two hottest paths of a contended unbound workqueue. Use the new kick_pool_pick() helper to select the worker to wake while holding pool->lock, then call wake_up_process() after pool->lock is released. All state that requires pool->lock (worker selection, wake_cpu adjustment, BH-pool fast path) is still done under the lock; only the unrelated rq->lock acquisition is moved out. Measured on a CONFIG_SMP arm64 VM (8 vCPUs) with the test_workqueue benchmark (lib/test_workqueue.c) using a batched-submit mode (8 producer kthreads, 200000 work items each, WQ_UNBOUND). Averages of five runs per scope: affinity_scope baseline (items/s) patched (items/s) gain -------------- ------------------ ----------------- ---- cpu 1,419,973 1,413,896 -0.4 % (no contention) smt 1,442,921 1,437,164 -0.4 % (no contention) cache_shard 1,184,058 1,279,184 +8.0 % cache 1,167,603 1,271,341 +8.9 % numa 1,163,617 1,285,427 +10.5 % system 1,175,933 1,255,227 +6.7 % Enqueue latency on the contended scopes also drops (p50 ~2875 -> ~2625 ns, p99 ~5000 -> ~4200 ns). The cpu/smt scopes use per-CPU pools with no producer/consumer contention, so as expected they are unchanged. Signed-off-by: Breno Leitao <leitao@debian.org> --- kernel/workqueue.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index b788d7c44ac0..1403a4b195a3 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2301,6 +2301,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, { struct pool_workqueue *pwq; struct worker_pool *last_pool, *pool; + struct task_struct *wake_p = NULL; unsigned int work_flags; unsigned int req_cpu = cpu; @@ -2415,7 +2416,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, trace_workqueue_activate_work(work); insert_work(pwq, work, &pool->worklist, work_flags); - kick_pool(pool); + wake_p = kick_pool_pick(pool); } else { work_flags |= WORK_STRUCT_INACTIVE; insert_work(pwq, work, &pwq->inactive_works, work_flags); @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, out: raw_spin_unlock(&pool->lock); + /* + * Issue the wakeup after dropping pool->lock to shorten the + * locked region on this hot enqueue path. kick_pool_pick() did all + * of the work that required the lock (worker selection and + * wake_cpu setup); the wake_up_process() itself only needs to + * take the target rq->lock. + */ + if (wake_p) + wake_up_process(wake_p); rcu_read_unlock(); } @@ -3243,6 +3253,7 @@ __acquires(&pool->lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker->pool; + struct task_struct *wake_p; unsigned long work_data; int lockdep_start_depth, rcu_start_depth; bool bh_draining = pool->flags & POOL_BH_DRAINING; @@ -3296,8 +3307,11 @@ __acquires(&pool->lock) * since nr_running would always be >= 1 at this point. This is used to * chain execution of the pending work items for WORKER_NOT_RUNNING * workers such as the UNBOUND and CPU_INTENSIVE ones. + * + * Select the worker to wake while holding pool->lock, but defer the + * actual wake_up_process() until after the lock is dropped below. */ - kick_pool(pool); + wake_p = kick_pool_pick(pool); /* * Record the last pool and clear PENDING which should be the last @@ -3310,6 +3324,9 @@ __acquires(&pool->lock) pwq->stats[PWQ_STAT_STARTED]++; raw_spin_unlock_irq(&pool->lock); + if (wake_p) + wake_up_process(wake_p); + rcu_start_depth = rcu_preempt_depth(); lockdep_start_depth = lockdep_depth(current); /* see drain_dead_softirq_workfn() */ -- 2.51.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao @ 2026-05-26 21:23 ` Hillf Danton 2026-05-27 9:48 ` Breno Leitao 2026-05-27 15:22 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 11+ messages in thread From: Hillf Danton @ 2026-05-26 21:23 UTC (permalink / raw) To: Breno Leitao Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy On Tue, 26 May 2026 14:08:06 -0400 Breno Leitao wrote: > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, > > out: > raw_spin_unlock(&pool->lock); > + /* > + * Issue the wakeup after dropping pool->lock to shorten the > + * locked region on this hot enqueue path. kick_pool_pick() did all > + * of the work that required the lock (worker selection and > + * wake_cpu setup); the wake_up_process() itself only needs to > + * take the target rq->lock. > + */ > + if (wake_p) > + wake_up_process(wake_p); > rcu_read_unlock(); > } > I suspect this works without the pool lock held, go and see workers culled in idle_cull_fn() for example. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-26 21:23 ` Hillf Danton @ 2026-05-27 9:48 ` Breno Leitao 2026-05-27 14:51 ` Breno Leitao 0 siblings, 1 reply; 11+ messages in thread From: Breno Leitao @ 2026-05-27 9:48 UTC (permalink / raw) To: Hillf Danton Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy On Wed, May 27, 2026 at 05:23:40AM +0000, Hillf Danton wrote: > On Tue, 26 May 2026 14:08:06 -0400 Breno Leitao wrote: > > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, > > > > out: > > raw_spin_unlock(&pool->lock); > > + /* > > + * Issue the wakeup after dropping pool->lock to shorten the > > + * locked region on this hot enqueue path. kick_pool_pick() did all > > + * of the work that required the lock (worker selection and > > + * wake_cpu setup); the wake_up_process() itself only needs to > > + * take the target rq->lock. > > + */ > > + if (wake_p) > > + wake_up_process(wake_p); > > rcu_read_unlock(); > > } > > > I suspect this works without the pool lock held, go and see workers culled > in idle_cull_fn() for example. Ack, you're right that wake_up_process() itself works without pool->lock held — idle_cull_fn() is a good precedent. But sashiko's point made me look again, and I think there's a latent bug that should be fixed before (or as part of) this patch. When kick_pool() picks a worker and wakes it, the worker is flipped to TASK_RUNNING by the scheduler, but from the workqueue's point of view it is still WORKER_IDLE and still on pool->idle_list. The transition off the idle list only happens later, when >the woken kthread actually schedules in, reacquires pool->lock, and runs worker_leave_idle(). That matters because idle_cull_fn() never looks at the task's scheduler state. It doesn't read task->__state or anything equivalent — its "is this worker idle?" check is purely "is it on pool->idle_list, and is last_active + IDLE_WORKER_TIMEOUT in the past?". So a worker that has been woken but hasn't yet run worker_leave_idle() is, as far as cull is concerned, a perfectly valid victim. The race is then: CPU A (kicker) CPU B (cull) ---------------- ---------------- raw_spin_lock(&pool->lock) pick worker W (idle_list head) wake_up_process(W->task) raw_spin_unlock(&pool->lock) raw_spin_lock(&pool->lock) too_many_workers() == true W is list_last_entry(idle_list) set_worker_dying(W, cull_list) raw_spin_unlock(&pool->lock) reap_dying_workers() kthread_stop_put(W->task) (W finally schedules in) worker_thread(): raw_spin_lock(&pool->lock) WORKER_DIE set -> return 0 W exits without ever running worker_leave_idle() or touching pool->worklist, and the work item we just enqueued is stranded until the next pool event kicks somebody else. This window is narrow in mainline today (bounded by scheduler latency between the kicker dropping pool->lock and the wakee reacquiring it, plus the 5 s IDLE_WORKER_TIMEOUT precondition), but my patch widens it considerably — process_one_work()'s deferred-wake site re-enables IRQs and is fully preemptible across the gap. I am still unclear how to fix it, but I am thinking about it. -- pw-bot: cr ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-27 9:48 ` Breno Leitao @ 2026-05-27 14:51 ` Breno Leitao 2026-05-27 15:35 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 11+ messages in thread From: Breno Leitao @ 2026-05-27 14:51 UTC (permalink / raw) To: Hillf Danton Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy On Wed, May 27, 2026 at 10:48:12AM +0100, Breno Leitao wrote: > I am still unclear how to fix it, but I am thinking about it. I would say the best approach is to take the worker from the idle list before waking it up. worker_leave_idle(worker); wake_up_process(p); Not later when the worker is scheduled (worker_thread()). commit f0a524a156b7563b9b37242aa42c822f7a8256e2 Author: Breno Leitao <leitao@debian.org> Date: Wed May 27 07:39:37 2026 -0700 workqueue: claim picked idle worker under pool->lock in kick_pool() kick_pool() picks an idle worker via first_idle_worker(), optionally adjusts the wake_cpu and repatriates it, then wakes it via wake_up_process() -- but it does not remove the picked worker from pool->idle_list and does not clear WORKER_IDLE. Those transitions happen later in worker_thread:woke_up: when the woken kthread reacquires pool->lock and calls worker_leave_idle(). idle_cull_fn() walks pool->idle_list from the tail and calls set_worker_dying() on entries whose last_active + IDLE_WORKER_TIMEOUT has elapsed. set_worker_dying() only checks WORKER_IDLE; it does not inspect task->__state. A worker that has been kicked but has not yet rerun worker_leave_idle() therefore remains a valid cull victim. If cull races in between the kick and the wakee reacquiring pool->lock, the worker gets WORKER_DIE set and is reaped by kthread_stop_put(). On wakeup it sees WORKER_DIE in worker_thread() and returns 0 without consuming pool->worklist, stranding the just-enqueued work item until the next pool event kicks somebody else. CPU A (kicker) CPU B (cull) ---------------- ---------------- raw_spin_lock(&pool->lock) pick worker W (idle_list head) wake_up_process(W->task) raw_spin_unlock(&pool->lock) raw_spin_lock(&pool->lock) too_many_workers() == true W is list_last_entry(idle_list) set_worker_dying(W, cull_list) raw_spin_unlock(&pool->lock) reap_dying_workers() kthread_stop_put(W->task) (W finally schedules in) worker_thread(): raw_spin_lock(&pool->lock) WORKER_DIE set -> return 0 Today the window is bounded by scheduler latency between the kicker dropping pool->lock and the wakee reacquiring it, plus the 5 minute IDLE_WORKER_TIMEOUT precondition. Deferring wake_up_process() outside pool->lock on the hot enqueue / process_one_work() paths widens it considerably (IRQs re-enabled, fully preemptible). Close the window by claiming the picked worker under pool->lock: call worker_leave_idle() in kick_pool() before wake_up_process(). That removes the worker from pool->idle_list and clears WORKER_IDLE while we still hold the lock, so set_worker_dying()'s WORKER_IDLE precondition is no longer satisfied and idle_cull_fn()'s tail walk cannot see the worker. Gate the existing worker_leave_idle() call in worker_thread:woke_up: behind WORKER_IDLE, so first-time wakeups from create_worker() (which do not go through kick_pool()) still transition correctly. Signed-off-by: Breno Leitao <leitao@debian.org> diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8df671066dd1..321df19f9fd8 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1310,6 +1310,18 @@ static bool kick_pool(struct worker_pool *pool) } } #endif + /* + * Claim @worker under pool->lock so it cannot race idle_cull_fn(): + * once @worker is off pool->idle_list and no longer WORKER_IDLE, + * set_worker_dying() will skip it and the cull walk cannot reach + * it. Otherwise, the woken worker remains WORKER_IDLE on + * pool->idle_list until it actually schedules in and runs + * worker_leave_idle() in worker_thread:woke_up:, leaving a window + * where a concurrent idle_cull_fn() can flag it WORKER_DIE and + * kthread_stop_put() it before it consumes pool->worklist, + * stranding the just-enqueued work. + */ + worker_leave_idle(worker); wake_up_process(p); return true; } @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker) return 0; } - worker_leave_idle(worker); + /* + * Kicked workers have already been removed from pool->idle_list + * by kick_pool(); only first-time wakeups (via create_worker()) + * still arrive with WORKER_IDLE set. + */ + if (worker->flags & WORKER_IDLE) + worker_leave_idle(worker); recheck: /* no more worker necessary? */ if (!need_more_worker(pool)) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-27 14:51 ` Breno Leitao @ 2026-05-27 15:35 ` Sebastian Andrzej Siewior 2026-05-28 14:35 ` Breno Leitao 2026-06-01 17:26 ` Breno Leitao 0 siblings, 2 replies; 11+ messages in thread From: Sebastian Andrzej Siewior @ 2026-05-27 15:35 UTC (permalink / raw) To: Breno Leitao Cc: Hillf Danton, Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic On 2026-05-27 07:51:17 [-0700], Breno Leitao wrote: > @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker) > return 0; > } > > - worker_leave_idle(worker); > + /* > + * Kicked workers have already been removed from pool->idle_list > + * by kick_pool(); only first-time wakeups (via create_worker()) > + * still arrive with WORKER_IDLE set. > + */ > + if (worker->flags & WORKER_IDLE) > + worker_leave_idle(worker); Couldn't create_worker() be aligned here not set the idle flag and wake the thread a few lines later? Then we wouldn't have to conditionally clear the idle flag here (which sort of NULL renders the flag check in worker_leave_idle()). > recheck: > /* no more worker necessary? */ > if (!need_more_worker(pool)) Sebastian ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-27 15:35 ` Sebastian Andrzej Siewior @ 2026-05-28 14:35 ` Breno Leitao 2026-06-01 17:26 ` Breno Leitao 1 sibling, 0 replies; 11+ messages in thread From: Breno Leitao @ 2026-05-28 14:35 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Hillf Danton, Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic On Wed, May 27, 2026 at 05:35:00PM +0000, Sebastian Andrzej Siewior wrote: > On 2026-05-27 07:51:17 [-0700], Breno Leitao wrote: > > @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker) > > return 0; > > } > > > > - worker_leave_idle(worker); > > + /* > > + * Kicked workers have already been removed from pool->idle_list > > + * by kick_pool(); only first-time wakeups (via create_worker()) > > + * still arrive with WORKER_IDLE set. > > + */ > > + if (worker->flags & WORKER_IDLE) > > + worker_leave_idle(worker); > > Couldn't create_worker() be aligned here not set the idle flag and wake > the thread a few lines later? Then we wouldn't have to conditionally > clear the idle flag here (which sort of NULL renders the flag check in > worker_leave_idle()). Agreed. If create_worker() defers worker_enter_idle() until just before wake_up_process(), we can drop this enter/leave pair entirely and avoid the awkward conditional worker_leave_idle() What about something like: commit 0e91d33f8fe8e86c9a2be40d7a0163f68300ea1f Author: Breno Leitao <leitao@debian.org> Date: Thu May 28 09:58:28 2026 -0400 workqueue: leave idle under pool->lock before waking the worker A woken worker is woken from two different shapes today: - kick_pool_pick() picks a worker off pool->idle_list (LIFO head), optionally adjusts wake_cpu under pool->lock and returns the task to wake. The worker is left WORKER_IDLE and on pool->idle_list at the point wake_up_process() runs. - create_worker() calls worker_enter_idle() to put a brand-new worker on pool->idle_list, then wakes it for the first time. In both cases the woken worker eventually reaches worker_thread:woke_up:, reacquires pool->lock and unconditionally calls worker_leave_idle() to flip its own WORKER_IDLE / pool->idle_list / pool->nr_idle state. That is bookkeeping the waker could just as well have done under the same pool->lock acquisition it already holds, and it muddles the invariant for any concurrent observer: a runnable worker still observably looks idle until it actually schedules in. Move the worker_leave_idle() to the waker side, while pool->lock is still held: - kick_pool_pick() calls worker_leave_idle(worker) right before returning the task to wake. Both kick_pool() and the deferred- wake callers in __queue_work() / process_one_work() inherit this. - create_worker() pairs its worker_enter_idle() with a matching worker_leave_idle() right before wake_up_process(), under the same pool->lock section. nr_idle accounting is unchanged from the outside (enter and leave happen back-to-back under the lock); the only visible difference is that the new worker is no longer on pool->idle_list at the point it is woken. - worker_thread:woke_up: drops its worker_leave_idle() call: by the invariant above, a worker reaching woke_up: is always already !WORKER_IDLE and off pool->idle_list. The resulting invariant is uniform across all wakeup paths: a woken worker is !WORKER_IDLE and off pool->idle_list. No functional change intended. This also shrinks the locked region at worker_thread:woke_up: (one fewer field update while holding pool->lock). Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Breno Leitao <leitao@debian.org> diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8df671066dd1..e66dd507a841 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1310,6 +1310,8 @@ static bool kick_pool(struct worker_pool *pool) } } #endif + /* Leave idle under pool->lock; worker_thread:woke_up: relies on this. */ + worker_leave_idle(worker); wake_up_process(p); return true; } @@ -2881,8 +2883,11 @@ static struct worker *create_worker(struct worker_pool *pool) * check if not woken up soon. As kick_pool() is noop if @pool is empty, * wake it up explicitly. */ - if (worker->task) + if (worker->task) { + /* match kick_pool_pick(): leave idle before waking */ + worker_leave_idle(worker); wake_up_process(worker->task); + } raw_spin_unlock_irq(&pool->lock); @@ -3447,7 +3452,7 @@ static int worker_thread(void *__worker) return 0; } - worker_leave_idle(worker); + /* wakers already called worker_leave_idle() under pool->lock */ recheck: /* no more worker necessary? */ if (!need_more_worker(pool)) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-27 15:35 ` Sebastian Andrzej Siewior 2026-05-28 14:35 ` Breno Leitao @ 2026-06-01 17:26 ` Breno Leitao 1 sibling, 0 replies; 11+ messages in thread From: Breno Leitao @ 2026-06-01 17:26 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Hillf Danton, Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic On Wed, May 27, 2026 at 05:35:00PM +0200, Sebastian Andrzej Siewior wrote: > On 2026-05-27 07:51:17 [-0700], Breno Leitao wrote: > > @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker) > > return 0; > > } > > > > - worker_leave_idle(worker); > > + /* > > + * Kicked workers have already been removed from pool->idle_list > > + * by kick_pool(); only first-time wakeups (via create_worker()) > > + * still arrive with WORKER_IDLE set. > > + */ > > + if (worker->flags & WORKER_IDLE) > > + worker_leave_idle(worker); > > Couldn't create_worker() be aligned here not set the idle flag and wake > the thread a few lines later? Then we wouldn't have to conditionally > clear the idle flag here (which sort of NULL renders the flag check in > worker_leave_idle()). I tried exactly that and it regresses worker creation, so I'd rather keep create_worker() as-is and leave the check in woke_up: conditional. create_worker() deliberately returns with the new worker still on pool->idle_list (counted in pool->nr_idle), and maybe_create_worker() depends on that. After a successful create_worker() it re-checks: need_to_create_worker() = need_more_worker() && !may_start_working() and may_start_working() is just pool->nr_idle. That nr_idle is the signal "I already created a worker that will pick up the pending work, stop creating". If create_worker() leaves the worker !WORKER_IDLE before the wakeup (or never enters idle), it returns with nr_idle unchanged. A fresh worker starts WORKER_PREP, so it doesn't bump nr_running until it actually runs. So until the woken kthread schedules in, the manager keeps seeing: need_more_worker() == true (worklist non-empty, nr_running 0) may_start_working() == false (nr_idle 0) and loops/goto restart, creating extra workers until one of the woken ones runs. Under scheduler latency that's a burst of surplus kworkers (eventually culled). So I kept create_worker() untouched and only claim the worker (off idle_list) in kick_pool(), with the conditional worker_leave_idle() in woke_up: covering the create_worker() path. --breno ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao 2026-05-26 21:23 ` Hillf Danton @ 2026-05-27 15:22 ` Sebastian Andrzej Siewior 2026-05-28 13:41 ` Breno Leitao 1 sibling, 1 reply; 11+ messages in thread From: Sebastian Andrzej Siewior @ 2026-05-27 15:22 UTC (permalink / raw) To: Breno Leitao Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic, kernel-team On 2026-05-26 14:08:06 [-0400], Breno Leitao wrote: > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, > > out: > raw_spin_unlock(&pool->lock); > + /* > + * Issue the wakeup after dropping pool->lock to shorten the > + * locked region on this hot enqueue path. kick_pool_pick() did all > + * of the work that required the lock (worker selection and > + * wake_cpu setup); the wake_up_process() itself only needs to > + * take the target rq->lock. > + */ > + if (wake_p) > + wake_up_process(wake_p); What about using wake_q_add() to grab a worker(s) and then wake_up_q() to wake the worker outside of the lock? > rcu_read_unlock(); > } > Sebastian ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths 2026-05-27 15:22 ` Sebastian Andrzej Siewior @ 2026-05-28 13:41 ` Breno Leitao 0 siblings, 0 replies; 11+ messages in thread From: Breno Leitao @ 2026-05-28 13:41 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari, frederic, kernel-team Hello Sebastian, On Wed, May 27, 2026 at 05:22:05PM +0000, Sebastian Andrzej Siewior wrote: > On 2026-05-26 14:08:06 [-0400], Breno Leitao wrote: > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, > > > > out: > > raw_spin_unlock(&pool->lock); > > + /* > > + * Issue the wakeup after dropping pool->lock to shorten the > > + * locked region on this hot enqueue path. kick_pool_pick() did all > > + * of the work that required the lock (worker selection and > > + * wake_cpu setup); the wake_up_process() itself only needs to > > + * take the target rq->lock. > > + */ > > + if (wake_p) > > + wake_up_process(wake_p); > > What about using wake_q_add() to grab a worker(s) and then wake_up_q() > to wake the worker outside of the lock? Thanks, this is a good recommendation. Looking at the API, I understand the intended pattern is to call wake_q_add() while still holding pool->lock and then wake_up_q() after dropping it -- wake_q_add() only does a cmpxchg on task->wake_q.next plus get_task_struct(), so it's safe under the spinlock, while wake_up_q() is the part that takes rq->lock and is what we want outside. Will switch to that in v2. --breno ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-06-01 17:27 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao 2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao 2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao 2026-05-26 21:23 ` Hillf Danton 2026-05-27 9:48 ` Breno Leitao 2026-05-27 14:51 ` Breno Leitao 2026-05-27 15:35 ` Sebastian Andrzej Siewior 2026-05-28 14:35 ` Breno Leitao 2026-06-01 17:26 ` Breno Leitao 2026-05-27 15:22 ` Sebastian Andrzej Siewior 2026-05-28 13:41 ` Breno Leitao
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.