* [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
2026-06-04 8:50 ` Tejun Heo
2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team
kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
on pool->idle_list until the woken kthread schedules in and runs
worker_leave_idle(). idle_cull_fn() only checks WORKER_IDLE, not the
task state, so a kicked-but-not-yet-scheduled worker is still a valid
cull victim -- the cull can reap it before it consumes the just-enqueued
work, stranding the item. The window is narrow today but later patches
in this series defer the wakeup outside pool->lock, widening it.
Move the picked worker from pool->idle_list to a new pool->kicked_list
under pool->lock so the cull path -- which walks idle_list only --
cannot reach it. worker_leave_idle() already does list_del_init(), so
it correctly removes the worker from kicked_list when it actually runs;
worker_enter_idle() puts it back onto idle_list on completion. No
extra list ops on the worker side.
LIFO coalescing of back-to-back kicks onto the same cache-hot worker is
preserved by having first_idle_worker() peek kicked_list before
idle_list: the second kick lands on the already-kicked worker, the
duplicate wakeup is a no-op, and the worker drains both items when it
runs.
Why not creating a new WORKER_KICKED flag instead, you might ask. I've
tried it and the numbers decreased.
Compared to tagging the worker with a new WORKER_KICKED flag,
list_move() writes to worker->entry (offset 0 of struct worker), which
the producer already dirties when reading the idle_list head; no new
cross-CPU cacheline is introduced. Tagging worker->flags would have put
a producer-side write on an otherwise worker-private cacheline, causing
a coherence bounce on every kick.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..b3f8b86cb52f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -216,12 +216,13 @@ struct worker_pool {
int nr_idle; /* L: currently idle workers */
struct list_head idle_list; /* L: list of idle workers */
+ struct list_head kicked_list; /* L: workers kicked but not yet running */
struct timer_list idle_timer; /* L: worker idle timeout */
struct work_struct idle_cull_work; /* L: worker idle cleanup */
struct timer_list mayday_timer; /* L: SOS timer for workers */
- /* a workers is either on busy_hash or idle_list, or the manager */
+ /* a worker is either on busy_hash, idle_list, kicked_list, or the manager */
DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
/* L: hash of busy workers */
@@ -1031,6 +1032,13 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
/* Return the first idle worker. Called with pool->lock held. */
static struct worker *first_idle_worker(struct worker_pool *pool)
{
+ /*
+ * Prefer an already-kicked worker so back-to-back kicks coalesce
+ * onto the same cache-hot worker (LIFO reuse).
+ */
+ if (!list_empty(&pool->kicked_list))
+ return list_first_entry(&pool->kicked_list, struct worker, entry);
+
if (unlikely(list_empty(&pool->idle_list)))
return NULL;
@@ -1310,6 +1318,16 @@ static bool kick_pool(struct worker_pool *pool)
}
}
#endif
+ /*
+ * Move @worker to pool->kicked_list so a concurrent idle_cull_fn()
+ * (which only walks pool->idle_list) cannot reap it before it
+ * consumes the just-enqueued work. worker_leave_idle() removes the
+ * worker from whichever list it sits on; worker_enter_idle() puts
+ * it back on pool->idle_list on completion. first_idle_worker()
+ * peeks kicked_list first, so back-to-back kicks still coalesce
+ * onto the same cache-hot worker (LIFO reuse).
+ */
+ list_move(&worker->entry, &pool->kicked_list);
wake_up_process(p);
return true;
}
@@ -4896,6 +4914,7 @@ static int init_worker_pool(struct worker_pool *pool)
pool->last_progress_ts = jiffies;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
+ INIT_LIST_HEAD(&pool->kicked_list);
hash_init(pool->busy_hash);
timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
@ 2026-06-04 8:50 ` Tejun Heo
0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-06-04 8:50 UTC (permalink / raw)
To: Breno Leitao
Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
Hillf Danton, kernel-team
Hello,
On Wed, Jun 03, 2026 at 06:40:08AM -0700, Breno Leitao wrote:
> kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
> on pool->idle_list until the woken kthread schedules in and runs
> worker_leave_idle(). idle_cull_fn() only checks WORKER_IDLE, not the
A lot of the existing comments in the file are double spaced but let's not
do that anymore. Please use single space to separate sentences in desc and
comments.
> task state, so a kicked-but-not-yet-scheduled worker is still a valid
> cull victim -- the cull can reap it before it consumes the just-enqueued
> work, stranding the item. The window is narrow today but later patches
> in this series defer the wakeup outside pool->lock, widening it.
Have you actually reproduced this? Idle culling never dips below 2 idle
workers. Wakeup is from the head of the list and culling from the end. How
can the same worker be both the wakeup and culling target? You might think a
woken idle worker can be pushed down the list by workers which became newly
idle before the woken worker could dispatch the work; however, there are no
distinctions between a newly woken idle worker and a task which is about to
turn idle. IOW, other workers cannot turn idle without picking up the work
item that the idle worker was woken up for.
I could be wrong but these days it's fairly low effort to reproduce these
perceived problems using AI agents. Please consider reproducing the problem
before submitting patches.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team
Factor the worker selection out of kick_pool() into kick_pool_pick(),
which picks and claims the worker under pool->lock but, instead of waking
it, queues it on a caller-provided wake_q via wake_q_add(). The caller
issues the wakeup later with wake_up_q(). wake_q_add() is safe under the
lock (cmpxchg + get_task_struct()); only wake_up_q() takes rq->lock.
kick_pool() becomes a thin wrapper using a local wake_q, so all existing
callers keep waking under pool->lock.
Pure refactor, no functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b3f8b86cb52f..2811ada6dec6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
#include <linux/uaccess.h>
#include <linux/sched/isolation.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
#include <linux/nmi.h>
#include <linux/kvm_para.h>
#include <linux/delay.h>
@@ -1266,13 +1267,17 @@ static void kick_bh_pool(struct worker_pool *pool)
}
/**
- * kick_pool - wake up an idle worker if necessary
+ * kick_pool_pick - select and claim an idle worker, deferring the wakeup
* @pool: pool to kick
+ * @wakeq: wake_q to queue the selected worker on
*
- * @pool may have pending work items. Wake up worker if necessary. Returns
- * whether a worker was woken up.
+ * Like kick_pool() but queues the picked worker on @wakeq (wake_q_add())
+ * instead of waking it, so the caller can wake_up_q(@wakeq) after dropping
+ * pool->lock. Returns whether a worker was selected.
+ *
+ * Must be called with @pool->lock held.
*/
-static bool kick_pool(struct worker_pool *pool)
+static bool kick_pool_pick(struct worker_pool *pool, struct wake_q_head *wakeq)
{
struct worker *worker = first_idle_worker(pool);
struct task_struct *p;
@@ -1328,10 +1333,27 @@ static bool kick_pool(struct worker_pool *pool)
* onto the same cache-hot worker (LIFO reuse).
*/
list_move(&worker->entry, &pool->kicked_list);
- wake_up_process(p);
+ wake_q_add(wakeq, p);
return true;
}
+/**
+ * kick_pool - wake up an idle worker if necessary
+ * @pool: pool to kick
+ *
+ * @pool may have pending work items. Wake up worker if necessary. Returns
+ * whether a worker was woken up.
+ */
+static bool kick_pool(struct worker_pool *pool)
+{
+ DEFINE_WAKE_Q(wakeq);
+ bool kicked;
+
+ kicked = kick_pool_pick(pool, &wakeq);
+ wake_up_q(&wakeq);
+ return kicked;
+}
+
#ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work()
2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team
__queue_work() is the enqueue hot path: it inserts the work item and
calls kick_pool() while holding pool->lock. kick_pool() ends in a
wakeup, which takes the target task's rq->lock, so rq->lock nests under
pool->lock on every enqueue that wakes a worker on a contended unbound
pool.
Use kick_pool_pick() to select and claim the worker under pool->lock,
queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
right after dropping the lock via raw_spin_unlock_wake(). Worker
selection, wake_cpu setup and claiming the worker off pool->idle_list
still happen under the lock; only the rq->lock acquisition moves out.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2811ada6dec6..b4246a801dd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2317,6 +2317,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
{
struct pool_workqueue *pwq;
struct worker_pool *last_pool, *pool;
+ DEFINE_WAKE_Q(wakeq);
unsigned int work_flags;
unsigned int req_cpu = cpu;
@@ -2431,14 +2432,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
trace_workqueue_activate_work(work);
insert_work(pwq, work, &pool->worklist, work_flags);
- kick_pool(pool);
+ kick_pool_pick(pool, &wakeq);
} else {
work_flags |= WORK_STRUCT_INACTIVE;
insert_work(pwq, work, &pwq->inactive_works, work_flags);
}
out:
- raw_spin_unlock(&pool->lock);
+ /* deferred kick_pool_pick() wakeup, issued outside pool->lock */
+ raw_spin_unlock_wake(&pool->lock, &wakeq);
rcu_read_unlock();
}
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
` (2 preceding siblings ...)
2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
2026-06-04 8:50 ` Tejun Heo
3 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team
process_one_work() kicks the pool to chain execution of the remaining
work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
ones), calling kick_pool() while holding pool->lock. As in the enqueue
path, the wakeup pulls the target rq->lock in under pool->lock.
Use kick_pool_pick() to select and claim the worker under pool->lock and
issue the wakeup with wake_up_q() after the lock is dropped via
raw_spin_unlock_irq_wake().
With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
8 producers queues 200000 work items one at a time on a WQ_UNBOUND
workqueue, waiting for each to complete), medians of five boots per
scope:
affinity_scope baseline patched tput p95
(items/s) (items/s) gain drop
-------------- --------- --------- ------ ------
cpu 3,611,591 3,568,433 -1.2% +4.6%
smt 3,601,697 3,550,632 -1.4% +6.1%
cache_shard 341,913 401,213 +17.3% -36.8%
cache 320,607 400,560 +24.9% -41.9%
numa 324,909 389,202 +19.8% -38.0%
system 314,510 392,278 +24.7% -37.5%
(p95 drop is the change in the p95 enqueue latency; negative is better.)
cpu/smt use per-CPU pools with no producer/consumer contention and are
essentially unchanged. On the contended scopes the shorter pool->lock
hold time cuts p95 enqueue latency by ~40%, and because this workload is
bound by the producer<->worker round-trip that latency reduction also
lifts throughput by ~20%.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b4246a801dd8..238b02edd01d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3261,6 +3261,7 @@ __acquires(&pool->lock)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
+ DEFINE_WAKE_Q(wakeq);
unsigned long work_data;
int lockdep_start_depth, rcu_start_depth;
bool bh_draining = pool->flags & POOL_BH_DRAINING;
@@ -3315,7 +3316,7 @@ __acquires(&pool->lock)
* chain execution of the pending work items for WORKER_NOT_RUNNING
* workers such as the UNBOUND and CPU_INTENSIVE ones.
*/
- kick_pool(pool);
+ kick_pool_pick(pool, &wakeq);
/*
* Record the last pool and clear PENDING which should be the last
@@ -3326,7 +3327,8 @@ __acquires(&pool->lock)
set_work_pool_and_clear_pending(work, pool->id, pool_offq_flags(pool));
pwq->stats[PWQ_STAT_STARTED]++;
- raw_spin_unlock_irq(&pool->lock);
+ /* deferred kick_pool_pick() wakeup, issued outside pool->lock */
+ raw_spin_unlock_irq_wake(&pool->lock, &wakeq);
rcu_start_depth = rcu_preempt_depth();
lockdep_start_depth = lockdep_depth(current);
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
@ 2026-06-04 8:50 ` Tejun Heo
2026-06-04 15:29 ` Breno Leitao
0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-06-04 8:50 UTC (permalink / raw)
To: Breno Leitao
Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
Hillf Danton, kernel-team
On Wed, Jun 03, 2026 at 06:40:11AM -0700, Breno Leitao wrote:
> process_one_work() kicks the pool to chain execution of the remaining
> work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
> ones), calling kick_pool() while holding pool->lock. As in the enqueue
> path, the wakeup pulls the target rq->lock in under pool->lock.
>
> Use kick_pool_pick() to select and claim the worker under pool->lock and
> issue the wakeup with wake_up_q() after the lock is dropped via
> raw_spin_unlock_irq_wake().
>
> With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
> with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
> 8 producers queues 200000 work items one at a time on a WQ_UNBOUND
> workqueue, waiting for each to complete), medians of five boots per
> scope:
Please test on bare metal.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
2026-06-04 8:50 ` Tejun Heo
@ 2026-06-04 15:29 ` Breno Leitao
0 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-04 15:29 UTC (permalink / raw)
To: Tejun Heo
Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
Hillf Danton, kernel-team
On Wed, Jun 03, 2026 at 10:50:29PM -1000, Tejun Heo wrote:
> On Wed, Jun 03, 2026 at 06:40:11AM -0700, Breno Leitao wrote:
> > process_one_work() kicks the pool to chain execution of the remaining
> > work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
> > ones), calling kick_pool() while holding pool->lock. As in the enqueue
> > path, the wakeup pulls the target rq->lock in under pool->lock.
> >
> > Use kick_pool_pick() to select and claim the worker under pool->lock and
> > issue the wakeup with wake_up_q() after the lock is dropped via
> > raw_spin_unlock_irq_wake().
> >
> > With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
> > with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
> > 8 producers queues 200000 work items one at a time on a WQ_UNBOUND
> > workqueue, waiting for each to complete), medians of five boots per
> > scope:
>
> Please test on bare metal.
Done, on two bare-metal machines, same test_workqueue benchmark (8
producers x 200000 items, one in flight at a time -- exactly like the test we have
lib/test_workqueue.c today):
- arm64: NVIDIA Grace (Neoverse V2), 72 cores
- x86: Intel Xeon Platinum 8321HC (Cooper Lake), 52 cores
VMs and arm64 (Grace) is where this series is meant to pay off -- waking an
idle CPU sitting in wfi costs an IPI, so doing it under pool->lock lengthens
the critical section. The bare-metal numbers match what the VM showed:
affinity_scope baseline patched tput p95
(items/s) (items/s) gain drop
-------------- --------- --------- ------ ------
cpu 2,569,880 3,029,740 +17.9% -13.6%
smt 2,586,485 3,044,788 +17.7% -14.0%
cache_shard 572,055 797,621 +39.4% -37.1%
cache 538,132 724,997 +34.7% -30.1%
numa 528,673 658,215 +24.5% -20.5%
system 524,287 614,486 +17.2% -21.1%
(p95 drop = change in p95 enqueue latency; negative is better.)
(tput gain = number of requests enqueued per sec; bigger is better.)
On x86 (Cooper Lake) the same test was neutral, thow -- within boot-to-boot
noise on the contended scopes.
I got the impression waking an idle x86 CPU is cheap, so there is little
under-lock wakeup cost to move out, and the benchmark stays
pool->lock-acquisition bound either way (perf shows ~46% in
queued_spin_lock_slowpath both before and after, unchanged).
So the win is real but architecture-dependent: arm64 (and virt, where a vCPU
wakeup is even more expensive) benefit; x86 bare metal is a Null-ish.
^ permalink raw reply [flat|nested] 8+ messages in thread