[PATCH v2 0/4] workqueue: Shrink the lock time

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v2 0/4] workqueue: Shrink the lock time
@ 2026-06-03 13:40 Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

The goal of this patchset is to decrease the time spent under the
workqueue pool->lock.

Currently the worker process is woken up inside pool->lock.  The wakeup
ends in wake_up_process(), which takes the target task's rq->lock, so
rq->lock nests under pool->lock on the two hottest paths of a contended
unbound workqueue (__queue_work() enqueue and process_one_work() chain
kick).  On some architectures the wakeup is even more expensive: on
arm64 waking a CPU that is idle (in wfi) issues an IPI.

Doing all of that while holding pool->lock lengthens the locked region
and hurts throughput on contended unbound pools.

This series shortens the locked region by selecting and claiming the
worker to wake under pool->lock, but issuing the actual wakeup after the
lock is dropped, using the wake_q machinery (wake_q_add() under the
lock, wake_up_q() after).

Because the win is a shorter pool->lock hold time, it shows up most
clearly as lower enqueue latency under contention.  Measured with the
in-tree test_workqueue microbenchmark (lib/test_workqueue.c, 8 producers
x 200000 items on a WQ_UNBOUND workqueue, x86 8-vCPU VM, medians of five
boots), on the contended affinity scopes (cache_shard, cache, numa,
system) p95 enqueue latency drops ~40% and throughput improves ~20%.
The uncontended per-CPU scopes (cpu, smt) are unaffected.

While reworking this, Hillf Danton pointed out -- and a closer look
confirmed -- a latent race: kick_pool() wakes a worker but leaves it
WORKER_IDLE on pool->idle_list until it schedules in, so a concurrent
idle_cull_fn() (which only checks WORKER_IDLE, not the task state) can
reap it before it consumes the work, stranding the just-enqueued item.

The window is narrow today but deferring the wakeup widens it, so the
first patch closes it by moving the kicked worker onto a new
pool->kicked_list under pool->lock, out of reach of the cull which walks
idle_list only.

Patch 1 fixes the cull race and is a standalone correctness fix.
Patch 2 is a pure refactor introducing kick_pool_pick().
Patch 3 defers the wakeup on the enqueue path (__queue_work()).
Patch 4 defers the wakeup on the per-work chain-kick path
(process_one_work()).

Changes in v2:
- Close the idle_cull_fn() vs kicked-worker race by parking the kicked
  worker on a new pool->kicked_list under pool->lock (new patch 1).
  Reported by Hillf Danton.
- Use the wake_q machinery (wake_q_add() / wake_up_q() via
  raw_spin_unlock_wake()) instead of plumbing a task_struct out of the
  helper by hand.  Suggested by Sebastian Andrzej Siewior.
- Link to v1: https://lore.kernel.org/r/20260526-fastwake-v1-0-e69ad86923e6@debian.org

Signed-off-by: Breno Leitao <leitao@debian.org>
---
To: Tejun Heo <tj@kernel.org>
To: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org

---
Breno Leitao (4):
      workqueue: park kicked worker on pool->kicked_list
      workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
      workqueue: defer the worker wakeup outside pool->lock in __queue_work()
      workqueue: defer the worker wakeup outside pool->lock in process_one_work()

 kernel/workqueue.c | 65 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 10 deletions(-)
---
base-commit: c1ecb239fa3456529a32255359fc78b69eb9d847
change-id: 20260526-fastwake-02982fd66312

Best regards,
-- 
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-04  8:50   ` Tejun Heo
  2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
on pool->idle_list until the woken kthread schedules in and runs
worker_leave_idle().  idle_cull_fn() only checks WORKER_IDLE, not the
task state, so a kicked-but-not-yet-scheduled worker is still a valid
cull victim -- the cull can reap it before it consumes the just-enqueued
work, stranding the item.  The window is narrow today but later patches
in this series defer the wakeup outside pool->lock, widening it.

Move the picked worker from pool->idle_list to a new pool->kicked_list
under pool->lock so the cull path -- which walks idle_list only --
cannot reach it.  worker_leave_idle() already does list_del_init(), so
it correctly removes the worker from kicked_list when it actually runs;
worker_enter_idle() puts it back onto idle_list on completion.  No
extra list ops on the worker side.

LIFO coalescing of back-to-back kicks onto the same cache-hot worker is
preserved by having first_idle_worker() peek kicked_list before
idle_list: the second kick lands on the already-kicked worker, the
duplicate wakeup is a no-op, and the worker drains both items when it
runs.

Why not creating a new WORKER_KICKED flag instead, you might ask. I've
tried it and the numbers decreased.

Compared to tagging the worker with a new WORKER_KICKED flag,
list_move() writes to worker->entry (offset 0 of struct worker), which
the producer already dirties when reading the idle_list head; no new
cross-CPU cacheline is introduced.  Tagging worker->flags would have put
a producer-side write on an otherwise worker-private cacheline, causing
a coherence bounce on every kick.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..b3f8b86cb52f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -216,12 +216,13 @@ struct worker_pool {
 	int			nr_idle;	/* L: currently idle workers */

 	struct list_head	idle_list;	/* L: list of idle workers */
+	struct list_head	kicked_list;	/* L: workers kicked but not yet running */
 	struct timer_list	idle_timer;	/* L: worker idle timeout */
 	struct work_struct      idle_cull_work; /* L: worker idle cleanup */

 	struct timer_list	mayday_timer;	  /* L: SOS timer for workers */

-	/* a workers is either on busy_hash or idle_list, or the manager */
+	/* a worker is either on busy_hash, idle_list, kicked_list, or the manager */
 	DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
 						/* L: hash of busy workers */

@@ -1031,6 +1032,13 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
 /* Return the first idle worker.  Called with pool->lock held. */
 static struct worker *first_idle_worker(struct worker_pool *pool)
 {
+	/*
+	 * Prefer an already-kicked worker so back-to-back kicks coalesce
+	 * onto the same cache-hot worker (LIFO reuse).
+	 */
+	if (!list_empty(&pool->kicked_list))
+		return list_first_entry(&pool->kicked_list, struct worker, entry);
+
 	if (unlikely(list_empty(&pool->idle_list)))
 		return NULL;

@@ -1310,6 +1318,16 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/*
+	 * Move @worker to pool->kicked_list so a concurrent idle_cull_fn()
+	 * (which only walks pool->idle_list) cannot reap it before it
+	 * consumes the just-enqueued work.  worker_leave_idle() removes the
+	 * worker from whichever list it sits on; worker_enter_idle() puts
+	 * it back on pool->idle_list on completion.  first_idle_worker()
+	 * peeks kicked_list first, so back-to-back kicks still coalesce
+	 * onto the same cache-hot worker (LIFO reuse).
+	 */
+	list_move(&worker->entry, &pool->kicked_list);
 	wake_up_process(p);
 	return true;
 }
@@ -4896,6 +4914,7 @@ static int init_worker_pool(struct worker_pool *pool)
 	pool->last_progress_ts = jiffies;
 	INIT_LIST_HEAD(&pool->worklist);
 	INIT_LIST_HEAD(&pool->idle_list);
+	INIT_LIST_HEAD(&pool->kicked_list);
 	hash_init(pool->busy_hash);

 	timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
@ 2026-06-04  8:50   ` Tejun Heo
  2026-06-05 14:40     ` Breno Leitao
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2026-06-04  8:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

Hello,

On Wed, Jun 03, 2026 at 06:40:08AM -0700, Breno Leitao wrote:
> kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
> on pool->idle_list until the woken kthread schedules in and runs
> worker_leave_idle().  idle_cull_fn() only checks WORKER_IDLE, not the

A lot of the existing comments in the file are double spaced but let's not
do that anymore. Please use single space to separate sentences in desc and
comments.

> task state, so a kicked-but-not-yet-scheduled worker is still a valid
> cull victim -- the cull can reap it before it consumes the just-enqueued
> work, stranding the item.  The window is narrow today but later patches
> in this series defer the wakeup outside pool->lock, widening it.

Have you actually reproduced this? Idle culling never dips below 2 idle
workers. Wakeup is from the head of the list and culling from the end. How
can the same worker be both the wakeup and culling target? You might think a
woken idle worker can be pushed down the list by workers which became newly
idle before the woken worker could dispatch the work; however, there are no
distinctions between a newly woken idle worker and a task which is about to
turn idle. IOW, other workers cannot turn idle without picking up the work
item that the idle worker was woken up for.

I could be wrong but these days it's fairly low effort to reproduce these
perceived problems using AI agents. Please consider reproducing the problem
before submitting patches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
  2026-06-04  8:50   ` Tejun Heo
@ 2026-06-05 14:40     ` Breno Leitao
  2026-06-05 17:30       ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Breno Leitao @ 2026-06-05 14:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

Hello Tejun,

On Wed, Jun 03, 2026 at 10:50:13PM -1000, Tejun Heo wrote:
> On Wed, Jun 03, 2026 at 06:40:08AM -0700, Breno Leitao wrote:
> > kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
> > on pool->idle_list until the woken kthread schedules in and runs
> > worker_leave_idle().  idle_cull_fn() only checks WORKER_IDLE, not the
> 
> A lot of the existing comments in the file are double spaced but let's not
> do that anymore. Please use single space to separate sentences in desc and
> comments.

I suppose you mean double space after double space? Sorry, I will avoid
it.

> > task state, so a kicked-but-not-yet-scheduled worker is still a valid
> > cull victim -- the cull can reap it before it consumes the just-enqueued
> > work, stranding the item.  The window is narrow today but later patches
> > in this series defer the wakeup outside pool->lock, widening it.
> 
> Have you actually reproduced this?

No -- not without artificially changing the timeout in the kernel. So far this
is a theoretical race window rather than something I've hit in
practice; it was flagged as a critical issue by sashiko:

    https://sashiko.dev/#/patchset/20260526-fastwake-v1-0-e69ad86923e6%40debian.org

The only way I could get it to actually strand an item was by shrinking
IDLE_WORKER_TIMEOUT 150,000x (300s -> 2ms), so that a worker counts as
"timed out" almost immediately.

> Idle culling never dips below 2 idle workers. Wakeup is from the head
> of the list and culling from the end.

Right -- idle_cull_fn() reaps from the tail (list_last_entry), i.e. the
worker with the oldest last_active, and stops once nr_idle drops to 2,
while kick_pool() wakes the head. So in the common case the kicked
worker (head) and the cull victim (tail) are different workers.

What I'm less sure about: the cull decision is purely time based -- it
reaps the tail worker only if worker->last_active + IDLE_WORKER_TIMEOUT
has elapsed -- but last_active is written in exactly one place,
worker_enter_idle(). Neither kick_pool()/kick_pool_pick() nor
worker_leave_idle() update it; it is only read by the two cull timers.

So being kicked gives a worker no grace period against the cull: a
worker that has already been idle ~IDLE_WORKER_TIMEOUT and is then
kicked still looks expired, and if it drifts off the head (other
workers going idle ahead of it) before it gets a chance to run, the
cull can still reap it.

Would it make sense to refresh last_active when a worker is kicked? The
cull walks tail->head and breaks at the first non-expired worker, so a
freshly-stamped kicked worker would simply be skipped while genuinely
old workers behind it are still reaped.

Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
  2026-06-05 14:40     ` Breno Leitao
@ 2026-06-05 17:30       ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2026-06-05 17:30 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

Hello,

On Fri, Jun 05, 2026 at 07:40:43AM -0700, Breno Leitao wrote:
...
> > > task state, so a kicked-but-not-yet-scheduled worker is still a valid
> > > cull victim -- the cull can reap it before it consumes the just-enqueued
> > > work, stranding the item.  The window is narrow today but later patches
> > > in this series defer the wakeup outside pool->lock, widening it.
> > 
> > Have you actually reproduced this?
> 
> No -- not without artificially changing the timeout in the kernel. So far this
> is a theoretical race window rather than something I've hit in
> practice; it was flagged as a critical issue by sashiko:
> 
>     https://sashiko.dev/#/patchset/20260526-fastwake-v1-0-e69ad86923e6%40debian.org
> 
> The only way I could get it to actually strand an item was by shrinking
> IDLE_WORKER_TIMEOUT 150,000x (300s -> 2ms), so that a worker counts as
> "timed out" almost immediately.

I see. This is a bug then. It shouldn't happen even with that.

> Would it make sense to refresh last_active when a worker is kicked? The
> cull walks tail->head and breaks at the first non-expired worker, so a
> freshly-stamped kicked worker would simply be skipped while genuinely
> old workers behind it are still reaped.

I don't think timestamping is where the problem is. The intention of the
code is that the idle thread minimum count + how workers transition their
states can't lead to a situation where a work item is pending without an
idle worker to execute it regardless of timing.

Can you instrument code with the lowered threshold and record the sequence
of events. If we record the sequence of work item and worker state
transitions, it should tell us what's broken. We shouldn't need to protect
all kicked workers to fix this.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
  3 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

Factor the worker selection out of kick_pool() into kick_pool_pick(),
which picks and claims the worker under pool->lock but, instead of waking
it, queues it on a caller-provided wake_q via wake_q_add().  The caller
issues the wakeup later with wake_up_q().  wake_q_add() is safe under the
lock (cmpxchg + get_task_struct()); only wake_up_q() takes rq->lock.

kick_pool() becomes a thin wrapper using a local wake_q, so all existing
callers keep waking under pool->lock.

Pure refactor, no functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b3f8b86cb52f..2811ada6dec6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
@@ -1266,13 +1267,17 @@ static void kick_bh_pool(struct worker_pool *pool)
 }
 
 /**
- * kick_pool - wake up an idle worker if necessary
+ * kick_pool_pick - select and claim an idle worker, deferring the wakeup
  * @pool: pool to kick
+ * @wakeq: wake_q to queue the selected worker on
  *
- * @pool may have pending work items. Wake up worker if necessary. Returns
- * whether a worker was woken up.
+ * Like kick_pool() but queues the picked worker on @wakeq (wake_q_add())
+ * instead of waking it, so the caller can wake_up_q(@wakeq) after dropping
+ * pool->lock.  Returns whether a worker was selected.
+ *
+ * Must be called with @pool->lock held.
  */
-static bool kick_pool(struct worker_pool *pool)
+static bool kick_pool_pick(struct worker_pool *pool, struct wake_q_head *wakeq)
 {
 	struct worker *worker = first_idle_worker(pool);
 	struct task_struct *p;
@@ -1328,10 +1333,27 @@ static bool kick_pool(struct worker_pool *pool)
 	 * onto the same cache-hot worker (LIFO reuse).
 	 */
 	list_move(&worker->entry, &pool->kicked_list);
-	wake_up_process(p);
+	wake_q_add(wakeq, p);
 	return true;
 }
 
+/**
+ * kick_pool - wake up an idle worker if necessary
+ * @pool: pool to kick
+ *
+ * @pool may have pending work items. Wake up worker if necessary. Returns
+ * whether a worker was woken up.
+ */
+static bool kick_pool(struct worker_pool *pool)
+{
+	DEFINE_WAKE_Q(wakeq);
+	bool kicked;
+
+	kicked = kick_pool_pick(pool, &wakeq);
+	wake_up_q(&wakeq);
+	return kicked;
+}
+
 #ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
 
 /*

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work()
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
  3 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

__queue_work() is the enqueue hot path: it inserts the work item and
calls kick_pool() while holding pool->lock.  kick_pool() ends in a
wakeup, which takes the target task's rq->lock, so rq->lock nests under
pool->lock on every enqueue that wakes a worker on a contended unbound
pool.

Use kick_pool_pick() to select and claim the worker under pool->lock,
queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
right after dropping the lock via raw_spin_unlock_wake().  Worker
selection, wake_cpu setup and claiming the worker off pool->idle_list
still happen under the lock; only the rq->lock acquisition moves out.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2811ada6dec6..b4246a801dd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2317,6 +2317,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 {
 	struct pool_workqueue *pwq;
 	struct worker_pool *last_pool, *pool;
+	DEFINE_WAKE_Q(wakeq);
 	unsigned int work_flags;
 	unsigned int req_cpu = cpu;
 
@@ -2431,14 +2432,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 
 		trace_workqueue_activate_work(work);
 		insert_work(pwq, work, &pool->worklist, work_flags);
-		kick_pool(pool);
+		kick_pool_pick(pool, &wakeq);
 	} else {
 		work_flags |= WORK_STRUCT_INACTIVE;
 		insert_work(pwq, work, &pwq->inactive_works, work_flags);
 	}
 
 out:
-	raw_spin_unlock(&pool->lock);
+	/* deferred kick_pool_pick() wakeup, issued outside pool->lock */
+	raw_spin_unlock_wake(&pool->lock, &wakeq);
 	rcu_read_unlock();
 }
 

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
                   ` (2 preceding siblings ...)
  2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-04  8:50   ` Tejun Heo
  3 siblings, 1 reply; 10+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

process_one_work() kicks the pool to chain execution of the remaining
work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
ones), calling kick_pool() while holding pool->lock.  As in the enqueue
path, the wakeup pulls the target rq->lock in under pool->lock.

Use kick_pool_pick() to select and claim the worker under pool->lock and
issue the wakeup with wake_up_q() after the lock is dropped via
raw_spin_unlock_irq_wake().

With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
8 producers queues 200000 work items one at a time on a WQ_UNBOUND
workqueue, waiting for each to complete), medians of five boots per
scope:

    affinity_scope    baseline    patched    tput     p95
                     (items/s)  (items/s)    gain    drop
    --------------   ---------  ---------  ------  ------
    cpu              3,611,591  3,568,433   -1.2%   +4.6%
    smt              3,601,697  3,550,632   -1.4%   +6.1%
    cache_shard        341,913    401,213  +17.3%  -36.8%
    cache              320,607    400,560  +24.9%  -41.9%
    numa               324,909    389,202  +19.8%  -38.0%
    system             314,510    392,278  +24.7%  -37.5%

(p95 drop is the change in the p95 enqueue latency; negative is better.)

cpu/smt use per-CPU pools with no producer/consumer contention and are
essentially unchanged.  On the contended scopes the shorter pool->lock
hold time cuts p95 enqueue latency by ~40%, and because this workload is
bound by the producer<->worker round-trip that latency reduction also
lifts throughput by ~20%.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b4246a801dd8..238b02edd01d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3261,6 +3261,7 @@ __acquires(&pool->lock)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
+	DEFINE_WAKE_Q(wakeq);
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
@@ -3315,7 +3316,7 @@ __acquires(&pool->lock)
 	 * chain execution of the pending work items for WORKER_NOT_RUNNING
 	 * workers such as the UNBOUND and CPU_INTENSIVE ones.
 	 */
-	kick_pool(pool);
+	kick_pool_pick(pool, &wakeq);
 
 	/*
 	 * Record the last pool and clear PENDING which should be the last
@@ -3326,7 +3327,8 @@ __acquires(&pool->lock)
 	set_work_pool_and_clear_pending(work, pool->id, pool_offq_flags(pool));
 
 	pwq->stats[PWQ_STAT_STARTED]++;
-	raw_spin_unlock_irq(&pool->lock);
+	/* deferred kick_pool_pick() wakeup, issued outside pool->lock */
+	raw_spin_unlock_irq_wake(&pool->lock, &wakeq);
 
 	rcu_start_depth = rcu_preempt_depth();
 	lockdep_start_depth = lockdep_depth(current);

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
  2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
@ 2026-06-04  8:50   ` Tejun Heo
  2026-06-04 15:29     ` Breno Leitao
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2026-06-04  8:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

On Wed, Jun 03, 2026 at 06:40:11AM -0700, Breno Leitao wrote:
> process_one_work() kicks the pool to chain execution of the remaining
> work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
> ones), calling kick_pool() while holding pool->lock.  As in the enqueue
> path, the wakeup pulls the target rq->lock in under pool->lock.
> 
> Use kick_pool_pick() to select and claim the worker under pool->lock and
> issue the wakeup with wake_up_q() after the lock is dropped via
> raw_spin_unlock_irq_wake().
> 
> With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
> with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
> 8 producers queues 200000 work items one at a time on a WQ_UNBOUND
> workqueue, waiting for each to complete), medians of five boots per
> scope:

Please test on bare metal.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
  2026-06-04  8:50   ` Tejun Heo
@ 2026-06-04 15:29     ` Breno Leitao
  0 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-06-04 15:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

On Wed, Jun 03, 2026 at 10:50:29PM -1000, Tejun Heo wrote:
> On Wed, Jun 03, 2026 at 06:40:11AM -0700, Breno Leitao wrote:
> > process_one_work() kicks the pool to chain execution of the remaining
> > work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
> > ones), calling kick_pool() while holding pool->lock.  As in the enqueue
> > path, the wakeup pulls the target rq->lock in under pool->lock.
> > 
> > Use kick_pool_pick() to select and claim the worker under pool->lock and
> > issue the wakeup with wake_up_q() after the lock is dropped via
> > raw_spin_unlock_irq_wake().
> > 
> > With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
> > with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
> > 8 producers queues 200000 work items one at a time on a WQ_UNBOUND
> > workqueue, waiting for each to complete), medians of five boots per
> > scope:
> 
> Please test on bare metal.

Done, on two bare-metal machines, same test_workqueue benchmark (8
producers x 200000 items, one in flight at a time -- exactly like the test we have
lib/test_workqueue.c today):

  - arm64: NVIDIA Grace (Neoverse V2), 72 cores
  - x86:   Intel Xeon Platinum 8321HC (Cooper Lake), 52 cores

VMs and arm64 (Grace) is where this series is meant to pay off -- waking an
idle CPU sitting in wfi costs an IPI, so doing it under pool->lock lengthens
the critical section. The bare-metal numbers match what the VM showed:

    affinity_scope    baseline    patched    tput     p95
                     (items/s)  (items/s)    gain    drop
    --------------   ---------  ---------  ------  ------
    cpu              2,569,880  3,029,740  +17.9%  -13.6%
    smt              2,586,485  3,044,788  +17.7%  -14.0%
    cache_shard        572,055    797,621  +39.4%  -37.1%
    cache              538,132    724,997  +34.7%  -30.1%
    numa               528,673    658,215  +24.5%  -20.5%
    system             524,287    614,486  +17.2%  -21.1%

(p95 drop = change in p95 enqueue latency; negative is better.)
(tput gain = number of requests enqueued per sec; bigger is better.)

On x86 (Cooper Lake) the same test was neutral, thow -- within boot-to-boot
noise on the contended scopes.

I got the impression waking an idle x86 CPU is cheap, so there is little
under-lock wakeup cost to move out, and the benchmark stays
pool->lock-acquisition bound either way (perf shows ~46% in
queued_spin_lock_slowpath both before and after, unchanged).

So the win is real but architecture-dependent: arm64 (and virt, where a vCPU
wakeup is even more expensive) benefit; x86 bare metal is a Null-ish.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-05 17:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
2026-06-04  8:50   ` Tejun Heo
2026-06-05 14:40     ` Breno Leitao
2026-06-05 17:30       ` Tejun Heo
2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
2026-06-04  8:50   ` Tejun Heo
2026-06-04 15:29     ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox