[PATCH v2 0/4] workqueue: Shrink the lock time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] workqueue: Shrink the lock time
@ 2026-06-03 13:40 Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

The goal of this patchset is to decrease the time spent under the
workqueue pool->lock.

Currently the worker process is woken up inside pool->lock.  The wakeup
ends in wake_up_process(), which takes the target task's rq->lock, so
rq->lock nests under pool->lock on the two hottest paths of a contended
unbound workqueue (__queue_work() enqueue and process_one_work() chain
kick).  On some architectures the wakeup is even more expensive: on
arm64 waking a CPU that is idle (in wfi) issues an IPI.

Doing all of that while holding pool->lock lengthens the locked region
and hurts throughput on contended unbound pools.

This series shortens the locked region by selecting and claiming the
worker to wake under pool->lock, but issuing the actual wakeup after the
lock is dropped, using the wake_q machinery (wake_q_add() under the
lock, wake_up_q() after).

Because the win is a shorter pool->lock hold time, it shows up most
clearly as lower enqueue latency under contention.  Measured with the
in-tree test_workqueue microbenchmark (lib/test_workqueue.c, 8 producers
x 200000 items on a WQ_UNBOUND workqueue, x86 8-vCPU VM, medians of five
boots), on the contended affinity scopes (cache_shard, cache, numa,
system) p95 enqueue latency drops ~40% and throughput improves ~20%.
The uncontended per-CPU scopes (cpu, smt) are unaffected.

While reworking this, Hillf Danton pointed out -- and a closer look
confirmed -- a latent race: kick_pool() wakes a worker but leaves it
WORKER_IDLE on pool->idle_list until it schedules in, so a concurrent
idle_cull_fn() (which only checks WORKER_IDLE, not the task state) can
reap it before it consumes the work, stranding the just-enqueued item.

The window is narrow today but deferring the wakeup widens it, so the
first patch closes it by moving the kicked worker onto a new
pool->kicked_list under pool->lock, out of reach of the cull which walks
idle_list only.

Patch 1 fixes the cull race and is a standalone correctness fix.
Patch 2 is a pure refactor introducing kick_pool_pick().
Patch 3 defers the wakeup on the enqueue path (__queue_work()).
Patch 4 defers the wakeup on the per-work chain-kick path
(process_one_work()).

Changes in v2:
- Close the idle_cull_fn() vs kicked-worker race by parking the kicked
  worker on a new pool->kicked_list under pool->lock (new patch 1).
  Reported by Hillf Danton.
- Use the wake_q machinery (wake_q_add() / wake_up_q() via
  raw_spin_unlock_wake()) instead of plumbing a task_struct out of the
  helper by hand.  Suggested by Sebastian Andrzej Siewior.
- Link to v1: https://lore.kernel.org/r/20260526-fastwake-v1-0-e69ad86923e6@debian.org

Signed-off-by: Breno Leitao <leitao@debian.org>
---
To: Tejun Heo <tj@kernel.org>
To: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org

---
Breno Leitao (4):
      workqueue: park kicked worker on pool->kicked_list
      workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
      workqueue: defer the worker wakeup outside pool->lock in __queue_work()
      workqueue: defer the worker wakeup outside pool->lock in process_one_work()

 kernel/workqueue.c | 65 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 10 deletions(-)
---
base-commit: c1ecb239fa3456529a32255359fc78b69eb9d847
change-id: 20260526-fastwake-02982fd66312

Best regards,
-- 
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-04  8:50   ` Tejun Heo
  2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
on pool->idle_list until the woken kthread schedules in and runs
worker_leave_idle().  idle_cull_fn() only checks WORKER_IDLE, not the
task state, so a kicked-but-not-yet-scheduled worker is still a valid
cull victim -- the cull can reap it before it consumes the just-enqueued
work, stranding the item.  The window is narrow today but later patches
in this series defer the wakeup outside pool->lock, widening it.

Move the picked worker from pool->idle_list to a new pool->kicked_list
under pool->lock so the cull path -- which walks idle_list only --
cannot reach it.  worker_leave_idle() already does list_del_init(), so
it correctly removes the worker from kicked_list when it actually runs;
worker_enter_idle() puts it back onto idle_list on completion.  No
extra list ops on the worker side.

LIFO coalescing of back-to-back kicks onto the same cache-hot worker is
preserved by having first_idle_worker() peek kicked_list before
idle_list: the second kick lands on the already-kicked worker, the
duplicate wakeup is a no-op, and the worker drains both items when it
runs.

Why not creating a new WORKER_KICKED flag instead, you might ask. I've
tried it and the numbers decreased.

Compared to tagging the worker with a new WORKER_KICKED flag,
list_move() writes to worker->entry (offset 0 of struct worker), which
the producer already dirties when reading the idle_list head; no new
cross-CPU cacheline is introduced.  Tagging worker->flags would have put
a producer-side write on an otherwise worker-private cacheline, causing
a coherence bounce on every kick.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..b3f8b86cb52f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -216,12 +216,13 @@ struct worker_pool {
 	int			nr_idle;	/* L: currently idle workers */

 	struct list_head	idle_list;	/* L: list of idle workers */
+	struct list_head	kicked_list;	/* L: workers kicked but not yet running */
 	struct timer_list	idle_timer;	/* L: worker idle timeout */
 	struct work_struct      idle_cull_work; /* L: worker idle cleanup */

 	struct timer_list	mayday_timer;	  /* L: SOS timer for workers */

-	/* a workers is either on busy_hash or idle_list, or the manager */
+	/* a worker is either on busy_hash, idle_list, kicked_list, or the manager */
 	DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
 						/* L: hash of busy workers */

@@ -1031,6 +1032,13 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
 /* Return the first idle worker.  Called with pool->lock held. */
 static struct worker *first_idle_worker(struct worker_pool *pool)
 {
+	/*
+	 * Prefer an already-kicked worker so back-to-back kicks coalesce
+	 * onto the same cache-hot worker (LIFO reuse).
+	 */
+	if (!list_empty(&pool->kicked_list))
+		return list_first_entry(&pool->kicked_list, struct worker, entry);
+
 	if (unlikely(list_empty(&pool->idle_list)))
 		return NULL;

@@ -1310,6 +1318,16 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/*
+	 * Move @worker to pool->kicked_list so a concurrent idle_cull_fn()
+	 * (which only walks pool->idle_list) cannot reap it before it
+	 * consumes the just-enqueued work.  worker_leave_idle() removes the
+	 * worker from whichever list it sits on; worker_enter_idle() puts
+	 * it back on pool->idle_list on completion.  first_idle_worker()
+	 * peeks kicked_list first, so back-to-back kicks still coalesce
+	 * onto the same cache-hot worker (LIFO reuse).
+	 */
+	list_move(&worker->entry, &pool->kicked_list);
 	wake_up_process(p);
 	return true;
 }
@@ -4896,6 +4914,7 @@ static int init_worker_pool(struct worker_pool *pool)
 	pool->last_progress_ts = jiffies;
 	INIT_LIST_HEAD(&pool->worklist);
 	INIT_LIST_HEAD(&pool->idle_list);
+	INIT_LIST_HEAD(&pool->kicked_list);
 	hash_init(pool->busy_hash);

 	timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
@ 2026-06-04  8:50   ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-06-04  8:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

Hello,

On Wed, Jun 03, 2026 at 06:40:08AM -0700, Breno Leitao wrote:
> kick_pool() picks an idle worker and wakes it, but leaves it WORKER_IDLE
> on pool->idle_list until the woken kthread schedules in and runs
> worker_leave_idle().  idle_cull_fn() only checks WORKER_IDLE, not the

A lot of the existing comments in the file are double spaced but let's not
do that anymore. Please use single space to separate sentences in desc and
comments.

> task state, so a kicked-but-not-yet-scheduled worker is still a valid
> cull victim -- the cull can reap it before it consumes the just-enqueued
> work, stranding the item.  The window is narrow today but later patches
> in this series defer the wakeup outside pool->lock, widening it.

Have you actually reproduced this? Idle culling never dips below 2 idle
workers. Wakeup is from the head of the list and culling from the end. How
can the same worker be both the wakeup and culling target? You might think a
woken idle worker can be pushed down the list by workers which became newly
idle before the woken worker could dispatch the work; however, there are no
distinctions between a newly woken idle worker and a task which is about to
turn idle. IOW, other workers cannot turn idle without picking up the work
item that the idle worker was woken up for.

I could be wrong but these days it's fairly low effort to reproduce these
perceived problems using AI agents. Please consider reproducing the problem
before submitting patches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
  3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

Factor the worker selection out of kick_pool() into kick_pool_pick(),
which picks and claims the worker under pool->lock but, instead of waking
it, queues it on a caller-provided wake_q via wake_q_add().  The caller
issues the wakeup later with wake_up_q().  wake_q_add() is safe under the
lock (cmpxchg + get_task_struct()); only wake_up_q() takes rq->lock.

kick_pool() becomes a thin wrapper using a local wake_q, so all existing
callers keep waking under pool->lock.

Pure refactor, no functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b3f8b86cb52f..2811ada6dec6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
@@ -1266,13 +1267,17 @@ static void kick_bh_pool(struct worker_pool *pool)
 }
 
 /**
- * kick_pool - wake up an idle worker if necessary
+ * kick_pool_pick - select and claim an idle worker, deferring the wakeup
  * @pool: pool to kick
+ * @wakeq: wake_q to queue the selected worker on
  *
- * @pool may have pending work items. Wake up worker if necessary. Returns
- * whether a worker was woken up.
+ * Like kick_pool() but queues the picked worker on @wakeq (wake_q_add())
+ * instead of waking it, so the caller can wake_up_q(@wakeq) after dropping
+ * pool->lock.  Returns whether a worker was selected.
+ *
+ * Must be called with @pool->lock held.
  */
-static bool kick_pool(struct worker_pool *pool)
+static bool kick_pool_pick(struct worker_pool *pool, struct wake_q_head *wakeq)
 {
 	struct worker *worker = first_idle_worker(pool);
 	struct task_struct *p;
@@ -1328,10 +1333,27 @@ static bool kick_pool(struct worker_pool *pool)
 	 * onto the same cache-hot worker (LIFO reuse).
 	 */
 	list_move(&worker->entry, &pool->kicked_list);
-	wake_up_process(p);
+	wake_q_add(wakeq, p);
 	return true;
 }
 
+/**
+ * kick_pool - wake up an idle worker if necessary
+ * @pool: pool to kick
+ *
+ * @pool may have pending work items. Wake up worker if necessary. Returns
+ * whether a worker was woken up.
+ */
+static bool kick_pool(struct worker_pool *pool)
+{
+	DEFINE_WAKE_Q(wakeq);
+	bool kicked;
+
+	kicked = kick_pool_pick(pool, &wakeq);
+	wake_up_q(&wakeq);
+	return kicked;
+}
+
 #ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
 
 /*

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work()
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
  3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

__queue_work() is the enqueue hot path: it inserts the work item and
calls kick_pool() while holding pool->lock.  kick_pool() ends in a
wakeup, which takes the target task's rq->lock, so rq->lock nests under
pool->lock on every enqueue that wakes a worker on a contended unbound
pool.

Use kick_pool_pick() to select and claim the worker under pool->lock,
queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
right after dropping the lock via raw_spin_unlock_wake().  Worker
selection, wake_cpu setup and claiming the worker off pool->idle_list
still happen under the lock; only the rq->lock acquisition moves out.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2811ada6dec6..b4246a801dd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2317,6 +2317,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 {
 	struct pool_workqueue *pwq;
 	struct worker_pool *last_pool, *pool;
+	DEFINE_WAKE_Q(wakeq);
 	unsigned int work_flags;
 	unsigned int req_cpu = cpu;
 
@@ -2431,14 +2432,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 
 		trace_workqueue_activate_work(work);
 		insert_work(pwq, work, &pool->worklist, work_flags);
-		kick_pool(pool);
+		kick_pool_pick(pool, &wakeq);
 	} else {
 		work_flags |= WORK_STRUCT_INACTIVE;
 		insert_work(pwq, work, &pwq->inactive_works, work_flags);
 	}
 
 out:
-	raw_spin_unlock(&pool->lock);
+	/* deferred kick_pool_pick() wakeup, issued outside pool->lock */
+	raw_spin_unlock_wake(&pool->lock, &wakeq);
 	rcu_read_unlock();
 }
 

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
  2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
                   ` (2 preceding siblings ...)
  2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
@ 2026-06-03 13:40 ` Breno Leitao
  2026-06-04  8:50   ` Tejun Heo
  3 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2026-06-03 13:40 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
	Breno Leitao, kernel-team

process_one_work() kicks the pool to chain execution of the remaining
work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
ones), calling kick_pool() while holding pool->lock.  As in the enqueue
path, the wakeup pulls the target rq->lock in under pool->lock.

Use kick_pool_pick() to select and claim the worker under pool->lock and
issue the wakeup with wake_up_q() after the lock is dropped via
raw_spin_unlock_irq_wake().

With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
8 producers queues 200000 work items one at a time on a WQ_UNBOUND
workqueue, waiting for each to complete), medians of five boots per
scope:

    affinity_scope    baseline    patched    tput     p95
                     (items/s)  (items/s)    gain    drop
    --------------   ---------  ---------  ------  ------
    cpu              3,611,591  3,568,433   -1.2%   +4.6%
    smt              3,601,697  3,550,632   -1.4%   +6.1%
    cache_shard        341,913    401,213  +17.3%  -36.8%
    cache              320,607    400,560  +24.9%  -41.9%
    numa               324,909    389,202  +19.8%  -38.0%
    system             314,510    392,278  +24.7%  -37.5%

(p95 drop is the change in the p95 enqueue latency; negative is better.)

cpu/smt use per-CPU pools with no producer/consumer contention and are
essentially unchanged.  On the contended scopes the shorter pool->lock
hold time cuts p95 enqueue latency by ~40%, and because this workload is
bound by the producer<->worker round-trip that latency reduction also
lifts throughput by ~20%.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b4246a801dd8..238b02edd01d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3261,6 +3261,7 @@ __acquires(&pool->lock)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
+	DEFINE_WAKE_Q(wakeq);
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
@@ -3315,7 +3316,7 @@ __acquires(&pool->lock)
 	 * chain execution of the pending work items for WORKER_NOT_RUNNING
 	 * workers such as the UNBOUND and CPU_INTENSIVE ones.
 	 */
-	kick_pool(pool);
+	kick_pool_pick(pool, &wakeq);
 
 	/*
 	 * Record the last pool and clear PENDING which should be the last
@@ -3326,7 +3327,8 @@ __acquires(&pool->lock)
 	set_work_pool_and_clear_pending(work, pool->id, pool_offq_flags(pool));
 
 	pwq->stats[PWQ_STAT_STARTED]++;
-	raw_spin_unlock_irq(&pool->lock);
+	/* deferred kick_pool_pick() wakeup, issued outside pool->lock */
+	raw_spin_unlock_irq_wake(&pool->lock, &wakeq);
 
 	rcu_start_depth = rcu_preempt_depth();
 	lockdep_start_depth = lockdep_depth(current);

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
  2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
@ 2026-06-04  8:50   ` Tejun Heo
  2026-06-04 15:29     ` Breno Leitao
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-06-04  8:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

On Wed, Jun 03, 2026 at 06:40:11AM -0700, Breno Leitao wrote:
> process_one_work() kicks the pool to chain execution of the remaining
> work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
> ones), calling kick_pool() while holding pool->lock.  As in the enqueue
> path, the wakeup pulls the target rq->lock in under pool->lock.
> 
> Use kick_pool_pick() to select and claim the worker under pool->lock and
> issue the wakeup with wake_up_q() after the lock is dropped via
> raw_spin_unlock_irq_wake().
> 
> With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
> with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
> 8 producers queues 200000 work items one at a time on a WQ_UNBOUND
> workqueue, waiting for each to complete), medians of five boots per
> scope:

Please test on bare metal.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
  2026-06-04  8:50   ` Tejun Heo
@ 2026-06-04 15:29     ` Breno Leitao
  0 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-06-04 15:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, linux-kernel, marco.crivellari, frederic, bigeasy,
	Hillf Danton, kernel-team

On Wed, Jun 03, 2026 at 10:50:29PM -1000, Tejun Heo wrote:
> On Wed, Jun 03, 2026 at 06:40:11AM -0700, Breno Leitao wrote:
> > process_one_work() kicks the pool to chain execution of the remaining
> > work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
> > ones), calling kick_pool() while holding pool->lock.  As in the enqueue
> > path, the wakeup pulls the target rq->lock in under pool->lock.
> > 
> > Use kick_pool_pick() to select and claim the worker under pool->lock and
> > issue the wakeup with wake_up_q() after the lock is dropped via
> > raw_spin_unlock_irq_wake().
> > 
> > With both hot paths converted, measured on a CONFIG_SMP x86 VM (8 vCPUs)
> > with the in-tree test_workqueue benchmark (lib/test_workqueue.c; each of
> > 8 producers queues 200000 work items one at a time on a WQ_UNBOUND
> > workqueue, waiting for each to complete), medians of five boots per
> > scope:
> 
> Please test on bare metal.

Done, on two bare-metal machines, same test_workqueue benchmark (8
producers x 200000 items, one in flight at a time -- exactly like the test we have
lib/test_workqueue.c today):

  - arm64: NVIDIA Grace (Neoverse V2), 72 cores
  - x86:   Intel Xeon Platinum 8321HC (Cooper Lake), 52 cores

VMs and arm64 (Grace) is where this series is meant to pay off -- waking an
idle CPU sitting in wfi costs an IPI, so doing it under pool->lock lengthens
the critical section. The bare-metal numbers match what the VM showed:

    affinity_scope    baseline    patched    tput     p95
                     (items/s)  (items/s)    gain    drop
    --------------   ---------  ---------  ------  ------
    cpu              2,569,880  3,029,740  +17.9%  -13.6%
    smt              2,586,485  3,044,788  +17.7%  -14.0%
    cache_shard        572,055    797,621  +39.4%  -37.1%
    cache              538,132    724,997  +34.7%  -30.1%
    numa               528,673    658,215  +24.5%  -20.5%
    system             524,287    614,486  +17.2%  -21.1%

(p95 drop = change in p95 enqueue latency; negative is better.)
(tput gain = number of requests enqueued per sec; bigger is better.)

On x86 (Cooper Lake) the same test was neutral, thow -- within boot-to-boot
noise on the contended scopes.

I got the impression waking an idle x86 CPU is cheap, so there is little
under-lock wakeup cost to move out, and the benchmark stays
pool->lock-acquisition bound either way (perf shows ~46% in
queued_spin_lock_slowpath both before and after, unchanged).

So the win is real but architecture-dependent: arm64 (and virt, where a vCPU
wakeup is even more expensive) benefit; x86 bare metal is a Null-ish.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-04 15:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 13:40 [PATCH v2 0/4] workqueue: Shrink the lock time Breno Leitao
2026-06-03 13:40 ` [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list Breno Leitao
2026-06-04  8:50   ` Tejun Heo
2026-06-03 13:40 ` [PATCH v2 2/4] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
2026-06-03 13:40 ` [PATCH v2 3/4] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
2026-06-03 13:40 ` [PATCH v2 4/4] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
2026-06-04  8:50   ` Tejun Heo
2026-06-04 15:29     ` Breno Leitao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.