[PATCH 0/2] workqueue: Shrink the lock time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] workqueue: Shrink the lock time
@ 2026-05-26 18:08 Breno Leitao
  2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao
  2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
  0 siblings, 2 replies; 11+ messages in thread
From: Breno Leitao @ 2026-05-26 18:08 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Breno Leitao,
	kernel-team

The goal of this patchset is to decrease the time spent on the
workqueue pool->lock.

Currently, the worker process is woken up inside the lock, which can
be expensive on some architectures.  On arm64, the wake-up issues an
IPI when the target CPU is in idle (wfi on arm64), which requires an
interrupt to wake it. And also, the rq lock, which will be nested in
this case.

This series decreases the time spent under pool->lock by waking up
the process outside of the lock. This improves the microbenchark we have
in lib/ by up to 10%.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
      workqueue: split kick_pool() into kick_pool_pick() + wake_up_process()
      workqueue: defer wake_up_process() outside pool->lock on hot paths

 kernel/workqueue.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 49 insertions(+), 8 deletions(-)
---
base-commit: c1ecb239fa3456529a32255359fc78b69eb9d847
change-id: 20260526-fastwake-02982fd66312

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process()
  2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao
@ 2026-05-26 18:08 ` Breno Leitao
  2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
  1 sibling, 0 replies; 11+ messages in thread
From: Breno Leitao @ 2026-05-26 18:08 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Breno Leitao,
	kernel-team

Factor the worker-selection part of kick_pool() out into a new helper
kick_pool_pick() that picks the worker to wake (and applies the
wake_cpu / repatriation hint that requires pool->lock) but does not
call wake_up_process() on it.  It returns the task_struct of the
worker to wake, or NULL when no wakeup is needed.

kick_pool() becomes a thin wrapper that calls kick_pool_pick() and
then wake_up_process() on the returned task, so all existing callers
keep their current behavior.

This is a pure refactor with no functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..b788d7c44ac0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1258,13 +1258,21 @@ static void kick_bh_pool(struct worker_pool *pool)
 }
 
 /**
- * kick_pool - wake up an idle worker if necessary
+ * kick_pool_pick - select a worker to wake up but defer the wakeup
  * @pool: pool to kick
  *
- * @pool may have pending work items. Wake up worker if necessary. Returns
- * whether a worker was woken up.
+ * Same selection logic as kick_pool() but returns the task that should be
+ * woken instead of calling wake_up_process() on it.  The caller is
+ * responsible for calling wake_up_process() on the returned task (typically
+ * after dropping @pool->lock to shorten the locked region).
+ *
+ * Returns the task_struct to wake, or NULL if no wakeup is needed (e.g. the
+ * pool was empty, no idle worker was available, or it was a BH pool which
+ * was already kicked synchronously).
+ *
+ * Must be called with @pool->lock held.
  */
-static bool kick_pool(struct worker_pool *pool)
+static struct task_struct *kick_pool_pick(struct worker_pool *pool)
 {
 	struct worker *worker = first_idle_worker(pool);
 	struct task_struct *p;
@@ -1272,11 +1280,11 @@ static bool kick_pool(struct worker_pool *pool)
 	lockdep_assert_held(&pool->lock);
 
 	if (!need_more_worker(pool) || !worker)
-		return false;
+		return NULL;
 
 	if (pool->flags & POOL_BH) {
 		kick_bh_pool(pool);
-		return true;
+		return NULL;
 	}
 
 	p = worker->task;
@@ -1310,6 +1318,22 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	return p;
+}
+
+/**
+ * kick_pool - wake up an idle worker if necessary
+ * @pool: pool to kick
+ *
+ * @pool may have pending work items.  Wake up worker if necessary.  Returns
+ * whether a worker was woken up.
+ */
+static bool kick_pool(struct worker_pool *pool)
+{
+	struct task_struct *p = kick_pool_pick(pool);
+
+	if (!p)
+		return false;
 	wake_up_process(p);
 	return true;
 }

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao
  2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao
@ 2026-05-26 18:08 ` Breno Leitao
  2026-05-26 21:23   ` Hillf Danton
  2026-05-27 15:22   ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 11+ messages in thread
From: Breno Leitao @ 2026-05-26 18:08 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan
  Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Breno Leitao,
	kernel-team

Both __queue_work() (enqueue) and process_one_work() (per-work chain
kick on unbound/CPU_INTENSIVE pools) call kick_pool() while holding
pool->lock.  kick_pool() ends in wake_up_process(), which takes the
target task's rq->lock.  Holding pool->lock across that runqueue lock
acquisition lengthens the locked region on the two hottest paths of a
contended unbound workqueue.

Use the new kick_pool_pick() helper to select the worker to wake
while holding pool->lock, then call wake_up_process() after pool->lock
is released.

All state that requires pool->lock (worker selection, wake_cpu
adjustment, BH-pool fast path) is still done under the lock; only the
unrelated rq->lock acquisition is moved out.

Measured on a CONFIG_SMP arm64 VM (8 vCPUs) with the test_workqueue
benchmark (lib/test_workqueue.c) using a batched-submit mode (8
producer kthreads, 200000 work items each, WQ_UNBOUND).  Averages of
five runs per scope:

    affinity_scope   baseline (items/s)   patched (items/s)   gain
    --------------   ------------------   -----------------   ----
    cpu                       1,419,973           1,413,896   -0.4 %  (no contention)
    smt                       1,442,921           1,437,164   -0.4 %  (no contention)
    cache_shard               1,184,058           1,279,184   +8.0 %
    cache                     1,167,603           1,271,341   +8.9 %
    numa                      1,163,617           1,285,427   +10.5 %
    system                    1,175,933           1,255,227   +6.7 %

Enqueue latency on the contended scopes also drops (p50 ~2875 -> ~2625
ns, p99 ~5000 -> ~4200 ns).  The cpu/smt scopes use per-CPU pools with
no producer/consumer contention, so as expected they are unchanged.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b788d7c44ac0..1403a4b195a3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2301,6 +2301,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 {
 	struct pool_workqueue *pwq;
 	struct worker_pool *last_pool, *pool;
+	struct task_struct *wake_p = NULL;
 	unsigned int work_flags;
 	unsigned int req_cpu = cpu;
 
@@ -2415,7 +2416,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 
 		trace_workqueue_activate_work(work);
 		insert_work(pwq, work, &pool->worklist, work_flags);
-		kick_pool(pool);
+		wake_p = kick_pool_pick(pool);
 	} else {
 		work_flags |= WORK_STRUCT_INACTIVE;
 		insert_work(pwq, work, &pwq->inactive_works, work_flags);
@@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 
 out:
 	raw_spin_unlock(&pool->lock);
+	/*
+	 * Issue the wakeup after dropping pool->lock to shorten the
+	 * locked region on this hot enqueue path.  kick_pool_pick() did all
+	 * of the work that required the lock (worker selection and
+	 * wake_cpu setup); the wake_up_process() itself only needs to
+	 * take the target rq->lock.
+	 */
+	if (wake_p)
+		wake_up_process(wake_p);
 	rcu_read_unlock();
 }
 
@@ -3243,6 +3253,7 @@ __acquires(&pool->lock)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
+	struct task_struct *wake_p;
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
@@ -3296,8 +3307,11 @@ __acquires(&pool->lock)
 	 * since nr_running would always be >= 1 at this point. This is used to
 	 * chain execution of the pending work items for WORKER_NOT_RUNNING
 	 * workers such as the UNBOUND and CPU_INTENSIVE ones.
+	 *
+	 * Select the worker to wake while holding pool->lock, but defer the
+	 * actual wake_up_process() until after the lock is dropped below.
 	 */
-	kick_pool(pool);
+	wake_p = kick_pool_pick(pool);
 
 	/*
 	 * Record the last pool and clear PENDING which should be the last
@@ -3310,6 +3324,9 @@ __acquires(&pool->lock)
 	pwq->stats[PWQ_STAT_STARTED]++;
 	raw_spin_unlock_irq(&pool->lock);
 
+	if (wake_p)
+		wake_up_process(wake_p);
+
 	rcu_start_depth = rcu_preempt_depth();
 	lockdep_start_depth = lockdep_depth(current);
 	/* see drain_dead_softirq_workfn() */

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
@ 2026-05-26 21:23   ` Hillf Danton
  2026-05-27  9:48     ` Breno Leitao
  2026-05-27 15:22   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 11+ messages in thread
From: Hillf Danton @ 2026-05-26 21:23 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
	frederic, bigeasy

On Tue, 26 May 2026 14:08:06 -0400 Breno Leitao wrote:
> @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
>  
>  out:
>  	raw_spin_unlock(&pool->lock);
> +	/*
> +	 * Issue the wakeup after dropping pool->lock to shorten the
> +	 * locked region on this hot enqueue path.  kick_pool_pick() did all
> +	 * of the work that required the lock (worker selection and
> +	 * wake_cpu setup); the wake_up_process() itself only needs to
> +	 * take the target rq->lock.
> +	 */
> +	if (wake_p)
> +		wake_up_process(wake_p);
>  	rcu_read_unlock();
>  }
>
I suspect this works without the pool lock held, go and see workers culled
in idle_cull_fn() for example.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-26 21:23   ` Hillf Danton
@ 2026-05-27  9:48     ` Breno Leitao
  2026-05-27 14:51       ` Breno Leitao
  0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2026-05-27  9:48 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
	frederic, bigeasy

On Wed, May 27, 2026 at 05:23:40AM +0000, Hillf Danton wrote:
> On Tue, 26 May 2026 14:08:06 -0400 Breno Leitao wrote:
> > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
> >  
> >  out:
> >  	raw_spin_unlock(&pool->lock);
> > +	/*
> > +	 * Issue the wakeup after dropping pool->lock to shorten the
> > +	 * locked region on this hot enqueue path.  kick_pool_pick() did all
> > +	 * of the work that required the lock (worker selection and
> > +	 * wake_cpu setup); the wake_up_process() itself only needs to
> > +	 * take the target rq->lock.
> > +	 */
> > +	if (wake_p)
> > +		wake_up_process(wake_p);
> >  	rcu_read_unlock();
> >  }
> >
> I suspect this works without the pool lock held, go and see workers culled
> in idle_cull_fn() for example.

Ack, you're right that wake_up_process() itself works without pool->lock
held — idle_cull_fn() is a good precedent. But sashiko's point made me
look again, and I think there's a latent bug that should be fixed before
(or as part of) this patch.

When kick_pool() picks a worker and wakes it, the worker is flipped to
TASK_RUNNING by the scheduler, but from the workqueue's point of view it
is still WORKER_IDLE and still on pool->idle_list. The transition off
the idle list only happens later, when >the woken kthread actually
schedules in, reacquires pool->lock, and runs worker_leave_idle().

That matters because idle_cull_fn() never looks at the task's scheduler
state. It doesn't read task->__state or anything equivalent — its "is
this worker idle?" check is purely "is it on pool->idle_list, and is
last_active + IDLE_WORKER_TIMEOUT in the past?". So a worker that has
been woken but hasn't yet run worker_leave_idle() is, as far as cull is
concerned, a perfectly valid victim.

   The race is then:
   CPU A (kicker)                      CPU B (cull)
   ----------------                    ----------------
   raw_spin_lock(&pool->lock)
   pick worker W (idle_list head)
   wake_up_process(W->task)
   raw_spin_unlock(&pool->lock)
                                       raw_spin_lock(&pool->lock)
                                       too_many_workers() == true
                                       W is list_last_entry(idle_list)
                                       set_worker_dying(W, cull_list)
                                       raw_spin_unlock(&pool->lock)
                                       reap_dying_workers()
                                       kthread_stop_put(W->task)
   (W finally schedules in)
   worker_thread():
     raw_spin_lock(&pool->lock)
     WORKER_DIE set -> return 0

W exits without ever running worker_leave_idle() or touching
pool->worklist, and the work item we just enqueued is stranded until the
next pool event kicks somebody else.

This window is narrow in mainline today (bounded by scheduler latency
between the kicker dropping pool->lock and the wakee reacquiring it,
plus the 5 s IDLE_WORKER_TIMEOUT precondition), but my patch widens it
considerably — process_one_work()'s deferred-wake site re-enables IRQs
and is fully preemptible across the gap.

I am still unclear how to fix it, but I am thinking about it.

--
pw-bot: cr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-27  9:48     ` Breno Leitao
@ 2026-05-27 14:51       ` Breno Leitao
  2026-05-27 15:35         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Breno Leitao @ 2026-05-27 14:51 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
	frederic, bigeasy

On Wed, May 27, 2026 at 10:48:12AM +0100, Breno Leitao wrote:
> I am still unclear how to fix it, but I am thinking about it.

I would say the best approach is to take the worker from the idle list
before waking it up.

	worker_leave_idle(worker);
        wake_up_process(p);

Not later when the worker is scheduled (worker_thread()).


commit f0a524a156b7563b9b37242aa42c822f7a8256e2
Author: Breno Leitao <leitao@debian.org>
Date:   Wed May 27 07:39:37 2026 -0700

    workqueue: claim picked idle worker under pool->lock in kick_pool()
    
    kick_pool() picks an idle worker via first_idle_worker(), optionally
    adjusts the wake_cpu and repatriates it, then wakes it via
    wake_up_process() -- but it does not remove the picked worker from
    pool->idle_list and does not clear WORKER_IDLE.  Those transitions
    happen later in worker_thread:woke_up: when the woken kthread
    reacquires pool->lock and calls worker_leave_idle().
    
    idle_cull_fn() walks pool->idle_list from the tail and calls
    set_worker_dying() on entries whose last_active + IDLE_WORKER_TIMEOUT
    has elapsed.  set_worker_dying() only checks WORKER_IDLE; it does
    not inspect task->__state.  A worker that has been kicked but has
    not yet rerun worker_leave_idle() therefore remains a valid cull
    victim.  If cull races in between the kick and the wakee reacquiring
    pool->lock, the worker gets WORKER_DIE set and is reaped by
    kthread_stop_put().  On wakeup it sees WORKER_DIE in worker_thread()
    and returns 0 without consuming pool->worklist, stranding the
    just-enqueued work item until the next pool event kicks somebody
    else.
    
        CPU A (kicker)                      CPU B (cull)
        ----------------                    ----------------
        raw_spin_lock(&pool->lock)
        pick worker W (idle_list head)
        wake_up_process(W->task)
        raw_spin_unlock(&pool->lock)
                                            raw_spin_lock(&pool->lock)
                                            too_many_workers() == true
                                            W is list_last_entry(idle_list)
                                            set_worker_dying(W, cull_list)
                                            raw_spin_unlock(&pool->lock)
                                            reap_dying_workers()
                                            kthread_stop_put(W->task)
        (W finally schedules in)
        worker_thread():
          raw_spin_lock(&pool->lock)
          WORKER_DIE set -> return 0
    
    Today the window is bounded by scheduler latency between the kicker
    dropping pool->lock and the wakee reacquiring it, plus the 5 minute
    IDLE_WORKER_TIMEOUT precondition.  Deferring wake_up_process()
    outside pool->lock on the hot enqueue / process_one_work() paths
    widens it considerably (IRQs re-enabled, fully preemptible).
    
    Close the window by claiming the picked worker under pool->lock:
    call worker_leave_idle() in kick_pool() before wake_up_process().
    That removes the worker from pool->idle_list and clears WORKER_IDLE
    while we still hold the lock, so set_worker_dying()'s WORKER_IDLE
    precondition is no longer satisfied and idle_cull_fn()'s tail walk
    cannot see the worker.  Gate the existing worker_leave_idle() call
    in worker_thread:woke_up: behind WORKER_IDLE, so first-time wakeups
    from create_worker() (which do not go through kick_pool()) still
    transition correctly.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..321df19f9fd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1310,6 +1310,18 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/*
+	 * Claim @worker under pool->lock so it cannot race idle_cull_fn():
+	 * once @worker is off pool->idle_list and no longer WORKER_IDLE,
+	 * set_worker_dying() will skip it and the cull walk cannot reach
+	 * it.  Otherwise, the woken worker remains WORKER_IDLE on
+	 * pool->idle_list until it actually schedules in and runs
+	 * worker_leave_idle() in worker_thread:woke_up:, leaving a window
+	 * where a concurrent idle_cull_fn() can flag it WORKER_DIE and
+	 * kthread_stop_put() it before it consumes pool->worklist,
+	 * stranding the just-enqueued work.
+	 */
+	worker_leave_idle(worker);
 	wake_up_process(p);
 	return true;
 }
@@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
 		return 0;
 	}
 
-	worker_leave_idle(worker);
+	/*
+	 * Kicked workers have already been removed from pool->idle_list
+	 * by kick_pool(); only first-time wakeups (via create_worker())
+	 * still arrive with WORKER_IDLE set.
+	 */
+	if (worker->flags & WORKER_IDLE)
+		worker_leave_idle(worker);
 recheck:
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-27 14:51       ` Breno Leitao
@ 2026-05-27 15:35         ` Sebastian Andrzej Siewior
  2026-05-28 14:35           ` Breno Leitao
  2026-06-01 17:26           ` Breno Leitao
  0 siblings, 2 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-27 15:35 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Hillf Danton, Tejun Heo, Lai Jiangshan, linux-kernel,
	marco.crivellari, frederic

On 2026-05-27 07:51:17 [-0700], Breno Leitao wrote:
> @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
>  		return 0;
>  	}
>  
> -	worker_leave_idle(worker);
> +	/*
> +	 * Kicked workers have already been removed from pool->idle_list
> +	 * by kick_pool(); only first-time wakeups (via create_worker())
> +	 * still arrive with WORKER_IDLE set.
> +	 */
> +	if (worker->flags & WORKER_IDLE)
> +		worker_leave_idle(worker);

Couldn't create_worker() be aligned here not set the idle flag and wake
the thread a few lines later? Then we wouldn't have to conditionally
clear the idle flag here (which sort of NULL renders the flag check in
worker_leave_idle()).

>  recheck:
>  	/* no more worker necessary? */
>  	if (!need_more_worker(pool))

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-27 15:35         ` Sebastian Andrzej Siewior
@ 2026-05-28 14:35           ` Breno Leitao
  2026-06-01 17:26           ` Breno Leitao
  1 sibling, 0 replies; 11+ messages in thread
From: Breno Leitao @ 2026-05-28 14:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hillf Danton, Tejun Heo, Lai Jiangshan, linux-kernel,
	marco.crivellari, frederic

On Wed, May 27, 2026 at 05:35:00PM +0000, Sebastian Andrzej Siewior wrote:
> On 2026-05-27 07:51:17 [-0700], Breno Leitao wrote:
> > @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
> >  		return 0;
> >  	}
> >  
> > -	worker_leave_idle(worker);
> > +	/*
> > +	 * Kicked workers have already been removed from pool->idle_list
> > +	 * by kick_pool(); only first-time wakeups (via create_worker())
> > +	 * still arrive with WORKER_IDLE set.
> > +	 */
> > +	if (worker->flags & WORKER_IDLE)
> > +		worker_leave_idle(worker);
> 
> Couldn't create_worker() be aligned here not set the idle flag and wake
> the thread a few lines later? Then we wouldn't have to conditionally
> clear the idle flag here (which sort of NULL renders the flag check in
> worker_leave_idle()).

Agreed. If create_worker() defers worker_enter_idle() until just before
wake_up_process(), we can drop this enter/leave pair entirely and avoid
the awkward conditional worker_leave_idle()

What about something like:

commit 0e91d33f8fe8e86c9a2be40d7a0163f68300ea1f
Author: Breno Leitao <leitao@debian.org>
Date:   Thu May 28 09:58:28 2026 -0400

    workqueue: leave idle under pool->lock before waking the worker
    
    A woken worker is woken from two different shapes today:
    
      - kick_pool_pick() picks a worker off pool->idle_list (LIFO head),
        optionally adjusts wake_cpu under pool->lock and returns the
        task to wake.  The worker is left WORKER_IDLE and on
        pool->idle_list at the point wake_up_process() runs.
    
      - create_worker() calls worker_enter_idle() to put a brand-new
        worker on pool->idle_list, then wakes it for the first time.
    
    In both cases the woken worker eventually reaches worker_thread:woke_up:,
    reacquires pool->lock and unconditionally calls worker_leave_idle() to
    flip its own WORKER_IDLE / pool->idle_list / pool->nr_idle state.  That
    is bookkeeping the waker could just as well have done under the same
    pool->lock acquisition it already holds, and it muddles the invariant
    for any concurrent observer: a runnable worker still observably looks
    idle until it actually schedules in.
    
    Move the worker_leave_idle() to the waker side, while pool->lock is
    still held:
    
      - kick_pool_pick() calls worker_leave_idle(worker) right before
        returning the task to wake.  Both kick_pool() and the deferred-
        wake callers in __queue_work() / process_one_work() inherit this.
    
      - create_worker() pairs its worker_enter_idle() with a matching
        worker_leave_idle() right before wake_up_process(), under the
        same pool->lock section.  nr_idle accounting is unchanged from
        the outside (enter and leave happen back-to-back under the
        lock); the only visible difference is that the new worker is no
        longer on pool->idle_list at the point it is woken.
    
      - worker_thread:woke_up: drops its worker_leave_idle() call: by the
        invariant above, a worker reaching woke_up: is always already
        !WORKER_IDLE and off pool->idle_list.
    
    The resulting invariant is uniform across all wakeup paths:
    
      a woken worker is !WORKER_IDLE and off pool->idle_list.
    
    No functional change intended.  This also shrinks the locked region
    at worker_thread:woke_up: (one fewer field update while holding
    pool->lock).
    
    Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..e66dd507a841 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1310,6 +1310,8 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/* Leave idle under pool->lock; worker_thread:woke_up: relies on this. */
+	worker_leave_idle(worker);
 	wake_up_process(p);
 	return true;
 }
@@ -2881,8 +2883,11 @@ static struct worker *create_worker(struct worker_pool *pool)
 	 * check if not woken up soon. As kick_pool() is noop if @pool is empty,
 	 * wake it up explicitly.
 	 */
-	if (worker->task)
+	if (worker->task) {
+		/* match kick_pool_pick(): leave idle before waking */
+		worker_leave_idle(worker);
 		wake_up_process(worker->task);
+	}
 
 	raw_spin_unlock_irq(&pool->lock);
 
@@ -3447,7 +3452,7 @@ static int worker_thread(void *__worker)
 		return 0;
 	}
 
-	worker_leave_idle(worker);
+	/* wakers already called worker_leave_idle() under pool->lock */
 recheck:
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-27 15:35         ` Sebastian Andrzej Siewior
  2026-05-28 14:35           ` Breno Leitao
@ 2026-06-01 17:26           ` Breno Leitao
  1 sibling, 0 replies; 11+ messages in thread
From: Breno Leitao @ 2026-06-01 17:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hillf Danton, Tejun Heo, Lai Jiangshan, linux-kernel,
	marco.crivellari, frederic

On Wed, May 27, 2026 at 05:35:00PM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-05-27 07:51:17 [-0700], Breno Leitao wrote:
> > @@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
> >  		return 0;
> >  	}
> >  
> > -	worker_leave_idle(worker);
> > +	/*
> > +	 * Kicked workers have already been removed from pool->idle_list
> > +	 * by kick_pool(); only first-time wakeups (via create_worker())
> > +	 * still arrive with WORKER_IDLE set.
> > +	 */
> > +	if (worker->flags & WORKER_IDLE)
> > +		worker_leave_idle(worker);
> 
> Couldn't create_worker() be aligned here not set the idle flag and wake
> the thread a few lines later? Then we wouldn't have to conditionally
> clear the idle flag here (which sort of NULL renders the flag check in
> worker_leave_idle()).

I tried exactly that and it regresses worker creation, so I'd rather
keep create_worker() as-is and leave the check in woke_up: conditional.

create_worker() deliberately returns with the new worker still on
pool->idle_list (counted in pool->nr_idle), and maybe_create_worker()
depends on that.  After a successful create_worker() it re-checks:

      need_to_create_worker() = need_more_worker() && !may_start_working()

and may_start_working() is just pool->nr_idle.  That nr_idle is the
signal "I already created a worker that will pick up the pending work,
stop creating".

If create_worker() leaves the worker !WORKER_IDLE before the wakeup (or
never enters idle), it returns with nr_idle unchanged.  A fresh worker
starts WORKER_PREP, so it doesn't bump nr_running until it actually
runs.  So until the woken kthread schedules in, the manager keeps
seeing:

      need_more_worker()  == true   (worklist non-empty, nr_running 0)
      may_start_working() == false  (nr_idle 0)

and loops/goto restart, creating extra workers until one of the woken
ones runs.  Under scheduler latency that's a burst of surplus kworkers
(eventually culled).

So I kept create_worker() untouched and only claim the worker (off
idle_list) in kick_pool(), with the conditional worker_leave_idle() in
woke_up: covering the create_worker() path.

--breno

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
  2026-05-26 21:23   ` Hillf Danton
@ 2026-05-27 15:22   ` Sebastian Andrzej Siewior
  2026-05-28 13:41     ` Breno Leitao
  1 sibling, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-27 15:22 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
	frederic, kernel-team

On 2026-05-26 14:08:06 [-0400], Breno Leitao wrote:
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
>  
>  out:
>  	raw_spin_unlock(&pool->lock);
> +	/*
> +	 * Issue the wakeup after dropping pool->lock to shorten the
> +	 * locked region on this hot enqueue path.  kick_pool_pick() did all
> +	 * of the work that required the lock (worker selection and
> +	 * wake_cpu setup); the wake_up_process() itself only needs to
> +	 * take the target rq->lock.
> +	 */
> +	if (wake_p)
> +		wake_up_process(wake_p);

What about using wake_q_add() to grab a worker(s) and then wake_up_q()
to wake the worker outside of the lock?

>  	rcu_read_unlock();
>  }
>  

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
  2026-05-27 15:22   ` Sebastian Andrzej Siewior
@ 2026-05-28 13:41     ` Breno Leitao
  0 siblings, 0 replies; 11+ messages in thread
From: Breno Leitao @ 2026-05-28 13:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
	frederic, kernel-team

Hello Sebastian,

On Wed, May 27, 2026 at 05:22:05PM +0000, Sebastian Andrzej Siewior wrote:
> On 2026-05-26 14:08:06 [-0400], Breno Leitao wrote:
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
> >  
> >  out:
> >  	raw_spin_unlock(&pool->lock);
> > +	/*
> > +	 * Issue the wakeup after dropping pool->lock to shorten the
> > +	 * locked region on this hot enqueue path.  kick_pool_pick() did all
> > +	 * of the work that required the lock (worker selection and
> > +	 * wake_cpu setup); the wake_up_process() itself only needs to
> > +	 * take the target rq->lock.
> > +	 */
> > +	if (wake_p)
> > +		wake_up_process(wake_p);
> 
> What about using wake_q_add() to grab a worker(s) and then wake_up_q()
> to wake the worker outside of the lock?

Thanks, this is a good recommendation.

Looking at the API, I understand the intended pattern is to call
wake_q_add() while still holding pool->lock and then wake_up_q() after
dropping it -- wake_q_add() only does a cmpxchg on task->wake_q.next
plus get_task_struct(), so it's safe under the spinlock, while
wake_up_q() is the part that takes rq->lock and is what we want
outside.

Will switch to that in v2.
--breno

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-06-01 17:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao
2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao
2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
2026-05-26 21:23   ` Hillf Danton
2026-05-27  9:48     ` Breno Leitao
2026-05-27 14:51       ` Breno Leitao
2026-05-27 15:35         ` Sebastian Andrzej Siewior
2026-05-28 14:35           ` Breno Leitao
2026-06-01 17:26           ` Breno Leitao
2026-05-27 15:22   ` Sebastian Andrzej Siewior
2026-05-28 13:41     ` Breno Leitao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.