* [PATCH v3 1/3] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
2026-06-16 13:33 [PATCH v3 0/3] workqueue: Shrink the lock time Breno Leitao
@ 2026-06-16 13:33 ` Breno Leitao
2026-06-16 13:33 ` [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2026-06-16 13:33 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team, kmagar, psuriset
Factor the worker selection out of kick_pool() into kick_pool_pick(),
which picks and claims the worker under pool->lock but, instead of waking
it, queues it on a caller-provided wake_q via wake_q_add(). The caller
issues the wakeup later with wake_up_q(). wake_q_add() is safe under the
lock (cmpxchg + get_task_struct()); only wake_up_q() takes rq->lock.
kick_pool() becomes a thin wrapper using a local wake_q, so all existing
callers keep waking under pool->lock.
Pure refactor, no functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 78f25afb4a9d6..fd3b5bc78df9e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
#include <linux/uaccess.h>
#include <linux/sched/isolation.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
#include <linux/nmi.h>
#include <linux/kvm_para.h>
#include <linux/delay.h>
@@ -1258,13 +1259,17 @@ static void kick_bh_pool(struct worker_pool *pool)
}
/**
- * kick_pool - wake up an idle worker if necessary
+ * kick_pool_pick - select and claim an idle worker, deferring the wakeup
* @pool: pool to kick
+ * @wakeq: wake_q to queue the selected worker on
*
- * @pool may have pending work items. Wake up worker if necessary. Returns
- * whether a worker was woken up.
+ * Like kick_pool() but queues the picked worker on @wakeq (wake_q_add())
+ * instead of waking it, so the caller can wake_up_q(@wakeq) after dropping
+ * pool->lock. Returns whether a worker was selected.
+ *
+ * Must be called with @pool->lock held.
*/
-static bool kick_pool(struct worker_pool *pool)
+static bool kick_pool_pick(struct worker_pool *pool, struct wake_q_head *wakeq)
{
struct worker *worker = first_idle_worker(pool);
struct task_struct *p;
@@ -1310,10 +1315,27 @@ static bool kick_pool(struct worker_pool *pool)
}
}
#endif
- wake_up_process(p);
+ wake_q_add(wakeq, p);
return true;
}
+/**
+ * kick_pool - wake up an idle worker if necessary
+ * @pool: pool to kick
+ *
+ * @pool may have pending work items. Wake up worker if necessary. Returns
+ * whether a worker was woken up.
+ */
+static bool kick_pool(struct worker_pool *pool)
+{
+ DEFINE_WAKE_Q(wakeq);
+ bool kicked;
+
+ kicked = kick_pool_pick(pool, &wakeq);
+ wake_up_q(&wakeq);
+ return kicked;
+}
+
#ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
/*
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work()
2026-06-16 13:33 [PATCH v3 0/3] workqueue: Shrink the lock time Breno Leitao
2026-06-16 13:33 ` [PATCH v3 1/3] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
@ 2026-06-16 13:33 ` Breno Leitao
2026-06-24 8:47 ` Sebastian Andrzej Siewior
2026-06-16 13:33 ` [PATCH v3 3/3] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
2026-06-24 8:54 ` [PATCH v3 0/3] workqueue: Shrink the lock time Sebastian Andrzej Siewior
3 siblings, 1 reply; 7+ messages in thread
From: Breno Leitao @ 2026-06-16 13:33 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team, kmagar, psuriset
__queue_work() is the enqueue hot path: it inserts the work item and
calls kick_pool() while holding pool->lock. kick_pool() ends in a
wakeup, which takes the target task's rq->lock, so rq->lock nests under
pool->lock on every enqueue that wakes a worker on a contended unbound
pool.
Use kick_pool_pick() to select and claim the worker under pool->lock,
queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
right after dropping the lock via raw_spin_unlock_wake(). Worker
selection, wake_cpu setup and claiming the worker off pool->idle_list
still happen under the lock; only the rq->lock acquisition moves out.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fd3b5bc78df9e..44ad3450ff77f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2299,6 +2299,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
{
struct pool_workqueue *pwq;
struct worker_pool *last_pool, *pool;
+ DEFINE_WAKE_Q(wakeq);
unsigned int work_flags;
unsigned int req_cpu = cpu;
@@ -2421,14 +2422,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
trace_workqueue_activate_work(work);
insert_work(pwq, work, &pool->worklist, work_flags);
- kick_pool(pool);
+ kick_pool_pick(pool, &wakeq);
} else {
work_flags |= WORK_STRUCT_INACTIVE;
insert_work(pwq, work, &pwq->inactive_works, work_flags);
}
out:
- raw_spin_unlock(&pool->lock);
+ raw_spin_unlock_wake(&pool->lock, &wakeq);
rcu_read_unlock();
}
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work()
2026-06-16 13:33 ` [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
@ 2026-06-24 8:47 ` Sebastian Andrzej Siewior
2026-06-24 11:19 ` Breno Leitao
0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-24 8:47 UTC (permalink / raw)
To: Breno Leitao
Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
frederic, Hillf Danton, kernel-team, kmagar, psuriset
On 2026-06-16 06:33:32 [-0700], Breno Leitao wrote:
> __queue_work() is the enqueue hot path: it inserts the work item and
> calls kick_pool() while holding pool->lock. kick_pool() ends in a
> wakeup, which takes the target task's rq->lock, so rq->lock nests under
> pool->lock on every enqueue that wakes a worker on a contended unbound
> pool.
>
> Use kick_pool_pick() to select and claim the worker under pool->lock,
> queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
> right after dropping the lock via raw_spin_unlock_wake(). Worker
> selection, wake_cpu setup and claiming the worker off pool->idle_list
> still happen under the lock; only the rq->lock acquisition moves out.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
…
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
…
> out:
> - raw_spin_unlock(&pool->lock);
> + raw_spin_unlock_wake(&pool->lock, &wakeq);
> rcu_read_unlock();
> }
It is not wrong but I am not sure if this is really needed here. The
pattern
preempt_disable();
raw_spin_unlock();
wake_up_q()
preempt_enable();
is used to prevent task preemption after the unlock operation. The
futex/ locking code needs to wake a task but before the unlock operation
the task priority might have been lowered as result of dropping the
lock. This means it might not be the task with the highest priority in
the system and the task with highest is not yet active and we schedule a
task in middle in the instead.
To form this easier: say we have Task A prio=1, B prio=2 and C prio=3
with higher number higher priority.
A owns a lock and is preempted by B. C gets on the CPU preempts B. C
wants A's lock so it passes its priority to A (PI-boost) and goes idle.
A gets on the CPU and unlocks.
Now: As part of the unlock operation (before the raw_spin_unlock()) A
goes back to its initial priority and C is not yet woken up meaning B is
the task with highest priority and the held raw_spinlock_t is the only
thing preventing scheduling. So we disable preemption, unlock and then
perform the wake-up so C becomes the next candidate (as it should be) to
occupy the CPU.
Otherwise it would be B which means per definition a task with lower
priority runs before a task with higher priority.
I don't think workqueue has this requirements here. Worst case something
else gets on the CPU and worker wakeup is delayed until the task is
scheduled again. It could be fine since the preemption could happen
before queue-work().
It probably does not lead to a huge performance
regression but who knows.
Your goal was to lower the contention on the pool lock during the wake
up which you achieved. The __queue_work() remains still not preemptible
until after the wake up which might be fine.
Sebastian
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work()
2026-06-24 8:47 ` Sebastian Andrzej Siewior
@ 2026-06-24 11:19 ` Breno Leitao
0 siblings, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2026-06-24 11:19 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
frederic, Hillf Danton, kernel-team, kmagar, psuriset
Hello Sebastian,
On Wed, Jun 24, 2026 at 10:47:56AM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-06-16 06:33:32 [-0700], Breno Leitao wrote:
> > __queue_work() is the enqueue hot path: it inserts the work item and
> > calls kick_pool() while holding pool->lock. kick_pool() ends in a
> > wakeup, which takes the target task's rq->lock, so rq->lock nests under
> > pool->lock on every enqueue that wakes a worker on a contended unbound
> > pool.
> >
> > Use kick_pool_pick() to select and claim the worker under pool->lock,
> > queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
> > right after dropping the lock via raw_spin_unlock_wake(). Worker
> > selection, wake_cpu setup and claiming the worker off pool->idle_list
> > still happen under the lock; only the rq->lock acquisition moves out.
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> …
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> …
> > out:
> > - raw_spin_unlock(&pool->lock);
> > + raw_spin_unlock_wake(&pool->lock, &wakeq);
> > rcu_read_unlock();
> > }
>
> It is not wrong but I am not sure if this is really needed here. The
> pattern
> preempt_disable();
> raw_spin_unlock();
> wake_up_q()
> preempt_enable();
>
> is used to prevent task preemption after the unlock operation. The
> futex/ locking code needs to wake a task but before the unlock operation
> the task priority might have been lowered as result of dropping the
> lock. This means it might not be the task with the highest priority in
> the system and the task with highest is not yet active and we schedule a
> task in middle in the instead.
> To form this easier: say we have Task A prio=1, B prio=2 and C prio=3
> with higher number higher priority.
> A owns a lock and is preempted by B. C gets on the CPU preempts B. C
> wants A's lock so it passes its priority to A (PI-boost) and goes idle.
> A gets on the CPU and unlocks.
> Now: As part of the unlock operation (before the raw_spin_unlock()) A
> goes back to its initial priority and C is not yet woken up meaning B is
> the task with highest priority and the held raw_spinlock_t is the only
> thing preventing scheduling. So we disable preemption, unlock and then
> perform the wake-up so C becomes the next candidate (as it should be) to
> occupy the CPU.
> Otherwise it would be B which means per definition a task with lower
> priority runs before a task with higher priority.
>
> I don't think workqueue has this requirements here. Worst case something
> else gets on the CPU and worker wakeup is delayed until the task is
> scheduled again. It could be fine since the preemption could happen
> before queue-work().
> It probably does not lead to a huge performance
> regression but who knows.
>
> Your goal was to lower the contention on the pool lock during the wake
> up which you achieved. The __queue_work() remains still not preemptible
> until after the wake up which might be fine.
You're absolutely right — the preemption dance isn't needed here.
I used raw_spin_unlock_wake() because the helper seemed to match what I
was doing, but this preempt_disable/enable pattern (while cheap) is
unnecessary for this use case.
I'll update the series to use the standard approach:
raw_spin_unlock_irq(&pool->lock);
wake_up_q(&wakeq);
Thanks,
--breno
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 3/3] workqueue: defer the worker wakeup outside pool->lock in process_one_work()
2026-06-16 13:33 [PATCH v3 0/3] workqueue: Shrink the lock time Breno Leitao
2026-06-16 13:33 ` [PATCH v3 1/3] workqueue: split kick_pool() into kick_pool_pick() + wake_up_q() Breno Leitao
2026-06-16 13:33 ` [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work() Breno Leitao
@ 2026-06-16 13:33 ` Breno Leitao
2026-06-24 8:54 ` [PATCH v3 0/3] workqueue: Shrink the lock time Sebastian Andrzej Siewior
3 siblings, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2026-06-16 13:33 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan
Cc: linux-kernel, marco.crivellari, frederic, bigeasy, Hillf Danton,
Breno Leitao, kernel-team, kmagar, psuriset
process_one_work() kicks the pool to chain execution of the remaining
work items on WORKER_NOT_RUNNING pools (the UNBOUND and CPU_INTENSIVE
ones), calling kick_pool() while holding pool->lock. As in the enqueue
path, the wakeup pulls the target rq->lock in under pool->lock.
Use kick_pool_pick() to select and claim the worker under pool->lock and
issue the wakeup with wake_up_q() after the lock is dropped via
raw_spin_unlock_irq_wake().
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 44ad3450ff77f..9fb57635968e1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3250,6 +3250,7 @@ __acquires(&pool->lock)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
+ DEFINE_WAKE_Q(wakeq);
unsigned long work_data;
int lockdep_start_depth, rcu_start_depth;
bool bh_draining = pool->flags & POOL_BH_DRAINING;
@@ -3304,7 +3305,7 @@ __acquires(&pool->lock)
* chain execution of the pending work items for WORKER_NOT_RUNNING
* workers such as the UNBOUND and CPU_INTENSIVE ones.
*/
- kick_pool(pool);
+ kick_pool_pick(pool, &wakeq);
/*
* Record the last pool and clear PENDING which should be the last
@@ -3315,7 +3316,7 @@ __acquires(&pool->lock)
set_work_pool_and_clear_pending(work, pool->id, pool_offq_flags(pool));
pwq->stats[PWQ_STAT_STARTED]++;
- raw_spin_unlock_irq(&pool->lock);
+ raw_spin_unlock_irq_wake(&pool->lock, &wakeq);
rcu_start_depth = rcu_preempt_depth();
lockdep_start_depth = lockdep_depth(current);
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 0/3] workqueue: Shrink the lock time
2026-06-16 13:33 [PATCH v3 0/3] workqueue: Shrink the lock time Breno Leitao
` (2 preceding siblings ...)
2026-06-16 13:33 ` [PATCH v3 3/3] workqueue: defer the worker wakeup outside pool->lock in process_one_work() Breno Leitao
@ 2026-06-24 8:54 ` Sebastian Andrzej Siewior
3 siblings, 0 replies; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-24 8:54 UTC (permalink / raw)
To: Breno Leitao
Cc: Tejun Heo, Lai Jiangshan, linux-kernel, marco.crivellari,
frederic, Hillf Danton, kernel-team, kmagar, psuriset
On 2026-06-16 06:33:30 [-0700], Breno Leitao wrote:
> The goal of this patchset is to decrease the time spent under the
> workqueue pool->lock.
…
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Sebastian
^ permalink raw reply [flat|nested] 7+ messages in thread