All of lore.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Hillf Danton <hdanton@sina.com>
Cc: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
	 linux-kernel@vger.kernel.org, marco.crivellari@suse.com,
	frederic@kernel.org,  bigeasy@linutronix.de
Subject: Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths
Date: Wed, 27 May 2026 07:51:17 -0700	[thread overview]
Message-ID: <ahcEcR6ApWr7E4ID@gmail.com> (raw)
In-Reply-To: <aha6ENW-K-4F4AHi@gmail.com>

On Wed, May 27, 2026 at 10:48:12AM +0100, Breno Leitao wrote:
> I am still unclear how to fix it, but I am thinking about it.

I would say the best approach is to take the worker from the idle list
before waking it up.

	worker_leave_idle(worker);
        wake_up_process(p);

Not later when the worker is scheduled (worker_thread()).


commit f0a524a156b7563b9b37242aa42c822f7a8256e2
Author: Breno Leitao <leitao@debian.org>
Date:   Wed May 27 07:39:37 2026 -0700

    workqueue: claim picked idle worker under pool->lock in kick_pool()
    
    kick_pool() picks an idle worker via first_idle_worker(), optionally
    adjusts the wake_cpu and repatriates it, then wakes it via
    wake_up_process() -- but it does not remove the picked worker from
    pool->idle_list and does not clear WORKER_IDLE.  Those transitions
    happen later in worker_thread:woke_up: when the woken kthread
    reacquires pool->lock and calls worker_leave_idle().
    
    idle_cull_fn() walks pool->idle_list from the tail and calls
    set_worker_dying() on entries whose last_active + IDLE_WORKER_TIMEOUT
    has elapsed.  set_worker_dying() only checks WORKER_IDLE; it does
    not inspect task->__state.  A worker that has been kicked but has
    not yet rerun worker_leave_idle() therefore remains a valid cull
    victim.  If cull races in between the kick and the wakee reacquiring
    pool->lock, the worker gets WORKER_DIE set and is reaped by
    kthread_stop_put().  On wakeup it sees WORKER_DIE in worker_thread()
    and returns 0 without consuming pool->worklist, stranding the
    just-enqueued work item until the next pool event kicks somebody
    else.
    
        CPU A (kicker)                      CPU B (cull)
        ----------------                    ----------------
        raw_spin_lock(&pool->lock)
        pick worker W (idle_list head)
        wake_up_process(W->task)
        raw_spin_unlock(&pool->lock)
                                            raw_spin_lock(&pool->lock)
                                            too_many_workers() == true
                                            W is list_last_entry(idle_list)
                                            set_worker_dying(W, cull_list)
                                            raw_spin_unlock(&pool->lock)
                                            reap_dying_workers()
                                            kthread_stop_put(W->task)
        (W finally schedules in)
        worker_thread():
          raw_spin_lock(&pool->lock)
          WORKER_DIE set -> return 0
    
    Today the window is bounded by scheduler latency between the kicker
    dropping pool->lock and the wakee reacquiring it, plus the 5 minute
    IDLE_WORKER_TIMEOUT precondition.  Deferring wake_up_process()
    outside pool->lock on the hot enqueue / process_one_work() paths
    widens it considerably (IRQs re-enabled, fully preemptible).
    
    Close the window by claiming the picked worker under pool->lock:
    call worker_leave_idle() in kick_pool() before wake_up_process().
    That removes the worker from pool->idle_list and clears WORKER_IDLE
    while we still hold the lock, so set_worker_dying()'s WORKER_IDLE
    precondition is no longer satisfied and idle_cull_fn()'s tail walk
    cannot see the worker.  Gate the existing worker_leave_idle() call
    in worker_thread:woke_up: behind WORKER_IDLE, so first-time wakeups
    from create_worker() (which do not go through kick_pool()) still
    transition correctly.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..321df19f9fd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1310,6 +1310,18 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/*
+	 * Claim @worker under pool->lock so it cannot race idle_cull_fn():
+	 * once @worker is off pool->idle_list and no longer WORKER_IDLE,
+	 * set_worker_dying() will skip it and the cull walk cannot reach
+	 * it.  Otherwise, the woken worker remains WORKER_IDLE on
+	 * pool->idle_list until it actually schedules in and runs
+	 * worker_leave_idle() in worker_thread:woke_up:, leaving a window
+	 * where a concurrent idle_cull_fn() can flag it WORKER_DIE and
+	 * kthread_stop_put() it before it consumes pool->worklist,
+	 * stranding the just-enqueued work.
+	 */
+	worker_leave_idle(worker);
 	wake_up_process(p);
 	return true;
 }
@@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
 		return 0;
 	}
 
-	worker_leave_idle(worker);
+	/*
+	 * Kicked workers have already been removed from pool->idle_list
+	 * by kick_pool(); only first-time wakeups (via create_worker())
+	 * still arrive with WORKER_IDLE set.
+	 */
+	if (worker->flags & WORKER_IDLE)
+		worker_leave_idle(worker);
 recheck:
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))

  reply	other threads:[~2026-05-27 14:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26 18:08 [PATCH 0/2] workqueue: Shrink the lock time Breno Leitao
2026-05-26 18:08 ` [PATCH 1/2] workqueue: split kick_pool() into kick_pool_pick() + wake_up_process() Breno Leitao
2026-05-26 18:08 ` [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths Breno Leitao
2026-05-26 21:23   ` Hillf Danton
2026-05-27  9:48     ` Breno Leitao
2026-05-27 14:51       ` Breno Leitao [this message]
2026-05-27 15:35         ` Sebastian Andrzej Siewior
2026-05-28 14:35           ` Breno Leitao
2026-06-01 17:26           ` Breno Leitao
2026-05-27 15:22   ` Sebastian Andrzej Siewior
2026-05-28 13:41     ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahcEcR6ApWr7E4ID@gmail.com \
    --to=leitao@debian.org \
    --cc=bigeasy@linutronix.de \
    --cc=frederic@kernel.org \
    --cc=hdanton@sina.com \
    --cc=jiangshanlai@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marco.crivellari@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.