[PATCH V3 0/7] workqueue: Factor the way to assign rescuer work

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work
@ 2025-11-21 14:57 Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 1/7] workqueue: Factor out assign_rescuer_work() Lai Jiangshan
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.

Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportinity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.

Introduce a dummy cursor  work item to avoid potentially O(N^2)
rescans of the work list.  The marker records the resume position for
the next scan, eliminating redundant traversals.

Change from v2:

 Renames.

 Change WARN_ON_ONCE() conditions in assign_work().

 Remove the check whether the cursor is on the list in send_mayday()
 which has the benefit of enabling the ability to only process a
 limit number of work items per run and still allow the cursor in
 worklist between runs.

 Also remove the insertion of the cursor in send_mayday() and contain
 the code of handling the cursor in assign[_rescuer]_work().

 Split patch.

V2: https://lore.kernel.org/lkml/20251118093832.81920-1-jiangshanlai@gmail.com/

Changed from v1:

Insert the marker and use it purely as iteration marker as Tejun
request.
  It is hard to maintain the proper state of it, expecially maintaining
  it only in pool->worklist which also requires an unlikely path in the
  worker_thread() path.

Add an unlikely path in the worker_thread() path(in assign_work()).

Add other code in other place to make it not be treated like a work
item.  I'm sure all the paths have been covered now, but I still feel
a bit nevous about it for future changes.

Extra the code to assign work in the rescuer_thread() out as
assign_rescue_work().

V1: https://lore.kernel.org/lkml/20251113163426.2950-1-jiangshanlai@gmail.com/

Lai Jiangshan (7):
  workqueue: Factor out assign_rescuer_work()
  workqueue: Only assign rescuer work when really needed
  workqueue: Don't rely on wq->rescuer to stop rescuer
  workqueue: Loop over in rescuer until all its work is done
  workqueue: Process rescuer work items one-by-one using a cursor
  workqueue: Limit number of processed works in rescuer per turn
  workqueue: Process extra works in rescuer when there are no more to
    rescue

 kernel/workqueue.c | 146 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 103 insertions(+), 43 deletions(-)

-- 
2.19.1.6.gb485710b

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V3 1/7] workqueue: Factor out assign_rescuer_work()
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 2/7] workqueue: Only assign rescuer work when really needed Lai Jiangshan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Move the code to assign work to rescuer and assign_rescuer_work().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 19bc7ea931d8..c60584a39fc9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3439,6 +3439,23 @@ static int worker_thread(void *__worker)
 	goto woke_up;
 }
 
+static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer)
+{
+	struct worker_pool *pool = pwq->pool;
+	struct work_struct *work, *n;
+
+	/*
+	 * Slurp in all works issued via this workqueue and
+	 * process'em.
+	 */
+	list_for_each_entry_safe(work, n, &pool->worklist, entry) {
+		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n))
+			pwq->stats[PWQ_STAT_RESCUED]++;
+	}
+
+	return !list_empty(&rescuer->scheduled);
+}
+
 /**
  * rescuer_thread - the rescuer thread function
  * @__rescuer: self
@@ -3493,7 +3510,6 @@ static int rescuer_thread(void *__rescuer)
 		struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
-		struct work_struct *work, *n;
 
 		__set_current_state(TASK_RUNNING);
 		list_del_init(&pwq->mayday_node);
@@ -3504,18 +3520,9 @@ static int rescuer_thread(void *__rescuer)
 
 		raw_spin_lock_irq(&pool->lock);
 
-		/*
-		 * Slurp in all works issued via this workqueue and
-		 * process'em.
-		 */
 		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
-		list_for_each_entry_safe(work, n, &pool->worklist, entry) {
-			if (get_work_pwq(work) == pwq &&
-			    assign_work(work, rescuer, &n))
-				pwq->stats[PWQ_STAT_RESCUED]++;
-		}
 
-		if (!list_empty(&rescuer->scheduled)) {
+		if (assign_rescuer_work(pwq, rescuer)) {
 			process_scheduled_works(rescuer);
 
 			/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V3 2/7] workqueue: Only assign rescuer work when really needed
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 1/7] workqueue: Factor out assign_rescuer_work() Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 3/7] workqueue: Don't rely on wq->rescuer to stop rescuer Lai Jiangshan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

If the pwq does not need rescue (normal workers have been created or
become available), the rescuer can immediately move on to other stalled
pwqs.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c60584a39fc9..932581d93edb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3444,6 +3444,10 @@ static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescu
 	struct worker_pool *pool = pwq->pool;
 	struct work_struct *work, *n;
 
+	/* need rescue? */
+	if (!pwq->nr_active || !need_to_create_worker(pool))
+		return false;
+
 	/*
 	 * Slurp in all works issued via this workqueue and
 	 * process'em.
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V3 3/7] workqueue: Don't rely on wq->rescuer to stop rescuer
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 1/7] workqueue: Factor out assign_rescuer_work() Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 2/7] workqueue: Only assign rescuer work when really needed Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 14:57 ` [PATCH V3 4/7] workqueue: Loop over in rescuer until all its work is done Lai Jiangshan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

The commit1 def98c84b6cd ("workqueue: Fix spurious sanity check failures
in destroy_workqueue()") tries to fix spurious sanity check failures by
stopping send_mayday() via setting wq->rescuer to NULL.

But it fails to stop the pwq->mayday_node requeuing in the rescuer, and
the commit2 e66b39af00f4 ("workqueue: Fix pwq ref leak in
rescuer_thread()") fixes it by checking wq->rescuer which is the result
of commit1.

Both commits together really fix spurious sanity check failures caused
by the rescuer, but they both use a convoluted method by relying on
wq->rescuer state rather than the real count of work items.

Actually __WQ_DESTROYING and drain_workqueue() together already stop
send_mayday() by draining all the work items and ensuring no new work
item requeuing.

And the more proper fix to stop the pwq->mayday_node requeuing in the
rescuer is from commit3 4f3f4cf388f8 ("workqueue: avoid unneeded
requeuing the pwq in rescuer thread") and renders the checking of
wq->rescuer in commit2 unnecessary.

So __WQ_DESTROYING, drain_workqueue() and commit3 together fix spurious
sanity check failures introduced by the rescuer.

Just remove the convoluted code of using wq->rescuer.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 932581d93edb..943fa27e272b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3541,10 +3541,9 @@ static int rescuer_thread(void *__rescuer)
 			if (pwq->nr_active && need_to_create_worker(pool)) {
 				raw_spin_lock(&wq_mayday_lock);
 				/*
-				 * Queue iff we aren't racing destruction
-				 * and somebody else hasn't queued it already.
+				 * Queue iff somebody else hasn't queued it already.
 				 */
-				if (wq->rescuer && list_empty(&pwq->mayday_node)) {
+				if (list_empty(&pwq->mayday_node)) {
 					get_pwq(pwq);
 					list_add_tail(&pwq->mayday_node, &wq->maydays);
 				}
@@ -5903,16 +5902,10 @@ void destroy_workqueue(struct workqueue_struct *wq)

 	/* kill rescuer, if sanity checks fail, leave it w/o rescuer */
 	if (wq->rescuer) {
-		struct worker *rescuer = wq->rescuer;
-
-		/* this prevents new queueing */
-		raw_spin_lock_irq(&wq_mayday_lock);
-		wq->rescuer = NULL;
-		raw_spin_unlock_irq(&wq_mayday_lock);
-
 		/* rescuer will empty maydays list before exiting */
-		kthread_stop(rescuer->task);
-		kfree(rescuer);
+		kthread_stop(wq->rescuer->task);
+		kfree(wq->rescuer);
+		wq->rescuer = NULL;
 	}

 	/*
-- 
2.19.1.6.gb485710b

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V3 4/7] workqueue: Loop over in rescuer until all its work is done
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
                   ` (2 preceding siblings ...)
  2025-11-21 14:57 ` [PATCH V3 3/7] workqueue: Don't rely on wq->rescuer to stop rescuer Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 19:30   ` Tejun Heo
  2025-11-21 14:57 ` [PATCH V3 5/7] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Simplify the rescuer work by looping directly in the rescuer rather than
adding the pwq back to the maydays list. This also helps when
max_requests is 1 or small but pwq->inactive_works has a large number of
pending work items.

This might hurt fairness among PWQs and the rescuer could end up being
stuck on one PWQ indefinitely, but the rescuer's objective is to make
forward progress rather than ensure fairness.

Fairness can be further improved in future by assigning work items to
the rescuer one by one.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 24 +-----------------------
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 943fa27e272b..3032235a131e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3526,31 +3526,9 @@ static int rescuer_thread(void *__rescuer)
 
 		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
 
-		if (assign_rescuer_work(pwq, rescuer)) {
+		while (assign_rescuer_work(pwq, rescuer))
 			process_scheduled_works(rescuer);
 
-			/*
-			 * The above execution of rescued work items could
-			 * have created more to rescue through
-			 * pwq_activate_first_inactive() or chained
-			 * queueing.  Let's put @pwq back on mayday list so
-			 * that such back-to-back work items, which may be
-			 * being used to relieve memory pressure, don't
-			 * incur MAYDAY_INTERVAL delay inbetween.
-			 */
-			if (pwq->nr_active && need_to_create_worker(pool)) {
-				raw_spin_lock(&wq_mayday_lock);
-				/*
-				 * Queue iff somebody else hasn't queued it already.
-				 */
-				if (list_empty(&pwq->mayday_node)) {
-					get_pwq(pwq);
-					list_add_tail(&pwq->mayday_node, &wq->maydays);
-				}
-				raw_spin_unlock(&wq_mayday_lock);
-			}
-		}
-
 		/*
 		 * Leave this pool. Notify regular workers; otherwise, we end up
 		 * with 0 concurrency and stalling the execution.
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V3 5/7] workqueue: Process rescuer work items one-by-one using a cursor
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
                   ` (3 preceding siblings ...)
  2025-11-21 14:57 ` [PATCH V3 4/7] workqueue: Loop over in rescuer until all its work is done Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 19:05   ` Tejun Heo
  2025-11-21 14:57 ` [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn Lai Jiangshan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.

Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportinity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.

Introduce a dummy cursor  work item to avoid potentially O(N^2)
rescans of the work list.  The marker records the resume position for
the next scan, eliminating redundant traversals.

Cc: ying chen <yc1082463@gmail.com>
Reported-by: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b3b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 56 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3032235a131e..49dce50ff647 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -286,6 +286,7 @@ struct pool_workqueue {
 	struct list_head	pending_node;	/* LN: node on wq_node_nr_active->pending_pwqs */
 	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
 	struct list_head	mayday_node;	/* MD: node on wq->maydays */
+	struct work_struct	mayday_cursor;	/* L: cursor on pool->worklist */
 
 	u64			stats[PWQ_NR_STATS];
 
@@ -1126,6 +1127,12 @@ static struct worker *find_worker_executing_work(struct worker_pool *pool,
 	return NULL;
 }
 
+static void mayday_cursor_func(struct work_struct *work)
+{
+	/* should not be processed, only for marking position */
+	BUG();
+}
+
 /**
  * move_linked_works - move linked works to a list
  * @work: start of series of works to be scheduled
@@ -1188,6 +1195,16 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 
 	lockdep_assert_held(&pool->lock);
 
+	/* The cursor work should not be processed */
+	if (unlikely(work->func == mayday_cursor_func)) {
+		/* only worker_thread() can possibly take this branch */
+		WARN_ON_ONCE(worker->rescue_wq);
+		if (nextp)
+			*nextp = list_next_entry(work, entry);
+		list_del_init(&work->entry);
+		return false;
+	}
+
 	/*
 	 * A single work shouldn't be executed concurrently by multiple workers.
 	 * __queue_work() ensures that @work doesn't jump to a different pool
@@ -3442,22 +3459,33 @@ static int worker_thread(void *__worker)
 static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer)
 {
 	struct worker_pool *pool = pwq->pool;
+	struct work_struct *cursor = &pwq->mayday_cursor;
 	struct work_struct *work, *n;
 
+	/* from where to search */
+	if (list_empty(&cursor->entry)) {
+		work = list_first_entry(&pool->worklist, struct work_struct, entry);
+	} else {
+		work = list_next_entry(cursor, entry);
+		/* It will be at a new position or not need cursor anymore */
+		list_del_init(&cursor->entry);
+	}
+
 	/* need rescue? */
 	if (!pwq->nr_active || !need_to_create_worker(pool))
 		return false;
 
-	/*
-	 * Slurp in all works issued via this workqueue and
-	 * process'em.
-	 */
-	list_for_each_entry_safe(work, n, &pool->worklist, entry) {
-		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n))
+	/* try to assign a work to rescue */
+	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
+		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
 			pwq->stats[PWQ_STAT_RESCUED]++;
+			/* put the cursor for next search */
+			list_add_tail(&cursor->entry, &n->entry);
+			return true;
+		}
 	}
 
-	return !list_empty(&rescuer->scheduled);
+	return false;
 }
 
 /**
@@ -5141,6 +5169,20 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	kthread_init_work(&pwq->release_work, pwq_release_workfn);
+
+	/*
+	 * Set the dumpy cursor work with valid function and get_work_pwq().
+	 *
+	 * The cursor work should only be in the pwq->pool->worklist, and
+	 * should never be queued, processed, flushed, cancelled or even examed
+	 * as a work item.
+	 *
+	 * WORK_STRUCT_PENDING and WORK_STRUCT_INACTIVE just make it less
+	 * surprise for kernel debuging tools and reviewers.
+	 */
+	INIT_WORK(&pwq->mayday_cursor, mayday_cursor_func);
+	atomic_long_set(&pwq->mayday_cursor.data, (unsigned long)pwq |
+			WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | WORK_STRUCT_INACTIVE);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
                   ` (4 preceding siblings ...)
  2025-11-21 14:57 ` [PATCH V3 5/7] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 19:28   ` Tejun Heo
  2025-11-21 14:57 ` [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue Lai Jiangshan
  2025-11-21 19:57 ` [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Tejun Heo
  7 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Currently the rescuer keeps looping until all work on a PWQ is done, and
this may hurt fairness among PWQs, as the rescuer could remain stuck on
one PWQ indefinitely.

Introduce RESCUER_BATCH to control the maximum number of work items the
rescuer processes in each turn, and move on to other PWQs when the limit
is reached.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 49dce50ff647..9bc155545492 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -117,6 +117,8 @@ enum wq_internal_consts {
 	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
 	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
 
+	RESCUER_BATCH		= 16,		/* process items per turn */
+
 	/*
 	 * Rescue workers are used only on emergencies and shared by
 	 * all cpus.  Give MIN_NICE.
@@ -3456,7 +3458,7 @@ static int worker_thread(void *__worker)
 	goto woke_up;
 }
 
-static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer)
+static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer, bool limited)
 {
 	struct worker_pool *pool = pwq->pool;
 	struct work_struct *cursor = &pwq->mayday_cursor;
@@ -3477,7 +3479,20 @@ static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescu
 
 	/* try to assign a work to rescue */
 	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
-		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
+		if (get_work_pwq(work) != pwq)
+		       continue;
+		/*
+		 * put the cursor, resend mayday for itself and move on to other
+		 * PWQs when the limit is reached.
+		 */
+		if (limited && !list_empty(&pwq->wq->maydays)) {
+			list_add_tail(&cursor->entry, &work->entry);
+			raw_spin_lock(&wq_mayday_lock);		/* for wq->maydays */
+			send_mayday(work);
+			raw_spin_unlock(&wq_mayday_lock);
+			return false;
+		}
+		if (assign_work(work, rescuer, &n)) {
 			pwq->stats[PWQ_STAT_RESCUED]++;
 			/* put the cursor for next search */
 			list_add_tail(&cursor->entry, &n->entry);
@@ -3542,6 +3557,7 @@ static int rescuer_thread(void *__rescuer)
 		struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
+		unsigned int count = 0;
 
 		__set_current_state(TASK_RUNNING);
 		list_del_init(&pwq->mayday_node);
@@ -3554,7 +3570,7 @@ static int rescuer_thread(void *__rescuer)
 
 		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
 
-		while (assign_rescuer_work(pwq, rescuer))
+		while (assign_rescuer_work(pwq, rescuer, ++count > RESCUER_BATCH))
 			process_scheduled_works(rescuer);
 
 		/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
                   ` (5 preceding siblings ...)
  2025-11-21 14:57 ` [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn Lai Jiangshan
@ 2025-11-21 14:57 ` Lai Jiangshan
  2025-11-21 19:29   ` Tejun Heo
  2025-11-21 19:57 ` [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Tejun Heo
  7 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-21 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Make the rescuer process more work on the last pwq when there are no
more to rescue for the whole workqueue to help the regular workers in
case it is a temporary relief and to reduce relapse.

Also limit the number of extra works.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9bc155545492..86acf2a9cd41 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3473,10 +3473,30 @@ static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescu
 		list_del_init(&cursor->entry);
 	}
 
-	/* need rescue? */
-	if (!pwq->nr_active || !need_to_create_worker(pool))
+	/* have work items to rescue? */
+	if (!pwq->nr_active)
 		return false;
 
+	/* need rescue? */
+	if (!need_to_create_worker(pool)) {
+		/*
+		 * The pool has idle workers and doesn't need the rescuer, so it
+		 * could simply return false here. However, the memory pressure
+		 * might not be fully relieved, and if the regular workers later
+		 * go to sleep and the pool runs out of idle workers, mayday
+		 * would be triggered again.
+		 *
+		 * Do more work even if it doesn't need rescue, to help the
+		 * regular workers in case it is a temporary relief and to
+		 * reduce relapse.
+		 *
+		 * Unless the limit is reached or there are other PWQs needing
+		 * help.
+		 */
+		if (limited || !list_empty(&pwq->wq->maydays))
+			return false;
+	}
+
 	/* try to assign a work to rescue */
 	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
 		if (get_work_pwq(work) != pwq)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 5/7] workqueue: Process rescuer work items one-by-one using a cursor
  2025-11-21 14:57 ` [PATCH V3 5/7] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
@ 2025-11-21 19:05   ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2025-11-21 19:05 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

>  static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer)
>  {
>  	struct worker_pool *pool = pwq->pool;
> +	struct work_struct *cursor = &pwq->mayday_cursor;
>  	struct work_struct *work, *n;
>  
> +	/* from where to search */

maybe: search from the start or cursor if available

...
> +	/* try to assign a work to rescue */

maybe: find the next work item to rescue

> +	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
> +		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
>  			pwq->stats[PWQ_STAT_RESCUED]++;
> +			/* put the cursor for next search */
> +			list_add_tail(&cursor->entry, &n->entry);
> +			return true;
> +		}
...
> @@ -5141,6 +5169,20 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
>  	INIT_LIST_HEAD(&pwq->pwqs_node);
>  	INIT_LIST_HEAD(&pwq->mayday_node);
>  	kthread_init_work(&pwq->release_work, pwq_release_workfn);
> +
> +	/*
> +	 * Set the dumpy cursor work with valid function and get_work_pwq().
                   ^
                   dummy
> +	 *
> +	 * The cursor work should only be in the pwq->pool->worklist, and
> +	 * should never be queued, processed, flushed, cancelled or even examed
                                                                        ^
                                                                        examined

I wonder whether this is unnecessarily verbose. Maybe just "should not be
treated as a regular work item"?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn
  2025-11-21 14:57 ` [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn Lai Jiangshan
@ 2025-11-21 19:28   ` Tejun Heo
  2025-11-22  6:22     ` Lai Jiangshan
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2025-11-21 19:28 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

Hello,

On Fri, Nov 21, 2025 at 10:57:19PM +0800, Lai Jiangshan wrote:
> +static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer, bool limited)

I find the organization a bit odd with the expiration detection in the
caller and the implmentation of it piped into this function. Please see
below.

>  	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
> -		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
> +		if (get_work_pwq(work) != pwq)
> +		       continue;
> +		/*
> +		 * put the cursor, resend mayday for itself and move on to other
> +		 * PWQs when the limit is reached.
> +		 */
> +		if (limited && !list_empty(&pwq->wq->maydays)) {
> +			list_add_tail(&cursor->entry, &work->entry);
> +			raw_spin_lock(&wq_mayday_lock);		/* for wq->maydays */
> +			send_mayday(work);
> +			raw_spin_unlock(&wq_mayday_lock);
> +			return false;

Does it make sense to maintain cursor position across pwqs? Shouldn't it be
reset? Imagine two pwqs' (A, B) work items interleaved:

        A1 B1 A2 B2 A3 B3

1. Two of A's work items are rescued and cursor is inserted before the next
   eligible one:

        B1 B2 A3 B3
              ^

2. Let's say limit is reached and we're moving on to B. Then, the rescuer
   would first run B3. Wouldn't it make more sense to go back to the head of
   the queue and start over so that it can pick up B1 first?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue
  2025-11-21 14:57 ` [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue Lai Jiangshan
@ 2025-11-21 19:29   ` Tejun Heo
  2025-11-22  7:07     ` Lai Jiangshan
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2025-11-21 19:29 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

On Fri, Nov 21, 2025 at 10:57:20PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Make the rescuer process more work on the last pwq when there are no
> more to rescue for the whole workqueue to help the regular workers in
> case it is a temporary relief and to reduce relapse.
> 
> Also limit the number of extra works.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Is this based on some empirical case? Can you give a concrete example?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 4/7] workqueue: Loop over in rescuer until all its work is done
  2025-11-21 14:57 ` [PATCH V3 4/7] workqueue: Loop over in rescuer until all its work is done Lai Jiangshan
@ 2025-11-21 19:30   ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2025-11-21 19:30 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

On Fri, Nov 21, 2025 at 10:57:17PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Simplify the rescuer work by looping directly in the rescuer rather than
> adding the pwq back to the maydays list. This also helps when
> max_requests is 1 or small but pwq->inactive_works has a large number of
> pending work items.
> 
> This might hurt fairness among PWQs and the rescuer could end up being
> stuck on one PWQ indefinitely, but the rescuer's objective is to make
> forward progress rather than ensure fairness.
> 
> Fairness can be further improved in future by assigning work items to
> the rescuer one by one.

Yeah, given that you restore similar behavior later. I'd just note that this
is temporary change to ease the transition.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work
  2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
                   ` (6 preceding siblings ...)
  2025-11-21 14:57 ` [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue Lai Jiangshan
@ 2025-11-21 19:57 ` Tejun Heo
  7 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2025-11-21 19:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lai Jiangshan, ying chen, Lai Jiangshan

On Fri, Nov 21, 2025 at 10:57:13PM +0800, Lai Jiangshan wrote:
> Lai Jiangshan (7):
>   workqueue: Factor out assign_rescuer_work()
>   workqueue: Only assign rescuer work when really needed
>   workqueue: Don't rely on wq->rescuer to stop rescuer

Applied 1-3 to wq/for-6.19.

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn
  2025-11-21 19:28   ` Tejun Heo
@ 2025-11-22  6:22     ` Lai Jiangshan
  2025-11-22 14:26       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-22  6:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, ying chen, Lai Jiangshan

Hello

On Sat, Nov 22, 2025 at 3:28 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Nov 21, 2025 at 10:57:19PM +0800, Lai Jiangshan wrote:
> > +static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer, bool limited)
>
> I find the organization a bit odd with the expiration detection in the
> caller and the implmentation of it piped into this function. Please see
> below.
>
> >       list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
> > -             if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
> > +             if (get_work_pwq(work) != pwq)
> > +                    continue;
> > +             /*
> > +              * put the cursor, resend mayday for itself and move on to other
> > +              * PWQs when the limit is reached.
> > +              */
> > +             if (limited && !list_empty(&pwq->wq->maydays)) {
> > +                     list_add_tail(&cursor->entry, &work->entry);
> > +                     raw_spin_lock(&wq_mayday_lock);         /* for wq->maydays */
> > +                     send_mayday(work);
> > +                     raw_spin_unlock(&wq_mayday_lock);
> > +                     return false;
>
> Does it make sense to maintain cursor position across pwqs? Shouldn't it be
> reset? Imagine two pwqs' (A, B) work items interleaved:
>
>         A1 B1 A2 B2 A3 B3

I might misunderstand the setting.

>
> 1. Two of A's work items are rescued and cursor is inserted before the next
>    eligible one:
>
>         B1 B2 A3 B3
>               ^
>
> 2. Let's say limit is reached and we're moving on to B. Then, the rescuer
>    would first run B3. Wouldn't it make more sense to go back to the head of
>    the queue and start over so that it can pick up B1 first?
>

The cursor is per PWQ. When the rescuer come back to this pool next time,
it can only handle the PWQ belonging to its wq, which is A, and it will search
from A3 and process A3 instead of searching from B1 if the cursor is reset.

The rescuer never moves to B (I assume "A1 B1 A2 B2 A3 B3" is the worklist
of the pool, and shows B is a pwq of the pool), because B is not a PWQ
of the workqueue since there never be two PWQs of the same wq in the same
pool.

The rescuer will leave this pool and will move to the next pwq in
pwq->wq->maydays and, continuing to help its own wq, shifting from
pwq to pwq with a limited number of processed work for each pwq in turn.

Thanks,
Lai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue
  2025-11-21 19:29   ` Tejun Heo
@ 2025-11-22  7:07     ` Lai Jiangshan
  2025-11-22 14:31       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2025-11-22  7:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, ying chen, Lai Jiangshan

On Sat, Nov 22, 2025 at 3:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Nov 21, 2025 at 10:57:20PM +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > Make the rescuer process more work on the last pwq when there are no
> > more to rescue for the whole workqueue to help the regular workers in
> > case it is a temporary relief and to reduce relapse.
> >
> > Also limit the number of extra works.
> >
> > Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
>
> Is this based on some empirical case? Can you give a concrete example?
>

It is not a case observed in production, but it can be achieved by test.

If the pool is a PERCPU pool with concurrency enabled, having idle
workers does not necessarily mean memory pressure is relieved; it may
simply reflect regular workers waking up, completing work and going idle
due to concurrency.

In such a case, those working workers may later sleep again, the pool may
run out of idle workers, and it will need to allocate new one again, becoming
stalled and waiting for the rescuer. This can cause unnecessary delay,
especially if memory pressure was never resolved throughout.

Before patch 2/7, the rescuer processed all active work items
unconditionally, so this situation was less problematic.

But this patch does not really check if memory pressure is still on,
using (pool->flags & POOL_MANAGER_ACTIVE) can kind of
achieve it, though not precisely. There is never a precise way,
but the system has poured some resources to create the rescuer,
it is better to make the best use of it, like before patch 2/7.

this patch:
       if (!pwq->nr_active)
                return false;
       if (!need_to_create_worker(pool)) {
               if (limited || !list_empty(&pwq->wq->maydays))
                       return false;
       }

with memory pressure test via POOL_MANAGER_ACTIVE:

       if (!pwq->nr_active)
                return false;
       if (!need_to_create_worker(pool)) {
               if (limited || !(pool->flags & POOL_MANAGER_ACTIVE) ||
!list_empty(&pwq->wq->maydays))
                       return false;
       }

What do you think?

Thanks
Lai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn
  2025-11-22  6:22     ` Lai Jiangshan
@ 2025-11-22 14:26       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2025-11-22 14:26 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

Hello,

On Sat, Nov 22, 2025 at 02:22:12PM +0800, Lai Jiangshan wrote:
> The cursor is per PWQ. When the rescuer come back to this pool next time,
> it can only handle the PWQ belonging to its wq, which is A, and it will search
> from A3 and process A3 instead of searching from B1 if the cursor is reset.

Oh yeah, you're right. I was thinking that the cursor was shared across pwqs
for some reason.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue
  2025-11-22  7:07     ` Lai Jiangshan
@ 2025-11-22 14:31       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2025-11-22 14:31 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

Hello,

On Sat, Nov 22, 2025 at 03:07:25PM +0800, Lai Jiangshan wrote:
...
> But this patch does not really check if memory pressure is still on,
> using (pool->flags & POOL_MANAGER_ACTIVE) can kind of
> achieve it, though not precisely. There is never a precise way,
> but the system has poured some resources to create the rescuer,
> it is better to make the best use of it, like before patch 2/7.
> 
> this patch:
>        if (!pwq->nr_active)
>                 return false;
>        if (!need_to_create_worker(pool)) {
>                if (limited || !list_empty(&pwq->wq->maydays))
>                        return false;
>        }
> 
> with memory pressure test via POOL_MANAGER_ACTIVE:
> 
>        if (!pwq->nr_active)
>                 return false;
>        if (!need_to_create_worker(pool)) {
>                if (limited || !(pool->flags & POOL_MANAGER_ACTIVE) ||
> !list_empty(&pwq->wq->maydays))
>                        return false;
>        }
> 
> What do you think?

Yeah, POOL_MANAGER_ACTIVE seems like a more reliable condition to test.
Let's do that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-11-22 14:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-21 14:57 [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Lai Jiangshan
2025-11-21 14:57 ` [PATCH V3 1/7] workqueue: Factor out assign_rescuer_work() Lai Jiangshan
2025-11-21 14:57 ` [PATCH V3 2/7] workqueue: Only assign rescuer work when really needed Lai Jiangshan
2025-11-21 14:57 ` [PATCH V3 3/7] workqueue: Don't rely on wq->rescuer to stop rescuer Lai Jiangshan
2025-11-21 14:57 ` [PATCH V3 4/7] workqueue: Loop over in rescuer until all its work is done Lai Jiangshan
2025-11-21 19:30   ` Tejun Heo
2025-11-21 14:57 ` [PATCH V3 5/7] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
2025-11-21 19:05   ` Tejun Heo
2025-11-21 14:57 ` [PATCH V3 6/7] workqueue: Limit number of processed works in rescuer per turn Lai Jiangshan
2025-11-21 19:28   ` Tejun Heo
2025-11-22  6:22     ` Lai Jiangshan
2025-11-22 14:26       ` Tejun Heo
2025-11-21 14:57 ` [PATCH V3 7/7] workqueue: Process extra works in rescuer when there are no more to rescue Lai Jiangshan
2025-11-21 19:29   ` Tejun Heo
2025-11-22  7:07     ` Lai Jiangshan
2025-11-22 14:31       ` Tejun Heo
2025-11-21 19:57 ` [PATCH V3 0/7] workqueue: Factor the way to assign rescuer work Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox