public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work
@ 2025-12-08 13:25 Lai Jiangshan
  2025-12-08 13:25 ` [PATCH V5 1/3] workqueue: Make send_mayday() take a PWQ argument directly Lai Jiangshan
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Lai Jiangshan @ 2025-12-08 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass using a cursor.

Changed from V4:

  Control the limit and call send_mayday() in rescuer_thread(), which
  makes rescuer_thread() responsible for ensuring the cursor is not left
  behind.

v4: https://lore.kernel.org/lkml/20251125063617.671199-1-jiangshanlai@gmail.com/

Lai Jiangshan (3):
  workqueue: Refactor send_mayday() to take pwq directly
  workqueue: Process rescuer work items one-by-one using a cursor
  workqueue: Process extra works in rescuer on memory pressure

 kernel/workqueue.c | 119 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 91 insertions(+), 28 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH V5 1/3] workqueue: Make send_mayday() take a PWQ argument directly
  2025-12-08 13:25 [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Lai Jiangshan
@ 2025-12-08 13:25 ` Lai Jiangshan
  2025-12-08 13:25 ` [PATCH V5 2/3] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Lai Jiangshan @ 2025-12-08 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Make send_mayday() operate on a PWQ directly instead of taking a work
item, so that rescuer_thread() now calls send_mayday(pwq) instead of
open-coding the mayday list manipulation.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 253311af47c6..f8371aa54dca 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2976,9 +2976,8 @@ static void idle_cull_fn(struct work_struct *work)
 	reap_dying_workers(&cull_list);
 }
 
-static void send_mayday(struct work_struct *work)
+static void send_mayday(struct pool_workqueue *pwq)
 {
-	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct workqueue_struct *wq = pwq->wq;
 
 	lockdep_assert_held(&wq_mayday_lock);
@@ -3016,7 +3015,7 @@ static void pool_mayday_timeout(struct timer_list *t)
 		 * rescuers.
 		 */
 		list_for_each_entry(work, &pool->worklist, entry)
-			send_mayday(work);
+			send_mayday(get_work_pwq(work));
 	}
 
 	raw_spin_unlock(&wq_mayday_lock);
@@ -3538,13 +3537,7 @@ static int rescuer_thread(void *__rescuer)
 			 */
 			if (pwq->nr_active && need_to_create_worker(pool)) {
 				raw_spin_lock(&wq_mayday_lock);
-				/*
-				 * Queue iff somebody else hasn't queued it already.
-				 */
-				if (list_empty(&pwq->mayday_node)) {
-					get_pwq(pwq);
-					list_add_tail(&pwq->mayday_node, &wq->maydays);
-				}
+				send_mayday(pwq);
 				raw_spin_unlock(&wq_mayday_lock);
 			}
 		}
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH V5 2/3] workqueue: Process rescuer work items one-by-one using a cursor
  2025-12-08 13:25 [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Lai Jiangshan
  2025-12-08 13:25 ` [PATCH V5 1/3] workqueue: Make send_mayday() take a PWQ argument directly Lai Jiangshan
@ 2025-12-08 13:25 ` Lai Jiangshan
  2025-12-08 13:25 ` [PATCH V5 3/3] workqueue: Process extra works in rescuer on memory pressure Lai Jiangshan
  2025-12-08 19:24 ` [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Tejun Heo
  3 siblings, 0 replies; 5+ messages in thread
From: Lai Jiangshan @ 2025-12-08 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.

Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportinity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.

Introduce a dummy cursor  work item to avoid potentially O(N^2)
rescans of the work list.  The marker records the resume position for
the next scan, eliminating redundant traversals.

Also introduce RESCUER_BATCH to control the maximum number of work items
the rescuer processes in each turn, and move on to other PWQs when the
limit is reached.

Cc: ying chen <yc1082463@gmail.com>
Reported-by: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b3b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 75 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 59 insertions(+), 16 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f8371aa54dca..add236e0dac4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -117,6 +117,8 @@ enum wq_internal_consts {
 	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
 	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
 
+	RESCUER_BATCH		= 16,		/* process items per turn */
+
 	/*
 	 * Rescue workers are used only on emergencies and shared by
 	 * all cpus.  Give MIN_NICE.
@@ -286,6 +288,7 @@ struct pool_workqueue {
 	struct list_head	pending_node;	/* LN: node on wq_node_nr_active->pending_pwqs */
 	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
 	struct list_head	mayday_node;	/* MD: node on wq->maydays */
+	struct work_struct	mayday_cursor;	/* L: cursor on pool->worklist */
 
 	u64			stats[PWQ_NR_STATS];
 
@@ -1120,6 +1123,12 @@ static struct worker *find_worker_executing_work(struct worker_pool *pool,
 	return NULL;
 }
 
+static void mayday_cursor_func(struct work_struct *work)
+{
+	/* should not be processed, only for marking position */
+	BUG();
+}
+
 /**
  * move_linked_works - move linked works to a list
  * @work: start of series of works to be scheduled
@@ -1182,6 +1191,16 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 
 	lockdep_assert_held(&pool->lock);
 
+	/* The cursor work should not be processed */
+	if (unlikely(work->func == mayday_cursor_func)) {
+		/* only worker_thread() can possibly take this branch */
+		WARN_ON_ONCE(worker->rescue_wq);
+		if (nextp)
+			*nextp = list_next_entry(work, entry);
+		list_del_init(&work->entry);
+		return false;
+	}
+
 	/*
 	 * A single work shouldn't be executed concurrently by multiple workers.
 	 * __queue_work() ensures that @work doesn't jump to a different pool
@@ -3439,22 +3458,30 @@ static int worker_thread(void *__worker)
 static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescuer)
 {
 	struct worker_pool *pool = pwq->pool;
+	struct work_struct *cursor = &pwq->mayday_cursor;
 	struct work_struct *work, *n;
 
 	/* need rescue? */
 	if (!pwq->nr_active || !need_to_create_worker(pool))
 		return false;
 
-	/*
-	 * Slurp in all works issued via this workqueue and
-	 * process'em.
-	 */
-	list_for_each_entry_safe(work, n, &pool->worklist, entry) {
-		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n))
+	/* search from the start or cursor if available */
+	if (list_empty(&cursor->entry))
+		work = list_first_entry(&pool->worklist, struct work_struct, entry);
+	else
+		work = list_next_entry(cursor, entry);
+
+	/* find the next work item to rescue */
+	list_for_each_entry_safe_from(work, n, &pool->worklist, entry) {
+		if (get_work_pwq(work) == pwq && assign_work(work, rescuer, &n)) {
 			pwq->stats[PWQ_STAT_RESCUED]++;
+			/* put the cursor for next search */
+			list_move_tail(&cursor->entry, &n->entry);
+			return true;
+		}
 	}
 
-	return !list_empty(&rescuer->scheduled);
+	return false;
 }
 
 /**
@@ -3511,6 +3538,7 @@ static int rescuer_thread(void *__rescuer)
 		struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
 					struct pool_workqueue, mayday_node);
 		struct worker_pool *pool = pwq->pool;
+		unsigned int count = 0;
 
 		__set_current_state(TASK_RUNNING);
 		list_del_init(&pwq->mayday_node);
@@ -3523,25 +3551,27 @@ static int rescuer_thread(void *__rescuer)
 
 		WARN_ON_ONCE(!list_empty(&rescuer->scheduled));
 
-		if (assign_rescuer_work(pwq, rescuer)) {
+		while (assign_rescuer_work(pwq, rescuer)) {
 			process_scheduled_works(rescuer);
 
 			/*
-			 * The above execution of rescued work items could
-			 * have created more to rescue through
-			 * pwq_activate_first_inactive() or chained
-			 * queueing.  Let's put @pwq back on mayday list so
-			 * that such back-to-back work items, which may be
-			 * being used to relieve memory pressure, don't
-			 * incur MAYDAY_INTERVAL delay inbetween.
+			 * If the per-turn work item limit is reached and other
+			 * PWQs are in mayday, requeue mayday for this PWQ and
+			 * let the rescuer handle the other PWQs first.
 			 */
-			if (pwq->nr_active && need_to_create_worker(pool)) {
+			if (++count > RESCUER_BATCH && !list_empty(&pwq->wq->maydays) &&
+			    pwq->nr_active && need_to_create_worker(pool)) {
 				raw_spin_lock(&wq_mayday_lock);
 				send_mayday(pwq);
 				raw_spin_unlock(&wq_mayday_lock);
+				break;
 			}
 		}
 
+		/* The cursor can not be left behind without the rescuer watching it. */
+		if (!list_empty(&pwq->mayday_cursor.entry) && list_empty(&pwq->mayday_node))
+			list_del_init(&pwq->mayday_cursor.entry);
+
 		/*
 		 * Leave this pool. Notify regular workers; otherwise, we end up
 		 * with 0 concurrency and stalling the execution.
@@ -5160,6 +5190,19 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	kthread_init_work(&pwq->release_work, pwq_release_workfn);
+
+	/*
+	 * Set the dummy cursor work with valid function and get_work_pwq().
+	 *
+	 * The cursor work should only be in the pwq->pool->worklist, and
+	 * should not be treated as a processable work item.
+	 *
+	 * WORK_STRUCT_PENDING and WORK_STRUCT_INACTIVE just make it less
+	 * surprise for kernel debuging tools and reviewers.
+	 */
+	INIT_WORK(&pwq->mayday_cursor, mayday_cursor_func);
+	atomic_long_set(&pwq->mayday_cursor.data, (unsigned long)pwq |
+			WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | WORK_STRUCT_INACTIVE);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH V5 3/3] workqueue: Process extra works in rescuer on memory pressure
  2025-12-08 13:25 [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Lai Jiangshan
  2025-12-08 13:25 ` [PATCH V5 1/3] workqueue: Make send_mayday() take a PWQ argument directly Lai Jiangshan
  2025-12-08 13:25 ` [PATCH V5 2/3] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
@ 2025-12-08 13:25 ` Lai Jiangshan
  2025-12-08 19:24 ` [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Tejun Heo
  3 siblings, 0 replies; 5+ messages in thread
From: Lai Jiangshan @ 2025-12-08 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, ying chen, Lai Jiangshan, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Make the rescuer process more work on the last pwq when there are no
more to rescue for the whole workqueue to help the regular workers in
case it is a temporary memory pressure relief and to reduce relapse.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index add236e0dac4..fe5daa32d55e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3461,10 +3461,37 @@ static bool assign_rescuer_work(struct pool_workqueue *pwq, struct worker *rescu
 	struct work_struct *cursor = &pwq->mayday_cursor;
 	struct work_struct *work, *n;
 
-	/* need rescue? */
-	if (!pwq->nr_active || !need_to_create_worker(pool))
+	/* have work items to rescue? */
+	if (!pwq->nr_active)
 		return false;
 
+	/* need rescue? */
+	if (!need_to_create_worker(pool)) {
+		/*
+		 * The pool has idle workers and doesn't need the rescuer, so it
+		 * could simply return false here.
+		 *
+		 * However, the memory pressure might not be fully relieved.
+		 * In PERCPU pool with concurrency enabled, having idle workers
+		 * does not necessarily mean memory pressure is gone; it may
+		 * simply mean regular workers have woken up, completed their
+		 * work, and gone idle again due to concurrency limits.
+		 *
+		 * In this case, those working workers may later sleep again,
+		 * the pool may run out of idle workers, and it will have to
+		 * allocate new ones and wait for the timer to send mayday,
+		 * causing unnecessary delay - especially if memory pressure
+		 * was never resolved throughout.
+		 *
+		 * Do more work if memory pressure is still on to reduce
+		 * relapse, using (pool->flags & POOL_MANAGER_ACTIVE), though
+		 * not precisely, unless there are other PWQs needing help.
+		 */
+		if (!(pool->flags & POOL_MANAGER_ACTIVE) ||
+		    !list_empty(&pwq->wq->maydays))
+			return false;
+	}
+
 	/* search from the start or cursor if available */
 	if (list_empty(&cursor->entry))
 		work = list_first_entry(&pool->worklist, struct work_struct, entry);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work
  2025-12-08 13:25 [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Lai Jiangshan
                   ` (2 preceding siblings ...)
  2025-12-08 13:25 ` [PATCH V5 3/3] workqueue: Process extra works in rescuer on memory pressure Lai Jiangshan
@ 2025-12-08 19:24 ` Tejun Heo
  3 siblings, 0 replies; 5+ messages in thread
From: Tejun Heo @ 2025-12-08 19:24 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, ying chen, Lai Jiangshan

> Lai Jiangshan (3):
>   workqueue: Refactor send_mayday() to take pwq directly
>   workqueue: Process rescuer work items one-by-one using a cursor
>   workqueue: Process extra works in rescuer on memory pressure

Applied 1-3 to wq/for-6.20 with minor typo fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-08 19:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-08 13:25 [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Lai Jiangshan
2025-12-08 13:25 ` [PATCH V5 1/3] workqueue: Make send_mayday() take a PWQ argument directly Lai Jiangshan
2025-12-08 13:25 ` [PATCH V5 2/3] workqueue: Process rescuer work items one-by-one using a cursor Lai Jiangshan
2025-12-08 13:25 ` [PATCH V5 3/3] workqueue: Process extra works in rescuer on memory pressure Lai Jiangshan
2025-12-08 19:24 ` [PATCH V5 0/3] workqueue: Factor the way to assign rescuer work Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox