[RFC PATCH 0/3] workqueue: Add configure to reduce work latency

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] workqueue: Add configure to reduce work latency
@ 2025-12-05 12:54 Xin Zhao
  2025-12-05 12:54 ` [RFC PATCH 1/3] workqueue: Support unbound RT workqueue by sysfs Xin Zhao
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Xin Zhao @ 2025-12-05 12:54 UTC (permalink / raw)
  To: tj, jiangshanlai; +Cc: hch, jackzxcui1989, linux-kernel

In a system with high real-time requirements, we have noticed that many
high-priority tasks, such as kernel threads responsible for dispatching
GPU tasks and receiving data sources, often experience latency spikes
due to insufficient real-time execution of work.
The existing sysfs can adjust nice value for unbound workqueues. Add new
'policy' node to support three common policies: SCHED_NORMAL, SCHED_FIFO,
or SCHED_RR. The original 'nice' node is retained for compatibility, add
new 'rtprio' node to adjust real-time priority when 'policy' is SCHED_FIFO
or SCHED_RR. The value of 'rtprio' uses the same numerical meaning as user
space tool chrt.
Introduce variable 'nr_idle_extra', which allows user space to configure
unbound workqueue through sysfs according to the real-time requirement.
By default, workqueue created by system will set 'nr_idle_extra' to 0.
When the policy of workqueue is set to SCHED_FIFO or SCHED_RR via sysfs,
'nr_idle_extra' will be set to WORKER_NR_RT_DEF(2) as default.
Supporting the private configuration aims to deterministically ensure that
tasks within one workqueue are not affected by tasks from other workqueues
with the same attributes. If the user has high real-time requirements,
they can increase the nr_idle_extra supported in the previous patch while
also setting the workqueue 'private', allowing it to independently use
kworker threads, thus ensuring scheduling-related work delays never occur.

Xin Zhao (3):
  workqueue: Support unbound RT workqueue by sysfs
  workqueue: Introduce nr_idle_extra to reduce work tail latency
  workqueue: Support private workqueue by sysfs

 include/linux/workqueue.h |  32 ++++-
 kernel/workqueue.c        | 295 +++++++++++++++++++++++++++++++++-----
 2 files changed, 290 insertions(+), 37 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/3] workqueue: Support unbound RT workqueue by sysfs
  2025-12-05 12:54 [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Xin Zhao
@ 2025-12-05 12:54 ` Xin Zhao
  2025-12-05 12:54 ` [RFC PATCH 2/3] workqueue: Introduce nr_idle_extra to reduce work tail latency Xin Zhao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Xin Zhao @ 2025-12-05 12:54 UTC (permalink / raw)
  To: tj, jiangshanlai; +Cc: hch, jackzxcui1989, linux-kernel

In a system with high real-time requirements, we have noticed that many
high-priority tasks, such as kernel threads responsible for dispatching
GPU tasks and receiving data sources, often experience latency spikes
due to insufficient real-time execution of work.
The existing sysfs can adjust nice value for unbound workqueues. Add new
'policy' node to support three common policies: SCHED_NORMAL, SCHED_FIFO,
or SCHED_RR. The original 'nice' node is retained for compatibility, add
new 'rtprio' node to adjust real-time priority when 'policy' is SCHED_FIFO
or SCHED_RR. The value of 'rtprio' uses the same numerical meaning as user
space tool chrt.
In addition, replace the existing sscanf with kstrto*, as suggested by
checkpatch.pl.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 include/linux/workqueue.h |   9 +-
 kernel/workqueue.c        | 185 +++++++++++++++++++++++++++++++-------
 2 files changed, 162 insertions(+), 32 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index dabc351cc..919e86496 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -146,9 +146,14 @@ enum wq_affn_scope {
  */
 struct workqueue_attrs {
 	/**
-	 * @nice: nice level
+	 * @policy: SCHED_NORMAL/SCHED_FIFO/SCHED_RR
 	 */
-	int nice;
+	int policy;
+
+	/**
+	 * @prio: static priority
+	 */
+	int prio;
 
 	/**
 	 * @cpumask: allowed CPUs
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 253311af4..e5cec7cdd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -55,6 +55,7 @@
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
 #include <linux/irq_work.h>
+#include <uapi/linux/sched/types.h>
 
 #include "workqueue_internal.h"
 
@@ -1202,7 +1203,7 @@ static bool assign_work(struct work_struct *work, struct worker *worker,
 
 static struct irq_work *bh_pool_irq_work(struct worker_pool *pool)
 {
-	int high = pool->attrs->nice == HIGHPRI_NICE_LEVEL ? 1 : 0;
+	int high = PRIO_TO_NICE(pool->attrs->prio) == HIGHPRI_NICE_LEVEL ? 1 : 0;
 
 	return &per_cpu(bh_pool_irq_works, pool->cpu)[high];
 }
@@ -1217,7 +1218,7 @@ static void kick_bh_pool(struct worker_pool *pool)
 		return;
 	}
 #endif
-	if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+	if (PRIO_TO_NICE(pool->attrs->prio) == HIGHPRI_NICE_LEVEL)
 		raise_softirq_irqoff(HI_SOFTIRQ);
 	else
 		raise_softirq_irqoff(TASKLET_SOFTIRQ);
@@ -2747,7 +2748,7 @@ static int format_worker_id(char *buf, size_t size, struct worker *worker,
 		if (pool->cpu >= 0)
 			return scnprintf(buf, size, "kworker/%d:%d%s",
 					 pool->cpu, worker->id,
-					 pool->attrs->nice < 0  ? "H" : "");
+					 pool->attrs->prio  < DEFAULT_PRIO  ? "H" : "");
 		else
 			return scnprintf(buf, size, "kworker/u%d:%d",
 					 pool->id, worker->id);
@@ -2772,6 +2773,8 @@ static struct worker *create_worker(struct worker_pool *pool)
 {
 	struct worker *worker;
 	int id;
+	struct workqueue_attrs *attrs = pool->attrs;
+	struct sched_param sp;
 
 	/* ID is needed to determine kthread name */
 	id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
@@ -2806,7 +2809,12 @@ static struct worker *create_worker(struct worker_pool *pool)
 			goto fail;
 		}
 
-		set_user_nice(worker->task, pool->attrs->nice);
+		if (attrs->policy == SCHED_NORMAL)
+			set_user_nice(worker->task, PRIO_TO_NICE(attrs->prio));
+		else {
+			sp.sched_priority = MAX_RT_PRIO - attrs->prio;
+			sched_setscheduler_nocheck(worker->task, attrs->policy, &sp);
+		}
 		kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
 	}
 
@@ -3676,7 +3684,7 @@ static void drain_dead_softirq_workfn(struct work_struct *work)
 	 * don't hog this CPU's BH.
 	 */
 	if (repeat) {
-		if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+		if (PRIO_TO_NICE(pool->attrs->prio) == HIGHPRI_NICE_LEVEL)
 			queue_work(system_bh_highpri_wq, work);
 		else
 			queue_work(system_bh_wq, work);
@@ -3708,7 +3716,7 @@ void workqueue_softirq_dead(unsigned int cpu)
 		dead_work.pool = pool;
 		init_completion(&dead_work.done);
 
-		if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
+		if (PRIO_TO_NICE(pool->attrs->prio) == HIGHPRI_NICE_LEVEL)
 			queue_work(system_bh_highpri_wq, &dead_work.work);
 		else
 			queue_work(system_bh_wq, &dead_work.work);
@@ -4683,7 +4691,8 @@ struct workqueue_attrs *alloc_workqueue_attrs_noprof(void)
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 				 const struct workqueue_attrs *from)
 {
-	to->nice = from->nice;
+	to->policy = from->policy;
+	to->prio = from->prio;
 	cpumask_copy(to->cpumask, from->cpumask);
 	cpumask_copy(to->__pod_cpumask, from->__pod_cpumask);
 	to->affn_strict = from->affn_strict;
@@ -4714,7 +4723,7 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 {
 	u32 hash = 0;
 
-	hash = jhash_1word(attrs->nice, hash);
+	hash = jhash_1word(attrs->prio, hash);
 	hash = jhash_1word(attrs->affn_strict, hash);
 	hash = jhash(cpumask_bits(attrs->__pod_cpumask),
 		     BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
@@ -4728,7 +4737,9 @@ static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
 static bool wqattrs_equal(const struct workqueue_attrs *a,
 			  const struct workqueue_attrs *b)
 {
-	if (a->nice != b->nice)
+	if (a->policy != b->policy)
+		return false;
+	if (a->prio != b->prio)
 		return false;
 	if (a->affn_strict != b->affn_strict)
 		return false;
@@ -6202,9 +6213,9 @@ static void pr_cont_pool_info(struct worker_pool *pool)
 	pr_cont(" flags=0x%x", pool->flags);
 	if (pool->flags & POOL_BH)
 		pr_cont(" bh%s",
-			pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+			PRIO_TO_NICE(pool->attrs->prio) == HIGHPRI_NICE_LEVEL ? "-hi" : "");
 	else
-		pr_cont(" nice=%d", pool->attrs->nice);
+		pr_cont(" prio=%d", pool->attrs->prio);
 }
 
 static void pr_cont_worker_id(struct worker *worker)
@@ -6213,7 +6224,7 @@ static void pr_cont_worker_id(struct worker *worker)
 
 	if (pool->flags & WQ_BH)
 		pr_cont("bh%s",
-			pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
+			PRIO_TO_NICE(pool->attrs->prio) == HIGHPRI_NICE_LEVEL ? "-hi" : "");
 	else
 		pr_cont("%d%s", task_pid_nr(worker->task),
 			worker->rescue_wq ? "(RESCUER)" : "");
@@ -7055,8 +7066,19 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
  *  max_active		RW int	: maximum number of in-flight work items
  *
  * Unbound workqueues have the following extra attributes.
- *
+ * Set the desire policy before set nice/rtprio.
+ * When policy change from SCHED_NORMAL to SCHED_FIFO/SCHED_RR, set rtprio to 1
+ * as default.
+ * When policy change from SCHED_FIFO/SCHED_RR to SCHED_NORMAL, set nice to 0
+ * as default.
+ * When policy change between SCHED_FIFO and SCHED_RR, all values except policy
+ * remain the same.
+ * Return -EINVAL when you read nice value when policy is SCHED_FIFO/SCHED_RR.
+ * Return -EINVAL when you read rtprio value when policy is SCHED_NORMAL.
+ *
+ *  policy		RW int	: SCHED_NORMAL/SCHED_FIFO/SCHED_RR
  *  nice		RW int	: nice value of the workers
+ *  rtprio		RW int	: rtprio value of the workers
  *  cpumask		RW mask	: bitmask of allowed CPUs for the workers
  *  affinity_scope	RW str  : worker CPU affinity scope (cache, numa, none)
  *  affinity_strict	RW bool : worker CPU affinity is strict
@@ -7097,7 +7119,7 @@ static ssize_t max_active_store(struct device *dev,
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	int val;
 
-	if (sscanf(buf, "%d", &val) != 1 || val <= 0)
+	if (kstrtoint(buf, 10, &val) || val <= 0)
 		return -EINVAL;
 
 	workqueue_set_max_active(wq, val);
@@ -7112,14 +7134,16 @@ static struct attribute *wq_sysfs_attrs[] = {
 };
 ATTRIBUTE_GROUPS(wq_sysfs);
 
-static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
-			    char *buf)
+static ssize_t wq_policy_show(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
 {
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	int written;
 
 	mutex_lock(&wq->mutex);
-	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n",
+			    wq->unbound_attrs->policy);
 	mutex_unlock(&wq->mutex);
 
 	return written;
@@ -7140,11 +7164,67 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 	return attrs;
 }
 
+static void wq_attrs_policy_change(struct workqueue_struct *wq,
+				   struct workqueue_attrs *attrs)
+{
+	if (wq->unbound_attrs->policy == SCHED_NORMAL)
+		attrs->prio = MAX_RT_PRIO - 1;
+	else if (attrs->policy == SCHED_NORMAL)
+		attrs->prio = DEFAULT_PRIO;
+}
+
+static ssize_t wq_policy_store(struct device *dev,
+			       struct device_attribute *attr,
+			       const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int policy, ret = -ENOMEM;
+
+	apply_wqattrs_lock();
+
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		goto out_unlock;
+
+	ret = -EINVAL;
+	if (!kstrtoint(buf, 10, &policy) &&
+		policy >= SCHED_NORMAL && policy <= SCHED_RR) {
+		ret = 0;
+		if (policy != wq->unbound_attrs->policy) {
+			attrs->policy = policy;
+			wq_attrs_policy_change(wq, attrs);
+			ret = apply_workqueue_attrs_locked(wq, attrs);
+		}
+	}
+
+out_unlock:
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
+static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int written = -EINVAL;
+
+	mutex_lock(&wq->mutex);
+	if (wq->unbound_attrs->policy == SCHED_NORMAL)
+		written = scnprintf(buf, PAGE_SIZE, "%d\n",
+				    PRIO_TO_NICE(wq->unbound_attrs->prio));
+	mutex_unlock(&wq->mutex);
+
+	return written;
+}
+
 static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t count)
 {
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	struct workqueue_attrs *attrs;
+	int nice;
 	int ret = -ENOMEM;
 
 	apply_wqattrs_lock();
@@ -7153,11 +7233,55 @@ static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
 	if (!attrs)
 		goto out_unlock;
 
-	if (sscanf(buf, "%d", &attrs->nice) == 1 &&
-	    attrs->nice >= MIN_NICE && attrs->nice <= MAX_NICE)
+	ret = -EINVAL;
+	if (attrs->policy == SCHED_NORMAL &&
+		!kstrtoint(buf, 10, &nice) &&
+		nice >= MIN_NICE && nice <= MAX_NICE) {
+		attrs->prio = NICE_TO_PRIO(nice);
 		ret = apply_workqueue_attrs_locked(wq, attrs);
-	else
-		ret = -EINVAL;
+	}
+
+out_unlock:
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
+static ssize_t wq_rtprio_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int written = -EINVAL;
+
+	mutex_lock(&wq->mutex);
+	if (wq->unbound_attrs->policy != SCHED_NORMAL)
+		written = scnprintf(buf, PAGE_SIZE, "%d\n",
+				    MAX_RT_PRIO - wq->unbound_attrs->prio);
+	mutex_unlock(&wq->mutex);
+
+	return written;
+}
+
+static ssize_t wq_rtprio_store(struct device *dev, struct device_attribute *attr,
+			       const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int rtprio, ret = -ENOMEM;
+
+	apply_wqattrs_lock();
+
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		goto out_unlock;
+
+	ret = -EINVAL;
+	if (attrs->policy != SCHED_NORMAL &&
+		!kstrtoint(buf, 10, &rtprio) &&
+		rtprio > 0 && rtprio < MAX_RT_PRIO) {
+		attrs->prio = MAX_RT_PRIO - rtprio;
+		ret = apply_workqueue_attrs_locked(wq, attrs);
+	}
 
 out_unlock:
 	apply_wqattrs_unlock();
@@ -7259,16 +7383,15 @@ static ssize_t wq_affinity_strict_store(struct device *dev,
 {
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	struct workqueue_attrs *attrs;
-	int v, ret = -ENOMEM;
-
-	if (sscanf(buf, "%d", &v) != 1)
-		return -EINVAL;
+	int ret = -ENOMEM;
 
 	apply_wqattrs_lock();
 	attrs = wq_sysfs_prep_attrs(wq);
 	if (attrs) {
-		attrs->affn_strict = (bool)v;
-		ret = apply_workqueue_attrs_locked(wq, attrs);
+		if (!kstrtobool(buf, &attrs->affn_strict))
+			ret = apply_workqueue_attrs_locked(wq, attrs);
+		else
+			ret = -EINVAL;
 	}
 	apply_wqattrs_unlock();
 	free_workqueue_attrs(attrs);
@@ -7276,7 +7399,9 @@ static ssize_t wq_affinity_strict_store(struct device *dev,
 }
 
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
+	__ATTR(policy, 0644, wq_policy_show, wq_policy_store),
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
+	__ATTR(rtprio, 0644, wq_rtprio_show, wq_rtprio_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
 	__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
 	__ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store),
@@ -7737,7 +7862,7 @@ static void __init init_cpu_worker_pool(struct worker_pool *pool, int cpu, int n
 	pool->cpu = cpu;
 	cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
 	cpumask_copy(pool->attrs->__pod_cpumask, cpumask_of(cpu));
-	pool->attrs->nice = nice;
+	pool->attrs->prio = NICE_TO_PRIO(nice);
 	pool->attrs->affn_strict = true;
 	pool->node = cpu_to_node(cpu);
 
@@ -7829,7 +7954,7 @@ void __init workqueue_init_early(void)
 		struct workqueue_attrs *attrs;
 
 		BUG_ON(!(attrs = alloc_workqueue_attrs()));
-		attrs->nice = std_nice[i];
+		attrs->prio = NICE_TO_PRIO(std_nice[i]);
 		unbound_std_wq_attrs[i] = attrs;
 
 		/*
@@ -7837,7 +7962,7 @@ void __init workqueue_init_early(void)
 		 * guaranteed by max_active which is enforced by pwqs.
 		 */
 		BUG_ON(!(attrs = alloc_workqueue_attrs()));
-		attrs->nice = std_nice[i];
+		attrs->prio = NICE_TO_PRIO(std_nice[i]);
 		attrs->ordered = true;
 		ordered_wq_attrs[i] = attrs;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/3] workqueue: Introduce nr_idle_extra to reduce work tail latency
  2025-12-05 12:54 [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Xin Zhao
  2025-12-05 12:54 ` [RFC PATCH 1/3] workqueue: Support unbound RT workqueue by sysfs Xin Zhao
@ 2025-12-05 12:54 ` Xin Zhao
  2025-12-05 12:54 ` [RFC PATCH 3/3] workqueue: Support private workqueue by sysfs Xin Zhao
  2025-12-05 17:47 ` [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Tejun Heo
  3 siblings, 0 replies; 7+ messages in thread
From: Xin Zhao @ 2025-12-05 12:54 UTC (permalink / raw)
  To: tj, jiangshanlai; +Cc: hch, jackzxcui1989, linux-kernel

If a workqueue has been set as a RT workqueue, execution of the associated
work should be performed in a more real-time manner. However, the existing
mechanism does not wake up other kworker threads if there is already a
running kworker thread that is not sleeping, leading to work execution
delay. We temporarily refer to this phenomenon as 'tail latency'.
Another type of 'tail latency' occurs when pool->nr_running is 0, meaning
that currently working kworker thread is sleeping due to possibly lock
contention while executing a work, although 'need_more_worker' indicates
that additional idle kworker threads need to be created, but there are no
available idle kworker threads unfortunately. The creation of new kworker
threads does not happen immediately when 'need_more_worker' is detected,
new kworker threads will only be created after the previously sleeping
kworker threads are awakened again.
Introduce variable 'nr_idle_extra', which allows user space to configure
unbound workqueue through sysfs according to the real-time requirement.
By default, workqueue created by system will set 'nr_idle_extra' to 0.
When the policy of workqueue is set to SCHED_FIFO or SCHED_RR via sysfs,
'nr_idle_extra' will be set to WORKER_NR_RT_DEF(2) as default.
If 'nr_idle_extra' is not 0, the system will unconditionally wake up
existing idle kworker threads to execute tasks immediately. Additionally,
each time a kworker thread runs through the WORKER_PREP phase, it will
ensure that the number of idle kworker threads is not less than
'nr_idle_extra', creating idle kworker threads as needed. Furthermore, the
threshold in function 'too_many_workers' is correspondingly increased to
ensure that at least 'nr_idle_extra' idle kworker threads are still alive.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 include/linux/workqueue.h | 16 ++++++++
 kernel/workqueue.c        | 81 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 919e86496..c8f40fd6f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -97,6 +97,9 @@ enum wq_misc_consts {
 
 	/* maximum string length for set_worker_desc() */
 	WORKER_DESC_LEN		= 32,
+
+	/* default value of nr_idle_extra when policy is SCHED_FIFO/SCHED_RR */
+	WORKER_NR_RT_DEF	= 2,
 };
 
 /* Convenience constants - of type 'unsigned long', not 'enum'! */
@@ -155,6 +158,19 @@ struct workqueue_attrs {
 	 */
 	int prio;
 
+	/**
+	 * @nr_idle_extra: number of extra idle thread reserved
+	 *
+	 * Default value:
+	 * 0 when policy is SCHED_NORMAL.
+	 * WORKER_NR_RT_DEF when policy is SCHED_FIFO/SCHED_RR.
+	 *
+	 * Reduce tail latency when enqueue multiple work in bursts.
+	 * When nr_idle_extra != 0, work will be queue immediately to an idle
+	 * worker.
+	 */
+	int nr_idle_extra;
+
 	/**
 	 * @cpumask: allowed CPUs
 	 *
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e5cec7cdd..d2bdde40b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -929,10 +929,12 @@ static unsigned long work_offqd_pack_flags(struct work_offq_data *offqd)
  * Note that, because unbound workers never contribute to nr_running, this
  * function will always return %true for unbound pools as long as the
  * worklist isn't empty.
+ * Wake up an idle worker unconditionally when nr_idle_extra != 0.
  */
 static bool need_more_worker(struct worker_pool *pool)
 {
-	return !list_empty(&pool->worklist) && !pool->nr_running;
+	return !list_empty(&pool->worklist) &&
+		(!pool->nr_running || pool->attrs->nr_idle_extra);
 }
 
 /* Can I start working?  Called from busy but !running workers. */
@@ -953,14 +955,25 @@ static bool need_to_create_worker(struct worker_pool *pool)
 	return need_more_worker(pool) && !may_start_working(pool);
 }
 
+static bool need_idle_extra(struct worker_pool *pool)
+{
+	return pool->nr_idle < pool->attrs->nr_idle_extra;
+}
+
+static bool need_to_create_worker_extra(struct worker_pool *pool)
+{
+	return need_to_create_worker(pool) || need_idle_extra(pool);
+}
+
 /* Do we have too many workers and should some go away? */
 static bool too_many_workers(struct worker_pool *pool)
 {
 	bool managing = pool->flags & POOL_MANAGER_ACTIVE;
 	int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
 	int nr_busy = pool->nr_workers - nr_idle;
+	int f = 2 + pool->attrs->nr_idle_extra; /* factor of idle check */
 
-	return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
+	return nr_idle > f && (nr_idle - f) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
 
 /**
@@ -3062,12 +3075,12 @@ __acquires(&pool->lock)
 	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
 
 	while (true) {
-		if (create_worker(pool) || !need_to_create_worker(pool))
+		if (create_worker(pool) || !need_to_create_worker_extra(pool))
 			break;
 
 		schedule_timeout_interruptible(CREATE_COOLDOWN);
 
-		if (!need_to_create_worker(pool))
+		if (!need_to_create_worker_extra(pool))
 			break;
 	}
 
@@ -3078,7 +3091,7 @@ __acquires(&pool->lock)
 	 * created as @pool->lock was dropped and the new worker might have
 	 * already become busy.
 	 */
-	if (need_to_create_worker(pool))
+	if (need_to_create_worker_extra(pool))
 		goto restart;
 }
 
@@ -3396,6 +3409,10 @@ static int worker_thread(void *__worker)
 
 	worker_leave_idle(worker);
 recheck:
+	/* reserve idle worker if nr_idle_extra != 0 */
+	if (need_idle_extra(pool))
+		manage_workers(worker);
+
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))
 		goto sleep;
@@ -4693,6 +4710,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 {
 	to->policy = from->policy;
 	to->prio = from->prio;
+	to->nr_idle_extra = from->nr_idle_extra;
 	cpumask_copy(to->cpumask, from->cpumask);
 	cpumask_copy(to->__pod_cpumask, from->__pod_cpumask);
 	to->affn_strict = from->affn_strict;
@@ -4741,6 +4759,8 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
 		return false;
 	if (a->prio != b->prio)
 		return false;
+	if (a->nr_idle_extra != b->nr_idle_extra)
+		return false;
 	if (a->affn_strict != b->affn_strict)
 		return false;
 	if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask))
@@ -7068,9 +7088,9 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
  * Unbound workqueues have the following extra attributes.
  * Set the desire policy before set nice/rtprio.
  * When policy change from SCHED_NORMAL to SCHED_FIFO/SCHED_RR, set rtprio to 1
- * as default.
+ * as default, set nr_idle_extra to WORKER_NR_RT_DEF as default.
  * When policy change from SCHED_FIFO/SCHED_RR to SCHED_NORMAL, set nice to 0
- * as default.
+ * as default, set nr_idle_extra to 0 as default.
  * When policy change between SCHED_FIFO and SCHED_RR, all values except policy
  * remain the same.
  * Return -EINVAL when you read nice value when policy is SCHED_FIFO/SCHED_RR.
@@ -7079,6 +7099,7 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
  *  policy		RW int	: SCHED_NORMAL/SCHED_FIFO/SCHED_RR
  *  nice		RW int	: nice value of the workers
  *  rtprio		RW int	: rtprio value of the workers
+ *  nr_idle_extra	RW int	: number of extra idle thread reserved
  *  cpumask		RW mask	: bitmask of allowed CPUs for the workers
  *  affinity_scope	RW str  : worker CPU affinity scope (cache, numa, none)
  *  affinity_strict	RW bool : worker CPU affinity is strict
@@ -7167,10 +7188,13 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 static void wq_attrs_policy_change(struct workqueue_struct *wq,
 				   struct workqueue_attrs *attrs)
 {
-	if (wq->unbound_attrs->policy == SCHED_NORMAL)
+	if (wq->unbound_attrs->policy == SCHED_NORMAL) {
 		attrs->prio = MAX_RT_PRIO - 1;
-	else if (attrs->policy == SCHED_NORMAL)
+		attrs->nr_idle_extra = WORKER_NR_RT_DEF;
+	} else if (attrs->policy == SCHED_NORMAL) {
 		attrs->prio = DEFAULT_PRIO;
+		attrs->nr_idle_extra = 0;
+	}
 }
 
 static ssize_t wq_policy_store(struct device *dev,
@@ -7289,6 +7313,44 @@ static ssize_t wq_rtprio_store(struct device *dev, struct device_attribute *attr
 	return ret ?: count;
 }
 
+static ssize_t wq_idle_extra_show(struct device *dev, struct device_attribute *attr,
+				  char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int written;
+
+	mutex_lock(&wq->mutex);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nr_idle_extra);
+	mutex_unlock(&wq->mutex);
+
+	return written;
+}
+
+static ssize_t wq_idle_extra_store(struct device *dev, struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int nr_idle_extra, ret = -ENOMEM;
+
+	apply_wqattrs_lock();
+
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		goto out_unlock;
+
+	ret = -EINVAL;
+	if (!kstrtoint(buf, 10, &nr_idle_extra) && nr_idle_extra >= 0) {
+		attrs->nr_idle_extra = nr_idle_extra;
+		ret = apply_workqueue_attrs_locked(wq, attrs);
+	}
+
+out_unlock:
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
 static ssize_t wq_cpumask_show(struct device *dev,
 			       struct device_attribute *attr, char *buf)
 {
@@ -7402,6 +7464,7 @@ static struct device_attribute wq_sysfs_unbound_attrs[] = {
 	__ATTR(policy, 0644, wq_policy_show, wq_policy_store),
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(rtprio, 0644, wq_rtprio_show, wq_rtprio_store),
+	__ATTR(nr_idle_extra, 0644, wq_idle_extra_show, wq_idle_extra_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
 	__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
 	__ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/3] workqueue: Support private workqueue by sysfs
  2025-12-05 12:54 [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Xin Zhao
  2025-12-05 12:54 ` [RFC PATCH 1/3] workqueue: Support unbound RT workqueue by sysfs Xin Zhao
  2025-12-05 12:54 ` [RFC PATCH 2/3] workqueue: Introduce nr_idle_extra to reduce work tail latency Xin Zhao
@ 2025-12-05 12:54 ` Xin Zhao
  2025-12-05 17:47 ` [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Tejun Heo
  3 siblings, 0 replies; 7+ messages in thread
From: Xin Zhao @ 2025-12-05 12:54 UTC (permalink / raw)
  To: tj, jiangshanlai; +Cc: hch, jackzxcui1989, linux-kernel

Globally, unbound workqueues with same attributes share one worker pool.
Directly change schedule attributes of a specific kworker thread by tools
like 'chrt' or 'taskset' may affect other work which runs on the same
worker_pool. During a discussion with Tejun regarding whether to rewrite
the code using kthread_work or continue using kworker, Tejun pointed out
that it is possible to allow a workqueue to become "private" by sysfs. In
this way, the scheduling attributes of the kworker threads associated with
this private workqueue can be set individually.
However, simply adding 'private' node does not address all situations, as
kworker threads are created and destroyed dynamically.
In this patch series, the support for 'private' is meaningful. While we
can adjust the 'nr_idle_extra' attribute supported in the previous patch
to increase the reservation of idle kworkers, there may still be a
significant number of workqueues with the same attributes globally. If, at
a certain moment, a large number of concurrent work items enter the same
worker pool, it can still cause 'tail latency' a concept described in the
previous patch in the patch series. Increasing nr_idle_extra may help
alleviate the delays caused by this sudden influx, but indiscriminately
setting nr_idle_extra too high can lead to thread resource waste.
Supporting the private configuration aims to deterministically ensure that
tasks within one workqueue are not affected by tasks from other workqueues
with the same attributes. If the user has high real-time requirements,
they can increase the nr_idle_extra supported in the previous patch while
also setting the workqueue 'private', allowing it to independently use
kworker threads, thus ensuring scheduling-related work delays never occur.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 include/linux/workqueue.h |  7 +++++++
 kernel/workqueue.c        | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index c8f40fd6f..faa554384 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -171,6 +171,13 @@ struct workqueue_attrs {
 	 */
 	int nr_idle_extra;
 
+	/**
+	 * @private: whether use individual worker_pool
+	 *
+	 * true means do not share with others even if attributes are the same
+	 */
+	bool private;
+
 	/**
 	 * @cpumask: allowed CPUs
 	 *
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2bdde40b..bd0a1c1ff 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4711,6 +4711,7 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
 	to->policy = from->policy;
 	to->prio = from->prio;
 	to->nr_idle_extra = from->nr_idle_extra;
+	to->private = from->private;
 	cpumask_copy(to->cpumask, from->cpumask);
 	cpumask_copy(to->__pod_cpumask, from->__pod_cpumask);
 	to->affn_strict = from->affn_strict;
@@ -4761,6 +4762,8 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
 		return false;
 	if (a->nr_idle_extra != b->nr_idle_extra)
 		return false;
+	if (a->private || b->private)
+		return false;
 	if (a->affn_strict != b->affn_strict)
 		return false;
 	if (!cpumask_equal(a->__pod_cpumask, b->__pod_cpumask))
@@ -7100,6 +7103,7 @@ module_param_cb(default_affinity_scope, &wq_affn_dfl_ops, NULL, 0644);
  *  nice		RW int	: nice value of the workers
  *  rtprio		RW int	: rtprio value of the workers
  *  nr_idle_extra	RW int	: number of extra idle thread reserved
+ *  private		RW int	: number of extra idle thread reserved
  *  cpumask		RW mask	: bitmask of allowed CPUs for the workers
  *  affinity_scope	RW str  : worker CPU affinity scope (cache, numa, none)
  *  affinity_strict	RW bool : worker CPU affinity is strict
@@ -7351,6 +7355,38 @@ static ssize_t wq_idle_extra_store(struct device *dev, struct device_attribute *
 	return ret ?: count;
 }
 
+static ssize_t wq_private_show(struct device *dev, struct device_attribute *attr,
+			       char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n",
+			 wq->unbound_attrs->private);
+}
+
+static ssize_t wq_private_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int ret = -ENOMEM;
+
+	apply_wqattrs_lock();
+
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		goto out_unlock;
+
+	ret = -EINVAL;
+	if (!kstrtobool(buf, &attrs->private))
+		ret = apply_workqueue_attrs_locked(wq, attrs);
+
+out_unlock:
+	apply_wqattrs_unlock();
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
 static ssize_t wq_cpumask_show(struct device *dev,
 			       struct device_attribute *attr, char *buf)
 {
@@ -7465,6 +7501,7 @@ static struct device_attribute wq_sysfs_unbound_attrs[] = {
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(rtprio, 0644, wq_rtprio_show, wq_rtprio_store),
 	__ATTR(nr_idle_extra, 0644, wq_idle_extra_show, wq_idle_extra_store),
+	__ATTR(private, 0644, wq_private_show, wq_private_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
 	__ATTR(affinity_scope, 0644, wq_affn_scope_show, wq_affn_scope_store),
 	__ATTR(affinity_strict, 0644, wq_affinity_strict_show, wq_affinity_strict_store),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] workqueue: Add configure to reduce work latency
  2025-12-05 12:54 [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Xin Zhao
                   ` (2 preceding siblings ...)
  2025-12-05 12:54 ` [RFC PATCH 3/3] workqueue: Support private workqueue by sysfs Xin Zhao
@ 2025-12-05 17:47 ` Tejun Heo
  2025-12-06  4:33   ` Xin Zhao
  3 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2025-12-05 17:47 UTC (permalink / raw)
  To: Xin Zhao; +Cc: jiangshanlai, hch, linux-kernel

Hello,

On Fri, Dec 05, 2025 at 08:54:42PM +0800, Xin Zhao wrote:
> In a system with high real-time requirements, we have noticed that many
> high-priority tasks, such as kernel threads responsible for dispatching
> GPU tasks and receiving data sources, often experience latency spikes
> due to insufficient real-time execution of work.
> The existing sysfs can adjust nice value for unbound workqueues. Add new
> 'policy' node to support three common policies: SCHED_NORMAL, SCHED_FIFO,
> or SCHED_RR. The original 'nice' node is retained for compatibility, add
> new 'rtprio' node to adjust real-time priority when 'policy' is SCHED_FIFO
> or SCHED_RR. The value of 'rtprio' uses the same numerical meaning as user
> space tool chrt.
> Introduce variable 'nr_idle_extra', which allows user space to configure
> unbound workqueue through sysfs according to the real-time requirement.
> By default, workqueue created by system will set 'nr_idle_extra' to 0.
> When the policy of workqueue is set to SCHED_FIFO or SCHED_RR via sysfs,
> 'nr_idle_extra' will be set to WORKER_NR_RT_DEF(2) as default.
> Supporting the private configuration aims to deterministically ensure that
> tasks within one workqueue are not affected by tasks from other workqueues
> with the same attributes. If the user has high real-time requirements,
> they can increase the nr_idle_extra supported in the previous patch while
> also setting the workqueue 'private', allowing it to independently use
> kworker threads, thus ensuring scheduling-related work delays never occur.

I don't think I'm applying this:

- The rationale is too vague. What are you exactly running and observing?
  How does this improve the situation?

- If wq supports private pools, then I don't think it makes sense to add wq
  interface to change their attributes. Once turned private, the worker
  threads are fixed and userspace can set whatever attributes they want to
  set, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] workqueue: Add configure to reduce work latency
  2025-12-05 17:47 ` [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Tejun Heo
@ 2025-12-06  4:33   ` Xin Zhao
  2025-12-28 16:49     ` Xin Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Xin Zhao @ 2025-12-06  4:33 UTC (permalink / raw)
  To: tj; +Cc: hch, jackzxcui1989, jiangshanlai, linux-kernel

On Fri, 5 Dec 2025 07:47:40 -1000 Tejun Heo <tj@kernel.org> wrote:
> On Fri, Dec 05, 2025 at 08:54:42PM +0800, Xin Zhao wrote:
> > In a system with high real-time requirements, we have noticed that many
> > high-priority tasks, such as kernel threads responsible for dispatching
> > GPU tasks and receiving data sources, often experience latency spikes
> > due to insufficient real-time execution of work.
> > The existing sysfs can adjust nice value for unbound workqueues. Add new
> > 'policy' node to support three common policies: SCHED_NORMAL, SCHED_FIFO,
> > or SCHED_RR. The original 'nice' node is retained for compatibility, add
> > new 'rtprio' node to adjust real-time priority when 'policy' is SCHED_FIFO
> > or SCHED_RR. The value of 'rtprio' uses the same numerical meaning as user
> > space tool chrt.
> > Introduce variable 'nr_idle_extra', which allows user space to configure
> > unbound workqueue through sysfs according to the real-time requirement.
> > By default, workqueue created by system will set 'nr_idle_extra' to 0.
> > When the policy of workqueue is set to SCHED_FIFO or SCHED_RR via sysfs,
> > 'nr_idle_extra' will be set to WORKER_NR_RT_DEF(2) as default.
> > Supporting the private configuration aims to deterministically ensure that
> > tasks within one workqueue are not affected by tasks from other workqueues
> > with the same attributes. If the user has high real-time requirements,
> > they can increase the nr_idle_extra supported in the previous patch while
> > also setting the workqueue 'private', allowing it to independently use
> > kworker threads, thus ensuring scheduling-related work delays never occur.
> 
> I don't think I'm applying this:
> 
> - The rationale is too vague. What are you exactly running and observing?
>   How does this improve the situation?
> 
> - If wq supports private pools, then I don't think it makes sense to add wq
>   interface to change their attributes. Once turned private, the worker
>   threads are fixed and userspace can set whatever attributes they want to
>   set, no?

Our system is used to run intelligent drive, which have explicit and stringent
requirements for real-time performance. This is why I developed this set of
patches:
1. Data acquisition quality inspection relies on the deterministic processing
of UART IMU data, which cannot exceed a specified latency range. Otherwise,
some topic data will be observed to have higher latency. As you know, I have
already proposed a patch for the TTY flip buffer to improve this situation.
Recent tests show that even with the workqueue attributes set to a nice value
of -20, after long-term operation, 2% of the entries in the quality inspection
total still show anomalies.
2. The GPU model processing has a requirement of at least 20 frames per second.
The time allocated for dispatching and running GPU tasks is within 10ms.
Excluding the execution time of GPU itself, the remaining time for scheduling
the dispatch of work tasks to their actual execution is only about 1ms.
Although there aren't many high real-time tasks on the system, there are still
some. Using common CFS kworker, perfetto trace captured show that due to
untimely scheduling of kworker/u37, there are often kernel submit costs
exceeding 20ms from "dispatching work to the actual execution of work."
3. The workqueue API is the most commonly used programming interface for task
processing in kernel drivers. The GPU driver and TTY driver where we encounter
issues also use it. Using kthread_work instead would involve adjustments to
the driver code's logic and retesting, while adding functionality to the
existing workqueue API only requires testing the workqueue itself.
Additionally, the workqueue API has complex and superior logic, its pooling
management logic saves system resources. I believe that providing real-time
capabilities based on the current workqueue API is a better choice.

Assuming we need to provide real-time capabilities based on the workqueue API,
let's discuss how to implement this:
1. Regarding your point that the wq interface is no longer needed once wq
supports private pools, I would like to say that creating kworker threads for
worker pools is dynamic in terms of creation and release. Concurrently
enqueuing multiple works in a workqueue may lead to new thread creation. After
a user sets the scheduling attributes of kworker threads which belong to a
private wq, any newly created kworker threads will not automatically inherit
the scheduling attributes set by the user, as their parent process is kthreadd.
2. In the commit log of the nr_idle_extra patch, I described two common types
of latency:
    Type 1: The need_more_worker check whether pool->nr_running is zero; if it
    is not zero, it will not wake up idle kworker threads to execute
    immediately, resulting in work execution latency.
    Type 2: The need_more_worker has already checked that pool->nr_running is
    zero, but currently, there are no idle kworker threads, leading to work
    execution latency.
The addition of the nr_idle_extra feature is intended to allow users to
optionally reduce execution latency based on real-time requirements.

As for the testing results of this set of patches, I have enabled them this
week and do the performance and stability tests. I will share the results once
testing is complete.

--
Xin Zhao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] workqueue: Add configure to reduce work latency
  2025-12-06  4:33   ` Xin Zhao
@ 2025-12-28 16:49     ` Xin Zhao
  0 siblings, 0 replies; 7+ messages in thread
From: Xin Zhao @ 2025-12-28 16:49 UTC (permalink / raw)
  To: jackzxcui1989, tj; +Cc: hch, jiangshanlai, linux-kernel

Dear Tejun,

On Sat,  6 Dec 2025 12:33:57 +0800 Xin Zhao <jackzxcui1989@163.com> wrote:
> On Fri, 5 Dec 2025 07:47:40 -1000 Tejun Heo <tj@kernel.org> wrote:
> > On Fri, Dec 05, 2025 at 08:54:42PM +0800, Xin Zhao wrote:
> > > In a system with high real-time requirements, we have noticed that many
> > > high-priority tasks, such as kernel threads responsible for dispatching
> > > GPU tasks and receiving data sources, often experience latency spikes
> > > due to insufficient real-time execution of work.
> > > The existing sysfs can adjust nice value for unbound workqueues. Add new
> > > 'policy' node to support three common policies: SCHED_NORMAL, SCHED_FIFO,
> > > or SCHED_RR. The original 'nice' node is retained for compatibility, add
> > > new 'rtprio' node to adjust real-time priority when 'policy' is SCHED_FIFO
> > > or SCHED_RR. The value of 'rtprio' uses the same numerical meaning as user
> > > space tool chrt.
> > > Introduce variable 'nr_idle_extra', which allows user space to configure
> > > unbound workqueue through sysfs according to the real-time requirement.
> > > By default, workqueue created by system will set 'nr_idle_extra' to 0.
> > > When the policy of workqueue is set to SCHED_FIFO or SCHED_RR via sysfs,
> > > 'nr_idle_extra' will be set to WORKER_NR_RT_DEF(2) as default.
> > > Supporting the private configuration aims to deterministically ensure that
> > > tasks within one workqueue are not affected by tasks from other workqueues
> > > with the same attributes. If the user has high real-time requirements,
> > > they can increase the nr_idle_extra supported in the previous patch while
> > > also setting the workqueue 'private', allowing it to independently use
> > > kworker threads, thus ensuring scheduling-related work delays never occur.
> > 
> > I don't think I'm applying this:
> > 
> > - The rationale is too vague. What are you exactly running and observing?
> >   How does this improve the situation?
> > 
> > - If wq supports private pools, then I don't think it makes sense to add wq
> >   interface to change their attributes. Once turned private, the worker
> >   threads are fixed and userspace can set whatever attributes they want to
> >   set, no?
> 
> 
> Our system is used to run intelligent drive, which have explicit and stringent
> requirements for real-time performance. This is why I developed this set of
> patches:
> 1. Data acquisition quality inspection relies on the deterministic processing
> of UART IMU data, which cannot exceed a specified latency range. Otherwise,
> some topic data will be observed to have higher latency. As you know, I have
> already proposed a patch for the TTY flip buffer to improve this situation.
> Recent tests show that even with the workqueue attributes set to a nice value
> of -20, after long-term operation, 2% of the entries in the quality inspection
> total still show anomalies.
> 2. The GPU model processing has a requirement of at least 20 frames per second.
> The time allocated for dispatching and running GPU tasks is within 10ms.
> Excluding the execution time of GPU itself, the remaining time for scheduling
> the dispatch of work tasks to their actual execution is only about 1ms.
> Although there aren't many high real-time tasks on the system, there are still
> some. Using common CFS kworker, perfetto trace captured show that due to
> untimely scheduling of kworker/u37, there are often kernel submit costs
> exceeding 20ms from "dispatching work to the actual execution of work."
> 3. The workqueue API is the most commonly used programming interface for task
> processing in kernel drivers. The GPU driver and TTY driver where we encounter
> issues also use it. Using kthread_work instead would involve adjustments to
> the driver code's logic and retesting, while adding functionality to the
> existing workqueue API only requires testing the workqueue itself.
> Additionally, the workqueue API has complex and superior logic, its pooling
> management logic saves system resources. I believe that providing real-time
> capabilities based on the current workqueue API is a better choice.
> 
> 
> Assuming we need to provide real-time capabilities based on the workqueue API,
> let's discuss how to implement this:
> 1. Regarding your point that the wq interface is no longer needed once wq
> supports private pools, I would like to say that creating kworker threads for
> worker pools is dynamic in terms of creation and release. Concurrently
> enqueuing multiple works in a workqueue may lead to new thread creation. After
> a user sets the scheduling attributes of kworker threads which belong to a
> private wq, any newly created kworker threads will not automatically inherit
> the scheduling attributes set by the user, as their parent process is kthreadd.
> 2. In the commit log of the nr_idle_extra patch, I described two common types
> of latency:
>     Type 1: The need_more_worker check whether pool->nr_running is zero; if it
>     is not zero, it will not wake up idle kworker threads to execute
>     immediately, resulting in work execution latency.
>     Type 2: The need_more_worker has already checked that pool->nr_running is
>     zero, but currently, there are no idle kworker threads, leading to work
>     execution latency.
> The addition of the nr_idle_extra feature is intended to allow users to
> optionally reduce execution latency based on real-time requirements.
> 
> 
> As for the testing results of this set of patches, I have enabled them this
> week and do the performance and stability tests. I will share the results once
> testing is complete.

My RT workqueue patch has been running stably on our autonomous driving system
for over two weeks. Previously, there would be a 2% data loss in the data
collection scenario due to kworker scheduling delays, which resulted in IMU
timestamp duplication. After applying this patch, and modifying the scheduling
properties of the tty flip buffer workqueue to FIFO 20, while binding it to the
relevant CPUs (CPU0 and CPU1), we no longer encounter any instances of IMU
timestamp duplication. In another GPU scenario, we also applied this patch, which
reduced the kernel submit cost from previous spikes of up to 40ms to a stable 7ms,
yielding significant benefits.
I have enabled this patch in our project, and I believe this is a relatively
common issue that may require similar special user-space configurations for kworker
in other scenarios. The kernel I am using is the RT-Linux kernel version 6.1.

Additionally, I saw your previous commit 636b927eba5bc633753f8eb80f35e1d5be806e51,
titled "workqueue: Make unbound workqueues to use per-cpu pool_workqueues." I
found another potential common issue: once this commit takes effect, controlling
max_active through workqueue requires using the alloc_ordered_workqueue macro.
However, once the alloc_ordered_workqueue macro is used, sysfs settings cannot be
applied.
In our 6.1 version, I could still bypass this sysfs setting by reverting the
alloc_ordered_workqueue macro to alloc_workqueue with WQ_UNBOUND and max_active set
to 1 on our non-NUMA system. However, in newer kernel versions, if the unbound
workqueue is not created using alloc_ordered_workqueue, it defaults to per-CPU
control of max_active, which differs from the typical per-node understanding of
max_active control. In other words, in the latest kernel versions, I cannot set a
system-wide ordered workqueue's CPU binding properties through sysfs. We feel that
allowing ordered workqueues to register sysfs nodes and adjust related attributes
would be meaningful.

Furthermore, I noticed that there are still many comments in the current code that
describe the unbound workqueue's max_active as applying to the whole system. I
believe these comments also need to be updated, such as the comment on max-active
in workqueue.h related to alloc_workqueue.

--
Xin Zhao


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-28 16:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-05 12:54 [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Xin Zhao
2025-12-05 12:54 ` [RFC PATCH 1/3] workqueue: Support unbound RT workqueue by sysfs Xin Zhao
2025-12-05 12:54 ` [RFC PATCH 2/3] workqueue: Introduce nr_idle_extra to reduce work tail latency Xin Zhao
2025-12-05 12:54 ` [RFC PATCH 3/3] workqueue: Support private workqueue by sysfs Xin Zhao
2025-12-05 17:47 ` [RFC PATCH 0/3] workqueue: Add configure to reduce work latency Tejun Heo
2025-12-06  4:33   ` Xin Zhao
2025-12-28 16:49     ` Xin Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox