* [PATCH] workqueue: Add WQ_SCHED_FIFO
@ 2023-01-13 21:07 Nathan Huckleberry
2023-01-13 21:11 ` Tejun Heo
2023-01-14 2:19 ` Gao Xiang
0 siblings, 2 replies; 14+ messages in thread
From: Nathan Huckleberry @ 2023-01-13 21:07 UTC (permalink / raw)
Cc: Nathan Huckleberry, Sandeep Dhavale, Daeho Jeong, Eric Biggers,
Sami Tolvanen, Tejun Heo, Lai Jiangshan, Jonathan Corbet,
linux-doc, linux-kernel
Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
imporant RT priority. This can reduce scheduler latency for IO
post-processing when the CPU is under load without impacting other RT
workloads. This has been shown to improve app startup time on Android
[1].
Scheduler latency affects several drivers as evidenced by [1], [2], [3],
[4]. Some of these drivers have moved post-processing into IRQ context.
However, this can cause latency spikes for real-time threads and jitter
related jank on Android. Using a workqueue with SCHED_FIFO improves
scheduler latency without causing latency problems for RT threads.
[1]:
https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
[2]:
https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
[3]:
https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
[4]:
https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
This change has been tested on dm-verity with the following fio config:
[global]
time_based
runtime=120
[do-verify]
ioengine=sync
filename=/dev/testing
rw=randread
direct=1
[burn_8x90%_qsort]
ioengine=cpuio
cpuload=90
numjobs=8
cpumode=qsort
Before:
clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
(15.3GB), run=120001-120001msec
After:
clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
(22.7GB), run=120012-120012msec
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Daeho Jeong <daehojeong@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
---
Documentation/core-api/workqueue.rst | 12 ++++++++++
include/linux/workqueue.h | 9 +++++++
kernel/workqueue.c | 36 +++++++++++++++++++++-------
3 files changed, 48 insertions(+), 9 deletions(-)
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 3b22ed137662..26faf2806c66 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -216,6 +216,18 @@ resources, scheduled and executed.
This flag is meaningless for unbound wq.
+``WQ_SCHED_FIFO``
+ Work items of a fifo wq are queued to the fifo
+ worker-pool of the target cpu. Fifo worker-pools are
+ served by worker threads with scheduler policy SCHED_FIFO and
+ the least important real-time priority. This can be useful
+ for workloads where low latency is imporant.
+
+ A workqueue cannot be both high-priority and fifo.
+
+ Note that normal and fifo worker-pools don't interact with
+ each other. Each maintains its separate pool of workers and
+ implements concurrency management among its workers.
``max_active``
--------------
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index ac551b8ee7d9..43a4eeaf8ff4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -134,6 +134,10 @@ struct workqueue_attrs {
* @nice: nice level
*/
int nice;
+ /**
+ * @sched_fifo: is using SCHED_FIFO
+ */
+ bool sched_fifo;
/**
* @cpumask: allowed CPUs
@@ -334,6 +338,11 @@ enum {
* http://thread.gmane.org/gmane.linux.kernel/1480396
*/
WQ_POWER_EFFICIENT = 1 << 7,
+ /*
+ * Low real-time priority workqueues can reduce scheduler latency
+ * for latency sensitive workloads like IO post-processing.
+ */
+ WQ_SCHED_FIFO = 1 << 8,
__WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5dc67aa9d696..99c5e0a3dc28 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -85,7 +85,7 @@ enum {
WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
WORKER_UNBOUND | WORKER_REBOUND,
- NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
+ NR_STD_WORKER_POOLS = 3, /* # standard pools per cpu */
UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
@@ -1949,7 +1949,8 @@ static struct worker *create_worker(struct worker_pool *pool)
if (pool->cpu >= 0)
snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
- pool->attrs->nice < 0 ? "H" : "");
+ pool->attrs->sched_fifo ? "F" :
+ (pool->attrs->nice < 0 ? "H" : ""));
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
@@ -1958,7 +1959,11 @@ static struct worker *create_worker(struct worker_pool *pool)
if (IS_ERR(worker->task))
goto fail;
- set_user_nice(worker->task, pool->attrs->nice);
+ if (pool->attrs->sched_fifo)
+ sched_set_fifo_low(worker->task);
+ else
+ set_user_nice(worker->task, pool->attrs->nice);
+
kthread_bind_mask(worker->task, pool->attrs->cpumask);
/* successful, attach the worker to the pool */
@@ -4323,9 +4328,17 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
- bool highpri = wq->flags & WQ_HIGHPRI;
+ int pool_index = 0;
int cpu, ret;
+ if (wq->flags & WQ_HIGHPRI && wq->flags & WQ_SCHED_FIFO)
+ return -EINVAL;
+
+ if (wq->flags & WQ_HIGHPRI)
+ pool_index = 1;
+ if (wq->flags & WQ_SCHED_FIFO)
+ pool_index = 2;
+
if (!(wq->flags & WQ_UNBOUND)) {
wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
if (!wq->cpu_pwqs)
@@ -4337,7 +4350,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
struct worker_pool *cpu_pools =
per_cpu(cpu_worker_pools, cpu);
- init_pwq(pwq, wq, &cpu_pools[highpri]);
+ init_pwq(pwq, wq, &cpu_pools[pool_index]);
mutex_lock(&wq->mutex);
link_pwq(pwq);
@@ -4348,13 +4361,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
cpus_read_lock();
if (wq->flags & __WQ_ORDERED) {
- ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
+ ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_index]);
/* there should only be single pwq for ordering guarantee */
WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
"ordering guarantee broken for workqueue %s\n", wq->name);
} else {
- ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
+ ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[pool_index]);
}
cpus_read_unlock();
@@ -6138,7 +6151,8 @@ static void __init wq_numa_init(void)
*/
void __init workqueue_init_early(void)
{
- int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
+ int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL, 0 };
+ bool std_sched_fifo[NR_STD_WORKER_POOLS] = { false, false, true };
int i, cpu;
BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
@@ -6158,8 +6172,10 @@ void __init workqueue_init_early(void)
BUG_ON(init_worker_pool(pool));
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
- pool->attrs->nice = std_nice[i++];
+ pool->attrs->nice = std_nice[i];
+ pool->attrs->sched_fifo = std_sched_fifo[i];
pool->node = cpu_to_node(cpu);
+ i++;
/* alloc pool ID */
mutex_lock(&wq_pool_mutex);
@@ -6174,6 +6190,7 @@ void __init workqueue_init_early(void)
BUG_ON(!(attrs = alloc_workqueue_attrs()));
attrs->nice = std_nice[i];
+ attrs->sched_fifo = std_sched_fifo[i];
unbound_std_wq_attrs[i] = attrs;
/*
@@ -6183,6 +6200,7 @@ void __init workqueue_init_early(void)
*/
BUG_ON(!(attrs = alloc_workqueue_attrs()));
attrs->nice = std_nice[i];
+ attrs->sched_fifo = std_sched_fifo[i];
attrs->no_numa = true;
ordered_wq_attrs[i] = attrs;
}
--
2.39.0.314.g84b9a713c41-goog
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-13 21:07 [PATCH] workqueue: Add WQ_SCHED_FIFO Nathan Huckleberry
@ 2023-01-13 21:11 ` Tejun Heo
2023-01-14 21:00 ` Nathan Huckleberry
2023-01-19 2:01 ` Nathan Huckleberry
2023-01-14 2:19 ` Gao Xiang
1 sibling, 2 replies; 14+ messages in thread
From: Tejun Heo @ 2023-01-13 21:11 UTC (permalink / raw)
To: Nathan Huckleberry
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:
> Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> imporant RT priority. This can reduce scheduler latency for IO
> post-processing when the CPU is under load without impacting other RT
> workloads. This has been shown to improve app startup time on Android
> [1].
>
> Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> [4]. Some of these drivers have moved post-processing into IRQ context.
> However, this can cause latency spikes for real-time threads and jitter
> related jank on Android. Using a workqueue with SCHED_FIFO improves
> scheduler latency without causing latency problems for RT threads.
>
> [1]:
> https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> [2]:
> https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> [3]:
> https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> [4]:
> https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
>
> This change has been tested on dm-verity with the following fio config:
>
> [global]
> time_based
> runtime=120
>
> [do-verify]
> ioengine=sync
> filename=/dev/testing
> rw=randread
> direct=1
>
> [burn_8x90%_qsort]
> ioengine=cpuio
> cpuload=90
> numjobs=8
> cpumode=qsort
>
> Before:
> clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> (15.3GB), run=120001-120001msec
>
> After:
> clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> (22.7GB), run=120012-120012msec
Given that its use case mostly intersects with WQ_HIGHPRI, would it make
more sense to add a switch to alter its behavior instead? I don't really
like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
to each user.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-13 21:07 [PATCH] workqueue: Add WQ_SCHED_FIFO Nathan Huckleberry
2023-01-13 21:11 ` Tejun Heo
@ 2023-01-14 2:19 ` Gao Xiang
2023-01-14 21:00 ` Nathan Huckleberry
1 sibling, 1 reply; 14+ messages in thread
From: Gao Xiang @ 2023-01-14 2:19 UTC (permalink / raw)
To: Nathan Huckleberry, Lai Jiangshan, Tejun Heo
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Jonathan Corbet, linux-doc, linux-kernel, linux-erofs
Hi Nathan!
On 2023/1/14 05:07, Nathan Huckleberry wrote:
> Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> imporant RT priority. This can reduce scheduler latency for IO
> post-processing when the CPU is under load without impacting other RT
> workloads. This has been shown to improve app startup time on Android
> [1].
Thank you all for your effort on this. Unfortunately I have no time to
setup the test [1] until now. If it can be addressed as a new workqueue
feature, that would be much helpful to me. Otherwise, I still need to
find a way to resolve the latest Android + EROFS latency problem.
>
> Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> [4]. Some of these drivers have moved post-processing into IRQ context.
> However, this can cause latency spikes for real-time threads and jitter
> related jank on Android. Using a workqueue with SCHED_FIFO improves
> scheduler latency without causing latency problems for RT threads.
softirq context is actually mainly for post-interrupt handling I think.
but considering decompression/verification/decryption all workload are much
complex than that and less important than real post-interrupt handling.
I don't think softirq context is the best place to handle these
CPU-intensive jobs. Beside, it could cause some important work moving to
softirqd unexpectedly in the extreme cases. Also such many post-processing
jobs are as complex as they could sleep so that softirq context is
unsuitable as well.
Anyway, I second this proposal if possible:
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Thanks,
Gao Xiang
>
> [1]:
> https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> [2]:
> https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> [3]:
> https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> [4]:
> https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
>
> This change has been tested on dm-verity with the following fio config:
>
> [global]
> time_based
> runtime=120
>
> [do-verify]
> ioengine=sync
> filename=/dev/testing
> rw=randread
> direct=1
>
> [burn_8x90%_qsort]
> ioengine=cpuio
> cpuload=90
> numjobs=8
> cpumode=qsort
>
> Before:
> clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> (15.3GB), run=120001-120001msec
>
> After:
> clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> (22.7GB), run=120012-120012msec
>
> Cc: Sandeep Dhavale <dhavale@google.com>
> Cc: Daeho Jeong <daehojeong@google.com>
> Cc: Eric Biggers <ebiggers@kernel.org>
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Signed-off-by: Nathan Huckleberry <nhuck@google.com>
> ---
> Documentation/core-api/workqueue.rst | 12 ++++++++++
> include/linux/workqueue.h | 9 +++++++
> kernel/workqueue.c | 36 +++++++++++++++++++++-------
> 3 files changed, 48 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
> index 3b22ed137662..26faf2806c66 100644
> --- a/Documentation/core-api/workqueue.rst
> +++ b/Documentation/core-api/workqueue.rst
> @@ -216,6 +216,18 @@ resources, scheduled and executed.
>
> This flag is meaningless for unbound wq.
>
> +``WQ_SCHED_FIFO``
> + Work items of a fifo wq are queued to the fifo
> + worker-pool of the target cpu. Fifo worker-pools are
> + served by worker threads with scheduler policy SCHED_FIFO and
> + the least important real-time priority. This can be useful
> + for workloads where low latency is imporant.
> +
> + A workqueue cannot be both high-priority and fifo.
> +
> + Note that normal and fifo worker-pools don't interact with
> + each other. Each maintains its separate pool of workers and
> + implements concurrency management among its workers.
>
> ``max_active``
> --------------
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index ac551b8ee7d9..43a4eeaf8ff4 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -134,6 +134,10 @@ struct workqueue_attrs {
> * @nice: nice level
> */
> int nice;
> + /**
> + * @sched_fifo: is using SCHED_FIFO
> + */
> + bool sched_fifo;
>
> /**
> * @cpumask: allowed CPUs
> @@ -334,6 +338,11 @@ enum {
> * http://thread.gmane.org/gmane.linux.kernel/1480396
> */
> WQ_POWER_EFFICIENT = 1 << 7,
> + /*
> + * Low real-time priority workqueues can reduce scheduler latency
> + * for latency sensitive workloads like IO post-processing.
> + */
> + WQ_SCHED_FIFO = 1 << 8,
>
> __WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */
> __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 5dc67aa9d696..99c5e0a3dc28 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -85,7 +85,7 @@ enum {
> WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
> WORKER_UNBOUND | WORKER_REBOUND,
>
> - NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
> + NR_STD_WORKER_POOLS = 3, /* # standard pools per cpu */
>
> UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
> BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
> @@ -1949,7 +1949,8 @@ static struct worker *create_worker(struct worker_pool *pool)
>
> if (pool->cpu >= 0)
> snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> - pool->attrs->nice < 0 ? "H" : "");
> + pool->attrs->sched_fifo ? "F" :
> + (pool->attrs->nice < 0 ? "H" : ""));
> else
> snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
>
> @@ -1958,7 +1959,11 @@ static struct worker *create_worker(struct worker_pool *pool)
> if (IS_ERR(worker->task))
> goto fail;
>
> - set_user_nice(worker->task, pool->attrs->nice);
> + if (pool->attrs->sched_fifo)
> + sched_set_fifo_low(worker->task);
> + else
> + set_user_nice(worker->task, pool->attrs->nice);
> +
> kthread_bind_mask(worker->task, pool->attrs->cpumask);
>
> /* successful, attach the worker to the pool */
> @@ -4323,9 +4328,17 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
>
> static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> {
> - bool highpri = wq->flags & WQ_HIGHPRI;
> + int pool_index = 0;
> int cpu, ret;
>
> + if (wq->flags & WQ_HIGHPRI && wq->flags & WQ_SCHED_FIFO)
> + return -EINVAL;
> +
> + if (wq->flags & WQ_HIGHPRI)
> + pool_index = 1;
> + if (wq->flags & WQ_SCHED_FIFO)
> + pool_index = 2;
> +
> if (!(wq->flags & WQ_UNBOUND)) {
> wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
> if (!wq->cpu_pwqs)
> @@ -4337,7 +4350,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> struct worker_pool *cpu_pools =
> per_cpu(cpu_worker_pools, cpu);
>
> - init_pwq(pwq, wq, &cpu_pools[highpri]);
> + init_pwq(pwq, wq, &cpu_pools[pool_index]);
>
> mutex_lock(&wq->mutex);
> link_pwq(pwq);
> @@ -4348,13 +4361,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>
> cpus_read_lock();
> if (wq->flags & __WQ_ORDERED) {
> - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
> + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_index]);
> /* there should only be single pwq for ordering guarantee */
> WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
> wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
> "ordering guarantee broken for workqueue %s\n", wq->name);
> } else {
> - ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
> + ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[pool_index]);
> }
> cpus_read_unlock();
>
> @@ -6138,7 +6151,8 @@ static void __init wq_numa_init(void)
> */
> void __init workqueue_init_early(void)
> {
> - int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> + int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL, 0 };
> + bool std_sched_fifo[NR_STD_WORKER_POOLS] = { false, false, true };
> int i, cpu;
>
> BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
> @@ -6158,8 +6172,10 @@ void __init workqueue_init_early(void)
> BUG_ON(init_worker_pool(pool));
> pool->cpu = cpu;
> cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> - pool->attrs->nice = std_nice[i++];
> + pool->attrs->nice = std_nice[i];
> + pool->attrs->sched_fifo = std_sched_fifo[i];
> pool->node = cpu_to_node(cpu);
> + i++;
>
> /* alloc pool ID */
> mutex_lock(&wq_pool_mutex);
> @@ -6174,6 +6190,7 @@ void __init workqueue_init_early(void)
>
> BUG_ON(!(attrs = alloc_workqueue_attrs()));
> attrs->nice = std_nice[i];
> + attrs->sched_fifo = std_sched_fifo[i];
> unbound_std_wq_attrs[i] = attrs;
>
> /*
> @@ -6183,6 +6200,7 @@ void __init workqueue_init_early(void)
> */
> BUG_ON(!(attrs = alloc_workqueue_attrs()));
> attrs->nice = std_nice[i];
> + attrs->sched_fifo = std_sched_fifo[i];
> attrs->no_numa = true;
> ordered_wq_attrs[i] = attrs;
> }
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-13 21:11 ` Tejun Heo
@ 2023-01-14 21:00 ` Nathan Huckleberry
2023-01-18 17:51 ` Tejun Heo
2023-01-19 2:01 ` Nathan Huckleberry
1 sibling, 1 reply; 14+ messages in thread
From: Nathan Huckleberry @ 2023-01-14 21:00 UTC (permalink / raw)
To: Tejun Heo
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Fri, Jan 13, 2023 at 1:11 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:
> > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > imporant RT priority. This can reduce scheduler latency for IO
> > post-processing when the CPU is under load without impacting other RT
> > workloads. This has been shown to improve app startup time on Android
> > [1].
> >
> > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > [4]. Some of these drivers have moved post-processing into IRQ context.
> > However, this can cause latency spikes for real-time threads and jitter
> > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > scheduler latency without causing latency problems for RT threads.
> >
> > [1]:
> > https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> > [2]:
> > https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> > [3]:
> > https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> > [4]:
> > https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
> >
> > This change has been tested on dm-verity with the following fio config:
> >
> > [global]
> > time_based
> > runtime=120
> >
> > [do-verify]
> > ioengine=sync
> > filename=/dev/testing
> > rw=randread
> > direct=1
> >
> > [burn_8x90%_qsort]
> > ioengine=cpuio
> > cpuload=90
> > numjobs=8
> > cpumode=qsort
> >
> > Before:
> > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > (15.3GB), run=120001-120001msec
> >
> > After:
> > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > (22.7GB), run=120012-120012msec
>
> Given that its use case mostly intersects with WQ_HIGHPRI, would it make
> more sense to add a switch to alter its behavior instead? I don't really
> like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
> to each user.
This sounds fine to me. How do you feel about a config flag to change
the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
policy per workqueue?
Thanks,
Huck
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-14 2:19 ` Gao Xiang
@ 2023-01-14 21:00 ` Nathan Huckleberry
2023-01-19 2:41 ` Sandeep Dhavale
0 siblings, 1 reply; 14+ messages in thread
From: Nathan Huckleberry @ 2023-01-14 21:00 UTC (permalink / raw)
To: Gao Xiang
Cc: Lai Jiangshan, Tejun Heo, Sandeep Dhavale, Daeho Jeong,
Eric Biggers, Sami Tolvanen, Jonathan Corbet, linux-doc,
linux-kernel, linux-erofs
On Fri, Jan 13, 2023 at 6:20 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Nathan!
>
> On 2023/1/14 05:07, Nathan Huckleberry wrote:
> > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > imporant RT priority. This can reduce scheduler latency for IO
> > post-processing when the CPU is under load without impacting other RT
> > workloads. This has been shown to improve app startup time on Android
> > [1].
>
> Thank you all for your effort on this. Unfortunately I have no time to
> setup the test [1] until now. If it can be addressed as a new workqueue
> feature, that would be much helpful to me. Otherwise, I still need to
> find a way to resolve the latest Android + EROFS latency problem.
>
The above patch and following diff should have equivalent performance
to [1], but I have not tested it.
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index ccf7c55d477f..a9c3893ad1d4 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -201,7 +201,7 @@ static inline int z_erofs_init_workqueue(void)
* scheduling overhead, perhaps per-CPU threads should be better?
*/
z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
- WQ_UNBOUND | WQ_HIGHPRI,
+ WQ_SCHED_FIFO,
onlinecpus + onlinecpus / 4);
return z_erofs_workqueue ? 0 : -ENOMEM;
Thanks,
Huck
}
> >
> > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > [4]. Some of these drivers have moved post-processing into IRQ context.
> > However, this can cause latency spikes for real-time threads and jitter
> > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > scheduler latency without causing latency problems for RT threads.
>
> softirq context is actually mainly for post-interrupt handling I think.
> but considering decompression/verification/decryption all workload are much
> complex than that and less important than real post-interrupt handling.
> I don't think softirq context is the best place to handle these
> CPU-intensive jobs. Beside, it could cause some important work moving to
> softirqd unexpectedly in the extreme cases. Also such many post-processing
> jobs are as complex as they could sleep so that softirq context is
> unsuitable as well.
>
> Anyway, I second this proposal if possible:
>
> Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>
> Thanks,
> Gao Xiang
>
> >
> > [1]:
> > https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> > [2]:
> > https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> > [3]:
> > https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> > [4]:
> > https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
> >
> > This change has been tested on dm-verity with the following fio config:
> >
> > [global]
> > time_based
> > runtime=120
> >
> > [do-verify]
> > ioengine=sync
> > filename=/dev/testing
> > rw=randread
> > direct=1
> >
> > [burn_8x90%_qsort]
> > ioengine=cpuio
> > cpuload=90
> > numjobs=8
> > cpumode=qsort
> >
> > Before:
> > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > (15.3GB), run=120001-120001msec
> >
> > After:
> > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > (22.7GB), run=120012-120012msec
> >
> > Cc: Sandeep Dhavale <dhavale@google.com>
> > Cc: Daeho Jeong <daehojeong@google.com>
> > Cc: Eric Biggers <ebiggers@kernel.org>
> > Cc: Sami Tolvanen <samitolvanen@google.com>
> > Signed-off-by: Nathan Huckleberry <nhuck@google.com>
> > ---
> > Documentation/core-api/workqueue.rst | 12 ++++++++++
> > include/linux/workqueue.h | 9 +++++++
> > kernel/workqueue.c | 36 +++++++++++++++++++++-------
> > 3 files changed, 48 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
> > index 3b22ed137662..26faf2806c66 100644
> > --- a/Documentation/core-api/workqueue.rst
> > +++ b/Documentation/core-api/workqueue.rst
> > @@ -216,6 +216,18 @@ resources, scheduled and executed.
> >
> > This flag is meaningless for unbound wq.
> >
> > +``WQ_SCHED_FIFO``
> > + Work items of a fifo wq are queued to the fifo
> > + worker-pool of the target cpu. Fifo worker-pools are
> > + served by worker threads with scheduler policy SCHED_FIFO and
> > + the least important real-time priority. This can be useful
> > + for workloads where low latency is imporant.
> > +
> > + A workqueue cannot be both high-priority and fifo.
> > +
> > + Note that normal and fifo worker-pools don't interact with
> > + each other. Each maintains its separate pool of workers and
> > + implements concurrency management among its workers.
> >
> > ``max_active``
> > --------------
> > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> > index ac551b8ee7d9..43a4eeaf8ff4 100644
> > --- a/include/linux/workqueue.h
> > +++ b/include/linux/workqueue.h
> > @@ -134,6 +134,10 @@ struct workqueue_attrs {
> > * @nice: nice level
> > */
> > int nice;
> > + /**
> > + * @sched_fifo: is using SCHED_FIFO
> > + */
> > + bool sched_fifo;
> >
> > /**
> > * @cpumask: allowed CPUs
> > @@ -334,6 +338,11 @@ enum {
> > * http://thread.gmane.org/gmane.linux.kernel/1480396
> > */
> > WQ_POWER_EFFICIENT = 1 << 7,
> > + /*
> > + * Low real-time priority workqueues can reduce scheduler latency
> > + * for latency sensitive workloads like IO post-processing.
> > + */
> > + WQ_SCHED_FIFO = 1 << 8,
> >
> > __WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */
> > __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 5dc67aa9d696..99c5e0a3dc28 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -85,7 +85,7 @@ enum {
> > WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
> > WORKER_UNBOUND | WORKER_REBOUND,
> >
> > - NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
> > + NR_STD_WORKER_POOLS = 3, /* # standard pools per cpu */
> >
> > UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
> > BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
> > @@ -1949,7 +1949,8 @@ static struct worker *create_worker(struct worker_pool *pool)
> >
> > if (pool->cpu >= 0)
> > snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> > - pool->attrs->nice < 0 ? "H" : "");
> > + pool->attrs->sched_fifo ? "F" :
> > + (pool->attrs->nice < 0 ? "H" : ""));
> > else
> > snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
> >
> > @@ -1958,7 +1959,11 @@ static struct worker *create_worker(struct worker_pool *pool)
> > if (IS_ERR(worker->task))
> > goto fail;
> >
> > - set_user_nice(worker->task, pool->attrs->nice);
> > + if (pool->attrs->sched_fifo)
> > + sched_set_fifo_low(worker->task);
> > + else
> > + set_user_nice(worker->task, pool->attrs->nice);
> > +
> > kthread_bind_mask(worker->task, pool->attrs->cpumask);
> >
> > /* successful, attach the worker to the pool */
> > @@ -4323,9 +4328,17 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
> >
> > static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > {
> > - bool highpri = wq->flags & WQ_HIGHPRI;
> > + int pool_index = 0;
> > int cpu, ret;
> >
> > + if (wq->flags & WQ_HIGHPRI && wq->flags & WQ_SCHED_FIFO)
> > + return -EINVAL;
> > +
> > + if (wq->flags & WQ_HIGHPRI)
> > + pool_index = 1;
> > + if (wq->flags & WQ_SCHED_FIFO)
> > + pool_index = 2;
> > +
> > if (!(wq->flags & WQ_UNBOUND)) {
> > wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
> > if (!wq->cpu_pwqs)
> > @@ -4337,7 +4350,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > struct worker_pool *cpu_pools =
> > per_cpu(cpu_worker_pools, cpu);
> >
> > - init_pwq(pwq, wq, &cpu_pools[highpri]);
> > + init_pwq(pwq, wq, &cpu_pools[pool_index]);
> >
> > mutex_lock(&wq->mutex);
> > link_pwq(pwq);
> > @@ -4348,13 +4361,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> >
> > cpus_read_lock();
> > if (wq->flags & __WQ_ORDERED) {
> > - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
> > + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_index]);
> > /* there should only be single pwq for ordering guarantee */
> > WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
> > wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
> > "ordering guarantee broken for workqueue %s\n", wq->name);
> > } else {
> > - ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
> > + ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[pool_index]);
> > }
> > cpus_read_unlock();
> >
> > @@ -6138,7 +6151,8 @@ static void __init wq_numa_init(void)
> > */
> > void __init workqueue_init_early(void)
> > {
> > - int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> > + int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL, 0 };
> > + bool std_sched_fifo[NR_STD_WORKER_POOLS] = { false, false, true };
> > int i, cpu;
> >
> > BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
> > @@ -6158,8 +6172,10 @@ void __init workqueue_init_early(void)
> > BUG_ON(init_worker_pool(pool));
> > pool->cpu = cpu;
> > cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> > - pool->attrs->nice = std_nice[i++];
> > + pool->attrs->nice = std_nice[i];
> > + pool->attrs->sched_fifo = std_sched_fifo[i];
> > pool->node = cpu_to_node(cpu);
> > + i++;
> >
> > /* alloc pool ID */
> > mutex_lock(&wq_pool_mutex);
> > @@ -6174,6 +6190,7 @@ void __init workqueue_init_early(void)
> >
> > BUG_ON(!(attrs = alloc_workqueue_attrs()));
> > attrs->nice = std_nice[i];
> > + attrs->sched_fifo = std_sched_fifo[i];
> > unbound_std_wq_attrs[i] = attrs;
> >
> > /*
> > @@ -6183,6 +6200,7 @@ void __init workqueue_init_early(void)
> > */
> > BUG_ON(!(attrs = alloc_workqueue_attrs()));
> > attrs->nice = std_nice[i];
> > + attrs->sched_fifo = std_sched_fifo[i];
> > attrs->no_numa = true;
> > ordered_wq_attrs[i] = attrs;
> > }
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-14 21:00 ` Nathan Huckleberry
@ 2023-01-18 17:51 ` Tejun Heo
2023-01-18 18:22 ` Sandeep Dhavale
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2023-01-18 17:51 UTC (permalink / raw)
To: Nathan Huckleberry
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Sat, Jan 14, 2023 at 01:00:00PM -0800, Nathan Huckleberry wrote:
> This sounds fine to me. How do you feel about a config flag to change
> the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
> policy per workqueue?
Yeah, sounds fine to me.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-18 17:51 ` Tejun Heo
@ 2023-01-18 18:22 ` Sandeep Dhavale
2023-01-18 18:25 ` Tejun Heo
2023-01-18 22:04 ` Nathan Huckleberry
0 siblings, 2 replies; 14+ messages in thread
From: Sandeep Dhavale @ 2023-01-18 18:22 UTC (permalink / raw)
To: Tejun Heo
Cc: Nathan Huckleberry, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Wed, Jan 18, 2023 at 9:52 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Sat, Jan 14, 2023 at 01:00:00PM -0800, Nathan Huckleberry wrote:
> > This sounds fine to me. How do you feel about a config flag to change
> > the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
> > policy per workqueue?
>
> Yeah, sounds fine to me.
>
> Thanks.
>
Hi Tejun,
If with the kernel config option, every WQ_HIGHPRI is elevated to
sched_fifo_low, wouldn't that be kind of defeating the purpose? Having
another class for even more urgent work is better in my opinion.
Thanks,
Sandeep.
> --
> tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-18 18:22 ` Sandeep Dhavale
@ 2023-01-18 18:25 ` Tejun Heo
2023-01-18 22:04 ` Nathan Huckleberry
1 sibling, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2023-01-18 18:25 UTC (permalink / raw)
To: Sandeep Dhavale
Cc: Nathan Huckleberry, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Wed, Jan 18, 2023 at 10:22:32AM -0800, Sandeep Dhavale wrote:
> If with the kernel config option, every WQ_HIGHPRI is elevated to
> sched_fifo_low, wouldn't that be kind of defeating the purpose? Having
> another class for even more urgent work is better in my opinion.
I mean, everybody thinks their work items are the most important. Even with
explicit FIFO, you're still gonna have similar problems as people crowd that
flag. If this is a concern, please benchmark with realistic scenarios and
consider other options (e.g. maybe that problematic workqueue doesn't need
to be HIGHPRI or should be split somehow). Right now, I don't think there
are enough justifications for adding another level.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-18 18:22 ` Sandeep Dhavale
2023-01-18 18:25 ` Tejun Heo
@ 2023-01-18 22:04 ` Nathan Huckleberry
1 sibling, 0 replies; 14+ messages in thread
From: Nathan Huckleberry @ 2023-01-18 22:04 UTC (permalink / raw)
To: Sandeep Dhavale
Cc: Tejun Heo, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Wed, Jan 18, 2023 at 10:22 AM Sandeep Dhavale <dhavale@google.com> wrote:
>
> On Wed, Jan 18, 2023 at 9:52 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Sat, Jan 14, 2023 at 01:00:00PM -0800, Nathan Huckleberry wrote:
> > > This sounds fine to me. How do you feel about a config flag to change
> > > the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
> > > policy per workqueue?
> >
> > Yeah, sounds fine to me.
> >
> > Thanks.
> >
> Hi Tejun,
> If with the kernel config option, every WQ_HIGHPRI is elevated to
> sched_fifo_low, wouldn't that be kind of defeating the purpose? Having
> another class for even more urgent work is better in my opinion.
I agree, however most of the users of WQ_HIGHPRI that are relevant to
Android would probably have SCHED_FIFO enabled anyway.
I can write the patches such that WQ_HIGHPRI selects one of two
internal WQ flags. If using SCHED_FIFO for all workqueues is
problematic, Android can consider using the internal WQ flags directly
and carry those one-line patches in the Android tree.
Thanks,
Huck
>
> Thanks,
> Sandeep.
> > --
> > tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-13 21:11 ` Tejun Heo
2023-01-14 21:00 ` Nathan Huckleberry
@ 2023-01-19 2:01 ` Nathan Huckleberry
2023-01-19 2:28 ` Tejun Heo
1 sibling, 1 reply; 14+ messages in thread
From: Nathan Huckleberry @ 2023-01-19 2:01 UTC (permalink / raw)
To: Tejun Heo
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Fri, Jan 13, 2023 at 1:11 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:
> > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > imporant RT priority. This can reduce scheduler latency for IO
> > post-processing when the CPU is under load without impacting other RT
> > workloads. This has been shown to improve app startup time on Android
> > [1].
> >
> > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > [4]. Some of these drivers have moved post-processing into IRQ context.
> > However, this can cause latency spikes for real-time threads and jitter
> > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > scheduler latency without causing latency problems for RT threads.
> >
> > [1]:
> > https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> > [2]:
> > https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> > [3]:
> > https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> > [4]:
> > https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
> >
> > This change has been tested on dm-verity with the following fio config:
> >
> > [global]
> > time_based
> > runtime=120
> >
> > [do-verify]
> > ioengine=sync
> > filename=/dev/testing
> > rw=randread
> > direct=1
> >
> > [burn_8x90%_qsort]
> > ioengine=cpuio
> > cpuload=90
> > numjobs=8
> > cpumode=qsort
> >
> > Before:
> > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > (15.3GB), run=120001-120001msec
> >
> > After:
> > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > (22.7GB), run=120012-120012msec
>
Hi Tejun,
> Given that its use case mostly intersects with WQ_HIGHPRI, would it make
> more sense to add a switch to alter its behavior instead? I don't really
> like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
> to each user.
Do you think something similar should be done for WQ_UNBOUND? In most
places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
boosts performance. However, I suspect that most of these benchmarks
were done on x86-64. I've found that WQ_UNBOUND significantly reduces
performance on arm64/Android.
From the documentation, using WQ_UNBOUND for performance doesn't seem
correct. It's only supposed to be used for long-running work. It might
make more sense to get rid of WQ_UNBOUND altogether and only move work
to unbound worker pools once it has stuck around for long enough.
Android will probably need to remove WQ_UNBOUND from all of these
performance critical users.
If there are performance benefits to using unbinding workqueues from
CPUs on x86-64, that should probably be a config flag, not controlled
by every user.
Thanks,
Huck
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-19 2:01 ` Nathan Huckleberry
@ 2023-01-19 2:28 ` Tejun Heo
2023-01-27 19:25 ` Nathan Huckleberry
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2023-01-19 2:28 UTC (permalink / raw)
To: Nathan Huckleberry
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
Hello,
On Wed, Jan 18, 2023 at 06:01:04PM -0800, Nathan Huckleberry wrote:
> Do you think something similar should be done for WQ_UNBOUND? In most
> places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
> boosts performance. However, I suspect that most of these benchmarks
> were done on x86-64. I've found that WQ_UNBOUND significantly reduces
> performance on arm64/Android.
One attribute with per-cpu workqueues is that they're concurrency-level
limited. ie. if you have two per-cpu work items queued, the second one might
not run until the first one is done. Maybe people were trying to avoid
possible latency spikes from that?
Even aside from that, UNBOUND tends to give more consistent latency
behaviors as you aren't necessarily bound to what's happening on that
particular, so I guess maybe that's also why but I didn't really follow how
each user is picking and justifying these flags, so my insight is pretty
limited.
> From the documentation, using WQ_UNBOUND for performance doesn't seem
> correct. It's only supposed to be used for long-running work. It might
> make more sense to get rid of WQ_UNBOUND altogether and only move work
> to unbound worker pools once it has stuck around for long enough.
UNBOUND says: Don't pin this to one cpu or subject it to workqueue's
concurrency limit. Use workqueue as a generic thread pool.
I don't know what you mean by performance but HIGHPRI | UNBOUND will
definitely improve some aspects.
> Android will probably need to remove WQ_UNBOUND from all of these
> performance critical users.
>
> If there are performance benefits to using unbinding workqueues from
> CPUs on x86-64, that should probably be a config flag, not controlled
> by every user.
It's unlikely that the instruction set is what's making the difference here,
right? It probably would help if we understand why it's worse on arm.
I don't think ppl have been all that deliberate with these flags, so it's
also likely that some of the usages can drop UNBOUND completely but I really
think more data and analyses would help.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-14 21:00 ` Nathan Huckleberry
@ 2023-01-19 2:41 ` Sandeep Dhavale
2023-01-19 4:31 ` Gao Xiang
0 siblings, 1 reply; 14+ messages in thread
From: Sandeep Dhavale @ 2023-01-19 2:41 UTC (permalink / raw)
To: Nathan Huckleberry
Cc: Gao Xiang, Lai Jiangshan, Tejun Heo, Daeho Jeong, Eric Biggers,
Sami Tolvanen, Jonathan Corbet, linux-doc, linux-kernel,
linux-erofs
On Sat, Jan 14, 2023 at 1:01 PM Nathan Huckleberry <nhuck@google.com> wrote:
>
> On Fri, Jan 13, 2023 at 6:20 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >
> > Hi Nathan!
> >
> > On 2023/1/14 05:07, Nathan Huckleberry wrote:
> > > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > > imporant RT priority. This can reduce scheduler latency for IO
> > > post-processing when the CPU is under load without impacting other RT
> > > workloads. This has been shown to improve app startup time on Android
> > > [1].
> >
> > Thank you all for your effort on this. Unfortunately I have no time to
> > setup the test [1] until now. If it can be addressed as a new workqueue
> > feature, that would be much helpful to me. Otherwise, I still need to
> > find a way to resolve the latest Android + EROFS latency problem.
> >
>
> The above patch and following diff should have equivalent performance
> to [1], but I have not tested it.
>
> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index ccf7c55d477f..a9c3893ad1d4 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -201,7 +201,7 @@ static inline int z_erofs_init_workqueue(void)
> * scheduling overhead, perhaps per-CPU threads should be better?
> */
> z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
> - WQ_UNBOUND | WQ_HIGHPRI,
> + WQ_SCHED_FIFO,
> onlinecpus + onlinecpus / 4);
> return z_erofs_workqueue ? 0 : -ENOMEM;
>
> Thanks,
> Huck
>
> }
>
Hello All,
With WQ_SCHED_FIFO and erofs patch mentioned above, I see that average
sched latency improves in the same ballpark as my previous
work proposed in [1] doing the same experiment (app launch tests) and
variation is reduced significantly.
Here is the table
|--------------+-----------+---------------+---------|
| | Workqueue | WQ_SCHED_FIFO | Delta |
|--------------+-----------+---------------+---------|
| Average (us) | 15253 | 3514 | -76.96% |
|--------------+-----------+---------------+---------|
| Median (us) | 14001 | 3450 | -75.36% |
|--------------+-----------+---------------+---------|
| Minimum (us) | 3117 | 3097 | -0.64% |
|--------------+-----------+---------------+---------|
| Maximum (us) | 30170 | 4896 | -83.77% |
|--------------+-----------+---------------+---------|
| Stdev | 7166 | 319 | |
|--------------+-----------+---------------+---------|
Thanks,
Sandeep.
[1] https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
>
> > >
> > > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > > [4]. Some of these drivers have moved post-processing into IRQ context.
> > > However, this can cause latency spikes for real-time threads and jitter
> > > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > > scheduler latency without causing latency problems for RT threads.
> >
> > softirq context is actually mainly for post-interrupt handling I think.
> > but considering decompression/verification/decryption all workload are much
> > complex than that and less important than real post-interrupt handling.
> > I don't think softirq context is the best place to handle these
> > CPU-intensive jobs. Beside, it could cause some important work moving to
> > softirqd unexpectedly in the extreme cases. Also such many post-processing
> > jobs are as complex as they could sleep so that softirq context is
> > unsuitable as well.
> >
> > Anyway, I second this proposal if possible:
> >
> > Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >
> > Thanks,
> > Gao Xiang
> >
> > >
> > > [1]:
> > > https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> > > [2]:
> > > https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> > > [3]:
> > > https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> > > [4]:
> > > https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
> > >
> > > This change has been tested on dm-verity with the following fio config:
> > >
> > > [global]
> > > time_based
> > > runtime=120
> > >
> > > [do-verify]
> > > ioengine=sync
> > > filename=/dev/testing
> > > rw=randread
> > > direct=1
> > >
> > > [burn_8x90%_qsort]
> > > ioengine=cpuio
> > > cpuload=90
> > > numjobs=8
> > > cpumode=qsort
> > >
> > > Before:
> > > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > > (15.3GB), run=120001-120001msec
> > >
> > > After:
> > > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > > (22.7GB), run=120012-120012msec
> > >
> > > Cc: Sandeep Dhavale <dhavale@google.com>
> > > Cc: Daeho Jeong <daehojeong@google.com>
> > > Cc: Eric Biggers <ebiggers@kernel.org>
> > > Cc: Sami Tolvanen <samitolvanen@google.com>
> > > Signed-off-by: Nathan Huckleberry <nhuck@google.com>
> > > ---
> > > Documentation/core-api/workqueue.rst | 12 ++++++++++
> > > include/linux/workqueue.h | 9 +++++++
> > > kernel/workqueue.c | 36 +++++++++++++++++++++-------
> > > 3 files changed, 48 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
> > > index 3b22ed137662..26faf2806c66 100644
> > > --- a/Documentation/core-api/workqueue.rst
> > > +++ b/Documentation/core-api/workqueue.rst
> > > @@ -216,6 +216,18 @@ resources, scheduled and executed.
> > >
> > > This flag is meaningless for unbound wq.
> > >
> > > +``WQ_SCHED_FIFO``
> > > + Work items of a fifo wq are queued to the fifo
> > > + worker-pool of the target cpu. Fifo worker-pools are
> > > + served by worker threads with scheduler policy SCHED_FIFO and
> > > + the least important real-time priority. This can be useful
> > > + for workloads where low latency is imporant.
> > > +
> > > + A workqueue cannot be both high-priority and fifo.
> > > +
> > > + Note that normal and fifo worker-pools don't interact with
> > > + each other. Each maintains its separate pool of workers and
> > > + implements concurrency management among its workers.
> > >
> > > ``max_active``
> > > --------------
> > > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> > > index ac551b8ee7d9..43a4eeaf8ff4 100644
> > > --- a/include/linux/workqueue.h
> > > +++ b/include/linux/workqueue.h
> > > @@ -134,6 +134,10 @@ struct workqueue_attrs {
> > > * @nice: nice level
> > > */
> > > int nice;
> > > + /**
> > > + * @sched_fifo: is using SCHED_FIFO
> > > + */
> > > + bool sched_fifo;
> > >
> > > /**
> > > * @cpumask: allowed CPUs
> > > @@ -334,6 +338,11 @@ enum {
> > > * http://thread.gmane.org/gmane.linux.kernel/1480396
> > > */
> > > WQ_POWER_EFFICIENT = 1 << 7,
> > > + /*
> > > + * Low real-time priority workqueues can reduce scheduler latency
> > > + * for latency sensitive workloads like IO post-processing.
> > > + */
> > > + WQ_SCHED_FIFO = 1 << 8,
> > >
> > > __WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */
> > > __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
> > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > > index 5dc67aa9d696..99c5e0a3dc28 100644
> > > --- a/kernel/workqueue.c
> > > +++ b/kernel/workqueue.c
> > > @@ -85,7 +85,7 @@ enum {
> > > WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
> > > WORKER_UNBOUND | WORKER_REBOUND,
> > >
> > > - NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
> > > + NR_STD_WORKER_POOLS = 3, /* # standard pools per cpu */
> > >
> > > UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
> > > BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
> > > @@ -1949,7 +1949,8 @@ static struct worker *create_worker(struct worker_pool *pool)
> > >
> > > if (pool->cpu >= 0)
> > > snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> > > - pool->attrs->nice < 0 ? "H" : "");
> > > + pool->attrs->sched_fifo ? "F" :
> > > + (pool->attrs->nice < 0 ? "H" : ""));
> > > else
> > > snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
> > >
> > > @@ -1958,7 +1959,11 @@ static struct worker *create_worker(struct worker_pool *pool)
> > > if (IS_ERR(worker->task))
> > > goto fail;
> > >
> > > - set_user_nice(worker->task, pool->attrs->nice);
> > > + if (pool->attrs->sched_fifo)
> > > + sched_set_fifo_low(worker->task);
> > > + else
> > > + set_user_nice(worker->task, pool->attrs->nice);
> > > +
> > > kthread_bind_mask(worker->task, pool->attrs->cpumask);
> > >
> > > /* successful, attach the worker to the pool */
> > > @@ -4323,9 +4328,17 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
> > >
> > > static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > > {
> > > - bool highpri = wq->flags & WQ_HIGHPRI;
> > > + int pool_index = 0;
> > > int cpu, ret;
> > >
> > > + if (wq->flags & WQ_HIGHPRI && wq->flags & WQ_SCHED_FIFO)
> > > + return -EINVAL;
> > > +
> > > + if (wq->flags & WQ_HIGHPRI)
> > > + pool_index = 1;
> > > + if (wq->flags & WQ_SCHED_FIFO)
> > > + pool_index = 2;
> > > +
> > > if (!(wq->flags & WQ_UNBOUND)) {
> > > wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
> > > if (!wq->cpu_pwqs)
> > > @@ -4337,7 +4350,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > > struct worker_pool *cpu_pools =
> > > per_cpu(cpu_worker_pools, cpu);
> > >
> > > - init_pwq(pwq, wq, &cpu_pools[highpri]);
> > > + init_pwq(pwq, wq, &cpu_pools[pool_index]);
> > >
> > > mutex_lock(&wq->mutex);
> > > link_pwq(pwq);
> > > @@ -4348,13 +4361,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > >
> > > cpus_read_lock();
> > > if (wq->flags & __WQ_ORDERED) {
> > > - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
> > > + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_index]);
> > > /* there should only be single pwq for ordering guarantee */
> > > WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
> > > wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
> > > "ordering guarantee broken for workqueue %s\n", wq->name);
> > > } else {
> > > - ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
> > > + ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[pool_index]);
> > > }
> > > cpus_read_unlock();
> > >
> > > @@ -6138,7 +6151,8 @@ static void __init wq_numa_init(void)
> > > */
> > > void __init workqueue_init_early(void)
> > > {
> > > - int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> > > + int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL, 0 };
> > > + bool std_sched_fifo[NR_STD_WORKER_POOLS] = { false, false, true };
> > > int i, cpu;
> > >
> > > BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
> > > @@ -6158,8 +6172,10 @@ void __init workqueue_init_early(void)
> > > BUG_ON(init_worker_pool(pool));
> > > pool->cpu = cpu;
> > > cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> > > - pool->attrs->nice = std_nice[i++];
> > > + pool->attrs->nice = std_nice[i];
> > > + pool->attrs->sched_fifo = std_sched_fifo[i];
> > > pool->node = cpu_to_node(cpu);
> > > + i++;
> > >
> > > /* alloc pool ID */
> > > mutex_lock(&wq_pool_mutex);
> > > @@ -6174,6 +6190,7 @@ void __init workqueue_init_early(void)
> > >
> > > BUG_ON(!(attrs = alloc_workqueue_attrs()));
> > > attrs->nice = std_nice[i];
> > > + attrs->sched_fifo = std_sched_fifo[i];
> > > unbound_std_wq_attrs[i] = attrs;
> > >
> > > /*
> > > @@ -6183,6 +6200,7 @@ void __init workqueue_init_early(void)
> > > */
> > > BUG_ON(!(attrs = alloc_workqueue_attrs()));
> > > attrs->nice = std_nice[i];
> > > + attrs->sched_fifo = std_sched_fifo[i];
> > > attrs->no_numa = true;
> > > ordered_wq_attrs[i] = attrs;
> > > }
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-19 2:41 ` Sandeep Dhavale
@ 2023-01-19 4:31 ` Gao Xiang
0 siblings, 0 replies; 14+ messages in thread
From: Gao Xiang @ 2023-01-19 4:31 UTC (permalink / raw)
To: Sandeep Dhavale
Cc: Nathan Huckleberry, Daeho Jeong, Jonathan Corbet, linux-doc,
Lai Jiangshan, linux-kernel, Sami Tolvanen, Tejun Heo, Gao Xiang,
linux-erofs
On Wed, Jan 18, 2023 at 06:41:26PM -0800, Sandeep Dhavale via Linux-erofs wrote:
> On Sat, Jan 14, 2023 at 1:01 PM Nathan Huckleberry <nhuck@google.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 6:20 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > >
> > > Hi Nathan!
> > >
> > > On 2023/1/14 05:07, Nathan Huckleberry wrote:
> > > > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > > > imporant RT priority. This can reduce scheduler latency for IO
> > > > post-processing when the CPU is under load without impacting other RT
> > > > workloads. This has been shown to improve app startup time on Android
> > > > [1].
> > >
> > > Thank you all for your effort on this. Unfortunately I have no time to
> > > setup the test [1] until now. If it can be addressed as a new workqueue
> > > feature, that would be much helpful to me. Otherwise, I still need to
> > > find a way to resolve the latest Android + EROFS latency problem.
> > >
> >
> > The above patch and following diff should have equivalent performance
> > to [1], but I have not tested it.
> >
> > diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> > index ccf7c55d477f..a9c3893ad1d4 100644
> > --- a/fs/erofs/zdata.c
> > +++ b/fs/erofs/zdata.c
> > @@ -201,7 +201,7 @@ static inline int z_erofs_init_workqueue(void)
> > * scheduling overhead, perhaps per-CPU threads should be better?
> > */
> > z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
> > - WQ_UNBOUND | WQ_HIGHPRI,
> > + WQ_SCHED_FIFO,
> > onlinecpus + onlinecpus / 4);
> > return z_erofs_workqueue ? 0 : -ENOMEM;
> >
> > Thanks,
> > Huck
> >
> > }
> >
> Hello All,
> With WQ_SCHED_FIFO and erofs patch mentioned above, I see that average
> sched latency improves in the same ballpark as my previous
> work proposed in [1] doing the same experiment (app launch tests) and
> variation is reduced significantly.
>
> Here is the table
> |--------------+-----------+---------------+---------|
> | | Workqueue | WQ_SCHED_FIFO | Delta |
> |--------------+-----------+---------------+---------|
> | Average (us) | 15253 | 3514 | -76.96% |
> |--------------+-----------+---------------+---------|
> | Median (us) | 14001 | 3450 | -75.36% |
> |--------------+-----------+---------------+---------|
> | Minimum (us) | 3117 | 3097 | -0.64% |
> |--------------+-----------+---------------+---------|
> | Maximum (us) | 30170 | 4896 | -83.77% |
> |--------------+-----------+---------------+---------|
> | Stdev | 7166 | 319 | |
> |--------------+-----------+---------------+---------|
Thanks, Sandeep. If so, there could be a way forward in
this WQ_SCHED_FIFO way as well?
Anyway, I will seek time working on kthread_worker
alternatively in my Lunar New year vacations since I'd
like to resolve it in some way anyway.
Thanks,
Gao Xiang
>
> Thanks,
> Sandeep.
>
> [1] https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
>
> >
> > > >
> > > > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > > > [4]. Some of these drivers have moved post-processing into IRQ context.
> > > > However, this can cause latency spikes for real-time threads and jitter
> > > > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > > > scheduler latency without causing latency problems for RT threads.
> > >
> > > softirq context is actually mainly for post-interrupt handling I think.
> > > but considering decompression/verification/decryption all workload are much
> > > complex than that and less important than real post-interrupt handling.
> > > I don't think softirq context is the best place to handle these
> > > CPU-intensive jobs. Beside, it could cause some important work moving to
> > > softirqd unexpectedly in the extreme cases. Also such many post-processing
> > > jobs are as complex as they could sleep so that softirq context is
> > > unsuitable as well.
> > >
> > > Anyway, I second this proposal if possible:
> > >
> > > Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> > >
> > > Thanks,
> > > Gao Xiang
> > >
> > > >
> > > > [1]:
> > > > https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/
> > > > [2]:
> > > > https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/
> > > > [3]:
> > > > https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/
> > > > [4]:
> > > > https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/
> > > >
> > > > This change has been tested on dm-verity with the following fio config:
> > > >
> > > > [global]
> > > > time_based
> > > > runtime=120
> > > >
> > > > [do-verify]
> > > > ioengine=sync
> > > > filename=/dev/testing
> > > > rw=randread
> > > > direct=1
> > > >
> > > > [burn_8x90%_qsort]
> > > > ioengine=cpuio
> > > > cpuload=90
> > > > numjobs=8
> > > > cpumode=qsort
> > > >
> > > > Before:
> > > > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > > > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > > > (15.3GB), run=120001-120001msec
> > > >
> > > > After:
> > > > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > > > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > > > (22.7GB), run=120012-120012msec
> > > >
> > > > Cc: Sandeep Dhavale <dhavale@google.com>
> > > > Cc: Daeho Jeong <daehojeong@google.com>
> > > > Cc: Eric Biggers <ebiggers@kernel.org>
> > > > Cc: Sami Tolvanen <samitolvanen@google.com>
> > > > Signed-off-by: Nathan Huckleberry <nhuck@google.com>
> > > > ---
> > > > Documentation/core-api/workqueue.rst | 12 ++++++++++
> > > > include/linux/workqueue.h | 9 +++++++
> > > > kernel/workqueue.c | 36 +++++++++++++++++++++-------
> > > > 3 files changed, 48 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
> > > > index 3b22ed137662..26faf2806c66 100644
> > > > --- a/Documentation/core-api/workqueue.rst
> > > > +++ b/Documentation/core-api/workqueue.rst
> > > > @@ -216,6 +216,18 @@ resources, scheduled and executed.
> > > >
> > > > This flag is meaningless for unbound wq.
> > > >
> > > > +``WQ_SCHED_FIFO``
> > > > + Work items of a fifo wq are queued to the fifo
> > > > + worker-pool of the target cpu. Fifo worker-pools are
> > > > + served by worker threads with scheduler policy SCHED_FIFO and
> > > > + the least important real-time priority. This can be useful
> > > > + for workloads where low latency is imporant.
> > > > +
> > > > + A workqueue cannot be both high-priority and fifo.
> > > > +
> > > > + Note that normal and fifo worker-pools don't interact with
> > > > + each other. Each maintains its separate pool of workers and
> > > > + implements concurrency management among its workers.
> > > >
> > > > ``max_active``
> > > > --------------
> > > > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> > > > index ac551b8ee7d9..43a4eeaf8ff4 100644
> > > > --- a/include/linux/workqueue.h
> > > > +++ b/include/linux/workqueue.h
> > > > @@ -134,6 +134,10 @@ struct workqueue_attrs {
> > > > * @nice: nice level
> > > > */
> > > > int nice;
> > > > + /**
> > > > + * @sched_fifo: is using SCHED_FIFO
> > > > + */
> > > > + bool sched_fifo;
> > > >
> > > > /**
> > > > * @cpumask: allowed CPUs
> > > > @@ -334,6 +338,11 @@ enum {
> > > > * http://thread.gmane.org/gmane.linux.kernel/1480396
> > > > */
> > > > WQ_POWER_EFFICIENT = 1 << 7,
> > > > + /*
> > > > + * Low real-time priority workqueues can reduce scheduler latency
> > > > + * for latency sensitive workloads like IO post-processing.
> > > > + */
> > > > + WQ_SCHED_FIFO = 1 << 8,
> > > >
> > > > __WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */
> > > > __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
> > > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > > > index 5dc67aa9d696..99c5e0a3dc28 100644
> > > > --- a/kernel/workqueue.c
> > > > +++ b/kernel/workqueue.c
> > > > @@ -85,7 +85,7 @@ enum {
> > > > WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
> > > > WORKER_UNBOUND | WORKER_REBOUND,
> > > >
> > > > - NR_STD_WORKER_POOLS = 2, /* # standard pools per cpu */
> > > > + NR_STD_WORKER_POOLS = 3, /* # standard pools per cpu */
> > > >
> > > > UNBOUND_POOL_HASH_ORDER = 6, /* hashed by pool->attrs */
> > > > BUSY_WORKER_HASH_ORDER = 6, /* 64 pointers */
> > > > @@ -1949,7 +1949,8 @@ static struct worker *create_worker(struct worker_pool *pool)
> > > >
> > > > if (pool->cpu >= 0)
> > > > snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> > > > - pool->attrs->nice < 0 ? "H" : "");
> > > > + pool->attrs->sched_fifo ? "F" :
> > > > + (pool->attrs->nice < 0 ? "H" : ""));
> > > > else
> > > > snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
> > > >
> > > > @@ -1958,7 +1959,11 @@ static struct worker *create_worker(struct worker_pool *pool)
> > > > if (IS_ERR(worker->task))
> > > > goto fail;
> > > >
> > > > - set_user_nice(worker->task, pool->attrs->nice);
> > > > + if (pool->attrs->sched_fifo)
> > > > + sched_set_fifo_low(worker->task);
> > > > + else
> > > > + set_user_nice(worker->task, pool->attrs->nice);
> > > > +
> > > > kthread_bind_mask(worker->task, pool->attrs->cpumask);
> > > >
> > > > /* successful, attach the worker to the pool */
> > > > @@ -4323,9 +4328,17 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu,
> > > >
> > > > static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > > > {
> > > > - bool highpri = wq->flags & WQ_HIGHPRI;
> > > > + int pool_index = 0;
> > > > int cpu, ret;
> > > >
> > > > + if (wq->flags & WQ_HIGHPRI && wq->flags & WQ_SCHED_FIFO)
> > > > + return -EINVAL;
> > > > +
> > > > + if (wq->flags & WQ_HIGHPRI)
> > > > + pool_index = 1;
> > > > + if (wq->flags & WQ_SCHED_FIFO)
> > > > + pool_index = 2;
> > > > +
> > > > if (!(wq->flags & WQ_UNBOUND)) {
> > > > wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
> > > > if (!wq->cpu_pwqs)
> > > > @@ -4337,7 +4350,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > > > struct worker_pool *cpu_pools =
> > > > per_cpu(cpu_worker_pools, cpu);
> > > >
> > > > - init_pwq(pwq, wq, &cpu_pools[highpri]);
> > > > + init_pwq(pwq, wq, &cpu_pools[pool_index]);
> > > >
> > > > mutex_lock(&wq->mutex);
> > > > link_pwq(pwq);
> > > > @@ -4348,13 +4361,13 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> > > >
> > > > cpus_read_lock();
> > > > if (wq->flags & __WQ_ORDERED) {
> > > > - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
> > > > + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_index]);
> > > > /* there should only be single pwq for ordering guarantee */
> > > > WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
> > > > wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
> > > > "ordering guarantee broken for workqueue %s\n", wq->name);
> > > > } else {
> > > > - ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
> > > > + ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[pool_index]);
> > > > }
> > > > cpus_read_unlock();
> > > >
> > > > @@ -6138,7 +6151,8 @@ static void __init wq_numa_init(void)
> > > > */
> > > > void __init workqueue_init_early(void)
> > > > {
> > > > - int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> > > > + int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL, 0 };
> > > > + bool std_sched_fifo[NR_STD_WORKER_POOLS] = { false, false, true };
> > > > int i, cpu;
> > > >
> > > > BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
> > > > @@ -6158,8 +6172,10 @@ void __init workqueue_init_early(void)
> > > > BUG_ON(init_worker_pool(pool));
> > > > pool->cpu = cpu;
> > > > cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
> > > > - pool->attrs->nice = std_nice[i++];
> > > > + pool->attrs->nice = std_nice[i];
> > > > + pool->attrs->sched_fifo = std_sched_fifo[i];
> > > > pool->node = cpu_to_node(cpu);
> > > > + i++;
> > > >
> > > > /* alloc pool ID */
> > > > mutex_lock(&wq_pool_mutex);
> > > > @@ -6174,6 +6190,7 @@ void __init workqueue_init_early(void)
> > > >
> > > > BUG_ON(!(attrs = alloc_workqueue_attrs()));
> > > > attrs->nice = std_nice[i];
> > > > + attrs->sched_fifo = std_sched_fifo[i];
> > > > unbound_std_wq_attrs[i] = attrs;
> > > >
> > > > /*
> > > > @@ -6183,6 +6200,7 @@ void __init workqueue_init_early(void)
> > > > */
> > > > BUG_ON(!(attrs = alloc_workqueue_attrs()));
> > > > attrs->nice = std_nice[i];
> > > > + attrs->sched_fifo = std_sched_fifo[i];
> > > > attrs->no_numa = true;
> > > > ordered_wq_attrs[i] = attrs;
> > > > }
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] workqueue: Add WQ_SCHED_FIFO
2023-01-19 2:28 ` Tejun Heo
@ 2023-01-27 19:25 ` Nathan Huckleberry
0 siblings, 0 replies; 14+ messages in thread
From: Nathan Huckleberry @ 2023-01-27 19:25 UTC (permalink / raw)
To: Tejun Heo
Cc: Sandeep Dhavale, Daeho Jeong, Eric Biggers, Sami Tolvanen,
Lai Jiangshan, Jonathan Corbet, linux-doc, linux-kernel
On Wed, Jan 18, 2023 at 6:29 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Jan 18, 2023 at 06:01:04PM -0800, Nathan Huckleberry wrote:
> > Do you think something similar should be done for WQ_UNBOUND? In most
> > places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
> > boosts performance. However, I suspect that most of these benchmarks
> > were done on x86-64. I've found that WQ_UNBOUND significantly reduces
> > performance on arm64/Android.
>
> One attribute with per-cpu workqueues is that they're concurrency-level
> limited. ie. if you have two per-cpu work items queued, the second one might
> not run until the first one is done. Maybe people were trying to avoid
> possible latency spikes from that?
>
> Even aside from that, UNBOUND tends to give more consistent latency
> behaviors as you aren't necessarily bound to what's happening on that
> particular, so I guess maybe that's also why but I didn't really follow how
> each user is picking and justifying these flags, so my insight is pretty
> limited.
>
> > From the documentation, using WQ_UNBOUND for performance doesn't seem
> > correct. It's only supposed to be used for long-running work. It might
> > make more sense to get rid of WQ_UNBOUND altogether and only move work
> > to unbound worker pools once it has stuck around for long enough.
>
> UNBOUND says: Don't pin this to one cpu or subject it to workqueue's
> concurrency limit. Use workqueue as a generic thread pool.
>
> I don't know what you mean by performance but HIGHPRI | UNBOUND will
> definitely improve some aspects.
>
> > Android will probably need to remove WQ_UNBOUND from all of these
> > performance critical users.
> >
> > If there are performance benefits to using unbinding workqueues from
> > CPUs on x86-64, that should probably be a config flag, not controlled
> > by every user.
>
> It's unlikely that the instruction set is what's making the difference here,
> right? It probably would help if we understand why it's worse on arm.
I did some more digging. For dm-verity I think this is related to the
availability of SHA instructions. If SHA instructions are present,
WQ_UNBOUND is suboptimal because the work finishes very quickly.
That doesn't explain why EROFS is slower with WQ_UNBOUND though.
It might also be related to the heterogeneity of modern arm
processors. Locality may be more important for ARM processors than for
x86-64.
See the table below:
| open-prebuilt-camera | UNBOUND | HIGHPRI | HIGHPRI ONLY | SCHED_FIFO ONLY |
| erofs wait time (us) | 357805 | 174205
(-51%) | 129861 (-63%) |
| verity wait time (us) | 11746 | 119
(-98%) | 0 (-100%) |
The bigger issue seems to be WQ_UNBOUND, so I'm abandoning these
patches for now.
Thanks,
Huck
>
> I don't think ppl have been all that deliberate with these flags, so it's
> also likely that some of the usages can drop UNBOUND completely but I really
> think more data and analyses would help.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-01-27 19:25 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-13 21:07 [PATCH] workqueue: Add WQ_SCHED_FIFO Nathan Huckleberry
2023-01-13 21:11 ` Tejun Heo
2023-01-14 21:00 ` Nathan Huckleberry
2023-01-18 17:51 ` Tejun Heo
2023-01-18 18:22 ` Sandeep Dhavale
2023-01-18 18:25 ` Tejun Heo
2023-01-18 22:04 ` Nathan Huckleberry
2023-01-19 2:01 ` Nathan Huckleberry
2023-01-19 2:28 ` Tejun Heo
2023-01-27 19:25 ` Nathan Huckleberry
2023-01-14 2:19 ` Gao Xiang
2023-01-14 21:00 ` Nathan Huckleberry
2023-01-19 2:41 ` Sandeep Dhavale
2023-01-19 4:31 ` Gao Xiang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).