* Observing higher CPU utilization during random IO fio testing
@ 2026-05-21 19:44 Wen Xiong
2026-05-21 21:52 ` Jens Axboe
2026-05-30 1:10 ` Ming Lei
0 siblings, 2 replies; 6+ messages in thread
From: Wen Xiong @ 2026-05-21 19:44 UTC (permalink / raw)
To: linux-block, axboe; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong
Hi All,
Our performance team observed the higher CPU utilization in RHEL10
compared to RHEL9.8, observed the similar issue in upstream
kernel(v7.1-rc4) as well when running FIO random IO tests.
System configuration:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
Random IO tests are more CPU intensive than sequential IO tests due to
several factors: more context switching, Interrupt Handling, cache
Inefficiency etc. We found out the following patch which caused the
higher CPU utilization in rhel10 and newer linux kernel:
commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu May 9 20:38:25 2024 +0800
block: add plug while submitting IO
So that if caller didn't use plug, for example,
__blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from
caching
nsec time in the plug.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link:
https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower
CPU utilization when doing the same FIO test.
The patch adds plugging in __submit_bio() in block layer, maybe cause
performance degradation:
- Random IO tests have less merging, flush overhead.
- More IO scheduler interaction, forces requests through scheduler
instead of direct dispatch(direct dispatch to hardware queue)
- Poor cache locality during plug operation
Below are some performance data that our performance team collected:
RHEL9.8 comparison RHEL10.0
Iotype qd nj rmix mpstat busy delta lparstat
delta
Randrw 1 20 100 135% 109%
Randrw 1 40 100 72% 81%
Randrw 1 20 70 278% 174%
Randrw 1 40 70 272% 191%
Randrw 1 20 0 93% 30%
Randrw 1 40 0 104% 36%
RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block
layer.h
Iotype qd nj rmix mpstat busy delta lparstat
deltab
Randrw 1 20 100 -12% 20%
Randrw 1 40 100 -42% -4%
Randrw 1 20 70 70% 71%
Randrw 1 40 70 %51 60%
Randrw 1 20 0 -14% -43%
Randrw 1 40 0 -33% -51%
Can a block layer expert help us resolve this high CPU utilization
performance issue?
Let us know if you need more performance data or other perf data.
Thanks a lot for your help!
Wendy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
@ 2026-05-21 21:52 ` Jens Axboe
2026-05-25 5:28 ` Yu Kuai
2026-05-30 1:10 ` Ming Lei
1 sibling, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2026-05-21 21:52 UTC (permalink / raw)
To: Wen Xiong, linux-block; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, Yu Kuai
On 5/21/26 1:44 PM, Wen Xiong wrote:
> Hi All,
>
> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>
> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling, cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date: Thu May 9 20:38:25 2024 +0800
>
> block: add plug while submitting IO
>
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>
> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation
>
> Below are some performance data that our performance team collected:
>
> RHEL9.8 comparison RHEL10.0
> Iotype qd nj rmix mpstat busy delta lparstat delta
> Randrw 1 20 100 135% 109%
> Randrw 1 40 100 72% 81%
> Randrw 1 20 70 278% 174%
> Randrw 1 40 70 272% 191%
> Randrw 1 20 0 93% 30%
> Randrw 1 40 0 104% 36%
>
> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
> Iotype qd nj rmix mpstat busy delta lparstat deltab
> Randrw 1 20 100 -12% 20%
> Randrw 1 40 100 -42% -4%
> Randrw 1 20 70 70% 71%
> Randrw 1 40 70 %51 60%
> Randrw 1 20 0 -14% -43%
> Randrw 1 40 0 -33% -51%
>
> Can a block layer expert help us resolve this high CPU utilization performance issue?
> Let us know if you need more performance data or other perf data.
Let's CC Yu Kuai who wrote that commit, that might help.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-21 21:52 ` Jens Axboe
@ 2026-05-25 5:28 ` Yu Kuai
2026-05-26 15:28 ` Wen Xiong
2026-05-29 17:13 ` Wen Xiong
0 siblings, 2 replies; 6+ messages in thread
From: Yu Kuai @ 2026-05-25 5:28 UTC (permalink / raw)
To: Jens Axboe, Wen Xiong, linux-block
Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, yukuai
Hi,
在 2026/5/22 5:52, Jens Axboe 写道:
> On 5/21/26 1:44 PM, Wen Xiong wrote:
>> Hi All,
>>
>> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>>
>> System configuration:
>> 47 dedicate cores
>> 120 GB memory
>> PCIe4 2-Port 64Gb FC Adapter
>> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>>
>> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling, cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>>
>> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
>> Author: Yu Kuai <yukuai3@huawei.com>
>> Date: Thu May 9 20:38:25 2024 +0800
>>
>> block: add plug while submitting IO
>>
>> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
>> and __blkdev_direct_IO_async(), block layer can still benefit from caching
>> nsec time in the plug.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>>
>> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
>> - Random IO tests have less merging, flush overhead.
>> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)
I don't understand this point. Can you explain more? I think plug should not matter if request go through scheduler or not.
>> - Poor cache locality during plug operation
>>
>> Below are some performance data that our performance team collected:
>>
>> RHEL9.8 comparison RHEL10.0
>> Iotype qd nj rmix mpstat busy delta lparstat delta
>> Randrw 1 20 100 135% 109%
>> Randrw 1 40 100 72% 81%
>> Randrw 1 20 70 278% 174%
>> Randrw 1 40 70 272% 191%
>> Randrw 1 20 0 93% 30%
>> Randrw 1 40 0 104% 36%
>>
>> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
>> Iotype qd nj rmix mpstat busy delta lparstat deltab
>> Randrw 1 20 100 -12% 20%
>> Randrw 1 40 100 -42% -4%
>> Randrw 1 20 70 70% 71%
>> Randrw 1 40 70 %51 60%
>> Randrw 1 20 0 -14% -43%
>> Randrw 1 40 0 -33% -51%
>>
>> Can a block layer expert help us resolve this high CPU utilization performance issue?
And I assume you're testing raw disk, because filesystems should always enable plug.
>> Let us know if you need more performance data or other perf data.
Yes, perf data will be helpful. And please show your test in details and I'll
check if I can reproduce it.
> Let's CC Yu Kuai who wrote that commit, that might help.
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-25 5:28 ` Yu Kuai
@ 2026-05-26 15:28 ` Wen Xiong
2026-05-29 17:13 ` Wen Xiong
1 sibling, 0 replies; 6+ messages in thread
From: Wen Xiong @ 2026-05-26 15:28 UTC (permalink / raw)
To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
On 2026-05-25 00:28, Yu Kuai wrote:
> Hi,
>
> 在 2026/5/22 5:52, Jens Axboe 写道:
>>> - More IO scheduler interaction, forces requests through scheduler
>>> instead of direct dispatch(direct dispatch to hardware queue)
> I don't understand this point. Can you explain more? I think plug
> should not matter if request go through scheduler or not.
My understanding is:
Random IO tests are more CPU intensive.
Plug delays the dispatch IOs to hardware queue(quick way) directly.
Plug submits multiple IO requests in a batch to defer submitting IO
until calling blk_flush_plug(dispatch to hardware queue) or task gets
scheduling.
>
> And I assume you're testing raw disk, because filesystems should
> always enable plug.
>
Yes. FIO random IO tests over raw disks.
> Yes, perf data will be helpful. And please show your test in details
> and I'll
> check if I can reproduce it.
System config:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
64Gb FC switch
FlashSystem: FS9500, 12 LUNs/FC port
Below is fio config for rwmixread=100:
[global]
randrepeat=0
buffered=0
direct=1
norandommap=1
group_reporting=1
size=80g
ioengine=libaio
rw=randrw
bs=4k
iodepth=1
rwmixread=100
runtime=600
ramp_time=5
time_based=1
numjobs=20
[job1]
filename=/dev/dm-2
[job2]
filename=/dev/dm-3
...
24 jobs in total.
We collected some perf data. What kind of perf data you want? Let me
know.
Thanks,
Wendy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-25 5:28 ` Yu Kuai
2026-05-26 15:28 ` Wen Xiong
@ 2026-05-29 17:13 ` Wen Xiong
1 sibling, 0 replies; 6+ messages in thread
From: Wen Xiong @ 2026-05-29 17:13 UTC (permalink / raw)
To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
On 2026-05-25 00:28, Yu Kuai wrote:
> 在 2026/5/22 5:52, Jens Axboe 写道:
> Yes, perf data will be helpful. And please show your test in details
> and I'll
> check if I can reproduce it.
Hi Yu Kuai,
Have you reproduced the issue yet?
Below is some perf data we took while running random read test:
Test:
FIO random read with qdepth=1 nj=20, we saw higher CPU utilization in
this testcase.
Perf record:
start fio run on one session and kickoff the script in another session
while test is running
Perf report:
With blk_start_plug/blk_finish_plug before calling __submit_bio() in
blk-core.c:
Top.txt
2.41% fio [kernel.kallsyms]
[k] cpupri_set
1.16% fio [kernel.kallsyms]
[k] queued_spin_lock_slowpath
0.75% fio [kernel.kallsyms]
[k] sbitmap_find_bit
0.47% fio [kernel.kallsyms]
[k] set_next_task_rt
0.41% fio [kernel.kallsyms]
[k] pull_rt_task
0.34% fio [kernel.kallsyms]
[k] enqueue_pushable_task
…
0.02% fio [kernel.kallsyms]
[k] __blk_flush_plug
0.01% fio [kernel.kallsyms]
[k] blk_add_rq_to_plug
0.01% fio [kernel.kallsyms]
[k] blk_mq_flush_plug_list
0.00% fio [kernel.kallsyms]
[k] blk_attempt_plug_merge
Callgraph.txt
2.41% fio [kernel.kallsyms]
[k] cpupri_set
|
---cpupri_set
|
|--1.15%--__enqueue_rt_entity
| enqueue_task_rt
| enqueue_task
| ttwu_do_activate
Perf report
Without blk_start_plug and blk_finish_plug before calling
__submit_bio():
Top.txt
0.67% fio [kernel.kallsyms]
[k] queued_spin_lock_slowpath
0.64% fio [kernel.kallsyms]
[k] sched_balance_newidle
0.47% fio [kernel.kallsyms]
[k] _raw_spin_lock
0.39% fio [kernel.kallsyms]
[k] sbitmap_find_bit
0.35% fio [kernel.kallsyms]
[k] cpupri_set
0.28% fio [kernel.kallsyms]
[k] work_grab_pending
0.24% fio [kernel.kallsyms]
[k] lookup_ioctx
0.23% fio [kernel.kallsyms]
[k] __schedule
…
…
0.00% fio [kernel.kallsyms]
[k] blk_attempt_plug_merge
Call graph.txt:
0.35% fio [kernel.kallsyms]
[k] cpupri_set
|
---cpupri_set
|
|--0.17%--arch_local_irq_restore.part.0
| |
| |--0.14%--finish_task_switch.isra.0
| | __schedule
| | |
| | |--0.13%--schedule
| | | |
| | | |--0.07%--read_events
…..
|--0.13%--__enqueue_rt_entity
| enqueue_task_rt
| enqueue_task
| ttwu_do_activate
From above perf data, looks like
1. High time spent in cpupri_set(): tasks being enqueued/dequeued
frequently, more IO scheduling.
2. Call more plug routines.
If you need full perf data report, I can email/attach your full report.
Thanks for your help!
Wendy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
2026-05-21 21:52 ` Jens Axboe
@ 2026-05-30 1:10 ` Ming Lei
1 sibling, 0 replies; 6+ messages in thread
From: Ming Lei @ 2026-05-30 1:10 UTC (permalink / raw)
To: Wen Xiong; +Cc: linux-block, axboe, jmoyer, Gjoyce, wenxiong
On Thu, May 21, 2026 at 02:44:22PM -0500, Wen Xiong wrote:
> Hi All,
>
> Our performance team observed the higher CPU utilization in RHEL10 compared
> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well
> when running FIO random IO tests.
>
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>
> Random IO tests are more CPU intensive than sequential IO tests due to
> several factors: more context switching, Interrupt Handling, cache
> Inefficiency etc. We found out the following patch which caused the higher
> CPU utilization in rhel10 and newer linux kernel:
>
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date: Thu May 9 20:38:25 2024 +0800
>
> block: add plug while submitting IO
>
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link:
> https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU
> utilization when doing the same FIO test.
>
> The patch adds plugging in __submit_bio() in block layer, maybe cause
> performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead
> of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation
Yes, it is expected to see regression on QD=1 workload.
Adding inner plug for caching timestamp only is not good from plug function viewpoint,
because only the outer code path(io_uring, libaio, ...) knows exact IO batch size
and can decide if plug should be used.
Given 060406c61c7c ("block: add plug while submitting IO") doesn't provide
any performance data, maybe it can be reverted.
I am wondering why not move the timestamp cache into 'task_struct' and get wider users?
Thanks,
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-30 1:11 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
2026-05-21 21:52 ` Jens Axboe
2026-05-25 5:28 ` Yu Kuai
2026-05-26 15:28 ` Wen Xiong
2026-05-29 17:13 ` Wen Xiong
2026-05-30 1:10 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox