* Observing higher CPU utilization during random IO fio testing
@ 2026-05-21 19:44 Wen Xiong
2026-05-21 21:52 ` Jens Axboe
0 siblings, 1 reply; 4+ messages in thread
From: Wen Xiong @ 2026-05-21 19:44 UTC (permalink / raw)
To: linux-block, axboe; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong
Hi All,
Our performance team observed the higher CPU utilization in RHEL10
compared to RHEL9.8, observed the similar issue in upstream
kernel(v7.1-rc4) as well when running FIO random IO tests.
System configuration:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
Random IO tests are more CPU intensive than sequential IO tests due to
several factors: more context switching, Interrupt Handling, cache
Inefficiency etc. We found out the following patch which caused the
higher CPU utilization in rhel10 and newer linux kernel:
commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu May 9 20:38:25 2024 +0800
block: add plug while submitting IO
So that if caller didn't use plug, for example,
__blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from
caching
nsec time in the plug.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link:
https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower
CPU utilization when doing the same FIO test.
The patch adds plugging in __submit_bio() in block layer, maybe cause
performance degradation:
- Random IO tests have less merging, flush overhead.
- More IO scheduler interaction, forces requests through scheduler
instead of direct dispatch(direct dispatch to hardware queue)
- Poor cache locality during plug operation
Below are some performance data that our performance team collected:
RHEL9.8 comparison RHEL10.0
Iotype qd nj rmix mpstat busy delta lparstat
delta
Randrw 1 20 100 135% 109%
Randrw 1 40 100 72% 81%
Randrw 1 20 70 278% 174%
Randrw 1 40 70 272% 191%
Randrw 1 20 0 93% 30%
Randrw 1 40 0 104% 36%
RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block
layer.h
Iotype qd nj rmix mpstat busy delta lparstat
deltab
Randrw 1 20 100 -12% 20%
Randrw 1 40 100 -42% -4%
Randrw 1 20 70 70% 71%
Randrw 1 40 70 %51 60%
Randrw 1 20 0 -14% -43%
Randrw 1 40 0 -33% -51%
Can a block layer expert help us resolve this high CPU utilization
performance issue?
Let us know if you need more performance data or other perf data.
Thanks a lot for your help!
Wendy
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
@ 2026-05-21 21:52 ` Jens Axboe
2026-05-25 5:28 ` Yu Kuai
0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2026-05-21 21:52 UTC (permalink / raw)
To: Wen Xiong, linux-block; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, Yu Kuai
On 5/21/26 1:44 PM, Wen Xiong wrote:
> Hi All,
>
> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>
> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling, cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date: Thu May 9 20:38:25 2024 +0800
>
> block: add plug while submitting IO
>
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>
> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation
>
> Below are some performance data that our performance team collected:
>
> RHEL9.8 comparison RHEL10.0
> Iotype qd nj rmix mpstat busy delta lparstat delta
> Randrw 1 20 100 135% 109%
> Randrw 1 40 100 72% 81%
> Randrw 1 20 70 278% 174%
> Randrw 1 40 70 272% 191%
> Randrw 1 20 0 93% 30%
> Randrw 1 40 0 104% 36%
>
> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
> Iotype qd nj rmix mpstat busy delta lparstat deltab
> Randrw 1 20 100 -12% 20%
> Randrw 1 40 100 -42% -4%
> Randrw 1 20 70 70% 71%
> Randrw 1 40 70 %51 60%
> Randrw 1 20 0 -14% -43%
> Randrw 1 40 0 -33% -51%
>
> Can a block layer expert help us resolve this high CPU utilization performance issue?
> Let us know if you need more performance data or other perf data.
Let's CC Yu Kuai who wrote that commit, that might help.
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-21 21:52 ` Jens Axboe
@ 2026-05-25 5:28 ` Yu Kuai
2026-05-26 15:28 ` Wen Xiong
0 siblings, 1 reply; 4+ messages in thread
From: Yu Kuai @ 2026-05-25 5:28 UTC (permalink / raw)
To: Jens Axboe, Wen Xiong, linux-block
Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, yukuai
Hi,
在 2026/5/22 5:52, Jens Axboe 写道:
> On 5/21/26 1:44 PM, Wen Xiong wrote:
>> Hi All,
>>
>> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>>
>> System configuration:
>> 47 dedicate cores
>> 120 GB memory
>> PCIe4 2-Port 64Gb FC Adapter
>> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>>
>> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling, cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>>
>> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
>> Author: Yu Kuai <yukuai3@huawei.com>
>> Date: Thu May 9 20:38:25 2024 +0800
>>
>> block: add plug while submitting IO
>>
>> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
>> and __blkdev_direct_IO_async(), block layer can still benefit from caching
>> nsec time in the plug.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>>
>> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
>> - Random IO tests have less merging, flush overhead.
>> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)
I don't understand this point. Can you explain more? I think plug should not matter if request go through scheduler or not.
>> - Poor cache locality during plug operation
>>
>> Below are some performance data that our performance team collected:
>>
>> RHEL9.8 comparison RHEL10.0
>> Iotype qd nj rmix mpstat busy delta lparstat delta
>> Randrw 1 20 100 135% 109%
>> Randrw 1 40 100 72% 81%
>> Randrw 1 20 70 278% 174%
>> Randrw 1 40 70 272% 191%
>> Randrw 1 20 0 93% 30%
>> Randrw 1 40 0 104% 36%
>>
>> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
>> Iotype qd nj rmix mpstat busy delta lparstat deltab
>> Randrw 1 20 100 -12% 20%
>> Randrw 1 40 100 -42% -4%
>> Randrw 1 20 70 70% 71%
>> Randrw 1 40 70 %51 60%
>> Randrw 1 20 0 -14% -43%
>> Randrw 1 40 0 -33% -51%
>>
>> Can a block layer expert help us resolve this high CPU utilization performance issue?
And I assume you're testing raw disk, because filesystems should always enable plug.
>> Let us know if you need more performance data or other perf data.
Yes, perf data will be helpful. And please show your test in details and I'll
check if I can reproduce it.
> Let's CC Yu Kuai who wrote that commit, that might help.
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Observing higher CPU utilization during random IO fio testing
2026-05-25 5:28 ` Yu Kuai
@ 2026-05-26 15:28 ` Wen Xiong
0 siblings, 0 replies; 4+ messages in thread
From: Wen Xiong @ 2026-05-26 15:28 UTC (permalink / raw)
To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
On 2026-05-25 00:28, Yu Kuai wrote:
> Hi,
>
> 在 2026/5/22 5:52, Jens Axboe 写道:
>>> - More IO scheduler interaction, forces requests through scheduler
>>> instead of direct dispatch(direct dispatch to hardware queue)
> I don't understand this point. Can you explain more? I think plug
> should not matter if request go through scheduler or not.
My understanding is:
Random IO tests are more CPU intensive.
Plug delays the dispatch IOs to hardware queue(quick way) directly.
Plug submits multiple IO requests in a batch to defer submitting IO
until calling blk_flush_plug(dispatch to hardware queue) or task gets
scheduling.
>
> And I assume you're testing raw disk, because filesystems should
> always enable plug.
>
Yes. FIO random IO tests over raw disks.
> Yes, perf data will be helpful. And please show your test in details
> and I'll
> check if I can reproduce it.
System config:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
64Gb FC switch
FlashSystem: FS9500, 12 LUNs/FC port
Below is fio config for rwmixread=100:
[global]
randrepeat=0
buffered=0
direct=1
norandommap=1
group_reporting=1
size=80g
ioengine=libaio
rw=randrw
bs=4k
iodepth=1
rwmixread=100
runtime=600
ramp_time=5
time_based=1
numjobs=20
[job1]
filename=/dev/dm-2
[job2]
filename=/dev/dm-3
...
24 jobs in total.
We collected some perf data. What kind of perf data you want? Let me
know.
Thanks,
Wendy
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-26 15:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
2026-05-21 21:52 ` Jens Axboe
2026-05-25 5:28 ` Yu Kuai
2026-05-26 15:28 ` Wen Xiong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox