Linux block layer
 help / color / mirror / Atom feed
* Observing higher CPU utilization during random IO fio testing
@ 2026-05-21 19:44 Wen Xiong
  2026-05-21 21:52 ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Wen Xiong @ 2026-05-21 19:44 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong

Hi All,

Our performance team observed the higher CPU utilization in RHEL10 
compared to RHEL9.8, observed the similar issue in upstream 
kernel(v7.1-rc4) as well when running FIO random IO tests.

System configuration:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.

Random IO tests are more CPU intensive than sequential IO tests due to 
several factors: more context switching, Interrupt Handling,  cache 
Inefficiency etc. We found out the following patch which caused the 
higher CPU utilization in rhel10 and newer linux kernel:

commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu May 9 20:38:25 2024 +0800

block: add plug while submitting IO

So that if caller didn't use plug, for example, 
__blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from 
caching
nsec time in the plug.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: 
https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower 
CPU utilization when doing the same FIO test.

The patch adds plugging in __submit_bio() in block layer, maybe cause 
performance degradation:
- Random IO tests have less merging, flush overhead.
- More IO scheduler interaction, forces requests through scheduler 
instead of direct dispatch(direct dispatch to hardware queue)
- Poor cache locality during plug operation

Below are some performance data that our performance team collected:

RHEL9.8 comparison RHEL10.0
Iotype     qd        nj         rmix      mpstat busy delta    lparstat 
delta
Randrw     1         20         100       135%                 109%
Randrw     1         40         100       72%                  81%
Randrw     1         20         70        278%                 174%
Randrw     1         40         70        272%                 191%
Randrw     1         20         0         93%                  30%
Randrw     1         40         0         104%                 36%

RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block 
layer.h
Iotype     qd        nj         rmix       mpstat busy delta    lparstat 
deltab
Randrw     1         20         100        -12%                 20%
Randrw     1         40         100        -42%                 -4%
Randrw     1         20         70         70%                  71%
Randrw     1         40         70         %51                  60%
Randrw     1         20         0          -14%                 -43%
Randrw     1         40         0          -33%                 -51%

Can a block layer expert help us resolve this high CPU utilization 
performance issue?
Let us know if you need more performance data or other perf data.

Thanks a lot for your help!
Wendy


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
@ 2026-05-21 21:52 ` Jens Axboe
  2026-05-25  5:28   ` Yu Kuai
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2026-05-21 21:52 UTC (permalink / raw)
  To: Wen Xiong, linux-block; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, Yu Kuai

On 5/21/26 1:44 PM, Wen Xiong wrote:
> Hi All,
> 
> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
> 
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
> 
> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling,  cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
> 
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu May 9 20:38:25 2024 +0800
> 
> block: add plug while submitting IO
> 
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
> 
> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation
> 
> Below are some performance data that our performance team collected:
> 
> RHEL9.8 comparison RHEL10.0
> Iotype     qd        nj         rmix      mpstat busy delta    lparstat delta
> Randrw     1         20         100       135%                 109%
> Randrw     1         40         100       72%                  81%
> Randrw     1         20         70        278%                 174%
> Randrw     1         40         70        272%                 191%
> Randrw     1         20         0         93%                  30%
> Randrw     1         40         0         104%                 36%
> 
> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
> Iotype     qd        nj         rmix       mpstat busy delta    lparstat deltab
> Randrw     1         20         100        -12%                 20%
> Randrw     1         40         100        -42%                 -4%
> Randrw     1         20         70         70%                  71%
> Randrw     1         40         70         %51                  60%
> Randrw     1         20         0          -14%                 -43%
> Randrw     1         40         0          -33%                 -51%
> 
> Can a block layer expert help us resolve this high CPU utilization performance issue?
> Let us know if you need more performance data or other perf data.

Let's CC Yu Kuai who wrote that commit, that might help.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-21 21:52 ` Jens Axboe
@ 2026-05-25  5:28   ` Yu Kuai
  2026-05-26 15:28     ` Wen Xiong
  0 siblings, 1 reply; 4+ messages in thread
From: Yu Kuai @ 2026-05-25  5:28 UTC (permalink / raw)
  To: Jens Axboe, Wen Xiong, linux-block
  Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, yukuai

Hi,

在 2026/5/22 5:52, Jens Axboe 写道:
> On 5/21/26 1:44 PM, Wen Xiong wrote:
>> Hi All,
>>
>> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>>
>> System configuration:
>> 47 dedicate cores
>> 120 GB memory
>> PCIe4 2-Port 64Gb FC Adapter
>> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>>
>> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling,  cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>>
>> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
>> Author: Yu Kuai <yukuai3@huawei.com>
>> Date:   Thu May 9 20:38:25 2024 +0800
>>
>> block: add plug while submitting IO
>>
>> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
>> and __blkdev_direct_IO_async(), block layer can still benefit from caching
>> nsec time in the plug.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>>
>> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
>> - Random IO tests have less merging, flush overhead.
>> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)

I don't understand this point. Can you explain more? I think plug should not matter if request go through scheduler or not.

>> - Poor cache locality during plug operation
>>
>> Below are some performance data that our performance team collected:
>>
>> RHEL9.8 comparison RHEL10.0
>> Iotype     qd        nj         rmix      mpstat busy delta    lparstat delta
>> Randrw     1         20         100       135%                 109%
>> Randrw     1         40         100       72%                  81%
>> Randrw     1         20         70        278%                 174%
>> Randrw     1         40         70        272%                 191%
>> Randrw     1         20         0         93%                  30%
>> Randrw     1         40         0         104%                 36%
>>
>> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
>> Iotype     qd        nj         rmix       mpstat busy delta    lparstat deltab
>> Randrw     1         20         100        -12%                 20%
>> Randrw     1         40         100        -42%                 -4%
>> Randrw     1         20         70         70%                  71%
>> Randrw     1         40         70         %51                  60%
>> Randrw     1         20         0          -14%                 -43%
>> Randrw     1         40         0          -33%                 -51%
>>
>> Can a block layer expert help us resolve this high CPU utilization performance issue?

And I assume you're testing raw disk, because filesystems should always enable plug.

>> Let us know if you need more performance data or other perf data.

Yes, perf data will be helpful. And please show your test in details and I'll
check if I can reproduce it.

> Let's CC Yu Kuai who wrote that commit, that might help.
>
-- 
Thansk,
Kuai

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-25  5:28   ` Yu Kuai
@ 2026-05-26 15:28     ` Wen Xiong
  0 siblings, 0 replies; 4+ messages in thread
From: Wen Xiong @ 2026-05-26 15:28 UTC (permalink / raw)
  To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong

On 2026-05-25 00:28, Yu Kuai wrote:
> Hi,
> 
> 在 2026/5/22 5:52, Jens Axboe 写道:
>>> - More IO scheduler interaction, forces requests through scheduler 
>>> instead of direct dispatch(direct dispatch to hardware queue)
> I don't understand this point. Can you explain more? I think plug
> should not matter if request go through scheduler or not.

My understanding is:
Random IO tests are more CPU intensive.
Plug delays the dispatch IOs to hardware queue(quick way) directly.
Plug submits multiple IO requests in a batch to defer submitting IO 
until calling blk_flush_plug(dispatch to hardware queue) or task gets 
scheduling.
> 

> And I assume you're testing raw disk, because filesystems should
> always enable plug.
> 
Yes. FIO random IO tests over raw disks.

> Yes, perf data will be helpful. And please show your test in details 
> and I'll
> check if I can reproduce it.

System config:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
64Gb FC switch
FlashSystem: FS9500, 12 LUNs/FC port

Below is fio config for rwmixread=100:
[global]
randrepeat=0
buffered=0
direct=1
norandommap=1
group_reporting=1
size=80g
ioengine=libaio
rw=randrw
bs=4k
iodepth=1
rwmixread=100
runtime=600
ramp_time=5
time_based=1
numjobs=20

[job1]
filename=/dev/dm-2

[job2]
filename=/dev/dm-3
...
24 jobs in total.

We collected some perf data. What kind of perf data you want? Let me 
know.
Thanks,
Wendy



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-26 15:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
2026-05-21 21:52 ` Jens Axboe
2026-05-25  5:28   ` Yu Kuai
2026-05-26 15:28     ` Wen Xiong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox