Linux block layer
 help / color / mirror / Atom feed
* Observing higher CPU utilization during random IO fio testing
@ 2026-05-21 19:44 Wen Xiong
  2026-05-21 21:52 ` Jens Axboe
  2026-05-30  1:10 ` Ming Lei
  0 siblings, 2 replies; 6+ messages in thread
From: Wen Xiong @ 2026-05-21 19:44 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong

Hi All,

Our performance team observed the higher CPU utilization in RHEL10 
compared to RHEL9.8, observed the similar issue in upstream 
kernel(v7.1-rc4) as well when running FIO random IO tests.

System configuration:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.

Random IO tests are more CPU intensive than sequential IO tests due to 
several factors: more context switching, Interrupt Handling,  cache 
Inefficiency etc. We found out the following patch which caused the 
higher CPU utilization in rhel10 and newer linux kernel:

commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu May 9 20:38:25 2024 +0800

block: add plug while submitting IO

So that if caller didn't use plug, for example, 
__blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from 
caching
nsec time in the plug.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: 
https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower 
CPU utilization when doing the same FIO test.

The patch adds plugging in __submit_bio() in block layer, maybe cause 
performance degradation:
- Random IO tests have less merging, flush overhead.
- More IO scheduler interaction, forces requests through scheduler 
instead of direct dispatch(direct dispatch to hardware queue)
- Poor cache locality during plug operation

Below are some performance data that our performance team collected:

RHEL9.8 comparison RHEL10.0
Iotype     qd        nj         rmix      mpstat busy delta    lparstat 
delta
Randrw     1         20         100       135%                 109%
Randrw     1         40         100       72%                  81%
Randrw     1         20         70        278%                 174%
Randrw     1         40         70        272%                 191%
Randrw     1         20         0         93%                  30%
Randrw     1         40         0         104%                 36%

RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block 
layer.h
Iotype     qd        nj         rmix       mpstat busy delta    lparstat 
deltab
Randrw     1         20         100        -12%                 20%
Randrw     1         40         100        -42%                 -4%
Randrw     1         20         70         70%                  71%
Randrw     1         40         70         %51                  60%
Randrw     1         20         0          -14%                 -43%
Randrw     1         40         0          -33%                 -51%

Can a block layer expert help us resolve this high CPU utilization 
performance issue?
Let us know if you need more performance data or other perf data.

Thanks a lot for your help!
Wendy


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
@ 2026-05-21 21:52 ` Jens Axboe
  2026-05-25  5:28   ` Yu Kuai
  2026-05-30  1:10 ` Ming Lei
  1 sibling, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2026-05-21 21:52 UTC (permalink / raw)
  To: Wen Xiong, linux-block; +Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, Yu Kuai

On 5/21/26 1:44 PM, Wen Xiong wrote:
> Hi All,
> 
> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
> 
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
> 
> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling,  cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
> 
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu May 9 20:38:25 2024 +0800
> 
> block: add plug while submitting IO
> 
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
> 
> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation
> 
> Below are some performance data that our performance team collected:
> 
> RHEL9.8 comparison RHEL10.0
> Iotype     qd        nj         rmix      mpstat busy delta    lparstat delta
> Randrw     1         20         100       135%                 109%
> Randrw     1         40         100       72%                  81%
> Randrw     1         20         70        278%                 174%
> Randrw     1         40         70        272%                 191%
> Randrw     1         20         0         93%                  30%
> Randrw     1         40         0         104%                 36%
> 
> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
> Iotype     qd        nj         rmix       mpstat busy delta    lparstat deltab
> Randrw     1         20         100        -12%                 20%
> Randrw     1         40         100        -42%                 -4%
> Randrw     1         20         70         70%                  71%
> Randrw     1         40         70         %51                  60%
> Randrw     1         20         0          -14%                 -43%
> Randrw     1         40         0          -33%                 -51%
> 
> Can a block layer expert help us resolve this high CPU utilization performance issue?
> Let us know if you need more performance data or other perf data.

Let's CC Yu Kuai who wrote that commit, that might help.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-21 21:52 ` Jens Axboe
@ 2026-05-25  5:28   ` Yu Kuai
  2026-05-26 15:28     ` Wen Xiong
  2026-05-29 17:13     ` Wen Xiong
  0 siblings, 2 replies; 6+ messages in thread
From: Yu Kuai @ 2026-05-25  5:28 UTC (permalink / raw)
  To: Jens Axboe, Wen Xiong, linux-block
  Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, yukuai

Hi,

在 2026/5/22 5:52, Jens Axboe 写道:
> On 5/21/26 1:44 PM, Wen Xiong wrote:
>> Hi All,
>>
>> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>>
>> System configuration:
>> 47 dedicate cores
>> 120 GB memory
>> PCIe4 2-Port 64Gb FC Adapter
>> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>>
>> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling,  cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>>
>> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
>> Author: Yu Kuai <yukuai3@huawei.com>
>> Date:   Thu May 9 20:38:25 2024 +0800
>>
>> block: add plug while submitting IO
>>
>> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
>> and __blkdev_direct_IO_async(), block layer can still benefit from caching
>> nsec time in the plug.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>>
>> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
>> - Random IO tests have less merging, flush overhead.
>> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)

I don't understand this point. Can you explain more? I think plug should not matter if request go through scheduler or not.

>> - Poor cache locality during plug operation
>>
>> Below are some performance data that our performance team collected:
>>
>> RHEL9.8 comparison RHEL10.0
>> Iotype     qd        nj         rmix      mpstat busy delta    lparstat delta
>> Randrw     1         20         100       135%                 109%
>> Randrw     1         40         100       72%                  81%
>> Randrw     1         20         70        278%                 174%
>> Randrw     1         40         70        272%                 191%
>> Randrw     1         20         0         93%                  30%
>> Randrw     1         40         0         104%                 36%
>>
>> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
>> Iotype     qd        nj         rmix       mpstat busy delta    lparstat deltab
>> Randrw     1         20         100        -12%                 20%
>> Randrw     1         40         100        -42%                 -4%
>> Randrw     1         20         70         70%                  71%
>> Randrw     1         40         70         %51                  60%
>> Randrw     1         20         0          -14%                 -43%
>> Randrw     1         40         0          -33%                 -51%
>>
>> Can a block layer expert help us resolve this high CPU utilization performance issue?

And I assume you're testing raw disk, because filesystems should always enable plug.

>> Let us know if you need more performance data or other perf data.

Yes, perf data will be helpful. And please show your test in details and I'll
check if I can reproduce it.

> Let's CC Yu Kuai who wrote that commit, that might help.
>
-- 
Thansk,
Kuai

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-25  5:28   ` Yu Kuai
@ 2026-05-26 15:28     ` Wen Xiong
  2026-05-29 17:13     ` Wen Xiong
  1 sibling, 0 replies; 6+ messages in thread
From: Wen Xiong @ 2026-05-26 15:28 UTC (permalink / raw)
  To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong

On 2026-05-25 00:28, Yu Kuai wrote:
> Hi,
> 
> 在 2026/5/22 5:52, Jens Axboe 写道:
>>> - More IO scheduler interaction, forces requests through scheduler 
>>> instead of direct dispatch(direct dispatch to hardware queue)
> I don't understand this point. Can you explain more? I think plug
> should not matter if request go through scheduler or not.

My understanding is:
Random IO tests are more CPU intensive.
Plug delays the dispatch IOs to hardware queue(quick way) directly.
Plug submits multiple IO requests in a batch to defer submitting IO 
until calling blk_flush_plug(dispatch to hardware queue) or task gets 
scheduling.
> 

> And I assume you're testing raw disk, because filesystems should
> always enable plug.
> 
Yes. FIO random IO tests over raw disks.

> Yes, perf data will be helpful. And please show your test in details 
> and I'll
> check if I can reproduce it.

System config:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
64Gb FC switch
FlashSystem: FS9500, 12 LUNs/FC port

Below is fio config for rwmixread=100:
[global]
randrepeat=0
buffered=0
direct=1
norandommap=1
group_reporting=1
size=80g
ioengine=libaio
rw=randrw
bs=4k
iodepth=1
rwmixread=100
runtime=600
ramp_time=5
time_based=1
numjobs=20

[job1]
filename=/dev/dm-2

[job2]
filename=/dev/dm-3
...
24 jobs in total.

We collected some perf data. What kind of perf data you want? Let me 
know.
Thanks,
Wendy



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-25  5:28   ` Yu Kuai
  2026-05-26 15:28     ` Wen Xiong
@ 2026-05-29 17:13     ` Wen Xiong
  1 sibling, 0 replies; 6+ messages in thread
From: Wen Xiong @ 2026-05-29 17:13 UTC (permalink / raw)
  To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong

On 2026-05-25 00:28, Yu Kuai wrote:

> 在 2026/5/22 5:52, Jens Axboe 写道:
> Yes, perf data will be helpful. And please show your test in details 
> and I'll
> check if I can reproduce it.

Hi Yu Kuai,
Have you reproduced the issue yet?

Below is some perf data we took while running random read test:

Test:
FIO random read with qdepth=1 nj=20, we saw higher CPU utilization in 
this testcase.

Perf record:
start fio run on one session and kickoff the script in another session 
while test is running

Perf report:
With blk_start_plug/blk_finish_plug before calling __submit_bio() in 
blk-core.c:
Top.txt
      2.41%  fio              [kernel.kallsyms]                           
     [k] cpupri_set
      1.16%  fio              [kernel.kallsyms]                           
     [k] queued_spin_lock_slowpath
      0.75%  fio              [kernel.kallsyms]                           
     [k] sbitmap_find_bit
      0.47%  fio              [kernel.kallsyms]                           
     [k] set_next_task_rt
      0.41%  fio              [kernel.kallsyms]                           
     [k] pull_rt_task
      0.34%  fio              [kernel.kallsyms]                           
     [k] enqueue_pushable_task
       …
      0.02%  fio              [kernel.kallsyms]                           
     [k] __blk_flush_plug
      0.01%  fio              [kernel.kallsyms]                           
     [k] blk_add_rq_to_plug
      0.01%  fio              [kernel.kallsyms]                           
     [k] blk_mq_flush_plug_list
      0.00%  fio              [kernel.kallsyms]                           
     [k] blk_attempt_plug_merge

Callgraph.txt

      2.41%  fio              [kernel.kallsyms]                           
     [k] cpupri_set
             |
             ---cpupri_set
                |
                |--1.15%--__enqueue_rt_entity
                |          enqueue_task_rt
                |          enqueue_task
                |          ttwu_do_activate


Perf report
  Without blk_start_plug and blk_finish_plug before calling 
__submit_bio():
Top.txt
     0.67%  fio              [kernel.kallsyms]                            
    [k] queued_spin_lock_slowpath
      0.64%  fio              [kernel.kallsyms]                           
     [k] sched_balance_newidle
      0.47%  fio              [kernel.kallsyms]                           
     [k] _raw_spin_lock
      0.39%  fio              [kernel.kallsyms]                           
     [k] sbitmap_find_bit
      0.35%  fio              [kernel.kallsyms]                           
     [k] cpupri_set
      0.28%  fio              [kernel.kallsyms]                           
     [k] work_grab_pending
      0.24%  fio              [kernel.kallsyms]                           
     [k] lookup_ioctx
      0.23%  fio              [kernel.kallsyms]                           
     [k] __schedule
       …
        …
      0.00%  fio              [kernel.kallsyms]                           
     [k] blk_attempt_plug_merge

Call graph.txt:

0.35%  fio              [kernel.kallsyms]                               
[k] cpupri_set
             |
             ---cpupri_set
                |
                |--0.17%--arch_local_irq_restore.part.0
                |          |
                |          |--0.14%--finish_task_switch.isra.0
                |          |          __schedule
                |          |          |
                |          |          |--0.13%--schedule
                |          |          |          |
                |          |          |          |--0.07%--read_events
…..
                        |--0.13%--__enqueue_rt_entity
                |          enqueue_task_rt
                |          enqueue_task
                |          ttwu_do_activate

 From above perf data, looks like
1. High time spent in cpupri_set(): tasks being enqueued/dequeued 
frequently, more IO scheduling.
2. Call more plug routines.

If you need full perf data report, I can email/attach your full report.

Thanks for your help!
Wendy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Observing higher CPU utilization during random IO fio testing
  2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
  2026-05-21 21:52 ` Jens Axboe
@ 2026-05-30  1:10 ` Ming Lei
  1 sibling, 0 replies; 6+ messages in thread
From: Ming Lei @ 2026-05-30  1:10 UTC (permalink / raw)
  To: Wen Xiong; +Cc: linux-block, axboe, jmoyer, Gjoyce, wenxiong

On Thu, May 21, 2026 at 02:44:22PM -0500, Wen Xiong wrote:
> Hi All,
> 
> Our performance team observed the higher CPU utilization in RHEL10 compared
> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well
> when running FIO random IO tests.
> 
> System configuration:
> 47 dedicate cores
> 120 GB memory
> PCIe4 2-Port 64Gb FC Adapter
> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
> 
> Random IO tests are more CPU intensive than sequential IO tests due to
> several factors: more context switching, Interrupt Handling,  cache
> Inefficiency etc. We found out the following patch which caused the higher
> CPU utilization in rhel10 and newer linux kernel:
> 
> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu May 9 20:38:25 2024 +0800
> 
> block: add plug while submitting IO
> 
> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
> and __blkdev_direct_IO_async(), block layer can still benefit from caching
> nsec time in the plug.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Link:
> https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU
> utilization when doing the same FIO test.
> 
> The patch adds plugging in __submit_bio() in block layer, maybe cause
> performance degradation:
> - Random IO tests have less merging, flush overhead.
> - More IO scheduler interaction, forces requests through scheduler instead
> of direct dispatch(direct dispatch to hardware queue)
> - Poor cache locality during plug operation

Yes, it is expected to see regression on QD=1 workload.

Adding inner plug for caching timestamp only is not good from plug function viewpoint,
because only the outer code path(io_uring, libaio, ...) knows exact IO batch size
and can decide if plug should be used.

Given 060406c61c7c ("block: add plug while submitting IO") doesn't provide
any performance data, maybe it can be reverted.

I am wondering why not move the timestamp cache into 'task_struct' and get wider users?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-30  1:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 19:44 Observing higher CPU utilization during random IO fio testing Wen Xiong
2026-05-21 21:52 ` Jens Axboe
2026-05-25  5:28   ` Yu Kuai
2026-05-26 15:28     ` Wen Xiong
2026-05-29 17:13     ` Wen Xiong
2026-05-30  1:10 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox