From: "Fengnan" <changfengnan@bytedance.com>
To: "Dave Chinner" <dgc@kernel.org>
Cc: "Ojaswin Mujoo" <ojaswin@linux.ibm.com>,
"Fengnan Chang" <fengnanchang@gmail.com>, <brauner@kernel.org>,
<djwong@kernel.org>, <linux-xfs@vger.kernel.org>,
<linux-fsdevel@vger.kernel.org>, <linux-ext4@vger.kernel.org>,
<lidiangang@bytedance.com>
Subject: Re: [RFC PATCH] iomap: add fast read path for small direct I/O
Date: Tue, 21 Apr 2026 11:19:31 +0800 [thread overview]
Message-ID: <87674d63-c8cb-4135-8d76-84f52e90ac2e@bytedance.com> (raw)
In-Reply-To: <aea96YRt2aHJsM96@dread>
[-- Attachment #1: Type: text/plain, Size: 6240 bytes --]
在 2026/4/21 07:59, Dave Chinner 写道:
> On Thu, Apr 16, 2026 at 11:22:08AM +0800, changfengnan wrote:
>> This is test 4k randread with QD 512 in io_uring poll mode.
>> If you use fio, almost like this, but ./t/io_uring can get higher IOPS.
>> fio \
>> --name=io_uring_test \
>> --ioengine=io_uring \
>> --filename=/mnt/testfile \
>> --direct=1 \
>> --rw=randread \
>> --bs=4096 \
>> --iodepth=512 \
>> --iodepth_batch_submit=32 \
>> --iodepth_batch_complete_min=32 \
>> --hipri=1 \
>> --fixedbufs=1 \
>> --registerfiles=1 \
>> --nonvectored=1 \
>> --sqthread_poll=1
> Ok, given the way fio works, the iodepth batching will result in in
> the code submitting repeated batches of 32 read Io submissions in a
> single 'syscall'.
>
> If you change the size of this batch, how does it change the
> performance of both vanilla and patched IO paths? i.e. does this
> optimisation provide a benefit over a range of IO submission
> patterns, or is it only evident when the CPU is running a IO-uring
> microbenchmark and userspace is doing no real work on the IO buffers
> being submitted?
Hi Dave:
If batch size is 16, IOPS 1.84M -> 2.11M.
If batch size is 8, IOPS 1.72M -> 1.98M.
If batch size is 1, IOPS 1.09M->1.17M.
This is a general optimization that isn't limited to specific tests; you
can see the
benefits even when using fio+libaio.
Use this command test, IOPS 480K->500K:
taskset -c 30 fio --name=test --ioengine=libaio
--filename=/mnt/mytest --direct=1
--rw=randread --bs=4096 --iodepth=512.
>
> Also, 'fixedbufs=1' leads me to beleive that this is using the same
> set of buffer memory for all IOs, and hence we've probably got a
> cache-hot data set here. Hence: is userspace reading the buffers at
> IO completion (i.e. emulating the application actually consuming the
> data that is being read from the disk), or are they remaining
> untouched by userspace and immediately reused for the next IO
> submission batch?
If use t/io_uring do test, the buffer is untouched by userspace. But I
think it doesn't matter.
I use fio with refill_buffers argument do some test, the result is same.
taskset -c 30 fio --name=test --ioengine=libaio
--filename=/mnt/mytest --direct=1
--rw=randread --bs=4096 --iodepth=512 --refill_buffers. IOPS 478K->498K.
taskset -c 30 fio --name=test --ioengine=io_uring
--filename=/mnt/mytest --direct=1
--rw=randread --bs=4096 --iodepth=512 --refill_buffers. IOPS 542K->568K.
Perhaps my test cases are a bit unusual, which has raised quite a few
questions.
In the upcoming patch, I’ll include more fio test results.
>
>>>> Profiling the ext4 workload reveals that a significant portion of CPU
>>>> time is spent on memory allocation and the iomap state machine
>>>> iteration:
>>>> 5.33% [kernel] [k] __iomap_dio_rw
>>>> 3.26% [kernel] [k] iomap_iter
>>>> 2.37% [kernel] [k] iomap_dio_bio_iter
>>>> 2.35% [kernel] [k] kfree
>>>> 1.33% [kernel] [k] iomap_dio_complete
>>>
>>> Hmm read is usually under a shared lock for inode as well as extent
>>> lookup so we should ideally not be blocking too much there. Can you
>>> share a bit more detailed perf report. I'd be interested to see where
>>> in iomap_iter() are you seeing the regression?
>> Are there enough images of the flame diagram? I’ve attached them.
>> ext4_poll_7.svg is without this patch, iomap_fast.svg is with this patch.
> I've had a look at them, and the biggest change in CPU usage is that
> bio_alloc_bioset() disappears from the graph. In the vanilla kernel,
> that accounts for 6.05% of the cpu samples.
>
> Let's put this in a table:
>
> function vanilla patched saved
> ---------- ------- ------- -----
> ext4_file_read_iter 54.75 46.85 -7.90
> iomap_dio_rw 49.21 40.69 -8.52
> ----
> bio_alloc_bioset 6.05 1.77 -4.28
> iomap_dio_bio_iter 25.44
> iomap_iter 15.02
> iomap_dio_fast_read_async 39.82
>
> (subtotals) 46.51 41.59 -4.99
> ----
> bio_alloc_bioset 6.05 1.77 -4.28
> bio_init 4.52 0.00 -4.52
>
> More than 50% of the difference in CPU usage between the two code
> paths is entirely from bio_init() overhead.
>
> That makes no sense to me. The fast path still requires bios to be
> allocated and have bio_init() called on them, and we are doing many
> more of those calls every second. Why is this overhead not showing
> up in the fast path profile -at all-?
I haven't figured that out either. I ran another flame graph on the
old kernel
version, and bio_alloc_bioset only accounted for 1.83%. I'm not sure if
there was
something wrong with the flame graph I generated back then.
I re-captured the ext4 heatmap using the newly modified patch, and it
looks more
reasonable now.
>
>>>> I attempted several incremental optimizations in the __iomap_dio_rw()
>>>> path to close the gap:
>>>> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
>>>> separate kmalloc. However, because `struct iomap_dio` is relatively
>>>> large and the main path is complex, this yielded almost no
>>>> performance improvement.
> Yet this is exactly what you do in the fast path. Why did it not
> provide any improvement for the existing code when it is an implied
> beneficial optimisation for the new fast path?
I think there might be two reasons: first, the __iomap_dio_rw path is
too complex, with
too many checks; second, the dio structure has to maintain reference
counts for every
I/O operation, and the operations on atomic variables are a bit heavy.
> I'm clearly missing something here. I'm trying to work out why the
> profiles show what they do, but there's differences between them
> that do make obvious sense to me.
>
> It would also be useful to have XFS profiles, because it has a
> larger CPU cache footprint than ext4. If what the profiles are
> showing is a result of CPU cache residency artifacts, then we'll see
> different profile (and, potentially, performance) artifacts with
> XFS...
The XFS flame graph is also attached now.
IOPS: 1.92M->2.3M.
>
> -Dave.
[-- Attachment #2: xfs_patch.svg --]
[-- Type: image/svg+xml, Size: 79092 bytes --]
[-- Attachment #3: xfs_base.svg --]
[-- Type: image/svg+xml, Size: 82837 bytes --]
[-- Attachment #4: ext4_base.svg --]
[-- Type: image/svg+xml, Size: 76895 bytes --]
[-- Attachment #5: ext4_patched.svg --]
[-- Type: image/svg+xml, Size: 72805 bytes --]
next prev parent reply other threads:[~2026-04-21 3:20 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-14 12:26 [RFC PATCH] iomap: add fast read path for small direct I/O Fengnan Chang
2026-04-15 7:14 ` Christoph Hellwig
2026-04-16 3:16 ` changfengnan
2026-04-17 7:30 ` Christoph Hellwig
2026-04-15 19:06 ` Ojaswin Mujoo
2026-04-16 3:22 ` changfengnan
2026-04-18 19:36 ` Ojaswin Mujoo
2026-04-20 23:59 ` Dave Chinner
2026-04-21 3:19 ` Fengnan [this message]
2026-04-21 22:36 ` Dave Chinner
2026-04-22 2:43 ` Fengnan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87674d63-c8cb-4135-8d76-84f52e90ac2e@bytedance.com \
--to=changfengnan@bytedance.com \
--cc=brauner@kernel.org \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=fengnanchang@gmail.com \
--cc=lidiangang@bytedance.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox