public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: "Fengnan" <changfengnan@bytedance.com>
To: "Dave Chinner" <dgc@kernel.org>
Cc: "Ojaswin Mujoo" <ojaswin@linux.ibm.com>,
	 "Fengnan Chang" <fengnanchang@gmail.com>, <brauner@kernel.org>,
	 <djwong@kernel.org>, <linux-xfs@vger.kernel.org>,
	 <linux-fsdevel@vger.kernel.org>, <linux-ext4@vger.kernel.org>,
	 <lidiangang@bytedance.com>
Subject: Re: [RFC PATCH] iomap: add fast read path for small direct I/O
Date: Tue, 21 Apr 2026 11:19:31 +0800	[thread overview]
Message-ID: <87674d63-c8cb-4135-8d76-84f52e90ac2e@bytedance.com> (raw)
In-Reply-To: <aea96YRt2aHJsM96@dread>

[-- Attachment #1: Type: text/plain, Size: 6240 bytes --]

在 2026/4/21 07:59, Dave Chinner 写道:
> On Thu, Apr 16, 2026 at 11:22:08AM +0800, changfengnan wrote:
>> This is test 4k randread with QD 512 in io_uring poll mode.
>> If you use fio, almost like this, but ./t/io_uring  can get higher IOPS.
>> fio \
>>    --name=io_uring_test \
>>    --ioengine=io_uring \
>>    --filename=/mnt/testfile \
>>    --direct=1 \
>>    --rw=randread \
>>    --bs=4096 \
>>    --iodepth=512 \
>>    --iodepth_batch_submit=32 \
>>    --iodepth_batch_complete_min=32 \
>>    --hipri=1 \
>>    --fixedbufs=1 \
>>    --registerfiles=1 \
>>    --nonvectored=1 \
>>    --sqthread_poll=1
> Ok, given the way fio works, the iodepth batching will result in in
> the code submitting repeated batches of 32 read Io submissions in a
> single 'syscall'.
>
> If you change the size of this batch, how does it change the
> performance of both vanilla and patched IO paths? i.e. does this
> optimisation provide a benefit over a range of IO submission
> patterns, or is it only evident when the CPU is running a IO-uring
> microbenchmark and userspace is doing no real work on the IO buffers
> being submitted?
Hi Dave:

If batch size is 16, IOPS 1.84M -> 2.11M.
If batch size is 8, IOPS 1.72M -> 1.98M.
If batch size is 1, IOPS 1.09M->1.17M.
This is a general optimization that isn't limited to specific tests; you 
can see the
benefits even when using fio+libaio.
Use this command test, IOPS 480K->500K:
taskset -c 30 fio  --name=test   --ioengine=libaio 
--filename=/mnt/mytest --direct=1
--rw=randread   --bs=4096 --iodepth=512.

>
> Also, 'fixedbufs=1' leads me to beleive that this is using the same
> set of buffer memory for all IOs, and hence we've probably got a
> cache-hot data set here. Hence: is userspace reading the buffers at
> IO completion (i.e. emulating the application actually consuming the
> data that is being read from the disk), or are they remaining
> untouched by userspace and immediately reused for the next IO
> submission batch?
If use t/io_uring do test, the buffer is untouched by userspace. But I 
think it doesn't matter.
I use fio with refill_buffers argument  do some test,  the result is same.

taskset -c 30 fio  --name=test   --ioengine=libaio 
--filename=/mnt/mytest --direct=1
--rw=randread   --bs=4096 --iodepth=512 --refill_buffers.  IOPS 478K->498K.
taskset -c 30 fio  --name=test   --ioengine=io_uring 
--filename=/mnt/mytest --direct=1
--rw=randread   --bs=4096 --iodepth=512 --refill_buffers.  IOPS 542K->568K.

Perhaps my test cases are a bit unusual, which has raised quite a few 
questions.
In the upcoming patch, I’ll include more fio test results.
>
>>>> Profiling the ext4 workload reveals that a significant portion of CPU
>>>> time is spent on memory allocation and the iomap state machine
>>>> iteration:
>>>>    5.33%  [kernel]  [k] __iomap_dio_rw
>>>>    3.26%  [kernel]  [k] iomap_iter
>>>>    2.37%  [kernel]  [k] iomap_dio_bio_iter
>>>>    2.35%  [kernel]  [k] kfree
>>>>    1.33%  [kernel]  [k] iomap_dio_complete
>>>   
>>> Hmm read is usually under a shared lock for inode as well as extent
>>> lookup so we should ideally not be blocking too much there. Can you
>>> share a bit more detailed perf report. I'd be interested to see where
>>> in iomap_iter() are you seeing the regression?
>> Are there enough images of the flame diagram? I’ve attached them.
>> ext4_poll_7.svg is without this patch, iomap_fast.svg is with this patch.
> I've had a look at them, and the biggest change in CPU usage is that
> bio_alloc_bioset() disappears from the graph. In the vanilla kernel,
> that accounts for 6.05% of the cpu samples.
>
> Let's put this in a table:
>
> function		vanilla		patched		saved
> ----------		-------		-------		-----
> ext4_file_read_iter	54.75		46.85		-7.90
> iomap_dio_rw		49.21		40.69		-8.52
> ----
> bio_alloc_bioset	 6.05		1.77		-4.28
> iomap_dio_bio_iter	25.44
> iomap_iter		15.02
> iomap_dio_fast_read_async		39.82
>
> (subtotals)		46.51		41.59		-4.99
> ----
> bio_alloc_bioset	 6.05		1.77		-4.28
> bio_init		 4.52		0.00		-4.52
>
> More than 50% of the difference in CPU usage between the two code
> paths is entirely from bio_init() overhead.
>
> That makes no sense to me. The fast path still requires bios to be
> allocated and have bio_init() called on them, and we are doing many
> more of those calls every second. Why is this overhead not showing
> up in the fast path profile -at all-?
I haven't figured that out either.   I ran another flame graph on the 
old kernel
version, and bio_alloc_bioset only accounted for 1.83%. I'm not sure if 
there was
something wrong with the flame graph I generated back then.
I re-captured the ext4 heatmap using the newly modified patch, and it 
looks more
reasonable now.

>
>>>> I attempted several incremental optimizations in the __iomap_dio_rw()
>>>> path to close the gap:
>>>> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
>>>>     separate kmalloc. However, because `struct iomap_dio` is relatively
>>>>     large and the main path is complex, this yielded almost no
>>>>     performance improvement.
> Yet this is exactly what you do in the fast path. Why did it not
> provide any improvement for the existing code when it is an implied
> beneficial optimisation for the new fast path?
I think there might be two reasons: first, the __iomap_dio_rw path is 
too complex, with
too many checks; second, the dio structure has to maintain reference 
counts for every
I/O operation, and the operations on atomic variables are a bit heavy.

> I'm clearly missing something here. I'm trying to work out why the
> profiles show what they do, but there's differences between them
> that do make obvious sense to me.
>
> It would also be useful to have XFS profiles, because it has a
> larger CPU cache footprint than ext4. If what the profiles are
> showing is a result of CPU cache residency artifacts, then we'll see
> different profile (and, potentially, performance) artifacts with
> XFS...
The XFS flame graph is also attached now.
IOPS: 1.92M->2.3M.

>
> -Dave.

[-- Attachment #2: xfs_patch.svg --]
[-- Type: image/svg+xml, Size: 79092 bytes --]

[-- Attachment #3: xfs_base.svg --]
[-- Type: image/svg+xml, Size: 82837 bytes --]

[-- Attachment #4: ext4_base.svg --]
[-- Type: image/svg+xml, Size: 76895 bytes --]

[-- Attachment #5: ext4_patched.svg --]
[-- Type: image/svg+xml, Size: 72805 bytes --]

  reply	other threads:[~2026-04-21  3:20 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-14 12:26 [RFC PATCH] iomap: add fast read path for small direct I/O Fengnan Chang
2026-04-15  7:14 ` Christoph Hellwig
2026-04-16  3:16   ` changfengnan
2026-04-17  7:30     ` Christoph Hellwig
2026-04-15 19:06 ` Ojaswin Mujoo
2026-04-16  3:22   ` changfengnan
2026-04-18 19:36     ` Ojaswin Mujoo
2026-04-20 23:59     ` Dave Chinner
2026-04-21  3:19       ` Fengnan [this message]
2026-04-21 22:36         ` Dave Chinner
2026-04-22  2:43           ` Fengnan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87674d63-c8cb-4135-8d76-84f52e90ac2e@bytedance.com \
    --to=changfengnan@bytedance.com \
    --cc=brauner@kernel.org \
    --cc=dgc@kernel.org \
    --cc=djwong@kernel.org \
    --cc=fengnanchang@gmail.com \
    --cc=lidiangang@bytedance.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox