Re: [RFC PATCH] iomap: add fast read path for small direct I/O

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <dgc@kernel.org>
To: changfengnan <changfengnan@bytedance.com>
Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Fengnan Chang <fengnanchang@gmail.com>,
	brauner@kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	lidiangang@bytedance.com
Subject: Re: [RFC PATCH] iomap: add fast read path for small direct I/O
Date: Tue, 21 Apr 2026 09:59:37 +1000	[thread overview]
Message-ID: <aea96YRt2aHJsM96@dread> (raw)
In-Reply-To: <d9210bcdf73fbe1ac8b6ec132865609a3ed68688.bd12b07f.c444.4fe0.8460.b6fed4af7332@bytedance.com>

On Thu, Apr 16, 2026 at 11:22:08AM +0800, changfengnan wrote:
> This is test 4k randread with QD 512 in io_uring poll mode. 
> If you use fio, almost like this, but ./t/io_uring  can get higher IOPS.
> fio \
>   --name=io_uring_test \
>   --ioengine=io_uring \
>   --filename=/mnt/testfile \
>   --direct=1 \
>   --rw=randread \
>   --bs=4096 \
>   --iodepth=512 \
>   --iodepth_batch_submit=32 \
>   --iodepth_batch_complete_min=32 \
>   --hipri=1 \
>   --fixedbufs=1 \
>   --registerfiles=1 \
>   --nonvectored=1 \
>   --sqthread_poll=1

Ok, given the way fio works, the iodepth batching will result in in
the code submitting repeated batches of 32 read Io submissions in a
single 'syscall'.

If you change the size of this batch, how does it change the
performance of both vanilla and patched IO paths? i.e. does this
optimisation provide a benefit over a range of IO submission
patterns, or is it only evident when the CPU is running a IO-uring
microbenchmark and userspace is doing no real work on the IO buffers
being submitted?

Also, 'fixedbufs=1' leads me to beleive that this is using the same
set of buffer memory for all IOs, and hence we've probably got a
cache-hot data set here. Hence: is userspace reading the buffers at
IO completion (i.e. emulating the application actually consuming the
data that is being read from the disk), or are they remaining
untouched by userspace and immediately reused for the next IO
submission batch?

> > > Profiling the ext4 workload reveals that a significant portion of CPU
> > > time is spent on memory allocation and the iomap state machine
> > > iteration:
> > >   5.33%  [kernel]  [k] __iomap_dio_rw
> > >   3.26%  [kernel]  [k] iomap_iter
> > >   2.37%  [kernel]  [k] iomap_dio_bio_iter
> > >   2.35%  [kernel]  [k] kfree
> > >   1.33%  [kernel]  [k] iomap_dio_complete
> > 
> > Hmm read is usually under a shared lock for inode as well as extent
> > lookup so we should ideally not be blocking too much there. Can you
> > share a bit more detailed perf report. I'd be interested to see where
> > in iomap_iter() are you seeing the regression?
> Are there enough images of the flame diagram? I’ve attached them.
> ext4_poll_7.svg is without this patch, iomap_fast.svg is with this patch.

I've had a look at them, and the biggest change in CPU usage is that
bio_alloc_bioset() disappears from the graph. In the vanilla kernel,
that accounts for 6.05% of the cpu samples.

Let's put this in a table:

function		vanilla		patched		saved
----------		-------		-------		-----
ext4_file_read_iter	54.75		46.85		-7.90
iomap_dio_rw		49.21		40.69		-8.52
----
bio_alloc_bioset	 6.05		1.77		-4.28
iomap_dio_bio_iter	25.44
iomap_iter		15.02
iomap_dio_fast_read_async		39.82

(subtotals)		46.51		41.59		-4.99
----
bio_alloc_bioset	 6.05		1.77		-4.28
bio_init		 4.52		0.00		-4.52

More than 50% of the difference in CPU usage between the two code
paths is entirely from bio_init() overhead.

That makes no sense to me. The fast path still requires bios to be
allocated and have bio_init() called on them, and we are doing many
more of those calls every second. Why is this overhead not showing
up in the fast path profile -at all-?

> > > I attempted several incremental optimizations in the __iomap_dio_rw()
> > > path to close the gap:
> > > 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
> > >    separate kmalloc. However, because `struct iomap_dio` is relatively
> > >    large and the main path is complex, this yielded almost no
> > >    performance improvement.

Yet this is exactly what you do in the fast path. Why did it not
provide any improvement for the existing code when it is an implied
beneficial optimisation for the new fast path?

I'm clearly missing something here. I'm trying to work out why the
profiles show what they do, but there's differences between them
that do make obvious sense to me.

It would also be useful to have XFS profiles, because it has a
larger CPU cache footprint than ext4. If what the profiles are
showing is a result of CPU cache residency artifacts, then we'll see
different profile (and, potentially, performance) artifacts with
XFS...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

next prev parent reply	other threads:[~2026-04-20 23:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-14 12:26 [RFC PATCH] iomap: add fast read path for small direct I/O Fengnan Chang
2026-04-15  7:14 ` Christoph Hellwig
2026-04-16  3:16   ` changfengnan
2026-04-17  7:30     ` Christoph Hellwig
2026-04-15 19:06 ` Ojaswin Mujoo
2026-04-16  3:22   ` changfengnan
2026-04-18 19:36     ` Ojaswin Mujoo
2026-04-20 23:59     ` Dave Chinner [this message]
2026-04-21  3:19       ` Fengnan
2026-04-21 22:36         ` Dave Chinner
2026-04-22  2:43           ` Fengnan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aea96YRt2aHJsM96@dread \
    --to=dgc@kernel.org \
    --cc=brauner@kernel.org \
    --cc=changfengnan@bytedance.com \
    --cc=djwong@kernel.org \
    --cc=fengnanchang@gmail.com \
    --cc=lidiangang@bytedance.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox