From: Christoph Hellwig <hch@infradead.org>
To: Fengnan Chang <fengnanchang@gmail.com>
Cc: brauner@kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
lidiangang@bytedance.com,
Fengnan Chang <changfengnan@bytedance.com>
Subject: Re: [RFC PATCH] iomap: add fast read path for small direct I/O
Date: Wed, 15 Apr 2026 00:14:52 -0700 [thread overview]
Message-ID: <ad867C8dlaQGLS97@infradead.org> (raw)
In-Reply-To: <20260414122647.15686-1-changfengnan@bytedance.com>
On Tue, Apr 14, 2026 at 08:26:47PM +0800, Fengnan Chang wrote:
> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
> separate kmalloc. However, because `struct iomap_dio` is relatively
> large and the main path is complex, this yielded almost no
> performance improvement.
One interesting bit here would be a slab for struct iomap_dio, and
the previously discussed per-cpu allocation of some form.
> 2. Reducing unnecessary state resets in the iomap state machine (e.g.,
> skipping `iomap_iter_reset_iomap` where safe). This provided a ~5%
> IOPS boost, which is helpful but still falls far short of closing
> the gap with the raw block device.
But it already is a major improvement, and one that would apply outside
of narrow special cases. So I'd really like to see that patch.
> The fast path is triggered when the request satisfies:
> - Asynchronous READ request only for now.
I think you really should handle synchronous reads as well.
> - I/O size is <= inode blocksize (fits in a single block, no splits).
Makes sense, and I suspect this is the main source of speedups.
> - Aligned to the block device's logical block size.
All direct I/O requires this.
> - No bounce buffering, fscrypt, or fsverity involved.
> - No custom `iomap_dio_ops` (dops) registered by the filesystem.
I'm really curious at what difference this makes. It removes a few
branches, but should not have much of an effect while limiting the
applicability a lot.
> After this optimization, the heavy generic functions disappear from the
> profile, replaced by a single streamlined execution path:
> 4.83% [kernel] [k] iomap_dio_fast_read_async.isra.31
>
> With this patch, 4K random read IOPS on ext4 increases from 1.9M to
> 2.3M.
That is still a lot slower than the block device path. A big part of
it should be the extent lookup and locking associated with it, but
I'd expect things to be a bit better. Do you have XFS version as well?
> However, I am submitting this patch to validate whether this
> optimization direction is correct and worth pursuing. I would appreciate
> feedback on how to better integrate these ideas into the main iomap
> execution path.
I think a <= block size fast path makes a lot of sense, just like we
have a simple version on the block device, but it needs more work.
> +struct iomap_dio_fast_read {
> + struct kiocb *iocb;
> + size_t size;
> + bool should_dirty;
> + struct work_struct work;
> + struct bio bio ____cacheline_aligned_in_smp;
Does the cache line alignment matter here? If yes, can you explain why
in a comment?
> +static struct bio_set iomap_dio_fast_read_pool;
In general I'd prefer to stick to simple as in the block device version
instead of fast.
> +static void iomap_dio_fast_read_complete_work(struct work_struct *work)
> +{
> + struct iomap_dio_fast_read *fr =
> + container_of(work, struct iomap_dio_fast_read, work);
> + struct kiocb *iocb = fr->iocb;
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool should_dirty = fr->should_dirty;
> + struct bio *bio = &fr->bio;
> + ssize_t ret;
> +
> + WRITE_ONCE(iocb->private, NULL);
> +
> + if (likely(!bio->bi_status)) {
> + ret = fr->size;
> + iocb->ki_pos += ret;
> + } else {
> + ret = blk_status_to_errno(bio->bi_status);
> + fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
> + fr->size, ret, GFP_NOFS);
> + }
> +
> + if (should_dirty) {
> + bio_check_pages_dirty(bio);
> + } else {
> + bio_release_pages(bio, false);
> + bio_put(bio);
> + }
> +
> + inode_dio_end(inode);
> +
> + trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
> + iocb->ki_complete(iocb, ret);
This is a lot of duplicate cork. Can we somehow share it by passing
more arguments or embedding the simple context into the bigger one?
> +static inline bool iomap_dio_fast_read_supported(struct kiocb *iocb,
> + struct iov_iter *iter,
> + unsigned int dio_flags,
> + size_t done_before)
Please stick to two-tab indents for prototype continuations, which is
both more readable and easier to modify later.
> + if (count < bdev_logical_block_size(inode->i_sb->s_bdev))
> + return false;
Sub-sector reads (unlike writes) don't require any special handling, so
I don't see why they are excluded.
> + if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
> + alignment = i_blocksize(inode);
> + else
> + alignment = bdev_logical_block_size(inode->i_sb->s_bdev);
> +
> + if ((iocb->ki_pos | count) & (alignment - 1))
> + return false;
Factor this into a helper?
> + inode_dio_begin(inode);
> +
> + ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
> + &iomi.iomap, &iomi.srcmap);
> + if (ret) {
> + inode_dio_end(inode);
> + return ret;
> + }
If we can I'd much prefer avoiding the open coded iomap_begin
invocation, as that is a real maintenance burden.
> +
> + if (iomi.iomap.type != IOMAP_MAPPED ||
> + iomi.iomap.offset > iomi.pos ||
> + iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
> + (iomi.iomap.flags & IOMAP_F_ANON_WRITE)) {
IOMAP_F_ANON_WRITE (as the name implies) only applies to writes.
> + ret = -EAGAIN;
-EAGAIN is a bad status code, as we already use to indicate that a
non-blocking read blocks.
> + ret = bio_iov_iter_get_pages(bio, iter,
> + bdev_logical_block_size(iomi.iomap.bdev) - 1);
Overly long line. Also this needs to use the calculated alignment
value.
> + if (unlikely(ret)) {
> + bio_put(bio);
> + goto out_iomap_end;
> + }
> +
> + if (bio->bi_iter.bi_size != count) {
> + iov_iter_revert(iter, bio->bi_iter.bi_size);
> + bio_release_pages(bio, false);
> + bio_put(bio);
> + ret = -EAGAIN;
> + goto out_iomap_end;
> + }
Share the bio_put with a new goto label, and maybe also move all
the other cleanup code out of the main path into a label?
> + if (!dops && iomap_dio_fast_read_supported(iocb, iter, dio_flags, done_before)) {
Overly long line. But we should not make the fast path conditional
on an option anyway.
next prev parent reply other threads:[~2026-04-15 7:14 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-14 12:26 [RFC PATCH] iomap: add fast read path for small direct I/O Fengnan Chang
2026-04-15 7:14 ` Christoph Hellwig [this message]
2026-04-16 3:16 ` changfengnan
2026-04-17 7:30 ` Christoph Hellwig
2026-04-15 19:06 ` Ojaswin Mujoo
2026-04-16 3:22 ` changfengnan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ad867C8dlaQGLS97@infradead.org \
--to=hch@infradead.org \
--cc=brauner@kernel.org \
--cc=changfengnan@bytedance.com \
--cc=djwong@kernel.org \
--cc=fengnanchang@gmail.com \
--cc=lidiangang@bytedance.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox