public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Fengnan Chang <fengnanchang@gmail.com>
Cc: brauner@kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	lidiangang@bytedance.com,
	Fengnan Chang <changfengnan@bytedance.com>
Subject: Re: [RFC PATCH] iomap: add fast read path for small direct I/O
Date: Wed, 15 Apr 2026 00:14:52 -0700	[thread overview]
Message-ID: <ad867C8dlaQGLS97@infradead.org> (raw)
In-Reply-To: <20260414122647.15686-1-changfengnan@bytedance.com>

On Tue, Apr 14, 2026 at 08:26:47PM +0800, Fengnan Chang wrote:
> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
>    separate kmalloc. However, because `struct iomap_dio` is relatively
>    large and the main path is complex, this yielded almost no
>    performance improvement.

One interesting bit here would be a slab for struct iomap_dio, and
the previously discussed per-cpu allocation of some form.

> 2. Reducing unnecessary state resets in the iomap state machine (e.g.,
>    skipping `iomap_iter_reset_iomap` where safe). This provided a ~5%
>    IOPS boost, which is helpful but still falls far short of closing
>    the gap with the raw block device.

But it already is a major improvement, and one that would apply outside
of narrow special cases.  So I'd really like to see that patch.

> The fast path is triggered when the request satisfies:
> - Asynchronous READ request only for now.

I think you really should handle synchronous reads as well.

> - I/O size is <= inode blocksize (fits in a single block, no splits).

Makes sense, and I suspect this is the main source of speedups.

> - Aligned to the block device's logical block size.

All direct I/O requires this.

> - No bounce buffering, fscrypt, or fsverity involved.
> - No custom `iomap_dio_ops` (dops) registered by the filesystem.

I'm really curious at what difference this makes.  It removes a few
branches, but should not have much of an effect while limiting the
applicability a lot.

> After this optimization, the heavy generic functions disappear from the
> profile, replaced by a single streamlined execution path:
>   4.83%  [kernel]  [k] iomap_dio_fast_read_async.isra.31
> 
> With this patch, 4K random read IOPS on ext4 increases from 1.9M to
> 2.3M.

That is still a lot slower than the block device path.  A big part of
it should be the extent lookup and locking associated with it, but
I'd expect things to be a bit better.  Do you have XFS version as well?

> However, I am submitting this patch to validate whether this
> optimization direction is correct and worth pursuing. I would appreciate
> feedback on how to better integrate these ideas into the main iomap
> execution path.

I think a <= block size fast path makes a lot of sense, just like we
have a simple version on the block device, but it needs more work.

> +struct iomap_dio_fast_read {
> +	struct kiocb	*iocb;
> +	size_t		size;
> +	bool		should_dirty;
> +	struct work_struct	work;
> +	struct bio	bio ____cacheline_aligned_in_smp;

Does the cache line alignment matter here?  If yes, can you explain why
in a comment?

> +static struct bio_set iomap_dio_fast_read_pool;

In general I'd prefer to stick to simple as in the block device version
instead of fast.

> +static void iomap_dio_fast_read_complete_work(struct work_struct *work)
> +{
> +	struct iomap_dio_fast_read *fr =
> +		container_of(work, struct iomap_dio_fast_read, work);
> +	struct kiocb *iocb = fr->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	bool should_dirty = fr->should_dirty;
> +	struct bio *bio = &fr->bio;
> +	ssize_t ret;
> +
> +	WRITE_ONCE(iocb->private, NULL);
> +
> +	if (likely(!bio->bi_status)) {
> +		ret = fr->size;
> +		iocb->ki_pos += ret;
> +	} else {
> +		ret = blk_status_to_errno(bio->bi_status);
> +		fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
> +				  fr->size, ret, GFP_NOFS);
> +	}
> +
> +	if (should_dirty) {
> +		bio_check_pages_dirty(bio);
> +	} else {
> +		bio_release_pages(bio, false);
> +		bio_put(bio);
> +	}
> +
> +	inode_dio_end(inode);
> +
> +	trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
> +	iocb->ki_complete(iocb, ret);

This is a lot of duplicate cork.  Can we somehow share it by passing
more arguments or embedding the simple context into the bigger one?

> +static inline bool iomap_dio_fast_read_supported(struct kiocb *iocb,
> +					  struct iov_iter *iter,
> +					  unsigned int dio_flags,
> +					  size_t done_before)

Please stick to two-tab indents for prototype continuations, which is
both more readable and easier to modify later.

> +	if (count < bdev_logical_block_size(inode->i_sb->s_bdev))
> +		return false;

Sub-sector reads (unlike writes) don't require any special handling, so
I don't see why they are excluded.

> +	if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
> +		alignment = i_blocksize(inode);
> +	else
> +		alignment = bdev_logical_block_size(inode->i_sb->s_bdev);
> +
> +	if ((iocb->ki_pos | count) & (alignment - 1))
> +		return false;

Factor this into a helper?

> +	inode_dio_begin(inode);
> +
> +	ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
> +			       &iomi.iomap, &iomi.srcmap);
> +	if (ret) {
> +		inode_dio_end(inode);
> +		return ret;
> +	}

If we can I'd much prefer avoiding the open coded iomap_begin
invocation, as that is a real maintenance burden.

> +
> +	if (iomi.iomap.type != IOMAP_MAPPED ||
> +	    iomi.iomap.offset > iomi.pos ||
> +	    iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
> +	    (iomi.iomap.flags & IOMAP_F_ANON_WRITE)) {

IOMAP_F_ANON_WRITE (as the name implies) only applies to writes.

> +		ret = -EAGAIN;

-EAGAIN is a bad status code, as we already use to indicate that a
non-blocking read blocks.

> +	ret = bio_iov_iter_get_pages(bio, iter,
> +				     bdev_logical_block_size(iomi.iomap.bdev) - 1);

Overly long line.  Also this needs to use the calculated alignment
value.

> +	if (unlikely(ret)) {
> +		bio_put(bio);
> +		goto out_iomap_end;
> +	}
> +
> +	if (bio->bi_iter.bi_size != count) {
> +		iov_iter_revert(iter, bio->bi_iter.bi_size);
> +		bio_release_pages(bio, false);
> +		bio_put(bio);
> +		ret = -EAGAIN;
> +		goto out_iomap_end;
> +	}

Share the bio_put with a new goto label, and maybe also move all
the other cleanup code out of the main path into a label?

> +	if (!dops && iomap_dio_fast_read_supported(iocb, iter, dio_flags, done_before)) {

Overly long line.  But we should not make the fast path conditional
on an option anyway.


  reply	other threads:[~2026-04-15  7:14 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-14 12:26 [RFC PATCH] iomap: add fast read path for small direct I/O Fengnan Chang
2026-04-15  7:14 ` Christoph Hellwig [this message]
2026-04-16  3:16   ` changfengnan
2026-04-17  7:30     ` Christoph Hellwig
2026-04-15 19:06 ` Ojaswin Mujoo
2026-04-16  3:22   ` changfengnan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad867C8dlaQGLS97@infradead.org \
    --to=hch@infradead.org \
    --cc=brauner@kernel.org \
    --cc=changfengnan@bytedance.com \
    --cc=djwong@kernel.org \
    --cc=fengnanchang@gmail.com \
    --cc=lidiangang@bytedance.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox