From: Christoph Hellwig <hch@infradead.org>
To: Fengnan Chang <changfengnan@bytedance.com>
Cc: brauner@kernel.org, djwong@kernel.org, hch@infradead.org,
ojaswin@linux.ibm.com, dgc@kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-kernel@vger.kernel.org, lidiangang@bytedance.com
Subject: Re: [PATCH v4] iomap: add simple read path for small direct I/O
Date: Wed, 24 Jun 2026 08:37:10 -0700 [thread overview]
Message-ID: <ajv5pqNureiK80Eu@infradead.org> (raw)
In-Reply-To: <20260608073134.95964-1-changfengnan@bytedance.com>
Sorry for the delay in getting back to this, I'm a bit overloaded at
the moment.
> -static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
> +static inline bool should_report_dio_fserror(int error)
Can you split all the refactoring into prep patches?
> +/*
> + * In the async simple read path, we need to prevent bio_endio() from
> + * triggering iocb->ki_complete() before the submitter has returned
> + * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
> + *
> + * We use a three-state rendezvous to synchronize the submitter and end_io:
> + *
> + * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
> + *
> + * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
> + * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
> + * ki_complete().
> + *
> + * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
> + * submit path. end_io sets this state and does nothing else. The submitter
> + * will see this state and handle the completion synchronously (bypassing
> + * ki_complete() and returning the actual result).
> + */
I don't think we actually need any of this. For the sync case we
can just use submit_bio_wait, and for async just always complete
from the end_io handler. This will simplify the implementation a lot,
and also avoid the atomic.
> +static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
Btw, I'd drop the _read in the name. Most of this would work as-is
for trivial overwrites if we figure out when to use them.
> + if (dio_flags & IOMAP_DIO_BOUNCE)
> + nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
> + else
> + nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
Bounce buffering requires dops, so all this can be dropped.
> + ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
> + &iomi.iomap, &iomi.srcmap);
> + if (ret) {
> + inode_dio_end(inode);
> + return ret;
> + }
> +
> + if (iomi.iomap.type != IOMAP_MAPPED ||
> + iomi.iomap.offset > iomi.pos ||
I don't think offset > pos can happen
> + iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
> + (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
> + ret = -ENOTBLK;
> + goto out_iomap_end;
> + }
Given that we already have a fallback here, I'm not sure why this is
limited to a single file system block. Anything that:
a) fits into the iomap
b) fits into a single bio
can be easily supported. The first condition is a trivial, and for
the second we could just check if iter->nr_segs is larger than
BIO_MAX_VECS.
> + if (user_backed_iter(iter))
> + dio_flags |= IOMAP_DIO_USER_BACKED;
> + bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
> + bio->bi_ioprio = iocb->ki_ioprio;
> + bio->bi_private = sr;
> + bio->bi_end_io = iomap_dio_simple_read_end_io;
> +
> + if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
> + !(dio_flags & IOMAP_DIO_BOUNCE))
> + bio_set_pages_dirty(bio);
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + bio->bi_opf |= REQ_NOWAIT;
> + if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
> + bio->bi_opf |= REQ_POLLED;
> + bio_set_polled(bio, iocb);
> + WRITE_ONCE(iocb->private, bio);
> + }
Can you check if sone more of this can be factored into a shared
helper?
Below is a completely untested patch implementing my suggestion
for the completion simplification. It compiles, but that's about
the guarantees I can give for it:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 3cb179752612..c785512e5339 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -909,11 +909,7 @@ struct iomap_dio_simple_read {
struct kiocb *iocb;
size_t size;
unsigned int dio_flags;
- atomic_t state;
- union {
- struct task_struct *waiter;
- struct work_struct work;
- };
+ struct work_struct work;
/*
* Align @bio to a cacheline boundary so that, combined with the
* front_pad passed to bioset_init(), the bio sits at the start of
@@ -926,35 +922,12 @@ struct iomap_dio_simple_read {
static struct bio_set iomap_dio_simple_read_pool;
-/*
- * In the async simple read path, we need to prevent bio_endio() from
- * triggering iocb->ki_complete() before the submitter has returned
- * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
- *
- * We use a three-state rendezvous to synchronize the submitter and end_io:
- *
- * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
- *
- * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
- * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
- * ki_complete().
- *
- * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
- * submit path. end_io sets this state and does nothing else. The submitter
- * will see this state and handle the completion synchronously (bypassing
- * ki_complete() and returning the actual result).
- */
-enum {
- IOMAP_DIO_SIMPLE_SUBMITTING = 0,
- IOMAP_DIO_SIMPLE_QUEUED,
- IOMAP_DIO_SIMPLE_DONE,
-};
-
-static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
- struct bio *bio, ssize_t ret)
+static ssize_t iomap_dio_simple_read_complete(struct iomap_dio_simple_read *sr)
{
+ struct bio *bio = &sr->bio;
+ struct kiocb *iocb = sr->iocb;
struct inode *inode = file_inode(iocb->ki_filp);
- struct iomap_dio_simple_read *sr = bio->bi_private;
+ ssize_t ret = blk_status_to_errno(bio->bi_status);
if (likely(!ret)) {
ret = sr->size;
@@ -965,21 +938,6 @@ static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
}
iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
-
- return ret;
-}
-
-static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
- struct bio *bio)
-{
- struct inode *inode = file_inode(iocb->ki_filp);
- ssize_t ret;
-
- WRITE_ONCE(iocb->private, NULL);
-
- ret = iomap_dio_simple_read_finish(iocb, bio,
- blk_status_to_errno(bio->bi_status));
-
inode_dio_end(inode);
trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
return ret;
@@ -988,45 +946,26 @@ static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
static void iomap_dio_simple_read_complete_work(struct work_struct *work)
{
struct iomap_dio_simple_read *sr =
- container_of(work, struct iomap_dio_simple_read, work);
- struct kiocb *iocb = sr->iocb;
- ssize_t ret;
+ container_of(work, struct iomap_dio_simple_read, work);
- ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
- iocb->ki_complete(iocb, ret);
+ WRITE_ONCE(sr->iocb->private, NULL);
+ sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
}
-static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
+static void iomap_dio_simple_read_end_io(struct bio *bio)
{
- struct kiocb *iocb = sr->iocb;
+ struct iomap_dio_simple_read *sr =
+ container_of(bio, struct iomap_dio_simple_read, bio);
if (unlikely(sr->bio.bi_status)) {
- struct inode *inode = file_inode(iocb->ki_filp);
+ struct inode *inode = file_inode(sr->iocb->ki_filp);
INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
return;
}
- iomap_dio_simple_read_complete_work(&sr->work);
-}
-
-static void iomap_dio_simple_read_end_io(struct bio *bio)
-{
- struct iomap_dio_simple_read *sr = bio->bi_private;
-
- if (sr->waiter) {
- struct task_struct *waiter = sr->waiter;
-
- WRITE_ONCE(sr->waiter, NULL);
- blk_wake_io_task(waiter);
- return;
- }
-
- if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
- atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
- IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
- iomap_dio_simple_read_async_done(sr);
+ sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
}
static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
@@ -1046,11 +985,13 @@ static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
*/
if (count > inode->i_sb->s_blocksize)
return false;
- if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
+ if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL |
+ IOMAP_DIO_BOUNCE))
return false;
if (iocb->ki_pos + count > i_size_read(inode))
return false;
+ // XXX: reject fscrypt
return true;
}
@@ -1060,7 +1001,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
{
struct inode *inode = file_inode(iocb->ki_filp);
size_t count = iov_iter_count(iter);
- int nr_pages;
struct iomap_dio_simple_read *sr;
unsigned int alignment;
struct iomap_iter iomi = {
@@ -1074,11 +1014,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
bool wait_for_completion = is_sync_kiocb(iocb);
ssize_t ret;
- if (dio_flags & IOMAP_DIO_BOUNCE)
- nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
- else
- nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
-
if (iocb->ki_flags & IOCB_NOWAIT)
iomi.flags |= IOMAP_NOWAIT;
@@ -1120,24 +1055,18 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
if (user_backed_iter(iter))
dio_flags |= IOMAP_DIO_USER_BACKED;
- bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
- REQ_OP_READ | REQ_SYNC | REQ_IDLE,
- GFP_KERNEL, &iomap_dio_simple_read_pool);
+ bio = bio_alloc_bioset(iomi.iomap.bdev,
+ bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS),
+ REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+ GFP_KERNEL, &iomap_dio_simple_read_pool);
sr = container_of(bio, struct iomap_dio_simple_read, bio);
-
- fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
sr->iocb = iocb;
sr->dio_flags = dio_flags;
bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
bio->bi_ioprio = iocb->ki_ioprio;
- bio->bi_private = sr;
- bio->bi_end_io = iomap_dio_simple_read_end_io;
- if (dio_flags & IOMAP_DIO_BOUNCE)
- ret = bio_iov_iter_bounce(bio, iter, count);
- else
- ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
+ ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
if (unlikely(ret))
goto out_bio_put;
@@ -1161,49 +1090,22 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
WRITE_ONCE(iocb->private, bio);
}
- if (wait_for_completion) {
- sr->waiter = current;
- blk_crypto_submit_bio(bio);
- } else {
- atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
- sr->waiter = NULL;
- blk_crypto_submit_bio(bio);
- ret = -EIOCBQUEUED;
- }
-
if (ops->iomap_end)
ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
&iomi.iomap);
- if (wait_for_completion) {
- for (;;) {
- set_current_state(TASK_UNINTERRUPTIBLE);
- if (!READ_ONCE(sr->waiter))
- break;
- blk_io_schedule();
- }
- __set_current_state(TASK_RUNNING);
-
- ret = iomap_dio_simple_read_finish(iocb, bio,
- blk_status_to_errno(bio->bi_status));
- inode_dio_end(inode);
- trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
- ret > 0 ? ret : 0);
- } else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
- IOMAP_DIO_SIMPLE_QUEUED) ==
- IOMAP_DIO_SIMPLE_DONE) {
- ret = iomap_dio_simple_read_complete(iocb, bio);
- } else {
+ if (!wait_for_completion) {
+ bio->bi_end_io = iomap_dio_simple_read_end_io;
+ submit_bio(bio);
trace_iomap_dio_rw_queued(inode, iomi.pos, count);
+ return -EIOCBQUEUED;
}
- return ret;
+ submit_bio_wait(bio);
+ return iomap_dio_simple_read_complete(sr);
out_bio_release_pages:
- if (dio_flags & IOMAP_DIO_BOUNCE)
- bio_iov_iter_unbounce(bio, true, false);
- else
- bio_release_pages(bio, false);
+ bio_release_pages(bio, false);
out_bio_put:
bio_put(bio);
out_iomap_end:
next prev parent reply other threads:[~2026-06-24 15:37 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-08 7:31 [PATCH v4] iomap: add simple read path for small direct I/O Fengnan Chang
2026-06-11 9:36 ` Pankaj Raghav (Samsung)
2026-06-11 12:04 ` Fengnan
2026-06-24 15:37 ` Christoph Hellwig [this message]
2026-06-25 2:24 ` Fengnan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajv5pqNureiK80Eu@infradead.org \
--to=hch@infradead.org \
--cc=brauner@kernel.org \
--cc=changfengnan@bytedance.com \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=lidiangang@bytedance.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox