From: Christoph Hellwig <hch@infradead.org>
To: Chuck Lever <cel@kernel.org>
Cc: NeilBrown <neil@brown.name>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
linux-nfs@vger.kernel.org, Mike Snitzer <snitzer@kernel.org>
Subject: Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Mon, 20 Oct 2025 00:19:35 -0700 [thread overview]
Message-ID: <aPXihwGTiA7bqTsN@infradead.org> (raw)
In-Reply-To: <20251018005431.3403-3-cel@kernel.org>
On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
>
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Synchronous buffered IO (with
> preference towards using DONTCACHE) is used for the misaligned extents
> and O_DIRECT is used for the middle DIO-aligned extent.
Can you define synchronous better here? The term is unfortunately
overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
style I/O. As of now I don't understand which one you mean, especially
with the DONTCACHE reference thrown in, but I guess I'll figure it out
reading the patch.
> If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> invalidate the page cache on behalf of the DIO WRITE, then
> nfsd_issue_write_dio() will fall back to using buffered IO.
Did you see -ENOTBLK leaking out of the file systems? Because at
least for iomap it is supposed to be an indication that the
file system ->write_iter handler needs to retry using buffered
I/O and never leak to the caller.
> These changes served as the original starting point for the NFS
> client's misaligned O_DIRECT support that landed with
> commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> READ and WRITE"). But NFSD's support is simpler because it currently
> doesn't use AIO completion.
I don't understand this paragraph. What does starting point mean
here? How does it matter for the patch description?
> +struct nfsd_write_dio {
> + ssize_t start_len; /* Length for misaligned first extent */
> + ssize_t middle_len; /* Length for DIO-aligned middle extent */
> + ssize_t end_len; /* Length for misaligned last extent */
> +};
Looking at how the code is structured later on, it seems like it would
work much better if each of these sections had it's own object with
the len, iov_iter, flag if it's aligned, etc. Otherwise we have this
structure and lots of arrays of three items passed around.
> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> + unsigned int len_mask)
Wouldn't it make sense to track the alignment when building the bio_vec
array instead of doing another walk here touching all cache lines?
> + if (unlikely(dio_blocksize > PAGE_SIZE))
> + return false;
Why does this matter? Can you add a comment explaining it?
> +static int
> +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> + unsigned int nvecs, unsigned long *cnt,
> + struct kiocb *kiocb)
> +{
> + struct iov_iter iter;
> + int host_err;
> +
> + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> + host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> + if (host_err < 0)
> + return host_err;
> + *cnt = host_err;
> +
> + return 0;
Nothing really buffered here per se, it's just a small wrapper
around vfs_iocb_iter_write.
> + /*
> + * Any buffered IO issued here will be misaligned, use
> + * sync IO to ensure it has completed before returning.
> + * Also update @stable_how to avoid need for COMMIT.
> + */
> + kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
What do you mean with completed before returning? I guess you
mean writeback actually happening, right? Why do you need that,
why do you also force it for the direct I/O?
Also IOCB_SYNC is wrong here, as the only thing it does over
IOCB_DSYNC is also forcing back of metadata not needed to find
data (aka timestamps), which I can't see any need for here.
> + *stable_how = NFS_FILE_SYNC;
> +
> + *cnt = 0;
> + for (int i = 0; i < n_iters; i++) {
> + if (iter_is_dio_aligned[i])
> + kiocb->ki_flags |= IOCB_DIRECT;
> + else
> + kiocb->ki_flags &= ~IOCB_DIRECT;
> +
> + host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> + if (host_err < 0) {
> + /*
> + * VFS will return -ENOTBLK if DIO WRITE fails to
> + * invalidate the page cache. Retry using buffered IO.
> + */
> + if (unlikely(host_err == -ENOTBLK)) {
The VFS certainly does not, and if it leaks out of a specific file
system we need to fix that.
> + } else if (unlikely(host_err == -EINVAL)) {
> + struct inode *inode = d_inode(fhp->fh_dentry);
> +
> + pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> + inode->i_sb->s_id, inode->i_ino);
> + host_err = -ESERVERFAULT;
-EINVAL can be lot more things than alignment failure. And more
importantly alignment failures should not happen with the proper
checks in place.
next prev parent reply other threads:[~2025-10-20 7:19 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-18 0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-18 0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-20 7:02 ` Christoph Hellwig
2025-10-18 0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20 7:19 ` Christoph Hellwig [this message]
2025-10-20 13:56 ` Chuck Lever
2025-10-20 14:05 ` Christoph Hellwig
2025-10-20 16:27 ` Mike Snitzer
2025-10-22 5:14 ` Christoph Hellwig
2025-10-22 14:37 ` Chuck Lever
2025-10-23 5:46 ` Christoph Hellwig
2025-10-21 11:24 ` Jeff Layton
2025-10-22 5:16 ` Christoph Hellwig
2025-10-22 10:15 ` Jeff Layton
2025-10-22 11:17 ` Christoph Hellwig
2025-10-22 11:30 ` Jeff Layton
2025-10-22 13:31 ` Chuck Lever
2025-10-23 5:27 ` Christoph Hellwig
2025-10-22 17:59 ` Chuck Lever
2025-10-23 5:52 ` Christoph Hellwig
2025-10-18 0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPXihwGTiA7bqTsN@infradead.org \
--to=hch@infradead.org \
--cc=cel@kernel.org \
--cc=dai.ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neil@brown.name \
--cc=okorniev@redhat.com \
--cc=snitzer@kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).