From: Jeff Layton <jlayton@kernel.org>
To: Christoph Hellwig <hch@infradead.org>, Chuck Lever <cel@kernel.org>
Cc: NeilBrown <neil@brown.name>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
linux-nfs@vger.kernel.org, Mike Snitzer <snitzer@kernel.org>
Subject: Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Tue, 21 Oct 2025 07:24:42 -0400 [thread overview]
Message-ID: <a5f3911ae6b65c70e1fd897bdd4f3e651decb196.camel@kernel.org> (raw)
In-Reply-To: <aPXihwGTiA7bqTsN@infradead.org>
On Mon, 2025-10-20 at 00:19 -0700, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
> > From: Mike Snitzer <snitzer@kernel.org>
> >
> > If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> > middle and end as needed. The large middle extent is DIO-aligned and
> > the start and/or end are misaligned. Synchronous buffered IO (with
> > preference towards using DONTCACHE) is used for the misaligned extents
> > and O_DIRECT is used for the middle DIO-aligned extent.
>
> Can you define synchronous better here? The term is unfortunately
> overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
> style I/O. As of now I don't understand which one you mean, especially
> with the DONTCACHE reference thrown in, but I guess I'll figure it out
> reading the patch.
>
> > If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> > invalidate the page cache on behalf of the DIO WRITE, then
> > nfsd_issue_write_dio() will fall back to using buffered IO.
>
> Did you see -ENOTBLK leaking out of the file systems? Because at
> least for iomap it is supposed to be an indication that the
> file system ->write_iter handler needs to retry using buffered
> I/O and never leak to the caller.
>
> > These changes served as the original starting point for the NFS
> > client's misaligned O_DIRECT support that landed with
> > commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> > READ and WRITE"). But NFSD's support is simpler because it currently
> > doesn't use AIO completion.
>
> I don't understand this paragraph. What does starting point mean
> here? How does it matter for the patch description?
>
> > +struct nfsd_write_dio {
> > + ssize_t start_len; /* Length for misaligned first extent */
> > + ssize_t middle_len; /* Length for DIO-aligned middle extent */
> > + ssize_t end_len; /* Length for misaligned last extent */
> > +};
>
> Looking at how the code is structured later on, it seems like it would
> work much better if each of these sections had it's own object with
> the len, iov_iter, flag if it's aligned, etc. Otherwise we have this
> structure and lots of arrays of three items passed around.
>
> > +static bool
> > +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> > + unsigned int len_mask)
>
> Wouldn't it make sense to track the alignment when building the bio_vec
> array instead of doing another walk here touching all cache lines?
>
> > + if (unlikely(dio_blocksize > PAGE_SIZE))
> > + return false;
>
> Why does this matter? Can you add a comment explaining it?
>
> > +static int
> > +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> > + unsigned int nvecs, unsigned long *cnt,
> > + struct kiocb *kiocb)
> > +{
> > + struct iov_iter iter;
> > + int host_err;
> > +
> > + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > + host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> > + if (host_err < 0)
> > + return host_err;
> > + *cnt = host_err;
> > +
> > + return 0;
>
>
> Nothing really buffered here per se, it's just a small wrapper
> around vfs_iocb_iter_write.
>
> > + /*
> > + * Any buffered IO issued here will be misaligned, use
> > + * sync IO to ensure it has completed before returning.
> > + * Also update @stable_how to avoid need for COMMIT.
> > + */
> > + kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
>
> What do you mean with completed before returning? I guess you
> mean writeback actually happening, right? Why do you need that,
> why do you also force it for the direct I/O?
>
> Also IOCB_SYNC is wrong here, as the only thing it does over
> IOCB_DSYNC is also forcing back of metadata not needed to find
> data (aka timestamps), which I can't see any need for here.
>
Responding to a WRITE with NFS_FILE_SYNC flag set means that the data
the client wrote is now on stable storage (and hence the client doesn't
need to follow up with a COMMIT). This patch is using DIO for the
aligned middle section but any unaligned ends use buffered I/O. If we
want to return NFS_FILE_SYNC here then all of the data and metadata
need to be on disk when the reply goes out.
Don't we need IOCB_SYNC here in this case? Otherwise, the server could
crash while the metadata is still in memory. When it comes back up, the
client could see stale timestamps. Maybe that's not fatal, but it seems
pretty sketchy.
> > + *stable_how = NFS_FILE_SYNC;
> > +
> > + *cnt = 0;
> > + for (int i = 0; i < n_iters; i++) {
> > + if (iter_is_dio_aligned[i])
> > + kiocb->ki_flags |= IOCB_DIRECT;
> > + else
> > + kiocb->ki_flags &= ~IOCB_DIRECT;
> > +
> > + host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> > + if (host_err < 0) {
> > + /*
> > + * VFS will return -ENOTBLK if DIO WRITE fails to
> > + * invalidate the page cache. Retry using buffered IO.
> > + */
> > + if (unlikely(host_err == -ENOTBLK)) {
>
> The VFS certainly does not, and if it leaks out of a specific file
> system we need to fix that.
>
> > + } else if (unlikely(host_err == -EINVAL)) {
> > + struct inode *inode = d_inode(fhp->fh_dentry);
> > +
> > + pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> > + inode->i_sb->s_id, inode->i_ino);
> > + host_err = -ESERVERFAULT;
>
> -EINVAL can be lot more things than alignment failure. And more
> importantly alignment failures should not happen with the proper
> checks in place.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2025-10-21 11:24 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-18 0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-18 0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-20 7:02 ` Christoph Hellwig
2025-10-18 0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20 7:19 ` Christoph Hellwig
2025-10-20 13:56 ` Chuck Lever
2025-10-20 14:05 ` Christoph Hellwig
2025-10-20 16:27 ` Mike Snitzer
2025-10-22 5:14 ` Christoph Hellwig
2025-10-22 14:37 ` Chuck Lever
2025-10-23 5:46 ` Christoph Hellwig
2025-10-21 11:24 ` Jeff Layton [this message]
2025-10-22 5:16 ` Christoph Hellwig
2025-10-22 10:15 ` Jeff Layton
2025-10-22 11:17 ` Christoph Hellwig
2025-10-22 11:30 ` Jeff Layton
2025-10-22 13:31 ` Chuck Lever
2025-10-23 5:27 ` Christoph Hellwig
2025-10-22 17:59 ` Chuck Lever
2025-10-23 5:52 ` Christoph Hellwig
2025-10-18 0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a5f3911ae6b65c70e1fd897bdd4f3e651decb196.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=cel@kernel.org \
--cc=dai.ngo@oracle.com \
--cc=hch@infradead.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neil@brown.name \
--cc=okorniev@redhat.com \
--cc=snitzer@kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).