linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Chuck Lever <cel@kernel.org>
Cc: NeilBrown <neil@brown.name>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org, Mike Snitzer <snitzer@kernel.org>
Subject: Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Mon, 20 Oct 2025 00:19:35 -0700	[thread overview]
Message-ID: <aPXihwGTiA7bqTsN@infradead.org> (raw)
In-Reply-To: <20251018005431.3403-3-cel@kernel.org>

On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
> 
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Synchronous buffered IO (with
> preference towards using DONTCACHE) is used for the misaligned extents
> and O_DIRECT is used for the middle DIO-aligned extent.

Can you define synchronous better here?  The term is unfortunately
overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
style I/O.  As of now I don't understand which one you mean, especially
with the DONTCACHE reference thrown in, but I guess I'll figure it out
reading the patch.

> If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> invalidate the page cache on behalf of the DIO WRITE, then
> nfsd_issue_write_dio() will fall back to using buffered IO.

Did you see -ENOTBLK leaking out of the file systems?  Because at
least for iomap it is supposed to be an indication that the
file system ->write_iter handler needs to retry using buffered
I/O and never leak to the caller.

> These changes served as the original starting point for the NFS
> client's misaligned O_DIRECT support that landed with
> commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> READ and WRITE"). But NFSD's support is simpler because it currently
> doesn't use AIO completion.

I don't understand this paragraph.  What does starting point mean
here?  How does it matter for the patch description?

> +struct nfsd_write_dio {
> +     ssize_t start_len;      /* Length for misaligned first extent */
> +     ssize_t middle_len;     /* Length for DIO-aligned middle extent */
> +     ssize_t end_len;        /* Length for misaligned last extent */
> +};

Looking at how the code is structured later on, it seems like it would
work much better if each of these sections had it's own object with
the len, iov_iter, flag if it's aligned, etc.  Otherwise we have this
structure and lots of arrays of three items passed around.

> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> +                        unsigned int len_mask)

Wouldn't it make sense to track the alignment when building the bio_vec
array instead of doing another walk here touching all cache lines?

> +	if (unlikely(dio_blocksize > PAGE_SIZE))
> +		return false;

Why does this matter?  Can you add a comment explaining it?

> +static int
> +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> +		    unsigned int nvecs, unsigned long *cnt,
> +		    struct kiocb *kiocb)
> +{
> +	struct iov_iter iter;
> +	int host_err;
> +
> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> +	if (host_err < 0)
> +		return host_err;
> +	*cnt = host_err;
> +
> +	return 0;


Nothing really buffered here per se, it's just a small wrapper
around vfs_iocb_iter_write.

> +	/*
> +	 * Any buffered IO issued here will be misaligned, use
> +	 * sync IO to ensure it has completed before returning.
> +	 * Also update @stable_how to avoid need for COMMIT.
> +	 */
> +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);

What do you mean with completed before returning?  I guess you
mean writeback actually happening, right?  Why do you need that,
why do you also force it for the direct I/O?

Also IOCB_SYNC is wrong here, as the only thing it does over
IOCB_DSYNC is also forcing back of metadata not needed to find
data (aka timestamps), which I can't see any need for here.

> +	*stable_how = NFS_FILE_SYNC;
> +
> +	*cnt = 0;
> +	for (int i = 0; i < n_iters; i++) {
> +		if (iter_is_dio_aligned[i])
> +			kiocb->ki_flags |= IOCB_DIRECT;
> +		else
> +			kiocb->ki_flags &= ~IOCB_DIRECT;
> +
> +		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> +		if (host_err < 0) {
> +			/*
> +			 * VFS will return -ENOTBLK if DIO WRITE fails to
> +			 * invalidate the page cache. Retry using buffered IO.
> +			 */
> +			if (unlikely(host_err == -ENOTBLK)) {

The VFS certainly does not, and if it leaks out of a specific file
system we need to fix that.

> +			} else if (unlikely(host_err == -EINVAL)) {
> +				struct inode *inode = d_inode(fhp->fh_dentry);
> +
> +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> +						    inode->i_sb->s_id, inode->i_ino);
> +				host_err = -ESERVERFAULT;

-EINVAL can be lot more things than alignment failure.   And more
importantly alignment failures should not happen with the proper
checks in place.


  reply	other threads:[~2025-10-20  7:19 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-18  0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-20  7:02   ` Christoph Hellwig
2025-10-18  0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20  7:19   ` Christoph Hellwig [this message]
2025-10-20 13:56     ` Chuck Lever
2025-10-20 14:05       ` Christoph Hellwig
2025-10-20 16:27     ` Mike Snitzer
2025-10-22  5:14       ` Christoph Hellwig
2025-10-22 14:37         ` Chuck Lever
2025-10-23  5:46           ` Christoph Hellwig
2025-10-21 11:24     ` Jeff Layton
2025-10-22  5:16       ` Christoph Hellwig
2025-10-22 10:15         ` Jeff Layton
2025-10-22 11:17           ` Christoph Hellwig
2025-10-22 11:30             ` Jeff Layton
2025-10-22 13:31             ` Chuck Lever
2025-10-23  5:27               ` Christoph Hellwig
2025-10-22 17:59     ` Chuck Lever
2025-10-23  5:52       ` Christoph Hellwig
2025-10-18  0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPXihwGTiA7bqTsN@infradead.org \
    --to=hch@infradead.org \
    --cc=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).