From: Mike Snitzer <snitzer@kernel.org>
To: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v7 6/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
Date: Tue, 26 Aug 2025 12:34:31 -0400 [thread overview]
Message-ID: <aK3iF3_807xdXZRk@kernel.org> (raw)
In-Reply-To: <e5052736e0d18f153bd3c3a9b75a7349218b5697.camel@kernel.org>
On Mon, Aug 18, 2025 at 03:46:06PM -0400, Jeff Layton wrote:
> On Fri, 2025-08-15 at 10:46 -0400, Mike Snitzer wrote:
> > If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> > middle and end as needed. The large middle extent is DIO-aligned and
> > the start and/or end are misaligned. Buffered IO is used for the
> > misaligned extents and O_DIRECT is used for the middle DIO-aligned
> > extent.
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> > fs/nfsd/vfs.c | 176 +++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 168 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 64732dc8985d6..afcc22fdddefc 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1355,6 +1355,169 @@ static int wait_for_concurrent_writes(struct file *file)
> > return err;
> > }
> >
> > +struct nfsd_write_dio {
> > + loff_t middle_offset; /* Offset for start of DIO-aligned middle */
> > + loff_t end_offset; /* Offset for start of DIO-aligned end */
> > + ssize_t start_len; /* Length for misaligned first extent */
> > + ssize_t middle_len; /* Length for DIO-aligned middle extent */
> > + ssize_t end_len; /* Length for misaligned last extent */
> > +};
> > +
> > +static bool
> > +nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > + struct nfsd_file *nf, loff_t offset,
> > + unsigned long len, struct nfsd_write_dio *write_dio)
> > +{
> > + const u32 dio_blocksize = nf->nf_dio_offset_align;
> > + loff_t orig_end, middle_end, start_end, start_offset = offset;
> > + ssize_t start_len = len;
> > +
> > + if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
> > + "%s: underlying filesystem has not provided DIO alignment info\n",
> > + __func__))
> > + return false;
> > + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > + __func__, dio_blocksize, PAGE_SIZE))
> > + return false;
> > + if (unlikely(len < dio_blocksize))
> > + return false;
> > +
> > + memset(write_dio, 0, sizeof(*write_dio));
> > +
> > + if (((offset | len) & (dio_blocksize-1)) == 0) {
> > + /* already DIO-aligned, no misaligned head or tail */
> > + write_dio->middle_offset = offset;
> > + write_dio->middle_len = len;
> > + /* clear these for the benefit of trace_nfsd_analyze_write_dio */
> > + start_offset = 0;
> > + start_len = 0;
> > + return true;
> > + }
> > +
> > + start_end = round_up(offset, dio_blocksize);
> > + start_len = start_end - offset;
> > + orig_end = offset + len;
> > + middle_end = round_down(orig_end, dio_blocksize);
> > +
> > + write_dio->start_len = start_len;
> > + write_dio->middle_offset = start_end;
> > + write_dio->middle_len = middle_end - start_end;
> > + write_dio->end_offset = middle_end;
> > + write_dio->end_len = orig_end - middle_end;
> > +
> > + return true;
> > +}
> > +
> > +/*
> > + * Setup as many as 3 iov_iter based on extents described by @write_dio.
> > + * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
> > + * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
> > + * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
> > + * @nvecs: number of segments in @rq_bvec
> > + * @cnt: size of the request in bytes
> > + * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
> > + *
> > + * Returns the number of iov_iter that were setup.
> > + */
> > +static int
> > +nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> > + struct bio_vec *rq_bvec, unsigned int nvecs,
> > + unsigned long cnt, struct nfsd_write_dio *write_dio)
> > +{
> > + int n_iters = 0;
> > + struct iov_iter *iters = *iterp;
> > +
> > + /* Setup misaligned start? */
> > + if (write_dio->start_len) {
> > + iter_is_dio_aligned[n_iters] = false;
> > + iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> > + iters[n_iters].count = write_dio->start_len;
> > + ++n_iters;
> > + }
> > +
> > + /* Setup DIO-aligned middle */
> > + iter_is_dio_aligned[n_iters] = true;
> > + iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> > + if (write_dio->start_len)
> > + iov_iter_advance(&iters[n_iters], write_dio->start_len);
> > + iters[n_iters].count -= write_dio->end_len;
> > + ++n_iters;
> > +
> > + /* Setup misaligned end? */
> > + if (write_dio->end_len) {
> > + iter_is_dio_aligned[n_iters] = false;
> > + iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> > + iov_iter_advance(&iters[n_iters],
> > + write_dio->start_len + write_dio->middle_len);
> > + ++n_iters;
> > + }
> > +
> > + return n_iters;
> > +}
> > +
> > +static int
> > +nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
> > + unsigned int nvecs, unsigned long *cnt,
> > + struct kiocb *kiocb)
> > +{
> > + struct iov_iter iter;
> > + int host_err;
> > +
> > + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > + host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> > + if (host_err < 0)
> > + return host_err;
> > + *cnt = host_err;
> > +
> > + return 0;
> > +}
> > +
> > +static noinline int
> > +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > + struct nfsd_file *nf, loff_t offset,
> > + unsigned int nvecs, unsigned long *cnt,
> > + struct kiocb *kiocb)
> > +{
> > + struct nfsd_write_dio write_dio;
> > + struct file *file = nf->nf_file;
> > +
> > + if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset, *cnt, &write_dio))
> > + return nfsd_issue_write_buffered(rqstp, file, nvecs, cnt, kiocb);
> > + else {
> > + bool iter_is_dio_aligned[3];
> > + struct iov_iter iter_stack[3];
> > + struct iov_iter *iter = iter_stack;
> > + unsigned int n_iters = 0;
> > + int host_err;
> > +
> > + n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> > + rqstp->rq_bvec, nvecs, *cnt, &write_dio);
> > + *cnt = 0;
> > + for (int i = 0; i < n_iters; i++) {
> > + if (iter_is_dio_aligned[i] &&
> > + nfsd_iov_iter_aligned_bvec(&iter[i], nf->nf_dio_mem_align-1,
> > + nf->nf_dio_offset_align-1))
> > + kiocb->ki_flags |= IOCB_DIRECT;
> > + else
> > + kiocb->ki_flags &= ~IOCB_DIRECT;
> > + host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> > + if (host_err < 0) {
> > + /* Underlying FS will return -EINVAL if misaligned
> > + * DIO is attempted because it shouldn't be.
> > + */
> > + WARN_ON_ONCE(host_err == -EINVAL);
> > + return host_err;
> > + }
> > + *cnt += host_err;
> > + if (host_err < iter[i].count) /* partial write? */
> > + return *cnt;
> > + }
>
> If this ends up doing some buffered I/Os on the unaligned end bits,
> should we have it issue a vfs_fsync_range() (or maybe use IOCB_SYNC)
> here as well to ensure that those bits get written back before sending
> the reply? I worry a bit about an aligned (DIO) read of a page racing
> in and not seeing data that hasn't been written back yet.
I haven't had any issues as-is, but that doesn't mean there couldn't
be something.
For the misaligned buffered IO writes, I'm fine with clearing
IOCB_DIRECT and setting IOCB_SYNC.
Mike
next prev parent reply other threads:[~2025-08-26 16:34 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-15 14:46 [PATCH v7 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-08-18 14:33 ` Jeff Layton
2025-08-15 14:46 ` [PATCH v7 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-18 14:34 ` Jeff Layton
2025-08-15 14:46 ` [PATCH v7 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-18 14:45 ` Jeff Layton
2025-08-18 19:05 ` Mike Snitzer
2025-08-18 19:27 ` Jeff Layton
2025-08-15 14:46 ` [PATCH v7 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-18 19:46 ` Jeff Layton
2025-08-26 16:34 ` Mike Snitzer [this message]
2025-08-15 14:46 ` [PATCH v7 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aK3iF3_807xdXZRk@kernel.org \
--to=snitzer@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).