From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org, Jeff Layton <jlayton@kernel.org>
Subject: Re: [PATCH v4 4/4] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
Date: Wed, 6 Aug 2025 11:55:06 -0400 [thread overview]
Message-ID: <aJN62l2AY00s5tVC@kernel.org> (raw)
In-Reply-To: <6d862893-ac75-4727-a5a2-abfff55b9836@oracle.com>
On Wed, Aug 06, 2025 at 09:53:10AM -0400, Chuck Lever wrote:
> On 8/5/25 2:44 PM, Mike Snitzer wrote:
> > If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> > middle and end as needed. The large middle extent is DIO-aligned and
> > the start and/or end are misaligned. Buffered IO is used for the
> > misaligned extents and O_DIRECT is used for the middle DIO-aligned
> > extent.
> >
> > The nfsd_analyze_write_dio trace event shows how NFSD splits a given
> > misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
> > extent.
> >
> > This combination of trace events is useful:
> >
> > echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> > echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
> > echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
> > echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
> >
> > Which for this dd command:
> >
> > dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
> >
> > Results in:
> >
> > nfsd-55714 [043] ..... 79976.260851: nfsd_write_opened: xid=0x966c5d2d fh_hash=0x4d34e6c1 offset=0 len=47008
> > nfsd-55714 [043] ..... 79976.260852: nfsd_analyze_write_dio: xid=0x966c5d2d fh_hash=0x4d34e6c1 offset=0 len=47008 start=0+0 middle=0+45056 end=45056+1952
> > nfsd-55714 [043] ..... 79976.260857: xfs_file_direct_write: dev 259:12 ino 0x3e00008f disize 0x0 pos 0x0 bytecount 0xb000
> > nfsd-55714 [043] ..... 79976.260965: nfsd_write_io_done: xid=0x966c5d2d fh_hash=0x4d34e6c1 offset=0 len=47008
> >
> > nfsd-55714 [043] ..... 79976.307762: nfsd_write_opened: xid=0x67e5ce6f fh_hash=0x4d34e6c1 offset=47008 len=47008
> > nfsd-55714 [043] ..... 79976.307762: nfsd_analyze_write_dio: xid=0x67e5ce6f fh_hash=0x4d34e6c1 offset=47008 len=47008 start=47008+2144 middle=49152+40960 end=90112+3904
> > nfsd-55714 [043] ..... 79976.307797: xfs_file_direct_write: dev 259:12 ino 0x3e00008f disize 0xc000 pos 0xc000 bytecount 0xa000
> > nfsd-55714 [043] ..... 79976.307866: nfsd_write_io_done: xid=0x67e5ce6f fh_hash=0x4d34e6c1 offset=47008 len=47008
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/nfsd/vfs.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 131 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 0d4f9f452d466..4980800fab66e 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1315,6 +1315,121 @@ static int wait_for_concurrent_writes(struct file *file)
> > return err;
> > }
> >
> > +struct nfsd_write_dio
> > +{
>
> struct nfsd_write_dio {
Yeap, fixed now ;)
> > + loff_t middle_offset; /* Offset for start of DIO-aligned middle */
> > + loff_t end_offset; /* Offset for start of DIO-aligned end */
> > + ssize_t start_len; /* Length for misaligned first extent */
> > + ssize_t middle_len; /* Length for DIO-aligned middle extent */
> > + ssize_t end_len; /* Length for misaligned last extent */
> > +};
> > +
> > +static void init_nfsd_write_dio(struct nfsd_write_dio *write_dio)
> > +{
> > + memset(write_dio, 0, sizeof(*write_dio));
> > +}
> > +
> > +static bool nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > + struct nfsd_file *nf, loff_t offset,
> > + unsigned long len, struct nfsd_write_dio *write_dio)
> > +{
> > + const u32 dio_blocksize = nf->nf_dio_offset_align;
> > + loff_t orig_end, middle_end, start_end, start_offset = offset;
> > + ssize_t start_len = len;
> > + bool aligned = true;
> > +
> > + if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
> > + "%s: underlying filesystem has not provided DIO alignment info\n",
> > + __func__))
> > + return false;
> > +
> > + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > + __func__, dio_blocksize, PAGE_SIZE))
> > + return false;
> > +
> > + if (unlikely(len < dio_blocksize)) {
> > + aligned = false;
> > + goto out;
> > + }
> > +
> > + if (((offset | len) & (dio_blocksize-1)) == 0) {
> > + /* already DIO-aligned, no misaligned head or tail */
> > + write_dio->middle_offset = offset;
> > + write_dio->middle_len = len;
> > + /* clear these for the benefit of trace_nfsd_analyze_write_dio */
> > + start_offset = 0;
> > + start_len = 0;
> > + goto out;
> > + }
> > +
> > + start_end = round_up(offset, dio_blocksize);
> > + start_len = start_end - offset;
> > + orig_end = offset + len;
> > + middle_end = round_down(orig_end, dio_blocksize);
> > +
> > + write_dio->start_len = start_len;
> > + write_dio->middle_offset = start_end;
> > + write_dio->middle_len = middle_end - start_end;
> > + write_dio->end_offset = middle_end;
> > + write_dio->end_len = orig_end - middle_end;
> > +out:
> > + trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
> > + write_dio->middle_offset, write_dio->middle_len,
> > + write_dio->end_offset, write_dio->end_len);
> > + return aligned;
> > +}
> > +
> > +/*
> > + * Setup as many as 3 iov_iter based on extents possibly described by @write_dio.
> > + * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
> > + * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
> > + * @dio_aligned: bool that reflects nfsd_analyze_write_dio()'s return
> > + * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
> > + * @nvecs: number of segments in @rq_bvec
> > + * @cnt: size of the request in bytes
> > + * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
> > + *
> > + * Returns the number of iov_iter that were setup.
> > + */
> > +static int nfsd_setup_write_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> > + bool dio_aligned, struct bio_vec *rq_bvec,
> > + unsigned int nvecs, unsigned long cnt,
> > + struct nfsd_write_dio *write_dio)
> > +{
> > + int n_iters = 0;
> > + struct iov_iter *iters = *iterp;
> > +
> > + /* Setup misaligned start? */
> > + if (write_dio->start_len) {
> > + iter_is_dio_aligned[n_iters] = false;
> > + iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> > + iters[n_iters].count = write_dio->start_len;
> > + n_iters++;
> > + }
> > +
> > + /* Setup possibly DIO-aligned middle */
> > + iter_is_dio_aligned[n_iters] = dio_aligned;
> > + iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> > + if (dio_aligned) {
> > + if (write_dio->start_len)
> > + iov_iter_advance(&iters[n_iters], write_dio->start_len);
> > + iters[n_iters].count -= write_dio->end_len;
> > + }
> > + n_iters++;
> > +
> > + /* Setup misaligned end? */
> > + if (write_dio->end_len) {
> > + iter_is_dio_aligned[n_iters] = false;
> > + iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> > + iov_iter_advance(&iters[n_iters],
> > + write_dio->start_len + write_dio->middle_len);
> > + n_iters++;
> > + }
> > +
> > + return n_iters;
> > +}
> > +
> > /**
> > * nfsd_vfs_write - write data to an already-open file
> > * @rqstp: RPC execution context
> > @@ -1349,9 +1464,12 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > unsigned int pflags = current->flags;
> > bool restore_flags = false;
> > unsigned int nvecs;
> > - struct iov_iter iter_stack[1];
> > + struct iov_iter iter_stack[3];
>
> struct iov_iter isn't that small. This is going to grow the stack frame
> substantially but is used for only the direct I/O case.
Yes, that's the only lingering footprint I have after another pass at
cleanup based on your IPL feedback yesterday.
I expect to be able to push the use of multiple iov_iter down into the
O_DIRECT path only.
> > struct iov_iter *iter = iter_stack;
> > unsigned int n_iters = 0;
> > + bool iov_iter_is_dio_aligned[3];
> > + bool dio_aligned = false;
> > + struct nfsd_write_dio write_dio;
> >
> > trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
> >
> > @@ -1380,18 +1498,12 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > if (stable && !fhp->fh_use_wgather)
> > kiocb.ki_flags |= IOCB_DSYNC;
> >
> > - nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > - iov_iter_bvec(&iter[0], ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > - n_iters++;
> > -
> > + init_nfsd_write_dio(&write_dio);
>
> I assume init_nfsd_write_dio() is going to be called only once.
> Is there a plan to make it more than a memset() ? Can it be called
> in only direct I/O mode?
Yes, I've fixed this.
> > switch (nfsd_io_cache_write) {
> > case NFSD_IO_DIRECT:
> > - /* direct I/O must be aligned to device logical sector size */
> > - if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
> > - (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0) &&
> > - iov_iter_is_aligned(&iter[0], nf->nf_dio_mem_align - 1,
> > - nf->nf_dio_offset_align - 1))
> > - kiocb.ki_flags = IOCB_DIRECT;
> > + if (nfsd_analyze_write_dio(rqstp, fhp, nf, offset,
> > + *cnt, &write_dio))
> > + dio_aligned = true;
>
> How about
>
> dio_aligned = nfsd_analyze_write_dio(rqstp, fhp, nf,
> offset, *cnt,
> &write_dio);
I've iterated on things a bit, no longer need dio_aligned variable.
> Let's make nfsd_analyze_write_dio a "noinline" so that the compiler
> removes it from the hot path in page cache I/O mode.
OK, will do.
> > break;
> > case NFSD_IO_DONTCACHE:
> > kiocb.ki_flags = IOCB_DONTCACHE;
> > @@ -1400,11 +1512,19 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > break;
> > }
> >
> > + nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > + n_iters = nfsd_setup_write_iters(&iter, iov_iter_is_dio_aligned, dio_aligned,
> > + rqstp->rq_bvec, nvecs, *cnt, &write_dio);
>
> Is there a plan to use buffer re-alignment for the other two I/O modes?
>
> I ask because there are many more conditional branches now, and they
> seem to be useful only if there are multiple iters. And it looks like
> there are multiple iters only in the direct I/O case.
>
> Generally what we do in situations like this is create utility functions
> that contain code common to all paths, and have the separate paths use
> those helpers in the combination that they need. Not only is the
> instruction path length shorter for each individual path, but the
> resulting source code is much more legible.
Yes, I am now working on doing just that.
Thanks,
Mike
prev parent reply other threads:[~2025-08-06 15:55 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-05 18:44 [PATCH v4 0/4] NFSD DIRECT: misaligned READs fixup, add handling for misaligned WRITEs Mike Snitzer
2025-08-05 18:44 ` [PATCH v4 1/4] NFSD: avoid using iov_iter_is_aligned() in nfsd_iter_read() Mike Snitzer
2025-08-06 13:18 ` Chuck Lever
2025-08-06 15:57 ` Mike Snitzer
2025-08-06 15:58 ` Chuck Lever
2025-08-07 15:50 ` sparse warnings with nfsd-testing [was: Re: [PATCH v4 1/4] NFSD: avoid using iov_iter_is_aligned() in nfsd_iter_read()] Mike Snitzer
2025-08-07 15:51 ` Chuck Lever
2025-08-07 15:53 ` Mike Snitzer
2025-08-05 18:44 ` [PATCH v4 2/4] NFSD: refactor nfsd_read_vector_dio to EVENT_CLASS useful for READ and WRITE Mike Snitzer
2025-08-05 18:44 ` [PATCH v4 3/4] NFSD: prepare nfsd_vfs_write() to use O_DIRECT on misaligned WRITEs Mike Snitzer
2025-08-05 18:44 ` [PATCH v4 4/4] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-06 13:53 ` Chuck Lever
2025-08-06 15:55 ` Mike Snitzer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aJN62l2AY00s5tVC@kernel.org \
--to=snitzer@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).