From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
Date: Fri, 8 Aug 2025 14:19:38 -0400 [thread overview]
Message-ID: <aJY_unu7uL7h3Q4z@kernel.org> (raw)
In-Reply-To: <9df6001e-8930-4618-841a-14e1831b720d@oracle.com>
On Fri, Aug 08, 2025 at 01:59:47PM -0400, Chuck Lever wrote:
> On 8/7/25 12:25 PM, Mike Snitzer wrote:
> > If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > DIO-aligned block (on either end of the READ). The expanded READ is
> > verified to have proper offset/len (logical_block_size) and
> > dma_alignment checking.
> >
> > Must allocate and use a bounce-buffer page (called 'start_extra_page')
> > if/when expanding the misaligned READ requires reading extra partial
> > page at the start of the READ so that its DIO-aligned. Otherwise that
> > extra page at the start will make its way back to the NFS client and
> > corruption will occur. As found, and then this fix of using an extra
> > page verified, using the 'dt' utility:
> > dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> > iotype=sequential pattern=iot onerr=abort oncerr=abort
> > see: https://github.com/RobinTMiller/dt.git
> >
> > Any misaligned READ that is less than 32K won't be expanded to be
> > DIO-aligned (this heuristic just avoids excess work, like allocating
> > start_extra_page, for smaller IO that can generally already perform
> > well using buffered IO).
> >
> > Also add EVENT_CLASS nfsd_analyze_dio_class and use it to create
> > nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events. This
> > prepares for nfsd_vfs_write() to also make use of it when handling
> > misaligned WRITEs.
> >
> > This combination of trace events is useful:
> >
> > echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> > echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
> > echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> > echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> >
> > Which for this dd command:
> >
> > dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
> >
> > Results in:
> >
> > nfsd-23908 [010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
> > nfsd-23908 [010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
> > nfsd-23908 [010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
> > nfsd-23908 [010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008
> >
> > nfsd-23908 [010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
> > nfsd-23908 [010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
> > nfsd-23908 [010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
> > nfsd-23908 [010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008
> >
> > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> > fs/nfsd/trace.h | 61 ++++++++++++++
> > fs/nfsd/vfs.c | 167 ++++++++++++++++++++++++++++++++++---
> > include/linux/sunrpc/svc.h | 5 +-
> > 3 files changed, 221 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index a664fdf1161e9..4173bd9344b6b 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
> > DEFINE_NFSD_IO_EVENT(commit_start);
> > DEFINE_NFSD_IO_EVENT(commit_done);
> >
> > +DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
> > + TP_PROTO(struct svc_rqst *rqstp,
> > + struct svc_fh *fhp,
> > + u64 offset,
> > + u32 len,
> > + loff_t start,
> > + ssize_t start_len,
> > + loff_t middle,
> > + ssize_t middle_len,
> > + loff_t end,
> > + ssize_t end_len),
> > + TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
> > + TP_STRUCT__entry(
> > + __field(u32, xid)
> > + __field(u32, fh_hash)
> > + __field(u64, offset)
> > + __field(u32, len)
> > + __field(loff_t, start)
> > + __field(ssize_t, start_len)
> > + __field(loff_t, middle)
> > + __field(ssize_t, middle_len)
> > + __field(loff_t, end)
> > + __field(ssize_t, end_len)
> > + ),
> > + TP_fast_assign(
> > + __entry->xid = be32_to_cpu(rqstp->rq_xid);
> > + __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
> > + __entry->offset = offset;
> > + __entry->len = len;
> > + __entry->start = start;
> > + __entry->start_len = start_len;
> > + __entry->middle = middle;
> > + __entry->middle_len = middle_len;
> > + __entry->end = end;
> > + __entry->end_len = end_len;
> > + ),
> > + TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
> > + __entry->xid, __entry->fh_hash,
> > + __entry->offset, __entry->len,
> > + __entry->start, __entry->start_len,
> > + __entry->middle, __entry->middle_len,
> > + __entry->end, __entry->end_len)
> > +)
> > +
> > +#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name) \
> > +DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio, \
> > + TP_PROTO(struct svc_rqst *rqstp, \
> > + struct svc_fh *fhp, \
> > + u64 offset, \
> > + u32 len, \
> > + loff_t start, \
> > + ssize_t start_len, \
> > + loff_t middle, \
> > + ssize_t middle_len, \
> > + loff_t end, \
> > + ssize_t end_len), \
> > + TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
> > +
> > +DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
> > +DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
> > +
>
> Just a random thought: usually I add new trace points at the end of
> the series to keep the deeper patches smaller.
OK, will do.
> > DECLARE_EVENT_CLASS(nfsd_err_class,
> > TP_PROTO(struct svc_rqst *rqstp,
> > struct svc_fh *fhp,
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 5768244c7a3c3..be083a8812717 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -19,6 +19,7 @@
> > #include <linux/splice.h>
> > #include <linux/falloc.h>
> > #include <linux/fcntl.h>
> > +#include <linux/math.h>
> > #include <linux/namei.h>
> > #include <linux/delay.h>
> > #include <linux/fsnotify.h>
> > @@ -1073,6 +1074,116 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > }
> >
> > +struct nfsd_read_dio {
> > + loff_t start;
> > + loff_t end;
> > + unsigned long start_extra;
> > + unsigned long end_extra;
> > + struct page *start_extra_page;
> > +};
> > +
> > +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> > +{
> > + memset(read_dio, 0, sizeof(*read_dio));
> > + read_dio->start_extra_page = NULL;
> > +}
> > +
> > +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > + struct nfsd_file *nf, loff_t offset,
> > + unsigned long len, unsigned int base,
> > + struct nfsd_read_dio *read_dio)
> > +{
> > + const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> > + loff_t middle_end, orig_end = offset + len;
> > +
> > + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> > + "%s: underlying filesystem has not provided DIO alignment info\n",
> > + __func__))
> > + return false;
> > + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > + __func__, dio_blocksize, PAGE_SIZE))
> > + return false;
> > +
> > + /* Return early if IO is irreparably misaligned
> > + * (len < PAGE_SIZE, or base not aligned).
> > + */
> > + if (unlikely(len < dio_blocksize) ||
> > + ((base & (nf->nf_dio_mem_align-1)) != 0))
> > + return false;
> > +
> > + read_dio->start = round_down(offset, dio_blocksize);
> > + read_dio->end = round_up(orig_end, dio_blocksize);
> > + read_dio->start_extra = offset - read_dio->start;
> > + read_dio->end_extra = read_dio->end - orig_end;
> > +
> > + /* don't expand READ for IO less than 32K */
>
> The code already says "don't expand READ for IO less than 32K". The
> comment needs to explain why. Move the rationale from the patch
> description to this comment, maybe?
>
>
> > + if ((read_dio->start_extra || read_dio->end_extra) && (len < (32 << 10))) {
> > + init_nfsd_read_dio(read_dio);
> > + return false;
> > + }
>
> Nit: Let's replace the raw integer with a symbolic constant. But let's
> resist the urge to expose this as a tunable for now ;-)
Ack to both.
> > +
> > + if (read_dio->start_extra) {
> > + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> > + if (WARN_ONCE(read_dio->start_extra_page == NULL,
> > + "%s: Unable to allocate start_extra_page\n", __func__)) {
> > + init_nfsd_read_dio(read_dio);
> > + return false;
> > + }
> > + }
> > +
> > + /* Show original offset and count, and how it was expanded for DIO */
> > + middle_end = read_dio->end - read_dio->end_extra;
> > + trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
> > + read_dio->start, read_dio->start_extra,
> > + offset, (middle_end - offset),
> > + middle_end, read_dio->end_extra);
> > + return true;
> > +}
> > +
> > +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> > + struct nfsd_read_dio *read_dio,
> > + ssize_t bytes_read,
> > + unsigned long bytes_expected,
> > + loff_t *offset,
> > + unsigned long *rq_bvec_numpages)
> > +{
> > + ssize_t host_err = bytes_read;
> > + loff_t v;
> > +
> > + /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> > + * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> > + */
> > + if (read_dio->start_extra_page) {
> > + __free_page(read_dio->start_extra_page);
> > + *rq_bvec_numpages -= 1;
> > + v = *rq_bvec_numpages;
> > + memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> > + v * sizeof(struct bio_vec));
> > + }
> > + /* Eliminate any end_extra bytes from the last page */
> > + v = *rq_bvec_numpages;
> > + rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> > +
> > + if (host_err < 0)
> > + return host_err;
> > +
> > + /* nfsd_analyze_read_dio() may have expanded the start and end,
> > + * if so adjust returned read size to reflect original extent.
> > + */
> > + *offset += read_dio->start_extra;
> > + if (likely(host_err >= read_dio->start_extra)) {
> > + host_err -= read_dio->start_extra;
> > + if (host_err > bytes_expected)
> > + host_err = bytes_expected;
> > + } else {
> > + /* Short read that didn't read any of requested data */
> > + host_err = 0;
> > + }
> > +
> > + return host_err;
> > +}
> > +
> > /**
> > * nfsd_iter_read - Perform a VFS read using an iterator
> > * @rqstp: RPC transaction context
> > @@ -1094,26 +1205,49 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > unsigned int base, u32 *eof)
> > {
> > struct file *file = nf->nf_file;
> > - unsigned long v, total;
> > + unsigned long v, total, in_count = *count;
> > + struct nfsd_read_dio read_dio;
> > struct iov_iter iter;
> > struct kiocb kiocb;
> > - ssize_t host_err;
> > + ssize_t host_err = 0;
> > size_t len;
> >
> > + init_nfsd_read_dio(&read_dio);
>
> Initialize only if direct I/O is in use. I think all this new read code
> needs the same treatment as the write path: move the handling of the
> esoteric I/O types out of the hot (buffered) path.
Will try to pull that off, but the read path needs a bit more
branching, etc. As you mentioned, splice is already the default so
nfsd_iter_read() theoretically afforded some additional lattitude
_but_ I don't disagree with the ideal of being as light as possible.
Took me a solid day to refactor the WRITE side, so this will slip
until next week.
Thanks,
Mike
next prev parent reply other threads:[~2025-08-08 18:19 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-08 11:49 ` Jeff Layton
2025-08-07 16:25 ` [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-08 11:51 ` Jeff Layton
2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-08-08 12:05 ` Jeff Layton
2025-08-08 12:05 ` Jeff Layton
2025-08-08 17:58 ` Chuck Lever
2025-08-07 16:25 ` [PATCH v5 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-08 12:03 ` Jeff Layton
2025-08-08 17:58 ` Chuck Lever
2025-08-08 18:10 ` Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
2025-08-08 12:05 ` Jeff Layton
2025-08-08 17:59 ` Chuck Lever
2025-08-08 18:12 ` Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-08 12:16 ` Jeff Layton
2025-08-08 17:59 ` Chuck Lever
2025-08-08 18:19 ` Mike Snitzer [this message]
2025-08-07 16:25 ` [PATCH v5 7/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-08 2:30 ` Mike Snitzer
2025-08-08 14:21 ` Chuck Lever
2025-08-08 12:20 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aJY_unu7uL7h3Q4z@kernel.org \
--to=snitzer@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).