From: Jeff Layton <jlayton@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>, Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
Date: Thu, 28 Aug 2025 12:22:43 -0400 [thread overview]
Message-ID: <45ccb9b479b2c5af09755c95f3dbd5b29db4370a.camel@kernel.org> (raw)
In-Reply-To: <aK9fZR7pQxrosEfW@kernel.org>
On Wed, 2025-08-27 at 15:41 -0400, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
> > On 8/26/25 2:57 PM, Mike Snitzer wrote:
> > > If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > > DIO-aligned block (on either end of the READ). The expanded READ is
> > > verified to have proper offset/len (logical_block_size) and
> > > dma_alignment checking.
> > >
> > > Must allocate and use a bounce-buffer page (called 'start_extra_page')
> > > if/when expanding the misaligned READ requires reading extra partial
> > > page at the start of the READ so that its DIO-aligned. Otherwise that
> > > extra page at the start will make its way back to the NFS client and
> > > corruption will occur. As found, and then this fix of using an extra
> > > page verified, using the 'dt' utility:
> > > dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> > > iotype=sequential pattern=iot onerr=abort oncerr=abort
> > > see: https://github.com/RobinTMiller/dt.git
> > >
> > > Any misaligned READ that is less than 32K won't be expanded to be
> > > DIO-aligned (this heuristic just avoids excess work, like allocating
> > > start_extra_page, for smaller IO that can generally already perform
> > > well using buffered IO).
> > >
> > > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > > Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > fs/nfsd/vfs.c | 200 +++++++++++++++++++++++++++++++++++--
> > > include/linux/sunrpc/svc.h | 5 +-
> > > 2 files changed, 194 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index c340708fbab4d..64732dc8985d6 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -19,6 +19,7 @@
> > > #include <linux/splice.h>
> > > #include <linux/falloc.h>
> > > #include <linux/fcntl.h>
> > > +#include <linux/math.h>
> > > #include <linux/namei.h>
> > > #include <linux/delay.h>
> > > #include <linux/fsnotify.h>
> > > @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > }
> > >
> > > +struct nfsd_read_dio {
> > > + loff_t start;
> > > + loff_t end;
> > > + unsigned long start_extra;
> > > + unsigned long end_extra;
> > > + struct page *start_extra_page;
> > > +};
> > > +
> > > +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> > > +{
> > > + memset(read_dio, 0, sizeof(*read_dio));
> > > + read_dio->start_extra_page = NULL;
> > > +}
> > > +
> > > +#define NFSD_READ_DIO_MIN_KB (32 << 10)
> > > +
> > > +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > + struct nfsd_file *nf, loff_t offset,
> > > + unsigned long len, unsigned int base,
> > > + struct nfsd_read_dio *read_dio)
> > > +{
> > > + const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> > > + loff_t middle_end, orig_end = offset + len;
> > > +
> > > + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> > > + "%s: underlying filesystem has not provided DIO alignment info\n",
> > > + __func__))
> > > + return false;
> > > + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > > + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > > + __func__, dio_blocksize, PAGE_SIZE))
> > > + return false;
> >
> > IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
>
> I won't die on this hill, I just don't see the risk of these given
> they are highly unlikely ("famous last words").
>
> But if they trigger we should surely be made aware immediately. Not
> only if someone happens to have a trace event enabled (which would
> only happen with further support and engineering involvement to chase
> "why isn't O_DIRECT being used like NFSD was optionally configured
> to!?").
>
A kernel log message in this case makes sense to me, since it is a
(minor) administrative issue. WARN_ONCE() is going to throw a big,
scary stack trace, however that won't be terribly useful. We'll get hit
with bug reports from it for years.
Maybe pr_notice_once() for this? Or, maybe a pr_notice_once, but do it
on a per-export basis?
> > > + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> > > + * or base not aligned).
> > > + * Ondisk alignment is implied by the following code that expands
> > > + * misaligned IO to have a DIO-aligned offset and len.
> > > + */
> > > + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> > > + return false;
> > > +
> > > + init_nfsd_read_dio(read_dio);
> > > +
> > > + read_dio->start = round_down(offset, dio_blocksize);
> > > + read_dio->end = round_up(orig_end, dio_blocksize);
> > > + read_dio->start_extra = offset - read_dio->start;
> > > + read_dio->end_extra = read_dio->end - orig_end;
> > > +
> > > + /*
> > > + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> > > + * to be DIO-aligned (this heuristic avoids excess work, like allocating
> > > + * start_extra_page, for smaller IO that can generally already perform
> > > + * well using buffered IO).
> > > + */
> > > + if ((read_dio->start_extra || read_dio->end_extra) &&
> > > + (len < NFSD_READ_DIO_MIN_KB)) {
> > > + init_nfsd_read_dio(read_dio);
> > > + return false;
> > > + }
> > > +
> > > + if (read_dio->start_extra) {
> > > + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> >
> > This introduces a page allocation where there weren't any before. For
> > NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
> > like this is not needed on an I/O path.
>
> NFSD never supported DIO before. Yes, with this patch there is
> a single page allocation in the misaligned DIO READ path (if it
> requires reading extra before the client requested data starts).
>
> I tried to succinctly explain the need for the extra page allocation
> for misaligned DIO READ in this patch's header (in 2nd paragraph
> of the above header).
>
> I cannot see how to read extra, not requested by the client, into the
> head of rq_pages without causing serious problems. So that cannot be
> what you're saying needed.
>
> > So I think the answer to this is that I want you to implement reading
> > an entire aligned range from the file and then forming the NFS READ
> > response with only the range of bytes that the client requested, as we
> > discussed before.
>
> That is what I'm doing. But you're taking issue with my implementation
> that uses a single extra page.
>
> > The use of xdr_buf and bvec should make that quite
> > straightforward.
>
> Is your suggestion to, rather than allocate a disjoint single page,
> borrow the extra page from the end of rq_pages? Just map it into the
> bvec instead of my extra page?
>
> > This should make the aligned and unaligned cases nearly identical and
> > much less fraught.
>
> Regardless of which memory used to read the extra data, I don't see
> how the care I've taken to read extra but hide that fact from the
> client can be avoided. So the pre/post misaligned DIO READ code won't
> change a whole lot. But once I understand your suggestion better
> (after a clarifying response to this message) hopefully I'll see what
> you're saying.
>
> All said, this patchset is very important to me, I don't want it to
> miss v6.18 -- I'm still "in it to win it" but it feels like I do need
> your or others' help to pull this off.
>
> And/or is it possible to accept this initial implementation with
> mutual understanding that we must revisit your concern about my
> allocating an extra page for the misaligned DIO READ path?
>
> > > + if (WARN_ONCE(read_dio->start_extra_page == NULL,
> > > + "%s: Unable to allocate start_extra_page\n", __func__)) {
> > > + init_nfsd_read_dio(read_dio);
> > > + return false;
> > > + }
> > > + }
> > > +
> > > + return true;
> > > +}
> > > +
> > > +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> > > + struct nfsd_read_dio *read_dio,
> > > + ssize_t bytes_read,
> > > + unsigned long bytes_expected,
> > > + loff_t *offset,
> > > + unsigned long *rq_bvec_numpages)
> > > +{
> > > + ssize_t host_err = bytes_read;
> > > + loff_t v;
> > > +
> > > + if (!read_dio->start_extra && !read_dio->end_extra)
> > > + return host_err;
> > > +
> > > + /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> > > + * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> > > + */
> > > + if (read_dio->start_extra_page) {
> > > + __free_page(read_dio->start_extra_page);
> > > + *rq_bvec_numpages -= 1;
> > > + v = *rq_bvec_numpages;
> > > + memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> > > + v * sizeof(struct bio_vec));
> > > + }
> > > + /* Eliminate any end_extra bytes from the last page */
> > > + v = *rq_bvec_numpages;
> > > + rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> > > +
> > > + if (host_err < 0) {
> > > + /* Underlying FS will return -EINVAL if misaligned
> > > + * DIO is attempted because it shouldn't be.
> > > + */
> > > + WARN_ON_ONCE(host_err == -EINVAL);
> > > + return host_err;
> > > + }
> > > +
> > > + /* nfsd_analyze_read_dio() may have expanded the start and end,
> > > + * if so adjust returned read size to reflect original extent.
> > > + */
> > > + *offset += read_dio->start_extra;
> > > + if (likely(host_err >= read_dio->start_extra)) {
> > > + host_err -= read_dio->start_extra;
> > > + if (host_err > bytes_expected)
> > > + host_err = bytes_expected;
> > > + } else {
> > > + /* Short read that didn't read any of requested data */
> > > + host_err = 0;
> > > + }
> > > +
> > > + return host_err;
> > > +}
> > > +
> > > +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> > > + unsigned addr_mask, unsigned len_mask)
> > > +{
> > > + const struct bio_vec *bvec = i->bvec;
> > > + unsigned skip = i->iov_offset;
> > > + size_t size = i->count;
> >
> > checkpatch.pl is complaining about the use of "unsigned" rather than
> > "unsigned int".
>
> OK.
>
> > > +
> > > + if (size & len_mask)
> > > + return false;
> > > + do {
> > > + size_t len = bvec->bv_len;
> > > +
> > > + if (len > size)
> > > + len = size;
> > > + if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > + return false;
> > > + bvec++;
> > > + size -= len;
> > > + skip = 0;
> > > + } while (size);
> > > +
> > > + return true;
> > > +}
> > > +
> > > /**
> > > * nfsd_iter_read - Perform a VFS read using an iterator
> > > * @rqstp: RPC transaction context
> > > @@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > unsigned int base, u32 *eof)
> > > {
> > > struct file *file = nf->nf_file;
> > > - unsigned long v, total;
> > > + unsigned long v, total, in_count = *count;
> > > + struct nfsd_read_dio read_dio;
> > > struct iov_iter iter;
> > > struct kiocb kiocb;
> > > ssize_t host_err;
> > > @@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > init_sync_kiocb(&kiocb, file);
> > >
> > > + v = 0;
> > > + total = in_count;
> > > +
> > > switch (nfsd_io_cache_read) {
> > > case NFSD_IO_DIRECT:
> > > - /* Verify ondisk and memory DIO alignment */
> > > - if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> > > - (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
> > > - (base & (nf->nf_dio_mem_align - 1)) == 0)
> > > - kiocb.ki_flags = IOCB_DIRECT;
> > > + /*
> > > + * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> > > + * the next DIO-aligned block (on either end of the READ).
> > > + */
> > > + if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
> > > + in_count, base, &read_dio)) {
> > > + /* trace_nfsd_read_vector() will reflect larger
> > > + * DIO-aligned READ.
> > > + */
> > > + offset = read_dio.start;
> > > + in_count = read_dio.end - offset;
> > > + total = in_count;
> > > +
> > > + kiocb.ki_flags |= IOCB_DIRECT;
> > > + if (read_dio.start_extra) {
> > > + len = read_dio.start_extra;
> > > + bvec_set_page(&rqstp->rq_bvec[v],
> > > + read_dio.start_extra_page,
> > > + len, PAGE_SIZE - len);
> > > + total -= len;
> > > + ++v;
> > > + }
> > > + }
> > > break;
> > > case NFSD_IO_DONTCACHE:
> > > kiocb.ki_flags = IOCB_DONTCACHE;
> > > @@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > kiocb.ki_pos = offset;
> > >
> > > - v = 0;
> > > - total = *count;
> > > while (total) {
> > > len = min_t(size_t, total, PAGE_SIZE - base);
> > > bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> > > @@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > }
> > > WARN_ON_ONCE(v > rqstp->rq_maxpages);
> > >
> > > - trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > > - iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > > + trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
> > > + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
> > > +
> > > + if ((kiocb.ki_flags & IOCB_DIRECT) &&
> > > + !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
> > > + nf->nf_dio_read_offset_align-1))
> > > + kiocb.ki_flags &= ~IOCB_DIRECT;
> > > +
> > > host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> > > +
> > > + if (in_count != *count) {
> > > + /* misaligned DIO expanded read to be DIO-aligned */
> > > + host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
> > > + host_err, *count, &offset, &v);
> > > + }
> > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > }
> > >
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index e64ab444e0a7f..190c2667500e2 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > * pages, one for the request, and one for the reply.
> > > * nfsd_splice_actor() might need an extra page when a READ payload
> > > * is not page-aligned.
> > > + * nfsd_iter_read() might need two extra pages when a READ payload
> > > + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> > > + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
> > > */
> > > static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
> > > {
> > > - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> > > + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> > > }
> > >
> > > /*
> >
> > To properly evaluate the impact of using direct I/O for reads with real
> > world user workloads, we will want to identify (or construct) some
> > metrics (and this is future work, but near-term future).
> >
> > Seems like allocating memory becomes difficult only when too many pages
> > are dirty. I am skeptical that the issue is due to read caching, since
> > clean pages in the page cache are pretty easy to evict quickly, AIUI. If
> > that's incorrect, I'd like to understand why.
>
> The much more problematic case is heavy WRITE workload with a working
> set that far exceeds system memory.
>
> But I agree it doesn't make a whole lot of sense that clean pages in
> the page cache would be getting in the way. All I can tell you is
> that in my experience MM seems to _not_ evict them quickly (but more
> focused read-only testing is warranted to further understand the
> dynamics and heuristics in MM and beyond -- especially if/when
> READ-only then a pivot to a mix of heavy READ and WRITE or
> WRITE-only).
>
> NFSD using DIO is optional. I thought the point was to get it as an
> available option so that _others_ could experiment and help categorize
> the benefits/pitfalls further?
>
> I cannot be a one man show on all this. I welcome more help from
> anyone interested.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2025-08-28 16:22 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-09-03 14:38 ` Chuck Lever
2025-09-03 15:07 ` Mike Snitzer
2025-09-03 16:02 ` Mike Snitzer
2025-09-03 16:12 ` Chuck Lever
2025-09-03 16:50 ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-27 15:34 ` Chuck Lever
2025-08-27 19:41 ` Mike Snitzer
2025-08-27 20:56 ` Chuck Lever
2025-08-27 23:15 ` Mike Snitzer
2025-08-28 1:57 ` Chuck Lever
2025-08-28 8:09 ` Mike Snitzer
2025-08-28 14:53 ` Chuck Lever
2025-08-28 18:52 ` Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
2025-09-02 14:04 ` Chuck Lever
2025-09-02 15:56 ` Chuck Lever
2025-09-02 17:59 ` Chuck Lever
2025-09-02 21:06 ` Mike Snitzer
2025-09-02 21:16 ` Chuck Lever
2025-09-02 21:27 ` Mike Snitzer
2025-09-02 22:18 ` Mike Snitzer
2025-09-04 19:07 ` Chuck Lever
2025-09-04 21:00 ` Mike Snitzer
2025-09-04 14:42 ` Mike Snitzer
2025-09-04 15:12 ` Chuck Lever
2025-09-04 16:10 ` Chuck Lever
2025-09-04 16:33 ` Mike Snitzer
2025-09-04 17:54 ` Chuck Lever
2025-08-30 17:38 ` [RFC PATCH 2/2] NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug Mike Snitzer
2025-08-30 18:53 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-28 16:36 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Jeff Layton
2025-08-28 16:22 ` Jeff Layton [this message]
2025-08-28 16:27 ` Chuck Lever
2025-08-26 18:57 ` [PATCH v8 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45ccb9b479b2c5af09755c95f3dbd5b29db4370a.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox