Linux NFS development
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org, Jeff Layton <jlayton@kernel.org>
Subject: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
Date: Wed, 27 Aug 2025 19:15:06 -0400	[thread overview]
Message-ID: <aK-Reg6g8ccscwMu@kernel.org> (raw)
In-Reply-To: <6f5516a5-1954-4f77-8a07-dacba1fb570c@oracle.com>

On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
> On 8/27/25 3:41 PM, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
> >> On 8/26/25 2:57 PM, Mike Snitzer wrote:
> 
> >>> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> >>> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> >>> +		      __func__))
> >>> +		return false;
> >>> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> >>> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> >>> +		      __func__, dio_blocksize, PAGE_SIZE))
> >>> +		return false;
> >>
> >> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
> > 
> > I won't die on this hill, I just don't see the risk of these given
> > they are highly unlikely ("famous last words").
> > 
> > But if they trigger we should surely be made aware immediately.  Not
> > only if someone happens to have a trace event enabled (which would
> > only happen with further support and engineering involvement to chase
> > "why isn't O_DIRECT being used like NFSD was optionally configured
> > to!?").
> A. It seems particularly inefficient to make this check for every I/O
>    rather than once per file system
> 
> B. Once the warning has fired for one file, it won't fire again, making
>    it pretty useless if there are multiple similar mismatches. You still
>    end up with "No direct I/O even though I flipped the switch, and I
>    can't tell why."

I've removed the WARN_ON_ONCEs for read and write.  These repeat
per-IO negative checks aren't ideal but they certainly aren't costly.

> >>> +	/* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> >>> +	 * or base not aligned).
> >>> +	 * Ondisk alignment is implied by the following code that expands
> >>> +	 * misaligned IO to have a DIO-aligned offset and len.
> >>> +	 */
> >>> +	if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> >>> +		return false;
> >>> +
> >>> +	init_nfsd_read_dio(read_dio);
> >>> +
> >>> +	read_dio->start = round_down(offset, dio_blocksize);
> >>> +	read_dio->end = round_up(orig_end, dio_blocksize);
> >>> +	read_dio->start_extra = offset - read_dio->start;
> >>> +	read_dio->end_extra = read_dio->end - orig_end;
> >>> +
> >>> +	/*
> >>> +	 * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> >>> +	 * to be DIO-aligned (this heuristic avoids excess work, like allocating
> >>> +	 * start_extra_page, for smaller IO that can generally already perform
> >>> +	 * well using buffered IO).
> >>> +	 */
> >>> +	if ((read_dio->start_extra || read_dio->end_extra) &&
> >>> +	    (len < NFSD_READ_DIO_MIN_KB)) {
> >>> +		init_nfsd_read_dio(read_dio);
> >>> +		return false;
> >>> +	}
> >>> +
> >>> +	if (read_dio->start_extra) {
> >>> +		read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> >>
> >> This introduces a page allocation where there weren't any before. For
> >> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
> >> like this is not needed on an I/O path.
> > 
> > NFSD never supported DIO before. Yes, with this patch there is
> > a single page allocation in the misaligned DIO READ path (if it
> > requires reading extra before the client requested data starts).
> > 
> > I tried to succinctly explain the need for the extra page allocation
> > for misaligned DIO READ in this patch's header (in 2nd paragraph
> > of the above header).
> > 
> > I cannot see how to read extra, not requested by the client, into the
> > head of rq_pages without causing serious problems. So that cannot be
> > what you're saying needed.
> > 
> >> So I think the answer to this is that I want you to implement reading
> >> an entire aligned range from the file and then forming the NFS READ
> >> response with only the range of bytes that the client requested, as we
> >> discussed before.
> > 
> > That is what I'm doing. But you're taking issue with my implementation
> > that uses a single extra page.
> > 
> >> The use of xdr_buf and bvec should make that quite
> >> straightforward.
> > 
> > Is your suggestion to, rather than allocate a disjoint single page,
> > borrow the extra page from the end of rq_pages? Just map it into the
> > bvec instead of my extra page?
> 
> Yes, the extra page needs to come from rq_pages. But I don't see why it
> should come from the /end/ of rq_pages.
> 
> - Extend the start of the byte range back to make it align with the
>   file's DIO alignment constraint
> 
> - Extend the end of the byte range forward to make it align with the
>   file's DIO alignment constraint

nfsd_analyze_read_dio() does that (start_extra and end_extra).

> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
> 
> - When the I/O is complete, adjust the offset in the first bvec entry
>   forward by setting a non-zero page offset, and adjust the returned
>   count downward to match the requested byte count from the client

Tried it long ago, such bvec manipulation only works when not using
RDMA.  When the memory is remote, twiddling a local bvec isn't going
to ensure the correct pages have the correct data upon return to the
client.

RDMA is why the pages must be used in-place, and RDMA is also why
the extra page needed by this patch (for use as throwaway front-pad
for expanded misaligned DIO READ) must either be allocated _or_
hopefully it can be from rq_pages (after the end of the client
requested READ payload).

Or am I wrong and simply need to keep learning about NFSD's IO path?

> > NFSD using DIO is optional. I thought the point was to get it as an
> > available option so that _others_ could experiment and help categorize
> > the benefits/pitfalls further?
> 
> Yes, that is the point. But such experiments lose value if there is no
> data collection plan to go with them.

Each user runs something they care about performing well and they
measure the result.

Literally the same thing as has been done for anything in Linux since
it all started.  Nothing unicorn or bespoke here.

> If you would rather make this drive-by, then you'll have to realize
> that you are requesting more than simple review from us. You'll have
> to be content with the pace at which us overloaded maintainers can get
> to the work.

I think I just experienced the mailing-list equivalent of the Detroit
definition of "drive-by".  Good/bad news: you're a terrible shot.

  reply	other threads:[~2025-08-27 23:15 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-09-03 14:38   ` Chuck Lever
2025-09-03 15:07     ` Mike Snitzer
2025-09-03 16:02       ` Mike Snitzer
2025-09-03 16:12         ` Chuck Lever
2025-09-03 16:50           ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-27 15:34   ` Chuck Lever
2025-08-27 19:41     ` Mike Snitzer
2025-08-27 20:56       ` Chuck Lever
2025-08-27 23:15         ` Mike Snitzer [this message]
2025-08-28  1:57           ` Chuck Lever
2025-08-28  8:09             ` Mike Snitzer
2025-08-28 14:53               ` Chuck Lever
2025-08-28 18:52                 ` Mike Snitzer
2025-08-30 17:38                   ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-30 17:38                     ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
2025-09-02 14:04                       ` Chuck Lever
2025-09-02 15:56                       ` Chuck Lever
2025-09-02 17:59                         ` Chuck Lever
2025-09-02 21:06                           ` Mike Snitzer
2025-09-02 21:16                             ` Chuck Lever
2025-09-02 21:27                               ` Mike Snitzer
2025-09-02 22:18                                 ` Mike Snitzer
2025-09-04 19:07                                   ` Chuck Lever
2025-09-04 21:00                                     ` Mike Snitzer
2025-09-04 14:42                                 ` Mike Snitzer
2025-09-04 15:12                                   ` Chuck Lever
2025-09-04 16:10                                   ` Chuck Lever
2025-09-04 16:33                                     ` Mike Snitzer
2025-09-04 17:54                                       ` Chuck Lever
2025-08-30 17:38                     ` [RFC PATCH 2/2] NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug Mike Snitzer
2025-08-30 18:53                     ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-28 16:36               ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Jeff Layton
2025-08-28 16:22       ` Jeff Layton
2025-08-28 16:27         ` Chuck Lever
2025-08-26 18:57 ` [PATCH v8 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aK-Reg6g8ccscwMu@kernel.org \
    --to=snitzer@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox