From: Jeff Layton <jlayton@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>, Mike Snitzer <snitzer@kernel.org>
Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>
Subject: Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
Date: Thu, 12 Jun 2025 12:22:42 -0400 [thread overview]
Message-ID: <f201c16677525288597becfd904d873931092cea.camel@kernel.org> (raw)
In-Reply-To: <d13ef7d6-0040-40ac-9761-922a1ec5d911@oracle.com>
On Thu, 2025-06-12 at 09:28 -0400, Chuck Lever wrote:
> On 6/12/25 6:28 AM, Jeff Layton wrote:
> > On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote:
> > > On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
> > > > On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
> > > > > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > > > > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > > > > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > > > > > or will be removed from the page cache upon completion (DONTCACHE).
> > > > > >
> > > > > > I thought we were going to do two switches: One for reads and one for
> > > > > > writes? I could be misremembering.
> > > > >
> > > > > We did discuss the possibility of doing that. Still can-do if that's
> > > > > what you'd prefer.
> > > > >
> > > >
> > > > Having them as separate controls in debugfs is fine for
> > > > experimentation's sake, but I imagine we'll need to be all-in one way
> > > > or the other with a real interface.
> > > >
> > > > I think if we can crack the problem of receiving WRITE payloads into an
> > > > already-aligned buffer, then that becomes much more feasible. I think
> > > > that's a solveable problem.
> > >
> > > You'd immediately be my hero! Let's get into it:
> > >
> > > In a previously reply to this thread you aptly detailed what I found
> > > out the hard way (with too much xdr_buf code review and tracing):
> > >
> > > On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
> > > > >
> > > > > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> > > > > DIO alignment (both page and disk alignment). This works quite well
> > > > > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> > > > > maps the WRITE payload into aligned pages. But more work is needed to
> > > > > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> > > > > used. I spent quite a bit of time analyzing the existing xdr_buf code
> > > > > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> > > > > misaligned pages such that O_DIRECT isn't possible without a copy
> > > > > (completely defeating the point). I'll reply to this cover letter to
> > > > > start a subthread to discuss how best to deal with misaligned write
> > > > > IO (by association with Hammerspace, I'm most interested in NFS v3).
> > > > >
> > > >
> > > > Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
> > > > rq_pages array. To get alignment right, you'd probably have to do the
> > > > receive in a much more piecemeal way.
> > > >
> > > > Basically, you'd need to decode as you receive chunks of the message,
> > > > and look out for WRITEs, and then set it up so that their payloads are
> > > > received with proper alignment.
> > >
> > > 1)
> > > Yes, and while I arrived at the same exact conclusion I was left with
> > > dread about the potential for "breaking too many eggs to make that
> > > tasty omelette".
> > >
> > > If you (or others) see a way forward to have SUNRPC TCP's XDR receive
> > > "inline" decode (rather than have the 2 stage process you covered
> > > above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
> > > from a time when such care about alignment of WRITE payload pages was
> > > completely off engineers' collective radar (owed to NFSD only using
> > > buffered IO I assume?).
> > >
> > > 2)
> > > One hack that I verified to work for READ and WRITE IO on my
> > > particular TCP testbed was to front-pad the first "head" page of the
> > > xdr_buf such that the WRITE payload started at the 2nd page of
> > > rq_pages. So that looked like this hack for my usage:
> > >
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 8fc5b2b2d806..cf082a265261 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
> > >
> > > /* Make arg->head point to first page and arg->pages point to rest */
> > > arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
> > > - arg->head[0].iov_len = PAGE_SIZE;
> > > + // FIXME: front-pad optimized to align TCP's WRITE payload
> > > + // but may not be enough for other operations?
> > > + arg->head[0].iov_len = 148;
> > > arg->pages = rqstp->rq_pages + 1;
> > > arg->page_base = 0;
> > > /* save at least one page for response */
> > >
> > > That gut "but may not be enough for other operations?" comment proved
> > > to be prophetic.
> > >
> > > Sadly it went on to fail spectacularly for other ops (specifically
> > > READDIR and READDIRPLUS, probably others would too) because
> > > xdr_inline_decode() _really_ doesn't like going beyond the end of the
> > > xdr_buf's inline "head" page. It could be that even if
> > > xdr_inline_decode() et al was "fixed" (which isn't for the faint of
> > > heart given xdr_buf's more complex nature) there will likely be other
> > > mole(s) that pop up. And in addition, we'd be wasting space in the
> > > xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
> > > to see this frontpad hack through to completion.
> > >
> > > 3)
> > > Lastly, for completeness, I also mentioned briefly in a previous
> > > recent reply:
> > >
> > > On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
> > > > On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> > > >
> > > > > In any case, for now at least, unless you're using RDMA, it's going to
> > > > > end up falling back to buffered writes everywhere. The data is almost
> > > > > never going to be properly aligned coming in off the wire. That might
> > > > > be fixable though.
> > > >
> > > > Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
> > > > over TCP to workaround SUNRPC TCP's XDR handling always storing the
> > > > write payload in misaligned IO. But that's purely a stop-gap
> > > > workaround, which needs testing (to see if soft-iwap negates the win
> > > > of using O_DIRECT, etc).
> > >
> > > (Ab)using soft-iwarp as the basis for easily getting page aligned TCP
> > > WRITE payloads seems pretty gross given we are chasing utmost
> > > performance, etc.
> > >
> > > All said, I welcome your sage advice and help on this effort to
> > > DIO-align SUNRPC TCP's WRITE payload pages.
> > >
> > > Thanks,
> > > Mike
> >
> > (Sent this to Mike only by accident yesterday -- resending to the full
> > list now)
> >
> > I've been looking over the code today. Basically, I think we need to
> > have svc_tcp_recvfrom() receive in phases. At a high level:
> >
> > 1/ receive the record marker (just like it does today)
> >
> > 2/ receive enough for the RPC header and then decode it.
> >
> > 3/ Use the rpc program and version from the decoded header to look up
> > the svc_program. Add an optional pg_tcp_recvfrom callback to that
> > structure that will receive the rest of the data into the buffer. If
> > pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do
> > today.
>
> The layering violations here are mind-blowing.
>
Aww. I don't think it's too bad.
>
> > For NFSv3, pc_tcp_recvfrom can just look at the procedure. If it's
> > anything but a WRITE we'll just do what we do today
> > (svc_tcp_read_msg()).
> >
> > For a WRITE, we'll receive the first part of the WRITE3args (everything
> > but the data) into rq_pages, and decode it. We can then use that info
> > to figure out the alignment. Advance to the next page in rq_pages, and
> > then to the point where the data is properly aligned. Do the receive
> > into that spot.
> >
> > Then we just add a RQ_ALIGNED_DATA to rqstp->rq_flags, and teach
> > nfsd3_proc_write how to find the data and do a DIO write when it's set.
> >
> > Unaligned writes are still a problem though. If two WRITE RPCs come in
> > for different parts of the same block at the same time, then you could
> > end up losing the result of the first write. I don't see a way to make
> > that non-racy.
> >
> > NFSv4 will also be a bit of a challenge. We'll need to receive the
> > whole compound one operation at a time. If we hit a WRITE, then we can
> > just do the same thing that we do for v3 to align the data.
> >
> > I'd probably aim to start with an implementation for v3, and then add
> > v4 support in a second phase.
> >
> > I'm interested in working on this. It'll be a fair bit of work though.
> > I'll need to think about how to break this up into manageable pieces.
>
> Bruce has been thinking about payload alignment schemes for at least
> ten years. My opinion has always been:
>
> - We have this already via RDMA, even over TCP
> - Any scheme like this will still not perform as well as RDMA
> - NFS/TCP is kind of a "works everywhere" technology that I prefer to
> not screw around with
> - The corner cases will be troubling us for many years
> - Only a handful of users will truly benefit from it
> - There are plenty of higher priority items on our to-do list.
>
If you're against the idea, I won't waste my time.
It would require some fairly hefty rejiggering of the receive code. The
v4 part would be pretty nightmarish to work out too since you'd have to
decode the compound as you receive to tell where the next op starts.
The potential for corruption with unaligned writes is also pretty
nasty.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2025-06-12 16:22 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 10:44 ` Mike Snitzer
2025-06-11 13:04 ` Jeff Layton
2025-06-11 13:56 ` Chuck Lever
2025-06-11 14:31 ` Chuck Lever
2025-06-11 19:18 ` Mike Snitzer
2025-06-11 20:29 ` Jeff Layton
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
2025-06-12 13:28 ` Chuck Lever
2025-06-12 14:17 ` Benjamin Coddington
2025-06-12 15:56 ` Mike Snitzer
2025-06-12 15:58 ` Chuck Lever
2025-06-12 16:12 ` Mike Snitzer
2025-06-12 16:32 ` Chuck Lever
2025-06-13 5:39 ` Christoph Hellwig
2025-06-12 16:22 ` Jeff Layton [this message]
2025-06-13 5:46 ` Christoph Hellwig
2025-06-13 9:23 ` Mike Snitzer
2025-06-13 13:02 ` Jeff Layton
2025-06-16 12:35 ` Christoph Hellwig
2025-06-16 12:29 ` Christoph Hellwig
2025-06-16 16:07 ` Mike Snitzer
2025-06-17 4:37 ` Christoph Hellwig
2025-06-17 20:26 ` Mike Snitzer
2025-06-17 22:23 ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03 0:12 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15 ` Chuck Lever
2025-06-12 13:21 ` Chuck Lever
2025-06-12 16:00 ` Mike Snitzer
2025-06-16 13:32 ` Chuck Lever
2025-06-16 16:10 ` Mike Snitzer
2025-06-17 17:22 ` Mike Snitzer
2025-06-17 17:31 ` Chuck Lever
2025-06-19 20:19 ` Mike Snitzer
2025-06-30 14:50 ` Chuck Lever
2025-07-04 19:46 ` Mike Snitzer
2025-07-04 19:49 ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11 6:58 ` Christoph Hellwig
2025-06-11 10:51 ` Mike Snitzer
2025-06-11 14:17 ` Chuck Lever
2025-06-12 7:15 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11 7:00 ` Christoph Hellwig
2025-06-11 12:23 ` Mike Snitzer
2025-06-11 13:30 ` Jeff Layton
2025-06-12 7:22 ` Christoph Hellwig
2025-06-12 7:23 ` Christoph Hellwig
2025-06-11 14:42 ` Chuck Lever
2025-06-11 15:07 ` Jeff Layton
2025-06-11 15:11 ` Chuck Lever
2025-06-11 15:44 ` Jeff Layton
2025-06-11 20:51 ` Mike Snitzer
2025-06-12 7:32 ` Christoph Hellwig
2025-06-12 7:28 ` Christoph Hellwig
2025-06-12 7:25 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12 7:39 ` Christoph Hellwig
2025-06-12 20:37 ` Mike Snitzer
2025-06-13 5:31 ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02 ` Mike Snitzer
2025-06-11 19:06 ` Chuck Lever
2025-06-11 19:58 ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08 ` Mike Snitzer
2025-06-12 20:17 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f201c16677525288597becfd904d873931092cea.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=axboe@kernel.dk \
--cc=chuck.lever@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).