From: Jeff Layton <jlayton@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>, Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>,
david.flynn@hammerspace.com
Subject: Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
Date: Fri, 13 Jun 2025 09:02:23 -0400 [thread overview]
Message-ID: <826d22214f01fc453a7e38953e2b8893073fcd46.camel@kernel.org> (raw)
In-Reply-To: <aEvuJP7_xhVk5R4S@kernel.org>
On Fri, 2025-06-13 at 05:23 -0400, Mike Snitzer wrote:
> On Thu, Jun 12, 2025 at 10:46:01PM -0700, Christoph Hellwig wrote:
> > On Thu, Jun 12, 2025 at 12:22:42PM -0400, Jeff Layton wrote:
> > > If you're against the idea, I won't waste my time.
> > >
> > > It would require some fairly hefty rejiggering of the receive code. The
> > > v4 part would be pretty nightmarish to work out too since you'd have to
> > > decode the compound as you receive to tell where the next op starts.
> > >
> > > The potential for corruption with unaligned writes is also pretty
> > > nasty.
> >
> > Maybe I'm missing an improvement to the receive buffer handling in modern
> > network hardware, but AFAIK this still would only help you to align the
> > sunrpc data buffer to page boundaries, but avoid the data copy from the
> > hardware receive buffer to the sunrpc data buffer as you still don't have
> > hardware header splitting.
>
> Correct, everything that Jeff detailed is about ensuring the WRITE
> payload is received into page aligned boundary.
>
> Which in practice has proven a hard requirement for O_DIRECT in my
> testing -- but I could be hitting some bizarre driver bug in my TCP
> testbed (which sadly sits ontop of older VMware guests/drivers).
>
> But if you looking at patch 5 in this series:
> https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@kernel.org/
>
> I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked
> ditto of fs/btrfs/direct-io.c:check_direct_IO():
>
> static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> const u32 blocksize)
> {
> u32 blocksize_mask;
>
> if (!blocksize)
> return false;
>
> blocksize_mask = blocksize - 1;
> if ((offset & blocksize_mask) ||
> (iov_iter_alignment(iter) & blocksize_mask))
> return false;
>
> return true;
> }
>
> And fs/nfsd/vfs.c:nfsd_vfs_write() has (after my patch 5):
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>
> if (nfsd_enable_dontcache) {
> if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> flags |= RWF_DIRECT;
>
> What I found is that unless SUNRPC TPC stored the WRITE payload in a
> page-aligned boundary then iov_iter_alignment() would fail.
>
> The @payload arg above, with my SUNRPC TCP testing, was always offset
> 148 bytes into the first page of the pages allocated for xdr_buf's
> use, which is rqstp->rq_pages, which is allocated by
> net/sunrpc/svc_xprt.c:svc_alloc_arg().
>
> > And I don't even know what this is supposed to buy the nfs server.
> > Direct I/O writes need to have the proper file offset alignment, but as
> > far as Linux is concerned we don't require any memory alignment. Most
> > storage hardware has requirements for the memory alignment that we pass
> > on, but typically that's just a dword (4-byte) alignment, which matches
> > the alignment sunrpc wants for most XDR data structures anyway. So what
> > additional alignment is actually needed for support direct I/O writes
> > assuming that is the goal? (I might also simply misunderstand the
> > problem).
>
> THIS... this is the very precise question/detail I discussed with
> Hammerspace's CEO David Flynn when discussing Linux's O_DIRECT
> support. David shares your understanding and confusion. And all I
> could tell him is that in practice I always page-aligned my data
> buffers used to issue O_DIRECT. And that in this instance if I don't
> then O_DIRECT doesn't work (if I commented out the iov_iter_alignment
> check in is_dio_aligned above).
>
> But is that simply due to xdr_buf_to_bvec()'s use of bvec_set_virt()
> for xdr_buf "head" page (first page of rqstp->rg_pages)? Whereas you
> can see xdr_buf_to_bvec() uses bvec_set_page() to add each of the
> other pages that immediately follow the first "head" page.
>
> All said, if Linux can/should happily allow non-page-aligned DIO (and
> we only need to worry about the on-disk DIO alignment requirements)
> that'd be wonderful.
>
> Then its just a matter of finding where that is broken...
>
> Happy to dig into this further if you might nudge me in the right
> direction.
>
This is an excellent point. If the memory alignment doesn't matter,
then maybe it's enough to just receive the same way we do today and
just pad out to the correct blocksize in the bvec array if the data is
unaligned vs. the blocksize.
We still have the problem of how to do a proper RMW though to deal with
unaligned writes. A couple of possibilities come to mind:
1. nfsd could just return nfserr_inval when a write is unaligned and
the export is set up for DIO writes. IOW, just project the requirement
about alignment to the client. This might be the safest option, at
least initially. Unaligned writes are pretty uncommon. Most clients
will probably never hit the error.
2. What if we added a new "rmw_iter" operation to file_operations that
could be used for unaligned writes? XFS (for instance) could take the
i_rwsem exclusive, do DIO reads of the end blocks into bounce pages,
copy in the unaligned bits at the ends of the iter, do a DIO write and
release the lock. It'll be slow as hell, but it wouldn't be racy.
Mike, would you be amenable to option #1, at least initially? If we can
come up with a way to do unaligned writes safely, we could relax the
restriction later.
I'm only half serious about rmw_iter, but it does seem like that could
work.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2025-06-13 13:02 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 10:44 ` Mike Snitzer
2025-06-11 13:04 ` Jeff Layton
2025-06-11 13:56 ` Chuck Lever
2025-06-11 14:31 ` Chuck Lever
2025-06-11 19:18 ` Mike Snitzer
2025-06-11 20:29 ` Jeff Layton
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
2025-06-12 13:28 ` Chuck Lever
2025-06-12 14:17 ` Benjamin Coddington
2025-06-12 15:56 ` Mike Snitzer
2025-06-12 15:58 ` Chuck Lever
2025-06-12 16:12 ` Mike Snitzer
2025-06-12 16:32 ` Chuck Lever
2025-06-13 5:39 ` Christoph Hellwig
2025-06-12 16:22 ` Jeff Layton
2025-06-13 5:46 ` Christoph Hellwig
2025-06-13 9:23 ` Mike Snitzer
2025-06-13 13:02 ` Jeff Layton [this message]
2025-06-16 12:35 ` Christoph Hellwig
2025-06-16 12:29 ` Christoph Hellwig
2025-06-16 16:07 ` Mike Snitzer
2025-06-17 4:37 ` Christoph Hellwig
2025-06-17 20:26 ` Mike Snitzer
2025-06-17 22:23 ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03 0:12 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15 ` Chuck Lever
2025-06-12 13:21 ` Chuck Lever
2025-06-12 16:00 ` Mike Snitzer
2025-06-16 13:32 ` Chuck Lever
2025-06-16 16:10 ` Mike Snitzer
2025-06-17 17:22 ` Mike Snitzer
2025-06-17 17:31 ` Chuck Lever
2025-06-19 20:19 ` Mike Snitzer
2025-06-30 14:50 ` Chuck Lever
2025-07-04 19:46 ` Mike Snitzer
2025-07-04 19:49 ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11 6:58 ` Christoph Hellwig
2025-06-11 10:51 ` Mike Snitzer
2025-06-11 14:17 ` Chuck Lever
2025-06-12 7:15 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11 7:00 ` Christoph Hellwig
2025-06-11 12:23 ` Mike Snitzer
2025-06-11 13:30 ` Jeff Layton
2025-06-12 7:22 ` Christoph Hellwig
2025-06-12 7:23 ` Christoph Hellwig
2025-06-11 14:42 ` Chuck Lever
2025-06-11 15:07 ` Jeff Layton
2025-06-11 15:11 ` Chuck Lever
2025-06-11 15:44 ` Jeff Layton
2025-06-11 20:51 ` Mike Snitzer
2025-06-12 7:32 ` Christoph Hellwig
2025-06-12 7:28 ` Christoph Hellwig
2025-06-12 7:25 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12 7:39 ` Christoph Hellwig
2025-06-12 20:37 ` Mike Snitzer
2025-06-13 5:31 ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02 ` Mike Snitzer
2025-06-11 19:06 ` Chuck Lever
2025-06-11 19:58 ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08 ` Mike Snitzer
2025-06-12 20:17 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=826d22214f01fc453a7e38953e2b8893073fcd46.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=axboe@kernel.dk \
--cc=chuck.lever@oracle.com \
--cc=david.flynn@hammerspace.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).