All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	david.flynn@hammerspace.com
Subject: Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
Date: Fri, 13 Jun 2025 05:23:48 -0400	[thread overview]
Message-ID: <aEvuJP7_xhVk5R4S@kernel.org> (raw)
In-Reply-To: <aEu7GSa7HRNNVJVA@infradead.org>

On Thu, Jun 12, 2025 at 10:46:01PM -0700, Christoph Hellwig wrote:
> On Thu, Jun 12, 2025 at 12:22:42PM -0400, Jeff Layton wrote:
> > If you're against the idea, I won't waste my time.
> > 
> > It would require some fairly hefty rejiggering of the receive code. The
> > v4 part would be pretty nightmarish to work out too since you'd have to
> > decode the compound as you receive to tell where the next op starts.
> > 
> > The potential for corruption with unaligned writes is also pretty
> > nasty.
> 
> Maybe I'm missing an improvement to the receive buffer handling in modern
> network hardware, but AFAIK this still would only help you to align the
> sunrpc data buffer to page boundaries, but avoid the data copy from the
> hardware receive buffer to the sunrpc data buffer as you still don't have
> hardware header splitting.

Correct, everything that Jeff detailed is about ensuring the WRITE
payload is received into page aligned boundary.

Which in practice has proven a hard requirement for O_DIRECT in my
testing -- but I could be hitting some bizarre driver bug in my TCP
testbed (which sadly sits ontop of older VMware guests/drivers).

But if you looking at patch 5 in this series:
https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@kernel.org/

I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked
ditto of fs/btrfs/direct-io.c:check_direct_IO():

static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
                           const u32 blocksize)
{
        u32 blocksize_mask;

        if (!blocksize)
                return false;

        blocksize_mask = blocksize - 1;
        if ((offset & blocksize_mask) ||
            (iov_iter_alignment(iter) & blocksize_mask))
                return false;

        return true;
}

And fs/nfsd/vfs.c:nfsd_vfs_write() has (after my patch 5):

        nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
        iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);

        if (nfsd_enable_dontcache) {
                if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
                        flags |= RWF_DIRECT;

What I found is that unless SUNRPC TPC stored the WRITE payload in a
page-aligned boundary then iov_iter_alignment() would fail.

The @payload arg above, with my SUNRPC TCP testing, was always offset
148 bytes into the first page of the pages allocated for xdr_buf's
use, which is rqstp->rq_pages, which is allocated by
net/sunrpc/svc_xprt.c:svc_alloc_arg().

> And I don't even know what this is supposed to buy the nfs server.
> Direct I/O writes need to have the proper file offset alignment, but as
> far as Linux is concerned we don't require any memory alignment.  Most
> storage hardware has requirements for the memory alignment that we pass
> on, but typically that's just a dword (4-byte) alignment, which matches
> the alignment sunrpc wants for most XDR data structures anyway.  So what
> additional alignment is actually needed for support direct I/O writes
> assuming that is the goal?  (I might also simply misunderstand the
> problem).

THIS... this is the very precise question/detail I discussed with
Hammerspace's CEO David Flynn when discussing Linux's O_DIRECT
support.  David shares your understanding and confusion.  And all I
could tell him is that in practice I always page-aligned my data
buffers used to issue O_DIRECT.  And that in this instance if I don't
then O_DIRECT doesn't work (if I commented out the iov_iter_alignment
check in is_dio_aligned above).

But is that simply due to xdr_buf_to_bvec()'s use of bvec_set_virt()
for xdr_buf "head" page (first page of rqstp->rg_pages)?  Whereas you
can see xdr_buf_to_bvec() uses bvec_set_page() to add each of the
other pages that immediately follow the first "head" page.

All said, if Linux can/should happily allow non-page-aligned DIO (and
we only need to worry about the on-disk DIO alignment requirements)
that'd be wonderful.

Then its just a matter of finding where that is broken...

Happy to dig into this further if you might nudge me in the right
direction.

Thanks,
Mike

  reply	other threads:[~2025-06-13  9:23 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11  6:57   ` Christoph Hellwig
2025-06-11 10:44     ` Mike Snitzer
2025-06-11 13:04       ` Jeff Layton
2025-06-11 13:56     ` Chuck Lever
2025-06-11 14:31   ` Chuck Lever
2025-06-11 19:18     ` Mike Snitzer
2025-06-11 20:29       ` Jeff Layton
2025-06-11 21:36         ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28           ` Jeff Layton
2025-06-12 11:28             ` Jeff Layton
2025-06-12 13:28             ` Chuck Lever
2025-06-12 14:17               ` Benjamin Coddington
2025-06-12 15:56                 ` Mike Snitzer
2025-06-12 15:58                   ` Chuck Lever
2025-06-12 16:12                     ` Mike Snitzer
2025-06-12 16:32                       ` Chuck Lever
2025-06-13  5:39                     ` Christoph Hellwig
2025-06-12 16:22               ` Jeff Layton
2025-06-13  5:46                 ` Christoph Hellwig
2025-06-13  9:23                   ` Mike Snitzer [this message]
2025-06-13 13:02                     ` Jeff Layton
2025-06-16 12:35                       ` Christoph Hellwig
2025-06-16 12:29                     ` Christoph Hellwig
2025-06-16 16:07                       ` Mike Snitzer
2025-06-17  4:37                         ` Christoph Hellwig
2025-06-17 20:26                           ` Mike Snitzer
2025-06-17 22:23                             ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03  0:12             ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12  7:13         ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15           ` Chuck Lever
2025-06-12 13:21       ` Chuck Lever
2025-06-12 16:00         ` Mike Snitzer
2025-06-16 13:32           ` Chuck Lever
2025-06-16 16:10             ` Mike Snitzer
2025-06-17 17:22               ` Mike Snitzer
2025-06-17 17:31                 ` Chuck Lever
2025-06-19 20:19                   ` Mike Snitzer
2025-06-30 14:50                     ` Chuck Lever
2025-07-04 19:46                       ` Mike Snitzer
2025-07-04 19:49                         ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11  6:58   ` Christoph Hellwig
2025-06-11 10:51     ` Mike Snitzer
2025-06-11 14:17     ` Chuck Lever
2025-06-12  7:15       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11  7:00   ` Christoph Hellwig
2025-06-11 12:23     ` Mike Snitzer
2025-06-11 13:30       ` Jeff Layton
2025-06-12  7:22         ` Christoph Hellwig
2025-06-12  7:23       ` Christoph Hellwig
2025-06-11 14:42   ` Chuck Lever
2025-06-11 15:07     ` Jeff Layton
2025-06-11 15:11       ` Chuck Lever
2025-06-11 15:44         ` Jeff Layton
2025-06-11 20:51           ` Mike Snitzer
2025-06-12  7:32           ` Christoph Hellwig
2025-06-12  7:28         ` Christoph Hellwig
2025-06-12  7:25       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12  7:39   ` Christoph Hellwig
2025-06-12 20:37     ` Mike Snitzer
2025-06-13  5:31       ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02   ` Mike Snitzer
2025-06-11 19:06     ` Chuck Lever
2025-06-11 19:58       ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08   ` Mike Snitzer
2025-06-12 20:17     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEvuJP7_xhVk5R4S@kernel.org \
    --to=snitzer@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chuck.lever@oracle.com \
    --cc=david.flynn@hammerspace.com \
    --cc=hch@infradead.org \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.