From: Mike Snitzer <snitzer@kernel.org>
To: Christoph Hellwig <hch@infradead.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Jeff Layton <jlayton@kernel.org>,
Chuck Lever <chuck.lever@oracle.com>,
linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>,
david.flynn@hammerspace.com
Subject: [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages]
Date: Tue, 17 Jun 2025 18:23:05 -0400 [thread overview]
Message-ID: <aFHqyU4qO_W1enUT@kernel.org> (raw)
In-Reply-To: <aFHPgrPM798wXdSG@kernel.org>
[Cc'ing Al and Andrew]
On Tue, Jun 17, 2025 at 04:26:42PM -0400, Mike Snitzer wrote:
> On Mon, Jun 16, 2025 at 09:37:01PM -0700, Christoph Hellwig wrote:
> > On Mon, Jun 16, 2025 at 12:07:42PM -0400, Mike Snitzer wrote:
> > > But that's OK... my test bdev is a bad example (archaic VMware vSphere
> > > provided SCSI device): it doesn't reflect expected modern hardware.
> > >
> > > But I just slapped together a test pmem blockdevice (memory backed,
> > > using memmap=6G!18G) and it too has dma_alignment=511
> >
> > That's the block layer default when not overriden by the driver, I guess
> > pmem folks didn't care enough. I suspect it should not have any
> > alignment requirements at all.
>
> Yeah, I hacked it with this just to quickly simulate NVMe's dma_alignment:
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 210fb77f51ba..0ab2826073f9 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -457,6 +457,7 @@ static int pmem_attach_disk(struct device *dev,
> .max_hw_sectors = UINT_MAX,
> .features = BLK_FEAT_WRITE_CACHE |
> BLK_FEAT_SYNCHRONOUS,
> + .dma_alignment = 3,
> };
> int nid = dev_to_node(dev), fua;
> struct resource *res = &nsio->res;
>
> > > I'd like NFSD to be able to know if its bvec is dma-aligned, before
> > > issuing DIO writes to underlying XFS. AFAIK I can do that simply by
> > > checking the STATX_DIOALIGN provided dio_mem_align...
> >
> > Exactly.
>
> I'm finding that even with dma_alignment=3 the bvec, that
> nfsd_vfs_write()'s call to xdr_buf_to_bvec() produces from NFS's WRITE
> payload, still causes iov_iter_aligned_bvec() to return false.
>
> The reason is that iov_iter_aligned_bvec() inspects each member of the
> bio_vec in isolation (in its while() loop). So even though NFS WRITE
> payload's overall size is aligned on-disk (e.g. offset=0 len=512K) its
> first and last bvec members are _not_ aligned (due to 512K NFS WRITE
> payload being offset 148 bytes into the first page of the pages
> allocated for it by SUNRPC). So iov_iter_aligned_bvec() fails at this
> check:
>
> if (len & len_mask)
> return false;
>
> with tracing I added:
>
> nfsd-14027 [001] ..... 3734.668780: nfsd_vfs_write: iov_iter_aligned_bvec: addr_mask=3 len_mask=511
> nfsd-14027 [001] ..... 3734.668781: nfsd_vfs_write: iov_iter_aligned_bvec: len=3948 & len_mask=511 failed
>
> Is this another case of the checks being too strict? The bvec does
> describe a contiguous 512K extent of on-disk LBA, just not if
> inspected piece-wise.
>
> BTW, XFS's directio code _will_ also check with
> iov_iter_aligned_bvec() via iov_iter_is_aligned().
This works, I just don't know what (if any) breakage it exposes us to:
Author: Mike Snitzer <snitzer@kernel.org>
Date: Tue Jun 17 22:04:44 2025 +0000
Subject: lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
iov_iter_aligned_bvec() is strictly checking alignment of each element
of the bvec to arrive at whether the bvec is aligned relative to
dma_alignment and on-disk alignment. Checking each element
individually results in disallowing a bvec that in aggregate is
perfectly aligned relative to the provided @len_mask.
Relax the on-disk alignment checking such that it is done on the full
extent described by the bvec but still do piecewise checking of the
dma_alignment for each bvec's bv_offset.
This allows for NFS's WRITE payload to be issued using O_DIRECT as
long as the bvec created with xdr_buf_to_bvec() is composed of pages
that respect the underlying device's dma_alignment (@addr_mask) and
the overall contiguous on-disk extent is aligned relative to the
logical_block_size (@len_mask).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index bdb37d572e97..b2ae482b8a1d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
unsigned skip = i->iov_offset;
size_t size = i->count;
+ if (size & len_mask)
+ return false;
+
do {
size_t len = bvec->bv_len;
if (len > size)
len = size;
- if (len & len_mask)
- return false;
if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
return false;
next prev parent reply other threads:[~2025-06-17 22:23 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 10:44 ` Mike Snitzer
2025-06-11 13:04 ` Jeff Layton
2025-06-11 13:56 ` Chuck Lever
2025-06-11 14:31 ` Chuck Lever
2025-06-11 19:18 ` Mike Snitzer
2025-06-11 20:29 ` Jeff Layton
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
2025-06-12 13:28 ` Chuck Lever
2025-06-12 14:17 ` Benjamin Coddington
2025-06-12 15:56 ` Mike Snitzer
2025-06-12 15:58 ` Chuck Lever
2025-06-12 16:12 ` Mike Snitzer
2025-06-12 16:32 ` Chuck Lever
2025-06-13 5:39 ` Christoph Hellwig
2025-06-12 16:22 ` Jeff Layton
2025-06-13 5:46 ` Christoph Hellwig
2025-06-13 9:23 ` Mike Snitzer
2025-06-13 13:02 ` Jeff Layton
2025-06-16 12:35 ` Christoph Hellwig
2025-06-16 12:29 ` Christoph Hellwig
2025-06-16 16:07 ` Mike Snitzer
2025-06-17 4:37 ` Christoph Hellwig
2025-06-17 20:26 ` Mike Snitzer
2025-06-17 22:23 ` Mike Snitzer [this message]
2025-07-03 0:12 ` NeilBrown
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15 ` Chuck Lever
2025-06-12 13:21 ` Chuck Lever
2025-06-12 16:00 ` Mike Snitzer
2025-06-16 13:32 ` Chuck Lever
2025-06-16 16:10 ` Mike Snitzer
2025-06-17 17:22 ` Mike Snitzer
2025-06-17 17:31 ` Chuck Lever
2025-06-19 20:19 ` Mike Snitzer
2025-06-30 14:50 ` Chuck Lever
2025-07-04 19:46 ` Mike Snitzer
2025-07-04 19:49 ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11 6:58 ` Christoph Hellwig
2025-06-11 10:51 ` Mike Snitzer
2025-06-11 14:17 ` Chuck Lever
2025-06-12 7:15 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11 7:00 ` Christoph Hellwig
2025-06-11 12:23 ` Mike Snitzer
2025-06-11 13:30 ` Jeff Layton
2025-06-12 7:22 ` Christoph Hellwig
2025-06-12 7:23 ` Christoph Hellwig
2025-06-11 14:42 ` Chuck Lever
2025-06-11 15:07 ` Jeff Layton
2025-06-11 15:11 ` Chuck Lever
2025-06-11 15:44 ` Jeff Layton
2025-06-11 20:51 ` Mike Snitzer
2025-06-12 7:32 ` Christoph Hellwig
2025-06-12 7:28 ` Christoph Hellwig
2025-06-12 7:25 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12 7:39 ` Christoph Hellwig
2025-06-12 20:37 ` Mike Snitzer
2025-06-13 5:31 ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02 ` Mike Snitzer
2025-06-11 19:06 ` Chuck Lever
2025-06-11 19:58 ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08 ` Mike Snitzer
2025-06-12 20:17 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aFHqyU4qO_W1enUT@kernel.org \
--to=snitzer@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=chuck.lever@oracle.com \
--cc=david.flynn@hammerspace.com \
--cc=hch@infradead.org \
--cc=jlayton@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.