All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Keith Busch <kbusch@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
	Jeff Layton <jlayton@kernel.org>,
	Chuck Lever <chuck.lever@oracle.com>, NeilBrown <neil@brown.name>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	Trond Myklebust <trondmy@kernel.org>,
	Anna Schumaker <anna@kernel.org>,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, hch@infradead.org,
	linux-block@vger.kernel.org
Subject: Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
Date: Thu, 10 Jul 2025 13:22:59 -0400	[thread overview]
Message-ID: <aG_28zNe3T-wt7L8@kernel.org> (raw)
In-Reply-To: <aG_qYnxiK1Rq5nZR@kbusch-mbp>

On Thu, Jul 10, 2025 at 10:29:22AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> > On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> > > On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > > > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > > > of the bvec to arrive at whether the bvec is aligned relative to
> > > > > dma_alignment and on-disk alignment.  Checking each element
> > > > > individually results in disallowing a bvec that in aggregate is
> > > > > perfectly aligned relative to the provided @len_mask.
> > > > > 
> > > > > Relax the on-disk alignment checking such that it is done on the full
> > > > > extent described by the bvec but still do piecewise checking of the
> > > > > dma_alignment for each bvec's bv_offset.
> > > > > 
> > > > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > > > the overall contiguous on-disk extent is aligned relative to the
> > > > > logical_block_size (@len_mask).
> > > > > 
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > >  lib/iov_iter.c | 5 +++--
> > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > > > index bdb37d572e97..b2ae482b8a1d 100644
> > > > > --- a/lib/iov_iter.c
> > > > > +++ b/lib/iov_iter.c
> > > > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > > > >  	unsigned skip = i->iov_offset;
> > > > >  	size_t size = i->count;
> > > > >  
> > > > > +	if (size & len_mask)
> > > > > +		return false;
> > > > > +
> > > > >  	do {
> > > > >  		size_t len = bvec->bv_len;
> > > > >  
> > > > >  		if (len > size)
> > > > >  			len = size;
> > > > > -		if (len & len_mask)
> > > > > -			return false;
> > > > >  		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > > >  			return false;
> > > > >  
> > > > 
> > > > cc'ing Keith too since he wrote this helper originally.
> > > 
> > > Thanks.
> > > 
> > > There's a comment in __bio_iov_iter_get_pages that says it expects each
> > > vector to be a multiple of the block size. That makes it easier to
> > > slit when needed, and this patch would allow vectors that break the
> > > current assumption when calculating the "trim" value.
> > 
> > Thanks for the pointer, that high-level bio code is being too
> > restrictive.
> > 
> > But not seeing any issues with the trim calculation itself, 'trim' is
> > the number of bytes that are past the last logical_block_size aligned
> > boundary.  And then iov_iter_revert() will rollback the iov such that
> > it doesn't include those.  Then size is reduced by trim bytes.
> 
> The trim calculation assumes the current bi_size is already a block size
> multiple, but it may not be with your propsal. So the trim bytes needs
> to take into account the existing bi_size to know how much to trim off
> to arrive at a proper total bi_size instead of assuming we can append a
> block sized multiple carved out the current iov.

The trim "calculation" doesn't assume anything, it just lops off
whatever is past the end of the last logical_block_size aligned
boundary of the requested pages (which is meant to be bi_size).  The
fact that the trim ever gets anything implies bi_size is *not* always
logical_block_size aligned. No?

But sure, with my change it opens the door for bvecs with vectors that
aren't all logical_block_size aligned.  

I'll revisit this code, but if you see a way forward to fix
__bio_iov_iter_get_pages to cope with my desired iov_iter_aligned_bvec
change please don't be shy with a patch ;)

> > All said, in practice I haven't had any issues with this patch.  But
> > it could just be I don't have the stars aligned to test the case that
> > might have problems.  If you know of such a case I'd welcome
> > suggestions.
> 
> It might be a little harder with iter_bvec, but you also mentioned doing
> the same for iter_iovec too, which I think should be pretty easy to
> cause a problem for nvme: just submit an O_DIRECT read or write with
> individual iovec sizes that are not block size granularities.

I made the iter_iovec change yesterday (before I realized I don't
actually need it for my NFSD case) and all was fine issuing O_DIRECT
IO (via NFSD, so needing the relaxed checking) through to 16
XFS-on-NVMe devices.  SO I think the devil will be in the details if
NVMe actually cares.

Mike

  reply	other threads:[~2025-07-10 17:23 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers Mike Snitzer
2025-07-10 13:59   ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper Mike Snitzer
2025-07-10 13:59   ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-07-10  7:45   ` Christoph Hellwig
2025-07-14 17:46     ` Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
2025-07-10  7:24   ` Christoph Hellwig
2025-07-10  7:32     ` Mike Snitzer
2025-07-10  7:44       ` Christoph Hellwig
2025-07-10 13:52   ` Jeff Layton
2025-07-10 14:48     ` Keith Busch
2025-07-10 16:12       ` Mike Snitzer
2025-07-10 16:29         ` Keith Busch
2025-07-10 17:22           ` Mike Snitzer [this message]
2025-07-10 19:51             ` Keith Busch
2025-07-10 19:57             ` Keith Busch
2025-08-01 15:23         ` Keith Busch
2025-08-01 16:10           ` Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 5/8] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-07-10  7:47   ` Christoph Hellwig
2025-07-14 17:33     ` Mike Snitzer
2025-07-10 14:06   ` Jeff Layton
2025-07-10 22:46     ` Chuck Lever
2025-07-14 16:47       ` Mike Snitzer
2025-07-15 11:57         ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 7/8] NFSD: add io_cache_write " Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-07-08 21:22   ` Mike Snitzer
2025-07-10  7:51   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aG_28zNe3T-wt7L8@kernel.org \
    --to=snitzer@kernel.org \
    --cc=Dai.Ngo@oracle.com \
    --cc=anna@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chuck.lever@oracle.com \
    --cc=hch@infradead.org \
    --cc=jlayton@kernel.org \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    --cc=trondmy@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.