linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
       [not found]     ` <aG_SpLuUv4EH7fAb@kbusch-mbp>
@ 2025-07-10 16:12       ` Mike Snitzer
  2025-07-10 16:29         ` Keith Busch
  2025-08-01 15:23         ` Keith Busch
  0 siblings, 2 replies; 7+ messages in thread
From: Mike Snitzer @ 2025-07-10 16:12 UTC (permalink / raw)
  To: Keith Busch, Ming Lei, Jens Axboe
  Cc: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
	linux-fsdevel, linux-mm, hch, linux-block

On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > of the bvec to arrive at whether the bvec is aligned relative to
> > > dma_alignment and on-disk alignment.  Checking each element
> > > individually results in disallowing a bvec that in aggregate is
> > > perfectly aligned relative to the provided @len_mask.
> > > 
> > > Relax the on-disk alignment checking such that it is done on the full
> > > extent described by the bvec but still do piecewise checking of the
> > > dma_alignment for each bvec's bv_offset.
> > > 
> > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > the overall contiguous on-disk extent is aligned relative to the
> > > logical_block_size (@len_mask).
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > ---
> > >  lib/iov_iter.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > index bdb37d572e97..b2ae482b8a1d 100644
> > > --- a/lib/iov_iter.c
> > > +++ b/lib/iov_iter.c
> > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > >  	unsigned skip = i->iov_offset;
> > >  	size_t size = i->count;
> > >  
> > > +	if (size & len_mask)
> > > +		return false;
> > > +
> > >  	do {
> > >  		size_t len = bvec->bv_len;
> > >  
> > >  		if (len > size)
> > >  			len = size;
> > > -		if (len & len_mask)
> > > -			return false;
> > >  		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > >  			return false;
> > >  
> > 
> > cc'ing Keith too since he wrote this helper originally.
> 
> Thanks.
> 
> There's a comment in __bio_iov_iter_get_pages that says it expects each
> vector to be a multiple of the block size. That makes it easier to
> slit when needed, and this patch would allow vectors that break the
> current assumption when calculating the "trim" value.

Thanks for the pointer, that high-level bio code is being too
restrictive.

But not seeing any issues with the trim calculation itself, 'trim' is
the number of bytes that are past the last logical_block_size aligned
boundary.  And then iov_iter_revert() will rollback the iov such that
it doesn't include those.  Then size is reduced by trim bytes.

Just restating my challenge:
Assuming that each vector is itself a multiple of logical_block_size
disallows valid usecases (like the one I have with NFSD needing to use
O_DIRECT for its WRITE payload, which can have the head and/or tail
vectors at _not_ logical_block_size aligned boundaries).

> But for nvme, you couldn't split such a bvec into a usable command
> anyway. I think you'd have to introduce a different queue limit to check
> against when validating iter alignment if you don't want to use the
> logical block size. 

There isn't a new queue limit in play though, but I get your meaning
that we did in fact assume the logical_block_size applicable for each
vector.

With my patch, we still validate logical_block_size (of overall
bvec length, not of each vector) and implicitly validate that each
vector has dma_alignment on-disk (while verifying that pages are
aligned to dma_alignment).

All said, in practice I haven't had any issues with this patch.  But
it could just be I don't have the stars aligned to test the case that
might have problems.  If you know of such a case I'd welcome
suggestions.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
  2025-07-10 16:12       ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
@ 2025-07-10 16:29         ` Keith Busch
  2025-07-10 17:22           ` Mike Snitzer
  2025-08-01 15:23         ` Keith Busch
  1 sibling, 1 reply; 7+ messages in thread
From: Keith Busch @ 2025-07-10 16:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
	linux-block

On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> > On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > > of the bvec to arrive at whether the bvec is aligned relative to
> > > > dma_alignment and on-disk alignment.  Checking each element
> > > > individually results in disallowing a bvec that in aggregate is
> > > > perfectly aligned relative to the provided @len_mask.
> > > > 
> > > > Relax the on-disk alignment checking such that it is done on the full
> > > > extent described by the bvec but still do piecewise checking of the
> > > > dma_alignment for each bvec's bv_offset.
> > > > 
> > > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > > the overall contiguous on-disk extent is aligned relative to the
> > > > logical_block_size (@len_mask).
> > > > 
> > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > ---
> > > >  lib/iov_iter.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > > index bdb37d572e97..b2ae482b8a1d 100644
> > > > --- a/lib/iov_iter.c
> > > > +++ b/lib/iov_iter.c
> > > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > > >  	unsigned skip = i->iov_offset;
> > > >  	size_t size = i->count;
> > > >  
> > > > +	if (size & len_mask)
> > > > +		return false;
> > > > +
> > > >  	do {
> > > >  		size_t len = bvec->bv_len;
> > > >  
> > > >  		if (len > size)
> > > >  			len = size;
> > > > -		if (len & len_mask)
> > > > -			return false;
> > > >  		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > >  			return false;
> > > >  
> > > 
> > > cc'ing Keith too since he wrote this helper originally.
> > 
> > Thanks.
> > 
> > There's a comment in __bio_iov_iter_get_pages that says it expects each
> > vector to be a multiple of the block size. That makes it easier to
> > slit when needed, and this patch would allow vectors that break the
> > current assumption when calculating the "trim" value.
> 
> Thanks for the pointer, that high-level bio code is being too
> restrictive.
> 
> But not seeing any issues with the trim calculation itself, 'trim' is
> the number of bytes that are past the last logical_block_size aligned
> boundary.  And then iov_iter_revert() will rollback the iov such that
> it doesn't include those.  Then size is reduced by trim bytes.

The trim calculation assumes the current bi_size is already a block size
multiple, but it may not be with your propsal. So the trim bytes needs
to take into account the existing bi_size to know how much to trim off
to arrive at a proper total bi_size instead of assuming we can append a
block sized multiple carved out the current iov.
 
> All said, in practice I haven't had any issues with this patch.  But
> it could just be I don't have the stars aligned to test the case that
> might have problems.  If you know of such a case I'd welcome
> suggestions.

It might be a little harder with iter_bvec, but you also mentioned doing
the same for iter_iovec too, which I think should be pretty easy to
cause a problem for nvme: just submit an O_DIRECT read or write with
individual iovec sizes that are not block size granularities.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
  2025-07-10 16:29         ` Keith Busch
@ 2025-07-10 17:22           ` Mike Snitzer
  2025-07-10 19:51             ` Keith Busch
  2025-07-10 19:57             ` Keith Busch
  0 siblings, 2 replies; 7+ messages in thread
From: Mike Snitzer @ 2025-07-10 17:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
	linux-block

On Thu, Jul 10, 2025 at 10:29:22AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> > On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> > > On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > > > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > > > of the bvec to arrive at whether the bvec is aligned relative to
> > > > > dma_alignment and on-disk alignment.  Checking each element
> > > > > individually results in disallowing a bvec that in aggregate is
> > > > > perfectly aligned relative to the provided @len_mask.
> > > > > 
> > > > > Relax the on-disk alignment checking such that it is done on the full
> > > > > extent described by the bvec but still do piecewise checking of the
> > > > > dma_alignment for each bvec's bv_offset.
> > > > > 
> > > > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > > > the overall contiguous on-disk extent is aligned relative to the
> > > > > logical_block_size (@len_mask).
> > > > > 
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > >  lib/iov_iter.c | 5 +++--
> > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > > > index bdb37d572e97..b2ae482b8a1d 100644
> > > > > --- a/lib/iov_iter.c
> > > > > +++ b/lib/iov_iter.c
> > > > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > > > >  	unsigned skip = i->iov_offset;
> > > > >  	size_t size = i->count;
> > > > >  
> > > > > +	if (size & len_mask)
> > > > > +		return false;
> > > > > +
> > > > >  	do {
> > > > >  		size_t len = bvec->bv_len;
> > > > >  
> > > > >  		if (len > size)
> > > > >  			len = size;
> > > > > -		if (len & len_mask)
> > > > > -			return false;
> > > > >  		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > > >  			return false;
> > > > >  
> > > > 
> > > > cc'ing Keith too since he wrote this helper originally.
> > > 
> > > Thanks.
> > > 
> > > There's a comment in __bio_iov_iter_get_pages that says it expects each
> > > vector to be a multiple of the block size. That makes it easier to
> > > slit when needed, and this patch would allow vectors that break the
> > > current assumption when calculating the "trim" value.
> > 
> > Thanks for the pointer, that high-level bio code is being too
> > restrictive.
> > 
> > But not seeing any issues with the trim calculation itself, 'trim' is
> > the number of bytes that are past the last logical_block_size aligned
> > boundary.  And then iov_iter_revert() will rollback the iov such that
> > it doesn't include those.  Then size is reduced by trim bytes.
> 
> The trim calculation assumes the current bi_size is already a block size
> multiple, but it may not be with your propsal. So the trim bytes needs
> to take into account the existing bi_size to know how much to trim off
> to arrive at a proper total bi_size instead of assuming we can append a
> block sized multiple carved out the current iov.

The trim "calculation" doesn't assume anything, it just lops off
whatever is past the end of the last logical_block_size aligned
boundary of the requested pages (which is meant to be bi_size).  The
fact that the trim ever gets anything implies bi_size is *not* always
logical_block_size aligned. No?

But sure, with my change it opens the door for bvecs with vectors that
aren't all logical_block_size aligned.  

I'll revisit this code, but if you see a way forward to fix
__bio_iov_iter_get_pages to cope with my desired iov_iter_aligned_bvec
change please don't be shy with a patch ;)

> > All said, in practice I haven't had any issues with this patch.  But
> > it could just be I don't have the stars aligned to test the case that
> > might have problems.  If you know of such a case I'd welcome
> > suggestions.
> 
> It might be a little harder with iter_bvec, but you also mentioned doing
> the same for iter_iovec too, which I think should be pretty easy to
> cause a problem for nvme: just submit an O_DIRECT read or write with
> individual iovec sizes that are not block size granularities.

I made the iter_iovec change yesterday (before I realized I don't
actually need it for my NFSD case) and all was fine issuing O_DIRECT
IO (via NFSD, so needing the relaxed checking) through to 16
XFS-on-NVMe devices.  SO I think the devil will be in the details if
NVMe actually cares.

Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
  2025-07-10 17:22           ` Mike Snitzer
@ 2025-07-10 19:51             ` Keith Busch
  2025-07-10 19:57             ` Keith Busch
  1 sibling, 0 replies; 7+ messages in thread
From: Keith Busch @ 2025-07-10 19:51 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
	linux-block

On Thu, Jul 10, 2025 at 01:22:59PM -0400, Mike Snitzer wrote:
> On Thu, Jul 10, 2025 at 10:29:22AM -0600, Keith Busch wrote:
> > The trim calculation assumes the current bi_size is already a block size
> > multiple, but it may not be with your propsal. So the trim bytes needs
> > to take into account the existing bi_size to know how much to trim off
> > to arrive at a proper total bi_size instead of assuming we can append a
> > block sized multiple carved out the current iov.
> 
> The trim "calculation" doesn't assume anything, it just lops off
> whatever is past the end of the last logical_block_size aligned
> boundary of the requested pages (which is meant to be bi_size).  The
> fact that the trim ever gets anything implies bi_size is *not* always
> logical_block_size aligned. No?

No. The iov must be a block size, but if it spans more pages than the
bio can hold (because of bi_max_vecs), the total size of the pages
gotten is only part of iov. That's the case that 'trim' is trying to
handle, as we only got part of the iov, so it's aligned down to make
sure the next iteration can consider perfectly block size aligned
iovecs. At every step of iovec iteration, the bio's bi_size is a block
size multiple.

Let's say we tried to allow smaller vecs. Assume block size of 512
bytes, and you send a direct IO with 4 vecs of 128 bytes each. That
would normally get rejected very early, but if you did send that to the
bio layer, the entirety of the first iov would get trimmed off and you
should get an EFAULT return.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
  2025-07-10 17:22           ` Mike Snitzer
  2025-07-10 19:51             ` Keith Busch
@ 2025-07-10 19:57             ` Keith Busch
  1 sibling, 0 replies; 7+ messages in thread
From: Keith Busch @ 2025-07-10 19:57 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
	linux-block

On Thu, Jul 10, 2025 at 01:22:59PM -0400, Mike Snitzer wrote:
> I'll revisit this code, but if you see a way forward to fix
> __bio_iov_iter_get_pages to cope with my desired iov_iter_aligned_bvec
> change please don't be shy with a patch ;)

For the record, I do like the cause you're going for. I wanted to make
that work when I initially introduced dma_alignment to direct-io, but
the spliting, merging, and respecting all the queue contraints for
things like virt boundaries and segment limits was a bit more than I
expected.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
  2025-07-10 16:12       ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
  2025-07-10 16:29         ` Keith Busch
@ 2025-08-01 15:23         ` Keith Busch
  2025-08-01 16:10           ` Mike Snitzer
  1 sibling, 1 reply; 7+ messages in thread
From: Keith Busch @ 2025-08-01 15:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
	linux-block

On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> All said, in practice I haven't had any issues with this patch.  But
> it could just be I don't have the stars aligned to test the case that
> might have problems.  If you know of such a case I'd welcome
> suggestions.

This is something I threw together that appears to be successful with
NVMe through raw block direct-io. This will defer catching an invalid io
vector to much later in the block stack, which should be okay, and
removes one of the vector walks in the fast path, so that's a bonus.

While this is testing okay with NVMe so far, I haven't tested any more
complicated setups yet, and I probably need to get filesystems using
this relaxed limit too.

---
diff --git a/block/bio.c b/block/bio.c
index 92c512e876c8d..634b2031c4829 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1227,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
-	/*
-	 * Each segment in the iov is required to be a block size multiple.
-	 * However, we may not be able to get the entire segment if it spans
-	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
-	 * result to ensure the bio's total size is correct. The remainder of
-	 * the iov data will be picked up in the next bio iteration.
-	 */
 	size = iov_iter_extract_pages(iter, &pages,
 				      UINT_MAX - bio->bi_iter.bi_size,
 				      nr_pages, extraction_flags, &offset);
@@ -1241,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return size ? size : -EFAULT;
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
-
-	if (bio->bi_bdev) {
-		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
-		iov_iter_revert(iter, trim);
-		size -= trim;
-	}
-
-	if (unlikely(!size)) {
-		ret = -EFAULT;
-		goto out;
-	}
-
 	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
 		struct page *page = pages[i];
 		struct folio *folio = page_folio(page);
@@ -1297,6 +1278,23 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return ret;
 }
 
+static int bio_align_to_bs(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned int mask = bdev_logical_block_size(bio->bi_bdev) - 1;
+	unsigned int total = bio->bi_iter.bi_size;
+	size_t trim = total & mask;
+
+	if (!trim)
+	        return 0;
+
+	/* FIXME: might be leaking pages */
+	bio_revert(bio, trim);
+	iov_iter_revert(iter, trim);
+	if (total == trim)
+	        return -EFAULT;
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
@@ -1327,7 +1325,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	if (iov_iter_is_bvec(iter)) {
 		bio_iov_bvec_set(bio, iter);
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
-		return 0;
+		return bio_align_to_bs(bio, iter);
 	}
 
 	if (iov_iter_extract_will_pin(iter))
@@ -1336,6 +1334,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
+	ret = bio_align_to_bs(bio, iter);
 	return bio->bi_vcnt ? 0 : ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be52..a3acfef8eb81d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -298,6 +298,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
+		if (bv.bv_offset & lim->dma_alignment)
+			return -EFAULT;
+
 		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
@@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	 * we do not use the full hardware limits.
 	 */
 	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
+	if (!bytes)
+		return -EFAULT;
 
 	/*
 	 * Bio splitting may cause subtle trouble such as hang when doing sync
diff --git a/block/fops.c b/block/fops.c
index 82451ac8ff25d..820902cf10730 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 				struct iov_iter *iter)
 {
-	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
-		!bdev_iter_is_aligned(bdev, iter);
+	return (iocb->ki_pos | iov_iter_count(iter)) &
+			(bdev_logical_block_size(bdev) - 1);
 }
 
 #define DIO_INLINE_BIO_VECS 4
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 46ffac5caab78..d3ddf78d1f35e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -169,6 +169,22 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
 
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
+static inline void bio_revert(struct bio *bio, unsigned int nbytes)
+{
+	bio->bi_iter.bi_size -= nbytes;
+
+	while (nbytes) {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (nbytes < bv->bv_len) {
+			bv->bv_len -= nbytes;
+			return;
+		}
+		bio->bi_vcnt--;
+		nbytes -= bv->bv_len;
+       }
+}
+
 static inline unsigned bio_segments(struct bio *bio)
 {
 	unsigned segs = 0;
--

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
  2025-08-01 15:23         ` Keith Busch
@ 2025-08-01 16:10           ` Mike Snitzer
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Snitzer @ 2025-08-01 16:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
	linux-block

On Fri, Aug 01, 2025 at 09:23:57AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> > All said, in practice I haven't had any issues with this patch.  But
> > it could just be I don't have the stars aligned to test the case that
> > might have problems.  If you know of such a case I'd welcome
> > suggestions.
> 
> This is something I threw together that appears to be successful with
> NVMe through raw block direct-io. This will defer catching an invalid io
> vector to much later in the block stack, which should be okay, and
> removes one of the vector walks in the fast path, so that's a bonus.
> 
> While this is testing okay with NVMe so far, I haven't tested any more
> complicated setups yet, and I probably need to get filesystems using
> this relaxed limit too.

Ship it! ;)

Many thanks for this, I'll review closely and circle back!

Mike

 
> ---
> diff --git a/block/bio.c b/block/bio.c
> index 92c512e876c8d..634b2031c4829 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1227,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
>  		extraction_flags |= ITER_ALLOW_P2PDMA;
>  
> -	/*
> -	 * Each segment in the iov is required to be a block size multiple.
> -	 * However, we may not be able to get the entire segment if it spans
> -	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
> -	 * result to ensure the bio's total size is correct. The remainder of
> -	 * the iov data will be picked up in the next bio iteration.
> -	 */
>  	size = iov_iter_extract_pages(iter, &pages,
>  				      UINT_MAX - bio->bi_iter.bi_size,
>  				      nr_pages, extraction_flags, &offset);
> @@ -1241,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  		return size ? size : -EFAULT;
>  
>  	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> -
> -	if (bio->bi_bdev) {
> -		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
> -		iov_iter_revert(iter, trim);
> -		size -= trim;
> -	}
> -
> -	if (unlikely(!size)) {
> -		ret = -EFAULT;
> -		goto out;
> -	}
> -
>  	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
>  		struct page *page = pages[i];
>  		struct folio *folio = page_folio(page);
> @@ -1297,6 +1278,23 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	return ret;
>  }
>  
> +static int bio_align_to_bs(struct bio *bio, struct iov_iter *iter)
> +{
> +	unsigned int mask = bdev_logical_block_size(bio->bi_bdev) - 1;
> +	unsigned int total = bio->bi_iter.bi_size;
> +	size_t trim = total & mask;
> +
> +	if (!trim)
> +	        return 0;
> +
> +	/* FIXME: might be leaking pages */
> +	bio_revert(bio, trim);
> +	iov_iter_revert(iter, trim);
> +	if (total == trim)
> +	        return -EFAULT;
> +	return 0;
> +}
> +
>  /**
>   * bio_iov_iter_get_pages - add user or kernel pages to a bio
>   * @bio: bio to add pages to
> @@ -1327,7 +1325,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	if (iov_iter_is_bvec(iter)) {
>  		bio_iov_bvec_set(bio, iter);
>  		iov_iter_advance(iter, bio->bi_iter.bi_size);
> -		return 0;
> +		return bio_align_to_bs(bio, iter);
>  	}
>  
>  	if (iov_iter_extract_will_pin(iter))
> @@ -1336,6 +1334,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  		ret = __bio_iov_iter_get_pages(bio, iter);
>  	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
>  
> +	ret = bio_align_to_bs(bio, iter);
>  	return bio->bi_vcnt ? 0 : ret;
>  }
>  EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be52..a3acfef8eb81d 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -298,6 +298,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	unsigned nsegs = 0, bytes = 0;
>  
>  	bio_for_each_bvec(bv, bio, iter) {
> +		if (bv.bv_offset & lim->dma_alignment)
> +			return -EFAULT;
> +
>  		/*
>  		 * If the queue doesn't support SG gaps and adding this
>  		 * offset would create a gap, disallow it.
> @@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	 * we do not use the full hardware limits.
>  	 */
>  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> +	if (!bytes)
> +		return -EFAULT;
>  
>  	/*
>  	 * Bio splitting may cause subtle trouble such as hang when doing sync
> diff --git a/block/fops.c b/block/fops.c
> index 82451ac8ff25d..820902cf10730 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
>  static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
>  				struct iov_iter *iter)
>  {
> -	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
> -		!bdev_iter_is_aligned(bdev, iter);
> +	return (iocb->ki_pos | iov_iter_count(iter)) &
> +			(bdev_logical_block_size(bdev) - 1);
>  }
>  
>  #define DIO_INLINE_BIO_VECS 4
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 46ffac5caab78..d3ddf78d1f35e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -169,6 +169,22 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
>  
>  #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
>  
> +static inline void bio_revert(struct bio *bio, unsigned int nbytes)
> +{
> +	bio->bi_iter.bi_size -= nbytes;
> +
> +	while (nbytes) {
> +		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +
> +		if (nbytes < bv->bv_len) {
> +			bv->bv_len -= nbytes;
> +			return;
> +		}
> +		bio->bi_vcnt--;
> +		nbytes -= bv->bv_len;
> +       }
> +}
> +
>  static inline unsigned bio_segments(struct bio *bio)
>  {
>  	unsigned segs = 0;
> --

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-08-01 16:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250708160619.64800-1-snitzer@kernel.org>
     [not found] ` <20250708160619.64800-5-snitzer@kernel.org>
     [not found]   ` <5819d6c5bb194613a14d2dcf05605e701683ba49.camel@kernel.org>
     [not found]     ` <aG_SpLuUv4EH7fAb@kbusch-mbp>
2025-07-10 16:12       ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
2025-07-10 16:29         ` Keith Busch
2025-07-10 17:22           ` Mike Snitzer
2025-07-10 19:51             ` Keith Busch
2025-07-10 19:57             ` Keith Busch
2025-08-01 15:23         ` Keith Busch
2025-08-01 16:10           ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).