linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 0/7] direct-io: even more flexible io vectors
@ 2025-08-05 14:11 Keith Busch
  2025-08-05 14:11 ` [PATCHv2 1/7] block: check for valid bio while splitting Keith Busch
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This series removes the direct io requirement that io vector lengths
align to the logical block size.

Changes from v1:

 - Fixed the return value when attempting to align a built bio to a
   block size multiple.

 - Added reviews

Keith Busch (7):
  block: check for valid bio while splitting
  block: align the bio after building it
  block: simplify direct io validity check
  iomap: simplify direct io validity check
  block: remove bdev_iter_is_aligned
  blk-integrity: use simpler alignment check
  iov_iter: remove iov_iter_is_aligned

 block/bio-integrity.c  |  4 +-
 block/bio.c            | 60 +++++++++++++++++---------
 block/blk-merge.c      |  5 +++
 block/fops.c           |  4 +-
 fs/iomap/direct-io.c   |  3 +-
 include/linux/blkdev.h |  7 ----
 include/linux/uio.h    |  2 -
 lib/iov_iter.c         | 95 ------------------------------------------
 8 files changed, 50 insertions(+), 130 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-10 14:37   ` Christoph Hellwig
  2025-08-13 20:06   ` Keith Busch
  2025-08-05 14:11 ` [PATCHv2 2/7] block: align the bio after building it Keith Busch
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

We're already iterating every segment, so check these for a valid IO at
the same time.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-merge.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be52..81bdad915699a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -298,6 +298,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
+		if (bv.bv_offset & lim->dma_alignment)
+			return -EINVAL;
+
 		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
@@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	 * we do not use the full hardware limits.
 	 */
 	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
+	if (!bytes)
+		return -EINVAL;
 
 	/*
 	 * Bio splitting may cause subtle trouble such as hang when doing sync
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv2 2/7] block: align the bio after building it
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
  2025-08-05 14:11 ` [PATCHv2 1/7] block: check for valid bio while splitting Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-06  6:07   ` Hannes Reinecke
  2025-08-10 14:43   ` Christoph Hellwig
  2025-08-05 14:11 ` [PATCHv2 3/7] block: simplify direct io validity check Keith Busch
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes it more flexible to accepting device valid io vectors that would
otherwise get rejected by alignment checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c | 60 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 92c512e876c8d..9ecd546c5b077 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1227,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
-	/*
-	 * Each segment in the iov is required to be a block size multiple.
-	 * However, we may not be able to get the entire segment if it spans
-	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
-	 * result to ensure the bio's total size is correct. The remainder of
-	 * the iov data will be picked up in the next bio iteration.
-	 */
 	size = iov_iter_extract_pages(iter, &pages,
 				      UINT_MAX - bio->bi_iter.bi_size,
 				      nr_pages, extraction_flags, &offset);
@@ -1241,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return size ? size : -EFAULT;
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
-
-	if (bio->bi_bdev) {
-		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
-		iov_iter_revert(iter, trim);
-		size -= trim;
-	}
-
-	if (unlikely(!size)) {
-		ret = -EFAULT;
-		goto out;
-	}
-
 	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
 		struct page *page = pages[i];
 		struct folio *folio = page_folio(page);
@@ -1297,6 +1278,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return ret;
 }
 
+static inline void bio_revert(struct bio *bio, unsigned int nbytes)
+{
+	bio->bi_iter.bi_size -= nbytes;
+
+	while (nbytes) {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (nbytes < bv->bv_len) {
+			bv->bv_len -= nbytes;
+			return;
+		}
+
+		bio_release_page(bio, bv->bv_page);
+		bio->bi_vcnt--;
+		nbytes -= bv->bv_len;
+       }
+}
+
+static int bio_align_to_lbs(struct bio *bio, struct iov_iter *iter)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	size_t nbytes;
+
+	if (!bdev)
+		return 0;
+
+	nbytes = bio->bi_iter.bi_size & (bdev_logical_block_size(bdev) - 1);
+	if (!nbytes)
+		return 0;
+
+	bio_revert(bio, nbytes);
+	iov_iter_revert(iter, nbytes);
+	if (!bio->bi_iter.bi_size)
+		return -EFAULT;
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
@@ -1336,7 +1354,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
-	return bio->bi_vcnt ? 0 : ret;
+	if (bio->bi_vcnt)
+		return bio_align_to_lbs(bio, iter);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv2 3/7] block: simplify direct io validity check
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
  2025-08-05 14:11 ` [PATCHv2 1/7] block: check for valid bio while splitting Keith Busch
  2025-08-05 14:11 ` [PATCHv2 2/7] block: align the bio after building it Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-05 14:11 ` [PATCHv2 4/7] iomap: " Keith Busch
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
and defer the segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/fops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 82451ac8ff25d..820902cf10730 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 				struct iov_iter *iter)
 {
-	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
-		!bdev_iter_is_aligned(bdev, iter);
+	return (iocb->ki_pos | iov_iter_count(iter)) &
+			(bdev_logical_block_size(bdev) - 1);
 }
 
 #define DIO_INLINE_BIO_VECS 4
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv2 4/7] iomap: simplify direct io validity check
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
                   ` (2 preceding siblings ...)
  2025-08-05 14:11 ` [PATCHv2 3/7] block: simplify direct io validity check Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-06  6:07   ` Hannes Reinecke
  2025-08-05 14:11 ` [PATCHv2 5/7] block: remove bdev_iter_is_aligned Keith Busch
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
and defer the segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 fs/iomap/direct-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 6f25d4cfea9f7..2c1a45e46dc75 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -337,8 +337,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	u64 copied = 0;
 	size_t orig_count;
 
-	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
+	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1))
 		return -EINVAL;
 
 	if (dio->flags & IOMAP_DIO_WRITE) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv2 5/7] block: remove bdev_iter_is_aligned
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
                   ` (3 preceding siblings ...)
  2025-08-05 14:11 ` [PATCHv2 4/7] iomap: " Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-05 14:11 ` [PATCHv2 6/7] blk-integrity: use simpler alignment check Keith Busch
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/blkdev.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 95886b404b16b..f86aecbb87385 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1589,13 +1589,6 @@ static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
 	return queue_dma_alignment(bdev_get_queue(bdev));
 }
 
-static inline bool bdev_iter_is_aligned(struct block_device *bdev,
-					struct iov_iter *iter)
-{
-	return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev),
-				   bdev_logical_block_size(bdev) - 1);
-}
-
 static inline unsigned int
 blk_lim_dma_alignment_and_pad(struct queue_limits *lim)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv2 6/7] blk-integrity: use simpler alignment check
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
                   ` (4 preceding siblings ...)
  2025-08-05 14:11 ` [PATCHv2 5/7] block: remove bdev_iter_is_aligned Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-05 14:11 ` [PATCHv2 7/7] iov_iter: remove iov_iter_is_aligned Keith Busch
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

We're checking length and addresses against the same alignment value, so
use the more simple iterator check.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/bio-integrity.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 6b077ca937f6b..6d069a49b4aad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -262,7 +262,6 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits);
 	struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
 	struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
 	size_t offset, bytes = iter->count;
@@ -285,7 +284,8 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 		pages = NULL;
 	}
 
-	copy = !iov_iter_is_aligned(iter, align, align);
+	copy = iov_iter_alignment(iter) &
+			blk_lim_dma_alignment_and_pad(&q->limits);
 	ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
 	if (unlikely(ret < 0))
 		goto free_bvec;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv2 7/7] iov_iter: remove iov_iter_is_aligned
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
                   ` (5 preceding siblings ...)
  2025-08-05 14:11 ` [PATCHv2 6/7] blk-integrity: use simpler alignment check Keith Busch
@ 2025-08-05 14:11 ` Keith Busch
  2025-08-06  6:08   ` Hannes Reinecke
  2025-08-06  3:03 ` [PATCHv2 0/7] direct-io: even more flexible io vectors Martin K. Petersen
  2025-08-06 14:51 ` Christoph Hellwig
  8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-05 14:11 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

From: Keith Busch <kbusch@kernel.org>

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
---
 include/linux/uio.h |  2 -
 lib/iov_iter.c      | 95 ---------------------------------------------
 2 files changed, 97 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2e86c653186c6..5b127043a1519 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -286,8 +286,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 #endif
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
-bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
-			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9193f952f499..2fe66a6b8789e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -784,101 +784,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 }
 EXPORT_SYMBOL(iov_iter_discard);
 
-static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
-				   unsigned len_mask)
-{
-	const struct iovec *iov = iter_iov(i);
-	size_t size = i->count;
-	size_t skip = i->iov_offset;
-
-	do {
-		size_t len = iov->iov_len - skip;
-
-		if (len > size)
-			len = size;
-		if (len & len_mask)
-			return false;
-		if ((unsigned long)(iov->iov_base + skip) & addr_mask)
-			return false;
-
-		iov++;
-		size -= len;
-		skip = 0;
-	} while (size);
-
-	return true;
-}
-
-static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
-				  unsigned len_mask)
-{
-	const struct bio_vec *bvec = i->bvec;
-	unsigned skip = i->iov_offset;
-	size_t size = i->count;
-
-	do {
-		size_t len = bvec->bv_len - skip;
-
-		if (len > size)
-			len = size;
-		if (len & len_mask)
-			return false;
-		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
-			return false;
-
-		bvec++;
-		size -= len;
-		skip = 0;
-	} while (size);
-
-	return true;
-}
-
-/**
- * iov_iter_is_aligned() - Check if the addresses and lengths of each segments
- * 	are aligned to the parameters.
- *
- * @i: &struct iov_iter to restore
- * @addr_mask: bit mask to check against the iov element's addresses
- * @len_mask: bit mask to check against the iov element's lengths
- *
- * Return: false if any addresses or lengths intersect with the provided masks
- */
-bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
-			 unsigned len_mask)
-{
-	if (likely(iter_is_ubuf(i))) {
-		if (i->count & len_mask)
-			return false;
-		if ((unsigned long)(i->ubuf + i->iov_offset) & addr_mask)
-			return false;
-		return true;
-	}
-
-	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
-		return iov_iter_aligned_iovec(i, addr_mask, len_mask);
-
-	if (iov_iter_is_bvec(i))
-		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
-
-	/* With both xarray and folioq types, we're dealing with whole folios. */
-	if (iov_iter_is_xarray(i)) {
-		if (i->count & len_mask)
-			return false;
-		if ((i->xarray_start + i->iov_offset) & addr_mask)
-			return false;
-	}
-	if (iov_iter_is_folioq(i)) {
-		if (i->count & len_mask)
-			return false;
-		if (i->iov_offset & addr_mask)
-			return false;
-	}
-
-	return true;
-}
-EXPORT_SYMBOL_GPL(iov_iter_is_aligned);
-
 static unsigned long iov_iter_alignment_iovec(const struct iov_iter *i)
 {
 	const struct iovec *iov = iter_iov(i);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 0/7] direct-io: even more flexible io vectors
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
                   ` (6 preceding siblings ...)
  2025-08-05 14:11 ` [PATCHv2 7/7] iov_iter: remove iov_iter_is_aligned Keith Busch
@ 2025-08-06  3:03 ` Martin K. Petersen
  2025-08-06 14:51 ` Christoph Hellwig
  8 siblings, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2025-08-06  3:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, Keith Busch


Keith,

> This series removes the direct io requirement that io vector lengths
> align to the logical block size.

Looks OK to me.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 2/7] block: align the bio after building it
  2025-08-05 14:11 ` [PATCHv2 2/7] block: align the bio after building it Keith Busch
@ 2025-08-06  6:07   ` Hannes Reinecke
  2025-08-10 14:43   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Hannes Reinecke @ 2025-08-06  6:07 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

On 8/5/25 16:11, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Instead of ensuring each vector is block size aligned while constructing
> the bio, just ensure the entire size is aligned after it's built. This
> makes it more flexible to accepting device valid io vectors that would
> otherwise get rejected by alignment checks.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   block/bio.c | 60 +++++++++++++++++++++++++++++++++++------------------
>   1 file changed, 40 insertions(+), 20 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 4/7] iomap: simplify direct io validity check
  2025-08-05 14:11 ` [PATCHv2 4/7] iomap: " Keith Busch
@ 2025-08-06  6:07   ` Hannes Reinecke
  0 siblings, 0 replies; 27+ messages in thread
From: Hannes Reinecke @ 2025-08-06  6:07 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

On 8/5/25 16:11, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The block layer checks all the segments for validity later, so no need
> for an early check. Just reduce it to a simple position and total length
> and defer the segment checks to the block layer.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   fs/iomap/direct-io.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 7/7] iov_iter: remove iov_iter_is_aligned
  2025-08-05 14:11 ` [PATCHv2 7/7] iov_iter: remove iov_iter_is_aligned Keith Busch
@ 2025-08-06  6:08   ` Hannes Reinecke
  0 siblings, 0 replies; 27+ messages in thread
From: Hannes Reinecke @ 2025-08-06  6:08 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, Keith Busch

On 8/5/25 16:11, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> No more callers.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> ---
>   include/linux/uio.h |  2 -
>   lib/iov_iter.c      | 95 ---------------------------------------------
>   2 files changed, 97 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 0/7] direct-io: even more flexible io vectors
  2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
                   ` (7 preceding siblings ...)
  2025-08-06  3:03 ` [PATCHv2 0/7] direct-io: even more flexible io vectors Martin K. Petersen
@ 2025-08-06 14:51 ` Christoph Hellwig
  2025-08-07 23:20   ` Keith Busch
  8 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-08-06 14:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, Keith Busch

Just jumping in while travelling with almost time available:  I like the
idea, but you have to be really, really careful for writes through file
systems with this.  Deferred errors happening after bio submission are
nasty and we tend to get things wrong there too often.

So this needs to come with an xfstests to actually hit the alignment
errors, including something that is not the first bio submitted.

Even with that it will change behavior quite a bit, e.g. for file
systems that can do direct I/O write without the exclusive i_rwsem will
leak newly allocated space on errors asynchronously reported through bio
completions, so you'll need to get fs maintainer buy in for that.

Given all that I'm not sure converting the legacy fs/direct-io.c path
is such a good idea as the only file systems left using it are those
that are more less unmaintained (or at least severely undermaintained).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 0/7] direct-io: even more flexible io vectors
  2025-08-06 14:51 ` Christoph Hellwig
@ 2025-08-07 23:20   ` Keith Busch
  2025-08-11 10:32     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-07 23:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner

On Wed, Aug 06, 2025 at 07:51:34AM -0700, Christoph Hellwig wrote:
> So this needs to come with an xfstests to actually hit the alignment
> errors, including something that is not the first bio submitted.

Sure. I wrote up some for blktest, and the same test works as-is for
filesystems too. Potential question: where do such programs go
(xfstests, blktests, both, or some common place)?

Anyway, the followig are diffs for xfstest then blktests, then last is
the test file itself that works for both.

I tested on loop, nvme, and virtio-blk, both raw block (blktests) and
xfs (fstests). Seems fine.

It should fail on current Linus upstream; it needs my patches to
succeed because only that one can use 'dio' length aligned vectors.  So
far, user space doesn't have a way to know if that is supported.

These tests caught a bug in "PATCH 2/7" from my series, specifically
that it needs check bv_len in addition to bv_offset against the
dma_alignment. I've a fix ready for that ready for the next version.


For 'xfstests':


---
diff --git a/.gitignore b/.gitignore
index 58dc2a63..0f5f57cc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -77,6 +77,7 @@ tags
 /src/dio-buf-fault
 /src/dio-interleaved
 /src/dio-invalidate-cache
+/src/dio-offsets
 /src/dio-write-fsync-same-fd
 /src/dirhash_collide
 /src/dirperf
diff --git a/src/Makefile b/src/Makefile
index 2cc1fb40..49a7c0c7 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -21,7 +21,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
 	t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
 	t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
 	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
-	dio-writeback-race
+	dio-writeback-race dio-offsets
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/tests/generic/771 b/tests/generic/771
new file mode 100755
index 00000000..3100a4b8
--- /dev/null
+++ b/tests/generic/771
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2025 Keith Busch.  All Rights Reserved.
+#
+# FS QA Test 771
+#
+# Test direct IO boundaries
+#
+
+. ./common/preamble
+. ./common/rc
+_begin_fstest auto quick
+
+_require_scratch
+_require_odirect
+_require_test
+_require_test_program dio-offsets
+
+# Modify as appropriate.
+
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount
+
+sys_max_segments=$(cat "/sys/block/$(_short_dev $SCRATCH_DEV)/queue/max_segments")
+sys_dma_alignment=$(cat "/sys/block/$(_short_dev $SCRATCH_DEV)/queue/dma_alignment")
+sys_virt_boundary_mask=$(cat "/sys/block/$(_short_dev $SCRATCH_DEV)/queue/virt_boundary_mask")
+sys_logical_block_size=$(cat "/sys/block/$(_short_dev $SCRATCH_DEV)/queue/logical_block_size")
+sys_max_sectors_kb=$(cat "/sys/block/$(_short_dev $SCRATCH_DEV)/queue/max_sectors_kb")
+
+echo "max_segments=$sys_max_segments dma_alignment=$sys_dma_alignment virt_boundary_mask=$sys_virt_boundary_mask logical_block_size=$sys_logical_block_size max_sectors_kb=$sys_max_sectors_kb" >> $seqres.full
+
+$here/src/dio-offsets $SCRATCH_MNT/foobar $sys_max_segments $sys_max_sectors_kb $sys_dma_alignment $sys_virt_boundary_mask $sys_logical_block_size
+
+cat $SCRATCH_MNT/foobar > /dev/null
+
+echo "Silence is golden"
+# success, all done
+status=0
+exit
diff --git a/tests/generic/771.out b/tests/generic/771.out
new file mode 100644
index 00000000..c2345c7b
--- /dev/null
+++ b/tests/generic/771.out
@@ -0,0 +1,2 @@
+QA output created by 771
+Silence is golden
--


For 'blktests'


---
diff --git a/src/.gitignore b/src/.gitignore
index 399a046..eb34474 100644
--- a/src/.gitignore
+++ b/src/.gitignore
@@ -1,3 +1,4 @@
+/dio-offsets
 /discontiguous-io
 /loblksize
 /loop_change_fd
diff --git a/src/Makefile b/src/Makefile
index f91ac62..7a20e46 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -9,6 +9,7 @@ HAVE_C_MACRO = $(shell if echo "$(H)include <$(1)>" |	\
 		then echo 1;else echo 0; fi)
 
 C_TARGETS := \
+	dio-offsets \
 	loblksize \
 	loop_change_fd \
 	loop_get_status_null \
diff --git a/tests/block/041 b/tests/block/041
new file mode 100755
index 0000000..714e1bb
--- /dev/null
+++ b/tests/block/041
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+. tests/block/rc
+
+DESCRIPTION="Test unusual direct-io offsets"
+QUICK=1
+
+test_device() {
+	echo "Running ${TEST_NAME}"
+
+	sys_max_segments=$(cat "${TEST_DEV_SYSFS}/queue/max_segments")
+	sys_dma_alignment=$(cat "${TEST_DEV_SYSFS}/queue/dma_alignment")
+	sys_virt_boundary_mask=$(cat "${TEST_DEV_SYSFS}/queue/virt_boundary_mask")
+	sys_logical_block_size=$(cat "${TEST_DEV_SYSFS}/queue/logical_block_size")
+	sys_max_sectors_kb=$(cat "${TEST_DEV_SYSFS}/queue/max_sectors_kb")
+
+	if ! src/dio-offsets ${TEST_DEV} $sys_max_segments $sys_max_sectors_kb $sys_dma_alignment $sys_virt_boundary_mask $sys_logical_block_size ; then
+		echo "src/dio-offsets failed"
+	fi
+
+	echo "Test complete"
+}
diff --git a/tests/block/041.out b/tests/block/041.out
new file mode 100644
index 0000000..6706a76
--- /dev/null
+++ b/tests/block/041.out
@@ -0,0 +1,2 @@
+Running block/041
+Test complete
--


And regardless of which you're running, drop this file, 'dio-offsets.c'
into the src/ directory:


---
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <sys/uio.h>

#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <stdbool.h>

#define power_of_2(x) ((x) && !((x) & ((x) - 1)))

static unsigned long logical_block_size;
static unsigned long dma_alignment;
static unsigned long virt_boundary;
static unsigned long max_segments;
static unsigned long max_bytes;
static size_t buf_size;
static long pagesize;
static void *out_buf;
static void *in_buf;
static int test_fd;

static void init_args(char **argv)
{
        test_fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC | O_DIRECT);
        if (test_fd < 0)
		err(errno, "%s: failed to open %s", __func__, argv[1]);

	max_segments = strtoul(argv[2], NULL, 0);
	max_bytes = strtoul(argv[3], NULL, 0) * 1024;
	dma_alignment = strtoul(argv[4], NULL, 0) + 1;
	virt_boundary = strtoul(argv[5], NULL, 0) + 1;
	logical_block_size = strtoul(argv[6], NULL, 0);

	if (!power_of_2(virt_boundary) ||
	    !power_of_2(dma_alignment) ||
	    !power_of_2(logical_block_size)) {
		errno = EINVAL;
		err(1, "%s: bad parameters", __func__);
	}

	if (virt_boundary > 1 && virt_boundary < logical_block_size) {
		errno = EINVAL;
		err(1, "%s: virt_boundary:%lu logical_block_size:%lu", __func__,
			virt_boundary, logical_block_size);
	}

	if (dma_alignment > logical_block_size) {
		errno = EINVAL;
		err(1, "%s: dma_alignment:%lu logical_block_size:%lu", __func__,
			dma_alignment, logical_block_size);
	}


	if (max_segments > 4096)
		max_segments = 4096;
	if (max_bytes > 16384 * 1024)
		max_bytes = 16384 * 1024;
	if (max_bytes & (logical_block_size - 1))
		max_bytes -= max_bytes & (logical_block_size - 1);
	pagesize = sysconf(_SC_PAGE_SIZE);
}

static void init_buffers()
{
	unsigned long lb_mask = logical_block_size - 1;
	int fd, ret;

	buf_size = max_bytes * max_segments * 2;
	if (buf_size < logical_block_size * max_segments) {
		errno = EINVAL;
		err(1, "%s: logical block size is too big", __func__);
	}

	if (buf_size < logical_block_size * 1024 * 4)
		buf_size = logical_block_size * 1024 * 4;

	if (buf_size & lb_mask)
		buf_size = (buf_size + lb_mask) & ~(lb_mask);

        ret = posix_memalign((void **)&in_buf, pagesize, buf_size);
        if (ret)
		err(1, "%s: failed to allocate in-buf", __func__);

        ret = posix_memalign((void **)&out_buf, pagesize, buf_size);
        if (ret)
		err(1, "%s: failed to allocate out-buf", __func__);

	fd = open("/dev/urandom", O_RDONLY);
	if (fd < 0)
		err(1, "%s: failed to open urandom", __func__);

	ret = read(fd, out_buf, buf_size);
	if (ret < 0)
		err(1, "%s: failed to read from urandom", __func__);

	close(fd);
}

/*
 * Test using page aligned buffers, single source
 *
 * Total size is aligned to a logical block size and exceeds the max transfer
 * size as well as the max segments. This should test the kernel's split bio
 * construction and bio splitting for exceeding these limits.
 */
static void test_1()
{
	int ret;

	memset(in_buf, 0, buf_size);
	ret = pwrite(test_fd, out_buf, buf_size, 0);
	if (ret < 0)
		err(1, "%s: failed to write buf", __func__);

	ret = pread(test_fd, in_buf, buf_size, 0);
	if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	if (memcmp(out_buf, in_buf, buf_size)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
}

/*
 * Test using dma aligned buffers, single source
 *
 * This tests the kernel's dio memory alignment
 */
static void test_2()
{
	int ret;

	memset(in_buf, 0, buf_size);
	ret = pwrite(test_fd, out_buf + dma_alignment, max_bytes, 0);
	if (ret < 0)
		err(1, "%s: failed to write buf", __func__);

	ret = pread(test_fd, in_buf + dma_alignment, max_bytes, 0);
	if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	if (memcmp(out_buf + dma_alignment, in_buf + dma_alignment, max_bytes)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
}

/*
 * Test using page aligned buffers + logicaly block sized vectored source
 *
 * This tests discontiguous vectored sources
 */
static void test_3()
{
	const int vecs = 4;

	int i, ret, offset;
	struct iovec iov[vecs];

	memset(in_buf, 0, buf_size);
	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 4;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size * 2;
	}

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to write buf", __func__);

	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 4;
		iov[i].iov_base = in_buf + offset;
		iov[i].iov_len = logical_block_size * 2;
	}

        ret = preadv(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 4;
		if (memcmp(in_buf + offset, out_buf + offset, logical_block_size * 2)) {
			errno = EIO;
			err(1, "%s: data corruption", __func__);
		}
	}
}

/*
 * Test using dma aligned buffers, vectored source
 *
 * This tests discontiguous vectored sources with incrementing dma aligned
 * offsets
 */
static void test_4()
{
	const int vecs = 4;

	int i, ret, offset;
	struct iovec iov[vecs];

	memset(in_buf, 0, buf_size);
	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 8 + dma_alignment * (i + 1);
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size * 2;
	}

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to write buf", __func__);

	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 8 + dma_alignment * (i + 1);
		iov[i].iov_base = in_buf + offset;
		iov[i].iov_len = logical_block_size * 2;
	}

        ret = preadv(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 8 + dma_alignment * (i + 1);
		if (memcmp(in_buf + offset, out_buf + offset, logical_block_size * 2)) {
			errno = EIO;
			err(1, "%s: data corruption", __func__);
		}
	}
}

/*
 * Test vectored read with a total size aligned to a block, but individual
 * vectors will not be; however, all the middle vectors start and end on page
 * boundaries which should satisify any virt_boundary condition.
 */
static void test_5()
{
	const int vecs = 4;

	int i, ret, offset, mult;
	struct iovec iov[vecs];

	i = 0;
	memset(in_buf, 0, buf_size);
	mult = pagesize / logical_block_size;
	if (mult < 2)
		mult = 2;

	offset = pagesize - (logical_block_size / 4);
	if (offset & (dma_alignment - 1))
		offset = pagesize - dma_alignment;

	iov[i].iov_base = out_buf + offset;
	iov[i].iov_len = pagesize - offset;

	for (i = 1; i < vecs - 1; i++) {
		offset = logical_block_size * i * 8 * mult;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size * mult;
	}

	offset = logical_block_size * i * 8 * mult;
	iov[i].iov_base = out_buf + offset;
	iov[i].iov_len = logical_block_size * mult - iov[0].iov_len;

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to write buf len:%zu", __func__,
			iov[0].iov_len + iov[1].iov_len + iov[2].iov_len + iov[3].iov_len);

	i = 0;
	offset = pagesize - (logical_block_size / 4);
	if (offset & (dma_alignment - 1))
		offset = pagesize - dma_alignment;

	iov[i].iov_base = in_buf + offset;
	iov[i].iov_len = pagesize - offset;

	for (i = 1; i < vecs - 1; i++) {
		offset = logical_block_size * i * 8 * mult;
		iov[i].iov_base = in_buf + offset;
		iov[i].iov_len = logical_block_size * mult;
	}

	offset = logical_block_size * i * 8 * mult;
	iov[i].iov_base = in_buf + offset;
	iov[i].iov_len = logical_block_size * mult - iov[0].iov_len;

        ret = preadv(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf len:%zu", __func__,
			iov[0].iov_len + iov[1].iov_len + iov[2].iov_len + iov[3].iov_len);

	i = 0;
	offset = pagesize - (logical_block_size / 4);
	if (offset & (dma_alignment - 1))
		offset = pagesize - dma_alignment;

	if (memcmp(in_buf + offset, out_buf + offset, iov[i].iov_len)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
	for (i = 1; i < vecs - 1; i++) {
		offset = logical_block_size * i * 8 * mult;
		if (memcmp(in_buf + offset, out_buf + offset, iov[i].iov_len)) {
			errno = EIO;
			err(1, "%s: data corruption", __func__);
		}
	}
	offset = logical_block_size * i * 8 * mult;
	if (memcmp(in_buf + offset, out_buf + offset, iov[i].iov_len)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
}

/*
 * Total size is a logical block size multiple, but none of the vectors are.
 * Total vectors will be less than the max. The vectors will be dma aligned. If
 * a virtual boundary exists, this should fail, otherwise it should succceed.
 */
static void test_6()
{
	const int vecs = 4;

	int i, ret, offset;
	struct iovec iov[vecs];
	bool should_fail = virt_boundary > 1;

	memset(in_buf, 0, buf_size);
	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 8;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size / 2;
	}

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0) {
		if (should_fail)
			return;
		err(1, "%s: failed to write buf", __func__);
	} else if (should_fail) {
		errno = ENOTSUP;
		err(1, "%s: write buf unexpectedly succeeded(?)", __func__);
	}

	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 8;
		iov[i].iov_base = in_buf + offset;
		iov[i].iov_len = logical_block_size / 2;
	}

        ret = preadv(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	for (i = 0; i < vecs; i++) {
		offset = logical_block_size * i * 8;
		if (memcmp(in_buf + offset, out_buf + offset, logical_block_size / 2)) {
			errno = EIO;
			err(1, "%s: data corruption", __func__);
		}
	}
}

/*
 * Provide an invalid iov_base at the beginning to test the kernel catching it
 * while building a bio.
 */
static void test_7()
{
	const int vecs = 4;

	int i, ret, offset;
	struct iovec iov[vecs];

	i = 0;
	iov[i].iov_base = 0;
	iov[i].iov_len = logical_block_size;

	for (i = 1; i < vecs; i++) {
		offset = logical_block_size * i * 8;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size;
	}

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		return;

	errno = ENOTSUP;
	err(1, "%s: write buf unexpectedly succeeded with NULL address(?)", __func__);
}

/*
 * Provide an invalid iov_base in the middle to test the kernel catching it
 * while building split bios. Ensure it is split by sending enough vectors to
 * exceed bio's MAX_VEC; this should cause part of the io to dispatch.
 */
static void test_8()
{
	const int vecs = 1024;

	int i, ret, offset;
	struct iovec iov[vecs];

	for (i = 0; i < vecs / 2 + 1; i++) {
		offset = logical_block_size * i * 2;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size;
	}

	offset = logical_block_size * i * 2;
	iov[i].iov_base = 0;
	iov[i].iov_len = logical_block_size;

	for (++i; i < vecs; i++) {
		offset = logical_block_size * i * 2;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = logical_block_size;
	}

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		return;

	errno = ENOTSUP;
	err(1, "%s: write buf unexpectedly succeeded with NULL address(?)", __func__);
}

/*
 * Test with an invalid DMA address. Should get caught early when splitting. If
 * the device supports byte aligned memory (which is unusual), then this should
 * be successful.
 */
static void test_9()
{
	int ret, offset;
	size_t size;
	bool should_fail = dma_alignment > 1;

	memset(in_buf, 0, buf_size);
	offset = 2 * dma_alignment - 1;
	size = logical_block_size * 256;
	ret = pwrite(test_fd, out_buf + offset, size, 0);
	if (ret < 0) {
		if (should_fail)
			return
		err(1, "%s: failed to write buf", __func__);
	} else if (should_fail) {
		errno = ENOTSUP;
		err(1, "%s: write buf unexpectedly succeeded with invalid DMA offset address(?)",
			__func__);
	}

	ret = pread(test_fd, in_buf + offset, size, 0);
	if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	if (memcmp(out_buf + offset, in_buf + offset, size)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
}

/*
 * Test with invalid DMA alignment in the middle. This should get split with
 * the first part being dispatched, and the 2nd one failing without dispatch.
 */
static void test_10()
{
	const int vecs = 5;

	bool should_fail = dma_alignment > 1;
	struct iovec iov[vecs];
	int ret, offset;

	offset = dma_alignment * 2 - 1;
	memset(in_buf, 0, buf_size);

	iov[0].iov_base = out_buf;
	iov[0].iov_len = max_bytes;

	iov[1].iov_base = out_buf + max_bytes * 2;
	iov[1].iov_len = max_bytes;

	iov[2].iov_base = out_buf + max_bytes * 4 + offset;
	iov[2].iov_len = max_bytes;

	iov[3].iov_base = out_buf + max_bytes * 6;
	iov[3].iov_len = max_bytes;

	iov[4].iov_base = out_buf + max_bytes * 8;
	iov[4].iov_len = max_bytes;

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0) {
		if (should_fail)
			return;
		err(1, "%s: failed to write buf", __func__);
	} else if (should_fail) {
		errno = ENOTSUP;
		err(1, "%s: write buf unexpectedly succeeded with invalid DMA offset address(?)", __func__);
	}

	iov[0].iov_base = in_buf;
	iov[0].iov_len = max_bytes;

	iov[1].iov_base = in_buf + max_bytes * 2;
	iov[1].iov_len = max_bytes;

	iov[2].iov_base = in_buf + max_bytes * 4 + offset;
	iov[2].iov_len = max_bytes;

	iov[3].iov_base = in_buf + max_bytes * 6;
	iov[3].iov_len = max_bytes;

	iov[4].iov_base = in_buf + max_bytes * 8;
	iov[4].iov_len = max_bytes;

        ret = preadv(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	if (memcmp(out_buf, in_buf, max_bytes) ||
	    memcmp(out_buf + max_bytes * 2, in_buf + max_bytes * 2, max_bytes) ||
	    memcmp(out_buf + max_bytes * 4 + offset, in_buf + max_bytes * 4 + offset, max_bytes) ||
	    memcmp(out_buf + max_bytes * 6, in_buf + max_bytes * 6, max_bytes) ||
	    memcmp(out_buf + max_bytes * 8, in_buf + max_bytes * 8, max_bytes)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
}

/*
 * Test a bunch of small vectors if the device dma alignemnt allows it. We'll
 * try to force a MAX_IOV split that can't form a valid IO so expect a failure.
 */
static void test_11()
{
	const int vecs = 320;

	int ret, i, offset, iovpb, iov_size;
	bool should_fail = true;
	struct iovec iov[vecs];

	memset(in_buf, 0, buf_size);
	iovpb = logical_block_size / dma_alignment;
	iov_size = logical_block_size / iovpb;

	if ((pagesize  / iov_size) < 256 &&
	    iov_size >= virt_boundary)
		should_fail = false;

	for (i = 0; i < vecs; i++) {
		offset = i * iov_size * 2;
		iov[i].iov_base = out_buf + offset;
		iov[i].iov_len = iov_size;
	}

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0) {
		if (should_fail)
			return;
		err(1, "%s: failed to write buf", __func__);
	} else if (should_fail) {
		errno = ENOTSUP;
		err(1, "%s: write buf unexpectedly succeeded(?)", __func__);
	}

	for (i = 0; i < vecs; i++) {
		offset = i * iov_size * 2;
		iov[i].iov_base = in_buf + offset;
		iov[i].iov_len = iov_size;
	}

        ret = preadv(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	for (i = 0; i < vecs; i++) {
		offset = i * iov_size * 2;
		if (memcmp(in_buf + offset, out_buf + offset, logical_block_size / 2)) {
			errno = EIO;
			err(1, "%s: data corruption", __func__);
		}
	}
}

/*
 * Start with a valid vector that can be split into a dispatched IO, but poison
 * the rest with an invalid DMA offset testing the kernel's late catch.
 */
static void test_12()
{
	const int vecs = 4;

	struct iovec iov[vecs];
	int i, ret;

	i = 0;
	iov[i].iov_base = out_buf;
	iov[i].iov_len = max_bytes - logical_block_size;

	i++;
	iov[i].iov_base = out_buf + max_bytes + logical_block_size;
	iov[i].iov_len = logical_block_size;

	i++;
	iov[i].iov_base = iov[1].iov_base + pagesize * 2 + (dma_alignment - 1);
	iov[i].iov_len = logical_block_size;

	i++;
	iov[i].iov_base = out_buf + max_bytes * 8;
	iov[i].iov_len = logical_block_size;

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		return;

	errno = ENOTSUP;
	err(1, "%s: write buf unexpectedly succeeded with NULL address(?)", __func__);
}

/*
 * Total size is block aligned, addresses are dma aligned, but invidual vector
 * sizes may not be dma aligned. If device has byte sized dma alignment, this
 * should succeed. If not, part of this should get dispatched, and the other
 * part should fail.
 */
static void test_13()
{
	const int vecs = 4;

	bool should_fail = dma_alignment > 1;
	struct iovec iov[vecs];
	int ret;

	iov[0].iov_base = out_buf;
	iov[0].iov_len = max_bytes * 2 - max_bytes / 2;

	iov[1].iov_base = out_buf + max_bytes * 4;
	iov[1].iov_len = logical_block_size * 2 - (dma_alignment + 1);

	iov[2].iov_base = out_buf + max_bytes * 8;
	iov[2].iov_len = logical_block_size * 2 + (dma_alignment + 1);

	iov[3].iov_base = out_buf + max_bytes * 12;
	iov[3].iov_len = max_bytes - max_bytes / 2;

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0) {
		if (should_fail)
			return;
		err(1, "%s: failed to write buf", __func__);
	} else if (should_fail) {
		errno = ENOTSUP;
		err(1, "%s: write buf unexpectedly succeeded with invalid DMA offset address(?)", __func__);
	}

	iov[0].iov_base = in_buf;
	iov[0].iov_len = max_bytes * 2 - max_bytes / 2;

	iov[1].iov_base = in_buf + max_bytes * 4;
	iov[1].iov_len = logical_block_size * 2 - (dma_alignment + 1);

	iov[2].iov_base = in_buf + max_bytes * 8;
	iov[2].iov_len = logical_block_size * 2 + (dma_alignment + 1);

	iov[3].iov_base = in_buf + max_bytes * 12;
	iov[3].iov_len = max_bytes - max_bytes / 2;

        ret = pwritev(test_fd, iov, vecs, 0);
        if (ret < 0)
		err(1, "%s: failed to read buf", __func__);

	if (memcmp(out_buf, in_buf, iov[0].iov_len) ||
	    memcmp(out_buf + max_bytes * 4, in_buf + max_bytes * 4, iov[1].iov_len) ||
	    memcmp(out_buf + max_bytes * 8, in_buf + max_bytes * 8, iov[2].iov_len) ||
	    memcmp(out_buf + max_bytes * 12, in_buf + max_bytes * 12, iov[3].iov_len)) {
		errno = EIO;
		err(1, "%s: data corruption", __func__);
	}
}

static void run_tests()
{
	test_1();
	test_2();
	test_3();
	test_4();
	test_5();
	test_6();
	test_7();
	test_8();
	test_9();
	test_10();
	test_11();
	test_12();
	test_13();
}

/* ./$prog-name file max_segments max_sectors_kb dma_alignment virt_boundary logical_block_size */
int main(int argc, char **argv)
{
        if (argc < 7)
                errx(1, "expect argments: file max_segments max_sectors_kb dma_alignment virt_boundary logical_block_size");

	init_args(argv);
	init_buffers();
	run_tests();

	close(test_fd);
	free(out_buf);
	free(in_buf);

	return 0;
}
--

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-05 14:11 ` [PATCHv2 1/7] block: check for valid bio while splitting Keith Busch
@ 2025-08-10 14:37   ` Christoph Hellwig
  2025-08-10 15:39     ` Keith Busch
  2025-08-13 20:06   ` Keith Busch
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-08-10 14:37 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, Keith Busch, Hannes Reinecke

On Tue, Aug 05, 2025 at 07:11:17AM -0700, Keith Busch wrote:
> @@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	 * we do not use the full hardware limits.
>  	 */
>  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> +	if (!bytes)
> +		return -EINVAL;

How is this related to the other hunk and the patch description?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 2/7] block: align the bio after building it
  2025-08-05 14:11 ` [PATCHv2 2/7] block: align the bio after building it Keith Busch
  2025-08-06  6:07   ` Hannes Reinecke
@ 2025-08-10 14:43   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-08-10 14:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, Keith Busch

On Tue, Aug 05, 2025 at 07:11:18AM -0700, Keith Busch wrote:
> +static inline void bio_revert(struct bio *bio, unsigned int nbytes)

Can you add a little comment explaining what this code does?  The name
suggast it is similar to iov_iter_revert, but I'm not sure how similar
it is intended to be.  The direct poking into bi_io_vec suggest it
can only be used by the I/O submitter, and the use of bio_release_page
suggests it is closely tied to to bio_iov_iter_get_pages despite the
very generic name.

> +static int bio_align_to_lbs(struct bio *bio, struct iov_iter *iter)
> +{
> +	struct block_device *bdev = bio->bi_bdev;
> +	size_t nbytes;
> +
> +	if (!bdev)
> +		return 0;

So this is something horribly Kent put in where he failed to deal with
review feedback and just fed his stuff to Linus, and I think we need
to fix it when touching this code again.  Assuming we want to support
bio_iov_iter_get_pages on bios without bi_bdev set we simplify can't
rely in looking at bdev_logical_block_size here, but instead need to
pass it explicitly.  Which honestly doesn't sound to bad, just add an
explicit argument for the required alignment to bio_iov_iter_get_pages
instead of trying to derive it.  Which is actually going to be useful
to reduce duplicate checks for file systems the require > LBA size
alignment as well.

> -	return bio->bi_vcnt ? 0 : ret;
> +	if (bio->bi_vcnt)
> +		return bio_align_to_lbs(bio, iter);
> +	return ret;

Nit, this would flow a little more easily by moving the error /
exceptional case into the branch, i.e.

	if (!bio->bi_vcnt)
		return ret;
	return bio_align_to_lbs(bio, iter);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-10 14:37   ` Christoph Hellwig
@ 2025-08-10 15:39     ` Keith Busch
  2025-08-10 17:14       ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-10 15:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, Hannes Reinecke

On Sun, Aug 10, 2025 at 07:37:36AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 05, 2025 at 07:11:17AM -0700, Keith Busch wrote:
> > @@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> >  	 * we do not use the full hardware limits.
> >  	 */
> >  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> > +	if (!bytes)
> > +		return -EINVAL;
> 
> How is this related to the other hunk and the patch description?

The patchset allows you to submit an io with vectors that are partial
logical blocks. Misuse could create a bio that exceeds the device max
vectors or introduces virtual boundary gaps, requiring a split into
something that is smaller than a block size. This check catches that.

Quick example: nvme with a 4k logical block size, and the usual 4k
virtual boundary. Send an io with four vectors iov_len=1k. The total
size is block sized, but there's no way that could split into a valid
io. There's a test specifically for this in my reply about xfstests.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-10 15:39     ` Keith Busch
@ 2025-08-10 17:14       ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-08-10 17:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, snitzer, axboe, dw, brauner, Hannes Reinecke

On Sun, Aug 10, 2025 at 09:39:50AM -0600, Keith Busch wrote:
> > >  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> > > +	if (!bytes)
> > > +		return -EINVAL;
> > 
> > How is this related to the other hunk and the patch description?
> 
> The patchset allows you to submit an io with vectors that are partial
> logical blocks. Misuse could create a bio that exceeds the device max
> vectors or introduces virtual boundary gaps, requiring a split into
> something that is smaller than a block size. This check catches that.
> 
> Quick example: nvme with a 4k logical block size, and the usual 4k
> virtual boundary. Send an io with four vectors iov_len=1k. The total
> size is block sized, but there's no way that could split into a valid
> io. There's a test specifically for this in my reply about xfstests.

Can you turn the above into a comment explaining the check?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 0/7] direct-io: even more flexible io vectors
  2025-08-07 23:20   ` Keith Busch
@ 2025-08-11 10:32     ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-08-11 10:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, snitzer, axboe, dw, brauner, linux-xfs

On Thu, Aug 07, 2025 at 05:20:17PM -0600, Keith Busch wrote:
> Sure. I wrote up some for blktest, and the same test works as-is for
> filesystems too. Potential question: where do such programs go
> (xfstests, blktests, both, or some common place)?

We currently have no good way to share tests between xfstest and
blktests, so I think it would require duplicating the helper programs.

> I tested on loop, nvme, and virtio-blk, both raw block (blktests) and
> xfs (fstests). Seems fine.

Cool.  I'd like the hear from the other XFS folks if the possibility
of easily introducing preallocated space (that's what it looks like on
disk) for writes failed because of wrong alignment is fine.  Given that
the same user could introduce them using fallocate it's definitively not
a security issue, but also rather unusual and unexpected.  And from the
other file system maintainers if they have similar issues.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-05 14:11 ` [PATCHv2 1/7] block: check for valid bio while splitting Keith Busch
  2025-08-10 14:37   ` Christoph Hellwig
@ 2025-08-13 20:06   ` Keith Busch
  2025-08-13 20:41     ` Bart Van Assche
  2025-08-13 21:23     ` Martin K. Petersen
  1 sibling, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-13 20:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, Hannes Reinecke

On Tue, Aug 05, 2025 at 07:11:17AM -0700, Keith Busch wrote:
>  	bio_for_each_bvec(bv, bio, iter) {
> +		if (bv.bv_offset & lim->dma_alignment)
> +			return -EINVAL;

I have a question about this part here because testing with various scsi
is showing odd results.

NVMe wants this check to actually be this:

		if ((bv.bv_offset | bv.bv_len) & lim->dma_alignment)

because the dma alignment defines not only the starting address offset,
but also the length. NVMe's default alignment is 4 bytes.

But I can't make that change because many scsi devices don't set the dma
alignment and get the default 511 value. This is fine for the memory
address offset, but the lengths sent for various inquriy commands are
much smaller, like 4 and 32 byte lengths. That length wouldn't pass the
dma alignment granularity, so I think the default value is far too
conservative. Does the address start size need to be a different limit
than minimum length? I feel like they should be the same, but maybe
that's just an nvme thing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 20:06   ` Keith Busch
@ 2025-08-13 20:41     ` Bart Van Assche
  2025-08-13 21:01       ` Keith Busch
  2025-08-13 21:23     ` Martin K. Petersen
  1 sibling, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2025-08-13 20:41 UTC (permalink / raw)
  To: Keith Busch, Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, Hannes Reinecke

On 8/13/25 1:06 PM, Keith Busch wrote:
> But I can't make that change because many scsi devices don't set the dma
> alignment and get the default 511 value. This is fine for the memory
> address offset, but the lengths sent for various inquriy commands are
> much smaller, like 4 and 32 byte lengths. That length wouldn't pass the
> dma alignment granularity, so I think the default value is far too
> conservative. Does the address start size need to be a different limit
> than minimum length? I feel like they should be the same, but maybe
> that's just an nvme thing.

Hi Keith,

Maybe I misunderstood your question. It seems to me that the SCSI core
sets the DMA alignment by default to four bytes. From
drivers/scsi/hosts.c:

	/* 32-byte (dword) is a common minimum for HBAs. */
	if (sht->dma_alignment)
		shost->dma_alignment = sht->dma_alignment;
	else
		shost->dma_alignment = 3;

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 20:41     ` Bart Van Assche
@ 2025-08-13 21:01       ` Keith Busch
  2025-08-13 23:17         ` Bart Van Assche
  2025-08-14  1:42         ` Damien Le Moal
  0 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-13 21:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, Hannes Reinecke

On Wed, Aug 13, 2025 at 01:41:49PM -0700, Bart Van Assche wrote:
> On 8/13/25 1:06 PM, Keith Busch wrote:
> > But I can't make that change because many scsi devices don't set the dma
> > alignment and get the default 511 value. This is fine for the memory
> > address offset, but the lengths sent for various inquriy commands are
> > much smaller, like 4 and 32 byte lengths. That length wouldn't pass the
> > dma alignment granularity, so I think the default value is far too
> > conservative. Does the address start size need to be a different limit
> > than minimum length? I feel like they should be the same, but maybe
> > that's just an nvme thing.
> 
> Hi Keith,
> 
> Maybe I misunderstood your question. It seems to me that the SCSI core
> sets the DMA alignment by default to four bytes. From
> drivers/scsi/hosts.c:

Thanks, I think you got my meaning. 

I'm using the AHCI driver. It looks like ata_scsi_dev_config() overrides
the dma_alignment to sector_size - 1, and that pattern goes way back,
almost 20 years ago, so maybe I can't change it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 20:06   ` Keith Busch
  2025-08-13 20:41     ` Bart Van Assche
@ 2025-08-13 21:23     ` Martin K. Petersen
  2025-08-13 21:52       ` Keith Busch
  1 sibling, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2025-08-13 21:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, Hannes Reinecke


Hi Keith!

> NVMe wants this check to actually be this:
>
> 		if ((bv.bv_offset | bv.bv_len) & lim->dma_alignment)
>
> because the dma alignment defines not only the starting address offset,
> but also the length. NVMe's default alignment is 4 bytes.

dma_alignment defines the alignment of the DMA starting address. We
don't have a dedicated queue limit for the transfer length granularity
described by NVMe.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 21:23     ` Martin K. Petersen
@ 2025-08-13 21:52       ` Keith Busch
  2025-08-18  4:49         ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-13 21:52 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, Hannes Reinecke

On Wed, Aug 13, 2025 at 05:23:47PM -0400, Martin K. Petersen wrote:
> dma_alignment defines the alignment of the DMA starting address. We
> don't have a dedicated queue limit for the transfer length granularity
> described by NVMe.

Darn, but thanks for confirming. I'll see if I can get by without
needing a new limit, or look into adding one if not. Worst case, I can
also let the device return the error, though I think we prefer not to
send an IO that we know should fail.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 21:01       ` Keith Busch
@ 2025-08-13 23:17         ` Bart Van Assche
  2025-08-14  1:42         ` Damien Le Moal
  1 sibling, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2025-08-13 23:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, Hannes Reinecke

On 8/13/25 2:01 PM, Keith Busch wrote:
> I'm using the AHCI driver. It looks like ata_scsi_dev_config() overrides
> the dma_alignment to sector_size - 1, and that pattern goes way back,
> almost 20 years ago, so maybe I can't change it.

 From an AHCI specification
(https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/serial-ata-ahci-spec-rev1-3-1.pdf):
"Data Base Address (DBA): Indicates the 32-bit physical address of the 
data block. The block must be word aligned, indicated by bit 0 being 
reserved."

Please note that I have no experience with programming AHCI controllers.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 21:01       ` Keith Busch
  2025-08-13 23:17         ` Bart Van Assche
@ 2025-08-14  1:42         ` Damien Le Moal
  1 sibling, 0 replies; 27+ messages in thread
From: Damien Le Moal @ 2025-08-14  1:42 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, Hannes Reinecke

On 8/14/25 6:01 AM, Keith Busch wrote:
> On Wed, Aug 13, 2025 at 01:41:49PM -0700, Bart Van Assche wrote:
>> On 8/13/25 1:06 PM, Keith Busch wrote:
>>> But I can't make that change because many scsi devices don't set the dma
>>> alignment and get the default 511 value. This is fine for the memory
>>> address offset, but the lengths sent for various inquriy commands are
>>> much smaller, like 4 and 32 byte lengths. That length wouldn't pass the
>>> dma alignment granularity, so I think the default value is far too
>>> conservative. Does the address start size need to be a different limit
>>> than minimum length? I feel like they should be the same, but maybe
>>> that's just an nvme thing.
>>
>> Hi Keith,
>>
>> Maybe I misunderstood your question. It seems to me that the SCSI core
>> sets the DMA alignment by default to four bytes. From
>> drivers/scsi/hosts.c:
> 
> Thanks, I think you got my meaning. 
> 
> I'm using the AHCI driver. It looks like ata_scsi_dev_config() overrides
> the dma_alignment to sector_size - 1, and that pattern goes way back,
> almost 20 years ago, so maybe I can't change it.

That is probably buggy now in the sense that the scsi layer should be able to
send any command with a size not aligned to the LBA size or ATA sector (512 B)
and libata-scsi SAT should do the translation using an internal 512B aligned
command size.

What makes a mess here is that SCSI allows having a media-access command
specifying a transfer size that is not aligned on the LBA size. The transfer
will be "short" in that case, which is perfectly fine with SCSI. But ATA does
not allow that. It is all or nothing and the command size thus must always be
aligned to the LBA size.

I think that dma_alignment was abused to check that. But I think it should not
be too hard to check the alignment in libata-scsi when translating the command.
SAS HBAs should be doing something similar too. Have never exactly tested that
though, and I am afraid how many SAS HBAs will not like unaligned command to
ATA devices...

We also have the different alignment for management commands (most of which use
512B sector size) and media access commands which use the actual device LBA
size alignment.

So it is a mess :)

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv2 1/7] block: check for valid bio while splitting
  2025-08-13 21:52       ` Keith Busch
@ 2025-08-18  4:49         ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-08-18  4:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, snitzer, axboe, dw, brauner, Hannes Reinecke

On Wed, Aug 13, 2025 at 03:52:44PM -0600, Keith Busch wrote:
> On Wed, Aug 13, 2025 at 05:23:47PM -0400, Martin K. Petersen wrote:
> > dma_alignment defines the alignment of the DMA starting address. We
> > don't have a dedicated queue limit for the transfer length granularity
> > described by NVMe.
> 
> Darn, but thanks for confirming. I'll see if I can get by without
> needing a new limit, or look into adding one if not. Worst case, I can
> also let the device return the error, though I think we prefer not to
> send an IO that we know should fail.

Allowing an unprivileged user application to trivially trigger an I/O
error sounds like a bad idea.

But I think simply applying the dma_alignment to the length of
READ/WRITE commands might be work, while ignoring it for other types
of commands.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-08-18  4:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05 14:11 [PATCHv2 0/7] direct-io: even more flexible io vectors Keith Busch
2025-08-05 14:11 ` [PATCHv2 1/7] block: check for valid bio while splitting Keith Busch
2025-08-10 14:37   ` Christoph Hellwig
2025-08-10 15:39     ` Keith Busch
2025-08-10 17:14       ` Christoph Hellwig
2025-08-13 20:06   ` Keith Busch
2025-08-13 20:41     ` Bart Van Assche
2025-08-13 21:01       ` Keith Busch
2025-08-13 23:17         ` Bart Van Assche
2025-08-14  1:42         ` Damien Le Moal
2025-08-13 21:23     ` Martin K. Petersen
2025-08-13 21:52       ` Keith Busch
2025-08-18  4:49         ` Christoph Hellwig
2025-08-05 14:11 ` [PATCHv2 2/7] block: align the bio after building it Keith Busch
2025-08-06  6:07   ` Hannes Reinecke
2025-08-10 14:43   ` Christoph Hellwig
2025-08-05 14:11 ` [PATCHv2 3/7] block: simplify direct io validity check Keith Busch
2025-08-05 14:11 ` [PATCHv2 4/7] iomap: " Keith Busch
2025-08-06  6:07   ` Hannes Reinecke
2025-08-05 14:11 ` [PATCHv2 5/7] block: remove bdev_iter_is_aligned Keith Busch
2025-08-05 14:11 ` [PATCHv2 6/7] blk-integrity: use simpler alignment check Keith Busch
2025-08-05 14:11 ` [PATCHv2 7/7] iov_iter: remove iov_iter_is_aligned Keith Busch
2025-08-06  6:08   ` Hannes Reinecke
2025-08-06  3:03 ` [PATCHv2 0/7] direct-io: even more flexible io vectors Martin K. Petersen
2025-08-06 14:51 ` Christoph Hellwig
2025-08-07 23:20   ` Keith Busch
2025-08-11 10:32     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).