linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv4 0/8]
@ 2025-08-27 14:12 Keith Busch
  2025-08-27 14:12 ` [PATCHv4 1/8] block: check for valid bio while splitting Keith Busch
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel; +Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Previous version:

  https://lore.kernel.org/linux-block/20250819164922.640964-1-kbusch@meta.com/

This series removes the direct io requirement that io vector lengths
align to the logical block size. There are two primary benefits from
doing this:

  1. It allows user space more flexibility in what kind of io vectors
     are accepted, removing the need to bounce their data to specially
     aligned buffers.

  2. By moving the alignment checks to later when the segments are
     already being checked, we remove one more iov walk per IO, reducing
     CPU utilization and submission latency.

Same as previously, I've tested direct IO on raw block, xfs, ext4, and btrfs.

Changes from v3:

  - Added reviews

  - Code style and comment updates

Keith Busch (8):
  block: check for valid bio while splitting
  block: add size alignment to bio_iov_iter_get_pages
  block: align the bio after building it
  block: simplify direct io validity check
  iomap: simplify direct io validity check
  block: remove bdev_iter_is_aligned
  blk-integrity: use simpler alignment check
  iov_iter: remove iov_iter_is_aligned

 block/bio-integrity.c  |  4 +-
 block/bio.c            | 64 ++++++++++++++++++----------
 block/blk-map.c        |  2 +-
 block/blk-merge.c      | 21 ++++++++--
 block/fops.c           | 10 ++---
 fs/iomap/direct-io.c   |  5 +--
 include/linux/bio.h    | 13 ++++--
 include/linux/blkdev.h | 21 ++++++----
 include/linux/uio.h    |  2 -
 lib/iov_iter.c         | 95 ------------------------------------------
 10 files changed, 92 insertions(+), 145 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCHv4 1/8] block: check for valid bio while splitting
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-08-31  0:40   ` Martin K. Petersen
  2025-08-27 14:12 ` [PATCHv4 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel; +Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

We're already iterating every segment, so check these for a valid IO
lengths at the same time. Individual segment lengths will not be checked
on passthrough commands. The read/write command segments must be sized
to the dma alignment.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-map.c        |  2 +-
 block/blk-merge.c      | 21 +++++++++++++++++----
 include/linux/bio.h    |  4 ++--
 include/linux/blkdev.h |  7 +++++++
 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 23e5d5ebe59ec..6d1268aa82715 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -443,7 +443,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
 	int ret;
 
 	/* check that the data layout matches the hardware restrictions */
-	ret = bio_split_rw_at(bio, lim, &nr_segs, max_bytes);
+	ret = bio_split_io_at(bio, lim, &nr_segs, max_bytes, 0);
 	if (ret) {
 		/* if we would have to split the bio, copy instead */
 		if (ret > 0)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be52..cffc0fe48d8a3 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -279,25 +279,30 @@ static unsigned int bio_split_alignment(struct bio *bio,
 }
 
 /**
- * bio_split_rw_at - check if and where to split a read/write bio
+ * bio_split_io_at - check if and where to split a bio
  * @bio:  [in] bio to be split
  * @lim:  [in] queue limits to split based on
  * @segs: [out] number of segments in the bio with the first half of the sectors
  * @max_bytes: [in] maximum number of bytes per bio
+ * @len_align_mask: [in] length alignment mask for each vector
  *
  * Find out if @bio needs to be split to fit the queue limits in @lim and a
  * maximum size of @max_bytes.  Returns a negative error number if @bio can't be
  * split, 0 if the bio doesn't have to be split, or a positive sector offset if
  * @bio needs to be split.
  */
-int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
-		unsigned *segs, unsigned max_bytes)
+int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes, unsigned len_align_mask)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
 	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
+		if (bv.bv_offset & lim->dma_alignment ||
+		    bv.bv_len & len_align_mask)
+			return -EINVAL;
+
 		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
@@ -339,8 +344,16 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	 * Individual bvecs might not be logical block aligned. Round down the
 	 * split size so that each bio is properly block size aligned, even if
 	 * we do not use the full hardware limits.
+	 *
+	 * It is possible to submit a bio that can't be split into a valid io:
+	 * there may either be too many discontiguous vectors for the max
+	 * segments limit, or contain virtual boundary gaps without having a
+	 * valid block sized split. A zero byte result means one of those
+	 * conditions occured.
 	 */
 	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
+	if (!bytes)
+		return -EINVAL;
 
 	/*
 	 * Bio splitting may cause subtle trouble such as hang when doing sync
@@ -350,7 +363,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	bio_clear_polled(bio);
 	return bytes >> SECTOR_SHIFT;
 }
-EXPORT_SYMBOL_GPL(bio_split_rw_at);
+EXPORT_SYMBOL_GPL(bio_split_io_at);
 
 struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 		unsigned *nr_segs)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 46ffac5caab78..519a1d59805f8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -322,8 +322,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
 void bio_trim(struct bio *bio, sector_t offset, sector_t size);
 extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
-int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
-		unsigned *segs, unsigned max_bytes);
+int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes, unsigned len_align);
 
 /**
  * bio_next_split - get next @sectors from a bio, splitting if necessary
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fe1797bbec420..d75c77eb8cb97 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1870,6 +1870,13 @@ bdev_atomic_write_unit_max_bytes(struct block_device *bdev)
 	return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
 }
 
+static inline int bio_split_rw_at(struct bio *bio,
+		const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes)
+{
+	return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment);
+}
+
 #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
 
 #endif /* _LINUX_BLKDEV_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 2/8] block: add size alignment to bio_iov_iter_get_pages
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
  2025-08-27 14:12 ` [PATCHv4 1/8] block: check for valid bio while splitting Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-08-31  0:40   ` Martin K. Petersen
  2025-08-27 14:12 ` [PATCHv4 3/8] block: align the bio after building it Keith Busch
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel; +Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may need to
align to something larger, which we can't derive it from the queue
limits. Have the caller specify what they want, or allow any length
alignment if nothing was specified. Since the most common use case
relies on the block device's limits, a helper function is provided.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/bio.c            | 19 +++++++++++--------
 block/fops.c           |  6 +++---
 fs/iomap/direct-io.c   |  2 +-
 include/linux/bio.h    |  9 ++++++++-
 include/linux/blkdev.h |  7 +++++++
 5 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3b371a5da159e..44286db14355f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1204,7 +1204,8 @@ static unsigned int get_contig_folio_len(unsigned int *num_pages,
  * For a multi-segment *iter, this function only adds pages from the next
  * non-empty segment of the iov iterator.
  */
-static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
+				    unsigned len_align_mask)
 {
 	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
@@ -1213,7 +1214,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size;
 	unsigned int num_pages, i = 0;
-	size_t offset, folio_offset, left, len;
+	size_t offset, folio_offset, left, len, trim;
 	int ret = 0;
 
 	/*
@@ -1242,8 +1243,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	if (bio->bi_bdev) {
-		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+	trim = size & len_align_mask;
+	if (trim) {
 		iov_iter_revert(iter, trim);
 		size -= trim;
 	}
@@ -1298,9 +1299,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - add user or kernel pages to a bio
+ * bio_iov_iter_get_pages_aligned - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
+ * @len_align_mask: the mask to align each vector size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
  * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
@@ -1317,7 +1319,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
  * MM encounters an error pinning the requested pages, it stops. Error
  * is returned only if 0 pages could be pinned.
  */
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter,
+			   unsigned len_align_mask)
 {
 	int ret = 0;
 
@@ -1333,12 +1336,12 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	if (iov_iter_extract_will_pin(iter))
 		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
-		ret = __bio_iov_iter_get_pages(bio, iter);
+		ret = __bio_iov_iter_get_pages(bio, iter, len_align_mask);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	return bio->bi_vcnt ? 0 : ret;
 }
-EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
+EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages_aligned);
 
 static void submit_bio_wait_endio(struct bio *bio)
 {
diff --git a/block/fops.c b/block/fops.c
index 82451ac8ff25d..d136fb5f6b6ab 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -78,7 +78,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
 
-	ret = bio_iov_iter_get_pages(&bio, iter);
+	ret = bio_iov_iter_get_bdev_pages(&bio, iter, bdev);
 	if (unlikely(ret))
 		goto out;
 	ret = bio.bi_iter.bi_size;
@@ -212,7 +212,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
 
-		ret = bio_iov_iter_get_pages(bio, iter);
+		ret = bio_iov_iter_get_bdev_pages(bio, iter, bdev);
 		if (unlikely(ret)) {
 			bio->bi_status = BLK_STS_IOERR;
 			bio_endio(bio);
@@ -348,7 +348,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		 */
 		bio_iov_bvec_set(bio, iter);
 	} else {
-		ret = bio_iov_iter_get_pages(bio, iter);
+		ret = bio_iov_iter_get_bdev_pages(bio, iter, bdev);
 		if (unlikely(ret))
 			goto out_bio_put;
 	}
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b84f6af2eb4c8..fea23fa6a402f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -434,7 +434,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
-		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
+		ret = bio_iov_iter_get_bdev_pages(bio, dio->submit.iter, iomap->bdev);
 		if (unlikely(ret)) {
 			/*
 			 * We have to stop part way through an IO. We must fall
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 519a1d59805f8..788a50ff319e3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,14 @@ int submit_bio_wait(struct bio *bio);
 int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
 		size_t len, enum req_op op);
 
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
+int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter,
+		unsigned len_align_mask);
+
+static inline int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	return bio_iov_iter_get_pages_aligned(bio, iter, 0);
+}
+
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
 extern void bio_set_pages_dirty(struct bio *bio);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d75c77eb8cb97..36500d576d7e9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1877,6 +1877,13 @@ static inline int bio_split_rw_at(struct bio *bio,
 	return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment);
 }
 
+static inline int bio_iov_iter_get_bdev_pages(struct bio *bio,
+		struct iov_iter *iter, struct block_device *bdev)
+{
+	return bio_iov_iter_get_pages_aligned(bio, iter,
+					bdev_logical_block_size(bdev) - 1);
+}
+
 #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
 
 #endif /* _LINUX_BLKDEV_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 3/8] block: align the bio after building it
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
  2025-08-27 14:12 ` [PATCHv4 1/8] block: check for valid bio while splitting Keith Busch
  2025-08-27 14:12 ` [PATCHv4 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-08-31  0:41   ` Martin K. Petersen
  2025-09-02  5:23   ` Christoph Hellwig
  2025-08-27 14:12 ` [PATCHv4 4/8] block: simplify direct io validity check Keith Busch
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel; +Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c | 65 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 44286db14355f..8703be4748db8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1204,8 +1204,7 @@ static unsigned int get_contig_folio_len(unsigned int *num_pages,
  * For a multi-segment *iter, this function only adds pages from the next
  * non-empty segment of the iov iterator.
  */
-static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-				    unsigned len_align_mask)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
@@ -1214,7 +1213,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 	struct page **pages = (struct page **)bv;
 	ssize_t size;
 	unsigned int num_pages, i = 0;
-	size_t offset, folio_offset, left, len, trim;
+	size_t offset, folio_offset, left, len;
 	int ret = 0;
 
 	/*
@@ -1228,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
-	/*
-	 * Each segment in the iov is required to be a block size multiple.
-	 * However, we may not be able to get the entire segment if it spans
-	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
-	 * result to ensure the bio's total size is correct. The remainder of
-	 * the iov data will be picked up in the next bio iteration.
-	 */
 	size = iov_iter_extract_pages(iter, &pages,
 				      UINT_MAX - bio->bi_iter.bi_size,
 				      nr_pages, extraction_flags, &offset);
@@ -1242,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 		return size ? size : -EFAULT;
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
-
-	trim = size & len_align_mask;
-	if (trim) {
-		iov_iter_revert(iter, trim);
-		size -= trim;
-	}
-
-	if (unlikely(!size)) {
-		ret = -EFAULT;
-		goto out;
-	}
-
 	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
 		struct page *page = pages[i];
 		struct folio *folio = page_folio(page);
@@ -1298,11 +1278,44 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 	return ret;
 }
 
+/*
+ * Aligns the bio size to the len_align_mask, releasing excessive bio vecs that
+ * __bio_iov_iter_get_pages may have inserted, and reverts the trimmed length
+ * for the next iteration.
+ */
+static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
+			    unsigned len_align_mask)
+{
+	size_t nbytes = bio->bi_iter.bi_size & len_align_mask;
+
+	if (!nbytes)
+		return 0;
+
+	iov_iter_revert(iter, nbytes);
+	bio->bi_iter.bi_size -= nbytes;
+	do {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (nbytes < bv->bv_len) {
+			bv->bv_len -= nbytes;
+			break;
+		}
+
+		bio_release_page(bio, bv->bv_page);
+		bio->bi_vcnt--;
+		nbytes -= bv->bv_len;
+	} while (nbytes);
+
+	if (!bio->bi_vcnt)
+		return -EFAULT;
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages_aligned - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
- * @len_align_mask: the mask to align each vector size to, 0 for any length
+ * @len_align_mask: the mask to align the total size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
  * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
@@ -1336,10 +1349,12 @@ int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter,
 	if (iov_iter_extract_will_pin(iter))
 		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
-		ret = __bio_iov_iter_get_pages(bio, iter, len_align_mask);
+		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
-	return bio->bi_vcnt ? 0 : ret;
+	if (bio->bi_vcnt)
+		return bio_iov_iter_align_down(bio, iter, len_align_mask);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages_aligned);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 4/8] block: simplify direct io validity check
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
                   ` (2 preceding siblings ...)
  2025-08-27 14:12 ` [PATCHv4 3/8] block: align the bio after building it Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-08-27 14:12 ` [PATCHv4 5/8] iomap: " Keith Busch
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch, Hannes Reinecke,
	Martin K. Petersen

From: Keith Busch <kbusch@kernel.org>

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index d136fb5f6b6ab..19814bddd77e2 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 				struct iov_iter *iter)
 {
-	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
-		!bdev_iter_is_aligned(bdev, iter);
+	return (iocb->ki_pos | iov_iter_count(iter)) &
+			(bdev_logical_block_size(bdev) - 1);
 }
 
 #define DIO_INLINE_BIO_VECS 4
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
                   ` (3 preceding siblings ...)
  2025-08-27 14:12 ` [PATCHv4 4/8] block: simplify direct io validity check Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-10-27 16:25   ` Carlos Llamas
  2025-08-27 14:12 ` [PATCHv4 6/8] block: remove bdev_iter_is_aligned Keith Busch
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch, Hannes Reinecke,
	Martin K. Petersen

From: Keith Busch <kbusch@kernel.org>

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index fea23fa6a402f..c06e41fd4d0af 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -337,8 +337,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	u64 copied = 0;
 	size_t orig_count;
 
-	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
+	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1))
 		return -EINVAL;
 
 	if (dio->flags & IOMAP_DIO_WRITE) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 6/8] block: remove bdev_iter_is_aligned
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
                   ` (4 preceding siblings ...)
  2025-08-27 14:12 ` [PATCHv4 5/8] iomap: " Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-08-27 14:12 ` [PATCHv4 7/8] blk-integrity: use simpler alignment check Keith Busch
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch, Hannes Reinecke,
	Martin K. Petersen

From: Keith Busch <kbusch@kernel.org>

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/blkdev.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 36500d576d7e9..221f6d7c0beb1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1590,13 +1590,6 @@ static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
 	return queue_dma_alignment(bdev_get_queue(bdev));
 }
 
-static inline bool bdev_iter_is_aligned(struct block_device *bdev,
-					struct iov_iter *iter)
-{
-	return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev),
-				   bdev_logical_block_size(bdev) - 1);
-}
-
 static inline unsigned int
 blk_lim_dma_alignment_and_pad(struct queue_limits *lim)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 7/8] blk-integrity: use simpler alignment check
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
                   ` (5 preceding siblings ...)
  2025-08-27 14:12 ` [PATCHv4 6/8] block: remove bdev_iter_is_aligned Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-08-27 14:12 ` [PATCHv4 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
  2025-09-09 16:27 ` [PATCHv4 0/8] Jens Axboe
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch, Hannes Reinecke,
	Martin K. Petersen

From: Keith Busch <kbusch@kernel.org>

We're checking length and addresses against the same alignment value, so
use the more simple iterator check.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/bio-integrity.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 6b077ca937f6b..6d069a49b4aad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -262,7 +262,6 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits);
 	struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
 	struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
 	size_t offset, bytes = iter->count;
@@ -285,7 +284,8 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 		pages = NULL;
 	}
 
-	copy = !iov_iter_is_aligned(iter, align, align);
+	copy = iov_iter_alignment(iter) &
+			blk_lim_dma_alignment_and_pad(&q->limits);
 	ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
 	if (unlikely(ret < 0))
 		goto free_bvec;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCHv4 8/8] iov_iter: remove iov_iter_is_aligned
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
                   ` (6 preceding siblings ...)
  2025-08-27 14:12 ` [PATCHv4 7/8] blk-integrity: use simpler alignment check Keith Busch
@ 2025-08-27 14:12 ` Keith Busch
  2025-09-09 16:27 ` [PATCHv4 0/8] Jens Axboe
  8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-08-27 14:12 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: linux-xfs, linux-ext4, hch, axboe, Keith Busch, Hannes Reinecke,
	Martin K. Petersen

From: Keith Busch <kbusch@kernel.org>

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/uio.h |  2 -
 lib/iov_iter.c      | 95 ---------------------------------------------
 2 files changed, 97 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2e86c653186c6..5b127043a1519 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -286,8 +286,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 #endif
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
-bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
-			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9193f952f499..2fe66a6b8789e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -784,101 +784,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 }
 EXPORT_SYMBOL(iov_iter_discard);
 
-static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
-				   unsigned len_mask)
-{
-	const struct iovec *iov = iter_iov(i);
-	size_t size = i->count;
-	size_t skip = i->iov_offset;
-
-	do {
-		size_t len = iov->iov_len - skip;
-
-		if (len > size)
-			len = size;
-		if (len & len_mask)
-			return false;
-		if ((unsigned long)(iov->iov_base + skip) & addr_mask)
-			return false;
-
-		iov++;
-		size -= len;
-		skip = 0;
-	} while (size);
-
-	return true;
-}
-
-static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
-				  unsigned len_mask)
-{
-	const struct bio_vec *bvec = i->bvec;
-	unsigned skip = i->iov_offset;
-	size_t size = i->count;
-
-	do {
-		size_t len = bvec->bv_len - skip;
-
-		if (len > size)
-			len = size;
-		if (len & len_mask)
-			return false;
-		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
-			return false;
-
-		bvec++;
-		size -= len;
-		skip = 0;
-	} while (size);
-
-	return true;
-}
-
-/**
- * iov_iter_is_aligned() - Check if the addresses and lengths of each segments
- * 	are aligned to the parameters.
- *
- * @i: &struct iov_iter to restore
- * @addr_mask: bit mask to check against the iov element's addresses
- * @len_mask: bit mask to check against the iov element's lengths
- *
- * Return: false if any addresses or lengths intersect with the provided masks
- */
-bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
-			 unsigned len_mask)
-{
-	if (likely(iter_is_ubuf(i))) {
-		if (i->count & len_mask)
-			return false;
-		if ((unsigned long)(i->ubuf + i->iov_offset) & addr_mask)
-			return false;
-		return true;
-	}
-
-	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
-		return iov_iter_aligned_iovec(i, addr_mask, len_mask);
-
-	if (iov_iter_is_bvec(i))
-		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
-
-	/* With both xarray and folioq types, we're dealing with whole folios. */
-	if (iov_iter_is_xarray(i)) {
-		if (i->count & len_mask)
-			return false;
-		if ((i->xarray_start + i->iov_offset) & addr_mask)
-			return false;
-	}
-	if (iov_iter_is_folioq(i)) {
-		if (i->count & len_mask)
-			return false;
-		if (i->iov_offset & addr_mask)
-			return false;
-	}
-
-	return true;
-}
-EXPORT_SYMBOL_GPL(iov_iter_is_aligned);
-
 static unsigned long iov_iter_alignment_iovec(const struct iov_iter *i)
 {
 	const struct iovec *iov = iter_iov(i);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 1/8] block: check for valid bio while splitting
  2025-08-27 14:12 ` [PATCHv4 1/8] block: check for valid bio while splitting Keith Busch
@ 2025-08-31  0:40   ` Martin K. Petersen
  0 siblings, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2025-08-31  0:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-ext4, hch, axboe,
	Keith Busch


Keith,

> We're already iterating every segment, so check these for a valid IO
> lengths at the same time. Individual segment lengths will not be
> checked on passthrough commands. The read/write command segments must
> be sized to the dma alignment.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 2/8] block: add size alignment to bio_iov_iter_get_pages
  2025-08-27 14:12 ` [PATCHv4 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
@ 2025-08-31  0:40   ` Martin K. Petersen
  0 siblings, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2025-08-31  0:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-ext4, hch, axboe,
	Keith Busch


Keith,

> The block layer tries to align bio vectors to the block device's
> logical block size. Some cases don't have a block device, or we may
> need to align to something larger, which we can't derive it from the
> queue limits. Have the caller specify what they want, or allow any
> length alignment if nothing was specified. Since the most common use
> case relies on the block device's limits, a helper function is
> provided.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 3/8] block: align the bio after building it
  2025-08-27 14:12 ` [PATCHv4 3/8] block: align the bio after building it Keith Busch
@ 2025-08-31  0:41   ` Martin K. Petersen
  2025-09-02  5:23   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2025-08-31  0:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-ext4, hch, axboe,
	Keith Busch


Keith,

> Instead of ensuring each vector is block size aligned while
> constructing the bio, just ensure the entire size is aligned after
> it's built. This makes getting bio pages more flexible to accepting
> device valid io vectors that would otherwise get rejected by alignment
> checks.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 3/8] block: align the bio after building it
  2025-08-27 14:12 ` [PATCHv4 3/8] block: align the bio after building it Keith Busch
  2025-08-31  0:41   ` Martin K. Petersen
@ 2025-09-02  5:23   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-09-02  5:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-ext4, hch, axboe,
	Keith Busch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 0/8]
  2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
                   ` (7 preceding siblings ...)
  2025-08-27 14:12 ` [PATCHv4 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
@ 2025-09-09 16:27 ` Jens Axboe
  8 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2025-09-09 16:27 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, Keith Busch
  Cc: linux-xfs, linux-ext4, hch, Keith Busch


On Wed, 27 Aug 2025 07:12:50 -0700, Keith Busch wrote:
> Previous version:
> 
>   https://lore.kernel.org/linux-block/20250819164922.640964-1-kbusch@meta.com/
> 
> This series removes the direct io requirement that io vector lengths
> align to the logical block size. There are two primary benefits from
> doing this:
> 
> [...]

Applied, thanks!

[1/8] block: check for valid bio while splitting
      commit: fec2e705729dc93de5399d8b139e4746805c3d81
[2/8] block: add size alignment to bio_iov_iter_get_pages
      commit: 743bf2e0c49c835cb7c4e4ac7d5a2610587047be
[3/8] block: align the bio after building it
      commit: 20a0e6276edba4318c13486df02c31e5f3c09431
[4/8] block: simplify direct io validity check
      commit: 5ff3f74e145adc79b49668adb8de276446acf6be
[5/8] iomap: simplify direct io validity check
      commit: 7eac331869575d81eaa2dd68b19e7468f8fa93cb
[6/8] block: remove bdev_iter_is_aligned
      commit: 9eab1d4e0d15b633adc170c458c51e8be3b1c553
[7/8] blk-integrity: use simpler alignment check
      commit: 69d7ed5b9ef661230264bfa0db4c96fa25b8efa4
[8/8] iov_iter: remove iov_iter_is_aligned
      commit: b475272f03ca5d0c437c8f899ff229b21010ec83

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-08-27 14:12 ` [PATCHv4 5/8] iomap: " Keith Busch
@ 2025-10-27 16:25   ` Carlos Llamas
  2025-10-27 16:42     ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: Carlos Llamas @ 2025-10-27 16:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-ext4, hch, axboe,
	Keith Busch, Hannes Reinecke, Martin K. Petersen

On Wed, Aug 27, 2025 at 07:12:55AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The block layer checks all the segments for validity later, so no need
> for an early check. Just reduce it to a simple position and total length
> check, and defer the more invasive segment checks to the block layer.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/direct-io.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index fea23fa6a402f..c06e41fd4d0af 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -337,8 +337,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>  	u64 copied = 0;
>  	size_t orig_count;
>  
> -	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> -	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> +	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1))
>  		return -EINVAL;
>  
>  	if (dio->flags & IOMAP_DIO_WRITE) {
> -- 
> 2.47.3
> 

Hey Keith, I'be bisected an LTP issue down to this patch. There is a
O_DIRECT read test that expects EINVAL for a bad buffer alignment.
However, if I understand the patchset correctly, this is intentional
move which makes this LTP test obsolete, correct?

The broken test is "test 5" here:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/read/read02.c

... and this is what I get now:
  read02.c:87: TFAIL: read() failed unexpectedly, expected EINVAL: EIO (5)

Cheers,
Carlos Llamas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-27 16:25   ` Carlos Llamas
@ 2025-10-27 16:42     ` Keith Busch
  2025-10-27 17:12       ` Carlos Llamas
  2025-10-28 22:47       ` Carlos Llamas
  0 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-10-27 16:42 UTC (permalink / raw)
  To: Carlos Llamas
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-xfs, linux-ext4,
	hch, axboe, Hannes Reinecke, Martin K. Petersen

On Mon, Oct 27, 2025 at 04:25:10PM +0000, Carlos Llamas wrote:
> Hey Keith, I'be bisected an LTP issue down to this patch. There is a
> O_DIRECT read test that expects EINVAL for a bad buffer alignment.
> However, if I understand the patchset correctly, this is intentional
> move which makes this LTP test obsolete, correct?
> 
> The broken test is "test 5" here:
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/read/read02.c
> 
> ... and this is what I get now:
>   read02.c:87: TFAIL: read() failed unexpectedly, expected EINVAL: EIO (5)

Yes, the changes are intentional. Your test should still see the read
fail since it looks like its attempting a byte aligned memory offset,
and most storage controllers don't advertise support for byte aligned
DMA. So the problem is that you got EIO instead of EINVAL? The block
layer that finds your misaligned address should have still failed with
EINVAL, but that check is deferred to pretty low in the stack rather
than preemptively checked as before. The filesystem may return a generic
EIO in that case, but not sure. What filesystem was this using?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-27 16:42     ` Keith Busch
@ 2025-10-27 17:12       ` Carlos Llamas
  2025-10-28 22:47       ` Carlos Llamas
  1 sibling, 0 replies; 27+ messages in thread
From: Carlos Llamas @ 2025-10-27 17:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-xfs, linux-ext4,
	hch, axboe, Hannes Reinecke, Martin K. Petersen

On Mon, Oct 27, 2025 at 10:42:47AM -0600, Keith Busch wrote:
> On Mon, Oct 27, 2025 at 04:25:10PM +0000, Carlos Llamas wrote:
> > Hey Keith, I'be bisected an LTP issue down to this patch. There is a
> > O_DIRECT read test that expects EINVAL for a bad buffer alignment.
> > However, if I understand the patchset correctly, this is intentional
> > move which makes this LTP test obsolete, correct?
> > 
> > The broken test is "test 5" here:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/read/read02.c
> > 
> > ... and this is what I get now:
> >   read02.c:87: TFAIL: read() failed unexpectedly, expected EINVAL: EIO (5)
> 
> Yes, the changes are intentional. Your test should still see the read
> fail since it looks like its attempting a byte aligned memory offset,
> and most storage controllers don't advertise support for byte aligned
> DMA. So the problem is that you got EIO instead of EINVAL? The block

Yes, that is the problem.

> layer that finds your misaligned address should have still failed with
> EINVAL, but that check is deferred to pretty low in the stack rather
> than preemptively checked as before. The filesystem may return a generic
> EIO in that case, but not sure. What filesystem was this using?

I see, so the check is to be deferred to the block implementation. I
don't really know what fs I was using, I throught it was ext4 but let me
double check.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-27 16:42     ` Keith Busch
  2025-10-27 17:12       ` Carlos Llamas
@ 2025-10-28 22:47       ` Carlos Llamas
  2025-10-28 22:56         ` Eric Biggers
  1 sibling, 1 reply; 27+ messages in thread
From: Carlos Llamas @ 2025-10-28 22:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-xfs, linux-ext4,
	hch, axboe, Hannes Reinecke, Martin K. Petersen, Eric Biggers

On Mon, Oct 27, 2025 at 10:42:47AM -0600, Keith Busch wrote:
> On Mon, Oct 27, 2025 at 04:25:10PM +0000, Carlos Llamas wrote:
> > Hey Keith, I'be bisected an LTP issue down to this patch. There is a
> > O_DIRECT read test that expects EINVAL for a bad buffer alignment.
> > However, if I understand the patchset correctly, this is intentional
> > move which makes this LTP test obsolete, correct?
> > 
> > The broken test is "test 5" here:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/read/read02.c
> > 
> > ... and this is what I get now:
> >   read02.c:87: TFAIL: read() failed unexpectedly, expected EINVAL: EIO (5)
> 
> Yes, the changes are intentional. Your test should still see the read
> fail since it looks like its attempting a byte aligned memory offset,
> and most storage controllers don't advertise support for byte aligned
> DMA. So the problem is that you got EIO instead of EINVAL? The block
> layer that finds your misaligned address should have still failed with
> EINVAL, but that check is deferred to pretty low in the stack rather
> than preemptively checked as before. The filesystem may return a generic
> EIO in that case, but not sure. What filesystem was this using?

Cc: Eric Biggers <ebiggers@google.com>

Ok, I did a bit more digging. I'm using f2fs but the problem in this
case is the blk_crypto layer. The OP_READ request goes through
submit_bio() which then calls blk_crypto_bio_prep() and if the bio has
crypto context then it checks for bio_crypt_check_alignment().

This is where the LTP tests fails the alignment. However, the propagated
error goes through "bio->bi_status = BLK_STS_IOERR" which in bio_endio()
get translates to EIO due to blk_status_to_errno().

I've verified this restores the original behavior matching the LTP test,
so I'll write up a patch and send it a bit later.

diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index 1336cbf5e3bd..a417843e7e4a 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -293,7 +293,7 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr)
 	}
 
 	if (!bio_crypt_check_alignment(bio)) {
-		bio->bi_status = BLK_STS_IOERR;
+		bio->bi_status = BLK_STS_INVAL;
 		goto fail;
 	}
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-28 22:47       ` Carlos Llamas
@ 2025-10-28 22:56         ` Eric Biggers
  2025-10-28 23:03           ` Eric Biggers
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Biggers @ 2025-10-28 22:56 UTC (permalink / raw)
  To: Carlos Llamas
  Cc: Keith Busch, Keith Busch, linux-block, linux-fsdevel, linux-xfs,
	linux-ext4, hch, axboe, Hannes Reinecke, Martin K. Petersen

On Tue, Oct 28, 2025 at 10:47:53PM +0000, Carlos Llamas wrote:
> Ok, I did a bit more digging. I'm using f2fs but the problem in this
> case is the blk_crypto layer. The OP_READ request goes through
> submit_bio() which then calls blk_crypto_bio_prep() and if the bio has
> crypto context then it checks for bio_crypt_check_alignment().
> 
> This is where the LTP tests fails the alignment. However, the propagated
> error goes through "bio->bi_status = BLK_STS_IOERR" which in bio_endio()
> get translates to EIO due to blk_status_to_errno().
> 
> I've verified this restores the original behavior matching the LTP test,
> so I'll write up a patch and send it a bit later.
> 
> diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> index 1336cbf5e3bd..a417843e7e4a 100644
> --- a/block/blk-crypto.c
> +++ b/block/blk-crypto.c
> @@ -293,7 +293,7 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr)
>  	}
>  
>  	if (!bio_crypt_check_alignment(bio)) {
> -		bio->bi_status = BLK_STS_IOERR;
> +		bio->bi_status = BLK_STS_INVAL;
>  		goto fail;
>  	}

That change looks fine, but I'm wondering how this case was reached in
the first place.  Upper layers aren't supposed to be submitting
misaligned bios like this.  For example, ext4 and f2fs require
filesystem logical block size alignment for direct I/O on encrypted
files.  They check for this early, before getting to the point of
submitting a bio, and fall back to buffered I/O if needed.

- Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-28 22:56         ` Eric Biggers
@ 2025-10-28 23:03           ` Eric Biggers
  2025-10-29  7:06             ` Christoph Hellwig
  2025-10-30  4:54             ` Carlos Llamas
  0 siblings, 2 replies; 27+ messages in thread
From: Eric Biggers @ 2025-10-28 23:03 UTC (permalink / raw)
  To: Carlos Llamas
  Cc: Keith Busch, Keith Busch, linux-block, linux-fsdevel, linux-xfs,
	linux-ext4, hch, axboe, Hannes Reinecke, Martin K. Petersen

On Tue, Oct 28, 2025 at 10:56:48PM +0000, Eric Biggers wrote:
> On Tue, Oct 28, 2025 at 10:47:53PM +0000, Carlos Llamas wrote:
> > Ok, I did a bit more digging. I'm using f2fs but the problem in this
> > case is the blk_crypto layer. The OP_READ request goes through
> > submit_bio() which then calls blk_crypto_bio_prep() and if the bio has
> > crypto context then it checks for bio_crypt_check_alignment().
> > 
> > This is where the LTP tests fails the alignment. However, the propagated
> > error goes through "bio->bi_status = BLK_STS_IOERR" which in bio_endio()
> > get translates to EIO due to blk_status_to_errno().
> > 
> > I've verified this restores the original behavior matching the LTP test,
> > so I'll write up a patch and send it a bit later.
> > 
> > diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> > index 1336cbf5e3bd..a417843e7e4a 100644
> > --- a/block/blk-crypto.c
> > +++ b/block/blk-crypto.c
> > @@ -293,7 +293,7 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr)
> >  	}
> >  
> >  	if (!bio_crypt_check_alignment(bio)) {
> > -		bio->bi_status = BLK_STS_IOERR;
> > +		bio->bi_status = BLK_STS_INVAL;
> >  		goto fail;
> >  	}
> 
> That change looks fine, but I'm wondering how this case was reached in
> the first place.  Upper layers aren't supposed to be submitting
> misaligned bios like this.  For example, ext4 and f2fs require
> filesystem logical block size alignment for direct I/O on encrypted
> files.  They check for this early, before getting to the point of
> submitting a bio, and fall back to buffered I/O if needed.

I suppose it's this code in f2fs_should_use_dio():

	/*
	 * Direct I/O not aligned to the disk's logical_block_size will be
	 * attempted, but will fail with -EINVAL.
	 *
	 * f2fs additionally requires that direct I/O be aligned to the
	 * filesystem block size, which is often a stricter requirement.
	 * However, f2fs traditionally falls back to buffered I/O on requests
	 * that are logical_block_size-aligned but not fs-block aligned.
	 *
	 * The below logic implements this behavior.
	 */
	align = iocb->ki_pos | iov_iter_alignment(iter);
	if (!IS_ALIGNED(align, i_blocksize(inode)) &&
	    IS_ALIGNED(align, bdev_logical_block_size(inode->i_sb->s_bdev)))
		return false;

So it relies on the alignment check in iomap in the case where the
request is neither logical_block_size nor filesystem_block_size aligned.

f2fs_should_use_dio() probably should just handle that case explicitly.

But making __blk_crypto_bio_prep() use a better error code sounds good
too.

- Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-28 23:03           ` Eric Biggers
@ 2025-10-29  7:06             ` Christoph Hellwig
  2025-10-30 17:40               ` Eric Biggers
  2025-10-30  4:54             ` Carlos Llamas
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-29  7:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Carlos Llamas, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-xfs, linux-ext4, hch, axboe, Hannes Reinecke,
	Martin K. Petersen

I think we need to take a step back and talk about what alignment
we're talking about here, as there are two dimensions to it.

The first dimension is: disk alignment vs memory alignment.

Disk alignment:
  Direct I/O obviously needs to be aligned to on-disk sectors to have
  a chance to work, as that is the lowest possible granularity of access.

  For fіle systems that write out of place we also need to align writes
  to the logical block size of the file system.

  With blk-crypto we need to align to the DUN if it is larger than the
  disk-sector dize.

Memory alignment:

  This is the alignment of the buffer in-memory.  Hardware only really
  cares about this when DMA engines discard the lowest bits, so a typical
  hardware alignment requirement is to only require a dword (4 byte)
  alignment.   For drivers that process the payload in software such
  low alignment have a tendency to cause bugs as they're not written
  thinking about it.  Similarly for any additional processing like
  encryption, parity or checksums.

The second dimension is for the entire operation vs individual vectors,
this has implications both for the disk and memory alignment.  Keith
has done work there recently to relax the alignment of the vectors to
only require the memory alignment, so that preadv/pwritev-like calls
can have lots of unaligned segments.

I think it's the latter that's tripping up here now.  Hard coding these
checks in the file systems seem like a bad idea, we really need to
advertise them in the queue limits, which is complicated by the fact that
we only want to do that for bios using block layer encryption. i.e., we
probably need a separate queue limit that mirrors dma_alignment, but only
for encrypted bios, and which is taken into account in the block layer
splitting and communicated up by file systems only for encrypted bios.
For blk-crypto-fallback we'd need DUN alignment so that the algorithms
just work (assuming the crypto API can't scatter over misaligned
segments), but for hardware blk-crypto I suspect that the normal DMA
engine rules apply, and we don't need to restrict alignment.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-28 23:03           ` Eric Biggers
  2025-10-29  7:06             ` Christoph Hellwig
@ 2025-10-30  4:54             ` Carlos Llamas
  1 sibling, 0 replies; 27+ messages in thread
From: Carlos Llamas @ 2025-10-30  4:54 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Keith Busch, Keith Busch, linux-block, linux-fsdevel, linux-xfs,
	linux-ext4, hch, axboe, Hannes Reinecke, Martin K. Petersen

On Tue, Oct 28, 2025 at 11:03:50PM +0000, Eric Biggers wrote:
> On Tue, Oct 28, 2025 at 10:56:48PM +0000, Eric Biggers wrote:
> > On Tue, Oct 28, 2025 at 10:47:53PM +0000, Carlos Llamas wrote:
> > > Ok, I did a bit more digging. I'm using f2fs but the problem in this
> > > case is the blk_crypto layer. The OP_READ request goes through
> > > submit_bio() which then calls blk_crypto_bio_prep() and if the bio has
> > > crypto context then it checks for bio_crypt_check_alignment().
> > > 
> > > This is where the LTP tests fails the alignment. However, the propagated
> > > error goes through "bio->bi_status = BLK_STS_IOERR" which in bio_endio()
> > > get translates to EIO due to blk_status_to_errno().
> > > 
> > > I've verified this restores the original behavior matching the LTP test,
> > > so I'll write up a patch and send it a bit later.
> > > 
> > > diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> > > index 1336cbf5e3bd..a417843e7e4a 100644
> > > --- a/block/blk-crypto.c
> > > +++ b/block/blk-crypto.c
> > > @@ -293,7 +293,7 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr)
> > >  	}
> > >  
> > >  	if (!bio_crypt_check_alignment(bio)) {
> > > -		bio->bi_status = BLK_STS_IOERR;
> > > +		bio->bi_status = BLK_STS_INVAL;
> > >  		goto fail;
> > >  	}
> > 
> > That change looks fine, but I'm wondering how this case was reached in
> > the first place.  Upper layers aren't supposed to be submitting
> > misaligned bios like this.  For example, ext4 and f2fs require
> > filesystem logical block size alignment for direct I/O on encrypted
> > files.  They check for this early, before getting to the point of
> > submitting a bio, and fall back to buffered I/O if needed.
> 
> I suppose it's this code in f2fs_should_use_dio():
> 
> 	/*
> 	 * Direct I/O not aligned to the disk's logical_block_size will be
> 	 * attempted, but will fail with -EINVAL.
> 	 *
> 	 * f2fs additionally requires that direct I/O be aligned to the
> 	 * filesystem block size, which is often a stricter requirement.
> 	 * However, f2fs traditionally falls back to buffered I/O on requests
> 	 * that are logical_block_size-aligned but not fs-block aligned.
> 	 *
> 	 * The below logic implements this behavior.
> 	 */
> 	align = iocb->ki_pos | iov_iter_alignment(iter);
> 	if (!IS_ALIGNED(align, i_blocksize(inode)) &&
> 	    IS_ALIGNED(align, bdev_logical_block_size(inode->i_sb->s_bdev)))
> 		return false;
> 
> So it relies on the alignment check in iomap in the case where the
> request is neither logical_block_size nor filesystem_block_size aligned.
> 
> f2fs_should_use_dio() probably should just handle that case explicitly.
> 
> But making __blk_crypto_bio_prep() use a better error code sounds good
> too.

I realize this is a bit of a band-aid but here is the patch to fail the
bad alignment with EINVAL:
https://lore.kernel.org/all/20251030043919.2787231-1-cmllamas@google.com/

As for the more achitectural fix suggested by Christoph, I'm absolutely
out of my depth so I can't comment on that.

Cheers,
--
Carlos Llamas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-29  7:06             ` Christoph Hellwig
@ 2025-10-30 17:40               ` Eric Biggers
  2025-10-31  9:18                 ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Biggers @ 2025-10-30 17:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Carlos Llamas, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-xfs, linux-ext4, axboe, Hannes Reinecke,
	Martin K. Petersen

On Wed, Oct 29, 2025 at 08:06:18AM +0100, Christoph Hellwig wrote:
> I think we need to take a step back and talk about what alignment
> we're talking about here, as there are two dimensions to it.
> 
> The first dimension is: disk alignment vs memory alignment.
> 
> Disk alignment:
>   Direct I/O obviously needs to be aligned to on-disk sectors to have
>   a chance to work, as that is the lowest possible granularity of access.
> 
>   For fіle systems that write out of place we also need to align writes
>   to the logical block size of the file system.
> 
>   With blk-crypto we need to align to the DUN if it is larger than the
>   disk-sector dize.
> 
> Memory alignment:
> 
>   This is the alignment of the buffer in-memory.  Hardware only really
>   cares about this when DMA engines discard the lowest bits, so a typical
>   hardware alignment requirement is to only require a dword (4 byte)
>   alignment.   For drivers that process the payload in software such
>   low alignment have a tendency to cause bugs as they're not written
>   thinking about it.  Similarly for any additional processing like
>   encryption, parity or checksums.
> 
> The second dimension is for the entire operation vs individual vectors,
> this has implications both for the disk and memory alignment.  Keith
> has done work there recently to relax the alignment of the vectors to
> only require the memory alignment, so that preadv/pwritev-like calls
> can have lots of unaligned segments.
> 
> I think it's the latter that's tripping up here now.  Hard coding these
> checks in the file systems seem like a bad idea, we really need to
> advertise them in the queue limits, which is complicated by the fact that
> we only want to do that for bios using block layer encryption. i.e., we
> probably need a separate queue limit that mirrors dma_alignment, but only
> for encrypted bios, and which is taken into account in the block layer
> splitting and communicated up by file systems only for encrypted bios.
> For blk-crypto-fallback we'd need DUN alignment so that the algorithms
> just work (assuming the crypto API can't scatter over misaligned
> segments), but for hardware blk-crypto I suspect that the normal DMA
> engine rules apply, and we don't need to restrict alignment.

Allowing DIO segments to be aligned (in memory address and/or length) to
less than crypto_data_unit_size on encrypted files has been attempted
and discussed before.  Read the cover letter of
https://lore.kernel.org/linux-fscrypt/20220128233940.79464-1-ebiggers@kernel.org/

We eventually decided to proceed with DIO support without it, since it
would have added a lot of complexity.  It would have made the bio
splitting code in the block layer split bios at boundaries where the
length isn't aligned to crypto_data_unit_size, it would have caused a
lot of trouble for blk-crypto-fallback, and it even would have been
incompatible with some of the hardware drivers (e.g. ufs-exynos.c).

It also didn't seem to be all that useful, and it would have introduced
edge cases that don't get tested much.  All reachable to unprivileged
userspace code too, of course.

I can't say that the idea seems all that great to me.

We can always reconsider and still add support for this.  But it's not
clear to me what's changed.

- Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-30 17:40               ` Eric Biggers
@ 2025-10-31  9:18                 ` Christoph Hellwig
  2025-11-03 18:10                   ` Eric Biggers
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31  9:18 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Christoph Hellwig, Carlos Llamas, Keith Busch, Keith Busch,
	linux-block, linux-fsdevel, linux-xfs, linux-ext4, axboe,
	Hannes Reinecke, Martin K. Petersen

On Thu, Oct 30, 2025 at 10:40:15AM -0700, Eric Biggers wrote:
> Allowing DIO segments to be aligned (in memory address and/or length) to
> less than crypto_data_unit_size on encrypted files has been attempted
> and discussed before.  Read the cover letter of
> https://lore.kernel.org/linux-fscrypt/20220128233940.79464-1-ebiggers@kernel.org/

Hmm, where does "First, it
necessarily causes it to be possible that crypto data units span bvecs.
Splits cannot occur at such locations; however the block layer currently
assumes that bios can be split at any bvec boundary.? come from?  The
block layer splits at arbitrary boundaries that don't need any kind of
bvec alignment.

> We eventually decided to proceed with DIO support without it, since it
> would have added a lot of complexity.  It would have made the bio
> splitting code in the block layer split bios at boundaries where the
> length isn't aligned to crypto_data_unit_size, it would have caused a
> lot of trouble for blk-crypto-fallback, and it even would have been
> incompatible with some of the hardware drivers (e.g. ufs-exynos.c).

Ok, if hardware drivers can't handle it that's a good argument.  I can
see why handling it in the software case is very annoying, but non-stupid
hardware should not be affected.  Stupid me assuming UFS might not be
dead stupid of course.

> It also didn't seem to be all that useful, and it would have introduced
> edge cases that don't get tested much.  All reachable to unprivileged
> userspace code too, of course.

xfstests just started exercising this and we're getting lots of interesting
reports (for the non-fscrypt case).


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-10-31  9:18                 ` Christoph Hellwig
@ 2025-11-03 18:10                   ` Eric Biggers
  2025-11-03 18:26                     ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Biggers @ 2025-11-03 18:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Carlos Llamas, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-xfs, linux-ext4, axboe, Hannes Reinecke,
	Martin K. Petersen

On Fri, Oct 31, 2025 at 10:18:20AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 30, 2025 at 10:40:15AM -0700, Eric Biggers wrote:
> > Allowing DIO segments to be aligned (in memory address and/or length) to
> > less than crypto_data_unit_size on encrypted files has been attempted
> > and discussed before.  Read the cover letter of
> > https://lore.kernel.org/linux-fscrypt/20220128233940.79464-1-ebiggers@kernel.org/
> 
> Hmm, where does "First, it
> necessarily causes it to be possible that crypto data units span bvecs.
> Splits cannot occur at such locations; however the block layer currently
> assumes that bios can be split at any bvec boundary.? come from?  The
> block layer splits at arbitrary boundaries that don't need any kind of
> bvec alignment.

While splits in general can occur on any logical_block_size boundary,
the last I checked if things are set up properly the splitting is much
better behaved than that in practice.

For example, if max_segment_size is a multiple of logical_block_size
size but not the crypto_data_unit_size being used, then that would of
course cause incorrect splits.  However, that's not an issue if the
driver sets max_segment_size to be a multiple of its largest supported
crypto_data_unit_size.  For example, the UFS driver defaults to
max_segment_size=SZ_256K, which is nicely aligned.

I'm sure there are other examples, related to dm devices,
virt_boundary_mask, and so on.  But the point is that they don't seem to
have been applicable to anyone actually using
crypto_data_unit_size > logical_block_size yet.

In contrast, allowing DIO with memory alignment < crypto_data_unit_size
makes the current code trivially start generating incorrect splits due
to the max_segments based splitting.

I'm not sure the current situation of the block layer not really paying
attention to crypto_data_unit_size is sustainable.  But when someone
looked into making the "only split on crypto_data_unit_size" boundaries
guarantee explicit
(https://lkml.kernel.org/linux-block/20210707052943.3960-1-satyaprateek2357@gmail.com/),
it turned out to be a lot of work.  And no one could find a real-world
example where it actually mattered, besides DIO with memory alignment <
crypto_data_unit_size which didn't seem that useful in the first place.
So that's why we're still in the current situation.

> > We eventually decided to proceed with DIO support without it, since it
> > would have added a lot of complexity.  It would have made the bio
> > splitting code in the block layer split bios at boundaries where the
> > length isn't aligned to crypto_data_unit_size, it would have caused a
> > lot of trouble for blk-crypto-fallback, and it even would have been
> > incompatible with some of the hardware drivers (e.g. ufs-exynos.c).
> 
> Ok, if hardware drivers can't handle it that's a good argument.  I can
> see why handling it in the software case is very annoying, but non-stupid
> hardware should not be affected.  Stupid me assuming UFS might not be
> dead stupid of course.
> 
> > It also didn't seem to be all that useful, and it would have introduced
> > edge cases that don't get tested much.  All reachable to unprivileged
> > userspace code too, of course.
> 
> xfstests just started exercising this and we're getting lots of interesting
> reports (for the non-fscrypt case).

Great to hear that it's starting to be tested.  But it's concerning that
it's just happening now, 3 years after the patches went in, and is also
still finding lots of bugs.  It's hard for me to understand how it was
ready, or even useful, in the first place.

- Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-11-03 18:10                   ` Eric Biggers
@ 2025-11-03 18:26                     ` Keith Busch
  2025-11-04 11:35                       ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-11-03 18:26 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Christoph Hellwig, Carlos Llamas, Keith Busch, linux-block,
	linux-fsdevel, linux-xfs, linux-ext4, axboe, Hannes Reinecke,
	Martin K. Petersen

On Mon, Nov 03, 2025 at 10:10:31AM -0800, Eric Biggers wrote:
> On Fri, Oct 31, 2025 at 10:18:20AM +0100, Christoph Hellwig wrote:
> > 
> > xfstests just started exercising this and we're getting lots of interesting
> > reports (for the non-fscrypt case).
> 
> Great to hear that it's starting to be tested.  But it's concerning that
> it's just happening now, 3 years after the patches went in, and is also
> still finding lots of bugs.  It's hard for me to understand how it was
> ready, or even useful, in the first place.

I've been using these memory alignment capabilities in production for
quite some time without issue on real hardware, and it's proven very
useful at reducing memory and cpu utilization because that's really how
the data alignment comes into the services responisble for running the
disk io, and the alignment is outside the service's control.

Christoph is testing different use cases with check summing and finding
much of the infrastructure wasn't ready to accept the more arbitrary
memory offsets and lengths.

Finding it now is more of an indication that those two use cases haven't
overlapped.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/8] iomap: simplify direct io validity check
  2025-11-03 18:26                     ` Keith Busch
@ 2025-11-04 11:35                       ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-04 11:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Eric Biggers, Christoph Hellwig, Carlos Llamas, Keith Busch,
	linux-block, linux-fsdevel, linux-xfs, linux-ext4, axboe,
	Hannes Reinecke, Martin K. Petersen

On Mon, Nov 03, 2025 at 11:26:10AM -0700, Keith Busch wrote:
> I've been using these memory alignment capabilities in production for
> quite some time without issue on real hardware, and it's proven very
> useful at reducing memory and cpu utilization because that's really how
> the data alignment comes into the services responisble for running the
> disk io, and the alignment is outside the service's control.
> 
> Christoph is testing different use cases with check summing and finding
> much of the infrastructure wasn't ready to accept the more arbitrary
> memory offsets and lengths.

Well, I've mostly just wired up the alignment reporting to the legacy
XFS ioctl, and now xfstests actually tried it.  Which found breakage
in the PI code and in null_blk pretty quickly, and I suspect it might
find more in drivers that access the data in software and don't just
DMA map it.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-11-04 11:35 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-27 14:12 [PATCHv4 0/8] Keith Busch
2025-08-27 14:12 ` [PATCHv4 1/8] block: check for valid bio while splitting Keith Busch
2025-08-31  0:40   ` Martin K. Petersen
2025-08-27 14:12 ` [PATCHv4 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
2025-08-31  0:40   ` Martin K. Petersen
2025-08-27 14:12 ` [PATCHv4 3/8] block: align the bio after building it Keith Busch
2025-08-31  0:41   ` Martin K. Petersen
2025-09-02  5:23   ` Christoph Hellwig
2025-08-27 14:12 ` [PATCHv4 4/8] block: simplify direct io validity check Keith Busch
2025-08-27 14:12 ` [PATCHv4 5/8] iomap: " Keith Busch
2025-10-27 16:25   ` Carlos Llamas
2025-10-27 16:42     ` Keith Busch
2025-10-27 17:12       ` Carlos Llamas
2025-10-28 22:47       ` Carlos Llamas
2025-10-28 22:56         ` Eric Biggers
2025-10-28 23:03           ` Eric Biggers
2025-10-29  7:06             ` Christoph Hellwig
2025-10-30 17:40               ` Eric Biggers
2025-10-31  9:18                 ` Christoph Hellwig
2025-11-03 18:10                   ` Eric Biggers
2025-11-03 18:26                     ` Keith Busch
2025-11-04 11:35                       ` Christoph Hellwig
2025-10-30  4:54             ` Carlos Llamas
2025-08-27 14:12 ` [PATCHv4 6/8] block: remove bdev_iter_is_aligned Keith Busch
2025-08-27 14:12 ` [PATCHv4 7/8] blk-integrity: use simpler alignment check Keith Busch
2025-08-27 14:12 ` [PATCHv4 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
2025-09-09 16:27 ` [PATCHv4 0/8] Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).