linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv3 0/8] direct-io: even more flexible io vectors
@ 2025-08-19 16:49 Keith Busch
  2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
                   ` (10 more replies)
  0 siblings, 11 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Previous version:

  https://lore.kernel.org/linux-block/20250805141123.332298-1-kbusch@meta.com/

This series removes the direct io requirement that io vector lengths
align to the logical block size.

I tested this on a few raw block device types including nvme,
virtio-blk, ahci, and loop. NVMe is the only one I tested with 4k
logical sectors; everything else was 512.

On each of those, I tested several iomap filesystems: xfs, ext4, and
btrfs. I found it interesting that each behave a little
differently with handling invalid vector alignments:

  - XFS is the most straight forward and reports failures on invalid
    vector conditions, same as raw blocks devices.

  - EXT4 falls back to buffered io for writes but not for reads.

  - BTRFS doesn't even try direct io for any unusual alignments; it
    chooses buffered io from the start.

So it has been a little slow going figuring out which results to expect
from various tests, but I think I've got all the corner cases covered. I
can submit the tests cases to blktests and fstests for consideration
separately, too.

I'm not 100% sure where we're at with the last patch. I think Mike
initially indicated this was okay to remove, but I could swear I read
something saying that might not be the case anymore. I just can't find
the message now. Mike?

Changes from v2:

  Include vector lengths when validating a split. The length check is
  only valid for r/w commands, and skipped for passthrough
  DRV_IN/DRV_OUT commands.

  Introduce a prep patch having bio_iov_iter_get_pages() take the
  caller's desired length alignment.

  Additional code comments explaing less obvious error conditions.

  Added reviews on the patches that haven't changed.

Keith Busch (8):
  block: check for valid bio while splitting
  block: add size alignment to bio_iov_iter_get_pages
  block: align the bio after building it
  block: simplify direct io validity check
  iomap: simplify direct io validity check
  block: remove bdev_iter_is_aligned
  blk-integrity: use simpler alignment check
  iov_iter: remove iov_iter_is_aligned

 block/bio-integrity.c  |  4 +-
 block/bio.c            | 64 ++++++++++++++++++----------
 block/blk-map.c        |  2 +-
 block/blk-merge.c      | 20 +++++++--
 block/fops.c           | 13 +++---
 fs/iomap/direct-io.c   |  6 +--
 include/linux/bio.h    | 13 ++++--
 include/linux/blkdev.h | 20 +++++----
 include/linux/uio.h    |  2 -
 lib/iov_iter.c         | 95 ------------------------------------------
 10 files changed, 94 insertions(+), 145 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCHv3 1/8] block: check for valid bio while splitting
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-20  7:02   ` Damien Le Moal
                     ` (2 more replies)
  2025-08-19 16:49 ` [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

We're already iterating every segment, so check these for a valid IO at
the same time.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-map.c        |  2 +-
 block/blk-merge.c      | 20 ++++++++++++++++----
 include/linux/bio.h    |  4 ++--
 include/linux/blkdev.h | 13 +++++++++++++
 4 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 23e5d5ebe59ec..c5da9d37deee9 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -443,7 +443,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
 	int ret;
 
 	/* check that the data layout matches the hardware restrictions */
-	ret = bio_split_rw_at(bio, lim, &nr_segs, max_bytes);
+	ret = bio_split_drv_at(bio, lim, &nr_segs, max_bytes);
 	if (ret) {
 		/* if we would have to split the bio, copy instead */
 		if (ret > 0)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be52..a0d8364983000 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -279,25 +279,29 @@ static unsigned int bio_split_alignment(struct bio *bio,
 }
 
 /**
- * bio_split_rw_at - check if and where to split a read/write bio
+ * bio_split_io_at - check if and where to split a read/write bio
  * @bio:  [in] bio to be split
  * @lim:  [in] queue limits to split based on
  * @segs: [out] number of segments in the bio with the first half of the sectors
  * @max_bytes: [in] maximum number of bytes per bio
+ * @len_align: [in] length alignment for each vector
  *
  * Find out if @bio needs to be split to fit the queue limits in @lim and a
  * maximum size of @max_bytes.  Returns a negative error number if @bio can't be
  * split, 0 if the bio doesn't have to be split, or a positive sector offset if
  * @bio needs to be split.
  */
-int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
-		unsigned *segs, unsigned max_bytes)
+int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes, unsigned len_align)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
 	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
+		if (bv.bv_offset & lim->dma_alignment || bv.bv_len & len_align)
+			return -EINVAL;
+
 		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
@@ -339,8 +343,16 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	 * Individual bvecs might not be logical block aligned. Round down the
 	 * split size so that each bio is properly block size aligned, even if
 	 * we do not use the full hardware limits.
+	 *
+	 * Misuse may submit a bio that can't be split into a valid io. There
+	 * may either be too many discontiguous vectors for the max segments
+	 * limit, or contain virtual boundary gaps without having a valid block
+	 * sized split. Catch that condition by checking for a zero byte
+	 * result.
 	 */
 	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
+	if (!bytes)
+		return -EINVAL;
 
 	/*
 	 * Bio splitting may cause subtle trouble such as hang when doing sync
@@ -350,7 +362,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	bio_clear_polled(bio);
 	return bytes >> SECTOR_SHIFT;
 }
-EXPORT_SYMBOL_GPL(bio_split_rw_at);
+EXPORT_SYMBOL_GPL(bio_split_io_at);
 
 struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 		unsigned *nr_segs)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 46ffac5caab78..519a1d59805f8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -322,8 +322,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
 void bio_trim(struct bio *bio, sector_t offset, sector_t size);
 extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
-int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
-		unsigned *segs, unsigned max_bytes);
+int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes, unsigned len_align);
 
 /**
  * bio_next_split - get next @sectors from a bio, splitting if necessary
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 95886b404b16b..7f83ad2df5425 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1869,6 +1869,19 @@ bdev_atomic_write_unit_max_bytes(struct block_device *bdev)
 	return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
 }
 
+static inline int bio_split_rw_at(struct bio *bio,
+		const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes)
+{
+	return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment);
+}
+
+static inline int bio_split_drv_at(struct bio *bio,
+		const struct queue_limits *lim,
+		unsigned *segs, unsigned max_bytes)
+{
+	return bio_split_io_at(bio, lim, segs, max_bytes, 0);
+}
 #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
 
 #endif /* _LINUX_BLKDEV_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
  2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-25  7:36   ` Christoph Hellwig
  2025-08-19 16:49 ` [PATCHv3 3/8] block: align the bio after building it Keith Busch
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may want to align
to something larger, which we can't derive it from the queue limits.
Have the caller specify what they want, or allow any length alignment if
nothing was specified.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c          | 19 +++++++++++--------
 block/fops.c         |  9 ++++++---
 fs/iomap/direct-io.c |  3 ++-
 include/linux/bio.h  |  9 ++++++++-
 4 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3b371a5da159e..44286db14355f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1204,7 +1204,8 @@ static unsigned int get_contig_folio_len(unsigned int *num_pages,
  * For a multi-segment *iter, this function only adds pages from the next
  * non-empty segment of the iov iterator.
  */
-static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
+				    unsigned len_align_mask)
 {
 	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
@@ -1213,7 +1214,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size;
 	unsigned int num_pages, i = 0;
-	size_t offset, folio_offset, left, len;
+	size_t offset, folio_offset, left, len, trim;
 	int ret = 0;
 
 	/*
@@ -1242,8 +1243,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	if (bio->bi_bdev) {
-		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+	trim = size & len_align_mask;
+	if (trim) {
 		iov_iter_revert(iter, trim);
 		size -= trim;
 	}
@@ -1298,9 +1299,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - add user or kernel pages to a bio
+ * bio_iov_iter_get_pages_aligned - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
+ * @len_align_mask: the mask to align each vector size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
  * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
@@ -1317,7 +1319,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
  * MM encounters an error pinning the requested pages, it stops. Error
  * is returned only if 0 pages could be pinned.
  */
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter,
+			   unsigned len_align_mask)
 {
 	int ret = 0;
 
@@ -1333,12 +1336,12 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	if (iov_iter_extract_will_pin(iter))
 		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
-		ret = __bio_iov_iter_get_pages(bio, iter);
+		ret = __bio_iov_iter_get_pages(bio, iter, len_align_mask);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	return bio->bi_vcnt ? 0 : ret;
 }
-EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
+EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages_aligned);
 
 static void submit_bio_wait_endio(struct bio *bio)
 {
diff --git a/block/fops.c b/block/fops.c
index 82451ac8ff25d..6d5c1e680d4a7 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -78,7 +78,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
 
-	ret = bio_iov_iter_get_pages(&bio, iter);
+	ret = bio_iov_iter_get_pages_aligned(&bio, iter,
+				     bdev_logical_block_size(bdev) - 1);
 	if (unlikely(ret))
 		goto out;
 	ret = bio.bi_iter.bi_size;
@@ -212,7 +213,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
 
-		ret = bio_iov_iter_get_pages(bio, iter);
+		ret = bio_iov_iter_get_pages_aligned(bio, iter,
+					     bdev_logical_block_size(bdev) - 1);
 		if (unlikely(ret)) {
 			bio->bi_status = BLK_STS_IOERR;
 			bio_endio(bio);
@@ -348,7 +350,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		 */
 		bio_iov_bvec_set(bio, iter);
 	} else {
-		ret = bio_iov_iter_get_pages(bio, iter);
+		ret = bio_iov_iter_get_pages_aligned(bio, iter,
+					     bdev_logical_block_size(bdev) - 1);
 		if (unlikely(ret))
 			goto out_bio_put;
 	}
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 6f25d4cfea9f7..213764bdee8f2 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -434,7 +434,8 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
-		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
+		ret = bio_iov_iter_get_pages_aligned(bio, dio->submit.iter,
+				bdev_logical_block_size(iomap->bdev) - 1);
 		if (unlikely(ret)) {
 			/*
 			 * We have to stop part way through an IO. We must fall
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 519a1d59805f8..788a50ff319e3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,14 @@ int submit_bio_wait(struct bio *bio);
 int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
 		size_t len, enum req_op op);
 
-int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
+int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter,
+		unsigned len_align_mask);
+
+static inline int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	return bio_iov_iter_get_pages_aligned(bio, iter, 0);
+}
+
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
 extern void bio_set_pages_dirty(struct bio *bio);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 3/8] block: align the bio after building it
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
  2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
  2025-08-19 16:49 ` [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-20  7:07   ` Damien Le Moal
                     ` (2 more replies)
  2025-08-19 16:49 ` [PATCHv3 4/8] block: simplify direct io validity check Keith Busch
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c | 65 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 44286db14355f..00bd5c76c461f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1204,8 +1204,7 @@ static unsigned int get_contig_folio_len(unsigned int *num_pages,
  * For a multi-segment *iter, this function only adds pages from the next
  * non-empty segment of the iov iterator.
  */
-static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-				    unsigned len_align_mask)
+static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
@@ -1214,7 +1213,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 	struct page **pages = (struct page **)bv;
 	ssize_t size;
 	unsigned int num_pages, i = 0;
-	size_t offset, folio_offset, left, len, trim;
+	size_t offset, folio_offset, left, len;
 	int ret = 0;
 
 	/*
@@ -1228,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
-	/*
-	 * Each segment in the iov is required to be a block size multiple.
-	 * However, we may not be able to get the entire segment if it spans
-	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
-	 * result to ensure the bio's total size is correct. The remainder of
-	 * the iov data will be picked up in the next bio iteration.
-	 */
 	size = iov_iter_extract_pages(iter, &pages,
 				      UINT_MAX - bio->bi_iter.bi_size,
 				      nr_pages, extraction_flags, &offset);
@@ -1242,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 		return size ? size : -EFAULT;
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
-
-	trim = size & len_align_mask;
-	if (trim) {
-		iov_iter_revert(iter, trim);
-		size -= trim;
-	}
-
-	if (unlikely(!size)) {
-		ret = -EFAULT;
-		goto out;
-	}
-
 	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
 		struct page *page = pages[i];
 		struct folio *folio = page_folio(page);
@@ -1298,11 +1278,44 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 	return ret;
 }
 
+/*
+ * Aligns the bio size to the len_align_mask, releasing any excessive bio vecs
+ * that  __bio_iov_iter_get_pages may have inserted and reverts that length for
+ * the next iteration.
+ */
+static int bio_align(struct bio *bio, struct iov_iter *iter,
+			    unsigned len_align_mask)
+{
+	size_t nbytes = bio->bi_iter.bi_size & len_align_mask;
+
+	if (!nbytes)
+		return 0;
+
+	iov_iter_revert(iter, nbytes);
+	bio->bi_iter.bi_size -= nbytes;
+	while (nbytes) {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (nbytes < bv->bv_len) {
+			bv->bv_len -= nbytes;
+			nbytes = 0;
+		} else {
+			bio_release_page(bio, bv->bv_page);
+			bio->bi_vcnt--;
+			nbytes -= bv->bv_len;
+		}
+	}
+
+	if (!bio->bi_iter.bi_size)
+		return -EFAULT;
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages_aligned - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
- * @len_align_mask: the mask to align each vector size to, 0 for any length
+ * @len_align_mask: the mask to align the total size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
  * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
@@ -1336,10 +1349,12 @@ int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter,
 	if (iov_iter_extract_will_pin(iter))
 		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
-		ret = __bio_iov_iter_get_pages(bio, iter, len_align_mask);
+		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
-	return bio->bi_vcnt ? 0 : ret;
+	if (bio->bi_vcnt && len_align_mask)
+		return bio_align(bio, iter, len_align_mask);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages_aligned);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 4/8] block: simplify direct io validity check
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (2 preceding siblings ...)
  2025-08-19 16:49 ` [PATCHv3 3/8] block: align the bio after building it Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-25  7:48   ` Christoph Hellwig
  2025-08-19 16:49 ` [PATCHv3 5/8] iomap: " Keith Busch
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/fops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 6d5c1e680d4a7..32a8e6fdc7338 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
 				struct iov_iter *iter)
 {
-	return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
-		!bdev_iter_is_aligned(bdev, iter);
+	return (iocb->ki_pos | iov_iter_count(iter)) &
+			(bdev_logical_block_size(bdev) - 1);
 }
 
 #define DIO_INLINE_BIO_VECS 4
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 5/8] iomap: simplify direct io validity check
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (3 preceding siblings ...)
  2025-08-19 16:49 ` [PATCHv3 4/8] block: simplify direct io validity check Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-25  7:48   ` Christoph Hellwig
  2025-08-19 16:49 ` [PATCHv3 6/8] block: remove bdev_iter_is_aligned Keith Busch
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 fs/iomap/direct-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 213764bdee8f2..33ecdb11e331e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -337,8 +337,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	u64 copied = 0;
 	size_t orig_count;
 
-	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
+	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1))
 		return -EINVAL;
 
 	if (dio->flags & IOMAP_DIO_WRITE) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 6/8] block: remove bdev_iter_is_aligned
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (4 preceding siblings ...)
  2025-08-19 16:49 ` [PATCHv3 5/8] iomap: " Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-25  7:48   ` Christoph Hellwig
  2025-08-19 16:49 ` [PATCHv3 7/8] blk-integrity: use simpler alignment check Keith Busch
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 include/linux/blkdev.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f83ad2df5425..a2128af3f08cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1589,13 +1589,6 @@ static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
 	return queue_dma_alignment(bdev_get_queue(bdev));
 }
 
-static inline bool bdev_iter_is_aligned(struct block_device *bdev,
-					struct iov_iter *iter)
-{
-	return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev),
-				   bdev_logical_block_size(bdev) - 1);
-}
-
 static inline unsigned int
 blk_lim_dma_alignment_and_pad(struct queue_limits *lim)
 {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 7/8] blk-integrity: use simpler alignment check
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (5 preceding siblings ...)
  2025-08-19 16:49 ` [PATCHv3 6/8] block: remove bdev_iter_is_aligned Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-25  7:49   ` Christoph Hellwig
  2025-08-19 16:49 ` [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

We're checking length and addresses against the same alignment value, so
use the more simple iterator check.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/bio-integrity.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 6b077ca937f6b..6d069a49b4aad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -262,7 +262,6 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits);
 	struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
 	struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
 	size_t offset, bytes = iter->count;
@@ -285,7 +284,8 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 		pages = NULL;
 	}
 
-	copy = !iov_iter_is_aligned(iter, align, align);
+	copy = iov_iter_alignment(iter) &
+			blk_lim_dma_alignment_and_pad(&q->limits);
 	ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
 	if (unlikely(ret < 0))
 		goto free_bvec;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (6 preceding siblings ...)
  2025-08-19 16:49 ` [PATCHv3 7/8] blk-integrity: use simpler alignment check Keith Busch
@ 2025-08-19 16:49 ` Keith Busch
  2025-08-25  7:50   ` Christoph Hellwig
  2025-08-19 23:36 ` [PATCHv3 0/8] direct-io: even more flexible io vectors Mike Snitzer
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-19 16:49 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Hannes Reinecke

From: Keith Busch <kbusch@kernel.org>

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 include/linux/uio.h |  2 -
 lib/iov_iter.c      | 95 ---------------------------------------------
 2 files changed, 97 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2e86c653186c6..5b127043a1519 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -286,8 +286,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 #endif
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
-bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
-			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9193f952f499..2fe66a6b8789e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -784,101 +784,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 }
 EXPORT_SYMBOL(iov_iter_discard);
 
-static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
-				   unsigned len_mask)
-{
-	const struct iovec *iov = iter_iov(i);
-	size_t size = i->count;
-	size_t skip = i->iov_offset;
-
-	do {
-		size_t len = iov->iov_len - skip;
-
-		if (len > size)
-			len = size;
-		if (len & len_mask)
-			return false;
-		if ((unsigned long)(iov->iov_base + skip) & addr_mask)
-			return false;
-
-		iov++;
-		size -= len;
-		skip = 0;
-	} while (size);
-
-	return true;
-}
-
-static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
-				  unsigned len_mask)
-{
-	const struct bio_vec *bvec = i->bvec;
-	unsigned skip = i->iov_offset;
-	size_t size = i->count;
-
-	do {
-		size_t len = bvec->bv_len - skip;
-
-		if (len > size)
-			len = size;
-		if (len & len_mask)
-			return false;
-		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
-			return false;
-
-		bvec++;
-		size -= len;
-		skip = 0;
-	} while (size);
-
-	return true;
-}
-
-/**
- * iov_iter_is_aligned() - Check if the addresses and lengths of each segments
- * 	are aligned to the parameters.
- *
- * @i: &struct iov_iter to restore
- * @addr_mask: bit mask to check against the iov element's addresses
- * @len_mask: bit mask to check against the iov element's lengths
- *
- * Return: false if any addresses or lengths intersect with the provided masks
- */
-bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
-			 unsigned len_mask)
-{
-	if (likely(iter_is_ubuf(i))) {
-		if (i->count & len_mask)
-			return false;
-		if ((unsigned long)(i->ubuf + i->iov_offset) & addr_mask)
-			return false;
-		return true;
-	}
-
-	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
-		return iov_iter_aligned_iovec(i, addr_mask, len_mask);
-
-	if (iov_iter_is_bvec(i))
-		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
-
-	/* With both xarray and folioq types, we're dealing with whole folios. */
-	if (iov_iter_is_xarray(i)) {
-		if (i->count & len_mask)
-			return false;
-		if ((i->xarray_start + i->iov_offset) & addr_mask)
-			return false;
-	}
-	if (iov_iter_is_folioq(i)) {
-		if (i->count & len_mask)
-			return false;
-		if (i->iov_offset & addr_mask)
-			return false;
-	}
-
-	return true;
-}
-EXPORT_SYMBOL_GPL(iov_iter_is_aligned);
-
 static unsigned long iov_iter_alignment_iovec(const struct iov_iter *i)
 {
 	const struct iovec *iov = iter_iov(i);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (7 preceding siblings ...)
  2025-08-19 16:49 ` [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
@ 2025-08-19 23:36 ` Mike Snitzer
  2025-08-20  1:52 ` Song Chen
  2025-08-22 13:27 ` Ritesh Harjani
  10 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-19 23:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Keith Busch, linux-nfs

On Tue, Aug 19, 2025 at 09:49:14AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Previous version:
> 
>   https://lore.kernel.org/linux-block/20250805141123.332298-1-kbusch@meta.com/
> 
> This series removes the direct io requirement that io vector lengths
> align to the logical block size.
> 
> I tested this on a few raw block device types including nvme,
> virtio-blk, ahci, and loop. NVMe is the only one I tested with 4k
> logical sectors; everything else was 512.
> 
> On each of those, I tested several iomap filesystems: xfs, ext4, and
> btrfs. I found it interesting that each behave a little
> differently with handling invalid vector alignments:
> 
>   - XFS is the most straight forward and reports failures on invalid
>     vector conditions, same as raw blocks devices.
> 
>   - EXT4 falls back to buffered io for writes but not for reads.
> 
>   - BTRFS doesn't even try direct io for any unusual alignments; it
>     chooses buffered io from the start.
> 
> So it has been a little slow going figuring out which results to expect
> from various tests, but I think I've got all the corner cases covered. I
> can submit the tests cases to blktests and fstests for consideration
> separately, too.
> 
> I'm not 100% sure where we're at with the last patch. I think Mike
> initially indicated this was okay to remove, but I could swear I read
> something saying that might not be the case anymore. I just can't find
> the message now. Mike?

Hey,

Yes, I don't have pointers immediately available but I did mention it
and cc'd you.  I have found that my work relative to NFS and NFSD does
still need to use iov_iter_aligned_bvec -- otherwise misaligned DIO
can get issued to the underlying filesystem.

I did try to push all the relevant checking down to NFS/NFSD code that
assembles their respective bvec into an iov_iter, like you suggested,
but came up short after my first attempt.

I don't want to speak for the NFS or NFSD miantainers, but I'm
personally still OK with the broader iov_iter_is_aligned() interface
and even iov_iter_aligned_bvec() going away (and NFS/NFSD carrying
their own until I can circle back to hopefully eliminating the need).

Either that, or we remove all but iov_iter_aligned_bvec() and export
it so that NFS/NFSD can use it, _and_  tweak it so that it offers more
coarse-grained length checking, like so:
https://lore.kernel.org/linux-nfs/20250708160619.64800-5-snitzer@kernel.org/
(this is probably the best intermediate solution actually, though it'd
force my NFS and NFSD changes to be dependent on your series landing
-- which is probably a perfectly appropriate constraint)

Thanks,
Mike


> 
> Changes from v2:
> 
>   Include vector lengths when validating a split. The length check is
>   only valid for r/w commands, and skipped for passthrough
>   DRV_IN/DRV_OUT commands.
> 
>   Introduce a prep patch having bio_iov_iter_get_pages() take the
>   caller's desired length alignment.
> 
>   Additional code comments explaing less obvious error conditions.
> 
>   Added reviews on the patches that haven't changed.
> 
> Keith Busch (8):
>   block: check for valid bio while splitting
>   block: add size alignment to bio_iov_iter_get_pages
>   block: align the bio after building it
>   block: simplify direct io validity check
>   iomap: simplify direct io validity check
>   block: remove bdev_iter_is_aligned
>   blk-integrity: use simpler alignment check
>   iov_iter: remove iov_iter_is_aligned
> 
>  block/bio-integrity.c  |  4 +-
>  block/bio.c            | 64 ++++++++++++++++++----------
>  block/blk-map.c        |  2 +-
>  block/blk-merge.c      | 20 +++++++--
>  block/fops.c           | 13 +++---
>  fs/iomap/direct-io.c   |  6 +--
>  include/linux/bio.h    | 13 ++++--
>  include/linux/blkdev.h | 20 +++++----
>  include/linux/uio.h    |  2 -
>  lib/iov_iter.c         | 95 ------------------------------------------
>  10 files changed, 94 insertions(+), 145 deletions(-)
> 
> -- 
> 2.47.3
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (8 preceding siblings ...)
  2025-08-19 23:36 ` [PATCHv3 0/8] direct-io: even more flexible io vectors Mike Snitzer
@ 2025-08-20  1:52 ` Song Chen
  2025-08-22 13:27 ` Ritesh Harjani
  10 siblings, 0 replies; 42+ messages in thread
From: Song Chen @ 2025-08-20  1:52 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

Hi,

在 2025/8/20 00:49, Keith Busch 写道:
> From: Keith Busch <kbusch@kernel.org>
> 
> Previous version:
> 
>    https://lore.kernel.org/linux-block/20250805141123.332298-1-kbusch@meta.com/
> 
> This series removes the direct io requirement that io vector lengths
> align to the logical block size.
> 
> I tested this on a few raw block device types including nvme,
> virtio-blk, ahci, and loop. NVMe is the only one I tested with 4k
> logical sectors; everything else was 512.
> 
> On each of those, I tested several iomap filesystems: xfs, ext4, and
> btrfs. I found it interesting that each behave a little
> differently with handling invalid vector alignments:
> 
>    - XFS is the most straight forward and reports failures on invalid
>      vector conditions, same as raw blocks devices.
> 
>    - EXT4 falls back to buffered io for writes but not for reads.

I found it in ext4 too, i tried to fall the misaligned dio read request 
back to buffered io and submitted a patch[1], but haven't received any 
comments yet.

[1]:https://lore.kernel.org/all/20250710085910.123168-1-chensong_2000@189.cn/

Song
> 
>    - BTRFS doesn't even try direct io for any unusual alignments; it
>      chooses buffered io from the start.
> 
> So it has been a little slow going figuring out which results to expect
> from various tests, but I think I've got all the corner cases covered. I
> can submit the tests cases to blktests and fstests for consideration
> separately, too.
> 
> I'm not 100% sure where we're at with the last patch. I think Mike
> initially indicated this was okay to remove, but I could swear I read
> something saying that might not be the case anymore. I just can't find
> the message now. Mike?
> 
> Changes from v2:
> 
>    Include vector lengths when validating a split. The length check is
>    only valid for r/w commands, and skipped for passthrough
>    DRV_IN/DRV_OUT commands.
> 
>    Introduce a prep patch having bio_iov_iter_get_pages() take the
>    caller's desired length alignment.
> 
>    Additional code comments explaing less obvious error conditions.
> 
>    Added reviews on the patches that haven't changed.
> 
> Keith Busch (8):
>    block: check for valid bio while splitting
>    block: add size alignment to bio_iov_iter_get_pages
>    block: align the bio after building it
>    block: simplify direct io validity check
>    iomap: simplify direct io validity check
>    block: remove bdev_iter_is_aligned
>    blk-integrity: use simpler alignment check
>    iov_iter: remove iov_iter_is_aligned
> 
>   block/bio-integrity.c  |  4 +-
>   block/bio.c            | 64 ++++++++++++++++++----------
>   block/blk-map.c        |  2 +-
>   block/blk-merge.c      | 20 +++++++--
>   block/fops.c           | 13 +++---
>   fs/iomap/direct-io.c   |  6 +--
>   include/linux/bio.h    | 13 ++++--
>   include/linux/blkdev.h | 20 +++++----
>   include/linux/uio.h    |  2 -
>   lib/iov_iter.c         | 95 ------------------------------------------
>   10 files changed, 94 insertions(+), 145 deletions(-)
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 1/8] block: check for valid bio while splitting
  2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
@ 2025-08-20  7:02   ` Damien Le Moal
  2025-08-20 14:25     ` Keith Busch
  2025-08-20  7:04   ` Damien Le Moal
  2025-08-25  7:35   ` Christoph Hellwig
  2 siblings, 1 reply; 42+ messages in thread
From: Damien Le Moal @ 2025-08-20  7:02 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

On 8/20/25 1:49 AM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> We're already iterating every segment, so check these for a valid IO at
> the same time.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/blk-map.c        |  2 +-
>  block/blk-merge.c      | 20 ++++++++++++++++----
>  include/linux/bio.h    |  4 ++--
>  include/linux/blkdev.h | 13 +++++++++++++
>  4 files changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 23e5d5ebe59ec..c5da9d37deee9 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -443,7 +443,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
>  	int ret;
>  
>  	/* check that the data layout matches the hardware restrictions */
> -	ret = bio_split_rw_at(bio, lim, &nr_segs, max_bytes);
> +	ret = bio_split_drv_at(bio, lim, &nr_segs, max_bytes);
>  	if (ret) {
>  		/* if we would have to split the bio, copy instead */
>  		if (ret > 0)
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be52..a0d8364983000 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -279,25 +279,29 @@ static unsigned int bio_split_alignment(struct bio *bio,
>  }
>  
>  /**
> - * bio_split_rw_at - check if and where to split a read/write bio
> + * bio_split_io_at - check if and where to split a read/write bio
>   * @bio:  [in] bio to be split
>   * @lim:  [in] queue limits to split based on
>   * @segs: [out] number of segments in the bio with the first half of the sectors
>   * @max_bytes: [in] maximum number of bytes per bio
> + * @len_align: [in] length alignment for each vector
>   *
>   * Find out if @bio needs to be split to fit the queue limits in @lim and a
>   * maximum size of @max_bytes.  Returns a negative error number if @bio can't be
>   * split, 0 if the bio doesn't have to be split, or a positive sector offset if
>   * @bio needs to be split.
>   */
> -int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> -		unsigned *segs, unsigned max_bytes)
> +int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes, unsigned len_align)
>  {
>  	struct bio_vec bv, bvprv, *bvprvp = NULL;
>  	struct bvec_iter iter;
>  	unsigned nsegs = 0, bytes = 0;
>  
>  	bio_for_each_bvec(bv, bio, iter) {
> +		if (bv.bv_offset & lim->dma_alignment || bv.bv_len & len_align)

Shouldn't this be:

		if (bv.bv_offset & len_align || bv.bv_len & len_align)

?

> +			return -EINVAL;
> +
>  		/*
>  		 * If the queue doesn't support SG gaps and adding this
>  		 * offset would create a gap, disallow it.
> @@ -339,8 +343,16 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	 * Individual bvecs might not be logical block aligned. Round down the
>  	 * split size so that each bio is properly block size aligned, even if
>  	 * we do not use the full hardware limits.
> +	 *
> +	 * Misuse may submit a bio that can't be split into a valid io. There
> +	 * may either be too many discontiguous vectors for the max segments
> +	 * limit, or contain virtual boundary gaps without having a valid block
> +	 * sized split. Catch that condition by checking for a zero byte
> +	 * result.
>  	 */
>  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> +	if (!bytes)
> +		return -EINVAL;
>  
>  	/*
>  	 * Bio splitting may cause subtle trouble such as hang when doing sync
> @@ -350,7 +362,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	bio_clear_polled(bio);
>  	return bytes >> SECTOR_SHIFT;
>  }
> -EXPORT_SYMBOL_GPL(bio_split_rw_at);
> +EXPORT_SYMBOL_GPL(bio_split_io_at);
>  
>  struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
>  		unsigned *nr_segs)
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 46ffac5caab78..519a1d59805f8 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -322,8 +322,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
>  void bio_trim(struct bio *bio, sector_t offset, sector_t size);
>  extern struct bio *bio_split(struct bio *bio, int sectors,
>  			     gfp_t gfp, struct bio_set *bs);
> -int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> -		unsigned *segs, unsigned max_bytes);
> +int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes, unsigned len_align);
>  
>  /**
>   * bio_next_split - get next @sectors from a bio, splitting if necessary
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 95886b404b16b..7f83ad2df5425 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1869,6 +1869,19 @@ bdev_atomic_write_unit_max_bytes(struct block_device *bdev)
>  	return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
>  }
>  
> +static inline int bio_split_rw_at(struct bio *bio,
> +		const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes)
> +{
> +	return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment);
> +}
> +
> +static inline int bio_split_drv_at(struct bio *bio,
> +		const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes)
> +{
> +	return bio_split_io_at(bio, lim, segs, max_bytes, 0);
> +}
>  #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
>  
>  #endif /* _LINUX_BLKDEV_H */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 1/8] block: check for valid bio while splitting
  2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
  2025-08-20  7:02   ` Damien Le Moal
@ 2025-08-20  7:04   ` Damien Le Moal
  2025-08-25  7:35   ` Christoph Hellwig
  2 siblings, 0 replies; 42+ messages in thread
From: Damien Le Moal @ 2025-08-20  7:04 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

On 8/20/25 1:49 AM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> We're already iterating every segment, so check these for a valid IO at
> the same time.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/blk-map.c        |  2 +-
>  block/blk-merge.c      | 20 ++++++++++++++++----
>  include/linux/bio.h    |  4 ++--
>  include/linux/blkdev.h | 13 +++++++++++++
>  4 files changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 23e5d5ebe59ec..c5da9d37deee9 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -443,7 +443,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
>  	int ret;
>  
>  	/* check that the data layout matches the hardware restrictions */
> -	ret = bio_split_rw_at(bio, lim, &nr_segs, max_bytes);
> +	ret = bio_split_drv_at(bio, lim, &nr_segs, max_bytes);
>  	if (ret) {
>  		/* if we would have to split the bio, copy instead */
>  		if (ret > 0)
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be52..a0d8364983000 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -279,25 +279,29 @@ static unsigned int bio_split_alignment(struct bio *bio,
>  }
>  
>  /**
> - * bio_split_rw_at - check if and where to split a read/write bio
> + * bio_split_io_at - check if and where to split a read/write bio
>   * @bio:  [in] bio to be split
>   * @lim:  [in] queue limits to split based on
>   * @segs: [out] number of segments in the bio with the first half of the sectors
>   * @max_bytes: [in] maximum number of bytes per bio
> + * @len_align: [in] length alignment for each vector

Another thing: since this is a mask, maybe better to call it len_align_mask
like you did in patch 2 for bio_iov_iter_get_pages_aligned() ?

>   *
>   * Find out if @bio needs to be split to fit the queue limits in @lim and a
>   * maximum size of @max_bytes.  Returns a negative error number if @bio can't be
>   * split, 0 if the bio doesn't have to be split, or a positive sector offset if
>   * @bio needs to be split.
>   */
> -int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> -		unsigned *segs, unsigned max_bytes)
> +int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes, unsigned len_align)
>  {
>  	struct bio_vec bv, bvprv, *bvprvp = NULL;
>  	struct bvec_iter iter;
>  	unsigned nsegs = 0, bytes = 0;
>  
>  	bio_for_each_bvec(bv, bio, iter) {
> +		if (bv.bv_offset & lim->dma_alignment || bv.bv_len & len_align)
> +			return -EINVAL;
> +
>  		/*
>  		 * If the queue doesn't support SG gaps and adding this
>  		 * offset would create a gap, disallow it.
> @@ -339,8 +343,16 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	 * Individual bvecs might not be logical block aligned. Round down the
>  	 * split size so that each bio is properly block size aligned, even if
>  	 * we do not use the full hardware limits.
> +	 *
> +	 * Misuse may submit a bio that can't be split into a valid io. There
> +	 * may either be too many discontiguous vectors for the max segments
> +	 * limit, or contain virtual boundary gaps without having a valid block
> +	 * sized split. Catch that condition by checking for a zero byte
> +	 * result.
>  	 */
>  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> +	if (!bytes)
> +		return -EINVAL;
>  
>  	/*
>  	 * Bio splitting may cause subtle trouble such as hang when doing sync
> @@ -350,7 +362,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	bio_clear_polled(bio);
>  	return bytes >> SECTOR_SHIFT;
>  }
> -EXPORT_SYMBOL_GPL(bio_split_rw_at);
> +EXPORT_SYMBOL_GPL(bio_split_io_at);
>  
>  struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
>  		unsigned *nr_segs)
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 46ffac5caab78..519a1d59805f8 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -322,8 +322,8 @@ static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
>  void bio_trim(struct bio *bio, sector_t offset, sector_t size);
>  extern struct bio *bio_split(struct bio *bio, int sectors,
>  			     gfp_t gfp, struct bio_set *bs);
> -int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> -		unsigned *segs, unsigned max_bytes);
> +int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes, unsigned len_align);
>  
>  /**
>   * bio_next_split - get next @sectors from a bio, splitting if necessary
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 95886b404b16b..7f83ad2df5425 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1869,6 +1869,19 @@ bdev_atomic_write_unit_max_bytes(struct block_device *bdev)
>  	return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
>  }
>  
> +static inline int bio_split_rw_at(struct bio *bio,
> +		const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes)
> +{
> +	return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment);
> +}
> +
> +static inline int bio_split_drv_at(struct bio *bio,
> +		const struct queue_limits *lim,
> +		unsigned *segs, unsigned max_bytes)
> +{
> +	return bio_split_io_at(bio, lim, segs, max_bytes, 0);
> +}
>  #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
>  
>  #endif /* _LINUX_BLKDEV_H */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-19 16:49 ` [PATCHv3 3/8] block: align the bio after building it Keith Busch
@ 2025-08-20  7:07   ` Damien Le Moal
  2025-08-25  7:46   ` Christoph Hellwig
  2025-08-25  7:47   ` Christoph Hellwig
  2 siblings, 0 replies; 42+ messages in thread
From: Damien Le Moal @ 2025-08-20  7:07 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch

On 8/20/25 1:49 AM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Instead of ensuring each vector is block size aligned while constructing
> the bio, just ensure the entire size is aligned after it's built. This
> makes getting bio pages more flexible to accepting device valid io
> vectors that would otherwise get rejected by alignment checks.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>

[...]

> +/*
> + * Aligns the bio size to the len_align_mask, releasing any excessive bio vecs
> + * that  __bio_iov_iter_get_pages may have inserted and reverts that length for
> + * the next iteration.
> + */
> +static int bio_align(struct bio *bio, struct iov_iter *iter,
> +			    unsigned len_align_mask)
> +{
> +	size_t nbytes = bio->bi_iter.bi_size & len_align_mask;
> +
> +	if (!nbytes)
> +		return 0;
> +
> +	iov_iter_revert(iter, nbytes);
> +	bio->bi_iter.bi_size -= nbytes;
> +	while (nbytes) {
> +		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +
> +		if (nbytes < bv->bv_len) {
> +			bv->bv_len -= nbytes;
> +			nbytes = 0;

Nit: remove the "nbytes = 0" and do a break here ?

> +		} else {
> +			bio_release_page(bio, bv->bv_page);
> +			bio->bi_vcnt--;
> +			nbytes -= bv->bv_len;
> +		}
> +	}
> +
> +	if (!bio->bi_iter.bi_size)
> +		return -EFAULT;
> +	return 0;
> +}
> +



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 1/8] block: check for valid bio while splitting
  2025-08-20  7:02   ` Damien Le Moal
@ 2025-08-20 14:25     ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-20 14:25 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, hch, martin.petersen, djwong, linux-xfs, viro

On Wed, Aug 20, 2025 at 04:02:31PM +0900, Damien Le Moal wrote:
> On 8/20/25 1:49 AM, Keith Busch wrote:
> >  {
> >  	struct bio_vec bv, bvprv, *bvprvp = NULL;
> >  	struct bvec_iter iter;
> >  	unsigned nsegs = 0, bytes = 0;
> >  
> >  	bio_for_each_bvec(bv, bio, iter) {
> > +		if (bv.bv_offset & lim->dma_alignment || bv.bv_len & len_align)
> 
> Shouldn't this be:
> 
> 		if (bv.bv_offset & len_align || bv.bv_len & len_align)
> 
> ?

No, we alwqys need to validate the address offset against the dma
alignment. The length is not validated for passthrough commands though,
so different masks for each so they can be independently set.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
                   ` (9 preceding siblings ...)
  2025-08-20  1:52 ` Song Chen
@ 2025-08-22 13:27 ` Ritesh Harjani
  2025-08-22 14:30   ` Keith Busch
  2025-08-25 12:07   ` Jan Kara
  10 siblings, 2 replies; 42+ messages in thread
From: Ritesh Harjani @ 2025-08-22 13:27 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-fsdevel, linux-kernel, linux-ext4
  Cc: snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Jan Kara

Keith Busch <kbusch@meta.com> writes:

> From: Keith Busch <kbusch@kernel.org>
>
> Previous version:
>
>   https://lore.kernel.org/linux-block/20250805141123.332298-1-kbusch@meta.com/
>
> This series removes the direct io requirement that io vector lengths
> align to the logical block size.
>
> I tested this on a few raw block device types including nvme,
> virtio-blk, ahci, and loop. NVMe is the only one I tested with 4k
> logical sectors; everything else was 512.
>
> On each of those, I tested several iomap filesystems: xfs, ext4, and
> btrfs. I found it interesting that each behave a little
> differently with handling invalid vector alignments:
>
>   - XFS is the most straight forward and reports failures on invalid
>     vector conditions, same as raw blocks devices.
>
>   - EXT4 falls back to buffered io for writes but not for reads.

++linux-ext4 to get any historical context behind why the difference of
behaviour in reads v/s writes for EXT4 DIO. 


BTW - I did some basic testing of the series against block device, XFS &
EXT4 and it worked as expected (for both DIO & AIO-DIO) i.e.
1. Individial iov_len need not be aligned to the logical block size anymore.
2. Total length of iovecs should be logical block size aligned though.

i.e. this combination works with this patch series now:

    posix_memalign((void**)&aligned_buf, mem_align, 2 * BLOCK_SIZE);
    struct iovec iov[4] = {
        {.iov_base = aligned_buf, .iov_len = 500},
        {.iov_base = aligned_buf + 500, .iov_len = 1500},
        {.iov_base = aligned_buf + 2000, .iov_len = 2000},
        {.iov_base = aligned_buf + 4000, .iov_len = 4192}
    }; // 500 + 1500 + 2000 + 4192 = 8192
 

-ritesh

>
>   - BTRFS doesn't even try direct io for any unusual alignments; it
>     chooses buffered io from the start.
>
> So it has been a little slow going figuring out which results to expect
> from various tests, but I think I've got all the corner cases covered. I
> can submit the tests cases to blktests and fstests for consideration
> separately, too.
>
> I'm not 100% sure where we're at with the last patch. I think Mike
> initially indicated this was okay to remove, but I could swear I read
> something saying that might not be the case anymore. I just can't find
> the message now. Mike?
>
> Changes from v2:
>
>   Include vector lengths when validating a split. The length check is
>   only valid for r/w commands, and skipped for passthrough
>   DRV_IN/DRV_OUT commands.
>
>   Introduce a prep patch having bio_iov_iter_get_pages() take the
>   caller's desired length alignment.
>
>   Additional code comments explaing less obvious error conditions.
>
>   Added reviews on the patches that haven't changed.
>
> Keith Busch (8):
>   block: check for valid bio while splitting
>   block: add size alignment to bio_iov_iter_get_pages
>   block: align the bio after building it
>   block: simplify direct io validity check
>   iomap: simplify direct io validity check
>   block: remove bdev_iter_is_aligned
>   blk-integrity: use simpler alignment check
>   iov_iter: remove iov_iter_is_aligned
>
>  block/bio-integrity.c  |  4 +-
>  block/bio.c            | 64 ++++++++++++++++++----------
>  block/blk-map.c        |  2 +-
>  block/blk-merge.c      | 20 +++++++--
>  block/fops.c           | 13 +++---
>  fs/iomap/direct-io.c   |  6 +--
>  include/linux/bio.h    | 13 ++++--
>  include/linux/blkdev.h | 20 +++++----
>  include/linux/uio.h    |  2 -
>  lib/iov_iter.c         | 95 ------------------------------------------
>  10 files changed, 94 insertions(+), 145 deletions(-)
>
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-22 13:27 ` Ritesh Harjani
@ 2025-08-22 14:30   ` Keith Busch
  2025-08-25 12:07   ` Jan Kara
  1 sibling, 0 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-22 14:30 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, linux-ext4,
	snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Jan Kara

On Fri, Aug 22, 2025 at 06:57:08PM +0530, Ritesh Harjani wrote:
> Keith Busch <kbusch@meta.com> writes:
> 
> BTW - I did some basic testing of the series against block device, XFS &
> EXT4 and it worked as expected (for both DIO & AIO-DIO) i.e.
> 1. Individial iov_len need not be aligned to the logical block size anymore.
> 2. Total length of iovecs should be logical block size aligned though.
> 
> i.e. this combination works with this patch series now:
> 
>     posix_memalign((void**)&aligned_buf, mem_align, 2 * BLOCK_SIZE);
>     struct iovec iov[4] = {
>         {.iov_base = aligned_buf, .iov_len = 500},
>         {.iov_base = aligned_buf + 500, .iov_len = 1500},
>         {.iov_base = aligned_buf + 2000, .iov_len = 2000},
>         {.iov_base = aligned_buf + 4000, .iov_len = 4192}
>     }; // 500 + 1500 + 2000 + 4192 = 8192

Yep, the kernel would have rejected that before, but should work now. An
added bonus, the code doesn't spend CPU cycles walking the iovec early
anymore.

Your test, though, is not getting to the real good stuff! :) Your
vectors are virtually contiguous, so the block layer will merge them to
maybe only one block sized segment. Add some offsets to create gaps, but
still adhere to your device's dma and virtual boundary limits. Your
offset options may be constrained if you're using NVMe, but I have a
follow up series fixing that for capable hardware.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 1/8] block: check for valid bio while splitting
  2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
  2025-08-20  7:02   ` Damien Le Moal
  2025-08-20  7:04   ` Damien Le Moal
@ 2025-08-25  7:35   ` Christoph Hellwig
  2 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch

On Tue, Aug 19, 2025 at 09:49:15AM -0700, Keith Busch wrote:
>  		/*
>  		 * If the queue doesn't support SG gaps and adding this
>  		 * offset would create a gap, disallow it.
> @@ -339,8 +343,16 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  	 * Individual bvecs might not be logical block aligned. Round down the
>  	 * split size so that each bio is properly block size aligned, even if
>  	 * we do not use the full hardware limits.
> +	 *
> +	 * Misuse may submit a bio that can't be split into a valid io. There
> +	 * may either be too many discontiguous vectors for the max segments
> +	 * limit, or contain virtual boundary gaps without having a valid block
> +	 * sized split. Catch that condition by checking for a zero byte
> +	 * result.
>  	 */
>  	bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> +	if (!bytes)

If this is just misuse it could be a WARN_ON_ONCE.  But I think we
can also trigger this when validating passthrough commands that need
to be built to hardware limits.  So maybe don't speak about misuse
here?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages
  2025-08-19 16:49 ` [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
@ 2025-08-25  7:36   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-19 16:49 ` [PATCHv3 3/8] block: align the bio after building it Keith Busch
  2025-08-20  7:07   ` Damien Le Moal
@ 2025-08-25  7:46   ` Christoph Hellwig
  2025-08-25 13:57     ` Keith Busch
  2025-08-25  7:47   ` Christoph Hellwig
  2 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch

On Tue, Aug 19, 2025 at 09:49:17AM -0700, Keith Busch wrote:
> +/*
> + * Aligns the bio size to the len_align_mask, releasing any excessive bio vecs
> + * that  __bio_iov_iter_get_pages may have inserted and reverts that length for
> + * the next iteration.
> + */
> +static int bio_align(struct bio *bio, struct iov_iter *iter,
> +			    unsigned len_align_mask)

I think the name is a bit too generic, as the function is very
specific to the __bio_iov_iter_get_pages-path only releasing of
pages, and also only aligns down.  Maybe name it
bio_iov_iter_align_down or something like that?

> +	size_t nbytes = bio->bi_iter.bi_size & len_align_mask;
> +
> +	if (!nbytes)
> +		return 0;
> +
> +	iov_iter_revert(iter, nbytes);
> +	bio->bi_iter.bi_size -= nbytes;
> +	while (nbytes) {
> +		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +
> +		if (nbytes < bv->bv_len) {
> +			bv->bv_len -= nbytes;
> +			nbytes = 0;
> +		} else {
> +			bio_release_page(bio, bv->bv_page);
> +			bio->bi_vcnt--;
> +			nbytes -= bv->bv_len;
> +		}
> +	}

Minor nitpicks on the loop:  it could be turned into a do { } while ()
loop, because nbytes is already checked above.  And the condition that
sets nbytes to 0 could just break out of the loo.

> +
> +	if (!bio->bi_iter.bi_size)
> +		return -EFAULT;

And as bi_size doesn't change in the loop, this should probably move
above the loop.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-19 16:49 ` [PATCHv3 3/8] block: align the bio after building it Keith Busch
  2025-08-20  7:07   ` Damien Le Moal
  2025-08-25  7:46   ` Christoph Hellwig
@ 2025-08-25  7:47   ` Christoph Hellwig
  2025-08-26  0:37     ` Keith Busch
  2 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch

Also with this we should be able to drop the iov_iter_alignment check
for always COW inodes in xfs_file_dio_write.  If you don't feel like
doing that yourself I can add it to my todo list.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 4/8] block: simplify direct io validity check
  2025-08-19 16:49 ` [PATCHv3 4/8] block: simplify direct io validity check Keith Busch
@ 2025-08-25  7:48   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 5/8] iomap: simplify direct io validity check
  2025-08-19 16:49 ` [PATCHv3 5/8] iomap: " Keith Busch
@ 2025-08-25  7:48   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 6/8] block: remove bdev_iter_is_aligned
  2025-08-19 16:49 ` [PATCHv3 6/8] block: remove bdev_iter_is_aligned Keith Busch
@ 2025-08-25  7:48   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 7/8] blk-integrity: use simpler alignment check
  2025-08-19 16:49 ` [PATCHv3 7/8] blk-integrity: use simpler alignment check Keith Busch
@ 2025-08-25  7:49   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned
  2025-08-19 16:49 ` [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
@ 2025-08-25  7:50   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-25  7:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, linux-kernel, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro,
	Keith Busch, Hannes Reinecke

I have two callers in code I plan to eventually upstream, but I think
of all your pending work in the area lands I might be able to kill
that.  If not I'll have to bring it back, but for now it should go as
all dead code should:

Signed-off-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-22 13:27 ` Ritesh Harjani
  2025-08-22 14:30   ` Keith Busch
@ 2025-08-25 12:07   ` Jan Kara
  2025-08-25 14:53     ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Jan Kara @ 2025-08-25 12:07 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, linux-ext4,
	snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Keith Busch, Jan Kara

On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> Keith Busch <kbusch@meta.com> writes:
> 
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Previous version:
> >
> >   https://lore.kernel.org/linux-block/20250805141123.332298-1-kbusch@meta.com/
> >
> > This series removes the direct io requirement that io vector lengths
> > align to the logical block size.
> >
> > I tested this on a few raw block device types including nvme,
> > virtio-blk, ahci, and loop. NVMe is the only one I tested with 4k
> > logical sectors; everything else was 512.
> >
> > On each of those, I tested several iomap filesystems: xfs, ext4, and
> > btrfs. I found it interesting that each behave a little
> > differently with handling invalid vector alignments:
> >
> >   - XFS is the most straight forward and reports failures on invalid
> >     vector conditions, same as raw blocks devices.
> >
> >   - EXT4 falls back to buffered io for writes but not for reads.
> 
> ++linux-ext4 to get any historical context behind why the difference of
> behaviour in reads v/s writes for EXT4 DIO. 

Hum, how did you test? Because in the basic testing I did (with vanilla
kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
falling back to buffered IO only if the underlying file itself does not
support any kind of direct IO.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-25  7:46   ` Christoph Hellwig
@ 2025-08-25 13:57     ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-25 13:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, martin.petersen, djwong, linux-xfs, viro

On Mon, Aug 25, 2025 at 09:46:06AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 19, 2025 at 09:49:17AM -0700, Keith Busch wrote:
> > +
> > +	if (!bio->bi_iter.bi_size)
> > +		return -EFAULT;
> 
> And as bi_size doesn't change in the loop, this should probably move
> above the loop.

The loop also releases pinned pages if necessary. We can't count on the
caller to do that if we return error here.  __blkdev_direct_IO is the
only one that tries to release pages on error, everyone else would leak
them.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-25 12:07   ` Jan Kara
@ 2025-08-25 14:53     ` Keith Busch
  2025-08-26  4:59       ` Ritesh Harjani
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-25 14:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara

On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > Keith Busch <kbusch@meta.com> writes:
> > >
> > >   - EXT4 falls back to buffered io for writes but not for reads.
> > 
> > ++linux-ext4 to get any historical context behind why the difference of
> > behaviour in reads v/s writes for EXT4 DIO. 
> 
> Hum, how did you test? Because in the basic testing I did (with vanilla
> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> falling back to buffered IO only if the underlying file itself does not
> support any kind of direct IO.

Simple test case (dio-offset-test.c) below.

I also ran this on vanilla kernel and got these results:

  # mkfs.ext4 /dev/vda
  # mount /dev/vda /mnt/ext4/
  # make dio-offset-test
  # ./dio-offset-test /mnt/ext4/foobar
  write: Success
  read: Invalid argument

I tracked the "write: Success" down to ext4's handling for the "special"
-ENOTBLK error after ext4_want_directio_fallback() returns "true".

dio-offset-test.c:
---
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <sys/uio.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	unsigned int pagesize;
	struct iovec iov[2];
	int ret, fd;
	void *buf;

	if (argc < 2)
		err(EINVAL, "usage: %s <file>", argv[0]);
	
	pagesize = sysconf(_SC_PAGE_SIZE);
	ret = posix_memalign((void **)&buf, pagesize, 2 * pagesize);
	if (ret)
		err(errno, "%s: failed to allocate buf", __func__);
	
	fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC | O_DIRECT);
	if (fd < 0)
		err(errno, "%s: failed to open %s", __func__, argv[1]);
	
	iov[0].iov_base = buf;
	iov[0].iov_len = 256;
	iov[1].iov_base = buf + pagesize;
	iov[1].iov_len = 256;
	ret = pwritev(fd, iov, 2, 0);
	perror("write");
	
	ret = preadv(fd, iov, 2, 0);
	perror("read");
	
	return 0;
}
--

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-25  7:47   ` Christoph Hellwig
@ 2025-08-26  0:37     ` Keith Busch
  2025-08-26  8:02       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-26  0:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, martin.petersen, djwong, linux-xfs, viro

On Mon, Aug 25, 2025 at 09:47:44AM +0200, Christoph Hellwig wrote:
> Also with this we should be able to drop the iov_iter_alignment check
> for always COW inodes in xfs_file_dio_write.  If you don't feel like
> doing that yourself I can add it to my todo list.

I'm unsure about the commit that introduced that behavior, so I think
you should remove it if you know its okay. :)

Specifically, we have this in the comments and commit message:

  check the alignment of each individual iovec segment, as they could
  end up with different I/Os due to the way bio_iov_iter_get_pages works 

bio_iov_iter_get_pages() might submit the segments as separate IO's
anyway for other reasons. I am not sure why the alignment conditions are
handled specifically here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-25 14:53     ` Keith Busch
@ 2025-08-26  4:59       ` Ritesh Harjani
  2025-08-27 15:20         ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Ritesh Harjani @ 2025-08-26  4:59 UTC (permalink / raw)
  To: Keith Busch, Jan Kara
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, linux-ext4,
	snitzer, axboe, dw, brauner, hch, martin.petersen, djwong,
	linux-xfs, viro, Jan Kara

Keith Busch <kbusch@kernel.org> writes:

> On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
>> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
>> > Keith Busch <kbusch@meta.com> writes:
>> > >
>> > >   - EXT4 falls back to buffered io for writes but not for reads.
>> > 
>> > ++linux-ext4 to get any historical context behind why the difference of
>> > behaviour in reads v/s writes for EXT4 DIO. 
>> 
>> Hum, how did you test? Because in the basic testing I did (with vanilla
>> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
>> falling back to buffered IO only if the underlying file itself does not
>> support any kind of direct IO.
>
> Simple test case (dio-offset-test.c) below.
>
> I also ran this on vanilla kernel and got these results:
>
>   # mkfs.ext4 /dev/vda
>   # mount /dev/vda /mnt/ext4/
>   # make dio-offset-test
>   # ./dio-offset-test /mnt/ext4/foobar
>   write: Success
>   read: Invalid argument
>
> I tracked the "write: Success" down to ext4's handling for the "special"
> -ENOTBLK error after ext4_want_directio_fallback() returns "true".
>

Right. Ext4 has fallback only for dio writes but not for DIO reads... 

buffered
static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
{
	/* must be a directio to fall back to buffered */
	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
		    (IOMAP_WRITE | IOMAP_DIRECT))
		return false;

    ...
}

So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
    -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...


	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
		return -EINVAL;

EXT4 then fallsback to buffered-io only for writes, but not for reads. 


-ritesh


> dio-offset-test.c:
> ---
> #ifndef _GNU_SOURCE
> #define _GNU_SOURCE
> #endif
>
> #include <sys/uio.h>
> #include <err.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <unistd.h>
>
> int main(int argc, char **argv)
> {
> 	unsigned int pagesize;
> 	struct iovec iov[2];
> 	int ret, fd;
> 	void *buf;
>
> 	if (argc < 2)
> 		err(EINVAL, "usage: %s <file>", argv[0]);
> 	
> 	pagesize = sysconf(_SC_PAGE_SIZE);
> 	ret = posix_memalign((void **)&buf, pagesize, 2 * pagesize);
> 	if (ret)
> 		err(errno, "%s: failed to allocate buf", __func__);
> 	
> 	fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC | O_DIRECT);
> 	if (fd < 0)
> 		err(errno, "%s: failed to open %s", __func__, argv[1]);
> 	
> 	iov[0].iov_base = buf;
> 	iov[0].iov_len = 256;
> 	iov[1].iov_base = buf + pagesize;
> 	iov[1].iov_len = 256;
> 	ret = pwritev(fd, iov, 2, 0);
> 	perror("write");
> 	
> 	ret = preadv(fd, iov, 2, 0);
> 	perror("read");
> 	
> 	return 0;
> }
> --

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-26  0:37     ` Keith Busch
@ 2025-08-26  8:02       ` Christoph Hellwig
  2025-08-26 23:11         ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-08-26  8:02 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, martin.petersen, djwong, linux-xfs, viro

On Mon, Aug 25, 2025 at 06:37:05PM -0600, Keith Busch wrote:
> On Mon, Aug 25, 2025 at 09:47:44AM +0200, Christoph Hellwig wrote:
> > Also with this we should be able to drop the iov_iter_alignment check
> > for always COW inodes in xfs_file_dio_write.  If you don't feel like
> > doing that yourself I can add it to my todo list.
> 
> I'm unsure about the commit that introduced that behavior, so I think
> you should remove it if you know its okay. :)

Yeah.

> Specifically, we have this in the comments and commit message:
> 
>   check the alignment of each individual iovec segment, as they could
>   end up with different I/Os due to the way bio_iov_iter_get_pages works 
> 
> bio_iov_iter_get_pages() might submit the segments as separate IO's
> anyway for other reasons. I am not sure why the alignment conditions are
> handled specifically here.

I'll take another look.  Basically what this wants to prevent is
bio_iov_iter_get_pages creating bios not aligned to file system
block size.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 3/8] block: align the bio after building it
  2025-08-26  8:02       ` Christoph Hellwig
@ 2025-08-26 23:11         ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2025-08-26 23:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, linux-kernel, snitzer,
	axboe, dw, brauner, martin.petersen, djwong, linux-xfs, viro

On Tue, Aug 26, 2025 at 10:02:00AM +0200, Christoph Hellwig wrote:
> On Mon, Aug 25, 2025 at 06:37:05PM -0600, Keith Busch wrote:
> > bio_iov_iter_get_pages() might submit the segments as separate IO's
> > anyway for other reasons. I am not sure why the alignment conditions are
> > handled specifically here.
> 
> I'll take another look.  Basically what this wants to prevent is
> bio_iov_iter_get_pages creating bios not aligned to file system
> block size.

Got it, I think we're okay in that case. You should only need to
consider the ki_pos and the iov_iter_count to align to the filesystem
block size. This patch will dispatch appropriately sized IO as long as
the hw limits allow it, and you'll get an error (same as before) if it
doesn't.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-26  4:59       ` Ritesh Harjani
@ 2025-08-27 15:20         ` Jan Kara
  2025-08-27 16:09           ` Mike Snitzer
                             ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Jan Kara @ 2025-08-27 15:20 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Keith Busch, Jan Kara, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

[-- Attachment #1: Type: text/plain, Size: 2979 bytes --]

On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> Keith Busch <kbusch@kernel.org> writes:
> 
> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> >> > Keith Busch <kbusch@meta.com> writes:
> >> > >
> >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> >> > 
> >> > ++linux-ext4 to get any historical context behind why the difference of
> >> > behaviour in reads v/s writes for EXT4 DIO. 
> >> 
> >> Hum, how did you test? Because in the basic testing I did (with vanilla
> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> >> falling back to buffered IO only if the underlying file itself does not
> >> support any kind of direct IO.
> >
> > Simple test case (dio-offset-test.c) below.
> >
> > I also ran this on vanilla kernel and got these results:
> >
> >   # mkfs.ext4 /dev/vda
> >   # mount /dev/vda /mnt/ext4/
> >   # make dio-offset-test
> >   # ./dio-offset-test /mnt/ext4/foobar
> >   write: Success
> >   read: Invalid argument
> >
> > I tracked the "write: Success" down to ext4's handling for the "special"
> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> >
> 
> Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> 
> buffered
> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> {
> 	/* must be a directio to fall back to buffered */
> 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> 		    (IOMAP_WRITE | IOMAP_DIRECT))
> 		return false;
> 
>     ...
> }
> 
> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> 
> 
> 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> 		return -EINVAL;
> 
> EXT4 then fallsback to buffered-io only for writes, but not for reads. 

Right. And the fallback for writes was actually inadvertedly "added" by
commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
changed the error handling logic. Previously if iomap_dio_bio_iter()
returned EINVAL, it got propagated to userspace regardless of what
->iomap_end() returned. After this commit if ->iomap_end() returns error
(which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
the error returned by iomap_dio_bio_iter().

Now both the old and new behavior make some sense so I won't argue that the
new iomap_iter() behavior is wrong. But I think we should change ext4 back
to the old behavior of failing unaligned dio writes instead of them falling
back to buffered IO. I think something like the attached patch should do
the trick - it makes unaligned dio writes fail again while writes to holes
of indirect-block mapped files still correctly fall back to buffered IO.
Once fstests run completes, I'll do a proper submission...


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Fail-unaligned-direct-IO-write-with-EINVAL.patch --]
[-- Type: text/x-patch, Size: 3254 bytes --]

From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 27 Aug 2025 14:55:19 +0200
Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL

Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
changed the error handling logic in iomap_iter(). Previously any error
from iomap_dio_bio_iter() got propagated to userspace, after this commit
if ->iomap_end returns error, it gets propagated to userspace instead of
an error from iomap_dio_bio_iter(). This results in unaligned writes to
ext4 to silently fallback to buffered IO instead of erroring out.

Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
unnecessary these days. It is enough to return ENOTBLK from
ext4_iomap_begin() when we don't support DIO write for that particular
file offset (due to hole).

Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c  |  2 --
 fs/ext4/inode.c | 35 -----------------------------------
 2 files changed, 37 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 93240e35ee36..cf39f57d21e9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
 			   dio_flags, NULL, 0);
-	if (ret == -ENOTBLK)
-		ret = 0;
 	if (extend) {
 		/*
 		 * We always perform extending DIO write synchronously so by
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5b7a15db4953..c3b23c90fd11 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
 	return ret;
 }
 
-static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
-{
-	/* must be a directio to fall back to buffered */
-	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
-		    (IOMAP_WRITE | IOMAP_DIRECT))
-		return false;
-
-	/* atomic writes are all-or-nothing */
-	if (flags & IOMAP_ATOMIC)
-		return false;
-
-	/* can only try again if we wrote nothing */
-	return written == 0;
-}
-
-static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
-			  ssize_t written, unsigned flags, struct iomap *iomap)
-{
-	/*
-	 * Check to see whether an error occurred while writing out the data to
-	 * the allocated blocks. If so, return the magic error code for
-	 * non-atomic write so that we fallback to buffered I/O and attempt to
-	 * complete the remainder of the I/O.
-	 * For non-atomic writes, any blocks that may have been
-	 * allocated in preparation for the direct I/O will be reused during
-	 * buffered I/O. For atomic write, we never fallback to buffered-io.
-	 */
-	if (ext4_want_directio_fallback(flags, written))
-		return -ENOTBLK;
-
-	return 0;
-}
-
 const struct iomap_ops ext4_iomap_ops = {
 	.iomap_begin		= ext4_iomap_begin,
-	.iomap_end		= ext4_iomap_end,
 };
 
 const struct iomap_ops ext4_iomap_overwrite_ops = {
 	.iomap_begin		= ext4_iomap_overwrite_begin,
-	.iomap_end		= ext4_iomap_end,
 };
 
 static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-27 15:20         ` Jan Kara
@ 2025-08-27 16:09           ` Mike Snitzer
  2025-09-01  7:55             ` Jan Kara
  2025-08-27 17:52           ` Brian Foster
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-08-27 16:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-kernel, linux-ext4, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

Hi Jan,

On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > Keith Busch <kbusch@kernel.org> writes:
> > 
> > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > >> > Keith Busch <kbusch@meta.com> writes:
> > >> > >
> > >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> > >> > 
> > >> > ++linux-ext4 to get any historical context behind why the difference of
> > >> > behaviour in reads v/s writes for EXT4 DIO. 
> > >> 
> > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > >> falling back to buffered IO only if the underlying file itself does not
> > >> support any kind of direct IO.
> > >
> > > Simple test case (dio-offset-test.c) below.
> > >
> > > I also ran this on vanilla kernel and got these results:
> > >
> > >   # mkfs.ext4 /dev/vda
> > >   # mount /dev/vda /mnt/ext4/
> > >   # make dio-offset-test
> > >   # ./dio-offset-test /mnt/ext4/foobar
> > >   write: Success
> > >   read: Invalid argument
> > >
> > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > >
> > 
> > Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> > 
> > buffered
> > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > {
> > 	/* must be a directio to fall back to buffered */
> > 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > 		    (IOMAP_WRITE | IOMAP_DIRECT))
> > 		return false;
> > 
> >     ...
> > }
> > 
> > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> >     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > 
> > 
> > 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > 		return -EINVAL;
> > 
> > EXT4 then fallsback to buffered-io only for writes, but not for reads. 
> 
> Right. And the fallback for writes was actually inadvertedly "added" by
> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> changed the error handling logic. Previously if iomap_dio_bio_iter()
> returned EINVAL, it got propagated to userspace regardless of what
> ->iomap_end() returned. After this commit if ->iomap_end() returns error
> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> the error returned by iomap_dio_bio_iter().
> 
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
> 
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 27 Aug 2025 14:55:19 +0200
> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> 
> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> changed the error handling logic in iomap_iter(). Previously any error
> from iomap_dio_bio_iter() got propagated to userspace, after this commit
> if ->iomap_end returns error, it gets propagated to userspace instead of
> an error from iomap_dio_bio_iter(). This results in unaligned writes to
> ext4 to silently fallback to buffered IO instead of erroring out.
> 
> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> unnecessary these days. It is enough to return ENOTBLK from
> ext4_iomap_begin() when we don't support DIO write for that particular
> file offset (due to hole).

Any particular reason for ext4 still returning -ENOTBLK for unaligned
DIO?

In my experience XFS returns -EINVAL when failing unaligned DIO (but
maybe there are edge cases where that isn't always the case?)

Would be nice to have consistency across filesystems for what is
returned when failing unaligned DIO.

The iomap code returns -ENOTBLK as "the magic error code to fall back
to buffered I/O".  But that seems only for page cache invalidation
failure, _not_ for unaligned DIO.

(Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
cache invalidation fails during DIO write. So it seems higher-level
code, like I've added to NFS/NFSD to check for unaligned DIO failure,
should check for both -EINVAL and -ENOTBLK).

Thanks,
Mike

ps. ENOTBLK is actually much less easily confused with other random
uses of EINVAL (EINVAL use is generally way too overloaded, rendering
it a pretty unhelpful error).  But switching XFS to use ENOTBLK
instead of EINVAL seems like disruptive interface breakage (I suppose
same could be said for ext4 if it were to now return EINVAL for
unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
prompted me to ask these questions now)

> ---
>  fs/ext4/file.c  |  2 --
>  fs/ext4/inode.c | 35 -----------------------------------
>  2 files changed, 37 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 93240e35ee36..cf39f57d21e9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   dio_flags, NULL, 0);
> -	if (ret == -ENOTBLK)
> -		ret = 0;
>  	if (extend) {
>  		/*
>  		 * We always perform extending DIO write synchronously so by
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b7a15db4953..c3b23c90fd11 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
>  	return ret;
>  }
>  
> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> -{
> -	/* must be a directio to fall back to buffered */
> -	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> -		    (IOMAP_WRITE | IOMAP_DIRECT))
> -		return false;
> -
> -	/* atomic writes are all-or-nothing */
> -	if (flags & IOMAP_ATOMIC)
> -		return false;
> -
> -	/* can only try again if we wrote nothing */
> -	return written == 0;
> -}
> -
> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> -			  ssize_t written, unsigned flags, struct iomap *iomap)
> -{
> -	/*
> -	 * Check to see whether an error occurred while writing out the data to
> -	 * the allocated blocks. If so, return the magic error code for
> -	 * non-atomic write so that we fallback to buffered I/O and attempt to
> -	 * complete the remainder of the I/O.
> -	 * For non-atomic writes, any blocks that may have been
> -	 * allocated in preparation for the direct I/O will be reused during
> -	 * buffered I/O. For atomic write, we never fallback to buffered-io.
> -	 */
> -	if (ext4_want_directio_fallback(flags, written))
> -		return -ENOTBLK;
> -
> -	return 0;
> -}
> -
>  const struct iomap_ops ext4_iomap_ops = {
>  	.iomap_begin		= ext4_iomap_begin,
> -	.iomap_end		= ext4_iomap_end,
>  };
>  
>  const struct iomap_ops ext4_iomap_overwrite_ops = {
>  	.iomap_begin		= ext4_iomap_overwrite_begin,
> -	.iomap_end		= ext4_iomap_end,
>  };
>  
>  static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-27 15:20         ` Jan Kara
  2025-08-27 16:09           ` Mike Snitzer
@ 2025-08-27 17:52           ` Brian Foster
  2025-08-27 19:20           ` Keith Busch
  2025-08-29  2:11           ` Ritesh Harjani
  3 siblings, 0 replies; 42+ messages in thread
From: Brian Foster @ 2025-08-27 17:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-kernel, linux-ext4, snitzer, axboe, dw,
	brauner, hch, martin.petersen, djwong, linux-xfs, viro, Jan Kara

On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > Keith Busch <kbusch@kernel.org> writes:
> > 
> > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > >> > Keith Busch <kbusch@meta.com> writes:
> > >> > >
> > >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> > >> > 
> > >> > ++linux-ext4 to get any historical context behind why the difference of
> > >> > behaviour in reads v/s writes for EXT4 DIO. 
> > >> 
> > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > >> falling back to buffered IO only if the underlying file itself does not
> > >> support any kind of direct IO.
> > >
> > > Simple test case (dio-offset-test.c) below.
> > >
> > > I also ran this on vanilla kernel and got these results:
> > >
> > >   # mkfs.ext4 /dev/vda
> > >   # mount /dev/vda /mnt/ext4/
> > >   # make dio-offset-test
> > >   # ./dio-offset-test /mnt/ext4/foobar
> > >   write: Success
> > >   read: Invalid argument
> > >
> > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > >
> > 
> > Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> > 
> > buffered
> > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > {
> > 	/* must be a directio to fall back to buffered */
> > 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > 		    (IOMAP_WRITE | IOMAP_DIRECT))
> > 		return false;
> > 
> >     ...
> > }
> > 
> > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> >     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > 
> > 
> > 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > 		return -EINVAL;
> > 
> > EXT4 then fallsback to buffered-io only for writes, but not for reads. 
> 
> Right. And the fallback for writes was actually inadvertedly "added" by
> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> changed the error handling logic. Previously if iomap_dio_bio_iter()
> returned EINVAL, it got propagated to userspace regardless of what
> ->iomap_end() returned. After this commit if ->iomap_end() returns error
> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> the error returned by iomap_dio_bio_iter().
> 

Ah, so IIUC you're referring to the change in iomap_iter() where the
iomap_end() error code was returned "if (ret < 0 && !iter->processed)",
where iter->processed held a potential error code from the iterator.
That was changed to !advanced, which filters out the processed < 0 case
and allows the error to return from iomap_end here rather than from
iter->processed a few lines down.

There were further changes to eliminate the advance from iomap_iter()
case (and rename processed -> status), so I suppose we could consider
changing that to something like:

	if (ret < 0 && !advanced && !iter->status)
		return ret;

... which I think would restore original error behavior. But I agree
it's not totally clear which is preferable. Certainly the change in
behavior was not intentional so thanks for the analysis. I'd have to
stare at the code and think (and test) some more to form an opinion on
whether it's worth changing. Meanwhile it looks like you have a
reasonable enough workaround..

Brian

> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
> 
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 27 Aug 2025 14:55:19 +0200
> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> 
> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> changed the error handling logic in iomap_iter(). Previously any error
> from iomap_dio_bio_iter() got propagated to userspace, after this commit
> if ->iomap_end returns error, it gets propagated to userspace instead of
> an error from iomap_dio_bio_iter(). This results in unaligned writes to
> ext4 to silently fallback to buffered IO instead of erroring out.
> 
> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> unnecessary these days. It is enough to return ENOTBLK from
> ext4_iomap_begin() when we don't support DIO write for that particular
> file offset (due to hole).
> 
> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c  |  2 --
>  fs/ext4/inode.c | 35 -----------------------------------
>  2 files changed, 37 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 93240e35ee36..cf39f57d21e9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   dio_flags, NULL, 0);
> -	if (ret == -ENOTBLK)
> -		ret = 0;
>  	if (extend) {
>  		/*
>  		 * We always perform extending DIO write synchronously so by
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b7a15db4953..c3b23c90fd11 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
>  	return ret;
>  }
>  
> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> -{
> -	/* must be a directio to fall back to buffered */
> -	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> -		    (IOMAP_WRITE | IOMAP_DIRECT))
> -		return false;
> -
> -	/* atomic writes are all-or-nothing */
> -	if (flags & IOMAP_ATOMIC)
> -		return false;
> -
> -	/* can only try again if we wrote nothing */
> -	return written == 0;
> -}
> -
> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> -			  ssize_t written, unsigned flags, struct iomap *iomap)
> -{
> -	/*
> -	 * Check to see whether an error occurred while writing out the data to
> -	 * the allocated blocks. If so, return the magic error code for
> -	 * non-atomic write so that we fallback to buffered I/O and attempt to
> -	 * complete the remainder of the I/O.
> -	 * For non-atomic writes, any blocks that may have been
> -	 * allocated in preparation for the direct I/O will be reused during
> -	 * buffered I/O. For atomic write, we never fallback to buffered-io.
> -	 */
> -	if (ext4_want_directio_fallback(flags, written))
> -		return -ENOTBLK;
> -
> -	return 0;
> -}
> -
>  const struct iomap_ops ext4_iomap_ops = {
>  	.iomap_begin		= ext4_iomap_begin,
> -	.iomap_end		= ext4_iomap_end,
>  };
>  
>  const struct iomap_ops ext4_iomap_overwrite_ops = {
>  	.iomap_begin		= ext4_iomap_overwrite_begin,
> -	.iomap_end		= ext4_iomap_end,
>  };
>  
>  static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-27 15:20         ` Jan Kara
  2025-08-27 16:09           ` Mike Snitzer
  2025-08-27 17:52           ` Brian Foster
@ 2025-08-27 19:20           ` Keith Busch
  2025-09-01  8:22             ` Jan Kara
  2025-08-29  2:11           ` Ritesh Harjani
  3 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-08-27 19:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...

Your suggestion looks all well and good, but I have a general question
about fstests. I've written up some to test this series, and I have
filesystem specific expectations for what should error or succeed. If
you modify ext4 to fail direct-io as described, my test will have to be
kernel version specific too. Is there a best practice in fstests for
handling such scenarios?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-27 15:20         ` Jan Kara
                             ` (2 preceding siblings ...)
  2025-08-27 19:20           ` Keith Busch
@ 2025-08-29  2:11           ` Ritesh Harjani
  2025-08-29  3:19             ` Ritesh Harjani
  3 siblings, 1 reply; 42+ messages in thread
From: Ritesh Harjani @ 2025-08-29  2:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Keith Busch, Jan Kara, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

Jan Kara <jack@suse.cz> writes:

> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
>> Keith Busch <kbusch@kernel.org> writes:
>> 
>> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
>> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
>> >> > Keith Busch <kbusch@meta.com> writes:
>> >> > >
>> >> > >   - EXT4 falls back to buffered io for writes but not for reads.
>> >> > 
>> >> > ++linux-ext4 to get any historical context behind why the difference of
>> >> > behaviour in reads v/s writes for EXT4 DIO. 
>> >> 
>> >> Hum, how did you test? Because in the basic testing I did (with vanilla
>> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
>> >> falling back to buffered IO only if the underlying file itself does not
>> >> support any kind of direct IO.
>> >
>> > Simple test case (dio-offset-test.c) below.
>> >
>> > I also ran this on vanilla kernel and got these results:
>> >
>> >   # mkfs.ext4 /dev/vda
>> >   # mount /dev/vda /mnt/ext4/
>> >   # make dio-offset-test
>> >   # ./dio-offset-test /mnt/ext4/foobar
>> >   write: Success
>> >   read: Invalid argument
>> >
>> > I tracked the "write: Success" down to ext4's handling for the "special"
>> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
>> >
>> 
>> Right. Ext4 has fallback only for dio writes but not for DIO reads... 
>> 
>> buffered
>> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>> {
>> 	/* must be a directio to fall back to buffered */
>> 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>> 		    (IOMAP_WRITE | IOMAP_DIRECT))
>> 		return false;
>> 
>>     ...
>> }
>> 
>> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>>     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
>> 
>> 
>> 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>> 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>> 		return -EINVAL;
>> 
>> EXT4 then fallsback to buffered-io only for writes, but not for reads. 
>
> Right. And the fallback for writes was actually inadvertedly "added" by
> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> changed the error handling logic. Previously if iomap_dio_bio_iter()
> returned EINVAL, it got propagated to userspace regardless of what
> ->iomap_end() returned. After this commit if ->iomap_end() returns error
> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> the error returned by iomap_dio_bio_iter().
>
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
>

Aah, right. So it wasn't EXT4 which had this behaviour of falling back
to buffered I/O for unaligned writes. Earlier EXT4 was assuming an error
code will be detected by iomap and will be passed to it as "written" in
ext4_iomap_end() for such unaligned writes. But I guess that logic
silently got changed with that commit. Thanks for analyzing that. 
I missed looking underneath iomap behaviour change :). 


>
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 27 Aug 2025 14:55:19 +0200
> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>
> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> changed the error handling logic in iomap_iter(). Previously any error
> from iomap_dio_bio_iter() got propagated to userspace, after this commit
> if ->iomap_end returns error, it gets propagated to userspace instead of
> an error from iomap_dio_bio_iter(). This results in unaligned writes to
> ext4 to silently fallback to buffered IO instead of erroring out.
>
> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> unnecessary these days. It is enough to return ENOTBLK from
> ext4_iomap_begin() when we don't support DIO write for that particular
> file offset (due to hole).

Right. This mainly only happens if we have holes in non-extent (indirect
blocks) case.

Also, as I see ext4 always just fallsback to buffered-io for no or
partial writes (unless iomap returned any error code). So, I was just
wondering if that could ever happen for DIO atomic write case. It's good
that we have a WARN_ON_ONCE() check in there to catch it. But I was
wondering if this needs an explicit handling in ext4_dio_write_iter() to
not fallback to buffered-writes for atomic DIO requests?

-ritesh



>
> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c  |  2 --
>  fs/ext4/inode.c | 35 -----------------------------------
>  2 files changed, 37 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 93240e35ee36..cf39f57d21e9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   dio_flags, NULL, 0);
> -	if (ret == -ENOTBLK)
> -		ret = 0;
>  	if (extend) {
>  		/*
>  		 * We always perform extending DIO write synchronously so by
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b7a15db4953..c3b23c90fd11 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
>  	return ret;
>  }
>  
> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> -{
> -	/* must be a directio to fall back to buffered */
> -	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> -		    (IOMAP_WRITE | IOMAP_DIRECT))
> -		return false;
> -
> -	/* atomic writes are all-or-nothing */
> -	if (flags & IOMAP_ATOMIC)
> -		return false;
> -
> -	/* can only try again if we wrote nothing */
> -	return written == 0;
> -}
> -
> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> -			  ssize_t written, unsigned flags, struct iomap *iomap)
> -{
> -	/*
> -	 * Check to see whether an error occurred while writing out the data to
> -	 * the allocated blocks. If so, return the magic error code for
> -	 * non-atomic write so that we fallback to buffered I/O and attempt to
> -	 * complete the remainder of the I/O.
> -	 * For non-atomic writes, any blocks that may have been
> -	 * allocated in preparation for the direct I/O will be reused during
> -	 * buffered I/O. For atomic write, we never fallback to buffered-io.
> -	 */
> -	if (ext4_want_directio_fallback(flags, written))
> -		return -ENOTBLK;
> -
> -	return 0;
> -}
> -
>  const struct iomap_ops ext4_iomap_ops = {
>  	.iomap_begin		= ext4_iomap_begin,
> -	.iomap_end		= ext4_iomap_end,
>  };
>  
>  const struct iomap_ops ext4_iomap_overwrite_ops = {
>  	.iomap_begin		= ext4_iomap_overwrite_begin,
> -	.iomap_end		= ext4_iomap_end,
>  };
>  
>  static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-29  2:11           ` Ritesh Harjani
@ 2025-08-29  3:19             ` Ritesh Harjani
  0 siblings, 0 replies; 42+ messages in thread
From: Ritesh Harjani @ 2025-08-29  3:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Keith Busch, Jan Kara, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Jan Kara <jack@suse.cz> writes:
>
>> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
>>> Keith Busch <kbusch@kernel.org> writes:
>>> 
>>> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
>>> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
>>> >> > Keith Busch <kbusch@meta.com> writes:
>>> >> > >
>>> >> > >   - EXT4 falls back to buffered io for writes but not for reads.
>>> >> > 
>>> >> > ++linux-ext4 to get any historical context behind why the difference of
>>> >> > behaviour in reads v/s writes for EXT4 DIO. 
>>> >> 
>>> >> Hum, how did you test? Because in the basic testing I did (with vanilla
>>> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
>>> >> falling back to buffered IO only if the underlying file itself does not
>>> >> support any kind of direct IO.
>>> >
>>> > Simple test case (dio-offset-test.c) below.
>>> >
>>> > I also ran this on vanilla kernel and got these results:
>>> >
>>> >   # mkfs.ext4 /dev/vda
>>> >   # mount /dev/vda /mnt/ext4/
>>> >   # make dio-offset-test
>>> >   # ./dio-offset-test /mnt/ext4/foobar
>>> >   write: Success
>>> >   read: Invalid argument
>>> >
>>> > I tracked the "write: Success" down to ext4's handling for the "special"
>>> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
>>> >
>>> 
>>> Right. Ext4 has fallback only for dio writes but not for DIO reads... 
>>> 
>>> buffered
>>> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>>> {
>>> 	/* must be a directio to fall back to buffered */
>>> 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>>> 		    (IOMAP_WRITE | IOMAP_DIRECT))
>>> 		return false;
>>> 
>>>     ...
>>> }
>>> 
>>> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>>>     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
>>> 
>>> 
>>> 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>> 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>> 		return -EINVAL;
>>> 
>>> EXT4 then fallsback to buffered-io only for writes, but not for reads. 
>>
>> Right. And the fallback for writes was actually inadvertedly "added" by
>> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
>> changed the error handling logic. Previously if iomap_dio_bio_iter()
>> returned EINVAL, it got propagated to userspace regardless of what
>> ->iomap_end() returned. After this commit if ->iomap_end() returns error
>> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
>> the error returned by iomap_dio_bio_iter().
>>
>> Now both the old and new behavior make some sense so I won't argue that the
>> new iomap_iter() behavior is wrong. But I think we should change ext4 back
>> to the old behavior of failing unaligned dio writes instead of them falling
>> back to buffered IO. I think something like the attached patch should do
>> the trick - it makes unaligned dio writes fail again while writes to holes
>> of indirect-block mapped files still correctly fall back to buffered IO.
>> Once fstests run completes, I'll do a proper submission...
>>
>
> Aah, right. So it wasn't EXT4 which had this behaviour of falling back
> to buffered I/O for unaligned writes. Earlier EXT4 was assuming an error
> code will be detected by iomap and will be passed to it as "written" in
> ext4_iomap_end() for such unaligned writes. But I guess that logic
> silently got changed with that commit. Thanks for analyzing that. 
> I missed looking underneath iomap behaviour change :). 
>
>
>>
>> 								Honza
>> -- 
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
>> From: Jan Kara <jack@suse.cz>
>> Date: Wed, 27 Aug 2025 14:55:19 +0200
>> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>>
>> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
>> changed the error handling logic in iomap_iter(). Previously any error
>> from iomap_dio_bio_iter() got propagated to userspace, after this commit
>> if ->iomap_end returns error, it gets propagated to userspace instead of
>> an error from iomap_dio_bio_iter(). This results in unaligned writes to
>> ext4 to silently fallback to buffered IO instead of erroring out.
>>
>> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
>> unnecessary these days. It is enough to return ENOTBLK from
>> ext4_iomap_begin() when we don't support DIO write for that particular
>> file offset (due to hole).
>
> Right. This mainly only happens if we have holes in non-extent (indirect
> blocks) case.
>

Thinking more on this case. Do we really want a fallback to buffered-io
for unaligned writes in this case (indirect block case)?

I don't think we care much here, right? And anyways the unaligned writes
should have the same behaviour for extents v/s non-extents case right?

I guess the problem is, iomap alignment check happens in
iomap_dio_bio_iter() where it has a valid bdev (populated by filesystem
during ->iomap_begin() call) to check the alignment against. But in this
indirect block case we return -ENOTBLK much earlier from ->iomap_begin()
call itself.


-ritesh



> Also, as I see ext4 always just fallsback to buffered-io for no or
> partial writes (unless iomap returned any error code). So, I was just
> wondering if that could ever happen for DIO atomic write case. It's good
> that we have a WARN_ON_ONCE() check in there to catch it. But I was
> wondering if this needs an explicit handling in ext4_dio_write_iter() to
> not fallback to buffered-writes for atomic DIO requests?
>
> -ritesh
>
>
>
>>
>> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>>  fs/ext4/file.c  |  2 --
>>  fs/ext4/inode.c | 35 -----------------------------------
>>  2 files changed, 37 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 93240e35ee36..cf39f57d21e9 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>  		iomap_ops = &ext4_iomap_overwrite_ops;
>>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>>  			   dio_flags, NULL, 0);
>> -	if (ret == -ENOTBLK)
>> -		ret = 0;
>>  	if (extend) {
>>  		/*
>>  		 * We always perform extending DIO write synchronously so by
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 5b7a15db4953..c3b23c90fd11 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
>>  	return ret;
>>  }
>>  
>> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>> -{
>> -	/* must be a directio to fall back to buffered */
>> -	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>> -		    (IOMAP_WRITE | IOMAP_DIRECT))
>> -		return false;
>> -
>> -	/* atomic writes are all-or-nothing */
>> -	if (flags & IOMAP_ATOMIC)
>> -		return false;
>> -
>> -	/* can only try again if we wrote nothing */
>> -	return written == 0;
>> -}
>> -
>> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>> -			  ssize_t written, unsigned flags, struct iomap *iomap)
>> -{
>> -	/*
>> -	 * Check to see whether an error occurred while writing out the data to
>> -	 * the allocated blocks. If so, return the magic error code for
>> -	 * non-atomic write so that we fallback to buffered I/O and attempt to
>> -	 * complete the remainder of the I/O.
>> -	 * For non-atomic writes, any blocks that may have been
>> -	 * allocated in preparation for the direct I/O will be reused during
>> -	 * buffered I/O. For atomic write, we never fallback to buffered-io.
>> -	 */
>> -	if (ext4_want_directio_fallback(flags, written))
>> -		return -ENOTBLK;
>> -
>> -	return 0;
>> -}
>> -
>>  const struct iomap_ops ext4_iomap_ops = {
>>  	.iomap_begin		= ext4_iomap_begin,
>> -	.iomap_end		= ext4_iomap_end,
>>  };
>>  
>>  const struct iomap_ops ext4_iomap_overwrite_ops = {
>>  	.iomap_begin		= ext4_iomap_overwrite_begin,
>> -	.iomap_end		= ext4_iomap_end,
>>  };
>>  
>>  static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
>> -- 
>> 2.43.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-27 16:09           ` Mike Snitzer
@ 2025-09-01  7:55             ` Jan Kara
  2025-09-02 14:39               ` Mike Snitzer
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2025-09-01  7:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jan Kara, Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-kernel, linux-ext4, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

Hi Mike!

On Wed 27-08-25 12:09:29, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > > Keith Busch <kbusch@kernel.org> writes:
> > > 
> > > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > > >> > Keith Busch <kbusch@meta.com> writes:
> > > >> > >
> > > >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> > > >> > 
> > > >> > ++linux-ext4 to get any historical context behind why the difference of
> > > >> > behaviour in reads v/s writes for EXT4 DIO. 
> > > >> 
> > > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > > >> falling back to buffered IO only if the underlying file itself does not
> > > >> support any kind of direct IO.
> > > >
> > > > Simple test case (dio-offset-test.c) below.
> > > >
> > > > I also ran this on vanilla kernel and got these results:
> > > >
> > > >   # mkfs.ext4 /dev/vda
> > > >   # mount /dev/vda /mnt/ext4/
> > > >   # make dio-offset-test
> > > >   # ./dio-offset-test /mnt/ext4/foobar
> > > >   write: Success
> > > >   read: Invalid argument
> > > >
> > > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > > >
> > > 
> > > Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> > > 
> > > buffered
> > > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > > {
> > > 	/* must be a directio to fall back to buffered */
> > > 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > > 		    (IOMAP_WRITE | IOMAP_DIRECT))
> > > 		return false;
> > > 
> > >     ...
> > > }
> > > 
> > > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > >     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > > 
> > > 
> > > 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > > 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > 		return -EINVAL;
> > > 
> > > EXT4 then fallsback to buffered-io only for writes, but not for reads. 
> > 
> > Right. And the fallback for writes was actually inadvertedly "added" by
> > commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> > changed the error handling logic. Previously if iomap_dio_bio_iter()
> > returned EINVAL, it got propagated to userspace regardless of what
> > ->iomap_end() returned. After this commit if ->iomap_end() returns error
> > (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> > the error returned by iomap_dio_bio_iter().
> > 
> > Now both the old and new behavior make some sense so I won't argue that the
> > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > to the old behavior of failing unaligned dio writes instead of them falling
> > back to buffered IO. I think something like the attached patch should do
> > the trick - it makes unaligned dio writes fail again while writes to holes
> > of indirect-block mapped files still correctly fall back to buffered IO.
> > Once fstests run completes, I'll do a proper submission...
> > 
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 
> > From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Wed, 27 Aug 2025 14:55:19 +0200
> > Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> > 
> > Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> > changed the error handling logic in iomap_iter(). Previously any error
> > from iomap_dio_bio_iter() got propagated to userspace, after this commit
> > if ->iomap_end returns error, it gets propagated to userspace instead of
> > an error from iomap_dio_bio_iter(). This results in unaligned writes to
> > ext4 to silently fallback to buffered IO instead of erroring out.
> > 
> > Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> > unnecessary these days. It is enough to return ENOTBLK from
> > ext4_iomap_begin() when we don't support DIO write for that particular
> > file offset (due to hole).
> 
> Any particular reason for ext4 still returning -ENOTBLK for unaligned
> DIO?

No, that is actually the bug I'm speaking about - ext4 should be returning
EINVAL for unaligned DIO as other filesystems do but after recent iomap
changes it started to return ENOTBLK.

> In my experience XFS returns -EINVAL when failing unaligned DIO (but
> maybe there are edge cases where that isn't always the case?)
> 
> Would be nice to have consistency across filesystems for what is
> returned when failing unaligned DIO.

Agreed although there are various corner cases like files which never
support direct IO - e.g. with data journalling - and thus fallback to
buffered IO happens before any alignment checks. 

> The iomap code returns -ENOTBLK as "the magic error code to fall back
> to buffered I/O".  But that seems only for page cache invalidation
> failure, _not_ for unaligned DIO.
> 
> (Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
> cache invalidation fails during DIO write. So it seems higher-level
> code, like I've added to NFS/NFSD to check for unaligned DIO failure,
> should check for both -EINVAL and -ENOTBLK).

I think the idea here is that if page cache invalidation fails we want to
fallback to buffered IO so that we don't cause cache coherency issues and
that's why ENOTBLK is returned.

> ps. ENOTBLK is actually much less easily confused with other random
> uses of EINVAL (EINVAL use is generally way too overloaded, rendering
> it a pretty unhelpful error).  But switching XFS to use ENOTBLK
> instead of EINVAL seems like disruptive interface breakage (I suppose
> same could be said for ext4 if it were to now return EINVAL for
> unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
> prompted me to ask these questions now)

Definitely. In this particular case EINVAL for unaligned DIO is there for
ages and there likely is some userspace program somewhere that depends on
it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-08-27 19:20           ` Keith Busch
@ 2025-09-01  8:22             ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2025-09-01  8:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jan Kara, Ritesh Harjani, Keith Busch, linux-block, linux-fsdevel,
	linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

On Wed 27-08-25 13:20:56, Keith Busch wrote:
> On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > Now both the old and new behavior make some sense so I won't argue that the
> > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > to the old behavior of failing unaligned dio writes instead of them falling
> > back to buffered IO. I think something like the attached patch should do
> > the trick - it makes unaligned dio writes fail again while writes to holes
> > of indirect-block mapped files still correctly fall back to buffered IO.
> > Once fstests run completes, I'll do a proper submission...
> 
> Your suggestion looks all well and good, but I have a general question
> about fstests. I've written up some to test this series, and I have
> filesystem specific expectations for what should error or succeed. If
> you modify ext4 to fail direct-io as described, my test will have to be
> kernel version specific too. Is there a best practice in fstests for
> handling such scenarios?

Well, I'd just expect EINVAL for ext4 in the test. Certain kernel versions
(since February or so) will fail but that's just an indication you should
backport the fix if you care...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
  2025-09-01  7:55             ` Jan Kara
@ 2025-09-02 14:39               ` Mike Snitzer
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-09-02 14:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
	linux-fsdevel, linux-kernel, linux-ext4, axboe, dw, brauner, hch,
	martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster

On Mon, Sep 01, 2025 at 09:55:20AM +0200, Jan Kara wrote:
> Hi Mike!
> 
> On Wed 27-08-25 12:09:29, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > > On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > > > Keith Busch <kbusch@kernel.org> writes:
> > > > 
> > > > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > > > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > > > >> > Keith Busch <kbusch@meta.com> writes:
> > > > >> > >
> > > > >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> > > > >> > 
> > > > >> > ++linux-ext4 to get any historical context behind why the difference of
> > > > >> > behaviour in reads v/s writes for EXT4 DIO. 
> > > > >> 
> > > > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > > > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > > > >> falling back to buffered IO only if the underlying file itself does not
> > > > >> support any kind of direct IO.
> > > > >
> > > > > Simple test case (dio-offset-test.c) below.
> > > > >
> > > > > I also ran this on vanilla kernel and got these results:
> > > > >
> > > > >   # mkfs.ext4 /dev/vda
> > > > >   # mount /dev/vda /mnt/ext4/
> > > > >   # make dio-offset-test
> > > > >   # ./dio-offset-test /mnt/ext4/foobar
> > > > >   write: Success
> > > > >   read: Invalid argument
> > > > >
> > > > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > > > >
> > > > 
> > > > Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> > > > 
> > > > buffered
> > > > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > > > {
> > > > 	/* must be a directio to fall back to buffered */
> > > > 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > > > 		    (IOMAP_WRITE | IOMAP_DIRECT))
> > > > 		return false;
> > > > 
> > > >     ...
> > > > }
> > > > 
> > > > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > > >     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > > > 
> > > > 
> > > > 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > > > 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > > 		return -EINVAL;
> > > > 
> > > > EXT4 then fallsback to buffered-io only for writes, but not for reads. 
> > > 
> > > Right. And the fallback for writes was actually inadvertedly "added" by
> > > commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> > > changed the error handling logic. Previously if iomap_dio_bio_iter()
> > > returned EINVAL, it got propagated to userspace regardless of what
> > > ->iomap_end() returned. After this commit if ->iomap_end() returns error
> > > (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> > > the error returned by iomap_dio_bio_iter().
> > > 
> > > Now both the old and new behavior make some sense so I won't argue that the
> > > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > > to the old behavior of failing unaligned dio writes instead of them falling
> > > back to buffered IO. I think something like the attached patch should do
> > > the trick - it makes unaligned dio writes fail again while writes to holes
> > > of indirect-block mapped files still correctly fall back to buffered IO.
> > > Once fstests run completes, I'll do a proper submission...
> > > 
> > > 
> > > 								Honza
> > > -- 
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > 
> > > From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> > > From: Jan Kara <jack@suse.cz>
> > > Date: Wed, 27 Aug 2025 14:55:19 +0200
> > > Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> > > 
> > > Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> > > changed the error handling logic in iomap_iter(). Previously any error
> > > from iomap_dio_bio_iter() got propagated to userspace, after this commit
> > > if ->iomap_end returns error, it gets propagated to userspace instead of
> > > an error from iomap_dio_bio_iter(). This results in unaligned writes to
> > > ext4 to silently fallback to buffered IO instead of erroring out.
> > > 
> > > Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> > > unnecessary these days. It is enough to return ENOTBLK from
> > > ext4_iomap_begin() when we don't support DIO write for that particular
> > > file offset (due to hole).
> > 
> > Any particular reason for ext4 still returning -ENOTBLK for unaligned
> > DIO?
> 
> No, that is actually the bug I'm speaking about - ext4 should be returning
> EINVAL for unaligned DIO as other filesystems do but after recent iomap
> changes it started to return ENOTBLK.
> 
> > In my experience XFS returns -EINVAL when failing unaligned DIO (but
> > maybe there are edge cases where that isn't always the case?)
> > 
> > Would be nice to have consistency across filesystems for what is
> > returned when failing unaligned DIO.
> 
> Agreed although there are various corner cases like files which never
> support direct IO - e.g. with data journalling - and thus fallback to
> buffered IO happens before any alignment checks. 
> 
> > The iomap code returns -ENOTBLK as "the magic error code to fall back
> > to buffered I/O".  But that seems only for page cache invalidation
> > failure, _not_ for unaligned DIO.
> > 
> > (Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
> > cache invalidation fails during DIO write. So it seems higher-level
> > code, like I've added to NFS/NFSD to check for unaligned DIO failure,
> > should check for both -EINVAL and -ENOTBLK).
> 
> I think the idea here is that if page cache invalidation fails we want to
> fallback to buffered IO so that we don't cause cache coherency issues and
> that's why ENOTBLK is returned.
> 
> > ps. ENOTBLK is actually much less easily confused with other random
> > uses of EINVAL (EINVAL use is generally way too overloaded, rendering
> > it a pretty unhelpful error).  But switching XFS to use ENOTBLK
> > instead of EINVAL seems like disruptive interface breakage (I suppose
> > same could be said for ext4 if it were to now return EINVAL for
> > unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
> > prompted me to ask these questions now)
> 
> Definitely. In this particular case EINVAL for unaligned DIO is there for
> ages and there likely is some userspace program somewhere that depends on
> it.

Thanks for your reply, that all makes sense.

Mike

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-09-02 14:39 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 16:49 [PATCHv3 0/8] direct-io: even more flexible io vectors Keith Busch
2025-08-19 16:49 ` [PATCHv3 1/8] block: check for valid bio while splitting Keith Busch
2025-08-20  7:02   ` Damien Le Moal
2025-08-20 14:25     ` Keith Busch
2025-08-20  7:04   ` Damien Le Moal
2025-08-25  7:35   ` Christoph Hellwig
2025-08-19 16:49 ` [PATCHv3 2/8] block: add size alignment to bio_iov_iter_get_pages Keith Busch
2025-08-25  7:36   ` Christoph Hellwig
2025-08-19 16:49 ` [PATCHv3 3/8] block: align the bio after building it Keith Busch
2025-08-20  7:07   ` Damien Le Moal
2025-08-25  7:46   ` Christoph Hellwig
2025-08-25 13:57     ` Keith Busch
2025-08-25  7:47   ` Christoph Hellwig
2025-08-26  0:37     ` Keith Busch
2025-08-26  8:02       ` Christoph Hellwig
2025-08-26 23:11         ` Keith Busch
2025-08-19 16:49 ` [PATCHv3 4/8] block: simplify direct io validity check Keith Busch
2025-08-25  7:48   ` Christoph Hellwig
2025-08-19 16:49 ` [PATCHv3 5/8] iomap: " Keith Busch
2025-08-25  7:48   ` Christoph Hellwig
2025-08-19 16:49 ` [PATCHv3 6/8] block: remove bdev_iter_is_aligned Keith Busch
2025-08-25  7:48   ` Christoph Hellwig
2025-08-19 16:49 ` [PATCHv3 7/8] blk-integrity: use simpler alignment check Keith Busch
2025-08-25  7:49   ` Christoph Hellwig
2025-08-19 16:49 ` [PATCHv3 8/8] iov_iter: remove iov_iter_is_aligned Keith Busch
2025-08-25  7:50   ` Christoph Hellwig
2025-08-19 23:36 ` [PATCHv3 0/8] direct-io: even more flexible io vectors Mike Snitzer
2025-08-20  1:52 ` Song Chen
2025-08-22 13:27 ` Ritesh Harjani
2025-08-22 14:30   ` Keith Busch
2025-08-25 12:07   ` Jan Kara
2025-08-25 14:53     ` Keith Busch
2025-08-26  4:59       ` Ritesh Harjani
2025-08-27 15:20         ` Jan Kara
2025-08-27 16:09           ` Mike Snitzer
2025-09-01  7:55             ` Jan Kara
2025-09-02 14:39               ` Mike Snitzer
2025-08-27 17:52           ` Brian Foster
2025-08-27 19:20           ` Keith Busch
2025-09-01  8:22             ` Jan Kara
2025-08-29  2:11           ` Ritesh Harjani
2025-08-29  3:19             ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).