[PATCHv2 0/6] direct-io: validate user space vectors during extraction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2 0/6] direct-io: validate user space vectors during extraction
@ 2026-06-22 17:42 Keith Busch
  2026-06-22 17:42 ` [PATCHv2 1/6] block: introduce bio_endio_errno helper Keith Busch
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This addresses the misaligned direct-io problem behind various threads:

 https://lore.kernel.org/linux-xfs/20260610145218.141369-1-cem@kernel.org/
 https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=DUgn02w9X+p94DCu0aHv5+5tKnQ@mail.gmail.com/
 https://lore.kernel.org/linux-block/ai7rnH20IYeSmY8s@gallifrey/
 https://lore.kernel.org/linux-block/20260616154009.2123183-1-kbusch@meta.com/

The previously tested fixes are correct as far as they go, but they
treat the symptom: they only matter because an invalid bio reaches those
drivers in the first place.

The reason it reaches them is an assumption I made when I removed
direct-io alignment checks in 5ff3f74e145a ("block: simplify direct io
validity check") and 7eac331869575 ("iomap: simplify direct io validity
check"): every bio is eventually split to the device limits, and the
upper layers cope with resulting errors once the bio has formed. Both
were optimistic assumptions. Drivers with their own ->submit_bio may
never pass through blk_mq_submit_bio()'s split, so the check never runs
for them, and as numerous threads showed, the consumers don't uniformly
handle this condition.

This patch stops the invalid bio at the source instead. It validates the
buffer's alignment against the alignment limits when the bio is built
from the iov_iter. The check is folded into the bvec extraction that
already walks the vectors, so it adds only a comparison on a path that
is pinning direct-io pages anyway. Misalignment is now uniformly
rejected with EINVAL before submission for every direct-io path.

With this in place, the dm side changes under discussion are no longer
required to fix the bugs: the affected targets simply never see the
invalid bio. The tested patches remain reasonable as defense-in-depth if
desired, but they are not strictly necessary after this.

v1->v2:

 I've included some prep patches that fix other issues in this path.

 Renamed the alignment to "mem_align_mask", re-ordered the function
 parameters so it appears before the length alignment, and added the
 appropriate kerneldoc.

 Added additional comments to explain the rationale behind the checks.

 For DEBUG kernels, a bio_vec iterator is checked in its entirety. The
 existing use cases appear to only need the first vector to be checked,
 so the more expensive exhaustive check is only happening for the debug
 kernels.

Keith Busch (6):
  block: introduce bio_endio_errno helper
  block: report the actual status
  block: fix dio leak on metadata mapping error
  loop: set dma_alignment from the backing file for direct I/O
  zloop: set dma_alignment from the backing files for direct I/O
  block: validate user space vectors during extraction

 block/bio.c            | 50 +++++++++++++++++++++++++++++++++++++++---
 block/blk-map.c        |  2 +-
 block/blk-merge.c      |  4 ++--
 block/fops.c           |  9 +++++---
 drivers/block/loop.c   | 50 +++++++++++++++++++++++++++++++++++-------
 drivers/block/zloop.c  | 22 +++++++++++++++++--
 fs/iomap/direct-io.c   |  1 +
 include/linux/bio.h    |  2 +-
 include/linux/blkdev.h |  5 +++++
 include/linux/uio.h    |  3 ++-
 lib/iov_iter.c         |  9 +++++++-
 11 files changed, 135 insertions(+), 22 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHv2 1/6] block: introduce bio_endio_errno helper
  2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
@ 2026-06-22 17:42 ` Keith Busch
  2026-06-23 14:54   ` Christoph Hellwig
  2026-06-22 17:42 ` [PATCHv2 2/6] block: report the actual status Keith Busch
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

No functional change; purely introducing a convenience function.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-merge.c      | 4 ++--
 include/linux/blkdev.h | 5 +++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index ab1161ca69f1e..c93170f340977 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -122,7 +122,7 @@ struct bio *bio_submit_split_bioset(struct bio *bio, unsigned int split_sectors,
 	struct bio *split = bio_split(bio, split_sectors, GFP_NOIO, bs);
 
 	if (IS_ERR(split)) {
-		bio_endio_status(bio, errno_to_blk_status(PTR_ERR(split)));
+		bio_endio_errno(bio, PTR_ERR(split));
 		return NULL;
 	}
 
@@ -142,7 +142,7 @@ EXPORT_SYMBOL_GPL(bio_submit_split_bioset);
 static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
 {
 	if (unlikely(split_sectors < 0)) {
-		bio_endio_status(bio, errno_to_blk_status(split_sectors));
+		bio_endio_errno(bio, split_sectors);
 		return NULL;
 	}
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9213a5716f95a..88e4bd88c3e28 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1047,6 +1047,11 @@ extern const char *blk_op_str(enum req_op op);
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
+static inline void bio_endio_errno(struct bio *bio, int errno)
+{
+	bio_endio_status(bio, errno_to_blk_status(errno));
+}
+
 /* only poll the hardware once, don't continue until a completion was found */
 #define BLK_POLL_ONESHOT		(1 << 0)
 int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 1/6] block: introduce bio_endio_errno helper
  2026-06-22 17:42 ` [PATCHv2 1/6] block: introduce bio_endio_errno helper Keith Busch
@ 2026-06-23 14:54   ` Christoph Hellwig
  2026-06-23 15:05     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 14:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch

On Mon, Jun 22, 2026 at 10:42:36AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> No functional change; purely introducing a convenience function.

I've been deeply into untangling the 1:1 BLK_STS_ mapping to errnos,
as propagating them up that way often causes more issues then it
solves.  So we can avoid it, I'd rather not add more helpers to
facility that (even if the helpers are just the messenger and not
the cause of the problem).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 1/6] block: introduce bio_endio_errno helper
  2026-06-23 14:54   ` Christoph Hellwig
@ 2026-06-23 15:05     ` Keith Busch
  2026-06-23 15:07       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-23 15:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, dm-devel, axboe, brauner,
	djwong, viro

On Tue, Jun 23, 2026 at 04:54:31PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 22, 2026 at 10:42:36AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > No functional change; purely introducing a convenience function.
> 
> I've been deeply into untangling the 1:1 BLK_STS_ mapping to errnos,
> as propagating them up that way often causes more issues then it
> solves.  So we can avoid it, I'd rather not add more helpers to
> facility that (even if the helpers are just the messenger and not
> the cause of the problem).

Sure, that's fine. I'm not sure what you have in mind for untangling the
errno:blk_status_t mappings, but I can certainly have the new users this
series introduces open code it like the existing users if that's
alright.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 1/6] block: introduce bio_endio_errno helper
  2026-06-23 15:05     ` Keith Busch
@ 2026-06-23 15:07       ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 15:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-fsdevel,
	dm-devel, axboe, brauner, djwong, viro

On Tue, Jun 23, 2026 at 09:05:49AM -0600, Keith Busch wrote:
> On Tue, Jun 23, 2026 at 04:54:31PM +0200, Christoph Hellwig wrote:
> > On Mon, Jun 22, 2026 at 10:42:36AM -0700, Keith Busch wrote:
> > > From: Keith Busch <kbusch@kernel.org>
> > > 
> > > No functional change; purely introducing a convenience function.
> > 
> > I've been deeply into untangling the 1:1 BLK_STS_ mapping to errnos,
> > as propagating them up that way often causes more issues then it
> > solves.  So we can avoid it, I'd rather not add more helpers to
> > facility that (even if the helpers are just the messenger and not
> > the cause of the problem).
> 
> Sure, that's fine. I'm not sure what you have in mind for untangling the
> errno:blk_status_t mappings, but I can certainly have the new users this
> series introduces open code it like the existing users if that's
> alright.

I've tried a few things and banged my ahead against the wall, so I'm
not entirely sure yet either..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHv2 2/6] block: report the actual status
  2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
  2026-06-22 17:42 ` [PATCHv2 1/6] block: introduce bio_endio_errno helper Keith Busch
@ 2026-06-22 17:42 ` Keith Busch
  2026-06-23 14:55   ` Christoph Hellwig
  2026-06-22 17:42 ` [PATCHv2 3/6] block: fix dio leak on metadata mapping error Keith Busch
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Rather than assume EIO, set the actual reported status for user space
informational purposes.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index 15783a6180dec..f237d6cab8975 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -218,7 +218,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 		ret = blkdev_iov_iter_get_pages(bio, iter, bdev);
 		if (unlikely(ret)) {
-			bio_endio_status(bio, BLK_STS_IOERR);
+			bio_endio_errno(bio, ret);
 			break;
 		}
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 2/6] block: report the actual status
  2026-06-22 17:42 ` [PATCHv2 2/6] block: report the actual status Keith Busch
@ 2026-06-23 14:55   ` Christoph Hellwig
  2026-06-23 14:59     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 14:55 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch

On Mon, Jun 22, 2026 at 10:42:37AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Rather than assume EIO, set the actual reported status for user space
> informational purposes.

Where "informational purposes" primarily mean not dropping the EINVAL
for incorrect alignment, right?  Maybe state that more clearly..


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 2/6] block: report the actual status
  2026-06-23 14:55   ` Christoph Hellwig
@ 2026-06-23 14:59     ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2026-06-23 14:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, dm-devel, axboe, brauner,
	djwong, viro

On Tue, Jun 23, 2026 at 04:55:11PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 22, 2026 at 10:42:37AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > Rather than assume EIO, set the actual reported status for user space
> > informational purposes.
> 
> Where "informational purposes" primarily mean not dropping the EINVAL
> for incorrect alignment, right?  

It could be any possible error, but for the practical purposes of this
series, yes, EINVAL is the status I need forwarded. But EFAULT was also
always a real possibility that wouldn't have been reported.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHv2 3/6] block: fix dio leak on metadata mapping error
  2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
  2026-06-22 17:42 ` [PATCHv2 1/6] block: introduce bio_endio_errno helper Keith Busch
  2026-06-22 17:42 ` [PATCHv2 2/6] block: report the actual status Keith Busch
@ 2026-06-22 17:42 ` Keith Busch
  2026-06-23 15:01   ` Christoph Hellwig
  2026-06-22 17:42 ` [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O Keith Busch
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

A failed integrity mapping holds a dio reference, so we need to go
through the full bio ending in case there were previously submitted
bio's in the sequence.

Fixes: 2729a60bbfb92 ("block: don't silently ignore metadata for sync read/write")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index f237d6cab8975..b5c320da28123 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -238,8 +238,10 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		}
 		if (iocb->ki_flags & IOCB_HAS_METADATA) {
 			ret = bio_integrity_map_iter(bio, iocb->private);
-			if (unlikely(ret))
-				goto fail;
+			if (unlikely(ret)) {
+				bio_endio_errno(bio, ret);
+				break;
+			}
 		}
 
 		if (is_read) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 3/6] block: fix dio leak on metadata mapping error
  2026-06-22 17:42 ` [PATCHv2 3/6] block: fix dio leak on metadata mapping error Keith Busch
@ 2026-06-23 15:01   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 15:01 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch

On Mon, Jun 22, 2026 at 10:42:38AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> A failed integrity mapping holds a dio reference, so we need to go
> through the full bio ending in case there were previously submitted
> bio's in the sequence.

Yeah, the goto fail is for sure wrong here.  I have a vague memory
of seeing the same or at least a very similar patch from others before,
but right now I'm too overloaded to find out if that really was the
case.

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O
  2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
                   ` (2 preceding siblings ...)
  2026-06-22 17:42 ` [PATCHv2 3/6] block: fix dio leak on metadata mapping error Keith Busch
@ 2026-06-22 17:42 ` Keith Busch
  2026-06-23 15:04   ` Christoph Hellwig
  2026-06-22 17:42 ` [PATCHv2 5/6] zloop: set dma_alignment from the backing files " Keith Busch
  2026-06-22 17:42 ` [PATCHv2 6/6] block: validate user space vectors during extraction Keith Busch
  5 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Direct I/O user pages are forwarded to the backing file unchanged, so
the backing's DMA alignment requirement applies to them. Track the
backing's dio_mem_align and advertise it as the loop device's
dma_alignment so we advertise proper limits and misaligned I/O is
rejected here instead of being dispatched to the backend.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/block/loop.c | 50 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 310de0463beb1..7114f80ab162a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -54,6 +54,7 @@ struct loop_device {
 
 	struct file	*lo_backing_file;
 	unsigned int	lo_min_dio_size;
+	unsigned int	lo_dio_mem_align;
 	struct block_device *lo_device;
 
 	gfp_t		old_gfp_mask;
@@ -447,26 +448,37 @@ static void loop_reread_partitions(struct loop_device *lo)
 			__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
-static unsigned int loop_query_min_dio_size(struct loop_device *lo)
+static void loop_update_dio_alignment(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
 	struct block_device *sb_bdev = file->f_mapping->host->i_sb->s_bdev;
 	struct kstat st;
 
 	/*
-	 * Use the minimal dio alignment of the file system if provided.
+	 * Use the dio alignment of the file system if provided.  dio_offset_align
+	 * is the minimum dio size and offset; dio_mem_align is the buffer memory
+	 * alignment, kept as a mask to become the loop device's dma_alignment in
+	 * direct I/O mode where the buffer is handed to the backing file unchanged.
 	 */
 	if (!vfs_getattr(&file->f_path, &st, STATX_DIOALIGN, 0) &&
-	    (st.result_mask & STATX_DIOALIGN))
-		return st.dio_offset_align;
+	    (st.result_mask & STATX_DIOALIGN)) {
+		lo->lo_min_dio_size = st.dio_offset_align;
+		lo->lo_dio_mem_align = st.dio_mem_align - 1;
+		return;
+	}
 
 	/*
 	 * In a perfect world this wouldn't be needed, but as of Linux 6.13 only
 	 * a handful of file systems support the STATX_DIOALIGN flag.
 	 */
-	if (sb_bdev)
-		return bdev_logical_block_size(sb_bdev);
-	return SECTOR_SIZE;
+	if (sb_bdev) {
+		lo->lo_min_dio_size = bdev_logical_block_size(sb_bdev);
+		lo->lo_dio_mem_align = bdev_dma_alignment(sb_bdev);
+		return;
+	}
+
+	lo->lo_min_dio_size = SECTOR_SIZE;
+	lo->lo_dio_mem_align = SECTOR_SIZE - 1;
 }
 
 static inline int is_loop_device(struct file *file)
@@ -509,7 +521,7 @@ static void loop_assign_backing_file(struct loop_device *lo, struct file *file)
 			lo->old_gfp_mask & ~(__GFP_IO | __GFP_FS));
 	if (lo->lo_backing_file->f_flags & O_DIRECT)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
-	lo->lo_min_dio_size = loop_query_min_dio_size(lo);
+	loop_update_dio_alignment(lo);
 }
 
 static int loop_check_backing_file(struct file *file)
@@ -961,6 +973,17 @@ static void loop_update_limits(struct loop_device *lo, struct queue_limits *lim,
 	lim->logical_block_size = bsize;
 	lim->physical_block_size = bsize;
 	lim->io_min = bsize;
+	/*
+	 * In direct I/O the user pages are handed to the backing file as-is, so
+	 * the backing's DMA alignment requirement applies to them.  Advertise it
+	 * so misaligned I/O is rejected at this device's entry instead of being
+	 * dispatched to the backend.  Buffered I/O copies through the page cache
+	 * and imposes no such requirement.
+	 */
+	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
+		lim->dma_alignment = lo->lo_dio_mem_align;
+	else
+		lim->dma_alignment = SECTOR_SIZE - 1;
 	lim->features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL);
 	if (file->f_op->fsync && !(lo->lo_flags & LO_FLAGS_READ_ONLY))
 		lim->features |= BLK_FEAT_WRITE_CACHE;
@@ -1416,6 +1439,7 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg)
 {
 	bool use_dio = !!arg;
 	unsigned int memflags;
+	struct queue_limits lim;
 
 	if (lo->lo_state != Lo_bound)
 		return -ENXIO;
@@ -1434,6 +1458,16 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
 	else
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
+	/*
+	 * Direct I/O forwards the user pages to the backing file unchanged, so
+	 * track the backing's DMA alignment requirement as the mode is toggled.
+	 */
+	lim = queue_limits_start_update(lo->lo_queue);
+	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
+		lim.dma_alignment = lo->lo_dio_mem_align;
+	else
+		lim.dma_alignment = SECTOR_SIZE - 1;
+	queue_limits_commit_update(lo->lo_queue, &lim);
 	blk_mq_unfreeze_queue(lo->lo_queue, memflags);
 	return 0;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O
  2026-06-22 17:42 ` [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O Keith Busch
@ 2026-06-23 15:04   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 15:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch

On Mon, Jun 22, 2026 at 10:42:39AM -0700, Keith Busch wrote:
>  	/*
> -	 * Use the minimal dio alignment of the file system if provided.
> +	 * Use the dio alignment of the file system if provided.  dio_offset_align
> +	 * is the minimum dio size and offset; dio_mem_align is the buffer memory
> +	 * alignment, kept as a mask to become the loop device's dma_alignment in
> +	 * direct I/O mode where the buffer is handed to the backing file unchanged.

A bunch of overly long lines here.

> +	 * In direct I/O the user pages are handed to the backing file as-is, so
> +	 * the backing's DMA alignment requirement applies to them.  Advertise it
> +	 * so misaligned I/O is rejected at this device's entry instead of being
> +	 * dispatched to the backend.  Buffered I/O copies through the page cache
> +	 * and imposes no such requirement.
> +	 */

More line spillover here.

> +	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
> +		lim->dma_alignment = lo->lo_dio_mem_align;
> +	else
> +		lim->dma_alignment = SECTOR_SIZE - 1;

Despite the comment above this does enforce a SECTOR_SIZE dma
alignment for buffered I/O.  Shouldn't this be our lowest supported
value (or dword alignment to match real devices)?

> +	lim = queue_limits_start_update(lo->lo_queue);
> +	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
> +		lim.dma_alignment = lo->lo_dio_mem_align;
> +	else
> +		lim.dma_alignment = SECTOR_SIZE - 1;

Should this and the above copy of this assignment be factored into a
helper?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHv2 5/6] zloop: set dma_alignment from the backing files for direct I/O
  2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
                   ` (3 preceding siblings ...)
  2026-06-22 17:42 ` [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O Keith Busch
@ 2026-06-22 17:42 ` Keith Busch
  2026-06-23 15:06   ` Christoph Hellwig
  2026-06-22 17:42 ` [PATCHv2 6/6] block: validate user space vectors during extraction Keith Busch
  5 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Direct I/O request's use pages handed to the backing files unchanged, so
the backing's DMA alignment requirement applies. Track dio_mem_align and
advertise it as the device's dma_alignment so we communicate proper
limits and misaligned I/O is rejected here instead of reaching the
backend.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/block/zloop.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
index 55eeb6aac0ea3..1149b817b5bc9 100644
--- a/drivers/block/zloop.c
+++ b/drivers/block/zloop.c
@@ -144,6 +144,7 @@ struct zloop_device {
 	unsigned int		nr_conv_zones;
 	unsigned int		max_open_zones;
 	unsigned int		block_size;
+	unsigned int		dio_mem_align;
 
 	spinlock_t		open_zones_lock;
 	struct list_head	open_zones_lru_list;
@@ -1035,6 +1036,9 @@ static int zloop_get_block_size(struct zloop_device *zlo,
 {
 	struct block_device *sb_bdev = zone->file->f_mapping->host->i_sb->s_bdev;
 	struct kstat st;
+	bool have_dioalign = !vfs_getattr(&zone->file->f_path, &st,
+					  STATX_DIOALIGN, 0) &&
+			     (st.result_mask & STATX_DIOALIGN);
 
 	/*
 	 * If the FS block size is lower than or equal to 4K, use that as the
@@ -1044,14 +1048,25 @@ static int zloop_get_block_size(struct zloop_device *zlo,
 	 */
 	if (file_inode(zone->file)->i_sb->s_blocksize <= SZ_4K)
 		zlo->block_size = file_inode(zone->file)->i_sb->s_blocksize;
-	else if (!vfs_getattr(&zone->file->f_path, &st, STATX_DIOALIGN, 0) &&
-		 (st.result_mask & STATX_DIOALIGN))
+	else if (have_dioalign)
 		zlo->block_size = st.dio_offset_align;
 	else if (sb_bdev)
 		zlo->block_size = bdev_physical_block_size(sb_bdev);
 	else
 		zlo->block_size = SECTOR_SIZE;
 
+	/*
+	 * In direct I/O the request's pages are handed to the backing files
+	 * unchanged, so track their required memory alignment as a mask for
+	 * dma_alignment.
+	 */
+	if (have_dioalign)
+		zlo->dio_mem_align = st.dio_mem_align - 1;
+	else if (sb_bdev)
+		zlo->dio_mem_align = bdev_dma_alignment(sb_bdev);
+	else
+		zlo->dio_mem_align = SECTOR_SIZE - 1;
+
 	if (zlo->zone_capacity & ((zlo->block_size >> SECTOR_SHIFT) - 1)) {
 		pr_err("Zone capacity is not aligned to block size %u\n",
 		       zlo->block_size);
@@ -1279,6 +1294,9 @@ static int zloop_ctl_add(struct zloop_options *opts)
 
 	lim.physical_block_size = zlo->block_size;
 	lim.logical_block_size = zlo->block_size;
+	/* Direct I/O hands the request's pages to the backing files unchanged. */
+	if (!opts->buffered_io)
+		lim.dma_alignment = zlo->dio_mem_align;
 	if (zlo->zone_append)
 		lim.max_hw_zone_append_sectors = lim.max_hw_sectors;
 	lim.max_open_zones = zlo->max_open_zones;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 5/6] zloop: set dma_alignment from the backing files for direct I/O
  2026-06-22 17:42 ` [PATCHv2 5/6] zloop: set dma_alignment from the backing files " Keith Busch
@ 2026-06-23 15:06   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 15:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch

On Mon, Jun 22, 2026 at 10:42:40AM -0700, Keith Busch wrote:
>  {
>  	struct block_device *sb_bdev = zone->file->f_mapping->host->i_sb->s_bdev;
>  	struct kstat st;
> +	bool have_dioalign = !vfs_getattr(&zone->file->f_path, &st,
> +					  STATX_DIOALIGN, 0) &&
> +			     (st.result_mask & STATX_DIOALIGN);

This is getting a bit crazy for an assignment :)

Maybe refactor this along the lines of the loop.c code?

> +	/* Direct I/O hands the request's pages to the backing files unchanged. */

Overly long line.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHv2 6/6] block: validate user space vectors during extraction
  2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
                   ` (4 preceding siblings ...)
  2026-06-22 17:42 ` [PATCHv2 5/6] zloop: set dma_alignment from the backing files " Keith Busch
@ 2026-06-22 17:42 ` Keith Busch
  2026-06-23 15:10   ` Christoph Hellwig
  5 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch, stable

From: Keith Busch <kbusch@kernel.org>

The bio-based drivers don't necessarily check the alignment split, and
stacking block drivers don't always handle a misalignment detected after
submitting the bio. Validate user vectors against the device's
dma_alignment as the bio is built from the iov_iter, rejecting
misaligned early with -EINVAL.

Cc: stable@vger.kernel.org
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c          | 50 +++++++++++++++++++++++++++++++++++++++++---
 block/blk-map.c      |  2 +-
 block/fops.c         |  1 +
 fs/iomap/direct-io.c |  1 +
 include/linux/bio.h  |  2 +-
 include/linux/uio.h  |  3 ++-
 lib/iov_iter.c       |  9 +++++++-
 7 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f2a5f4d0a9672..4360149d4eba2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1220,10 +1220,39 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_KERNEL
+static inline bool bio_iov_bvec_aligned(const struct bio *bio,
+					unsigned mem_align_mask)
+{
+	struct bvec_iter iter;
+	struct bio_vec bv;
+
+	for_each_mp_bvec(bv, bio->bi_io_vec, iter, bio->bi_iter)
+		if ((bv.bv_offset | bv.bv_len) & mem_align_mask)
+			return false;
+	return true;
+}
+#else
+static inline bool bio_iov_bvec_aligned(const struct bio *bio,
+					unsigned mem_align_mask)
+{
+	/*
+	 * The vectors are owned and laid out by the caller; we only forward
+	 * them. Most callers are already aligned, but io_uring can place a
+	 * user chosen offset through a registered buffer, where only the first
+	 * vector may be unaligned.
+	 */
+	return !(mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
+							mem_align_mask);
+}
+#endif
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
+ * @mem_align_mask: the mask the source address and length must be aligned to,
+ *	0 for no requirement
  * @len_align_mask: the mask to align the total size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
@@ -1242,7 +1271,7 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
  * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-			   unsigned len_align_mask)
+			   unsigned mem_align_mask, unsigned len_align_mask)
 {
 	iov_iter_extraction_t flags = 0;
 
@@ -1251,6 +1280,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 	if (iov_iter_is_bvec(iter)) {
 		bio_iov_bvec_set(bio, iter);
+
+		if (!bio_iov_bvec_aligned(bio, mem_align_mask))
+			return -EINVAL;
+
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 		return 0;
 	}
@@ -1265,8 +1298,19 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec,
 				BIO_MAX_SIZE - bio->bi_iter.bi_size,
-				&bio->bi_vcnt, bio->bi_max_vecs, flags);
+				&bio->bi_vcnt, bio->bi_max_vecs,
+				mem_align_mask, flags);
 		if (ret <= 0) {
+			/*
+			 * A misaligned vector fails the whole I/O.  Release any
+			 * pages pinned by earlier iterations before returning
+			 * since this bio won't be submitted to release them.
+			 */
+			if (ret == -EINVAL) {
+				bio_release_pages(bio, false);
+				bio_clear_flag(bio, BIO_PAGE_PINNED);
+				bio->bi_vcnt = 0;
+			}
 			if (!bio->bi_vcnt)
 				return ret;
 			break;
@@ -1377,7 +1421,7 @@ static int bio_iov_iter_bounce_read(struct bio *bio, struct iov_iter *iter,
 		ssize_t ret;
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec + 1, len,
-				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0);
+				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0, 0);
 		if (ret <= 0) {
 			if (!bio->bi_vcnt) {
 				folio_put(folio);
diff --git a/block/blk-map.c b/block/blk-map.c
index 768549f19f97e..c9535efe1a913 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -274,7 +274,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	 * No alignment requirements on our part to support arbitrary
 	 * passthrough commands.
 	 */
-	ret = bio_iov_iter_get_pages(bio, iter, 0);
+	ret = bio_iov_iter_get_pages(bio, iter, 0, 0);
 	if (ret)
 		goto out_put;
 	ret = blk_rq_append_bio(rq, bio);
diff --git a/block/fops.c b/block/fops.c
index b5c320da28123..84eeabd97e1f0 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -47,6 +47,7 @@ static inline int blkdev_iov_iter_get_pages(struct bio *bio,
 		struct iov_iter *iter, struct block_device *bdev)
 {
 	return bio_iov_iter_get_pages(bio, iter,
+			bdev_dma_alignment(bdev),
 			bdev_logical_block_size(bdev) - 1);
 }
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b485e3b191daf..ff458aa12ae29 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -358,6 +358,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 				iomap_max_bio_size(&iter->iomap), alignment);
 	else
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter,
+					     bdev_dma_alignment(bio->bi_bdev),
 					     alignment - 1);
 	if (unlikely(ret))
 		goto out_put_bio;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8f33f717b14f5..ce34ea49ef358 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -477,7 +477,7 @@ int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
 		size_t len, enum req_op op);
 
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-		unsigned len_align_mask);
+		unsigned mem_align_mask, unsigned len_align_mask);
 
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e32..653dee76c0b33 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -391,7 +391,8 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
 			       size_t *offset0);
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags);
+		unsigned short max_vecs, unsigned mem_align_mask,
+		iov_iter_extraction_t extraction_flags);
 
 /**
  * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 273919b161617..8d5ca3e38522a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1886,6 +1886,8 @@ static unsigned int get_contig_folio_len(struct page **pages,
  * @max_size:	maximum size to extract from @iter
  * @nr_vecs:	number of vectors in @bv (on in and output)
  * @max_vecs:	maximum vectors in @bv, including those filled before calling
+ * @mem_align_mask:	reject with -EINVAL if the source address or length is not
+ *		aligned to this mask
  * @extraction_flags: flags to qualify request
  *
  * Like iov_iter_extract_pages(), but returns physically contiguous ranges
@@ -1897,14 +1899,19 @@ static unsigned int get_contig_folio_len(struct page **pages,
  */
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags)
+		unsigned short max_vecs, unsigned mem_align_mask,
+		iov_iter_extraction_t extraction_flags)
 {
+	unsigned long start = (unsigned long)iter_iov_addr(iter);
 	unsigned short entries_left = max_vecs - *nr_vecs;
 	unsigned short nr_pages, i = 0;
 	size_t left, offset, len;
 	struct page **pages;
 	ssize_t size;
 
+	if ((start | iter_iov_len(iter)) & mem_align_mask)
+		return -EINVAL;
+
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
 	 * possible so that we can start filling biovecs from the beginning
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 6/6] block: validate user space vectors during extraction
  2026-06-22 17:42 ` [PATCHv2 6/6] block: validate user space vectors during extraction Keith Busch
@ 2026-06-23 15:10   ` Christoph Hellwig
  2026-06-23 16:17     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2026-06-23 15:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-fsdevel, dm-devel, hch, axboe, brauner, djwong,
	viro, Keith Busch, stable

> +#ifdef CONFIG_DEBUG_KERNEL

That's a pretty broad option.  Not that I have any better idea off the
bat.

> +static inline bool bio_iov_bvec_aligned(const struct bio *bio,
> +					unsigned mem_align_mask)
> +{
> +	/*
> +	 * The vectors are owned and laid out by the caller; we only forward
> +	 * them. Most callers are already aligned, but io_uring can place a
> +	 * user chosen offset through a registered buffer, where only the first
> +	 * vector may be unaligned.
> +	 */
> +	return !(mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
> +							mem_align_mask);

I don't fully understand the comment.  I guess this is to say ITER_BVEC
users better don't create any alignment gaps?  Maybe we should also
clearly document that in uio.h?

>  	return bio_iov_iter_get_pages(bio, iter,
> +			bdev_dma_alignment(bdev),

Nit: this easily fits onto the previous line.

Otherwise this looks good.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCHv2 6/6] block: validate user space vectors during extraction
  2026-06-23 15:10   ` Christoph Hellwig
@ 2026-06-23 16:17     ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2026-06-23 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, dm-devel, axboe, brauner,
	djwong, viro, stable

On Tue, Jun 23, 2026 at 05:10:21PM +0200, Christoph Hellwig wrote:
> > +	/*
> > +	 * The vectors are owned and laid out by the caller; we only forward
> > +	 * them. Most callers are already aligned, but io_uring can place a
> > +	 * user chosen offset through a registered buffer, where only the first
> > +	 * vector may be unaligned.
> > +	 */
> > +	return !(mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
> > +							mem_align_mask);
> 
> I don't fully understand the comment.  I guess this is to say ITER_BVEC
> users better don't create any alignment gaps?  Maybe we should also
> clearly document that in uio.h?

Exactly, the in-kernel users of ITER_BVEC that allocate their own
buffers are, as far as I know, aligned already. Fabric storage targets
like nvme allocate their own SGLs on page boundaries so the bio is
aligned at the point it was constructed.

The ones that forward user buffers like loop and zloop are addressed in
the previous two patches. They generally should have been fine for most
hardware without those updates, but they're included in case a backing
device has more restrictive constraints than 512b "sector_t" aligned.

The only other user space provided alignment that I think may trip this
up is the io_uring registered buffer, so that's what I'm trying to call
out here.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-23 16:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-22 17:42 [PATCHv2 0/6] direct-io: validate user space vectors during extraction Keith Busch
2026-06-22 17:42 ` [PATCHv2 1/6] block: introduce bio_endio_errno helper Keith Busch
2026-06-23 14:54   ` Christoph Hellwig
2026-06-23 15:05     ` Keith Busch
2026-06-23 15:07       ` Christoph Hellwig
2026-06-22 17:42 ` [PATCHv2 2/6] block: report the actual status Keith Busch
2026-06-23 14:55   ` Christoph Hellwig
2026-06-23 14:59     ` Keith Busch
2026-06-22 17:42 ` [PATCHv2 3/6] block: fix dio leak on metadata mapping error Keith Busch
2026-06-23 15:01   ` Christoph Hellwig
2026-06-22 17:42 ` [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O Keith Busch
2026-06-23 15:04   ` Christoph Hellwig
2026-06-22 17:42 ` [PATCHv2 5/6] zloop: set dma_alignment from the backing files " Keith Busch
2026-06-23 15:06   ` Christoph Hellwig
2026-06-22 17:42 ` [PATCHv2 6/6] block: validate user space vectors during extraction Keith Busch
2026-06-23 15:10   ` Christoph Hellwig
2026-06-23 16:17     ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.