Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Christoph Hellwig @ 2026-06-16 12:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781597506.git.wqu@suse.com>

Note: You'll need to include Jens for the block bits to get either an
ACK or a merge through the block tree.


^ permalink raw reply

* Re: [PATCH v4 2/3] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
From: Christoph Hellwig @ 2026-06-16 12:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <9c165a314022b61566eb247852eb773ca6c70889.1781597506.git.wqu@suse.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
From: Christoph Hellwig @ 2026-06-16 12:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <c400989f227343b134110773d5acaaacf7024574.1781597506.git.wqu@suse.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-16 12:36 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Andrew Morton, Christoph Hellwig, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <20260614-zram-swap-ops-block-register-v1-0-6c1a6639c222@gmail.com>

I fear this is going entirely in the wrong direction.

Yes, we have to keep zram around as a legacy interface for now,
but the right place to deal with compressed swap is in the core.

So please don't add more hacks for 'magic' block devices.

^ permalink raw reply

* Re: [PATCH net v2 1/2] iov_iter: export iov_iter_restore
From: Stefano Garzarella @ 2026-06-16 12:35 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
	David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
	Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
	Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
	virtualization, Xuan Zhuo
In-Reply-To: <20260613000953.467473-2-tavip@google.com>

On Sat, Jun 13, 2026 at 12:09:52AM +0000, Octavian Purdila wrote:
>Export iov_iter_restore so that it can be used by modules.
>
>This is needed by the virtio vsock transport (which can be built as a
>module) to restore the msg_iter state when transmission fails.
>
>Signed-off-by: Octavian Purdila <tavip@google.com>
>---
> lib/iov_iter.c | 1 +
> 1 file changed, 1 insertion(+)

Acked-by: Stefano Garzarella <sgarzare@redhat.com>

>
>diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>index 243662af1af73..067e745f9ef53 100644
>--- a/lib/iov_iter.c
>+++ b/lib/iov_iter.c
>@@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
> 		i->__iov -= state->nr_segs - i->nr_segs;
> 	i->nr_segs = state->nr_segs;
> }
>+EXPORT_SYMBOL(iov_iter_restore);
>
> /*
>  * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This does
>-- 
>2.54.0.1136.gdb2ca164c4-goog
>


^ permalink raw reply

* Re: [PATCH v2 2/5] block: split bdev_yield_claim() out of bdev_fput()
From: Jan Kara @ 2026-06-16 12:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Chris Mason, Jens Axboe, David Sterba, Jan Kara, Naohiro Aota,
	Josef Bacik, linux-btrfs, linux-block, linux-fsdevel
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-2-b3567c7f994b@kernel.org>

On Tue 16-06-26 13:58:15, Christian Brauner wrote:
> bdev_fput() yields the holder claim and then closes the file, which is a
> deferred operation.  Split the yield half into bdev_yield_claim() so a caller
> can give up the holder while the file - and therefore the block device - is
> still open, act on the device, and only then bdev_fput().
> 
> A filesystem that made a device unfreezable for a membership change with
> bdev_deny_freeze() undoes the deny on release with
> 
> 	bdev_yield_claim(bdev_file);
> 	bdev_allow_freeze(file_bdev(bdev_file));
> 	bdev_fput(bdev_file);
> 
> Re-allowing only after the holder is yielded avoids stranding the filesystem
> on a racing freeze, and doing it while the file is still open avoids touching
> the block device after bdev_fput().  bdev_fput() yields again, which is a
> no-op once the claim has already been given up.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/bdev.c           | 50 ++++++++++++++++++++++++++++++++++----------------
>  include/linux/blkdev.h |  1 +
>  2 files changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index a83a3809380c..54b35a084c36 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1200,6 +1200,39 @@ void bdev_release(struct file *bdev_file)
>  	blkdev_put_no_open(bdev);
>  }
>  
> +/**
> + * bdev_yield_claim - give up the holder claim on an open block device
> + * @bdev_file: open block device
> + *
> + * Yield the holder and any write access for @bdev_file without closing it, so
> + * the caller can still act on the device - e.g. bdev_allow_freeze() it - before
> + * the final bdev_fput().  bdev_fput() yields too, so calling it afterwards is
> + * safe.
> + */
> +void bdev_yield_claim(struct file *bdev_file)
> +{
> +	struct block_device *bdev;
> +	struct gendisk *disk;
> +
> +	if (!bdev_file->private_data)
> +		return;
> +
> +	bdev = file_bdev(bdev_file);
> +	disk = bdev->bd_disk;
> +
> +	mutex_lock(&disk->open_mutex);
> +	bdev_yield_write_access(bdev_file);
> +	bd_yield_claim(bdev_file);
> +	/*
> +	 * Tell release we already gave up our hold on the
> +	 * device and if write restrictions are available that
> +	 * we already gave up write access to the device.
> +	 */
> +	bdev_file->private_data = BDEV_I(bdev_file->f_mapping->host);
> +	mutex_unlock(&disk->open_mutex);
> +}
> +EXPORT_SYMBOL_GPL(bdev_yield_claim);
> +
>  /**
>   * bdev_fput - yield claim to the block device and put the file
>   * @bdev_file: open block device
> @@ -1213,22 +1246,7 @@ void bdev_fput(struct file *bdev_file)
>  	if (WARN_ON_ONCE(bdev_file->f_op != &def_blk_fops))
>  		return;
>  
> -	if (bdev_file->private_data) {
> -		struct block_device *bdev = file_bdev(bdev_file);
> -		struct gendisk *disk = bdev->bd_disk;
> -
> -		mutex_lock(&disk->open_mutex);
> -		bdev_yield_write_access(bdev_file);
> -		bd_yield_claim(bdev_file);
> -		/*
> -		 * Tell release we already gave up our hold on the
> -		 * device and if write restrictions are available that
> -		 * we already gave up write access to the device.
> -		 */
> -		bdev_file->private_data = BDEV_I(bdev_file->f_mapping->host);
> -		mutex_unlock(&disk->open_mutex);
> -	}
> -
> +	bdev_yield_claim(bdev_file);
>  	fput(bdev_file);
>  }
>  EXPORT_SYMBOL(bdev_fput);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index cf1951caadb2..9fc16e3c8075 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1832,6 +1832,7 @@ int bdev_thaw(struct block_device *bdev);
>  int bdev_deny_freeze(struct block_device *bdev);
>  void bdev_allow_freeze(struct block_device *bdev);
>  void bdev_fput(struct file *bdev_file);
> +void bdev_yield_claim(struct file *bdev_file);
>  
>  struct io_comp_batch {
>  	struct rq_list req_list;
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christoph Hellwig @ 2026-06-16 12:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-2-bb0fd82f3861@kernel.org>

On Tue, Jun 02, 2026 at 12:10:08PM +0200, Christian Brauner wrote:
> fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> forces the holder to be exactly one superblock and prevents several
> superblocks from sharing one block device. That's what erofs is doing.
> 
> Introduce a global dev_t-keyed rhltable mapping each block device to the
> superblock(s) using it. The holder argument becomes purely the block
> layer's exclusivity token (a superblock, or a file_system_type for
> shared devices) and is no longer needed by the fs specific callbacks.

Err, no.  block devices need to have a specific owner.  If erofs wants
to share a device between superblock it needs to come up with an entity
that owns the block devices which is not a superblock.

IMHO sharing devices between superblocks is a bad idea, but that ship
has sailed, but please keep it contained inside of erofs.


^ permalink raw reply

* [PATCH v2 5/5] btrfs: deny freezing devices undergoing a replace
From: Christian Brauner @ 2026-06-16 11:58 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-0-b3567c7f994b@kernel.org>

A device replace opens a target and, on success, frees the source on a live
filesystem from btrfs_dev_replace_finishing() - which cannot fail and also
runs from a kthread on mount resume.  A bdev_freeze() racing the source free
or the target swap-in would freeze the filesystem through a claim that is
being torn down or replaced, leaving nothing for bdev_thaw() to rebalance.

Make both devices unfreezable for the whole replace, with the invariant that
a STARTED replace holds one deny on each device and any other state holds
none.  The target is denied at open (btrfs_open_device_deny_freeze(), undone
on btrfs_init_dev_replace_tgtdev()'s error unwind); the source is denied at
the start of btrfs_dev_replace_start(), before mark_block_group_to_copy() so
every 'leave' unwind sees both denied.

The deny tracks the STARTED state and is dropped whenever the replace leaves
it: btrfs_dev_replace_finishing() re-allows the target it makes a member and
frees the source through btrfs_close_bdev(allow_freeze=true), and its
scrub-error path re-allows both as it cancels.  Its early failures (before
the device swap) keep the replace STARTED and resumable, so both stay denied.
Suspending for unmount re-allows both, so they are reopened freezable at the
next mount where btrfs_resume_dev_replace_async() re-denies them (staying
suspended if a device is frozen right then); a replace cancelled from the
suspended state therefore destroys the target without allowing.
btrfs_close_bdev() and btrfs_destroy_dev_replace_tgtdev() take an allow_freeze
argument to carry this distinction; the unmount path
(btrfs_close_one_device()) passes false.

On resume, a failed kthread_run() re-allows both devices and goes through the
suspend path, resetting the replace to SUSPENDED and finishing the exclusive
operation instead of returning straight away.  The (re)mount still aborts on
that error; routing it through suspend keeps the deny balanced against the
unmount teardown and additionally drops BTRFS_EXCLOP_DEV_REPLACE, closing a
pre-existing leak that was harmless on the failed mount that frees the fs but
would have wedged future exclusive operations after a failed remount-rw.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/dev-replace.c | 65 ++++++++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/volumes.c     | 18 +++++++++-----
 fs/btrfs/volumes.h     |  3 ++-
 3 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 8f8fa14886de..4ae34acb89e8 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -247,8 +247,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return -EINVAL;
 	}
 
-	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info->sb, &fs_holder_ops);
+	/* Unfreezable for the whole replace; see btrfs_dev_replace_start(). */
+	bdev_file = btrfs_open_device_deny_freeze(device_path, fs_info->sb);
 	if (IS_ERR(bdev_file)) {
 		btrfs_err(fs_info, "target device %s is invalid!", device_path);
 		return PTR_ERR(bdev_file);
@@ -325,7 +325,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	return 0;
 
 error:
-	bdev_fput(bdev_file);
+	/* Undo the open-time freeze deny. */
+	btrfs_release_device_allow_freeze(bdev_file);
 	return ret;
 }
 
@@ -622,6 +623,15 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	/* Deny the source before mark, so every 'leave' unwinds both denied. */
+	if (src_device->bdev) {
+		ret = bdev_deny_freeze(src_device->bdev);
+		if (ret) {
+			btrfs_destroy_dev_replace_tgtdev(tgt_device, true);
+			return ret;
+		}
+	}
+
 	ret = mark_block_group_to_copy(fs_info, src_device);
 	if (ret)
 		return ret;
@@ -706,7 +716,9 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	return ret;
 
 leave:
-	btrfs_destroy_dev_replace_tgtdev(tgt_device);
+	if (src_device->bdev)
+		bdev_allow_freeze(src_device->bdev);
+	btrfs_destroy_dev_replace_tgtdev(tgt_device, true);
 	return ret;
 }
 
@@ -887,6 +899,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	 */
 	ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
 	if (ret) {
+		/* Stays started/resumable; keep both denied. */
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 		return ret;
 	}
@@ -900,6 +913,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	while (1) {
 		trans = btrfs_start_transaction(root, 0);
 		if (IS_ERR(trans)) {
+			/* Stays started/resumable; keep both denied. */
 			mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 			return PTR_ERR(trans);
 		}
@@ -952,7 +966,10 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 		mutex_unlock(&fs_devices->device_list_mutex);
 		btrfs_rm_dev_replace_blocked(fs_info);
 		if (tgt_device)
-			btrfs_destroy_dev_replace_tgtdev(tgt_device);
+			btrfs_destroy_dev_replace_tgtdev(tgt_device, true);
+		/* The source stays a member; re-allow freezing it. */
+		if (src_device->bdev)
+			bdev_allow_freeze(src_device->bdev);
 		btrfs_rm_dev_replace_unblocked(fs_info);
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 
@@ -1018,6 +1035,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 
 	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 
+	/* The target is now a member; the source is freed (allow + release). */
+	bdev_allow_freeze(tgt_device->bdev);
 	btrfs_rm_dev_replace_free_srcdev(src_device);
 
 	return 0;
@@ -1146,8 +1165,9 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
 			btrfs_dev_name(src_device), src_device->devid,
 			btrfs_dev_name(tgt_device));
 
+		/* A suspended replace never re-denied freezing; do not allow. */
 		if (tgt_device)
-			btrfs_destroy_dev_replace_tgtdev(tgt_device);
+			btrfs_destroy_dev_replace_tgtdev(tgt_device, false);
 		break;
 	default:
 		up_write(&dev_replace->rwsem);
@@ -1177,6 +1197,11 @@ void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info)
 		dev_replace->time_stopped = ktime_get_real_seconds();
 		dev_replace->item_needs_writeback = 1;
 		btrfs_info(fs_info, "suspending dev_replace for unmount");
+		/* Reopened freezable next mount; resume re-denies. */
+		if (dev_replace->srcdev && dev_replace->srcdev->bdev)
+			bdev_allow_freeze(dev_replace->srcdev->bdev);
+		if (dev_replace->tgtdev && dev_replace->tgtdev->bdev)
+			bdev_allow_freeze(dev_replace->tgtdev->bdev);
 		break;
 	}
 
@@ -1189,6 +1214,7 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
 {
 	struct task_struct *task;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	int ret = 0;
 
 	down_write(&dev_replace->rwsem);
 
@@ -1232,8 +1258,33 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
 		return 0;
 	}
 
+	/* Re-deny for the resumed replace; stay suspended if frozen now. */
+	if (dev_replace->srcdev->bdev &&
+	    bdev_deny_freeze(dev_replace->srcdev->bdev))
+		goto suspend;
+	if (bdev_deny_freeze(dev_replace->tgtdev->bdev)) {
+		if (dev_replace->srcdev->bdev)
+			bdev_allow_freeze(dev_replace->srcdev->bdev);
+		goto suspend;
+	}
+
 	task = kthread_run(btrfs_dev_replace_kthread, fs_info, "btrfs-devrepl");
-	return PTR_ERR_OR_ZERO(task);
+	if (IS_ERR(task)) {
+		bdev_allow_freeze(dev_replace->tgtdev->bdev);
+		if (dev_replace->srcdev->bdev)
+			bdev_allow_freeze(dev_replace->srcdev->bdev);
+		/* Undo the deny and suspend, but still fail the mount. */
+		ret = PTR_ERR(task);
+		goto suspend;
+	}
+	return 0;
+
+suspend:
+	btrfs_exclop_finish(fs_info);
+	down_write(&dev_replace->rwsem);
+	dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED;
+	up_write(&dev_replace->rwsem);
+	return ret;
 }
 
 static int btrfs_dev_replace_kthread(void *data)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 167a1c3d0fca..9ffc5329f6b2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1128,7 +1128,7 @@ void btrfs_release_device_allow_freeze(struct file *bdev_file)
 	bdev_fput(bdev_file);
 }
 
-static void btrfs_close_bdev(struct btrfs_device *device)
+static void btrfs_close_bdev(struct btrfs_device *device, bool allow_freeze)
 {
 	if (!device->bdev)
 		return;
@@ -1138,7 +1138,11 @@ static void btrfs_close_bdev(struct btrfs_device *device)
 		invalidate_bdev(device->bdev);
 	}
 
-	bdev_fput(device->bdev_file);
+	/* @allow_freeze undoes a replace-time deny; unmount-close was never denied. */
+	if (allow_freeze)
+		btrfs_release_device_allow_freeze(device->bdev_file);
+	else
+		bdev_fput(device->bdev_file);
 }
 
 static void btrfs_close_one_device(struct btrfs_device *device)
@@ -1159,7 +1163,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 		fs_devices->missing_devices--;
 	}
 
-	btrfs_close_bdev(device);
+	btrfs_close_bdev(device, false);
 	if (device->bdev) {
 		fs_devices->open_devices--;
 		device->bdev = NULL;
@@ -2511,7 +2515,8 @@ void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev)
 
 	mutex_lock(&uuid_mutex);
 
-	btrfs_close_bdev(srcdev);
+	/* The source was made unfreezable for the replace; undo it. */
+	btrfs_close_bdev(srcdev, true);
 	synchronize_rcu();
 	btrfs_free_device(srcdev);
 
@@ -2532,7 +2537,8 @@ void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev)
 	mutex_unlock(&uuid_mutex);
 }
 
-void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
+void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev,
+				      bool allow_freeze)
 {
 	struct btrfs_fs_devices *fs_devices = tgtdev->fs_info->fs_devices;
 
@@ -2553,7 +2559,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
 
 	btrfs_scratch_superblocks(tgtdev->fs_info, tgtdev);
 
-	btrfs_close_bdev(tgtdev);
+	btrfs_close_bdev(tgtdev, allow_freeze);
 	synchronize_rcu();
 	btrfs_free_device(tgtdev);
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 75c7963f5d4c..65de9504d887 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -790,7 +790,8 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
 int btrfs_run_dev_stats(struct btrfs_trans_handle *trans);
 void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev);
 void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev);
-void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev);
+void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev,
+				      bool allow_freeze);
 unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 				    u64 logical);
 u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);

-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 4/5] btrfs: deny freezing a device while it is being added
From: Christian Brauner @ 2026-06-16 11:58 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-0-b3567c7f994b@kernel.org>

btrfs_init_new_device() opens and claims the new device on a live
superblock without holding the write count, so a bdev_freeze() racing the
window between the claim being published and the device becoming a member
could freeze the filesystem through a claim the add may still abort and tear
down.

Add btrfs_open_device_deny_freeze(): it opens the device once
non-exclusively to take the freeze deny, then claims it by the same dev_t,
so the holder is only ever published while the device is already
unfreezable.  Keep it denied until the add is durable: bdev_allow_freeze()
on each success return (the device is now a committed member),
btrfs_release_device_allow_freeze() on the error unwind.  The deny spans the
whole add, including the seeding tail whose late failures still release the
device.  A device already frozen when the add starts is refused with -EBUSY.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/volumes.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 36f9835f65e3..167a1c3d0fca 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2822,6 +2822,37 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
 	return 0;
 }
 
+/*
+ * Open @path for @sb with freezing denied before the holder claim is published,
+ * so a racing bdev_freeze() can never reach a claim a device add or replace may
+ * still abort.  The deny is taken on a throwaway non-holder probe open, then the
+ * holder is opened by the probe's dev_t.  Balanced by the caller.
+ */
+struct file *btrfs_open_device_deny_freeze(const char *path,
+					   struct super_block *sb)
+{
+	struct file *probe_file, *bdev_file;
+	int ret;
+
+	/* WRITE so bdev_file_open_by_path() rejects a read-only device. */
+	probe_file = bdev_file_open_by_path(path, BLK_OPEN_WRITE, NULL, NULL);
+	if (IS_ERR(probe_file))
+		return probe_file;
+
+	ret = bdev_deny_freeze(file_bdev(probe_file));
+	if (ret) {
+		bdev_fput(probe_file);
+		return ERR_PTR(ret);
+	}
+
+	bdev_file = bdev_file_open_by_dev(file_bdev(probe_file)->bd_dev,
+					  BLK_OPEN_WRITE, sb, &fs_holder_ops);
+	if (IS_ERR(bdev_file))
+		bdev_allow_freeze(file_bdev(probe_file));
+	bdev_fput(probe_file);
+	return bdev_file;
+}
+
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
 {
 	struct btrfs_root *root = fs_info->dev_root;
@@ -2840,8 +2871,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (sb_rdonly(sb) && !fs_devices->seeding)
 		return -EROFS;
 
-	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info->sb, &fs_holder_ops);
+	/* Forbid freezing until the device is a committed member (or unwound). */
+	bdev_file = btrfs_open_device_deny_freeze(device_path, fs_info->sb);
 	if (IS_ERR(bdev_file))
 		return PTR_ERR(bdev_file);
 
@@ -3006,8 +3037,10 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		up_write(&sb->s_umount);
 		locked = false;
 
-		if (ret) /* transaction commit */
+		if (ret) { /* transaction commit */
+			bdev_allow_freeze(file_bdev(bdev_file));
 			return ret;
+		}
 
 		ret = btrfs_relocate_sys_chunks(fs_info);
 		if (ret < 0)
@@ -3015,8 +3048,10 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 				    "Failed to relocate sys chunks after device initialization. This can be fixed using the \"btrfs balance\" command.");
 		trans = btrfs_attach_transaction(root);
 		if (IS_ERR(trans)) {
-			if (PTR_ERR(trans) == -ENOENT)
+			if (PTR_ERR(trans) == -ENOENT) {
+				bdev_allow_freeze(file_bdev(bdev_file));
 				return 0;
+			}
 			ret = PTR_ERR(trans);
 			trans = NULL;
 			goto error_sysfs;
@@ -3036,6 +3071,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	/* Update ctime/mtime for blkid or udev */
 	update_dev_time(device_path);
 
+	bdev_allow_freeze(file_bdev(bdev_file));
 	return ret;
 
 error_sysfs:
@@ -3065,7 +3101,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 error_free_device:
 	btrfs_free_device(device);
 error:
-	bdev_fput(bdev_file);
+	btrfs_release_device_allow_freeze(bdev_file);
 	if (locked) {
 		mutex_unlock(&uuid_mutex);
 		up_write(&sb->s_umount);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 60e82c15881a..75c7963f5d4c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -769,6 +769,8 @@ struct btrfs_device *btrfs_find_device(const struct btrfs_fs_devices *fs_devices
 				       const struct btrfs_dev_lookup_args *args);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path);
+struct file *btrfs_open_device_deny_freeze(const char *path,
+					   struct super_block *sb);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
 		  struct btrfs_balance_control *bctl,
 		  struct btrfs_ioctl_balance_args *bargs);

-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 3/5] btrfs: deny freezing a device while it is being removed
From: Christian Brauner @ 2026-06-16 11:58 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-0-b3567c7f994b@kernel.org>

btrfs_rm_device() runs under mnt_want_write_file(), but the claim on the
removed device is released by the ioctl after mnt_drop_write_file(), so a
bdev_freeze() racing that window could freeze the filesystem through the
device just as its claim is torn down, leaving nothing for bdev_thaw() to
rebalance.

The window cannot be closed by reordering the teardown.  btrfs_rm_device()
hands the final bdev_fput() back to the ioctl, run only after
mnt_drop_write_file(), because bdev_release() takes the disk ->open_mutex and
its dependency chain, which must not nest under the superblock's freeze/write
protection -- freeze_super() drops s_umount before draining writers precisely
to keep sb_start_write ordered above s_umount.  Holding mnt_want_write across
bdev_fput() would reintroduce that inversion, so the holder teardown is forced
outside the write-protected section.  A freeze landing in the resulting gap
resolves the still-live holder, rides in, and strands when the claim is
released; no ordering of the close against the drop removes the gap.  The
device itself therefore has to refuse freezing for the whole removal.

Deny freezing the device for the duration of the removal: bdev_deny_freeze()
at the start of btrfs_rm_device() (it cannot be frozen yet, the ioctl holds
the write count), and release it through btrfs_release_device_allow_freeze()
in the ioctls on success, or bdev_allow_freeze() on the error paths that keep
the device a member.  A device frozen before the removal begins is refused
with -EBUSY.

btrfs_release_device_allow_freeze() yields the holder, re-allows freezing,
then closes the device, so the re-allow neither strands the filesystem on a
racing freeze nor touches the block device after the final fput.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/ioctl.c   |  4 ++--
 fs/btrfs/volumes.c | 20 ++++++++++++++++++++
 fs/btrfs/volumes.h |  1 +
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b2e447f5005c..fc3e06445211 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2579,7 +2579,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 err_drop:
 	mnt_drop_write_file(file);
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		btrfs_release_device_allow_freeze(bdev_file);
 out:
 	btrfs_put_dev_args_from_path(&args);
 	kfree(vol_args);
@@ -2630,7 +2630,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)

 	mnt_drop_write_file(file);
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		btrfs_release_device_allow_freeze(bdev_file);
 out:
 	btrfs_put_dev_args_from_path(&args);
 out_free:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f90564..36f9835f65e3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1119,6 +1119,15 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices)
 	mutex_unlock(&uuid_mutex);
 }

+/* Release a device that was made unfreezable for a membership change. */
+void btrfs_release_device_allow_freeze(struct file *bdev_file)
+{
+	/* Yield before allow (strand-safe); file still open for the allow (UAF-safe). */
+	bdev_yield_claim(bdev_file);
+	bdev_allow_freeze(file_bdev(bdev_file));
+	bdev_fput(bdev_file);
+}
+
 static void btrfs_close_bdev(struct btrfs_device *device)
 {
 	if (!device->bdev)
@@ -2336,6 +2345,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 	    fs_info->fs_devices->rw_devices == 1)
 		return BTRFS_ERROR_DEV_ONLY_WRITABLE;

+	/* Removal and freezing are mutually exclusive; refuse if frozen now. */
+	if (device->bdev) {
+		ret = bdev_deny_freeze(device->bdev);
+		if (ret)
+			return ret;
+	}
+
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 		mutex_lock(&fs_info->chunk_mutex);
 		list_del_init(&device->dev_alloc_list);
@@ -2362,6 +2378,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 			   device->devid, ret);
 		btrfs_abort_transaction(trans, ret);
 		btrfs_end_transaction(trans);
+		if (device->bdev)
+			bdev_allow_freeze(device->bdev);
 		return ret;
 	}

@@ -2447,6 +2465,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 	return btrfs_commit_transaction(trans);

 error_undo:
+	if (device->bdev)
+		bdev_allow_freeze(device->bdev);
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 		mutex_lock(&fs_info->chunk_mutex);
 		list_add(&device->dev_alloc_list,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0082c166af91..60e82c15881a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -744,6 +744,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 struct btrfs_device *btrfs_scan_one_device(const char *path, bool mount_arg_dev);
 int btrfs_forget_devices(dev_t devt);
 void btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
+void btrfs_release_device_allow_freeze(struct file *bdev_file);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices);
 void btrfs_assign_next_active_device(struct btrfs_device *device,
 				     struct btrfs_device *this_dev);

-- 
2.47.3

^ permalink raw reply related

* [PATCH v2 2/5] block: split bdev_yield_claim() out of bdev_fput()
From: Christian Brauner @ 2026-06-16 11:58 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-0-b3567c7f994b@kernel.org>

bdev_fput() yields the holder claim and then closes the file, which is a
deferred operation.  Split the yield half into bdev_yield_claim() so a caller
can give up the holder while the file - and therefore the block device - is
still open, act on the device, and only then bdev_fput().

A filesystem that made a device unfreezable for a membership change with
bdev_deny_freeze() undoes the deny on release with

	bdev_yield_claim(bdev_file);
	bdev_allow_freeze(file_bdev(bdev_file));
	bdev_fput(bdev_file);

Re-allowing only after the holder is yielded avoids stranding the filesystem
on a racing freeze, and doing it while the file is still open avoids touching
the block device after bdev_fput().  bdev_fput() yields again, which is a
no-op once the claim has already been given up.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 block/bdev.c           | 50 ++++++++++++++++++++++++++++++++++----------------
 include/linux/blkdev.h |  1 +
 2 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index a83a3809380c..54b35a084c36 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1200,6 +1200,39 @@ void bdev_release(struct file *bdev_file)
 	blkdev_put_no_open(bdev);
 }
 
+/**
+ * bdev_yield_claim - give up the holder claim on an open block device
+ * @bdev_file: open block device
+ *
+ * Yield the holder and any write access for @bdev_file without closing it, so
+ * the caller can still act on the device - e.g. bdev_allow_freeze() it - before
+ * the final bdev_fput().  bdev_fput() yields too, so calling it afterwards is
+ * safe.
+ */
+void bdev_yield_claim(struct file *bdev_file)
+{
+	struct block_device *bdev;
+	struct gendisk *disk;
+
+	if (!bdev_file->private_data)
+		return;
+
+	bdev = file_bdev(bdev_file);
+	disk = bdev->bd_disk;
+
+	mutex_lock(&disk->open_mutex);
+	bdev_yield_write_access(bdev_file);
+	bd_yield_claim(bdev_file);
+	/*
+	 * Tell release we already gave up our hold on the
+	 * device and if write restrictions are available that
+	 * we already gave up write access to the device.
+	 */
+	bdev_file->private_data = BDEV_I(bdev_file->f_mapping->host);
+	mutex_unlock(&disk->open_mutex);
+}
+EXPORT_SYMBOL_GPL(bdev_yield_claim);
+
 /**
  * bdev_fput - yield claim to the block device and put the file
  * @bdev_file: open block device
@@ -1213,22 +1246,7 @@ void bdev_fput(struct file *bdev_file)
 	if (WARN_ON_ONCE(bdev_file->f_op != &def_blk_fops))
 		return;
 
-	if (bdev_file->private_data) {
-		struct block_device *bdev = file_bdev(bdev_file);
-		struct gendisk *disk = bdev->bd_disk;
-
-		mutex_lock(&disk->open_mutex);
-		bdev_yield_write_access(bdev_file);
-		bd_yield_claim(bdev_file);
-		/*
-		 * Tell release we already gave up our hold on the
-		 * device and if write restrictions are available that
-		 * we already gave up write access to the device.
-		 */
-		bdev_file->private_data = BDEV_I(bdev_file->f_mapping->host);
-		mutex_unlock(&disk->open_mutex);
-	}
-
+	bdev_yield_claim(bdev_file);
 	fput(bdev_file);
 }
 EXPORT_SYMBOL(bdev_fput);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf1951caadb2..9fc16e3c8075 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1832,6 +1832,7 @@ int bdev_thaw(struct block_device *bdev);
 int bdev_deny_freeze(struct block_device *bdev);
 void bdev_allow_freeze(struct block_device *bdev);
 void bdev_fput(struct file *bdev_file);
+void bdev_yield_claim(struct file *bdev_file);
 
 struct io_comp_batch {
 	struct rq_list req_list;

-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 1/5] block: allow making a block device unfreezable
From: Christian Brauner @ 2026-06-16 11:58 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)
In-Reply-To: <20260616-work-super-freeze_deny_upstream-v2-0-b3567c7f994b@kernel.org>

Add bdev_deny_freeze() and bdev_allow_freeze(), modeled on
deny_write_access()/allow_write_access().  bd_fsfreeze_count becomes a
signed counter: > 0 counts active freezes, < 0 counts deniers, and the
two regimes are mutually exclusive.  bdev_freeze() refuses with -EBUSY
while a deny is held, and bdev_deny_freeze() refuses while the device is
frozen.

A filesystem that mutates a device's membership (a btrfs device add,
remove or replace) denies freezing on the device for the duration, so a
claim a freeze walk might act on is never added or torn down behind the
freezer's back.

The deny/allow helpers are a single atomic on bd_fsfreeze_count and take
no lock, so they can be called while holding s_umount without inverting
against bdev_freeze()'s bd_fsfreeze_mutex -> s_umount order.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 block/bdev.c              | 63 +++++++++++++++++++++++++++++++++++++++--------
 include/linux/blk_types.h |  2 +-
 include/linux/blkdev.h    |  2 ++
 3 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index bb0ffa3bb4df..a83a3809380c 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -304,7 +304,12 @@ int bdev_freeze(struct block_device *bdev)
 
 	mutex_lock(&bdev->bd_fsfreeze_mutex);
 
-	if (atomic_inc_return(&bdev->bd_fsfreeze_count) > 1) {
+	/* A device being removed from its filesystem refuses freezes. */
+	if (!atomic_inc_unless_negative(&bdev->bd_fsfreeze_count)) {
+		mutex_unlock(&bdev->bd_fsfreeze_mutex);
+		return -EBUSY;
+	}
+	if (atomic_read(&bdev->bd_fsfreeze_count) > 1) {
 		mutex_unlock(&bdev->bd_fsfreeze_mutex);
 		return 0;
 	}
@@ -340,18 +345,18 @@ int bdev_thaw(struct block_device *bdev)
 
 	mutex_lock(&bdev->bd_fsfreeze_mutex);
 
-	/*
-	 * If this returns < 0 it means that @bd_fsfreeze_count was
-	 * already 0 and no decrement was performed.
-	 */
-	nr_freeze = atomic_dec_if_positive(&bdev->bd_fsfreeze_count);
-	if (nr_freeze < 0)
+	/* <= 0: not frozen (0) or a freeze deny is held (< 0); leave it. */
+	nr_freeze = atomic_read(&bdev->bd_fsfreeze_count);
+	if (nr_freeze <= 0)
 		goto out;
 
 	error = 0;
-	if (nr_freeze > 0)
+	if (nr_freeze > 1) {
+		atomic_dec(&bdev->bd_fsfreeze_count);
 		goto out;
+	}
 
+	/* Keep the count positive across the thaw so a deny is refused. */
 	mutex_lock(&bdev->bd_holder_lock);
 	if (bdev->bd_holder_ops && bdev->bd_holder_ops->thaw) {
 		error = bdev->bd_holder_ops->thaw(bdev);
@@ -360,14 +365,52 @@ int bdev_thaw(struct block_device *bdev)
 		mutex_unlock(&bdev->bd_holder_lock);
 	}
 
-	if (error)
-		atomic_inc(&bdev->bd_fsfreeze_count);
+	if (!error)
+		atomic_dec(&bdev->bd_fsfreeze_count);
 out:
 	mutex_unlock(&bdev->bd_fsfreeze_mutex);
 	return error;
 }
 EXPORT_SYMBOL(bdev_thaw);
 
+/**
+ * bdev_deny_freeze - make a block device unfreezable
+ * @bdev: block device
+ *
+ * Reserve @bdev against bdev_freeze() the way deny_write_access() reserves a
+ * file against writers.  bd_fsfreeze_count is sign-encoded: > 0 counts active
+ * freezes, < 0 counts deniers, so a deny succeeds only while no freeze is in
+ * progress.  While held, bdev_freeze() returns -EBUSY.  Pair with
+ * bdev_allow_freeze().
+ *
+ * A filesystem removing, adding or replacing a member device denies freezes on
+ * it for the duration, so a claim a freeze walk might act on is never torn down
+ * behind the freezer's back.  The deny is device-scoped, not (device,
+ * superblock)-scoped: a device shared by several superblocks is refused for all
+ * of them.  No in-tree filesystem removes a shared claim from a live superblock.
+ *
+ * Return: 0, or -EBUSY if the device is currently frozen.
+ */
+int bdev_deny_freeze(struct block_device *bdev)
+{
+	return atomic_dec_unless_positive(&bdev->bd_fsfreeze_count) ? 0 : -EBUSY;
+}
+EXPORT_SYMBOL_GPL(bdev_deny_freeze);
+
+/**
+ * bdev_allow_freeze - allow freezing a block device again
+ * @bdev: block device
+ *
+ * Undo one bdev_deny_freeze().
+ */
+void bdev_allow_freeze(struct block_device *bdev)
+{
+	/* A deny must be held, i.e. the count must be negative. */
+	WARN_ON_ONCE(atomic_read(&bdev->bd_fsfreeze_count) >= 0);
+	atomic_inc(&bdev->bd_fsfreeze_count);
+}
+EXPORT_SYMBOL_GPL(bdev_allow_freeze);
+
 /*
  * pseudo-fs
  */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8808ee76e73c..5a725a0cd35f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -66,7 +66,7 @@ struct block_device {
 	int			bd_holders;
 	struct kobject		*bd_holder_dir;
 
-	atomic_t		bd_fsfreeze_count; /* number of freeze requests */
+	atomic_t		bd_fsfreeze_count; /* >0 freeze requests, <0 freeze deniers */
 	struct mutex		bd_fsfreeze_mutex; /* serialize freeze/thaw */
 
 	struct partition_meta_info *bd_meta_info;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..cf1951caadb2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1829,6 +1829,8 @@ static inline int early_lookup_bdev(const char *pathname, dev_t *dev)
 
 int bdev_freeze(struct block_device *bdev);
 int bdev_thaw(struct block_device *bdev);
+int bdev_deny_freeze(struct block_device *bdev);
+void bdev_allow_freeze(struct block_device *bdev);
 void bdev_fput(struct file *bdev_file);
 
 struct io_comp_batch {

-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 0/5] block,btrfs: fix frozen-superblock strand on device add/remove/replace
From: Christian Brauner @ 2026-06-16 11:58 UTC (permalink / raw)
  To: Chris Mason, Jens Axboe, David Sterba, Jan Kara
  Cc: Naohiro Aota, Josef Bacik, linux-btrfs, linux-block,
	linux-fsdevel, Christian Brauner (Amutable)

This is another series of fixes that fell out of the device to
superblock hashtable work. These are all pre-existing bugs.

A block-device freeze that races a btrfs device membership change can leave
the whole filesystem stuck frozen, recoverable only with a manual FITHAW.

btrfs holds each of its devices open with the superblock as the block-device
holder.  bdev_freeze() - issued by "dmsetup suspend" or an LVM snapshot -
resolves that holder to freeze the filesystem, and bdev_thaw() ("dmsetup
resume") resolves it again to thaw.  If a freeze lands while btrfs is adding,
removing or replacing a device, it rides in on the device's holder link and
freezes the filesystem; the membership change then drops that link, so the
matching thaw can no longer find the superblock.  The filesystem stays frozen
with no way back short of FITHAW.

To reproduce on the remove path: build a two-device btrfs with one member
behind a dm-linear target, write enough data that removing that member
relocates for a few seconds, start "btrfs device remove" on it, and
"dmsetup suspend" the dm device while the removal is underway.  The suspend's
freeze blocks on the remove ioctl's write access and rides in as the ioctl
drops it; the removal then clears the device's holder link, so the matching
"dmsetup resume" can no longer reach the superblock.  On an unpatched kernel
the filesystem is left frozen and the next write hangs in D state until a
manual FITHAW (fsfreeze -u).

The fix lets a filesystem forbid freezing a device for the duration of a
membership change, modelled on deny_write_access()/allow_write_access().
bd_fsfreeze_count becomes signed: > 0 counts active freezes, < 0 counts deny
holders, and the two are mutually exclusive.  bdev_deny_freeze() reserves the
device (bdev_freeze() then returns -EBUSY) and bdev_allow_freeze() releases
it; both are a single lockless atomic, so a filesystem can deny under
s_umount without inverting against bdev_freeze()'s bd_fsfreeze_mutex.  btrfs
denies the device across each add, remove and replace, so a racing freeze is
refused instead of riding in, while a normal freeze of a settled member
still works.

To re-allow freezing safely on release, bdev_yield_claim() is split out of
bdev_fput(): the caller yields the holder while the device file is still
open, re-allows freezing on the now-holderless device, and only then closes
it. Re-allowing after the holder is gone avoids re-stranding on a racing
freeze; doing it while the file is still open keeps the block device alive
without referencing it after the final fput.

With the fix the racing suspend is refused with -EBUSY mid-removal and the
filesystem stays writable.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Changes in v2:
- block: bdev_thaw() now keeps bd_fsfreeze_count positive across the thaw
  and only drops it to 0 on success, so a bdev_deny_freeze() racing the thaw
  is refused instead of slipping in on a transient 0 and corrupting the
  sign-encoded counter.
- block: bdev_allow_freeze() WARN_ON_ONCE()s an unbalanced call (Jan Kara).
- block: bdev_yield_claim() early-returns instead of wrapping its body in an
  if (Johannes Thumshirn).
- btrfs: btrfs_open_device_deny_freeze() opens the probe BLK_OPEN_WRITE so a
  read-only device is rejected at "device add"; the by-dev open of the
  holder skipped the read-only check the previous by-path open enforced.
- Reword the cover: FIFREEZE freezes the superblock, not the bare device.
- Link to v1: https://patch.msgid.link/20260615-work-super-freeze_deny_upstream-v1-0-a6c72b840e7d@kernel.org

---
Christian Brauner (5):
      block: allow making a block device unfreezable
      block: split bdev_yield_claim() out of bdev_fput()
      btrfs: deny freezing a device while it is being removed
      btrfs: deny freezing a device while it is being added
      btrfs: deny freezing devices undergoing a replace

 block/bdev.c              | 113 +++++++++++++++++++++++++++++++++++-----------
 fs/btrfs/dev-replace.c    |  65 +++++++++++++++++++++++---
 fs/btrfs/ioctl.c          |   4 +-
 fs/btrfs/volumes.c        |  84 +++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h        |   6 ++-
 include/linux/blk_types.h |   2 +-
 include/linux/blkdev.h    |   3 ++
 7 files changed, 229 insertions(+), 48 deletions(-)
---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260615-work-super-freeze_deny_upstream-498ae64761a0

^ permalink raw reply

* Re: [PATCH v17 03/10] rust: implement `ForeignOwnable` for `Owned`
From: Gary Guo @ 2026-06-16 11:55 UTC (permalink / raw)
  To: Andreas Hindborg, Miguel Ojeda, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Greg Kroah-Hartman, Dave Ertman, Ira Weiny, Leon Romanovsky,
	Paul Moore, Serge Hallyn, Rafael J. Wysocki, David Airlie,
	Simona Vetter, Alexander Viro, Christian Brauner, Jan Kara,
	Daniel Almeida, Viresh Kumar, Nishanth Menon, Stephen Boyd,
	Bjorn Helgaas, Krzysztof Wilczyński, Boqun Feng,
	Uladzislau Rezki, Lorenzo Stoakes, Vlastimil Babka,
	Liam R. Howlett, Igor Korotin, Pavel Tikhomirov
  Cc: linux-kernel, rust-for-linux, linux-block, linux-security-module,
	dri-devel, linux-fsdevel, linux-mm, linux-pm, linux-pci,
	driver-core
In-Reply-To: <20260604-unique-ref-v17-3-7b4c3d2930b9@kernel.org>

On Thu Jun 4, 2026 at 9:11 PM BST, Andreas Hindborg wrote:
> Implement `ForeignOwnable` for `Owned<T>`. This allows use of `Owned<T>` in
> places such as the `XArray`.
>
> Note that `T` does not need to implement `ForeignOwnable` for `Owned<T>` to
> implement `ForeignOwnable`.
>
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
> ---
>  rust/kernel/owned.rs | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>
> diff --git a/rust/kernel/owned.rs b/rust/kernel/owned.rs
> index 456e239e906e..5eacdf327d12 100644
> --- a/rust/kernel/owned.rs
> +++ b/rust/kernel/owned.rs
> @@ -15,6 +15,8 @@
>      ptr::NonNull, //
>  };
>  
> +use kernel::types::ForeignOwnable;
> +
>  /// Types that specify their own way of performing allocation and destruction. Typically, this trait
>  /// is implemented on types from the C side.
>  ///
> @@ -108,6 +110,7 @@ pub trait Ownable {
>  ///
>  /// - Until `T::release` is called, this `Owned<T>` exclusively owns the underlying `T`.
>  /// - The `T` value is pinned.
> +#[repr(transparent)]

AFAIT this `#[repr(transparent)]` isn't actually needed.

>  pub struct Owned<T: Ownable> {
>      ptr: NonNull<T>,
>  }
> @@ -185,3 +188,46 @@ fn drop(&mut self) {
>          unsafe { T::release(self.ptr.as_mut()) };
>      }
>  }
> +
> +// SAFETY: We derive the pointer to `T` from a valid `T`, so the returned
> +// pointer satisfy alignment requirements of `T`.
> +unsafe impl<T: Ownable + 'static> ForeignOwnable for Owned<T> {

You should drop the `'static` bound and put where bound on the GAT below
instead. See how `Box` is doing it.

Best,
Gary

> +    const FOREIGN_ALIGN: usize = core::mem::align_of::<Owned<T>>();
> +
> +    type Borrowed<'a> = &'a T;
> +    type BorrowedMut<'a> = Pin<&'a mut T>;
> +
> +    #[inline]
> +    fn into_foreign(self) -> *mut kernel::ffi::c_void {
> +        let ptr = self.ptr.as_ptr().cast();
> +        core::mem::forget(self);
> +        ptr
> +    }
> +
> +    #[inline]
> +    unsafe fn from_foreign(ptr: *mut kernel::ffi::c_void) -> Self {
> +        Self {
> +            // SAFETY: By function safety contract, `ptr` came from
> +            // `into_foreign` and cannot be null.
> +            ptr: unsafe { NonNull::new_unchecked(ptr.cast()) },
> +        }
> +    }
> +
> +    #[inline]
> +    unsafe fn borrow<'a>(ptr: *mut kernel::ffi::c_void) -> Self::Borrowed<'a> {
> +        // SAFETY: By function safety requirements, `ptr` is valid for use as a
> +        // reference for `'a`.
> +        unsafe { &*ptr.cast() }
> +    }
> +
> +    #[inline]
> +    unsafe fn borrow_mut<'a>(ptr: *mut kernel::ffi::c_void) -> Self::BorrowedMut<'a> {
> +        // SAFETY: By function safety requirements, `ptr` is valid for use as a
> +        // unique reference for `'a`.
> +        let inner = unsafe { &mut *ptr.cast() };
> +
> +        // SAFETY: We never move out of inner, and we do not hand out mutable
> +        // references when `T: !Unpin`.
> +        unsafe { Pin::new_unchecked(inner) }
> +    }
> +}



^ permalink raw reply

* Re: [PATCH v17 02/10] rust: types: Add Ownable/Owned types
From: Alice Ryhl @ 2026-06-16 11:54 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Miguel Ojeda, Gary Guo, Björn Roy Baron, Benno Lossin,
	Trevor Gross, Danilo Krummrich, Greg Kroah-Hartman, Dave Ertman,
	Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Alexander Viro,
	Christian Brauner, Jan Kara, Daniel Almeida, Viresh Kumar,
	Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
	Krzysztof Wilczyński, Boqun Feng, Uladzislau Rezki,
	Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett, Igor Korotin,
	Pavel Tikhomirov, linux-kernel, rust-for-linux, linux-block,
	linux-security-module, dri-devel, linux-fsdevel, linux-mm,
	linux-pm, linux-pci, driver-core, Asahi Lina, Oliver Mangold
In-Reply-To: <20260604-unique-ref-v17-2-7b4c3d2930b9@kernel.org>

On Thu, Jun 04, 2026 at 10:11:14PM +0200, Andreas Hindborg wrote:
> From: Asahi Lina <lina+kernel@asahilina.net>
> 
> By analogy to `AlwaysRefCounted` and `ARef`, an `Ownable` type is a
> (typically C FFI) type that *may* be owned by Rust, but need not be. Unlike
> `AlwaysRefCounted`, this mechanism expects the reference to be unique
> within Rust, and does not allow cloning.
> 
> Conceptually, this is similar to a `KBox<T>`, except that it delegates
> resource management to the `T` instead of using a generic allocator.
> 
> [ om:
>   - Split code into separate file and `pub use` it from types.rs.
>   - Make from_raw() and into_raw() public.
>   - Remove OwnableMut, and make DerefMut dependent on Unpin instead.
>   - Usage example/doctest for Ownable/Owned.
>   - Fixes to documentation and commit message.
> ]
> 
> Link: https://lore.kernel.org/all/20250202-rust-page-v1-1-e3170d7fe55e@asahilina.net/
> Signed-off-by: Asahi Lina <lina+kernel@asahilina.net>
> Co-developed-by: Oliver Mangold <oliver.mangold@pm.me>
> Signed-off-by: Oliver Mangold <oliver.mangold@pm.me>
> Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
> [ Andreas: Updated documentation, examples, and formatting. Change safety
>   requirements, safety comments. Use a reference for `release`. ]
> Reviewed-by: Gary Guo <gary@garyguo.net>
> Co-developed-by: Andreas Hindborg <a.hindborg@kernel.org>
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>

Overall looks good to me, but two nits below. With them fixed:

Reviewed-by: Alice Ryhl <aliceryhl@google.com>

> +pub trait Ownable {
> +    /// Tear down this `Ownable`.
> +    ///
> +    /// Implementers of `Ownable` can use this function to clean up the use of `Self`. This can
> +    /// include freeing the underlying object.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the caller has exclusive ownership of `T`, and this ownership can
> +    /// be transferred to the `release` method.
> +    unsafe fn release(&mut self);

I'd make this take a raw pointer because the pointer can be freed during
the execution of release(), which references don't allow.

> diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
> index 4329d3c2c2e5..4aec7b699269 100644
> --- a/rust/kernel/types.rs
> +++ b/rust/kernel/types.rs
> @@ -11,6 +11,17 @@
>  };
>  use pin_init::{PinInit, Wrapper, Zeroable};
>  
> +pub use crate::{
> +    owned::{
> +        Ownable,
> +        Owned, //
> +    },
> +    sync::aref::{
> +        ARef,
> +        AlwaysRefCounted, //
> +    }, //
> +};

We removed the types::ARef re-export, so you shouldn't add it back.

Alice

^ permalink raw reply

* Re: [PATCH v17 01/10] rust: alloc: add `KBox::into_non_null`
From: Gary Guo @ 2026-06-16 11:52 UTC (permalink / raw)
  To: Andreas Hindborg, Miguel Ojeda, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Greg Kroah-Hartman, Dave Ertman, Ira Weiny, Leon Romanovsky,
	Paul Moore, Serge Hallyn, Rafael J. Wysocki, David Airlie,
	Simona Vetter, Alexander Viro, Christian Brauner, Jan Kara,
	Daniel Almeida, Viresh Kumar, Nishanth Menon, Stephen Boyd,
	Bjorn Helgaas, Krzysztof Wilczyński, Boqun Feng,
	Uladzislau Rezki, Lorenzo Stoakes, Vlastimil Babka,
	Liam R. Howlett, Igor Korotin, Pavel Tikhomirov
  Cc: linux-kernel, rust-for-linux, linux-block, linux-security-module,
	dri-devel, linux-fsdevel, linux-mm, linux-pm, linux-pci,
	driver-core
In-Reply-To: <20260604-unique-ref-v17-1-7b4c3d2930b9@kernel.org>

On Thu Jun 4, 2026 at 9:11 PM BST, Andreas Hindborg wrote:
> Add a method to consume a `Box<T, A>` and return a `NonNull<T>`. This
> is a convenience wrapper around `Self::into_raw` for callers that need
> a `NonNull` pointer rather than a raw pointer.
> 
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>

Reviewed-by: Gary Guo <gary@garyguo.net>

> ---
>  rust/kernel/alloc/kbox.rs | 9 +++++++++
>  1 file changed, 9 insertions(+)


^ permalink raw reply

* Re: [PATCH v17 01/10] rust: alloc: add `KBox::into_non_null`
From: Alice Ryhl @ 2026-06-16 11:50 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Miguel Ojeda, Gary Guo, Björn Roy Baron, Benno Lossin,
	Trevor Gross, Danilo Krummrich, Greg Kroah-Hartman, Dave Ertman,
	Ira Weiny, Leon Romanovsky, Paul Moore, Serge Hallyn,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Alexander Viro,
	Christian Brauner, Jan Kara, Daniel Almeida, Viresh Kumar,
	Nishanth Menon, Stephen Boyd, Bjorn Helgaas,
	Krzysztof Wilczyński, Boqun Feng, Uladzislau Rezki,
	Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett, Igor Korotin,
	Pavel Tikhomirov, linux-kernel, rust-for-linux, linux-block,
	linux-security-module, dri-devel, linux-fsdevel, linux-mm,
	linux-pm, linux-pci, driver-core
In-Reply-To: <20260604-unique-ref-v17-1-7b4c3d2930b9@kernel.org>

On Thu, Jun 04, 2026 at 10:11:13PM +0200, Andreas Hindborg wrote:
> Add a method to consume a `Box<T, A>` and return a `NonNull<T>`. This
> is a convenience wrapper around `Self::into_raw` for callers that need
> a `NonNull` pointer rather than a raw pointer.
> 
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>

Reviewed-by: Alice Ryhl <aliceryhl@google.com>

^ permalink raw reply

* Re: [PATCH 1/1] block: partitions: bound sysv68 slice table count
From: Philippe De Muyter @ 2026-06-16 10:44 UTC (permalink / raw)
  To: Ren Wei
  Cc: linux-block, kees, axboe, objecting, akpm, yuantan098, zcliangcn,
	bird, zzhan461
In-Reply-To: <f16321f7378d613d81af13f288de82217fc7d934.1781036698.git.zzhan461@ucr.edu>

Hi Ren Wei,

On Thu, Jun 11, 2026 at 12:58:13AM +0800, Ren Wei wrote:
> From: Zhao Zhang <zzhan461@ucr.edu>
> 
> sysv68_partition() reads a single sector for the slice table, but it
> trusts ios_slccnt from disk and walks that many entries after skipping
> the synthetic whole-disk slice. A crafted image can set ios_slccnt
> larger than the 64 struct slice records that fit in one sector and
> trigger an out-of-bounds read while scanning partitions.
> 
> Limit the slice count to the number of records that fit in the sector
> returned by read_part_sector(), then drop the whole-disk entry only
> when the bounded count is non-zero.
> 
> Fixes: 19d0e8ce856a ("partition: add support for sysv68 partitions")
> Cc: stable@vger.kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Assisted-by: Codex:GPT-5.4
> Signed-off-by: Zhao Zhang <zzhan461@ucr.edu>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> ---
>  block/partitions/sysv68.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/block/partitions/sysv68.c b/block/partitions/sysv68.c
> index 470e0f9de7be..5110ed83c541 100644
> --- a/block/partitions/sysv68.c
> +++ b/block/partitions/sysv68.c
> @@ -48,7 +48,8 @@ struct slice {
>  
>  int sysv68_partition(struct parsed_partitions *state)
>  {
> -	int i, slices;
> +	sector_t slice_sector;
> +	unsigned int i, slices;
>  	int slot = 1;
>  	Sector sect;
>  	unsigned char *data;
> @@ -65,14 +66,16 @@ int sysv68_partition(struct parsed_partitions *state)
>  		return 0;
>  	}
>  	slices = be16_to_cpu(b->dk_ios.ios_slccnt);
> -	i = be32_to_cpu(b->dk_ios.ios_slcblk);
> +	slice_sector = be32_to_cpu(b->dk_ios.ios_slcblk);
>  	put_dev_sector(sect);
>  
> -	data = read_part_sector(state, i, &sect);
> +	data = read_part_sector(state, slice_sector, &sect);
>  	if (!data)
>  		return -1;
>  
> -	slices -= 1; /* last slice is the whole disk */
> +	slices = min_t(unsigned int, slices, SECTOR_SIZE / sizeof(*slice));
> +	if (slices)
> +		slices -= 1; /* last slice is the whole disk */
>  	seq_buf_printf(&state->pp_buf, "sysV68: %s(s%u)", state->name, slices);
>  	slice = (struct slice *)data;
>  	for (i = 0; i < slices; i++, slice++) {
> -- 
> 2.47.3

That does the job.  IIRC 'last slice' had number 7, so ios_slccnt had to be
8.  I do not have such a partition handy at the moment, so

Reviewed-by: Philippe De Muyter <phdm@macqel.be>

Best regards

Philippe

^ permalink raw reply

* [PATCH v4 15/30] iov_iter: Add a segmented queue of bio_vec[]
From: David Howells @ 2026-06-16 10:08 UTC (permalink / raw)
  To: Christian Brauner, Matthew Wilcox, Christoph Hellwig
  Cc: David Howells, Paulo Alcantara, Jens Axboe, Leon Romanovsky,
	Steve French, ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, netfs, linux-afs, linux-cifs,
	linux-nfs, ceph-devel, v9fs, linux-erofs, linux-fsdevel,
	linux-kernel, linux-block
In-Reply-To: <20260616100821.2062304-1-dhowells@redhat.com>

Add the concept of a segmented queue of bio_vec[] arrays.  This allows an
indefinite quantity of elements to be handled and allows things like
network filesystems and crypto drivers to glue bits on the ends without
having to reallocate the array.

The bvecq struct that defines each segment also carries capacity/usage
information along with flags indicating whether the constituent memory
regions need freeing or unpinning and the file position of the first
element in a segment.  The bvecq structs are refcounted to allow a queue to
be extracted in batches and split between a number of subrequests.

The bvecq can have the bio_vec[] it manages allocated in with it, but this
is not required.  A flag is provided for if this is the case as comparing
->bv to ->__bv is not sufficient to detect this case.

Add an iterator type ITER_BVECQ for it.  This is intended to replace
ITER_FOLIOQ (and ITER_XARRAY).

Note that the prev pointer is only really needed for iov_iter_revert() and
could be dispensed with if struct iov_iter contained the head information
as well as the current point.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/bvecq.h      |  56 +++++++
 include/linux/iov_iter.h   |  74 ++++++++-
 include/linux/uio.h        |  11 ++
 lib/iov_iter.c             | 324 ++++++++++++++++++++++++++++++++++++-
 lib/scatterlist.c          |  67 +++++++-
 lib/tests/kunit_iov_iter.c | 262 ++++++++++++++++++++++++++++++
 6 files changed, 788 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/bvecq.h

diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h
new file mode 100644
index 000000000000..15f16f905877
--- /dev/null
+++ b/include/linux/bvecq.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Implementation of a segmented queue of bio_vec[].
+ *
+ * Copyright (C) 2026 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_BVECQ_H
+#define _LINUX_BVECQ_H
+
+#include <linux/bvec.h>
+
+/*
+ * The type of memory retention used by the elements in bvecq->bv[] and how to
+ * clean it up.
+ */
+enum bvecq_mem {
+	BVECQ_MEM_EXTERNAL,	/* Externally retained memory - no freeing */
+	BVECQ_MEM_PAGECACHE,	/* Ref'd pagecache pages - must put */
+	BVECQ_MEM_GUP,		/* Pinned memory from get_user_pages() - unpin */
+	BVECQ_MEM_ALLOCED,	/* Memory alloc'd by bvecq - can be freed/pooled */
+} __mode(byte);
+
+/*
+ * Segmented bio_vec queue.
+ *
+ * These can be linked together to form messages of indefinite length and
+ * iterated over with an ITER_BVECQ iterator.  The list is non-circular; next
+ * and prev are NULL at the ends.
+ *
+ * The bv pointer points to the bio_vec array; this may be __bv if allocated
+ * together.  The caller is responsible for determining whether or not this is
+ * the case as the array pointed to by bv may be follow on directly from the
+ * bvecq by accident of allocation (ie. ->bv == ->__bv is *not* sufficient to
+ * determine this).
+ *
+ * The file position and discontiguity flag allow non-contiguous data sets to
+ * be chained together, but still teased apart without the need to convert the
+ * info in the bio_vec back into a folio pointer.
+ */
+struct bvecq {
+	struct bvecq	*next;		/* Next bvec in the list or NULL */
+	struct bvecq	*prev;		/* Prev bvec in the list or NULL */
+	unsigned long long fpos;	/* File position */
+	refcount_t	ref;
+	u32		priv;		/* Private data */
+	u16		nr_slots;	/* Number of elements in bv[] used */
+	u16		max_slots;	/* Number of elements allocated in bv[] */
+	enum bvecq_mem	mem_type:3;	/* What sort of memory and how to free it */
+	bool		inline_bv:1;	/* T if __bv[] is being used */
+	bool		discontig:1;	/* T if not contiguous with previous bvecq */
+	struct bio_vec	*bv;		/* Pointer to array of page fragments */
+	struct bio_vec	__bv[];		/* Default array (if ->inline_bv) */
+};
+
+#endif /* _LINUX_BVECQ_H */
diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index f9a17fbbd398..04b8a6d943fa 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -10,6 +10,7 @@
 
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
 
 typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
@@ -141,6 +142,71 @@ size_t iterate_bvec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
 	return progress;
 }
 
+/*
+ * Handle ITER_BVECQ.
+ */
+static __always_inline
+size_t iterate_bvecq(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		     iov_step_f step)
+{
+	const struct bvecq *bq = iter->bvecq;
+	unsigned int slot = iter->bvecq_slot;
+	size_t progress = 0, skip = iter->iov_offset;
+
+	do {
+		const struct bio_vec *bvec;
+		struct page *page;
+		size_t poff, plen;
+		void *base;
+
+		if (slot >= bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq = bq->next;
+			slot = 0;
+			continue;
+		}
+
+		bvec = &bq->bv[slot];
+		/*
+		 * The caller must ensure that a slot with bv_len>0 has a valid
+		 * bv_page.
+		 */
+		page = bvec->bv_page + (bvec->bv_offset + skip) / PAGE_SIZE;
+		poff = (bvec->bv_offset + skip) % PAGE_SIZE;
+		plen = umin(bvec->bv_len - skip, len);
+
+		while (plen > 0) {
+			size_t part, remain, consumed;
+
+			part = umin(plen, PAGE_SIZE - poff);
+			base = kmap_local_page(page) + poff;
+			remain = step(base, progress, part, priv, priv2);
+			kunmap_local(base);
+
+			consumed = part - remain;
+			progress += consumed;
+			skip += consumed;
+			len -= consumed;
+			if (!len || remain)
+				goto stop;
+			page++;
+			poff = 0;
+			plen -= consumed;
+		}
+
+		skip = 0;
+		slot++;
+	} while (len);
+
+stop:
+	iter->bvecq_slot = slot;
+	iter->bvecq = bq;
+	iter->iov_offset = skip;
+	iter->count -= progress;
+	return progress;
+}
+
 /*
  * Handle ITER_FOLIOQ.
  */
@@ -306,6 +372,8 @@ size_t iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
@@ -342,8 +410,8 @@ size_t iterate_and_advance(struct iov_iter *iter, size_t len, void *priv,
  * buffer is presented in segments, which for kernel iteration are broken up by
  * physical pages and mapped, with the mapped address being presented.
  *
- * [!] Note This will only handle BVEC, KVEC, FOLIOQ, XARRAY and DISCARD-type
- * iterators; it will not handle UBUF or IOVEC-type iterators.
+ * [!] Note This will only handle BVEC, KVEC, BVECQ, FOLIOQ, XARRAY and
+ * DISCARD-type iterators; it will not handle UBUF or IOVEC-type iterators.
  *
  * A step functions, @step, must be provided, one for handling mapped kernel
  * addresses and the other is given user addresses which have the potential to
@@ -370,6 +438,8 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e3..f7cfa6ea8213 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -26,6 +26,7 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_BVEC,
 	ITER_KVEC,
+	ITER_BVECQ,
 	ITER_FOLIOQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
@@ -68,6 +69,7 @@ struct iov_iter {
 				const struct iovec *__iov;
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
+				const struct bvecq *bvecq;
 				const struct folio_queue *folioq;
 				struct xarray *xarray;
 				void __user *ubuf;
@@ -77,6 +79,7 @@ struct iov_iter {
 	};
 	union {
 		unsigned long nr_segs;
+		u16 bvecq_slot;
 		u8 folioq_slot;
 		loff_t xarray_start;
 	};
@@ -145,6 +148,11 @@ static inline bool iov_iter_is_discard(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_DISCARD;
 }
 
+static inline bool iov_iter_is_bvecq(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_BVECQ;
+}
+
 static inline bool iov_iter_is_folioq(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_FOLIOQ;
@@ -295,6 +303,9 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq,
+			 unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
 			  const struct folio_queue *folioq,
 			  unsigned int first_slot, unsigned int offset, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 49bca2c2f019..205d0da47b12 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -538,6 +538,40 @@ static void iov_iter_iovec_advance(struct iov_iter *i, size_t size)
 	i->__iov = iov;
 }
 
+static void iov_iter_bvecq_advance(struct iov_iter *i, size_t by)
+{
+	const struct bvecq *bq = i->bvecq;
+	unsigned int slot = i->bvecq_slot;
+
+	if (!i->count)
+		return;
+	i->count -= by;
+
+	by += i->iov_offset; /* From beginning of current segment. */
+	do {
+		size_t len;
+
+		if (slot >= bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq = bq->next;
+			slot = 0;
+			continue;
+		}
+
+		len = bq->bv[slot].bv_len;
+
+		if (likely(by < len))
+			break;
+		by -= len;
+		slot++;
+	} while (by);
+
+	i->iov_offset = by;
+	i->bvecq_slot = slot;
+	i->bvecq = bq;
+}
+
 static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
 {
 	const struct folio_queue *folioq = i->folioq;
@@ -583,6 +617,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
+	} else if (iov_iter_is_bvecq(i)) {
+		iov_iter_bvecq_advance(i, size);
 	} else if (iov_iter_is_folioq(i)) {
 		iov_iter_folioq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
@@ -591,6 +627,33 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
+static void iov_iter_bvecq_revert(struct iov_iter *i, size_t unroll)
+{
+	const struct bvecq *bq = i->bvecq;
+	unsigned int slot = i->bvecq_slot;
+
+	for (;;) {
+		size_t len;
+
+		if (slot == 0) {
+			bq = bq->prev;
+			slot = bq->nr_slots;
+			continue;
+		}
+		slot--;
+
+		len = bq->bv[slot].bv_len;
+		if (unroll <= len) {
+			i->iov_offset = len - unroll;
+			break;
+		}
+		unroll -= len;
+	}
+
+	i->bvecq_slot = slot;
+	i->bvecq = bq;
+}
+
 static void iov_iter_folioq_revert(struct iov_iter *i, size_t unroll)
 {
 	const struct folio_queue *folioq = i->folioq;
@@ -648,6 +711,9 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 			}
 			unroll -= n;
 		}
+	} else if (iov_iter_is_bvecq(i)) {
+		i->iov_offset = 0;
+		iov_iter_bvecq_revert(i, unroll);
 	} else if (iov_iter_is_folioq(i)) {
 		i->iov_offset = 0;
 		iov_iter_folioq_revert(i, unroll);
@@ -678,9 +744,24 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 		if (iov_iter_is_bvec(i))
 			return min(i->count, i->bvec->bv_len - i->iov_offset);
 	}
+	if (!i->count)
+		return 0;
+	if (unlikely(iov_iter_is_bvecq(i))) {
+		const struct bvecq *bq = i->bvecq;
+		unsigned int slot = i->bvecq_slot;
+		size_t offset = i->iov_offset;
+
+		while (slot >= bq->nr_slots) {
+			bq = bq->next;
+			if (!bq)
+				return 0;
+			slot = 0;
+			offset = 0;
+		}
+		return umin(i->count, bq->bv[slot].bv_len - offset);
+	}
 	if (unlikely(iov_iter_is_folioq(i)))
-		return !i->count ? 0 :
-			umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
+		return umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
 	return i->count;
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
@@ -717,6 +798,35 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+/**
+ * iov_iter_bvec_queue - Initialise an I/O iterator to use a segmented bvec queue
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @bvecq: The starting point in the bvec queue.
+ * @first_slot: The first slot in the bvec queue to use
+ * @offset: The offset into the bvec in the first slot to start at
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to either draw data out of the buffers attached to an
+ * inode or to inject data into those buffers.  The pages *must* be prevented
+ * from evaporation, either by the caller.
+ */
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq, unsigned int first_slot,
+			 unsigned int offset, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i = (struct iov_iter) {
+		.iter_type	= ITER_BVECQ,
+		.data_source	= direction,
+		.bvecq		= bvecq,
+		.bvecq_slot	= first_slot,
+		.count		= count,
+		.iov_offset	= offset,
+	};
+}
+EXPORT_SYMBOL(iov_iter_bvec_queue);
+
 /**
  * iov_iter_folio_queue - Initialise an I/O iterator to use the folios in a folio queue
  * @i: The iterator to initialise.
@@ -839,6 +949,37 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
 	return res;
 }
 
+static unsigned long iov_iter_alignment_bvecq(const struct iov_iter *iter)
+{
+	const struct bvecq *bq;
+	unsigned long res = 0;
+	unsigned int slot = iter->bvecq_slot;
+	size_t skip = iter->iov_offset;
+	size_t size = iter->count;
+
+	if (!size)
+		return res;
+
+	for (bq = iter->bvecq; bq; bq = bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec = &bq->bv[slot];
+			size_t part = umin(bvec->bv_len - skip, size);
+
+			res |= bvec->bv_offset + skip;
+			res |= part;
+
+			size -= part;
+			if (size == 0)
+				return res;
+			skip = 0;
+		}
+
+		slot = 0;
+	}
+
+	return res;
+}
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	if (likely(iter_is_ubuf(i))) {
@@ -854,6 +995,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_alignment_bvecq(i);
 
 	/* With both xarray and folioq types, we're dealing with whole folios. */
 	if (iov_iter_is_folioq(i))
@@ -1066,6 +1209,36 @@ static int bvec_npages(const struct iov_iter *i, int maxpages)
 	return npages;
 }
 
+static size_t iov_npages_bvecq(const struct iov_iter *iter, size_t maxpages)
+{
+	const struct bvecq *bq;
+	unsigned int slot = iter->bvecq_slot;
+	size_t npages = 0;
+	size_t skip = iter->iov_offset;
+	size_t size = iter->count;
+
+	for (bq = iter->bvecq; bq; bq = bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec = &bq->bv[slot];
+			size_t offs = (bvec->bv_offset + skip) % PAGE_SIZE;
+			size_t part = umin(bvec->bv_len - skip, size);
+
+			npages += DIV_ROUND_UP(offs + part, PAGE_SIZE);
+			if (npages >= maxpages)
+				goto out;
+
+			size -= part;
+			if (!size)
+				goto out;
+			skip = 0;
+		}
+
+		slot = 0;
+	}
+out:
+	return umin(npages, maxpages);
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1080,6 +1253,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
+	if (iov_iter_is_bvecq(i))
+		return iov_npages_bvecq(i, maxpages);
 	if (iov_iter_is_folioq(i)) {
 		unsigned offset = i->iov_offset % PAGE_SIZE;
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1366,6 +1541,147 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 	i->nr_segs = state->nr_segs;
 }
 
+/*
+ * Count the number of virtually contiguous pages coming up next in an
+ * ITER_BVECQ iterator, up to the specified maxima.
+ */
+static unsigned int iter_count_bvecq_pages(const struct iov_iter *iter,
+					   size_t maxsize,
+					   unsigned int maxpages)
+{
+	const struct bvecq *bvecq = iter->bvecq;
+	unsigned int slot = iter->bvecq_slot;
+	ssize_t remain = umin(maxsize, iter->count);
+	size_t count = 0, offset = iter->iov_offset;
+
+	do {
+		const struct bio_vec *bv;
+		size_t boff, blen;
+
+		if (slot >= bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(remain > 0);
+				break;
+			}
+			bvecq = bvecq->next;
+			slot = 0;
+			offset = 0;
+			continue;
+		}
+
+		bv = &bvecq->bv[slot++];
+		boff = bv->bv_offset;
+		blen = bv->bv_len;
+
+		if (unlikely(!bv->bv_page)) {
+			if (blen && count > 0)
+				break;
+			continue;
+		}
+		if (!PAGE_ALIGNED(boff) && count > 0)
+			break;
+
+		boff += offset;
+		blen -= offset;
+		offset = 0;
+		if (!blen)
+			continue;
+
+		blen = umin(blen, remain);
+		remain -= blen;
+		blen += offset_in_page(boff);
+		count += DIV_ROUND_UP(blen, PAGE_SIZE);
+
+		if (!PAGE_ALIGNED(blen))
+			break;
+	} while (remain > 0 && count < maxpages);
+
+	return umin(count, maxpages);
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_BVECQ iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvecq_pages(struct iov_iter *iter,
+					    struct page ***pages, size_t maxsize,
+					    unsigned int maxpages,
+					    iov_iter_extraction_t extraction_flags,
+					    size_t *offset0)
+{
+	const struct bvecq *bvecq;
+	struct page **p;
+	unsigned int slot, nr = 0;
+	size_t extracted = 0, offset;
+
+	/* Count the next run of virtually contiguous pages. */
+	maxpages = iter_count_bvecq_pages(iter, maxsize, maxpages);
+
+	if (!*pages) {
+		*pages = kvmalloc_array(maxpages, sizeof(struct page *), GFP_KERNEL);
+		if (!*pages)
+			return -ENOMEM;
+	}
+
+	p = *pages;
+
+	/* Now transcribe the page pointers. */
+	extracted = 0;
+	bvecq = iter->bvecq;
+	offset = iter->iov_offset;
+	slot = iter->bvecq_slot;
+
+	do {
+		const struct bio_vec *bv;
+		size_t boff, blen;
+
+		if (slot >= bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(extracted < iter->count);
+				break;
+			}
+			bvecq = bvecq->next;
+			slot = 0;
+			offset = 0;
+			continue;
+		}
+
+		bv = &bvecq->bv[slot];
+		boff = bv->bv_offset;
+		blen = bv->bv_len;
+
+		if (!bv->bv_page)
+			blen = 0;
+
+		if (offset < blen) {
+			size_t part = umin(maxsize - extracted, blen - offset);
+			size_t poff = (boff + offset) % PAGE_SIZE;
+			size_t pix = (boff + offset) / PAGE_SIZE;
+
+			if (poff + part > PAGE_SIZE)
+				part = PAGE_SIZE - poff;
+
+			if (!extracted)
+				*offset0 = poff;
+
+			p[nr++] = bv->bv_page + pix;
+			offset += part;
+			extracted += part;
+		}
+
+		if (offset >= blen) {
+			offset = 0;
+			slot++;
+		}
+	} while (nr < maxpages && extracted < maxsize);
+
+	iter->bvecq = bvecq;
+	iter->bvecq_slot = slot;
+	iter->iov_offset = offset;
+	iter->count -= extracted;
+	return extracted;
+}
+
 /*
  * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This does
  * not get references on the pages, nor does it get a pin on them.
@@ -1713,6 +2029,10 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_bvec_pages(i, pages, maxsize,
 						   maxpages, extraction_flags,
 						   offset0);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_extract_bvecq_pages(i, pages, maxsize,
+						    maxpages, extraction_flags,
+						    offset0);
 	if (iov_iter_is_folioq(i))
 		return iov_iter_extract_folioq_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 6ea40d2e6247..23e5a180103b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/kmemleak.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/uio.h>
 #include <linux/folio_queue.h>
 
@@ -1267,6 +1268,65 @@ static ssize_t extract_kvec_to_sg(struct iov_iter *iter,
 	return ret;
 }
 
+/*
+ * Extract up to sg_max folios from an BVECQ-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t extract_bvecq_to_sg(struct iov_iter *iter,
+				   ssize_t maxsize,
+				   struct sg_table *sgtable,
+				   unsigned int sg_max,
+				   iov_iter_extraction_t extraction_flags)
+{
+	const struct bvecq *bvecq = iter->bvecq;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned int slot = iter->bvecq_slot;
+	ssize_t ret = 0;
+	size_t offset = iter->iov_offset;
+
+	maxsize = umin(maxsize, iov_iter_count(iter));
+
+	while (sg_max > 0 && ret < maxsize) {
+		const struct bio_vec *bv;
+		size_t blen, part;
+
+		if (slot >= bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(ret < iter->count);
+				break;
+			}
+			bvecq = bvecq->next;
+			slot = 0;
+			offset = 0;
+			continue;
+		}
+
+		bv = &bvecq->bv[slot];
+		blen = bv->bv_len;
+
+		if (offset >= blen) {
+			offset = 0;
+			slot++;
+			continue;
+		}
+
+		part = umin(maxsize - ret, blen - offset);
+
+		sg_set_page(sg, bv->bv_page, part, bv->bv_offset + offset);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+		offset += part;
+		ret += part;
+	}
+
+	iter->bvecq = bvecq;
+	iter->bvecq_slot = slot;
+	iter->iov_offset = offset;
+	iter->count -= ret;
+	return ret;
+}
+
 /*
  * Extract up to sg_max folios from an FOLIOQ-type iterator and add them to
  * the scatterlist.  The pages are not pinned.
@@ -1391,8 +1451,8 @@ static ssize_t extract_xarray_to_sg(struct iov_iter *iter,
  * addition of @sg_max elements.
  *
  * The pages referred to by UBUF- and IOVEC-type iterators are extracted and
- * pinned; BVEC-, KVEC-, FOLIOQ- and XARRAY-type are extracted but aren't
- * pinned; DISCARD-type is not supported.
+ * pinned; BVEC-, BVECQ-, KVEC-, FOLIOQ- and XARRAY-type are extracted but
+ * aren't pinned; DISCARD-type is not supported.
  *
  * No end mark is placed on the scatterlist; that's left to the caller.
  *
@@ -1424,6 +1484,9 @@ ssize_t extract_iter_to_sg(struct iov_iter *iter, size_t maxsize,
 	case ITER_KVEC:
 		return extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
 					  extraction_flags);
+	case ITER_BVECQ:
+		return extract_bvecq_to_sg(iter, maxsize, sgtable, sg_max,
+					   extraction_flags);
 	case ITER_FOLIOQ:
 		return extract_folioq_to_sg(iter, maxsize, sgtable, sg_max,
 					    extraction_flags);
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index e7e154f94f66..df3e78e7c7aa 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
 #include <linux/scatterlist.h>
 #include <linux/minmax.h>
@@ -544,6 +545,185 @@ static void __init iov_kunit_copy_from_folioq(struct kunit *test)
 	KUNIT_SUCCEED(test);
 }
 
+static void iov_kunit_destroy_bvecq(void *data)
+{
+	struct bvecq *bq, *next;
+
+	for (bq = data; bq; bq = next) {
+		next = bq->next;
+		for (int i = 0; i < bq->nr_slots; i++)
+			if (bq->bv[i].bv_page)
+				put_page(bq->bv[i].bv_page);
+		kfree(bq);
+	}
+}
+
+static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned int max_slots)
+{
+	struct bvecq *bq;
+
+	bq = kzalloc(struct_size(bq, __bv, max_slots), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bq);
+	bq->max_slots = max_slots;
+	bq->bv = bq->__bv;
+	bq->inline_bv = true;
+	return bq;
+}
+
+static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned int max_slots)
+{
+	struct bvecq *bq;
+
+	bq = iov_kunit_alloc_bvecq(test, max_slots);
+	kunit_add_action_or_reset(test, iov_kunit_destroy_bvecq, bq);
+	return bq;
+}
+
+static void __init iov_kunit_load_bvecq(struct kunit *test,
+					struct iov_iter *iter, int dir,
+					struct bvecq *bq_head,
+					struct page **pages, size_t npages)
+{
+	struct bvecq *bq = bq_head;
+	size_t size = 0;
+
+	for (int i = 0; i < npages; i++) {
+		if (bq->nr_slots >= bq->max_slots) {
+			bq->next = iov_kunit_alloc_bvecq(test, 13);
+			bq->next->prev = bq;
+			bq = bq->next;
+		}
+		bvec_set_page(&bq->bv[bq->nr_slots], pages[i], PAGE_SIZE, 0);
+		bq->nr_slots++;
+		size += PAGE_SIZE;
+	}
+	iov_iter_bvec_queue(iter, dir, bq_head, 0, 0, size);
+}
+
+/*
+ * Test copying to a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_to_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, patt;
+
+	bufsize = 0x100000;
+	npages = bufsize / PAGE_SIZE;
+
+	bq = iov_kunit_create_bvecq(test, 13);
+
+	scratch = iov_kunit_create_buffer(test, &spages, npages);
+	for (i = 0; i < bufsize; i++)
+		scratch[i] = pattern(i);
+
+	buffer = iov_kunit_create_buffer(test, &bpages, npages);
+	memset(buffer, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i = 0;
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		size = pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, READ, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied = copy_to_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i += size;
+		if (test->status == KUNIT_FAILURE)
+			goto stop;
+	}
+
+	/* Build the expected image in the scratch buffer. */
+	patt = 0;
+	memset(scratch, 0, bufsize);
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++)
+		for (i = pr->from; i < pr->to; i++)
+			scratch[i] = pattern(patt++);
+
+	/* Compare the images */
+	for (i = 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, buffer[i], scratch[i], "at i=%x", i);
+		if (buffer[i] != scratch[i])
+			return;
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
+/*
+ * Test copying from a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_from_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, j;
+
+	bufsize = 0x100000;
+	npages = bufsize / PAGE_SIZE;
+
+	bq = iov_kunit_create_bvecq(test, 13);
+
+	buffer = iov_kunit_create_buffer(test, &bpages, npages);
+	for (i = 0; i < bufsize; i++)
+		buffer[i] = pattern(i);
+
+	scratch = iov_kunit_create_buffer(test, &spages, npages);
+	memset(scratch, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i = 0;
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		size = pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied = copy_from_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i += size;
+	}
+
+	/* Build the expected image in the main buffer. */
+	i = 0;
+	memset(buffer, 0, bufsize);
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		for (j = pr->from; j < pr->to; j++) {
+			buffer[i++] = pattern(j);
+			if (i >= bufsize)
+				goto stop;
+		}
+	}
+stop:
+
+	/* Compare the images */
+	for (i = 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, scratch[i], buffer[i], "at i=%x", i);
+		if (scratch[i] != buffer[i])
+			return;
+	}
+
+	KUNIT_SUCCEED(test);
+}
+
 static void iov_kunit_destroy_xarray(void *data)
 {
 	struct xarray *xarray = data;
@@ -859,6 +1039,85 @@ static void __init iov_kunit_extract_pages_bvec(struct kunit *test)
 	KUNIT_SUCCEED(test);
 }
 
+/*
+ * Test the extraction of ITER_BVECQ-type iterators.
+ */
+static void __init iov_kunit_extract_pages_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **bpages, *pagelist[8], **pages = pagelist;
+	ssize_t len;
+	size_t bufsize, size = 0, npages;
+	int i, from;
+
+	bufsize = 0x100000;
+	npages = bufsize / PAGE_SIZE;
+
+	bq = iov_kunit_create_bvecq(test, 13);
+
+	iov_kunit_create_buffer(test, &bpages, npages);
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		from = pr->from;
+		size = pr->to - from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, from);
+
+		do {
+			size_t offset0 = LONG_MAX;
+
+			for (i = 0; i < ARRAY_SIZE(pagelist); i++)
+				pagelist[i] = (void *)(unsigned long)0xaa55aa55aa55aa55ULL;
+
+			len = iov_iter_extract_pages(&iter, &pages, 100 * 1024,
+						     ARRAY_SIZE(pagelist), 0, &offset0);
+			KUNIT_EXPECT_GE(test, len, 0);
+			if (len < 0)
+				break;
+			KUNIT_EXPECT_LE(test, len, size);
+			KUNIT_EXPECT_EQ(test, iter.count, size - len);
+			if (len == 0)
+				break;
+			size -= len;
+			KUNIT_EXPECT_GE(test, (ssize_t)offset0, 0);
+			KUNIT_EXPECT_LT(test, offset0, PAGE_SIZE);
+
+			for (i = 0; i < ARRAY_SIZE(pagelist); i++) {
+				struct page *p;
+				ssize_t part = min_t(ssize_t, len, PAGE_SIZE - offset0);
+				int ix;
+
+				KUNIT_ASSERT_GE(test, part, 0);
+				ix = from / PAGE_SIZE;
+				KUNIT_ASSERT_LT(test, ix, npages);
+				p = bpages[ix];
+				KUNIT_EXPECT_PTR_EQ(test, pagelist[i], p);
+				KUNIT_EXPECT_EQ(test, offset0, from % PAGE_SIZE);
+				from += part;
+				len -= part;
+				KUNIT_ASSERT_GE(test, len, 0);
+				if (len == 0)
+					break;
+				offset0 = 0;
+			}
+
+			if (test->status == KUNIT_FAILURE)
+				goto stop;
+		} while (iov_iter_count(&iter) > 0);
+
+		KUNIT_EXPECT_EQ(test, size, 0);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
 /*
  * Test the extraction of ITER_FOLIOQ-type iterators.
  */
@@ -1218,12 +1477,15 @@ static struct kunit_case __refdata iov_kunit_cases[] = {
 	KUNIT_CASE(iov_kunit_copy_from_kvec),
 	KUNIT_CASE(iov_kunit_copy_to_bvec),
 	KUNIT_CASE(iov_kunit_copy_from_bvec),
+	KUNIT_CASE(iov_kunit_copy_to_bvecq),
+	KUNIT_CASE(iov_kunit_copy_from_bvecq),
 	KUNIT_CASE(iov_kunit_copy_to_folioq),
 	KUNIT_CASE(iov_kunit_copy_from_folioq),
 	KUNIT_CASE(iov_kunit_copy_to_xarray),
 	KUNIT_CASE(iov_kunit_copy_from_xarray),
 	KUNIT_CASE(iov_kunit_extract_pages_kvec),
 	KUNIT_CASE(iov_kunit_extract_pages_bvec),
+	KUNIT_CASE(iov_kunit_extract_pages_bvecq),
 	KUNIT_CASE(iov_kunit_extract_pages_folioq),
 	KUNIT_CASE(iov_kunit_extract_pages_xarray),
 	KUNIT_CASE(iov_kunit_iter_to_sg_kvec),


^ permalink raw reply related

* [PATCH v4 14/30] iov_iter: Make iov_iter_get_pages*() wrap iov_iter_extract_pages()
From: David Howells @ 2026-06-16 10:08 UTC (permalink / raw)
  To: Christian Brauner, Matthew Wilcox, Christoph Hellwig
  Cc: David Howells, Paulo Alcantara, Jens Axboe, Leon Romanovsky,
	Steve French, ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, netfs, linux-afs, linux-cifs,
	linux-nfs, ceph-devel, v9fs, linux-erofs, linux-fsdevel,
	linux-kernel, linux-block
In-Reply-To: <20260616100821.2062304-1-dhowells@redhat.com>

Make iov_iter_get_pages*() wrap iov_iter_extract_pages() for kernel
iterator types (e.g. ITER_BVEC, ITER_FOLIOQ, ITER_XARRAY).  The pages
obtained have their refcounts incremented afterwards if they're not slab
pages.  ITER_KVEC is left returning -EFAULT.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 lib/iov_iter.c | 164 ++++++-------------------------------------------
 1 file changed, 19 insertions(+), 145 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6386ae4ef491..49bca2c2f019 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -910,118 +910,34 @@ static int want_pages_array(struct page ***res, size_t size,
 	return count;
 }
 
-static ssize_t iter_folioq_get_pages(struct iov_iter *iter,
+/*
+ * Wrap iov_iter_extract_pages() and then pin the non-slab pages we got back.
+ * This only works for non-user iterator types as get_pages uses get_user_pages
+ * not pin_user_pages.
+ */
+static ssize_t iter_get_kernel_pages(struct iov_iter *iter,
 				     struct page ***ppages, size_t maxsize,
 				     unsigned maxpages, size_t *_start_offset)
 {
-	const struct folio_queue *folioq = iter->folioq;
 	struct page **pages;
-	unsigned int slot = iter->folioq_slot;
-	size_t extracted = 0, count = iter->count, iov_offset = iter->iov_offset;
+	ssize_t ret, done;
 
-	if (slot >= folioq_nr_slots(folioq)) {
-		folioq = folioq->next;
-		slot = 0;
-		if (WARN_ON(iov_offset != 0))
-			return -EIO;
-	}
+	ret = iov_iter_extract_pages(iter, ppages, maxsize, maxpages,
+				     0, _start_offset);
+	if (ret <= 0)
+		return ret;
 
-	maxpages = want_pages_array(ppages, maxsize, iov_offset & ~PAGE_MASK, maxpages);
-	if (!maxpages)
-		return -ENOMEM;
-	*_start_offset = iov_offset & ~PAGE_MASK;
 	pages = *ppages;
+	for (done = ret + *_start_offset; done > 0; done -= PAGE_SIZE) {
+		struct folio *folio = page_folio(*pages);
 
-	for (;;) {
-		struct folio *folio = folioq_folio(folioq, slot);
-		size_t offset = iov_offset, fsize = folioq_folio_size(folioq, slot);
-		size_t part = PAGE_SIZE - offset % PAGE_SIZE;
-
-		if (offset < fsize) {
-			part = umin(part, umin(maxsize - extracted, fsize - offset));
-			count -= part;
-			iov_offset += part;
-			extracted += part;
-
-			*pages = folio_page(folio, offset / PAGE_SIZE);
-			get_page(*pages);
-			pages++;
-			maxpages--;
-		}
-
-		if (maxpages == 0 || extracted >= maxsize)
-			break;
-
-		if (iov_offset >= fsize) {
-			iov_offset = 0;
-			slot++;
-			if (slot == folioq_nr_slots(folioq) && folioq->next) {
-				folioq = folioq->next;
-				slot = 0;
-			}
-		}
+		if (!folio_test_slab(folio))
+			folio_get(folio);
+		pages++;
 	}
-
-	iter->count = count;
-	iter->iov_offset = iov_offset;
-	iter->folioq = folioq;
-	iter->folioq_slot = slot;
-	return extracted;
-}
-
-static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
-					  pgoff_t index, unsigned int nr_pages)
-{
-	XA_STATE(xas, xa, index);
-	struct folio *folio;
-	unsigned int ret = 0;
-
-	rcu_read_lock();
-	for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
-		if (xas_retry(&xas, folio))
-			continue;
-
-		/* Has the folio moved or been split? */
-		if (unlikely(folio != xas_reload(&xas))) {
-			xas_reset(&xas);
-			continue;
-		}
-
-		pages[ret] = folio_file_page(folio, xas.xa_index);
-		folio_get(folio);
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
 	return ret;
 }
 
-static ssize_t iter_xarray_get_pages(struct iov_iter *i,
-				     struct page ***pages, size_t maxsize,
-				     unsigned maxpages, size_t *_start_offset)
-{
-	unsigned nr, offset, count;
-	pgoff_t index;
-	loff_t pos;
-
-	pos = i->xarray_start + i->iov_offset;
-	index = pos >> PAGE_SHIFT;
-	offset = pos & ~PAGE_MASK;
-	*_start_offset = offset;
-
-	count = want_pages_array(pages, maxsize, offset, maxpages);
-	if (!count)
-		return -ENOMEM;
-	nr = iter_xarray_populate_pages(*pages, i->xarray, index, count);
-	if (nr == 0)
-		return 0;
-
-	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
-	i->iov_offset += maxsize;
-	i->count -= maxsize;
-	return maxsize;
-}
-
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size)
 {
@@ -1044,22 +960,6 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size)
 	BUG(); // if it had been empty, we wouldn't get called
 }
 
-/* must be done on non-empty ITER_BVEC one */
-static struct page *first_bvec_segment(const struct iov_iter *i,
-				       size_t *size, size_t *start)
-{
-	struct page *page;
-	size_t skip = i->iov_offset, len;
-
-	len = i->bvec->bv_len - skip;
-	if (*size > len)
-		*size = len;
-	skip += i->bvec->bv_offset;
-	page = i->bvec->bv_page + skip / PAGE_SIZE;
-	*start = skip % PAGE_SIZE;
-	return page;
-}
-
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start)
@@ -1095,36 +995,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		iov_iter_advance(i, maxsize);
 		return maxsize;
 	}
-	if (iov_iter_is_bvec(i)) {
-		struct page **p;
-		struct page *page;
 
-		page = first_bvec_segment(i, &maxsize, start);
-		n = want_pages_array(pages, maxsize, *start, maxpages);
-		if (!n)
-			return -ENOMEM;
-		p = *pages;
-		for (int k = 0; k < n; k++) {
-			struct folio *folio = page_folio(page + k);
-			p[k] = page + k;
-			if (!folio_test_slab(folio))
-				folio_get(folio);
-		}
-		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
-		i->count -= maxsize;
-		i->iov_offset += maxsize;
-		if (i->iov_offset == i->bvec->bv_len) {
-			i->iov_offset = 0;
-			i->bvec++;
-			i->nr_segs--;
-		}
-		return maxsize;
-	}
-	if (iov_iter_is_folioq(i))
-		return iter_folioq_get_pages(i, pages, maxsize, maxpages, start);
-	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
-	return -EFAULT;
+	if (iov_iter_is_kvec(i))
+		return -EFAULT;
+	return iter_get_kernel_pages(i, pages, maxsize, maxpages, start);
 }
 
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,


^ permalink raw reply related

* [PATCH v4 13/30] Add a function to kmap one page of a multipage bio_vec
From: David Howells @ 2026-06-16 10:08 UTC (permalink / raw)
  To: Christian Brauner, Matthew Wilcox, Christoph Hellwig
  Cc: David Howells, Paulo Alcantara, Jens Axboe, Leon Romanovsky,
	Steve French, ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, netfs, linux-afs, linux-cifs,
	linux-nfs, ceph-devel, v9fs, linux-erofs, linux-fsdevel,
	linux-kernel, linux-block
In-Reply-To: <20260616100821.2062304-1-dhowells@redhat.com>

Add a function to kmap one page of a multipage bio_vec by offset (which is
added to the offset in the bio_vec internally).  The caller is responsible
for calculating how much of the page is then available.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/bvec.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d36dd476feda..f834a862224e 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -299,4 +299,22 @@ static inline phys_addr_t bvec_phys(const struct bio_vec *bvec)
 	return page_to_phys(bvec->bv_page) + bvec->bv_offset;
 }
 
+/**
+ * bvec_kmap_partial - Map part of a bvec into the kernel virtual address space
+ * @bvec: bvec to map
+ * @offset: Offset into bvec
+ *
+ * Map the page containing the byte at @offset into the kernel virtual address
+ * space.  The caller is responsible for making sure this doesn't overrun.
+ *
+ * Call kunmap_local on the returned address to unmap.
+ */
+static inline void *bvec_kmap_partial(struct bio_vec *bvec, size_t offset)
+{
+	offset += bvec->bv_offset;
+
+	return kmap_local_page(bvec->bv_page + (offset >> PAGE_SHIFT)) +
+		(offset & ~PAGE_MASK);
+}
+
 #endif /* __LINUX_BVEC_H */


^ permalink raw reply related

* [PATCH v4 2/3] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781597506.git.wqu@suse.com>

For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.

However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.

Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index d2fdfcb016c1..7fd7c35f9a28 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1335,7 +1335,11 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);

-		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (iter->nofault)
+			copied = copy_folio_from_iter_atomic(folio, 0, this_len,
+							     iter);
+		else
+			copied = copy_folio_from_iter(folio, 0, this_len, iter);
 		if (copied < this_len) {
 			/*
 			 * Need to revert the iov iter for all bytes we have
-- 
2.54.0

^ permalink raw reply related

* [PATCH v4 3/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781597506.git.wqu@suse.com>

Previously btrfs forces direct writes to fall back to buffered ones if the
inode has data checksum or the profile has duplication.

That fallback is to avoid the content being modified that the final
content may mismatch with the checksum or the other mirrors.

That brings a pretty huge performance cost, which already caused some
concern at that time.

But later upstream commit c9d114846b38 ("iomap: add a flag to bounce
buffer direct I/O") introduced a new method by copying the content into
new pages, and do all the operations based on the newly allocated pages.

So let btrfs to utilize the new flag for direct writes if we require
stable folios.

There is a quick benchmark, using the following fio setup:

 fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \
     --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \
     --bs=$blocksize

Unit is MiB/s.

 Blocksize | Zero-copy (*) | Buffered |   Bounce
-----------+---------------+----------+-----------
        4K |          35.1 |     17.1 |      33.8
       64K |           522 |      251 |       492

*: This is done by reverting the commit 968f19c5b1b7 ("btrfs: always
   fallback to buffered write if the inode requires checksum")

Although with page bouncing the performance is only around 95% of
true-zero copy, it's still almost double the performance of buffered
fallback.

There will be a small change in behavior, since we're using
IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will
immediately fail.

So for true NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are
still required.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 58 ++++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e566a60b0ce5..bb457649902e 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -818,13 +818,41 @@ static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
 }
 
+static bool need_stable_write(struct btrfs_inode *inode)
+{
+	const u64 data_profile = btrfs_data_alloc_profile(inode->root->fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	/* Data checksum requires stable buffer. */
+	if (!(inode->flags & BTRFS_INODE_NODATASUM))
+		return true;
+	/*
+	 * Any profile with mirror/parity will require stable buffer.
+	 * Otherwise the mirror may differ from each other.
+	 *
+	 * Thus only SINGLE and RAID0 doesn't require stable buffer.
+	 */
+	if (data_profile != 0 && data_profile != BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+	return false;
+}
+
 static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 					 size_t done_before)
 {
 	struct btrfs_dio_data data = { 0 };
+	unsigned int dio_flags = IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED;
+
+	if (need_stable_write(BTRFS_I(file_inode(iocb->ki_filp)))) {
+		/* For now no support for BOUNCE and NOWAIT direct write. */
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return ERR_PTR(-EAGAIN);
+
+		dio_flags |= IOMAP_DIO_BOUNCE;
+	}
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
+			      dio_flags, &data, done_before);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -853,8 +881,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
-	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
-				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -868,16 +894,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;
 
-	/*
-	 * If our data profile has duplication (either extra mirrors or RAID56),
-	 * we can not trust the direct IO buffer, the content may change during
-	 * writeback and cause different contents written to different mirrors.
-	 *
-	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
-	 */
-	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
-		goto buffered;
-
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -918,22 +934,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
 		goto buffered;
 	}
-	/*
-	 * We can't control the folios being passed in, applications can write
-	 * to them while a direct IO write is in progress.  This means the
-	 * content might change after we calculated the data checksum.
-	 * Therefore we can end up storing a checksum that doesn't match the
-	 * persisted data.
-	 *
-	 * To be extra safe and avoid false data checksum mismatch, if the
-	 * inode requires data checksum, just fallback to buffered IO.
-	 * For buffered IO we have full control of page cache and can ensure
-	 * no one is modifying the content during writeback.
-	 */
-	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
-		goto buffered;
-	}
 
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <cover.1781597506.git.wqu@suse.com>

For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().

This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.

And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.

Revert the iov_iter after a short copy.

One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).

Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..d2fdfcb016c1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1321,6 +1321,7 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 
 	do {
 		size_t this_len = min(total_len, SZ_1M);
+		size_t copied;
 		struct folio *folio;
 
 		if (this_len > minsize * 2)
@@ -1334,12 +1335,22 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		if (copy_from_iter(folio_address(folio), this_len, iter) !=
-				this_len) {
+		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (copied < this_len) {
+			/*
+			 * Need to revert the iov iter for all bytes we have
+			 * copied.
+			 *
+			 * However the bio size differs from the real copied
+			 * bytes as @this_len is queued but only advanced
+			 * less than that.
+			 * Need to compensate that for the revert.
+			 */
+			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len +
+					copied);
 			bio_free_folios(bio);
 			return -EFAULT;
 		}
-
 		total_len -= this_len;
 	} while (total_len && bio->bi_vcnt < bio->bi_max_vecs);
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

[CHANGELOG]
v4:
- Follow iomap/block layer code style to avoid lines over 80 chars

- Reject NOWAIT BOUNCE direct writes inside btrfs
  The iomap code still allocates memory with GFP_KERNEL in other
  locations.
  For now just disable NOWAIT BOUNCE direct writes and let the caller
  fall back to blocking mode.

v3:
- Fix a bug in error handling of bio_iov_iter_bounce_write()
  Which can lead to generic/708 failure on btrfs.

- Respect nofault flag in bio_iov_iter_bounce_write()
  To avoid btrfs specific deadlocks.

- Reject NOWAIT and BOUNCE direct IOs
  Since BOUNCE always allocate pages using GFP_KERNEL, which can sleep
  and break NOWAIT requirement, has to reject such combination.

v2:
- Rework the comment in btrfs_dio_write()

Commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the
inode requires checksum") solved the csum mismatch caused by unstable
direct IO buffers, it has a pretty hefty performance penalty.

Meanwhile upstream iomap has introduce IOMAP_DIO_BOUNCE flag to get
stable buffers meanwhile without falling back to buffered IOs.

Using that flag btrfs can reach 95% of the original zero-copy direct IO
performance, almost 2x the current buffered fallback performance.

However during my tests, there are several bugs related to iomap that
can lead to direct IO test case failures:

- generic/708
  Results garbage in the end of the writes, is a bug in the error
  handling of a short copy.

  Fixed in the first patch.

- Deadlock if using the page cache as direct IO buffer
  This is because bio_iov_iter_bounce_write() doesn't respect
  iov_iter::nofault flag.

  Fixed in the second patch.

- Possible NOWAIT and BOUNCE conflicts
  BOUNCE flag for both reads and writes will allocate new folios using
  GFP_KERNEL, which can sleep and break NOWAIT requirement.

  Reject such combination in btrfs when enabling IOMAP_DIO_BOUNCE
  support.

And the final one will enable btrfs to use IOMAP_DIO_BOUNCE flag, so
that even with data checksum we do not need to fallback to buffered IO
and reclaim most of the dropped direct IO performance.

Qu Wenruo (3):
  block: revert the iov_iter after a short copy in
    bio_iov_iter_bounce_write()
  block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered
    IO

 block/bio.c          | 21 +++++++++++++---
 fs/btrfs/direct-io.c | 58 ++++++++++++++++++++++----------------------
 2 files changed, 47 insertions(+), 32 deletions(-)

-- 
2.54.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox