Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH RFC v2 00/18] fs: support freeze/thaw/mark_dead/sync with shared devices
From: Jan Kara @ 2026-06-22 15:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs, syzbot
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-0-7df6b864028e@kernel.org>

Hi!

On Tue 16-06-26 16:08:16, Christian Brauner wrote:
> This is a generalization of the device number to superblock so it works
> for actual block device and anonymous (or even mtd) devices.
> 
> fs_holder_ops recovers the affected superblock from bdev->bd_holder. That
> forces the holder of a block device to be exactly one superblock and makes
> it impossible for several superblocks to share a single device.
> 
> erofs does exactly that. It can mount read-only "blob" devices that are
> shared between many superblocks: a metadata-only erofs that indexes a set
> of per-layer blobs (one filesystem instead of one per OCI layer), or an
> incremental image whose base device is shared by several updates. Because
> the block layer only tracks a single holder, a freeze, thaw, removal or
> sync on such a device is never propagated to all the superblocks using it,
> and the current infrastructure has no way to find them.
> 
> This series replaces the bd_holder-based lookup with a global, dev_t-keyed
> table mapping each block device to the superblock(s) using it. The holder
> argument becomes purely the block layer's exclusivity token -- a superblock,
> or the file_system_type for a device shared within one filesystem type --
> and the fs_holder_ops callbacks look the device up in the table and act on
> every superblock registered for it: 1:1 for most filesystems, 1:many for
> erofs.

So I was thinking about this also in the light of Christoph's complaints. I
agree with you, Chritian, that this translation table maintains the
abstraction of the holder - holder ops define how to transition from bdev
to its holder(s) and how to translate the .sync, .freeze and other
operations for the holders - and that is kept since your changes are
specific to fs_holder_ops.

What I'm wondering about a bit is whether we want this complexity for the
only user which is erofs (i.e., whether this wouldn't be better implemented
in erofs specific holder ops which could arguably be simpler than this
generic solution). On the other hand that will likely have to replicate
the locking dances we do in bdev_super_lock() and I'm not sure whether
spread of this locking complexity into filesystems is better than this
more complex VFS mapping code.

One more thing I was considering is that the need to transition from one
bdev to multiple holders isn't actually unique to erofs. For example device
mapper will need the same thing, arguably partition bdevs could be also
made holders of the complete bdev so events are propagated from the whole
bdev into partition bdevs properly (which currently happens in kind of ad
hoc manner and only in some cases). Currently your translation mechanism is
tied to mapping to superblock but actually rather weakly - we only need the
guarantee that the holder stays alive while the mapping entry exists, the
rest is protected by the mapping entry refcount AFAICS. So with a bit of
effort we could make this a generic bdev -> holders mapping mechanism
usable from whichever holder ops decide to employ it, which would then be
quite attractive IMO.

But I guess let's leave lifting the mapping code from super.c and
converting it into generic mapping mechanism for the moment when we really
get into implementing another user.

All this is a long way of saying that I'm OK with the mapping mechanism
like this :).

								Honza

> Filesystems claim and release their devices through new
> fs_bdev_file_open_by_{dev,path}() and fs_bdev_file_release() helpers; the
> per-fs patches convert xfs, btrfs, ext4, f2fs and erofs over to them and
> fix cramfs and romfs, which released the registered main device with a
> raw bdev_fput().
> 
> Since every superblock is registered under its s_dev the table also
> replaces the last s_dev-keyed walk of the super_blocks list:
> user_get_super() resolves device numbers through it, so ustat() and
> quotactl() now work on any device a filesystem claims and no longer
> take sb_lock.
> 
> The longer-term motivation is to let userspace decide which devices may be
> onlined from one central place, without having to teach every filesystem
> about it individually.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
> Changes in v2:
> - super: rework the device-to-superblock table reference counting: each
>   (device, superblock) entry carries a single claim count and holds one
>   passive reference on its superblock for the entry's lifetime. New prep
>   patches convert s_count to refcount_t s_passive and make put_super()
>   self-locking.
> - super: preallocate the entry in alloc_super() and register it from the
>   set callbacks through set_anon_super()/set_bdev_super(); an insert
>   failure unwinds exactly like a set callback failure. The superblock
>   stashes the entry in sb->s_super_dev and kill_super_notify() drops the
>   claim through it.
> - super: initialize the table from mnt_init(); the rootfs and shm mounts
>   are created long before any initcall runs.
> - super: fold the v1 "refuse to claim a frozen block device" patch into
>   the registration helper and restore the EBUSY check for the primary
>   device in setup_bdev_super(): additional devices (the xfs log, the ext4
>   journal, erofs blobs) are now refused while frozen as well, answering
>   Jan's question on v1 3/8.
> - Split the core patch into table/helpers/switch-over and move the
>   xfs/btrfs/ext4 conversions before the fs_holder_ops switch so no
>   freeze/mark_dead events are lost mid-series; erofs follows the switch.
> - New prep patches: the ext4 KUnit tests allocate anonymous devices and
>   ocfs2 stops resetting s_dev on dismount.
> - New: convert user_get_super() to the device table, plus a ustat()
>   selftest.
> - New: fix a pre-existing double release of the realtime device file and
>   dangling buftarg pointers in xfs_open_devices()'s error unwind.
> - New: convert f2fs's additional devices to the helpers; fix cramfs and
>   romfs releasing the registered main device with a raw bdev_fput().
> - erofs: drop the .shutdown() and .remove_bdev() implementations and the
>   per-device "dead" flag. Immutable filesystems don't need them: the block
>   layer sets GD_DEAD before fs_bdev_mark_dead() so in-flight bios fail
>   anyway, erofs has no write path or journal to stop, and the read-only
>   loop_change_fd() case must not be forced to -EIO. Patch from Gao Xiang,
>   applied verbatim - thanks!
> - btrfs: fix a general protection fault in close_fs_devices() on a failed
>   mount (reported by syzbot). The release path took the superblock from
>   device->fs_info, which is still NULL if open_ctree() fails before
>   btrfs_init_devices_late(); it now uses bdev_file->private_data.
> - erofs: the v1 conversion was sent with a generic boilerplate changelog;
>   superseded by Gao's patch above.
> - Collect Reviewed-by from Jan Kara and Tested-by from syzbot.
> - Rebase onto v7.1-rc1.
> - Link to v1: https://patch.msgid.link/20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org
> 
> ---
> Christian Brauner (18):
>       xfs: fix the error unwind in xfs_open_devices()
>       super: convert s_count to refcount_t s_passive
>       super: take lock after last reference count
>       fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
>       ext4: use anonymous devices for KUnit test superblocks
>       ocfs2: don't reset s_dev on dismount
>       fs: maintain a global device-to-superblock table
>       fs: add dedicated block device open helpers for filesystems
>       xfs: port to fs_bdev_file_open_by_path()
>       btrfs: open via dedicated fs bdev helpers
>       ext4: open via dedicated fs bdev helpers
>       fs: look up superblocks via the device table in fs_holder_ops
>       fs: tolerate per-superblock freeze errors on shared devices
>       erofs: open via dedicated fs bdev helpers
>       f2fs: open via dedicated fs bdev helpers
>       super: make fs_holder_ops private
>       fs: look up the superblock via the device table in user_get_super()
>       selftests/filesystems: add ustat() coverage
> 
>  fs/btrfs/volumes.c                               |  31 +-
>  fs/cramfs/inode.c                                |   2 +-
>  fs/erofs/super.c                                 |  35 +-
>  fs/ext4/extents-test.c                           |   9 +-
>  fs/ext4/mballoc-test.c                           |   9 +-
>  fs/ext4/super.c                                  |  12 +-
>  fs/f2fs/super.c                                  |   6 +-
>  fs/internal.h                                    |   1 +
>  fs/namespace.c                                   |   2 +
>  fs/ocfs2/super.c                                 |   1 -
>  fs/romfs/super.c                                 |   2 +-
>  fs/super.c                                       | 620 ++++++++++++++++-------
>  fs/xfs/xfs_buf.c                                 |   2 +-
>  fs/xfs/xfs_super.c                               |  13 +-
>  include/linux/blkdev.h                           |   9 -
>  include/linux/fs.h                               |   2 -
>  include/linux/fs/super.h                         |   8 +
>  include/linux/fs/super_types.h                   |   4 +-
>  include/linux/types.h                            |   2 +
>  tools/testing/selftests/filesystems/.gitignore   |   1 +
>  tools/testing/selftests/filesystems/Makefile     |   2 +-
>  tools/testing/selftests/filesystems/ustat_test.c | 135 +++++
>  22 files changed, 647 insertions(+), 261 deletions(-)
> ---
> base-commit: 0c0d974f62e6603d4514e1a8035658edb353c68f
> change-id: 20260602-work-super-bdev_holder_global-8cba5e52bed5
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC v2 07/18] fs: maintain a global device-to-superblock table
From: Jan Kara @ 2026-06-22 15:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-7-7df6b864028e@kernel.org>

On Tue 16-06-26 16:08:23, Christian Brauner wrote:
> fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> forces the holder to be exactly one superblock and prevents several
> superblocks from sharing one block device. That's what erofs is doing.
> 
> As a first step introduce a global dev_t-keyed rhltable mapping each
> device to the superblock(s) using it. The entry is preallocated in
> alloc_super() and registered under sb->s_dev by the set callback through
> set_anon_super() and set_bdev_super(), the two helpers every set
> callback assigns s_dev through. Registration is the final fallible act
> of a set callback, so an insert failure unwinds through sget_fc()'s
> existing set-failure path: the fs_context keeps ownership of s_fs_info
> and the callers' error paths stay correct. set_anon_super() releases
> the anonymous dev it allocated when registration fails. Unwinding
> through deactivate_locked_super() instead would run kill_sb() and free
> s_fs_info behind the caller's back: nfs and ceph free that object
> through a local pointer when sget_fc() fails and would double-free.
> 
> The superblock stashes the entry in sb->s_super_dev and
> kill_super_notify() drops the claim through it, so teardown doesn't
> depend on s_dev staying stable; an entry that was never registered is
> freed together with the superblock in destroy_super_work().
> 
> Each table entry holds a passive reference (s_passive) on its
> superblock, so the struct stays valid for as long as the entry is
> reachable. Entries are claim-counted through sd_ref: additional claims
> on the same (device, superblock) pair share the entry, and the unlink
> is deferred to the last put, so a later iteration cursor never resumes
> from a removed node.
> 
> The table is initialized from mnt_init(): the first superblocks (the
> tmpfs shm mount and rootfs) are created from start_kernel() long before
> any initcall runs, so an initcall would be too late.
> 
> The table has no readers yet; the fs_holder_ops callbacks are switched
> over once all devices a filesystem claims are registered.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/internal.h                  |   1 +
>  fs/namespace.c                 |   2 +
>  fs/super.c                     | 102 ++++++++++++++++++++++++++++++++++++++++-
>  include/linux/fs/super_types.h |   2 +
>  4 files changed, 105 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index d77578d66d42..83eb3e2a0f85 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -137,6 +137,7 @@ extern int reconfigure_super(struct fs_context *);
>  extern bool super_trylock_shared(struct super_block *sb);
>  struct super_block *user_get_super(dev_t, bool excl);
>  void put_super(struct super_block *sb);
> +void __init super_dev_init(void);
>  extern bool mount_capable(struct fs_context *);
>  int sb_init_dio_done_wq(struct super_block *sb);
>  
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 3d5cd5bf3b05..7cef6dae0854 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -6262,6 +6262,8 @@ void __init mnt_init(void)
>  	if (!mount_hashtable || !mountpoint_hashtable)
>  		panic("Failed to allocate mount hash table\n");
>  
> +	super_dev_init();
> +
>  	kernfs_init();
>  
>  	err = sysfs_init();
> diff --git a/fs/super.c b/fs/super.c
> index a771a0ad4c9a..ff5e305d0ab4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -24,6 +24,7 @@
>  #include <linux/export.h>
>  #include <linux/slab.h>
>  #include <linux/blkdev.h>
> +#include <linux/rhashtable.h>
>  #include <linux/mount.h>
>  #include <linux/security.h>
>  #include <linux/writeback.h>		/* for the emergency remount stuff */
> @@ -272,6 +273,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  	return total_objects;
>  }
>  
> +static struct super_dev *super_dev_alloc(dev_t dev, struct super_block *sb);
> +
>  static void destroy_super_work(struct work_struct *work)
>  {
>  	struct super_block *s = container_of(work, struct super_block,
> @@ -279,6 +282,8 @@ static void destroy_super_work(struct work_struct *work)
>  	fsnotify_sb_free(s);
>  	security_sb_free(s);
>  	put_user_ns(s->s_user_ns);
> +	/* Only an unregistered entry is still owned by the superblock. */
> +	kfree(s->s_super_dev);
>  	kfree(s->s_subtype);
>  	for (int i = 0; i < SB_FREEZE_LEVELS; i++)
>  		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> @@ -392,6 +397,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  		goto fail;
>  	if (list_lru_init_memcg(&s->s_inode_lru, s->s_shrink))
>  		goto fail;
> +	s->s_super_dev = super_dev_alloc(0, s);
> +	if (!s->s_super_dev)
> +		goto fail;
> +
>  	s->s_min_writeback_pages = MIN_WRITEBACK_PAGES;
>  	return s;
>  
> @@ -421,6 +430,77 @@ void put_super(struct super_block *s)
>  	}
>  }
>  
> +struct super_dev {
> +	dev_t			sd_dev;
> +	struct super_block	*sd_sb;
> +	refcount_t		sd_ref;
> +	struct rhlist_head	sd_node;
> +	struct rcu_head		sd_rcu;
> +};
> +
> +static struct rhltable super_dev_table;
> +static const struct rhashtable_params super_dev_params = {
> +	.key_len	= sizeof(dev_t),
> +	.key_offset	= offsetof(struct super_dev, sd_dev),
> +	.head_offset	= offsetof(struct super_dev, sd_node),
> +};
> +
> +static struct super_dev *super_dev_alloc(dev_t dev, struct super_block *sb)
> +{
> +	struct super_dev *fsd;
> +
> +	fsd = kzalloc_obj(*fsd);
> +	if (!fsd)
> +		return NULL;
> +	fsd->sd_dev = dev;
> +	fsd->sd_sb = sb;
> +	refcount_set(&fsd->sd_ref, 1);
> +	return fsd;
> +}
> +
> +static void super_dev_put(struct super_dev *fsd)
> +{
> +	/* Unlink only once unpinned, so a cursor never resumes from a removed node. */
> +	if (fsd && refcount_dec_and_test(&fsd->sd_ref)) {
> +		rhltable_remove(&super_dev_table, &fsd->sd_node, super_dev_params);
> +		put_super(fsd->sd_sb);
> +		kfree_rcu(fsd, sd_rcu);
> +	}
> +}
> +
> +void __init super_dev_init(void)
> +{
> +	if (rhltable_init(&super_dev_table, &super_dev_params))
> +		panic("VFS: Cannot initialise super_dev_table\n");
> +}
> +
> +static int super_dev_insert(struct super_dev *fsd)
> +{
> +	int err;
> +
> +	err = rhltable_insert(&super_dev_table, &fsd->sd_node, super_dev_params);
> +	if (!err)
> +		refcount_inc(&fsd->sd_sb->s_passive);
> +	return err;
> +}
> +
> +/* Register @sb under @sb->s_dev as the final fallible act of a set callback. */
> +static int super_dev_register(struct super_block *sb)
> +{
> +	struct super_dev *fsd = sb->s_super_dev;
> +	int err;
> +
> +	lockdep_assert_held(&sb_lock);
> +	VFS_WARN_ON_ONCE(!sb->s_dev);
> +	VFS_WARN_ON_ONCE(!fsd || fsd->sd_dev);
> +
> +	fsd->sd_dev = sb->s_dev;
> +	err = super_dev_insert(fsd);
> +	if (err)
> +		fsd->sd_dev = 0;
> +	return err;
> +}
> +
>  static void kill_super_notify(struct super_block *sb)
>  {
>  	lockdep_assert_not_held(&sb->s_umount);
> @@ -440,6 +520,12 @@ static void kill_super_notify(struct super_block *sb)
>  	hlist_del_init(&sb->s_instances);
>  	spin_unlock(&sb_lock);
>  
> +	/* Drop sget_fc()'s claim; a never-registered entry stays with the sb. */
> +	if (sb->s_super_dev->sd_dev) {
> +		super_dev_put(sb->s_super_dev);
> +		sb->s_super_dev = NULL;
> +	}
> +
>  	/*
>  	 * Let concurrent mounts know that this thing is really dead.
>  	 * We don't need @sb->s_umount here as every concurrent caller
> @@ -750,6 +836,7 @@ struct super_block *sget_fc(struct fs_context *fc,
>  	}
>  	if (!s) {
>  		spin_unlock(&sb_lock);
> +
>  		s = alloc_super(fc->fs_type, fc->sb_flags, user_ns);
>  		if (!s)
>  			return ERR_PTR(-ENOMEM);
> @@ -759,11 +846,13 @@ struct super_block *sget_fc(struct fs_context *fc,
>  	s->s_fs_info = fc->s_fs_info;
>  	err = set(s, fc);
>  	if (err) {
> +		VFS_WARN_ON_ONCE(s->s_super_dev->sd_dev);
>  		s->s_fs_info = NULL;
>  		spin_unlock(&sb_lock);
>  		destroy_unused_super(s);
>  		return ERR_PTR(err);
>  	}
> +	VFS_WARN_ON_ONCE(!s->s_super_dev->sd_dev);
>  	fc->s_fs_info = NULL;
>  	s->s_type = fc->fs_type;
>  	s->s_iflags |= fc->s_iflags;
> @@ -1217,7 +1306,16 @@ EXPORT_SYMBOL(free_anon_bdev);
>  
>  int set_anon_super(struct super_block *s, void *data)
>  {
> -	return get_anon_bdev(&s->s_dev);
> +	int error;
> +
> +	error = get_anon_bdev(&s->s_dev);
> +	if (error)
> +		return error;
> +
> +	error = super_dev_register(s);
> +	if (error)
> +		free_anon_bdev(s->s_dev);
> +	return error;
>  }
>  EXPORT_SYMBOL(set_anon_super);
>  
> @@ -1303,7 +1401,7 @@ EXPORT_SYMBOL(get_tree_keyed);
>  static int set_bdev_super(struct super_block *s, void *data)
>  {
>  	s->s_dev = *(dev_t *)data;
> -	return 0;
> +	return super_dev_register(s);
>  }
>  
>  static int super_s_dev_set(struct super_block *s, struct fs_context *fc)
> diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
> index 68747182abf9..c8172558750f 100644
> --- a/include/linux/fs/super_types.h
> +++ b/include/linux/fs/super_types.h
> @@ -30,6 +30,7 @@ struct mount;
>  struct mtd_info;
>  struct quotactl_ops;
>  struct shrinker;
> +struct super_dev;
>  struct unicode_map;
>  struct user_namespace;
>  struct workqueue_struct;
> @@ -132,6 +133,7 @@ struct super_operations {
>  struct super_block {
>  	struct list_head			s_list;		/* Keep this first */
>  	dev_t					s_dev;		/* search index; _not_ kdev_t */
> +	struct super_dev			*s_super_dev;	/* sget_fc()'s device table claim */
>  	unsigned char				s_blocksize_bits;
>  	unsigned long				s_blocksize;
>  	loff_t					s_maxbytes;	/* Max file size */
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH] block: fix incorrect error injection static key decrement
From: Christoph Hellwig @ 2026-06-22 16:07 UTC (permalink / raw)
  To: axboe; +Cc: dlemoal, linux-block

Only decrement the static key when we had items and thus it was
incremented before.

Fixes: e8dcf2d142bd ("block: add configurable error injection")
Reported-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/error-injection.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/error-injection.c b/block/error-injection.c
index d24c90e9a25f..cfb83138960c 100644
--- a/block/error-injection.c
+++ b/block/error-injection.c
@@ -120,13 +120,13 @@ static void error_inject_removeall(struct gendisk *disk)
 	struct blk_error_inject *inj;
 
 	mutex_lock(&disk->error_injection_lock);
-	clear_bit(GD_ERROR_INJECT, &disk->state);
+	if (test_and_clear_bit(GD_ERROR_INJECT, &disk->state))
+		static_branch_dec(&blk_error_injection_enabled);
 	while ((inj = list_first_entry_or_null(&disk->error_injection_list,
 			struct blk_error_inject, entry))) {
 		list_del_rcu(&inj->entry);
 		kfree_rcu_mightsleep(inj);
 	}
-	static_branch_dec(&blk_error_injection_enabled);
 	mutex_unlock(&disk->error_injection_lock);
 }
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH blktests] block/044: basic block error injection sanity test
From: Christoph Hellwig @ 2026-06-22 16:08 UTC (permalink / raw)
  To: shinichiro.kawasaki; +Cc: linux-block

Test the basic block layer error injection functionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 tests/block/044     | 71 +++++++++++++++++++++++++++++++++++++++++++++
 tests/block/044.out |  9 ++++++
 2 files changed, 80 insertions(+)
 create mode 100755 tests/block/044
 create mode 100644 tests/block/044.out

diff --git a/tests/block/044 b/tests/block/044
new file mode 100755
index 000000000000..8baf9fa59c68
--- /dev/null
+++ b/tests/block/044
@@ -0,0 +1,71 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Christoph Hellwig.
+#
+# Basic block error injection test.
+
+. tests/block/rc
+. common/scsi_debug
+
+DESCRIPTION="basic block error injection test"
+QUICK=1
+
+requires()
+{	
+	_have_loadable_scsi_debug
+	_have_program xfs_io
+}
+
+# load and remove scsi_debug once to test the static_key bug in the
+# initial commit
+_test_load_unload()
+{
+	if ! _init_scsi_debug dev_size_mb=500; then
+		return 1
+	fi
+
+	local dev=${SCSI_DEBUG_DEVICES[0]}
+	local debugfs_file="/sys/kernel/debug/block/$dev/error_injection"
+	if [[ ! -f "${debugfs_file}" ]]; then
+		SKIP_REASONS+=("error injection not supported")
+		_exit_scsi_debug
+		return 1
+	fi
+	echo "Testing unload without rules"
+	_exit_scsi_debug
+}
+
+_test_rules()
+{
+	if ! _init_scsi_debug dev_size_mb=500; then
+		return 1
+	fi
+
+	local dev=${SCSI_DEBUG_DEVICES[0]}
+	local debugfs_file="/sys/kernel/debug/block/$dev/error_injection"
+
+	echo "Testing valid rules"
+	echo "add,op=WRITE,status=RESOURCE,start=0,nr_sectors=8" > $debugfs_file
+	echo "add,op=READ,status=IOERR,start=16,nr_sectors=8" > $debugfs_file
+	xfs_io -d -c 'pwrite -q 0 4096' /dev/$dev
+	xfs_io -d -c 'pread -q 0 4096' /dev/$dev
+	xfs_io -d -c 'pwrite -q 4096 4096' /dev/$dev
+	xfs_io -d -c 'pread -q 8192 8192' /dev/$dev
+
+	echo "Testing invalid rules"
+	echo "op=READ,status=IOERR" > $debugfs_file
+	echo "add,op=READ,status=EIO,start=32" > $debugfs_file
+	_exit_scsi_debug
+}
+
+test()
+{
+	echo "Running ${TEST_NAME}"
+
+	local ret
+
+	_test_load_unload
+	_test_rules
+
+	echo "Test complete"
+}
diff --git a/tests/block/044.out b/tests/block/044.out
new file mode 100644
index 000000000000..92efcddf7c8e
--- /dev/null
+++ b/tests/block/044.out
@@ -0,0 +1,9 @@
+Running block/044
+Testing unload without rules
+Testing valid rules
+pwrite: Cannot allocate memory
+pread: Input/output error
+Testing invalid rules
+tests/block/044: line 56: echo: write error: Invalid argument
+tests/block/044: line 57: echo: write error: Invalid argument
+Test complete
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH RFC v2 08/18] fs: add dedicated block device open helpers for filesystems
From: Jan Kara @ 2026-06-22 16:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-8-7df6b864028e@kernel.org>

On Tue 16-06-26 16:08:24, Christian Brauner wrote:
> Add fs_bdev_file_open_by_{dev,path}() and fs_bdev_file_release(). They
> open the device with fs_holder_ops and register a claim in the
> device-to-superblock table. Claims on the same (device, superblock)
> pair share one entry, so when a filesystem claims a device it already
> uses (xfs with its log on the data device), no second entry is added
> and each superblock will be acted on once.
> 
> The holder argument remains purely the block layer's exclusivity token:
> a superblock, or a file_system_type for a device shared by several
> superblocks of that type. The shared case only becomes usable once the
> fs_holder_ops callbacks resolve superblocks through the table instead
> of bdev->bd_holder.
> 
> Convert the main device, setup_bdev_super() and kill_block_super(),
> over: the open finds the entry registered by sget_fc() and claims it
> again. cramfs and romfs bypass kill_block_super() so they can handle
> MTD mounts and release the main device with a plain bdev_fput(), which
> would leave the claim behind: the (dev, sb) entry would never be
> unregistered and the passive reference it holds would keep the
> superblock alive forever. Convert their release paths in the same
> step.
> 
> The frozen-device check stays in setup_bdev_super() for the primary
> device and is added to fs_bdev_register() for new claims, i.e. every
> additional device a filesystem opens through the helpers. Only a
> (device, superblock) pair the superblock claimed earlier may be
> reopened while frozen (xfs with its log on the data device): the freeze
> already covers that superblock through the existing claim, so nothing
> escapes it. Without the setup_bdev_super() check a device frozen before
> the mount even started (dm lock_fs, loop) could be mounted and written
> to (journal replay) under an active freeze, because the primary open
> reuses the entry registered by sget_fc() and never takes the new-claim
> path.
> 
> Both checks read bd_fsfreeze_count only after the entry is published
> (by sget_fc() for the primary, by fs_bdev_register() for new claims)
> and pair with bdev_freeze() incrementing the count before walking the
> table: either the mount sees the elevated freeze count and fails with
> EBUSY, or the freeze finds the published entry and converges once
> SB_BORN is set.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

...

> +static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
> +{
> +	struct super_dev *sb_dev __free(kfree) = NULL;

Frankly I find the use of __free on sb_dev more confusing than helping in
this function. If you didn't use it, you could remove the somewhat
confusing retain_and_null_ptr() calls below, remove this initialization and
just put one kfree() into the error handling branch when super_dev_insert()
fails...

> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	int err;
> +
> +	scoped_guard(rcu) {
> +		sb_dev = super_dev_lookup(dev, sb);
> +		if (sb_dev && refcount_inc_not_zero(&sb_dev->sd_ref)) {
> +			retain_and_null_ptr(sb_dev);
> +			return 0;
> +		}
> +	}
> +
> +	sb_dev = super_dev_alloc(dev, sb);
> +	if (!sb_dev)
> +		return -ENOMEM;
> +
> +	err = super_dev_insert(sb_dev);
> +	if (err)
> +		return err;
> +
> +	/* Publish the entry before reading the count; pairs with bdev_freeze(). */
> +	smp_mb();
> +	if (atomic_read(&file_bdev(bdev_file)->bd_fsfreeze_count) > 0) {
> +		err = -EBUSY;
> +		super_dev_put(sb_dev);
> +	}
> +
> +	retain_and_null_ptr(sb_dev);
> +	return err;
> +}

...

> +/**
> + * fs_bdev_file_release - release a block device claimed for a superblock
> + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> + * @sb: superblock the device was claimed for
> + *
> + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> + * pinning cursor defers the actual unlink).  Then close the block device.
> + */
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct super_dev *sb_dev;
> +
> +	rcu_read_lock();
> +	sb_dev = super_dev_lookup(dev, sb);
> +	rcu_read_unlock();
> +	super_dev_put(sb_dev);
> +	bdev_fput(bdev_file);
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_release);

Why don't you use sb->s_super_dev in this function?

Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC v2 08/18] fs: add dedicated block device open helpers for filesystems
From: Jan Kara @ 2026-06-22 16:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <xlfnmwv2upjia6ozd4z5l5icaewor4a6cgkafnigulndzmt6r7@rhay3h3wablo>

On Mon 22-06-26 18:28:50, Jan Kara wrote:
> On Tue 16-06-26 16:08:24, Christian Brauner wrote:
> > Add fs_bdev_file_open_by_{dev,path}() and fs_bdev_file_release(). They
> > open the device with fs_holder_ops and register a claim in the
> > device-to-superblock table. Claims on the same (device, superblock)
> > pair share one entry, so when a filesystem claims a device it already
> > uses (xfs with its log on the data device), no second entry is added
> > and each superblock will be acted on once.
> > 
> > The holder argument remains purely the block layer's exclusivity token:
> > a superblock, or a file_system_type for a device shared by several
> > superblocks of that type. The shared case only becomes usable once the
> > fs_holder_ops callbacks resolve superblocks through the table instead
> > of bdev->bd_holder.
> > 
> > Convert the main device, setup_bdev_super() and kill_block_super(),
> > over: the open finds the entry registered by sget_fc() and claims it
> > again. cramfs and romfs bypass kill_block_super() so they can handle
> > MTD mounts and release the main device with a plain bdev_fput(), which
> > would leave the claim behind: the (dev, sb) entry would never be
> > unregistered and the passive reference it holds would keep the
> > superblock alive forever. Convert their release paths in the same
> > step.
> > 
> > The frozen-device check stays in setup_bdev_super() for the primary
> > device and is added to fs_bdev_register() for new claims, i.e. every
> > additional device a filesystem opens through the helpers. Only a
> > (device, superblock) pair the superblock claimed earlier may be
> > reopened while frozen (xfs with its log on the data device): the freeze
> > already covers that superblock through the existing claim, so nothing
> > escapes it. Without the setup_bdev_super() check a device frozen before
> > the mount even started (dm lock_fs, loop) could be mounted and written
> > to (journal replay) under an active freeze, because the primary open
> > reuses the entry registered by sget_fc() and never takes the new-claim
> > path.
> > 
> > Both checks read bd_fsfreeze_count only after the entry is published
> > (by sget_fc() for the primary, by fs_bdev_register() for new claims)
> > and pair with bdev_freeze() incrementing the count before walking the
> > table: either the mount sees the elevated freeze count and fails with
> > EBUSY, or the freeze finds the published entry and converges once
> > SB_BORN is set.
> > 
> > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

...

> > +/**
> > + * fs_bdev_file_release - release a block device claimed for a superblock
> > + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> > + * @sb: superblock the device was claimed for
> > + *
> > + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> > + * pinning cursor defers the actual unlink).  Then close the block device.
> > + */
> > +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> > +{
> > +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> > +	struct super_dev *sb_dev;
> > +
> > +	rcu_read_lock();
> > +	sb_dev = super_dev_lookup(dev, sb);
> > +	rcu_read_unlock();
> > +	super_dev_put(sb_dev);
> > +	bdev_fput(bdev_file);
> > +}
> > +EXPORT_SYMBOL_GPL(fs_bdev_file_release);
> 
> Why don't you use sb->s_super_dev in this function?

Nevermind, I've realized sb can hold multiple bdevs so this is really
needed.

I'd still prefer the __free handling in fs_bdev_register() sorted out but
regardless feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-06-22 16:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Dave Chinner, Bart Van Assche, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <4jtsjd2sbsn2w7fzfwb7wwowz72r4kc6345ckkkcjoxjbbwjwn@rkl44x3o3sgy>

On 6/18/26 10:26 AM, Jan Kara wrote:
> On Mon 01-06-26 13:04:41, Jan Kara wrote:
>> On Fri 29-05-26 16:46:15, Tal Zussman wrote:
>> > On 5/27/26 9:00 AM, Christoph Hellwig wrote:
>> > > On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
>> > >> > I ran some experiments with fio on both XFS and a raw block device. Five
>> > >> > iterations each for 60s. Results below.
>> > >> > 
>> > >> > TLDR: Removing the delay doesn't significantly decrease user-visible
>> > >> > latency or otherwise improve performance, but does significantly reduce
>> > >> > throughput and increase context switches in some workloads (e.g. C).
>> > >> > I think it makes sense to leave the delay as-is. Thoughts?
>> > >> 
>> > >> Thanks for the test! One question below:
>> > > 
>> > > Thanks from me as well!
>> > > 
>> > >> 
>> > >> > Results:
>> > >> > 
>> > >> > Workloads (all `uncached=1`):
>> > >> >   A: rw=write     bs=128k iodepth=1   ioengine=pvsync2     # XFS
>> > >> >   B: rw=write     bs=128k iodepth=128 ioengine=io_uring    # XFS
>> > >> >   C: rw=randwrite bs=4k   iodepth=32  ioengine=io_uring    # XFS
>> > >> >   D: rw=rw 50/50  bs=64k  iodepth=32  ioengine=io_uring    # XFS
>> > >> >   E: rw=write     bs=128k iodepth=128 ioengine=io_uring    # raw /dev/nvmeXn1
>> > >> >   F: rw=write     bs=128k iodepth=128 numjobs=4
>> > >> >      + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
>> > >> > 
>> > >> > Mean ± stddev across 5 iterations:
>> > >> > 
>> > >> >     metric                     delay=1           delay=0     delta
>> > >> >     --------------------------------------------------------------
>> > >> > 
>> > >> >   A seq 128k qd1
>> > >> >     BW (MB/s)                4333 ± 27         4374 ± 34     +0.9%
>> > >> >     p99   (us)              36.2 ± 0.8        35.8 ± 0.4     -1.1%
>> > >> >     p999  (us)               3260 ± 75         3228 ± 29     -1.0%
>> > >> >     ctx-switches          184 k ± 59 k     3.68 M ± 65 k    +1903%
>> > >> >     cs / io                0.09 ± 0.03       1.86 ± 0.03    +1888%
>> > >> >     avg bios/run            80.4 ± 0.6         1.1 ± 0.0    -98.7%
>> > >> 
>> > >> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
>> > >> the completion latency should be at least 1000us but your results show p99
>> > >> latency of 36. What am I missing?
>> > > 
>> > > Yes, this looks a bit odd.  Unless there's multiple threads submitting
>> > > and somehow the completions get batched this should complete one
>> > > bio at a time and be the worst case for the delay scheme.
>> > 
>> > Sorry, I should've clarified - the latency here is the userspace-visible
>> > I/O completion latency (i.e. fio's clat value).
>> > 
>> > I ran again and traced to get the actual time from __bio_complete_in_task()
>> > to calling ->bi_end_io(). The results match the 1 jiffie delay now:
>> > 
>> >   metric                  delay=1  delay=0
>> > 
>> >   A seq 128k qd1
>> >     fio clat p99             38us     36us
>> >     bio cb p50             1.23ms    2.5us
>> >     bio cb p99             4.13ms   1.44ms
>> >     bio cb p999            5.01ms   2.63ms
>> 
>> So I'm clearly missing something fundamental as I don't see how can fio
>> reported IO completion time be lower than the end_io callback latency...
>> Ahh, it is the strange meaning of clat in fio in combination with sync
>> engine where clat means: "how long after the syscall has returned the data
>> is ready". Which for sync engine is immediately so the clat number is
>> meaningless. I think reporting 'lat' numbers from fio would make more
>> sense but whatever.
>> 
>> The bio cb latency indeed looks like what I'd roughly expect now. And
>> notice how the median latency of IO completion is 1.23ms in delay=1 case
>> and your throughput isn't abbysmal only because writes end up accumulating
>> in the page cache and writeback infrastructure ends up submitting a lot of
>> writeback IOs in parallel (you have ~80 bios to complete per run which
>> amortizes the latency to decent level).
>> 
>> However if you'd have IO that were to use BIO_COMPLETE_IN_TASK
>> infrastructure which doesn't have so many IOs in flight (like direct IO
>> with lower queue depth which has to do extent conversion on completion),
>> you would very much see the latency hit on your throughput as well. In the
>> extreme case of qd=1 direct IO you'd reduce the throughput to ~4MB/s.
>> 
>> Now I'm not saying the delay is bad - it is a tradeoff with clear wins in
>> CPU overhead your benchmarks are showing. I just wanted to point out
>> there's also the cost side which your benchmarks don't show very clearly.
>> So we might need to keep some stats showing how many IO completions we are
>> offloading per second on each CPU and switch to delaying the work only once
>> it crosses a threshold like 1000000/HZ per second or so (so we at most
>> double the IO latency by delaying the end_io callback).
> 
> Any progress here? The patchset looks really promising so I'd love to have
> it completed :)
> 
Sorry for the delay - got caught up with some other work and had to set this
aside for a couple weeks, but haven't forgotten about this. Planning to pick
it back up some time this week.

Thanks,
Tal


^ permalink raw reply

* Re: [PATCH blktests] Fix _get_page_size()
From: Omar Sandoval @ 2026-06-22 17:31 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: Bart Van Assche, Jeff Moyer, linux-block, kch
In-Reply-To: <ajkeGQd-0LnKJbHN@shinmob>

On Mon, Jun 22, 2026 at 08:38:48PM +0900, Shin'ichiro Kawasaki wrote:
> On Jun 20, 2026 / 09:11, Bart Van Assche wrote:
> > On 6/20/26 6:51 AM, Shin'ichiro Kawasaki wrote:
> > > On Jun 20, 2026 / 05:55, Bart Van Assche wrote:
> > > > On 6/20/26 3:26 AM, Shin'ichiro Kawasaki wrote:
> > > > > This is a rather fundamental change, so I would like to ask opinions from
> > > > > other blktests users, especially Omar and Chaitanya. What do you think about
> > > > > the idea to add getconf to the requirement list?
> > > > 
> > > > CONFIG_PAGE_SHIFT was introduced in the Linux kernel in February 2024
> > > > (commit ba89f9c8ccba ("arch: consolidate existing CONFIG_PAGE_SIZE_*KB
> > > > definitions")). Older kernels had CONFIG_PAGE_SIZE_4KB,
> > > > CONFIG_PAGE_SIZE_16KB, etc. This means that it is possible to derive the
> > > > kernel page size from the kernel configuration file for all upstream and
> > > > distro kernels, isn't it?
> > > 
> > > I checked the commit is in the tag v6.9. My Debian bookworm system has kernel
> > > v6.1, then the config file at /boot does not have CONFIG_PAGE_SHIFT as expected.
> > > But it does not have CONFIG_PAGE_SIZE_* either... I'm still afraid that kernel
> > > config file approach is not reliable.
> > 
> > Right, for older kernels CONFIG_PAGE_SIZE_*KB is only available for some
> > but not for all supported architectures.
> > 
> > It is not clear to me where the desire to avoid the dependency on
> > getconf comes from? As far as I know it is available on all Linux
> > distro's. Since it is typically included in the C library package it
> > should not introduce a new dependency.
> 
> I think less dependent is the better in general, and wanted to confirm that
> it is fine for everybody. If there is no voice to object, I will create a
> patch to add getconf to the requirement list.

I agree with Bart, getconf is ubiquitous enough that it's not worth
trying to hack around its absence. In my opinion, parsing kernel config
options should be a last resort. If anyone complains about the getconf
dependency in the future, I think it'd be better to add a simple
src/pagesize.c file that uses sysconf(_SC_PAGESIZE), but I don't expect
that to be necessary.

Omar

^ permalink raw reply

* Re: [PATCH 1/1] block: validate user space vectors during extraction
From: Keith Busch @ 2026-06-22 17:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-fsdevel, dm-devel, axboe, brauner,
	djwong, viro, stable
In-Reply-To: <ajPv7yOoYsR5O6kf@kbusch-mbp>

On Thu, Jun 18, 2026 at 07:17:35AM -0600, Keith Busch wrote:
>
> But since you mention it, __blkdev_direct_IO's handling does look wrong,
> so maybe I can clean that up first.

After careful reviewing, I think __blkdev_direct_IO() is mostly correct
in what it's doing. It looks weird, but appears to be well optimized for
the common case such that making it look more readable would produce
less efficient execution code. So I'm not going to touch it, but there is
a bug here with metadata mapping error handling that I'm going to
propose in the next version.

^ permalink raw reply

* [PATCHv2 0/6] direct-io: validate user space vectors during extraction
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This addresses the misaligned direct-io problem behind various threads:

 https://lore.kernel.org/linux-xfs/20260610145218.141369-1-cem@kernel.org/
 https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=DUgn02w9X+p94DCu0aHv5+5tKnQ@mail.gmail.com/
 https://lore.kernel.org/linux-block/ai7rnH20IYeSmY8s@gallifrey/
 https://lore.kernel.org/linux-block/20260616154009.2123183-1-kbusch@meta.com/

The previously tested fixes are correct as far as they go, but they
treat the symptom: they only matter because an invalid bio reaches those
drivers in the first place.

The reason it reaches them is an assumption I made when I removed
direct-io alignment checks in 5ff3f74e145a ("block: simplify direct io
validity check") and 7eac331869575 ("iomap: simplify direct io validity
check"): every bio is eventually split to the device limits, and the
upper layers cope with resulting errors once the bio has formed. Both
were optimistic assumptions. Drivers with their own ->submit_bio may
never pass through blk_mq_submit_bio()'s split, so the check never runs
for them, and as numerous threads showed, the consumers don't uniformly
handle this condition.

This patch stops the invalid bio at the source instead. It validates the
buffer's alignment against the alignment limits when the bio is built
from the iov_iter. The check is folded into the bvec extraction that
already walks the vectors, so it adds only a comparison on a path that
is pinning direct-io pages anyway. Misalignment is now uniformly
rejected with EINVAL before submission for every direct-io path.

With this in place, the dm side changes under discussion are no longer
required to fix the bugs: the affected targets simply never see the
invalid bio. The tested patches remain reasonable as defense-in-depth if
desired, but they are not strictly necessary after this.

v1->v2:

 I've included some prep patches that fix other issues in this path.

 Renamed the alignment to "mem_align_mask", re-ordered the function
 parameters so it appears before the length alignment, and added the
 appropriate kerneldoc.

 Added additional comments to explain the rationale behind the checks.

 For DEBUG kernels, a bio_vec iterator is checked in its entirety. The
 existing use cases appear to only need the first vector to be checked,
 so the more expensive exhaustive check is only happening for the debug
 kernels.

Keith Busch (6):
  block: introduce bio_endio_errno helper
  block: report the actual status
  block: fix dio leak on metadata mapping error
  loop: set dma_alignment from the backing file for direct I/O
  zloop: set dma_alignment from the backing files for direct I/O
  block: validate user space vectors during extraction

 block/bio.c            | 50 +++++++++++++++++++++++++++++++++++++++---
 block/blk-map.c        |  2 +-
 block/blk-merge.c      |  4 ++--
 block/fops.c           |  9 +++++---
 drivers/block/loop.c   | 50 +++++++++++++++++++++++++++++++++++-------
 drivers/block/zloop.c  | 22 +++++++++++++++++--
 fs/iomap/direct-io.c   |  1 +
 include/linux/bio.h    |  2 +-
 include/linux/blkdev.h |  5 +++++
 include/linux/uio.h    |  3 ++-
 lib/iov_iter.c         |  9 +++++++-
 11 files changed, 135 insertions(+), 22 deletions(-)

-- 
2.52.0

^ permalink raw reply

* [PATCHv2 3/6] block: fix dio leak on metadata mapping error
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260622174241.2299563-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

A failed integrity mapping holds a dio reference, so we need to go
through the full bio ending in case there were previously submitted
bio's in the sequence.

Fixes: 2729a60bbfb92 ("block: don't silently ignore metadata for sync read/write")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index f237d6cab8975..b5c320da28123 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -238,8 +238,10 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		}
 		if (iocb->ki_flags & IOCB_HAS_METADATA) {
 			ret = bio_integrity_map_iter(bio, iocb->private);
-			if (unlikely(ret))
-				goto fail;
+			if (unlikely(ret)) {
+				bio_endio_errno(bio, ret);
+				break;
+			}
 		}
 
 		if (is_read) {
-- 
2.52.0


^ permalink raw reply related

* [PATCHv2 1/6] block: introduce bio_endio_errno helper
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260622174241.2299563-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

No functional change; purely introducing a convenience function.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-merge.c      | 4 ++--
 include/linux/blkdev.h | 5 +++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index ab1161ca69f1e..c93170f340977 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -122,7 +122,7 @@ struct bio *bio_submit_split_bioset(struct bio *bio, unsigned int split_sectors,
 	struct bio *split = bio_split(bio, split_sectors, GFP_NOIO, bs);
 
 	if (IS_ERR(split)) {
-		bio_endio_status(bio, errno_to_blk_status(PTR_ERR(split)));
+		bio_endio_errno(bio, PTR_ERR(split));
 		return NULL;
 	}
 
@@ -142,7 +142,7 @@ EXPORT_SYMBOL_GPL(bio_submit_split_bioset);
 static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
 {
 	if (unlikely(split_sectors < 0)) {
-		bio_endio_status(bio, errno_to_blk_status(split_sectors));
+		bio_endio_errno(bio, split_sectors);
 		return NULL;
 	}
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9213a5716f95a..88e4bd88c3e28 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1047,6 +1047,11 @@ extern const char *blk_op_str(enum req_op op);
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
+static inline void bio_endio_errno(struct bio *bio, int errno)
+{
+	bio_endio_status(bio, errno_to_blk_status(errno));
+}
+
 /* only poll the hardware once, don't continue until a completion was found */
 #define BLK_POLL_ONESHOT		(1 << 0)
 int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
-- 
2.52.0


^ permalink raw reply related

* [PATCHv2 4/6] loop: set dma_alignment from the backing file for direct I/O
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260622174241.2299563-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Direct I/O user pages are forwarded to the backing file unchanged, so
the backing's DMA alignment requirement applies to them. Track the
backing's dio_mem_align and advertise it as the loop device's
dma_alignment so we advertise proper limits and misaligned I/O is
rejected here instead of being dispatched to the backend.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/block/loop.c | 50 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 310de0463beb1..7114f80ab162a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -54,6 +54,7 @@ struct loop_device {
 
 	struct file	*lo_backing_file;
 	unsigned int	lo_min_dio_size;
+	unsigned int	lo_dio_mem_align;
 	struct block_device *lo_device;
 
 	gfp_t		old_gfp_mask;
@@ -447,26 +448,37 @@ static void loop_reread_partitions(struct loop_device *lo)
 			__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
-static unsigned int loop_query_min_dio_size(struct loop_device *lo)
+static void loop_update_dio_alignment(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
 	struct block_device *sb_bdev = file->f_mapping->host->i_sb->s_bdev;
 	struct kstat st;
 
 	/*
-	 * Use the minimal dio alignment of the file system if provided.
+	 * Use the dio alignment of the file system if provided.  dio_offset_align
+	 * is the minimum dio size and offset; dio_mem_align is the buffer memory
+	 * alignment, kept as a mask to become the loop device's dma_alignment in
+	 * direct I/O mode where the buffer is handed to the backing file unchanged.
 	 */
 	if (!vfs_getattr(&file->f_path, &st, STATX_DIOALIGN, 0) &&
-	    (st.result_mask & STATX_DIOALIGN))
-		return st.dio_offset_align;
+	    (st.result_mask & STATX_DIOALIGN)) {
+		lo->lo_min_dio_size = st.dio_offset_align;
+		lo->lo_dio_mem_align = st.dio_mem_align - 1;
+		return;
+	}
 
 	/*
 	 * In a perfect world this wouldn't be needed, but as of Linux 6.13 only
 	 * a handful of file systems support the STATX_DIOALIGN flag.
 	 */
-	if (sb_bdev)
-		return bdev_logical_block_size(sb_bdev);
-	return SECTOR_SIZE;
+	if (sb_bdev) {
+		lo->lo_min_dio_size = bdev_logical_block_size(sb_bdev);
+		lo->lo_dio_mem_align = bdev_dma_alignment(sb_bdev);
+		return;
+	}
+
+	lo->lo_min_dio_size = SECTOR_SIZE;
+	lo->lo_dio_mem_align = SECTOR_SIZE - 1;
 }
 
 static inline int is_loop_device(struct file *file)
@@ -509,7 +521,7 @@ static void loop_assign_backing_file(struct loop_device *lo, struct file *file)
 			lo->old_gfp_mask & ~(__GFP_IO | __GFP_FS));
 	if (lo->lo_backing_file->f_flags & O_DIRECT)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
-	lo->lo_min_dio_size = loop_query_min_dio_size(lo);
+	loop_update_dio_alignment(lo);
 }
 
 static int loop_check_backing_file(struct file *file)
@@ -961,6 +973,17 @@ static void loop_update_limits(struct loop_device *lo, struct queue_limits *lim,
 	lim->logical_block_size = bsize;
 	lim->physical_block_size = bsize;
 	lim->io_min = bsize;
+	/*
+	 * In direct I/O the user pages are handed to the backing file as-is, so
+	 * the backing's DMA alignment requirement applies to them.  Advertise it
+	 * so misaligned I/O is rejected at this device's entry instead of being
+	 * dispatched to the backend.  Buffered I/O copies through the page cache
+	 * and imposes no such requirement.
+	 */
+	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
+		lim->dma_alignment = lo->lo_dio_mem_align;
+	else
+		lim->dma_alignment = SECTOR_SIZE - 1;
 	lim->features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL);
 	if (file->f_op->fsync && !(lo->lo_flags & LO_FLAGS_READ_ONLY))
 		lim->features |= BLK_FEAT_WRITE_CACHE;
@@ -1416,6 +1439,7 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg)
 {
 	bool use_dio = !!arg;
 	unsigned int memflags;
+	struct queue_limits lim;
 
 	if (lo->lo_state != Lo_bound)
 		return -ENXIO;
@@ -1434,6 +1458,16 @@ static int loop_set_dio(struct loop_device *lo, unsigned long arg)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
 	else
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
+	/*
+	 * Direct I/O forwards the user pages to the backing file unchanged, so
+	 * track the backing's DMA alignment requirement as the mode is toggled.
+	 */
+	lim = queue_limits_start_update(lo->lo_queue);
+	if (lo->lo_flags & LO_FLAGS_DIRECT_IO)
+		lim.dma_alignment = lo->lo_dio_mem_align;
+	else
+		lim.dma_alignment = SECTOR_SIZE - 1;
+	queue_limits_commit_update(lo->lo_queue, &lim);
 	blk_mq_unfreeze_queue(lo->lo_queue, memflags);
 	return 0;
 }
-- 
2.52.0


^ permalink raw reply related

* [PATCHv2 5/6] zloop: set dma_alignment from the backing files for direct I/O
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260622174241.2299563-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Direct I/O request's use pages handed to the backing files unchanged, so
the backing's DMA alignment requirement applies. Track dio_mem_align and
advertise it as the device's dma_alignment so we communicate proper
limits and misaligned I/O is rejected here instead of reaching the
backend.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/block/zloop.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
index 55eeb6aac0ea3..1149b817b5bc9 100644
--- a/drivers/block/zloop.c
+++ b/drivers/block/zloop.c
@@ -144,6 +144,7 @@ struct zloop_device {
 	unsigned int		nr_conv_zones;
 	unsigned int		max_open_zones;
 	unsigned int		block_size;
+	unsigned int		dio_mem_align;
 
 	spinlock_t		open_zones_lock;
 	struct list_head	open_zones_lru_list;
@@ -1035,6 +1036,9 @@ static int zloop_get_block_size(struct zloop_device *zlo,
 {
 	struct block_device *sb_bdev = zone->file->f_mapping->host->i_sb->s_bdev;
 	struct kstat st;
+	bool have_dioalign = !vfs_getattr(&zone->file->f_path, &st,
+					  STATX_DIOALIGN, 0) &&
+			     (st.result_mask & STATX_DIOALIGN);
 
 	/*
 	 * If the FS block size is lower than or equal to 4K, use that as the
@@ -1044,14 +1048,25 @@ static int zloop_get_block_size(struct zloop_device *zlo,
 	 */
 	if (file_inode(zone->file)->i_sb->s_blocksize <= SZ_4K)
 		zlo->block_size = file_inode(zone->file)->i_sb->s_blocksize;
-	else if (!vfs_getattr(&zone->file->f_path, &st, STATX_DIOALIGN, 0) &&
-		 (st.result_mask & STATX_DIOALIGN))
+	else if (have_dioalign)
 		zlo->block_size = st.dio_offset_align;
 	else if (sb_bdev)
 		zlo->block_size = bdev_physical_block_size(sb_bdev);
 	else
 		zlo->block_size = SECTOR_SIZE;
 
+	/*
+	 * In direct I/O the request's pages are handed to the backing files
+	 * unchanged, so track their required memory alignment as a mask for
+	 * dma_alignment.
+	 */
+	if (have_dioalign)
+		zlo->dio_mem_align = st.dio_mem_align - 1;
+	else if (sb_bdev)
+		zlo->dio_mem_align = bdev_dma_alignment(sb_bdev);
+	else
+		zlo->dio_mem_align = SECTOR_SIZE - 1;
+
 	if (zlo->zone_capacity & ((zlo->block_size >> SECTOR_SHIFT) - 1)) {
 		pr_err("Zone capacity is not aligned to block size %u\n",
 		       zlo->block_size);
@@ -1279,6 +1294,9 @@ static int zloop_ctl_add(struct zloop_options *opts)
 
 	lim.physical_block_size = zlo->block_size;
 	lim.logical_block_size = zlo->block_size;
+	/* Direct I/O hands the request's pages to the backing files unchanged. */
+	if (!opts->buffered_io)
+		lim.dma_alignment = zlo->dio_mem_align;
 	if (zlo->zone_append)
 		lim.max_hw_zone_append_sectors = lim.max_hw_sectors;
 	lim.max_open_zones = zlo->max_open_zones;
-- 
2.52.0


^ permalink raw reply related

* [PATCHv2 2/6] block: report the actual status
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch
In-Reply-To: <20260622174241.2299563-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Rather than assume EIO, set the actual reported status for user space
informational purposes.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index 15783a6180dec..f237d6cab8975 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -218,7 +218,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 		ret = blkdev_iov_iter_get_pages(bio, iter, bdev);
 		if (unlikely(ret)) {
-			bio_endio_status(bio, BLK_STS_IOERR);
+			bio_endio_errno(bio, ret);
 			break;
 		}
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.52.0


^ permalink raw reply related

* [PATCHv2 6/6] block: validate user space vectors during extraction
From: Keith Busch @ 2026-06-22 17:42 UTC (permalink / raw)
  To: linux-block, linux-fsdevel
  Cc: dm-devel, hch, axboe, brauner, djwong, viro, Keith Busch, stable
In-Reply-To: <20260622174241.2299563-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

The bio-based drivers don't necessarily check the alignment split, and
stacking block drivers don't always handle a misalignment detected after
submitting the bio. Validate user vectors against the device's
dma_alignment as the bio is built from the iov_iter, rejecting
misaligned early with -EINVAL.

Cc: stable@vger.kernel.org
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c          | 50 +++++++++++++++++++++++++++++++++++++++++---
 block/blk-map.c      |  2 +-
 block/fops.c         |  1 +
 fs/iomap/direct-io.c |  1 +
 include/linux/bio.h  |  2 +-
 include/linux/uio.h  |  3 ++-
 lib/iov_iter.c       |  9 +++++++-
 7 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f2a5f4d0a9672..4360149d4eba2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1220,10 +1220,39 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_KERNEL
+static inline bool bio_iov_bvec_aligned(const struct bio *bio,
+					unsigned mem_align_mask)
+{
+	struct bvec_iter iter;
+	struct bio_vec bv;
+
+	for_each_mp_bvec(bv, bio->bi_io_vec, iter, bio->bi_iter)
+		if ((bv.bv_offset | bv.bv_len) & mem_align_mask)
+			return false;
+	return true;
+}
+#else
+static inline bool bio_iov_bvec_aligned(const struct bio *bio,
+					unsigned mem_align_mask)
+{
+	/*
+	 * The vectors are owned and laid out by the caller; we only forward
+	 * them. Most callers are already aligned, but io_uring can place a
+	 * user chosen offset through a registered buffer, where only the first
+	 * vector may be unaligned.
+	 */
+	return !(mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) &
+							mem_align_mask);
+}
+#endif
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be added
+ * @mem_align_mask: the mask the source address and length must be aligned to,
+ *	0 for no requirement
  * @len_align_mask: the mask to align the total size to, 0 for any length
  *
  * This takes either an iterator pointing to user memory, or one pointing to
@@ -1242,7 +1271,7 @@ static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter,
  * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-			   unsigned len_align_mask)
+			   unsigned mem_align_mask, unsigned len_align_mask)
 {
 	iov_iter_extraction_t flags = 0;
 
@@ -1251,6 +1280,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 	if (iov_iter_is_bvec(iter)) {
 		bio_iov_bvec_set(bio, iter);
+
+		if (!bio_iov_bvec_aligned(bio, mem_align_mask))
+			return -EINVAL;
+
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 		return 0;
 	}
@@ -1265,8 +1298,19 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec,
 				BIO_MAX_SIZE - bio->bi_iter.bi_size,
-				&bio->bi_vcnt, bio->bi_max_vecs, flags);
+				&bio->bi_vcnt, bio->bi_max_vecs,
+				mem_align_mask, flags);
 		if (ret <= 0) {
+			/*
+			 * A misaligned vector fails the whole I/O.  Release any
+			 * pages pinned by earlier iterations before returning
+			 * since this bio won't be submitted to release them.
+			 */
+			if (ret == -EINVAL) {
+				bio_release_pages(bio, false);
+				bio_clear_flag(bio, BIO_PAGE_PINNED);
+				bio->bi_vcnt = 0;
+			}
 			if (!bio->bi_vcnt)
 				return ret;
 			break;
@@ -1377,7 +1421,7 @@ static int bio_iov_iter_bounce_read(struct bio *bio, struct iov_iter *iter,
 		ssize_t ret;
 
 		ret = iov_iter_extract_bvecs(iter, bio->bi_io_vec + 1, len,
-				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0);
+				&bio->bi_vcnt, bio->bi_max_vecs - 1, 0, 0);
 		if (ret <= 0) {
 			if (!bio->bi_vcnt) {
 				folio_put(folio);
diff --git a/block/blk-map.c b/block/blk-map.c
index 768549f19f97e..c9535efe1a913 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -274,7 +274,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	 * No alignment requirements on our part to support arbitrary
 	 * passthrough commands.
 	 */
-	ret = bio_iov_iter_get_pages(bio, iter, 0);
+	ret = bio_iov_iter_get_pages(bio, iter, 0, 0);
 	if (ret)
 		goto out_put;
 	ret = blk_rq_append_bio(rq, bio);
diff --git a/block/fops.c b/block/fops.c
index b5c320da28123..84eeabd97e1f0 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -47,6 +47,7 @@ static inline int blkdev_iov_iter_get_pages(struct bio *bio,
 		struct iov_iter *iter, struct block_device *bdev)
 {
 	return bio_iov_iter_get_pages(bio, iter,
+			bdev_dma_alignment(bdev),
 			bdev_logical_block_size(bdev) - 1);
 }
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b485e3b191daf..ff458aa12ae29 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -358,6 +358,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 				iomap_max_bio_size(&iter->iomap), alignment);
 	else
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter,
+					     bdev_dma_alignment(bio->bi_bdev),
 					     alignment - 1);
 	if (unlikely(ret))
 		goto out_put_bio;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8f33f717b14f5..ce34ea49ef358 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -477,7 +477,7 @@ int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
 		size_t len, enum req_op op);
 
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
-		unsigned len_align_mask);
+		unsigned mem_align_mask, unsigned len_align_mask);
 
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e32..653dee76c0b33 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -391,7 +391,8 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
 			       size_t *offset0);
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags);
+		unsigned short max_vecs, unsigned mem_align_mask,
+		iov_iter_extraction_t extraction_flags);
 
 /**
  * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 273919b161617..8d5ca3e38522a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1886,6 +1886,8 @@ static unsigned int get_contig_folio_len(struct page **pages,
  * @max_size:	maximum size to extract from @iter
  * @nr_vecs:	number of vectors in @bv (on in and output)
  * @max_vecs:	maximum vectors in @bv, including those filled before calling
+ * @mem_align_mask:	reject with -EINVAL if the source address or length is not
+ *		aligned to this mask
  * @extraction_flags: flags to qualify request
  *
  * Like iov_iter_extract_pages(), but returns physically contiguous ranges
@@ -1897,14 +1899,19 @@ static unsigned int get_contig_folio_len(struct page **pages,
  */
 ssize_t iov_iter_extract_bvecs(struct iov_iter *iter, struct bio_vec *bv,
 		size_t max_size, unsigned short *nr_vecs,
-		unsigned short max_vecs, iov_iter_extraction_t extraction_flags)
+		unsigned short max_vecs, unsigned mem_align_mask,
+		iov_iter_extraction_t extraction_flags)
 {
+	unsigned long start = (unsigned long)iter_iov_addr(iter);
 	unsigned short entries_left = max_vecs - *nr_vecs;
 	unsigned short nr_pages, i = 0;
 	size_t left, offset, len;
 	struct page **pages;
 	ssize_t size;
 
+	if ((start | iter_iov_len(iter)) & mem_align_mask)
+		return -EINVAL;
+
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
 	 * possible so that we can start filling biovecs from the beginning
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Eric Biggers @ 2026-06-22 18:23 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Herbert Xu, Alasdair Kergon, Ard Biesheuvel, Jens Axboe, dm-devel,
	linux-block
In-Reply-To: <20260622071044.4079-1-lravich@amazon.com>

On Mon, Jun 22, 2026 at 07:10:44AM +0000, Leonid Ravich wrote:
> On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
> > So in other words, this series slows down dm-crypt and crypto_skcipher
> > for everyone to optimize for an out-of-tree driver.  And there's also no
> > benchmark showing that your driver is even worth it over just using the
> > CPU.
> 
> I measured on arm64 (Graviton3, dm-crypt + xts-aes-ce, RAM-backed,
> fixed CPU freq):
> 
>   - 4 KiB random write, 512-byte sectors: v4 as posted regressed ~5%.
>     Root cause (ftrace): a per-bio kmalloc_array() for the scatterlists,
>     where the per-sector path uses dm-crypt's inline sg_in[]/sg_out[].
> 
>   - Reusing the inline arrays when the segment count fits (heap only for
>     larger bios) removes the regression, back to parity. This will be in
>     the dm-crypt patch for v5.
> 
> So the software path is neutral after the fix, not slower. No software throughput win
> either: the auto-splitter still calls alg->encrypt per data unit. The win
> is for a consumer that takes the whole request in one pass, a HW engine,
> or any async offload engine that pays a fixed per-request cost,
> it currently pays once per sector instead of once per bio.
> 
> I'd rather not over-complicate the patches until there's a general
> ack on the direction: per-request data_unit_size + auto-split,
> enabling one-pass consumers, neutral for everyone else. Is that direction
> acceptable? If so I'll respin v5.

I don't think there's a path forward without an in-tree user that's
shown to be worthwhile over just using the acceleration built directly
into the CPU.  As well as confirmation of no regression to existing
users, including in cases where the inline sg list can't be used.

- Eric

^ permalink raw reply

* Re: [PATCH blktests] scsi/009: fix unset bytes_to_write in TEST 8
From: Shin'ichiro Kawasaki @ 2026-06-22 21:30 UTC (permalink / raw)
  To: Sebastian Chlad; +Cc: linux-block, Sebastian Chlad, alan.adamson
In-Reply-To: <ajIhtkOIMXeM6BAI@shinmob>

On Jun 17, 2026 / 13:29, Shin'ichiro Kawasaki wrote:
> CC+ Alan,
> 
> On Jun 14, 2026 / 20:16, Sebastian Chlad wrote:
> > bytes_to_write was never assigned before TEST 8, causing it to pass for
> > the wrong reason. Set it to atomic_unit_max_bytes + logical_block_size
> > and update the golden output with the expected "pwrite: Invalid argument"
> > from xfs_io.
> > 
> > Signed-off-by: Sebastian Chlad <sebastian.chlad@suse.com>
> 
> Thanks. The change looks good to me.
> 
> I will wait a few more days just in case anyone has opinion on the change.
> FYI: Sebastian posted a similar change for nvme/059 [*].
> 
> [*] https://github.com/linux-blktests/blktests/pull/245

I applied the patch. Thanks!

^ permalink raw reply

* Re: [PATCH] block: fix incorrect error injection static key decrement
From: Jens Axboe @ 2026-06-22 22:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dlemoal, linux-block
In-Reply-To: <20260622160752.1552516-1-hch@lst.de>


On Mon, 22 Jun 2026 18:07:52 +0200, Christoph Hellwig wrote:
> Only decrement the static key when we had items and thus it was
> incremented before.

Applied, thanks!

[1/1] block: fix incorrect error injection static key decrement
      commit: 214cdae69dba9bb1fc0b517b7fb97bab385a2e3a

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block, bfq: protect async queue reset with blkcg locks
From: Jens Axboe @ 2026-06-22 22:00 UTC (permalink / raw)
  To: Yu Kuai, Tejun Heo, Josef Bacik, Arianna Avanzini, Paolo Valente,
	Cen Zhang
  Cc: linux-block, cgroups, linux-kernel, baijiaju1990
In-Reply-To: <20260621135930.2657810-1-zzzccc427@gmail.com>


On Sun, 21 Jun 2026 21:59:30 +0800, Cen Zhang wrote:
> Writing 0 to BFQ's low_latency attribute ends weight raising for active,
> idle and async queues. The async cgroup path walks q->blkg_list, converts
> each blkg to BFQ policy data and then reads bfqg->async_bfqq and
> bfqg->async_idle_bfqq.
> 
> That walk was protected only by bfqd->lock. blkcg release work is
> serialized by q->blkcg_mutex and q->queue_lock instead, and
> blkg_free_workfn() can call BFQ's pd_free_fn before it removes
> blkg->q_node from q->blkg_list. A low_latency reset can therefore still
> find the blkg on the queue list after the BFQ policy data has been freed.
> 
> [...]

Applied, thanks!

[1/1] block, bfq: protect async queue reset with blkcg locks
      commit: 17b2d950a3c0328ed749476e6118ca869b3ca8b5

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] nbd: don't warn when reclassifying a busy socket lock
From: Jens Axboe @ 2026-06-22 22:00 UTC (permalink / raw)
  To: josef, edumazet, Deepanshu Kartikey
  Cc: linux-block, nbd, linux-kernel, syzbot+6b85d1e39a5b8ed9a954
In-Reply-To: <20260621235255.66015-1-kartikey406@gmail.com>


On Mon, 22 Jun 2026 05:22:55 +0530, Deepanshu Kartikey wrote:
> nbd_reclassify_socket() warns via WARN_ON_ONCE() if the socket lock is
> held at the point of reclassification. That assertion was copied from
> nvme-tcp, where the socket is created internally by the kernel
> (sock_create_kern()) and is never visible to user space, so the lock
> is guaranteed to be free.
> 
> NBD is different: the socket is looked up from a user-supplied fd in
> nbd_get_socket(), and user space retains that fd. A concurrent syscall
> on the same socket (or softirq processing taking bh_lock_sock() on a
> connected TCP socket) can legitimately hold the lock at the instant
> NBD reclassifies it. sock_allow_reclassification() then returns false
> and the WARN_ON_ONCE() fires, which turns into a crash under
> panic_on_warn. This is reachable by simply racing NBD_CMD_CONNECT
> against socket activity on the same fd, as reported by syzbot.
> 
> [...]

Applied, thanks!

[1/1] nbd: don't warn when reclassifying a busy socket lock
      commit: 9280e6edf65662b6aafc8b704ad065b54c08b519

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH V2] blk-cgroup: fix UAF in __blkcg_rstat_flush()
From: Jens Axboe @ 2026-06-22 22:01 UTC (permalink / raw)
  To: linux-block, Ming Lei
  Cc: Michal Koutný, stable, Jay Shin, Tejun Heo, Waiman Long,
	coregee2000
In-Reply-To: <20260205155425.342084-1-ming.lei@redhat.com>


On Thu, 05 Feb 2026 23:54:23 +0800, Ming Lei wrote:
> When multiple blkgs in the same blkcg are released concurrently,
> a use-after-free can occur. The race happens when one blkg's
> __blkcg_rstat_flush() removes another blkg's iostat entries via
> llist_del_all(). The second blkg sees an empty list and proceeds
> to free itself while the first is still iterating over its entries.
> 
> Move the flush from __blkg_release() (RCU callback) to blkg_release()
> (before call_rcu). This ensures the RCU grace period waits for any
> concurrent flush's rcu_read_lock() section to complete before freeing.
> 
> [...]

Applied, thanks!

[1/1] blk-cgroup: fix UAF in __blkcg_rstat_flush()
      commit: 0ab5ee5a1badb58cbb2242617cb01a4972b1f2a2

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH V3] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
From: Jens Axboe @ 2026-06-22 22:01 UTC (permalink / raw)
  To: tj, josef, linux-block, Zizhi Wo
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260616011746.2451461-1-wozizhi@huaweicloud.com>


On Tue, 16 Jun 2026 09:17:46 +0800, Zizhi Wo wrote:
> [BUG]
> Our fuzz testing triggered a blkcg use-after-free issue:
> 
>   BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
>   Call Trace:
>   ...
>   blkcg_deactivate_policy+0x244/0x4d0
>   ioc_rqos_exit+0x44/0xe0
>   rq_qos_exit+0xba/0x120
>   __del_gendisk+0x50b/0x800
>   del_gendisk+0xff/0x190
>   ...
> 
> [...]

Applied, thanks!

[1/1] blk-cgroup: defer blkcg css_put until blkg is unlinked from queue
      commit: 3ed9b4779a4aa3f44cd9f78627498d7adac40daa

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH 1/2 blktests] src/miniublk: switch to ioctl-encoded ublk commands
From: Shin'ichiro Kawasaki @ 2026-06-22 22:21 UTC (permalink / raw)
  To: Sebastian Chlad; +Cc: Sebastian Chlad, linux-block
In-Reply-To: <CAJR+Y9K=0C+TnKAycdXbeQF98FE=RhaYYvK6SCpLPbdeMH2Xxw@mail.gmail.com>

On Jun 22, 2026 / 15:34, Sebastian Chlad wrote:
[...]
> > > diff --git a/src/miniublk.c b/src/miniublk.c
> > > index f98f850..5a35ca7 100644
> > > --- a/src/miniublk.c
> > > +++ b/src/miniublk.c
> > [...]
> > > @@ -624,9 +624,9 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
> > >               return 0;
> > >
> > >       if (io->flags & UBLKSRV_NEED_COMMIT_RQ_COMP)
> > > -             cmd_op = UBLK_IO_COMMIT_AND_FETCH_REQ;
> > > -     else if (io->flags & UBLKSRV_NEED_FETCH_RQ)
> > > -             cmd_op = UBLK_IO_FETCH_REQ;
> > > +             cmd_op = UBLK_U_IO_COMMIT_AND_FETCH_REQ;
> > > +     else
> > > +             cmd_op = UBLK_U_IO_FETCH_REQ;
> >
> > The hunk above changes the "else if" part, is this intentional?
> >
> 
> Yes, this is intentional because we already check things in
>     if (!(io->flags &
>         (UBLKSRV_NEED_FETCH_RQ | UBLKSRV_NEED_COMMIT_RQ_COMP)))
> which returns early if neither flag is set, so checking the first
> condition makes another check redundant as by that
> time we know we need UBLK_U_IO_FETCH_REQ.

Thanks for the explanation. Now I see your point.

> 
> However if you think it's safer to still check if io->flags &
> UBLKSRV_NEED_FETCH_RQ, I can implement it this way in the v2.
> Let me know what you prefer.

I think it's the better to keep the current

  "else if (io->flags & UBLKSRV_NEED_FETCH_RQ)"

form. Even though the change is small and will not affect the code behavior, it
is against "single purpose with single patch" guide. Anyone who looks at the
commit in future may have the same question as mine.

^ permalink raw reply

* Re: [PATCH blktests] Fix _get_page_size()
From: Shin'ichiro Kawasaki @ 2026-06-22 22:27 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Bart Van Assche, Jeff Moyer, linux-block, kch
In-Reply-To: <ajlxXfgpMQJ4qlRR@telecaster>

On Jun 22, 2026 / 10:31, Omar Sandoval wrote:
> On Mon, Jun 22, 2026 at 08:38:48PM +0900, Shin'ichiro Kawasaki wrote:
> > On Jun 20, 2026 / 09:11, Bart Van Assche wrote:
> > > On 6/20/26 6:51 AM, Shin'ichiro Kawasaki wrote:
> > > > On Jun 20, 2026 / 05:55, Bart Van Assche wrote:
> > > > > On 6/20/26 3:26 AM, Shin'ichiro Kawasaki wrote:
> > > > > > This is a rather fundamental change, so I would like to ask opinions from
> > > > > > other blktests users, especially Omar and Chaitanya. What do you think about
> > > > > > the idea to add getconf to the requirement list?
> > > > > 
> > > > > CONFIG_PAGE_SHIFT was introduced in the Linux kernel in February 2024
> > > > > (commit ba89f9c8ccba ("arch: consolidate existing CONFIG_PAGE_SIZE_*KB
> > > > > definitions")). Older kernels had CONFIG_PAGE_SIZE_4KB,
> > > > > CONFIG_PAGE_SIZE_16KB, etc. This means that it is possible to derive the
> > > > > kernel page size from the kernel configuration file for all upstream and
> > > > > distro kernels, isn't it?
> > > > 
> > > > I checked the commit is in the tag v6.9. My Debian bookworm system has kernel
> > > > v6.1, then the config file at /boot does not have CONFIG_PAGE_SHIFT as expected.
> > > > But it does not have CONFIG_PAGE_SIZE_* either... I'm still afraid that kernel
> > > > config file approach is not reliable.
> > > 
> > > Right, for older kernels CONFIG_PAGE_SIZE_*KB is only available for some
> > > but not for all supported architectures.
> > > 
> > > It is not clear to me where the desire to avoid the dependency on
> > > getconf comes from? As far as I know it is available on all Linux
> > > distro's. Since it is typically included in the C library package it
> > > should not introduce a new dependency.
> > 
> > I think less dependent is the better in general, and wanted to confirm that
> > it is fine for everybody. If there is no voice to object, I will create a
> > patch to add getconf to the requirement list.
> 
> I agree with Bart, getconf is ubiquitous enough that it's not worth
> trying to hack around its absence. In my opinion, parsing kernel config
> options should be a last resort. If anyone complains about the getconf
> dependency in the future, I think it'd be better to add a simple
> src/pagesize.c file that uses sysconf(_SC_PAGESIZE), but I don't expect
> that to be necessary.

Omar, thank you for the comment. It's good to have the plan B idea of
"src/pagesize.c". I will prepare the patch to add getconf to the
requirement list as the plan A.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox