Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
From: Gao Xiang @ 2026-06-02 16:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christoph Hellwig, Jan Kara
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-7-bb0fd82f3861@kernel.org>



On 2026/6/2 18:10, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against the correct superblock, and convert the matching
> releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>   fs/erofs/data.c     |  6 +++++
>   fs/erofs/internal.h | 10 ++++++++
>   fs/erofs/super.c    | 66 +++++++++++++++++++++++++++++++++++++++++++----------
>   fs/erofs/zdata.c    | 10 +++++---
>   4 files changed, 77 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 44da21c9d777..5220585293df 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -69,6 +69,9 @@ int erofs_init_metabuf(struct erofs_buf *buf, struct super_block *sb,
>   {
>   	struct erofs_sb_info *sbi = EROFS_SB(sb);
>   
> +	if (erofs_is_shutdown(sb))
> +		return -EIO;
> +
>   	buf->file = NULL;
>   	if (in_metabox) {
>   		if (unlikely(!sbi->metabox_inode))
> @@ -236,6 +239,9 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
>   		}
>   		up_read(&devs->rwsem);
>   	}
> +	if (erofs_is_shutdown(sb) ||
> +	    (map->m_dif && READ_ONCE(map->m_dif->dead)))
> +		return -EIO;

Take a quick look at the code, maybe we can just add
the SHUTDOWN status only since I don't think remove an
individual blob device is useful for the typical image
use cases, so there is no need adding `dead` for each
individual extra device.

and just bail out if erofs_is_shutdown() at the very
beginning of erofs_map_dev()?

>   	return 0;
>   }
>   

...

> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 43bb5a6a9924..89ae91935364 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -1697,11 +1697,15 @@ static void z_erofs_submit_queue(struct z_erofs_frontend *f,
>   			continue;
>   		}
>   
> -		/* no device id here, thus it will always succeed */
>   		mdev = (struct erofs_map_dev) {
>   			.m_pa = round_down(pcl->pos, sb->s_blocksize),
>   		};
> -		(void)erofs_map_dev(sb, &mdev);
> +		if (erofs_map_dev(sb, &mdev)) {
> +			/* the backing device is gone; fail the batch */
> +			q[JQ_SUBMIT]->eio = true;
> +			qtail[JQ_SUBMIT] = &pcl->next;
> +			continue;
> +		}

It needs some injection tests anyway.

May I ask if it's an urgent 7.2 work? If not, I could
make a preparation patch for the upcoming 7.2 cycle
to handle erofs_map_dev() failure here so you don't
need to bother with this in this patchset.

I will seek more time to resolve the recent todos
yet always intercepted by other unrelated stuffs.

Thanks,
Gao Xaing

^ permalink raw reply

* Re: [PATCH RFC 0/8] fs: support freeze/thaw/mark_dead/sync with shared devices
From: Gao Xiang @ 2026-06-02 16:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christoph Hellwig, Jan Kara
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

Hi,

On 2026/6/2 18:10, Christian Brauner wrote:
> Note, this is on the border between RFC/POC and so I haven't pushed this
> through testing yet. But I don't want to waste more time on this before
> showing it.
> 
> I surveyed various fs implementations because I want the ability to
> extend userspace the ability to manage what devices can be onlined in a
> centralized way without having to force every fs to care about this.
> 
> I realized that erofs allows sharing block devices with multiple
> superblocks. Any freeze, thaw, removal, or sync on those devices will
> not be communicated to the superblocks using it and our current
> infrastructure is unable to deal with this.
> 
> This attempts to add the ability to go from device number to all the
> superblock using that device, iterate through them one-by-one and
> perform actions on them. For most fses this is a 1:1 mapping but for
> erofs its a 1:many mapping.
> 
> This is not unreasonable infastructure to support in my opinion. I
> played around with some ideas for this and I want to send out an RFC to
> gather some early input.

Yes, just a side note: On the erofs side, since we apply immutable
model to each filesystems rather than writable filesystem approaches
so inode data (in devices or files) can be shared among multiple
different filesystems without any reference count needs for example
(in the similar models: any write needs to be COWed using overlayfs
for example.), so blob devices are 1:many shared mapping by design.

One typical example is that we could convert each OCI tar layer
into an erofs blob, and use a metadata-only erofs to index these
converted erofs blobs so there is only one filesystem instead of
per-layer filesystems (it's called fsmerge in the containerd
implementation.), but each converted erofs blob can be shared
among different filesystems.

Another example is incremental diff updates, the primary device
can only contain incremental data and refer to the base image for
the remaining data; and base image can be shared too.

Thanks,
Gao Xiang

^ permalink raw reply

* Re: [PATCH v2 4/4] fs: retire sget()
From: Jan Kara @ 2026-06-02 11:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Theodore Ts'o, Andreas Dilger, Jan Kara,
	Ritesh Harjani (IBM), linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260529-work-sget-v2-4-57bbe08604e4@kernel.org>

On Fri 29-05-26 10:43:43, Christian Brauner wrote:
> sget() and sget_fc() have lived side by side as near-duplicate
> find-or-create-and-publish helpers for the legacy and fs_context mount
> APIs. The three remaining in-tree callers (CIFS plus the ext4 extents
> and mballoc KUnit tests) have all been moved to sget_fc(). Nothing
> calls sget() anymore.
> 
> Delete sget() from fs/super.c and the prototype in <linux/fs.h>.
> Update the two comments that referred to "sget()" or "sget{_fc}()" to
> just say "sget_fc()".
> 
> This removes ~60 lines of code that only existed to be kept in
> lockstep with sget_fc() on every superblock publish-path change.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/btrfs/super.c   |  2 +-
>  fs/super.c         | 71 ++++--------------------------------------------------
>  include/linux/fs.h |  4 ---
>  3 files changed, 6 insertions(+), 71 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b26aa9169e83..636154861d7c 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2052,7 +2052,7 @@ static int btrfs_get_tree_subvol(struct fs_context *fc)
>  	 * then open_ctree will properly initialize the file system specific
>  	 * settings later.  btrfs_init_fs_info initializes the static elements
>  	 * of the fs_info (locks and such) to make cleanup easier if we find a
> -	 * superblock with our given fs_devices later on at sget() time.
> +	 * superblock with our given fs_devices later on at sget_fc() time.
>  	 */
>  	fs_info = kvzalloc_obj(struct btrfs_fs_info);
>  	if (!fs_info)
> diff --git a/fs/super.c b/fs/super.c
> index 378e81efe643..5fe8cea9f8fe 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -328,7 +328,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  	init_rwsem(&s->s_umount);
>  	lockdep_set_class(&s->s_umount, &type->s_umount_key);
>  	/*
> -	 * sget() can have s_umount recursion.
> +	 * sget_fc() can have s_umount recursion.
>  	 *
>  	 * When it cannot find a suitable sb, it allocates a new
>  	 * one (this one), and tries again to find a suitable old
> @@ -439,7 +439,7 @@ static void kill_super_notify(struct super_block *sb)
>  
>  	/*
>  	 * Remove it from @fs_supers so it isn't found by new
> -	 * sget{_fc}() walkers anymore. Any concurrent mounter still
> +	 * sget_fc() walkers anymore. Any concurrent mounter still
>  	 * managing to grab a temporary reference is guaranteed to
>  	 * already see SB_DYING and will wait until we notify them about
>  	 * SB_DEAD.
> @@ -517,7 +517,7 @@ EXPORT_SYMBOL(deactivate_super);
>   * @sb: superblock to acquire
>   *
>   * Acquire a temporary reference on a superblock and try to trade it for
> - * an active reference. This is used in sget{_fc}() to wait for a
> + * an active reference. This is used in sget_fc() to wait for a
>   * superblock to either become SB_BORN or for it to pass through
>   * sb->kill() and be marked as SB_DEAD.
>   *
> @@ -673,11 +673,11 @@ void generic_shutdown_super(struct super_block *sb)
>  	/*
>  	 * Broadcast to everyone that grabbed a temporary reference to this
>  	 * superblock before we removed it from @fs_supers that the superblock
> -	 * is dying. Every walker of @fs_supers outside of sget{_fc}() will now
> +	 * is dying. Every walker of @fs_supers outside of sget_fc() will now
>  	 * discard this superblock and treat it as dead.
>  	 *
>  	 * We leave the superblock on @fs_supers so it can be found by
> -	 * sget{_fc}() until we passed sb->kill_sb().
> +	 * sget_fc() until we passed sb->kill_sb().
>  	 */
>  	super_wake(sb, SB_DYING);
>  	super_unlock_excl(sb);
> @@ -808,67 +808,6 @@ struct super_block *sget_fc(struct fs_context *fc,
>  }
>  EXPORT_SYMBOL(sget_fc);
>  
> -/**
> - *	sget	-	find or create a superblock
> - *	@type:	  filesystem type superblock should belong to
> - *	@test:	  comparison callback
> - *	@set:	  setup callback
> - *	@flags:	  mount flags
> - *	@data:	  argument to each of them
> - */
> -struct super_block *sget(struct file_system_type *type,
> -			int (*test)(struct super_block *,void *),
> -			int (*set)(struct super_block *,void *),
> -			int flags,
> -			void *data)
> -{
> -	struct user_namespace *user_ns = current_user_ns();
> -	struct super_block *s = NULL;
> -	struct super_block *old;
> -	int err;
> -
> -retry:
> -	spin_lock(&sb_lock);
> -	if (test) {
> -		hlist_for_each_entry(old, &type->fs_supers, s_instances) {
> -			if (!test(old, data))
> -				continue;
> -			if (user_ns != old->s_user_ns) {
> -				spin_unlock(&sb_lock);
> -				destroy_unused_super(s);
> -				return ERR_PTR(-EBUSY);
> -			}
> -			if (!grab_super(old))
> -				goto retry;
> -			destroy_unused_super(s);
> -			return old;
> -		}
> -	}
> -	if (!s) {
> -		spin_unlock(&sb_lock);
> -		s = alloc_super(type, flags, user_ns);
> -		if (!s)
> -			return ERR_PTR(-ENOMEM);
> -		goto retry;
> -	}
> -
> -	err = set(s, data);
> -	if (err) {
> -		spin_unlock(&sb_lock);
> -		destroy_unused_super(s);
> -		return ERR_PTR(err);
> -	}
> -	s->s_type = type;
> -	strscpy(s->s_id, type->name, sizeof(s->s_id));
> -	list_add_tail(&s->s_list, &super_blocks);
> -	hlist_add_head(&s->s_instances, &type->fs_supers);
> -	spin_unlock(&sb_lock);
> -	get_filesystem(type);
> -	shrinker_register(s->s_shrink);
> -	return s;
> -}
> -EXPORT_SYMBOL(sget);
> -
>  void drop_super(struct super_block *sb)
>  {
>  	super_unlock_shared(sb);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..6dbe3218dc1e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2327,10 +2327,6 @@ void free_anon_bdev(dev_t);
>  struct super_block *sget_fc(struct fs_context *fc,
>  			    int (*test)(struct super_block *, struct fs_context *),
>  			    int (*set)(struct super_block *, struct fs_context *));
> -struct super_block *sget(struct file_system_type *type,
> -			int (*test)(struct super_block *,void *),
> -			int (*set)(struct super_block *,void *),
> -			int flags, void *data);
>  struct super_block *sget_dev(struct fs_context *fc, dev_t dev);
>  
>  /* Alas, no aliases. Too much hassle with bringing module.h everywhere */
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 2/4] ext4: convert mballoc KUnit test to sget_fc()
From: Jan Kara @ 2026-06-02 11:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Theodore Ts'o, Andreas Dilger, Jan Kara,
	Ritesh Harjani (IBM), linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260529-work-sget-v2-2-57bbe08604e4@kernel.org>

On Fri 29-05-26 10:43:41, Christian Brauner wrote:
> Same treatment as the extents KUnit test. The mballoc test uses sget()
> as a thin "give me an initialized superblock" wrapper for a fake
> file_system_type. Move it onto sget_fc() so sget() can go away.
> 
> Add a no-op mbt_init_fs_context() so fs_context_for_mount() has
> something to call on the fake fs_type. mbt_set() now takes a struct
> fs_context * (still a no-op). mbt_ext4_alloc_super_block() allocates
> the fc, hands it to sget_fc() and drops the fc reference once the sb
> is published.
> 
> No functional change.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/mballoc-test.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/mballoc-test.c b/fs/ext4/mballoc-test.c
> index 90ed505fa4b1..d90da44aadbd 100644
> --- a/fs/ext4/mballoc-test.c
> +++ b/fs/ext4/mballoc-test.c
> @@ -5,6 +5,7 @@
>  
>  #include <kunit/test.h>
>  #include <kunit/static_stub.h>
> +#include <linux/fs_context.h>
>  #include <linux/random.h>
>  
>  #include "ext4.h"
> @@ -63,8 +64,14 @@ static void mbt_kill_sb(struct super_block *sb)
>  	generic_shutdown_super(sb);
>  }
>  
> +static int mbt_init_fs_context(struct fs_context *fc)
> +{
> +	return 0;
> +}
> +
>  static struct file_system_type mbt_fs_type = {
>  	.name			= "mballoc test",
> +	.init_fs_context	= mbt_init_fs_context,
>  	.kill_sb		= mbt_kill_sb,
>  };
>  
> @@ -127,7 +134,7 @@ static void mbt_mb_release(struct super_block *sb)
>  	kfree(sb->s_bdev);
>  }
>  
> -static int mbt_set(struct super_block *sb, void *data)
> +static int mbt_set(struct super_block *sb, struct fs_context *fc)
>  {
>  	return 0;
>  }
> @@ -136,13 +143,19 @@ static struct super_block *mbt_ext4_alloc_super_block(void)
>  {
>  	struct mbt_ext4_super_block *fsb;
>  	struct super_block *sb;
> +	struct fs_context *fc;
>  	struct ext4_sb_info *sbi;
>  
>  	fsb = kzalloc_obj(*fsb);
>  	if (fsb == NULL)
>  		return NULL;
>  
> -	sb = sget(&mbt_fs_type, NULL, mbt_set, 0, NULL);
> +	fc = fs_context_for_mount(&mbt_fs_type, 0);
> +	if (IS_ERR(fc))
> +		goto out;
> +
> +	sb = sget_fc(fc, NULL, mbt_set);
> +	put_fs_context(fc);
>  	if (IS_ERR(sb))
>  		goto out;
>  
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 1/4] ext4: convert extents KUnit test to sget_fc()
From: Jan Kara @ 2026-06-02 11:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Theodore Ts'o, Andreas Dilger, Jan Kara,
	Ritesh Harjani (IBM), linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260529-work-sget-v2-1-57bbe08604e4@kernel.org>

On Fri 29-05-26 10:43:40, Christian Brauner wrote:
> The extents KUnit test uses sget() to get an initialized superblock for
> its fake file_system_type. sget() predates fs_context and we want to
> retire it. Switch this caller over to sget_fc().
> 
> Add a no-op ext_init_fs_context() so fs_context_for_mount() has
> something to call on the fake fs_type. ext_set() now takes a struct
> fs_context * (still a no-op). extents_kunit_init() allocates the fc,
> hands it to sget_fc() and drops the fc reference once the sb is
> published. sget_fc() does not retain a pointer to it.
> 
> No functional change for the test.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents-test.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/extents-test.c b/fs/ext4/extents-test.c
> index 6b53a3f39fcd..bd7795a82607 100644
> --- a/fs/ext4/extents-test.c
> +++ b/fs/ext4/extents-test.c
> @@ -37,6 +37,7 @@
>  
>  #include <kunit/test.h>
>  #include <kunit/static_stub.h>
> +#include <linux/fs_context.h>
>  #include <linux/gfp_types.h>
>  #include <linux/stddef.h>
>  
> @@ -130,14 +131,20 @@ static void ext_kill_sb(struct super_block *sb)
>  	generic_shutdown_super(sb);
>  }
>  
> -static int ext_set(struct super_block *sb, void *data)
> +static int ext_init_fs_context(struct fs_context *fc)
> +{
> +	return 0;
> +}
> +
> +static int ext_set(struct super_block *sb, struct fs_context *fc)
>  {
>  	return 0;
>  }
>  
>  static struct file_system_type ext_fs_type = {
> -	.name = "extents test",
> -	.kill_sb = ext_kill_sb,
> +	.name		 = "extents test",
> +	.init_fs_context = ext_init_fs_context,
> +	.kill_sb	 = ext_kill_sb,
>  };
>  
>  static void extents_kunit_exit(struct kunit *test)
> @@ -223,6 +230,7 @@ static int extents_kunit_init(struct kunit *test)
>  	struct ext4_inode_info *ei;
>  	struct inode *inode;
>  	struct super_block *sb;
> +	struct fs_context *fc;
>  	struct ext4_sb_info *sbi = NULL;
>  	struct kunit_ext_test_param *param =
>  		(struct kunit_ext_test_param *)(test->param_value);
> @@ -232,7 +240,13 @@ static int extents_kunit_init(struct kunit *test)
>  	if (sbi == NULL)
>  		return -ENOMEM;
>  
> -	sb = sget(&ext_fs_type, NULL, ext_set, 0, NULL);
> +	fc = fs_context_for_mount(&ext_fs_type, 0);
> +	if (IS_ERR(fc)) {
> +		kfree(sbi);
> +		return PTR_ERR(fc);
> +	}
> +	sb = sget_fc(fc, NULL, ext_set);
> +	put_fs_context(fc);
>  	if (IS_ERR(sb)) {
>  		kfree(sbi);
>  		return PTR_ERR(sb);
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Ojaswin Mujoo @ 2026-06-02 10:38 UTC (permalink / raw)
  To: Disha Goel; +Cc: fstests, linux-ext4, linux-fsdevel, ritesh.list
In-Reply-To: <20260602101418.55131-1-disgoel@linux.ibm.com>

On Tue, Jun 02, 2026 at 03:44:18PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on DAX-enabled filesystems
> because DAX bypasses the page cache required for defrag operations.
> 
> Add check in _require_defrag() to skip tests when DAX is enabled,
> avoiding false failures on ext4/301-304, ext4/308 and generic/018.
> 
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>

Looks good Disha, feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

One small comment:
> ---
>  common/defrag | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..28db2f7a 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -6,6 +6,10 @@
>  
>  _require_defrag()
>  {
> +    # Defragmentation is not supported on DAX-enabled filesystems

I think this comment is not needed as _notrun explains it already

> +    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
> +        _notrun "Defragmentation not supported on DAX-enabled filesystem"
> +    fi
>      case "$FSTYP" in
>      xfs)
>          # xfs_fsr does preallocates, require "falloc"
> -- 
> 2.45.1
> 

^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo @ 2026-06-02 10:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-9-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce two new iomap_ops instances for ext4 buffered writes:
> 
>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>    ext4_da_map_blocks() to map delalloc extents.
>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>    ext4_iomap_get_blocks() to directly allocate blocks.
> 
> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> validity.
> 
> Key changes and considerations:
> 
>  - Unwritten extents for new blocks (dioread_nolock always on)
>    Since data=ordered mode is not used to prevent stale data exposure in
>    the non-delayed allocation path, new blocks are always allocated as
>    unwritten extents.
> 
>  - Short write and write failure handling
>    a. Delalloc path: On short write or failure, the stale delalloc range
>       must be dropped and its space reservation released. Otherwise, a
>       clean folio may cover leftover delalloc extents, causing
>       inaccurate space reservation accounting.
>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>       short write.
> 
>  - Lock ordering reversal
>    The folio lock and transaction start ordering is reversed compared to
>    the buffer_head buffered write path. To handle this, the journal
>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>    documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

I went through this again and after our discussion the changes looks
okay. Just a small quesiton below but otherwise feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

> ---
>  fs/ext4/ext4.h  |   4 ++
>  fs/ext4/file.c  |  20 +++++-
>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  10 ++-
>  4 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1e27d73d7427..4832e7f7db82 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>  				struct buffer_head *bh);
>  void ext4_set_inode_mapping_order(struct inode *inode);
> +int ext4_nonda_switch(struct super_block *sb);
>  #define FALL_BACK_TO_NONDELALLOC 1
>  #define CONVERT_INLINE_DATA	 2

<snip>

> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, false);
> +}
> +
> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, true);
> +}
> +
> +/*
> + * On write failure, drop the stale delayed allocation range and release
> + * its reserved space for both start and end blocks. Otherwise, we may
> + * leave a range of delayed extents covered by a clean folio, which can
> + * result in inaccurate space reservation accounting.
> + */
> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> +				     loff_t length, struct iomap *iomap)
> +{
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> +					    loff_t length, ssize_t written,
> +					    unsigned int flags,
> +					    struct iomap *iomap)
> +{
> +	loff_t start_byte, end_byte;
> +
> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))

Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
either creates a new IOMAP_DELALLOC extent or finds older ones which
won't have EXT4_MAP_NEW set

> +		return 0;
> +
> +	/* Nothing to do if we've written the entire delalloc extent */
> +	start_byte = iomap_last_written_block(inode, offset, written);
> +	end_byte = round_up(offset + length, i_blocksize(inode));
> +	if (start_byte >= end_byte)
> +		return 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> +				     iomap, ext4_iomap_punch_delalloc);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return 0;
> +}

^ permalink raw reply

* [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-02 10:14 UTC (permalink / raw)
  To: fstests; +Cc: linux-ext4, linux-fsdevel, ritesh.list, ojaswin, Disha Goel

Online defragmentation is not supported on DAX-enabled filesystems
because DAX bypasses the page cache required for defrag operations.

Add check in _require_defrag() to skip tests when DAX is enabled,
avoiding false failures on ext4/301-304, ext4/308 and generic/018.

Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
---
 common/defrag | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/common/defrag b/common/defrag
index 055d0d0e..28db2f7a 100644
--- a/common/defrag
+++ b/common/defrag
@@ -6,6 +6,10 @@
 
 _require_defrag()
 {
+    # Defragmentation is not supported on DAX-enabled filesystems
+    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
+        _notrun "Defragmentation not supported on DAX-enabled filesystem"
+    fi
     case "$FSTYP" in
     xfs)
         # xfs_fsr does preallocates, require "falloc"
-- 
2.45.1


^ permalink raw reply related

* [PATCH RFC 8/8] super: make fs_holder_ops private
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

There's no need to expose it anymore.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/super.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index cea743f699e4..983c2fbf5202 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1643,13 +1643,12 @@ static int fs_bdev_thaw(struct block_device *bdev)
 	return error;
 }
 
-const struct blk_holder_ops fs_holder_ops = {
+static const struct blk_holder_ops fs_holder_ops = {
 	.mark_dead		= fs_bdev_mark_dead,
 	.sync			= fs_bdev_sync,
 	.freeze			= fs_bdev_freeze,
 	.thaw			= fs_bdev_thaw,
 };
-EXPORT_SYMBOL_GPL(fs_holder_ops);
 
 static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
 {

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

Route opens through fs_bdev_file_open_by_path() so each external device
is registered against the correct superblock, and convert the matching
releases.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/erofs/data.c     |  6 +++++
 fs/erofs/internal.h | 10 ++++++++
 fs/erofs/super.c    | 66 +++++++++++++++++++++++++++++++++++++++++++----------
 fs/erofs/zdata.c    | 10 +++++---
 4 files changed, 77 insertions(+), 15 deletions(-)

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 44da21c9d777..5220585293df 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -69,6 +69,9 @@ int erofs_init_metabuf(struct erofs_buf *buf, struct super_block *sb,
 {
 	struct erofs_sb_info *sbi = EROFS_SB(sb);
 
+	if (erofs_is_shutdown(sb))
+		return -EIO;
+
 	buf->file = NULL;
 	if (in_metabox) {
 		if (unlikely(!sbi->metabox_inode))
@@ -236,6 +239,9 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
 		}
 		up_read(&devs->rwsem);
 	}
+	if (erofs_is_shutdown(sb) ||
+	    (map->m_dif && READ_ONCE(map->m_dif->dead)))
+		return -EIO;
 	return 0;
 }
 
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 4792490161ec..ca1ed7ce3961 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -48,6 +48,7 @@ struct erofs_device_info {
 
 	erofs_blk_t blocks;
 	erofs_blk_t uniaddr;
+	bool dead;		/* backing device gone; fence I/O */
 };
 
 enum {
@@ -104,6 +105,7 @@ struct erofs_xattr_prefix_item {
 struct erofs_sb_info {
 	struct erofs_device_info dif0;
 	struct erofs_mount_opts opt;	/* options */
+	unsigned long flags;		/* see EROFS_SB_* */
 #ifdef CONFIG_EROFS_FS_ZIP
 	/* list for all registered superblocks, mainly for shrinker */
 	struct list_head list;
@@ -195,6 +197,14 @@ static inline bool erofs_is_fscache_mode(struct super_block *sb)
 			!erofs_is_fileio_mode(EROFS_SB(sb)) && !sb->s_bdev;
 }
 
+/* erofs_sb_info->flags */
+#define EROFS_SB_SHUTDOWN	0	/* primary device gone; fail all I/O */
+
+static inline bool erofs_is_shutdown(struct super_block *sb)
+{
+	return test_bit(EROFS_SB_SHUTDOWN, &EROFS_SB(sb)->flags);
+}
+
 enum {
 	EROFS_ZIP_CACHE_DISABLED,
 	EROFS_ZIP_CACHE_READAHEAD,
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 802add6652fd..e03cb95be96b 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -153,8 +153,8 @@ static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb,
 	} else if (!sbi->devs->flatdev) {
 		file = erofs_is_fileio_mode(sbi) ?
 				filp_open(dif->path, O_RDONLY | O_LARGEFILE, 0) :
-				bdev_file_open_by_path(dif->path,
-						BLK_OPEN_READ, sb->s_type, NULL);
+				fs_bdev_file_open_by_path(dif->path,
+						BLK_OPEN_READ, sb->s_type, sb);
 		if (IS_ERR(file)) {
 			if (file == ERR_PTR(-ENOTBLK))
 				return -EINVAL;
@@ -843,11 +843,16 @@ static int erofs_fc_reconfigure(struct fs_context *fc)
 
 static int erofs_release_device_info(int id, void *ptr, void *data)
 {
+	struct super_block *sb = data;
 	struct erofs_device_info *dif = ptr;
 
 	fs_put_dax(dif->dax_dev, NULL);
-	if (dif->file)
-		fput(dif->file);
+	if (dif->file) {
+		if (S_ISBLK(file_inode(dif->file)->i_mode))
+			fs_bdev_file_release(dif->file, sb);
+		else
+			fput(dif->file);
+	}
 	erofs_fscache_unregister_cookie(dif->fscache);
 	dif->fscache = NULL;
 	kfree(dif->path);
@@ -855,18 +860,19 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
 	return 0;
 }
 
-static void erofs_free_dev_context(struct erofs_dev_context *devs)
+static void erofs_free_dev_context(struct erofs_dev_context *devs,
+				   struct super_block *sb)
 {
 	if (!devs)
 		return;
-	idr_for_each(&devs->tree, &erofs_release_device_info, NULL);
+	idr_for_each(&devs->tree, &erofs_release_device_info, sb);
 	idr_destroy(&devs->tree);
 	kfree(devs);
 }
 
-static void erofs_sb_free(struct erofs_sb_info *sbi)
+static void erofs_sb_free(struct erofs_sb_info *sbi, struct super_block *sb)
 {
-	erofs_free_dev_context(sbi->devs);
+	erofs_free_dev_context(sbi->devs, sb);
 	kfree(sbi->fsid);
 	kfree_sensitive(sbi->domain_id);
 	if (sbi->dif0.file)
@@ -879,8 +885,13 @@ static void erofs_fc_free(struct fs_context *fc)
 {
 	struct erofs_sb_info *sbi = fc->s_fs_info;
 
-	if (sbi) /* free here if an error occurs before transferring to sb */
-		erofs_sb_free(sbi);
+	/*
+	 * Freed here only if an error occurs before the sb is set up; at that
+	 * point no block-backed device has been claimed (that happens in
+	 * fill_super), so the NULL sb never reaches fs_bdev_file_release().
+	 */
+	if (sbi)
+		erofs_sb_free(sbi, NULL);
 }
 
 static const struct fs_context_operations erofs_context_ops = {
@@ -936,7 +947,7 @@ static void erofs_kill_sb(struct super_block *sb)
 	erofs_drop_internal_inodes(sbi);
 	fs_put_dax(sbi->dif0.dax_dev, NULL);
 	erofs_fscache_unregister_fs(sb);
-	erofs_sb_free(sbi);
+	erofs_sb_free(sbi, sb);
 	sb->s_fs_info = NULL;
 }
 
@@ -948,7 +959,7 @@ static void erofs_put_super(struct super_block *sb)
 	erofs_shrinker_unregister(sb);
 	erofs_xattr_prefixes_cleanup(sb);
 	erofs_drop_internal_inodes(sbi);
-	erofs_free_dev_context(sbi->devs);
+	erofs_free_dev_context(sbi->devs, sb);
 	sbi->devs = NULL;
 	erofs_fscache_unregister_fs(sb);
 }
@@ -1121,6 +1132,35 @@ static void erofs_evict_inode(struct inode *inode)
 	clear_inode(inode);
 }
 
+/*
+ * A blob device may back several erofs superblocks; fence only the affected
+ * one and keep the rest of the mount alive.  The primary device falls back to
+ * the generic teardown (return non-zero).
+ */
+static int erofs_remove_bdev(struct super_block *sb, struct block_device *bdev)
+{
+	struct erofs_dev_context *devs = EROFS_SB(sb)->devs;
+	struct erofs_device_info *dif;
+	int id;
+
+	if (bdev == sb->s_bdev)
+		return 1;
+
+	down_read(&devs->rwsem);
+	idr_for_each_entry(&devs->tree, dif, id) {
+		if (dif->file && S_ISBLK(file_inode(dif->file)->i_mode) &&
+		    file_bdev(dif->file)->bd_dev == bdev->bd_dev)
+			WRITE_ONCE(dif->dead, true);
+	}
+	up_read(&devs->rwsem);
+	return 0;
+}
+
+static void erofs_shutdown(struct super_block *sb)
+{
+	set_bit(EROFS_SB_SHUTDOWN, &EROFS_SB(sb)->flags);
+}
+
 const struct super_operations erofs_sops = {
 	.put_super = erofs_put_super,
 	.alloc_inode = erofs_alloc_inode,
@@ -1128,6 +1168,8 @@ const struct super_operations erofs_sops = {
 	.evict_inode = erofs_evict_inode,
 	.statfs = erofs_statfs,
 	.show_options = erofs_show_options,
+	.remove_bdev = erofs_remove_bdev,
+	.shutdown = erofs_shutdown,
 };
 
 module_init(erofs_module_init);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 43bb5a6a9924..89ae91935364 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1697,11 +1697,15 @@ static void z_erofs_submit_queue(struct z_erofs_frontend *f,
 			continue;
 		}
 
-		/* no device id here, thus it will always succeed */
 		mdev = (struct erofs_map_dev) {
 			.m_pa = round_down(pcl->pos, sb->s_blocksize),
 		};
-		(void)erofs_map_dev(sb, &mdev);
+		if (erofs_map_dev(sb, &mdev)) {
+			/* the backing device is gone; fail the batch */
+			q[JQ_SUBMIT]->eio = true;
+			qtail[JQ_SUBMIT] = &pcl->next;
+			continue;
+		}
 
 		cur = mdev.m_pa;
 		end = round_up(cur + pcl->pageofs_in + pcl->pclustersize,
@@ -1785,7 +1789,7 @@ static void z_erofs_submit_queue(struct z_erofs_frontend *f,
 	 * although background is preferred, no one is pending for submission.
 	 * don't issue decompression but drop it directly instead.
 	 */
-	if (!*force_fg && !nr_bios) {
+	if (!*force_fg && !nr_bios && !q[JQ_SUBMIT]->eio) {
 		kvfree(q[JQ_SUBMIT]);
 		return;
 	}

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 6/8] ext4: open via dedicated fs bdev helpers
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

Route opens through fs_bdev_file_open_by_path() so each external device
is registered against the correct superblock, and convert the matching
releases.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/ext4/super.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..8108d999008e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5793,7 +5793,7 @@ failed_mount8: __maybe_unused
 	brelse(sbi->s_sbh);
 	if (sbi->s_journal_bdev_file) {
 		invalidate_bdev(file_bdev(sbi->s_journal_bdev_file));
-		bdev_fput(sbi->s_journal_bdev_file);
+		fs_bdev_file_release(sbi->s_journal_bdev_file, sb);
 	}
 out_fail:
 	invalidate_bdev(sb->s_bdev);
@@ -5972,9 +5972,9 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
 	struct ext4_super_block *es;
 	int errno;
 
-	bdev_file = bdev_file_open_by_dev(j_dev,
+	bdev_file = fs_bdev_file_open_by_dev(j_dev,
 		BLK_OPEN_READ | BLK_OPEN_WRITE | BLK_OPEN_RESTRICT_WRITES,
-		sb, &fs_holder_ops);
+		sb, sb);
 	if (IS_ERR(bdev_file)) {
 		ext4_msg(sb, KERN_ERR,
 			 "failed to open journal device unknown-block(%u,%u) %ld",
@@ -6034,7 +6034,7 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
 out_bh:
 	brelse(bh);
 out_bdev:
-	bdev_fput(bdev_file);
+	fs_bdev_file_release(bdev_file, sb);
 	return ERR_PTR(errno);
 }
 
@@ -6073,7 +6073,7 @@ static journal_t *ext4_open_dev_journal(struct super_block *sb,
 out_journal:
 	ext4_journal_destroy(EXT4_SB(sb), journal);
 out_bdev:
-	bdev_fput(bdev_file);
+	fs_bdev_file_release(bdev_file, sb);
 	return ERR_PTR(errno);
 }
 
@@ -7492,7 +7492,7 @@ static void ext4_kill_sb(struct super_block *sb)
 	kill_block_super(sb);
 
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		fs_bdev_file_release(bdev_file, sb);
 }
 
 static struct file_system_type ext4_fs_type = {

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 5/8] btrfs: open via dedicated fs bdev helpers
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

Route opens through fs_bdev_file_open_by_path() so each external device
is registered against the correct superblock, and convert the matching
releases.

The temporary identification opens that only read the superblock and close
again pass a NULL holder and are left untouched.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/btrfs/dev-replace.c |  6 +++---
 fs/btrfs/ioctl.c       |  4 ++--
 fs/btrfs/volumes.c     | 26 +++++++++++++++++---------
 3 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 8f8fa14886de..463155b0b1ff 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -247,8 +247,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return -EINVAL;
 	}
 
-	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info->sb, &fs_holder_ops);
+	bdev_file = fs_bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
+					      fs_info->sb, fs_info->sb);
 	if (IS_ERR(bdev_file)) {
 		btrfs_err(fs_info, "target device %s is invalid!", device_path);
 		return PTR_ERR(bdev_file);
@@ -325,7 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	return 0;
 
 error:
-	bdev_fput(bdev_file);
+	fs_bdev_file_release(bdev_file, fs_info->sb);
 	return ret;
 }
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b2e447f5005c..16afa71b98f2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2579,7 +2579,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 err_drop:
 	mnt_drop_write_file(file);
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		fs_bdev_file_release(bdev_file, fs_info->sb);
 out:
 	btrfs_put_dev_args_from_path(&args);
 	kfree(vol_args);
@@ -2630,7 +2630,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
 
 	mnt_drop_write_file(file);
 	if (bdev_file)
-		bdev_fput(bdev_file);
+		fs_bdev_file_release(bdev_file, fs_info->sb);
 out:
 	btrfs_put_dev_args_from_path(&args);
 out_free:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f90564..6f7d7afb4d66 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -480,7 +480,12 @@ btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
 	struct block_device *bdev;
 	int ret;
 
-	*bdev_file = bdev_file_open_by_path(device_path, flags, holder, &fs_holder_ops);
+	if (holder)
+		*bdev_file = fs_bdev_file_open_by_path(device_path, flags,
+						       holder, holder);
+	else
+		*bdev_file = bdev_file_open_by_path(device_path, flags, NULL,
+						    NULL);
 
 	if (IS_ERR(*bdev_file)) {
 		ret = PTR_ERR(*bdev_file);
@@ -495,7 +500,7 @@ btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
 	if (holder) {
 		ret = set_blocksize(*bdev_file, BTRFS_BDEV_BLOCKSIZE);
 		if (ret) {
-			bdev_fput(*bdev_file);
+			fs_bdev_file_release(*bdev_file, holder);
 			goto error;
 		}
 	}
@@ -503,7 +508,10 @@ btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
 	*disk_super = btrfs_read_disk_super(bdev, 0, false);
 	if (IS_ERR(*disk_super)) {
 		ret = PTR_ERR(*disk_super);
-		bdev_fput(*bdev_file);
+		if (holder)
+			fs_bdev_file_release(*bdev_file, holder);
+		else
+			bdev_fput(*bdev_file);
 		goto error;
 	}
 
@@ -727,7 +735,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 
 error_free_page:
 	btrfs_release_disk_super(disk_super);
-	bdev_fput(bdev_file);
+	fs_bdev_file_release(bdev_file, holder);
 
 	return -EINVAL;
 }
@@ -1082,7 +1090,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
 			continue;
 
 		if (device->bdev_file) {
-			bdev_fput(device->bdev_file);
+			fs_bdev_file_release(device->bdev_file, fs_devices->fs_info->sb);
 			device->bdev = NULL;
 			device->bdev_file = NULL;
 			fs_devices->open_devices--;
@@ -1129,7 +1137,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
 		invalidate_bdev(device->bdev);
 	}
 
-	bdev_fput(device->bdev_file);
+	fs_bdev_file_release(device->bdev_file, device->fs_info->sb);
 }
 
 static void btrfs_close_one_device(struct btrfs_device *device)
@@ -2820,8 +2828,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (sb_rdonly(sb) && !fs_devices->seeding)
 		return -EROFS;
 
-	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info->sb, &fs_holder_ops);
+	bdev_file = fs_bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
+					      fs_info->sb, fs_info->sb);
 	if (IS_ERR(bdev_file))
 		return PTR_ERR(bdev_file);
 
@@ -3045,7 +3053,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 error_free_device:
 	btrfs_free_device(device);
 error:
-	bdev_fput(bdev_file);
+	fs_bdev_file_release(bdev_file, fs_info->sb);
 	if (locked) {
 		mutex_unlock(&uuid_mutex);
 		up_write(&sb->s_umount);

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 4/8] xfs: port to fs_bdev_file_open_by_path()
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

Route opens through fs_bdev_file_open_by_path() so each external device
is registered against mp->m_super, and convert the matching releases.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/xfs/xfs_buf.c   |  2 +-
 fs/xfs/xfs_super.c | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 580d40a5ee57..3d3b29edb156 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1601,7 +1601,7 @@ xfs_free_buftarg(
 	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
 	/* the main block device is closed by kill_block_super */
 	if (btp->bt_bdev != btp->bt_mount->m_super->s_bdev)
-		bdev_fput(btp->bt_file);
+		fs_bdev_file_release(btp->bt_file, btp->bt_mount->m_super);
 	kfree(btp);
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f8de44443e81..304667210695 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -400,8 +400,8 @@ xfs_blkdev_get(
 	blk_mode_t		mode;
 
 	mode = sb_open_mode(mp->m_super->s_flags);
-	*bdev_filep = bdev_file_open_by_path(name, mode,
-			mp->m_super, &fs_holder_ops);
+	*bdev_filep = fs_bdev_file_open_by_path(name, mode,
+			mp->m_super, mp->m_super);
 	if (IS_ERR(*bdev_filep)) {
 		error = PTR_ERR(*bdev_filep);
 		*bdev_filep = NULL;
@@ -526,7 +526,7 @@ xfs_open_devices(
 		mp->m_logdev_targp = mp->m_ddev_targp;
 		/* Handle won't be used, drop it */
 		if (logdev_file)
-			bdev_fput(logdev_file);
+			fs_bdev_file_release(logdev_file, mp->m_super);
 	}
 
 	return 0;
@@ -538,10 +538,10 @@ xfs_open_devices(
 	xfs_free_buftarg(mp->m_ddev_targp);
  out_close_rtdev:
 	 if (rtdev_file)
-		bdev_fput(rtdev_file);
+		fs_bdev_file_release(rtdev_file, mp->m_super);
  out_close_logdev:
 	if (logdev_file)
-		bdev_fput(logdev_file);
+		fs_bdev_file_release(logdev_file, mp->m_super);
 	return error;
 }
 

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 3/8] fs: refuse to claim any frozen block device
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

setup_bdev_super() already refuses to bring a filesystem up on a frozen
block device but only for the primary device. Now that filesystems claim
every device through fs_bdev_file_open_by_{dev,path}(), do that check
once in the registration helper so it covers all of them.

Drop the now-redundant check from setup_bdev_super().

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/super.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index e0174d5819a0..cea743f699e4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1690,6 +1690,17 @@ static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
 	sb->s_count++;
 	spin_unlock(&sb_lock);
 
+	/*
+	 * Don't bring a filesystem up on a frozen device.  The entry is already
+	 * published, so a freeze either is seen here or finds it and waits in
+	 * super_lock() until this mount is born or (on -EBUSY) dies.  The mount
+	 * aborts, so the entry is torn down without rebalancing @fs_bdev_active.
+	 */
+	if (atomic_read(&file_bdev(bdev_file)->bd_fsfreeze_count) > 0) {
+		fs_bdev_holder_put(h);
+		return -EBUSY;
+	}
+
 	return 0;
 }
 
@@ -1801,16 +1812,6 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
 		return -EACCES;
 	}
 
-	/*
-	 * It is enough to check bdev was not frozen before we set
-	 * s_bdev as freezing will wait until SB_BORN is set.
-	 */
-	if (atomic_read(&bdev->bd_fsfreeze_count) > 0) {
-		if (fc)
-			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
-		fs_bdev_file_release(bdev_file, sb);
-		return -EBUSY;
-	}
 	spin_lock(&sb_lock);
 	sb->s_bdev_file = bdev_file;
 	sb->s_bdev = bdev;

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
forces the holder to be exactly one superblock and prevents several
superblocks from sharing one block device. That's what erofs is doing.

Introduce a global dev_t-keyed rhltable mapping each block device to the
superblock(s) using it. The holder argument becomes purely the block
layer's exclusivity token (a superblock, or a file_system_type for
shared devices) and is no longer needed by the fs specific callbacks.

Registration keeps one entry per (device, superblock). When a filesystem
claims a device it already uses (xfs with its log on the data device), no
second entry is added, so each superblock is acted on once.

Each table entry holds a passive reference (s_count) on its superblock,
so the struct stays valid for as long as the entry is reachable. The
callbacks look the device up in the table and act on every superblock
using it:

Unlinking an entry is deferred to the last unpin, so a cursor never
resumes from a removed node. After this it's possible to act on all
superblocks that share a given device.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/super.c               | 430 +++++++++++++++++++++++++++++++++--------------
 include/linux/blkdev.h   |   7 -
 include/linux/fs/super.h |   7 +
 3 files changed, 309 insertions(+), 135 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 378e81efe643..e0174d5819a0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -24,6 +24,7 @@
 #include <linux/export.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/rhashtable.h>
 #include <linux/mount.h>
 #include <linux/security.h>
 #include <linux/writeback.h>		/* for the emergency remount stuff */
@@ -1411,186 +1412,234 @@ EXPORT_SYMBOL(sget_dev);
 
 #ifdef CONFIG_BLOCK
 /*
- * Lock the superblock that is holder of the bdev. Returns the superblock
- * pointer if we successfully locked the superblock and it is alive. Otherwise
- * we return NULL and just unlock bdev->bd_holder_lock.
- *
- * The function must be called with bdev->bd_holder_lock and releases it.
+ * Filesystems claim block devices through fs_bdev_file_open_by_{dev,path}(),
+ * which records a {dev_t -> super_block} entry in the global @fs_bdev_supers
+ * table.  The fs_holder_ops callbacks resolve a device event to the
+ * superblock(s) using that device by looking it up there rather than reading
+ * bdev->bd_holder, so several superblocks may share one block device -- the
+ * holder is then only the block layer's exclusivity token.
  */
-static struct super_block *bdev_super_lock(struct block_device *bdev, bool excl)
-	__releases(&bdev->bd_holder_lock)
+struct fs_bdev_holder {
+	dev_t			dev;		/* @fs_bdev_supers key */
+	struct super_block	*sb;
+	refcount_t		fs_bdev_passive;	/* @fs_bdev_active>0 bias + cursor pins */
+	refcount_t		fs_bdev_active;		/* open claims for (dev, sb) */
+	struct rhlist_head	node;
+	struct rcu_head		rcu;
+};
+
+static struct rhltable fs_bdev_supers;
+static const struct rhashtable_params fs_bdev_params = {
+	.key_len	= sizeof(dev_t),
+	.key_offset	= offsetof(struct fs_bdev_holder, dev),
+	.head_offset	= offsetof(struct fs_bdev_holder, node),
+};
+
+static int __init fs_bdev_supers_init(void)
 {
-	struct super_block *sb = bdev->bd_holder;
-	bool locked;
+	if (rhltable_init(&fs_bdev_supers, &fs_bdev_params))
+		panic("VFS: Cannot initialise fs_bdev_supers\n");
+	return 0;
+}
+fs_initcall(fs_bdev_supers_init);
 
-	lockdep_assert_held(&bdev->bd_holder_lock);
-	lockdep_assert_not_held(&sb->s_umount);
-	lockdep_assert_not_held(&bdev->bd_disk->open_mutex);
+static void fs_bdev_holder_put(struct fs_bdev_holder *h)
+{
+	/* Unlink only once unpinned, so a cursor never resumes from a removed node. */
+	if (refcount_dec_and_test(&h->fs_bdev_passive)) {
+		rhltable_remove(&fs_bdev_supers, &h->node, fs_bdev_params);
+		put_super(h->sb);
+		kfree_rcu(h, rcu);
+	}
+}
 
-	/* Make sure sb doesn't go away from under us */
-	spin_lock(&sb_lock);
-	sb->s_count++;
-	spin_unlock(&sb_lock);
+/*
+ * Walk the superblocks sharing a block device the way __iterate_supers() walks
+ * super_blocks: fs_bdev_first()/fs_bdev_next() return each entry with its node
+ * pinned (refcount) so the chain link survives the RCU drop and the sleeping
+ * work the callbacks do between iterations; fs_bdev_next() also unpins the
+ * previous entry.  The entry's fs_bdev_passive ref keeps @h->sb valid; callers
+ * take s_active and/or super_lock_shared() as needed and skip dying superblocks.
+ * A shared per-entry list node can't replace this because mark_dead and sync
+ * are not mutually serialised.
+ */
+static struct fs_bdev_holder *fs_bdev_pin(struct rhlist_head *pos)
+{
+	struct fs_bdev_holder *h;
 
-	mutex_unlock(&bdev->bd_holder_lock);
+	/* Caller holds rcu_read_lock(). */
+	for (; pos; pos = rcu_dereference_all(pos->next)) {
+		h = container_of(pos, struct fs_bdev_holder, node);
+		if (refcount_inc_not_zero(&h->fs_bdev_passive))
+			return h;
+	}
+	return NULL;
+}
 
-	locked = super_lock(sb, excl);
+static struct fs_bdev_holder *fs_bdev_first(dev_t dev)
+{
+	struct fs_bdev_holder *h;
 
-	/*
-	 * If the superblock wasn't already SB_DYING then we hold
-	 * s_umount and can safely drop our temporary reference.
-         */
-	put_super(sb);
+	rcu_read_lock();
+	h = fs_bdev_pin(rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params));
+	rcu_read_unlock();
+	return h;
+}
 
-	if (!locked)
-		return NULL;
+static struct fs_bdev_holder *fs_bdev_next(struct fs_bdev_holder *prev)
+{
+	struct fs_bdev_holder *h;
 
-	if (!sb->s_root || !(sb->s_flags & SB_ACTIVE)) {
-		super_unlock(sb, excl);
-		return NULL;
-	}
+	rcu_read_lock();
+	h = fs_bdev_pin(rcu_dereference_all(prev->node.next));
+	rcu_read_unlock();
+
+	fs_bdev_holder_put(prev);
+	return h;
+}
 
-	return sb;
+static int fs_super_freeze(struct super_block *sb)
+{
+	if (sb->s_op->freeze_super)
+		return sb->s_op->freeze_super(sb,
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
+	return freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
+}
+
+static int fs_super_thaw(struct super_block *sb)
+{
+	if (sb->s_op->thaw_super)
+		return sb->s_op->thaw_super(sb,
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
+	return thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
 {
-	struct super_block *sb;
+	struct fs_bdev_holder *h;
+	dev_t dev = bdev->bd_dev;
 
-	sb = bdev_super_lock(bdev, false);
-	if (!sb)
-		return;
+	mutex_unlock(&bdev->bd_holder_lock);
 
-	if (sb->s_op->remove_bdev) {
-		int ret;
+	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
+		struct super_block *sb = h->sb;
 
-		ret = sb->s_op->remove_bdev(sb, bdev);
-		if (!ret) {
-			super_unlock_shared(sb);
-			return;
+		if (!super_lock_shared(sb))
+			continue;
+		if (sb->s_root && (sb->s_flags & SB_ACTIVE)) {
+			if (!sb->s_op->remove_bdev ||
+			    sb->s_op->remove_bdev(sb, bdev)) {
+				if (!surprise)
+					sync_filesystem(sb);
+				shrink_dcache_sb(sb);
+				evict_inodes(sb);
+				if (sb->s_op->shutdown)
+					sb->s_op->shutdown(sb);
+			}
 		}
-		/* Fallback to shutdown. */
+		super_unlock_shared(sb);
 	}
-
-	if (!surprise)
-		sync_filesystem(sb);
-	shrink_dcache_sb(sb);
-	evict_inodes(sb);
-	if (sb->s_op->shutdown)
-		sb->s_op->shutdown(sb);
-
-	super_unlock_shared(sb);
 }
 
 static void fs_bdev_sync(struct block_device *bdev)
 {
-	struct super_block *sb;
+	struct fs_bdev_holder *h;
+	dev_t dev = bdev->bd_dev;
 
-	sb = bdev_super_lock(bdev, false);
-	if (!sb)
-		return;
+	mutex_unlock(&bdev->bd_holder_lock);
 
-	sync_filesystem(sb);
-	super_unlock_shared(sb);
-}
+	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
+		struct super_block *sb = h->sb;
 
-static struct super_block *get_bdev_super(struct block_device *bdev)
-{
-	bool active = false;
-	struct super_block *sb;
-
-	sb = bdev_super_lock(bdev, true);
-	if (sb) {
-		active = atomic_inc_not_zero(&sb->s_active);
-		super_unlock_excl(sb);
+		if (!super_lock_shared(sb))
+			continue;
+		if (sb->s_root && (sb->s_flags & SB_ACTIVE))
+			sync_filesystem(sb);
+		super_unlock_shared(sb);
 	}
-	if (!active)
-		return NULL;
-	return sb;
 }
 
 /**
- * fs_bdev_freeze - freeze owning filesystem of block device
+ * fs_bdev_freeze - freeze every superblock using a block device
  * @bdev: block device
  *
- * Freeze the filesystem that owns this block device if it is still
- * active.
- *
- * A filesystem that owns multiple block devices may be frozen from each
- * block device and won't be unfrozen until all block devices are
- * unfrozen. Each block device can only freeze the filesystem once as we
- * nest freezes for block devices in the block layer.
+ * Freeze each live superblock using @bdev.  A superblock owning several block
+ * devices is frozen once per device and stays frozen until all are thawed; the
+ * block layer nests these freezes so the count stays balanced.
  *
- * Return: If the freeze was successful zero is returned. If the freeze
- *         failed a negative error code is returned.
+ * Return: 0, or the error from the one superblock on a single-fs device.  When
+ *         several superblocks share @bdev a per-superblock failure is swallowed
+ *         (see below), but a sync_blockdev() failure is always reported.
  */
 static int fs_bdev_freeze(struct block_device *bdev)
 {
-	struct super_block *sb;
-	int error = 0;
+	dev_t dev = bdev->bd_dev;
+	struct fs_bdev_holder *h;
+	unsigned int count = 0;
+	int error = 0, err;
 
 	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
 
-	sb = get_bdev_super(bdev);
-	if (!sb)
-		return -EINVAL;
+	mutex_unlock(&bdev->bd_holder_lock);
 
-	if (sb->s_op->freeze_super)
-		error = sb->s_op->freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
-	else
-		error = freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
+	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
+		if (!atomic_inc_not_zero(&h->sb->s_active))
+			continue;
+		err = fs_super_freeze(h->sb);
+		if (err && !error)
+			error = err;
+		deactivate_super(h->sb);
+		count++;
+	}
+
+	/*
+	 * When several superblocks share the device, keep it frozen even if some
+	 * of them failed to freeze and swallow the error: rolling the rest back
+	 * via thaw_super() can fail too, so neither is a clear win. A single
+	 * filesystem (count == 1) still reports its error.
+	 */
+	if (error && count > 1)
+		error = 0;
 	if (!error)
 		error = sync_blockdev(bdev);
-	deactivate_super(sb);
 	return error;
 }
 
 /**
- * fs_bdev_thaw - thaw owning filesystem of block device
+ * fs_bdev_thaw - thaw every superblock using a block device
  * @bdev: block device
  *
- * Thaw the filesystem that owns this block device.
+ * The counterpart to fs_bdev_freeze(): thaw each live superblock using @bdev.
+ * A zero return does not imply a superblock is fully unfrozen; it may have been
+ * frozen more than once (by the kernel or via another device).
  *
- * A filesystem that owns multiple block devices may be frozen from each
- * block device and won't be unfrozen until all block devices are
- * unfrozen. Each block device can only freeze the filesystem once as we
- * nest freezes for block devices in the block layer.
- *
- * Return: If the thaw was successful zero is returned. If the thaw
- *         failed a negative error code is returned. If this function
- *         returns zero it doesn't mean that the filesystem is unfrozen
- *         as it may have been frozen multiple times (kernel may hold a
- *         freeze or might be frozen from other block devices).
+ * Return: 0, or the first error on a single-fs device; a shared device swallows
+ *         per-superblock errors, as fs_bdev_freeze() does.
  */
 static int fs_bdev_thaw(struct block_device *bdev)
 {
-	struct super_block *sb;
-	int error;
+	dev_t dev = bdev->bd_dev;
+	struct fs_bdev_holder *h;
+	unsigned int count = 0;
+	int error = 0, err;
 
 	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
 
-	/*
-	 * The block device may have been frozen before it was claimed by a
-	 * filesystem. Concurrently another process might try to mount that
-	 * frozen block device and has temporarily claimed the block device for
-	 * that purpose causing a concurrent fs_bdev_thaw() to end up here. The
-	 * mounter is already about to abort mounting because they still saw an
-	 * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
-	 * NULL in that case.
-	 */
-	sb = get_bdev_super(bdev);
-	if (!sb)
-		return -EINVAL;
+	mutex_unlock(&bdev->bd_holder_lock);
 
-	if (sb->s_op->thaw_super)
-		error = sb->s_op->thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
-	else
-		error = thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
-	deactivate_super(sb);
+	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
+		if (!atomic_inc_not_zero(&h->sb->s_active))
+			continue;
+		err = fs_super_thaw(h->sb);
+		if (err && !error)
+			error = err;
+		deactivate_super(h->sb);
+		count++;
+	}
+
+	/* Shared device: swallow per-superblock errors, like fs_bdev_freeze(). */
+	if (error && count > 1)
+		error = 0;
 	return error;
 }
 
@@ -1602,6 +1651,131 @@ const struct blk_holder_ops fs_holder_ops = {
 };
 EXPORT_SYMBOL_GPL(fs_holder_ops);
 
+static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
+{
+	dev_t dev = file_bdev(bdev_file)->bd_dev;
+	struct rhlist_head *list, *pos;
+	struct fs_bdev_holder *h;
+	int err;
+
+	/*
+	 * A superblock may claim one device more than once (xfs with its log on
+	 * the data device).  Keep a single entry per (device, superblock) and
+	 * count the claims in @fs_bdev_active; the entry lives until the last one
+	 * is released.
+	 */
+	scoped_guard(rcu) {
+		list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
+		rhl_for_each_entry_rcu(h, pos, list, node)
+			if (h->sb == sb && refcount_inc_not_zero(&h->fs_bdev_active))
+				return 0;
+	}
+
+	h = kmalloc(sizeof(*h), GFP_KERNEL);
+	if (!h)
+		return -ENOMEM;
+	h->dev = dev;
+	h->sb = sb;
+	refcount_set(&h->fs_bdev_passive, 1);
+	refcount_set(&h->fs_bdev_active, 1);
+
+	err = rhltable_insert(&fs_bdev_supers, &h->node, fs_bdev_params);
+	if (err) {
+		kfree(h);
+		return err;
+	}
+
+	/* The sb->s_count ref keeps @h->sb valid for as long as the entry exists. */
+	spin_lock(&sb_lock);
+	sb->s_count++;
+	spin_unlock(&sb_lock);
+
+	return 0;
+}
+
+/**
+ * fs_bdev_file_open_by_dev - claim a block device on behalf of a superblock
+ * @dev: block device number
+ * @mode: open mode
+ * @holder: block-layer exclusivity token (a superblock, or the file_system_type
+ *          when the device may be shared by several superblocks of that type)
+ * @sb: superblock to drive fs_holder_ops events for
+ *
+ * Open @dev with &fs_holder_ops and register that @sb uses it, so device
+ * removal/sync/freeze/thaw are propagated to @sb (and any other superblock
+ * sharing @dev).  Must be paired with fs_bdev_file_release().
+ *
+ * Return: an opened block-device file or an ERR_PTR().
+ */
+struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
+				      struct super_block *sb)
+{
+	struct file *bdev_file;
+	int err;
+
+	bdev_file = bdev_file_open_by_dev(dev, mode, holder, &fs_holder_ops);
+	if (IS_ERR(bdev_file))
+		return bdev_file;
+
+	err = fs_bdev_register(bdev_file, sb);
+	if (err) {
+		bdev_fput(bdev_file);
+		return ERR_PTR(err);
+	}
+	return bdev_file;
+}
+EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_dev);
+
+struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
+				       void *holder, struct super_block *sb)
+{
+	struct file *bdev_file;
+	int err;
+
+	bdev_file = bdev_file_open_by_path(path, mode, holder, &fs_holder_ops);
+	if (IS_ERR(bdev_file))
+		return bdev_file;
+
+	err = fs_bdev_register(bdev_file, sb);
+	if (err) {
+		bdev_fput(bdev_file);
+		return ERR_PTR(err);
+	}
+	return bdev_file;
+}
+EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_path);
+
+/**
+ * fs_bdev_file_release - release a block device claimed for a superblock
+ * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
+ * @sb: superblock the device was claimed for
+ *
+ * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
+ * pinning cursor defers the actual unlink).  Then close the block device.
+ */
+void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
+{
+	dev_t dev = file_bdev(bdev_file)->bd_dev;
+	struct fs_bdev_holder *h, *found = NULL;
+	struct rhlist_head *list, *pos;
+
+	rcu_read_lock();
+	list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
+	rhl_for_each_entry_rcu(h, pos, list, node) {
+		if (h->sb != sb)
+			continue;
+		/* At most one entry per (dev, sb); the last claim drops the bias. */
+		if (refcount_dec_and_test(&h->fs_bdev_active))
+			found = h;
+		break;
+	}
+	rcu_read_unlock();
+	if (found)
+		fs_bdev_holder_put(found);
+	bdev_fput(bdev_file);
+}
+EXPORT_SYMBOL_GPL(fs_bdev_file_release);
+
 int setup_bdev_super(struct super_block *sb, int sb_flags,
 		struct fs_context *fc)
 {
@@ -1609,7 +1783,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
 	struct file *bdev_file;
 	struct block_device *bdev;
 
-	bdev_file = bdev_file_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
+	bdev_file = fs_bdev_file_open_by_dev(sb->s_dev, mode, sb, sb);
 	if (IS_ERR(bdev_file)) {
 		if (fc)
 			errorf(fc, "%s: Can't open blockdev", fc->source);
@@ -1623,7 +1797,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
 	 * writable from userspace even for a read-only block device.
 	 */
 	if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
-		bdev_fput(bdev_file);
+		fs_bdev_file_release(bdev_file, sb);
 		return -EACCES;
 	}
 
@@ -1634,7 +1808,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
 	if (atomic_read(&bdev->bd_fsfreeze_count) > 0) {
 		if (fc)
 			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
-		bdev_fput(bdev_file);
+		fs_bdev_file_release(bdev_file, sb);
 		return -EBUSY;
 	}
 	spin_lock(&sb_lock);
@@ -1725,7 +1899,7 @@ void kill_block_super(struct super_block *sb)
 	generic_shutdown_super(sb);
 	if (bdev) {
 		sync_blockdev(bdev);
-		bdev_fput(sb->s_bdev_file);
+		fs_bdev_file_release(sb->s_bdev_file, sb);
 	}
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c8494d64a69d..43d37c02febf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1760,13 +1760,6 @@ struct blk_holder_ops {
 	int (*thaw)(struct block_device *bdev);
 };
 
-/*
- * For filesystems using @fs_holder_ops, the @holder argument passed to
- * helpers used to open and claim block devices via
- * bd_prepare_to_claim() must point to a superblock.
- */
-extern const struct blk_holder_ops fs_holder_ops;
-
 /*
  * Return the correct open flags for blkdev_get_by_* for super block flags
  * as stored in sb->s_flags.
diff --git a/include/linux/fs/super.h b/include/linux/fs/super.h
index f21ffbb6dea5..721d842e3b24 100644
--- a/include/linux/fs/super.h
+++ b/include/linux/fs/super.h
@@ -235,4 +235,11 @@ int freeze_super(struct super_block *super, enum freeze_holder who,
 int thaw_super(struct super_block *super, enum freeze_holder who,
 	       const void *freeze_owner);
 
+struct file;
+struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
+				      struct super_block *sb);
+struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
+				       void *holder, struct super_block *sb);
+void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb);
+
 #endif /* _LINUX_FS_SUPER_H */

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 1/8] fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

blk_mode_t and fop_flags_t are both plain 'unsigned int __bitwise' flag
typedefs, exactly like the gfp_t, slab_flags_t and fmode_t that already
live in <linux/types.h>. Move them there so they are available
everywhere without having to drag in a subsystem header.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 include/linux/blkdev.h | 2 --
 include/linux/fs.h     | 2 --
 include/linux/types.h  | 2 ++
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..c8494d64a69d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -126,8 +126,6 @@ struct blk_integrity {
 	unsigned char				pi_tuple_size;
 };
 
-typedef unsigned int __bitwise blk_mode_t;
-
 /* open for reading */
 #define BLK_OPEN_READ		((__force blk_mode_t)(1 << 0))
 /* open for writing */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..e9346be8470f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1921,8 +1921,6 @@ struct dir_context {
 struct io_uring_cmd;
 struct offset_ctx;
 
-typedef unsigned int __bitwise fop_flags_t;
-
 struct file_operations {
 	struct module *owner;
 	fop_flags_t fop_flags;
diff --git a/include/linux/types.h b/include/linux/types.h
index 608050dbca6a..ef026585420b 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -163,6 +163,8 @@ typedef u32 dma_addr_t;
 typedef unsigned int __bitwise gfp_t;
 typedef unsigned int __bitwise slab_flags_t;
 typedef unsigned int __bitwise fmode_t;
+typedef unsigned int __bitwise blk_mode_t;
+typedef unsigned int __bitwise fop_flags_t;
 
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 typedef u64 phys_addr_t;

-- 
2.47.3


^ permalink raw reply related

* [PATCH RFC 0/8] fs: support freeze/thaw/mark_dead/sync with shared devices
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christian Brauner (Amutable)

Note, this is on the border between RFC/POC and so I haven't pushed this
through testing yet. But I don't want to waste more time on this before
showing it.

I surveyed various fs implementations because I want the ability to
extend userspace the ability to manage what devices can be onlined in a
centralized way without having to force every fs to care about this.

I realized that erofs allows sharing block devices with multiple
superblocks. Any freeze, thaw, removal, or sync on those devices will
not be communicated to the superblocks using it and our current
infrastructure is unable to deal with this.

This attempts to add the ability to go from device number to all the
superblock using that device, iterate through them one-by-one and
perform actions on them. For most fses this is a 1:1 mapping but for
erofs its a 1:many mapping.

This is not unreasonable infastructure to support in my opinion. I
played around with some ideas for this and I want to send out an RFC to
gather some early input.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Christian Brauner (8):
      fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
      fs: add a global device to super block hash table
      fs: refuse to claim any frozen block device
      xfs: port to fs_bdev_file_open_by_path()
      btrfs: open via dedicated fs bdev helpers
      ext4: open via dedicated fs bdev helpers
      erofs: open via dedicated fs bdev helpers
      super: make fs_holder_ops private

 fs/btrfs/dev-replace.c   |   6 +-
 fs/btrfs/ioctl.c         |   4 +-
 fs/btrfs/volumes.c       |  26 ++-
 fs/erofs/data.c          |   6 +
 fs/erofs/internal.h      |  10 ++
 fs/erofs/super.c         |  66 +++++--
 fs/erofs/zdata.c         |  10 +-
 fs/ext4/super.c          |  12 +-
 fs/super.c               | 452 ++++++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_buf.c         |   2 +-
 fs/xfs/xfs_super.c       |  10 +-
 include/linux/blkdev.h   |   9 -
 include/linux/fs.h       |   2 -
 include/linux/fs/super.h |   7 +
 include/linux/types.h    |   2 +
 15 files changed, 433 insertions(+), 191 deletions(-)
---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260602-work-super-bdev_holder_global-8cba5e52bed5

^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo @ 2026-06-02 10:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <83e36f9c-aeb1-40d4-9265-fd22120a7fa9@huaweicloud.com>

On Fri, May 29, 2026 at 05:13:55PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >>    ext4_da_map_blocks() to map delalloc extents.
> >>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >>    ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >>  - Unwritten extents for new blocks (dioread_nolock always on)
> >>    Since data=ordered mode is not used to prevent stale data exposure in
> >>    the non-delayed allocation path, new blocks are always allocated as
> >>    unwritten extents.
> > 
> > Okay makes sense.
> > 
> >>
> >>  - Short write and write failure handling
> >>    a. Delalloc path: On short write or failure, the stale delalloc range
> >>       must be dropped and its space reservation released. Otherwise, a
> >>       clean folio may cover leftover delalloc extents, causing
> >>       inaccurate space reservation accounting.
> > 
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> > 
> 
> In fact, in the normal buffer head path, writeback does not consume
> delalloc reservations. Instead, the reservations are retained until the
> inode is released or the area is written again using delalloc. This is
> because i_size is not updated during short writes. Therefore, when a
> zeroed dirty folio is written back, no block mapping is created for it.
> For details, please see the lblk >= blocks judgment in
> mpage_process_page_bufs().

Oh okay I see, I'm not very clear on the code path but what about a case
where i_size is beyond the short write range.

> 
> This will not lead to duplicate space statistics, because
> ext4_da_map_blocks() only reserves space when inserting a new delalloc
> extent. Therefore, this does not pose a serious issue. However, It may
> cause some temporary and minor space leaks. Nevertheless, I think it
> would be better if delalloc extents could be released for the buffer
> head path when short writes occur.

Yes true, ideally it would be more intuitive if we cancelled the
reservations in short write.

Regards,
ojaswin

> 
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
> > 
> Yeah.
> 
> Thanks,
> Yi.
> 
> > Regards,
> > Ojaswin
> > 
> >>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >>       short write.
> >>
> >>  - Lock ordering reversal
> >>    The folio lock and transaction start ordering is reversed compared to
> >>    the buffer_head buffered write path. To handle this, the journal
> >>    handle must be stopped in iomap_begin() callbacks. The lock ordering
> >>    documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/ext4.h  |   4 ++
> >>  fs/ext4/file.c  |  20 +++++-
> >>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >>  fs/ext4/super.c |  10 ++-
> >>  4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >>  				struct buffer_head *bh);
> >>  void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >>  #define FALL_BACK_TO_NONDELALLOC 1
> >>  #define CONVERT_INLINE_DATA	 2
> >>  
> >> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> >>  
> >>  extern const struct iomap_ops ext4_iomap_ops;
> >>  extern const struct iomap_ops ext4_iomap_report_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> >> +extern const struct iomap_write_ops ext4_iomap_write_ops;
> >>  
> >>  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> >>  {
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index eb1a323962b1..7f9bfbbc4a4e 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >>  	return count;
> >>  }
> >>  
> >> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> >> +					 struct iov_iter *from)
> >> +{
> >> +	struct inode *inode = file_inode(iocb->ki_filp);
> >> +	const struct iomap_ops *iomap_ops;
> >> +
> >> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> >> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> >> +	else
> >> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> >> +
> >> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> >> +					 &ext4_iomap_write_ops, NULL);
> >> +}
> >> +
> >>  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >>  					struct iov_iter *from)
> >>  {
> >> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >>  	if (ret <= 0)
> >>  		goto out;
> >>  
> >> -	ret = generic_perform_write(iocb, from);
> >> +	if (ext4_inode_buffered_iomap(inode))
> >> +		ret = ext4_iomap_buffered_write(iocb, from);
> >> +	else
> >> +		ret = generic_perform_write(iocb, from);
> >>  
> >>  out:
> >>  	inode_unlock(inode);
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 39577a6b65b9..1ae7d3f4a1c8 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> >>  	return ret;
> >>  }
> >>  
> >> -static int ext4_nonda_switch(struct super_block *sb)
> >> +int ext4_nonda_switch(struct super_block *sb)
> >>  {
> >>  	s64 free_clusters, dirty_clusters;
> >>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> >> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> >>  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> >>  }
> >>  
> >> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> >> +{
> >> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> >> +}
> >> +
> >> +const struct iomap_write_ops ext4_iomap_write_ops = {
> >> +	.iomap_valid = ext4_iomap_valid,
> >> +};
> >> +
> >>  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >>  			   struct ext4_map_blocks *map, loff_t offset,
> >>  			   loff_t length, unsigned int flags)
> >> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >>  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >>  		iomap->flags |= IOMAP_F_MERGED;
> >>  
> >> +	iomap->validity_cookie = map->m_seq;
> >> +
> >>  	/*
> >>  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
> >>  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> >> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >>  	.iomap_begin = ext4_iomap_begin_report,
> >>  };
> >>  
> >> +/* Map blocks */
> >> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> >> +
> >>  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >> -		loff_t length, struct ext4_map_blocks *map)
> >> +		loff_t length, ext4_get_blocks_t get_blocks,
> >> +		struct ext4_map_blocks *map)
> >>  {
> >>  	u8 blkbits = inode->i_blkbits;
> >>  
> >> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >>  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >>  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> >>  
> >> +	if (get_blocks)
> >> +		return get_blocks(inode, map);
> >> +
> >>  	return ext4_map_blocks(NULL, inode, map, 0);
> >>  }
> >>  
> >> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >>  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >>  		return -ERANGE;
> >>  
> >> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >>  	if (ret < 0)
> >>  		return ret;
> >>  
> >> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>  
> >> +static int ext4_iomap_get_blocks(struct inode *inode,
> >> +				 struct ext4_map_blocks *map)
> >> +{
> >> +	loff_t i_size = i_size_read(inode);
> >> +	handle_t *handle;
> >> +	int ret;
> >> +
> >> +	/*
> >> +	 * Check if the blocks have already been allocated, this could
> >> +	 * avoid initiating a new journal transaction and return the
> >> +	 * mapping information directly.
> >> +	 */
> >> +	if ((map->m_lblk + map->m_len) <=
> >> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> >> +		ret = ext4_map_blocks(NULL, inode, map, 0);
> >> +		if (ret < 0)
> >> +			return ret;
> >> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> >> +				    EXT4_MAP_DELAYED))
> >> +			return 0;
> >> +	}
> >> +
> >> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> >> +			ext4_chunk_trans_blocks(inode, map->m_len));
> >> +	if (IS_ERR(handle))
> >> +		return PTR_ERR(handle);
> >> +
> >> +	ret = ext4_map_blocks(handle, inode, map,
> >> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> >> +	/*
> >> +	 * Stop handle here following the lock ordering of the folio lock
> >> +	 * and the transaction start.
> >> +	 */
> >> +	ext4_journal_stop(handle);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> >> +{
> >> +	int ret, retries = 0;
> >> +	struct ext4_map_blocks map;
> >> +	ext4_get_blocks_t *get_blocks;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	/* Inline data and non-extent are not supported. */
> >> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >> +		return -ERANGE;
> >> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> >> +		return -EINVAL;
> >> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> >> +		return -EINVAL;
> >> +
> >> +	if (delalloc)
> >> +		get_blocks = ext4_da_map_blocks;
> >> +	else
> >> +		get_blocks = ext4_iomap_get_blocks;
> >> +retry:
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> >> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> >> +		goto retry;
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> +	return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> +				     loff_t length, struct iomap *iomap)
> >> +{
> >> +	down_write(&EXT4_I(inode)->i_data_sem);
> >> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> +	up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> +					    loff_t length, ssize_t written,
> >> +					    unsigned int flags,
> >> +					    struct iomap *iomap)
> >> +{
> >> +	loff_t start_byte, end_byte;
> >> +
> >> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> >> +		return 0;
> >> +
> >> +	/* Nothing to do if we've written the entire delalloc extent */
> >> +	start_byte = iomap_last_written_block(inode, offset, written);
> >> +	end_byte = round_up(offset + length, i_blocksize(inode));
> >> +	if (start_byte >= end_byte)
> >> +		return 0;
> >> +
> >> +	filemap_invalidate_lock(inode->i_mapping);
> >> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> >> +				     iomap, ext4_iomap_punch_delalloc);
> >> +	filemap_invalidate_unlock(inode->i_mapping);
> >> +	return 0;
> >> +}
> >> +
> >> +/*
> >> + * Since we always allocate unwritten extents, there is no need for
> >> + * iomap_end to clean up allocated blocks on a short write.
> >> + */
> >> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >> +	.iomap_begin = ext4_iomap_buffered_write_begin,
> >> +};
> >> +
> >> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> >> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> >> +	.iomap_end = ext4_iomap_buffered_da_write_end,
> >> +};
> >> +
> >>  const struct iomap_ops ext4_iomap_buffered_read_ops = {
> >>  	.iomap_begin = ext4_iomap_buffered_read_begin,
> >>  };
> >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> index 6a77db4d3124..9bc294b769db 100644
> >> --- a/fs/ext4/super.c
> >> +++ b/fs/ext4/super.c
> >> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> >>   *   -> page lock -> i_data_sem (rw)
> >>   *
> >>   * buffered write path:
> >> - * sb_start_write -> i_mutex -> mmap_lock
> >> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> >> - *   i_data_sem (rw)
> >> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> >> + * - buffer_head path:
> >> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> >> + *     i_data_sem (rw)
> >> + * - iomap path:
> >> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> >> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> >>   *
> >>   * truncate:
> >>   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> >> -- 
> >> 2.52.0
> >>
> 

^ permalink raw reply

* Re: [PATCH v2 0/10] fs: Fix missed inode write during fsync
From: Jan Kara @ 2026-06-02  7:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
	OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>

Hello!

On Mon 25-05-26 10:58:06, Jan Kara wrote:
> here is v2 of the patch series which fixes the possibly missing inode write
> during fsync(2) for filesystems using generic metadata bh tracking. The
> inherent problem is that .write_inode methods clear inode dirty bit but they
> only copy inode contents into to the buffer cache. Because buffer carrying the
> inode is shared among multiple inodes, it cannot be tracked by the generic
> metadata bh tracking infrastructure and thus nothing is tracking that buffer
> needs to be written out to maintain fsync(2) guarantees. Normally, this gets
> taken care of by .write_inode checking for WB_SYNC_ALL writeback and submitting
> & waiting for the buffer in that case however if flush worker ends up writing
> the inode before data integrity writeback, this mechanism is broken.
> 
> This patch series adds a way for filesystems to track metadata block number
> which contains the inode metadata and then uses this information to writeout
> the buffer on fsync.

FWIW I went through Sashiko review comments. Lot of them are hallucinated
but there are actually three good finds:

1) FAT implementation of inode tracking is broken when fsync races with
rename.

2) ext2 & minix inode tracking makes handling of dirsync even more broken
than it already is (current handling is already broken because we don't
flush any directory indirect blocks but my changes also stop flushing the
inode buffer).

3) mmb_sync() flushing of inode buffer is racy for multiple parallel
fsyncs.

So I'll be addressing these. Please don't waste time with the series as is.

								Honza

> 
> Changes since v1:
> * Fixed freeing for ext4 dynamically allocated mmb struct
> * Optimized tracking of block carrying the inode so that we don't flush it
>   unnecessarily on fsync
> * Add forgotten check for reclaimed bh to mmb_sync() to avoid NULL ptr deref
> * Couple other smaller fixups pointed out by Sashiko
> 
> 								Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20260511115725.28441-1-jack@suse.cz # v1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo @ 2026-06-02  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260528154059.GA6054@frogsfrogsfrogs>

On Thu, May 28, 2026 at 08:40:59AM -0700, Darrick J. Wong wrote:
> On Tue, May 26, 2026 at 10:40:30PM +0530, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > Introduce two new iomap_ops instances for ext4 buffered writes:
> > > 
> > >  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> > >    ext4_da_map_blocks() to map delalloc extents.
> > >  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> > >    ext4_iomap_get_blocks() to directly allocate blocks.
> > > 
> > > Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> > > validity.
> > > 
> > > Key changes and considerations:
> > > 
> > >  - Unwritten extents for new blocks (dioread_nolock always on)
> > >    Since data=ordered mode is not used to prevent stale data exposure in
> > >    the non-delayed allocation path, new blocks are always allocated as
> > >    unwritten extents.
> > 
> > Okay makes sense.
> > 
> > > 
> > >  - Short write and write failure handling
> > >    a. Delalloc path: On short write or failure, the stale delalloc range
> > >       must be dropped and its space reservation released. Otherwise, a
> > >       clean folio may cover leftover delalloc extents, causing
> > >       inaccurate space reservation accounting.
> > 
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> > 
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
> 
> Yes, that's true of iomap's pagecache handling.

Thanks for confirming Darrick.

Regards,
Ojaswin

> 
> --D
> 
> > Regards,
> > Ojaswin
> > 
> > >    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> > >       short write.
> > > 
> > >  - Lock ordering reversal
> > >    The folio lock and transaction start ordering is reversed compared to
> > >    the buffer_head buffered write path. To handle this, the journal
> > >    handle must be stopped in iomap_begin() callbacks. The lock ordering
> > >    documentation in super.c has been updated accordingly.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > ---
> > >  fs/ext4/ext4.h  |   4 ++
> > >  fs/ext4/file.c  |  20 +++++-
> > >  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> > >  fs/ext4/super.c |  10 ++-
> > >  4 files changed, 192 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 1e27d73d7427..4832e7f7db82 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> > >  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> > >  				struct buffer_head *bh);
> > >  void ext4_set_inode_mapping_order(struct inode *inode);
> > > +int ext4_nonda_switch(struct super_block *sb);
> > >  #define FALL_BACK_TO_NONDELALLOC 1
> > >  #define CONVERT_INLINE_DATA	 2
> > >  
> > > @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> > >  
> > >  extern const struct iomap_ops ext4_iomap_ops;
> > >  extern const struct iomap_ops ext4_iomap_report_ops;
> > > +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> > > +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> > > +extern const struct iomap_write_ops ext4_iomap_write_ops;
> > >  
> > >  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> > >  {
> > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > > index eb1a323962b1..7f9bfbbc4a4e 100644
> > > --- a/fs/ext4/file.c
> > > +++ b/fs/ext4/file.c
> > > @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  	return count;
> > >  }
> > >  
> > > +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> > > +					 struct iov_iter *from)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	const struct iomap_ops *iomap_ops;
> > > +
> > > +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> > > +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> > > +	else
> > > +		iomap_ops = &ext4_iomap_buffered_write_ops;
> > > +
> > > +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> > > +					 &ext4_iomap_write_ops, NULL);
> > > +}
> > > +
> > >  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > >  					struct iov_iter *from)
> > >  {
> > > @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > >  	if (ret <= 0)
> > >  		goto out;
> > >  
> > > -	ret = generic_perform_write(iocb, from);
> > > +	if (ext4_inode_buffered_iomap(inode))
> > > +		ret = ext4_iomap_buffered_write(iocb, from);
> > > +	else
> > > +		ret = generic_perform_write(iocb, from);
> > >  
> > >  out:
> > >  	inode_unlock(inode);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 39577a6b65b9..1ae7d3f4a1c8 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> > >  	return ret;
> > >  }
> > >  
> > > -static int ext4_nonda_switch(struct super_block *sb)
> > > +int ext4_nonda_switch(struct super_block *sb)
> > >  {
> > >  	s64 free_clusters, dirty_clusters;
> > >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> > >  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> > >  }
> > >  
> > > +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> > > +{
> > > +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> > > +}
> > > +
> > > +const struct iomap_write_ops ext4_iomap_write_ops = {
> > > +	.iomap_valid = ext4_iomap_valid,
> > > +};
> > > +
> > >  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > >  			   struct ext4_map_blocks *map, loff_t offset,
> > >  			   loff_t length, unsigned int flags)
> > > @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > >  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> > >  		iomap->flags |= IOMAP_F_MERGED;
> > >  
> > > +	iomap->validity_cookie = map->m_seq;
> > > +
> > >  	/*
> > >  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
> > >  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> > > @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> > >  	.iomap_begin = ext4_iomap_begin_report,
> > >  };
> > >  
> > > +/* Map blocks */
> > > +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> > > +
> > >  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > > -		loff_t length, struct ext4_map_blocks *map)
> > > +		loff_t length, ext4_get_blocks_t get_blocks,
> > > +		struct ext4_map_blocks *map)
> > >  {
> > >  	u8 blkbits = inode->i_blkbits;
> > >  
> > > @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > >  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> > >  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> > >  
> > > +	if (get_blocks)
> > > +		return get_blocks(inode, map);
> > > +
> > >  	return ext4_map_blocks(NULL, inode, map, 0);
> > >  }
> > >  
> > > @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> > >  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > >  		return -ERANGE;
> > >  
> > > -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> > > +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> > >  	if (ret < 0)
> > >  		return ret;
> > >  
> > > @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> > >  	return 0;
> > >  }
> > >  
> > > +static int ext4_iomap_get_blocks(struct inode *inode,
> > > +				 struct ext4_map_blocks *map)
> > > +{
> > > +	loff_t i_size = i_size_read(inode);
> > > +	handle_t *handle;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Check if the blocks have already been allocated, this could
> > > +	 * avoid initiating a new journal transaction and return the
> > > +	 * mapping information directly.
> > > +	 */
> > > +	if ((map->m_lblk + map->m_len) <=
> > > +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> > > +		ret = ext4_map_blocks(NULL, inode, map, 0);
> > > +		if (ret < 0)
> > > +			return ret;
> > > +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> > > +				    EXT4_MAP_DELAYED))
> > > +			return 0;
> > > +	}
> > > +
> > > +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> > > +			ext4_chunk_trans_blocks(inode, map->m_len));
> > > +	if (IS_ERR(handle))
> > > +		return PTR_ERR(handle);
> > > +
> > > +	ret = ext4_map_blocks(handle, inode, map,
> > > +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> > > +	/*
> > > +	 * Stop handle here following the lock ordering of the folio lock
> > > +	 * and the transaction start.
> > > +	 */
> > > +	ext4_journal_stop(handle);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> > > +		loff_t offset, loff_t length, unsigned int flags,
> > > +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> > > +{
> > > +	int ret, retries = 0;
> > > +	struct ext4_map_blocks map;
> > > +	ext4_get_blocks_t *get_blocks;
> > > +
> > > +	ret = ext4_emergency_state(inode->i_sb);
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	/* Inline data and non-extent are not supported. */
> > > +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > > +		return -ERANGE;
> > > +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> > > +		return -EINVAL;
> > > +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> > > +		return -EINVAL;
> > > +
> > > +	if (delalloc)
> > > +		get_blocks = ext4_da_map_blocks;
> > > +	else
> > > +		get_blocks = ext4_iomap_get_blocks;
> > > +retry:
> > > +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> > > +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > +		goto retry;
> > > +	if (ret < 0)
> > > +		return ret;
> > > +
> > > +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> > > +	return 0;
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> > > +		loff_t offset, loff_t length, unsigned int flags,
> > > +		struct iomap *iomap, struct iomap *srcmap)
> > > +{
> > > +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > > +						  iomap, srcmap, false);
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> > > +		loff_t offset, loff_t length, unsigned int flags,
> > > +		struct iomap *iomap, struct iomap *srcmap)
> > > +{
> > > +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > > +						  iomap, srcmap, true);
> > > +}
> > > +
> > > +/*
> > > + * On write failure, drop the stale delayed allocation range and release
> > > + * its reserved space for both start and end blocks. Otherwise, we may
> > > + * leave a range of delayed extents covered by a clean folio, which can
> > > + * result in inaccurate space reservation accounting.
> > > + */
> > > +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> > > +				     loff_t length, struct iomap *iomap)
> > > +{
> > > +	down_write(&EXT4_I(inode)->i_data_sem);
> > > +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> > > +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> > > +	up_write(&EXT4_I(inode)->i_data_sem);
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> > > +					    loff_t length, ssize_t written,
> > > +					    unsigned int flags,
> > > +					    struct iomap *iomap)
> > > +{
> > > +	loff_t start_byte, end_byte;
> > > +
> > > +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> > > +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> > > +		return 0;
> > > +
> > > +	/* Nothing to do if we've written the entire delalloc extent */
> > > +	start_byte = iomap_last_written_block(inode, offset, written);
> > > +	end_byte = round_up(offset + length, i_blocksize(inode));
> > > +	if (start_byte >= end_byte)
> > > +		return 0;
> > > +
> > > +	filemap_invalidate_lock(inode->i_mapping);
> > > +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> > > +				     iomap, ext4_iomap_punch_delalloc);
> > > +	filemap_invalidate_unlock(inode->i_mapping);
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Since we always allocate unwritten extents, there is no need for
> > > + * iomap_end to clean up allocated blocks on a short write.
> > > + */
> > > +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> > > +	.iomap_begin = ext4_iomap_buffered_write_begin,
> > > +};
> > > +
> > > +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> > > +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> > > +	.iomap_end = ext4_iomap_buffered_da_write_end,
> > > +};
> > > +
> > >  const struct iomap_ops ext4_iomap_buffered_read_ops = {
> > >  	.iomap_begin = ext4_iomap_buffered_read_begin,
> > >  };
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 6a77db4d3124..9bc294b769db 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> > >   *   -> page lock -> i_data_sem (rw)
> > >   *
> > >   * buffered write path:
> > > - * sb_start_write -> i_mutex -> mmap_lock
> > > - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> > > - *   i_data_sem (rw)
> > > + * sb_start_write -> i_rwsem (w) -> mmap_lock
> > > + * - buffer_head path:
> > > + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> > > + *     i_data_sem (rw)
> > > + * - iomap path:
> > > + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> > > + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> > >   *
> > >   * truncate:
> > >   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> > > -- 
> > > 2.52.0
> > > 
> > 

^ permalink raw reply

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-02  5:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <b22baeed-3a20-47a4-8e7c-22f61a6eb49b@huaweicloud.com>

On Sat, May 30, 2026 at 05:32:54PM +0800, Zhang Yi wrote:
> On 5/28/2026 9:34 PM, Ojaswin Mujoo wrote:
> > On Wed, May 27, 2026 at 09:28:28PM +0530, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@suse.cz>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>>  fs/ext4/ext4.h    | 11 +++++++
> >>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>>  fs/ext4/super.c   | 23 ++++++++++----
> >>>  4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> [...]
> >>> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >>>  					loff_t from, loff_t end)
> >>>  {
> >>>  	struct address_space *mapping = inode->i_mapping;
> >>> +	struct ext4_inode_info *ei = EXT4_I(inode);
> >>>  	struct folio *folio;
> >>>  	bool do_submit = false;
> >>> +	int ret;
> >>>  
> >>>  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> >>>  	if (IS_ERR(folio))
> >>> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >>>  	folio_wait_writeback(folio);
> >>>  	WARN_ON_ONCE(folio_test_writeback(folio));
> >>>  
> >>> -	if (likely(folio_test_dirty(folio)))
> >>> +	/*
> >>> +	 * Mark the ordered range. It will be cleared upon I/O completion
> >>> +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> >>> +	 * (including append write end io past the zeroed boundary,
> >>> +	 * truncate up and append fallocate) must wait for this I/O to
> >>> +	 * complete before updating i_disksize.
> >>> +	 *
> >>> +	 * When multiple overlapping unaligned EOF writes are in flight, we
> >>> +	 * only need to track and wait for the first one. Subsequent writes
> >>> +	 * will zero the gap in memory and ensure that the zeroed data is
> >>> +	 * written out along with the valid data in the same block before
> >>> +	 * i_disksize is updated.
> >>> +	 */
> >>> +	if (likely(folio_test_dirty(folio) &&
> >>> +		   READ_ONCE(ei->i_ordered_len) == 0)) {
> >>> +		WRITE_ONCE(ei->i_ordered_lblk,
> >>> +			   from >> inode->i_blkbits);
> >>> +		/*
> >>> +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> >>> +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> >>> +		 * i_ordered_lblk is visible when i_ordered_len becomes
> >>> +		 * non-zero.
> >>> +		 */
> >>> +		smp_store_release(&ei->i_ordered_len, 1);
> >>>  		do_submit = true;
> >>> +	}
> >>>  	folio_unlock(folio);
> >>>  	folio_put(folio);
> >>>  
> >>>  	/* Submit zeroed block. */
> >>> -	if (do_submit)
> >>> -		return filemap_fdatawrite_range(mapping, from, end - 1);
> >>> +	if (do_submit) {
> >>> +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
> >>> +		if (ret) {
> >>> +			/*
> >>> +			 * Pairs with wait_event() in
> >>> +			 * ext4_iomap_wb_ordered_wait(). Ensure
> >>> +			 * i_ordered_len = 0 is visible before waking up
> >>> +			 * waiters.
> >>> +			 */
> >>> +			smp_store_release(&ei->i_ordered_len, 0);
> >>> +			wake_up_all(&ei->i_ordered_wq);
> >>> +			return ret;
> > 
> > Okay so even if the ordered IO fails we still let the i_disksize updates
> > go ahead? 
> 
> Yes when data_err=ignore, no when data_err=abort.
> 
> > I think this is a deviation from the current behavior where we
> > abort the journal. If this is acceptable we should atleast add a comment
> > on why its okay.
> > 
> 
> I think this behavior is consistent with the current data=ordered mode.
> In the data_err=ignore mode, if an I/O write fails, ext4_end_io_end()
> does not abort the journal, so i_disksize is still updated normally.
> Conversely, in the data_err=abort mode, the journal is aborted, and
> since i_disksize is not updated, it cannot be updated afterwards. Am I
> missing something?

So I was thinking about various scenarios where
filemap_fdatawrite_range() might return an ERROR and yes it seems like
we do end up aborting the journal for almost all paths and ENOMEM is
already taken care of. So I think it should be okay.
> 
> >>> +		}
> >>> +	}
> >>>  	return 0;
> >>>  }
> >>>  
> >>> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >>>  		 * data=ordered mode. We submit zeroed range directly here.
> >>>  		 * Do not wait for I/O completion for performance.
> >>>  		 *
> >>> -		 * TODO: Any operation that extends i_disksize (including
> >>> -		 * append write end io past the zeroed boundary, truncate up,
> >>> -		 * and append fallocate) must wait for the relevant I/O to
> >>> -		 * complete before updating i_disksize.
> >>> +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> >>> +		 * for I/O completion before updating i_disksize if the write
> >>> +		 * extends beyond the zeroed boundary.
> >>> +		 *
> >>> +		 * TODO: Any other operation that extends i_disksize
> >>> +		 * (including truncate up and append fallocate) must wait for
> >>> +		 * the relevant I/O to complete before updating i_disksize.
> >>>  		 */
> >>>  		} else if (ext4_inode_buffered_iomap(inode)) {
> >>>  			err = ext4_iomap_submit_zero_block(inode, from, end);
> >>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> >>> index 3050c887329f..ad05ebb49bf6 100644
> >>> --- a/fs/ext4/page-io.c
> >>> +++ b/fs/ext4/page-io.c
> >>> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
> >>>  	return 0;
> >>>  }
> >>>  
> >>> +/*
> >>> + * If the old disk size is not block size aligned and the current
> >>> + * writeback range is entirely beyond the old EOF block, we should
> >>> + * wait for the zeroed data written in ext4_block_zero_eof() to be
> >>> + * written out, otherwise, it may expose stale data in that block.
> >>> + */
> >>> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> >>> +				       loff_t pos, loff_t end)
> >>> +{
> >>> +	struct ext4_inode_info *ei = EXT4_I(inode);
> >>> +	unsigned int blocksize = i_blocksize(inode);
> >>> +	loff_t disksize = READ_ONCE(ei->i_disksize);
> >>> +	ext4_lblk_t order_lblk, order_len;
> >>> +
> >>> +	/*
> >>> +	 * Waiting for ordered I/O is unnecessary when:
> >>> +	 * - The on-disk size is block-aligned (no stale data exists).
> >>> +	 * - The write start is within the block of the old EOF
> >>> +	 *   (overwriting, or appending to a block that already contains
> >>> +	 *   valid data).
> >>> +	 */
> >>> +	if (!(disksize & (blocksize - 1)) ||
> >>> +	    pos < round_up(disksize, blocksize))
> >>> +		return;
> > 
> > Okay these checks are pretty confusing. I was intially thinking that
> > i_disksize's block would always be equal to i_ordered_lblk but seems
> > like that is not true because ext4_block_zero_eof() uses from=i_size.
> 
> Yeah, this is the key point that I was a bit confused about as
> well.
> 
> > 
> > So we could have a sequence where
> > 
> > 1. truncate 4k (i_disksize = i_size = 4k)
> > 2. write 8k,10k (i_disksize = 4k i_size = 10k, i_ordered_len = 0 (old isisze  is block aligned)) 
> > 3. write 16k,18k (i_disksize = 4k i_size = 10k, i_ordered_len = 1, lblk=4)
>                                              18k                     lblk=2, (10k >> 12)
                                               ^^^ Yess correct, my bad.
> > 
> > Here we issue ordered IO even though it' probably not needed.  Now if
> > write 3 finishes first we see disksize as 4k so we don't wait for
> > ordered write. Which seems okay since we don't risk any stale data
> > exposure. However, this flow is pretty confuing.
> 
> Indeed!
> 
> > 
> > Can't we somehow avoid having to issue/set ordered len/lblk in case it
> > is not really needed, like only issue it if i_disksize (and not i_size) 
> > is unaligned. That can simplify some of our check and avoid extra IO
> > overhead.
> > 
> 
> I was also planning to explore optimizations on this point next.
> However, since the original logic in buffer_head already works this way,
> keeping the same logic in the iomap path will not introduce any
> additional side effects. To avoid unnecessary waiting, I simply added
> the disksize alignment check in ext4_iomap_wb_ordered_wait().
> 
> Therefore, I do not plan to implement this optimization in this series.
> I can open a separate series later to address this optimization — perhaps
> by checking i_disksize in ext4_block_zero_eof() before issuing or adding
> ordered I/O, and the buffer_head path might also benefit from optimization.
> Meanwhile, to avoid confusion, I can add a TODO comment in this patch.
> 
> What do you think?

Sure Zhang, such an optimization would make the code simpler but I'm
okay to do this in a different series.

Regards,
ojaswin

> 
> Cheers,
> Yi.
> 

^ permalink raw reply

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-02  5:35 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yangerkun, yukuai
In-Reply-To: <ad963861-4ae3-4753-9415-793bbac00e06@gmail.com>

On Tue, Jun 02, 2026 at 11:22:12AM +0800, Zhang Yi wrote:
> On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> > > On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > > > Hi, Ojaswin!
> > > > 
> > > > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> > > > > On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> > > > > > From: Zhang Yi <yi.zhang@huawei.com>
> > > > > > 
> > > > > > For append writes, wait for ordered I/O to complete before updating
> > > > > > i_disksize. This ensures that zeroed data is flushed to disk before the
> > > > > > metadata update, preventing stale data from being exposed during
> > > > > > unaligned post-EOF append writes.
> > > > > > 
> > > > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > > > > ---
> > > > > >   fs/ext4/ext4.h    | 11 +++++++
> > > > > >   fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> > > > > >   fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> > > > > >   fs/ext4/super.c   | 23 ++++++++++----
> > > > > >   4 files changed, 161 insertions(+), 13 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > > index 078feda47e36..9ce2128eea3e 100644
> > > > > > --- a/fs/ext4/ext4.h
> > > > > > +++ b/fs/ext4/ext4.h
> > > > > > @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> > > > > >   #ifdef CONFIG_FS_ENCRYPTION
> > > > > >   	struct fscrypt_inode_info *i_crypt_info;
> > > > > >   #endif
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> > > > > > +	 * and truncate-up operations. These parameters are used only in the
> > > > > > +	 * iomap buffered I/O path.
> > > > > > +	 */
> > > > > > +	ext4_lblk_t i_ordered_lblk;
> > > > > > +	ext4_lblk_t i_ordered_len;
> > > > > > +	wait_queue_head_t i_ordered_wq;
> > > > > >   };
> > > > > >   /*
> > > > > > @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> > > > > >   			     __u64 len, __u64 *moved_len);
> > > > > >   /* page-io.c */
> > > > > > +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> > > > > > +
> > > > > >   extern int __init ext4_init_pageio(void);
> > > > > >   extern void ext4_exit_pageio(void);
> > > > > >   extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > > index e013aeb03d7b..11fb369efeb1 100644
> > > > > > --- a/fs/ext4/inode.c
> > > > > > +++ b/fs/ext4/inode.c
> > > > > > @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > >   {
> > > > > >   	struct iomap_ioend *ioend = wpc->wb_ctx;
> > > > > >   	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> > > > > > +	ext4_lblk_t start, end, order_lblk, order_len;
> > > > > >   	/*
> > > > > >   	 * After I/O completion, a worker needs to be scheduled when:
> > > > > > @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > >   	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> > > > > >   		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > +	/*
> > > > > > +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> > > > > > +	 * handling and must not be merged with regular I/O operations.
> > > > > > +	 */
> > > > > > +	order_len = READ_ONCE(ei->i_ordered_len);
> > > > > > +	if (order_len) {
> > > > > > +		/*
> > > > > > +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> > > > > > +		 * Ensure we see the updated i_ordered_lblk that was written
> > > > > > +		 * before the release store to i_ordered_len.
> > > > > > +		 */
> > > > > > +		smp_rmb();
> > > > > > +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> > > > > > +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> > > > > > +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> > > > > > +				     ioend->io_offset + ioend->io_size);
> > > > > > +
> > > > > > +		if (start <= order_lblk && end >= order_lblk + order_len) {
> > > > > 
> > > > > Hi Zhang,
> > > > > 
> > > > > I guess this check is enough cause ordered_lblk and ordered_len will
> > > > > always be  contained in a single block.
> > > > 
> > > > Yeah.
> > > > 
> > > > > 
> > > > > > +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> > > > > > +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> > > > > 
> > > > > FWIU, we are wanting the ordered IO to not be merged and submitted asap
> > > > > since we want to wake up the waiters. Is there any other reason?
> > > > 
> > > > My original intention was to prevent the loss of the
> > > > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > > > completion, which could be caused by merging an ordered ioend with a
> > > > normal ioend.  In patch 19, we need to determine the flag to update
> > > > i_disksize to the correct position.
> > 
> > Ahh okay, we don't want the flag to be lost.
> > 
> > > > 
> > > > > 
> > > > > Adding the boundary in ->writeback_submit() only affects
> > > > > iomap_ioend_can_merge() which happens after we have woken up the waiters
> > > > > and deferred the IO to the wq. We ideally want it affect
> > > > > iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> > > > > ->writeback_range().
> > > > 
> > > > IIUC, merging into the same ioend during the submission stage doesn't
> > > > seem to cause any problems.
> > 
> > Got it since the flag is set later. I was thinking we want to quickly
> > issue the ordered IO to wake up the waiters and not waste time trying to
> > merge it and hence we wanted to use that flag.
> > 
> > > > 
> > > > > 
> > > > > Secondly, I don't think boundary is the right flag here. It ensures
> > > > > that everything before the ordered iomap gets submitted and the ordered
> > > > > iomap starts a new ioend. This can still keep getting merged with the
> > > > > newer ioends untils we decide to submit the IO, which can delay waking
> > > > > up the waiters. If we really want the "no merge" behavior, we'll have to
> > > > > do something like [1] (Check the 2 NOMERGE flag patches).
> > > > 
> > > > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > > > still cannot prevent merging. I missed this, thank you for pointing this
> > > > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > > > to check the iomap_ioend->io_private. Something like:
> > > > 
> > > > 	if (ioend->io_private || next->io_private)
> > >              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >              ioend->io_private != next->io_private
> > 
> > I guess if the purpose is to just not lose the flag, then boundary seems
> > to work for because we only lose the flag if ordered ioend backward
> > merges to a prev one. Flag is retained if we forward merge. Which
> > boundary seems to take care of.
> > 
> Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
> loss. However, from the perspective of the iomap infrastructure, I
> believe it is still necessary to add the ioend->io_private !=
> next->io_private check. Because ioends with different io_private values
> should not be merged, as this carries the risk of potentially losing
> io_private or even memory leaks. With this check in iomap, we would no
> longer need IOMAP_IOEND_BOUNDARY.

I agree that even outside this patchset it seems like a sane thing to
do.
> 
> > However, if we want to avoid merges so we can quickly issue IO and wake
> > up the waiters then the above change looks good. Also, if this is the
> > reason we'd also want to have this during submission stage so the flag
> > setting will probs have to move to ->wirteback_range()
> 
> Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
> the latency of sync file range. Suppose when we are syncing data beyond
> the ordered range, the background writeback process has already started
> committing and bundled the ordered range into a large ioend (up to
> IOEND_BATCH_SIZE folios), then this sync operation will indeed
> experience significant latency. However, for other non-sync scenarios,
> there should be little benefit.
 
 Yes that's true.

> 
> But I'm not sure if this is strictly necessary, because in the existing
> implementation, issuing ordered I/O via data=ordered mode works the same
> way — it also doesn't issue ordered I/O as soon as possible, and still
> has to wait when encountering concurrent background writeback. So I
> think we can keep the current implementation for now and see user
> feedback to decide whether further optimization is needed.

I agree!

Thanks,
ojaswin
> 
> Cheers,
> Yi.

^ permalink raw reply

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Zhang Yi @ 2026-06-02  3:36 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai
In-Reply-To: <ah3MR1gYMm3nKe5f@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 6/2/2026 2:15 AM, Ojaswin Mujoo wrote:
> On Mon, Jun 01, 2026 at 08:22:09PM +0800, Zhang Yi wrote:
>> On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
>>> On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
>>>> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
>>>>> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> In the generic buffered_head I/O path, we rely on the data=order mode to
>>>>>> ensure that the zeroed EOF block data is written before updating
>>>>>> i_disksize, thus preventing stale data from being exposed.
>>>>>>
>>>>>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>>>>>> we issue the I/O immediately after performing the zero operation
>>>>>> (without synchronous waiting for performance). This can reduce the risk
>>>>>> of exposing stale data, but it does not guarantee that the zero data
>>>>>> will be flushed to disk before the metadata of i_disksize is updated.
>>>>>> The subsequent patches will wait for this I/O to complete before
>>>>>> updating i_disksize.
>>>>>>
>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> I think we discussed that we may not need to do this [1] but I guess
>>>>> you've decided to make the tradeoff of issuing the IO to avoid having to
>>>>> wait for bg flush to complete the tail page zeroing
>>>>>
>>
>> I'm glad to hear that, thanks.
>>
>>>>
>>>> Yes. For truncate up and append fallocate, originally i_disksize would
>>>> be updated immediately, and the change would be persisted via the
>>>> journal within default 5 seconds. But now, if the tail page is not
>>>> committed immediately, the update to i_disksize will be delayed by about
>>>> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
>>>> not sure what impact this change might have — I just don't really want
>>>> to introduce it.
>>>
>>>> For normal append writes, the impact is minimal, unless we call
>>>> sync_range to sync the portion of data that extends beyond EOF.
>>>
>>> Hmm while trying to retain the behavior for falloc and truncate up,
>>> we end up changing it for append writes :) But anyways, I understand
>>> your reasoning and don't have any strong opinions against it. I'll let
>>> Jan pitch in since he had some comments around this.
>>>
>>>>
>>>> In addition, if the zeroed page is not issued here immediately, the
>>>> logic will become more complex because we need to more careful about the
>>>> order of write-back IOs to prevent deadlock issues caused by mutual
>>>> waiting.
>>>
>>> You mean an endio completion waiting for ordered IO to complete but
>>> ordered IO writeback is somehow waiting for this endio completion? Is that
>>> actually possible?
>>
>> Well, after thinking it over more carefully, it seems this is
>> impossible, I cannot think of a scenario that could trigger this kind of
>> issue. The generic writeback process always executes writeback in folio
>> index order, so there would be no situation where a later data I/O
>> depends on an earlier ordered I/O. Even if both kinds of IOs are placed
>> in the same ioend, there should be no problem. I was confused and
>> overthinking it.
>>
>>  From this perspective, if we can accept that truncate up and fallocate
>> will have longer persistence time by default(I guess this is
>> acceptable), we can avoid writing back zeroed data immediately. To
>> achieve this, we only need to consider the case of sync file range. :-)
> 
> Yeah, I think during writeback we will have to submit the ordered data
> if we are writing back data beyond the i_disksize.
> 
> If this is straightforward enough to implement, I think this approach
> would be a safer choice cause we will avoid overheads due to small,
> random ordered IOs overworking the writeback layer.

Indeed, Let me investigate further to ensure there are no other side
effects.

Thanks,
Yi.

> 
>>
>> Regards,
>> Yi.
>>>>
>>>>> However, I think one side effect might be many threads calling the
>>>>> writeback mechanism to issue zero IOs which might not scale well. I
>>>>> don't know if it'll be a huge problem though, I guess it's a sort of
>>>>> thing we will have to deal with in case we see it in real world
>>>>> workloads.
>>>>>
>>>>
>>>> I agree with you. However, I suspect that unless we run some specific
>>>> benchmark tests, it should be difficult to encounter a large number of
>>>> post-EOF append writes and truncate up operations in real-world usage
>>>> scenarios — and I haven't come across such scenarios yet. For
>>>> simplicity, I'd like to proceed with this implementation for now. If we
>>>> do run into actual problems later, we can consider not issuing I/O
>>>> directly here, but instead: 1) find the ordered block in
>>>> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
>>>> for normal background writeback as well — otherwise, there is a risk of
>>>> deadlock (mutual waiting). What do you think?
>>>
>>> Yes sounds good Yi, we can deal with performance tuning later.
>>>
>>> Regards,
>>> Ojaswin
>>>
>>>>
>>>> Cheers,
>>>> Yi.
>>>>
>>>>> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
>>>>>
>>>>>> ---
>>>>>>    fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>>>>>    1 file changed, 55 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index 239d387ffaf2..e013aeb03d7b 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>>>>>    					zero_written);
>>>>>>    }
>>>>>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>>>> +					loff_t from, loff_t end)
>>>>>> +{
>>>>>> +	struct address_space *mapping = inode->i_mapping;
>>>>>> +	struct folio *folio;
>>>>>> +	bool do_submit = false;
>>>>>> +
>>>>>> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>>>> +	if (IS_ERR(folio))
>>>>>> +		/* Already writeback and clear? */
>>>>>> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>>>>>> +
>>>>>> +	folio_wait_writeback(folio);
>>>>>> +	WARN_ON_ONCE(folio_test_writeback(folio));
>>>>>> +
>>>>>> +	if (likely(folio_test_dirty(folio)))
>>>>>> +		do_submit = true;
>>>>>> +	folio_unlock(folio);
>>>>>> +	folio_put(folio);
>>>>>> +
>>>>>> +	/* Submit zeroed block. */
>>>>>> +	if (do_submit)
>>>>>> +		return filemap_fdatawrite_range(mapping, from, end - 1);
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>>    /*
>>>>>>     * Zero out a mapping from file offset 'from' up to the end of the block
>>>>>>     * which corresponds to 'from' or to the given 'end' inside this block.
>>>>>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>>>    	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>>>>>    		return 0;
>>>>>> -	if (length > blocksize - offset)
>>>>>> +	if (length > blocksize - offset) {
>>>>>>    		length = blocksize - offset;
>>>>>> +		end = from + length;
>>>>>> +	}
>>>>>>    	err = ext4_block_zero_range(inode, from, length,
>>>>>>    				    &did_zero, &zero_written);
>>>>>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>>>    	 * TODO: In the iomap path, handle this by updating i_disksize to
>>>>>>    	 * i_size after the zeroed data has been written back.
>>>>>>    	 */
>>>>>> -	if (ext4_should_order_data(inode) &&
>>>>>> -	    did_zero && zero_written && !IS_DAX(inode)) {
>>>>>> -		handle_t *handle;
>>>>>> +	if (did_zero && zero_written && !IS_DAX(inode)) {
>>>>>> +		if (ext4_should_order_data(inode)) {
>>>>>> +			handle_t *handle;
>>>>>> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>>>> -		if (IS_ERR(handle))
>>>>>> -			return PTR_ERR(handle);
>>>>>> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>>>> +			if (IS_ERR(handle))
>>>>>> +				return PTR_ERR(handle);
>>>>>> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>>>>>> -		ext4_journal_stop(handle);
>>>>>> -		if (err)
>>>>>> -			return err;
>>>>>> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
>>>>>> +							length);
>>>>>> +			ext4_journal_stop(handle);
>>>>>> +			if (err)
>>>>>> +				return err;
>>>>>> +		/*
>>>>>> +		 * inodes using the iomap buffered I/O path do not use the
>>>>>> +		 * data=ordered mode. We submit zeroed range directly here.
>>>>>> +		 * Do not wait for I/O completion for performance.
>>>>>> +		 *
>>>>>> +		 * TODO: Any operation that extends i_disksize (including
>>>>>> +		 * append write end io past the zeroed boundary, truncate up,
>>>>>> +		 * and append fallocate) must wait for the relevant I/O to
>>>>>> +		 * complete before updating i_disksize.
>>>>>> +		 */
>>>>>> +		} else if (ext4_inode_buffered_iomap(inode)) {
>>>>>> +			err = ext4_iomap_submit_zero_block(inode, from, end);
>>>>>> +			if (err)
>>>>>> +				return err;
>>>>>> +		}
>>>>>>    	}
>>>>>>    	return 0;
>>>>>> -- 
>>>>>> 2.52.0
>>>>>>
>>>>
>>


^ permalink raw reply

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Zhang Yi @ 2026-06-02  3:22 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yangerkun, yukuai
In-Reply-To: <ah3QlCt8V-3kVzW8@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
>> On 5/30/2026 3:22 PM, Zhang Yi wrote:
>>> Hi, Ojaswin!
>>>
>>> On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
>>>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> For append writes, wait for ordered I/O to complete before updating
>>>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>>>> metadata update, preventing stale data from being exposed during
>>>>> unaligned post-EOF append writes.
>>>>>
>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ---
>>>>>   fs/ext4/ext4.h    | 11 +++++++
>>>>>   fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>>>>   fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>>>>   fs/ext4/super.c   | 23 ++++++++++----
>>>>>   4 files changed, 161 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>> index 078feda47e36..9ce2128eea3e 100644
>>>>> --- a/fs/ext4/ext4.h
>>>>> +++ b/fs/ext4/ext4.h
>>>>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>>>>   #ifdef CONFIG_FS_ENCRYPTION
>>>>>   	struct fscrypt_inode_info *i_crypt_info;
>>>>>   #endif
>>>>> +
>>>>> +	/*
>>>>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
>>>>> +	 * and truncate-up operations. These parameters are used only in the
>>>>> +	 * iomap buffered I/O path.
>>>>> +	 */
>>>>> +	ext4_lblk_t i_ordered_lblk;
>>>>> +	ext4_lblk_t i_ordered_len;
>>>>> +	wait_queue_head_t i_ordered_wq;
>>>>>   };
>>>>>   
>>>>>   /*
>>>>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>>>>   			     __u64 len, __u64 *moved_len);
>>>>>   
>>>>>   /* page-io.c */
>>>>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
>>>>> +
>>>>>   extern int __init ext4_init_pageio(void);
>>>>>   extern void ext4_exit_pageio(void);
>>>>>   extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index e013aeb03d7b..11fb369efeb1 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>>   {
>>>>>   	struct iomap_ioend *ioend = wpc->wb_ctx;
>>>>>   	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>>>>> +	ext4_lblk_t start, end, order_lblk, order_len;
>>>>>   
>>>>>   	/*
>>>>>   	 * After I/O completion, a worker needs to be scheduled when:
>>>>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>>   	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>>>>   		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>>   
>>>>> +	/*
>>>>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
>>>>> +	 * handling and must not be merged with regular I/O operations.
>>>>> +	 */
>>>>> +	order_len = READ_ONCE(ei->i_ordered_len);
>>>>> +	if (order_len) {
>>>>> +		/*
>>>>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
>>>>> +		 * Ensure we see the updated i_ordered_lblk that was written
>>>>> +		 * before the release store to i_ordered_len.
>>>>> +		 */
>>>>> +		smp_rmb();
>>>>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
>>>>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>>>>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
>>>>> +				     ioend->io_offset + ioend->io_size);
>>>>> +
>>>>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
>>>>
>>>> Hi Zhang,
>>>>
>>>> I guess this check is enough cause ordered_lblk and ordered_len will
>>>> always be  contained in a single block.
>>>
>>> Yeah.
>>>
>>>>
>>>>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>>>>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
>>>>
>>>> FWIU, we are wanting the ordered IO to not be merged and submitted asap
>>>> since we want to wake up the waiters. Is there any other reason?
>>>
>>> My original intention was to prevent the loss of the
>>> EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
>>> completion, which could be caused by merging an ordered ioend with a
>>> normal ioend.  In patch 19, we need to determine the flag to update
>>> i_disksize to the correct position.
> 
> Ahh okay, we don't want the flag to be lost.
> 
>>>
>>>>
>>>> Adding the boundary in ->writeback_submit() only affects
>>>> iomap_ioend_can_merge() which happens after we have woken up the waiters
>>>> and deferred the IO to the wq. We ideally want it affect
>>>> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
>>>> ->writeback_range().
>>>
>>> IIUC, merging into the same ioend during the submission stage doesn't
>>> seem to cause any problems.
> 
> Got it since the flag is set later. I was thinking we want to quickly
> issue the ordered IO to wake up the waiters and not waste time trying to
> merge it and hence we wanted to use that flag.
> 
>>>
>>>>
>>>> Secondly, I don't think boundary is the right flag here. It ensures
>>>> that everything before the ordered iomap gets submitted and the ordered
>>>> iomap starts a new ioend. This can still keep getting merged with the
>>>> newer ioends untils we decide to submit the IO, which can delay waking
>>>> up the waiters. If we really want the "no merge" behavior, we'll have to
>>>> do something like [1] (Check the 2 NOMERGE flag patches).
>>>
>>> Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
>>> still cannot prevent merging. I missed this, thank you for pointing this
>>> out. However, I think perhaps we should change iomap_ioend_can_merge()
>>> to check the iomap_ioend->io_private. Something like:
>>>
>>> 	if (ioend->io_private || next->io_private)
>>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>              ioend->io_private != next->io_private
> 
> I guess if the purpose is to just not lose the flag, then boundary seems
> to work for because we only lose the flag if ordered ioend backward
> merges to a prev one. Flag is retained if we forward merge. Which
> boundary seems to take care of.
> 
Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
loss. However, from the perspective of the iomap infrastructure, I
believe it is still necessary to add the ioend->io_private !=
next->io_private check. Because ioends with different io_private values
should not be merged, as this carries the risk of potentially losing
io_private or even memory leaks. With this check in iomap, we would no
longer need IOMAP_IOEND_BOUNDARY.

> However, if we want to avoid merges so we can quickly issue IO and wake
> up the waiters then the above change looks good. Also, if this is the
> reason we'd also want to have this during submission stage so the flag
> setting will probs have to move to ->wirteback_range()

Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
the latency of sync file range. Suppose when we are syncing data beyond
the ordered range, the background writeback process has already started
committing and bundled the ordered range into a large ioend (up to
IOEND_BATCH_SIZE folios), then this sync operation will indeed
experience significant latency. However, for other non-sync scenarios,
there should be little benefit.

But I'm not sure if this is strictly necessary, because in the existing
implementation, issuing ordered I/O via data=ordered mode works the same
way — it also doesn't issue ordered I/O as soon as possible, and still
has to wait when encountering concurrent background writeback. So I
think we can keep the current implementation for now and see user
feedback to decide whether further optimization is needed.

Cheers,
Yi.

^ permalink raw reply

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-01 18:33 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yizhang089, yangerkun,
	yukuai
In-Reply-To: <81f4c0ec-1d80-4987-b31e-4e9ecd394c63@huaweicloud.com>

On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > Hi, Ojaswin!
> > 
> > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@suse.cz>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>>  fs/ext4/ext4.h    | 11 +++++++
> >>>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>>  fs/ext4/super.c   | 23 ++++++++++----
> >>>  4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>> index 078feda47e36..9ce2128eea3e 100644
> >>> --- a/fs/ext4/ext4.h
> >>> +++ b/fs/ext4/ext4.h
> >>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> >>>  #ifdef CONFIG_FS_ENCRYPTION
> >>>  	struct fscrypt_inode_info *i_crypt_info;
> >>>  #endif
> >>> +
> >>> +	/*
> >>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> >>> +	 * and truncate-up operations. These parameters are used only in the
> >>> +	 * iomap buffered I/O path.
> >>> +	 */
> >>> +	ext4_lblk_t i_ordered_lblk;
> >>> +	ext4_lblk_t i_ordered_len;
> >>> +	wait_queue_head_t i_ordered_wq;
> >>>  };
> >>>  
> >>>  /*
> >>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> >>>  			     __u64 len, __u64 *moved_len);
> >>>  
> >>>  /* page-io.c */
> >>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> >>> +
> >>>  extern int __init ext4_init_pageio(void);
> >>>  extern void ext4_exit_pageio(void);
> >>>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>> index e013aeb03d7b..11fb369efeb1 100644
> >>> --- a/fs/ext4/inode.c
> >>> +++ b/fs/ext4/inode.c
> >>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>>  {
> >>>  	struct iomap_ioend *ioend = wpc->wb_ctx;
> >>>  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> >>> +	ext4_lblk_t start, end, order_lblk, order_len;
> >>>  
> >>>  	/*
> >>>  	 * After I/O completion, a worker needs to be scheduled when:
> >>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>>  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> >>>  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>>  
> >>> +	/*
> >>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> >>> +	 * handling and must not be merged with regular I/O operations.
> >>> +	 */
> >>> +	order_len = READ_ONCE(ei->i_ordered_len);
> >>> +	if (order_len) {
> >>> +		/*
> >>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> >>> +		 * Ensure we see the updated i_ordered_lblk that was written
> >>> +		 * before the release store to i_ordered_len.
> >>> +		 */
> >>> +		smp_rmb();
> >>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> >>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> >>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> >>> +				     ioend->io_offset + ioend->io_size);
> >>> +
> >>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
> >>
> >> Hi Zhang,
> >>
> >> I guess this check is enough cause ordered_lblk and ordered_len will
> >> always be  contained in a single block.
> > 
> > Yeah.
> > 
> >>
> >>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> >>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> >>
> >> FWIU, we are wanting the ordered IO to not be merged and submitted asap
> >> since we want to wake up the waiters. Is there any other reason?
> > 
> > My original intention was to prevent the loss of the
> > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > completion, which could be caused by merging an ordered ioend with a
> > normal ioend.  In patch 19, we need to determine the flag to update
> > i_disksize to the correct position.

Ahh okay, we don't want the flag to be lost.

> > 
> >>
> >> Adding the boundary in ->writeback_submit() only affects
> >> iomap_ioend_can_merge() which happens after we have woken up the waiters
> >> and deferred the IO to the wq. We ideally want it affect
> >> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> >> ->writeback_range().
> > 
> > IIUC, merging into the same ioend during the submission stage doesn't
> > seem to cause any problems.

Got it since the flag is set later. I was thinking we want to quickly
issue the ordered IO to wake up the waiters and not waste time trying to
merge it and hence we wanted to use that flag. 

> > 
> >>
> >> Secondly, I don't think boundary is the right flag here. It ensures
> >> that everything before the ordered iomap gets submitted and the ordered
> >> iomap starts a new ioend. This can still keep getting merged with the
> >> newer ioends untils we decide to submit the IO, which can delay waking
> >> up the waiters. If we really want the "no merge" behavior, we'll have to
> >> do something like [1] (Check the 2 NOMERGE flag patches).
> > 
> > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > still cannot prevent merging. I missed this, thank you for pointing this
> > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > to check the iomap_ioend->io_private. Something like:
> > 
> > 	if (ioend->io_private || next->io_private)
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>             ioend->io_private != next->io_private

I guess if the purpose is to just not lose the flag, then boundary seems
to work for because we only lose the flag if ordered ioend backward
merges to a prev one. Flag is retained if we forward merge. Which
boundary seems to take care of.

However, if we want to avoid merges so we can quickly issue IO and wake
up the waiters then the above change looks good. Also, if this is the
reason we'd also want to have this during submission stage so the flag
setting will probs have to move to ->wirteback_range()

Regards,
Ojaswin


> 
> 
> > 		return false;
> > 
> > What do you think?
> > 
> > Thanks,
> > Yi.
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox