[PATCH v4 0/6] btrfs: add remove

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/6] btrfs: add remove_bdev() callback
@ 2025-07-04  0:42 Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

[CHANGELOG]
v4:
- Update the commit message of the first patch
  Remove the out-of-date comments about the old *_shutdown() names.

v3:
- Also rename the callback functions inside each fs to *_remove_bdev()
  To keep the consistency between the interface and implementation.

- Add extra handling if the to-be-removed device is already missing in
  btrfs
  I do not know if a device can be double-removed, but the handling
  inside btrfs is pretty simple, if the target device is already
  missing, nothing needs to be done and can exit immediately.

v2:
- Enhance and rename shutdown() callback
  Rename it to remove_bdev() and add a @bdev parameter.
  For the existing call backs in filesystems, keep their callback
  function names, now something like ".remove_bdev = ext4_shutdown,"
  will be a quick indicator of the behavior.

- Remove the @surprise parameter for the remove_bdev() parameter.
  The fs_bdev_mark_dead() is already trying to sync the fs if it's not
  a surprise removal.
  So there isn't much a filesystem can do with the @surprise parameter.

- Fix btrfs error handling when the devices are not opened
  There are several cases that the fs_devices is not opened, including:
  * sget_fc() failure
  * an existing super block is returned
  * a new super block is returned but btrfS_open_fs_devices() failed

  Handle the error properly so that fs_devices is not freed twice.

RFC->v1:
- Add a new remove_bdev() callback
  Thanks all the feedback from Christian, Christoph and Jan on this new
  name.

- Add a @surprise parameter to the remove_bdev() callback
  To keep it the same as the bdev_mark_dead().

- Hide the shutdown ioctl and remove_bdev callback behind experimental 
  With the shutdown ioctl, there are at least 2 test failures (g/388, g/508).

  G/388 is related to the error handling with COW fixup.
  G/508 looks like something related to log replay.

  And the remove_bdev() doesn't have any btrfs specific test case yet to
  check the auto-degraded behavior, nor the auto-degraded behavior is
  fully discussed.

  So hide both of them behind experimental features.

- Do not use btrfs_handle_fs_error() to avoid freeze/thaw behavior change
  btrfs_handle_fs_error() will flips the fs read-only, which will
  affect freeze/thaw behavior.
  And no other fs set the fs read-only when shutting down, so follow the
  other fs to have a more consistent behavior.


Qu Wenruo (6):
  fs: enhance and rename shutdown() callback to remove_bdev()
  btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
  btrfs: reject file operations if in shutdown state
  btrfs: reject delalloc ranges if in shutdown state
  btrfs: implement shutdown ioctl
  btrfs: implement remove_bdev super operation callback

 fs/btrfs/file.c            | 25 ++++++++++++++++-
 fs/btrfs/fs.h              | 28 +++++++++++++++++++
 fs/btrfs/inode.c           | 14 +++++++++-
 fs/btrfs/ioctl.c           | 43 ++++++++++++++++++++++++++++
 fs/btrfs/messages.c        |  1 +
 fs/btrfs/reflink.c         |  3 ++
 fs/btrfs/super.c           | 57 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c         |  2 ++
 fs/btrfs/volumes.h         |  5 ++++
 fs/exfat/super.c           |  4 +--
 fs/ext4/super.c            |  4 +--
 fs/f2fs/super.c            |  4 +--
 fs/ntfs3/super.c           |  6 ++--
 fs/super.c                 |  4 +--
 fs/xfs/xfs_super.c         |  7 +++--
 include/linux/fs.h         |  7 ++++-
 include/uapi/linux/btrfs.h |  9 ++++++
 17 files changed, 206 insertions(+), 17 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
@ 2025-07-04  0:42 ` Qu Wenruo
  2025-07-04  9:00   ` (subset) " Christian Brauner
                     ` (2 more replies)
  2025-07-04  0:42 ` [PATCH v4 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel
  Cc: viro, brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3,
	linux-xfs

Currently all the filesystems implementing the
super_opearations::shutdown() callback can not afford losing a device.

Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
involved filesystem.

But it will no longer be the case, with multi-device filesystems like
btrfs and bcachefs the filesystem can handle certain device loss without
shutting down the whole filesystem.

To allow those multi-device filesystems to be integrated to use
fs_holder_ops:

- Replace super_opearation::shutdown() with
  super_opearations::remove_bdev()
  To better describe when the callback is called.

- Add a new @bdev parameter to remove_bdev() callback
  To allow the fs to determine which device is missing, and do the
  proper handling when needed.

For the existing shutdown callback users, the change is minimal.
They only need to follow the rename and the new parameter list.
The new @bdev parameter can be ignored if the filesystem can not afford
losing any device, and continue using the old shutdown behavior.

Btrfs is going to implement the callback soon, which will either
shutdown the fs or continue read-write operations.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: ntfs3@lists.linux.dev
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/exfat/super.c   | 4 ++--
 fs/ext4/super.c    | 4 ++--
 fs/f2fs/super.c    | 4 ++--
 fs/ntfs3/super.c   | 6 +++---
 fs/super.c         | 4 ++--
 fs/xfs/xfs_super.c | 7 ++++---
 include/linux/fs.h | 7 ++++++-
 7 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/exfat/super.c b/fs/exfat/super.c
index 7ed858937d45..a0e11166b194 100644
--- a/fs/exfat/super.c
+++ b/fs/exfat/super.c
@@ -172,7 +172,7 @@ int exfat_force_shutdown(struct super_block *sb, u32 flags)
 	return 0;
 }
 
-static void exfat_shutdown(struct super_block *sb)
+static void exfat_remove_bdev(struct super_block *sb, struct block_device *bdev)
 {
 	exfat_force_shutdown(sb, EXFAT_GOING_DOWN_NOSYNC);
 }
@@ -202,7 +202,7 @@ static const struct super_operations exfat_sops = {
 	.put_super	= exfat_put_super,
 	.statfs		= exfat_statfs,
 	.show_options	= exfat_show_options,
-	.shutdown	= exfat_shutdown,
+	.remove_bdev	= exfat_remove_bdev,
 };
 
 enum {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c7d39da7e733..d75b416401ae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1456,7 +1456,7 @@ static void ext4_destroy_inode(struct inode *inode)
 			 EXT4_I(inode)->i_reserved_data_blocks);
 }
 
-static void ext4_shutdown(struct super_block *sb)
+static void ext4_remove_bdev(struct super_block *sb, struct block_device *bdev)
 {
        ext4_force_shutdown(sb, EXT4_GOING_FLAGS_NOLOGFLUSH);
 }
@@ -1620,7 +1620,7 @@ static const struct super_operations ext4_sops = {
 	.unfreeze_fs	= ext4_unfreeze,
 	.statfs		= ext4_statfs,
 	.show_options	= ext4_show_options,
-	.shutdown	= ext4_shutdown,
+	.remove_bdev	= ext4_remove_bdev,
 #ifdef CONFIG_QUOTA
 	.quota_read	= ext4_quota_read,
 	.quota_write	= ext4_quota_write,
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index bbf1dad6843f..8667af9f76e4 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2640,7 +2640,7 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
 	return err;
 }
 
-static void f2fs_shutdown(struct super_block *sb)
+static void f2fs_remove_bdev(struct super_block *sb, struct block_device *bdev)
 {
 	f2fs_do_shutdown(F2FS_SB(sb), F2FS_GOING_DOWN_NOSYNC, false, false);
 }
@@ -3264,7 +3264,7 @@ static const struct super_operations f2fs_sops = {
 	.unfreeze_fs	= f2fs_unfreeze,
 	.statfs		= f2fs_statfs,
 	.remount_fs	= f2fs_remount,
-	.shutdown	= f2fs_shutdown,
+	.remove_bdev	= f2fs_remove_bdev,
 };
 
 #ifdef CONFIG_FS_ENCRYPTION
diff --git a/fs/ntfs3/super.c b/fs/ntfs3/super.c
index 920a1ab47b63..3e69dc805e3a 100644
--- a/fs/ntfs3/super.c
+++ b/fs/ntfs3/super.c
@@ -762,9 +762,9 @@ static int ntfs_show_options(struct seq_file *m, struct dentry *root)
 }
 
 /*
- * ntfs_shutdown - super_operations::shutdown
+ * ntfs_remove_bdev - super_operations::remove_bdev
  */
-static void ntfs_shutdown(struct super_block *sb)
+static void ntfs_remove_bdev(struct super_block *sb, struct block_device *bdev)
 {
 	set_bit(NTFS_FLAGS_SHUTDOWN_BIT, &ntfs_sb(sb)->flags);
 }
@@ -821,7 +821,7 @@ static const struct super_operations ntfs_sops = {
 	.put_super = ntfs_put_super,
 	.statfs = ntfs_statfs,
 	.show_options = ntfs_show_options,
-	.shutdown = ntfs_shutdown,
+	.remove_bdev = ntfs_remove_bdev,
 	.sync_fs = ntfs_sync_fs,
 	.write_inode = ntfs3_write_inode,
 };
diff --git a/fs/super.c b/fs/super.c
index 80418ca8e215..c972efb38f6a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1463,8 +1463,8 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
 		sync_filesystem(sb);
 	shrink_dcache_sb(sb);
 	evict_inodes(sb);
-	if (sb->s_op->shutdown)
-		sb->s_op->shutdown(sb);
+	if (sb->s_op->remove_bdev)
+		sb->s_op->remove_bdev(sb, bdev);
 
 	super_unlock_shared(sb);
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 0bc4b5489078..8e307b036133 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1276,8 +1276,9 @@ xfs_fs_free_cached_objects(
 }
 
 static void
-xfs_fs_shutdown(
-	struct super_block	*sb)
+xfs_fs_remove_bdev(
+	struct super_block	*sb,
+	struct block_device	*bdev)
 {
 	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REMOVED);
 }
@@ -1308,7 +1309,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
-	.shutdown		= xfs_fs_shutdown,
+	.remove_bdev		= xfs_fs_remove_bdev,
 	.show_stats		= xfs_fs_show_stats,
 };
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b085f161ed22..b08af63d2d4f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2367,7 +2367,12 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
-	void (*shutdown)(struct super_block *sb);
+	/*
+	 * Called when block device @bdev belonging to @sb is removed.
+	 *
+	 * If the fs can't afford the device loss, it should be shutdown.
+	 */
+	void (*remove_bdev)(struct super_block *sb, struct block_device *bdev);
 };
 
 /*
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
  2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
@ 2025-07-04  0:42 ` Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

This is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN,
after entering the emergency shutdown state, all operations will return
errors (-EIO), and can not be bring back to normal state until unmount.

A new helper, btrfs_force_shutdown() is introduced, which will:

- Mark the fs as error
  But without flipping the fs read-only.
  This is a special handling for the future shutdown ioctl, which will
  freeze the fs first, set the SHUTDOWN flag, thaw the fs.

  But the thaw path will no longer call the unfreeze_fs() call back
  if the superblock is already read-only.

  So to handle future shutdown correctly, we only mark the fs as error,
  without flipping it read-only.

- Set the SHUTDOWN flag and output an message

New users of those interfaces will be added when implementing shutdown
ioctl support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/fs.h       | 28 ++++++++++++++++++++++++++++
 fs/btrfs/messages.c |  1 +
 2 files changed, 29 insertions(+)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 5154ad390f31..83d93ef0c451 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -29,6 +29,7 @@
 #include "extent-io-tree.h"
 #include "async-thread.h"
 #include "block-rsv.h"
+#include "messages.h"
 
 struct inode;
 struct super_block;
@@ -120,6 +121,12 @@ enum {
 	/* No more delayed iput can be queued. */
 	BTRFS_FS_STATE_NO_DELAYED_IPUT,
 
+	/*
+	 * Emergency shutdown, a step further than trans aborted by rejecting
+	 * all operations.
+	 */
+	BTRFS_FS_STATE_EMERGENCY_SHUTDOWN,
+
 	BTRFS_FS_STATE_COUNT
 };
 
@@ -1094,6 +1101,27 @@ static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
 	(unlikely(test_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR,		\
 			   &(fs_info)->fs_state)))
 
+static inline bool btrfs_is_shutdown(struct btrfs_fs_info *fs_info)
+{
+	return test_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state);
+}
+
+static inline void btrfs_force_shutdown(struct btrfs_fs_info *fs_info)
+{
+	/*
+	 * Here we do not want to use handle_fs_error(), which will mark
+	 * the fs read-only.
+	 * Some call sites like shutdown ioctl will mark the fs shutdown
+	 * when the fs is frozen. But thaw path will handle RO and RW fs
+	 * differently.
+	 *
+	 * So here we only mark the fs error without flipping it RO.
+	 */
+	WRITE_ONCE(fs_info->fs_error, -EIO);
+	if (!test_and_set_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state))
+		btrfs_info(fs_info, "emergency shutdown");
+}
+
 /*
  * We use folio flag owner_2 to indicate there is an ordered extent with
  * unfinished IO.
diff --git a/fs/btrfs/messages.c b/fs/btrfs/messages.c
index 363fd28c0268..2bb4bcb7c2cd 100644
--- a/fs/btrfs/messages.c
+++ b/fs/btrfs/messages.c
@@ -23,6 +23,7 @@ static const char fs_state_chars[] = {
 	[BTRFS_FS_STATE_NO_DATA_CSUMS]		= 'C',
 	[BTRFS_FS_STATE_SKIP_META_CSUMS]	= 'S',
 	[BTRFS_FS_STATE_LOG_CLEANUP_ERROR]	= 'L',
+	[BTRFS_FS_STATE_EMERGENCY_SHUTDOWN]	= 'E',
 };
 
 static void btrfs_state_to_string(const struct btrfs_fs_info *info, char *buf)
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 3/6] btrfs: reject file operations if in shutdown state
  2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
@ 2025-07-04  0:42 ` Qu Wenruo
  2025-07-05 14:10   ` David Sterba
  2025-07-04  0:42 ` [PATCH v4 4/6] btrfs: reject delalloc ranges " Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

This includes the following callbacks of file_operations:

- read_iter()
- write_iter()
- mmap()
- open()
- remap_file_range()
- uring_cmd()
- splice_read()
  This requires a small wrapper to do the extra shutdown check, then call
  the regular filemap_splice_read() function

This should reject most of the file operations on a shutdown btrfs.

The callback ioctl() is intentionally skipped, as ext4 doesn't do the
shutdown check on ioctl() either, thus I believe there is some special
require for ioctl() callback even if the fs is fully shutdown.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c    | 25 ++++++++++++++++++++++++-
 fs/btrfs/ioctl.c   |  3 +++
 fs/btrfs/reflink.c |  3 +++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 05b046c6806f..cb7d1d53fc13 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1419,6 +1419,8 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
 	struct btrfs_inode *inode = BTRFS_I(file_inode(file));
 	ssize_t num_written, num_sync;
 
+	if (unlikely(btrfs_is_shutdown(inode->root->fs_info)))
+		return -EIO;
 	/*
 	 * If the fs flips readonly due to some impossible error, although we
 	 * have opened a file as writable, we have to stop this write operation
@@ -1981,6 +1983,8 @@ static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
 {
 	struct address_space *mapping = filp->f_mapping;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(filp)))))
+		return -EIO;
 	if (!mapping->a_ops->read_folio)
 		return -ENOEXEC;
 
@@ -3040,6 +3044,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 	int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
 	int ret;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+		return -EIO;
+
 	/* Do not allow fallocate in ZONED mode */
 	if (btrfs_is_zoned(inode_to_fs_info(inode)))
 		return -EOPNOTSUPP;
@@ -3731,6 +3738,9 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
 	int ret;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+		return -EIO;
+
 	filp->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
 
 	ret = fsverity_file_open(inode, filp);
@@ -3743,6 +3753,9 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(iocb->ki_filp)))))
+		return -EIO;
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0 || !iov_iter_count(to) ||
@@ -3753,10 +3766,20 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return filemap_read(iocb, to, ret);
 }
 
+static ssize_t btrfs_file_splice_read(struct file *in, loff_t *ppos,
+				      struct pipe_inode_info *pipe,
+				      size_t len, unsigned int flags)
+{
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(in)))))
+		return -EIO;
+
+	return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
 	.read_iter      = btrfs_file_read_iter,
-	.splice_read	= filemap_splice_read,
+	.splice_read	= btrfs_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.splice_write	= iter_file_splice_write,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 503c469249e5..2f3b7be13bea 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5048,6 +5048,9 @@ static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issu
 
 int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(cmd->file)))))
+		return -EIO;
+
 	switch (cmd->cmd_op) {
 	case BTRFS_IOC_ENCODED_READ:
 #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 0197bd9160a7..123a5682514b 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -869,6 +869,9 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
 	bool same_inode = dst_inode == src_inode;
 	int ret;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(src_file)))))
+		return -EIO;
+
 	if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
 		return -EINVAL;
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 4/6] btrfs: reject delalloc ranges if in shutdown state
  2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-07-04  0:42 ` [PATCH v4 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
@ 2025-07-04  0:42 ` Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 5/6] btrfs: implement shutdown ioctl Qu Wenruo
  2025-07-04  0:42 ` [PATCH v4 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
  5 siblings, 0 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

If the filesystem has dirty pages before the fs is shutdown, we should
no longer write them back, instead should treat them as writeback error.

Handle such situation by marking all those delalloc range as error and
let error handling path to clean them up.

For ranges that already have ordered extent created, let them continue
the writeback, and at ordered io finish time the file extent item update
will be rejected as the fs is already marked error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d7a2ea7d121f..0f0c415e82a2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -861,6 +861,9 @@ static void compress_file_range(struct btrfs_work *work)
 	int compress_type = fs_info->compress_type;
 	int compress_level = fs_info->compress_level;
 
+	if (unlikely(btrfs_is_shutdown(fs_info)))
+		goto cleanup_and_bail_uncompressed;
+
 	inode_should_defrag(inode, start, end, end - start + 1, SZ_16K);
 
 	/*
@@ -1276,6 +1279,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	unsigned long page_ops;
 	int ret = 0;
 
+	if (unlikely(btrfs_is_shutdown(fs_info))) {
+		ret = -EIO;
+		goto out_unlock;
+	}
+
 	if (btrfs_is_free_space_inode(inode)) {
 		ret = -EINVAL;
 		goto out_unlock;
@@ -2026,7 +2034,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_root *root = inode->root;
-	struct btrfs_path *path;
+	struct btrfs_path *path = NULL;
 	u64 cow_start = (u64)-1;
 	/*
 	 * If not 0, represents the inclusive end of the last fallback_to_cow()
@@ -2046,6 +2054,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 	 */
 	ASSERT(!btrfs_is_zoned(fs_info) || btrfs_is_data_reloc_root(root));
 
+	if (unlikely(btrfs_is_shutdown(fs_info))) {
+		ret = -EIO;
+		goto error;
+	}
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 5/6] btrfs: implement shutdown ioctl
  2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-07-04  0:42 ` [PATCH v4 4/6] btrfs: reject delalloc ranges " Qu Wenruo
@ 2025-07-04  0:42 ` Qu Wenruo
  2025-07-05 14:22   ` David Sterba
  2025-07-04  0:42 ` [PATCH v4 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
  5 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

The shutdown ioctl should follow the XFS one, which use magic number 'X',
and ioctl number 125, with a u32 as flags.

For now btrfs don't distinguish DEFAULT and LOGFLUSH flags (just like
f2fs), both will freeze the fs first (implies committing the current
transaction), setting the SHUTDOWN flag and finally thaw the fs.

For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current
transaction is aborted.

The new shutdown ioctl is hidden behind experimental features for more
testing.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c           | 40 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  9 +++++++++
 2 files changed, 49 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2f3b7be13bea..94eb7a8499db 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5194,6 +5194,36 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
 	return 0;
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
+{
+	int ret = 0;
+
+	if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
+		return -EINVAL;
+
+	if (btrfs_is_shutdown(fs_info))
+		return 0;
+
+	switch (flags) {
+	case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
+	case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
+		ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+		if (ret)
+			return ret;
+		btrfs_force_shutdown(fs_info);
+		ret = thaw_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+		if (ret)
+			return ret;
+		break;
+	case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
+		btrfs_force_shutdown(fs_info);
+		break;
+	}
+	return ret;
+}
+#endif
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5201,6 +5231,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	void __user *argp = (void __user *)arg;
+	/* If @arg is just an unsigned long value. */
+	unsigned long flags;
 
 	switch (cmd) {
 	case FS_IOC_GETVERSION:
@@ -5349,6 +5381,14 @@ long btrfs_ioctl(struct file *file, unsigned int
 #endif
 	case BTRFS_IOC_SUBVOL_SYNC_WAIT:
 		return btrfs_ioctl_subvol_sync(fs_info, argp);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	case BTRFS_IOC_SHUTDOWN:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		if (get_user(flags, (__u32 __user *)arg))
+			return -EFAULT;
+		return btrfs_emergency_shutdown(fs_info, flags);
+#endif
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dd02160015b2..8f6324cf15d9 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -1096,6 +1096,12 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
 };
 
+/* Flags for IOC_SHUTDOWN, should match XFS' flags. */
+#define BTRFS_SHUTDOWN_FLAGS_DEFAULT	0x0
+#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH	0x1
+#define BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH	0x2
+#define BTRFS_SHUTDOWN_FLAGS_LAST	0x3
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -1217,6 +1223,9 @@ enum btrfs_err_code {
 #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
 					struct btrfs_ioctl_subvol_wait)
 
+/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
+#define BTRFS_IOC_SHUTDOWN	_IOR('X', 125, __u32)
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 6/6] btrfs: implement remove_bdev super operation callback
  2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
                   ` (4 preceding siblings ...)
  2025-07-04  0:42 ` [PATCH v4 5/6] btrfs: implement shutdown ioctl Qu Wenruo
@ 2025-07-04  0:42 ` Qu Wenruo
  5 siblings, 0 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-04  0:42 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

For this callback, btrfs will:

- Go degraded if the fs can still maintain RW operations
  And of course mark the target device as missing.

- Shutdown if the fs can not maintain RW operations

To support the lookup from bdev to a btrfs_device,
btrfs_dev_lookup_args is enhanced to have a new @devt member.
If set, we should be able to use that @devt member to uniquely locating a
btrfs device.

I know the shutdown can be a little overkilled, if one has a RAID1
metadata and RAID0 data, in that case one can still read data with 50%
chance to got some good data.

But a filesystem returning -EIO for half of the time is not really
considered usable.
Further it can also be as bad as the only device went missing for a single
device btrfs.

So here we go safe other than sorry when handling missing device.

And the remove_bdev callback will be hidden behind experimental features
for now, the reasons are:

- There are not enough btrfs specific bdev removal test cases
  The existing test cases are all removing the only device, thus only
  exercises the shutdown branch.

- Not yet determined what's the expected behavior
  Although the current auto-degrade behavior is no worse than the old
  behavior, it may not always be what the end users want.

  Before there is a concrete interface, better hide the new feature
  from end users.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c |  2 ++
 fs/btrfs/volumes.h |  5 ++++
 3 files changed, 64 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 727d5c1d1bf1..dd6e8a50ac39 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2421,6 +2421,60 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
 	return 0;
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static void btrfs_remove_bdev(struct super_block *sb, struct block_device *bdev)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	struct btrfs_device *device;
+	struct btrfs_dev_lookup_args lookup_args = { .devt = bdev->bd_dev };
+	bool can_rw;
+
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	device = btrfs_find_device(fs_info->fs_devices, &lookup_args);
+	if (!device) {
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		btrfs_warn(fs_info, "unable to find btrfs device for block device '%pg'",
+			   bdev);
+		return;
+	}
+	/*
+	 * The to-be-removed device is already missing?
+	 *
+	 * That's weird but no special handling needed and can exit right now.
+	 */
+	if (unlikely(test_and_set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))) {
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		btrfs_warn(fs_info, "btrfs device id %llu is already missing",
+			   device->devid);
+		return;
+	}
+
+	device->fs_devices->missing_devices++;
+	if (test_and_clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
+		list_del_init(&device->dev_alloc_list);
+		WARN_ON(device->fs_devices->rw_devices < 1);
+		device->fs_devices->rw_devices--;
+	}
+	can_rw = btrfs_check_rw_degradable(fs_info, device);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+	/*
+	 * Now device is considered missing, btrfs_device_name() won't give a
+	 * meaningful result anymore, so only output the devid.
+	 */
+	if (!can_rw) {
+		btrfs_crit(fs_info,
+		"btrfs device id %llu has gone missing, can not maintain read-write",
+			   device->devid);
+		btrfs_force_shutdown(fs_info);
+		return;
+	}
+	btrfs_warn(fs_info,
+		   "btrfs device id %llu has gone missing, continue as degraded",
+		   device->devid);
+	btrfs_set_opt(fs_info->mount_opt, DEGRADED);
+}
+#endif
+
 static const struct super_operations btrfs_super_ops = {
 	.drop_inode	= btrfs_drop_inode,
 	.evict_inode	= btrfs_evict_inode,
@@ -2436,6 +2490,9 @@ static const struct super_operations btrfs_super_ops = {
 	.unfreeze_fs	= btrfs_unfreeze,
 	.nr_cached_objects = btrfs_nr_cached_objects,
 	.free_cached_objects = btrfs_free_cached_objects,
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	.remove_bdev	= btrfs_remove_bdev,
+#endif
 };
 
 static const struct file_operations btrfs_ctl_fops = {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2098129743e7..340a1bb0c844 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6792,6 +6792,8 @@ static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
 static bool dev_args_match_device(const struct btrfs_dev_lookup_args *args,
 				  const struct btrfs_device *device)
 {
+	if (args->devt)
+		return device->devt == args->devt;
 	if (args->missing) {
 		if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state) &&
 		    !device->bdev)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6acb154ccf87..71e570f8337d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -663,6 +663,11 @@ struct btrfs_dev_lookup_args {
 	u64 devid;
 	u8 *uuid;
 	u8 *fsid;
+	/*
+	 * If devt is specified, all other members will be ignored as it is
+	 * enough to uniquely locate a device.
+	 */
+	dev_t devt;
 	bool missing;
 };
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: (subset) [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
@ 2025-07-04  9:00   ` Christian Brauner
  2025-07-04  9:05   ` Jan Kara
  2025-07-07 23:02   ` Dave Chinner
  2 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2025-07-04  9:00 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel, Qu Wenruo
  Cc: Christian Brauner, viro, jack, linux-ext4, linux-f2fs-devel,
	ntfs3, linux-xfs

On Fri, 04 Jul 2025 10:12:29 +0930, Qu Wenruo wrote:
> Currently all the filesystems implementing the
> super_opearations::shutdown() callback can not afford losing a device.
> 
> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> involved filesystem.
> 
> But it will no longer be the case, with multi-device filesystems like
> btrfs and bcachefs the filesystem can handle certain device loss without
> shutting down the whole filesystem.
> 
> [...]

Moving this into a stable branch that won't change anymore and that you
can pull into btrfs/use as base for your patches.

---

Applied to the vfs-6.17.super branch of the vfs/vfs.git tree.
Patches in the vfs-6.17.super branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.17.super

[1/6] fs: enhance and rename shutdown() callback to remove_bdev()
      https://git.kernel.org/vfs/vfs/c/165fa94de612

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
  2025-07-04  9:00   ` (subset) " Christian Brauner
@ 2025-07-04  9:05   ` Jan Kara
  2025-07-07 23:02   ` Dave Chinner
  2 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2025-07-04  9:05 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs

OK, now that you've made me read the changelog :) :

On Fri 04-07-25 10:12:29, Qu Wenruo wrote:
> Currently all the filesystems implementing the
> super_opearations::shutdown() callback can not afford losing a device.
        ^^^ operations

> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> involved filesystem.
> 
> But it will no longer be the case, with multi-device filesystems like
> btrfs and bcachefs the filesystem can handle certain device loss without
> shutting down the whole filesystem.
> 
> To allow those multi-device filesystems to be integrated to use
> fs_holder_ops:
> 
> - Replace super_opearation::shutdown() with
>   super_opearations::remove_bdev()
    ^^ again typos in work "operations"

>   To better describe when the callback is called.
> 
> - Add a new @bdev parameter to remove_bdev() callback
>   To allow the fs to determine which device is missing, and do the
>   proper handling when needed.
> 
> For the existing shutdown callback users, the change is minimal.
> They only need to follow the rename and the new parameter list.
> The new @bdev parameter can be ignored if the filesystem can not afford
> losing any device, and continue using the old shutdown behavior.
> 
> Btrfs is going to implement the callback soon, which will either
> shutdown the fs or continue read-write operations.
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-ext4@vger.kernel.org
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Cc: ntfs3@lists.linux.dev
> Cc: linux-xfs@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Still feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/exfat/super.c   | 4 ++--
>  fs/ext4/super.c    | 4 ++--
>  fs/f2fs/super.c    | 4 ++--
>  fs/ntfs3/super.c   | 6 +++---
>  fs/super.c         | 4 ++--
>  fs/xfs/xfs_super.c | 7 ++++---
>  include/linux/fs.h | 7 ++++++-
>  7 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/exfat/super.c b/fs/exfat/super.c
> index 7ed858937d45..a0e11166b194 100644
> --- a/fs/exfat/super.c
> +++ b/fs/exfat/super.c
> @@ -172,7 +172,7 @@ int exfat_force_shutdown(struct super_block *sb, u32 flags)
>  	return 0;
>  }
>  
> -static void exfat_shutdown(struct super_block *sb)
> +static void exfat_remove_bdev(struct super_block *sb, struct block_device *bdev)
>  {
>  	exfat_force_shutdown(sb, EXFAT_GOING_DOWN_NOSYNC);
>  }
> @@ -202,7 +202,7 @@ static const struct super_operations exfat_sops = {
>  	.put_super	= exfat_put_super,
>  	.statfs		= exfat_statfs,
>  	.show_options	= exfat_show_options,
> -	.shutdown	= exfat_shutdown,
> +	.remove_bdev	= exfat_remove_bdev,
>  };
>  
>  enum {
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c7d39da7e733..d75b416401ae 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1456,7 +1456,7 @@ static void ext4_destroy_inode(struct inode *inode)
>  			 EXT4_I(inode)->i_reserved_data_blocks);
>  }
>  
> -static void ext4_shutdown(struct super_block *sb)
> +static void ext4_remove_bdev(struct super_block *sb, struct block_device *bdev)
>  {
>         ext4_force_shutdown(sb, EXT4_GOING_FLAGS_NOLOGFLUSH);
>  }
> @@ -1620,7 +1620,7 @@ static const struct super_operations ext4_sops = {
>  	.unfreeze_fs	= ext4_unfreeze,
>  	.statfs		= ext4_statfs,
>  	.show_options	= ext4_show_options,
> -	.shutdown	= ext4_shutdown,
> +	.remove_bdev	= ext4_remove_bdev,
>  #ifdef CONFIG_QUOTA
>  	.quota_read	= ext4_quota_read,
>  	.quota_write	= ext4_quota_write,
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index bbf1dad6843f..8667af9f76e4 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -2640,7 +2640,7 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
>  	return err;
>  }
>  
> -static void f2fs_shutdown(struct super_block *sb)
> +static void f2fs_remove_bdev(struct super_block *sb, struct block_device *bdev)
>  {
>  	f2fs_do_shutdown(F2FS_SB(sb), F2FS_GOING_DOWN_NOSYNC, false, false);
>  }
> @@ -3264,7 +3264,7 @@ static const struct super_operations f2fs_sops = {
>  	.unfreeze_fs	= f2fs_unfreeze,
>  	.statfs		= f2fs_statfs,
>  	.remount_fs	= f2fs_remount,
> -	.shutdown	= f2fs_shutdown,
> +	.remove_bdev	= f2fs_remove_bdev,
>  };
>  
>  #ifdef CONFIG_FS_ENCRYPTION
> diff --git a/fs/ntfs3/super.c b/fs/ntfs3/super.c
> index 920a1ab47b63..3e69dc805e3a 100644
> --- a/fs/ntfs3/super.c
> +++ b/fs/ntfs3/super.c
> @@ -762,9 +762,9 @@ static int ntfs_show_options(struct seq_file *m, struct dentry *root)
>  }
>  
>  /*
> - * ntfs_shutdown - super_operations::shutdown
> + * ntfs_remove_bdev - super_operations::remove_bdev
>   */
> -static void ntfs_shutdown(struct super_block *sb)
> +static void ntfs_remove_bdev(struct super_block *sb, struct block_device *bdev)
>  {
>  	set_bit(NTFS_FLAGS_SHUTDOWN_BIT, &ntfs_sb(sb)->flags);
>  }
> @@ -821,7 +821,7 @@ static const struct super_operations ntfs_sops = {
>  	.put_super = ntfs_put_super,
>  	.statfs = ntfs_statfs,
>  	.show_options = ntfs_show_options,
> -	.shutdown = ntfs_shutdown,
> +	.remove_bdev = ntfs_remove_bdev,
>  	.sync_fs = ntfs_sync_fs,
>  	.write_inode = ntfs3_write_inode,
>  };
> diff --git a/fs/super.c b/fs/super.c
> index 80418ca8e215..c972efb38f6a 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1463,8 +1463,8 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>  		sync_filesystem(sb);
>  	shrink_dcache_sb(sb);
>  	evict_inodes(sb);
> -	if (sb->s_op->shutdown)
> -		sb->s_op->shutdown(sb);
> +	if (sb->s_op->remove_bdev)
> +		sb->s_op->remove_bdev(sb, bdev);
>  
>  	super_unlock_shared(sb);
>  }
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 0bc4b5489078..8e307b036133 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1276,8 +1276,9 @@ xfs_fs_free_cached_objects(
>  }
>  
>  static void
> -xfs_fs_shutdown(
> -	struct super_block	*sb)
> +xfs_fs_remove_bdev(
> +	struct super_block	*sb,
> +	struct block_device	*bdev)
>  {
>  	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REMOVED);
>  }
> @@ -1308,7 +1309,7 @@ static const struct super_operations xfs_super_operations = {
>  	.show_options		= xfs_fs_show_options,
>  	.nr_cached_objects	= xfs_fs_nr_cached_objects,
>  	.free_cached_objects	= xfs_fs_free_cached_objects,
> -	.shutdown		= xfs_fs_shutdown,
> +	.remove_bdev		= xfs_fs_remove_bdev,
>  	.show_stats		= xfs_fs_show_stats,
>  };
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b085f161ed22..b08af63d2d4f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2367,7 +2367,12 @@ struct super_operations {
>  				  struct shrink_control *);
>  	long (*free_cached_objects)(struct super_block *,
>  				    struct shrink_control *);
> -	void (*shutdown)(struct super_block *sb);
> +	/*
> +	 * Called when block device @bdev belonging to @sb is removed.
> +	 *
> +	 * If the fs can't afford the device loss, it should be shutdown.
> +	 */
> +	void (*remove_bdev)(struct super_block *sb, struct block_device *bdev);
>  };
>  
>  /*
> -- 
> 2.50.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 3/6] btrfs: reject file operations if in shutdown state
  2025-07-04  0:42 ` [PATCH v4 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
@ 2025-07-05 14:10   ` David Sterba
  0 siblings, 0 replies; 42+ messages in thread
From: David Sterba @ 2025-07-05 14:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack

On Fri, Jul 04, 2025 at 10:12:31AM +0930, Qu Wenruo wrote:
> This includes the following callbacks of file_operations:
> 
> - read_iter()
> - write_iter()
> - mmap()
> - open()
> - remap_file_range()
> - uring_cmd()
> - splice_read()
>   This requires a small wrapper to do the extra shutdown check, then call
>   the regular filemap_splice_read() function
> 
> This should reject most of the file operations on a shutdown btrfs.
> 
> The callback ioctl() is intentionally skipped, as ext4 doesn't do the
> shutdown check on ioctl() either, thus I believe there is some special
> require for ioctl() callback even if the fs is fully shutdown.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/file.c    | 25 ++++++++++++++++++++++++-
>  fs/btrfs/ioctl.c   |  3 +++
>  fs/btrfs/reflink.c |  3 +++
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 05b046c6806f..cb7d1d53fc13 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1419,6 +1419,8 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
>  	struct btrfs_inode *inode = BTRFS_I(file_inode(file));
>  	ssize_t num_written, num_sync;
>  
> +	if (unlikely(btrfs_is_shutdown(inode->root->fs_info)))

This looks like a repetitive pattern, it would be better to do something
like

#define IS_SHUTDOWN(fs_info) (unlikely(btrfs_is_shutdown(fs_info))

Eventually we can use _Generic to pick the fs_info from the most
commonly used types, like inode or root.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] btrfs: implement shutdown ioctl
  2025-07-04  0:42 ` [PATCH v4 5/6] btrfs: implement shutdown ioctl Qu Wenruo
@ 2025-07-05 14:22   ` David Sterba
  2025-07-06  3:37     ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: David Sterba @ 2025-07-05 14:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack

On Fri, Jul 04, 2025 at 10:12:33AM +0930, Qu Wenruo wrote:
> The shutdown ioctl should follow the XFS one, which use magic number 'X',
> and ioctl number 125, with a u32 as flags.
> 
> For now btrfs don't distinguish DEFAULT and LOGFLUSH flags (just like
> f2fs), both will freeze the fs first (implies committing the current
> transaction), setting the SHUTDOWN flag and finally thaw the fs.
> 
> For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current
> transaction is aborted.
> 
> The new shutdown ioctl is hidden behind experimental features for more
> testing.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ioctl.c           | 40 ++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/btrfs.h |  9 +++++++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 2f3b7be13bea..94eb7a8499db 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -5194,6 +5194,36 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
>  	return 0;
>  }
>  
> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> +static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
> +{
> +	int ret = 0;
> +
> +	if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
> +		return -EINVAL;
> +
> +	if (btrfs_is_shutdown(fs_info))
> +		return 0;
> +
> +	switch (flags) {
> +	case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
> +	case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
> +		ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);

Recently I've looked at scrub blocking filesystem freezing and it does
not work because it blocks on the semaphore taken in mnt_want_write,
also taken in freeze_super().

I have an idea for fix, basically pause scrub, undo mnt_want_write
and then call freeze_super. So we'll need that too for shutdown. Once
implemented the fixup would be to use btrfs_freeze_super callback here.

> +		if (ret)
> +			return ret;
> +		btrfs_force_shutdown(fs_info);
> +		ret = thaw_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
> +		if (ret)
> +			return ret;
> +		break;
> +	case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
> +		btrfs_force_shutdown(fs_info);
> +		break;
> +	}
> +	return ret;
> +}
> +#endif
> +
>  long btrfs_ioctl(struct file *file, unsigned int
>  		cmd, unsigned long arg)
>  {

> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -1096,6 +1096,12 @@ enum btrfs_err_code {
>  	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
>  };
>  
> +/* Flags for IOC_SHUTDOWN, should match XFS' flags. */
> +#define BTRFS_SHUTDOWN_FLAGS_DEFAULT	0x0
> +#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH	0x1
> +#define BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH	0x2
> +#define BTRFS_SHUTDOWN_FLAGS_LAST	0x3
> +
>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>  				   struct btrfs_ioctl_vol_args)
>  #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
> @@ -1217,6 +1223,9 @@ enum btrfs_err_code {
>  #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>  					struct btrfs_ioctl_subvol_wait)
>  
> +/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
> +#define BTRFS_IOC_SHUTDOWN	_IOR('X', 125, __u32)

In XFS it's

#define XFS_IOC_GOINGDOWN            _IOR ('X', 125, uint32_t)

It's right to use the same definition and ioctl value as this will
be a generic ioctl eventually, with 3 users at least. I like the name
SHUTDOWN better, ext4 also uses that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] btrfs: implement shutdown ioctl
  2025-07-05 14:22   ` David Sterba
@ 2025-07-06  3:37     ` Qu Wenruo
  2025-07-07 20:51       ` David Sterba
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-06  3:37 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack



在 2025/7/5 23:52, David Sterba 写道:
> On Fri, Jul 04, 2025 at 10:12:33AM +0930, Qu Wenruo wrote:
>> The shutdown ioctl should follow the XFS one, which use magic number 'X',
>> and ioctl number 125, with a u32 as flags.
>>
>> For now btrfs don't distinguish DEFAULT and LOGFLUSH flags (just like
>> f2fs), both will freeze the fs first (implies committing the current
>> transaction), setting the SHUTDOWN flag and finally thaw the fs.
>>
>> For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current
>> transaction is aborted.
>>
>> The new shutdown ioctl is hidden behind experimental features for more
>> testing.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/ioctl.c           | 40 ++++++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/btrfs.h |  9 +++++++++
>>   2 files changed, 49 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 2f3b7be13bea..94eb7a8499db 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -5194,6 +5194,36 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
>>   	return 0;
>>   }
>>   
>> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
>> +static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
>> +{
>> +	int ret = 0;
>> +
>> +	if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
>> +		return -EINVAL;
>> +
>> +	if (btrfs_is_shutdown(fs_info))
>> +		return 0;
>> +
>> +	switch (flags) {
>> +	case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
>> +	case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
>> +		ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
> 
> Recently I've looked at scrub blocking filesystem freezing and it does
> not work because it blocks on the semaphore taken in mnt_want_write,
> also taken in freeze_super().
> 
> I have an idea for fix, basically pause scrub, undo mnt_want_write
> and then call freeze_super. So we'll need that too for shutdown. Once
> implemented the fixup would be to use btrfs_freeze_super callback here.

It may not be that simple.

freeze_super() itself is doing extra works related to the 
stage/freeze_owner/etc.

I'm not sure if it's a good idea to completely skip that part.

I'd prefer scrub to check the frozen stage, and if it's already in any 
FREEZE stages, exit early.

Thanks,
Qu

> 
>> +		if (ret)
>> +			return ret;
>> +		btrfs_force_shutdown(fs_info);
>> +		ret = thaw_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
>> +		if (ret)
>> +			return ret;
>> +		break;
>> +	case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
>> +		btrfs_force_shutdown(fs_info);
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +#endif
>> +
>>   long btrfs_ioctl(struct file *file, unsigned int
>>   		cmd, unsigned long arg)
>>   {
> 
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -1096,6 +1096,12 @@ enum btrfs_err_code {
>>   	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
>>   };
>>   
>> +/* Flags for IOC_SHUTDOWN, should match XFS' flags. */
>> +#define BTRFS_SHUTDOWN_FLAGS_DEFAULT	0x0
>> +#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH	0x1
>> +#define BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH	0x2
>> +#define BTRFS_SHUTDOWN_FLAGS_LAST	0x3
>> +
>>   #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>>   				   struct btrfs_ioctl_vol_args)
>>   #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
>> @@ -1217,6 +1223,9 @@ enum btrfs_err_code {
>>   #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>>   					struct btrfs_ioctl_subvol_wait)
>>   
>> +/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
>> +#define BTRFS_IOC_SHUTDOWN	_IOR('X', 125, __u32)
> 
> In XFS it's
> 
> #define XFS_IOC_GOINGDOWN            _IOR ('X', 125, uint32_t)
> 
> It's right to use the same definition and ioctl value as this will
> be a generic ioctl eventually, with 3 users at least. I like the name
> SHUTDOWN better, ext4 also uses that.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] btrfs: implement shutdown ioctl
  2025-07-06  3:37     ` Qu Wenruo
@ 2025-07-07 20:51       ` David Sterba
  2025-07-07 23:04         ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: David Sterba @ 2025-07-07 20:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs, linux-fsdevel, viro, brauner, jack

On Sun, Jul 06, 2025 at 01:07:19PM +0930, Qu Wenruo wrote:
> 在 2025/7/5 23:52, David Sterba 写道:
> >> +static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
> >> +{
> >> +	int ret = 0;
> >> +
> >> +	if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
> >> +		return -EINVAL;
> >> +
> >> +	if (btrfs_is_shutdown(fs_info))
> >> +		return 0;
> >> +
> >> +	switch (flags) {
> >> +	case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
> >> +	case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
> >> +		ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
> > 
> > Recently I've looked at scrub blocking filesystem freezing and it does
> > not work because it blocks on the semaphore taken in mnt_want_write,
> > also taken in freeze_super().
> > 
> > I have an idea for fix, basically pause scrub, undo mnt_want_write
> > and then call freeze_super. So we'll need that too for shutdown. Once
> > implemented the fixup would be to use btrfs_freeze_super callback here.
> 
> It may not be that simple.
> 
> freeze_super() itself is doing extra works related to the 
> stage/freeze_owner/etc.
> 
> I'm not sure if it's a good idea to completely skip that part.
> 
> I'd prefer scrub to check the frozen stage, and if it's already in any 
> FREEZE stages, exit early.

I have working prototype for pausing scrub that does not need to exit,
so far I've tested it with fsfreeze in a VM, I still need to test actual
freezing for suspend purposes.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
  2025-07-04  9:00   ` (subset) " Christian Brauner
  2025-07-04  9:05   ` Jan Kara
@ 2025-07-07 23:02   ` Dave Chinner
  2025-07-07 23:22     ` Qu Wenruo
  2 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2025-07-07 23:02 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs

On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> Currently all the filesystems implementing the
> super_opearations::shutdown() callback can not afford losing a device.
> 
> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> involved filesystem.
> 
> But it will no longer be the case, with multi-device filesystems like
> btrfs and bcachefs the filesystem can handle certain device loss without
> shutting down the whole filesystem.
> 
> To allow those multi-device filesystems to be integrated to use
> fs_holder_ops:
> 
> - Replace super_opearation::shutdown() with
>   super_opearations::remove_bdev()
>   To better describe when the callback is called.

This conflates cause with action.

The shutdown callout is an action that the filesystem must execute,
whilst "remove bdev" is a cause notification that might require an
action to be take.

Yes, the cause could be someone doing hot-unplug of the block
device, but it could also be something going wrong in software
layers below the filesystem. e.g. dm-thinp having an unrecoverable
corruption or ENOSPC errors.

We already have a "cause" notification: blk_holder_ops->mark_dead().

The generic fs action that is taken by this notification is
fs_bdev_mark_dead().  That action is to invalidate caches and shut
down the filesystem.

btrfs needs to do something different to a blk_holder_ops->mark_dead
notification. i.e. it needs an action that is different to
fs_bdev_mark_dead().

Indeed, this is how bcachefs already handles "single device
died" events for multi-device filesystems - see
bch2_fs_bdev_mark_dead().

Hence Btrfs should be doing the same thing as bcachefs. The
bdev_handle_ops structure exists precisly because it allows the
filesystem to handle block device events in the exact manner they
require....

> - Add a new @bdev parameter to remove_bdev() callback
>   To allow the fs to determine which device is missing, and do the
>   proper handling when needed.
> 
> For the existing shutdown callback users, the change is minimal.

Except for the change in API semantics. ->shutdown is an external
shutdown trigger for the filesystem, not a generic "block device
removed" notification.

Hooking blk_holder_ops->mark_dead means that btrfs can also provide
a ->shutdown implementation for when something external other than a
block device removal needs to shut down the filesystem....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] btrfs: implement shutdown ioctl
  2025-07-07 20:51       ` David Sterba
@ 2025-07-07 23:04         ` Qu Wenruo
  2025-07-08  0:53           ` David Sterba
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-07 23:04 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack



在 2025/7/8 06:21, David Sterba 写道:
> On Sun, Jul 06, 2025 at 01:07:19PM +0930, Qu Wenruo wrote:
>> 在 2025/7/5 23:52, David Sterba 写道:
>>>> +static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
>>>> +{
>>>> +	int ret = 0;
>>>> +
>>>> +	if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (btrfs_is_shutdown(fs_info))
>>>> +		return 0;
>>>> +
>>>> +	switch (flags) {
>>>> +	case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
>>>> +	case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
>>>> +		ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
>>>
>>> Recently I've looked at scrub blocking filesystem freezing and it does
>>> not work because it blocks on the semaphore taken in mnt_want_write,
>>> also taken in freeze_super().
>>>
>>> I have an idea for fix, basically pause scrub, undo mnt_want_write
>>> and then call freeze_super. So we'll need that too for shutdown. Once
>>> implemented the fixup would be to use btrfs_freeze_super callback here.
>>
>> It may not be that simple.
>>
>> freeze_super() itself is doing extra works related to the
>> stage/freeze_owner/etc.
>>
>> I'm not sure if it's a good idea to completely skip that part.
>>
>> I'd prefer scrub to check the frozen stage, and if it's already in any
>> FREEZE stages, exit early.
> 
> I have working prototype for pausing scrub that does not need to exit,
> so far I've tested it with fsfreeze in a VM, I still need to test actual
> freezing for suspend purposes.

Not sure how would you test with running scrub and freeze, but please 
enable lockdep for the potential reversed lock sequence related to btrfs 
specific locks and s_umount/s_writers.rw_sem.

But I guess we should have a test case utilizing freeze and scrub.
Especially that fsstress doesn't include freeze, thus we have to 
manually do scrub and freeze (maybe with fsstress at the background).

We need such test case no matter if we allow scrub to be paused/canceled.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-07 23:02   ` Dave Chinner
@ 2025-07-07 23:22     ` Qu Wenruo
  2025-07-08  0:45       ` Darrick J. Wong
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-07 23:22 UTC (permalink / raw)
  To: Dave Chinner, Qu Wenruo
  Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/8 08:32, Dave Chinner 写道:
> On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
>> Currently all the filesystems implementing the
>> super_opearations::shutdown() callback can not afford losing a device.
>>
>> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
>> involved filesystem.
>>
>> But it will no longer be the case, with multi-device filesystems like
>> btrfs and bcachefs the filesystem can handle certain device loss without
>> shutting down the whole filesystem.
>>
>> To allow those multi-device filesystems to be integrated to use
>> fs_holder_ops:
>>
>> - Replace super_opearation::shutdown() with
>>    super_opearations::remove_bdev()
>>    To better describe when the callback is called.
> 
> This conflates cause with action.
> 
> The shutdown callout is an action that the filesystem must execute,
> whilst "remove bdev" is a cause notification that might require an
> action to be take.
> 
> Yes, the cause could be someone doing hot-unplug of the block
> device, but it could also be something going wrong in software
> layers below the filesystem. e.g. dm-thinp having an unrecoverable
> corruption or ENOSPC errors.
> 
> We already have a "cause" notification: blk_holder_ops->mark_dead().
> 
> The generic fs action that is taken by this notification is
> fs_bdev_mark_dead().  That action is to invalidate caches and shut
> down the filesystem.
> 
> btrfs needs to do something different to a blk_holder_ops->mark_dead
> notification. i.e. it needs an action that is different to
> fs_bdev_mark_dead().
> 
> Indeed, this is how bcachefs already handles "single device
> died" events for multi-device filesystems - see
> bch2_fs_bdev_mark_dead().

I do not think it's the correct way to go, especially when there is 
already fs_holder_ops.

We're always going towards a more generic solution, other than letting 
the individual fs to do the same thing slightly differently.

Yes, the naming is not perfect and mixing cause and action, but the end 
result is still a more generic and less duplicated code base.

> 
> Hence Btrfs should be doing the same thing as bcachefs. The
> bdev_handle_ops structure exists precisly because it allows the
> filesystem to handle block device events in the exact manner they
> require....
> 
>> - Add a new @bdev parameter to remove_bdev() callback
>>    To allow the fs to determine which device is missing, and do the
>>    proper handling when needed.
>>
>> For the existing shutdown callback users, the change is minimal.
> 
> Except for the change in API semantics. ->shutdown is an external
> shutdown trigger for the filesystem, not a generic "block device
> removed" notification.

The problem is, there is no one utilizing ->shutdown() out of 
fs_bdev_mark_dead().

If shutdown ioctl is handled through super_operations::shutdown, it will 
be more meaningful to split shutdown and dev removal.

But that's not the case, and different fses even have slightly different 
handling for the shutdown flags (not all fses even utilize journal to 
protect their metadata).

Thanks,
Qu


> 
> Hooking blk_holder_ops->mark_dead means that btrfs can also provide
> a ->shutdown implementation for when something external other than a
> block device removal needs to shut down the filesystem....
> 
> -Dave.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-07 23:22     ` Qu Wenruo
@ 2025-07-08  0:45       ` Darrick J. Wong
  2025-07-08  2:09         ` Qu Wenruo
                           ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Darrick J. Wong @ 2025-07-08  0:45 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs

On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/7/8 08:32, Dave Chinner 写道:
> > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > Currently all the filesystems implementing the
> > > super_opearations::shutdown() callback can not afford losing a device.
> > > 
> > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > involved filesystem.
> > > 
> > > But it will no longer be the case, with multi-device filesystems like
> > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > shutting down the whole filesystem.
> > > 
> > > To allow those multi-device filesystems to be integrated to use
> > > fs_holder_ops:
> > > 
> > > - Replace super_opearation::shutdown() with
> > >    super_opearations::remove_bdev()
> > >    To better describe when the callback is called.
> > 
> > This conflates cause with action.
> > 
> > The shutdown callout is an action that the filesystem must execute,
> > whilst "remove bdev" is a cause notification that might require an
> > action to be take.
> > 
> > Yes, the cause could be someone doing hot-unplug of the block
> > device, but it could also be something going wrong in software
> > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > corruption or ENOSPC errors.
> > 
> > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > 
> > The generic fs action that is taken by this notification is
> > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > down the filesystem.
> > 
> > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > notification. i.e. it needs an action that is different to
> > fs_bdev_mark_dead().
> > 
> > Indeed, this is how bcachefs already handles "single device
> > died" events for multi-device filesystems - see
> > bch2_fs_bdev_mark_dead().
> 
> I do not think it's the correct way to go, especially when there is already
> fs_holder_ops.
> 
> We're always going towards a more generic solution, other than letting the
> individual fs to do the same thing slightly differently.

On second thought -- it's weird that you'd flush the filesystem and
shrink the inode/dentry caches in a "your device went away" handler.
Fancy filesystems like bcachefs and btrfs would likely just shift IO to
a different bdev, right?  And there's no good reason to run shrinkers on
either of those fses, right?

> Yes, the naming is not perfect and mixing cause and action, but the end
> result is still a more generic and less duplicated code base.

I think dchinner makes a good point that if your filesystem can do
something clever on device removal, it should provide its own block
device holder ops instead of using fs_holder_ops.  I don't understand
why you need a "generic" solution for btrfs when it's not going to do
what the others do anyway.

Awkward naming is often a sign that further thought (or at least
separation of code) is needed.

As an aside:
'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
everyone's ioctl functions into the VFS, and then move the "I am dead"
state into super_block so that you could actually shut down any
filesystem, not just the seven that currently implement it.

--D

> > Hence Btrfs should be doing the same thing as bcachefs. The
> > bdev_handle_ops structure exists precisly because it allows the
> > filesystem to handle block device events in the exact manner they
> > require....
> > 
> > > - Add a new @bdev parameter to remove_bdev() callback
> > >    To allow the fs to determine which device is missing, and do the
> > >    proper handling when needed.
> > > 
> > > For the existing shutdown callback users, the change is minimal.
> > 
> > Except for the change in API semantics. ->shutdown is an external
> > shutdown trigger for the filesystem, not a generic "block device
> > removed" notification.
> 
> The problem is, there is no one utilizing ->shutdown() out of
> fs_bdev_mark_dead().
> 
> If shutdown ioctl is handled through super_operations::shutdown, it will be
> more meaningful to split shutdown and dev removal.
> 
> But that's not the case, and different fses even have slightly different
> handling for the shutdown flags (not all fses even utilize journal to
> protect their metadata).
> 
> Thanks,
> Qu
> 
> 
> > 
> > Hooking blk_holder_ops->mark_dead means that btrfs can also provide
> > a ->shutdown implementation for when something external other than a
> > block device removal needs to shut down the filesystem....
> > 
> > -Dave.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] btrfs: implement shutdown ioctl
  2025-07-07 23:04         ` Qu Wenruo
@ 2025-07-08  0:53           ` David Sterba
  0 siblings, 0 replies; 42+ messages in thread
From: David Sterba @ 2025-07-08  0:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs, linux-fsdevel, viro, brauner, jack

On Tue, Jul 08, 2025 at 08:34:09AM +0930, Qu Wenruo wrote:
> > I have working prototype for pausing scrub that does not need to exit,
> > so far I've tested it with fsfreeze in a VM, I still need to test actual
> > freezing for suspend purposes.
> 
> Not sure how would you test with running scrub and freeze, but please 
> enable lockdep for the potential reversed lock sequence related to btrfs 
> specific locks and s_umount/s_writers.rw_sem.

I used debug_show_held_locks(current) in the task and it showed only the
writers rwsem, once dropped there were no other locks held.

The test is a started scrub with a low bw limit so it takes long enough,
then either from script or manually I do fsfreeze.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  0:45       ` Darrick J. Wong
@ 2025-07-08  2:09         ` Qu Wenruo
  2025-07-08  3:06           ` Qu Wenruo
  2025-07-08  7:55         ` Christian Brauner
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-08  2:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/8 10:15, Darrick J. Wong 写道:
[...]
>>
>> I do not think it's the correct way to go, especially when there is already
>> fs_holder_ops.
>>
>> We're always going towards a more generic solution, other than letting the
>> individual fs to do the same thing slightly differently.
> 
> On second thought -- it's weird that you'd flush the filesystem and
> shrink the inode/dentry caches in a "your device went away" handler.
> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> a different bdev, right?  And there's no good reason to run shrinkers on
> either of those fses, right?

That's right, some part of fs_bdev_mark_dead() is not making much sense 
if the fs can handle the dev loss.

> 
>> Yes, the naming is not perfect and mixing cause and action, but the end
>> result is still a more generic and less duplicated code base.
> 
> I think dchinner makes a good point that if your filesystem can do
> something clever on device removal, it should provide its own block
> device holder ops instead of using fs_holder_ops.

Then re-implement a lot of things like bdev_super_lock()?

I'd prefer not.


fs_holder_ops solves a lot of things like handling mounting/inactive 
fses, and pushing it back again to the fs code is just causing more 
duplication.

Not really worthy if we only want a single different behavior.

Thus I strongly prefer to do with the existing fs_holder_ops, no matter 
if it's using/renaming the shutdown() callback, or a new callback.

>  I don't understand
> why you need a "generic" solution for btrfs when it's not going to do
> what the others do anyway.

Because there is only one behavior different.

Other things like freezing/thawing/syncing are all the same.

Thanks,
Qu

> 
> Awkward naming is often a sign that further thought (or at least
> separation of code) is needed.
> 
> As an aside:
> 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
> everyone's ioctl functions into the VFS, and then move the "I am dead"
> state into super_block so that you could actually shut down any
> filesystem, not just the seven that currently implement it.
> 
> --D
> 
>>> Hence Btrfs should be doing the same thing as bcachefs. The
>>> bdev_handle_ops structure exists precisly because it allows the
>>> filesystem to handle block device events in the exact manner they
>>> require....
>>>
>>>> - Add a new @bdev parameter to remove_bdev() callback
>>>>     To allow the fs to determine which device is missing, and do the
>>>>     proper handling when needed.
>>>>
>>>> For the existing shutdown callback users, the change is minimal.
>>>
>>> Except for the change in API semantics. ->shutdown is an external
>>> shutdown trigger for the filesystem, not a generic "block device
>>> removed" notification.
>>
>> The problem is, there is no one utilizing ->shutdown() out of
>> fs_bdev_mark_dead().
>>
>> If shutdown ioctl is handled through super_operations::shutdown, it will be
>> more meaningful to split shutdown and dev removal.
>>
>> But that's not the case, and different fses even have slightly different
>> handling for the shutdown flags (not all fses even utilize journal to
>> protect their metadata).
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> Hooking blk_holder_ops->mark_dead means that btrfs can also provide
>>> a ->shutdown implementation for when something external other than a
>>> block device removal needs to shut down the filesystem....
>>>
>>> -Dave.
>>
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  2:09         ` Qu Wenruo
@ 2025-07-08  3:06           ` Qu Wenruo
  2025-07-08  5:05             ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-08  3:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/8 11:39, Qu Wenruo 写道:
> 
> 
> 在 2025/7/8 10:15, Darrick J. Wong 写道:
> [...]
>>>
>>> I do not think it's the correct way to go, especially when there is 
>>> already
>>> fs_holder_ops.
>>>
>>> We're always going towards a more generic solution, other than 
>>> letting the
>>> individual fs to do the same thing slightly differently.
>>
>> On second thought -- it's weird that you'd flush the filesystem and
>> shrink the inode/dentry caches in a "your device went away" handler.
>> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
>> a different bdev, right?  And there's no good reason to run shrinkers on
>> either of those fses, right?
> 
> That's right, some part of fs_bdev_mark_dead() is not making much sense 
> if the fs can handle the dev loss.
> 
>>
>>> Yes, the naming is not perfect and mixing cause and action, but the end
>>> result is still a more generic and less duplicated code base.
>>
>> I think dchinner makes a good point that if your filesystem can do
>> something clever on device removal, it should provide its own block
>> device holder ops instead of using fs_holder_ops.
> 
> Then re-implement a lot of things like bdev_super_lock()?
> 
> I'd prefer not.
> 
> 
> fs_holder_ops solves a lot of things like handling mounting/inactive 
> fses, and pushing it back again to the fs code is just causing more 
> duplication.
> 
> Not really worthy if we only want a single different behavior.
> 
> Thus I strongly prefer to do with the existing fs_holder_ops, no matter 
> if it's using/renaming the shutdown() callback, or a new callback.

Previously Christoph is against a new ->remove_bdev() callback, as it is 
conflicting with the existing ->shutdown().

So what about a new ->handle_bdev_remove() callback, that we do 
something like this inside fs_bdev_mark_dead():

{
	bdev_super_lock();
	if (!surprise)
		sync_filesystem();

	if (s_op->handle_bdev_remove) {
		ret = s_op->handle_bdev_remove();
		if (!ret) {
			super_unlock_shared();
			return;
		}
	}
	shrink_dcache_sb();
	evict_inodes();
	if (s_op->shutdown)
		s_op->shutdown();
}

So that the new ->handle_bdev_remove() is not conflicting with
->shutdown() but an optional one.

And if the fs can not handle the removal, just let
->handle_bdev_remove() return an error so that we fallback to the 
existing shutdown routine.

Would this be more acceptable?

Thanks,
Qu

> 
>>  I don't understand
>> why you need a "generic" solution for btrfs when it's not going to do
>> what the others do anyway.
> 
> Because there is only one behavior different.
> 
> Other things like freezing/thawing/syncing are all the same.
> 
> Thanks,
> Qu
> 
>>
>> Awkward naming is often a sign that further thought (or at least
>> separation of code) is needed.
>>
>> As an aside:
>> 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
>> everyone's ioctl functions into the VFS, and then move the "I am dead"
>> state into super_block so that you could actually shut down any
>> filesystem, not just the seven that currently implement it.
>>
>> --D
>>
>>>> Hence Btrfs should be doing the same thing as bcachefs. The
>>>> bdev_handle_ops structure exists precisly because it allows the
>>>> filesystem to handle block device events in the exact manner they
>>>> require....
>>>>
>>>>> - Add a new @bdev parameter to remove_bdev() callback
>>>>>     To allow the fs to determine which device is missing, and do the
>>>>>     proper handling when needed.
>>>>>
>>>>> For the existing shutdown callback users, the change is minimal.
>>>>
>>>> Except for the change in API semantics. ->shutdown is an external
>>>> shutdown trigger for the filesystem, not a generic "block device
>>>> removed" notification.
>>>
>>> The problem is, there is no one utilizing ->shutdown() out of
>>> fs_bdev_mark_dead().
>>>
>>> If shutdown ioctl is handled through super_operations::shutdown, it 
>>> will be
>>> more meaningful to split shutdown and dev removal.
>>>
>>> But that's not the case, and different fses even have slightly different
>>> handling for the shutdown flags (not all fses even utilize journal to
>>> protect their metadata).
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>>>
>>>> Hooking blk_holder_ops->mark_dead means that btrfs can also provide
>>>> a ->shutdown implementation for when something external other than a
>>>> block device removal needs to shut down the filesystem....
>>>>
>>>> -Dave.
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  3:06           ` Qu Wenruo
@ 2025-07-08  5:05             ` Dave Chinner
  2025-07-08  5:41               ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2025-07-08  5:05 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Darrick J. Wong, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs

On Tue, Jul 08, 2025 at 12:36:48PM +0930, Qu Wenruo wrote:
> 在 2025/7/8 11:39, Qu Wenruo 写道:
> > 在 2025/7/8 10:15, Darrick J. Wong 写道:
> > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > result is still a more generic and less duplicated code base.
> > > 
> > > I think dchinner makes a good point that if your filesystem can do
> > > something clever on device removal, it should provide its own block
> > > device holder ops instead of using fs_holder_ops.
> > 
> > Then re-implement a lot of things like bdev_super_lock()?

IDGI. Simply add:

EXPORT_SYMBOL_GPL(get_bdev_super);

And the problem is solved.

> > I'd prefer not.
> > 
> > 
> > fs_holder_ops solves a lot of things like handling mounting/inactive
> > fses, and pushing it back again to the fs code is just causing more
> > duplication.

This is all encapsulated in get_bdev_super(), so btrfs doesn't need
to implement any of this. get_bdev_super/deactivate_super is the API
you should be using with the blk_holder_ops methods.

> > Not really worthy if we only want a single different behavior.

This is the *3rd* different behaviour for ->mark_dead. We
have the generic behaviour, the bcachefs behaviour, and now the
btrfs behaviour (whatever that may be).

> > Thus I strongly prefer to do with the existing fs_holder_ops, no matter
> > if it's using/renaming the shutdown() callback, or a new callback.
> 
> Previously Christoph is against a new ->remove_bdev() callback, as it is
> conflicting with the existing ->shutdown().
> 
> So what about a new ->handle_bdev_remove() callback, that we do something
> like this inside fs_bdev_mark_dead():
> 
> {
> 	bdev_super_lock();
> 	if (!surprise)
> 		sync_filesystem();
> 
> 	if (s_op->handle_bdev_remove) {
> 		ret = s_op->handle_bdev_remove();
> 		if (!ret) {
> 			super_unlock_shared();
> 			return;
> 		}
> 	}
> 	shrink_dcache_sb();
> 	evict_inodes();
> 	if (s_op->shutdown)
> 		s_op->shutdown();
> }
> 
> So that the new ->handle_bdev_remove() is not conflicting with
> ->shutdown() but an optional one.
> 
> And if the fs can not handle the removal, just let
> ->handle_bdev_remove() return an error so that we fallback to the existing
> shutdown routine.
> 
> Would this be more acceptable?

No.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  5:05             ` Dave Chinner
@ 2025-07-08  5:41               ` Qu Wenruo
  0 siblings, 0 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-08  5:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/8 14:35, Dave Chinner 写道:
[...]
>>> Not really worthy if we only want a single different behavior.
> 
> This is the *3rd* different behaviour for ->mark_dead. We
> have the generic behaviour, the bcachefs behaviour, and now the
> btrfs behaviour (whatever that may be).

Then why not merging the btrfs/bcachefs callback into one generic 
callback? Other than introducing 3 different bdev_holder_ops?

This looks exactly the opposite what VFS is trying to do.

Thanks,
Qu

> 
>>> Thus I strongly prefer to do with the existing fs_holder_ops, no matter
>>> if it's using/renaming the shutdown() callback, or a new callback.
>>
>> Previously Christoph is against a new ->remove_bdev() callback, as it is
>> conflicting with the existing ->shutdown().
>>
>> So what about a new ->handle_bdev_remove() callback, that we do something
>> like this inside fs_bdev_mark_dead():
>>
>> {
>> 	bdev_super_lock();
>> 	if (!surprise)
>> 		sync_filesystem();
>>
>> 	if (s_op->handle_bdev_remove) {
>> 		ret = s_op->handle_bdev_remove();
>> 		if (!ret) {
>> 			super_unlock_shared();
>> 			return;
>> 		}
>> 	}
>> 	shrink_dcache_sb();
>> 	evict_inodes();
>> 	if (s_op->shutdown)
>> 		s_op->shutdown();
>> }
>>
>> So that the new ->handle_bdev_remove() is not conflicting with
>> ->shutdown() but an optional one.
>>
>> And if the fs can not handle the removal, just let
>> ->handle_bdev_remove() return an error so that we fallback to the existing
>> shutdown routine.
>>
>> Would this be more acceptable?
> 
> No.
> 
> -Dave.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  0:45       ` Darrick J. Wong
  2025-07-08  2:09         ` Qu Wenruo
@ 2025-07-08  7:55         ` Christian Brauner
  2025-07-08 22:59           ` Dave Chinner
  2025-07-10 10:54           ` Christoph Hellwig
  2025-07-08 10:20         ` Jan Kara
  2025-07-10 10:52         ` Christoph Hellwig
  3 siblings, 2 replies; 42+ messages in thread
From: Christian Brauner @ 2025-07-08  7:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Qu Wenruo, Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs

On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > 
> > 
> > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > Currently all the filesystems implementing the
> > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > 
> > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > involved filesystem.
> > > > 
> > > > But it will no longer be the case, with multi-device filesystems like
> > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > shutting down the whole filesystem.
> > > > 
> > > > To allow those multi-device filesystems to be integrated to use
> > > > fs_holder_ops:
> > > > 
> > > > - Replace super_opearation::shutdown() with
> > > >    super_opearations::remove_bdev()
> > > >    To better describe when the callback is called.
> > > 
> > > This conflates cause with action.
> > > 
> > > The shutdown callout is an action that the filesystem must execute,
> > > whilst "remove bdev" is a cause notification that might require an
> > > action to be take.
> > > 
> > > Yes, the cause could be someone doing hot-unplug of the block
> > > device, but it could also be something going wrong in software
> > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > corruption or ENOSPC errors.
> > > 
> > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > 
> > > The generic fs action that is taken by this notification is
> > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > down the filesystem.
> > > 
> > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > notification. i.e. it needs an action that is different to
> > > fs_bdev_mark_dead().
> > > 
> > > Indeed, this is how bcachefs already handles "single device
> > > died" events for multi-device filesystems - see
> > > bch2_fs_bdev_mark_dead().
> > 
> > I do not think it's the correct way to go, especially when there is already
> > fs_holder_ops.
> > 
> > We're always going towards a more generic solution, other than letting the
> > individual fs to do the same thing slightly differently.
> 
> On second thought -- it's weird that you'd flush the filesystem and
> shrink the inode/dentry caches in a "your device went away" handler.
> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> a different bdev, right?  And there's no good reason to run shrinkers on
> either of those fses, right?
> 
> > Yes, the naming is not perfect and mixing cause and action, but the end
> > result is still a more generic and less duplicated code base.
> 
> I think dchinner makes a good point that if your filesystem can do
> something clever on device removal, it should provide its own block
> device holder ops instead of using fs_holder_ops.  I don't understand
> why you need a "generic" solution for btrfs when it's not going to do
> what the others do anyway.

I think letting filesystems implement their own holder ops should be
avoided if we can. Christoph may chime in here. I have no appettite for
exporting stuff like get_bdev_super() unless absolutely necessary. We
tried to move all that handling into the VFS to eliminate a slew of
deadlocks we detected and fixed. I have no appetite to repeat that
cycle.

The shutdown method is implemented only by block-based filesystems and
arguably shutdown was always a misnomer because it assumed that the
filesystem needs to actually shut down when it is called. IOW, we made
it so that it is a call to action but that doesn't have to be the case.
Calling it ->remove_bdev() is imo the correct thing because it gives
block based filesystem the ability to handle device events how they see
fit.

Once we will have non-block based filesystems that need a method to
always shut down the filesystem itself we might have to revisit this
design anyway but no one had that use-case yet.

> 
> Awkward naming is often a sign that further thought (or at least
> separation of code) is needed.
> 
> As an aside:
> 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
> everyone's ioctl functions into the VFS, and then move the "I am dead"
> state into super_block so that you could actually shut down any
> filesystem, not just the seven that currently implement it.

That goes back to my earlier point. Fwiw, I think that's valuable work.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  0:45       ` Darrick J. Wong
  2025-07-08  2:09         ` Qu Wenruo
  2025-07-08  7:55         ` Christian Brauner
@ 2025-07-08 10:20         ` Jan Kara
  2025-07-08 20:20           ` Darrick J. Wong
  2025-07-10 10:52         ` Christoph Hellwig
  3 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2025-07-08 10:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Qu Wenruo, Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3,
	linux-xfs

On Mon 07-07-25 17:45:32, Darrick J. Wong wrote:
> On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > Currently all the filesystems implementing the
> > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > 
> > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > involved filesystem.
> > > > 
> > > > But it will no longer be the case, with multi-device filesystems like
> > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > shutting down the whole filesystem.
> > > > 
> > > > To allow those multi-device filesystems to be integrated to use
> > > > fs_holder_ops:
> > > > 
> > > > - Replace super_opearation::shutdown() with
> > > >    super_opearations::remove_bdev()
> > > >    To better describe when the callback is called.
> > > 
> > > This conflates cause with action.
> > > 
> > > The shutdown callout is an action that the filesystem must execute,
> > > whilst "remove bdev" is a cause notification that might require an
> > > action to be take.
> > > 
> > > Yes, the cause could be someone doing hot-unplug of the block
> > > device, but it could also be something going wrong in software
> > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > corruption or ENOSPC errors.
> > > 
> > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > 
> > > The generic fs action that is taken by this notification is
> > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > down the filesystem.
> > > 
> > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > notification. i.e. it needs an action that is different to
> > > fs_bdev_mark_dead().
> > > 
> > > Indeed, this is how bcachefs already handles "single device
> > > died" events for multi-device filesystems - see
> > > bch2_fs_bdev_mark_dead().
> > 
> > I do not think it's the correct way to go, especially when there is already
> > fs_holder_ops.
> > 
> > We're always going towards a more generic solution, other than letting the
> > individual fs to do the same thing slightly differently.
> 
> On second thought -- it's weird that you'd flush the filesystem and
> shrink the inode/dentry caches in a "your device went away" handler.
> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> a different bdev, right?  And there's no good reason to run shrinkers on
> either of those fses, right?

I agree it is awkward and bcachefs avoids these in case of removal it can
handle gracefully AFAICS.

> > Yes, the naming is not perfect and mixing cause and action, but the end
> > result is still a more generic and less duplicated code base.
> 
> I think dchinner makes a good point that if your filesystem can do
> something clever on device removal, it should provide its own block
> device holder ops instead of using fs_holder_ops.  I don't understand
> why you need a "generic" solution for btrfs when it's not going to do
> what the others do anyway.

Well, I'd also say just go for own fs_holder_ops if it was not for the
awkward "get super from bdev" step. As Christian wrote we've encapsulated
that in fs/super.c and bdev_super_lock() in particular but the calling
conventions for the fs_holder_ops are not very nice (holding
bdev_holder_lock, need to release it before grabbing practically anything
else) so I'd have much greater peace of mind if this didn't spread too
much. Once you call bdev_super_lock() and hold on to sb with s_umount held,
things are much more conventional for the fs land so I'd like if this
step happened before any fs hook got called. So I prefer something like
Qu's proposal of separate sb op for device removal over exporting
bdev_super_lock(). Like:

static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
{
        struct super_block *sb;

        sb = bdev_super_lock(bdev, false);
        if (!sb)
                return;

	if (sb->s_op->remove_bdev) {
		sb->s_op->remove_bdev(sb, bdev, surprise);
		return;
	}

	if (!surprise)
		sync_filesystem(sb);
	shrink_dcache_sb(sb);
	evict_inodes(sb);
	if (sb->s_op->shutdown)
		sb->s_op->shutdown(sb);

	super_unlock_shared(sb);
}

> As an aside:
> 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
> everyone's ioctl functions into the VFS, and then move the "I am dead"
> state into super_block so that you could actually shut down any
> filesystem, not just the seven that currently implement it.

Yes, I should find time to revive that patch series... It was not *that*
hard to do.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08 10:20         ` Jan Kara
@ 2025-07-08 20:20           ` Darrick J. Wong
  2025-07-08 22:12             ` Qu Wenruo
  2025-07-10  8:40             ` Christian Brauner
  0 siblings, 2 replies; 42+ messages in thread
From: Darrick J. Wong @ 2025-07-08 20:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Qu Wenruo, Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, brauner, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs

On Tue, Jul 08, 2025 at 12:20:00PM +0200, Jan Kara wrote:
> On Mon 07-07-25 17:45:32, Darrick J. Wong wrote:
> > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > Currently all the filesystems implementing the
> > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > 
> > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > involved filesystem.
> > > > > 
> > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > shutting down the whole filesystem.
> > > > > 
> > > > > To allow those multi-device filesystems to be integrated to use
> > > > > fs_holder_ops:
> > > > > 
> > > > > - Replace super_opearation::shutdown() with
> > > > >    super_opearations::remove_bdev()
> > > > >    To better describe when the callback is called.
> > > > 
> > > > This conflates cause with action.
> > > > 
> > > > The shutdown callout is an action that the filesystem must execute,
> > > > whilst "remove bdev" is a cause notification that might require an
> > > > action to be take.
> > > > 
> > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > device, but it could also be something going wrong in software
> > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > corruption or ENOSPC errors.
> > > > 
> > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > 
> > > > The generic fs action that is taken by this notification is
> > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > down the filesystem.
> > > > 
> > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > notification. i.e. it needs an action that is different to
> > > > fs_bdev_mark_dead().
> > > > 
> > > > Indeed, this is how bcachefs already handles "single device
> > > > died" events for multi-device filesystems - see
> > > > bch2_fs_bdev_mark_dead().
> > > 
> > > I do not think it's the correct way to go, especially when there is already
> > > fs_holder_ops.
> > > 
> > > We're always going towards a more generic solution, other than letting the
> > > individual fs to do the same thing slightly differently.
> > 
> > On second thought -- it's weird that you'd flush the filesystem and
> > shrink the inode/dentry caches in a "your device went away" handler.
> > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > a different bdev, right?  And there's no good reason to run shrinkers on
> > either of those fses, right?
> 
> I agree it is awkward and bcachefs avoids these in case of removal it can
> handle gracefully AFAICS.
> 
> > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > result is still a more generic and less duplicated code base.
> > 
> > I think dchinner makes a good point that if your filesystem can do
> > something clever on device removal, it should provide its own block
> > device holder ops instead of using fs_holder_ops.  I don't understand
> > why you need a "generic" solution for btrfs when it's not going to do
> > what the others do anyway.
> 
> Well, I'd also say just go for own fs_holder_ops if it was not for the
> awkward "get super from bdev" step. As Christian wrote we've encapsulated
> that in fs/super.c and bdev_super_lock() in particular but the calling
> conventions for the fs_holder_ops are not very nice (holding
> bdev_holder_lock, need to release it before grabbing practically anything
> else) so I'd have much greater peace of mind if this didn't spread too
> much. Once you call bdev_super_lock() and hold on to sb with s_umount held,
> things are much more conventional for the fs land so I'd like if this
> step happened before any fs hook got called. So I prefer something like
> Qu's proposal of separate sb op for device removal over exporting
> bdev_super_lock(). Like:
> 
> static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> {
>         struct super_block *sb;
> 
>         sb = bdev_super_lock(bdev, false);
>         if (!sb)
>                 return;
> 
> 	if (sb->s_op->remove_bdev) {
> 		sb->s_op->remove_bdev(sb, bdev, surprise);
> 		return;
> 	}

It feels odd but I could live with this, particularly since that's the
direction that brauner is laying down. :)

Do we still need to super_unlock_shared here?

--D

> 
> 	if (!surprise)
> 		sync_filesystem(sb);
> 	shrink_dcache_sb(sb);
> 	evict_inodes(sb);
> 	if (sb->s_op->shutdown)
> 		sb->s_op->shutdown(sb);
> 
> 	super_unlock_shared(sb);
> }
> 
> > As an aside:
> > 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
> > everyone's ioctl functions into the VFS, and then move the "I am dead"
> > state into super_block so that you could actually shut down any
> > filesystem, not just the seven that currently implement it.
> 
> Yes, I should find time to revive that patch series... It was not *that*
> hard to do.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08 20:20           ` Darrick J. Wong
@ 2025-07-08 22:12             ` Qu Wenruo
  2025-07-10  8:40             ` Christian Brauner
  1 sibling, 0 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-07-08 22:12 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara
  Cc: Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	brauner, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/9 05:50, Darrick J. Wong 写道:
[...]
>> Well, I'd also say just go for own fs_holder_ops if it was not for the
>> awkward "get super from bdev" step. As Christian wrote we've encapsulated
>> that in fs/super.c and bdev_super_lock() in particular but the calling
>> conventions for the fs_holder_ops are not very nice (holding
>> bdev_holder_lock, need to release it before grabbing practically anything
>> else) so I'd have much greater peace of mind if this didn't spread too
>> much. Once you call bdev_super_lock() and hold on to sb with s_umount held,
>> things are much more conventional for the fs land so I'd like if this
>> step happened before any fs hook got called. So I prefer something like
>> Qu's proposal of separate sb op for device removal over exporting
>> bdev_super_lock(). Like:
>>
>> static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>> {
>>          struct super_block *sb;
>>
>>          sb = bdev_super_lock(bdev, false);
>>          if (!sb)
>>                  return;
>>
>> 	if (sb->s_op->remove_bdev) {
>> 		sb->s_op->remove_bdev(sb, bdev, surprise);

The only concern here is, remove_bdev() will handle two different things:

- Mark the target devices as missing and continue working
   Of course that's when the fs can handle it.

- Shutdown
   If the fs can not handle it.

And if the fs has to shutdown, we're missing all the shrinks in the 
shutdown path.

Thus I'd prefer the remove_bdev() to have a return value, so that we can 
fallback to shutdown path if the remove_bdev() failed.

This also aligns more towards Brauner's idea that we may want to expose 
if the fs can handle the removal.

Thanks,
Qu

>> 		return;
>> 	}
> 
> It feels odd but I could live with this, particularly since that's the
> direction that brauner is laying down. :)
> 
> Do we still need to super_unlock_shared here?
> 
> --D
> 
>>
>> 	if (!surprise)
>> 		sync_filesystem(sb);
>> 	shrink_dcache_sb(sb);
>> 	evict_inodes(sb);
>> 	if (sb->s_op->shutdown)
>> 		sb->s_op->shutdown(sb);
>>
>> 	super_unlock_shared(sb);
>> }
>>
>>> As an aside:
>>> 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
>>> everyone's ioctl functions into the VFS, and then move the "I am dead"
>>> state into super_block so that you could actually shut down any
>>> filesystem, not just the seven that currently implement it.
>>
>> Yes, I should find time to revive that patch series... It was not *that*
>> hard to do.
>>
>> 								Honza
>> -- 
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  7:55         ` Christian Brauner
@ 2025-07-08 22:59           ` Dave Chinner
  2025-07-08 23:07             ` Qu Wenruo
  2025-07-10  8:33             ` Christian Brauner
  2025-07-10 10:54           ` Christoph Hellwig
  1 sibling, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2025-07-08 22:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Qu Wenruo, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs

On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
> On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > Currently all the filesystems implementing the
> > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > 
> > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > involved filesystem.
> > > > > 
> > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > shutting down the whole filesystem.
> > > > > 
> > > > > To allow those multi-device filesystems to be integrated to use
> > > > > fs_holder_ops:
> > > > > 
> > > > > - Replace super_opearation::shutdown() with
> > > > >    super_opearations::remove_bdev()
> > > > >    To better describe when the callback is called.
> > > > 
> > > > This conflates cause with action.
> > > > 
> > > > The shutdown callout is an action that the filesystem must execute,
> > > > whilst "remove bdev" is a cause notification that might require an
> > > > action to be take.
> > > > 
> > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > device, but it could also be something going wrong in software
> > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > corruption or ENOSPC errors.
> > > > 
> > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > 
> > > > The generic fs action that is taken by this notification is
> > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > down the filesystem.
> > > > 
> > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > notification. i.e. it needs an action that is different to
> > > > fs_bdev_mark_dead().
> > > > 
> > > > Indeed, this is how bcachefs already handles "single device
> > > > died" events for multi-device filesystems - see
> > > > bch2_fs_bdev_mark_dead().
> > > 
> > > I do not think it's the correct way to go, especially when there is already
> > > fs_holder_ops.
> > > 
> > > We're always going towards a more generic solution, other than letting the
> > > individual fs to do the same thing slightly differently.
> > 
> > On second thought -- it's weird that you'd flush the filesystem and
> > shrink the inode/dentry caches in a "your device went away" handler.
> > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > a different bdev, right?  And there's no good reason to run shrinkers on
> > either of those fses, right?
> > 
> > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > result is still a more generic and less duplicated code base.
> > 
> > I think dchinner makes a good point that if your filesystem can do
> > something clever on device removal, it should provide its own block
> > device holder ops instead of using fs_holder_ops.  I don't understand
> > why you need a "generic" solution for btrfs when it's not going to do
> > what the others do anyway.
> 
> I think letting filesystems implement their own holder ops should be
> avoided if we can. Christoph may chime in here. I have no appettite for
> exporting stuff like get_bdev_super() unless absolutely necessary. We
> tried to move all that handling into the VFS to eliminate a slew of
> deadlocks we detected and fixed. I have no appetite to repeat that
> cycle.

Except it isn't actually necessary.

Everyone here seems to be assuming that the filesystem *must* take
an active superblock reference to process a device removal event,
and that is *simply not true*.

bcachefs does not use get_bdev_super() or an active superblock
reference to process ->mark_dead events.

It has it's own internal reference counting on the struct bch_fs
attached to the bdev that ensures the filesystem structures can't go
away whilst ->mark_dead is being processed.  i.e. bcachefs is only
dependent on the bdev->bd_holder_lock() being held when
->mark_dead() is called and does not rely on the VFS for anything.

This means that device removal processing can be performed
without global filesystem/VFS locks needing to be held. Hence issues
like re-entrancy deadlocks when there are concurrent/cascading
device failures (e.g. a HBA dies, taking out multiple devices
simultaneously) are completely avoided...

It also avoids the problem of ->mark_dead events being generated
from a context that holds filesystem/vfs locks and then deadlocking
waiting for those locks to be released.

IOWs, a multi-device filesystem should really be implementing
->mark_dead itself, and should not be depending on being able to
lock the superblock to take an active reference to it.

It should be pretty clear that these are not issues that the generic
filesystem ->mark_dead implementation should be trying to
handle.....

> The shutdown method is implemented only by block-based filesystems and
> arguably shutdown was always a misnomer because it assumed that the
> filesystem needs to actually shut down when it is called.

Shutdown was not -assumed- as the operation that needed to be
performed. That was the feature that was *required* to fix
filesystem level problems that occur when the device underneath it
disappears.

->mark_dead() is the abstract filesystem notification from the block
device, fs_bdfev_mark_dead() is the -generic implementation- of the
functionality required by single block device filesystems. Part of
that functionality is shutting down the filesystem because it can
*no longer function without a backing device*.

multi-block device filesystems require compeltely different
implementations, and we already have one that -does not use active
superblock references-. IOWs, even if we add ->remove_bdev(sb)
callout, bcachefs will continue to use ->mark_dead() because low
level filesystem device management isn't (and shouldn't be!)
dependent on high level VFS structure reference counting....

> IOW, we made
> it so that it is a call to action but that doesn't have to be the case.
> Calling it ->remove_bdev() is imo the correct thing because it gives
> block based filesystem the ability to handle device events how they see
> fit.

And that's exactly what ->mark_dead already provides. 

And, as I've pointed out above, multi-device filesystems don't
actually need actively referenced superblocks to process device
removal notifications. Hence ->mark_dead is the correct interface
for them to use.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08 22:59           ` Dave Chinner
@ 2025-07-08 23:07             ` Qu Wenruo
  2025-07-09  0:35               ` Kent Overstreet
  2025-07-10  8:33             ` Christian Brauner
  1 sibling, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-08 23:07 UTC (permalink / raw)
  To: Dave Chinner, Christian Brauner
  Cc: Darrick J. Wong, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs,
	Kent Overstreet



在 2025/7/9 08:29, Dave Chinner 写道:
> On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
>> On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
>>> On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2025/7/8 08:32, Dave Chinner 写道:
>>>>> On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
>>>>>> Currently all the filesystems implementing the
>>>>>> super_opearations::shutdown() callback can not afford losing a device.
>>>>>>
>>>>>> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
>>>>>> involved filesystem.
>>>>>>
>>>>>> But it will no longer be the case, with multi-device filesystems like
>>>>>> btrfs and bcachefs the filesystem can handle certain device loss without
>>>>>> shutting down the whole filesystem.
>>>>>>
>>>>>> To allow those multi-device filesystems to be integrated to use
>>>>>> fs_holder_ops:
>>>>>>
>>>>>> - Replace super_opearation::shutdown() with
>>>>>>     super_opearations::remove_bdev()
>>>>>>     To better describe when the callback is called.
>>>>>
>>>>> This conflates cause with action.
>>>>>
>>>>> The shutdown callout is an action that the filesystem must execute,
>>>>> whilst "remove bdev" is a cause notification that might require an
>>>>> action to be take.
>>>>>
>>>>> Yes, the cause could be someone doing hot-unplug of the block
>>>>> device, but it could also be something going wrong in software
>>>>> layers below the filesystem. e.g. dm-thinp having an unrecoverable
>>>>> corruption or ENOSPC errors.
>>>>>
>>>>> We already have a "cause" notification: blk_holder_ops->mark_dead().
>>>>>
>>>>> The generic fs action that is taken by this notification is
>>>>> fs_bdev_mark_dead().  That action is to invalidate caches and shut
>>>>> down the filesystem.
>>>>>
>>>>> btrfs needs to do something different to a blk_holder_ops->mark_dead
>>>>> notification. i.e. it needs an action that is different to
>>>>> fs_bdev_mark_dead().
>>>>>
>>>>> Indeed, this is how bcachefs already handles "single device
>>>>> died" events for multi-device filesystems - see
>>>>> bch2_fs_bdev_mark_dead().
>>>>
>>>> I do not think it's the correct way to go, especially when there is already
>>>> fs_holder_ops.
>>>>
>>>> We're always going towards a more generic solution, other than letting the
>>>> individual fs to do the same thing slightly differently.
>>>
>>> On second thought -- it's weird that you'd flush the filesystem and
>>> shrink the inode/dentry caches in a "your device went away" handler.
>>> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
>>> a different bdev, right?  And there's no good reason to run shrinkers on
>>> either of those fses, right?
>>>
>>>> Yes, the naming is not perfect and mixing cause and action, but the end
>>>> result is still a more generic and less duplicated code base.
>>>
>>> I think dchinner makes a good point that if your filesystem can do
>>> something clever on device removal, it should provide its own block
>>> device holder ops instead of using fs_holder_ops.  I don't understand
>>> why you need a "generic" solution for btrfs when it's not going to do
>>> what the others do anyway.
>>
>> I think letting filesystems implement their own holder ops should be
>> avoided if we can. Christoph may chime in here. I have no appettite for
>> exporting stuff like get_bdev_super() unless absolutely necessary. We
>> tried to move all that handling into the VFS to eliminate a slew of
>> deadlocks we detected and fixed. I have no appetite to repeat that
>> cycle.
> 
> Except it isn't actually necessary.
> 
> Everyone here seems to be assuming that the filesystem *must* take
> an active superblock reference to process a device removal event,
> and that is *simply not true*.
> 
> bcachefs does not use get_bdev_super() or an active superblock
> reference to process ->mark_dead events.

Yes, bcachefs doesn't go this path, but the reason is more important.

Is it just because there is no such callback so that Kent has to come up 
his own solution, or something else?

If the former case, all your argument here makes no sense.

Adding Kent here to see if he wants a more generic s_op->remove_bdev() 
solution or the current each fs doing its own mark_dead() callback.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08 23:07             ` Qu Wenruo
@ 2025-07-09  0:35               ` Kent Overstreet
  2025-07-09  0:55                 ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-07-09  0:35 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	linux-btrfs, linux-fsdevel, viro, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs

On Wed, Jul 09, 2025 at 08:37:05AM +0930, Qu Wenruo wrote:
> 在 2025/7/9 08:29, Dave Chinner 写道:
> > On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
> > > On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > > > 
> > > > > 
> > > > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > > > Currently all the filesystems implementing the
> > > > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > > > 
> > > > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > > > involved filesystem.
> > > > > > > 
> > > > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > > > shutting down the whole filesystem.
> > > > > > > 
> > > > > > > To allow those multi-device filesystems to be integrated to use
> > > > > > > fs_holder_ops:
> > > > > > > 
> > > > > > > - Replace super_opearation::shutdown() with
> > > > > > >     super_opearations::remove_bdev()
> > > > > > >     To better describe when the callback is called.
> > > > > > 
> > > > > > This conflates cause with action.
> > > > > > 
> > > > > > The shutdown callout is an action that the filesystem must execute,
> > > > > > whilst "remove bdev" is a cause notification that might require an
> > > > > > action to be take.
> > > > > > 
> > > > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > > > device, but it could also be something going wrong in software
> > > > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > > > corruption or ENOSPC errors.
> > > > > > 
> > > > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > > > 
> > > > > > The generic fs action that is taken by this notification is
> > > > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > > > down the filesystem.
> > > > > > 
> > > > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > > > notification. i.e. it needs an action that is different to
> > > > > > fs_bdev_mark_dead().
> > > > > > 
> > > > > > Indeed, this is how bcachefs already handles "single device
> > > > > > died" events for multi-device filesystems - see
> > > > > > bch2_fs_bdev_mark_dead().
> > > > > 
> > > > > I do not think it's the correct way to go, especially when there is already
> > > > > fs_holder_ops.
> > > > > 
> > > > > We're always going towards a more generic solution, other than letting the
> > > > > individual fs to do the same thing slightly differently.
> > > > 
> > > > On second thought -- it's weird that you'd flush the filesystem and
> > > > shrink the inode/dentry caches in a "your device went away" handler.
> > > > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > > > a different bdev, right?  And there's no good reason to run shrinkers on
> > > > either of those fses, right?
> > > > 
> > > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > > result is still a more generic and less duplicated code base.
> > > > 
> > > > I think dchinner makes a good point that if your filesystem can do
> > > > something clever on device removal, it should provide its own block
> > > > device holder ops instead of using fs_holder_ops.  I don't understand
> > > > why you need a "generic" solution for btrfs when it's not going to do
> > > > what the others do anyway.
> > > 
> > > I think letting filesystems implement their own holder ops should be
> > > avoided if we can. Christoph may chime in here. I have no appettite for
> > > exporting stuff like get_bdev_super() unless absolutely necessary. We
> > > tried to move all that handling into the VFS to eliminate a slew of
> > > deadlocks we detected and fixed. I have no appetite to repeat that
> > > cycle.
> > 
> > Except it isn't actually necessary.
> > 
> > Everyone here seems to be assuming that the filesystem *must* take
> > an active superblock reference to process a device removal event,
> > and that is *simply not true*.
> > 
> > bcachefs does not use get_bdev_super() or an active superblock
> > reference to process ->mark_dead events.
> 
> Yes, bcachefs doesn't go this path, but the reason is more important.
> 
> Is it just because there is no such callback so that Kent has to come up his
> own solution, or something else?
> 
> If the former case, all your argument here makes no sense.
> 
> Adding Kent here to see if he wants a more generic s_op->remove_bdev()
> solution or the current each fs doing its own mark_dead() callback.

Consider that the thing that has a block device open might not even be a
filesystem, or at least a VFS filesystem.

It could be a stacking block device driver - md or md - and those
absolutely should be implementing .mark_dead() (likely passing it
through on up the stack), or something else entirely.

In bcachefs's case, we might not even have created the VFS super_block
yet: we don't do that until after recovery finishes, and indeed we can't
because creating a super_block and leaving it in !SB_BORN will cause
such fun as sync calls to hang for the entire system...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-09  0:35               ` Kent Overstreet
@ 2025-07-09  0:55                 ` Qu Wenruo
  2025-07-09  1:13                   ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-09  0:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	linux-btrfs, linux-fsdevel, viro, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/9 10:05, Kent Overstreet 写道:
> On Wed, Jul 09, 2025 at 08:37:05AM +0930, Qu Wenruo wrote:
>> 在 2025/7/9 08:29, Dave Chinner 写道:
>>> On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
>>>> On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
>>>>> On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> 在 2025/7/8 08:32, Dave Chinner 写道:
>>>>>>> On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
>>>>>>>> Currently all the filesystems implementing the
>>>>>>>> super_opearations::shutdown() callback can not afford losing a device.
>>>>>>>>
>>>>>>>> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
>>>>>>>> involved filesystem.
>>>>>>>>
>>>>>>>> But it will no longer be the case, with multi-device filesystems like
>>>>>>>> btrfs and bcachefs the filesystem can handle certain device loss without
>>>>>>>> shutting down the whole filesystem.
>>>>>>>>
>>>>>>>> To allow those multi-device filesystems to be integrated to use
>>>>>>>> fs_holder_ops:
>>>>>>>>
>>>>>>>> - Replace super_opearation::shutdown() with
>>>>>>>>      super_opearations::remove_bdev()
>>>>>>>>      To better describe when the callback is called.
>>>>>>>
>>>>>>> This conflates cause with action.
>>>>>>>
>>>>>>> The shutdown callout is an action that the filesystem must execute,
>>>>>>> whilst "remove bdev" is a cause notification that might require an
>>>>>>> action to be take.
>>>>>>>
>>>>>>> Yes, the cause could be someone doing hot-unplug of the block
>>>>>>> device, but it could also be something going wrong in software
>>>>>>> layers below the filesystem. e.g. dm-thinp having an unrecoverable
>>>>>>> corruption or ENOSPC errors.
>>>>>>>
>>>>>>> We already have a "cause" notification: blk_holder_ops->mark_dead().
>>>>>>>
>>>>>>> The generic fs action that is taken by this notification is
>>>>>>> fs_bdev_mark_dead().  That action is to invalidate caches and shut
>>>>>>> down the filesystem.
>>>>>>>
>>>>>>> btrfs needs to do something different to a blk_holder_ops->mark_dead
>>>>>>> notification. i.e. it needs an action that is different to
>>>>>>> fs_bdev_mark_dead().
>>>>>>>
>>>>>>> Indeed, this is how bcachefs already handles "single device
>>>>>>> died" events for multi-device filesystems - see
>>>>>>> bch2_fs_bdev_mark_dead().
>>>>>>
>>>>>> I do not think it's the correct way to go, especially when there is already
>>>>>> fs_holder_ops.
>>>>>>
>>>>>> We're always going towards a more generic solution, other than letting the
>>>>>> individual fs to do the same thing slightly differently.
>>>>>
>>>>> On second thought -- it's weird that you'd flush the filesystem and
>>>>> shrink the inode/dentry caches in a "your device went away" handler.
>>>>> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
>>>>> a different bdev, right?  And there's no good reason to run shrinkers on
>>>>> either of those fses, right?
>>>>>
>>>>>> Yes, the naming is not perfect and mixing cause and action, but the end
>>>>>> result is still a more generic and less duplicated code base.
>>>>>
>>>>> I think dchinner makes a good point that if your filesystem can do
>>>>> something clever on device removal, it should provide its own block
>>>>> device holder ops instead of using fs_holder_ops.  I don't understand
>>>>> why you need a "generic" solution for btrfs when it's not going to do
>>>>> what the others do anyway.
>>>>
>>>> I think letting filesystems implement their own holder ops should be
>>>> avoided if we can. Christoph may chime in here. I have no appettite for
>>>> exporting stuff like get_bdev_super() unless absolutely necessary. We
>>>> tried to move all that handling into the VFS to eliminate a slew of
>>>> deadlocks we detected and fixed. I have no appetite to repeat that
>>>> cycle.
>>>
>>> Except it isn't actually necessary.
>>>
>>> Everyone here seems to be assuming that the filesystem *must* take
>>> an active superblock reference to process a device removal event,
>>> and that is *simply not true*.
>>>
>>> bcachefs does not use get_bdev_super() or an active superblock
>>> reference to process ->mark_dead events.
>>
>> Yes, bcachefs doesn't go this path, but the reason is more important.
>>
>> Is it just because there is no such callback so that Kent has to come up his
>> own solution, or something else?
>>
>> If the former case, all your argument here makes no sense.
>>
>> Adding Kent here to see if he wants a more generic s_op->remove_bdev()
>> solution or the current each fs doing its own mark_dead() callback.
> 
> Consider that the thing that has a block device open might not even be a
> filesystem, or at least a VFS filesystem.
> 
> It could be a stacking block device driver - md or md - and those
> absolutely should be implementing .mark_dead() (likely passing it
> through on up the stack), or something else entirely.
> 
> In bcachefs's case, we might not even have created the VFS super_block
> yet: we don't do that until after recovery finishes, and indeed we can't
> because creating a super_block and leaving it in !SB_BORN will cause
> such fun as sync calls to hang for the entire system...
> 

Not related to the series, but IIRC if s_flags doesn't have SB_ACTIVE 
set, things like bdev_super_lock() won't choose that superblock, thus 
won't call ->sync() callback through the bdev callbacks.

And btrfs also follows the same scheme, only setting SB_ACTIVE after 
everything is done (including replaying the log etc), and so far we 
haven't yet hit such sync during mount.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-09  0:55                 ` Qu Wenruo
@ 2025-07-09  1:13                   ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-07-09  1:13 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	linux-btrfs, linux-fsdevel, viro, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs

On Wed, Jul 09, 2025 at 10:25:08AM +0930, Qu Wenruo wrote:
> 在 2025/7/9 10:05, Kent Overstreet 写道:
> > Consider that the thing that has a block device open might not even be a
> > filesystem, or at least a VFS filesystem.
> > 
> > It could be a stacking block device driver - md or md - and those
> > absolutely should be implementing .mark_dead() (likely passing it
> > through on up the stack), or something else entirely.
> > 
> > In bcachefs's case, we might not even have created the VFS super_block
> > yet: we don't do that until after recovery finishes, and indeed we can't
> > because creating a super_block and leaving it in !SB_BORN will cause
> > such fun as sync calls to hang for the entire system...
> > 
> 
> Not related to the series, but IIRC if s_flags doesn't have SB_ACTIVE set,
> things like bdev_super_lock() won't choose that superblock, thus won't call
> ->sync() callback through the bdev callbacks.
> 
> And btrfs also follows the same scheme, only setting SB_ACTIVE after
> everything is done (including replaying the log etc), and so far we haven't
> yet hit such sync during mount.

Well, how long can that take? Have a look at iterate_supers(), it's
something to be wary of.

Fixing the fs/super.c locking is something I was looking at, it's doable
but it'd be a giant hassle - for bcachefs it wasn't worth it, bcachefs
has preexisting reasons for wanting to avoid the vfs superblock
dependency.

Anyways - the VFS trying to own .mark_dead() is a layering violation.

        VFS
------------------
        FS
------------------
        BLOCK

By default, the "FS" layer should be considered a black box by both the
block layer and VFS; reaching across that and assuming filesystem
behavior is a good way to paint yourself into a corner.

It's seductive because most filesystems are single device filesystems,
and for that case it makes total sense to provide helpers for
convenience, given that there's not much room for behaviour to deviate
in the single device case.

But in the multi device case: the behaviour is completely up to the
filesystem - in general we don't shut down the entire filesystem if a
single block device goes dead, if we're redundant we can just drop that
device and continue.

And if you're thinking you want to make use of locking provided by the
VFS - I would warn away from that line of thinking too, mixing up
locking from different layers creates all sorts of fun...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
@ 2025-07-09 17:23 Jan Kara
  2025-07-09 17:49 ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2025-07-09 17:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Darrick J. Wong, Qu Wenruo, Qu Wenruo,
	linux-btrfs, linux-fsdevel, viro, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs, Kent Overstreet,
	linux-bcachefs

Bcc:
Subject: Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to
 remove_bdev()
Reply-To:
In-Reply-To: <aG2i3qP01m-vmFVE@dread.disaster.area>

On Wed 09-07-25 08:59:42, Dave Chinner wrote:
> On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
> > On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > > Currently all the filesystems implementing the
> > > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > > 
> > > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > > involved filesystem.
> > > > > > 
> > > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > > shutting down the whole filesystem.
> > > > > > 
> > > > > > To allow those multi-device filesystems to be integrated to use
> > > > > > fs_holder_ops:
> > > > > > 
> > > > > > - Replace super_opearation::shutdown() with
> > > > > >    super_opearations::remove_bdev()
> > > > > >    To better describe when the callback is called.
> > > > > 
> > > > > This conflates cause with action.
> > > > > 
> > > > > The shutdown callout is an action that the filesystem must execute,
> > > > > whilst "remove bdev" is a cause notification that might require an
> > > > > action to be take.
> > > > > 
> > > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > > device, but it could also be something going wrong in software
> > > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > > corruption or ENOSPC errors.
> > > > > 
> > > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > > 
> > > > > The generic fs action that is taken by this notification is
> > > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > > down the filesystem.
> > > > > 
> > > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > > notification. i.e. it needs an action that is different to
> > > > > fs_bdev_mark_dead().
> > > > > 
> > > > > Indeed, this is how bcachefs already handles "single device
> > > > > died" events for multi-device filesystems - see
> > > > > bch2_fs_bdev_mark_dead().
> > > > 
> > > > I do not think it's the correct way to go, especially when there is already
> > > > fs_holder_ops.
> > > > 
> > > > We're always going towards a more generic solution, other than letting the
> > > > individual fs to do the same thing slightly differently.
> > > 
> > > On second thought -- it's weird that you'd flush the filesystem and
> > > shrink the inode/dentry caches in a "your device went away" handler.
> > > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > > a different bdev, right?  And there's no good reason to run shrinkers on
> > > either of those fses, right?
> > > 
> > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > result is still a more generic and less duplicated code base.
> > > 
> > > I think dchinner makes a good point that if your filesystem can do
> > > something clever on device removal, it should provide its own block
> > > device holder ops instead of using fs_holder_ops.  I don't understand
> > > why you need a "generic" solution for btrfs when it's not going to do
> > > what the others do anyway.
> > 
> > I think letting filesystems implement their own holder ops should be
> > avoided if we can. Christoph may chime in here. I have no appettite for
> > exporting stuff like get_bdev_super() unless absolutely necessary. We
> > tried to move all that handling into the VFS to eliminate a slew of
> > deadlocks we detected and fixed. I have no appetite to repeat that
> > cycle.
> 
> Except it isn't actually necessary.
> 
> Everyone here seems to be assuming that the filesystem *must* take
> an active superblock reference to process a device removal event,
> and that is *simply not true*.
> 
> bcachefs does not use get_bdev_super() or an active superblock
> reference to process ->mark_dead events.
>
> It has it's own internal reference counting on the struct bch_fs
> attached to the bdev that ensures the filesystem structures can't go
> away whilst ->mark_dead is being processed.  i.e. bcachefs is only
> dependent on the bdev->bd_holder_lock() being held when
> ->mark_dead() is called and does not rely on the VFS for anything.

Right, they have their own refcount which effectively blocks umount
in .put_super AFAICS and they use it instead of VFS refcounts for this.

> This means that device removal processing can be performed
> without global filesystem/VFS locks needing to be held. Hence issues
> like re-entrancy deadlocks when there are concurrent/cascading
> device failures (e.g. a HBA dies, taking out multiple devices
> simultaneously) are completely avoided...

Funnily enough how about:

bch2_fs_bdev_mark_dead()		umount()
  bdev_get_fs()
    bch2_ro_ref_tryget() -> grabs bch_fs->ro_ref
    mutex_unlock(&bdev->bd_holder_lock);
					deactivate_super()
					  down_write(&sb->s_umount);
					  deactivate_locked_super()
					    bch2_kill_sb()
					      generic_shutdown_super()
					        bch2_put_super()
						  __bch2_fs_stop()
						    bch2_ro_ref_put()
						    wait_event(c->ro_ref_wait, !refcount_read(&c->ro_ref));
  sb = c->vfs_sb;
  down_read(&sb->s_umount); -> deadlock

Which is a case in point why I would like to have a shared infrastructure
for bdev -> sb transition that's used as widely as possible. Because it
isn't easy to get the lock ordering right given all the constraints in the
VFS and block layer code paths for this transition that's going contrary to
the usual ordering sb -> bdev. And yes I do realize bcachefs grabs s_umount
not because it itself needs it but because it calls some VFS helpers
(sync_filesystem()) which expect it to be held so the pain is inflicted
by VFS here but that just demostrates the fact that VFS and FS locking are
deeply intertwined and you can hardly avoid dealing with VFS locking rules
in the filesystem itself.

> It also avoids the problem of ->mark_dead events being generated
> from a context that holds filesystem/vfs locks and then deadlocking
> waiting for those locks to be released.
> 
> IOWs, a multi-device filesystem should really be implementing
> ->mark_dead itself, and should not be depending on being able to
> lock the superblock to take an active reference to it.
> 
> It should be pretty clear that these are not issues that the generic
> filesystem ->mark_dead implementation should be trying to
> handle.....

Well, IMO every fs implementation needs to do the bdev -> sb transition and
make sb somehow stable. It may be that grabbing s_umount and active sb
reference is not what everybody wants but AFAIU btrfs as the second
multi-device filesystem would be fine with that and for bcachefs this
doesn't work only because they have special superblock instantiation
behavior on mount for independent reasons (i.e., not because active ref
+ s_umount would be problematic for them) if I understand Kent right.
So I'm still not fully convinced each multi-device filesystem should be
shipping their special method to get from device to stable sb reference.

> > The shutdown method is implemented only by block-based filesystems and
> > arguably shutdown was always a misnomer because it assumed that the
> > filesystem needs to actually shut down when it is called.
> 
> Shutdown was not -assumed- as the operation that needed to be
> performed. That was the feature that was *required* to fix
> filesystem level problems that occur when the device underneath it
> disappears.
> 
> ->mark_dead() is the abstract filesystem notification from the block
> device, fs_bdfev_mark_dead() is the -generic implementation- of the
> functionality required by single block device filesystems. Part of
> that functionality is shutting down the filesystem because it can
> *no longer function without a backing device*.
> 
> multi-block device filesystems require compeltely different
> implementations, and we already have one that -does not use active
> superblock references-. IOWs, even if we add ->remove_bdev(sb)
> callout, bcachefs will continue to use ->mark_dead() because low
> level filesystem device management isn't (and shouldn't be!)
> dependent on high level VFS structure reference counting....

I have to admit I don't get why device management shouldn't be dependent on
VFS refcounts / locking. IMO it is often dependent although I agree with
multiple devices you likely have to do *additional* locking. And yes, I can
imagine VFS locking could get in your way but the only tangible example we
have is bcachefs and btrfs seems to be a counter example showing even multi
device filesystem can live with VFS locking. So I don't think the case is
as clear as you try to frame it.

So conceptually I agree making filesystems as bdev holders implement their
own holder ops makes a lot of sense but because of lock ordering rules it
is not quite practical or easily maintainable choice I'm afraid.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-09 17:23 [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Jan Kara
@ 2025-07-09 17:49 ` Kent Overstreet
  2025-07-10 13:10   ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-07-09 17:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	Qu Wenruo, linux-btrfs, linux-fsdevel, viro, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> On Wed 09-07-25 08:59:42, Dave Chinner wrote:
> > This means that device removal processing can be performed
> > without global filesystem/VFS locks needing to be held. Hence issues
> > like re-entrancy deadlocks when there are concurrent/cascading
> > device failures (e.g. a HBA dies, taking out multiple devices
> > simultaneously) are completely avoided...
> 
> Funnily enough how about:
> 
> bch2_fs_bdev_mark_dead()		umount()
>   bdev_get_fs()
>     bch2_ro_ref_tryget() -> grabs bch_fs->ro_ref
>     mutex_unlock(&bdev->bd_holder_lock);
> 					deactivate_super()
> 					  down_write(&sb->s_umount);
> 					  deactivate_locked_super()
> 					    bch2_kill_sb()
> 					      generic_shutdown_super()
> 					        bch2_put_super()
> 						  __bch2_fs_stop()
> 						    bch2_ro_ref_put()
> 						    wait_event(c->ro_ref_wait, !refcount_read(&c->ro_ref));
>   sb = c->vfs_sb;
>   down_read(&sb->s_umount); -> deadlock
> 
> Which is a case in point why I would like to have a shared infrastructure
> for bdev -> sb transition that's used as widely as possible. Because it
> isn't easy to get the lock ordering right given all the constraints in the
> VFS and block layer code paths for this transition that's going contrary to
> the usual ordering sb -> bdev. And yes I do realize bcachefs grabs s_umount
> not because it itself needs it but because it calls some VFS helpers
> (sync_filesystem()) which expect it to be held so the pain is inflicted
> by VFS here but that just demostrates the fact that VFS and FS locking are
> deeply intertwined and you can hardly avoid dealing with VFS locking rules
> in the filesystem itself.

Getting rid of the s_umount use looks like the much saner and easier
fix - like the comment notes, it's only taken to avoid the warning in
sync_filesystem, we don't actually need it.

Locking gets easier when locks are private to individual subsystems,
protecting specific data structures that are private to those
subsystems.

> > It also avoids the problem of ->mark_dead events being generated
> > from a context that holds filesystem/vfs locks and then deadlocking
> > waiting for those locks to be released.
> > 
> > IOWs, a multi-device filesystem should really be implementing
> > ->mark_dead itself, and should not be depending on being able to
> > lock the superblock to take an active reference to it.
> > 
> > It should be pretty clear that these are not issues that the generic
> > filesystem ->mark_dead implementation should be trying to
> > handle.....
> 
> Well, IMO every fs implementation needs to do the bdev -> sb transition and
> make sb somehow stable. It may be that grabbing s_umount and active sb
> reference is not what everybody wants but AFAIU btrfs as the second
> multi-device filesystem would be fine with that and for bcachefs this
> doesn't work only because they have special superblock instantiation
> behavior on mount for independent reasons (i.e., not because active ref
> + s_umount would be problematic for them) if I understand Kent right.
> So I'm still not fully convinced each multi-device filesystem should be
> shipping their special method to get from device to stable sb reference.

Honestly, the sync_filesystem() call seems bogus.

If the block device is truly dead, what's it going to accomplish?

It's not like we get callbacks for "this device is going to be going
away soon", we only get that in reaction to something that's already
happened.

> > > The shutdown method is implemented only by block-based filesystems and
> > > arguably shutdown was always a misnomer because it assumed that the
> > > filesystem needs to actually shut down when it is called.
> > 
> > Shutdown was not -assumed- as the operation that needed to be
> > performed. That was the feature that was *required* to fix
> > filesystem level problems that occur when the device underneath it
> > disappears.
> > 
> > ->mark_dead() is the abstract filesystem notification from the block
> > device, fs_bdfev_mark_dead() is the -generic implementation- of the
> > functionality required by single block device filesystems. Part of
> > that functionality is shutting down the filesystem because it can
> > *no longer function without a backing device*.
> > 
> > multi-block device filesystems require compeltely different
> > implementations, and we already have one that -does not use active
> > superblock references-. IOWs, even if we add ->remove_bdev(sb)
> > callout, bcachefs will continue to use ->mark_dead() because low
> > level filesystem device management isn't (and shouldn't be!)
> > dependent on high level VFS structure reference counting....
> 
> I have to admit I don't get why device management shouldn't be dependent on
> VFS refcounts / locking. IMO it is often dependent although I agree with
> multiple devices you likely have to do *additional* locking. And yes, I can
> imagine VFS locking could get in your way but the only tangible example we
> have is bcachefs and btrfs seems to be a counter example showing even multi
> device filesystem can live with VFS locking. So I don't think the case is
> as clear as you try to frame it.

Individual devices coming and going has nothing to do with the VFS. If a
single device goes away and we're continuing in RW mode, _no_ VFS state
is affected whatsoever.

The only thing that's needed is a ref to prevent the filesystem from
going away, not a lock. But again given that a bch_fs doesn't
necessarily even have a VFS superblock it's not something we'd use
directly in .mark_dead, that synchronization is handled directly via
kill_sb -> generic_shutdown_super -> and all that...

We don't want bch_fs to outlive the VFS superblock if we do have a VFS
sb, because asynchronous shutdown and releasing of resources causes very
real problems (which already exist for other reasons...)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08 22:59           ` Dave Chinner
  2025-07-08 23:07             ` Qu Wenruo
@ 2025-07-10  8:33             ` Christian Brauner
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2025-07-10  8:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Qu Wenruo, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, jack, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs

On Wed, Jul 09, 2025 at 08:59:42AM +1000, Dave Chinner wrote:
> On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
> > On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > > 
> > > > 
> > > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > > Currently all the filesystems implementing the
> > > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > > 
> > > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > > involved filesystem.
> > > > > > 
> > > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > > shutting down the whole filesystem.
> > > > > > 
> > > > > > To allow those multi-device filesystems to be integrated to use
> > > > > > fs_holder_ops:
> > > > > > 
> > > > > > - Replace super_opearation::shutdown() with
> > > > > >    super_opearations::remove_bdev()
> > > > > >    To better describe when the callback is called.
> > > > > 
> > > > > This conflates cause with action.
> > > > > 
> > > > > The shutdown callout is an action that the filesystem must execute,
> > > > > whilst "remove bdev" is a cause notification that might require an
> > > > > action to be take.
> > > > > 
> > > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > > device, but it could also be something going wrong in software
> > > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > > corruption or ENOSPC errors.
> > > > > 
> > > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > > 
> > > > > The generic fs action that is taken by this notification is
> > > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > > down the filesystem.
> > > > > 
> > > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > > notification. i.e. it needs an action that is different to
> > > > > fs_bdev_mark_dead().
> > > > > 
> > > > > Indeed, this is how bcachefs already handles "single device
> > > > > died" events for multi-device filesystems - see
> > > > > bch2_fs_bdev_mark_dead().
> > > > 
> > > > I do not think it's the correct way to go, especially when there is already
> > > > fs_holder_ops.
> > > > 
> > > > We're always going towards a more generic solution, other than letting the
> > > > individual fs to do the same thing slightly differently.
> > > 
> > > On second thought -- it's weird that you'd flush the filesystem and
> > > shrink the inode/dentry caches in a "your device went away" handler.
> > > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > > a different bdev, right?  And there's no good reason to run shrinkers on
> > > either of those fses, right?
> > > 
> > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > result is still a more generic and less duplicated code base.
> > > 
> > > I think dchinner makes a good point that if your filesystem can do
> > > something clever on device removal, it should provide its own block
> > > device holder ops instead of using fs_holder_ops.  I don't understand
> > > why you need a "generic" solution for btrfs when it's not going to do
> > > what the others do anyway.
> > 
> > I think letting filesystems implement their own holder ops should be
> > avoided if we can. Christoph may chime in here. I have no appettite for
> > exporting stuff like get_bdev_super() unless absolutely necessary. We
> > tried to move all that handling into the VFS to eliminate a slew of
> > deadlocks we detected and fixed. I have no appetite to repeat that
> > cycle.
> 
> Except it isn't actually necessary.
> 
> Everyone here seems to be assuming that the filesystem *must* take
> an active superblock reference to process a device removal event,
> and that is *simply not true*.
> 
> bcachefs does not use get_bdev_super() or an active superblock
> reference to process ->mark_dead events.
> 
> It has it's own internal reference counting on the struct bch_fs
> attached to the bdev that ensures the filesystem structures can't go
> away whilst ->mark_dead is being processed.  i.e. bcachefs is only
> dependent on the bdev->bd_holder_lock() being held when
> ->mark_dead() is called and does not rely on the VFS for anything.
> 
> This means that device removal processing can be performed
> without global filesystem/VFS locks needing to be held. Hence issues
> like re-entrancy deadlocks when there are concurrent/cascading
> device failures (e.g. a HBA dies, taking out multiple devices
> simultaneously) are completely avoided...
> 
> It also avoids the problem of ->mark_dead events being generated
> from a context that holds filesystem/vfs locks and then deadlocking
> waiting for those locks to be released.
> 
> IOWs, a multi-device filesystem should really be implementing
> ->mark_dead itself, and should not be depending on being able to
> lock the superblock to take an active reference to it.
> 
> It should be pretty clear that these are not issues that the generic
> filesystem ->mark_dead implementation should be trying to
> handle.....
> 
> > The shutdown method is implemented only by block-based filesystems and
> > arguably shutdown was always a misnomer because it assumed that the
> > filesystem needs to actually shut down when it is called.
> 
> Shutdown was not -assumed- as the operation that needed to be
> performed. That was the feature that was *required* to fix
> filesystem level problems that occur when the device underneath it
> disappears.
> 
> ->mark_dead() is the abstract filesystem notification from the block
> device, fs_bdfev_mark_dead() is the -generic implementation- of the
> functionality required by single block device filesystems. Part of
> that functionality is shutting down the filesystem because it can
> *no longer function without a backing device*.
> 
> multi-block device filesystems require compeltely different
> implementations, and we already have one that -does not use active
> superblock references-. IOWs, even if we add ->remove_bdev(sb)
> callout, bcachefs will continue to use ->mark_dead() because low
> level filesystem device management isn't (and shouldn't be!)
> dependent on high level VFS structure reference counting....
> 
> > IOW, we made
> > it so that it is a call to action but that doesn't have to be the case.
> > Calling it ->remove_bdev() is imo the correct thing because it gives
> > block based filesystem the ability to handle device events how they see
> > fit.
> 
> And that's exactly what ->mark_dead already provides. 
> 
> And, as I've pointed out above, multi-device filesystems don't
> actually need actively referenced superblocks to process device
> removal notifications. Hence ->mark_dead is the correct interface
> for them to use.

I'm not sure what this is trying to argue about as we agree.

All current filesystems that use the fs_holder_ops require an active
superblock reference. If they want to rewrite themselves to not need an
active superblock reference and switch to custom holder ops then the VFS
doesn't care.

This is about what is currently the case. Everyone is aware that a
filesystem can do this differently.

If btrfs wants to rely on the VFS infrastructure then we will enable it
and we will help them move along and the only requirement is that we
don't have to bleed the VFS locking requirements into the specific
filesystem unnecessarily. That's all this is.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08 20:20           ` Darrick J. Wong
  2025-07-08 22:12             ` Qu Wenruo
@ 2025-07-10  8:40             ` Christian Brauner
  2025-07-10  9:54               ` Qu Wenruo
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2025-07-10  8:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Qu Wenruo, Dave Chinner, Qu Wenruo, linux-btrfs,
	linux-fsdevel, viro, linux-ext4, linux-f2fs-devel, ntfs3,
	linux-xfs

On Tue, Jul 08, 2025 at 01:20:50PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 08, 2025 at 12:20:00PM +0200, Jan Kara wrote:
> > On Mon 07-07-25 17:45:32, Darrick J. Wong wrote:
> > > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > > Currently all the filesystems implementing the
> > > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > > 
> > > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > > involved filesystem.
> > > > > > 
> > > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > > shutting down the whole filesystem.
> > > > > > 
> > > > > > To allow those multi-device filesystems to be integrated to use
> > > > > > fs_holder_ops:
> > > > > > 
> > > > > > - Replace super_opearation::shutdown() with
> > > > > >    super_opearations::remove_bdev()
> > > > > >    To better describe when the callback is called.
> > > > > 
> > > > > This conflates cause with action.
> > > > > 
> > > > > The shutdown callout is an action that the filesystem must execute,
> > > > > whilst "remove bdev" is a cause notification that might require an
> > > > > action to be take.
> > > > > 
> > > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > > device, but it could also be something going wrong in software
> > > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > > corruption or ENOSPC errors.
> > > > > 
> > > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > > 
> > > > > The generic fs action that is taken by this notification is
> > > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > > down the filesystem.
> > > > > 
> > > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > > notification. i.e. it needs an action that is different to
> > > > > fs_bdev_mark_dead().
> > > > > 
> > > > > Indeed, this is how bcachefs already handles "single device
> > > > > died" events for multi-device filesystems - see
> > > > > bch2_fs_bdev_mark_dead().
> > > > 
> > > > I do not think it's the correct way to go, especially when there is already
> > > > fs_holder_ops.
> > > > 
> > > > We're always going towards a more generic solution, other than letting the
> > > > individual fs to do the same thing slightly differently.
> > > 
> > > On second thought -- it's weird that you'd flush the filesystem and
> > > shrink the inode/dentry caches in a "your device went away" handler.
> > > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > > a different bdev, right?  And there's no good reason to run shrinkers on
> > > either of those fses, right?
> > 
> > I agree it is awkward and bcachefs avoids these in case of removal it can
> > handle gracefully AFAICS.
> > 
> > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > result is still a more generic and less duplicated code base.
> > > 
> > > I think dchinner makes a good point that if your filesystem can do
> > > something clever on device removal, it should provide its own block
> > > device holder ops instead of using fs_holder_ops.  I don't understand
> > > why you need a "generic" solution for btrfs when it's not going to do
> > > what the others do anyway.
> > 
> > Well, I'd also say just go for own fs_holder_ops if it was not for the
> > awkward "get super from bdev" step. As Christian wrote we've encapsulated
> > that in fs/super.c and bdev_super_lock() in particular but the calling
> > conventions for the fs_holder_ops are not very nice (holding
> > bdev_holder_lock, need to release it before grabbing practically anything
> > else) so I'd have much greater peace of mind if this didn't spread too
> > much. Once you call bdev_super_lock() and hold on to sb with s_umount held,
> > things are much more conventional for the fs land so I'd like if this
> > step happened before any fs hook got called. So I prefer something like
> > Qu's proposal of separate sb op for device removal over exporting
> > bdev_super_lock(). Like:
> > 
> > static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> > {
> >         struct super_block *sb;
> > 
> >         sb = bdev_super_lock(bdev, false);
> >         if (!sb)
> >                 return;
> > 
> > 	if (sb->s_op->remove_bdev) {
> > 		sb->s_op->remove_bdev(sb, bdev, surprise);
> > 		return;
> > 	}
> 
> It feels odd but I could live with this, particularly since that's the
> direction that brauner is laying down. :)

I want to reiterate that no one is saying "under no circumstances
implement your own holder ops". But if you rely on the VFS locking then
you better not spill it's guts into your filesystem and make us export
this bloody locking that half the world had implemented wrong in their
drivers in the first places spewing endless syzbot deadlocks reports
that we had to track down and fix. That will not happen again similar
way we don't bleed all the nastiness of other locking paths.

Please all stop long philosophical treatises about things no on has ever
argued. btrfs wants to rely on the VFS infra. That is fine and well. We
will support and enable this.

I think the two method idea is fine given that they now are clearly
delineated.

Thanks for providing some clarity here, Darrick and Qu.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-10  8:40             ` Christian Brauner
@ 2025-07-10  9:54               ` Qu Wenruo
  2025-07-11  9:34                 ` Christian Brauner
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-07-10  9:54 UTC (permalink / raw)
  To: Christian Brauner, Darrick J. Wong
  Cc: Jan Kara, Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs



在 2025/7/10 18:10, Christian Brauner 写道:
> On Tue, Jul 08, 2025 at 01:20:50PM -0700, Darrick J. Wong wrote:
>> On Tue, Jul 08, 2025 at 12:20:00PM +0200, Jan Kara wrote:
>>> On Mon 07-07-25 17:45:32, Darrick J. Wong wrote:
>>>> On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
>>>>> 在 2025/7/8 08:32, Dave Chinner 写道:
>>>>>> On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
>>>>>>> Currently all the filesystems implementing the
>>>>>>> super_opearations::shutdown() callback can not afford losing a device.
>>>>>>>
>>>>>>> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
>>>>>>> involved filesystem.
>>>>>>>
>>>>>>> But it will no longer be the case, with multi-device filesystems like
>>>>>>> btrfs and bcachefs the filesystem can handle certain device loss without
>>>>>>> shutting down the whole filesystem.
>>>>>>>
>>>>>>> To allow those multi-device filesystems to be integrated to use
>>>>>>> fs_holder_ops:
>>>>>>>
>>>>>>> - Replace super_opearation::shutdown() with
>>>>>>>     super_opearations::remove_bdev()
>>>>>>>     To better describe when the callback is called.
>>>>>>
>>>>>> This conflates cause with action.
>>>>>>
>>>>>> The shutdown callout is an action that the filesystem must execute,
>>>>>> whilst "remove bdev" is a cause notification that might require an
>>>>>> action to be take.
>>>>>>
>>>>>> Yes, the cause could be someone doing hot-unplug of the block
>>>>>> device, but it could also be something going wrong in software
>>>>>> layers below the filesystem. e.g. dm-thinp having an unrecoverable
>>>>>> corruption or ENOSPC errors.
>>>>>>
>>>>>> We already have a "cause" notification: blk_holder_ops->mark_dead().
>>>>>>
>>>>>> The generic fs action that is taken by this notification is
>>>>>> fs_bdev_mark_dead().  That action is to invalidate caches and shut
>>>>>> down the filesystem.
>>>>>>
>>>>>> btrfs needs to do something different to a blk_holder_ops->mark_dead
>>>>>> notification. i.e. it needs an action that is different to
>>>>>> fs_bdev_mark_dead().
>>>>>>
>>>>>> Indeed, this is how bcachefs already handles "single device
>>>>>> died" events for multi-device filesystems - see
>>>>>> bch2_fs_bdev_mark_dead().
>>>>>
>>>>> I do not think it's the correct way to go, especially when there is already
>>>>> fs_holder_ops.
>>>>>
>>>>> We're always going towards a more generic solution, other than letting the
>>>>> individual fs to do the same thing slightly differently.
>>>>
>>>> On second thought -- it's weird that you'd flush the filesystem and
>>>> shrink the inode/dentry caches in a "your device went away" handler.
>>>> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
>>>> a different bdev, right?  And there's no good reason to run shrinkers on
>>>> either of those fses, right?
>>>
>>> I agree it is awkward and bcachefs avoids these in case of removal it can
>>> handle gracefully AFAICS.
>>>
>>>>> Yes, the naming is not perfect and mixing cause and action, but the end
>>>>> result is still a more generic and less duplicated code base.
>>>>
>>>> I think dchinner makes a good point that if your filesystem can do
>>>> something clever on device removal, it should provide its own block
>>>> device holder ops instead of using fs_holder_ops.  I don't understand
>>>> why you need a "generic" solution for btrfs when it's not going to do
>>>> what the others do anyway.
>>>
>>> Well, I'd also say just go for own fs_holder_ops if it was not for the
>>> awkward "get super from bdev" step. As Christian wrote we've encapsulated
>>> that in fs/super.c and bdev_super_lock() in particular but the calling
>>> conventions for the fs_holder_ops are not very nice (holding
>>> bdev_holder_lock, need to release it before grabbing practically anything
>>> else) so I'd have much greater peace of mind if this didn't spread too
>>> much. Once you call bdev_super_lock() and hold on to sb with s_umount held,
>>> things are much more conventional for the fs land so I'd like if this
>>> step happened before any fs hook got called. So I prefer something like
>>> Qu's proposal of separate sb op for device removal over exporting
>>> bdev_super_lock(). Like:
>>>
>>> static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>>> {
>>>          struct super_block *sb;
>>>
>>>          sb = bdev_super_lock(bdev, false);
>>>          if (!sb)
>>>                  return;
>>>
>>> 	if (sb->s_op->remove_bdev) {
>>> 		sb->s_op->remove_bdev(sb, bdev, surprise);
>>> 		return;
>>> 	}
>>
>> It feels odd but I could live with this, particularly since that's the
>> direction that brauner is laying down. :)
> 
> I want to reiterate that no one is saying "under no circumstances
> implement your own holder ops". But if you rely on the VFS locking then
> you better not spill it's guts into your filesystem and make us export
> this bloody locking that half the world had implemented wrong in their
> drivers in the first places spewing endless syzbot deadlocks reports
> that we had to track down and fix. That will not happen again similar
> way we don't bleed all the nastiness of other locking paths.
> 
> Please all stop long philosophical treatises about things no on has ever
> argued. btrfs wants to rely on the VFS infra. That is fine and well. We
> will support and enable this.
> 
> I think the two method idea is fine given that they now are clearly
> delineated.
> 
> Thanks for providing some clarity here, Darrick and Qu.
> 

So the next update would be something like this for fs_bdev_mark_dead():

	sb = bdev_super_lock();
	if (!sb)
		return;
	if (!surprise)
		sync_filesystem(sb);
+	if (sb->s_op->remove_bdev) {
+		ret = sb->s_op->remove_bdev();
+		if (!ret) {
+			/* Fs can handle the dev loss. */
+			super_unlock_shared();
+			return;
+		}
+	}
+	/* Fs can not handle the dev loss, shutdown. */
	shrink_dcache_sb();
	evict_inodes();
	if (sb->s_op->shutdown)
		sb->s_op->shutdown();
	super_unlock_shared();

This means ->remove_bdev() must have a return value to indicate if the 
fs can handle the loss.
And any error, no matter if it's not enough tolerance from the fs or 
some other problem during the dev loss handling, the old shutdown 
behavior will be triggered.

Would this be an acceptable solution?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  0:45       ` Darrick J. Wong
                           ` (2 preceding siblings ...)
  2025-07-08 10:20         ` Jan Kara
@ 2025-07-10 10:52         ` Christoph Hellwig
  3 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-07-10 10:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Qu Wenruo, Dave Chinner, Qu Wenruo, linux-btrfs, linux-fsdevel,
	viro, brauner, jack, linux-ext4, linux-f2fs-devel, ntfs3,
	linux-xfs

On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> On second thought -- it's weird that you'd flush the filesystem and
> shrink the inode/dentry caches in a "your device went away" handler.
> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> a different bdev, right?  And there's no good reason to run shrinkers on
> either of those fses, right?

No nmeed for fancy file systems, this is weird no matter what.  But it
is what Linux has done for 30+ years, so I kept it when refactoring
this code to sit in a callback.

> > Yes, the naming is not perfect and mixing cause and action, but the end
> > result is still a more generic and less duplicated code base.
> 
> I think dchinner makes a good point that if your filesystem can do
> something clever on device removal, it should provide its own block
> device holder ops instead of using fs_holder_ops.  I don't understand
> why you need a "generic" solution for btrfs when it's not going to do
> what the others do anyway.

Why?  You're most likely to get the locking wrong, and so on.

What might make sense is to move the sync_filesystem, shrink_dcache_sb
and evict_inodes into the method.  That way file systems where we

> As an aside:
> 'twould be nice if we could lift the *FS_IOC_SHUTDOWN dispatch out of
> everyone's ioctl functions into the VFS, and then move the "I am dead"
> state into super_block so that you could actually shut down any
> filesystem, not just the seven that currently implement it.

Sure.  Someone just needs to do the work..


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-08  7:55         ` Christian Brauner
  2025-07-08 22:59           ` Dave Chinner
@ 2025-07-10 10:54           ` Christoph Hellwig
  1 sibling, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-07-10 10:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Qu Wenruo, Dave Chinner, Qu Wenruo, linux-btrfs,
	linux-fsdevel, viro, jack, linux-ext4, linux-f2fs-devel, ntfs3,
	linux-xfs

On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
> I think letting filesystems implement their own holder ops should be
> avoided if we can. Christoph may chime in here.

Ccing helps for that..

>
> I have no appettite for
> exporting stuff like get_bdev_super() unless absolutely necessary. We
> tried to move all that handling into the VFS to eliminate a slew of
> deadlocks we detected and fixed. I have no appetite to repeat that
> cycle.

Exactly.

> The shutdown method is implemented only by block-based filesystems and
> arguably shutdown was always a misnomer because it assumed that the
> filesystem needs to actually shut down when it is called. IOW, we made
> it so that it is a call to action but that doesn't have to be the case.
> Calling it ->remove_bdev() is imo the correct thing because it gives
> block based filesystem the ability to handle device events how they see
> fit.
> 
> Once we will have non-block based filesystems that need a method to
> always shut down the filesystem itself we might have to revisit this
> design anyway but no one had that use-case yet.

I'm not sure what non-block file systems would need it for except
maybe for a generic IOC_SHUTDOWN implementation, but that would have
a different signature anyway as it needs to pass flags that don't
fit here.  So that would be a separate method.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-09 17:49 ` Kent Overstreet
@ 2025-07-10 13:10   ` Jan Kara
  2025-07-10 18:41     ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2025-07-10 13:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, Dave Chinner, Christian Brauner, Darrick J. Wong,
	Qu Wenruo, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Wed 09-07-25 13:49:12, Kent Overstreet wrote:
> On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> > > It also avoids the problem of ->mark_dead events being generated
> > > from a context that holds filesystem/vfs locks and then deadlocking
> > > waiting for those locks to be released.
> > > 
> > > IOWs, a multi-device filesystem should really be implementing
> > > ->mark_dead itself, and should not be depending on being able to
> > > lock the superblock to take an active reference to it.
> > > 
> > > It should be pretty clear that these are not issues that the generic
> > > filesystem ->mark_dead implementation should be trying to
> > > handle.....
> > 
> > Well, IMO every fs implementation needs to do the bdev -> sb transition and
> > make sb somehow stable. It may be that grabbing s_umount and active sb
> > reference is not what everybody wants but AFAIU btrfs as the second
> > multi-device filesystem would be fine with that and for bcachefs this
> > doesn't work only because they have special superblock instantiation
> > behavior on mount for independent reasons (i.e., not because active ref
> > + s_umount would be problematic for them) if I understand Kent right.
> > So I'm still not fully convinced each multi-device filesystem should be
> > shipping their special method to get from device to stable sb reference.
> 
> Honestly, the sync_filesystem() call seems bogus.
> 
> If the block device is truly dead, what's it going to accomplish?

Notice that fs_bdev_mark_dead() calls sync_filesystem() only in case
'surprise' argument is false - meaning this is actually a notification
*before* the device is going away. I.e., graceful device hot unplug when
you can access the device to clean up as much as possible.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-10 13:10   ` Jan Kara
@ 2025-07-10 18:41     ` Kent Overstreet
  2025-07-11 14:20       ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-07-10 18:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	Qu Wenruo, linux-btrfs, linux-fsdevel, viro, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Thu, Jul 10, 2025 at 03:10:04PM +0200, Jan Kara wrote:
> On Wed 09-07-25 13:49:12, Kent Overstreet wrote:
> > On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> > > > It also avoids the problem of ->mark_dead events being generated
> > > > from a context that holds filesystem/vfs locks and then deadlocking
> > > > waiting for those locks to be released.
> > > > 
> > > > IOWs, a multi-device filesystem should really be implementing
> > > > ->mark_dead itself, and should not be depending on being able to
> > > > lock the superblock to take an active reference to it.
> > > > 
> > > > It should be pretty clear that these are not issues that the generic
> > > > filesystem ->mark_dead implementation should be trying to
> > > > handle.....
> > > 
> > > Well, IMO every fs implementation needs to do the bdev -> sb transition and
> > > make sb somehow stable. It may be that grabbing s_umount and active sb
> > > reference is not what everybody wants but AFAIU btrfs as the second
> > > multi-device filesystem would be fine with that and for bcachefs this
> > > doesn't work only because they have special superblock instantiation
> > > behavior on mount for independent reasons (i.e., not because active ref
> > > + s_umount would be problematic for them) if I understand Kent right.
> > > So I'm still not fully convinced each multi-device filesystem should be
> > > shipping their special method to get from device to stable sb reference.
> > 
> > Honestly, the sync_filesystem() call seems bogus.
> > 
> > If the block device is truly dead, what's it going to accomplish?
> 
> Notice that fs_bdev_mark_dead() calls sync_filesystem() only in case
> 'surprise' argument is false - meaning this is actually a notification
> *before* the device is going away. I.e., graceful device hot unplug when
> you can access the device to clean up as much as possible.

That doesn't seem to be hooked up to anything?

blk_mark_disk_dead() -> blk_report_disk_dead(), surprise is always true

disk_force_media_change(), same

The only call where it's falso is in s390 code. If we know that a disk
is going away, that would be a userspace thing, and they can just
unmount.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-10  9:54               ` Qu Wenruo
@ 2025-07-11  9:34                 ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2025-07-11  9:34 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Darrick J. Wong, Jan Kara, Dave Chinner, Qu Wenruo, linux-btrfs,
	linux-fsdevel, viro, linux-ext4, linux-f2fs-devel, ntfs3,
	linux-xfs

On Thu, Jul 10, 2025 at 07:24:46PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/7/10 18:10, Christian Brauner 写道:
> > On Tue, Jul 08, 2025 at 01:20:50PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 08, 2025 at 12:20:00PM +0200, Jan Kara wrote:
> > > > On Mon 07-07-25 17:45:32, Darrick J. Wong wrote:
> > > > > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > > > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > > > > Currently all the filesystems implementing the
> > > > > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > > > > 
> > > > > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > > > > involved filesystem.
> > > > > > > > 
> > > > > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > > > > shutting down the whole filesystem.
> > > > > > > > 
> > > > > > > > To allow those multi-device filesystems to be integrated to use
> > > > > > > > fs_holder_ops:
> > > > > > > > 
> > > > > > > > - Replace super_opearation::shutdown() with
> > > > > > > >     super_opearations::remove_bdev()
> > > > > > > >     To better describe when the callback is called.
> > > > > > > 
> > > > > > > This conflates cause with action.
> > > > > > > 
> > > > > > > The shutdown callout is an action that the filesystem must execute,
> > > > > > > whilst "remove bdev" is a cause notification that might require an
> > > > > > > action to be take.
> > > > > > > 
> > > > > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > > > > device, but it could also be something going wrong in software
> > > > > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > > > > corruption or ENOSPC errors.
> > > > > > > 
> > > > > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > > > > 
> > > > > > > The generic fs action that is taken by this notification is
> > > > > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > > > > down the filesystem.
> > > > > > > 
> > > > > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > > > > notification. i.e. it needs an action that is different to
> > > > > > > fs_bdev_mark_dead().
> > > > > > > 
> > > > > > > Indeed, this is how bcachefs already handles "single device
> > > > > > > died" events for multi-device filesystems - see
> > > > > > > bch2_fs_bdev_mark_dead().
> > > > > > 
> > > > > > I do not think it's the correct way to go, especially when there is already
> > > > > > fs_holder_ops.
> > > > > > 
> > > > > > We're always going towards a more generic solution, other than letting the
> > > > > > individual fs to do the same thing slightly differently.
> > > > > 
> > > > > On second thought -- it's weird that you'd flush the filesystem and
> > > > > shrink the inode/dentry caches in a "your device went away" handler.
> > > > > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > > > > a different bdev, right?  And there's no good reason to run shrinkers on
> > > > > either of those fses, right?
> > > > 
> > > > I agree it is awkward and bcachefs avoids these in case of removal it can
> > > > handle gracefully AFAICS.
> > > > 
> > > > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > > > result is still a more generic and less duplicated code base.
> > > > > 
> > > > > I think dchinner makes a good point that if your filesystem can do
> > > > > something clever on device removal, it should provide its own block
> > > > > device holder ops instead of using fs_holder_ops.  I don't understand
> > > > > why you need a "generic" solution for btrfs when it's not going to do
> > > > > what the others do anyway.
> > > > 
> > > > Well, I'd also say just go for own fs_holder_ops if it was not for the
> > > > awkward "get super from bdev" step. As Christian wrote we've encapsulated
> > > > that in fs/super.c and bdev_super_lock() in particular but the calling
> > > > conventions for the fs_holder_ops are not very nice (holding
> > > > bdev_holder_lock, need to release it before grabbing practically anything
> > > > else) so I'd have much greater peace of mind if this didn't spread too
> > > > much. Once you call bdev_super_lock() and hold on to sb with s_umount held,
> > > > things are much more conventional for the fs land so I'd like if this
> > > > step happened before any fs hook got called. So I prefer something like
> > > > Qu's proposal of separate sb op for device removal over exporting
> > > > bdev_super_lock(). Like:
> > > > 
> > > > static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> > > > {
> > > >          struct super_block *sb;
> > > > 
> > > >          sb = bdev_super_lock(bdev, false);
> > > >          if (!sb)
> > > >                  return;
> > > > 
> > > > 	if (sb->s_op->remove_bdev) {
> > > > 		sb->s_op->remove_bdev(sb, bdev, surprise);
> > > > 		return;
> > > > 	}
> > > 
> > > It feels odd but I could live with this, particularly since that's the
> > > direction that brauner is laying down. :)
> > 
> > I want to reiterate that no one is saying "under no circumstances
> > implement your own holder ops". But if you rely on the VFS locking then
> > you better not spill it's guts into your filesystem and make us export
> > this bloody locking that half the world had implemented wrong in their
> > drivers in the first places spewing endless syzbot deadlocks reports
> > that we had to track down and fix. That will not happen again similar
> > way we don't bleed all the nastiness of other locking paths.
> > 
> > Please all stop long philosophical treatises about things no on has ever
> > argued. btrfs wants to rely on the VFS infra. That is fine and well. We
> > will support and enable this.
> > 
> > I think the two method idea is fine given that they now are clearly
> > delineated.
> > 
> > Thanks for providing some clarity here, Darrick and Qu.
> > 
> 
> So the next update would be something like this for fs_bdev_mark_dead():
> 
> 	sb = bdev_super_lock();
> 	if (!sb)
> 		return;
> 	if (!surprise)
> 		sync_filesystem(sb);
> +	if (sb->s_op->remove_bdev) {
> +		ret = sb->s_op->remove_bdev();
> +		if (!ret) {
> +			/* Fs can handle the dev loss. */
> +			super_unlock_shared();
> +			return;
> +		}
> +	}
> +	/* Fs can not handle the dev loss, shutdown. */
> 	shrink_dcache_sb();
> 	evict_inodes();
> 	if (sb->s_op->shutdown)
> 		sb->s_op->shutdown();
> 	super_unlock_shared();
> 
> This means ->remove_bdev() must have a return value to indicate if the fs
> can handle the loss.
> And any error, no matter if it's not enough tolerance from the fs or some
> other problem during the dev loss handling, the old shutdown behavior will
> be triggered.
> 
> Would this be an acceptable solution?

This works for me.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-10 18:41     ` Kent Overstreet
@ 2025-07-11 14:20       ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2025-07-11 14:20 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, Dave Chinner, Christian Brauner, Darrick J. Wong,
	Qu Wenruo, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Thu 10-07-25 14:41:18, Kent Overstreet wrote:
> On Thu, Jul 10, 2025 at 03:10:04PM +0200, Jan Kara wrote:
> > On Wed 09-07-25 13:49:12, Kent Overstreet wrote:
> > > On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> > > > > It also avoids the problem of ->mark_dead events being generated
> > > > > from a context that holds filesystem/vfs locks and then deadlocking
> > > > > waiting for those locks to be released.
> > > > > 
> > > > > IOWs, a multi-device filesystem should really be implementing
> > > > > ->mark_dead itself, and should not be depending on being able to
> > > > > lock the superblock to take an active reference to it.
> > > > > 
> > > > > It should be pretty clear that these are not issues that the generic
> > > > > filesystem ->mark_dead implementation should be trying to
> > > > > handle.....
> > > > 
> > > > Well, IMO every fs implementation needs to do the bdev -> sb transition and
> > > > make sb somehow stable. It may be that grabbing s_umount and active sb
> > > > reference is not what everybody wants but AFAIU btrfs as the second
> > > > multi-device filesystem would be fine with that and for bcachefs this
> > > > doesn't work only because they have special superblock instantiation
> > > > behavior on mount for independent reasons (i.e., not because active ref
> > > > + s_umount would be problematic for them) if I understand Kent right.
> > > > So I'm still not fully convinced each multi-device filesystem should be
> > > > shipping their special method to get from device to stable sb reference.
> > > 
> > > Honestly, the sync_filesystem() call seems bogus.
> > > 
> > > If the block device is truly dead, what's it going to accomplish?
> > 
> > Notice that fs_bdev_mark_dead() calls sync_filesystem() only in case
> > 'surprise' argument is false - meaning this is actually a notification
> > *before* the device is going away. I.e., graceful device hot unplug when
> > you can access the device to clean up as much as possible.
> 
> That doesn't seem to be hooked up to anything?

__del_gendisk()
  if (!test_bit(GD_DEAD, &disk->state))
    blk_report_disk_dead(disk, false);

Is the path which results in "surprise" to be false. I have to admit I
didn't check deeper into drivers whether this is hooked up properly but
del_gendisk() is a standard call to tear down a disk so it would seem so
from the first glance.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-07-11 14:20 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04  0:42 [PATCH v4 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-07-04  0:42 ` [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Qu Wenruo
2025-07-04  9:00   ` (subset) " Christian Brauner
2025-07-04  9:05   ` Jan Kara
2025-07-07 23:02   ` Dave Chinner
2025-07-07 23:22     ` Qu Wenruo
2025-07-08  0:45       ` Darrick J. Wong
2025-07-08  2:09         ` Qu Wenruo
2025-07-08  3:06           ` Qu Wenruo
2025-07-08  5:05             ` Dave Chinner
2025-07-08  5:41               ` Qu Wenruo
2025-07-08  7:55         ` Christian Brauner
2025-07-08 22:59           ` Dave Chinner
2025-07-08 23:07             ` Qu Wenruo
2025-07-09  0:35               ` Kent Overstreet
2025-07-09  0:55                 ` Qu Wenruo
2025-07-09  1:13                   ` Kent Overstreet
2025-07-10  8:33             ` Christian Brauner
2025-07-10 10:54           ` Christoph Hellwig
2025-07-08 10:20         ` Jan Kara
2025-07-08 20:20           ` Darrick J. Wong
2025-07-08 22:12             ` Qu Wenruo
2025-07-10  8:40             ` Christian Brauner
2025-07-10  9:54               ` Qu Wenruo
2025-07-11  9:34                 ` Christian Brauner
2025-07-10 10:52         ` Christoph Hellwig
2025-07-04  0:42 ` [PATCH v4 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
2025-07-04  0:42 ` [PATCH v4 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
2025-07-05 14:10   ` David Sterba
2025-07-04  0:42 ` [PATCH v4 4/6] btrfs: reject delalloc ranges " Qu Wenruo
2025-07-04  0:42 ` [PATCH v4 5/6] btrfs: implement shutdown ioctl Qu Wenruo
2025-07-05 14:22   ` David Sterba
2025-07-06  3:37     ` Qu Wenruo
2025-07-07 20:51       ` David Sterba
2025-07-07 23:04         ` Qu Wenruo
2025-07-08  0:53           ` David Sterba
2025-07-04  0:42 ` [PATCH v4 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2025-07-09 17:23 [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Jan Kara
2025-07-09 17:49 ` Kent Overstreet
2025-07-10 13:10   ` Jan Kara
2025-07-10 18:41     ` Kent Overstreet
2025-07-11 14:20       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).