[PATCH 0/6] btrfs: add remove

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] btrfs: add remove_bdev() callback
@ 2025-06-25 23:53 Qu Wenruo
  2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

[CHANGELOG]
RFC->v1:

- Add a new remove_bdev() callback
  Thanks all the feedback from Christian, Christoph and Jan on this new
  name.

- Add a @surprise parameter to the remove_bdev() callback
  To keep it the same as the bdev_mark_dead().

- Hide the shutdown ioctl and remove_bdev callback behind experimental 
  With the shutdown ioctl, there are 3 test failures (g/050, g/388,
  g/508).

  G/050 may require a test case update.
  G/388 is related to the error handling with COW fixup.
  G/508 looks like something related to log replay.

  And the remove_bdev() doesn't have any btrfs specific test case yet to
  check the auto-degraded behavior, nor the auto-degraded behavior is
  fully discussed.

  So hide both of them behind experimental features.

- Do not use btrfs_handle_fs_error() to avoid freeze/thaw behavior change
  btrfs_handle_fs_error() will flips the fs read-only, which will
  affect freeze/thaw behavior.
  And no other fs set the fs read-only when shutting down, so follow the
  other fs to have a more consistent behavior.

The series is based on the fs_holder_ops patchset here:
https://lore.kernel.org/linux-btrfs/cover.1750724841.git.wqu@suse.com/

The full series including the btrfs' support of fs_holder_ops can be
found here:
https://github.com/adam900710/linux/tree/shutdown

Patch 1 is introducing the new remove_bdev() super operation callback,
with not only the extra @bdev parameter, but also a @surprise flag to
follow bdev_mark_dead().

Patch 2~5 implement the shutdown ioctl so that btrfs can benefit from
the existing shutdown test group.
Although it's still only enabled for experimental builds, as there are
still some minor failures needs to be addressed

Patch 6 implements the remove_bdev() callback, so that btrfs can either
shutdown the fs or continue degraded.
It's still an experimental feature, as we still need more disccussion
on the user expectation and possible extra user configurable behaviors.

Qu Wenruo (6):
  fs: add a new remove_bdev() super operations callback
  btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
  btrfs: reject file operations if in shutdown state
  btrfs: reject delalloc ranges if the fs is shutdown
  btrfs: implement shutdown ioctl
  btrfs: implement remove_bdev super operation callback

 fs/btrfs/file.c            | 25 ++++++++++++++++-
 fs/btrfs/fs.h              | 28 +++++++++++++++++++
 fs/btrfs/inode.c           | 14 +++++++++-
 fs/btrfs/ioctl.c           | 43 +++++++++++++++++++++++++++++
 fs/btrfs/messages.c        |  1 +
 fs/btrfs/reflink.c         |  3 +++
 fs/btrfs/super.c           | 55 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c         |  2 ++
 fs/btrfs/volumes.h         |  5 ++++
 fs/super.c                 |  4 ++-
 include/linux/fs.h         | 18 +++++++++++++
 include/uapi/linux/btrfs.h |  9 +++++++
 12 files changed, 204 insertions(+), 3 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
  2025-06-26  8:38   ` Christian Brauner
  2025-06-25 23:53 ` [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

The new remove_bdev() call back is mostly for multi-device filesystems
to handle device removal.

Some multi-devices filesystems like btrfs can have the ability to handle
device lose according to the setup (e.g. all chunks have extra mirrors),
thus losing a block device will not interrupt the normal operations.

Btrfs will soon implement this call back by:

- Automatically degrade the fs if read-write operations can be
  maintained

- Shutdown the fs if read-write operations can not be maintained

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/super.c         |  4 +++-
 include/linux/fs.h | 18 ++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 80418ca8e215..07845d2f9ec4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
 		sync_filesystem(sb);
 	shrink_dcache_sb(sb);
 	evict_inodes(sb);
-	if (sb->s_op->shutdown)
+	if (sb->s_op->remove_bdev)
+		sb->s_op->remove_bdev(sb, bdev, surprise);
+	else if (sb->s_op->shutdown)
 		sb->s_op->shutdown(sb);
 
 	super_unlock_shared(sb);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b085f161ed22..5e84e06c7354 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2367,7 +2367,25 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	/*
+	 * Callback to shutdown the fs.
+	 *
+	 * If a fs can not afford losing any block device, implement this callback.
+	 */
 	void (*shutdown)(struct super_block *sb);
+
+	/*
+	 * Callback to handle a block device removal.
+	 *
+	 * Recommended to implement this for multi-device filesystems, as they
+	 * may afford losing a block device and continue operations.
+	 *
+	 * @surprse:	indicates a surprise removal. If true the device/media is
+	 *		already gone. Otherwise we're prepareing for an orderly
+	 *		removal.
+	 */
+	void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
+			    bool surprise);
 };
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
  2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
  2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
  2025-06-25 23:53 ` [PATCH 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

This is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN,
after entering the emergency shutdown state, all operations will return
error (-EIO), and can not be bring back to normal state until unmount.

A new helper, btrfs_force_shutdown() is introduced, which will:

- Mark the fs as error
  But without flipping the fs read-only.
  This is a special handling for the future shutdown ioctl, which will
  freeze the fs first, set the SHUTDOWN flag, thaw the fs.

  But the thaw path will no longer call the unfreeze_fs() call back
  if the superblock is already read-only.

  So to handle future shutdown correctly, we only mark the fs as error,
  without flipping it read-only.

- Set the SHUTDOWN flag and output an message

The only extra thing is to set the EMERGENCY_SHUTDOWN flag to reject
future operations with -EIO.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/fs.h       | 28 ++++++++++++++++++++++++++++
 fs/btrfs/messages.c |  1 +
 2 files changed, 29 insertions(+)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d90304d4e32c..45185c095696 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -29,6 +29,7 @@
 #include "extent-io-tree.h"
 #include "async-thread.h"
 #include "block-rsv.h"
+#include "messages.h"
 
 struct inode;
 struct super_block;
@@ -120,6 +121,12 @@ enum {
 	/* No more delayed iput can be queued. */
 	BTRFS_FS_STATE_NO_DELAYED_IPUT,
 
+	/*
+	 * Emergency shutdown, a step further than trans aborted by rejecting
+	 * all operations.
+	 */
+	BTRFS_FS_STATE_EMERGENCY_SHUTDOWN,
+
 	BTRFS_FS_STATE_COUNT
 };
 
@@ -1100,6 +1107,27 @@ static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
 	(unlikely(test_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR,		\
 			   &(fs_info)->fs_state)))
 
+static inline bool btrfs_is_shutdown(struct btrfs_fs_info *fs_info)
+{
+	return test_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state);
+}
+
+static inline void btrfs_force_shutdown(struct btrfs_fs_info *fs_info)
+{
+	/*
+	 * Here we do not want to use handle_fs_error(), which will mark
+	 * the fs read-only.
+	 * Some call sites like shutdown ioctl will mark the fs shutdown
+	 * when the fs is frozen. But thaw path will handle RO and RW fs
+	 * differently.
+	 *
+	 * So here we only mark the fs error without flipping it RO.
+	 */
+	WRITE_ONCE(fs_info->fs_error, -EIO);
+	if (!test_and_set_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state))
+		btrfs_info(fs_info, "emergency shutdown");
+}
+
 /*
  * We use folio flag owner_2 to indicate there is an ordered extent with
  * unfinished IO.
diff --git a/fs/btrfs/messages.c b/fs/btrfs/messages.c
index 363fd28c0268..2bb4bcb7c2cd 100644
--- a/fs/btrfs/messages.c
+++ b/fs/btrfs/messages.c
@@ -23,6 +23,7 @@ static const char fs_state_chars[] = {
 	[BTRFS_FS_STATE_NO_DATA_CSUMS]		= 'C',
 	[BTRFS_FS_STATE_SKIP_META_CSUMS]	= 'S',
 	[BTRFS_FS_STATE_LOG_CLEANUP_ERROR]	= 'L',
+	[BTRFS_FS_STATE_EMERGENCY_SHUTDOWN]	= 'E',
 };
 
 static void btrfs_state_to_string(const struct btrfs_fs_info *info, char *buf)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] btrfs: reject file operations if in shutdown state
  2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
  2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
  2025-06-25 23:53 ` [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
  2025-06-25 23:53 ` [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

This includes the following callbacks of file_operations:

- read_iter()
- write_iter()
- mmap()
- open()
- remap_file_range()
- uring_cmd()
- splice_read()
  This requires a small wrapper to do the extra shutdown check, then call
  the regular filemap_splice_read() function

This should reject most of the file operations on a shutdown btrfs.

The callback ioctl() is intentionally skipped, as ext4 doesn't do the
shutdown check on ioctl() either, thus I believe there is some special
require for ioctl() callback even if the fs is fully shutdown.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c    | 25 ++++++++++++++++++++++++-
 fs/btrfs/ioctl.c   |  3 +++
 fs/btrfs/reflink.c |  3 +++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 247676f34f2e..918c39a5ed39 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1419,6 +1419,8 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
 	struct btrfs_inode *inode = BTRFS_I(file_inode(file));
 	ssize_t num_written, num_sync;
 
+	if (unlikely(btrfs_is_shutdown(inode->root->fs_info)))
+		return -EIO;
 	/*
 	 * If the fs flips readonly due to some impossible error, although we
 	 * have opened a file as writable, we have to stop this write operation
@@ -1981,6 +1983,8 @@ static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
 {
 	struct address_space *mapping = filp->f_mapping;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(filp)))))
+		return -EIO;
 	if (!mapping->a_ops->read_folio)
 		return -ENOEXEC;
 
@@ -3040,6 +3044,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 	int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
 	int ret;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+		return -EIO;
+
 	/* Do not allow fallocate in ZONED mode */
 	if (btrfs_is_zoned(inode_to_fs_info(inode)))
 		return -EOPNOTSUPP;
@@ -3731,6 +3738,9 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
 	int ret;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+		return -EIO;
+
 	filp->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
 
 	ret = fsverity_file_open(inode, filp);
@@ -3743,6 +3753,9 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(iocb->ki_filp)))))
+		return -EIO;
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0 || !iov_iter_count(to) ||
@@ -3753,10 +3766,20 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return filemap_read(iocb, to, ret);
 }
 
+static ssize_t btrfs_file_splice_read(struct file *in, loff_t *ppos,
+				      struct pipe_inode_info *pipe,
+				      size_t len, unsigned int flags)
+{
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(in)))))
+		return -EIO;
+
+	return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
 	.read_iter      = btrfs_file_read_iter,
-	.splice_read	= filemap_splice_read,
+	.splice_read	= btrfs_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.splice_write	= iter_file_splice_write,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 65763bc6a0f6..d86967bd3c9c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5066,6 +5066,9 @@ static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issu
 
 int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(cmd->file)))))
+		return -EIO;
+
 	switch (cmd->cmd_op) {
 	case BTRFS_IOC_ENCODED_READ:
 #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 0197bd9160a7..123a5682514b 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -869,6 +869,9 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
 	bool same_inode = dst_inode == src_inode;
 	int ret;
 
+	if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(src_file)))))
+		return -EIO;
+
 	if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
 		return -EINVAL;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown
  2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-06-25 23:53 ` [PATCH 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
  2025-06-25 23:53 ` [PATCH 5/6] btrfs: implement shutdown ioctl Qu Wenruo
  2025-06-25 23:53 ` [PATCH 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
  5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

If the filesystem has dirty pages before the fs is shutdown, we should
no longer write them back, instead should treat them as writeback error.

Handle such situation by marking all those delalloc range as error and
let error handling path to clean them up.

For ranges that already have ordered extent created, let them continue
the writeback, and at ordered io finish time the file extent item update
will be rejected as the fs is already marked error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80c72c594b19..bb2b5d594b14 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -862,6 +862,9 @@ static void compress_file_range(struct btrfs_work *work)
 	int compress_type = fs_info->compress_type;
 	int compress_level = fs_info->compress_level;
 
+	if (unlikely(btrfs_is_shutdown(fs_info)))
+		goto cleanup_and_bail_uncompressed;
+
 	inode_should_defrag(inode, start, end, end - start + 1, SZ_16K);
 
 	/*
@@ -1277,6 +1280,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	unsigned long page_ops;
 	int ret = 0;
 
+	if (unlikely(btrfs_is_shutdown(fs_info))) {
+		ret = -EIO;
+		goto out_unlock;
+	}
+
 	if (btrfs_is_free_space_inode(inode)) {
 		ret = -EINVAL;
 		goto out_unlock;
@@ -2027,7 +2035,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_root *root = inode->root;
-	struct btrfs_path *path;
+	struct btrfs_path *path = NULL;
 	u64 cow_start = (u64)-1;
 	/*
 	 * If not 0, represents the inclusive end of the last fallback_to_cow()
@@ -2047,6 +2055,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 	 */
 	ASSERT(!btrfs_is_zoned(fs_info) || btrfs_is_data_reloc_root(root));
 
+	if (unlikely(btrfs_is_shutdown(fs_info))) {
+		ret = -EIO;
+		goto error;
+	}
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] btrfs: implement shutdown ioctl
  2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-06-25 23:53 ` [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
  2025-06-25 23:53 ` [PATCH 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
  5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

The shutdown interface should all follow the XFS one, which use magic
'X', and ioctl number 125, with a u32 as flags.

For now btrfs don't distinguish DEFAULT and LOGFLUSH flags, both will
freeze the fs first (implies committing the current transaction),
setting the SHUTDOWN flag (along with fs error flag), thaw the fs.

For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current
transaction is aborted.

The new shutdown ioctl is hidden behind experimental features, as there
are still some minor test failures exposed during the shutdown group
run.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c           | 40 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  9 +++++++++
 2 files changed, 49 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d86967bd3c9c..1550fe7887fa 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5212,6 +5212,36 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
 	return 0;
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
+{
+	int ret = 0;
+
+	if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
+		return -EINVAL;
+
+	if (btrfs_is_shutdown(fs_info))
+		return 0;
+
+	switch (flags) {
+	case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
+	case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
+		ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+		if (ret)
+			return ret;
+		btrfs_force_shutdown(fs_info);
+		ret = thaw_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+		if (ret)
+			return ret;
+		break;
+	case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
+		btrfs_force_shutdown(fs_info);
+		break;
+	}
+	return ret;
+}
+#endif
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5219,6 +5249,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	void __user *argp = (void __user *)arg;
+	/* If @arg is just an unsigned long value. */
+	unsigned long flags;
 
 	switch (cmd) {
 	case FS_IOC_GETVERSION:
@@ -5367,7 +5399,15 @@ long btrfs_ioctl(struct file *file, unsigned int
 #endif
 	case BTRFS_IOC_SUBVOL_SYNC_WAIT:
 		return btrfs_ioctl_subvol_sync(fs_info, argp);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	case BTRFS_IOC_SHUTDOWN:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		if (get_user(flags, (__u32 __user *)arg))
+			return -EFAULT;
+		return btrfs_emergency_shutdown(fs_info, flags);
 	}
+#endif
 
 	return -ENOTTY;
 }
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dd02160015b2..8f6324cf15d9 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -1096,6 +1096,12 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
 };
 
+/* Flags for IOC_SHUTDOWN, should match XFS' flags. */
+#define BTRFS_SHUTDOWN_FLAGS_DEFAULT	0x0
+#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH	0x1
+#define BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH	0x2
+#define BTRFS_SHUTDOWN_FLAGS_LAST	0x3
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -1217,6 +1223,9 @@ enum btrfs_err_code {
 #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
 					struct btrfs_ioctl_subvol_wait)
 
+/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
+#define BTRFS_IOC_SHUTDOWN	_IOR('X', 125, __u32)
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] btrfs: implement remove_bdev super operation callback
  2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
                   ` (4 preceding siblings ...)
  2025-06-25 23:53 ` [PATCH 5/6] btrfs: implement shutdown ioctl Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
  5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack

For this callback, btrfs will:

- Go degraded if the fs can still maintain RW operations
  And of course mark the target device as missing.

- Shutdown if the fs can not maintain RW operations

I know the shutdown can be a little overkilled, if one has a RAID1
metadata and RAID0 data, in that case one can still read data with 50%
rate to got some good data.

But it can also be as bad as the only device went missing for a single
device btrfs.

So here we go safe other than sorry when handling missing device.

And the remove_bdev callback will be hidden behind experimental features
for now, the reasons are:

- There are not enough btrfs specific bdev removal test cases
  The existing test cases are all removing the only device, thus only
  exercises the shutdown branch.

- Not yet determined what's the expected behavior
  Although the current auto-degrade behavior is no worse than the old
  behavior, it may not always be what the end users want.

  Before there is a concrete solution, better hide the new feature
  from end users.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c   | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c |  2 ++
 fs/btrfs/volumes.h |  5 +++++
 3 files changed, 62 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5a07330fb3a6..3fba3d6309a2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2412,6 +2412,58 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
 	return 0;
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static void btrfs_remove_bdev(struct super_block *sb, struct block_device *bdev,
+			      bool surprise)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	struct btrfs_device *device;
+	struct btrfs_dev_lookup_args lookup_args = { .devt = bdev->bd_dev };
+	bool can_rw;
+	int ret;
+
+	if (!surprise) {
+		ret = btrfs_sync_fs(sb, 1);
+		if (ret)
+			btrfs_warn(fs_info,
+			"filesystem failed to sync in preparation for device loss: %d",
+				   ret);
+	}
+
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	device = btrfs_find_device(fs_info->fs_devices, &lookup_args);
+	if (!device) {
+		btrfs_warn(fs_info, "unable to find btrfs device for block device '%pg'",
+			   bdev);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		return;
+	}
+	set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
+	device->fs_devices->missing_devices++;
+	if (test_and_clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
+		list_del_init(&device->dev_alloc_list);
+		device->fs_devices->rw_devices--;
+	}
+	can_rw = btrfs_check_rw_degradable(fs_info, device);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+	/*
+	 * Now device is considered missing, btrfs_device_name() won't give a
+	 * meaningful result anymore.
+	 */
+	if (!can_rw) {
+		btrfs_warn(fs_info,
+		"btrfs device id %llu has gone missing, can not maintain read-write",
+			   device->devid);
+		btrfs_force_shutdown(fs_info);
+		return;
+	}
+	btrfs_warn(fs_info,
+		   "btrfs device id %llu has gone missing, continue as degraded",
+		   device->devid);
+	btrfs_set_opt(fs_info->mount_opt, DEGRADED);
+}
+#endif
+
 static const struct super_operations btrfs_super_ops = {
 	.drop_inode	= btrfs_drop_inode,
 	.evict_inode	= btrfs_evict_inode,
@@ -2427,6 +2479,9 @@ static const struct super_operations btrfs_super_ops = {
 	.unfreeze_fs	= btrfs_unfreeze,
 	.nr_cached_objects = btrfs_nr_cached_objects,
 	.free_cached_objects = btrfs_free_cached_objects,
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	.remove_bdev	= btrfs_remove_bdev,
+#endif
 };
 
 static const struct file_operations btrfs_ctl_fops = {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8ea1a69808a3..8feac0129bdd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6794,6 +6794,8 @@ static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
 static bool dev_args_match_device(const struct btrfs_dev_lookup_args *args,
 				  const struct btrfs_device *device)
 {
+	if (args->devt)
+		return device->devt == args->devt;
 	if (args->missing) {
 		if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state) &&
 		    !device->bdev)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6acb154ccf87..71e570f8337d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -663,6 +663,11 @@ struct btrfs_dev_lookup_args {
 	u64 devid;
 	u8 *uuid;
 	u8 *fsid;
+	/*
+	 * If devt is specified, all other members will be ignored as it is
+	 * enough to uniquely locate a device.
+	 */
+	dev_t devt;
 	bool missing;
 };
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
@ 2025-06-26  8:38   ` Christian Brauner
  2025-06-26  8:44     ` Qu Wenruo
  2025-06-26 10:14     ` Christoph Hellwig
  0 siblings, 2 replies; 16+ messages in thread
From: Christian Brauner @ 2025-06-26  8:38 UTC (permalink / raw)
  To: Qu Wenruo, Christoph Hellwig; +Cc: linux-btrfs, linux-fsdevel, viro, jack

On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
> The new remove_bdev() call back is mostly for multi-device filesystems
> to handle device removal.
> 
> Some multi-devices filesystems like btrfs can have the ability to handle
> device lose according to the setup (e.g. all chunks have extra mirrors),
> thus losing a block device will not interrupt the normal operations.
> 
> Btrfs will soon implement this call back by:
> 
> - Automatically degrade the fs if read-write operations can be
>   maintained
> 
> - Shutdown the fs if read-write operations can not be maintained
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/super.c         |  4 +++-
>  include/linux/fs.h | 18 ++++++++++++++++++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 80418ca8e215..07845d2f9ec4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>  		sync_filesystem(sb);
>  	shrink_dcache_sb(sb);
>  	evict_inodes(sb);
> -	if (sb->s_op->shutdown)
> +	if (sb->s_op->remove_bdev)
> +		sb->s_op->remove_bdev(sb, bdev, surprise);
> +	else if (sb->s_op->shutdown)
>  		sb->s_op->shutdown(sb);

This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
really dislike this pattern. It introduces the possibility that a
filesystem accidently implement both variants and assumes both are
somehow called. That can be solved by an assert at superblock initation
time but it's still nasty.

The other thing is that this just reeks of being the wrong api. We
should absolutely aim for the methods to not be mutually exclusive. I
hate that with a passion. That's just an ugly api and I want to have as
little of that as possible in our code.

>  
>  	super_unlock_shared(sb);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b085f161ed22..5e84e06c7354 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2367,7 +2367,25 @@ struct super_operations {
>  				  struct shrink_control *);
>  	long (*free_cached_objects)(struct super_block *,
>  				    struct shrink_control *);
> +	/*
> +	 * Callback to shutdown the fs.
> +	 *
> +	 * If a fs can not afford losing any block device, implement this callback.
> +	 */
>  	void (*shutdown)(struct super_block *sb);
> +
> +	/*
> +	 * Callback to handle a block device removal.
> +	 *
> +	 * Recommended to implement this for multi-device filesystems, as they
> +	 * may afford losing a block device and continue operations.
> +	 *
> +	 * @surprse:	indicates a surprise removal. If true the device/media is
> +	 *		already gone. Otherwise we're prepareing for an orderly
> +	 *		removal.
> +	 */
> +	void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
> +			    bool surprise);
>  };

Yeah, I think that's just not a good api. That looks a lot to me like we
should just collapse both functions even though earlier discussion said
we shouldn't. Just do:

s/shutdown/remove_bdev/

or

s/shutdown/shutdown_bdev()

The filesystem will know whether it has to kill the filesystem or if it
can keep going even if the device is lost. Hell, if we have to we could
just have it return whether it killed the superblock or just the device
by giving the method a return value. But for now it probably doesn't
matter.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26  8:38   ` Christian Brauner
@ 2025-06-26  8:44     ` Qu Wenruo
  2025-06-26  9:10       ` Christian Brauner
  2025-06-26 10:14     ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-26  8:44 UTC (permalink / raw)
  To: Christian Brauner, Christoph Hellwig
  Cc: linux-btrfs, linux-fsdevel, viro, jack



在 2025/6/26 18:08, Christian Brauner 写道:
> On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
>> The new remove_bdev() call back is mostly for multi-device filesystems
>> to handle device removal.
>>
>> Some multi-devices filesystems like btrfs can have the ability to handle
>> device lose according to the setup (e.g. all chunks have extra mirrors),
>> thus losing a block device will not interrupt the normal operations.
>>
>> Btrfs will soon implement this call back by:
>>
>> - Automatically degrade the fs if read-write operations can be
>>    maintained
>>
>> - Shutdown the fs if read-write operations can not be maintained
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/super.c         |  4 +++-
>>   include/linux/fs.h | 18 ++++++++++++++++++
>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/super.c b/fs/super.c
>> index 80418ca8e215..07845d2f9ec4 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>>   		sync_filesystem(sb);
>>   	shrink_dcache_sb(sb);
>>   	evict_inodes(sb);
>> -	if (sb->s_op->shutdown)
>> +	if (sb->s_op->remove_bdev)
>> +		sb->s_op->remove_bdev(sb, bdev, surprise);
>> +	else if (sb->s_op->shutdown)
>>   		sb->s_op->shutdown(sb);
> 
> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> really dislike this pattern. It introduces the possibility that a
> filesystem accidently implement both variants and assumes both are
> somehow called. That can be solved by an assert at superblock initation
> time but it's still nasty.
> 
> The other thing is that this just reeks of being the wrong api. We
> should absolutely aim for the methods to not be mutually exclusive. I
> hate that with a passion. That's just an ugly api and I want to have as
> little of that as possible in our code.

So what do you really want?

The original path to expand the shutdown() callback is rejected, and now 
the new callback is also rejected.

I guess the only thing left is to rename shutdown() to remove_bdev(), 
add the new parameters and keep existing fses doing what they do (aka, 
shutdown)?

Thanks,
Qu

> 
>>   
>>   	super_unlock_shared(sb);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index b085f161ed22..5e84e06c7354 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2367,7 +2367,25 @@ struct super_operations {
>>   				  struct shrink_control *);
>>   	long (*free_cached_objects)(struct super_block *,
>>   				    struct shrink_control *);
>> +	/*
>> +	 * Callback to shutdown the fs.
>> +	 *
>> +	 * If a fs can not afford losing any block device, implement this callback.
>> +	 */
>>   	void (*shutdown)(struct super_block *sb);
>> +
>> +	/*
>> +	 * Callback to handle a block device removal.
>> +	 *
>> +	 * Recommended to implement this for multi-device filesystems, as they
>> +	 * may afford losing a block device and continue operations.
>> +	 *
>> +	 * @surprse:	indicates a surprise removal. If true the device/media is
>> +	 *		already gone. Otherwise we're prepareing for an orderly
>> +	 *		removal.
>> +	 */
>> +	void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
>> +			    bool surprise);
>>   };
> 
> Yeah, I think that's just not a good api. That looks a lot to me like we
> should just collapse both functions even though earlier discussion said
> we shouldn't. Just do:
> 
> s/shutdown/remove_bdev/
> 
> or
> 
> s/shutdown/shutdown_bdev()
> 
> The filesystem will know whether it has to kill the filesystem or if it
> can keep going even if the device is lost. Hell, if we have to we could
> just have it return whether it killed the superblock or just the device
> by giving the method a return value. But for now it probably doesn't
> matter.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26  8:44     ` Qu Wenruo
@ 2025-06-26  9:10       ` Christian Brauner
  2025-06-26  9:18         ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2025-06-26  9:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, viro, jack

On Thu, Jun 26, 2025 at 06:14:03PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/6/26 18:08, Christian Brauner 写道:
> > On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
> > > The new remove_bdev() call back is mostly for multi-device filesystems
> > > to handle device removal.
> > > 
> > > Some multi-devices filesystems like btrfs can have the ability to handle
> > > device lose according to the setup (e.g. all chunks have extra mirrors),
> > > thus losing a block device will not interrupt the normal operations.
> > > 
> > > Btrfs will soon implement this call back by:
> > > 
> > > - Automatically degrade the fs if read-write operations can be
> > >    maintained
> > > 
> > > - Shutdown the fs if read-write operations can not be maintained
> > > 
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > >   fs/super.c         |  4 +++-
> > >   include/linux/fs.h | 18 ++++++++++++++++++
> > >   2 files changed, 21 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/super.c b/fs/super.c
> > > index 80418ca8e215..07845d2f9ec4 100644
> > > --- a/fs/super.c
> > > +++ b/fs/super.c
> > > @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> > >   		sync_filesystem(sb);
> > >   	shrink_dcache_sb(sb);
> > >   	evict_inodes(sb);
> > > -	if (sb->s_op->shutdown)
> > > +	if (sb->s_op->remove_bdev)
> > > +		sb->s_op->remove_bdev(sb, bdev, surprise);
> > > +	else if (sb->s_op->shutdown)
> > >   		sb->s_op->shutdown(sb);
> > 
> > This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> > really dislike this pattern. It introduces the possibility that a
> > filesystem accidently implement both variants and assumes both are
> > somehow called. That can be solved by an assert at superblock initation
> > time but it's still nasty.
> > 
> > The other thing is that this just reeks of being the wrong api. We
> > should absolutely aim for the methods to not be mutually exclusive. I
> > hate that with a passion. That's just an ugly api and I want to have as
> > little of that as possible in our code.
> 
> So what do you really want?
> 
> The original path to expand the shutdown() callback is rejected, and now the
> new callback is also rejected.
> 
> I guess the only thing left is to rename shutdown() to remove_bdev(), add
> the new parameters and keep existing fses doing what they do (aka,
> shutdown)?

Yes. My original understanding had been that ->remove_bdev() would be
called in a different codepath than ->shutdown() and they would be
complimentary. So sorry for the back and forth here. If that's not the
case I don't see any point in having two distinct methods.

> 
> Thanks,
> Qu
> 
> > 
> > >   	super_unlock_shared(sb);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index b085f161ed22..5e84e06c7354 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2367,7 +2367,25 @@ struct super_operations {
> > >   				  struct shrink_control *);
> > >   	long (*free_cached_objects)(struct super_block *,
> > >   				    struct shrink_control *);
> > > +	/*
> > > +	 * Callback to shutdown the fs.
> > > +	 *
> > > +	 * If a fs can not afford losing any block device, implement this callback.
> > > +	 */
> > >   	void (*shutdown)(struct super_block *sb);
> > > +
> > > +	/*
> > > +	 * Callback to handle a block device removal.
> > > +	 *
> > > +	 * Recommended to implement this for multi-device filesystems, as they
> > > +	 * may afford losing a block device and continue operations.
> > > +	 *
> > > +	 * @surprse:	indicates a surprise removal. If true the device/media is
> > > +	 *		already gone. Otherwise we're prepareing for an orderly
> > > +	 *		removal.
> > > +	 */
> > > +	void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
> > > +			    bool surprise);
> > >   };
> > 
> > Yeah, I think that's just not a good api. That looks a lot to me like we
> > should just collapse both functions even though earlier discussion said
> > we shouldn't. Just do:
> > 
> > s/shutdown/remove_bdev/
> > 
> > or
> > 
> > s/shutdown/shutdown_bdev()
> > 
> > The filesystem will know whether it has to kill the filesystem or if it
> > can keep going even if the device is lost. Hell, if we have to we could
> > just have it return whether it killed the superblock or just the device
> > by giving the method a return value. But for now it probably doesn't
> > matter.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26  9:10       ` Christian Brauner
@ 2025-06-26  9:18         ` Qu Wenruo
  2025-06-26  9:29           ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-26  9:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, viro, jack



在 2025/6/26 18:40, Christian Brauner 写道:
> On Thu, Jun 26, 2025 at 06:14:03PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2025/6/26 18:08, Christian Brauner 写道:
>>> On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
>>>> The new remove_bdev() call back is mostly for multi-device filesystems
>>>> to handle device removal.
>>>>
>>>> Some multi-devices filesystems like btrfs can have the ability to handle
>>>> device lose according to the setup (e.g. all chunks have extra mirrors),
>>>> thus losing a block device will not interrupt the normal operations.
>>>>
>>>> Btrfs will soon implement this call back by:
>>>>
>>>> - Automatically degrade the fs if read-write operations can be
>>>>     maintained
>>>>
>>>> - Shutdown the fs if read-write operations can not be maintained
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/super.c         |  4 +++-
>>>>    include/linux/fs.h | 18 ++++++++++++++++++
>>>>    2 files changed, 21 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/super.c b/fs/super.c
>>>> index 80418ca8e215..07845d2f9ec4 100644
>>>> --- a/fs/super.c
>>>> +++ b/fs/super.c
>>>> @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>>>>    		sync_filesystem(sb);
>>>>    	shrink_dcache_sb(sb);
>>>>    	evict_inodes(sb);
>>>> -	if (sb->s_op->shutdown)
>>>> +	if (sb->s_op->remove_bdev)
>>>> +		sb->s_op->remove_bdev(sb, bdev, surprise);
>>>> +	else if (sb->s_op->shutdown)
>>>>    		sb->s_op->shutdown(sb);
>>>
>>> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
>>> really dislike this pattern. It introduces the possibility that a
>>> filesystem accidently implement both variants and assumes both are
>>> somehow called. That can be solved by an assert at superblock initation
>>> time but it's still nasty.
>>>
>>> The other thing is that this just reeks of being the wrong api. We
>>> should absolutely aim for the methods to not be mutually exclusive. I
>>> hate that with a passion. That's just an ugly api and I want to have as
>>> little of that as possible in our code.
>>
>> So what do you really want?
>>
>> The original path to expand the shutdown() callback is rejected, and now the
>> new callback is also rejected.
>>
>> I guess the only thing left is to rename shutdown() to remove_bdev(), add
>> the new parameters and keep existing fses doing what they do (aka,
>> shutdown)?
> 
> Yes. My original understanding had been that ->remove_bdev() would be
> called in a different codepath than ->shutdown() and they would be
> complimentary. So sorry for the back and forth here. If that's not the
> case I don't see any point in having two distinct methods.

That's fine, just want to do a final confirmation that everyone is fine 
with such change:

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b085f161ed22..c11b9219863b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2367,7 +2367,8 @@ struct super_operations {
                                   struct shrink_control *);
         long (*free_cached_objects)(struct super_block *,
                                     struct shrink_control *);
-       void (*shutdown)(struct super_block *sb);
+       void (*remove_bdev)(struct super_block *sb, struct block_device 
*bdev,
+                           bool surprise);
  };

  /*

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26  9:18         ` Qu Wenruo
@ 2025-06-26  9:29           ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2025-06-26  9:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, viro, jack

On Thu, Jun 26, 2025 at 06:48:29PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/6/26 18:40, Christian Brauner 写道:
> > On Thu, Jun 26, 2025 at 06:14:03PM +0930, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2025/6/26 18:08, Christian Brauner 写道:
> > > > On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
> > > > > The new remove_bdev() call back is mostly for multi-device filesystems
> > > > > to handle device removal.
> > > > > 
> > > > > Some multi-devices filesystems like btrfs can have the ability to handle
> > > > > device lose according to the setup (e.g. all chunks have extra mirrors),
> > > > > thus losing a block device will not interrupt the normal operations.
> > > > > 
> > > > > Btrfs will soon implement this call back by:
> > > > > 
> > > > > - Automatically degrade the fs if read-write operations can be
> > > > >     maintained
> > > > > 
> > > > > - Shutdown the fs if read-write operations can not be maintained
> > > > > 
> > > > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > > > ---
> > > > >    fs/super.c         |  4 +++-
> > > > >    include/linux/fs.h | 18 ++++++++++++++++++
> > > > >    2 files changed, 21 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/super.c b/fs/super.c
> > > > > index 80418ca8e215..07845d2f9ec4 100644
> > > > > --- a/fs/super.c
> > > > > +++ b/fs/super.c
> > > > > @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> > > > >    		sync_filesystem(sb);
> > > > >    	shrink_dcache_sb(sb);
> > > > >    	evict_inodes(sb);
> > > > > -	if (sb->s_op->shutdown)
> > > > > +	if (sb->s_op->remove_bdev)
> > > > > +		sb->s_op->remove_bdev(sb, bdev, surprise);
> > > > > +	else if (sb->s_op->shutdown)
> > > > >    		sb->s_op->shutdown(sb);
> > > > 
> > > > This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> > > > really dislike this pattern. It introduces the possibility that a
> > > > filesystem accidently implement both variants and assumes both are
> > > > somehow called. That can be solved by an assert at superblock initation
> > > > time but it's still nasty.
> > > > 
> > > > The other thing is that this just reeks of being the wrong api. We
> > > > should absolutely aim for the methods to not be mutually exclusive. I
> > > > hate that with a passion. That's just an ugly api and I want to have as
> > > > little of that as possible in our code.
> > > 
> > > So what do you really want?
> > > 
> > > The original path to expand the shutdown() callback is rejected, and now the
> > > new callback is also rejected.
> > > 
> > > I guess the only thing left is to rename shutdown() to remove_bdev(), add
> > > the new parameters and keep existing fses doing what they do (aka,
> > > shutdown)?
> > 
> > Yes. My original understanding had been that ->remove_bdev() would be
> > called in a different codepath than ->shutdown() and they would be
> > complimentary. So sorry for the back and forth here. If that's not the
> > case I don't see any point in having two distinct methods.
> 
> That's fine, just want to do a final confirmation that everyone is fine with
> such change:
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b085f161ed22..c11b9219863b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2367,7 +2367,8 @@ struct super_operations {
>                                   struct shrink_control *);
>         long (*free_cached_objects)(struct super_block *,
>                                     struct shrink_control *);
> -       void (*shutdown)(struct super_block *sb);
> +       void (*remove_bdev)(struct super_block *sb, struct block_device
> *bdev,
> +                           bool surprise);
>  };

This looks good to me!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26  8:38   ` Christian Brauner
  2025-06-26  8:44     ` Qu Wenruo
@ 2025-06-26 10:14     ` Christoph Hellwig
  2025-06-26 10:36       ` Qu Wenruo
  2025-07-01 11:35       ` Christian Brauner
  1 sibling, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-06-26 10:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Qu Wenruo, Christoph Hellwig, linux-btrfs, linux-fsdevel, viro,
	jack

On Thu, Jun 26, 2025 at 10:38:11AM +0200, Christian Brauner wrote:
> > +	if (sb->s_op->remove_bdev)
> > +		sb->s_op->remove_bdev(sb, bdev, surprise);
> > +	else if (sb->s_op->shutdown)
> >  		sb->s_op->shutdown(sb);
> 
> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> really dislike this pattern. It introduces the possibility that a
> filesystem accidently implement both variants and assumes both are
> somehow called. That can be solved by an assert at superblock initation
> time but it's still nasty.
> 
> The other thing is that this just reeks of being the wrong api. We
> should absolutely aim for the methods to not be mutually exclusive. I
> hate that with a passion. That's just an ugly api and I want to have as
> little of that as possible in our code.

Yes.  As I mentioned before I think we just want to transition everyone
to the remove_bdev API.  But our life is probably easier if Qu's series
only adds the new one so that we can do the transitions through the
file system trees instead of needing a global API change.  Or am I
overthinking this?

> > +	 * @surprse:	indicates a surprise removal. If true the device/media is

surprse is misspelled here.  But I don't see how passing this on to
the file system would be useful, because at this point everything is
a surprise for the file system.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26 10:14     ` Christoph Hellwig
@ 2025-06-26 10:36       ` Qu Wenruo
  2025-06-26 11:02         ` Christoph Hellwig
  2025-07-01 11:35       ` Christian Brauner
  1 sibling, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-26 10:36 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: Qu Wenruo, linux-btrfs, linux-fsdevel, viro, jack

在 2025/6/26 19:44, Christoph Hellwig 写道:
> On Thu, Jun 26, 2025 at 10:38:11AM +0200, Christian Brauner wrote:
>>> +	if (sb->s_op->remove_bdev)
>>> +		sb->s_op->remove_bdev(sb, bdev, surprise);
>>> +	else if (sb->s_op->shutdown)
>>>   		sb->s_op->shutdown(sb);
>>
>> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
>> really dislike this pattern. It introduces the possibility that a
>> filesystem accidently implement both variants and assumes both are
>> somehow called. That can be solved by an assert at superblock initation
>> time but it's still nasty.
>>
>> The other thing is that this just reeks of being the wrong api. We
>> should absolutely aim for the methods to not be mutually exclusive. I
>> hate that with a passion. That's just an ugly api and I want to have as
>> little of that as possible in our code.
> 
> Yes.  As I mentioned before I think we just want to transition everyone
> to the remove_bdev API.  But our life is probably easier if Qu's series
> only adds the new one so that we can do the transitions through the
> file system trees instead of needing a global API change.  Or am I
> overthinking this?

Then let's do the transition right now.

The transition itself is pretty straightforward, just renaming and 
appending the parameter list, and keeping the old shutdown behavior.

AFAIK only btrfs and bcachefs will take advantage of the new interface 
for now.

> 
>>> +	 * @surprse:	indicates a surprise removal. If true the device/media is
> 
> surprse is misspelled here.  But I don't see how passing this on to
> the file system would be useful, because at this point everything is
> a surprise for the file system.

If I understand the @surprise parameter correctly, it should allow the 
fs to do read/write as usual if it's not a surprise removal.

And btrfs will take the chance to fully writeback all the dirty pages 
(more than the default shutdown behavior which only writebacks the 
current transaction, no dirty data pages.).

But in the real world, for test case like generic/730, the @surprise 
flag is either not properly respected, I'm getting @surprise == false 
but the block device is already gone.

So I'm not sure what's the real expected behavior here, and the new flag 
is only for future expansion for now.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26 10:36       ` Qu Wenruo
@ 2025-06-26 11:02         ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-06-26 11:02 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Christian Brauner, linux-btrfs, linux-fsdevel,
	viro, jack

On Thu, Jun 26, 2025 at 08:06:03PM +0930, Qu Wenruo wrote:
>
> If I understand the @surprise parameter correctly, it should allow the fs 
> to do read/write as usual if it's not a surprise removal.
>
> And btrfs will take the chance to fully writeback all the dirty pages (more 
> than the default shutdown behavior which only writebacks the current 
> transaction, no dirty data pages.).

That's already taken care of by the call to sync_filesystem in
fs_bdev_mark_dead.

> But in the real world, for test case like generic/730, the @surprise flag 
> is either not properly respected, I'm getting @surprise == false but the 
> block device is already gone.

It only works for drivers that call blk_mark_disk_dead or bdev_mark_dead
directly with the surprise flag.

> So I'm not sure what's the real expected behavior here, and the new flag is 
> only for future expansion for now.

Let's not add just in case arguments.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
  2025-06-26 10:14     ` Christoph Hellwig
  2025-06-26 10:36       ` Qu Wenruo
@ 2025-07-01 11:35       ` Christian Brauner
  1 sibling, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2025-07-01 11:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs, linux-fsdevel, viro, jack

On Thu, Jun 26, 2025 at 12:14:43PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 26, 2025 at 10:38:11AM +0200, Christian Brauner wrote:
> > > +	if (sb->s_op->remove_bdev)
> > > +		sb->s_op->remove_bdev(sb, bdev, surprise);
> > > +	else if (sb->s_op->shutdown)
> > >  		sb->s_op->shutdown(sb);
> > 
> > This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> > really dislike this pattern. It introduces the possibility that a
> > filesystem accidently implement both variants and assumes both are
> > somehow called. That can be solved by an assert at superblock initation
> > time but it's still nasty.
> > 
> > The other thing is that this just reeks of being the wrong api. We
> > should absolutely aim for the methods to not be mutually exclusive. I
> > hate that with a passion. That's just an ugly api and I want to have as
> > little of that as possible in our code.
> 
> Yes.  As I mentioned before I think we just want to transition everyone
> to the remove_bdev API.  But our life is probably easier if Qu's series
> only adds the new one so that we can do the transitions through the
> file system trees instead of needing a global API change.  Or am I
> overthinking this?

I think it's fine to do it right now. If we can avoid having
transitional stages then let's do that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-07-01 11:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
2025-06-26  8:38   ` Christian Brauner
2025-06-26  8:44     ` Qu Wenruo
2025-06-26  9:10       ` Christian Brauner
2025-06-26  9:18         ` Qu Wenruo
2025-06-26  9:29           ` Christian Brauner
2025-06-26 10:14     ` Christoph Hellwig
2025-06-26 10:36       ` Qu Wenruo
2025-06-26 11:02         ` Christoph Hellwig
2025-07-01 11:35       ` Christian Brauner
2025-06-25 23:53 ` [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
2025-06-25 23:53 ` [PATCH 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
2025-06-25 23:53 ` [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown Qu Wenruo
2025-06-25 23:53 ` [PATCH 5/6] btrfs: implement shutdown ioctl Qu Wenruo
2025-06-25 23:53 ` [PATCH 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).