* [PATCH 0/6] btrfs: add remove_bdev() callback
@ 2025-06-25 23:53 Qu Wenruo
2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
` (5 more replies)
0 siblings, 6 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
[CHANGELOG]
RFC->v1:
- Add a new remove_bdev() callback
Thanks all the feedback from Christian, Christoph and Jan on this new
name.
- Add a @surprise parameter to the remove_bdev() callback
To keep it the same as the bdev_mark_dead().
- Hide the shutdown ioctl and remove_bdev callback behind experimental
With the shutdown ioctl, there are 3 test failures (g/050, g/388,
g/508).
G/050 may require a test case update.
G/388 is related to the error handling with COW fixup.
G/508 looks like something related to log replay.
And the remove_bdev() doesn't have any btrfs specific test case yet to
check the auto-degraded behavior, nor the auto-degraded behavior is
fully discussed.
So hide both of them behind experimental features.
- Do not use btrfs_handle_fs_error() to avoid freeze/thaw behavior change
btrfs_handle_fs_error() will flips the fs read-only, which will
affect freeze/thaw behavior.
And no other fs set the fs read-only when shutting down, so follow the
other fs to have a more consistent behavior.
The series is based on the fs_holder_ops patchset here:
https://lore.kernel.org/linux-btrfs/cover.1750724841.git.wqu@suse.com/
The full series including the btrfs' support of fs_holder_ops can be
found here:
https://github.com/adam900710/linux/tree/shutdown
Patch 1 is introducing the new remove_bdev() super operation callback,
with not only the extra @bdev parameter, but also a @surprise flag to
follow bdev_mark_dead().
Patch 2~5 implement the shutdown ioctl so that btrfs can benefit from
the existing shutdown test group.
Although it's still only enabled for experimental builds, as there are
still some minor failures needs to be addressed
Patch 6 implements the remove_bdev() callback, so that btrfs can either
shutdown the fs or continue degraded.
It's still an experimental feature, as we still need more disccussion
on the user expectation and possible extra user configurable behaviors.
Qu Wenruo (6):
fs: add a new remove_bdev() super operations callback
btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
btrfs: reject file operations if in shutdown state
btrfs: reject delalloc ranges if the fs is shutdown
btrfs: implement shutdown ioctl
btrfs: implement remove_bdev super operation callback
fs/btrfs/file.c | 25 ++++++++++++++++-
fs/btrfs/fs.h | 28 +++++++++++++++++++
fs/btrfs/inode.c | 14 +++++++++-
fs/btrfs/ioctl.c | 43 +++++++++++++++++++++++++++++
fs/btrfs/messages.c | 1 +
fs/btrfs/reflink.c | 3 +++
fs/btrfs/super.c | 55 ++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.c | 2 ++
fs/btrfs/volumes.h | 5 ++++
fs/super.c | 4 ++-
include/linux/fs.h | 18 +++++++++++++
include/uapi/linux/btrfs.h | 9 +++++++
12 files changed, 204 insertions(+), 3 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
2025-06-26 8:38 ` Christian Brauner
2025-06-25 23:53 ` [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
` (4 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
The new remove_bdev() call back is mostly for multi-device filesystems
to handle device removal.
Some multi-devices filesystems like btrfs can have the ability to handle
device lose according to the setup (e.g. all chunks have extra mirrors),
thus losing a block device will not interrupt the normal operations.
Btrfs will soon implement this call back by:
- Automatically degrade the fs if read-write operations can be
maintained
- Shutdown the fs if read-write operations can not be maintained
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/super.c | 4 +++-
include/linux/fs.h | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/fs/super.c b/fs/super.c
index 80418ca8e215..07845d2f9ec4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
sync_filesystem(sb);
shrink_dcache_sb(sb);
evict_inodes(sb);
- if (sb->s_op->shutdown)
+ if (sb->s_op->remove_bdev)
+ sb->s_op->remove_bdev(sb, bdev, surprise);
+ else if (sb->s_op->shutdown)
sb->s_op->shutdown(sb);
super_unlock_shared(sb);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b085f161ed22..5e84e06c7354 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2367,7 +2367,25 @@ struct super_operations {
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
+ /*
+ * Callback to shutdown the fs.
+ *
+ * If a fs can not afford losing any block device, implement this callback.
+ */
void (*shutdown)(struct super_block *sb);
+
+ /*
+ * Callback to handle a block device removal.
+ *
+ * Recommended to implement this for multi-device filesystems, as they
+ * may afford losing a block device and continue operations.
+ *
+ * @surprse: indicates a surprise removal. If true the device/media is
+ * already gone. Otherwise we're prepareing for an orderly
+ * removal.
+ */
+ void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
+ bool surprise);
};
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
2025-06-25 23:53 ` [PATCH 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
` (3 subsequent siblings)
5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
This is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN,
after entering the emergency shutdown state, all operations will return
error (-EIO), and can not be bring back to normal state until unmount.
A new helper, btrfs_force_shutdown() is introduced, which will:
- Mark the fs as error
But without flipping the fs read-only.
This is a special handling for the future shutdown ioctl, which will
freeze the fs first, set the SHUTDOWN flag, thaw the fs.
But the thaw path will no longer call the unfreeze_fs() call back
if the superblock is already read-only.
So to handle future shutdown correctly, we only mark the fs as error,
without flipping it read-only.
- Set the SHUTDOWN flag and output an message
The only extra thing is to set the EMERGENCY_SHUTDOWN flag to reject
future operations with -EIO.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/fs.h | 28 ++++++++++++++++++++++++++++
fs/btrfs/messages.c | 1 +
2 files changed, 29 insertions(+)
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d90304d4e32c..45185c095696 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -29,6 +29,7 @@
#include "extent-io-tree.h"
#include "async-thread.h"
#include "block-rsv.h"
+#include "messages.h"
struct inode;
struct super_block;
@@ -120,6 +121,12 @@ enum {
/* No more delayed iput can be queued. */
BTRFS_FS_STATE_NO_DELAYED_IPUT,
+ /*
+ * Emergency shutdown, a step further than trans aborted by rejecting
+ * all operations.
+ */
+ BTRFS_FS_STATE_EMERGENCY_SHUTDOWN,
+
BTRFS_FS_STATE_COUNT
};
@@ -1100,6 +1107,27 @@ static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
(unlikely(test_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR, \
&(fs_info)->fs_state)))
+static inline bool btrfs_is_shutdown(struct btrfs_fs_info *fs_info)
+{
+ return test_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state);
+}
+
+static inline void btrfs_force_shutdown(struct btrfs_fs_info *fs_info)
+{
+ /*
+ * Here we do not want to use handle_fs_error(), which will mark
+ * the fs read-only.
+ * Some call sites like shutdown ioctl will mark the fs shutdown
+ * when the fs is frozen. But thaw path will handle RO and RW fs
+ * differently.
+ *
+ * So here we only mark the fs error without flipping it RO.
+ */
+ WRITE_ONCE(fs_info->fs_error, -EIO);
+ if (!test_and_set_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state))
+ btrfs_info(fs_info, "emergency shutdown");
+}
+
/*
* We use folio flag owner_2 to indicate there is an ordered extent with
* unfinished IO.
diff --git a/fs/btrfs/messages.c b/fs/btrfs/messages.c
index 363fd28c0268..2bb4bcb7c2cd 100644
--- a/fs/btrfs/messages.c
+++ b/fs/btrfs/messages.c
@@ -23,6 +23,7 @@ static const char fs_state_chars[] = {
[BTRFS_FS_STATE_NO_DATA_CSUMS] = 'C',
[BTRFS_FS_STATE_SKIP_META_CSUMS] = 'S',
[BTRFS_FS_STATE_LOG_CLEANUP_ERROR] = 'L',
+ [BTRFS_FS_STATE_EMERGENCY_SHUTDOWN] = 'E',
};
static void btrfs_state_to_string(const struct btrfs_fs_info *info, char *buf)
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/6] btrfs: reject file operations if in shutdown state
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
2025-06-25 23:53 ` [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
2025-06-25 23:53 ` [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown Qu Wenruo
` (2 subsequent siblings)
5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
This includes the following callbacks of file_operations:
- read_iter()
- write_iter()
- mmap()
- open()
- remap_file_range()
- uring_cmd()
- splice_read()
This requires a small wrapper to do the extra shutdown check, then call
the regular filemap_splice_read() function
This should reject most of the file operations on a shutdown btrfs.
The callback ioctl() is intentionally skipped, as ext4 doesn't do the
shutdown check on ioctl() either, thus I believe there is some special
require for ioctl() callback even if the fs is fully shutdown.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 25 ++++++++++++++++++++++++-
fs/btrfs/ioctl.c | 3 +++
fs/btrfs/reflink.c | 3 +++
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 247676f34f2e..918c39a5ed39 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1419,6 +1419,8 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
struct btrfs_inode *inode = BTRFS_I(file_inode(file));
ssize_t num_written, num_sync;
+ if (unlikely(btrfs_is_shutdown(inode->root->fs_info)))
+ return -EIO;
/*
* If the fs flips readonly due to some impossible error, although we
* have opened a file as writable, we have to stop this write operation
@@ -1981,6 +1983,8 @@ static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct address_space *mapping = filp->f_mapping;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(filp)))))
+ return -EIO;
if (!mapping->a_ops->read_folio)
return -ENOEXEC;
@@ -3040,6 +3044,9 @@ static long btrfs_fallocate(struct file *file, int mode,
int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
int ret;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+ return -EIO;
+
/* Do not allow fallocate in ZONED mode */
if (btrfs_is_zoned(inode_to_fs_info(inode)))
return -EOPNOTSUPP;
@@ -3731,6 +3738,9 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
{
int ret;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+ return -EIO;
+
filp->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
ret = fsverity_file_open(inode, filp);
@@ -3743,6 +3753,9 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
ssize_t ret = 0;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(iocb->ki_filp)))))
+ return -EIO;
+
if (iocb->ki_flags & IOCB_DIRECT) {
ret = btrfs_direct_read(iocb, to);
if (ret < 0 || !iov_iter_count(to) ||
@@ -3753,10 +3766,20 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return filemap_read(iocb, to, ret);
}
+static ssize_t btrfs_file_splice_read(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe,
+ size_t len, unsigned int flags)
+{
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(in)))))
+ return -EIO;
+
+ return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
.read_iter = btrfs_file_read_iter,
- .splice_read = filemap_splice_read,
+ .splice_read = btrfs_file_splice_read,
.write_iter = btrfs_file_write_iter,
.splice_write = iter_file_splice_write,
.mmap = btrfs_file_mmap,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 65763bc6a0f6..d86967bd3c9c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5066,6 +5066,9 @@ static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issu
int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(cmd->file)))))
+ return -EIO;
+
switch (cmd->cmd_op) {
case BTRFS_IOC_ENCODED_READ:
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 0197bd9160a7..123a5682514b 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -869,6 +869,9 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
bool same_inode = dst_inode == src_inode;
int ret;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(src_file)))))
+ return -EIO;
+
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
` (2 preceding siblings ...)
2025-06-25 23:53 ` [PATCH 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
2025-06-25 23:53 ` [PATCH 5/6] btrfs: implement shutdown ioctl Qu Wenruo
2025-06-25 23:53 ` [PATCH 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
If the filesystem has dirty pages before the fs is shutdown, we should
no longer write them back, instead should treat them as writeback error.
Handle such situation by marking all those delalloc range as error and
let error handling path to clean them up.
For ranges that already have ordered extent created, let them continue
the writeback, and at ordered io finish time the file extent item update
will be rejected as the fs is already marked error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80c72c594b19..bb2b5d594b14 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -862,6 +862,9 @@ static void compress_file_range(struct btrfs_work *work)
int compress_type = fs_info->compress_type;
int compress_level = fs_info->compress_level;
+ if (unlikely(btrfs_is_shutdown(fs_info)))
+ goto cleanup_and_bail_uncompressed;
+
inode_should_defrag(inode, start, end, end - start + 1, SZ_16K);
/*
@@ -1277,6 +1280,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
unsigned long page_ops;
int ret = 0;
+ if (unlikely(btrfs_is_shutdown(fs_info))) {
+ ret = -EIO;
+ goto out_unlock;
+ }
+
if (btrfs_is_free_space_inode(inode)) {
ret = -EINVAL;
goto out_unlock;
@@ -2027,7 +2035,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_root *root = inode->root;
- struct btrfs_path *path;
+ struct btrfs_path *path = NULL;
u64 cow_start = (u64)-1;
/*
* If not 0, represents the inclusive end of the last fallback_to_cow()
@@ -2047,6 +2055,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
*/
ASSERT(!btrfs_is_zoned(fs_info) || btrfs_is_data_reloc_root(root));
+ if (unlikely(btrfs_is_shutdown(fs_info))) {
+ ret = -EIO;
+ goto error;
+ }
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/6] btrfs: implement shutdown ioctl
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
` (3 preceding siblings ...)
2025-06-25 23:53 ` [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
2025-06-25 23:53 ` [PATCH 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
The shutdown interface should all follow the XFS one, which use magic
'X', and ioctl number 125, with a u32 as flags.
For now btrfs don't distinguish DEFAULT and LOGFLUSH flags, both will
freeze the fs first (implies committing the current transaction),
setting the SHUTDOWN flag (along with fs error flag), thaw the fs.
For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current
transaction is aborted.
The new shutdown ioctl is hidden behind experimental features, as there
are still some minor test failures exposed during the shutdown group
run.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/ioctl.c | 40 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/btrfs.h | 9 +++++++++
2 files changed, 49 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d86967bd3c9c..1550fe7887fa 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5212,6 +5212,36 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
return 0;
}
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_emergency_shutdown(struct btrfs_fs_info *fs_info, u32 flags)
+{
+ int ret = 0;
+
+ if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
+ return -EINVAL;
+
+ if (btrfs_is_shutdown(fs_info))
+ return 0;
+
+ switch (flags) {
+ case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
+ case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
+ ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+ if (ret)
+ return ret;
+ btrfs_force_shutdown(fs_info);
+ ret = thaw_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+ if (ret)
+ return ret;
+ break;
+ case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
+ btrfs_force_shutdown(fs_info);
+ break;
+ }
+ return ret;
+}
+#endif
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -5219,6 +5249,8 @@ long btrfs_ioctl(struct file *file, unsigned int
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
struct btrfs_root *root = BTRFS_I(inode)->root;
void __user *argp = (void __user *)arg;
+ /* If @arg is just an unsigned long value. */
+ unsigned long flags;
switch (cmd) {
case FS_IOC_GETVERSION:
@@ -5367,7 +5399,15 @@ long btrfs_ioctl(struct file *file, unsigned int
#endif
case BTRFS_IOC_SUBVOL_SYNC_WAIT:
return btrfs_ioctl_subvol_sync(fs_info, argp);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+ case BTRFS_IOC_SHUTDOWN:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ if (get_user(flags, (__u32 __user *)arg))
+ return -EFAULT;
+ return btrfs_emergency_shutdown(fs_info, flags);
}
+#endif
return -ENOTTY;
}
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dd02160015b2..8f6324cf15d9 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -1096,6 +1096,12 @@ enum btrfs_err_code {
BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
};
+/* Flags for IOC_SHUTDOWN, should match XFS' flags. */
+#define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
+#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
+#define BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH 0x2
+#define BTRFS_SHUTDOWN_FLAGS_LAST 0x3
+
#define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
struct btrfs_ioctl_vol_args)
#define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -1217,6 +1223,9 @@ enum btrfs_err_code {
#define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
struct btrfs_ioctl_subvol_wait)
+/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
+#define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
+
#ifdef __cplusplus
}
#endif
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 6/6] btrfs: implement remove_bdev super operation callback
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
` (4 preceding siblings ...)
2025-06-25 23:53 ` [PATCH 5/6] btrfs: implement shutdown ioctl Qu Wenruo
@ 2025-06-25 23:53 ` Qu Wenruo
5 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-06-25 23:53 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
For this callback, btrfs will:
- Go degraded if the fs can still maintain RW operations
And of course mark the target device as missing.
- Shutdown if the fs can not maintain RW operations
I know the shutdown can be a little overkilled, if one has a RAID1
metadata and RAID0 data, in that case one can still read data with 50%
rate to got some good data.
But it can also be as bad as the only device went missing for a single
device btrfs.
So here we go safe other than sorry when handling missing device.
And the remove_bdev callback will be hidden behind experimental features
for now, the reasons are:
- There are not enough btrfs specific bdev removal test cases
The existing test cases are all removing the only device, thus only
exercises the shutdown branch.
- Not yet determined what's the expected behavior
Although the current auto-degrade behavior is no worse than the old
behavior, it may not always be what the end users want.
Before there is a concrete solution, better hide the new feature
from end users.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/super.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.c | 2 ++
fs/btrfs/volumes.h | 5 +++++
3 files changed, 62 insertions(+)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5a07330fb3a6..3fba3d6309a2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2412,6 +2412,58 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
return 0;
}
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static void btrfs_remove_bdev(struct super_block *sb, struct block_device *bdev,
+ bool surprise)
+{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_device *device;
+ struct btrfs_dev_lookup_args lookup_args = { .devt = bdev->bd_dev };
+ bool can_rw;
+ int ret;
+
+ if (!surprise) {
+ ret = btrfs_sync_fs(sb, 1);
+ if (ret)
+ btrfs_warn(fs_info,
+ "filesystem failed to sync in preparation for device loss: %d",
+ ret);
+ }
+
+ mutex_lock(&fs_info->fs_devices->device_list_mutex);
+ device = btrfs_find_device(fs_info->fs_devices, &lookup_args);
+ if (!device) {
+ btrfs_warn(fs_info, "unable to find btrfs device for block device '%pg'",
+ bdev);
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ return;
+ }
+ set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
+ device->fs_devices->missing_devices++;
+ if (test_and_clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
+ list_del_init(&device->dev_alloc_list);
+ device->fs_devices->rw_devices--;
+ }
+ can_rw = btrfs_check_rw_degradable(fs_info, device);
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ /*
+ * Now device is considered missing, btrfs_device_name() won't give a
+ * meaningful result anymore.
+ */
+ if (!can_rw) {
+ btrfs_warn(fs_info,
+ "btrfs device id %llu has gone missing, can not maintain read-write",
+ device->devid);
+ btrfs_force_shutdown(fs_info);
+ return;
+ }
+ btrfs_warn(fs_info,
+ "btrfs device id %llu has gone missing, continue as degraded",
+ device->devid);
+ btrfs_set_opt(fs_info->mount_opt, DEGRADED);
+}
+#endif
+
static const struct super_operations btrfs_super_ops = {
.drop_inode = btrfs_drop_inode,
.evict_inode = btrfs_evict_inode,
@@ -2427,6 +2479,9 @@ static const struct super_operations btrfs_super_ops = {
.unfreeze_fs = btrfs_unfreeze,
.nr_cached_objects = btrfs_nr_cached_objects,
.free_cached_objects = btrfs_free_cached_objects,
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+ .remove_bdev = btrfs_remove_bdev,
+#endif
};
static const struct file_operations btrfs_ctl_fops = {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8ea1a69808a3..8feac0129bdd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6794,6 +6794,8 @@ static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
static bool dev_args_match_device(const struct btrfs_dev_lookup_args *args,
const struct btrfs_device *device)
{
+ if (args->devt)
+ return device->devt == args->devt;
if (args->missing) {
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state) &&
!device->bdev)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6acb154ccf87..71e570f8337d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -663,6 +663,11 @@ struct btrfs_dev_lookup_args {
u64 devid;
u8 *uuid;
u8 *fsid;
+ /*
+ * If devt is specified, all other members will be ignored as it is
+ * enough to uniquely locate a device.
+ */
+ dev_t devt;
bool missing;
};
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
@ 2025-06-26 8:38 ` Christian Brauner
2025-06-26 8:44 ` Qu Wenruo
2025-06-26 10:14 ` Christoph Hellwig
0 siblings, 2 replies; 16+ messages in thread
From: Christian Brauner @ 2025-06-26 8:38 UTC (permalink / raw)
To: Qu Wenruo, Christoph Hellwig; +Cc: linux-btrfs, linux-fsdevel, viro, jack
On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
> The new remove_bdev() call back is mostly for multi-device filesystems
> to handle device removal.
>
> Some multi-devices filesystems like btrfs can have the ability to handle
> device lose according to the setup (e.g. all chunks have extra mirrors),
> thus losing a block device will not interrupt the normal operations.
>
> Btrfs will soon implement this call back by:
>
> - Automatically degrade the fs if read-write operations can be
> maintained
>
> - Shutdown the fs if read-write operations can not be maintained
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/super.c | 4 +++-
> include/linux/fs.h | 18 ++++++++++++++++++
> 2 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 80418ca8e215..07845d2f9ec4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> sync_filesystem(sb);
> shrink_dcache_sb(sb);
> evict_inodes(sb);
> - if (sb->s_op->shutdown)
> + if (sb->s_op->remove_bdev)
> + sb->s_op->remove_bdev(sb, bdev, surprise);
> + else if (sb->s_op->shutdown)
> sb->s_op->shutdown(sb);
This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
really dislike this pattern. It introduces the possibility that a
filesystem accidently implement both variants and assumes both are
somehow called. That can be solved by an assert at superblock initation
time but it's still nasty.
The other thing is that this just reeks of being the wrong api. We
should absolutely aim for the methods to not be mutually exclusive. I
hate that with a passion. That's just an ugly api and I want to have as
little of that as possible in our code.
>
> super_unlock_shared(sb);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b085f161ed22..5e84e06c7354 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2367,7 +2367,25 @@ struct super_operations {
> struct shrink_control *);
> long (*free_cached_objects)(struct super_block *,
> struct shrink_control *);
> + /*
> + * Callback to shutdown the fs.
> + *
> + * If a fs can not afford losing any block device, implement this callback.
> + */
> void (*shutdown)(struct super_block *sb);
> +
> + /*
> + * Callback to handle a block device removal.
> + *
> + * Recommended to implement this for multi-device filesystems, as they
> + * may afford losing a block device and continue operations.
> + *
> + * @surprse: indicates a surprise removal. If true the device/media is
> + * already gone. Otherwise we're prepareing for an orderly
> + * removal.
> + */
> + void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
> + bool surprise);
> };
Yeah, I think that's just not a good api. That looks a lot to me like we
should just collapse both functions even though earlier discussion said
we shouldn't. Just do:
s/shutdown/remove_bdev/
or
s/shutdown/shutdown_bdev()
The filesystem will know whether it has to kill the filesystem or if it
can keep going even if the device is lost. Hell, if we have to we could
just have it return whether it killed the superblock or just the device
by giving the method a return value. But for now it probably doesn't
matter.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 8:38 ` Christian Brauner
@ 2025-06-26 8:44 ` Qu Wenruo
2025-06-26 9:10 ` Christian Brauner
2025-06-26 10:14 ` Christoph Hellwig
1 sibling, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-26 8:44 UTC (permalink / raw)
To: Christian Brauner, Christoph Hellwig
Cc: linux-btrfs, linux-fsdevel, viro, jack
在 2025/6/26 18:08, Christian Brauner 写道:
> On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
>> The new remove_bdev() call back is mostly for multi-device filesystems
>> to handle device removal.
>>
>> Some multi-devices filesystems like btrfs can have the ability to handle
>> device lose according to the setup (e.g. all chunks have extra mirrors),
>> thus losing a block device will not interrupt the normal operations.
>>
>> Btrfs will soon implement this call back by:
>>
>> - Automatically degrade the fs if read-write operations can be
>> maintained
>>
>> - Shutdown the fs if read-write operations can not be maintained
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/super.c | 4 +++-
>> include/linux/fs.h | 18 ++++++++++++++++++
>> 2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/super.c b/fs/super.c
>> index 80418ca8e215..07845d2f9ec4 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>> sync_filesystem(sb);
>> shrink_dcache_sb(sb);
>> evict_inodes(sb);
>> - if (sb->s_op->shutdown)
>> + if (sb->s_op->remove_bdev)
>> + sb->s_op->remove_bdev(sb, bdev, surprise);
>> + else if (sb->s_op->shutdown)
>> sb->s_op->shutdown(sb);
>
> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> really dislike this pattern. It introduces the possibility that a
> filesystem accidently implement both variants and assumes both are
> somehow called. That can be solved by an assert at superblock initation
> time but it's still nasty.
>
> The other thing is that this just reeks of being the wrong api. We
> should absolutely aim for the methods to not be mutually exclusive. I
> hate that with a passion. That's just an ugly api and I want to have as
> little of that as possible in our code.
So what do you really want?
The original path to expand the shutdown() callback is rejected, and now
the new callback is also rejected.
I guess the only thing left is to rename shutdown() to remove_bdev(),
add the new parameters and keep existing fses doing what they do (aka,
shutdown)?
Thanks,
Qu
>
>>
>> super_unlock_shared(sb);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index b085f161ed22..5e84e06c7354 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2367,7 +2367,25 @@ struct super_operations {
>> struct shrink_control *);
>> long (*free_cached_objects)(struct super_block *,
>> struct shrink_control *);
>> + /*
>> + * Callback to shutdown the fs.
>> + *
>> + * If a fs can not afford losing any block device, implement this callback.
>> + */
>> void (*shutdown)(struct super_block *sb);
>> +
>> + /*
>> + * Callback to handle a block device removal.
>> + *
>> + * Recommended to implement this for multi-device filesystems, as they
>> + * may afford losing a block device and continue operations.
>> + *
>> + * @surprse: indicates a surprise removal. If true the device/media is
>> + * already gone. Otherwise we're prepareing for an orderly
>> + * removal.
>> + */
>> + void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
>> + bool surprise);
>> };
>
> Yeah, I think that's just not a good api. That looks a lot to me like we
> should just collapse both functions even though earlier discussion said
> we shouldn't. Just do:
>
> s/shutdown/remove_bdev/
>
> or
>
> s/shutdown/shutdown_bdev()
>
> The filesystem will know whether it has to kill the filesystem or if it
> can keep going even if the device is lost. Hell, if we have to we could
> just have it return whether it killed the superblock or just the device
> by giving the method a return value. But for now it probably doesn't
> matter.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 8:44 ` Qu Wenruo
@ 2025-06-26 9:10 ` Christian Brauner
2025-06-26 9:18 ` Qu Wenruo
0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2025-06-26 9:10 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, viro, jack
On Thu, Jun 26, 2025 at 06:14:03PM +0930, Qu Wenruo wrote:
>
>
> 在 2025/6/26 18:08, Christian Brauner 写道:
> > On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
> > > The new remove_bdev() call back is mostly for multi-device filesystems
> > > to handle device removal.
> > >
> > > Some multi-devices filesystems like btrfs can have the ability to handle
> > > device lose according to the setup (e.g. all chunks have extra mirrors),
> > > thus losing a block device will not interrupt the normal operations.
> > >
> > > Btrfs will soon implement this call back by:
> > >
> > > - Automatically degrade the fs if read-write operations can be
> > > maintained
> > >
> > > - Shutdown the fs if read-write operations can not be maintained
> > >
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > > fs/super.c | 4 +++-
> > > include/linux/fs.h | 18 ++++++++++++++++++
> > > 2 files changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/super.c b/fs/super.c
> > > index 80418ca8e215..07845d2f9ec4 100644
> > > --- a/fs/super.c
> > > +++ b/fs/super.c
> > > @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> > > sync_filesystem(sb);
> > > shrink_dcache_sb(sb);
> > > evict_inodes(sb);
> > > - if (sb->s_op->shutdown)
> > > + if (sb->s_op->remove_bdev)
> > > + sb->s_op->remove_bdev(sb, bdev, surprise);
> > > + else if (sb->s_op->shutdown)
> > > sb->s_op->shutdown(sb);
> >
> > This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> > really dislike this pattern. It introduces the possibility that a
> > filesystem accidently implement both variants and assumes both are
> > somehow called. That can be solved by an assert at superblock initation
> > time but it's still nasty.
> >
> > The other thing is that this just reeks of being the wrong api. We
> > should absolutely aim for the methods to not be mutually exclusive. I
> > hate that with a passion. That's just an ugly api and I want to have as
> > little of that as possible in our code.
>
> So what do you really want?
>
> The original path to expand the shutdown() callback is rejected, and now the
> new callback is also rejected.
>
> I guess the only thing left is to rename shutdown() to remove_bdev(), add
> the new parameters and keep existing fses doing what they do (aka,
> shutdown)?
Yes. My original understanding had been that ->remove_bdev() would be
called in a different codepath than ->shutdown() and they would be
complimentary. So sorry for the back and forth here. If that's not the
case I don't see any point in having two distinct methods.
>
> Thanks,
> Qu
>
> >
> > > super_unlock_shared(sb);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index b085f161ed22..5e84e06c7354 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2367,7 +2367,25 @@ struct super_operations {
> > > struct shrink_control *);
> > > long (*free_cached_objects)(struct super_block *,
> > > struct shrink_control *);
> > > + /*
> > > + * Callback to shutdown the fs.
> > > + *
> > > + * If a fs can not afford losing any block device, implement this callback.
> > > + */
> > > void (*shutdown)(struct super_block *sb);
> > > +
> > > + /*
> > > + * Callback to handle a block device removal.
> > > + *
> > > + * Recommended to implement this for multi-device filesystems, as they
> > > + * may afford losing a block device and continue operations.
> > > + *
> > > + * @surprse: indicates a surprise removal. If true the device/media is
> > > + * already gone. Otherwise we're prepareing for an orderly
> > > + * removal.
> > > + */
> > > + void (*remove_bdev)(struct super_block *sb, struct block_device *bdev,
> > > + bool surprise);
> > > };
> >
> > Yeah, I think that's just not a good api. That looks a lot to me like we
> > should just collapse both functions even though earlier discussion said
> > we shouldn't. Just do:
> >
> > s/shutdown/remove_bdev/
> >
> > or
> >
> > s/shutdown/shutdown_bdev()
> >
> > The filesystem will know whether it has to kill the filesystem or if it
> > can keep going even if the device is lost. Hell, if we have to we could
> > just have it return whether it killed the superblock or just the device
> > by giving the method a return value. But for now it probably doesn't
> > matter.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 9:10 ` Christian Brauner
@ 2025-06-26 9:18 ` Qu Wenruo
2025-06-26 9:29 ` Christian Brauner
0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-26 9:18 UTC (permalink / raw)
To: Christian Brauner
Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, viro, jack
在 2025/6/26 18:40, Christian Brauner 写道:
> On Thu, Jun 26, 2025 at 06:14:03PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2025/6/26 18:08, Christian Brauner 写道:
>>> On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
>>>> The new remove_bdev() call back is mostly for multi-device filesystems
>>>> to handle device removal.
>>>>
>>>> Some multi-devices filesystems like btrfs can have the ability to handle
>>>> device lose according to the setup (e.g. all chunks have extra mirrors),
>>>> thus losing a block device will not interrupt the normal operations.
>>>>
>>>> Btrfs will soon implement this call back by:
>>>>
>>>> - Automatically degrade the fs if read-write operations can be
>>>> maintained
>>>>
>>>> - Shutdown the fs if read-write operations can not be maintained
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> fs/super.c | 4 +++-
>>>> include/linux/fs.h | 18 ++++++++++++++++++
>>>> 2 files changed, 21 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/super.c b/fs/super.c
>>>> index 80418ca8e215..07845d2f9ec4 100644
>>>> --- a/fs/super.c
>>>> +++ b/fs/super.c
>>>> @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>>>> sync_filesystem(sb);
>>>> shrink_dcache_sb(sb);
>>>> evict_inodes(sb);
>>>> - if (sb->s_op->shutdown)
>>>> + if (sb->s_op->remove_bdev)
>>>> + sb->s_op->remove_bdev(sb, bdev, surprise);
>>>> + else if (sb->s_op->shutdown)
>>>> sb->s_op->shutdown(sb);
>>>
>>> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
>>> really dislike this pattern. It introduces the possibility that a
>>> filesystem accidently implement both variants and assumes both are
>>> somehow called. That can be solved by an assert at superblock initation
>>> time but it's still nasty.
>>>
>>> The other thing is that this just reeks of being the wrong api. We
>>> should absolutely aim for the methods to not be mutually exclusive. I
>>> hate that with a passion. That's just an ugly api and I want to have as
>>> little of that as possible in our code.
>>
>> So what do you really want?
>>
>> The original path to expand the shutdown() callback is rejected, and now the
>> new callback is also rejected.
>>
>> I guess the only thing left is to rename shutdown() to remove_bdev(), add
>> the new parameters and keep existing fses doing what they do (aka,
>> shutdown)?
>
> Yes. My original understanding had been that ->remove_bdev() would be
> called in a different codepath than ->shutdown() and they would be
> complimentary. So sorry for the back and forth here. If that's not the
> case I don't see any point in having two distinct methods.
That's fine, just want to do a final confirmation that everyone is fine
with such change:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b085f161ed22..c11b9219863b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2367,7 +2367,8 @@ struct super_operations {
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
- void (*shutdown)(struct super_block *sb);
+ void (*remove_bdev)(struct super_block *sb, struct block_device
*bdev,
+ bool surprise);
};
/*
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 9:18 ` Qu Wenruo
@ 2025-06-26 9:29 ` Christian Brauner
0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2025-06-26 9:29 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, viro, jack
On Thu, Jun 26, 2025 at 06:48:29PM +0930, Qu Wenruo wrote:
>
>
> 在 2025/6/26 18:40, Christian Brauner 写道:
> > On Thu, Jun 26, 2025 at 06:14:03PM +0930, Qu Wenruo wrote:
> > >
> > >
> > > 在 2025/6/26 18:08, Christian Brauner 写道:
> > > > On Thu, Jun 26, 2025 at 09:23:42AM +0930, Qu Wenruo wrote:
> > > > > The new remove_bdev() call back is mostly for multi-device filesystems
> > > > > to handle device removal.
> > > > >
> > > > > Some multi-devices filesystems like btrfs can have the ability to handle
> > > > > device lose according to the setup (e.g. all chunks have extra mirrors),
> > > > > thus losing a block device will not interrupt the normal operations.
> > > > >
> > > > > Btrfs will soon implement this call back by:
> > > > >
> > > > > - Automatically degrade the fs if read-write operations can be
> > > > > maintained
> > > > >
> > > > > - Shutdown the fs if read-write operations can not be maintained
> > > > >
> > > > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > > > ---
> > > > > fs/super.c | 4 +++-
> > > > > include/linux/fs.h | 18 ++++++++++++++++++
> > > > > 2 files changed, 21 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/super.c b/fs/super.c
> > > > > index 80418ca8e215..07845d2f9ec4 100644
> > > > > --- a/fs/super.c
> > > > > +++ b/fs/super.c
> > > > > @@ -1463,7 +1463,9 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> > > > > sync_filesystem(sb);
> > > > > shrink_dcache_sb(sb);
> > > > > evict_inodes(sb);
> > > > > - if (sb->s_op->shutdown)
> > > > > + if (sb->s_op->remove_bdev)
> > > > > + sb->s_op->remove_bdev(sb, bdev, surprise);
> > > > > + else if (sb->s_op->shutdown)
> > > > > sb->s_op->shutdown(sb);
> > > >
> > > > This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> > > > really dislike this pattern. It introduces the possibility that a
> > > > filesystem accidently implement both variants and assumes both are
> > > > somehow called. That can be solved by an assert at superblock initation
> > > > time but it's still nasty.
> > > >
> > > > The other thing is that this just reeks of being the wrong api. We
> > > > should absolutely aim for the methods to not be mutually exclusive. I
> > > > hate that with a passion. That's just an ugly api and I want to have as
> > > > little of that as possible in our code.
> > >
> > > So what do you really want?
> > >
> > > The original path to expand the shutdown() callback is rejected, and now the
> > > new callback is also rejected.
> > >
> > > I guess the only thing left is to rename shutdown() to remove_bdev(), add
> > > the new parameters and keep existing fses doing what they do (aka,
> > > shutdown)?
> >
> > Yes. My original understanding had been that ->remove_bdev() would be
> > called in a different codepath than ->shutdown() and they would be
> > complimentary. So sorry for the back and forth here. If that's not the
> > case I don't see any point in having two distinct methods.
>
> That's fine, just want to do a final confirmation that everyone is fine with
> such change:
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b085f161ed22..c11b9219863b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2367,7 +2367,8 @@ struct super_operations {
> struct shrink_control *);
> long (*free_cached_objects)(struct super_block *,
> struct shrink_control *);
> - void (*shutdown)(struct super_block *sb);
> + void (*remove_bdev)(struct super_block *sb, struct block_device
> *bdev,
> + bool surprise);
> };
This looks good to me!
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 8:38 ` Christian Brauner
2025-06-26 8:44 ` Qu Wenruo
@ 2025-06-26 10:14 ` Christoph Hellwig
2025-06-26 10:36 ` Qu Wenruo
2025-07-01 11:35 ` Christian Brauner
1 sibling, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-06-26 10:14 UTC (permalink / raw)
To: Christian Brauner
Cc: Qu Wenruo, Christoph Hellwig, linux-btrfs, linux-fsdevel, viro,
jack
On Thu, Jun 26, 2025 at 10:38:11AM +0200, Christian Brauner wrote:
> > + if (sb->s_op->remove_bdev)
> > + sb->s_op->remove_bdev(sb, bdev, surprise);
> > + else if (sb->s_op->shutdown)
> > sb->s_op->shutdown(sb);
>
> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> really dislike this pattern. It introduces the possibility that a
> filesystem accidently implement both variants and assumes both are
> somehow called. That can be solved by an assert at superblock initation
> time but it's still nasty.
>
> The other thing is that this just reeks of being the wrong api. We
> should absolutely aim for the methods to not be mutually exclusive. I
> hate that with a passion. That's just an ugly api and I want to have as
> little of that as possible in our code.
Yes. As I mentioned before I think we just want to transition everyone
to the remove_bdev API. But our life is probably easier if Qu's series
only adds the new one so that we can do the transitions through the
file system trees instead of needing a global API change. Or am I
overthinking this?
> > + * @surprse: indicates a surprise removal. If true the device/media is
surprse is misspelled here. But I don't see how passing this on to
the file system would be useful, because at this point everything is
a surprise for the file system.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 10:14 ` Christoph Hellwig
@ 2025-06-26 10:36 ` Qu Wenruo
2025-06-26 11:02 ` Christoph Hellwig
2025-07-01 11:35 ` Christian Brauner
1 sibling, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-06-26 10:36 UTC (permalink / raw)
To: Christoph Hellwig, Christian Brauner
Cc: Qu Wenruo, linux-btrfs, linux-fsdevel, viro, jack
在 2025/6/26 19:44, Christoph Hellwig 写道:
> On Thu, Jun 26, 2025 at 10:38:11AM +0200, Christian Brauner wrote:
>>> + if (sb->s_op->remove_bdev)
>>> + sb->s_op->remove_bdev(sb, bdev, surprise);
>>> + else if (sb->s_op->shutdown)
>>> sb->s_op->shutdown(sb);
>>
>> This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
>> really dislike this pattern. It introduces the possibility that a
>> filesystem accidently implement both variants and assumes both are
>> somehow called. That can be solved by an assert at superblock initation
>> time but it's still nasty.
>>
>> The other thing is that this just reeks of being the wrong api. We
>> should absolutely aim for the methods to not be mutually exclusive. I
>> hate that with a passion. That's just an ugly api and I want to have as
>> little of that as possible in our code.
>
> Yes. As I mentioned before I think we just want to transition everyone
> to the remove_bdev API. But our life is probably easier if Qu's series
> only adds the new one so that we can do the transitions through the
> file system trees instead of needing a global API change. Or am I
> overthinking this?
Then let's do the transition right now.
The transition itself is pretty straightforward, just renaming and
appending the parameter list, and keeping the old shutdown behavior.
AFAIK only btrfs and bcachefs will take advantage of the new interface
for now.
>
>>> + * @surprse: indicates a surprise removal. If true the device/media is
>
> surprse is misspelled here. But I don't see how passing this on to
> the file system would be useful, because at this point everything is
> a surprise for the file system.
If I understand the @surprise parameter correctly, it should allow the
fs to do read/write as usual if it's not a surprise removal.
And btrfs will take the chance to fully writeback all the dirty pages
(more than the default shutdown behavior which only writebacks the
current transaction, no dirty data pages.).
But in the real world, for test case like generic/730, the @surprise
flag is either not properly respected, I'm getting @surprise == false
but the block device is already gone.
So I'm not sure what's the real expected behavior here, and the new flag
is only for future expansion for now.
Thanks,
Qu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 10:36 ` Qu Wenruo
@ 2025-06-26 11:02 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-06-26 11:02 UTC (permalink / raw)
To: Qu Wenruo
Cc: Christoph Hellwig, Christian Brauner, linux-btrfs, linux-fsdevel,
viro, jack
On Thu, Jun 26, 2025 at 08:06:03PM +0930, Qu Wenruo wrote:
>
> If I understand the @surprise parameter correctly, it should allow the fs
> to do read/write as usual if it's not a surprise removal.
>
> And btrfs will take the chance to fully writeback all the dirty pages (more
> than the default shutdown behavior which only writebacks the current
> transaction, no dirty data pages.).
That's already taken care of by the call to sync_filesystem in
fs_bdev_mark_dead.
> But in the real world, for test case like generic/730, the @surprise flag
> is either not properly respected, I'm getting @surprise == false but the
> block device is already gone.
It only works for drivers that call blk_mark_disk_dead or bdev_mark_dead
directly with the surprise flag.
> So I'm not sure what's the real expected behavior here, and the new flag is
> only for future expansion for now.
Let's not add just in case arguments.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] fs: add a new remove_bdev() super operations callback
2025-06-26 10:14 ` Christoph Hellwig
2025-06-26 10:36 ` Qu Wenruo
@ 2025-07-01 11:35 ` Christian Brauner
1 sibling, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2025-07-01 11:35 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Qu Wenruo, linux-btrfs, linux-fsdevel, viro, jack
On Thu, Jun 26, 2025 at 12:14:43PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 26, 2025 at 10:38:11AM +0200, Christian Brauner wrote:
> > > + if (sb->s_op->remove_bdev)
> > > + sb->s_op->remove_bdev(sb, bdev, surprise);
> > > + else if (sb->s_op->shutdown)
> > > sb->s_op->shutdown(sb);
> >
> > This makes ->remove_bdev() and ->shutdown() mutually exclusive. I really
> > really dislike this pattern. It introduces the possibility that a
> > filesystem accidently implement both variants and assumes both are
> > somehow called. That can be solved by an assert at superblock initation
> > time but it's still nasty.
> >
> > The other thing is that this just reeks of being the wrong api. We
> > should absolutely aim for the methods to not be mutually exclusive. I
> > hate that with a passion. That's just an ugly api and I want to have as
> > little of that as possible in our code.
>
> Yes. As I mentioned before I think we just want to transition everyone
> to the remove_bdev API. But our life is probably easier if Qu's series
> only adds the new one so that we can do the transitions through the
> file system trees instead of needing a global API change. Or am I
> overthinking this?
I think it's fine to do it right now. If we can avoid having
transitional stages then let's do that.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-07-01 11:35 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 23:53 [PATCH 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-06-25 23:53 ` [PATCH 1/6] fs: add a new remove_bdev() super operations callback Qu Wenruo
2025-06-26 8:38 ` Christian Brauner
2025-06-26 8:44 ` Qu Wenruo
2025-06-26 9:10 ` Christian Brauner
2025-06-26 9:18 ` Qu Wenruo
2025-06-26 9:29 ` Christian Brauner
2025-06-26 10:14 ` Christoph Hellwig
2025-06-26 10:36 ` Qu Wenruo
2025-06-26 11:02 ` Christoph Hellwig
2025-07-01 11:35 ` Christian Brauner
2025-06-25 23:53 ` [PATCH 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
2025-06-25 23:53 ` [PATCH 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
2025-06-25 23:53 ` [PATCH 4/6] btrfs: reject delalloc ranges if the fs is shutdown Qu Wenruo
2025-06-25 23:53 ` [PATCH 5/6] btrfs: implement shutdown ioctl Qu Wenruo
2025-06-25 23:53 ` [PATCH 6/6] btrfs: implement remove_bdev super operation callback Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).