* [PATCH v5 0/6] btrfs: add remove_bdev() callback
@ 2025-07-14 5:25 Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 1/6] fs: add a new " Qu Wenruo
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:25 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
[CHANGELOG]
v5:
- Split remove_bdev() from shutdown()
Now remove_bdev() will have a return value to indicate if the fs can
handle the removal of the device.
And if not, a non-zero (normally minus) value is returned.
In that case ->shutdown() will be called as usual.
This allows us to avoid unnecessary operations that only make sense
for shutdown case, like shrinking the cache.
This also means no change to any of the existing filesystems.
- Implement ->shutdown() callback for btrfs
Since ->shutdown() and ->remove_bdev() call backs are separate now,
btrfs needs to implement both.
v4:
- Update the commit message of the first patch
Remove the out-of-date comments about the old *_shutdown() names.
v3:
- Also rename the callback functions inside each fs to *_remove_bdev()
To keep the consistency between the interface and implementation.
- Add extra handling if the to-be-removed device is already missing in
btrfs
I do not know if a device can be double-removed, but the handling
inside btrfs is pretty simple, if the target device is already
missing, nothing needs to be done and can exit immediately.
v2:
- Enhance and rename shutdown() callback
Rename it to remove_bdev() and add a @bdev parameter.
For the existing call backs in filesystems, keep their callback
function names, now something like ".remove_bdev = ext4_shutdown,"
will be a quick indicator of the behavior.
- Remove the @surprise parameter for the remove_bdev() parameter.
The fs_bdev_mark_dead() is already trying to sync the fs if it's not
a surprise removal.
So there isn't much a filesystem can do with the @surprise parameter.
- Fix btrfs error handling when the devices are not opened
There are several cases that the fs_devices is not opened, including:
* sget_fc() failure
* an existing super block is returned
* a new super block is returned but btrfS_open_fs_devices() failed
Handle the error properly so that fs_devices is not freed twice.
RFC->v1:
- Add a new remove_bdev() callback
Thanks all the feedback from Christian, Christoph and Jan on this new
name.
- Add a @surprise parameter to the remove_bdev() callback
To keep it the same as the bdev_mark_dead().
- Hide the shutdown ioctl and remove_bdev callback behind experimental
With the shutdown ioctl, there are at least 2 test failures (g/388, g/508).
G/388 is related to the error handling with COW fixup.
G/508 looks like something related to log replay.
And the remove_bdev() doesn't have any btrfs specific test case yet to
check the auto-degraded behavior, nor the auto-degraded behavior is
fully discussed.
So hide both of them behind experimental features.
- Do not use btrfs_handle_fs_error() to avoid freeze/thaw behavior change
btrfs_handle_fs_error() will flips the fs read-only, which will
affect freeze/thaw behavior.
And no other fs set the fs read-only when shutting down, so follow the
other fs to have a more consistent behavior.
Qu Wenruo (6):
fs: add a new remove_bdev() callback
btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
btrfs: reject file operations if in shutdown state
btrfs: reject delalloc ranges if in shutdown state
btrfs: implement shutdown ioctl
btrfs: implement remove_bdev and shutdown super operation callbacks
fs/btrfs/file.c | 25 ++++++++++++++-
fs/btrfs/fs.h | 28 ++++++++++++++++
fs/btrfs/inode.c | 14 +++++++-
fs/btrfs/ioctl.c | 44 +++++++++++++++++++++++++
fs/btrfs/messages.c | 1 +
fs/btrfs/reflink.c | 3 ++
fs/btrfs/super.c | 66 ++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.c | 2 ++
fs/btrfs/volumes.h | 5 +++
fs/super.c | 11 +++++++
include/linux/fs.h | 9 ++++++
include/uapi/linux/btrfs.h | 9 ++++++
12 files changed, 215 insertions(+), 2 deletions(-)
--
2.50.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v5 1/6] fs: add a new remove_bdev() callback
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
@ 2025-07-14 5:25 ` Qu Wenruo
2025-07-14 10:14 ` Jan Kara
2025-07-15 11:40 ` (subset) " Christian Brauner
2025-07-14 5:25 ` [PATCH v5 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
` (4 subsequent siblings)
5 siblings, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:25 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
Currently all filesystems which implement super_operations::shutdown()
can not afford losing a device.
Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the
involved filesystem.
But it will no longer be the case, as multi-device filesystems like
btrfs and bcachefs can handle certain device loss without the need to
shutdown the whole filesystem.
To allow those multi-device filesystems to be integrated to use
fs_holder_ops:
- Add a new super_operations::remove_bdev() callback
- Try ->remove_bdev() callback first inside fs_bdev_mark_dead()
If the callback returned 0, meaning the fs can handling the device
loss, then exit without doing anything else.
If there is no such callback or the callback returned non-zero value,
continue to shutdown the filesystem as usual.
This means the new remove_bdev() should only do the check on whether the
operation can continue, and if so do the fs specific handlings.
The shutdown handling should still be handled by the existing
->shutdown() callback.
For all existing filesystems with shutdown callback, there is no change
to the code nor behavior.
Btrfs is going to implement both the ->remove_bdev() and ->shutdown()
callbacks soon.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/super.c | 11 +++++++++++
include/linux/fs.h | 9 +++++++++
2 files changed, 20 insertions(+)
diff --git a/fs/super.c b/fs/super.c
index 80418ca8e215..7f876f32343a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1459,6 +1459,17 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
if (!sb)
return;
+ if (sb->s_op->remove_bdev) {
+ int ret;
+
+ ret = sb->s_op->remove_bdev(sb, bdev);
+ if (!ret) {
+ super_unlock_shared(sb);
+ return;
+ }
+ /* Fallback to shutdown. */
+ }
+
if (!surprise)
sync_filesystem(sb);
shrink_dcache_sb(sb);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b085f161ed22..6a8a5e63a5d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2367,6 +2367,15 @@ struct super_operations {
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
+ /*
+ * If a filesystem can support graceful removal of a device and
+ * continue read-write operations, implement this callback.
+ *
+ * Return 0 if the filesystem can continue read-write.
+ * Non-zero return value or no such callback means the fs will be shutdown
+ * as usual.
+ */
+ int (*remove_bdev)(struct super_block *sb, struct block_device *bdev);
void (*shutdown)(struct super_block *sb);
};
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 1/6] fs: add a new " Qu Wenruo
@ 2025-07-14 5:25 ` Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:25 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
This is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN,
after entering the emergency shutdown state, all operations will return
errors (-EIO), and can not be bring back to normal state until unmount.
A new helper, btrfs_force_shutdown() is introduced, which will:
- Mark the fs as error
But without flipping the fs read-only.
This is a special handling for the future shutdown ioctl, which will
freeze the fs first, set the SHUTDOWN flag, thaw the fs.
But the thaw path will no longer call the unfreeze_fs() call back
if the superblock is already read-only.
So to handle future shutdown correctly, we only mark the fs as error,
without flipping it read-only.
- Set the SHUTDOWN flag and output an message
New users of those interfaces will be added when implementing shutdown
ioctl support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/fs.h | 28 ++++++++++++++++++++++++++++
fs/btrfs/messages.c | 1 +
2 files changed, 29 insertions(+)
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 8cc07cc70b12..f0f1d1e47d6c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -29,6 +29,7 @@
#include "extent-io-tree.h"
#include "async-thread.h"
#include "block-rsv.h"
+#include "messages.h"
struct inode;
struct super_block;
@@ -120,6 +121,12 @@ enum {
/* No more delayed iput can be queued. */
BTRFS_FS_STATE_NO_DELAYED_IPUT,
+ /*
+ * Emergency shutdown, a step further than trans aborted by rejecting
+ * all operations.
+ */
+ BTRFS_FS_STATE_EMERGENCY_SHUTDOWN,
+
BTRFS_FS_STATE_COUNT
};
@@ -1095,6 +1102,27 @@ static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
(unlikely(test_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR, \
&(fs_info)->fs_state)))
+static inline bool btrfs_is_shutdown(struct btrfs_fs_info *fs_info)
+{
+ return test_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state);
+}
+
+static inline void btrfs_force_shutdown(struct btrfs_fs_info *fs_info)
+{
+ /*
+ * Here we do not want to use handle_fs_error(), which will mark
+ * the fs read-only.
+ * Some call sites like shutdown ioctl will mark the fs shutdown
+ * when the fs is frozen. But thaw path will handle RO and RW fs
+ * differently.
+ *
+ * So here we only mark the fs error without flipping it RO.
+ */
+ WRITE_ONCE(fs_info->fs_error, -EIO);
+ if (!test_and_set_bit(BTRFS_FS_STATE_EMERGENCY_SHUTDOWN, &fs_info->fs_state))
+ btrfs_info(fs_info, "emergency shutdown");
+}
+
/*
* We use folio flag owner_2 to indicate there is an ordered extent with
* unfinished IO.
diff --git a/fs/btrfs/messages.c b/fs/btrfs/messages.c
index 363fd28c0268..2bb4bcb7c2cd 100644
--- a/fs/btrfs/messages.c
+++ b/fs/btrfs/messages.c
@@ -23,6 +23,7 @@ static const char fs_state_chars[] = {
[BTRFS_FS_STATE_NO_DATA_CSUMS] = 'C',
[BTRFS_FS_STATE_SKIP_META_CSUMS] = 'S',
[BTRFS_FS_STATE_LOG_CLEANUP_ERROR] = 'L',
+ [BTRFS_FS_STATE_EMERGENCY_SHUTDOWN] = 'E',
};
static void btrfs_state_to_string(const struct btrfs_fs_info *info, char *buf)
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 3/6] btrfs: reject file operations if in shutdown state
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 1/6] fs: add a new " Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
@ 2025-07-14 5:25 ` Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 4/6] btrfs: reject delalloc ranges " Qu Wenruo
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:25 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
This includes the following callbacks of file_operations:
- read_iter()
- write_iter()
- mmap()
- open()
- remap_file_range()
- uring_cmd()
- splice_read()
This requires a small wrapper to do the extra shutdown check, then call
the regular filemap_splice_read() function
This should reject most of the file operations on a shutdown btrfs.
The callback ioctl() is intentionally skipped, as ext4 doesn't do the
shutdown check on ioctl() either, thus I believe there is some special
require for ioctl() callback even if the fs is fully shutdown.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 25 ++++++++++++++++++++++++-
fs/btrfs/ioctl.c | 3 +++
fs/btrfs/reflink.c | 3 +++
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index bc1e00db96c9..efcb9e6e34a3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1442,6 +1442,8 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
struct btrfs_inode *inode = BTRFS_I(file_inode(file));
ssize_t num_written, num_sync;
+ if (unlikely(btrfs_is_shutdown(inode->root->fs_info)))
+ return -EIO;
/*
* If the fs flips readonly due to some impossible error, although we
* have opened a file as writable, we have to stop this write operation
@@ -2043,6 +2045,8 @@ static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct address_space *mapping = filp->f_mapping;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(filp)))))
+ return -EIO;
if (!mapping->a_ops->read_folio)
return -ENOEXEC;
@@ -3102,6 +3106,9 @@ static long btrfs_fallocate(struct file *file, int mode,
int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
int ret;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+ return -EIO;
+
/* Do not allow fallocate in ZONED mode */
if (btrfs_is_zoned(inode_to_fs_info(inode)))
return -EOPNOTSUPP;
@@ -3793,6 +3800,9 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
{
int ret;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(inode))))
+ return -EIO;
+
filp->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
ret = fsverity_file_open(inode, filp);
@@ -3805,6 +3815,9 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
ssize_t ret = 0;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(iocb->ki_filp)))))
+ return -EIO;
+
if (iocb->ki_flags & IOCB_DIRECT) {
ret = btrfs_direct_read(iocb, to);
if (ret < 0 || !iov_iter_count(to) ||
@@ -3815,10 +3828,20 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return filemap_read(iocb, to, ret);
}
+static ssize_t btrfs_file_splice_read(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe,
+ size_t len, unsigned int flags)
+{
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(in)))))
+ return -EIO;
+
+ return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
.read_iter = btrfs_file_read_iter,
- .splice_read = filemap_splice_read,
+ .splice_read = btrfs_file_splice_read,
.write_iter = btrfs_file_write_iter,
.splice_write = iter_file_splice_write,
.mmap = btrfs_file_mmap,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 680c4e794e67..01d27f093eeb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5040,6 +5040,9 @@ static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issu
int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(cmd->file)))))
+ return -EIO;
+
switch (cmd->cmd_op) {
case BTRFS_IOC_ENCODED_READ:
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index ce25ab7f0e99..d88318ea31ba 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -869,6 +869,9 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
bool same_inode = dst_inode == src_inode;
int ret;
+ if (unlikely(btrfs_is_shutdown(inode_to_fs_info(file_inode(src_file)))))
+ return -EIO;
+
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 4/6] btrfs: reject delalloc ranges if in shutdown state
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
` (2 preceding siblings ...)
2025-07-14 5:25 ` [PATCH v5 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
@ 2025-07-14 5:26 ` Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 5/6] btrfs: implement shutdown ioctl Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 6/6] btrfs: implement remove_bdev and shutdown super operation callbacks Qu Wenruo
5 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:26 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
If the filesystem has dirty pages before the fs is shutdown, we should
no longer write them back, instead should treat them as writeback error.
Handle such situation by marking all those delalloc range as error and
let error handling path to clean them up.
For ranges that already have ordered extent created, let them continue
the writeback, and at ordered io finish time the file extent item update
will be rejected as the fs is already marked error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7ed340cac33f..928ab8ba7d0e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -861,6 +861,9 @@ static void compress_file_range(struct btrfs_work *work)
int compress_type = fs_info->compress_type;
int compress_level = fs_info->compress_level;
+ if (unlikely(btrfs_is_shutdown(fs_info)))
+ goto cleanup_and_bail_uncompressed;
+
inode_should_defrag(inode, start, end, end - start + 1, SZ_16K);
/*
@@ -1276,6 +1279,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
unsigned long page_ops;
int ret = 0;
+ if (unlikely(btrfs_is_shutdown(fs_info))) {
+ ret = -EIO;
+ goto out_unlock;
+ }
+
if (btrfs_is_free_space_inode(inode)) {
ret = -EINVAL;
goto out_unlock;
@@ -2027,7 +2035,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_root *root = inode->root;
- struct btrfs_path *path;
+ struct btrfs_path *path = NULL;
u64 cow_start = (u64)-1;
/*
* If not 0, represents the inclusive end of the last fallback_to_cow()
@@ -2047,6 +2055,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
*/
ASSERT(!btrfs_is_zoned(fs_info) || btrfs_is_data_reloc_root(root));
+ if (unlikely(btrfs_is_shutdown(fs_info))) {
+ ret = -EIO;
+ goto error;
+ }
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 5/6] btrfs: implement shutdown ioctl
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
` (3 preceding siblings ...)
2025-07-14 5:26 ` [PATCH v5 4/6] btrfs: reject delalloc ranges " Qu Wenruo
@ 2025-07-14 5:26 ` Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 6/6] btrfs: implement remove_bdev and shutdown super operation callbacks Qu Wenruo
5 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:26 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
The shutdown ioctl should follow the XFS one, which use magic number 'X',
and ioctl number 125, with a uint32 as flags.
For now btrfs don't distinguish DEFAULT and LOGFLUSH flags (just like
f2fs), both will freeze the fs first (implies committing the current
transaction), setting the SHUTDOWN flag and finally thaw the fs.
For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current
transaction is aborted.
The new shutdown ioctl is hidden behind experimental features for more
testing.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/ioctl.c | 41 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/btrfs.h | 9 +++++++++
2 files changed, 50 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 01d27f093eeb..54bb9e5f9892 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5186,6 +5186,43 @@ static int btrfs_ioctl_subvol_sync(struct btrfs_fs_info *fs_info, void __user *a
return 0;
}
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg)
+{
+ int ret = 0;
+ uint32_t flags;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (get_user(flags, (uint32_t __user *)arg))
+ return -EFAULT;
+
+ if (flags >= BTRFS_SHUTDOWN_FLAGS_LAST)
+ return -EINVAL;
+
+ if (btrfs_is_shutdown(fs_info))
+ return 0;
+
+ switch (flags) {
+ case BTRFS_SHUTDOWN_FLAGS_LOGFLUSH:
+ case BTRFS_SHUTDOWN_FLAGS_DEFAULT:
+ ret = freeze_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+ if (ret)
+ return ret;
+ btrfs_force_shutdown(fs_info);
+ ret = thaw_super(fs_info->sb, FREEZE_HOLDER_KERNEL, NULL);
+ if (ret)
+ return ret;
+ break;
+ case BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH:
+ btrfs_force_shutdown(fs_info);
+ break;
+ }
+ return ret;
+}
+#endif
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -5341,6 +5378,10 @@ long btrfs_ioctl(struct file *file, unsigned int
#endif
case BTRFS_IOC_SUBVOL_SYNC_WAIT:
return btrfs_ioctl_subvol_sync(fs_info, argp);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+ case BTRFS_IOC_SHUTDOWN:
+ return btrfs_ioctl_shutdown(fs_info, arg);
+#endif
}
return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dd02160015b2..4d8201f3b4a4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -1096,6 +1096,12 @@ enum btrfs_err_code {
BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
};
+/* Flags for IOC_SHUTDOWN, should match XFS' flags. */
+#define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
+#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
+#define BTRFS_SHUTDOWN_FLAGS_NOLOGFLUSH 0x2
+#define BTRFS_SHUTDOWN_FLAGS_LAST 0x3
+
#define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
struct btrfs_ioctl_vol_args)
#define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -1217,6 +1223,9 @@ enum btrfs_err_code {
#define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
struct btrfs_ioctl_subvol_wait)
+/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
+#define BTRFS_IOC_SHUTDOWN _IOR('X', 125, uint32_t)
+
#ifdef __cplusplus
}
#endif
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 6/6] btrfs: implement remove_bdev and shutdown super operation callbacks
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
` (4 preceding siblings ...)
2025-07-14 5:26 ` [PATCH v5 5/6] btrfs: implement shutdown ioctl Qu Wenruo
@ 2025-07-14 5:26 ` Qu Wenruo
2025-07-15 11:36 ` Christian Brauner
5 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-07-14 5:26 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel; +Cc: viro, brauner, jack
For the ->remove_bdev() callback, btrfs will:
- Mark the target device as missing
- Go degraded if the fs can afford it
- Return error other wise
Thus falls back to the shutdown callback
For the ->shutdown callback, btrfs will:
- Set the SHUTDOWN flag
Which will reject all new incoming operations, and make all writeback
to fail.
The behavior is the same as the NOLOGFLUSH behavior.
To support the lookup from bdev to a btrfs_device,
btrfs_dev_lookup_args is enhanced to have a new @devt member.
If set, we should be able to use that @devt member to uniquely locating a
btrfs device.
I know the shutdown can be a little overkilled, if one has a RAID1
metadata and RAID0 data, in that case one can still read data with 50%
chance to got some good data.
But a filesystem returning -EIO for half of the time is not really
considered usable.
Further it can also be as bad as the only device went missing for a single
device btrfs.
So here we go safe other than sorry when handling missing device.
And the remove_bdev callback will be hidden behind experimental features
for now, the reasons are:
- There are not enough btrfs specific bdev removal test cases
The existing test cases are all removing the only device, thus only
exercises the ->shutdown() behavior.
- Not yet determined what's the expected behavior
Although the current auto-degrade behavior is no worse than the old
behavior, it may not always be what the end users want.
Before there is a concrete interface, better hide the new feature
from end users.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/super.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.c | 2 ++
fs/btrfs/volumes.h | 5 ++++
3 files changed, 73 insertions(+)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 466d0450269c..79f6ad1d44de 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2413,6 +2413,68 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
return 0;
}
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_remove_bdev(struct super_block *sb, struct block_device *bdev)
+{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_device *device;
+ struct btrfs_dev_lookup_args lookup_args = { .devt = bdev->bd_dev };
+ bool can_rw;
+
+ mutex_lock(&fs_info->fs_devices->device_list_mutex);
+ device = btrfs_find_device(fs_info->fs_devices, &lookup_args);
+ if (!device) {
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ /* Device not found, should not affect the running fs, just give a warning. */
+ btrfs_warn(fs_info, "unable to find btrfs device for block device '%pg'",
+ bdev);
+ return 0;
+ }
+ /*
+ * The to-be-removed device is already missing?
+ *
+ * That's weird but no special handling needed and can exit right now.
+ */
+ if (unlikely(test_and_set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))) {
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ btrfs_warn(fs_info, "btrfs device id %llu is already missing",
+ device->devid);
+ return 0;
+ }
+
+ device->fs_devices->missing_devices++;
+ if (test_and_clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
+ list_del_init(&device->dev_alloc_list);
+ WARN_ON(device->fs_devices->rw_devices < 1);
+ device->fs_devices->rw_devices--;
+ }
+ can_rw = btrfs_check_rw_degradable(fs_info, device);
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+ /*
+ * Now device is considered missing, btrfs_device_name() won't give a
+ * meaningful result anymore, so only output the devid.
+ */
+ if (!can_rw) {
+ btrfs_crit(fs_info,
+ "btrfs device id %llu has gone missing, can not maintain read-write",
+ device->devid);
+ return -EIO;
+ }
+ btrfs_warn(fs_info,
+ "btrfs device id %llu has gone missing, continue as degraded",
+ device->devid);
+ btrfs_set_opt(fs_info->mount_opt, DEGRADED);
+ return 0;
+}
+
+static void btrfs_shutdown(struct super_block *sb)
+{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+
+ btrfs_force_shutdown(fs_info);
+}
+#endif
+
static const struct super_operations btrfs_super_ops = {
.drop_inode = btrfs_drop_inode,
.evict_inode = btrfs_evict_inode,
@@ -2428,6 +2490,10 @@ static const struct super_operations btrfs_super_ops = {
.unfreeze_fs = btrfs_unfreeze,
.nr_cached_objects = btrfs_nr_cached_objects,
.free_cached_objects = btrfs_free_cached_objects,
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+ .remove_bdev = btrfs_remove_bdev,
+ .shutdown = btrfs_shutdown,
+#endif
};
static const struct file_operations btrfs_ctl_fops = {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fa7a929a0461..89a82b2a5a7a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6802,6 +6802,8 @@ static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
static bool dev_args_match_device(const struct btrfs_dev_lookup_args *args,
const struct btrfs_device *device)
{
+ if (args->devt)
+ return device->devt == args->devt;
if (args->missing) {
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state) &&
!device->bdev)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index a56e873a3029..78003c9b8abd 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -662,6 +662,11 @@ struct btrfs_dev_lookup_args {
u64 devid;
u8 *uuid;
u8 *fsid;
+ /*
+ * If devt is specified, all other members will be ignored as it is
+ * enough to uniquely locate a device.
+ */
+ dev_t devt;
bool missing;
};
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v5 1/6] fs: add a new remove_bdev() callback
2025-07-14 5:25 ` [PATCH v5 1/6] fs: add a new " Qu Wenruo
@ 2025-07-14 10:14 ` Jan Kara
2025-07-15 11:40 ` (subset) " Christian Brauner
1 sibling, 0 replies; 10+ messages in thread
From: Jan Kara @ 2025-07-14 10:14 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, linux-fsdevel, viro, brauner, jack
On Mon 14-07-25 14:55:57, Qu Wenruo wrote:
> Currently all filesystems which implement super_operations::shutdown()
> can not afford losing a device.
>
> Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the
> involved filesystem.
>
> But it will no longer be the case, as multi-device filesystems like
> btrfs and bcachefs can handle certain device loss without the need to
> shutdown the whole filesystem.
>
> To allow those multi-device filesystems to be integrated to use
> fs_holder_ops:
>
> - Add a new super_operations::remove_bdev() callback
>
> - Try ->remove_bdev() callback first inside fs_bdev_mark_dead()
> If the callback returned 0, meaning the fs can handling the device
^^^ handle
> loss, then exit without doing anything else.
>
> If there is no such callback or the callback returned non-zero value,
> continue to shutdown the filesystem as usual.
>
> This means the new remove_bdev() should only do the check on whether the
> operation can continue, and if so do the fs specific handlings.
> The shutdown handling should still be handled by the existing
^^^^ I'd remove this word.
> ->shutdown() callback.
>
> For all existing filesystems with shutdown callback, there is no change
> to the code nor behavior.
>
> Btrfs is going to implement both the ->remove_bdev() and ->shutdown()
> callbacks soon.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Besides the spelling fixes looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/super.c | 11 +++++++++++
> include/linux/fs.h | 9 +++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/fs/super.c b/fs/super.c
> index 80418ca8e215..7f876f32343a 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1459,6 +1459,17 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
> if (!sb)
> return;
>
> + if (sb->s_op->remove_bdev) {
> + int ret;
> +
> + ret = sb->s_op->remove_bdev(sb, bdev);
> + if (!ret) {
> + super_unlock_shared(sb);
> + return;
> + }
> + /* Fallback to shutdown. */
> + }
> +
> if (!surprise)
> sync_filesystem(sb);
> shrink_dcache_sb(sb);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b085f161ed22..6a8a5e63a5d4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2367,6 +2367,15 @@ struct super_operations {
> struct shrink_control *);
> long (*free_cached_objects)(struct super_block *,
> struct shrink_control *);
> + /*
> + * If a filesystem can support graceful removal of a device and
> + * continue read-write operations, implement this callback.
> + *
> + * Return 0 if the filesystem can continue read-write.
> + * Non-zero return value or no such callback means the fs will be shutdown
> + * as usual.
> + */
> + int (*remove_bdev)(struct super_block *sb, struct block_device *bdev);
> void (*shutdown)(struct super_block *sb);
> };
>
> --
> 2.50.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v5 6/6] btrfs: implement remove_bdev and shutdown super operation callbacks
2025-07-14 5:26 ` [PATCH v5 6/6] btrfs: implement remove_bdev and shutdown super operation callbacks Qu Wenruo
@ 2025-07-15 11:36 ` Christian Brauner
0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2025-07-15 11:36 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, linux-fsdevel, viro, jack
On Mon, Jul 14, 2025 at 02:56:02PM +0930, Qu Wenruo wrote:
> For the ->remove_bdev() callback, btrfs will:
>
> - Mark the target device as missing
>
> - Go degraded if the fs can afford it
>
> - Return error other wise
> Thus falls back to the shutdown callback
>
> For the ->shutdown callback, btrfs will:
>
> - Set the SHUTDOWN flag
> Which will reject all new incoming operations, and make all writeback
> to fail.
>
> The behavior is the same as the NOLOGFLUSH behavior.
>
> To support the lookup from bdev to a btrfs_device,
> btrfs_dev_lookup_args is enhanced to have a new @devt member.
> If set, we should be able to use that @devt member to uniquely locating a
> btrfs device.
>
> I know the shutdown can be a little overkilled, if one has a RAID1
> metadata and RAID0 data, in that case one can still read data with 50%
> chance to got some good data.
>
> But a filesystem returning -EIO for half of the time is not really
> considered usable.
> Further it can also be as bad as the only device went missing for a single
> device btrfs.
>
> So here we go safe other than sorry when handling missing device.
>
> And the remove_bdev callback will be hidden behind experimental features
> for now, the reasons are:
>
> - There are not enough btrfs specific bdev removal test cases
> The existing test cases are all removing the only device, thus only
> exercises the ->shutdown() behavior.
>
> - Not yet determined what's the expected behavior
> Although the current auto-degrade behavior is no worse than the old
> behavior, it may not always be what the end users want.
>
> Before there is a concrete interface, better hide the new feature
> from end users.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/super.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.c | 2 ++
> fs/btrfs/volumes.h | 5 ++++
> 3 files changed, 73 insertions(+)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 466d0450269c..79f6ad1d44de 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2413,6 +2413,68 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
> return 0;
> }
>
> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> +static int btrfs_remove_bdev(struct super_block *sb, struct block_device *bdev)
> +{
> + struct btrfs_fs_info *fs_info = btrfs_sb(sb);
> + struct btrfs_device *device;
> + struct btrfs_dev_lookup_args lookup_args = { .devt = bdev->bd_dev };
> + bool can_rw;
> +
> + mutex_lock(&fs_info->fs_devices->device_list_mutex);
> + device = btrfs_find_device(fs_info->fs_devices, &lookup_args);
> + if (!device) {
> + mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> + /* Device not found, should not affect the running fs, just give a warning. */
> + btrfs_warn(fs_info, "unable to find btrfs device for block device '%pg'",
> + bdev);
> + return 0;
> + }
I got very confused when reviewing this questioning myself how this is
going to work... until I pulled btrfs/for-next. It would've been good to
know that you merged patches to do bdev_file_open_by_*() calls with the
superblock as owner and fs_holder_ops set. Very good to see this! Thanks
everyone!
So I can now go delete the whole paragraph I had about that. :)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: (subset) [PATCH v5 1/6] fs: add a new remove_bdev() callback
2025-07-14 5:25 ` [PATCH v5 1/6] fs: add a new " Qu Wenruo
2025-07-14 10:14 ` Jan Kara
@ 2025-07-15 11:40 ` Christian Brauner
1 sibling, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2025-07-15 11:40 UTC (permalink / raw)
To: linux-btrfs, linux-fsdevel, Qu Wenruo; +Cc: Christian Brauner, viro, jack
On Mon, 14 Jul 2025 14:55:57 +0930, Qu Wenruo wrote:
> Currently all filesystems which implement super_operations::shutdown()
> can not afford losing a device.
>
> Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the
> involved filesystem.
>
> But it will no longer be the case, as multi-device filesystems like
> btrfs and bcachefs can handle certain device loss without the need to
> shutdown the whole filesystem.
>
> [...]
Applied to the vfs-6.17.super branch of the vfs/vfs.git tree.
Patches in the vfs-6.17.super branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.17.super
[1/6] fs: add a new remove_bdev() callback
https://git.kernel.org/vfs/vfs/c/d9c37a4904ec
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-07-15 11:41 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 5:25 [PATCH v5 0/6] btrfs: add remove_bdev() callback Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 1/6] fs: add a new " Qu Wenruo
2025-07-14 10:14 ` Jan Kara
2025-07-15 11:40 ` (subset) " Christian Brauner
2025-07-14 5:25 ` [PATCH v5 2/6] btrfs: introduce a new fs state, EMERGENCY_SHUTDOWN Qu Wenruo
2025-07-14 5:25 ` [PATCH v5 3/6] btrfs: reject file operations if in shutdown state Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 4/6] btrfs: reject delalloc ranges " Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 5/6] btrfs: implement shutdown ioctl Qu Wenruo
2025-07-14 5:26 ` [PATCH v5 6/6] btrfs: implement remove_bdev and shutdown super operation callbacks Qu Wenruo
2025-07-15 11:36 ` Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).