[PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back
@ 2025-06-09  5:19 Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 1/4] btrfs: use fs_info as the block device holder Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  5:19 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Extra patches to properly handle fs mounting/unmounting races

- Harden freeze/thaw races
  As they can be called on a per-fs and per-bdev basis, can cause extra
  races.

Btrfs doesn't implement any callbacks of blk_holder_ops, this means:

- No sync/freeze/thaw support for an opened btrfs device
  Not sure if this is the root cause of btrfs + hiberantion data loss,
  but it won't hurt if we have such support.

- No ability to detect dead device at runtime
  Meaning we can have "living" dead device in btrfs, the worst case is
  generic/730, that the single device of a btrfs is removed, and btrfs
  just abort transaction when it fails to write the transaction.

This series improve the situation by:

- Use a per-fs holder for block device
  This is a dependency to implement proper blk_holder_ops, as we need a
  way to grab the fs_info from a block device.

- Use bdev_fput() for btrfs devices
  This ensures the bdev holder reclaim will not be delayed, thus ensure
  the fs_info's lifespan is always covering any opened block devices.

  This is another dependency to implement blk_holder_ops, or we can grab
  an fs_info which is already released.

- Implement per-bdev sync/freeze/thaw callbacks
  This has a special requirement for freeze/thaw, as freeze/thaw can be
  triggered from two different paths, by ioctl (aka, per-fs) and by
  block device (aka, per-bdev).

  This means for the worst case, per-fs and per-dev freezing/thawing can
  race with each other, and this requires fs level serialization.

- Implement the mark_dead() call back
  This will automatically mark the dead devices as missing, and degrade
  the fs.

  Furthermore, if the remaining devices can no longer maintain RW
  operations, immediately mark the fs as error (thus also read-only) to
  prevent further data loss.


Qu Wenruo (4):
  btrfs: use fs_info as the block device holder
  btrfs: replace fput() with bdev_fput() for block devices
  btrfs: implement a basic per-block-device call backs
  btrfs: add a simple dead device detection mechanism

 fs/btrfs/dev-replace.c |   4 +-
 fs/btrfs/disk-io.c     |  11 +++
 fs/btrfs/fs.h          |  14 +++-
 fs/btrfs/ioctl.c       |   4 +-
 fs/btrfs/super.c       |  27 ++++++--
 fs/btrfs/super.h       |   2 +
 fs/btrfs/volumes.c     | 154 +++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h     |   6 ++
 8 files changed, 197 insertions(+), 25 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/4] btrfs: use fs_info as the block device holder
  2025-06-09  5:19 [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Qu Wenruo
@ 2025-06-09  5:19 ` Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 2/4] btrfs: replace fput() with bdev_fput() for block devices Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  5:19 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs uses "btrfs_fs_type" as the bdev holder for all opened
device, which means all btrfses shares the same holder value.

That's only fine when there is no blk_holder_ops provided, but we're
going to implement blk_holder_ops soon, so replace the "btrfs_fs_type"
holder usage, and replace it with a proper fs_info instead.

This means we can remove the btrfs_fs_info::bdev_holder completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/dev-replace.c | 2 +-
 fs/btrfs/fs.h          | 2 --
 fs/btrfs/super.c       | 3 +--
 fs/btrfs/volumes.c     | 4 ++--
 4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 2decb9fff445..cf63f4b29327 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -250,7 +250,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	}
 
 	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					fs_info->bdev_holder, NULL);
+					   fs_info, NULL);
 	if (IS_ERR(bdev_file)) {
 		btrfs_err(fs_info, "target device %s is invalid!", device_path);
 		return PTR_ERR(bdev_file);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index b239e4b8421c..d90304d4e32c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -715,8 +715,6 @@ struct btrfs_fs_info {
 	u32 data_chunk_allocations;
 	u32 metadata_ratio;
 
-	void *bdev_holder;
-
 	/* Private scrub information */
 	struct mutex scrub_lock;
 	atomic_t scrubs_running;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2d0d8c6e77b4..c1efd20166cc 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1865,7 +1865,7 @@ static int btrfs_get_tree_super(struct fs_context *fc)
 	fs_devices = device->fs_devices;
 	fs_info->fs_devices = fs_devices;
 
-	ret = btrfs_open_devices(fs_devices, mode, &btrfs_fs_type);
+	ret = btrfs_open_devices(fs_devices, mode, fs_info);
 	mutex_unlock(&uuid_mutex);
 	if (ret)
 		return ret;
@@ -1905,7 +1905,6 @@ static int btrfs_get_tree_super(struct fs_context *fc)
 	} else {
 		snprintf(sb->s_id, sizeof(sb->s_id), "%pg", bdev);
 		shrinker_debugfs_rename(sb->s_shrink, "sb-btrfs:%s", sb->s_id);
-		btrfs_sb(sb)->bdev_holder = &btrfs_fs_type;
 		ret = btrfs_fill_super(sb, fs_devices);
 		if (ret) {
 			deactivate_locked_super(sb);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1535a425e8f9..dae946ee6b07 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2705,7 +2705,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		return -EROFS;
 
 	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					fs_info->bdev_holder, NULL);
+					   fs_info, NULL);
 	if (IS_ERR(bdev_file))
 		return PTR_ERR(bdev_file);
 
@@ -7174,7 +7174,7 @@ static struct btrfs_fs_devices *open_seed_devices(struct btrfs_fs_info *fs_info,
 	if (IS_ERR(fs_devices))
 		return fs_devices;
 
-	ret = open_fs_devices(fs_devices, BLK_OPEN_READ, fs_info->bdev_holder);
+	ret = open_fs_devices(fs_devices, BLK_OPEN_READ, fs_info);
 	if (ret) {
 		free_fs_devices(fs_devices);
 		return ERR_PTR(ret);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/4] btrfs: replace fput() with bdev_fput() for block devices
  2025-06-09  5:19 [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 1/4] btrfs: use fs_info as the block device holder Qu Wenruo
@ 2025-06-09  5:19 ` Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 3/4] btrfs: implement a basic per-block-device call backs Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  5:19 UTC (permalink / raw)
  To: linux-btrfs

The function fput() will defer the reclaim of a block devices' holder
into a workqueue.

This means even if btrfs has closed a block device, certain call backs
on that block device can still grab the out-of-date holder values.

Thankfully it's not a big deal because btrfs doesn't support any
blk_holder_ops callbacks, thus there is no real holder usage, but the
situation will soon change.

For the future blk_holder_ops call backs, we need to ensure that the
holder value in a btrfs block device is always valid as long as btrfs is
still holding it.

So instead of relying on the deferred reclaim, call bdev_fput() to
immediately reclaim the holder.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/dev-replace.c |  2 +-
 fs/btrfs/ioctl.c       |  4 ++--
 fs/btrfs/volumes.c     | 18 +++++++++---------
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index cf63f4b29327..42d795156397 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -327,7 +327,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	return 0;
 
 error:
-	fput(bdev_file);
+	bdev_fput(bdev_file);
 	return ret;
 }
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4eda35bdba71..608a63f07e6b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2700,7 +2700,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 err_drop:
 	mnt_drop_write_file(file);
 	if (bdev_file)
-		fput(bdev_file);
+		bdev_fput(bdev_file);
 out:
 	btrfs_put_dev_args_from_path(&args);
 	kfree(vol_args);
@@ -2751,7 +2751,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
 
 	mnt_drop_write_file(file);
 	if (bdev_file)
-		fput(bdev_file);
+		bdev_fput(bdev_file);
 out:
 	btrfs_put_dev_args_from_path(&args);
 out_free:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dae946ee6b07..d17d898fd12b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -488,7 +488,7 @@ btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
 	if (holder) {
 		ret = set_blocksize(*bdev_file, BTRFS_BDEV_BLOCKSIZE);
 		if (ret) {
-			fput(*bdev_file);
+			bdev_fput(*bdev_file);
 			goto error;
 		}
 	}
@@ -496,7 +496,7 @@ btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
 	*disk_super = btrfs_read_disk_super(bdev, 0, false);
 	if (IS_ERR(*disk_super)) {
 		ret = PTR_ERR(*disk_super);
-		fput(*bdev_file);
+		bdev_fput(*bdev_file);
 		goto error;
 	}
 
@@ -720,7 +720,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 
 error_free_page:
 	btrfs_release_disk_super(disk_super);
-	fput(bdev_file);
+	bdev_fput(bdev_file);
 
 	return -EINVAL;
 }
@@ -1070,7 +1070,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
 			continue;
 
 		if (device->bdev_file) {
-			fput(device->bdev_file);
+			bdev_fput(device->bdev_file);
 			device->bdev = NULL;
 			device->bdev_file = NULL;
 			fs_devices->open_devices--;
@@ -1117,7 +1117,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
 		invalidate_bdev(device->bdev);
 	}
 
-	fput(device->bdev_file);
+	bdev_fput(device->bdev_file);
 }
 
 static void btrfs_close_one_device(struct btrfs_device *device)
@@ -1490,7 +1490,7 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, blk_mode_t flags,
 	btrfs_release_disk_super(disk_super);
 
 error_bdev_put:
-	fput(bdev_file);
+	bdev_fput(bdev_file);
 
 	return device;
 }
@@ -2294,7 +2294,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 	 * free the device.
 	 *
 	 * We cannot call btrfs_close_bdev() here because we're holding the sb
-	 * write lock, and fput() on the block device will pull in the
+	 * write lock, and bdev_fput() on the block device will pull in the
 	 * ->open_mutex on the block device and it's dependencies.  Instead
 	 *  just flush the device and let the caller do the final bdev_release.
 	 */
@@ -2473,7 +2473,7 @@ int btrfs_get_dev_args_from_path(struct btrfs_fs_info *fs_info,
 	else
 		memcpy(args->fsid, disk_super->fsid, BTRFS_FSID_SIZE);
 	btrfs_release_disk_super(disk_super);
-	fput(bdev_file);
+	bdev_fput(bdev_file);
 	return 0;
 }
 
@@ -2921,7 +2921,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 error_free_device:
 	btrfs_free_device(device);
 error:
-	fput(bdev_file);
+	bdev_fput(bdev_file);
 	if (locked) {
 		mutex_unlock(&uuid_mutex);
 		up_write(&sb->s_umount);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/4] btrfs: implement a basic per-block-device call backs
  2025-06-09  5:19 [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 1/4] btrfs: use fs_info as the block device holder Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 2/4] btrfs: replace fput() with bdev_fput() for block devices Qu Wenruo
@ 2025-06-09  5:19 ` Qu Wenruo
  2025-06-09  5:19 ` [PATCH v2 4/4] btrfs: add a simple dead device detection mechanism Qu Wenruo
  2025-06-09  5:21 ` [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Christoph Hellwig
  4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  5:19 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs doesn't implement any per-block-device call backs, nor
utilize the single device generic one from fs_holder_ops, as btrfs is a
multi-device fs and it doesn't go through the common setup_bdev_super()
path.

For the incoming support of mark_dead() call back, implement a basic
sync/freeze/thaw callbacks first.

Those call backs are just a wrapper of the per-fs ioctl versions, with
the extra modifications to handle mount/unmount races:

- Add a atomic_t to record how many bdev callbacks are running
  So that at unmount time we can properly wait for them.

- Add a helper to properly grab the fs_info
  The fs_info stored inside block_devices::bd_holder is ensured to be
  valid, as the life span of the fs_info covers the bdev:

  fs_info creation                                      fs_info free
  |      |                                    |         |
         btrfs devs open                      btrfs devs close
	 (bdev_file_open_*())                 (bdev_fput())

  But it doesn't mean it's safe to do thing when the fs is not yet
  mounted nor is closing.

  So a helper, get_bdev_fs_info() is introduecd to:

  * Check if the fs_info is properly mounted and not being closed
  * Increase the bdev_ops_running atomic
  * Make fs closing to wait for any unfinished bdev operations

  This should help us to avoid race against mount/unmount races

- Add a helper to release the fs_info for bdev call backs
  The helper put_bdev_fs_info() will decrease the bdev_ops_running
  atomic and wakeup the wait.

- Harden unfreeze checks
  Since it's possible to call free/unfree on two different devices of
  btrfs at the same time, we may hit a race that two processes are
  trying to unfreeze the same fs at the same time.

  Normally the freeze/thaw is just a single bit operation, but for
  thaw btrfs will do extra super block checks, we need to protect the
  super block checks with the proper device_list_mutex.

  And since it's possible that two processes are running the same thaw
  operation, add an extra mutex and extra checks for freeze/thaw.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/dev-replace.c |  2 +-
 fs/btrfs/disk-io.c     | 11 ++++++
 fs/btrfs/fs.h          | 12 ++++++
 fs/btrfs/super.c       | 24 +++++++++---
 fs/btrfs/super.h       |  2 +
 fs/btrfs/volumes.c     | 86 +++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h     |  1 +
 7 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 42d795156397..81091e22478c 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -250,7 +250,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	}
 
 	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info, NULL);
+					   fs_info, &btrfs_bdev_ops);
 	if (IS_ERR(bdev_file)) {
 		btrfs_err(fs_info, "target device %s is invalid!", device_path);
 		return PTR_ERR(bdev_file);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3def93016963..a71ea9eb5646 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2843,6 +2843,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	atomic_set(&fs_info->async_delalloc_pages, 0);
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->nr_delayed_iputs, 0);
+	atomic_set(&fs_info->bdev_ops_running, 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->global_root_tree = RB_ROOT;
 	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2876,6 +2877,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
 	mutex_init(&fs_info->ro_block_group_mutex);
+	mutex_init(&fs_info->freeze_mutex);
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
@@ -2893,6 +2895,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 	init_waitqueue_head(&fs_info->delayed_iputs_wait);
+	init_waitqueue_head(&fs_info->bdev_ops_wait);
 
 	/* Usable values until the real ones are cached from the superblock */
 	fs_info->nodesize = 4096;
@@ -4197,6 +4200,14 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
 
+	/*
+	 * Wait for any bdev operations first.
+	 * After setting the above CLOSING_START bit, we will no longer
+	 * accept any new bdev operations.
+	 */
+	wait_event(fs_info->bdev_ops_wait,
+		   (atomic_read(&fs_info->bdev_ops_running) == 0));
+
 	/*
 	 * If we had UNFINISHED_DROPS we could still be processing them, so
 	 * clear that bit and wake up relocation so it can stop.
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d90304d4e32c..4c8df5f4ba1a 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -541,6 +541,14 @@ struct btrfs_fs_info {
 	struct mutex transaction_kthread_mutex;
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
+	/*
+	 * Serialize freeze/unfreeze operations.
+	 *
+	 * Freeze/thaw is shared by not only the per-fs freeze/thaw but also
+	 * per-bdev callbacks, thus need a unified mutex inside btrfs to handle
+	 * per-fs and per-bdev races correctly.
+	 */
+	struct mutex freeze_mutex;
 
 	/*
 	 * This is taken to make sure we don't set block groups ro after the
@@ -690,6 +698,10 @@ struct btrfs_fs_info {
 	struct rb_root defrag_inodes;
 	atomic_t defrag_running;
 
+	/* For per-block-device callbacks.*/
+	atomic_t bdev_ops_running;
+	wait_queue_head_t bdev_ops_wait;
+
 	/* Used to protect avail_{data, metadata, system}_alloc_bits */
 	seqlock_t profiles_lock;
 	/*
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c1efd20166cc..a847d64803cb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2273,11 +2273,19 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
 	return ret;
 }
 
-static int btrfs_freeze(struct super_block *sb)
+int btrfs_freeze(struct super_block *sb)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 
+	mutex_lock(&fs_info->freeze_mutex);
+	if (test_bit(BTRFS_FS_FROZEN, &fs_info->flags)) {
+		mutex_unlock(&fs_info->freeze_mutex);
+		return -EBUSY;
+	}
+
+
 	set_bit(BTRFS_FS_FROZEN, &fs_info->flags);
+	mutex_unlock(&fs_info->freeze_mutex);
 	/*
 	 * We don't need a barrier here, we'll wait for any transaction that
 	 * could be in progress on other threads (and do delayed iputs that
@@ -2339,20 +2347,24 @@ static int check_dev_super(struct btrfs_device *dev)
 	return ret;
 }
 
-static int btrfs_unfreeze(struct super_block *sb)
+int btrfs_unfreeze(struct super_block *sb)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_device *device;
 	int ret = 0;
 
+	mutex_lock(&fs_info->freeze_mutex);
+	if (!test_bit(BTRFS_FS_FROZEN, &fs_info->flags)) {
+		mutex_unlock(&fs_info->freeze_mutex);
+		return -EINVAL;
+	}
+
 	/*
 	 * Make sure the fs is not changed by accident (like hibernation then
 	 * modified by other OS).
 	 * If we found anything wrong, we mark the fs error immediately.
-	 *
-	 * And since the fs is frozen, no one can modify the fs yet, thus
-	 * we don't need to hold device_list_mutex.
 	 */
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
 		ret = check_dev_super(device);
 		if (ret < 0) {
@@ -2362,7 +2374,9 @@ static int btrfs_unfreeze(struct super_block *sb)
 			break;
 		}
 	}
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 	clear_bit(BTRFS_FS_FROZEN, &fs_info->flags);
+	mutex_unlock(&fs_info->freeze_mutex);
 
 	/*
 	 * We still return 0, to allow VFS layer to unfreeze the fs even the
diff --git a/fs/btrfs/super.h b/fs/btrfs/super.h
index d80a86acfbbe..1d4a029a6042 100644
--- a/fs/btrfs/super.h
+++ b/fs/btrfs/super.h
@@ -14,6 +14,8 @@ bool btrfs_check_options(const struct btrfs_fs_info *info,
 			 unsigned long long *mount_opt,
 			 unsigned long flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
+int btrfs_freeze(struct super_block *sb);
+int btrfs_unfreeze(struct super_block *sb);
 char *btrfs_get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info,
 					  u64 subvol_objectid);
 void btrfs_set_free_space_cache_settings(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d17d898fd12b..d7cfc883c834 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -256,6 +256,88 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
 out_overflow:;
 }
 
+static struct btrfs_fs_info *get_bdev_fs_info(struct block_device *bdev)
+	__releases(&bdev->bd_holder_lock)
+{
+	struct btrfs_fs_info *fs_info = bdev->bd_holder;
+
+	if (!fs_info)
+		goto out;
+
+	/*
+	 * The fs_info's lifespan is ensured to cover the lifespan of an opened
+	 * bdev, so we are safe to access the fs_info.
+	 */
+	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags) ||
+	    btrfs_fs_closing(fs_info)) {
+		fs_info = NULL;
+		goto out;
+	}
+	atomic_inc(&fs_info->bdev_ops_running);
+
+out:
+	mutex_unlock(&bdev->bd_holder_lock);
+	return fs_info;
+}
+
+static void put_bdev_fs_info(struct btrfs_fs_info *fs_info)
+{
+	if (!fs_info)
+		return;
+	if (atomic_dec_and_test(&fs_info->bdev_ops_running))
+		wake_up(&fs_info->bdev_ops_wait);
+}
+
+static void btrfs_bdev_sync(struct block_device *bdev)
+{
+	struct btrfs_fs_info *fs_info = get_bdev_fs_info(bdev);
+	int ret;
+
+	if (!fs_info)
+		goto out;
+	ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
+	if (ret)
+		goto out;
+	btrfs_sync_fs(fs_info->sb, 1);
+	wake_up_process(fs_info->cleaner_kthread);
+out:
+	put_bdev_fs_info(fs_info);
+}
+
+static int btrfs_bdev_freeze(struct block_device *bdev)
+{
+	struct btrfs_fs_info *fs_info = get_bdev_fs_info(bdev);
+	int ret = 0;
+
+	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
+	if (!fs_info)
+		goto out;
+	ret = btrfs_freeze(fs_info->sb);
+out:
+	put_bdev_fs_info(fs_info);
+	return ret;
+}
+
+static int btrfs_bdev_unfreeze(struct block_device *bdev)
+{
+	struct btrfs_fs_info *fs_info = get_bdev_fs_info(bdev);
+	int ret = 0;
+
+	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
+	if (!fs_info)
+		goto out;
+	ret = btrfs_unfreeze(fs_info->sb);
+out:
+	put_bdev_fs_info(fs_info);
+	return ret;
+}
+
+const struct blk_holder_ops btrfs_bdev_ops = {
+	.sync = btrfs_bdev_sync,
+	.freeze = btrfs_bdev_freeze,
+	.thaw = btrfs_bdev_unfreeze,
+};
+
 static int init_first_rw_device(struct btrfs_trans_handle *trans);
 static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
 static void btrfs_dev_stat_print_on_load(struct btrfs_device *device);
@@ -473,7 +555,7 @@ btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
 	struct block_device *bdev;
 	int ret;
 
-	*bdev_file = bdev_file_open_by_path(device_path, flags, holder, NULL);
+	*bdev_file = bdev_file_open_by_path(device_path, flags, holder, &btrfs_bdev_ops);
 
 	if (IS_ERR(*bdev_file)) {
 		ret = PTR_ERR(*bdev_file);
@@ -2705,7 +2787,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		return -EROFS;
 
 	bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
-					   fs_info, NULL);
+					   fs_info, &btrfs_bdev_ops);
 	if (IS_ERR(bdev_file))
 		return PTR_ERR(bdev_file);
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6d8b1f38e3ee..b4b8adc80e52 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -607,6 +607,7 @@ struct btrfs_raid_attr {
 };
 
 extern const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES];
+extern const struct blk_holder_ops btrfs_bdev_ops;
 
 struct btrfs_chunk_map {
 	struct rb_node rb_node;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/4] btrfs: add a simple dead device detection mechanism
  2025-06-09  5:19 [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-06-09  5:19 ` [PATCH v2 3/4] btrfs: implement a basic per-block-device call backs Qu Wenruo
@ 2025-06-09  5:19 ` Qu Wenruo
  2025-06-09  5:21 ` [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Christoph Hellwig
  4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  5:19 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs always detect missing devices at mount time, and lacks a
way to detect a dead device at runtime.

This makes btrfs to treat intentionally or unintentionally removed
device as usual, making test case generic/730 to fail as btrfs still
return the cached data from page cache.
(The root cause is btrfs has no shutdown support for test cases
requiring shutdown)

Add a very basic and simple dead device detection mechanism for btrfs,
which includes:

- Output an info/warning message
  If the devices is not removed by surprise, the log level is info.

  The message will include the device id, device path.

- Mark that devices as missing and mark the fs degraded

- If the fs can not maintain RW operations, mark the fs as error
  So the fs is fully read-only, prevent further corruption.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  5 +++++
 2 files changed, 53 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d7cfc883c834..f3b3bb652cfc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -332,7 +332,52 @@ static int btrfs_bdev_unfreeze(struct block_device *bdev)
 	return ret;
 }
 
+static void btrfs_bdev_mark_dead(struct block_device *bdev, bool surprise)
+{
+	struct btrfs_fs_info *fs_info = get_bdev_fs_info(bdev);
+	struct btrfs_dev_lookup_args args = { .devt = bdev->bd_dev, };
+	struct btrfs_device *device;
+
+	if (!fs_info)
+		return;
+
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	device = btrfs_find_device(fs_info->fs_devices, &args);
+	if (unlikely(!device)) {
+		btrfs_crit(fs_info, "can't find a btrfs_device for %pg", bdev);
+		goto out;
+	}
+	if (surprise)
+		btrfs_warn_in_rcu(fs_info, "devid %llu device %pg path %s is dead",
+				  device->devid, device->bdev, btrfs_dev_name(device));
+	else
+		btrfs_info_in_rcu(fs_info, "devid %llu device %pg path %s is going to be removed",
+				  device->devid, device->bdev, btrfs_dev_name(device));
+	set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
+	device->fs_devices->missing_devices++;
+	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
+		list_del_init(&device->dev_alloc_list);
+		clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+		device->fs_devices->rw_devices--;
+	}
+	/*
+	 * If we can no longer maintain the RW opeartions for the fs, mark the
+	 * fs error.
+	 */
+	if (!btrfs_check_rw_degradable(fs_info, device)) {
+		btrfs_handle_fs_error(fs_info, -EIO,
+			"btrfs can no longer maintain read-write due to missing device(s)");
+	} else  {
+		btrfs_set_opt(fs_info->mount_opt, DEGRADED);
+		btrfs_warn(fs_info, "filesystem degraded due to missing device(s)");
+	}
+out:
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+	put_bdev_fs_info(fs_info);
+}
+
 const struct blk_holder_ops btrfs_bdev_ops = {
+	.mark_dead = btrfs_bdev_mark_dead,
 	.sync = btrfs_bdev_sync,
 	.freeze = btrfs_bdev_freeze,
 	.thaw = btrfs_bdev_unfreeze,
@@ -6874,6 +6919,9 @@ static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
 static bool dev_args_match_device(const struct btrfs_dev_lookup_args *args,
 				  const struct btrfs_device *device)
 {
+	if (args->devt)
+		return device->devt == args->devt;
+
 	if (args->missing) {
 		if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state) &&
 		    !device->bdev)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b4b8adc80e52..e2c453e230a0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -651,6 +651,11 @@ struct btrfs_balance_control {
  */
 struct btrfs_dev_lookup_args {
 	u64 devid;
+	/*
+	 * If @devt is set (non-zero), then other args will be ignored since the
+	 * non-zero dev_t can locate the device uniquely.
+	 */
+	dev_t devt;
 	u8 *uuid;
 	u8 *fsid;
 	bool missing;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back
  2025-06-09  5:19 [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-06-09  5:19 ` [PATCH v2 4/4] btrfs: add a simple dead device detection mechanism Qu Wenruo
@ 2025-06-09  5:21 ` Christoph Hellwig
  2025-06-09  5:31   ` Qu Wenruo
  4 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-06-09  5:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

No full reivew yet, but I think in the long run your maintainance
burdern will be a lot lower if you implement my suggestion of using
the generic code and adding a new devloss super_uperation.

This might require resurrecting my old holder cleanup that Johannes
reposted about a year ago.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back
  2025-06-09  5:21 ` [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Christoph Hellwig
@ 2025-06-09  5:31   ` Qu Wenruo
  2025-06-09  5:38     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  5:31 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs

在 2025/6/9 14:51, Christoph Hellwig 写道:
> No full reivew yet, but I think in the long run your maintainance
> burdern will be a lot lower if you implement my suggestion of using
> the generic code and adding a new devloss super_uperation.

The main problem here is, we didn't go through setup_bdev_super() at 
all, and the super_block structure itself only supports one bdev.

Thus even if we implement a devloss call back in super ops, it will 
still require quite some extra works to make btrfs to go through the 
setup_bdev_super().

Although I have to admit, if all btrfs bdevs go through fs_holder_ops, 
it indeed solves a lot of extra races more easily (freeze ioctl vs bdev 
freeze call back races).

> 
> This might require resurrecting my old holder cleanup that Johannes
> reposted about a year ago.
>
Maybe it's time to revive that series, mind to share the link to that 
series?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back
  2025-06-09  5:31   ` Qu Wenruo
@ 2025-06-09  5:38     ` Christoph Hellwig
  2025-06-09  6:27       ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-06-09  5:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Mon, Jun 09, 2025 at 03:01:32PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/6/9 14:51, Christoph Hellwig 写道:
> > No full reivew yet, but I think in the long run your maintainance
> > burdern will be a lot lower if you implement my suggestion of using
> > the generic code and adding a new devloss super_uperation.
> 
> The main problem here is, we didn't go through setup_bdev_super() at all,
> and the super_block structure itself only supports one bdev.
> 
> Thus even if we implement a devloss call back in super ops, it will still
> require quite some extra works to make btrfs to go through the
> setup_bdev_super().

Why do you need setup_bdev_super?  Everything relevant is already
open coded in btrfs, you'll just need to use fs_holder_ops and ensure
the sb is stored as holder in every block device.

The other nice thing is that you can also stage the changes, i.e.
first resurrect the old holder cleanups, then support ->shutdown,
then add the new ->devloss callback to not shut down the entire file
system if there is enough redundancy.

> Although I have to admit, if all btrfs bdevs go through fs_holder_ops, it
> indeed solves a lot of extra races more easily (freeze ioctl vs bdev freeze
> call back races).
> 
> > 
> > This might require resurrecting my old holder cleanup that Johannes
> > reposted about a year ago.
> > 
> Maybe it's time to revive that series, mind to share the link to that
> series?

My original posting:

https://lore.kernel.org/linux-btrfs/b083ae24-2273-479f-8c9e-96cb9ef083b8@wdc.com/

Rebase from Johannes:

https://lore.kernel.org/linux-btrfs/20240214-hch-device-open-v1-0-b153428b4f72@wdc.com/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back
  2025-06-09  5:38     ` Christoph Hellwig
@ 2025-06-09  6:27       ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-06-09  6:27 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs



在 2025/6/9 15:08, Christoph Hellwig 写道:
> On Mon, Jun 09, 2025 at 03:01:32PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2025/6/9 14:51, Christoph Hellwig 写道:
>>> No full reivew yet, but I think in the long run your maintainance
>>> burdern will be a lot lower if you implement my suggestion of using
>>> the generic code and adding a new devloss super_uperation.
>>
>> The main problem here is, we didn't go through setup_bdev_super() at all,
>> and the super_block structure itself only supports one bdev.
>>
>> Thus even if we implement a devloss call back in super ops, it will still
>> require quite some extra works to make btrfs to go through the
>> setup_bdev_super().
> 
> Why do you need setup_bdev_super?  Everything relevant is already
> open coded in btrfs, you'll just need to use fs_holder_ops and ensure
> the sb is stored as holder in every block device.
> 
> The other nice thing is that you can also stage the changes, i.e.
> first resurrect the old holder cleanups, then support ->shutdown,
> then add the new ->devloss callback to not shut down the entire file
> system if there is enough redundancy.
> 
>> Although I have to admit, if all btrfs bdevs go through fs_holder_ops, it
>> indeed solves a lot of extra races more easily (freeze ioctl vs bdev freeze
>> call back races).
>>
>>>
>>> This might require resurrecting my old holder cleanup that Johannes
>>> reposted about a year ago.
>>>
>> Maybe it's time to revive that series, mind to share the link to that
>> series?
> 
> My original posting:
> 
> https://lore.kernel.org/linux-btrfs/b083ae24-2273-479f-8c9e-96cb9ef083b8@wdc.com/
> 
> Rebase from Johannes:
> 
> https://lore.kernel.org/linux-btrfs/20240214-hch-device-open-v1-0-b153428b4f72@wdc.com/
> 

Thanks a lot, I'll give that series a review and rebase.

It will be great if we do not need to introduce any extra per-device 
specific freeze/thaw serialization inside btrfs.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-09  6:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09  5:19 [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Qu Wenruo
2025-06-09  5:19 ` [PATCH v2 1/4] btrfs: use fs_info as the block device holder Qu Wenruo
2025-06-09  5:19 ` [PATCH v2 2/4] btrfs: replace fput() with bdev_fput() for block devices Qu Wenruo
2025-06-09  5:19 ` [PATCH v2 3/4] btrfs: implement a basic per-block-device call backs Qu Wenruo
2025-06-09  5:19 ` [PATCH v2 4/4] btrfs: add a simple dead device detection mechanism Qu Wenruo
2025-06-09  5:21 ` [PATCH v2 0/4] btrfs: introduce btrfs specific bdev holder ops and implement mark_dead() call back Christoph Hellwig
2025-06-09  5:31   ` Qu Wenruo
2025-06-09  5:38     ` Christoph Hellwig
2025-06-09  6:27       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox