* [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache
@ 2025-07-08 9:06 Qu Wenruo
2025-07-08 9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-07-08 9:06 UTC (permalink / raw)
To: linux-btrfs
[ABUSE OF BDEV'S PAGE CACHE]
Btrfs has a long history using bdev's page cache for super block IOs.
This looks weird, but is mostly for the sake of concurrency.
However this has already caused problems, for example when the block
layer page cache enables large folio support, it triggers an ASSERT()
inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
order ASSERT()s in super block writeback path"), but it is already a
warning.
[MOVEING AWAY FROM BDEV'S PAGE CACHE]
Thankfully we're moving away from the bdev's page cache already, starting
with commit bc00965dbff7 ("btrfs: count super block write errors in
device instead of tracking folio error state"), we no longer relies on
page cache to detect super block IO errors.
We still have the following paths using bdev's page cache, and those
points will be addressed in this series:
- Reading super blocks
This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
handle it well.
- Scratching super blocks
Previously we just zero out the magic, but leaving everything else
there.
We rely on the block layer to write the involved blocks.
Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
block, and write the full super block back to disk.
- Writing super blocks
Although write_dev_supers() is alreadying using the bio interface, it
still relies on the bdev's page cache.
One of the reason is, we want to submit all super blocks of a device
in one go, and each super block of the same block device is slightly
different, thus we go using page cache, so that each super block can
have its own backing folio.
Here we solve it by pre-allocating super block buffers.
This also makes endio function much simpler, no need to iterate the
bio to unlock the folio.
- Waiting super blocks
Instead of locking the folio to make sure its IO is done, just use an
atomic and wait queue head to do it the usual way.
By this we solve the problem and all IOs are done using bio interface.
[THE COST AND REASON FOR RFC]
But this brings some overhead, thus I marked the series RFC:
- Extra 12K memory usage for each block device
I hope the extra cost is acceptable for modern day systems.
- Extra memory copy for super block writeback
Previously we do the copy into the bdev's page cache, then submit the
IO using folio from the bdev page cache.
This updates the page cache and do the IO in one go.
But now we memcpy() into the preallocated super block buffer, not
updating the bdev's page cache directly.
If by somehow the block device drive determines to copy the bio's
content to page cache, it will need to do one extra memory copy.
- Extra memory allocation for btrfs_scratch_superblock()
Previously we need no memory allocation, thus no error handling
needed.
But now we need extra memory allocation, and such allocation is just
to write zero into block devices. Thus the cost is a little hard to
accept.
- No more cached super block during device scan
But the cost should be minimal.
Qu Wenruo (2):
btrfs: use bdev_rw_virt() to read and scratch the disk super block
btrfs: do not poke into bdev's page cache for super block write
fs/btrfs/disk-io.c | 76 ++++++++++++++--------------------------------
fs/btrfs/volumes.c | 59 ++++++++++++++++++++---------------
fs/btrfs/volumes.h | 11 ++++++-
3 files changed, 67 insertions(+), 79 deletions(-)
--
2.50.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block
2025-07-08 9:06 [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Qu Wenruo
@ 2025-07-08 9:06 ` Qu Wenruo
2025-07-08 15:30 ` David Sterba
2025-07-08 9:06 ` [PATCH RFC 2/2] btrfs: do not poke into bdev's page cache for super block write Qu Wenruo
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-07-08 9:06 UTC (permalink / raw)
To: linux-btrfs
Currently we're using the block device page cache to read and scratch
the super block.
This makes btrfs to poking into lower layer unnecessarily.
Change both btrfs_read_disk_super() and btrfs_scratch_superblock() to
kmalloc() + bdev_rw_virt() to do the read and write.
This allows btrfs_release_disk_super() to be a simple wrapper of
kfree().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/volumes.c | 46 +++++++++++++++++++++-------------------------
fs/btrfs/volumes.h | 5 ++++-
2 files changed, 25 insertions(+), 26 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 31aecd291d6c..47f11e3c4a98 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1326,18 +1326,10 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
return ret;
}
-void btrfs_release_disk_super(struct btrfs_super_block *super)
-{
- struct page *page = virt_to_page(super);
-
- put_page(page);
-}
-
struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
int copy_num, bool drop_cache)
{
struct btrfs_super_block *super;
- struct page *page;
u64 bytenr, bytenr_orig;
struct address_space *mapping = bdev->bd_mapping;
int ret;
@@ -1362,14 +1354,19 @@ struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
* always read from the device.
*/
invalidate_inode_pages2_range(mapping, bytenr >> PAGE_SHIFT,
- (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT);
+ (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT);
}
- page = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS);
- if (IS_ERR(page))
- return ERR_CAST(page);
+ super = kmalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
+ if (!super)
+ return ERR_PTR(-ENOMEM);
+ ret = bdev_rw_virt(bdev, bytenr >> SECTOR_SHIFT, super, BTRFS_SUPER_INFO_SIZE,
+ REQ_OP_READ);
+ if (ret < 0) {
+ btrfs_release_disk_super(super);
+ return ERR_PTR(ret);
+ }
- super = page_address(page);
if (btrfs_super_magic(super) != BTRFS_MAGIC ||
btrfs_super_bytenr(super) != bytenr_orig) {
btrfs_release_disk_super(super);
@@ -2134,21 +2131,20 @@ static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
static void btrfs_scratch_superblock(struct btrfs_fs_info *fs_info,
struct block_device *bdev, int copy_num)
{
- struct btrfs_super_block *disk_super;
- const size_t len = sizeof(disk_super->magic);
+ struct btrfs_super_block *super;
const u64 bytenr = btrfs_sb_offset(copy_num);
int ret;
- disk_super = btrfs_read_disk_super(bdev, copy_num, false);
- if (IS_ERR(disk_super))
- return;
-
- memset(&disk_super->magic, 0, len);
- folio_mark_dirty(virt_to_folio(disk_super));
- btrfs_release_disk_super(disk_super);
-
- ret = sync_blockdev_range(bdev, bytenr, bytenr + len - 1);
- if (ret)
+ super = kzalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
+ if (!super) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ ret = bdev_rw_virt(bdev, bytenr >> SECTOR_SHIFT, super,
+ BTRFS_SUPER_INFO_SIZE, REQ_OP_WRITE);
+out:
+ btrfs_release_disk_super(super);
+ if (ret < 0)
btrfs_warn(fs_info, "error clearing superblock number %d (%d)",
copy_num, ret);
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7395cb5e1238..d3b65f66a691 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -799,7 +799,10 @@ struct btrfs_chunk_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *map);
struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
int copy_num, bool drop_cache);
-void btrfs_release_disk_super(struct btrfs_super_block *super);
+static inline void btrfs_release_disk_super(struct btrfs_super_block *super)
+{
+ kfree(super);
+}
static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
int index)
--
2.50.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH RFC 2/2] btrfs: do not poke into bdev's page cache for super block write
2025-07-08 9:06 [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Qu Wenruo
2025-07-08 9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
@ 2025-07-08 9:06 ` Qu Wenruo
2025-07-08 10:35 ` [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Johannes Thumshirn
2025-07-08 15:27 ` David Sterba
3 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-07-08 9:06 UTC (permalink / raw)
To: linux-btrfs
[EXISTING BEHAVIOR]
Currently btrfs is utilizing block device's page cache to write super
blocks.
This has a long history, dating back to early days when we relies on the
page's flags to determine if the IO is done or failed.
But later commit bc00965dbff7 ("btrfs: count super block write errors in
device instead of tracking folio error state") uses extra atomic to
track how many errors are hit, and we no longer rely on page flags to
determine if the IO is done or failed.
[PROBLEMS]
But such direct usage of block devices' page cache is putting btrfs
responsible to the implementation details of the block device.
In fact this has already caused problem when the block device's page
cache has large folio support enabled, fixed by commit 65f2a3b2323e
("btrfs: remove folio order ASSERT()s in super block writeback path").
This should give us a warning that poking into other layer's
implementation is not a good idea.
[ENHANCEMENT]
Instead of directly using block device's page cache, use the regular bio
interfaces, this involves:
- Introduce extra buffer for all super blocks of a device
This is needed because we want to submit bios for all super blocks in
one go, and wait for them all.
This is to increase concurrency, thus we can not reuse a single super
block copy, and that's part of the reason we want to use page cache.
Unfortunately this means we will have extra 12K memory usage for each
btrfs device.
- Introduce ways to wait for super block writeback
Previously we rely on the page cache to wait for the IO to finish.
Now we have an atomic, @sb_write_running, to record how many running
super block writes are pending.
And also a wait queue head for waiting.
This greatly simplify wait_dev_supers(), now we only need to wait for
the pending ios to finish.
- Simplify btrfs_end_super_write() function
Now we don't need to bother releasing the folio, we can get rid of the
bio iteration code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/disk-io.c | 76 ++++++++++++++--------------------------------
fs/btrfs/volumes.c | 13 ++++++++
fs/btrfs/volumes.h | 6 ++++
3 files changed, 42 insertions(+), 53 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 44e7ae4a2e0b..de844772585d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3692,27 +3692,23 @@ ALLOW_ERROR_INJECTION(open_ctree, ERRNO);
static void btrfs_end_super_write(struct bio *bio)
{
struct btrfs_device *device = bio->bi_private;
- struct folio_iter fi;
- bio_for_each_folio_all(fi, bio) {
- if (bio->bi_status) {
- btrfs_warn_rl(device->fs_info,
- "lost super block write due to IO error on %s (%d)",
- btrfs_dev_name(device),
- blk_status_to_errno(bio->bi_status));
- btrfs_dev_stat_inc_and_print(device,
- BTRFS_DEV_STAT_WRITE_ERRS);
- /* Ensure failure if the primary sb fails. */
- if (bio->bi_opf & REQ_FUA)
- atomic_add(BTRFS_SUPER_PRIMARY_WRITE_ERROR,
- &device->sb_write_errors);
- else
- atomic_inc(&device->sb_write_errors);
- }
- folio_unlock(fi.folio);
- folio_put(fi.folio);
+ if (bio->bi_status) {
+ btrfs_warn_rl(device->fs_info,
+ "lost super block write due to IO error on %s (%d)",
+ btrfs_dev_name(device),
+ blk_status_to_errno(bio->bi_status));
+ btrfs_dev_stat_inc_and_print(device,
+ BTRFS_DEV_STAT_WRITE_ERRS);
+ /* Ensure failure if the primary sb fails. */
+ if (bio->bi_opf & REQ_FUA)
+ atomic_add(BTRFS_SUPER_PRIMARY_WRITE_ERROR,
+ &device->sb_write_errors);
+ else
+ atomic_inc(&device->sb_write_errors);
}
-
+ if (atomic_dec_and_test(&device->sb_write_running))
+ wake_up(&device->sb_write_wait);
bio_put(bio);
}
@@ -3730,24 +3726,21 @@ static int write_dev_supers(struct btrfs_device *device,
struct btrfs_super_block *sb, int max_mirrors)
{
struct btrfs_fs_info *fs_info = device->fs_info;
- struct address_space *mapping = device->bdev->bd_mapping;
SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
int i;
int ret;
u64 bytenr, bytenr_orig;
atomic_set(&device->sb_write_errors, 0);
+ ASSERT(atomic_read(&device->sb_write_running) == 0);
if (max_mirrors == 0)
max_mirrors = BTRFS_SUPER_MIRROR_MAX;
-
+ ASSERT(max_mirrors <= BTRFS_SUPER_MIRROR_MAX);
shash->tfm = fs_info->csum_shash;
for (i = 0; i < max_mirrors; i++) {
- struct folio *folio;
struct bio *bio;
- struct btrfs_super_block *disk_super;
- size_t offset;
bytenr_orig = btrfs_sb_offset(i);
ret = btrfs_sb_log_location(device, i, WRITE, &bytenr);
@@ -3769,21 +3762,7 @@ static int write_dev_supers(struct btrfs_device *device,
crypto_shash_digest(shash, (const char *)sb + BTRFS_CSUM_SIZE,
BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE,
sb->csum);
-
- folio = __filemap_get_folio(mapping, bytenr >> PAGE_SHIFT,
- FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
- GFP_NOFS);
- if (IS_ERR(folio)) {
- btrfs_err(device->fs_info,
- "couldn't get super block page for bytenr %llu error %ld",
- bytenr, PTR_ERR(folio));
- atomic_inc(&device->sb_write_errors);
- continue;
- }
-
- offset = offset_in_folio(folio, bytenr);
- disk_super = folio_address(folio) + offset;
- memcpy(disk_super, sb, BTRFS_SUPER_INFO_SIZE);
+ memcpy(device->sb_write_buf[i], sb, BTRFS_SUPER_INFO_SIZE);
/*
* Directly use bios here instead of relying on the page cache
@@ -3796,7 +3775,7 @@ static int write_dev_supers(struct btrfs_device *device,
bio->bi_iter.bi_sector = bytenr >> SECTOR_SHIFT;
bio->bi_private = device;
bio->bi_end_io = btrfs_end_super_write;
- bio_add_folio_nofail(bio, folio, BTRFS_SUPER_INFO_SIZE, offset);
+ bio_add_virt_nofail(bio, device->sb_write_buf[i], BTRFS_SUPER_INFO_SIZE);
/*
* We FUA only the first super block. The others we allow to
@@ -3805,6 +3784,7 @@ static int write_dev_supers(struct btrfs_device *device,
*/
if (i == 0 && !btrfs_test_opt(device->fs_info, NOBARRIER))
bio->bi_opf |= REQ_FUA;
+ atomic_inc(&device->sb_write_running);
submit_bio(bio);
if (btrfs_advance_sb_log(device, i))
@@ -3830,10 +3810,10 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
if (max_mirrors == 0)
max_mirrors = BTRFS_SUPER_MIRROR_MAX;
+ ASSERT(max_mirrors <= BTRFS_SUPER_MIRROR_MAX);
+ /* Calculate how many super blocks we really have. */
for (i = 0; i < max_mirrors; i++) {
- struct folio *folio;
-
ret = btrfs_sb_log_location(device, i, READ, &bytenr);
if (ret == -ENOENT) {
break;
@@ -3846,17 +3826,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
if (bytenr + BTRFS_SUPER_INFO_SIZE >=
device->commit_total_bytes)
break;
-
- folio = filemap_get_folio(device->bdev->bd_mapping,
- bytenr >> PAGE_SHIFT);
- /* If the folio has been removed, then we know it completed. */
- if (IS_ERR(folio))
- continue;
-
- /* Folio will be unlocked once the write completes. */
- folio_wait_locked(folio);
- folio_put(folio);
}
+ wait_event(device->sb_write_wait, atomic_read(&device->sb_write_running) == 0);
errors += atomic_read(&device->sb_write_errors);
if (errors >= BTRFS_SUPER_PRIMARY_WRITE_ERROR)
@@ -3866,7 +3837,6 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
device->devid);
return -1;
}
-
return errors < i ? 0 : -1;
}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 47f11e3c4a98..b58640e36ee1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -400,6 +400,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid)
static void btrfs_free_device(struct btrfs_device *device)
{
WARN_ON(!list_empty(&device->post_commit_list));
+ WARN_ON(atomic_read(&device->sb_write_running) != 0);
/*
* No need to call kfree_rcu() nor do RCU lock/unlock, nothing is
* reading the device name.
@@ -407,6 +408,8 @@ static void btrfs_free_device(struct btrfs_device *device)
kfree(rcu_dereference_raw(device->name));
btrfs_extent_io_tree_release(&device->alloc_state);
btrfs_destroy_dev_zone_info(device);
+ for (int i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++)
+ kfree(device->sb_write_buf[i]);
kfree(device);
}
@@ -6905,6 +6908,8 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
INIT_LIST_HEAD(&dev->post_commit_list);
atomic_set(&dev->dev_stats_ccnt, 0);
+ atomic_set(&dev->sb_write_running, 0);
+ init_waitqueue_head(&dev->sb_write_wait);
btrfs_device_data_ordered_init(dev);
btrfs_extent_io_tree_init(fs_info, &dev->alloc_state, IO_TREE_DEVICE_ALLOC_STATE);
@@ -6937,6 +6942,14 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
rcu_assign_pointer(dev->name, name);
}
+ for (int i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+ dev->sb_write_buf[i] = kmalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
+ if (!dev->sb_write_buf[i]) {
+ btrfs_free_device(dev);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+
return dev;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d3b65f66a691..dd8e2cfdcd2d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -20,6 +20,7 @@
#include <linux/rbtree.h>
#include <uapi/linux/btrfs.h>
#include <uapi/linux/btrfs_tree.h>
+#include "disk-io.h"
#include "messages.h"
#include "extent-io-tree.h"
@@ -182,6 +183,11 @@ struct btrfs_device {
struct bio flush_bio;
struct completion flush_wait;
+ /* Buffer for each super block. */
+ struct btrfs_super_block *sb_write_buf[BTRFS_SUPER_MIRROR_MAX];
+ atomic_t sb_write_running;
+ wait_queue_head_t sb_write_wait;
+
/* per-device scrub information */
struct scrub_ctx *scrub_ctx;
--
2.50.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache
2025-07-08 9:06 [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Qu Wenruo
2025-07-08 9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
2025-07-08 9:06 ` [PATCH RFC 2/2] btrfs: do not poke into bdev's page cache for super block write Qu Wenruo
@ 2025-07-08 10:35 ` Johannes Thumshirn
2025-07-08 21:59 ` Qu Wenruo
2025-07-08 15:27 ` David Sterba
3 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2025-07-08 10:35 UTC (permalink / raw)
To: WenRuo Qu, linux-btrfs@vger.kernel.org; +Cc: Naohiro Aota
On 08.07.25 11:07, Qu Wenruo wrote:
> [ABUSE OF BDEV'S PAGE CACHE]
> Btrfs has a long history using bdev's page cache for super block IOs.
> This looks weird, but is mostly for the sake of concurrency.
>
> However this has already caused problems, for example when the block
> layer page cache enables large folio support, it triggers an ASSERT()
> inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
> order ASSERT()s in super block writeback path"), but it is already a
> warning.
>
> [MOVEING AWAY FROM BDEV'S PAGE CACHE]
> Thankfully we're moving away from the bdev's page cache already, starting
> with commit bc00965dbff7 ("btrfs: count super block write errors in
> device instead of tracking folio error state"), we no longer relies on
> page cache to detect super block IO errors.
>
> We still have the following paths using bdev's page cache, and those
> points will be addressed in this series:
>
> - Reading super blocks
> This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
> handle it well.
>
> - Scratching super blocks
> Previously we just zero out the magic, but leaving everything else
> there.
> We rely on the block layer to write the involved blocks.
>
> Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
> block, and write the full super block back to disk.
>
> - Writing super blocks
> Although write_dev_supers() is alreadying using the bio interface, it
> still relies on the bdev's page cache.
>
> One of the reason is, we want to submit all super blocks of a device
> in one go, and each super block of the same block device is slightly
> different, thus we go using page cache, so that each super block can
> have its own backing folio.
>
> Here we solve it by pre-allocating super block buffers.
> This also makes endio function much simpler, no need to iterate the
> bio to unlock the folio.
>
> - Waiting super blocks
> Instead of locking the folio to make sure its IO is done, just use an
> atomic and wait queue head to do it the usual way.
>
> By this we solve the problem and all IOs are done using bio interface.
>
> [THE COST AND REASON FOR RFC]
> But this brings some overhead, thus I marked the series RFC:
>
> - Extra 12K memory usage for each block device
> I hope the extra cost is acceptable for modern day systems.
>
> - Extra memory copy for super block writeback
> Previously we do the copy into the bdev's page cache, then submit the
> IO using folio from the bdev page cache.
>
> This updates the page cache and do the IO in one go.
>
> But now we memcpy() into the preallocated super block buffer, not
> updating the bdev's page cache directly.
> If by somehow the block device drive determines to copy the bio's
> content to page cache, it will need to do one extra memory copy.
>
> - Extra memory allocation for btrfs_scratch_superblock()
> Previously we need no memory allocation, thus no error handling
> needed.
>
> But now we need extra memory allocation, and such allocation is just
> to write zero into block devices. Thus the cost is a little hard to
> accept.
>
> - No more cached super block during device scan
> But the cost should be minimal.
I've also grepped for the users of btrfs_release_disk_super() because I
think we should remove it as well as after this series it's a plain
kfree() wrapper and found that `sb_write_pointer()` in zoned.c is still
using the page cache for superblock reading.
Should that be converted as well in the final series?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache
2025-07-08 9:06 [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Qu Wenruo
` (2 preceding siblings ...)
2025-07-08 10:35 ` [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Johannes Thumshirn
@ 2025-07-08 15:27 ` David Sterba
2025-07-09 4:02 ` Qu Wenruo
3 siblings, 1 reply; 9+ messages in thread
From: David Sterba @ 2025-07-08 15:27 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Jul 08, 2025 at 06:36:31PM +0930, Qu Wenruo wrote:
> [ABUSE OF BDEV'S PAGE CACHE]
> Btrfs has a long history using bdev's page cache for super block IOs.
> This looks weird, but is mostly for the sake of concurrency.
>
> However this has already caused problems, for example when the block
> layer page cache enables large folio support, it triggers an ASSERT()
> inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
> order ASSERT()s in super block writeback path"), but it is already a
> warning.
>
> [MOVEING AWAY FROM BDEV'S PAGE CACHE]
> Thankfully we're moving away from the bdev's page cache already, starting
> with commit bc00965dbff7 ("btrfs: count super block write errors in
> device instead of tracking folio error state"), we no longer relies on
> page cache to detect super block IO errors.
>
> We still have the following paths using bdev's page cache, and those
> points will be addressed in this series:
>
> - Reading super blocks
> This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
> handle it well.
>
> - Scratching super blocks
> Previously we just zero out the magic, but leaving everything else
> there.
> We rely on the block layer to write the involved blocks.
>
> Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
> block, and write the full super block back to disk.
>
> - Writing super blocks
> Although write_dev_supers() is alreadying using the bio interface, it
> still relies on the bdev's page cache.
>
> One of the reason is, we want to submit all super blocks of a device
> in one go, and each super block of the same block device is slightly
> different, thus we go using page cache, so that each super block can
> have its own backing folio.
>
> Here we solve it by pre-allocating super block buffers.
> This also makes endio function much simpler, no need to iterate the
> bio to unlock the folio.
>
> - Waiting super blocks
> Instead of locking the folio to make sure its IO is done, just use an
> atomic and wait queue head to do it the usual way.
>
> By this we solve the problem and all IOs are done using bio interface.
>
> [THE COST AND REASON FOR RFC]
> But this brings some overhead, thus I marked the series RFC:
>
> - Extra 12K memory usage for each block device
> I hope the extra cost is acceptable for modern day systems.
>
> - Extra memory copy for super block writeback
> Previously we do the copy into the bdev's page cache, then submit the
> IO using folio from the bdev page cache.
>
> This updates the page cache and do the IO in one go.
>
> But now we memcpy() into the preallocated super block buffer, not
> updating the bdev's page cache directly.
> If by somehow the block device drive determines to copy the bio's
> content to page cache, it will need to do one extra memory copy.
>
> - Extra memory allocation for btrfs_scratch_superblock()
> Previously we need no memory allocation, thus no error handling
> needed.
>
> But now we need extra memory allocation, and such allocation is just
> to write zero into block devices. Thus the cost is a little hard to
> accept.
>
> - No more cached super block during device scan
> But the cost should be minimal.
I have a similar patchset, there was a discussion about how to write the
super block
https://lore.kernel.org/all/Yn%2FtxWbij5voeGOB@casper.infradead.org/
I did separate pages and bios for the write, not using the page cache
and IIRC there were some problems with userspace interaction as it
depends on the page cache as the synchronization point.
The preallocation of 3 pages per device is IMHO acceptable and it avoids
any potential memory allocation failures when we're writing the most
critical block.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block
2025-07-08 9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
@ 2025-07-08 15:30 ` David Sterba
2025-07-08 22:07 ` Qu Wenruo
0 siblings, 1 reply; 9+ messages in thread
From: David Sterba @ 2025-07-08 15:30 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Jul 08, 2025 at 06:36:32PM +0930, Qu Wenruo wrote:
> Currently we're using the block device page cache to read and scratch
> the super block.
>
> This makes btrfs to poking into lower layer unnecessarily.
>
> Change both btrfs_read_disk_super() and btrfs_scratch_superblock() to
> kmalloc() + bdev_rw_virt() to do the read and write.
>
> This allows btrfs_release_disk_super() to be a simple wrapper of
> kfree().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/volumes.c | 46 +++++++++++++++++++++-------------------------
> fs/btrfs/volumes.h | 5 ++++-
> 2 files changed, 25 insertions(+), 26 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 31aecd291d6c..47f11e3c4a98 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1326,18 +1326,10 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
> return ret;
> }
>
> -void btrfs_release_disk_super(struct btrfs_super_block *super)
> -{
> - struct page *page = virt_to_page(super);
> -
> - put_page(page);
> -}
> -
> struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
> int copy_num, bool drop_cache)
> {
> struct btrfs_super_block *super;
> - struct page *page;
> u64 bytenr, bytenr_orig;
> struct address_space *mapping = bdev->bd_mapping;
> int ret;
> @@ -1362,14 +1354,19 @@ struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
> * always read from the device.
> */
> invalidate_inode_pages2_range(mapping, bytenr >> PAGE_SHIFT,
> - (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT);
> + (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT);
> }
>
> - page = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS);
> - if (IS_ERR(page))
> - return ERR_CAST(page);
> + super = kmalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
> + if (!super)
> + return ERR_PTR(-ENOMEM);
> + ret = bdev_rw_virt(bdev, bytenr >> SECTOR_SHIFT, super, BTRFS_SUPER_INFO_SIZE,
> + REQ_OP_READ);
> + if (ret < 0) {
> + btrfs_release_disk_super(super);
> + return ERR_PTR(ret);
> + }
>
> - super = page_address(page);
> if (btrfs_super_magic(super) != BTRFS_MAGIC ||
> btrfs_super_bytenr(super) != bytenr_orig) {
> btrfs_release_disk_super(super);
> @@ -2134,21 +2131,20 @@ static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
> static void btrfs_scratch_superblock(struct btrfs_fs_info *fs_info,
> struct block_device *bdev, int copy_num)
> {
> - struct btrfs_super_block *disk_super;
> - const size_t len = sizeof(disk_super->magic);
> + struct btrfs_super_block *super;
> const u64 bytenr = btrfs_sb_offset(copy_num);
> int ret;
>
> - disk_super = btrfs_read_disk_super(bdev, copy_num, false);
> - if (IS_ERR(disk_super))
> - return;
> -
> - memset(&disk_super->magic, 0, len);
> - folio_mark_dirty(virt_to_folio(disk_super));
> - btrfs_release_disk_super(disk_super);
> -
> - ret = sync_blockdev_range(bdev, bytenr, bytenr + len - 1);
> - if (ret)
> + super = kzalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
> + if (!super) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + ret = bdev_rw_virt(bdev, bytenr >> SECTOR_SHIFT, super,
> + BTRFS_SUPER_INFO_SIZE, REQ_OP_WRITE);
Can we keep this as before, i.e. only clearing the magic? The additional
memcpy shouldn't be a problem, this is one write per device removal. The
remains of the superblock can be used for restoration or later analysis,
wipefs also only removes the signature, I'd like to keep the
compatibility.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache
2025-07-08 10:35 ` [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Johannes Thumshirn
@ 2025-07-08 21:59 ` Qu Wenruo
0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-07-08 21:59 UTC (permalink / raw)
To: Johannes Thumshirn, WenRuo Qu, linux-btrfs@vger.kernel.org; +Cc: Naohiro Aota
在 2025/7/8 20:05, Johannes Thumshirn 写道:
> On 08.07.25 11:07, Qu Wenruo wrote:
>> [ABUSE OF BDEV'S PAGE CACHE]
>> Btrfs has a long history using bdev's page cache for super block IOs.
>> This looks weird, but is mostly for the sake of concurrency.
>>
>> However this has already caused problems, for example when the block
>> layer page cache enables large folio support, it triggers an ASSERT()
>> inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
>> order ASSERT()s in super block writeback path"), but it is already a
>> warning.
>>
>> [MOVEING AWAY FROM BDEV'S PAGE CACHE]
>> Thankfully we're moving away from the bdev's page cache already, starting
>> with commit bc00965dbff7 ("btrfs: count super block write errors in
>> device instead of tracking folio error state"), we no longer relies on
>> page cache to detect super block IO errors.
>>
>> We still have the following paths using bdev's page cache, and those
>> points will be addressed in this series:
>>
>> - Reading super blocks
>> This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
>> handle it well.
>>
>> - Scratching super blocks
>> Previously we just zero out the magic, but leaving everything else
>> there.
>> We rely on the block layer to write the involved blocks.
>>
>> Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
>> block, and write the full super block back to disk.
>>
>> - Writing super blocks
>> Although write_dev_supers() is alreadying using the bio interface, it
>> still relies on the bdev's page cache.
>>
>> One of the reason is, we want to submit all super blocks of a device
>> in one go, and each super block of the same block device is slightly
>> different, thus we go using page cache, so that each super block can
>> have its own backing folio.
>>
>> Here we solve it by pre-allocating super block buffers.
>> This also makes endio function much simpler, no need to iterate the
>> bio to unlock the folio.
>>
>> - Waiting super blocks
>> Instead of locking the folio to make sure its IO is done, just use an
>> atomic and wait queue head to do it the usual way.
>>
>> By this we solve the problem and all IOs are done using bio interface.
>>
>> [THE COST AND REASON FOR RFC]
>> But this brings some overhead, thus I marked the series RFC:
>>
>> - Extra 12K memory usage for each block device
>> I hope the extra cost is acceptable for modern day systems.
>>
>> - Extra memory copy for super block writeback
>> Previously we do the copy into the bdev's page cache, then submit the
>> IO using folio from the bdev page cache.
>>
>> This updates the page cache and do the IO in one go.
>>
>> But now we memcpy() into the preallocated super block buffer, not
>> updating the bdev's page cache directly.
>> If by somehow the block device drive determines to copy the bio's
>> content to page cache, it will need to do one extra memory copy.
>>
>> - Extra memory allocation for btrfs_scratch_superblock()
>> Previously we need no memory allocation, thus no error handling
>> needed.
>>
>> But now we need extra memory allocation, and such allocation is just
>> to write zero into block devices. Thus the cost is a little hard to
>> accept.
>>
>> - No more cached super block during device scan
>> But the cost should be minimal.
>
> I've also grepped for the users of btrfs_release_disk_super() because I
> think we should remove it as well as after this series it's a plain
> kfree() wrapper and found that `sb_write_pointer()` in zoned.c is still
> using the page cache for superblock reading.
>
> Should that be converted as well in the final series?
Thanks a lot for exposing another page cache user, sure it must be gone.
Thanks,
Qu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block
2025-07-08 15:30 ` David Sterba
@ 2025-07-08 22:07 ` Qu Wenruo
0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-07-08 22:07 UTC (permalink / raw)
To: dsterba, Qu Wenruo; +Cc: linux-btrfs
在 2025/7/9 01:00, David Sterba 写道:
> On Tue, Jul 08, 2025 at 06:36:32PM +0930, Qu Wenruo wrote:
>> Currently we're using the block device page cache to read and scratch
>> the super block.
>>
>> This makes btrfs to poking into lower layer unnecessarily.
>>
>> Change both btrfs_read_disk_super() and btrfs_scratch_superblock() to
>> kmalloc() + bdev_rw_virt() to do the read and write.
>>
>> This allows btrfs_release_disk_super() to be a simple wrapper of
>> kfree().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/volumes.c | 46 +++++++++++++++++++++-------------------------
>> fs/btrfs/volumes.h | 5 ++++-
>> 2 files changed, 25 insertions(+), 26 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 31aecd291d6c..47f11e3c4a98 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1326,18 +1326,10 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
>> return ret;
>> }
>>
>> -void btrfs_release_disk_super(struct btrfs_super_block *super)
>> -{
>> - struct page *page = virt_to_page(super);
>> -
>> - put_page(page);
>> -}
>> -
>> struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
>> int copy_num, bool drop_cache)
>> {
>> struct btrfs_super_block *super;
>> - struct page *page;
>> u64 bytenr, bytenr_orig;
>> struct address_space *mapping = bdev->bd_mapping;
>> int ret;
>> @@ -1362,14 +1354,19 @@ struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
>> * always read from the device.
>> */
>> invalidate_inode_pages2_range(mapping, bytenr >> PAGE_SHIFT,
>> - (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT);
>> + (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT);
>> }
>>
>> - page = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS);
>> - if (IS_ERR(page))
>> - return ERR_CAST(page);
>> + super = kmalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
>> + if (!super)
>> + return ERR_PTR(-ENOMEM);
>> + ret = bdev_rw_virt(bdev, bytenr >> SECTOR_SHIFT, super, BTRFS_SUPER_INFO_SIZE,
>> + REQ_OP_READ);
>> + if (ret < 0) {
>> + btrfs_release_disk_super(super);
>> + return ERR_PTR(ret);
>> + }
>>
>> - super = page_address(page);
>> if (btrfs_super_magic(super) != BTRFS_MAGIC ||
>> btrfs_super_bytenr(super) != bytenr_orig) {
>> btrfs_release_disk_super(super);
>> @@ -2134,21 +2131,20 @@ static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
>> static void btrfs_scratch_superblock(struct btrfs_fs_info *fs_info,
>> struct block_device *bdev, int copy_num)
>> {
>> - struct btrfs_super_block *disk_super;
>> - const size_t len = sizeof(disk_super->magic);
>> + struct btrfs_super_block *super;
>> const u64 bytenr = btrfs_sb_offset(copy_num);
>> int ret;
>>
>> - disk_super = btrfs_read_disk_super(bdev, copy_num, false);
>> - if (IS_ERR(disk_super))
>> - return;
>> -
>> - memset(&disk_super->magic, 0, len);
>> - folio_mark_dirty(virt_to_folio(disk_super));
>> - btrfs_release_disk_super(disk_super);
>> -
>> - ret = sync_blockdev_range(bdev, bytenr, bytenr + len - 1);
>> - if (ret)
>> + super = kzalloc(BTRFS_SUPER_INFO_SIZE, GFP_KERNEL);
>> + if (!super) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> + ret = bdev_rw_virt(bdev, bytenr >> SECTOR_SHIFT, super,
>> + BTRFS_SUPER_INFO_SIZE, REQ_OP_WRITE);
>
> Can we keep this as before, i.e. only clearing the magic?
> The additional
> memcpy shouldn't be a problem, this is one write per device removal. The
> remains of the superblock can be used for restoration or later analysis,
> wipefs also only removes the signature, I'd like to keep the
> compatibility.
>
OK, that just means we need to do two IOs, one to read (the same as the
old behavior), then write with magic wiped.
Shouldn't be that complex.
Thanks,
Qu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache
2025-07-08 15:27 ` David Sterba
@ 2025-07-09 4:02 ` Qu Wenruo
0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-07-09 4:02 UTC (permalink / raw)
To: dsterba, Qu Wenruo; +Cc: linux-btrfs
在 2025/7/9 00:57, David Sterba 写道:
> On Tue, Jul 08, 2025 at 06:36:31PM +0930, Qu Wenruo wrote:
>> [ABUSE OF BDEV'S PAGE CACHE]
>> Btrfs has a long history using bdev's page cache for super block IOs.
>> This looks weird, but is mostly for the sake of concurrency.
>>
>> However this has already caused problems, for example when the block
>> layer page cache enables large folio support, it triggers an ASSERT()
>> inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
>> order ASSERT()s in super block writeback path"), but it is already a
>> warning.
>>
>> [MOVEING AWAY FROM BDEV'S PAGE CACHE]
>> Thankfully we're moving away from the bdev's page cache already, starting
>> with commit bc00965dbff7 ("btrfs: count super block write errors in
>> device instead of tracking folio error state"), we no longer relies on
>> page cache to detect super block IO errors.
>>
>> We still have the following paths using bdev's page cache, and those
>> points will be addressed in this series:
>>
>> - Reading super blocks
>> This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
>> handle it well.
>>
>> - Scratching super blocks
>> Previously we just zero out the magic, but leaving everything else
>> there.
>> We rely on the block layer to write the involved blocks.
>>
>> Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
>> block, and write the full super block back to disk.
>>
>> - Writing super blocks
>> Although write_dev_supers() is alreadying using the bio interface, it
>> still relies on the bdev's page cache.
>>
>> One of the reason is, we want to submit all super blocks of a device
>> in one go, and each super block of the same block device is slightly
>> different, thus we go using page cache, so that each super block can
>> have its own backing folio.
>>
>> Here we solve it by pre-allocating super block buffers.
>> This also makes endio function much simpler, no need to iterate the
>> bio to unlock the folio.
>>
>> - Waiting super blocks
>> Instead of locking the folio to make sure its IO is done, just use an
>> atomic and wait queue head to do it the usual way.
>>
>> By this we solve the problem and all IOs are done using bio interface.
>>
>> [THE COST AND REASON FOR RFC]
>> But this brings some overhead, thus I marked the series RFC:
>>
>> - Extra 12K memory usage for each block device
>> I hope the extra cost is acceptable for modern day systems.
>>
>> - Extra memory copy for super block writeback
>> Previously we do the copy into the bdev's page cache, then submit the
>> IO using folio from the bdev page cache.
>>
>> This updates the page cache and do the IO in one go.
>>
>> But now we memcpy() into the preallocated super block buffer, not
>> updating the bdev's page cache directly.
>> If by somehow the block device drive determines to copy the bio's
>> content to page cache, it will need to do one extra memory copy.
>>
>> - Extra memory allocation for btrfs_scratch_superblock()
>> Previously we need no memory allocation, thus no error handling
>> needed.
>>
>> But now we need extra memory allocation, and such allocation is just
>> to write zero into block devices. Thus the cost is a little hard to
>> accept.
>>
>> - No more cached super block during device scan
>> But the cost should be minimal.
>
> I have a similar patchset, there was a discussion about how to write the
> super block
> https://lore.kernel.org/all/Yn%2FtxWbij5voeGOB@casper.infradead.org/
>
> I did separate pages and bios for the write, not using the page cache
> and IIRC there were some problems with userspace interaction as it
> depends on the page cache as the synchronization point.
Do you still have some details about this problem?
In my fstests runs on non-zoned devices, it's totally fine.
And to be honest, I think this is very weird.
For filesystems that fully relies on bio interface, there will be no
interaction with the bdev's page cache.
Thus if there is something requiring the page cache as a sync point, it
will never work anyway.
Thanks,
Qu
>
> The preallocation of 3 pages per device is IMHO acceptable and it avoids
> any potential memory allocation failures when we're writing the most
> critical block.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-09 4:02 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 9:06 [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Qu Wenruo
2025-07-08 9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
2025-07-08 15:30 ` David Sterba
2025-07-08 22:07 ` Qu Wenruo
2025-07-08 9:06 ` [PATCH RFC 2/2] btrfs: do not poke into bdev's page cache for super block write Qu Wenruo
2025-07-08 10:35 ` [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Johannes Thumshirn
2025-07-08 21:59 ` Qu Wenruo
2025-07-08 15:27 ` David Sterba
2025-07-09 4:02 ` Qu Wenruo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.