All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache
@ 2025-07-08  9:06 Qu Wenruo
  2025-07-08  9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-07-08  9:06 UTC (permalink / raw)
  To: linux-btrfs

[ABUSE OF BDEV'S PAGE CACHE]
Btrfs has a long history using bdev's page cache for super block IOs.
This looks weird, but is mostly for the sake of concurrency.

However this has already caused problems, for example when the block
layer page cache enables large folio support, it triggers an ASSERT()
inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
order ASSERT()s in super block writeback path"), but it is already a
warning.

[MOVEING AWAY FROM BDEV'S PAGE CACHE]
Thankfully we're moving away from the bdev's page cache already, starting
with commit bc00965dbff7 ("btrfs: count super block write errors in
device instead of tracking folio error state"), we no longer relies on
page cache to detect super block IO errors.

We still have the following paths using bdev's page cache, and those
points will be addressed in this series:

- Reading super blocks
  This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
  handle it well.

- Scratching super blocks
  Previously we just zero out the magic, but leaving everything else
  there.
  We rely on the block layer to write the involved blocks.

  Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
  block, and write the full super block back to disk.

- Writing super blocks
  Although write_dev_supers() is alreadying using the bio interface, it
  still relies on the bdev's page cache.

  One of the reason is, we want to submit all super blocks of a device
  in one go, and each super block of the same block device is slightly
  different, thus we go using page cache, so that each super block can
  have its own backing folio.

  Here we solve it by pre-allocating super block buffers.
  This also makes endio function much simpler, no need to iterate the
  bio to unlock the folio.

- Waiting super blocks
  Instead of locking the folio to make sure its IO is done, just use an
  atomic and wait queue head to do it the usual way.

By this we solve the problem and all IOs are done using bio interface.

[THE COST AND REASON FOR RFC]
But this brings some overhead, thus I marked the series RFC:

- Extra 12K memory usage for each block device
  I hope the extra cost is acceptable for modern day systems.

- Extra memory copy for super block writeback
  Previously we do the copy into the bdev's page cache, then submit the
  IO using folio from the bdev page cache.

  This updates the page cache and do the IO in one go.

  But now we memcpy() into the preallocated super block buffer, not
  updating the bdev's page cache directly.
  If by somehow the block device drive determines to copy the bio's
  content to page cache, it will need to do one extra memory copy.

- Extra memory allocation for btrfs_scratch_superblock()
  Previously we need no memory allocation, thus no error handling
  needed.

  But now we need extra memory allocation, and such allocation is just
  to write zero into block devices. Thus the cost is a little hard to
  accept.

- No more cached super block during device scan
  But the cost should be minimal.

Qu Wenruo (2):
  btrfs: use bdev_rw_virt() to read and scratch the disk super block
  btrfs: do not poke into bdev's page cache for super block write

 fs/btrfs/disk-io.c | 76 ++++++++++++++--------------------------------
 fs/btrfs/volumes.c | 59 ++++++++++++++++++++---------------
 fs/btrfs/volumes.h | 11 ++++++-
 3 files changed, 67 insertions(+), 79 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-09  4:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08  9:06 [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Qu Wenruo
2025-07-08  9:06 ` [PATCH RFC 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
2025-07-08 15:30   ` David Sterba
2025-07-08 22:07     ` Qu Wenruo
2025-07-08  9:06 ` [PATCH RFC 2/2] btrfs: do not poke into bdev's page cache for super block write Qu Wenruo
2025-07-08 10:35 ` [PATCH RFC 0/2] btrfs: do not poke into bdev's page cache Johannes Thumshirn
2025-07-08 21:59   ` Qu Wenruo
2025-07-08 15:27 ` David Sterba
2025-07-09  4:02   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.