From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 0/2] btrfs: do not poke into bdev's page cache
Date: Wed, 9 Jul 2025 13:35:47 +0930 [thread overview]
Message-ID: <cover.1752033203.git.wqu@suse.com> (raw)
[CHANGELOG]
v2:
- Make sb_write_pointer() use bdev_rw_virt()
That is the missing location that still uses bdev's page cache, thanks
Johannes for exposing this one.
- Replace btrfs_release_disk_super() with kfree()
There is no need to keep that helper, and such replace will help us
exposing locations which are still using the old page cache, like the
above case.
- Only scratch the magic number of a super block in
btrfs_scratch_superblock()
To keep the behavior the same.
- Use GFP_NOFS when allocating memory
This is also to keep the old behavior.
Although I'd say btrfs_read_disk_super() call sites are safe, as they
are either scanning a device, or at mount time, thus out of the write
path and should be safe.
The sb_write_pointer() one still needs the old GFP_NOFS flag as they
can be called when writing the super block.
Btrfs has a long history using bdev's page cache for super block IOs.
This looks weird, but is mostly for the sake of concurrency.
However this has already caused problems, for example when the block
layer page cache enables large folio support, it triggers an ASSERT()
inside btrfs, this is fixed by commit 65f2a3b2323e ("btrfs: remove folio
order ASSERT()s in super block writeback path"), but it is already a
warning.
Thankfully we're moving away from the bdev's page cache already, starting
with commit bc00965dbff7 ("btrfs: count super block write errors in
device instead of tracking folio error state"), we no longer relies on
page cache to detect super block IO errors.
But we're still using the bdev's page cache for:
- Reading super blocks
This is the easist one to kill, just kmalloc() and bdev_rw_virt() will
handle it well.
- Scratching super blocks
Previously we just zero out the magic, but leaving everything else
there.
We rely on the block layer to write the involved blocks.
Here we follow btrfs_read_disk_super() by kzalloc()ing a dummy super
block, and write the full super block back to disk.
- Writing super blocks
Although write_dev_supers() is alreadying using the bio interface, it
still relies on the bdev's page cache.
One of the reason is, we want to submit all super blocks of a device
in one go, and each super block of the same block device is slightly
different, thus we go using page cache, so that each super block can
have its own backing folio.
Here we solve it by pre-allocating super block buffers.
This also makes endio function much simpler, no need to iterate the
bio to unlock the folio.
- Waiting super blocks
Instead of locking the folio to make sure its IO is done, just use an
atomic and wait queue head to do it the usual way.
By this we solve the problem and all IOs are done using bio interface.
But this brings some overhead, thus I marked the series RFC:
- Extra 12K memory usage for each block device
I hope the extra cost is acceptable for modern day systems.
- Extra memory copy for super block writeback
Previously we do the copy into the bdev's page cache, then submit the
IO using folio from the bdev page cache.
This updates the page cache and do the IO in one go.
But now we memcpy() into the preallocated super block buffer, not
updating the bdev's page cache directly.
Qu Wenruo (2):
btrfs: use bdev_rw_virt() to read and scratch the disk super block
btrfs: do not poke into bdev's page cache for super block write
fs/btrfs/disk-io.c | 86 +++++++++++++++-------------------------------
fs/btrfs/super.c | 4 +--
fs/btrfs/volumes.c | 83 +++++++++++++++++++++-----------------------
fs/btrfs/volumes.h | 10 ++++--
fs/btrfs/zoned.c | 28 ++++++++-------
5 files changed, 93 insertions(+), 118 deletions(-)
--
2.50.0
next reply other threads:[~2025-07-09 4:06 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 4:05 Qu Wenruo [this message]
2025-07-09 4:05 ` [PATCH 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
2025-07-09 6:59 ` Johannes Thumshirn
2025-07-09 4:05 ` [PATCH 2/2] btrfs: do not poke into bdev's page cache for super block write Qu Wenruo
2025-07-09 7:43 ` [PATCH 0/2] btrfs: do not poke into bdev's page cache Johannes Thumshirn
2025-07-09 8:29 ` Qu Wenruo
2025-07-09 8:44 ` Johannes Thumshirn
2025-07-09 8:45 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1752033203.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.