Why a lot of fses are using bdev's page cache to do super block read/write?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Why a lot of fses are using bdev's page cache to do super block read/write?
@ 2025-07-09  9:05 Qu Wenruo
  2025-07-09 12:01 ` Matthew Wilcox
  2025-07-09 15:04 ` Darrick J. Wong
  0 siblings, 2 replies; 4+ messages in thread
From: Qu Wenruo @ 2025-07-09  9:05 UTC (permalink / raw)
  To: linux-fsdevel@vger.kernel.org, linux-btrfs, Matthew Wilcox,
	linux-block@vger.kernel.org

Hi,

Recently I'm trying to remove direct bdev's page cache usage from btrfs 
super block IOs.

And replace it with common bio interface (mostly with bdev_rw_virt()).

However I'm hitting random generic/492 failure where sometimes blkid 
failed to detect any useful super block signature of btrfs.

This leads more digging, and to my surprise using bdev's page cache to 
do superblock IOs is not an exception, in fact f2fs is doing exactly the 
same thing.

This makes me wonder:

- Should a fs use bdev's page cache directly?
   I thought a fs shouldn't do this, and bio interface should be
   enough for most if not all cases.

   Or am I wrong in the first place?

- What is keeping fs super block update from racing with user space
   device scan?

   I guess it's the regular page/folio locking of the bdev page cache.
   But that also means, pure bio based IO will always race with buffered
   read of a block device.

- If so, is there any special bio flag to prevent such race?
   So far I am unable to find out such flag.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Why a lot of fses are using bdev's page cache to do super block read/write?
  2025-07-09  9:05 Why a lot of fses are using bdev's page cache to do super block read/write? Qu Wenruo
@ 2025-07-09 12:01 ` Matthew Wilcox
  2025-07-09 15:04 ` Darrick J. Wong
  1 sibling, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2025-07-09 12:01 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-fsdevel@vger.kernel.org, linux-btrfs,
	linux-block@vger.kernel.org

On Wed, Jul 09, 2025 at 06:35:00PM +0930, Qu Wenruo wrote:
> This leads more digging, and to my surprise using bdev's page cache to do
> superblock IOs is not an exception, in fact f2fs is doing exactly the same
> thing.

Almost all filesystems use the page cache (sometimes the buffer cache
which amounts to the exact same thing).  This is a good thing as many
filesystems put their superblock in the same place, so scanning block
devices to determine what filesystem they have results in less I/O.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Why a lot of fses are using bdev's page cache to do super block read/write?
  2025-07-09  9:05 Why a lot of fses are using bdev's page cache to do super block read/write? Qu Wenruo
  2025-07-09 12:01 ` Matthew Wilcox
@ 2025-07-09 15:04 ` Darrick J. Wong
  2025-07-09 20:40   ` Qu Wenruo
  1 sibling, 1 reply; 4+ messages in thread
From: Darrick J. Wong @ 2025-07-09 15:04 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, Matthew Wilcox,
	linux-block@vger.kernel.org, Catherine Hoang

On Wed, Jul 09, 2025 at 06:35:00PM +0930, Qu Wenruo wrote:
> Hi,
> 
> Recently I'm trying to remove direct bdev's page cache usage from btrfs
> super block IOs.
> 
> And replace it with common bio interface (mostly with bdev_rw_virt()).
> 
> However I'm hitting random generic/492 failure where sometimes blkid failed
> to detect any useful super block signature of btrfs.

Yes, you need to invalidate_bdev() after writing the superblock directly
to disk via submit_bio.

> This leads more digging, and to my surprise using bdev's page cache to do
> superblock IOs is not an exception, in fact f2fs is doing exactly the same
> thing.
> 
> 
> This makes me wonder:
> 
> - Should a fs use bdev's page cache directly?
>   I thought a fs shouldn't do this, and bio interface should be
>   enough for most if not all cases.
> 
>   Or am I wrong in the first place?

As willy said, most filesystems use the bdev pagecache because then they
don't have to implement their own (metadata) buffer cache.  The downside
is that any filesystem that does so must be prepared to handle the
buffer_head contents changing any time they cycle the bh lock because
anyone can write to the block device of a mounted fs ala tune2fs.

Effectively this means that you have to (a) revalidate the entire buffer
contents every time you lock_buffer(); and (b) you can't make decisions
based on superblock feature bits in the superblock bh directly.

I made that mistake when adding metadata_csum support to ext4 -- we'd
only connect to the crc32c "crypto" module if checksums were enabled in
the ondisk super at mount time, but then there were a couple of places
that looked at the ondisk super bits at runtime, so you could flip the
bit on and crash the kernel almost immediately.

Nowadays you could protect against malicious writes with the
BLK_DEV_WRITE_MOUNTED=n so at least that's mitigated a little bit.
Note (a) implies that the use of BH_Verified is a giant footgun.

Catherine Hoang [now cc'd] has prototyped a generic buffer cache so that
we can fix these vulnerabilities in ext2:
https://lore.kernel.org/linux-ext4/20250326014928.61507-1-catherine.hoang@oracle.com/

> - What is keeping fs super block update from racing with user space
>   device scan?
> 
>   I guess it's the regular page/folio locking of the bdev page cache.
>   But that also means, pure bio based IO will always race with buffered
>   read of a block device.

Right.  In theory you could take the posix advisory lock (aka flock)
from inside the kernel for the duration of the sb write, and that would
prevent libblkid/udev from seeing torn/stale contents because they take
LOCK_SH.

> - If so, is there any special bio flag to prevent such race?
>   So far I am unable to find out such flag.

No.

--D

> Thanks,
> Qu
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Why a lot of fses are using bdev's page cache to do super block read/write?
  2025-07-09 15:04 ` Darrick J. Wong
@ 2025-07-09 20:40   ` Qu Wenruo
  0 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2025-07-09 20:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, Matthew Wilcox,
	linux-block@vger.kernel.org, Catherine Hoang



在 2025/7/10 00:34, Darrick J. Wong 写道:
> On Wed, Jul 09, 2025 at 06:35:00PM +0930, Qu Wenruo wrote:
>> Hi,
>>
>> Recently I'm trying to remove direct bdev's page cache usage from btrfs
>> super block IOs.
>>
>> And replace it with common bio interface (mostly with bdev_rw_virt()).
>>
>> However I'm hitting random generic/492 failure where sometimes blkid failed
>> to detect any useful super block signature of btrfs.
> 
> Yes, you need to invalidate_bdev() after writing the superblock directly
> to disk via submit_bio.

Since invalidate_bdev() is invaliding the whole page cache of the bdev, 
it may increase the latency of super block writeback, which may bring 
unexpected performance change.

All we want is only to ensure the content of folio where our sb is,
so it looks like we're better sticking with the existing bdev page cache 
usage.
Although the btrfs' super block writeback is still doing something out 
of normal, and will be properly addressed.

Thanks Matthew and Darrick for this detailed explanation,
Qu

> 
>> This leads more digging, and to my surprise using bdev's page cache to do
>> superblock IOs is not an exception, in fact f2fs is doing exactly the same
>> thing.
>>
>>
>> This makes me wonder:
>>
>> - Should a fs use bdev's page cache directly?
>>    I thought a fs shouldn't do this, and bio interface should be
>>    enough for most if not all cases.
>>
>>    Or am I wrong in the first place?
> 
> As willy said, most filesystems use the bdev pagecache because then they
> don't have to implement their own (metadata) buffer cache.  The downside
> is that any filesystem that does so must be prepared to handle the
> buffer_head contents changing any time they cycle the bh lock because
> anyone can write to the block device of a mounted fs ala tune2fs.
> 
> Effectively this means that you have to (a) revalidate the entire buffer
> contents every time you lock_buffer(); and (b) you can't make decisions
> based on superblock feature bits in the superblock bh directly.
> 
> I made that mistake when adding metadata_csum support to ext4 -- we'd
> only connect to the crc32c "crypto" module if checksums were enabled in
> the ondisk super at mount time, but then there were a couple of places
> that looked at the ondisk super bits at runtime, so you could flip the
> bit on and crash the kernel almost immediately.
> 
> Nowadays you could protect against malicious writes with the
> BLK_DEV_WRITE_MOUNTED=n so at least that's mitigated a little bit.
> Note (a) implies that the use of BH_Verified is a giant footgun.
> 
> Catherine Hoang [now cc'd] has prototyped a generic buffer cache so that
> we can fix these vulnerabilities in ext2:
> https://lore.kernel.org/linux-ext4/20250326014928.61507-1-catherine.hoang@oracle.com/
> 
>> - What is keeping fs super block update from racing with user space
>>    device scan?
>>
>>    I guess it's the regular page/folio locking of the bdev page cache.
>>    But that also means, pure bio based IO will always race with buffered
>>    read of a block device.
> 
> Right.  In theory you could take the posix advisory lock (aka flock)
> from inside the kernel for the duration of the sb write, and that would
> prevent libblkid/udev from seeing torn/stale contents because they take
> LOCK_SH.
> 
>> - If so, is there any special bio flag to prevent such race?
>>    So far I am unable to find out such flag.
> 
> No.
> 
> --D
> 
>> Thanks,
>> Qu
>>
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-09 20:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09  9:05 Why a lot of fses are using bdev's page cache to do super block read/write? Qu Wenruo
2025-07-09 12:01 ` Matthew Wilcox
2025-07-09 15:04 ` Darrick J. Wong
2025-07-09 20:40   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).