[PATCH v2 2/2] btrfs: use proper superblock writeback tracing

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH v2 2/2] btrfs: use proper superblock writeback tracing
Date: Thu, 10 Jul 2025 07:24:03 +0930	[thread overview]
Message-ID: <ee0c53d5a56dd66fa41dbb5bec3ea02b294cc208.1752097916.git.wqu@suse.com> (raw)
In-Reply-To: <cover.1752097916.git.wqu@suse.com>

[EXISTING BEHAVIOR]
Btrfs is currently using bdev's page cache to do super block writeback.

However it's doing it in not-so-consistent way:

- The page cache is utilized to store the content to write
  That's fine, we can go that way without using the folio's writeback
  flag to do bio writes.

- But we also use folio lock flag to determine if the IO is finished
  Although we didn't go through the regular folio writeback status
  change (dirty -> locked-> writeback -> clear writeback).

  But we still lock and unlock the folio during the submission, thus
  although it looks weird, it still makes sense.

This has a long history, dating back to early days when we relies on the
page's flags to determine if the IO is done or failed.

[CONCERNS]
But such usage of bdev's page cache has brought some concern, Wilcox is
expressing his concern of the folio flags abuse in the past:

https://lore.kernel.org/all/Yn%2FtxWbij5voeGOB@casper.infradead.org/

Thankfully later commit bc00965dbff7 ("btrfs: count super block write
errors in device instead of tracking folio error state") uses extra
atomic to track how many errors are hit, and we no longer rely on
folio/page flags to determine if the IO is failed.

The only remaining usage is folio locked flag, but we still want the folio
locked for a different reason, to ensure user space signature scanning
tool to get a consistent view of the super block.

Otherwise if we use pure bio without bdev's page cache, our bio writes
can race with user space buffered reads, causing test case like
generic/492 to fail.

[ENHANCEMENT]
- Add a comment on why we want to use and lock the bdev's page cache
  There are two reasons:

  * To make the user space have a consistent view of our super block
    The folio containing our sb is locked during writeback, thus user
    space have to wait until the write is done, thus won't see any
    half-backed contents.

  * To provide different memory for each super block
    As each super block has a slightly different contents from each
    other, and we want to submit them in parallel, we need independent
    memory for each super block.

- Introduce a proper way to wait for super block writeback
  Previously we rely on the page cache to wait for the IO to finish.
  Now we have an atomic, @sb_write_running, to record how many running
  super block writes are pending.

  And also a wait queue head for waiting.

  This greatly simplify wait_dev_supers(), now we only need to wait for
  the pending ios to finish, no need to bother the page cache anymore.

- Slightly refactor btrfs_end_super_write() function
  The error handling doesn't need to be embedded inside the folio
  iteration, extract it out of the iteration loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 55 ++++++++++++++++++++++++----------------------
 fs/btrfs/volumes.c |  3 +++
 fs/btrfs/volumes.h |  5 +++++
 3 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c4b020187249..243b7d8d7709 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3695,24 +3695,26 @@ static void btrfs_end_super_write(struct bio *bio)
 	struct folio_iter fi;
 
 	bio_for_each_folio_all(fi, bio) {
-		if (bio->bi_status) {
-			btrfs_warn_rl(device->fs_info,
-				"lost super block write due to IO error on %s (%d)",
-				btrfs_dev_name(device),
-				blk_status_to_errno(bio->bi_status));
-			btrfs_dev_stat_inc_and_print(device,
-						     BTRFS_DEV_STAT_WRITE_ERRS);
-			/* Ensure failure if the primary sb fails. */
-			if (bio->bi_opf & REQ_FUA)
-				atomic_add(BTRFS_SUPER_PRIMARY_WRITE_ERROR,
-					   &device->sb_write_errors);
-			else
-				atomic_inc(&device->sb_write_errors);
-		}
 		folio_unlock(fi.folio);
 		folio_put(fi.folio);
 	}
 
+	if (bio->bi_status) {
+		btrfs_warn_rl(device->fs_info,
+			"lost super block write due to IO error on %s (%d)",
+			btrfs_dev_name(device),
+			blk_status_to_errno(bio->bi_status));
+		btrfs_dev_stat_inc_and_print(device,
+					     BTRFS_DEV_STAT_WRITE_ERRS);
+		/* Ensure failure if the primary sb fails. */
+		if (bio->bi_opf & REQ_FUA)
+			atomic_add(BTRFS_SUPER_PRIMARY_WRITE_ERROR,
+				   &device->sb_write_errors);
+		else
+			atomic_inc(&device->sb_write_errors);
+	}
+	if (atomic_dec_and_test(&device->sb_write_running))
+		wake_up(&device->sb_write_wait);
 	bio_put(bio);
 }
 
@@ -3737,6 +3739,7 @@ static int write_dev_supers(struct btrfs_device *device,
 	u64 bytenr, bytenr_orig;
 
 	atomic_set(&device->sb_write_errors, 0);
+	ASSERT(atomic_read(&device->sb_write_running) == 0);
 
 	if (max_mirrors == 0)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
@@ -3770,6 +3773,15 @@ static int write_dev_supers(struct btrfs_device *device,
 				    BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE,
 				    sb->csum);
 
+		/*
+		 * Lock the folio covering our super block.
+		 *
+		 * This will prevent user space scanning getting half-backed
+		 * content caused by race between buffered and direct IO.
+		 *
+		 * This also provide different folios for each super block
+		 * which contains slightly different contents.
+		 */
 		folio = __filemap_get_folio(mapping, bytenr >> PAGE_SHIFT,
 					    FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
 					    GFP_NOFS);
@@ -3805,6 +3817,7 @@ static int write_dev_supers(struct btrfs_device *device,
 		 */
 		if (i == 0 && !btrfs_test_opt(device->fs_info, NOBARRIER))
 			bio->bi_opf |= REQ_FUA;
+		atomic_inc(&device->sb_write_running);
 		submit_bio(bio);
 
 		if (btrfs_advance_sb_log(device, i))
@@ -3831,9 +3844,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 	if (max_mirrors == 0)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
 
+	/* Calculate how many super blocks we really have. */
 	for (i = 0; i < max_mirrors; i++) {
-		struct folio *folio;
-
 		ret = btrfs_sb_log_location(device, i, READ, &bytenr);
 		if (ret == -ENOENT) {
 			break;
@@ -3846,17 +3858,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
-
-		folio = filemap_get_folio(device->bdev->bd_mapping,
-					  bytenr >> PAGE_SHIFT);
-		/* If the folio has been removed, then we know it completed. */
-		if (IS_ERR(folio))
-			continue;
-
-		/* Folio will be unlocked once the write completes. */
-		folio_wait_locked(folio);
-		folio_put(folio);
 	}
+	wait_event(device->sb_write_wait, atomic_read(&device->sb_write_running) == 0);
 
 	errors += atomic_read(&device->sb_write_errors);
 	if (errors >= BTRFS_SUPER_PRIMARY_WRITE_ERROR)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5e5b18524d1f..9667ca0374d0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -400,6 +400,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid)
 static void btrfs_free_device(struct btrfs_device *device)
 {
 	WARN_ON(!list_empty(&device->post_commit_list));
+	WARN_ON(atomic_read(&device->sb_write_running) != 0);
 	/*
 	 * No need to call kfree_rcu() nor do RCU lock/unlock, nothing is
 	 * reading the device name.
@@ -6895,6 +6896,8 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&dev->post_commit_list);
 
 	atomic_set(&dev->dev_stats_ccnt, 0);
+	atomic_set(&dev->sb_write_running, 0);
+	init_waitqueue_head(&dev->sb_write_wait);
 	btrfs_device_data_ordered_init(dev);
 	btrfs_extent_io_tree_init(fs_info, &dev->alloc_state, IO_TREE_DEVICE_ALLOC_STATE);
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index bf2e9a5a63ea..661ac281c065 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -20,6 +20,7 @@
 #include <linux/rbtree.h>
 #include <uapi/linux/btrfs.h>
 #include <uapi/linux/btrfs_tree.h>
+#include "disk-io.h"
 #include "messages.h"
 #include "extent-io-tree.h"
 
@@ -182,6 +183,10 @@ struct btrfs_device {
 	struct bio flush_bio;
 	struct completion flush_wait;
 
+	/* How many super block write bios are running. */
+	atomic_t sb_write_running;
+	wait_queue_head_t sb_write_wait;
+
 	/* per-device scrub information */
 	struct scrub_ctx *scrub_ctx;
 
-- 
2.50.0

     prev parent reply	other threads:[~2025-07-09 21:54 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-09 21:54 [PATCH v2 0/2] btrfs: only use bdev's page cache for super block writeback Qu Wenruo
2025-07-09 21:54 ` [PATCH v2 1/2] btrfs: use bdev_rw_virt() to read and scratch the disk super block Qu Wenruo
2025-07-09 21:54 ` Qu Wenruo [this message]

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c4b02018724 dfblob:243b7d8d770 dfblob:5e5b18524d1
dfblob:9667ca0374d dfblob:bf2e9a5a63e dfblob:661ac281c06 )
 OR (
bs:"[PATCH v2 2/2] btrfs: use proper superblock writeback tracing" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ee0c53d5a56dd66fa41dbb5bec3ea02b294cc208.1752097916.git.wqu@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox