From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH v3 1/2] btrfs: remove the COW fixup mechanism
Date: Tue, 14 Apr 2026 13:05:26 +0930 [thread overview]
Message-ID: <642cf7ec4272f7c6d18e355076d40fa16be4631f.1776137603.git.wqu@suse.com> (raw)
In-Reply-To: <cover.1776137603.git.wqu@suse.com>
[BACKGROUND]
Btrfs has a special mechanism called COW fixup, which detects dirty
pages without an ordered extent (folio ordered flag).
Normally a dirty folio must go through delayed allocation (delalloc)
before it can be submitted, and delalloc will create an ordered extent
for it and mark the range with ordered flag.
However in older kernels, there are bugs related to get_user_pages()
which can lead to some page marked dirty but without notifying the fs to
properly prepare them for writeback.
In that case without an ordered extent btrfs is unable to properly
submit such dirty folios, thus the COW fixup mechanism is introduced,
which do the extra space reservation so that they can be written back
properly.
[MODERN SOLUTIONS]
The MM layer has solved it properly now with the introduction of
pin_user_pages*(), so we're handling cases that are no longer valid.
So commit 7ca3e84980ef ("btrfs: reject out-of-band dirty folios during
writeback") is introduced to change the behavior from going through
COW fixup to rejecting them directly for experimental builds.
So far it works fine, but when errors are injected into the IO path, we
have random failures triggering the new warnings.
It looks like we have error path that cleared the ordered flag but
leaves the folio dirty flag, which later triggers the warning.
[REMOVAL OF COW FIXUP]
Although I hope to fix all those known warnings cases, I just can not
figure out the root cause yet.
But on the other hand, if we remove the ordered and checked flags in the
future, and purely rely on the dirty flags and ordered extent search, we
can get a much cleaner handling.
Considering it's no longer hitting the COW fixup for normal IO paths, I
think it's finally the time to remove the COW fixup completely.
Furthermore, the function name "btrfs_writepage_cow_fixup()" is no
longer meaningful, and since it's pretty small, only a folio flag check
with error message, there is no need to put it as a dedicated helper,
just open code it inside extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/Kconfig | 4 -
fs/btrfs/btrfs_inode.h | 1 -
fs/btrfs/disk-io.c | 16 +---
fs/btrfs/extent_io.c | 23 ++---
fs/btrfs/fs.h | 7 --
fs/btrfs/inode.c | 202 -----------------------------------------
6 files changed, 10 insertions(+), 243 deletions(-)
diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 5e75438e0b73..5d785d010971 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -93,10 +93,6 @@ config BTRFS_EXPERIMENTAL
Current list:
- - COW fixup worker warning - last warning before removing the
- functionality catching out-of-band page
- dirtying, not necessary since 5.8
-
- RAID mirror read policy - additional read policies for balancing
reading from redundant block group
profiles (currently: pid, round-robin,
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..6e696b350dc5 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -630,7 +630,6 @@ int btrfs_prealloc_file_range_trans(struct inode *inode,
loff_t actual_len, u64 *alloc_hint);
int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_folio,
u64 start, u64 end, struct writeback_control *wbc);
-int btrfs_writepage_cow_fixup(struct folio *folio);
int btrfs_encoded_io_compression_from_extent(struct btrfs_fs_info *fs_info,
int compress_type);
int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d450bfc6d89b..761362f4ab9f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1736,7 +1736,6 @@ static int read_backup_root(struct btrfs_fs_info *fs_info, u8 priority)
/* helper to cleanup workers */
static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
{
- btrfs_destroy_workqueue(fs_info->fixup_workers);
btrfs_destroy_workqueue(fs_info->delalloc_workers);
btrfs_destroy_workqueue(fs_info->workers);
if (fs_info->endio_workers)
@@ -1944,9 +1943,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
fs_info->caching_workers =
btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
- fs_info->fixup_workers =
- btrfs_alloc_ordered_workqueue(fs_info, "fixup", ordered_flags);
-
fs_info->endio_workers =
alloc_workqueue("btrfs-endio", flags, max_active);
fs_info->endio_meta_workers =
@@ -1972,7 +1968,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
fs_info->endio_workers && fs_info->endio_meta_workers &&
fs_info->endio_write_workers &&
fs_info->endio_freespace_worker && fs_info->rmw_workers &&
- fs_info->caching_workers && fs_info->fixup_workers &&
+ fs_info->caching_workers &&
fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
fs_info->discard_ctl.discard_workers)) {
return -ENOMEM;
@@ -4327,16 +4323,6 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
if (unlikely(BTRFS_FS_ERROR(fs_info)))
btrfs_error_commit_super(fs_info);
- /*
- * Wait for any fixup workers to complete.
- * If we don't wait for them here and they are still running by the time
- * we call kthread_stop() against the cleaner kthread further below, we
- * get an use-after-free on the cleaner because the fixup worker adds an
- * inode to the list of delayed iputs and then attempts to wakeup the
- * cleaner kthread, which was already stopped and destroyed. We parked
- * already the cleaner, but below we run all pending delayed iputs.
- */
- btrfs_flush_workqueue(fs_info->fixup_workers);
/*
* Similar case here, we have to wait for delalloc workers before we
* proceed below and stop the cleaner kthread, otherwise we trigger a
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a8887285deda..4eab0f9909e3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1738,18 +1738,17 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
ASSERT(end <= folio_end, "start=%llu len=%u folio_start=%llu folio_size=%zu",
start, len, folio_start, folio_size(folio));
- ret = btrfs_writepage_cow_fixup(folio);
- if (ret == -EAGAIN) {
- /* Fixup worker will requeue */
- folio_redirty_for_writepage(bio_ctrl->wbc, folio);
- folio_unlock(folio);
- return 1;
- }
- if (ret < 0) {
+ if (unlikely(!folio_test_ordered(folio))) {
+ DEBUG_WARN();
+ btrfs_err_rl(fs_info,
+ "root %lld ino %llu folio %llu is marked dirty without notifying the fs",
+ btrfs_root_id(inode->root),
+ btrfs_ino(inode),
+ folio_pos(folio));
btrfs_folio_clear_dirty(fs_info, folio, start, len);
btrfs_folio_set_writeback(fs_info, folio, start, len);
btrfs_folio_clear_writeback(fs_info, folio, start, len);
- return ret;
+ return -EUCLEAN;
}
bitmap_set(&range_bitmap, (start - folio_pos(folio)) >> fs_info->sectorsize_bits,
@@ -1867,12 +1866,8 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
*
* So here we check if the page has private set to rule out such
* case.
- * But we also have a long history of relying on the COW fixup,
- * so here we only enable this check for experimental builds until
- * we're sure it's safe.
*/
- if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL) &&
- unlikely(!folio_test_private(folio))) {
+ if (unlikely(!folio_test_private(folio))) {
WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
btrfs_err_rl(fs_info,
"root %lld ino %llu folio %llu is marked dirty without notifying the fs",
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..5ccc907327a8 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -697,13 +697,6 @@ struct btrfs_fs_info {
struct btrfs_workqueue *endio_write_workers;
struct btrfs_workqueue *endio_freespace_worker;
struct btrfs_workqueue *caching_workers;
-
- /*
- * Fixup workers take dirty pages that didn't properly go through the
- * cow mechanism and make them safe to write. It happens for the
- * sys_munmap function call path.
- */
- struct btrfs_workqueue *fixup_workers;
struct btrfs_workqueue *delayed_workers;
struct task_struct *transaction_kthread;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 808e52aa6ef2..b75cf936843f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2833,208 +2833,6 @@ int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
EXTENT_DELALLOC | extra_bits, cached_state);
}
-/* see btrfs_writepage_start_hook for details on why this is required */
-struct btrfs_writepage_fixup {
- struct folio *folio;
- struct btrfs_inode *inode;
- struct btrfs_work work;
-};
-
-static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
-{
- struct btrfs_writepage_fixup *fixup =
- container_of(work, struct btrfs_writepage_fixup, work);
- struct btrfs_ordered_extent *ordered;
- struct extent_state *cached_state = NULL;
- struct extent_changeset *data_reserved = NULL;
- struct folio *folio = fixup->folio;
- struct btrfs_inode *inode = fixup->inode;
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
- u64 page_start = folio_pos(folio);
- u64 page_end = folio_next_pos(folio) - 1;
- int ret = 0;
- bool free_delalloc_space = true;
-
- /*
- * This is similar to page_mkwrite, we need to reserve the space before
- * we take the folio lock.
- */
- ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
- folio_size(folio));
-again:
- folio_lock(folio);
-
- /*
- * Before we queued this fixup, we took a reference on the folio.
- * folio->mapping may go NULL, but it shouldn't be moved to a different
- * address space.
- */
- if (!folio->mapping || !folio_test_dirty(folio) ||
- !folio_test_checked(folio)) {
- /*
- * Unfortunately this is a little tricky, either
- *
- * 1) We got here and our folio had already been dealt with and
- * we reserved our space, thus ret == 0, so we need to just
- * drop our space reservation and bail. This can happen the
- * first time we come into the fixup worker, or could happen
- * while waiting for the ordered extent.
- * 2) Our folio was already dealt with, but we happened to get an
- * ENOSPC above from the btrfs_delalloc_reserve_space. In
- * this case we obviously don't have anything to release, but
- * because the folio was already dealt with we don't want to
- * mark the folio with an error, so make sure we're resetting
- * ret to 0. This is why we have this check _before_ the ret
- * check, because we do not want to have a surprise ENOSPC
- * when the folio was already properly dealt with.
- */
- if (!ret) {
- btrfs_delalloc_release_extents(inode, folio_size(folio));
- btrfs_delalloc_release_space(inode, data_reserved,
- page_start, folio_size(folio),
- true);
- }
- ret = 0;
- goto out_page;
- }
-
- /*
- * We can't mess with the folio state unless it is locked, so now that
- * it is locked bail if we failed to make our space reservation.
- */
- if (ret)
- goto out_page;
-
- btrfs_lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
-
- /* already ordered? We're done */
- if (folio_test_ordered(folio))
- goto out_reserved;
-
- ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
- if (ordered) {
- btrfs_unlock_extent(&inode->io_tree, page_start, page_end,
- &cached_state);
- folio_unlock(folio);
- btrfs_start_ordered_extent(ordered);
- btrfs_put_ordered_extent(ordered);
- goto again;
- }
-
- ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
- &cached_state);
- if (ret)
- goto out_reserved;
-
- /*
- * Everything went as planned, we're now the owner of a dirty page with
- * delayed allocation bits set and space reserved for our COW
- * destination.
- *
- * The page was dirty when we started, nothing should have cleaned it.
- */
- BUG_ON(!folio_test_dirty(folio));
- free_delalloc_space = false;
-out_reserved:
- btrfs_delalloc_release_extents(inode, PAGE_SIZE);
- if (free_delalloc_space)
- btrfs_delalloc_release_space(inode, data_reserved, page_start,
- PAGE_SIZE, true);
- btrfs_unlock_extent(&inode->io_tree, page_start, page_end, &cached_state);
-out_page:
- if (ret) {
- /*
- * We hit ENOSPC or other errors. Update the mapping and page
- * to reflect the errors and clean the page.
- */
- mapping_set_error(folio->mapping, ret);
- btrfs_folio_clear_ordered(fs_info, folio, page_start,
- folio_size(folio));
- btrfs_mark_ordered_io_finished(inode, page_start,
- folio_size(folio), !ret);
- folio_clear_dirty_for_io(folio);
- }
- btrfs_folio_clear_checked(fs_info, folio, page_start, PAGE_SIZE);
- folio_unlock(folio);
- folio_put(folio);
- kfree(fixup);
- extent_changeset_free(data_reserved);
- /*
- * As a precaution, do a delayed iput in case it would be the last iput
- * that could need flushing space. Recursing back to fixup worker would
- * deadlock.
- */
- btrfs_add_delayed_iput(inode);
-}
-
-/*
- * There are a few paths in the higher layers of the kernel that directly
- * set the folio dirty bit without asking the filesystem if it is a
- * good idea. This causes problems because we want to make sure COW
- * properly happens and the data=ordered rules are followed.
- *
- * In our case any range that doesn't have the ORDERED bit set
- * hasn't been properly setup for IO. We kick off an async process
- * to fix it up. The async helper will wait for ordered extents, set
- * the delalloc bit and make it safe to write the folio.
- */
-int btrfs_writepage_cow_fixup(struct folio *folio)
-{
- struct inode *inode = folio->mapping->host;
- struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
- struct btrfs_writepage_fixup *fixup;
-
- /* This folio has ordered extent covering it already */
- if (folio_test_ordered(folio))
- return 0;
-
- /*
- * For experimental build, we error out instead of EAGAIN.
- *
- * We should not hit such out-of-band dirty folios anymore.
- */
- if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL)) {
- DEBUG_WARN();
- btrfs_err_rl(fs_info,
- "root %lld ino %llu folio %llu is marked dirty without notifying the fs",
- btrfs_root_id(BTRFS_I(inode)->root),
- btrfs_ino(BTRFS_I(inode)),
- folio_pos(folio));
- return -EUCLEAN;
- }
-
- /*
- * folio_checked is set below when we create a fixup worker for this
- * folio, don't try to create another one if we're already
- * folio_test_checked.
- *
- * The extent_io writepage code will redirty the foio if we send back
- * EAGAIN.
- */
- if (folio_test_checked(folio))
- return -EAGAIN;
-
- fixup = kzalloc_obj(*fixup, GFP_NOFS);
- if (!fixup)
- return -EAGAIN;
-
- /*
- * We are already holding a reference to this inode from
- * write_cache_pages. We need to hold it because the space reservation
- * takes place outside of the folio lock, and we can't trust
- * folio->mapping outside of the folio lock.
- */
- ihold(inode);
- btrfs_folio_set_checked(fs_info, folio, folio_pos(folio), folio_size(folio));
- folio_get(folio);
- btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL);
- fixup->folio = folio;
- fixup->inode = BTRFS_I(inode);
- btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
-
- return -EAGAIN;
-}
-
static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode, u64 file_pos,
struct btrfs_file_extent_item *stack_fi,
--
2.53.0
next prev parent reply other threads:[~2026-04-14 3:35 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-14 3:35 [PATCH v3 0/2] btrfs: remove COW fixup and checked folio flag Qu Wenruo
2026-04-14 3:35 ` Qu Wenruo [this message]
2026-04-14 3:35 ` [PATCH v3 2/2] btrfs: remove folio checked subpage bitmap tracking Qu Wenruo
2026-04-14 15:32 ` [PATCH v3 0/2] btrfs: remove COW fixup and checked folio flag David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=642cf7ec4272f7c6d18e355076d40fa16be4631f.1776137603.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox