* [PATCH 1/2] btrfs: remove the COW fixup mechanism
2026-04-08 4:25 [PATCH 0/2] btrfs: remove COW fixup and checked folio flag Qu Wenruo
@ 2026-04-08 4:25 ` Qu Wenruo
2026-04-13 17:50 ` David Sterba
2026-04-08 4:25 ` [PATCH 2/2] btrfs: remove folio checked subpage bitmap tracking Qu Wenruo
2026-04-13 17:49 ` [PATCH 0/2] btrfs: remove COW fixup and checked folio flag David Sterba
2 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2026-04-08 4:25 UTC (permalink / raw)
To: linux-btrfs
[BACKGROUND]
Btrfs has a special mechanism called COW fixup, which detects dirty
pages without an ordered extent (folio ordered flag).
Normally a dirty folio must go through delayed allocation (delalloc)
before it can be submitted, and delalloc will create an ordered extent
for it and mark the range with ordered flag.
However in older kernels, there are bugs related to get_user_pages()
which can lead to some page marked dirty but without notifying the fs to
properly prepare them for writeback.
In that case without an ordered extent btrfs is unable to properly
submit such dirty folios, thus the COW fixup mechanism is introduced,
which do the extra space reservation so that they can be written back
properly.
[MODERN SOLUTIONS]
The MM layer has solved it properly now with the introduction of
pin_user_pages*(), so we're handling cases that are no longer valid.
So commit 7ca3e84980ef ("btrfs: reject out-of-band dirty folios during
writeback") is introduced to change the behavior from going through
COW fixup to rejecting them directly for experimental builds.
So far it works fine, but when errors are injected into the IO path, we
have random failures triggering the new warnings.
It looks like we have error path that cleared the ordered flag but
leaves the folio dirty flag, which later triggers the warning.
[REMOVAL OF COW FIXUP]
Although I hope to fix all those known warnings cases, I just can not
figure out the root cause yet.
But on the other hand, if we remove the ordered and checked flags in the
future, and purely rely on the dirty flags and ordered extent search, we
can get a much cleaner handling.
Considering it's no longer hitting the COW fixup for normal IO paths, I
think it's finally the time to remove the COW fixup completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 201 ++++-------------------------------------------
1 file changed, 14 insertions(+), 187 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 808e52aa6ef2..2119c957aa47 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2833,206 +2833,33 @@ int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
EXTENT_DELALLOC | extra_bits, cached_state);
}
-/* see btrfs_writepage_start_hook for details on why this is required */
-struct btrfs_writepage_fixup {
- struct folio *folio;
- struct btrfs_inode *inode;
- struct btrfs_work work;
-};
-
-static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
-{
- struct btrfs_writepage_fixup *fixup =
- container_of(work, struct btrfs_writepage_fixup, work);
- struct btrfs_ordered_extent *ordered;
- struct extent_state *cached_state = NULL;
- struct extent_changeset *data_reserved = NULL;
- struct folio *folio = fixup->folio;
- struct btrfs_inode *inode = fixup->inode;
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
- u64 page_start = folio_pos(folio);
- u64 page_end = folio_next_pos(folio) - 1;
- int ret = 0;
- bool free_delalloc_space = true;
-
- /*
- * This is similar to page_mkwrite, we need to reserve the space before
- * we take the folio lock.
- */
- ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
- folio_size(folio));
-again:
- folio_lock(folio);
-
- /*
- * Before we queued this fixup, we took a reference on the folio.
- * folio->mapping may go NULL, but it shouldn't be moved to a different
- * address space.
- */
- if (!folio->mapping || !folio_test_dirty(folio) ||
- !folio_test_checked(folio)) {
- /*
- * Unfortunately this is a little tricky, either
- *
- * 1) We got here and our folio had already been dealt with and
- * we reserved our space, thus ret == 0, so we need to just
- * drop our space reservation and bail. This can happen the
- * first time we come into the fixup worker, or could happen
- * while waiting for the ordered extent.
- * 2) Our folio was already dealt with, but we happened to get an
- * ENOSPC above from the btrfs_delalloc_reserve_space. In
- * this case we obviously don't have anything to release, but
- * because the folio was already dealt with we don't want to
- * mark the folio with an error, so make sure we're resetting
- * ret to 0. This is why we have this check _before_ the ret
- * check, because we do not want to have a surprise ENOSPC
- * when the folio was already properly dealt with.
- */
- if (!ret) {
- btrfs_delalloc_release_extents(inode, folio_size(folio));
- btrfs_delalloc_release_space(inode, data_reserved,
- page_start, folio_size(folio),
- true);
- }
- ret = 0;
- goto out_page;
- }
-
- /*
- * We can't mess with the folio state unless it is locked, so now that
- * it is locked bail if we failed to make our space reservation.
- */
- if (ret)
- goto out_page;
-
- btrfs_lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
-
- /* already ordered? We're done */
- if (folio_test_ordered(folio))
- goto out_reserved;
-
- ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
- if (ordered) {
- btrfs_unlock_extent(&inode->io_tree, page_start, page_end,
- &cached_state);
- folio_unlock(folio);
- btrfs_start_ordered_extent(ordered);
- btrfs_put_ordered_extent(ordered);
- goto again;
- }
-
- ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
- &cached_state);
- if (ret)
- goto out_reserved;
-
- /*
- * Everything went as planned, we're now the owner of a dirty page with
- * delayed allocation bits set and space reserved for our COW
- * destination.
- *
- * The page was dirty when we started, nothing should have cleaned it.
- */
- BUG_ON(!folio_test_dirty(folio));
- free_delalloc_space = false;
-out_reserved:
- btrfs_delalloc_release_extents(inode, PAGE_SIZE);
- if (free_delalloc_space)
- btrfs_delalloc_release_space(inode, data_reserved, page_start,
- PAGE_SIZE, true);
- btrfs_unlock_extent(&inode->io_tree, page_start, page_end, &cached_state);
-out_page:
- if (ret) {
- /*
- * We hit ENOSPC or other errors. Update the mapping and page
- * to reflect the errors and clean the page.
- */
- mapping_set_error(folio->mapping, ret);
- btrfs_folio_clear_ordered(fs_info, folio, page_start,
- folio_size(folio));
- btrfs_mark_ordered_io_finished(inode, page_start,
- folio_size(folio), !ret);
- folio_clear_dirty_for_io(folio);
- }
- btrfs_folio_clear_checked(fs_info, folio, page_start, PAGE_SIZE);
- folio_unlock(folio);
- folio_put(folio);
- kfree(fixup);
- extent_changeset_free(data_reserved);
- /*
- * As a precaution, do a delayed iput in case it would be the last iput
- * that could need flushing space. Recursing back to fixup worker would
- * deadlock.
- */
- btrfs_add_delayed_iput(inode);
-}
-
/*
- * There are a few paths in the higher layers of the kernel that directly
- * set the folio dirty bit without asking the filesystem if it is a
- * good idea. This causes problems because we want to make sure COW
- * properly happens and the data=ordered rules are followed.
+ * There used to be a bug related get user space where a page can be dirtied
+ * without notifying the filesystem.
*
- * In our case any range that doesn't have the ORDERED bit set
- * hasn't been properly setup for IO. We kick off an async process
- * to fix it up. The async helper will wait for ordered extents, set
- * the delalloc bit and make it safe to write the folio.
+ * Btrfs used to handle such corner case by manually re-setup the needed flags
+ * so we can later submit them for writeback.
+ *
+ * But nowadays this can only happen at some error paths that we cleared the
+ * ordered flag without clearing the dirty flag.
+ * In that case we just error out.
*/
int btrfs_writepage_cow_fixup(struct folio *folio)
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
- struct btrfs_writepage_fixup *fixup;
/* This folio has ordered extent covering it already */
if (folio_test_ordered(folio))
return 0;
- /*
- * For experimental build, we error out instead of EAGAIN.
- *
- * We should not hit such out-of-band dirty folios anymore.
- */
- if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL)) {
- DEBUG_WARN();
- btrfs_err_rl(fs_info,
+ DEBUG_WARN();
+ btrfs_err_rl(fs_info,
"root %lld ino %llu folio %llu is marked dirty without notifying the fs",
- btrfs_root_id(BTRFS_I(inode)->root),
- btrfs_ino(BTRFS_I(inode)),
- folio_pos(folio));
- return -EUCLEAN;
- }
-
- /*
- * folio_checked is set below when we create a fixup worker for this
- * folio, don't try to create another one if we're already
- * folio_test_checked.
- *
- * The extent_io writepage code will redirty the foio if we send back
- * EAGAIN.
- */
- if (folio_test_checked(folio))
- return -EAGAIN;
-
- fixup = kzalloc_obj(*fixup, GFP_NOFS);
- if (!fixup)
- return -EAGAIN;
-
- /*
- * We are already holding a reference to this inode from
- * write_cache_pages. We need to hold it because the space reservation
- * takes place outside of the folio lock, and we can't trust
- * folio->mapping outside of the folio lock.
- */
- ihold(inode);
- btrfs_folio_set_checked(fs_info, folio, folio_pos(folio), folio_size(folio));
- folio_get(folio);
- btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL);
- fixup->folio = folio;
- fixup->inode = BTRFS_I(inode);
- btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
-
- return -EAGAIN;
+ btrfs_root_id(BTRFS_I(inode)->root),
+ btrfs_ino(BTRFS_I(inode)),
+ folio_pos(folio));
+ return -EUCLEAN;
}
static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH 2/2] btrfs: remove folio checked subpage bitmap tracking
2026-04-08 4:25 [PATCH 0/2] btrfs: remove COW fixup and checked folio flag Qu Wenruo
2026-04-08 4:25 ` [PATCH 1/2] btrfs: remove the COW fixup mechanism Qu Wenruo
@ 2026-04-08 4:25 ` Qu Wenruo
2026-04-13 17:49 ` [PATCH 0/2] btrfs: remove COW fixup and checked folio flag David Sterba
2 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2026-04-08 4:25 UTC (permalink / raw)
To: linux-btrfs
The folio checked flag is only utilized by the COW fixup mechanism
inside btrfs.
Since the COW fixup is already removed from non-experimental builds,
there is no need to keep the checked subpage bitmap.
This will saves us some space for large folios, for example for a single
256K sized large folio on 4K page sized systems:
Old bitmap size = 6 * (256K / 4K / 8) = 48 bytes
New bitmap size = 5 * (256K / 4K / 8) = 40 bytes
This will be more obvious when we're going to support huge folios (order
= 9).
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/defrag.c | 1 -
fs/btrfs/file.c | 10 ----------
fs/btrfs/free-space-cache.c | 4 ----
fs/btrfs/inode.c | 3 ---
fs/btrfs/reflink.c | 1 -
fs/btrfs/subpage.c | 39 ++-----------------------------------
fs/btrfs/subpage.h | 5 +----
7 files changed, 3 insertions(+), 60 deletions(-)
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 7e2db5d3a4d4..af40ad62009a 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1179,7 +1179,6 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
if (start >= folio_next_pos(folio) ||
start + len <= folio_pos(folio))
continue;
- btrfs_folio_clamp_clear_checked(fs_info, folio, start, len);
btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
}
btrfs_delalloc_release_extents(inode, len);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cf1cb5c4db75..586f32fa03d5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -49,14 +49,6 @@ static void btrfs_drop_folio(struct btrfs_fs_info *fs_info, struct folio *folio,
u64 block_len = round_up(pos + copied, fs_info->sectorsize) - block_start;
ASSERT(block_len <= U32_MAX);
- /*
- * Folio checked is some magic around finding folios that have been
- * modified without going through btrfs_dirty_folio(). Clear it here.
- * There should be no need to mark the pages accessed as
- * prepare_one_folio() should have marked them accessed in
- * prepare_one_folio() via find_or_create_page()
- */
- btrfs_folio_clamp_clear_checked(fs_info, folio, block_start, block_len);
folio_unlock(folio);
folio_put(folio);
}
@@ -107,7 +99,6 @@ int btrfs_dirty_folio(struct btrfs_inode *inode, struct folio *folio, loff_t pos
return ret;
btrfs_folio_clamp_set_uptodate(fs_info, folio, start_pos, num_bytes);
- btrfs_folio_clamp_clear_checked(fs_info, folio, start_pos, num_bytes);
btrfs_folio_clamp_set_dirty(fs_info, folio, start_pos, num_bytes);
/*
@@ -1987,7 +1978,6 @@ static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
if (zero_start != fsize)
folio_zero_range(folio, zero_start, folio_size(folio) - zero_start);
- btrfs_folio_clear_checked(fs_info, folio, page_start, fsize);
btrfs_folio_set_dirty(fs_info, folio, page_start, end + 1 - page_start);
btrfs_folio_set_uptodate(fs_info, folio, page_start, end + 1 - page_start);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ab22e4f9ffdd..07567fd45634 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -433,10 +433,6 @@ static void io_ctl_drop_pages(struct btrfs_io_ctl *io_ctl)
for (i = 0; i < io_ctl->num_pages; i++) {
if (io_ctl->pages[i]) {
- btrfs_folio_clear_checked(io_ctl->fs_info,
- page_folio(io_ctl->pages[i]),
- page_offset(io_ctl->pages[i]),
- PAGE_SIZE);
unlock_page(io_ctl->pages[i]);
put_page(io_ctl->pages[i]);
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2119c957aa47..8cd537e9d44a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5032,8 +5032,6 @@ int btrfs_truncate_block(struct btrfs_inode *inode, u64 offset, u64 start, u64 e
folio_zero_range(folio, zero_start - folio_pos(folio),
zero_end - zero_start + 1);
- btrfs_folio_clear_checked(fs_info, folio, block_start,
- block_end + 1 - block_start);
btrfs_folio_set_dirty(fs_info, folio, block_start,
block_end + 1 - block_start);
@@ -7650,7 +7648,6 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
* did something wrong.
*/
ASSERT(!folio_test_ordered(folio));
- btrfs_folio_clear_checked(fs_info, folio, folio_pos(folio), folio_size(folio));
if (!inode_evicting)
__btrfs_release_folio(folio, GFP_NOFS);
clear_folio_extent_mapped(folio);
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 49865a463780..14742abe0f92 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -141,7 +141,6 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
folio_zero_range(folio, datal, block_size - datal);
btrfs_folio_set_uptodate(fs_info, folio, file_offset, block_size);
- btrfs_folio_clear_checked(fs_info, folio, file_offset, block_size);
btrfs_folio_set_dirty(fs_info, folio, file_offset, block_size);
out_unlock:
if (!IS_ERR(folio)) {
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index f82e71f5d88b..8a09f34ea31e 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -508,35 +508,6 @@ void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
spin_unlock_irqrestore(&bfs->lock, flags);
}
-void btrfs_subpage_set_checked(const struct btrfs_fs_info *fs_info,
- struct folio *folio, u64 start, u32 len)
-{
- struct btrfs_folio_state *bfs = folio_get_private(folio);
- unsigned int start_bit = subpage_calc_start_bit(fs_info, folio,
- checked, start, len);
- unsigned long flags;
-
- spin_lock_irqsave(&bfs->lock, flags);
- bitmap_set(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
- if (subpage_test_bitmap_all_set(fs_info, folio, checked))
- folio_set_checked(folio);
- spin_unlock_irqrestore(&bfs->lock, flags);
-}
-
-void btrfs_subpage_clear_checked(const struct btrfs_fs_info *fs_info,
- struct folio *folio, u64 start, u32 len)
-{
- struct btrfs_folio_state *bfs = folio_get_private(folio);
- unsigned int start_bit = subpage_calc_start_bit(fs_info, folio,
- checked, start, len);
- unsigned long flags;
-
- spin_lock_irqsave(&bfs->lock, flags);
- bitmap_clear(bfs->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
- folio_clear_checked(folio);
- spin_unlock_irqrestore(&bfs->lock, flags);
-}
-
/*
* Unlike set/clear which is dependent on each page status, for test all bits
* are tested in the same way.
@@ -561,7 +532,6 @@ IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
-IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(checked);
/*
* Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -659,8 +629,6 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, folio_start_writeback, folio_end_writeback,
folio_test_writeback);
IMPLEMENT_BTRFS_PAGE_OPS(ordered, folio_set_ordered, folio_clear_ordered,
folio_test_ordered);
-IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
- folio_test_checked);
#define GET_SUBPAGE_BITMAP(fs_info, folio, name, dst) \
{ \
@@ -782,7 +750,6 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
unsigned long dirty_bitmap;
unsigned long writeback_bitmap;
unsigned long ordered_bitmap;
- unsigned long checked_bitmap;
unsigned long locked_bitmap;
unsigned long flags;
@@ -795,20 +762,18 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
GET_SUBPAGE_BITMAP(fs_info, folio, dirty, &dirty_bitmap);
GET_SUBPAGE_BITMAP(fs_info, folio, writeback, &writeback_bitmap);
GET_SUBPAGE_BITMAP(fs_info, folio, ordered, &ordered_bitmap);
- GET_SUBPAGE_BITMAP(fs_info, folio, checked, &checked_bitmap);
GET_SUBPAGE_BITMAP(fs_info, folio, locked, &locked_bitmap);
spin_unlock_irqrestore(&bfs->lock, flags);
dump_page(folio_page(folio, 0), "btrfs folio state dump");
btrfs_warn(fs_info,
-"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl locked=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl",
+"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl locked=%*pbl writeback=%*pbl ordered=%*pbl",
start, len, folio_pos(folio),
blocks_per_folio, &uptodate_bitmap,
blocks_per_folio, &dirty_bitmap,
blocks_per_folio, &locked_bitmap,
blocks_per_folio, &writeback_bitmap,
- blocks_per_folio, &ordered_bitmap,
- blocks_per_folio, &checked_bitmap);
+ blocks_per_folio, &ordered_bitmap);
}
void btrfs_get_subpage_dirty_bitmap(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index d81a0ade559f..fdea0b605bfc 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -41,11 +41,9 @@ enum {
btrfs_bitmap_nr_writeback,
/*
- * The ordered and checked flags are for COW fixup, already marked
- * deprecated, and will be removed eventually.
+ * The ordered flags shows if the range has an ordered extent.
*/
btrfs_bitmap_nr_ordered,
- btrfs_bitmap_nr_checked,
/*
* The locked bit is for async delalloc range (compression), currently
@@ -182,7 +180,6 @@ DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
DECLARE_BTRFS_SUBPAGE_OPS(dirty);
DECLARE_BTRFS_SUBPAGE_OPS(writeback);
DECLARE_BTRFS_SUBPAGE_OPS(ordered);
-DECLARE_BTRFS_SUBPAGE_OPS(checked);
/*
* Helper for error cleanup, where a folio will have its dirty flag cleared,
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread