From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Leo Martins <loemra.dev@gmail.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com,
Filipe Manana <fdmanana@suse.com>,
Sun YangKai <sunk67188@gmail.com>
Subject: Re: [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
Date: Thu, 5 Mar 2026 12:11:19 -0800 [thread overview]
Message-ID: <20260305201119.GA1237826@zen.localdomain> (raw)
In-Reply-To: <3f85a632-d062-4006-8bd7-1048c60197ef@gmx.com>
On Wed, Mar 04, 2026 at 05:01:19PM +1030, Qu Wenruo wrote:
>
>
> 在 2026/2/26 20:21, Leo Martins 写道:
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> >
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> >
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> >
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated. Normally
> > dirty_pages is only cleared at commit time by
> > btrfs_write_and_wait_transaction(), but if qgroups are enabled and
> > snapshots are being created, qgroup_account_snapshot() may have already
> > called btrfs_write_and_wait_transaction() and released the range before
> > the final commit-time call.
> >
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> >
> > Relax the lockdep assertion in btrfs_mark_buffer_dirty() from
> > btrfs_assert_tree_write_locked() to lockdep_assert_held() so that it
> > accepts either a read or write lock. should_cow_block() may be called
> > from btrfs_search_slot() when only a read lock is held (nodes above
> > write_lock_level are read-locked). The write lock assertion previously
> > documented the caller convention that buffer content was being modified
> > under exclusive access, but btrfs_mark_buffer_dirty() and
> > set_extent_buffer_dirty() themselves only perform independently
> > synchronized operations: atomic bit ops on bflags, folio_mark_dirty()
> > (kernel-internal folio locking), xarray mark updates (xarray spinlock),
> > and percpu counter updates. The read lock is sufficient because it
> > prevents lock_extent_buffer_for_io() from acquiring the write lock and
> > racing on the dirty state. Since rw_semaphore permits concurrent
> > readers, multiple threads can enter btrfs_mark_buffer_dirty()
> > simultaneously for the same buffer; this is safe because
> > test_and_set_bit(EXTENT_BUFFER_DIRTY) ensures only one thread performs
> > the full dirty state transition.
> >
> > Remove the CONFIG_BTRFS_DEBUG assertion in set_extent_buffer_dirty()
> > that checked folio_test_dirty() after marking the buffer dirty. This
> > assertion assumed exclusive access (only one thread in
> > set_extent_buffer_dirty() at a time), which held when the only caller
> > was btrfs_mark_buffer_dirty() under write lock. With concurrent readers
> > calling through should_cow_block(), a thread that loses the
> > test_and_set_bit race sees was_dirty=true and skips the folio dirty
> > marking, but the winning thread may not have called
> > btrfs_meta_folio_set_dirty() yet, causing the assertion to fire. This
> > is a benign race: the winning thread will complete the folio dirty
> > marking, and no writeback can clear it while readers hold their locks.
> >
> > Hoist the EXTENT_BUFFER_WRITEBACK, BTRFS_HEADER_FLAG_RELOC, and
> > BTRFS_ROOT_FORCE_COW checks before the WRITTEN block since they apply
> > regardless of whether the buffer has been written back. This
> > consolidates the exclusion logic and simplifies the WRITTEN path to
> > only handle log trees and zoned devices. Moving the RELOC checks
> > before the smp_mb__before_atomic() barrier is safe because both
> > btrfs_root_id() (immutable) and BTRFS_HEADER_FLAG_RELOC (set at COW
> > time under tree lock) are stable values not subject to concurrent
> > modification; the barrier is only needed for BTRFS_ROOT_FORCE_COW
> > which is set concurrently by create_pending_snapshot().
> >
> > Exclude cases where in-place overwrite is not safe:
> > - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> > - Zoned devices: require sequential writes
> > - Log trees: log blocks are immediately referenced by a committed
> > superblock via btrfs_sync_log(), so overwriting could corrupt the
> > committed log
> > - BTRFS_ROOT_FORCE_COW: snapshot in progress
> > - BTRFS_HEADER_FLAG_RELOC: block being relocated
> >
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > Reviewed-by: Sun YangKai <sunk67188@gmail.com>
>
> Unfortunately this patch is making btrfs/232 fail.
> Bisection lead to this one.
>
> I have hit the following errors during btrfs/232 run with this patch, but
> not the commit before it:
>
> - Write time tree-checker errors
> From first key mismatch to bad key order.
>
> - Fsck errors
> From missing inode item other problems.
> AKA, on-disk corruption, which is never a good sign.
>
> One thing to note is, that test case itself can lead to a false alerts from
> DEBUG_WARN() inside btrfs_remove_qgroup(), thus you may need to manually
> remove that DEBUG_WARN() or check the failure dmesg to be extra sure.
>
> Thanks,
> Qu
>
I confirm that b/232 looks broken with this and good without it.
Reverted this patch (but left the independently useful inhibition patch)
while Leo figures out the tree-checker issues, which were reasonable to
expect coming out of something like this.
Thanks,
Boris
> > ---
> > fs/btrfs/ctree.c | 87 ++++++++++++++++++++++++++++++++++----------
> > fs/btrfs/disk-io.c | 2 +-
> > fs/btrfs/extent_io.c | 4 --
> > 3 files changed, 69 insertions(+), 24 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..ea7cfc3a9e89 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,29 +599,40 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> > return ret;
> > }
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +/*
> > + * Check if @buf needs to be COW'd.
> > + *
> > + * Returns true if COW is required, false if the block can be reused
> > + * in place.
> > + *
> > + * We do not need to COW a block if:
> > + * 1) the block was created or changed in this transaction;
> > + * 2) the block does not belong to TREE_RELOC tree;
> > + * 3) the root is not forced COW.
> > + *
> > + * Forced COW happens when we create a snapshot during transaction commit:
> > + * after copying the src root, we must COW the shared block to ensure
> > + * metadata consistency.
> > + *
> > + * When returning false for a WRITTEN buffer allocated in the current
> > + * transaction, re-dirties the buffer for in-place overwrite instead
> > + * of requesting a new COW.
> > + */
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> > const struct btrfs_root *root,
> > - const struct extent_buffer *buf)
> > + struct extent_buffer *buf)
> > {
> > if (btrfs_is_testing(root->fs_info))
> > return false;
> > - /*
> > - * We do not need to cow a block if
> > - * 1) this block is not created or changed in this transaction;
> > - * 2) this block does not belong to TREE_RELOC tree;
> > - * 3) the root is not forced COW.
> > - *
> > - * What is forced COW:
> > - * when we create snapshot during committing the transaction,
> > - * after we've finished copying src root, we must COW the shared
> > - * block to ensure the metadata consistency.
> > - */
> > -
> > if (btrfs_header_generation(buf) != trans->transid)
> > return true;
> > - if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > + if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> > + return true;
> > +
> > + if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
> > + btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> > return true;
> > /* Ensure we can see the FORCE_COW bit. */
> > @@ -629,11 +640,49 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> > return true;
> > - if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> > - return false;
> > + if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > + /*
> > + * The buffer was allocated in this transaction and has been
> > + * written back to disk (WRITTEN is set). Normally we'd COW
> > + * it again, but since the committed superblock doesn't
> > + * reference this buffer (it was allocated in this transaction),
> > + * we can safely overwrite it in place.
> > + *
> > + * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > + * persisted at this bytenr and will be again after the
> > + * in-place update. This is important so that
> > + * btrfs_free_tree_block() correctly pins the block if it is
> > + * freed later (e.g., during tree rebalancing or FORCE_COW).
> > + *
> > + * Log trees and zoned devices cannot use this optimization:
> > + * - Log trees: log blocks are written and immediately
> > + * referenced by a committed superblock via
> > + * btrfs_sync_log(), bypassing the normal transaction
> > + * commit. Overwriting in place could corrupt the
> > + * committed log.
> > + * - Zoned devices: require sequential writes.
> > + */
> > + if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
> > + btrfs_is_zoned(root->fs_info))
> > + return true;
> > - if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> > - return true;
> > + /*
> > + * Re-register this block's range in the current transaction's
> > + * dirty_pages so that btrfs_write_and_wait_transaction()
> > + * writes it. The range was originally registered when the
> > + * block was allocated. Normally dirty_pages is only cleared
> > + * at commit time by btrfs_write_and_wait_transaction(), but
> > + * if qgroups are enabled and snapshots are being created,
> > + * qgroup_account_snapshot() may have already called
> > + * btrfs_write_and_wait_transaction() and released the range
> > + * before the final commit-time call.
> > + */
> > + btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > + buf->start,
> > + buf->start + buf->len - 1,
> > + EXTENT_DIRTY, NULL);
> > + btrfs_mark_buffer_dirty(trans, buf);
> > + }
> > return false;
> > }
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 32fffb0557e5..bee8f76fbfea 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -4491,7 +4491,7 @@ void btrfs_mark_buffer_dirty(struct btrfs_trans_handle *trans,
> > #endif
> > /* This is an active transaction (its state < TRANS_STATE_UNBLOCKED). */
> > ASSERT(trans->transid == fs_info->generation);
> > - btrfs_assert_tree_write_locked(buf);
> > + lockdep_assert_held(&buf->lock);
> > if (unlikely(transid != fs_info->generation)) {
> > btrfs_abort_transaction(trans, -EUCLEAN);
> > btrfs_crit(fs_info,
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index dfc17c292217..ff1fc699a6ca 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3791,10 +3791,6 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
> > eb->len,
> > eb->fs_info->dirty_metadata_batch);
> > }
> > -#ifdef CONFIG_BTRFS_DEBUG
> > - for (int i = 0; i < num_extent_folios(eb); i++)
> > - ASSERT(folio_test_dirty(eb->folios[i]));
> > -#endif
> > }
> > void clear_extent_buffer_uptodate(struct extent_buffer *eb)
>
next prev parent reply other threads:[~2026-03-05 20:10 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-26 9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-26 9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-03-04 6:31 ` Qu Wenruo
2026-03-05 20:11 ` Boris Burkov [this message]
2026-02-26 9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-26 15:23 ` Filipe Manana
2026-02-26 15:27 ` Sun YangKai
2026-03-04 4:22 ` David Sterba
2026-02-26 9:51 ` [PATCH v4 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-03-02 23:12 ` [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Boris Burkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260305201119.GA1237826@zen.localdomain \
--to=boris@bur.io \
--cc=fdmanana@suse.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=loemra.dev@gmail.com \
--cc=quwenruo.btrfs@gmx.com \
--cc=sunk67188@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox