From: Leo Martins <loemra.dev@gmail.com>
To: Sun YangKai <sunk67188@gmail.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
Date: Tue, 17 Feb 2026 14:04:26 -0800 [thread overview]
Message-ID: <20260217220440.801494-1-loemra.dev@gmail.com> (raw)
In-Reply-To: <bcde0730-b3d4-4aa2-88f9-7fee861601cc@gmail.com>
On Sat, 14 Feb 2026 09:25:03 +0800 Sun YangKai <sunk67188@gmail.com> wrote:
> Thanks for your working on this and I've expecting this for a long time :)
Thanks for the review!
>
> On 2026/2/14 04:30, Leo Martins wrote:
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> >
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> >
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> >
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated, but
> > background writeback may have already written and cleared it.
> >
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> >
> > Exclude cases where in-place overwrite is not safe:
> > - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> > - Zoned devices: require sequential writes
> > - Log trees: log blocks are immediately referenced by a committed
> > superblock via btrfs_sync_log(), so overwriting could corrupt the
> > committed log
> > - BTRFS_ROOT_FORCE_COW: snapshot in progress
> > - BTRFS_HEADER_FLAG_RELOC: block being relocated
> >
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > ---
> > fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 50 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..a345e1be24d8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> > return ret;
> > }
> >
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> > const struct btrfs_root *root,
> > - const struct extent_buffer *buf)
> > + struct extent_buffer *buf)
> > {
> > if (btrfs_is_testing(root->fs_info))
> > return false;
> > @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > if (btrfs_header_generation(buf) != trans->transid)
> > return true;
> >
> > - if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > + if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > + /*
> > + * The buffer was allocated in this transaction and has been
> > + * written back to disk (WRITTEN is set). Normally we'd COW
> > + * it again, but since the committed superblock doesn't
> > + * reference this buffer (it was allocated this transaction),
> > + * we can safely overwrite it in place.
> > + *
> > + * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > + * persisted at this bytenr and will be again after the
> > + * in-place update. This is important so that
> > + * btrfs_free_tree_block() correctly pins the block if it is
> > + * freed later (e.g., during tree rebalancing or FORCE_COW).
> > + *
> > + * We re-dirty the buffer to ensure the in-place modifications
> > + * will be written back to disk.
> > + *
> > + * Exclusions:
> > + * - Log trees: log blocks are written and immediately
> > + * referenced by a committed superblock via
> > + * btrfs_sync_log(), bypassing the normal transaction
> > + * commit. Overwriting in place could corrupt the
> > + * committed log.
> > + * - Zoned devices: require sequential writes
> > + * - FORCE_COW: snapshot in progress
> > + * - RELOC flag: block being relocated
> > + */
> > + if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> > + !btrfs_is_zoned(root->fs_info) &&
> > + btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> > + !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
> it seems we need smp_mb__before_atomic() to see the FORCE_COW bit?
Good call, fixed in v3.
> > + !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> > + /*
> > + * Re-register this block's range in the current
> > + * transaction's dirty_pages so that
> > + * btrfs_write_and_wait_transaction() writes it.
> > + * The range was originally registered when the block
> > + * was allocated, but that transaction's dirty_pages
> > + * may have already been released.
> > + */
> > + btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > + buf->start,
> > + buf->start + buf->len - 1,
> > + EXTENT_DIRTY, NULL);
> > + set_extent_buffer_dirty(buf);
> why use set_extent_buffer_dirty() instead of btrfs_mark_buffer_dirty()?
> I don't see any other callers doing this.
btrfs_mark_buffer_dirty() calls btrfs_assert_tree_write_locked(buf),
but should_cow_block() may be called from btrfs_search_slot() when the
buffer only holds a read lock (root node acquired with BTRFS_READ_LOCK
in btrfs_search_slot_get_root()). Added a comment explaining this in
v3.
> > + return false;
> > + }
> > return true;
> > + }
> >
> > /* Ensure we can see the FORCE_COW bit. */
> > smp_mb__before_atomic();
>
> And I wonder if we could have something more readable like this:
>
> if (btrfs_header_generation(buf) != trans->transid)
> return true;
>
> if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> return true;
>
> if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
> btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> return true;
>
> /* Ensure we can see the FORCE_COW bit. */
> smp_mb__before_atomic();
> if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> return true;
>
> if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
> btrfs_is_zoned(root->fs_info))
> return true;
> btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> buf->start,
> buf->start + buf->len - 1,
> EXTENT_DIRTY, NULL);
> btrfs_mark_buffer_dirty(trans, buf);
> }
>
> return false;
Updated in v3!
>
> Thanks,
> Sun YangKai
next prev parent reply other threads:[~2026-02-17 22:04 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-14 1:25 ` Sun YangKai
2026-02-17 22:04 ` Leo Martins [this message]
2026-02-16 12:18 ` Filipe Manana
2026-02-17 21:48 ` Leo Martins
2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-16 12:33 ` Filipe Manana
2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
2026-02-16 12:40 ` Filipe Manana
2026-02-17 22:06 ` Leo Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260217220440.801494-1-loemra.dev@gmail.com \
--to=loemra.dev@gmail.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=sunk67188@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.