From: Sun YangKai <sunk67188@gmail.com>
To: Leo Martins <loemra.dev@gmail.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
Date: Sat, 14 Feb 2026 09:25:03 +0800 [thread overview]
Message-ID: <bcde0730-b3d4-4aa2-88f9-7fee861601cc@gmail.com> (raw)
In-Reply-To: <04eca407999f1db58a4af9f4d88397aa2edd2d3c.1771012202.git.loemra.dev@gmail.com>
Thanks for your working on this and I've expecting this for a long time :)
On 2026/2/14 04:30, Leo Martins wrote:
> When memory pressure causes writeback of a recently COW'd buffer,
> btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> the buffer unnecessarily, causing COW amplification that can exhaust
> block reservations and degrade throughput.
>
> Overwriting in place is crash-safe because the committed superblock
> does not reference buffers allocated in the current (uncommitted)
> transaction, so no on-disk tree points to this block yet.
>
> When should_cow_block() encounters a WRITTEN buffer whose generation
> matches the current transaction, instead of requesting a COW, re-dirty
> the buffer and re-register its range in the transaction's dirty_pages.
>
> Both are necessary because btrfs tracks dirty metadata through two
> independent mechanisms. set_extent_buffer_dirty() sets the
> EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> mark, which is what background writeback (btree_write_cache_pages) uses
> to find and write dirty buffers. The transaction's dirty_pages io tree
> is a separate structure used by btrfs_write_and_wait_transaction() at
> commit time to ensure all buffers allocated during the transaction are
> persisted. The dirty_pages range was originally registered in
> btrfs_init_new_buffer() when the block was first allocated, but
> background writeback may have already written and cleared it.
>
> Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> correctly pins the block if it is freed later.
>
> Exclude cases where in-place overwrite is not safe:
> - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> - Zoned devices: require sequential writes
> - Log trees: log blocks are immediately referenced by a committed
> superblock via btrfs_sync_log(), so overwriting could corrupt the
> committed log
> - BTRFS_ROOT_FORCE_COW: snapshot in progress
> - BTRFS_HEADER_FLAG_RELOC: block being relocated
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
> fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 50 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..a345e1be24d8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> return ret;
> }
>
> -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> const struct btrfs_root *root,
> - const struct extent_buffer *buf)
> + struct extent_buffer *buf)
> {
> if (btrfs_is_testing(root->fs_info))
> return false;
> @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> if (btrfs_header_generation(buf) != trans->transid)
> return true;
>
> - if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> + if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> + /*
> + * The buffer was allocated in this transaction and has been
> + * written back to disk (WRITTEN is set). Normally we'd COW
> + * it again, but since the committed superblock doesn't
> + * reference this buffer (it was allocated this transaction),
> + * we can safely overwrite it in place.
> + *
> + * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> + * persisted at this bytenr and will be again after the
> + * in-place update. This is important so that
> + * btrfs_free_tree_block() correctly pins the block if it is
> + * freed later (e.g., during tree rebalancing or FORCE_COW).
> + *
> + * We re-dirty the buffer to ensure the in-place modifications
> + * will be written back to disk.
> + *
> + * Exclusions:
> + * - Log trees: log blocks are written and immediately
> + * referenced by a committed superblock via
> + * btrfs_sync_log(), bypassing the normal transaction
> + * commit. Overwriting in place could corrupt the
> + * committed log.
> + * - Zoned devices: require sequential writes
> + * - FORCE_COW: snapshot in progress
> + * - RELOC flag: block being relocated
> + */
> + if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> + !btrfs_is_zoned(root->fs_info) &&
> + btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> + !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
it seems we need smp_mb__before_atomic() to see the FORCE_COW bit?
> + !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> + /*
> + * Re-register this block's range in the current
> + * transaction's dirty_pages so that
> + * btrfs_write_and_wait_transaction() writes it.
> + * The range was originally registered when the block
> + * was allocated, but that transaction's dirty_pages
> + * may have already been released.
> + */
> + btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> + buf->start,
> + buf->start + buf->len - 1,
> + EXTENT_DIRTY, NULL);
> + set_extent_buffer_dirty(buf);
why use set_extent_buffer_dirty() instead of btrfs_mark_buffer_dirty()?
I don't see any other callers doing this.
> + return false;
> + }
> return true;
> + }
>
> /* Ensure we can see the FORCE_COW bit. */
> smp_mb__before_atomic();
And I wonder if we could have something more readable like this:
if (btrfs_header_generation(buf) != trans->transid)
return true;
if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
return true;
if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
return true;
/* Ensure we can see the FORCE_COW bit. */
smp_mb__before_atomic();
if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
return true;
if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
btrfs_is_zoned(root->fs_info))
return true;
btrfs_set_extent_bit(&trans->transaction->dirty_pages,
buf->start,
buf->start + buf->len - 1,
EXTENT_DIRTY, NULL);
btrfs_mark_buffer_dirty(trans, buf);
}
return false;
Thanks,
Sun YangKai
next prev parent reply other threads:[~2026-02-14 1:25 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-14 1:25 ` Sun YangKai [this message]
2026-02-17 22:04 ` Leo Martins
2026-02-16 12:18 ` Filipe Manana
2026-02-17 21:48 ` Leo Martins
2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-16 12:33 ` Filipe Manana
2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
2026-02-16 12:40 ` Filipe Manana
2026-02-17 22:06 ` Leo Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bcde0730-b3d4-4aa2-88f9-7fee861601cc@gmail.com \
--to=sunk67188@gmail.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=loemra.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox