From: Sun YangKai <sunk67188@gmail.com>
To: Leo Martins <loemra.dev@gmail.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com
Cc: Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
Date: Wed, 25 Feb 2026 14:17:24 +0800 [thread overview]
Message-ID: <8b4f2be8-169d-4e10-a47f-b534ab0ebac6@gmail.com> (raw)
In-Reply-To: <cc847a35e26cc4dfad18c59e3c525cea507ff440.1771884128.git.loemra.dev@gmail.com>
On 2026/2/25 03:22, Leo Martins wrote:
> Inhibit writeback on COW'd extent buffers for the lifetime of the
> transaction handle, preventing background writeback from setting
> BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
>
> COW amplification occurs when background writeback flushes an extent
> buffer that a transaction handle is still actively modifying. When
> lock_extent_buffer_for_io() transitions a buffer from dirty to
> writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
> having been persisted to disk at its current bytenr. Once WRITTEN is
> set, should_cow_block() must either COW the block again or overwrite
> it in place, both of which are unnecessary overhead when the buffer
> is still being modified by the same handle that allocated it. By
> inhibiting background writeback on actively-used buffers, WRITTEN is
> never set while a transaction handle holds a reference to the buffer,
> avoiding this overhead entirely.
>
> Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
> which fits in an existing 6-byte hole without increasing struct size.
> When a buffer is COW'd in btrfs_force_cow_block(), call
> btrfs_inhibit_eb_writeback() to store the eb in the transaction
> handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
> reference, and increment writeback_inhibitors. The function handles
> dedup (same eb inhibited twice by the same handle) and replacement
> (different eb at the same logical address). Allocation failure is
> graceful: the buffer simply falls back to the pre-existing behavior
> where it may be written back and re-COW'd.
>
> In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
> and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
> is used by the VM flusher threads for background and periodic
> writeback, which are the only paths that cause COW amplification by
> opportunistically writing out dirty extent buffers mid-transaction.
> Skipping these is safe because the buffers remain dirty in the page
> cache and will be written out at transaction commit time.
>
> WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
> This is required for correctness in the fsync path: btrfs_sync_log()
> writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
> while the transaction handle that inhibited those same blocks is still
> active. Without the WB_SYNC_ALL bypass, those inhibited log tree
> blocks would be silently skipped, resulting in an incomplete log on
> disk and corruption on replay. btrfs_write_and_wait_transaction()
> also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
> inhibitors are already cleared beforehand, but the bypass ensures
> correctness regardless.
>
> Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
> to prevent a race where the committer proceeds while buffers are still
> inhibited. Also uninhibit in btrfs_commit_transaction() before writing
> and in cleanup_transaction() for the error path.
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/ctree.c | 4 +++
> fs/btrfs/extent_io.c | 63 +++++++++++++++++++++++++++++++++++++++++-
> fs/btrfs/extent_io.h | 6 ++++
> fs/btrfs/transaction.c | 19 +++++++++++++
> fs/btrfs/transaction.h | 3 ++
> 5 files changed, 94 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 0e02b7b14adc..d4da65bb9096 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -590,6 +590,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> btrfs_tree_unlock(buf);
> free_extent_buffer_stale(buf);
> btrfs_mark_buffer_dirty(trans, cow);
> +
> + /* Inhibit writeback on the COW'd buffer for this transaction handle. */
> + btrfs_inhibit_eb_writeback(trans, cow);
> +
> *cow_ret = cow;
> return 0;
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ff1fc699a6ca..e04e42a81978 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
> * of time.
> */
> spin_lock(&eb->refs_lock);
> - if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> + if ((wbc->sync_mode == WB_SYNC_ALL ||
> + atomic_read(&eb->writeback_inhibitors) == 0) &&
> + test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
> unsigned long flags;
>
> @@ -2999,6 +3001,64 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
> kmem_cache_free(extent_buffer_cache, eb);
> }
>
> +/*
> + * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction.
> + * @trans: transaction handle that will own the inhibitor
> + * @eb: extent buffer to inhibit writeback on
> + *
> + * Attempts to track this extent buffer in the transaction's inhibited set.
> + * If memory allocation fails, the buffer is simply not tracked. It may
> + * be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
> + */
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> + struct extent_buffer *eb)
> +{
> + unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
> + void *old;
> +
> + /* Check if already inhibited by this handle. */
> + old = xa_load(&trans->writeback_inhibited_ebs, index);
> + if (old == eb)
> + return;
> +
> + /* Take reference for the xarray entry. */
> + refcount_inc(&eb->refs);
> +
> + old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
> + if (xa_is_err(old)) {
> + /* Allocation failed, just skip inhibiting this buffer. */
> + free_extent_buffer(eb);
> + return;
> + }
> +
> + /* Handle replacement of different eb at same index. */
> + if (old && old != eb) {
> + struct extent_buffer *old_eb = old;
> +
> + atomic_dec(&old_eb->writeback_inhibitors);
> + free_extent_buffer(old_eb);
> + }
> +
> + atomic_inc(&eb->writeback_inhibitors);
> +}
> +
> +/*
> + * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers.
> + * @trans: transaction handle to clean up
> + */
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
> +{
> + struct extent_buffer *eb;
> + unsigned long index;
> +
> + xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
> + atomic_dec(&eb->writeback_inhibitors);
> + free_extent_buffer(eb);
> + }
> + xa_destroy(&trans->writeback_inhibited_ebs);
> +}
> +
> static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> u64 start)
> {
> @@ -3009,6 +3069,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
> eb->len = fs_info->nodesize;
> eb->fs_info = fs_info;
> init_rwsem(&eb->lock);
> + atomic_set(&eb->writeback_inhibitors, 0);
>
> btrfs_leak_debug_add_eb(eb);
>
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..fb68fbd4866c 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,8 @@ struct extent_buffer {
> /* >= 0 if eb belongs to a log tree, -1 otherwise */
> s8 log_index;
> u8 folio_shift;
> + /* Inhibits WB_SYNC_NONE writeback when > 0. */
> + atomic_t writeback_inhibitors;
I might be missing something here, but I'm curious whether this atomic
counter can ever go above 1. If not, and it's strictly binary, perhaps
using atomic_set(1/0) instead of atomic_inc/dec would make the intent
clearer?
Otherwise looks good. Thanks.
> struct rcu_head rcu_head;
>
> struct rw_semaphore lock;
> @@ -381,4 +383,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
> #define btrfs_extent_buffer_leak_debug_check(fs_info) do {} while (0)
> #endif
>
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> + struct extent_buffer *eb);
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
> +
> #endif
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f4cc9e1a1b93..a9a22629b49d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -15,6 +15,7 @@
> #include "misc.h"
> #include "ctree.h"
> #include "disk-io.h"
> +#include "extent_io.h"
> #include "transaction.h"
> #include "locking.h"
> #include "tree-log.h"
> @@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> goto alloc_fail;
> }
>
> + xa_init(&h->writeback_inhibited_ebs);
> +
> /*
> * If we are JOIN_NOLOCK we're already committing a transaction and
> * waiting on this guy, so we don't need to do the sb_start_intwrite
> @@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> if (trans->type & __TRANS_FREEZABLE)
> sb_end_intwrite(info->sb);
>
> + /*
> + * Uninhibit extent buffer writeback before decrementing num_writers,
> + * since the decrement wakes the committing thread which needs all
> + * buffers uninhibited to write them to disk.
> + */
> + btrfs_uninhibit_all_eb_writeback(trans);
> +
> WARN_ON(cur_trans != info->running_transaction);
> WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
> atomic_dec(&cur_trans->num_writers);
> @@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
> if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
> btrfs_scrub_cancel(fs_info);
>
> + btrfs_uninhibit_all_eb_writeback(trans);
> kmem_cache_free(btrfs_trans_handle_cachep, trans);
> }
>
> @@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> fs_info->cleaner_kthread)
> wake_up_process(fs_info->cleaner_kthread);
>
> + /*
> + * Uninhibit writeback on all extent buffers inhibited during this
> + * transaction before writing them to disk. Inhibiting prevented
> + * writeback while the transaction was building, but now we need
> + * them written.
> + */
> + btrfs_uninhibit_all_eb_writeback(trans);
> +
> ret = btrfs_write_and_wait_transaction(trans);
> if (unlikely(ret)) {
> btrfs_err(fs_info, "error while writing out transaction: %d", ret);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 18ef069197e5..7d70fe486758 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,6 +12,7 @@
> #include <linux/time64.h>
> #include <linux/mutex.h>
> #include <linux/wait.h>
> +#include <linux/xarray.h>
> #include "btrfs_inode.h"
> #include "delayed-ref.h"
>
> @@ -162,6 +163,8 @@ struct btrfs_trans_handle {
> struct btrfs_fs_info *fs_info;
> struct list_head new_bgs;
> struct btrfs_block_rsv delayed_rsv;
> + /* Extent buffers with writeback inhibited by this handle. */
> + struct xarray writeback_inhibited_ebs;
> };
>
> /*
next prev parent reply other threads:[~2026-02-25 6:17 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 19:22 [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-24 19:22 ` [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-25 6:06 ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-25 6:17 ` Sun YangKai [this message]
2026-02-26 2:30 ` Leo Martins
2026-02-26 5:03 ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-02-25 17:38 ` [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8b4f2be8-169d-4e10-a47f-b534ab0ebac6@gmail.com \
--to=sunk67188@gmail.com \
--cc=fdmanana@suse.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=loemra.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox