public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Sun YangKai <sunk67188@gmail.com>
To: Leo Martins <loemra.dev@gmail.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Cc: Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
Date: Wed, 25 Feb 2026 14:17:24 +0800	[thread overview]
Message-ID: <8b4f2be8-169d-4e10-a47f-b534ab0ebac6@gmail.com> (raw)
In-Reply-To: <cc847a35e26cc4dfad18c59e3c525cea507ff440.1771884128.git.loemra.dev@gmail.com>



On 2026/2/25 03:22, Leo Martins wrote:
> Inhibit writeback on COW'd extent buffers for the lifetime of the
> transaction handle, preventing background writeback from setting
> BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
> 
> COW amplification occurs when background writeback flushes an extent
> buffer that a transaction handle is still actively modifying. When
> lock_extent_buffer_for_io() transitions a buffer from dirty to
> writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
> having been persisted to disk at its current bytenr. Once WRITTEN is
> set, should_cow_block() must either COW the block again or overwrite
> it in place, both of which are unnecessary overhead when the buffer
> is still being modified by the same handle that allocated it. By
> inhibiting background writeback on actively-used buffers, WRITTEN is
> never set while a transaction handle holds a reference to the buffer,
> avoiding this overhead entirely.
> 
> Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
> which fits in an existing 6-byte hole without increasing struct size.
> When a buffer is COW'd in btrfs_force_cow_block(), call
> btrfs_inhibit_eb_writeback() to store the eb in the transaction
> handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
> reference, and increment writeback_inhibitors. The function handles
> dedup (same eb inhibited twice by the same handle) and replacement
> (different eb at the same logical address). Allocation failure is
> graceful: the buffer simply falls back to the pre-existing behavior
> where it may be written back and re-COW'd.
> 
> In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
> and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
> is used by the VM flusher threads for background and periodic
> writeback, which are the only paths that cause COW amplification by
> opportunistically writing out dirty extent buffers mid-transaction.
> Skipping these is safe because the buffers remain dirty in the page
> cache and will be written out at transaction commit time.
> 
> WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
> This is required for correctness in the fsync path: btrfs_sync_log()
> writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
> while the transaction handle that inhibited those same blocks is still
> active. Without the WB_SYNC_ALL bypass, those inhibited log tree
> blocks would be silently skipped, resulting in an incomplete log on
> disk and corruption on replay. btrfs_write_and_wait_transaction()
> also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
> inhibitors are already cleared beforehand, but the bypass ensures
> correctness regardless.
> 
> Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
> to prevent a race where the committer proceeds while buffers are still
> inhibited. Also uninhibit in btrfs_commit_transaction() before writing
> and in cleanup_transaction() for the error path.
> 
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/ctree.c       |  4 +++
>   fs/btrfs/extent_io.c   | 63 +++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/extent_io.h   |  6 ++++
>   fs/btrfs/transaction.c | 19 +++++++++++++
>   fs/btrfs/transaction.h |  3 ++
>   5 files changed, 94 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 0e02b7b14adc..d4da65bb9096 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -590,6 +590,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>   		btrfs_tree_unlock(buf);
>   	free_extent_buffer_stale(buf);
>   	btrfs_mark_buffer_dirty(trans, cow);
> +
> +	/* Inhibit writeback on the COW'd buffer for this transaction handle. */
> +	btrfs_inhibit_eb_writeback(trans, cow);
> +
>   	*cow_ret = cow;
>   	return 0;
>   
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ff1fc699a6ca..e04e42a81978 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>   	 * of time.
>   	 */
>   	spin_lock(&eb->refs_lock);
> -	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> +	if ((wbc->sync_mode == WB_SYNC_ALL ||
> +	     atomic_read(&eb->writeback_inhibitors) == 0) &&
> +	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
>   		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
>   		unsigned long flags;
>   
> @@ -2999,6 +3001,64 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>   	kmem_cache_free(extent_buffer_cache, eb);
>   }
>   
> +/*
> + * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction.
> + * @trans: transaction handle that will own the inhibitor
> + * @eb: extent buffer to inhibit writeback on
> + *
> + * Attempts to track this extent buffer in the transaction's inhibited set.
> + * If memory allocation fails, the buffer is simply not tracked. It may
> + * be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
> + */
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +				struct extent_buffer *eb)
> +{
> +	unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
> +	void *old;
> +
> +	/* Check if already inhibited by this handle. */
> +	old = xa_load(&trans->writeback_inhibited_ebs, index);
> +	if (old == eb)
> +		return;
> +
> +	/* Take reference for the xarray entry. */
> +	refcount_inc(&eb->refs);
> +
> +	old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
> +	if (xa_is_err(old)) {
> +		/* Allocation failed, just skip inhibiting this buffer. */
> +		free_extent_buffer(eb);
> +		return;
> +	}
> +
> +	/* Handle replacement of different eb at same index. */
> +	if (old && old != eb) {
> +		struct extent_buffer *old_eb = old;
> +
> +		atomic_dec(&old_eb->writeback_inhibitors);
> +		free_extent_buffer(old_eb);
> +	}
> +
> +	atomic_inc(&eb->writeback_inhibitors);
> +}
> +
> +/*
> + * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers.
> + * @trans: transaction handle to clean up
> + */
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
> +{
> +	struct extent_buffer *eb;
> +	unsigned long index;
> +
> +	xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
> +		atomic_dec(&eb->writeback_inhibitors);
> +		free_extent_buffer(eb);
> +	}
> +	xa_destroy(&trans->writeback_inhibited_ebs);
> +}
> +
>   static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>   						   u64 start)
>   {
> @@ -3009,6 +3069,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>   	eb->len = fs_info->nodesize;
>   	eb->fs_info = fs_info;
>   	init_rwsem(&eb->lock);
> +	atomic_set(&eb->writeback_inhibitors, 0);
>   
>   	btrfs_leak_debug_add_eb(eb);
>   
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..fb68fbd4866c 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,8 @@ struct extent_buffer {
>   	/* >= 0 if eb belongs to a log tree, -1 otherwise */
>   	s8 log_index;
>   	u8 folio_shift;
> +	/* Inhibits WB_SYNC_NONE writeback when > 0. */
> +	atomic_t writeback_inhibitors;

I might be missing something here, but I'm curious whether this atomic 
counter can ever go above 1. If not, and it's strictly binary, perhaps 
using atomic_set(1/0) instead of atomic_inc/dec would make the intent 
clearer?

Otherwise looks good. Thanks.

>   	struct rcu_head rcu_head;
>   
>   	struct rw_semaphore lock;
> @@ -381,4 +383,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
>   #define btrfs_extent_buffer_leak_debug_check(fs_info)	do {} while (0)
>   #endif
>   
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +			       struct extent_buffer *eb);
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
> +
>   #endif
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f4cc9e1a1b93..a9a22629b49d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -15,6 +15,7 @@
>   #include "misc.h"
>   #include "ctree.h"
>   #include "disk-io.h"
> +#include "extent_io.h"
>   #include "transaction.h"
>   #include "locking.h"
>   #include "tree-log.h"
> @@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>   		goto alloc_fail;
>   	}
>   
> +	xa_init(&h->writeback_inhibited_ebs);
> +
>   	/*
>   	 * If we are JOIN_NOLOCK we're already committing a transaction and
>   	 * waiting on this guy, so we don't need to do the sb_start_intwrite
> @@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>   	if (trans->type & __TRANS_FREEZABLE)
>   		sb_end_intwrite(info->sb);
>   
> +	/*
> +	 * Uninhibit extent buffer writeback before decrementing num_writers,
> +	 * since the decrement wakes the committing thread which needs all
> +	 * buffers uninhibited to write them to disk.
> +	 */
> +	btrfs_uninhibit_all_eb_writeback(trans);
> +
>   	WARN_ON(cur_trans != info->running_transaction);
>   	WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
>   	atomic_dec(&cur_trans->num_writers);
> @@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
>   	if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
>   		btrfs_scrub_cancel(fs_info);
>   
> +	btrfs_uninhibit_all_eb_writeback(trans);
>   	kmem_cache_free(btrfs_trans_handle_cachep, trans);
>   }
>   
> @@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>   	    fs_info->cleaner_kthread)
>   		wake_up_process(fs_info->cleaner_kthread);
>   
> +	/*
> +	 * Uninhibit writeback on all extent buffers inhibited during this
> +	 * transaction before writing them to disk. Inhibiting prevented
> +	 * writeback while the transaction was building, but now we need
> +	 * them written.
> +	 */
> +	btrfs_uninhibit_all_eb_writeback(trans);
> +
>   	ret = btrfs_write_and_wait_transaction(trans);
>   	if (unlikely(ret)) {
>   		btrfs_err(fs_info, "error while writing out transaction: %d", ret);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 18ef069197e5..7d70fe486758 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,6 +12,7 @@
>   #include <linux/time64.h>
>   #include <linux/mutex.h>
>   #include <linux/wait.h>
> +#include <linux/xarray.h>
>   #include "btrfs_inode.h"
>   #include "delayed-ref.h"
>   
> @@ -162,6 +163,8 @@ struct btrfs_trans_handle {
>   	struct btrfs_fs_info *fs_info;
>   	struct list_head new_bgs;
>   	struct btrfs_block_rsv delayed_rsv;
> +	/* Extent buffers with writeback inhibited by this handle. */
> +	struct xarray writeback_inhibited_ebs;
>   };
>   
>   /*


  reply	other threads:[~2026-02-25  6:17 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24 19:22 [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-24 19:22 ` [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-25  6:06   ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-25  6:17   ` Sun YangKai [this message]
2026-02-26  2:30     ` Leo Martins
2026-02-26  5:03       ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-02-25 17:38 ` [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8b4f2be8-169d-4e10-a47f-b534ab0ebac6@gmail.com \
    --to=sunk67188@gmail.com \
    --cc=fdmanana@suse.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=loemra.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox