Re: [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Sun YangKai <sunk67188@gmail.com>
To: Leo Martins <loemra.dev@gmail.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
Date: Wed, 25 Feb 2026 14:06:51 +0800	[thread overview]
Message-ID: <060df404-3e39-4637-88f4-ffd42fcddc26@gmail.com> (raw)
In-Reply-To: <4ce911a475b998ddf76951629ad203e6440ab0ca.1771884128.git.loemra.dev@gmail.com>



On 2026/2/25 03:22, Leo Martins wrote:
> When memory pressure causes writeback of a recently COW'd buffer,
> btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> the buffer unnecessarily, causing COW amplification that can exhaust
> block reservations and degrade throughput.
> 
> Overwriting in place is crash-safe because the committed superblock
> does not reference buffers allocated in the current (uncommitted)
> transaction, so no on-disk tree points to this block yet.
> 
> When should_cow_block() encounters a WRITTEN buffer whose generation
> matches the current transaction, instead of requesting a COW, re-dirty
> the buffer and re-register its range in the transaction's dirty_pages.
> 
> Both are necessary because btrfs tracks dirty metadata through two
> independent mechanisms. set_extent_buffer_dirty() sets the
> EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> mark, which is what background writeback (btree_write_cache_pages) uses
> to find and write dirty buffers. The transaction's dirty_pages io tree
> is a separate structure used by btrfs_write_and_wait_transaction() at
> commit time to ensure all buffers allocated during the transaction are
> persisted. The dirty_pages range was originally registered in
> btrfs_init_new_buffer() when the block was first allocated. Normally
> dirty_pages is only cleared at commit time by
> btrfs_write_and_wait_transaction(), but if qgroups are enabled and
> snapshots are being created, qgroup_account_snapshot() may have already
> called btrfs_write_and_wait_transaction() and released the range before
> the final commit-time call.
> 
> Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> correctly pins the block if it is freed later.
> 
> Relax the lockdep assertion in btrfs_mark_buffer_dirty() from
> btrfs_assert_tree_write_locked() to lockdep_assert_held() so that it
> accepts either a read or write lock. should_cow_block() may be called
> from btrfs_search_slot() when only a read lock is held (nodes above
> write_lock_level are read-locked). The write lock assertion previously
> documented the caller convention that buffer content was being modified
> under exclusive access, but btrfs_mark_buffer_dirty() and
> set_extent_buffer_dirty() themselves only perform independently
> synchronized operations: atomic bit ops on bflags, folio_mark_dirty()
> (kernel-internal folio locking), xarray mark updates (xarray spinlock),
> and percpu counter updates. The read lock is sufficient because it
> prevents lock_extent_buffer_for_io() from acquiring the write lock and
> racing on the dirty state. Since rw_semaphore permits concurrent
> readers, multiple threads can enter btrfs_mark_buffer_dirty()
> simultaneously for the same buffer; this is safe because
> test_and_set_bit(EXTENT_BUFFER_DIRTY) ensures only one thread performs
> the full dirty state transition.
> 
> Remove the CONFIG_BTRFS_DEBUG assertion in set_extent_buffer_dirty()
> that checked folio_test_dirty() after marking the buffer dirty. This
> assertion assumed exclusive access (only one thread in
> set_extent_buffer_dirty() at a time), which held when the only caller
> was btrfs_mark_buffer_dirty() under write lock. With concurrent readers
> calling through should_cow_block(), a thread that loses the
> test_and_set_bit race sees was_dirty=true and skips the folio dirty
> marking, but the winning thread may not have called
> btrfs_meta_folio_set_dirty() yet, causing the assertion to fire. This
> is a benign race: the winning thread will complete the folio dirty
> marking, and no writeback can clear it while readers hold their locks.
> 
> Hoist the EXTENT_BUFFER_WRITEBACK, BTRFS_HEADER_FLAG_RELOC, and
> BTRFS_ROOT_FORCE_COW checks before the WRITTEN block since they apply
> regardless of whether the buffer has been written back. This
> consolidates the exclusion logic and simplifies the WRITTEN path to
> only handle log trees and zoned devices. Moving the RELOC checks
> before the smp_mb__before_atomic() barrier is safe because both
> btrfs_root_id() (immutable) and BTRFS_HEADER_FLAG_RELOC (set at COW
> time under tree lock) are stable values not subject to concurrent
> modification; the barrier is only needed for BTRFS_ROOT_FORCE_COW
> which is set concurrently by create_pending_snapshot().
> 
> Exclude cases where in-place overwrite is not safe:
>   - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
>   - Zoned devices: require sequential writes
>   - Log trees: log blocks are immediately referenced by a committed
>     superblock via btrfs_sync_log(), so overwriting could corrupt the
>     committed log
>   - BTRFS_ROOT_FORCE_COW: snapshot in progress
>   - BTRFS_HEADER_FLAG_RELOC: block being relocated
> 
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>

Thanks! Looks good now. Currently we only early return true in 
should_cow_block(), making the overall logic easier to follow.

And the commit message and comments are quite detailed and helpful.

Reviewed-by: Sun YangKai <sunk67188@gmail.com>

> ---
>   fs/btrfs/ctree.c     | 56 ++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/disk-io.c   |  2 +-
>   fs/btrfs/extent_io.c |  4 ----
>   3 files changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..0e02b7b14adc 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>   
> -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>   				    const struct btrfs_root *root,
> -				    const struct extent_buffer *buf)
> +				    struct extent_buffer *buf)
>   {
>   	if (btrfs_is_testing(root->fs_info))
>   		return false;
> @@ -621,7 +621,11 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
>   	if (btrfs_header_generation(buf) != trans->transid)
>   		return true;
>   
> -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> +	if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> +		return true;
> +
> +	if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
> +	    btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
>   		return true;
>   
>   	/* Ensure we can see the FORCE_COW bit. */
> @@ -629,11 +633,49 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
>   	if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>   		return true;
>   
> -	if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> -		return false;
> +	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> +		/*
> +		 * The buffer was allocated in this transaction and has been
> +		 * written back to disk (WRITTEN is set). Normally we'd COW
> +		 * it again, but since the committed superblock doesn't
> +		 * reference this buffer (it was allocated in this transaction),
> +		 * we can safely overwrite it in place.
> +		 *
> +		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> +		 * persisted at this bytenr and will be again after the
> +		 * in-place update. This is important so that
> +		 * btrfs_free_tree_block() correctly pins the block if it is
> +		 * freed later (e.g., during tree rebalancing or FORCE_COW).
> +		 *
> +		 * Log trees and zoned devices cannot use this optimization:
> +		 * - Log trees: log blocks are written and immediately
> +		 *   referenced by a committed superblock via
> +		 *   btrfs_sync_log(), bypassing the normal transaction
> +		 *   commit. Overwriting in place could corrupt the
> +		 *   committed log.
> +		 * - Zoned devices: require sequential writes.
> +		 */
> +		if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
> +		    btrfs_is_zoned(root->fs_info))
> +			return true;
>   
> -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> -		return true;
> +		/*
> +		 * Re-register this block's range in the current transaction's
> +		 * dirty_pages so that btrfs_write_and_wait_transaction()
> +		 * writes it. The range was originally registered when the
> +		 * block was allocated. Normally dirty_pages is only cleared
> +		 * at commit time by btrfs_write_and_wait_transaction(), but
> +		 * if qgroups are enabled and snapshots are being created,
> +		 * qgroup_account_snapshot() may have already called
> +		 * btrfs_write_and_wait_transaction() and released the range
> +		 * before the final commit-time call.
> +		 */
> +		btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> +				     buf->start,
> +				     buf->start + buf->len - 1,
> +				     EXTENT_DIRTY, NULL);
> +		btrfs_mark_buffer_dirty(trans, buf);
> +	}
>   
>   	return false;
>   }
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 32fffb0557e5..bee8f76fbfea 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4491,7 +4491,7 @@ void btrfs_mark_buffer_dirty(struct btrfs_trans_handle *trans,
>   #endif
>   	/* This is an active transaction (its state < TRANS_STATE_UNBLOCKED). */
>   	ASSERT(trans->transid == fs_info->generation);
> -	btrfs_assert_tree_write_locked(buf);
> +	lockdep_assert_held(&buf->lock);
>   	if (unlikely(transid != fs_info->generation)) {
>   		btrfs_abort_transaction(trans, -EUCLEAN);
>   		btrfs_crit(fs_info,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfc17c292217..ff1fc699a6ca 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3791,10 +3791,6 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
>   					 eb->len,
>   					 eb->fs_info->dirty_metadata_batch);
>   	}
> -#ifdef CONFIG_BTRFS_DEBUG
> -	for (int i = 0; i < num_extent_folios(eb); i++)
> -		ASSERT(folio_test_dirty(eb->folios[i]));
> -#endif
>   }
>   
>   void clear_extent_buffer_uptodate(struct extent_buffer *eb)

next prev parent reply	other threads:[~2026-02-25  6:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24 19:22 [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-24 19:22 ` [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-25  6:06   ` Sun YangKai [this message]
2026-02-24 19:22 ` [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-25  6:17   ` Sun YangKai
2026-02-26  2:30     ` Leo Martins
2026-02-26  5:03       ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-02-25 17:38 ` [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=060df404-3e39-4637-88f4-ffd42fcddc26@gmail.com \
    --to=sunk67188@gmail.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=loemra.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox