[PATCH v2 0/3] btrfs: fix COW amplification under memory pressure

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure
@ 2026-02-13 20:30 Leo Martins
  2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Leo Martins @ 2026-02-13 20:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

I've been investigating a pattern where COW amplification under memory
pressure exhausts the transaction block reserve, leading to spurious
ENOSPCs on filesystems with plenty of unallocated space.

The pattern is:

 1. btrfs_search_slot() begins tree traversal with cow=1
 2. Node at level N needs COW (old generation or WRITTEN flag set)
 3. btrfs_cow_block() allocates new block, copies data, updates
    parent pointer
 4. Traversal hits a condition requiring restart (node not cached,
    lock contention, need higher write_lock_level)
 5. btrfs_release_path() releases all locks and references
 6. Memory pressure triggers writeback on the COW'd block
 7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
    BTRFS_HEADER_FLAG_WRITTEN
 8. goto again - traversal restarts from root
 9. Traversal reaches the same tree node
 10. should_cow_block() sees WRITTEN flag, returns true
 11. btrfs_cow_block() allocates yet another new block, copies data,
     updates parent pointer again, consuming another reservation
 12. Steps 4-11 repeat under sustained memory pressure

This series fixes the problem with two complementary approaches:

Patch 1 implements Filipe's suggestion of overwriting in place. When
should_cow_block() encounters a WRITTEN buffer whose generation matches
the current transaction, it re-dirties the buffer and returns false
instead of requesting a COW. This is crash-safe because the committed
superblock does not reference buffers allocated in the current
transaction. Log trees, zoned devices, FORCE_COW, relocation, and
buffers mid-writeback are excluded. Beyond fixing the amplification bug,
this is a general optimization for the entire transaction lifetime: any
time writeback runs during a transaction, revisiting the same path no
longer triggers unnecessary COW, reducing extent allocation overhead,
memory copies,
and space usage per transaction. This is especially valuable now that
Qu's recent patch removed the btree_writepages 32M dirty threshold,
allowing external callers (mm, cgroup) to trigger metadata writeback
at any dirty level.

Patch 2 inhibits writeback on COW'd buffers for the lifetime of the
transaction handle. This prevents WRITTEN from being set while a
handle holds a reference to the buffer, avoiding unnecessary re-COW
at its source. Only WB_SYNC_NONE (background/periodic flusher
writeback) is inhibited. WB_SYNC_ALL (data-integrity writeback from
btrfs_sync_log() and btrfs_write_and_wait_transaction()) always
proceeds, which is required for correctness in the fsync path where
log tree blocks must be written while the inhibiting handle is still
active.

Both approaches are independently useful. The overwrite path is a
general optimization that eliminates unnecessary COW across the entire
transaction, not just within a single handle. Writeback inhibition
prevents the problem from occurring in the first place and is the
only fix for log trees and zoned devices where overwrite does not
apply. Together they provide defense in depth against COW
amplification.

Patch 3 adds a tracepoint for tracking cow_count per search_slot
call.

Note: Boris's AS_KERNEL_FILE changes prevent cgroup-scoped reclaim
from targeting extent buffer pages, making this harder to trigger via
per-cgroup memory limits. I was able to reproduce the amplification
using global memory pressure (drop_caches, root cgroup memory.reclaim,
memory hog) with concurrent filesystem operations.

Benchmark: concurrent filesystem operations under aggressive global
memory pressure. COW counts measured via bpftrace.

  Approach                   Max COW count   Num operations
  -------                    -------------   --------------
  Baseline                              35   -
  Overwrite only                        19   ~same
  Writeback inhibition only              5   ~2x
  Combined (this series)                 4   ~2x

Changes since v1:

The v1 patch used a per-btrfs_search_slot() xarray to track COW'd
buffers. Filipe pointed out this was too complex and too narrow in
scope, and suggested overwriting in place instead. Qu raised the
concern that other btrfs_cow_block() callers outside of search_slot
would not be covered. Boris suggested transaction-handle-level scope
as an alternative.

v2 implements both overwrite-in-place and writeback inhibition at
the transaction handle level. The per-search-slot xarray is replaced
by a per-transaction-handle xarray, which covers all
btrfs_force_cow_block() callers. Log trees are excluded from overwrite
because btrfs_sync_log() immediately commits a superblock referencing
them, and handled by writeback inhibition instead. The
EXTENT_BUFFER_WRITEBACK check addresses Sun YangKai's observation
about races with in-flight I/O.

Leo Martins (3):
  btrfs: skip COW for written extent buffers allocated in current
    transaction
  btrfs: inhibit extent buffer writeback to prevent COW amplification
  btrfs: add tracepoint for COW amplification tracking

 fs/btrfs/ctree.c             | 72 +++++++++++++++++++++++++++++++++---
 fs/btrfs/extent_io.c         | 62 ++++++++++++++++++++++++++++++-
 fs/btrfs/extent_io.h         |  5 +++
 fs/btrfs/transaction.c       | 19 ++++++++++
 fs/btrfs/transaction.h       |  2 +
 include/trace/events/btrfs.h | 26 +++++++++++++
 6 files changed, 180 insertions(+), 6 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
@ 2026-02-13 20:30 ` Leo Martins
  2026-02-14  1:25   ` Sun YangKai
  2026-02-16 12:18   ` Filipe Manana
  2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
  2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
  2 siblings, 2 replies; 11+ messages in thread
From: Leo Martins @ 2026-02-13 20:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When memory pressure causes writeback of a recently COW'd buffer,
btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
the buffer unnecessarily, causing COW amplification that can exhaust
block reservations and degrade throughput.

Overwriting in place is crash-safe because the committed superblock
does not reference buffers allocated in the current (uncommitted)
transaction, so no on-disk tree points to this block yet.

When should_cow_block() encounters a WRITTEN buffer whose generation
matches the current transaction, instead of requesting a COW, re-dirty
the buffer and re-register its range in the transaction's dirty_pages.

Both are necessary because btrfs tracks dirty metadata through two
independent mechanisms. set_extent_buffer_dirty() sets the
EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
mark, which is what background writeback (btree_write_cache_pages) uses
to find and write dirty buffers. The transaction's dirty_pages io tree
is a separate structure used by btrfs_write_and_wait_transaction() at
commit time to ensure all buffers allocated during the transaction are
persisted. The dirty_pages range was originally registered in
btrfs_init_new_buffer() when the block was first allocated, but
background writeback may have already written and cleared it.

Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
correctly pins the block if it is freed later.

Exclude cases where in-place overwrite is not safe:
 - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
 - Zoned devices: require sequential writes
 - Log trees: log blocks are immediately referenced by a committed
   superblock via btrfs_sync_log(), so overwriting could corrupt the
   committed log
 - BTRFS_ROOT_FORCE_COW: snapshot in progress
 - BTRFS_HEADER_FLAG_RELOC: block being relocated

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
---
 fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 7267b2502665..a345e1be24d8 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
+static inline bool should_cow_block(struct btrfs_trans_handle *trans,
 				    const struct btrfs_root *root,
-				    const struct extent_buffer *buf)
+				    struct extent_buffer *buf)
 {
 	if (btrfs_is_testing(root->fs_info))
 		return false;
@@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
 	if (btrfs_header_generation(buf) != trans->transid)
 		return true;
 
-	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
+	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+		/*
+		 * The buffer was allocated in this transaction and has been
+		 * written back to disk (WRITTEN is set). Normally we'd COW
+		 * it again, but since the committed superblock doesn't
+		 * reference this buffer (it was allocated this transaction),
+		 * we can safely overwrite it in place.
+		 *
+		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
+		 * persisted at this bytenr and will be again after the
+		 * in-place update. This is important so that
+		 * btrfs_free_tree_block() correctly pins the block if it is
+		 * freed later (e.g., during tree rebalancing or FORCE_COW).
+		 *
+		 * We re-dirty the buffer to ensure the in-place modifications
+		 * will be written back to disk.
+		 *
+		 * Exclusions:
+		 * - Log trees: log blocks are written and immediately
+		 *   referenced by a committed superblock via
+		 *   btrfs_sync_log(), bypassing the normal transaction
+		 *   commit. Overwriting in place could corrupt the
+		 *   committed log.
+		 * - Zoned devices: require sequential writes
+		 * - FORCE_COW: snapshot in progress
+		 * - RELOC flag: block being relocated
+		 */
+		if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
+		    !btrfs_is_zoned(root->fs_info) &&
+		    btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
+		    !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
+		    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
+			/*
+			 * Re-register this block's range in the current
+			 * transaction's dirty_pages so that
+			 * btrfs_write_and_wait_transaction() writes it.
+			 * The range was originally registered when the block
+			 * was allocated, but that transaction's dirty_pages
+			 * may have already been released.
+			 */
+			btrfs_set_extent_bit(&trans->transaction->dirty_pages,
+					     buf->start,
+					     buf->start + buf->len - 1,
+					     EXTENT_DIRTY, NULL);
+			set_extent_buffer_dirty(buf);
+			return false;
+		}
 		return true;
+	}
 
 	/* Ensure we can see the FORCE_COW bit. */
 	smp_mb__before_atomic();
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
  2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
  2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
@ 2026-02-13 20:30 ` Leo Martins
  2026-02-16 12:33   ` Filipe Manana
  2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
  2 siblings, 1 reply; 11+ messages in thread
From: Leo Martins @ 2026-02-13 20:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Inhibit writeback on COW'd extent buffers for the lifetime of the
transaction handle, preventing background writeback from setting
BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.

COW amplification occurs when background writeback flushes an extent
buffer that a transaction handle is still actively modifying. When
lock_extent_buffer_for_io() transitions a buffer from dirty to
writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
having been persisted to disk at its current bytenr. Once WRITTEN is
set, should_cow_block() must either COW the block again or overwrite
it in place, both of which are unnecessary overhead when the buffer
is still being modified by the same handle that allocated it. By
inhibiting background writeback on actively-used buffers, WRITTEN is
never set while a transaction handle holds a reference to the buffer,
avoiding this overhead entirely.

Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
which fits in an existing 6-byte hole without increasing struct size.
When a buffer is COW'd in btrfs_force_cow_block(), call
btrfs_inhibit_eb_writeback() to store the eb in the transaction
handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
reference, and increment writeback_inhibitors. The function handles
dedup (same eb inhibited twice by the same handle) and replacement
(different eb at the same logical address). Allocation failure is
graceful: the buffer simply falls back to the pre-existing behavior
where it may be written back and re-COW'd.

In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
is used by the VM flusher threads for background and periodic
writeback, which are the only paths that cause COW amplification by
opportunistically writing out dirty extent buffers mid-transaction.
Skipping these is safe because the buffers remain dirty in the page
cache and will be written out at transaction commit time.

WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
This is required for correctness in the fsync path: btrfs_sync_log()
writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
while the transaction handle that inhibited those same blocks is still
active. Without the WB_SYNC_ALL bypass, those inhibited log tree
blocks would be silently skipped, resulting in an incomplete log on
disk and corruption on replay. btrfs_write_and_wait_transaction()
also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
inhibitors are already cleared beforehand, but the bypass ensures
correctness regardless.

Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
to prevent a race where the committer proceeds while buffers are still
inhibited. Also uninhibit in btrfs_commit_transaction() before writing
and in cleanup_transaction() for the error path.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
---
 fs/btrfs/ctree.c       |  4 +++
 fs/btrfs/extent_io.c   | 62 +++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/extent_io.h   |  5 ++++
 fs/btrfs/transaction.c | 19 +++++++++++++
 fs/btrfs/transaction.h |  2 ++
 5 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a345e1be24d8..55187ba59cc0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -590,6 +590,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
 		btrfs_tree_unlock(buf);
 	free_extent_buffer_stale(buf);
 	btrfs_mark_buffer_dirty(trans, cow);
+
+	/* Inhibit writeback on the COW'd buffer for this transaction handle */
+	btrfs_inhibit_eb_writeback(trans, cow);
+
 	*cow_ret = cow;
 	return 0;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfc17c292217..0c9276cff299 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
 	 * of time.
 	 */
 	spin_lock(&eb->refs_lock);
-	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
+	if ((wbc->sync_mode == WB_SYNC_ALL ||
+	     atomic_read(&eb->writeback_inhibitors) == 0) &&
+	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
 		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
 		unsigned long flags;
 
@@ -2999,6 +3001,63 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 	kmem_cache_free(extent_buffer_cache, eb);
 }
 
+/*
+ * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction
+ * @trans: transaction handle that will own the inhibitor
+ * @eb: extent buffer to inhibit writeback on
+ *
+ * Attempts to track this extent buffer in the transaction's inhibited set.
+ * If memory allocation fails, the buffer is simply not tracked. It may
+ * be written back and need re-COW, which is the original behavior.
+ * This is acceptable since inhibiting writeback is an optimization.
+ */
+void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
+				struct extent_buffer *eb)
+{
+	unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
+	void *old;
+
+	/* Check if already inhibited by this handle */
+	old = xa_load(&trans->writeback_inhibited_ebs, index);
+	if (old == eb)
+		return;
+
+	refcount_inc(&eb->refs);	/* Take reference */
+
+	old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
+	if (xa_is_err(old)) {
+		/* Allocation failed, just skip inhibiting this buffer */
+		free_extent_buffer(eb);
+		return;
+	}
+
+	/* Handle replacement of different eb at same index */
+	if (old && old != eb) {
+		struct extent_buffer *old_eb = old;
+
+		atomic_dec(&old_eb->writeback_inhibitors);
+		free_extent_buffer(old_eb);
+	}
+
+	atomic_inc(&eb->writeback_inhibitors);
+}
+
+/*
+ * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers
+ * @trans: transaction handle to clean up
+ */
+void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
+{
+	struct extent_buffer *eb;
+	unsigned long index;
+
+	xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
+		atomic_dec(&eb->writeback_inhibitors);
+		free_extent_buffer(eb);
+	}
+	xa_destroy(&trans->writeback_inhibited_ebs);
+}
+
 static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 						   u64 start)
 {
@@ -3009,6 +3068,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
 	eb->len = fs_info->nodesize;
 	eb->fs_info = fs_info;
 	init_rwsem(&eb->lock);
+	atomic_set(&eb->writeback_inhibitors, 0);
 
 	btrfs_leak_debug_add_eb(eb);
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 73571d5d3d5a..4b15a5d8bc0f 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -102,6 +102,7 @@ struct extent_buffer {
 	/* >= 0 if eb belongs to a log tree, -1 otherwise */
 	s8 log_index;
 	u8 folio_shift;
+	atomic_t writeback_inhibitors;	/* inhibits writeback when > 0 */
 	struct rcu_head rcu_head;
 
 	struct rw_semaphore lock;
@@ -381,4 +382,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
 #define btrfs_extent_buffer_leak_debug_check(fs_info)	do {} while (0)
 #endif
 
+void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
+			       struct extent_buffer *eb);
+void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
+
 #endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f4cc9e1a1b93..a9a22629b49d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -15,6 +15,7 @@
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
+#include "extent_io.h"
 #include "transaction.h"
 #include "locking.h"
 #include "tree-log.h"
@@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 		goto alloc_fail;
 	}
 
+	xa_init(&h->writeback_inhibited_ebs);
+
 	/*
 	 * If we are JOIN_NOLOCK we're already committing a transaction and
 	 * waiting on this guy, so we don't need to do the sb_start_intwrite
@@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 	if (trans->type & __TRANS_FREEZABLE)
 		sb_end_intwrite(info->sb);
 
+	/*
+	 * Uninhibit extent buffer writeback before decrementing num_writers,
+	 * since the decrement wakes the committing thread which needs all
+	 * buffers uninhibited to write them to disk.
+	 */
+	btrfs_uninhibit_all_eb_writeback(trans);
+
 	WARN_ON(cur_trans != info->running_transaction);
 	WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
 	atomic_dec(&cur_trans->num_writers);
@@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
 	if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
 		btrfs_scrub_cancel(fs_info);
 
+	btrfs_uninhibit_all_eb_writeback(trans);
 	kmem_cache_free(btrfs_trans_handle_cachep, trans);
 }
 
@@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	    fs_info->cleaner_kthread)
 		wake_up_process(fs_info->cleaner_kthread);
 
+	/*
+	 * Uninhibit writeback on all extent buffers inhibited during this
+	 * transaction before writing them to disk. Inhibiting prevented
+	 * writeback while the transaction was building, but now we need
+	 * them written.
+	 */
+	btrfs_uninhibit_all_eb_writeback(trans);
+
 	ret = btrfs_write_and_wait_transaction(trans);
 	if (unlikely(ret)) {
 		btrfs_err(fs_info, "error while writing out transaction: %d", ret);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 18ef069197e5..f0d12c16d796 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -12,6 +12,7 @@
 #include <linux/time64.h>
 #include <linux/mutex.h>
 #include <linux/wait.h>
+#include <linux/xarray.h>
 #include "btrfs_inode.h"
 #include "delayed-ref.h"
 
@@ -162,6 +163,7 @@ struct btrfs_trans_handle {
 	struct btrfs_fs_info *fs_info;
 	struct list_head new_bgs;
 	struct btrfs_block_rsv delayed_rsv;
+	struct xarray writeback_inhibited_ebs;	/* ebs with writeback inhibited */
 };
 
 /*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking
  2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
  2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
  2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
@ 2026-02-13 20:30 ` Leo Martins
  2026-02-16 12:40   ` Filipe Manana
  2 siblings, 1 reply; 11+ messages in thread
From: Leo Martins @ 2026-02-13 20:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Add a btrfs_search_slot_stats tracepoint to btrfs_search_slot() for
measuring COW amplification.

The tracepoint fires when a search with at least one COW completes,
recording the root, total cow_count, restart_count, and return value.
cow_count and restart_count per search_slot call are useful metrics
for tracking COW amplification.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
---
 fs/btrfs/ctree.c             | 15 +++++++++++++--
 include/trace/events/btrfs.h | 26 ++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 55187ba59cc0..1971d7bb5f60 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2069,6 +2069,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	u8 lowest_level = 0;
 	int min_write_lock_level;
 	int prev_cmp;
+	int cow_count = 0;
+	int restart_count = 0;
 
 	if (!root)
 		return -EINVAL;
@@ -2157,6 +2159,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			    p->nodes[level + 1])) {
 				write_lock_level = level + 1;
 				btrfs_release_path(p);
+				restart_count++;
 				goto again;
 			}
 
@@ -2172,6 +2175,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 				ret = ret2;
 				goto done;
 			}
+			cow_count++;
 		}
 cow_done:
 		p->nodes[level] = b;
@@ -2219,8 +2223,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		p->slots[level] = slot;
 		ret2 = setup_nodes_for_search(trans, root, p, b, level, ins_len,
 					      &write_lock_level);
-		if (ret2 == -EAGAIN)
+		if (ret2 == -EAGAIN) {
+			restart_count++;
 			goto again;
+		}
 		if (ret2) {
 			ret = ret2;
 			goto done;
@@ -2236,6 +2242,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		if (slot == 0 && ins_len && write_lock_level < level + 1) {
 			write_lock_level = level + 1;
 			btrfs_release_path(p);
+			restart_count++;
 			goto again;
 		}
 
@@ -2249,8 +2256,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		}
 
 		ret2 = read_block_for_search(root, p, &b, slot, key);
-		if (ret2 == -EAGAIN && !p->nowait)
+		if (ret2 == -EAGAIN && !p->nowait) {
+			restart_count++;
 			goto again;
+		}
 		if (ret2) {
 			ret = ret2;
 			goto done;
@@ -2281,6 +2290,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	}
 	ret = 1;
 done:
+	if (cow_count > 0)
+		trace_btrfs_search_slot_stats(root, cow_count, restart_count, ret);
 	if (ret < 0 && !p->skip_release_on_error)
 		btrfs_release_path(p);
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 125bdc166bfe..b8934938a087 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1110,6 +1110,32 @@ TRACE_EVENT(btrfs_cow_block,
 		  __entry->cow_level)
 );
 
+TRACE_EVENT(btrfs_search_slot_stats,
+
+	TP_PROTO(const struct btrfs_root *root,
+		 int cow_count, int restart_count, int ret),
+
+	TP_ARGS(root, cow_count, restart_count, ret),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root_objectid		)
+		__field(	int,	cow_count		)
+		__field(	int,	restart_count		)
+		__field(	int,	ret			)
+	),
+
+	TP_fast_assign_btrfs(root->fs_info,
+		__entry->root_objectid	= btrfs_root_id(root);
+		__entry->cow_count	= cow_count;
+		__entry->restart_count	= restart_count;
+		__entry->ret		= ret;
+	),
+
+	TP_printk_btrfs("root=%llu(%s) cow_count=%d restarts=%d ret=%d",
+		  show_root_type(__entry->root_objectid),
+		  __entry->cow_count, __entry->restart_count, __entry->ret)
+);
+
 TRACE_EVENT(btrfs_space_reservation,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info, const char *type, u64 val,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
@ 2026-02-14  1:25   ` Sun YangKai
  2026-02-17 22:04     ` Leo Martins
  2026-02-16 12:18   ` Filipe Manana
  1 sibling, 1 reply; 11+ messages in thread
From: Sun YangKai @ 2026-02-14  1:25 UTC (permalink / raw)
  To: Leo Martins, linux-btrfs, kernel-team

Thanks for your working on this and I've expecting this for a long time :)

On 2026/2/14 04:30, Leo Martins wrote:
> When memory pressure causes writeback of a recently COW'd buffer,
> btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> the buffer unnecessarily, causing COW amplification that can exhaust
> block reservations and degrade throughput.
> 
> Overwriting in place is crash-safe because the committed superblock
> does not reference buffers allocated in the current (uncommitted)
> transaction, so no on-disk tree points to this block yet.
> 
> When should_cow_block() encounters a WRITTEN buffer whose generation
> matches the current transaction, instead of requesting a COW, re-dirty
> the buffer and re-register its range in the transaction's dirty_pages.
> 
> Both are necessary because btrfs tracks dirty metadata through two
> independent mechanisms. set_extent_buffer_dirty() sets the
> EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> mark, which is what background writeback (btree_write_cache_pages) uses
> to find and write dirty buffers. The transaction's dirty_pages io tree
> is a separate structure used by btrfs_write_and_wait_transaction() at
> commit time to ensure all buffers allocated during the transaction are
> persisted. The dirty_pages range was originally registered in
> btrfs_init_new_buffer() when the block was first allocated, but
> background writeback may have already written and cleared it.
> 
> Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> correctly pins the block if it is freed later.
> 
> Exclude cases where in-place overwrite is not safe:
>   - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
>   - Zoned devices: require sequential writes
>   - Log trees: log blocks are immediately referenced by a committed
>     superblock via btrfs_sync_log(), so overwriting could corrupt the
>     committed log
>   - BTRFS_ROOT_FORCE_COW: snapshot in progress
>   - BTRFS_HEADER_FLAG_RELOC: block being relocated
> 
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
>   fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..a345e1be24d8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>   
> -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>   				    const struct btrfs_root *root,
> -				    const struct extent_buffer *buf)
> +				    struct extent_buffer *buf)
>   {
>   	if (btrfs_is_testing(root->fs_info))
>   		return false;
> @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
>   	if (btrfs_header_generation(buf) != trans->transid)
>   		return true;
>   
> -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> +	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> +		/*
> +		 * The buffer was allocated in this transaction and has been
> +		 * written back to disk (WRITTEN is set). Normally we'd COW
> +		 * it again, but since the committed superblock doesn't
> +		 * reference this buffer (it was allocated this transaction),
> +		 * we can safely overwrite it in place.
> +		 *
> +		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> +		 * persisted at this bytenr and will be again after the
> +		 * in-place update. This is important so that
> +		 * btrfs_free_tree_block() correctly pins the block if it is
> +		 * freed later (e.g., during tree rebalancing or FORCE_COW).
> +		 *
> +		 * We re-dirty the buffer to ensure the in-place modifications
> +		 * will be written back to disk.
> +		 *
> +		 * Exclusions:
> +		 * - Log trees: log blocks are written and immediately
> +		 *   referenced by a committed superblock via
> +		 *   btrfs_sync_log(), bypassing the normal transaction
> +		 *   commit. Overwriting in place could corrupt the
> +		 *   committed log.
> +		 * - Zoned devices: require sequential writes
> +		 * - FORCE_COW: snapshot in progress
> +		 * - RELOC flag: block being relocated
> +		 */
> +		if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> +		    !btrfs_is_zoned(root->fs_info) &&
> +		    btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> +		    !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
it seems we need smp_mb__before_atomic() to see the FORCE_COW bit?
> +		    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> +			/*
> +			 * Re-register this block's range in the current
> +			 * transaction's dirty_pages so that
> +			 * btrfs_write_and_wait_transaction() writes it.
> +			 * The range was originally registered when the block
> +			 * was allocated, but that transaction's dirty_pages
> +			 * may have already been released.
> +			 */
> +			btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> +					     buf->start,
> +					     buf->start + buf->len - 1,
> +					     EXTENT_DIRTY, NULL);
> +			set_extent_buffer_dirty(buf);
why use set_extent_buffer_dirty() instead of btrfs_mark_buffer_dirty()? 
I don't see any other callers doing this.
> +			return false;
> +		}
>   		return true;
> +	}
>   
>   	/* Ensure we can see the FORCE_COW bit. */
>   	smp_mb__before_atomic();

And I wonder if we could have something more readable like this:

	if (btrfs_header_generation(buf) != trans->transid)
		return true;

	if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
		return true;

	if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
	    btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
		return true;

	/* Ensure we can see the FORCE_COW bit. */
	smp_mb__before_atomic();
	if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
		return true;

	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
		if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
		    btrfs_is_zoned(root->fs_info))
				return true;
		btrfs_set_extent_bit(&trans->transaction->dirty_pages,
				     buf->start,
				     buf->start + buf->len - 1,
				     EXTENT_DIRTY, NULL);
		btrfs_mark_buffer_dirty(trans, buf);
	}

	return false;

Thanks,
Sun YangKai



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
  2026-02-14  1:25   ` Sun YangKai
@ 2026-02-16 12:18   ` Filipe Manana
  2026-02-17 21:48     ` Leo Martins
  1 sibling, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2026-02-16 12:18 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Fri, Feb 13, 2026 at 8:38 PM Leo Martins <loemra.dev@gmail.com> wrote:
>
> When memory pressure causes writeback of a recently COW'd buffer,
> btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> the buffer unnecessarily, causing COW amplification that can exhaust
> block reservations and degrade throughput.
>
> Overwriting in place is crash-safe because the committed superblock
> does not reference buffers allocated in the current (uncommitted)
> transaction, so no on-disk tree points to this block yet.
>
> When should_cow_block() encounters a WRITTEN buffer whose generation
> matches the current transaction, instead of requesting a COW, re-dirty
> the buffer and re-register its range in the transaction's dirty_pages.
>
> Both are necessary because btrfs tracks dirty metadata through two
> independent mechanisms. set_extent_buffer_dirty() sets the
> EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> mark, which is what background writeback (btree_write_cache_pages) uses
> to find and write dirty buffers. The transaction's dirty_pages io tree
> is a separate structure used by btrfs_write_and_wait_transaction() at
> commit time to ensure all buffers allocated during the transaction are
> persisted. The dirty_pages range was originally registered in
> btrfs_init_new_buffer() when the block was first allocated, but
> background writeback may have already written and cleared it.

This is not quite correct, the dirty_pages range is never cleared on
background writeback.
We only clear it during a transaction commit, in
btrfs_write_and_wait_transaction().

Normally we shouldn't care about setting the range again in
dirty_pages, because after
we call  btrfs_write_and_wait_transaction(), no more COW should be
possible using this
transaction (which is in the unblocked state, so any new COW attempt
will be in another transaction).

The exception is if we have snapshots to create and qgroups are
enabled, since in qgroup_account_snapshot() we
call btrfs_write_and_wait_transaction() and after that we can get more
COW, due to all the stuff we need to do to
create a snapshot, before we get to the final call to
btrfs_write_and_wait_transaction() right before we write the
super blocks in btrfs_commit_transaction().

>
> Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> correctly pins the block if it is freed later.
>
> Exclude cases where in-place overwrite is not safe:
>  - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
>  - Zoned devices: require sequential writes
>  - Log trees: log blocks are immediately referenced by a committed
>    superblock via btrfs_sync_log(), so overwriting could corrupt the
>    committed log
>  - BTRFS_ROOT_FORCE_COW: snapshot in progress
>  - BTRFS_HEADER_FLAG_RELOC: block being relocated
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
>  fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 50 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..a345e1be24d8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>         return ret;
>  }
>
> -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>                                     const struct btrfs_root *root,
> -                                   const struct extent_buffer *buf)
> +                                   struct extent_buffer *buf)
>  {
>         if (btrfs_is_testing(root->fs_info))
>                 return false;
> @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
>         if (btrfs_header_generation(buf) != trans->transid)
>                 return true;
>
> -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> +               /*
> +                * The buffer was allocated in this transaction and has been
> +                * written back to disk (WRITTEN is set). Normally we'd COW
> +                * it again, but since the committed superblock doesn't
> +                * reference this buffer (it was allocated this transaction),

Missing an "in" before "this transaction".

> +                * we can safely overwrite it in place.
> +                *
> +                * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> +                * persisted at this bytenr and will be again after the
> +                * in-place update. This is important so that
> +                * btrfs_free_tree_block() correctly pins the block if it is
> +                * freed later (e.g., during tree rebalancing or FORCE_COW).
> +                *
> +                * We re-dirty the buffer to ensure the in-place modifications
> +                * will be written back to disk.
> +                *
> +                * Exclusions:
> +                * - Log trees: log blocks are written and immediately
> +                *   referenced by a committed superblock via
> +                *   btrfs_sync_log(), bypassing the normal transaction
> +                *   commit. Overwriting in place could corrupt the
> +                *   committed log.
> +                * - Zoned devices: require sequential writes
> +                * - FORCE_COW: snapshot in progress
> +                * - RELOC flag: block being relocated
> +                */
> +               if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> +                   !btrfs_is_zoned(root->fs_info) &&
> +                   btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> +                   !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&

We need a  smp_mb__before_atomic() before checking FORCE_COW, see the
existing code below.

> +                   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> +                       /*
> +                        * Re-register this block's range in the current
> +                        * transaction's dirty_pages so that
> +                        * btrfs_write_and_wait_transaction() writes it.
> +                        * The range was originally registered when the block
> +                        * was allocated, but that transaction's dirty_pages
> +                        * may have already been released.

I think it's worth adding something like: "... already been released
if we are in a transaction that creates snapshots and we have qgroups
enabled."

Otherwise it looks good, thanks!

> +                        */
> +                       btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> +                                            buf->start,
> +                                            buf->start + buf->len - 1,
> +                                            EXTENT_DIRTY, NULL);
> +                       set_extent_buffer_dirty(buf);
> +                       return false;
> +               }
>                 return true;
> +       }
>
>         /* Ensure we can see the FORCE_COW bit. */
>         smp_mb__before_atomic();
> --
> 2.47.3
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
  2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
@ 2026-02-16 12:33   ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2026-02-16 12:33 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Fri, Feb 13, 2026 at 8:38 PM Leo Martins <loemra.dev@gmail.com> wrote:
>
> Inhibit writeback on COW'd extent buffers for the lifetime of the
> transaction handle, preventing background writeback from setting
> BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
>
> COW amplification occurs when background writeback flushes an extent
> buffer that a transaction handle is still actively modifying. When
> lock_extent_buffer_for_io() transitions a buffer from dirty to
> writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
> having been persisted to disk at its current bytenr. Once WRITTEN is
> set, should_cow_block() must either COW the block again or overwrite
> it in place, both of which are unnecessary overhead when the buffer
> is still being modified by the same handle that allocated it. By
> inhibiting background writeback on actively-used buffers, WRITTEN is
> never set while a transaction handle holds a reference to the buffer,
> avoiding this overhead entirely.
>
> Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
> which fits in an existing 6-byte hole without increasing struct size.
> When a buffer is COW'd in btrfs_force_cow_block(), call
> btrfs_inhibit_eb_writeback() to store the eb in the transaction
> handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
> reference, and increment writeback_inhibitors. The function handles
> dedup (same eb inhibited twice by the same handle) and replacement
> (different eb at the same logical address). Allocation failure is
> graceful: the buffer simply falls back to the pre-existing behavior
> where it may be written back and re-COW'd.
>
> In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
> and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
> is used by the VM flusher threads for background and periodic
> writeback, which are the only paths that cause COW amplification by
> opportunistically writing out dirty extent buffers mid-transaction.
> Skipping these is safe because the buffers remain dirty in the page
> cache and will be written out at transaction commit time.
>
> WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
> This is required for correctness in the fsync path: btrfs_sync_log()
> writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
> while the transaction handle that inhibited those same blocks is still
> active. Without the WB_SYNC_ALL bypass, those inhibited log tree
> blocks would be silently skipped, resulting in an incomplete log on
> disk and corruption on replay. btrfs_write_and_wait_transaction()
> also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
> inhibitors are already cleared beforehand, but the bypass ensures
> correctness regardless.
>
> Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
> to prevent a race where the committer proceeds while buffers are still
> inhibited. Also uninhibit in btrfs_commit_transaction() before writing
> and in cleanup_transaction() for the error path.
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
>  fs/btrfs/ctree.c       |  4 +++
>  fs/btrfs/extent_io.c   | 62 +++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/extent_io.h   |  5 ++++
>  fs/btrfs/transaction.c | 19 +++++++++++++
>  fs/btrfs/transaction.h |  2 ++
>  5 files changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index a345e1be24d8..55187ba59cc0 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -590,6 +590,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>                 btrfs_tree_unlock(buf);
>         free_extent_buffer_stale(buf);
>         btrfs_mark_buffer_dirty(trans, cow);
> +
> +       /* Inhibit writeback on the COW'd buffer for this transaction handle */

Please always end sentences in a comment with punctuation.


> +       btrfs_inhibit_eb_writeback(trans, cow);
> +
>         *cow_ret = cow;
>         return 0;
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfc17c292217..0c9276cff299 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>          * of time.
>          */
>         spin_lock(&eb->refs_lock);
> -       if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> +       if ((wbc->sync_mode == WB_SYNC_ALL ||
> +            atomic_read(&eb->writeback_inhibitors) == 0) &&
> +           test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
>                 XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
>                 unsigned long flags;
>
> @@ -2999,6 +3001,63 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>         kmem_cache_free(extent_buffer_cache, eb);
>  }
>
> +/*
> + * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction
> + * @trans: transaction handle that will own the inhibitor
> + * @eb: extent buffer to inhibit writeback on
> + *
> + * Attempts to track this extent buffer in the transaction's inhibited set.
> + * If memory allocation fails, the buffer is simply not tracked. It may
> + * be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
> + */
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +                               struct extent_buffer *eb)
> +{
> +       unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
> +       void *old;
> +
> +       /* Check if already inhibited by this handle */

Same here.

> +       old = xa_load(&trans->writeback_inhibited_ebs, index);
> +       if (old == eb)
> +               return;
> +
> +       refcount_inc(&eb->refs);        /* Take reference */

Always place comments above the code line, not in the same line.

> +
> +       old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
> +       if (xa_is_err(old)) {
> +               /* Allocation failed, just skip inhibiting this buffer */

Punctuation missing.

> +               free_extent_buffer(eb);
> +               return;
> +       }
> +
> +       /* Handle replacement of different eb at same index */

Punctuation missing.

> +       if (old && old != eb) {
> +               struct extent_buffer *old_eb = old;
> +
> +               atomic_dec(&old_eb->writeback_inhibitors);
> +               free_extent_buffer(old_eb);
> +       }
> +
> +       atomic_inc(&eb->writeback_inhibitors);
> +}
> +
> +/*
> + * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers
> + * @trans: transaction handle to clean up
> + */
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
> +{
> +       struct extent_buffer *eb;
> +       unsigned long index;
> +
> +       xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
> +               atomic_dec(&eb->writeback_inhibitors);
> +               free_extent_buffer(eb);
> +       }
> +       xa_destroy(&trans->writeback_inhibited_ebs);
> +}
> +
>  static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>                                                    u64 start)
>  {
> @@ -3009,6 +3068,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>         eb->len = fs_info->nodesize;
>         eb->fs_info = fs_info;
>         init_rwsem(&eb->lock);
> +       atomic_set(&eb->writeback_inhibitors, 0);
>
>         btrfs_leak_debug_add_eb(eb);
>
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..4b15a5d8bc0f 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,7 @@ struct extent_buffer {
>         /* >= 0 if eb belongs to a log tree, -1 otherwise */
>         s8 log_index;
>         u8 folio_shift;
> +       atomic_t writeback_inhibitors;  /* inhibits writeback when > 0 */

Always place the comment above the structure's member (just like for
code), and add punctuation and capitalize the first word.

We have old code that does follow this, from the old days where
"anything goes", but we try to be consistent nowadays, see:

https://btrfs.readthedocs.io/en/latest/dev/Development-notes.html#comments

Otherwise it looks good, with those minor changes:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.



>         struct rcu_head rcu_head;
>
>         struct rw_semaphore lock;
> @@ -381,4 +382,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
>  #define btrfs_extent_buffer_leak_debug_check(fs_info)  do {} while (0)
>  #endif
>
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +                              struct extent_buffer *eb);
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
> +
>  #endif
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f4cc9e1a1b93..a9a22629b49d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -15,6 +15,7 @@
>  #include "misc.h"
>  #include "ctree.h"
>  #include "disk-io.h"
> +#include "extent_io.h"
>  #include "transaction.h"
>  #include "locking.h"
>  #include "tree-log.h"
> @@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>                 goto alloc_fail;
>         }
>
> +       xa_init(&h->writeback_inhibited_ebs);
> +
>         /*
>          * If we are JOIN_NOLOCK we're already committing a transaction and
>          * waiting on this guy, so we don't need to do the sb_start_intwrite
> @@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>         if (trans->type & __TRANS_FREEZABLE)
>                 sb_end_intwrite(info->sb);
>
> +       /*
> +        * Uninhibit extent buffer writeback before decrementing num_writers,
> +        * since the decrement wakes the committing thread which needs all
> +        * buffers uninhibited to write them to disk.
> +        */
> +       btrfs_uninhibit_all_eb_writeback(trans);
> +
>         WARN_ON(cur_trans != info->running_transaction);
>         WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
>         atomic_dec(&cur_trans->num_writers);
> @@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
>         if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
>                 btrfs_scrub_cancel(fs_info);
>
> +       btrfs_uninhibit_all_eb_writeback(trans);
>         kmem_cache_free(btrfs_trans_handle_cachep, trans);
>  }
>
> @@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>             fs_info->cleaner_kthread)
>                 wake_up_process(fs_info->cleaner_kthread);
>
> +       /*
> +        * Uninhibit writeback on all extent buffers inhibited during this
> +        * transaction before writing them to disk. Inhibiting prevented
> +        * writeback while the transaction was building, but now we need
> +        * them written.
> +        */
> +       btrfs_uninhibit_all_eb_writeback(trans);
> +
>         ret = btrfs_write_and_wait_transaction(trans);
>         if (unlikely(ret)) {
>                 btrfs_err(fs_info, "error while writing out transaction: %d", ret);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 18ef069197e5..f0d12c16d796 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,6 +12,7 @@
>  #include <linux/time64.h>
>  #include <linux/mutex.h>
>  #include <linux/wait.h>
> +#include <linux/xarray.h>
>  #include "btrfs_inode.h"
>  #include "delayed-ref.h"
>
> @@ -162,6 +163,7 @@ struct btrfs_trans_handle {
>         struct btrfs_fs_info *fs_info;
>         struct list_head new_bgs;
>         struct btrfs_block_rsv delayed_rsv;
> +       struct xarray writeback_inhibited_ebs;  /* ebs with writeback inhibited */
>  };
>
>  /*
> --
> 2.47.3
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking
  2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
@ 2026-02-16 12:40   ` Filipe Manana
  2026-02-17 22:06     ` Leo Martins
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2026-02-16 12:40 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Fri, Feb 13, 2026 at 8:37 PM Leo Martins <loemra.dev@gmail.com> wrote:
>
> Add a btrfs_search_slot_stats tracepoint to btrfs_search_slot() for
> measuring COW amplification.
>
> The tracepoint fires when a search with at least one COW completes,
> recording the root, total cow_count, restart_count, and return value.
> cow_count and restart_count per search_slot call are useful metrics
> for tracking COW amplification.
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
>  fs/btrfs/ctree.c             | 15 +++++++++++++--
>  include/trace/events/btrfs.h | 26 ++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 55187ba59cc0..1971d7bb5f60 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2069,6 +2069,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>         u8 lowest_level = 0;
>         int min_write_lock_level;
>         int prev_cmp;
> +       int cow_count = 0;
> +       int restart_count = 0;
>
>         if (!root)
>                 return -EINVAL;
> @@ -2157,6 +2159,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                             p->nodes[level + 1])) {
>                                 write_lock_level = level + 1;
>                                 btrfs_release_path(p);
> +                               restart_count++;
>                                 goto again;
>                         }
>
> @@ -2172,6 +2175,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                                 ret = ret2;
>                                 goto done;
>                         }
> +                       cow_count++;
>                 }
>  cow_done:
>                 p->nodes[level] = b;
> @@ -2219,8 +2223,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                 p->slots[level] = slot;
>                 ret2 = setup_nodes_for_search(trans, root, p, b, level, ins_len,
>                                               &write_lock_level);
> -               if (ret2 == -EAGAIN)
> +               if (ret2 == -EAGAIN) {
> +                       restart_count++;
>                         goto again;
> +               }
>                 if (ret2) {
>                         ret = ret2;
>                         goto done;
> @@ -2236,6 +2242,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                 if (slot == 0 && ins_len && write_lock_level < level + 1) {
>                         write_lock_level = level + 1;
>                         btrfs_release_path(p);
> +                       restart_count++;
>                         goto again;
>                 }
>
> @@ -2249,8 +2256,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                 }
>
>                 ret2 = read_block_for_search(root, p, &b, slot, key);
> -               if (ret2 == -EAGAIN && !p->nowait)
> +               if (ret2 == -EAGAIN && !p->nowait) {
> +                       restart_count++;
>                         goto again;
> +               }
>                 if (ret2) {
>                         ret = ret2;
>                         goto done;
> @@ -2281,6 +2290,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>         }
>         ret = 1;
>  done:
> +       if (cow_count > 0)
> +               trace_btrfs_search_slot_stats(root, cow_count, restart_count, ret);

So I find this way too specific, plus even if trace points are
disabled we have the overhead of the counters (and inside critical
sections).

We already have a tracepoint for COW, trace_btrfs_cow_block(), and we
could have one just for the retry thing, maybe naming it like
trace_btrfs_search_slot_restart() or something.
So we could use those two tracepoints to measure things (bpftrace
scripts could easily report a count of each trace point and such),
instead of this highly specialized tracepoint that adds some overhead
when tracepoints are disabled.

Thanks.


>         if (ret < 0 && !p->skip_release_on_error)
>                 btrfs_release_path(p);
>
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 125bdc166bfe..b8934938a087 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1110,6 +1110,32 @@ TRACE_EVENT(btrfs_cow_block,
>                   __entry->cow_level)
>  );
>
> +TRACE_EVENT(btrfs_search_slot_stats,
> +
> +       TP_PROTO(const struct btrfs_root *root,
> +                int cow_count, int restart_count, int ret),
> +
> +       TP_ARGS(root, cow_count, restart_count, ret),
> +
> +       TP_STRUCT__entry_btrfs(
> +               __field(        u64,    root_objectid           )
> +               __field(        int,    cow_count               )
> +               __field(        int,    restart_count           )
> +               __field(        int,    ret                     )
> +       ),
> +
> +       TP_fast_assign_btrfs(root->fs_info,
> +               __entry->root_objectid  = btrfs_root_id(root);
> +               __entry->cow_count      = cow_count;
> +               __entry->restart_count  = restart_count;
> +               __entry->ret            = ret;
> +       ),
> +
> +       TP_printk_btrfs("root=%llu(%s) cow_count=%d restarts=%d ret=%d",
> +                 show_root_type(__entry->root_objectid),
> +                 __entry->cow_count, __entry->restart_count, __entry->ret)
> +);
> +
>  TRACE_EVENT(btrfs_space_reservation,
>
>         TP_PROTO(const struct btrfs_fs_info *fs_info, const char *type, u64 val,
> --
> 2.47.3
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-16 12:18   ` Filipe Manana
@ 2026-02-17 21:48     ` Leo Martins
  0 siblings, 0 replies; 11+ messages in thread
From: Leo Martins @ 2026-02-17 21:48 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Mon, 16 Feb 2026 12:18:48 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Feb 13, 2026 at 8:38 PM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> >
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> >
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> >
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated, but
> > background writeback may have already written and cleared it.
> 
> This is not quite correct, the dirty_pages range is never cleared on
> background writeback.
> We only clear it during a transaction commit, in
> btrfs_write_and_wait_transaction().
> 
> Normally we shouldn't care about setting the range again in
> dirty_pages, because after
> we call  btrfs_write_and_wait_transaction(), no more COW should be
> possible using this
> transaction (which is in the unblocked state, so any new COW attempt
> will be in another transaction).
> 
> The exception is if we have snapshots to create and qgroups are
> enabled, since in qgroup_account_snapshot() we
> call btrfs_write_and_wait_transaction() and after that we can get more
> COW, due to all the stuff we need to do to
> create a snapshot, before we get to the final call to
> btrfs_write_and_wait_transaction() right before we write the
> super blocks in btrfs_commit_transaction().

Got it, thanks for the correction. Updated in v3.

Thanks,
Leo

> 
> >
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> >
> > Exclude cases where in-place overwrite is not safe:
> >  - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> >  - Zoned devices: require sequential writes
> >  - Log trees: log blocks are immediately referenced by a committed
> >    superblock via btrfs_sync_log(), so overwriting could corrupt the
> >    committed log
> >  - BTRFS_ROOT_FORCE_COW: snapshot in progress
> >  - BTRFS_HEADER_FLAG_RELOC: block being relocated
> >
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > ---
> >  fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 50 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..a345e1be24d8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> >         return ret;
> >  }
> >
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> >                                     const struct btrfs_root *root,
> > -                                   const struct extent_buffer *buf)
> > +                                   struct extent_buffer *buf)
> >  {
> >         if (btrfs_is_testing(root->fs_info))
> >                 return false;
> > @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> >         if (btrfs_header_generation(buf) != trans->transid)
> >                 return true;
> >
> > -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > +               /*
> > +                * The buffer was allocated in this transaction and has been
> > +                * written back to disk (WRITTEN is set). Normally we'd COW
> > +                * it again, but since the committed superblock doesn't
> > +                * reference this buffer (it was allocated this transaction),
> 
> Missing an "in" before "this transaction".
> 
> > +                * we can safely overwrite it in place.
> > +                *
> > +                * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > +                * persisted at this bytenr and will be again after the
> > +                * in-place update. This is important so that
> > +                * btrfs_free_tree_block() correctly pins the block if it is
> > +                * freed later (e.g., during tree rebalancing or FORCE_COW).
> > +                *
> > +                * We re-dirty the buffer to ensure the in-place modifications
> > +                * will be written back to disk.
> > +                *
> > +                * Exclusions:
> > +                * - Log trees: log blocks are written and immediately
> > +                *   referenced by a committed superblock via
> > +                *   btrfs_sync_log(), bypassing the normal transaction
> > +                *   commit. Overwriting in place could corrupt the
> > +                *   committed log.
> > +                * - Zoned devices: require sequential writes
> > +                * - FORCE_COW: snapshot in progress
> > +                * - RELOC flag: block being relocated
> > +                */
> > +               if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> > +                   !btrfs_is_zoned(root->fs_info) &&
> > +                   btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> > +                   !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
> 
> We need a  smp_mb__before_atomic() before checking FORCE_COW, see the
> existing code below.
> 
> > +                   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> > +                       /*
> > +                        * Re-register this block's range in the current
> > +                        * transaction's dirty_pages so that
> > +                        * btrfs_write_and_wait_transaction() writes it.
> > +                        * The range was originally registered when the block
> > +                        * was allocated, but that transaction's dirty_pages
> > +                        * may have already been released.
> 
> I think it's worth adding something like: "... already been released
> if we are in a transaction that creates snapshots and we have qgroups
> enabled."
> 
> Otherwise it looks good, thanks!
> 
> > +                        */
> > +                       btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > +                                            buf->start,
> > +                                            buf->start + buf->len - 1,
> > +                                            EXTENT_DIRTY, NULL);
> > +                       set_extent_buffer_dirty(buf);
> > +                       return false;
> > +               }
> >                 return true;
> > +       }
> >
> >         /* Ensure we can see the FORCE_COW bit. */
> >         smp_mb__before_atomic();
> > --
> > 2.47.3
> >
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-14  1:25   ` Sun YangKai
@ 2026-02-17 22:04     ` Leo Martins
  0 siblings, 0 replies; 11+ messages in thread
From: Leo Martins @ 2026-02-17 22:04 UTC (permalink / raw)
  To: Sun YangKai; +Cc: linux-btrfs, kernel-team

On Sat, 14 Feb 2026 09:25:03 +0800 Sun YangKai <sunk67188@gmail.com> wrote:

> Thanks for your working on this and I've expecting this for a long time :)

Thanks for the review!

> 
> On 2026/2/14 04:30, Leo Martins wrote:
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> > 
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> > 
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> > 
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated, but
> > background writeback may have already written and cleared it.
> > 
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> > 
> > Exclude cases where in-place overwrite is not safe:
> >   - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> >   - Zoned devices: require sequential writes
> >   - Log trees: log blocks are immediately referenced by a committed
> >     superblock via btrfs_sync_log(), so overwriting could corrupt the
> >     committed log
> >   - BTRFS_ROOT_FORCE_COW: snapshot in progress
> >   - BTRFS_HEADER_FLAG_RELOC: block being relocated
> > 
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > ---
> >   fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
> >   1 file changed, 50 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..a345e1be24d8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> >   	return ret;
> >   }
> >   
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> >   				    const struct btrfs_root *root,
> > -				    const struct extent_buffer *buf)
> > +				    struct extent_buffer *buf)
> >   {
> >   	if (btrfs_is_testing(root->fs_info))
> >   		return false;
> > @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> >   	if (btrfs_header_generation(buf) != trans->transid)
> >   		return true;
> >   
> > -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > +	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > +		/*
> > +		 * The buffer was allocated in this transaction and has been
> > +		 * written back to disk (WRITTEN is set). Normally we'd COW
> > +		 * it again, but since the committed superblock doesn't
> > +		 * reference this buffer (it was allocated this transaction),
> > +		 * we can safely overwrite it in place.
> > +		 *
> > +		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > +		 * persisted at this bytenr and will be again after the
> > +		 * in-place update. This is important so that
> > +		 * btrfs_free_tree_block() correctly pins the block if it is
> > +		 * freed later (e.g., during tree rebalancing or FORCE_COW).
> > +		 *
> > +		 * We re-dirty the buffer to ensure the in-place modifications
> > +		 * will be written back to disk.
> > +		 *
> > +		 * Exclusions:
> > +		 * - Log trees: log blocks are written and immediately
> > +		 *   referenced by a committed superblock via
> > +		 *   btrfs_sync_log(), bypassing the normal transaction
> > +		 *   commit. Overwriting in place could corrupt the
> > +		 *   committed log.
> > +		 * - Zoned devices: require sequential writes
> > +		 * - FORCE_COW: snapshot in progress
> > +		 * - RELOC flag: block being relocated
> > +		 */
> > +		if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> > +		    !btrfs_is_zoned(root->fs_info) &&
> > +		    btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> > +		    !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
> it seems we need smp_mb__before_atomic() to see the FORCE_COW bit?

Good call, fixed in v3.

> > +		    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> > +			/*
> > +			 * Re-register this block's range in the current
> > +			 * transaction's dirty_pages so that
> > +			 * btrfs_write_and_wait_transaction() writes it.
> > +			 * The range was originally registered when the block
> > +			 * was allocated, but that transaction's dirty_pages
> > +			 * may have already been released.
> > +			 */
> > +			btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > +					     buf->start,
> > +					     buf->start + buf->len - 1,
> > +					     EXTENT_DIRTY, NULL);
> > +			set_extent_buffer_dirty(buf);
> why use set_extent_buffer_dirty() instead of btrfs_mark_buffer_dirty()? 
> I don't see any other callers doing this.

btrfs_mark_buffer_dirty() calls btrfs_assert_tree_write_locked(buf),
but should_cow_block() may be called from btrfs_search_slot() when the
buffer only holds a read lock (root node acquired with BTRFS_READ_LOCK
in btrfs_search_slot_get_root()). Added a comment explaining this in 
v3.

> > +			return false;
> > +		}
> >   		return true;
> > +	}
> >   
> >   	/* Ensure we can see the FORCE_COW bit. */
> >   	smp_mb__before_atomic();
> 
> And I wonder if we could have something more readable like this:
> 
> 	if (btrfs_header_generation(buf) != trans->transid)
> 		return true;
> 
> 	if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> 		return true;
> 
> 	if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
> 	    btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> 		return true;
> 
> 	/* Ensure we can see the FORCE_COW bit. */
> 	smp_mb__before_atomic();
> 	if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> 		return true;
> 
> 	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> 		if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
> 		    btrfs_is_zoned(root->fs_info))
> 				return true;
> 		btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> 				     buf->start,
> 				     buf->start + buf->len - 1,
> 				     EXTENT_DIRTY, NULL);
> 		btrfs_mark_buffer_dirty(trans, buf);
> 	}
> 
> 	return false;

Updated in v3!

> 
> Thanks,
> Sun YangKai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking
  2026-02-16 12:40   ` Filipe Manana
@ 2026-02-17 22:06     ` Leo Martins
  0 siblings, 0 replies; 11+ messages in thread
From: Leo Martins @ 2026-02-17 22:06 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Mon, 16 Feb 2026 12:40:04 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Feb 13, 2026 at 8:37 PM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > Add a btrfs_search_slot_stats tracepoint to btrfs_search_slot() for
> > measuring COW amplification.
> >
> > The tracepoint fires when a search with at least one COW completes,
> > recording the root, total cow_count, restart_count, and return value.
> > cow_count and restart_count per search_slot call are useful metrics
> > for tracking COW amplification.
> >
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > ---
> >  fs/btrfs/ctree.c             | 15 +++++++++++++--
> >  include/trace/events/btrfs.h | 26 ++++++++++++++++++++++++++
> >  2 files changed, 39 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 55187ba59cc0..1971d7bb5f60 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -2069,6 +2069,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >         u8 lowest_level = 0;
> >         int min_write_lock_level;
> >         int prev_cmp;
> > +       int cow_count = 0;
> > +       int restart_count = 0;
> >
> >         if (!root)
> >                 return -EINVAL;
> > @@ -2157,6 +2159,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >                             p->nodes[level + 1])) {
> >                                 write_lock_level = level + 1;
> >                                 btrfs_release_path(p);
> > +                               restart_count++;
> >                                 goto again;
> >                         }
> >
> > @@ -2172,6 +2175,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >                                 ret = ret2;
> >                                 goto done;
> >                         }
> > +                       cow_count++;
> >                 }
> >  cow_done:
> >                 p->nodes[level] = b;
> > @@ -2219,8 +2223,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >                 p->slots[level] = slot;
> >                 ret2 = setup_nodes_for_search(trans, root, p, b, level, ins_len,
> >                                               &write_lock_level);
> > -               if (ret2 == -EAGAIN)
> > +               if (ret2 == -EAGAIN) {
> > +                       restart_count++;
> >                         goto again;
> > +               }
> >                 if (ret2) {
> >                         ret = ret2;
> >                         goto done;
> > @@ -2236,6 +2242,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >                 if (slot == 0 && ins_len && write_lock_level < level + 1) {
> >                         write_lock_level = level + 1;
> >                         btrfs_release_path(p);
> > +                       restart_count++;
> >                         goto again;
> >                 }
> >
> > @@ -2249,8 +2256,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >                 }
> >
> >                 ret2 = read_block_for_search(root, p, &b, slot, key);
> > -               if (ret2 == -EAGAIN && !p->nowait)
> > +               if (ret2 == -EAGAIN && !p->nowait) {
> > +                       restart_count++;
> >                         goto again;
> > +               }
> >                 if (ret2) {
> >                         ret = ret2;
> >                         goto done;
> > @@ -2281,6 +2290,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >         }
> >         ret = 1;
> >  done:
> > +       if (cow_count > 0)
> > +               trace_btrfs_search_slot_stats(root, cow_count, restart_count, ret);
> 
> So I find this way too specific, plus even if trace points are
> disabled we have the overhead of the counters (and inside critical
> sections).
> 
> We already have a tracepoint for COW, trace_btrfs_cow_block(), and we
> could have one just for the retry thing, maybe naming it like
> trace_btrfs_search_slot_restart() or something.
> So we could use those two tracepoints to measure things (bpftrace
> scripts could easily report a count of each trace point and such),
> instead of this highly specialized tracepoint that adds some overhead
> when tracepoints are disabled.

Good point, added a per-restart-site trace_btrfs_search_slot_restart()
tracepoint in v3.

Thanks,
Leo

> 
> Thanks.
> 
> 
> >         if (ret < 0 && !p->skip_release_on_error)
> >                 btrfs_release_path(p);
> >
> > diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> > index 125bdc166bfe..b8934938a087 100644
> > --- a/include/trace/events/btrfs.h
> > +++ b/include/trace/events/btrfs.h
> > @@ -1110,6 +1110,32 @@ TRACE_EVENT(btrfs_cow_block,
> >                   __entry->cow_level)
> >  );
> >
> > +TRACE_EVENT(btrfs_search_slot_stats,
> > +
> > +       TP_PROTO(const struct btrfs_root *root,
> > +                int cow_count, int restart_count, int ret),
> > +
> > +       TP_ARGS(root, cow_count, restart_count, ret),
> > +
> > +       TP_STRUCT__entry_btrfs(
> > +               __field(        u64,    root_objectid           )
> > +               __field(        int,    cow_count               )
> > +               __field(        int,    restart_count           )
> > +               __field(        int,    ret                     )
> > +       ),
> > +
> > +       TP_fast_assign_btrfs(root->fs_info,
> > +               __entry->root_objectid  = btrfs_root_id(root);
> > +               __entry->cow_count      = cow_count;
> > +               __entry->restart_count  = restart_count;
> > +               __entry->ret            = ret;
> > +       ),
> > +
> > +       TP_printk_btrfs("root=%llu(%s) cow_count=%d restarts=%d ret=%d",
> > +                 show_root_type(__entry->root_objectid),
> > +                 __entry->cow_count, __entry->restart_count, __entry->ret)
> > +);
> > +
> >  TRACE_EVENT(btrfs_space_reservation,
> >
> >         TP_PROTO(const struct btrfs_fs_info *fs_info, const char *type, u64 val,
> > --
> > 2.47.3
> >
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-17 22:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-14  1:25   ` Sun YangKai
2026-02-17 22:04     ` Leo Martins
2026-02-16 12:18   ` Filipe Manana
2026-02-17 21:48     ` Leo Martins
2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-16 12:33   ` Filipe Manana
2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
2026-02-16 12:40   ` Filipe Manana
2026-02-17 22:06     ` Leo Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox