[PATCH v4 0/3] btrfs: fix COW amplification under memory pressure

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure
@ 2026-02-26  9:51 Leo Martins
  2026-02-26  9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Leo Martins @ 2026-02-26  9:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

I've been investigating a pattern where COW amplification under memory
pressure exhausts the transaction block reserve, leading to spurious
ENOSPCs on filesystems with plenty of unallocated space.

The pattern is:

 1. btrfs_search_slot() begins tree traversal with cow=1
 2. Node at level N needs COW (old generation or WRITTEN flag set)
 3. btrfs_cow_block() allocates new block, copies data, updates
    parent pointer
 4. Traversal hits a condition requiring restart (node not cached,
    lock contention, need higher write_lock_level)
 5. btrfs_release_path() releases all locks and references
 6. Memory pressure triggers writeback on the COW'd block
 7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
    BTRFS_HEADER_FLAG_WRITTEN
 8. goto again - traversal restarts from root
 9. Traversal reaches the same tree node
 10. should_cow_block() sees WRITTEN flag, returns true
 11. btrfs_cow_block() allocates yet another new block, copies data,
     updates parent pointer again, consuming another reservation
 12. Steps 4-11 repeat under sustained memory pressure

This series fixes the problem with two complementary approaches:

Patch 1 implements Filipe's suggestion of overwriting in place. When
should_cow_block() encounters a WRITTEN buffer whose generation matches
the current transaction, it re-dirties the buffer and returns false
instead of requesting a COW. This is crash-safe because the committed
superblock does not reference buffers allocated in the current
transaction. Log trees, zoned devices, FORCE_COW, relocation, and
buffers mid-writeback are excluded. Beyond improving the amplification
bug, this is a general optimization for the entire transaction lifetime:
any time writeback runs during a transaction, revisiting the same path
no longer triggers unnecessary COW, reducing extent allocation overhead,
memory copies, and space usage per transaction.

Patch 2 inhibits writeback on COW'd buffers for the lifetime of the
transaction handle. This prevents WRITTEN from being set while a
handle holds a reference to the buffer, avoiding unnecessary re-COW
at its source. Only WB_SYNC_NONE (background/periodic flusher
writeback) is inhibited. WB_SYNC_ALL (data-integrity writeback from
btrfs_sync_log() and btrfs_write_and_wait_transaction()) always
proceeds, which is required for correctness in the fsync path where
log tree blocks must be written while the inhibiting handle is still
active.

Both approaches are independently useful. The overwrite path is a
general optimization that eliminates unnecessary COW across the entire
transaction, not just within a single handle. Writeback inhibition
prevents the problem from occurring in the first place and is the
only fix for log trees and zoned devices where overwrite does not
apply. Together they provide defense in depth against COW
amplification.

Patch 3 adds a tracepoint for tracking search slot restarts.

Note: Boris's AS_KERNEL_FILE changes prevent cgroup-scoped reclaim
from targeting extent buffer pages, making this harder to trigger via
per-cgroup memory limits. I was able to reproduce the amplification
using global memory pressure (drop_caches, root cgroup memory.reclaim,
memory hog) with concurrent filesystem operations.

Benchmark: concurrent filesystem operations under aggressive global
memory pressure. COW counts measured via bpftrace.

  Approach                   Max COW count   Num operations
  -------                    -------------   --------------
  Baseline                              35   -
  Overwrite only                        19   ~same
  Writeback inhibition only              5   ~2x
  Combined (this series)                 4   ~2x

Changes since v3:

Patch 2:
 - Also inhibit writeback in should_cow_block() when COW is skipped,
   so that every transaction handle that reuses a COW'd buffer inhibits
   its writeback, not just the handle that originally COW'd it

Changes since v2:

Patch 1:
 - Add smp_mb__before_atomic() before FORCE_COW check (Filipe, Sun YangKai)
 - Hoist WRITEBACK, RELOC, FORCE_COW checks before WRITTEN block (Sun YangKai)
 - Update dirty_pages comment for qgroup_account_snapshot() (Filipe)
 - Relax btrfs_mark_buffer_dirty() lockdep assertion to accept read locks
 - Use btrfs_mark_buffer_dirty() instead of set_extent_buffer_dirty()
 - Remove folio_test_dirty() debug assertion (concurrent reader race)

Patch 2:
 - Fix comment style (Filipe)

Patch 3:
 - Replace counters with per-restart-site tracepoint (Filipe)

Changes since v1:

The v1 patch used a per-btrfs_search_slot() xarray to track COW'd
buffers. Filipe pointed out this was too complex and too narrow in
scope, and suggested overwriting in place instead. Qu raised the
concern that other btrfs_cow_block() callers outside of search_slot
would not be covered. Boris suggested transaction-handle-level scope
as an alternative.

v2 implemented both overwrite-in-place and writeback inhibition at
the transaction handle level. The per-search-slot xarray was replaced
by a per-transaction-handle xarray, which covers all
btrfs_force_cow_block() callers.

Leo Martins (3):
  btrfs: skip COW for written extent buffers allocated in current
    transaction
  btrfs: inhibit extent buffer writeback to prevent COW amplification
  btrfs: add tracepoint for search slot restart tracking

 fs/btrfs/ctree.c             | 106 ++++++++++++++++++++++++++++-------
 fs/btrfs/disk-io.c           |   2 +-
 fs/btrfs/extent_io.c         |  67 ++++++++++++++++++++--
 fs/btrfs/extent_io.h         |   6 ++
 fs/btrfs/transaction.c       |  19 +++++++
 fs/btrfs/transaction.h       |   3 +
 include/trace/events/btrfs.h |  24 ++++++++
 7 files changed, 200 insertions(+), 27 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-26  9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
@ 2026-02-26  9:51 ` Leo Martins
  2026-03-04  6:31   ` Qu Wenruo
  2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Leo Martins @ 2026-02-26  9:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Filipe Manana, Sun YangKai

When memory pressure causes writeback of a recently COW'd buffer,
btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
the buffer unnecessarily, causing COW amplification that can exhaust
block reservations and degrade throughput.

Overwriting in place is crash-safe because the committed superblock
does not reference buffers allocated in the current (uncommitted)
transaction, so no on-disk tree points to this block yet.

When should_cow_block() encounters a WRITTEN buffer whose generation
matches the current transaction, instead of requesting a COW, re-dirty
the buffer and re-register its range in the transaction's dirty_pages.

Both are necessary because btrfs tracks dirty metadata through two
independent mechanisms. set_extent_buffer_dirty() sets the
EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
mark, which is what background writeback (btree_write_cache_pages) uses
to find and write dirty buffers. The transaction's dirty_pages io tree
is a separate structure used by btrfs_write_and_wait_transaction() at
commit time to ensure all buffers allocated during the transaction are
persisted. The dirty_pages range was originally registered in
btrfs_init_new_buffer() when the block was first allocated. Normally
dirty_pages is only cleared at commit time by
btrfs_write_and_wait_transaction(), but if qgroups are enabled and
snapshots are being created, qgroup_account_snapshot() may have already
called btrfs_write_and_wait_transaction() and released the range before
the final commit-time call.

Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
correctly pins the block if it is freed later.

Relax the lockdep assertion in btrfs_mark_buffer_dirty() from
btrfs_assert_tree_write_locked() to lockdep_assert_held() so that it
accepts either a read or write lock. should_cow_block() may be called
from btrfs_search_slot() when only a read lock is held (nodes above
write_lock_level are read-locked). The write lock assertion previously
documented the caller convention that buffer content was being modified
under exclusive access, but btrfs_mark_buffer_dirty() and
set_extent_buffer_dirty() themselves only perform independently
synchronized operations: atomic bit ops on bflags, folio_mark_dirty()
(kernel-internal folio locking), xarray mark updates (xarray spinlock),
and percpu counter updates. The read lock is sufficient because it
prevents lock_extent_buffer_for_io() from acquiring the write lock and
racing on the dirty state. Since rw_semaphore permits concurrent
readers, multiple threads can enter btrfs_mark_buffer_dirty()
simultaneously for the same buffer; this is safe because
test_and_set_bit(EXTENT_BUFFER_DIRTY) ensures only one thread performs
the full dirty state transition.

Remove the CONFIG_BTRFS_DEBUG assertion in set_extent_buffer_dirty()
that checked folio_test_dirty() after marking the buffer dirty. This
assertion assumed exclusive access (only one thread in
set_extent_buffer_dirty() at a time), which held when the only caller
was btrfs_mark_buffer_dirty() under write lock. With concurrent readers
calling through should_cow_block(), a thread that loses the
test_and_set_bit race sees was_dirty=true and skips the folio dirty
marking, but the winning thread may not have called
btrfs_meta_folio_set_dirty() yet, causing the assertion to fire. This
is a benign race: the winning thread will complete the folio dirty
marking, and no writeback can clear it while readers hold their locks.

Hoist the EXTENT_BUFFER_WRITEBACK, BTRFS_HEADER_FLAG_RELOC, and
BTRFS_ROOT_FORCE_COW checks before the WRITTEN block since they apply
regardless of whether the buffer has been written back. This
consolidates the exclusion logic and simplifies the WRITTEN path to
only handle log trees and zoned devices. Moving the RELOC checks
before the smp_mb__before_atomic() barrier is safe because both
btrfs_root_id() (immutable) and BTRFS_HEADER_FLAG_RELOC (set at COW
time under tree lock) are stable values not subject to concurrent
modification; the barrier is only needed for BTRFS_ROOT_FORCE_COW
which is set concurrently by create_pending_snapshot().

Exclude cases where in-place overwrite is not safe:
 - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
 - Zoned devices: require sequential writes
 - Log trees: log blocks are immediately referenced by a committed
   superblock via btrfs_sync_log(), so overwriting could corrupt the
   committed log
 - BTRFS_ROOT_FORCE_COW: snapshot in progress
 - BTRFS_HEADER_FLAG_RELOC: block being relocated

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/ctree.c     | 87 ++++++++++++++++++++++++++++++++++----------
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c |  4 --
 3 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 7267b2502665..ea7cfc3a9e89 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -599,29 +599,40 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
 	return ret;
 }

-static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
+/*
+ * Check if @buf needs to be COW'd.
+ *
+ * Returns true if COW is required, false if the block can be reused
+ * in place.
+ *
+ * We do not need to COW a block if:
+ * 1) the block was created or changed in this transaction;
+ * 2) the block does not belong to TREE_RELOC tree;
+ * 3) the root is not forced COW.
+ *
+ * Forced COW happens when we create a snapshot during transaction commit:
+ * after copying the src root, we must COW the shared block to ensure
+ * metadata consistency.
+ *
+ * When returning false for a WRITTEN buffer allocated in the current
+ * transaction, re-dirties the buffer for in-place overwrite instead
+ * of requesting a new COW.
+ */
+static inline bool should_cow_block(struct btrfs_trans_handle *trans,
 				    const struct btrfs_root *root,
-				    const struct extent_buffer *buf)
+				    struct extent_buffer *buf)
 {
 	if (btrfs_is_testing(root->fs_info))
 		return false;

-	/*
-	 * We do not need to cow a block if
-	 * 1) this block is not created or changed in this transaction;
-	 * 2) this block does not belong to TREE_RELOC tree;
-	 * 3) the root is not forced COW.
-	 *
-	 * What is forced COW:
-	 *    when we create snapshot during committing the transaction,
-	 *    after we've finished copying src root, we must COW the shared
-	 *    block to ensure the metadata consistency.
-	 */
-
 	if (btrfs_header_generation(buf) != trans->transid)
 		return true;

-	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
+	if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
+		return true;
+
+	if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
+	    btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
 		return true;

 	/* Ensure we can see the FORCE_COW bit. */
@@ -629,11 +640,49 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
 	if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
 		return true;

-	if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
-		return false;
+	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+		/*
+		 * The buffer was allocated in this transaction and has been
+		 * written back to disk (WRITTEN is set). Normally we'd COW
+		 * it again, but since the committed superblock doesn't
+		 * reference this buffer (it was allocated in this transaction),
+		 * we can safely overwrite it in place.
+		 *
+		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
+		 * persisted at this bytenr and will be again after the
+		 * in-place update. This is important so that
+		 * btrfs_free_tree_block() correctly pins the block if it is
+		 * freed later (e.g., during tree rebalancing or FORCE_COW).
+		 *
+		 * Log trees and zoned devices cannot use this optimization:
+		 * - Log trees: log blocks are written and immediately
+		 *   referenced by a committed superblock via
+		 *   btrfs_sync_log(), bypassing the normal transaction
+		 *   commit. Overwriting in place could corrupt the
+		 *   committed log.
+		 * - Zoned devices: require sequential writes.
+		 */
+		if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
+		    btrfs_is_zoned(root->fs_info))
+			return true;

-	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
-		return true;
+		/*
+		 * Re-register this block's range in the current transaction's
+		 * dirty_pages so that btrfs_write_and_wait_transaction()
+		 * writes it. The range was originally registered when the
+		 * block was allocated. Normally dirty_pages is only cleared
+		 * at commit time by btrfs_write_and_wait_transaction(), but
+		 * if qgroups are enabled and snapshots are being created,
+		 * qgroup_account_snapshot() may have already called
+		 * btrfs_write_and_wait_transaction() and released the range
+		 * before the final commit-time call.
+		 */
+		btrfs_set_extent_bit(&trans->transaction->dirty_pages,
+				     buf->start,
+				     buf->start + buf->len - 1,
+				     EXTENT_DIRTY, NULL);
+		btrfs_mark_buffer_dirty(trans, buf);
+	}

 	return false;
 }
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 32fffb0557e5..bee8f76fbfea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4491,7 +4491,7 @@ void btrfs_mark_buffer_dirty(struct btrfs_trans_handle *trans,
 #endif
 	/* This is an active transaction (its state < TRANS_STATE_UNBLOCKED). */
 	ASSERT(trans->transid == fs_info->generation);
-	btrfs_assert_tree_write_locked(buf);
+	lockdep_assert_held(&buf->lock);
 	if (unlikely(transid != fs_info->generation)) {
 		btrfs_abort_transaction(trans, -EUCLEAN);
 		btrfs_crit(fs_info,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfc17c292217..ff1fc699a6ca 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3791,10 +3791,6 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
 					 eb->len,
 					 eb->fs_info->dirty_metadata_batch);
 	}
-#ifdef CONFIG_BTRFS_DEBUG
-	for (int i = 0; i < num_extent_folios(eb); i++)
-		ASSERT(folio_test_dirty(eb->folios[i]));
-#endif
 }

 void clear_extent_buffer_uptodate(struct extent_buffer *eb)
-- 
2.47.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
  2026-02-26  9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
  2026-02-26  9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
@ 2026-02-26  9:51 ` Leo Martins
  2026-02-26 15:23   ` Filipe Manana
                     ` (2 more replies)
  2026-02-26  9:51 ` [PATCH v4 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
  2026-03-02 23:12 ` [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Boris Burkov
  3 siblings, 3 replies; 10+ messages in thread
From: Leo Martins @ 2026-02-26  9:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Inhibit writeback on COW'd extent buffers for the lifetime of the
transaction handle, preventing background writeback from setting
BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.

COW amplification occurs when background writeback flushes an extent
buffer that a transaction handle is still actively modifying. When
lock_extent_buffer_for_io() transitions a buffer from dirty to
writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
having been persisted to disk at its current bytenr. Once WRITTEN is
set, should_cow_block() must either COW the block again or overwrite
it in place, both of which are unnecessary overhead when the buffer
is still being modified by the same handle that allocated it. By
inhibiting background writeback on actively-used buffers, WRITTEN is
never set while a transaction handle holds a reference to the buffer,
avoiding this overhead entirely.

Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
which fits in an existing 6-byte hole without increasing struct size.
When a buffer is COW'd in btrfs_force_cow_block(), call
btrfs_inhibit_eb_writeback() to store the eb in the transaction
handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
reference, and increment writeback_inhibitors. The function handles
dedup (same eb inhibited twice by the same handle) and replacement
(different eb at the same logical address). Allocation failure is
graceful: the buffer simply falls back to the pre-existing behavior
where it may be written back and re-COW'd.

Also inhibit writeback in should_cow_block() when COW is skipped,
so that every transaction handle that reuses an already-COW'd buffer
also inhibits its writeback. Without this, if handle A COWs a block
and inhibits it, and handle B later reuses the same block without
inhibiting, handle A's uninhibit on end_transaction leaves the buffer
unprotected while handle B is still using it. This ensures all handles
that access a COW'd buffer contribute to the inhibitor count, and the
buffer remains protected until the last handle releases it.

In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
is used by the VM flusher threads for background and periodic
writeback, which are the only paths that cause COW amplification by
opportunistically writing out dirty extent buffers mid-transaction.
Skipping these is safe because the buffers remain dirty in the page
cache and will be written out at transaction commit time.

WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
This is required for correctness in the fsync path: btrfs_sync_log()
writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
while the transaction handle that inhibited those same blocks is still
active. Without the WB_SYNC_ALL bypass, those inhibited log tree
blocks would be silently skipped, resulting in an incomplete log on
disk and corruption on replay. btrfs_write_and_wait_transaction()
also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
inhibitors are already cleared beforehand, but the bypass ensures
correctness regardless.

Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
to prevent a race where the committer proceeds while buffers are still
inhibited. Also uninhibit in btrfs_commit_transaction() before writing
and in cleanup_transaction() for the error path.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
---
 fs/btrfs/ctree.c       |  9 ++++++
 fs/btrfs/extent_io.c   | 63 +++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/extent_io.h   |  6 ++++
 fs/btrfs/transaction.c | 19 +++++++++++++
 fs/btrfs/transaction.h |  3 ++
 5 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index ea7cfc3a9e89..46a715c95bc8 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -21,6 +21,7 @@
 #include "fs.h"
 #include "accessors.h"
 #include "extent-tree.h"
+#include "extent_io.h"
 #include "relocation.h"
 #include "file-item.h"
 
@@ -590,6 +591,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
 		btrfs_tree_unlock(buf);
 	free_extent_buffer_stale(buf);
 	btrfs_mark_buffer_dirty(trans, cow);
+
+	/* Inhibit writeback on the COW'd buffer for this transaction handle. */
+	btrfs_inhibit_eb_writeback(trans, cow);
+
 	*cow_ret = cow;
 	return 0;
 
@@ -617,6 +622,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
  * When returning false for a WRITTEN buffer allocated in the current
  * transaction, re-dirties the buffer for in-place overwrite instead
  * of requesting a new COW.
+ *
+ * When returning false, inhibits background writeback on the buffer
+ * for the lifetime of the transaction handle.
  */
 static inline bool should_cow_block(struct btrfs_trans_handle *trans,
 				    const struct btrfs_root *root,
@@ -684,6 +692,7 @@ static inline bool should_cow_block(struct btrfs_trans_handle *trans,
 		btrfs_mark_buffer_dirty(trans, buf);
 	}
 
+	btrfs_inhibit_eb_writeback(trans, buf);
 	return false;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ff1fc699a6ca..e04e42a81978 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
 	 * of time.
 	 */
 	spin_lock(&eb->refs_lock);
-	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
+	if ((wbc->sync_mode == WB_SYNC_ALL ||
+	     atomic_read(&eb->writeback_inhibitors) == 0) &&
+	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
 		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
 		unsigned long flags;
 
@@ -2999,6 +3001,64 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 	kmem_cache_free(extent_buffer_cache, eb);
 }
 
+/*
+ * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction.
+ * @trans: transaction handle that will own the inhibitor
+ * @eb: extent buffer to inhibit writeback on
+ *
+ * Attempts to track this extent buffer in the transaction's inhibited set.
+ * If memory allocation fails, the buffer is simply not tracked. It may
+ * be written back and need re-COW, which is the original behavior.
+ * This is acceptable since inhibiting writeback is an optimization.
+ */
+void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
+				struct extent_buffer *eb)
+{
+	unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
+	void *old;
+
+	/* Check if already inhibited by this handle. */
+	old = xa_load(&trans->writeback_inhibited_ebs, index);
+	if (old == eb)
+		return;
+
+	/* Take reference for the xarray entry. */
+	refcount_inc(&eb->refs);
+
+	old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
+	if (xa_is_err(old)) {
+		/* Allocation failed, just skip inhibiting this buffer. */
+		free_extent_buffer(eb);
+		return;
+	}
+
+	/* Handle replacement of different eb at same index. */
+	if (old && old != eb) {
+		struct extent_buffer *old_eb = old;
+
+		atomic_dec(&old_eb->writeback_inhibitors);
+		free_extent_buffer(old_eb);
+	}
+
+	atomic_inc(&eb->writeback_inhibitors);
+}
+
+/*
+ * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers.
+ * @trans: transaction handle to clean up
+ */
+void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
+{
+	struct extent_buffer *eb;
+	unsigned long index;
+
+	xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
+		atomic_dec(&eb->writeback_inhibitors);
+		free_extent_buffer(eb);
+	}
+	xa_destroy(&trans->writeback_inhibited_ebs);
+}
+
 static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 						   u64 start)
 {
@@ -3009,6 +3069,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
 	eb->len = fs_info->nodesize;
 	eb->fs_info = fs_info;
 	init_rwsem(&eb->lock);
+	atomic_set(&eb->writeback_inhibitors, 0);
 
 	btrfs_leak_debug_add_eb(eb);
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 73571d5d3d5a..fb68fbd4866c 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -102,6 +102,8 @@ struct extent_buffer {
 	/* >= 0 if eb belongs to a log tree, -1 otherwise */
 	s8 log_index;
 	u8 folio_shift;
+	/* Inhibits WB_SYNC_NONE writeback when > 0. */
+	atomic_t writeback_inhibitors;
 	struct rcu_head rcu_head;
 
 	struct rw_semaphore lock;
@@ -381,4 +383,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
 #define btrfs_extent_buffer_leak_debug_check(fs_info)	do {} while (0)
 #endif
 
+void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
+			       struct extent_buffer *eb);
+void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
+
 #endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f4cc9e1a1b93..a9a22629b49d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -15,6 +15,7 @@
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
+#include "extent_io.h"
 #include "transaction.h"
 #include "locking.h"
 #include "tree-log.h"
@@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 		goto alloc_fail;
 	}
 
+	xa_init(&h->writeback_inhibited_ebs);
+
 	/*
 	 * If we are JOIN_NOLOCK we're already committing a transaction and
 	 * waiting on this guy, so we don't need to do the sb_start_intwrite
@@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 	if (trans->type & __TRANS_FREEZABLE)
 		sb_end_intwrite(info->sb);
 
+	/*
+	 * Uninhibit extent buffer writeback before decrementing num_writers,
+	 * since the decrement wakes the committing thread which needs all
+	 * buffers uninhibited to write them to disk.
+	 */
+	btrfs_uninhibit_all_eb_writeback(trans);
+
 	WARN_ON(cur_trans != info->running_transaction);
 	WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
 	atomic_dec(&cur_trans->num_writers);
@@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
 	if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
 		btrfs_scrub_cancel(fs_info);
 
+	btrfs_uninhibit_all_eb_writeback(trans);
 	kmem_cache_free(btrfs_trans_handle_cachep, trans);
 }
 
@@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	    fs_info->cleaner_kthread)
 		wake_up_process(fs_info->cleaner_kthread);
 
+	/*
+	 * Uninhibit writeback on all extent buffers inhibited during this
+	 * transaction before writing them to disk. Inhibiting prevented
+	 * writeback while the transaction was building, but now we need
+	 * them written.
+	 */
+	btrfs_uninhibit_all_eb_writeback(trans);
+
 	ret = btrfs_write_and_wait_transaction(trans);
 	if (unlikely(ret)) {
 		btrfs_err(fs_info, "error while writing out transaction: %d", ret);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 18ef069197e5..7d70fe486758 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -12,6 +12,7 @@
 #include <linux/time64.h>
 #include <linux/mutex.h>
 #include <linux/wait.h>
+#include <linux/xarray.h>
 #include "btrfs_inode.h"
 #include "delayed-ref.h"
 
@@ -162,6 +163,8 @@ struct btrfs_trans_handle {
 	struct btrfs_fs_info *fs_info;
 	struct list_head new_bgs;
 	struct btrfs_block_rsv delayed_rsv;
+	/* Extent buffers with writeback inhibited by this handle. */
+	struct xarray writeback_inhibited_ebs;
 };
 
 /*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 3/3] btrfs: add tracepoint for search slot restart tracking
  2026-02-26  9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
  2026-02-26  9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
  2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
@ 2026-02-26  9:51 ` Leo Martins
  2026-03-02 23:12 ` [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Boris Burkov
  3 siblings, 0 replies; 10+ messages in thread
From: Leo Martins @ 2026-02-26  9:51 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Filipe Manana

Add a btrfs_search_slot_restart tracepoint that fires at each restart
site in btrfs_search_slot(), recording the root, tree level, and
reason for the restart. This enables tracking search slot restarts
which contribute to COW amplification under memory pressure.

The four restart reasons are:
 - write_lock: insufficient write lock level, need to restart with
   higher lock
 - setup_nodes: node setup returned -EAGAIN
 - slot_zero: insertion at slot 0 requires higher write lock level
 - read_block: read_block_for_search returned -EAGAIN (block not
   cached or lock contention)

COW counts are already tracked by the existing trace_btrfs_cow_block()
tracepoint. The per-restart-site tracepoint avoids counter overhead
in the critical path when tracepoints are disabled, and provides
richer per-event information that bpftrace scripts can aggregate into
counts, histograms, and per-root breakdowns.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c             | 10 ++++++++--
 include/trace/events/btrfs.h | 24 ++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 46a715c95bc8..19f1fafeebb6 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2164,6 +2164,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			    p->nodes[level + 1])) {
 				write_lock_level = level + 1;
 				btrfs_release_path(p);
+				trace_btrfs_search_slot_restart(root, level, "write_lock");
 				goto again;
 			}
 
@@ -2226,8 +2227,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		p->slots[level] = slot;
 		ret2 = setup_nodes_for_search(trans, root, p, b, level, ins_len,
 					      &write_lock_level);
-		if (ret2 == -EAGAIN)
+		if (ret2 == -EAGAIN) {
+			trace_btrfs_search_slot_restart(root, level, "setup_nodes");
 			goto again;
+		}
 		if (ret2) {
 			ret = ret2;
 			goto done;
@@ -2243,6 +2246,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		if (slot == 0 && ins_len && write_lock_level < level + 1) {
 			write_lock_level = level + 1;
 			btrfs_release_path(p);
+			trace_btrfs_search_slot_restart(root, level, "slot_zero");
 			goto again;
 		}
 
@@ -2256,8 +2260,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		}
 
 		ret2 = read_block_for_search(root, p, &b, slot, key);
-		if (ret2 == -EAGAIN && !p->nowait)
+		if (ret2 == -EAGAIN && !p->nowait) {
+			trace_btrfs_search_slot_restart(root, level, "read_block");
 			goto again;
+		}
 		if (ret2) {
 			ret = ret2;
 			goto done;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 125bdc166bfe..abf1b73ee60f 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1110,6 +1110,30 @@ TRACE_EVENT(btrfs_cow_block,
 		  __entry->cow_level)
 );
 
+TRACE_EVENT(btrfs_search_slot_restart,
+
+	TP_PROTO(const struct btrfs_root *root, int level,
+		 const char *reason),
+
+	TP_ARGS(root, level, reason),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root_objectid		)
+		__field(	int,	level			)
+		__string(	reason,	reason			)
+	),
+
+	TP_fast_assign_btrfs(root->fs_info,
+		__entry->root_objectid	= btrfs_root_id(root);
+		__entry->level		= level;
+		__assign_str(reason);
+	),
+
+	TP_printk_btrfs("root=%llu(%s) level=%d reason=%s",
+		  show_root_type(__entry->root_objectid),
+		  __entry->level, __get_str(reason))
+);
+
 TRACE_EVENT(btrfs_space_reservation,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info, const char *type, u64 val,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
  2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
@ 2026-02-26 15:23   ` Filipe Manana
  2026-02-26 15:27   ` Sun YangKai
  2026-03-04  4:22   ` David Sterba
  2 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2026-02-26 15:23 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Thu, Feb 26, 2026 at 9:58 AM Leo Martins <loemra.dev@gmail.com> wrote:
>
> Inhibit writeback on COW'd extent buffers for the lifetime of the
> transaction handle, preventing background writeback from setting
> BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
>
> COW amplification occurs when background writeback flushes an extent
> buffer that a transaction handle is still actively modifying. When
> lock_extent_buffer_for_io() transitions a buffer from dirty to
> writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
> having been persisted to disk at its current bytenr. Once WRITTEN is
> set, should_cow_block() must either COW the block again or overwrite
> it in place, both of which are unnecessary overhead when the buffer
> is still being modified by the same handle that allocated it. By
> inhibiting background writeback on actively-used buffers, WRITTEN is
> never set while a transaction handle holds a reference to the buffer,
> avoiding this overhead entirely.
>
> Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
> which fits in an existing 6-byte hole without increasing struct size.
> When a buffer is COW'd in btrfs_force_cow_block(), call
> btrfs_inhibit_eb_writeback() to store the eb in the transaction
> handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
> reference, and increment writeback_inhibitors. The function handles
> dedup (same eb inhibited twice by the same handle) and replacement
> (different eb at the same logical address). Allocation failure is
> graceful: the buffer simply falls back to the pre-existing behavior
> where it may be written back and re-COW'd.
>
> Also inhibit writeback in should_cow_block() when COW is skipped,
> so that every transaction handle that reuses an already-COW'd buffer
> also inhibits its writeback. Without this, if handle A COWs a block
> and inhibits it, and handle B later reuses the same block without
> inhibiting, handle A's uninhibit on end_transaction leaves the buffer
> unprotected while handle B is still using it. This ensures all handles
> that access a COW'd buffer contribute to the inhibitor count, and the
> buffer remains protected until the last handle releases it.
>
> In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
> and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
> is used by the VM flusher threads for background and periodic
> writeback, which are the only paths that cause COW amplification by
> opportunistically writing out dirty extent buffers mid-transaction.
> Skipping these is safe because the buffers remain dirty in the page
> cache and will be written out at transaction commit time.
>
> WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
> This is required for correctness in the fsync path: btrfs_sync_log()
> writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
> while the transaction handle that inhibited those same blocks is still
> active. Without the WB_SYNC_ALL bypass, those inhibited log tree
> blocks would be silently skipped, resulting in an incomplete log on
> disk and corruption on replay. btrfs_write_and_wait_transaction()
> also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
> inhibitors are already cleared beforehand, but the bypass ensures
> correctness regardless.
>
> Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
> to prevent a race where the committer proceeds while buffers are still
> inhibited. Also uninhibit in btrfs_commit_transaction() before writing
> and in cleanup_transaction() for the error path.
>
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> ---
>  fs/btrfs/ctree.c       |  9 ++++++
>  fs/btrfs/extent_io.c   | 63 +++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/extent_io.h   |  6 ++++
>  fs/btrfs/transaction.c | 19 +++++++++++++
>  fs/btrfs/transaction.h |  3 ++
>  5 files changed, 99 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index ea7cfc3a9e89..46a715c95bc8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -21,6 +21,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "extent-tree.h"
> +#include "extent_io.h"
>  #include "relocation.h"
>  #include "file-item.h"
>
> @@ -590,6 +591,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>                 btrfs_tree_unlock(buf);
>         free_extent_buffer_stale(buf);
>         btrfs_mark_buffer_dirty(trans, cow);
> +
> +       /* Inhibit writeback on the COW'd buffer for this transaction handle. */
> +       btrfs_inhibit_eb_writeback(trans, cow);

Btw, that comment is redundant. It's clear what we are doing, since
the function's name is clear about what it does and the eb is named
"cow".
Usually we add comments for things that are not obvious.

> +
>         *cow_ret = cow;
>         return 0;
>
> @@ -617,6 +622,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>   * When returning false for a WRITTEN buffer allocated in the current
>   * transaction, re-dirties the buffer for in-place overwrite instead
>   * of requesting a new COW.
> + *
> + * When returning false, inhibits background writeback on the buffer
> + * for the lifetime of the transaction handle.
>   */
>  static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>                                     const struct btrfs_root *root,
> @@ -684,6 +692,7 @@ static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>                 btrfs_mark_buffer_dirty(trans, buf);
>         }
>
> +       btrfs_inhibit_eb_writeback(trans, buf);
>         return false;
>  }
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ff1fc699a6ca..e04e42a81978 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>          * of time.
>          */
>         spin_lock(&eb->refs_lock);
> -       if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> +       if ((wbc->sync_mode == WB_SYNC_ALL ||
> +            atomic_read(&eb->writeback_inhibitors) == 0) &&
> +           test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
>                 XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
>                 unsigned long flags;
>
> @@ -2999,6 +3001,64 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>         kmem_cache_free(extent_buffer_cache, eb);
>  }
>
> +/*
> + * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction.
> + * @trans: transaction handle that will own the inhibitor
> + * @eb: extent buffer to inhibit writeback on
> + *
> + * Attempts to track this extent buffer in the transaction's inhibited set.
> + * If memory allocation fails, the buffer is simply not tracked. It may
> + * be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
> + */
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +                               struct extent_buffer *eb)
> +{
> +       unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
> +       void *old;
> +
> +       /* Check if already inhibited by this handle. */
> +       old = xa_load(&trans->writeback_inhibited_ebs, index);
> +       if (old == eb)
> +               return;
> +
> +       /* Take reference for the xarray entry. */
> +       refcount_inc(&eb->refs);
> +
> +       old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
> +       if (xa_is_err(old)) {
> +               /* Allocation failed, just skip inhibiting this buffer. */
> +               free_extent_buffer(eb);
> +               return;
> +       }
> +
> +       /* Handle replacement of different eb at same index. */
> +       if (old && old != eb) {
> +               struct extent_buffer *old_eb = old;
> +
> +               atomic_dec(&old_eb->writeback_inhibitors);
> +               free_extent_buffer(old_eb);
> +       }
> +
> +       atomic_inc(&eb->writeback_inhibitors);

Btw, at the top of this function we should assert the eb is locked.

Otherwise,

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +}
> +
> +/*
> + * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers.
> + * @trans: transaction handle to clean up
> + */
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
> +{
> +       struct extent_buffer *eb;
> +       unsigned long index;
> +
> +       xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
> +               atomic_dec(&eb->writeback_inhibitors);
> +               free_extent_buffer(eb);
> +       }
> +       xa_destroy(&trans->writeback_inhibited_ebs);
> +}
> +
>  static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>                                                    u64 start)
>  {
> @@ -3009,6 +3069,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>         eb->len = fs_info->nodesize;
>         eb->fs_info = fs_info;
>         init_rwsem(&eb->lock);
> +       atomic_set(&eb->writeback_inhibitors, 0);
>
>         btrfs_leak_debug_add_eb(eb);
>
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..fb68fbd4866c 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,8 @@ struct extent_buffer {
>         /* >= 0 if eb belongs to a log tree, -1 otherwise */
>         s8 log_index;
>         u8 folio_shift;
> +       /* Inhibits WB_SYNC_NONE writeback when > 0. */
> +       atomic_t writeback_inhibitors;
>         struct rcu_head rcu_head;
>
>         struct rw_semaphore lock;
> @@ -381,4 +383,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
>  #define btrfs_extent_buffer_leak_debug_check(fs_info)  do {} while (0)
>  #endif
>
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +                              struct extent_buffer *eb);
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
> +
>  #endif
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f4cc9e1a1b93..a9a22629b49d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -15,6 +15,7 @@
>  #include "misc.h"
>  #include "ctree.h"
>  #include "disk-io.h"
> +#include "extent_io.h"
>  #include "transaction.h"
>  #include "locking.h"
>  #include "tree-log.h"
> @@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>                 goto alloc_fail;
>         }
>
> +       xa_init(&h->writeback_inhibited_ebs);
> +
>         /*
>          * If we are JOIN_NOLOCK we're already committing a transaction and
>          * waiting on this guy, so we don't need to do the sb_start_intwrite
> @@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>         if (trans->type & __TRANS_FREEZABLE)
>                 sb_end_intwrite(info->sb);
>
> +       /*
> +        * Uninhibit extent buffer writeback before decrementing num_writers,
> +        * since the decrement wakes the committing thread which needs all
> +        * buffers uninhibited to write them to disk.
> +        */
> +       btrfs_uninhibit_all_eb_writeback(trans);
> +
>         WARN_ON(cur_trans != info->running_transaction);
>         WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
>         atomic_dec(&cur_trans->num_writers);
> @@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
>         if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
>                 btrfs_scrub_cancel(fs_info);
>
> +       btrfs_uninhibit_all_eb_writeback(trans);
>         kmem_cache_free(btrfs_trans_handle_cachep, trans);
>  }
>
> @@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>             fs_info->cleaner_kthread)
>                 wake_up_process(fs_info->cleaner_kthread);
>
> +       /*
> +        * Uninhibit writeback on all extent buffers inhibited during this
> +        * transaction before writing them to disk. Inhibiting prevented
> +        * writeback while the transaction was building, but now we need
> +        * them written.
> +        */
> +       btrfs_uninhibit_all_eb_writeback(trans);
> +
>         ret = btrfs_write_and_wait_transaction(trans);
>         if (unlikely(ret)) {
>                 btrfs_err(fs_info, "error while writing out transaction: %d", ret);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 18ef069197e5..7d70fe486758 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,6 +12,7 @@
>  #include <linux/time64.h>
>  #include <linux/mutex.h>
>  #include <linux/wait.h>
> +#include <linux/xarray.h>
>  #include "btrfs_inode.h"
>  #include "delayed-ref.h"
>
> @@ -162,6 +163,8 @@ struct btrfs_trans_handle {
>         struct btrfs_fs_info *fs_info;
>         struct list_head new_bgs;
>         struct btrfs_block_rsv delayed_rsv;
> +       /* Extent buffers with writeback inhibited by this handle. */
> +       struct xarray writeback_inhibited_ebs;
>  };
>
>  /*
> --
> 2.47.3
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
  2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
  2026-02-26 15:23   ` Filipe Manana
@ 2026-02-26 15:27   ` Sun YangKai
  2026-03-04  4:22   ` David Sterba
  2 siblings, 0 replies; 10+ messages in thread
From: Sun YangKai @ 2026-02-26 15:27 UTC (permalink / raw)
  To: Leo Martins, linux-btrfs, kernel-team



On 2026/2/26 17:51, Leo Martins wrote:
> Inhibit writeback on COW'd extent buffers for the lifetime of the
> transaction handle, preventing background writeback from setting
> BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
> 
> COW amplification occurs when background writeback flushes an extent
> buffer that a transaction handle is still actively modifying. When
> lock_extent_buffer_for_io() transitions a buffer from dirty to
> writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
> having been persisted to disk at its current bytenr. Once WRITTEN is
> set, should_cow_block() must either COW the block again or overwrite
> it in place, both of which are unnecessary overhead when the buffer
> is still being modified by the same handle that allocated it. By
> inhibiting background writeback on actively-used buffers, WRITTEN is
> never set while a transaction handle holds a reference to the buffer,
> avoiding this overhead entirely.
> 
> Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
> which fits in an existing 6-byte hole without increasing struct size.
> When a buffer is COW'd in btrfs_force_cow_block(), call
> btrfs_inhibit_eb_writeback() to store the eb in the transaction
> handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
> reference, and increment writeback_inhibitors. The function handles
> dedup (same eb inhibited twice by the same handle) and replacement
> (different eb at the same logical address). Allocation failure is
> graceful: the buffer simply falls back to the pre-existing behavior
> where it may be written back and re-COW'd.
> 
> Also inhibit writeback in should_cow_block() when COW is skipped,
> so that every transaction handle that reuses an already-COW'd buffer
> also inhibits its writeback. Without this, if handle A COWs a block
> and inhibits it, and handle B later reuses the same block without
> inhibiting, handle A's uninhibit on end_transaction leaves the buffer
> unprotected while handle B is still using it. This ensures all handles
> that access a COW'd buffer contribute to the inhibitor count, and the
> buffer remains protected until the last handle releases it.
> 
> In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
> and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
> is used by the VM flusher threads for background and periodic
> writeback, which are the only paths that cause COW amplification by
> opportunistically writing out dirty extent buffers mid-transaction.
> Skipping these is safe because the buffers remain dirty in the page
> cache and will be written out at transaction commit time.
> 
> WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
> This is required for correctness in the fsync path: btrfs_sync_log()
> writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
> while the transaction handle that inhibited those same blocks is still
> active. Without the WB_SYNC_ALL bypass, those inhibited log tree
> blocks would be silently skipped, resulting in an incomplete log on
> disk and corruption on replay. btrfs_write_and_wait_transaction()
> also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
> inhibitors are already cleared beforehand, but the bypass ensures
> correctness regardless.
> 
> Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
> to prevent a race where the committer proceeds while buffers are still
> inhibited. Also uninhibit in btrfs_commit_transaction() before writing
> and in cleanup_transaction() for the error path.
> 
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>

Looks good.
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
> ---
>   fs/btrfs/ctree.c       |  9 ++++++
>   fs/btrfs/extent_io.c   | 63 +++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/extent_io.h   |  6 ++++
>   fs/btrfs/transaction.c | 19 +++++++++++++
>   fs/btrfs/transaction.h |  3 ++
>   5 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index ea7cfc3a9e89..46a715c95bc8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -21,6 +21,7 @@
>   #include "fs.h"
>   #include "accessors.h"
>   #include "extent-tree.h"
> +#include "extent_io.h"
>   #include "relocation.h"
>   #include "file-item.h"
>   
> @@ -590,6 +591,10 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>   		btrfs_tree_unlock(buf);
>   	free_extent_buffer_stale(buf);
>   	btrfs_mark_buffer_dirty(trans, cow);
> +
> +	/* Inhibit writeback on the COW'd buffer for this transaction handle. */
> +	btrfs_inhibit_eb_writeback(trans, cow);
> +
>   	*cow_ret = cow;
>   	return 0;
>   
> @@ -617,6 +622,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>    * When returning false for a WRITTEN buffer allocated in the current
>    * transaction, re-dirties the buffer for in-place overwrite instead
>    * of requesting a new COW.
> + *
> + * When returning false, inhibits background writeback on the buffer
> + * for the lifetime of the transaction handle.
>    */
>   static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>   				    const struct btrfs_root *root,
> @@ -684,6 +692,7 @@ static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>   		btrfs_mark_buffer_dirty(trans, buf);
>   	}
>   
> +	btrfs_inhibit_eb_writeback(trans, buf);
>   	return false;
>   }
>   
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ff1fc699a6ca..e04e42a81978 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1940,7 +1940,9 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
>   	 * of time.
>   	 */
>   	spin_lock(&eb->refs_lock);
> -	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> +	if ((wbc->sync_mode == WB_SYNC_ALL ||
> +	     atomic_read(&eb->writeback_inhibitors) == 0) &&
> +	    test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
>   		XA_STATE(xas, &fs_info->buffer_tree, eb->start >> fs_info->nodesize_bits);
>   		unsigned long flags;
>   
> @@ -2999,6 +3001,64 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>   	kmem_cache_free(extent_buffer_cache, eb);
>   }
>   
> +/*
> + * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction.
> + * @trans: transaction handle that will own the inhibitor
> + * @eb: extent buffer to inhibit writeback on
> + *
> + * Attempts to track this extent buffer in the transaction's inhibited set.
> + * If memory allocation fails, the buffer is simply not tracked. It may
> + * be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
> + */
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +				struct extent_buffer *eb)
> +{
> +	unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
> +	void *old;
> +
> +	/* Check if already inhibited by this handle. */
> +	old = xa_load(&trans->writeback_inhibited_ebs, index);
> +	if (old == eb)
> +		return;
> +
> +	/* Take reference for the xarray entry. */
> +	refcount_inc(&eb->refs);
> +
> +	old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
> +	if (xa_is_err(old)) {
> +		/* Allocation failed, just skip inhibiting this buffer. */
> +		free_extent_buffer(eb);
> +		return;
> +	}
> +
> +	/* Handle replacement of different eb at same index. */
> +	if (old && old != eb) {
> +		struct extent_buffer *old_eb = old;
> +
> +		atomic_dec(&old_eb->writeback_inhibitors);
> +		free_extent_buffer(old_eb);
> +	}
> +
> +	atomic_inc(&eb->writeback_inhibitors);
> +}
> +
> +/*
> + * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers.
> + * @trans: transaction handle to clean up
> + */
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
> +{
> +	struct extent_buffer *eb;
> +	unsigned long index;
> +
> +	xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
> +		atomic_dec(&eb->writeback_inhibitors);
> +		free_extent_buffer(eb);
> +	}
> +	xa_destroy(&trans->writeback_inhibited_ebs);
> +}
> +
>   static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>   						   u64 start)
>   {
> @@ -3009,6 +3069,7 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>   	eb->len = fs_info->nodesize;
>   	eb->fs_info = fs_info;
>   	init_rwsem(&eb->lock);
> +	atomic_set(&eb->writeback_inhibitors, 0);
>   
>   	btrfs_leak_debug_add_eb(eb);
>   
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 73571d5d3d5a..fb68fbd4866c 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -102,6 +102,8 @@ struct extent_buffer {
>   	/* >= 0 if eb belongs to a log tree, -1 otherwise */
>   	s8 log_index;
>   	u8 folio_shift;
> +	/* Inhibits WB_SYNC_NONE writeback when > 0. */
> +	atomic_t writeback_inhibitors;
>   	struct rcu_head rcu_head;
>   
>   	struct rw_semaphore lock;
> @@ -381,4 +383,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info);
>   #define btrfs_extent_buffer_leak_debug_check(fs_info)	do {} while (0)
>   #endif
>   
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +			       struct extent_buffer *eb);
> +void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans);
> +
>   #endif
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f4cc9e1a1b93..a9a22629b49d 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -15,6 +15,7 @@
>   #include "misc.h"
>   #include "ctree.h"
>   #include "disk-io.h"
> +#include "extent_io.h"
>   #include "transaction.h"
>   #include "locking.h"
>   #include "tree-log.h"
> @@ -688,6 +689,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>   		goto alloc_fail;
>   	}
>   
> +	xa_init(&h->writeback_inhibited_ebs);
> +
>   	/*
>   	 * If we are JOIN_NOLOCK we're already committing a transaction and
>   	 * waiting on this guy, so we don't need to do the sb_start_intwrite
> @@ -1083,6 +1086,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>   	if (trans->type & __TRANS_FREEZABLE)
>   		sb_end_intwrite(info->sb);
>   
> +	/*
> +	 * Uninhibit extent buffer writeback before decrementing num_writers,
> +	 * since the decrement wakes the committing thread which needs all
> +	 * buffers uninhibited to write them to disk.
> +	 */
> +	btrfs_uninhibit_all_eb_writeback(trans);
> +
>   	WARN_ON(cur_trans != info->running_transaction);
>   	WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
>   	atomic_dec(&cur_trans->num_writers);
> @@ -2110,6 +2120,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
>   	if (!test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
>   		btrfs_scrub_cancel(fs_info);
>   
> +	btrfs_uninhibit_all_eb_writeback(trans);
>   	kmem_cache_free(btrfs_trans_handle_cachep, trans);
>   }
>   
> @@ -2556,6 +2567,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>   	    fs_info->cleaner_kthread)
>   		wake_up_process(fs_info->cleaner_kthread);
>   
> +	/*
> +	 * Uninhibit writeback on all extent buffers inhibited during this
> +	 * transaction before writing them to disk. Inhibiting prevented
> +	 * writeback while the transaction was building, but now we need
> +	 * them written.
> +	 */
> +	btrfs_uninhibit_all_eb_writeback(trans);
> +
>   	ret = btrfs_write_and_wait_transaction(trans);
>   	if (unlikely(ret)) {
>   		btrfs_err(fs_info, "error while writing out transaction: %d", ret);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 18ef069197e5..7d70fe486758 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,6 +12,7 @@
>   #include <linux/time64.h>
>   #include <linux/mutex.h>
>   #include <linux/wait.h>
> +#include <linux/xarray.h>
>   #include "btrfs_inode.h"
>   #include "delayed-ref.h"
>   
> @@ -162,6 +163,8 @@ struct btrfs_trans_handle {
>   	struct btrfs_fs_info *fs_info;
>   	struct list_head new_bgs;
>   	struct btrfs_block_rsv delayed_rsv;
> +	/* Extent buffers with writeback inhibited by this handle. */
> +	struct xarray writeback_inhibited_ebs;
>   };
>   
>   /*


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure
  2026-02-26  9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
                   ` (2 preceding siblings ...)
  2026-02-26  9:51 ` [PATCH v4 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
@ 2026-03-02 23:12 ` Boris Burkov
  3 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-02 23:12 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Thu, Feb 26, 2026 at 01:51:05AM -0800, Leo Martins wrote:
> I've been investigating a pattern where COW amplification under memory
> pressure exhausts the transaction block reserve, leading to spurious
> ENOSPCs on filesystems with plenty of unallocated space.
> 
> The pattern is:
> 
>  1. btrfs_search_slot() begins tree traversal with cow=1
>  2. Node at level N needs COW (old generation or WRITTEN flag set)
>  3. btrfs_cow_block() allocates new block, copies data, updates
>     parent pointer
>  4. Traversal hits a condition requiring restart (node not cached,
>     lock contention, need higher write_lock_level)
>  5. btrfs_release_path() releases all locks and references
>  6. Memory pressure triggers writeback on the COW'd block
>  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>     BTRFS_HEADER_FLAG_WRITTEN
>  8. goto again - traversal restarts from root
>  9. Traversal reaches the same tree node
>  10. should_cow_block() sees WRITTEN flag, returns true
>  11. btrfs_cow_block() allocates yet another new block, copies data,
>      updates parent pointer again, consuming another reservation
>  12. Steps 4-11 repeat under sustained memory pressure
> 
> This series fixes the problem with two complementary approaches:
> 
> Patch 1 implements Filipe's suggestion of overwriting in place. When
> should_cow_block() encounters a WRITTEN buffer whose generation matches
> the current transaction, it re-dirties the buffer and returns false
> instead of requesting a COW. This is crash-safe because the committed
> superblock does not reference buffers allocated in the current
> transaction. Log trees, zoned devices, FORCE_COW, relocation, and
> buffers mid-writeback are excluded. Beyond improving the amplification
> bug, this is a general optimization for the entire transaction lifetime:
> any time writeback runs during a transaction, revisiting the same path
> no longer triggers unnecessary COW, reducing extent allocation overhead,
> memory copies, and space usage per transaction.
> 
> Patch 2 inhibits writeback on COW'd buffers for the lifetime of the
> transaction handle. This prevents WRITTEN from being set while a
> handle holds a reference to the buffer, avoiding unnecessary re-COW
> at its source. Only WB_SYNC_NONE (background/periodic flusher
> writeback) is inhibited. WB_SYNC_ALL (data-integrity writeback from
> btrfs_sync_log() and btrfs_write_and_wait_transaction()) always
> proceeds, which is required for correctness in the fsync path where
> log tree blocks must be written while the inhibiting handle is still
> active.
> 
> Both approaches are independently useful. The overwrite path is a
> general optimization that eliminates unnecessary COW across the entire
> transaction, not just within a single handle. Writeback inhibition
> prevents the problem from occurring in the first place and is the
> only fix for log trees and zoned devices where overwrite does not
> apply. Together they provide defense in depth against COW
> amplification.
> 
> Patch 3 adds a tracepoint for tracking search slot restarts.
> 
> Note: Boris's AS_KERNEL_FILE changes prevent cgroup-scoped reclaim
> from targeting extent buffer pages, making this harder to trigger via
> per-cgroup memory limits. I was able to reproduce the amplification
> using global memory pressure (drop_caches, root cgroup memory.reclaim,
> memory hog) with concurrent filesystem operations.
> 
> Benchmark: concurrent filesystem operations under aggressive global
> memory pressure. COW counts measured via bpftrace.
> 
>   Approach                   Max COW count   Num operations
>   -------                    -------------   --------------
>   Baseline                              35   -
>   Overwrite only                        19   ~same
>   Writeback inhibition only              5   ~2x
>   Combined (this series)                 4   ~2x

I applied Filipe's suggestion of the lock assert and removing a comment
in patch 2 and pushed the series to for-next.

Thanks,
Boris

> 
> Changes since v3:
> 
> Patch 2:
>  - Also inhibit writeback in should_cow_block() when COW is skipped,
>    so that every transaction handle that reuses a COW'd buffer inhibits
>    its writeback, not just the handle that originally COW'd it
> 
> Changes since v2:
> 
> Patch 1:
>  - Add smp_mb__before_atomic() before FORCE_COW check (Filipe, Sun YangKai)
>  - Hoist WRITEBACK, RELOC, FORCE_COW checks before WRITTEN block (Sun YangKai)
>  - Update dirty_pages comment for qgroup_account_snapshot() (Filipe)
>  - Relax btrfs_mark_buffer_dirty() lockdep assertion to accept read locks
>  - Use btrfs_mark_buffer_dirty() instead of set_extent_buffer_dirty()
>  - Remove folio_test_dirty() debug assertion (concurrent reader race)
> 
> Patch 2:
>  - Fix comment style (Filipe)
> 
> Patch 3:
>  - Replace counters with per-restart-site tracepoint (Filipe)
> 
> Changes since v1:
> 
> The v1 patch used a per-btrfs_search_slot() xarray to track COW'd
> buffers. Filipe pointed out this was too complex and too narrow in
> scope, and suggested overwriting in place instead. Qu raised the
> concern that other btrfs_cow_block() callers outside of search_slot
> would not be covered. Boris suggested transaction-handle-level scope
> as an alternative.
> 
> v2 implemented both overwrite-in-place and writeback inhibition at
> the transaction handle level. The per-search-slot xarray was replaced
> by a per-transaction-handle xarray, which covers all
> btrfs_force_cow_block() callers.
> 
> Leo Martins (3):
>   btrfs: skip COW for written extent buffers allocated in current
>     transaction
>   btrfs: inhibit extent buffer writeback to prevent COW amplification
>   btrfs: add tracepoint for search slot restart tracking
> 
>  fs/btrfs/ctree.c             | 106 ++++++++++++++++++++++++++++-------
>  fs/btrfs/disk-io.c           |   2 +-
>  fs/btrfs/extent_io.c         |  67 ++++++++++++++++++++--
>  fs/btrfs/extent_io.h         |   6 ++
>  fs/btrfs/transaction.c       |  19 +++++++
>  fs/btrfs/transaction.h       |   3 +
>  include/trace/events/btrfs.h |  24 ++++++++
>  7 files changed, 200 insertions(+), 27 deletions(-)
> 
> -- 
> 2.47.3
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification
  2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
  2026-02-26 15:23   ` Filipe Manana
  2026-02-26 15:27   ` Sun YangKai
@ 2026-03-04  4:22   ` David Sterba
  2 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2026-03-04  4:22 UTC (permalink / raw)
  To: Leo Martins; +Cc: linux-btrfs, kernel-team

On Thu, Feb 26, 2026 at 01:51:07AM -0800, Leo Martins wrote:
> +/*
> + * btrfs_inhibit_eb_writeback - Inhibit writeback on buffer during transaction.
> + * @trans: transaction handle that will own the inhibitor
> + * @eb: extent buffer to inhibit writeback on
> + *
> + * Attempts to track this extent buffer in the transaction's inhibited set.
> + * If memory allocation fails, the buffer is simply not tracked. It may
> + * be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
> + */

Minor thing, please check how functions should be commented, with the
parameter block.

https://btrfs.readthedocs.io/en/latest/dev/Development-notes.html#comments

> +/*
> + * btrfs_uninhibit_all_eb_writeback - Uninhibit writeback on all buffers.
> + * @trans: transaction handle to clean up
> + */

Same

> @@ -102,6 +102,8 @@ struct extent_buffer {
>  	/* >= 0 if eb belongs to a log tree, -1 otherwise */
>  	s8 log_index;
>  	u8 folio_shift;
> +	/* Inhibits WB_SYNC_NONE writeback when > 0. */
> +	atomic_t writeback_inhibitors;

The extent buffer is a sensitive data structure so the layout and size
is closely watched. You increase the size to 248 bytes which is still
under 256 so it fits fine to the slabs. What is not good is the
placement after the s8/u8 members as this leaves 2 byte hole before and
4 byte hole after it

@@ -5666,15 +5669,19 @@ struct extent_buffer {
 
        /* XXX 2 bytes hole, try to pack */
 
-       struct callback_head       callback_head __attribute__((__aligned__(8))); /*    56    16 */
-       /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
-       struct rw_semaphore        lock __attribute__((__aligned__(8))); /*    72    40 */
-       struct folio *             folios[16];           /*   112   128 */
+       atomic_t                   writeback_inhibitors __attribute__((__aligned__(4))); /*    56     4 */
 
-       /* size: 240, cachelines: 4, members: 14 */
-       /* sum members: 238, holes: 1, sum holes: 2 */
-       /* forced alignments: 4, forced holes: 1, sum forced holes: 2 */
-       /* last cacheline: 48 bytes */
+       /* XXX 4 bytes hole, try to pack */
+
+       /* --- cacheline 1 boundary (64 bytes) --- */
+       struct callback_head       callback_head __attribute__((__aligned__(8))); /*    64    16 */
+       struct rw_semaphore        lock __attribute__((__aligned__(8))); /*    80    40 */
+       struct folio *             folios[16];           /*   120   128 */
+
+       /* size: 248, cachelines: 4, members: 15 */
+       /* sum members: 242, holes: 2, sum holes: 6 */
+       /* forced alignments: 5, forced holes: 2, sum forced holes: 6 */
+       /* last cacheline: 56 bytes */
 } __attribute__((__aligned__(8)));

We can't get rid of the holes unfortunately but at least placing it after
mirror_num the hole becomes contiguous:

@@ -5664,14 +5664,11 @@ struct extent_buffer {
        spinlock_t                 refs_lock __attribute__((__aligned__(4))); /*    40     4 */
        refcount_t                 refs __attribute__((__aligned__(4))); /*    44     4 */
        int                        read_mirror;          /*    48     4 */
-       s8                         log_index;            /*    52     1 */
-       u8                         folio_shift;          /*    53     1 */
+       atomic_t                   writeback_inhibitors __attribute__((__aligned__(4))); /*    52     4 */
+       s8                         log_index;            /*    56     1 */
+       u8                         folio_shift;          /*    57     1 */
 
-       /* XXX 2 bytes hole, try to pack */
-
-       atomic_t                   writeback_inhibitors __attribute__((__aligned__(4))); /*    56     4 */
-
-       /* XXX 4 bytes hole, try to pack */
+       /* XXX 6 bytes hole, try to pack */
 
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    64    16 */
@@ -5679,8 +5676,8 @@ struct extent_buffer {
        struct folio *             folios[16];           /*   120   128 */
 
        /* size: 248, cachelines: 4, members: 15 */
-       /* sum members: 242, holes: 2, sum holes: 6 */
-       /* forced alignments: 5, forced holes: 2, sum forced holes: 6 */
+       /* sum members: 242, holes: 1, sum holes: 6 */
+       /* forced alignments: 5, forced holes: 1, sum forced holes: 6 */
        /* last cacheline: 56 bytes */
 } __attribute__((__aligned__(8)));

The placement to first cacheline cannot be avoided but the access pattern seems
consistent with the other members like refs or lock so this should be OK.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-02-26  9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
@ 2026-03-04  6:31   ` Qu Wenruo
  2026-03-05 20:11     ` Boris Burkov
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2026-03-04  6:31 UTC (permalink / raw)
  To: Leo Martins, linux-btrfs, kernel-team; +Cc: Filipe Manana, Sun YangKai



在 2026/2/26 20:21, Leo Martins 写道:
> When memory pressure causes writeback of a recently COW'd buffer,
> btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> the buffer unnecessarily, causing COW amplification that can exhaust
> block reservations and degrade throughput.
> 
> Overwriting in place is crash-safe because the committed superblock
> does not reference buffers allocated in the current (uncommitted)
> transaction, so no on-disk tree points to this block yet.
> 
> When should_cow_block() encounters a WRITTEN buffer whose generation
> matches the current transaction, instead of requesting a COW, re-dirty
> the buffer and re-register its range in the transaction's dirty_pages.
> 
> Both are necessary because btrfs tracks dirty metadata through two
> independent mechanisms. set_extent_buffer_dirty() sets the
> EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> mark, which is what background writeback (btree_write_cache_pages) uses
> to find and write dirty buffers. The transaction's dirty_pages io tree
> is a separate structure used by btrfs_write_and_wait_transaction() at
> commit time to ensure all buffers allocated during the transaction are
> persisted. The dirty_pages range was originally registered in
> btrfs_init_new_buffer() when the block was first allocated. Normally
> dirty_pages is only cleared at commit time by
> btrfs_write_and_wait_transaction(), but if qgroups are enabled and
> snapshots are being created, qgroup_account_snapshot() may have already
> called btrfs_write_and_wait_transaction() and released the range before
> the final commit-time call.
> 
> Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> correctly pins the block if it is freed later.
> 
> Relax the lockdep assertion in btrfs_mark_buffer_dirty() from
> btrfs_assert_tree_write_locked() to lockdep_assert_held() so that it
> accepts either a read or write lock. should_cow_block() may be called
> from btrfs_search_slot() when only a read lock is held (nodes above
> write_lock_level are read-locked). The write lock assertion previously
> documented the caller convention that buffer content was being modified
> under exclusive access, but btrfs_mark_buffer_dirty() and
> set_extent_buffer_dirty() themselves only perform independently
> synchronized operations: atomic bit ops on bflags, folio_mark_dirty()
> (kernel-internal folio locking), xarray mark updates (xarray spinlock),
> and percpu counter updates. The read lock is sufficient because it
> prevents lock_extent_buffer_for_io() from acquiring the write lock and
> racing on the dirty state. Since rw_semaphore permits concurrent
> readers, multiple threads can enter btrfs_mark_buffer_dirty()
> simultaneously for the same buffer; this is safe because
> test_and_set_bit(EXTENT_BUFFER_DIRTY) ensures only one thread performs
> the full dirty state transition.
> 
> Remove the CONFIG_BTRFS_DEBUG assertion in set_extent_buffer_dirty()
> that checked folio_test_dirty() after marking the buffer dirty. This
> assertion assumed exclusive access (only one thread in
> set_extent_buffer_dirty() at a time), which held when the only caller
> was btrfs_mark_buffer_dirty() under write lock. With concurrent readers
> calling through should_cow_block(), a thread that loses the
> test_and_set_bit race sees was_dirty=true and skips the folio dirty
> marking, but the winning thread may not have called
> btrfs_meta_folio_set_dirty() yet, causing the assertion to fire. This
> is a benign race: the winning thread will complete the folio dirty
> marking, and no writeback can clear it while readers hold their locks.
> 
> Hoist the EXTENT_BUFFER_WRITEBACK, BTRFS_HEADER_FLAG_RELOC, and
> BTRFS_ROOT_FORCE_COW checks before the WRITTEN block since they apply
> regardless of whether the buffer has been written back. This
> consolidates the exclusion logic and simplifies the WRITTEN path to
> only handle log trees and zoned devices. Moving the RELOC checks
> before the smp_mb__before_atomic() barrier is safe because both
> btrfs_root_id() (immutable) and BTRFS_HEADER_FLAG_RELOC (set at COW
> time under tree lock) are stable values not subject to concurrent
> modification; the barrier is only needed for BTRFS_ROOT_FORCE_COW
> which is set concurrently by create_pending_snapshot().
> 
> Exclude cases where in-place overwrite is not safe:
>   - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
>   - Zoned devices: require sequential writes
>   - Log trees: log blocks are immediately referenced by a committed
>     superblock via btrfs_sync_log(), so overwriting could corrupt the
>     committed log
>   - BTRFS_ROOT_FORCE_COW: snapshot in progress
>   - BTRFS_HEADER_FLAG_RELOC: block being relocated
> 
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Reviewed-by: Sun YangKai <sunk67188@gmail.com>

Unfortunately this patch is making btrfs/232 fail.
Bisection lead to this one.

I have hit the following errors during btrfs/232 run with this patch, 
but not the commit before it:

- Write time tree-checker errors
   From first key mismatch to bad key order.

- Fsck errors
   From missing inode item other problems.
   AKA, on-disk corruption, which is never a good sign.

One thing to note is, that test case itself can lead to a false alerts 
from DEBUG_WARN() inside btrfs_remove_qgroup(), thus you may need to 
manually remove that DEBUG_WARN() or check the failure dmesg to be extra 
sure.

Thanks,
Qu

> ---
>   fs/btrfs/ctree.c     | 87 ++++++++++++++++++++++++++++++++++----------
>   fs/btrfs/disk-io.c   |  2 +-
>   fs/btrfs/extent_io.c |  4 --
>   3 files changed, 69 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 7267b2502665..ea7cfc3a9e89 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -599,29 +599,40 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>   
> -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> +/*
> + * Check if @buf needs to be COW'd.
> + *
> + * Returns true if COW is required, false if the block can be reused
> + * in place.
> + *
> + * We do not need to COW a block if:
> + * 1) the block was created or changed in this transaction;
> + * 2) the block does not belong to TREE_RELOC tree;
> + * 3) the root is not forced COW.
> + *
> + * Forced COW happens when we create a snapshot during transaction commit:
> + * after copying the src root, we must COW the shared block to ensure
> + * metadata consistency.
> + *
> + * When returning false for a WRITTEN buffer allocated in the current
> + * transaction, re-dirties the buffer for in-place overwrite instead
> + * of requesting a new COW.
> + */
> +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
>   				    const struct btrfs_root *root,
> -				    const struct extent_buffer *buf)
> +				    struct extent_buffer *buf)
>   {
>   	if (btrfs_is_testing(root->fs_info))
>   		return false;
>   
> -	/*
> -	 * We do not need to cow a block if
> -	 * 1) this block is not created or changed in this transaction;
> -	 * 2) this block does not belong to TREE_RELOC tree;
> -	 * 3) the root is not forced COW.
> -	 *
> -	 * What is forced COW:
> -	 *    when we create snapshot during committing the transaction,
> -	 *    after we've finished copying src root, we must COW the shared
> -	 *    block to ensure the metadata consistency.
> -	 */
> -
>   	if (btrfs_header_generation(buf) != trans->transid)
>   		return true;
>   
> -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> +	if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> +		return true;
> +
> +	if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
> +	    btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
>   		return true;
>   
>   	/* Ensure we can see the FORCE_COW bit. */
> @@ -629,11 +640,49 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
>   	if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
>   		return true;
>   
> -	if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> -		return false;
> +	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> +		/*
> +		 * The buffer was allocated in this transaction and has been
> +		 * written back to disk (WRITTEN is set). Normally we'd COW
> +		 * it again, but since the committed superblock doesn't
> +		 * reference this buffer (it was allocated in this transaction),
> +		 * we can safely overwrite it in place.
> +		 *
> +		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> +		 * persisted at this bytenr and will be again after the
> +		 * in-place update. This is important so that
> +		 * btrfs_free_tree_block() correctly pins the block if it is
> +		 * freed later (e.g., during tree rebalancing or FORCE_COW).
> +		 *
> +		 * Log trees and zoned devices cannot use this optimization:
> +		 * - Log trees: log blocks are written and immediately
> +		 *   referenced by a committed superblock via
> +		 *   btrfs_sync_log(), bypassing the normal transaction
> +		 *   commit. Overwriting in place could corrupt the
> +		 *   committed log.
> +		 * - Zoned devices: require sequential writes.
> +		 */
> +		if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
> +		    btrfs_is_zoned(root->fs_info))
> +			return true;
>   
> -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> -		return true;
> +		/*
> +		 * Re-register this block's range in the current transaction's
> +		 * dirty_pages so that btrfs_write_and_wait_transaction()
> +		 * writes it. The range was originally registered when the
> +		 * block was allocated. Normally dirty_pages is only cleared
> +		 * at commit time by btrfs_write_and_wait_transaction(), but
> +		 * if qgroups are enabled and snapshots are being created,
> +		 * qgroup_account_snapshot() may have already called
> +		 * btrfs_write_and_wait_transaction() and released the range
> +		 * before the final commit-time call.
> +		 */
> +		btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> +				     buf->start,
> +				     buf->start + buf->len - 1,
> +				     EXTENT_DIRTY, NULL);
> +		btrfs_mark_buffer_dirty(trans, buf);
> +	}
>   
>   	return false;
>   }
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 32fffb0557e5..bee8f76fbfea 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4491,7 +4491,7 @@ void btrfs_mark_buffer_dirty(struct btrfs_trans_handle *trans,
>   #endif
>   	/* This is an active transaction (its state < TRANS_STATE_UNBLOCKED). */
>   	ASSERT(trans->transid == fs_info->generation);
> -	btrfs_assert_tree_write_locked(buf);
> +	lockdep_assert_held(&buf->lock);
>   	if (unlikely(transid != fs_info->generation)) {
>   		btrfs_abort_transaction(trans, -EUCLEAN);
>   		btrfs_crit(fs_info,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index dfc17c292217..ff1fc699a6ca 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3791,10 +3791,6 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
>   					 eb->len,
>   					 eb->fs_info->dirty_metadata_batch);
>   	}
> -#ifdef CONFIG_BTRFS_DEBUG
> -	for (int i = 0; i < num_extent_folios(eb); i++)
> -		ASSERT(folio_test_dirty(eb->folios[i]));
> -#endif
>   }
>   
>   void clear_extent_buffer_uptodate(struct extent_buffer *eb)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
  2026-03-04  6:31   ` Qu Wenruo
@ 2026-03-05 20:11     ` Boris Burkov
  0 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-05 20:11 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Leo Martins, linux-btrfs, kernel-team, Filipe Manana, Sun YangKai

On Wed, Mar 04, 2026 at 05:01:19PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2026/2/26 20:21, Leo Martins 写道:
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> > 
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> > 
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> > 
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated. Normally
> > dirty_pages is only cleared at commit time by
> > btrfs_write_and_wait_transaction(), but if qgroups are enabled and
> > snapshots are being created, qgroup_account_snapshot() may have already
> > called btrfs_write_and_wait_transaction() and released the range before
> > the final commit-time call.
> > 
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> > 
> > Relax the lockdep assertion in btrfs_mark_buffer_dirty() from
> > btrfs_assert_tree_write_locked() to lockdep_assert_held() so that it
> > accepts either a read or write lock. should_cow_block() may be called
> > from btrfs_search_slot() when only a read lock is held (nodes above
> > write_lock_level are read-locked). The write lock assertion previously
> > documented the caller convention that buffer content was being modified
> > under exclusive access, but btrfs_mark_buffer_dirty() and
> > set_extent_buffer_dirty() themselves only perform independently
> > synchronized operations: atomic bit ops on bflags, folio_mark_dirty()
> > (kernel-internal folio locking), xarray mark updates (xarray spinlock),
> > and percpu counter updates. The read lock is sufficient because it
> > prevents lock_extent_buffer_for_io() from acquiring the write lock and
> > racing on the dirty state. Since rw_semaphore permits concurrent
> > readers, multiple threads can enter btrfs_mark_buffer_dirty()
> > simultaneously for the same buffer; this is safe because
> > test_and_set_bit(EXTENT_BUFFER_DIRTY) ensures only one thread performs
> > the full dirty state transition.
> > 
> > Remove the CONFIG_BTRFS_DEBUG assertion in set_extent_buffer_dirty()
> > that checked folio_test_dirty() after marking the buffer dirty. This
> > assertion assumed exclusive access (only one thread in
> > set_extent_buffer_dirty() at a time), which held when the only caller
> > was btrfs_mark_buffer_dirty() under write lock. With concurrent readers
> > calling through should_cow_block(), a thread that loses the
> > test_and_set_bit race sees was_dirty=true and skips the folio dirty
> > marking, but the winning thread may not have called
> > btrfs_meta_folio_set_dirty() yet, causing the assertion to fire. This
> > is a benign race: the winning thread will complete the folio dirty
> > marking, and no writeback can clear it while readers hold their locks.
> > 
> > Hoist the EXTENT_BUFFER_WRITEBACK, BTRFS_HEADER_FLAG_RELOC, and
> > BTRFS_ROOT_FORCE_COW checks before the WRITTEN block since they apply
> > regardless of whether the buffer has been written back. This
> > consolidates the exclusion logic and simplifies the WRITTEN path to
> > only handle log trees and zoned devices. Moving the RELOC checks
> > before the smp_mb__before_atomic() barrier is safe because both
> > btrfs_root_id() (immutable) and BTRFS_HEADER_FLAG_RELOC (set at COW
> > time under tree lock) are stable values not subject to concurrent
> > modification; the barrier is only needed for BTRFS_ROOT_FORCE_COW
> > which is set concurrently by create_pending_snapshot().
> > 
> > Exclude cases where in-place overwrite is not safe:
> >   - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> >   - Zoned devices: require sequential writes
> >   - Log trees: log blocks are immediately referenced by a committed
> >     superblock via btrfs_sync_log(), so overwriting could corrupt the
> >     committed log
> >   - BTRFS_ROOT_FORCE_COW: snapshot in progress
> >   - BTRFS_HEADER_FLAG_RELOC: block being relocated
> > 
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > Reviewed-by: Sun YangKai <sunk67188@gmail.com>
> 
> Unfortunately this patch is making btrfs/232 fail.
> Bisection lead to this one.
> 
> I have hit the following errors during btrfs/232 run with this patch, but
> not the commit before it:
> 
> - Write time tree-checker errors
>   From first key mismatch to bad key order.
> 
> - Fsck errors
>   From missing inode item other problems.
>   AKA, on-disk corruption, which is never a good sign.
> 
> One thing to note is, that test case itself can lead to a false alerts from
> DEBUG_WARN() inside btrfs_remove_qgroup(), thus you may need to manually
> remove that DEBUG_WARN() or check the failure dmesg to be extra sure.
> 
> Thanks,
> Qu
> 

I confirm that b/232 looks broken with this and good without it.

Reverted this patch (but left the independently useful inhibition patch)
while Leo figures out the tree-checker issues, which were reasonable to
expect coming out of something like this.

Thanks,
Boris

> > ---
> >   fs/btrfs/ctree.c     | 87 ++++++++++++++++++++++++++++++++++----------
> >   fs/btrfs/disk-io.c   |  2 +-
> >   fs/btrfs/extent_io.c |  4 --
> >   3 files changed, 69 insertions(+), 24 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..ea7cfc3a9e89 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,29 +599,40 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> >   	return ret;
> >   }
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +/*
> > + * Check if @buf needs to be COW'd.
> > + *
> > + * Returns true if COW is required, false if the block can be reused
> > + * in place.
> > + *
> > + * We do not need to COW a block if:
> > + * 1) the block was created or changed in this transaction;
> > + * 2) the block does not belong to TREE_RELOC tree;
> > + * 3) the root is not forced COW.
> > + *
> > + * Forced COW happens when we create a snapshot during transaction commit:
> > + * after copying the src root, we must COW the shared block to ensure
> > + * metadata consistency.
> > + *
> > + * When returning false for a WRITTEN buffer allocated in the current
> > + * transaction, re-dirties the buffer for in-place overwrite instead
> > + * of requesting a new COW.
> > + */
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> >   				    const struct btrfs_root *root,
> > -				    const struct extent_buffer *buf)
> > +				    struct extent_buffer *buf)
> >   {
> >   	if (btrfs_is_testing(root->fs_info))
> >   		return false;
> > -	/*
> > -	 * We do not need to cow a block if
> > -	 * 1) this block is not created or changed in this transaction;
> > -	 * 2) this block does not belong to TREE_RELOC tree;
> > -	 * 3) the root is not forced COW.
> > -	 *
> > -	 * What is forced COW:
> > -	 *    when we create snapshot during committing the transaction,
> > -	 *    after we've finished copying src root, we must COW the shared
> > -	 *    block to ensure the metadata consistency.
> > -	 */
> > -
> >   	if (btrfs_header_generation(buf) != trans->transid)
> >   		return true;
> > -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > +	if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags))
> > +		return true;
> > +
> > +	if (btrfs_root_id(root) != BTRFS_TREE_RELOC_OBJECTID &&
> > +	    btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> >   		return true;
> >   	/* Ensure we can see the FORCE_COW bit. */
> > @@ -629,11 +640,49 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> >   	if (test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> >   		return true;
> > -	if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> > -		return false;
> > +	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > +		/*
> > +		 * The buffer was allocated in this transaction and has been
> > +		 * written back to disk (WRITTEN is set). Normally we'd COW
> > +		 * it again, but since the committed superblock doesn't
> > +		 * reference this buffer (it was allocated in this transaction),
> > +		 * we can safely overwrite it in place.
> > +		 *
> > +		 * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > +		 * persisted at this bytenr and will be again after the
> > +		 * in-place update. This is important so that
> > +		 * btrfs_free_tree_block() correctly pins the block if it is
> > +		 * freed later (e.g., during tree rebalancing or FORCE_COW).
> > +		 *
> > +		 * Log trees and zoned devices cannot use this optimization:
> > +		 * - Log trees: log blocks are written and immediately
> > +		 *   referenced by a committed superblock via
> > +		 *   btrfs_sync_log(), bypassing the normal transaction
> > +		 *   commit. Overwriting in place could corrupt the
> > +		 *   committed log.
> > +		 * - Zoned devices: require sequential writes.
> > +		 */
> > +		if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID ||
> > +		    btrfs_is_zoned(root->fs_info))
> > +			return true;
> > -	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))
> > -		return true;
> > +		/*
> > +		 * Re-register this block's range in the current transaction's
> > +		 * dirty_pages so that btrfs_write_and_wait_transaction()
> > +		 * writes it. The range was originally registered when the
> > +		 * block was allocated. Normally dirty_pages is only cleared
> > +		 * at commit time by btrfs_write_and_wait_transaction(), but
> > +		 * if qgroups are enabled and snapshots are being created,
> > +		 * qgroup_account_snapshot() may have already called
> > +		 * btrfs_write_and_wait_transaction() and released the range
> > +		 * before the final commit-time call.
> > +		 */
> > +		btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > +				     buf->start,
> > +				     buf->start + buf->len - 1,
> > +				     EXTENT_DIRTY, NULL);
> > +		btrfs_mark_buffer_dirty(trans, buf);
> > +	}
> >   	return false;
> >   }
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 32fffb0557e5..bee8f76fbfea 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -4491,7 +4491,7 @@ void btrfs_mark_buffer_dirty(struct btrfs_trans_handle *trans,
> >   #endif
> >   	/* This is an active transaction (its state < TRANS_STATE_UNBLOCKED). */
> >   	ASSERT(trans->transid == fs_info->generation);
> > -	btrfs_assert_tree_write_locked(buf);
> > +	lockdep_assert_held(&buf->lock);
> >   	if (unlikely(transid != fs_info->generation)) {
> >   		btrfs_abort_transaction(trans, -EUCLEAN);
> >   		btrfs_crit(fs_info,
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index dfc17c292217..ff1fc699a6ca 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3791,10 +3791,6 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
> >   					 eb->len,
> >   					 eb->fs_info->dirty_metadata_batch);
> >   	}
> > -#ifdef CONFIG_BTRFS_DEBUG
> > -	for (int i = 0; i < num_extent_folios(eb); i++)
> > -		ASSERT(folio_test_dirty(eb->folios[i]));
> > -#endif
> >   }
> >   void clear_extent_buffer_uptodate(struct extent_buffer *eb)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-05 20:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26  9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-26  9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-03-04  6:31   ` Qu Wenruo
2026-03-05 20:11     ` Boris Burkov
2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-26 15:23   ` Filipe Manana
2026-02-26 15:27   ` Sun YangKai
2026-03-04  4:22   ` David Sterba
2026-02-26  9:51 ` [PATCH v4 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-03-02 23:12 ` [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox