From: Leo Martins <loemra.dev@gmail.com>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure
Date: Tue, 24 Feb 2026 11:22:31 -0800 [thread overview]
Message-ID: <cover.1771884128.git.loemra.dev@gmail.com> (raw)
I've been investigating a pattern where COW amplification under memory
pressure exhausts the transaction block reserve, leading to spurious
ENOSPCs on filesystems with plenty of unallocated space.
The pattern is:
1. btrfs_search_slot() begins tree traversal with cow=1
2. Node at level N needs COW (old generation or WRITTEN flag set)
3. btrfs_cow_block() allocates new block, copies data, updates
parent pointer
4. Traversal hits a condition requiring restart (node not cached,
lock contention, need higher write_lock_level)
5. btrfs_release_path() releases all locks and references
6. Memory pressure triggers writeback on the COW'd block
7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
BTRFS_HEADER_FLAG_WRITTEN
8. goto again - traversal restarts from root
9. Traversal reaches the same tree node
10. should_cow_block() sees WRITTEN flag, returns true
11. btrfs_cow_block() allocates yet another new block, copies data,
updates parent pointer again, consuming another reservation
12. Steps 4-11 repeat under sustained memory pressure
This series fixes the problem with two complementary approaches:
Patch 1 implements Filipe's suggestion of overwriting in place. When
should_cow_block() encounters a WRITTEN buffer whose generation matches
the current transaction, it re-dirties the buffer and returns false
instead of requesting a COW. This is crash-safe because the committed
superblock does not reference buffers allocated in the current
transaction. Log trees, zoned devices, FORCE_COW, relocation, and
buffers mid-writeback are excluded. Beyond improving the amplification bug,
this is a general optimization for the entire transaction lifetime: any
time writeback runs during a transaction, revisiting the same path no
longer triggers unnecessary COW, reducing extent allocation overhead,
memory copies, and space usage per transaction.
Patch 2 inhibits writeback on COW'd buffers for the lifetime of the
transaction handle. This prevents WRITTEN from being set while a
handle holds a reference to the buffer, avoiding unnecessary re-COW
at its source. Only WB_SYNC_NONE (background/periodic flusher
writeback) is inhibited. WB_SYNC_ALL (data-integrity writeback from
btrfs_sync_log() and btrfs_write_and_wait_transaction()) always
proceeds, which is required for correctness in the fsync path where
log tree blocks must be written while the inhibiting handle is still
active.
Both approaches are independently useful. The overwrite path is a
general optimization that eliminates unnecessary COW across the entire
transaction, not just within a single handle. Writeback inhibition
prevents the problem from occurring in the first place and is the
only fix for log trees and zoned devices where overwrite does not
apply. Together they provide defense in depth against COW
amplification.
Patch 3 adds a tracepoint for tracking search slot restarts.
Note: Boris's AS_KERNEL_FILE changes prevent cgroup-scoped reclaim
from targeting extent buffer pages, making this harder to trigger via
per-cgroup memory limits. I was able to reproduce the amplification
using global memory pressure (drop_caches, root cgroup memory.reclaim,
memory hog) with concurrent filesystem operations.
Benchmark: concurrent filesystem operations under aggressive global
memory pressure. COW counts measured via bpftrace.
Approach Max COW count Num operations
------- ------------- --------------
Baseline 35 -
Overwrite only 19 ~same
Writeback inhibition only 5 ~2x
Combined (this series) 4 ~2x
Changes since v2:
Patch 1:
- Add smp_mb__before_atomic() before FORCE_COW check (Filipe, Sun YangKai)
- Hoist WRITEBACK, RELOC, FORCE_COW checks before WRITTEN block (Sun YangKai)
- Update dirty_pages comment for qgroup_account_snapshot() (Filipe)
- Relax btrfs_mark_buffer_dirty() lockdep assertion to accept read locks
- Use btrfs_mark_buffer_dirty() instead of set_extent_buffer_dirty()
- Remove folio_test_dirty() debug assertion (concurrent reader race)
Patch 2:
- Fix comment style (Filipe)
Patch 3:
- Replace counters with per-restart-site tracepoint (Filipe)
Changes since v1:
The v1 patch used a per-btrfs_search_slot() xarray to track COW'd
buffers. Filipe pointed out this was too complex and too narrow in
scope, and suggested overwriting in place instead. Qu raised the
concern that other btrfs_cow_block() callers outside of search_slot
would not be covered. Boris suggested transaction-handle-level scope
as an alternative.
v2 implemented both overwrite-in-place and writeback inhibition at
the transaction handle level. The per-search-slot xarray was replaced
by a per-transaction-handle xarray, which covers all
btrfs_force_cow_block() callers.
Leo Martins (3):
btrfs: skip COW for written extent buffers allocated in current
transaction
btrfs: inhibit extent buffer writeback to prevent COW amplification
btrfs: add tracepoint for search slot restart tracking
fs/btrfs/ctree.c | 70 +++++++++++++++++++++++++++++++-----
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent_io.c | 67 +++++++++++++++++++++++++++++++---
fs/btrfs/extent_io.h | 6 ++++
fs/btrfs/transaction.c | 19 ++++++++++
fs/btrfs/transaction.h | 3 ++
include/trace/events/btrfs.h | 24 +++++++++++++
7 files changed, 176 insertions(+), 15 deletions(-)
--
2.47.3
next reply other threads:[~2026-02-24 19:22 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 19:22 Leo Martins [this message]
2026-02-24 19:22 ` [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-25 6:06 ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-25 6:17 ` Sun YangKai
2026-02-26 2:30 ` Leo Martins
2026-02-26 5:03 ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-02-25 17:38 ` [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1771884128.git.loemra.dev@gmail.com \
--to=loemra.dev@gmail.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox