public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Leo Martins <loemra.dev@gmail.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure
Date: Wed, 25 Feb 2026 17:38:52 +0000	[thread overview]
Message-ID: <CAL3q7H5RAa72uLEktuTiQ-gWHQuZ8_q_zY7cKT5xBjVkyruELA@mail.gmail.com> (raw)
In-Reply-To: <cover.1771884128.git.loemra.dev@gmail.com>

On Tue, Feb 24, 2026 at 7:23 PM Leo Martins <loemra.dev@gmail.com> wrote:
>
> I've been investigating a pattern where COW amplification under memory
> pressure exhausts the transaction block reserve, leading to spurious
> ENOSPCs on filesystems with plenty of unallocated space.
>
> The pattern is:
>
>  1. btrfs_search_slot() begins tree traversal with cow=1
>  2. Node at level N needs COW (old generation or WRITTEN flag set)
>  3. btrfs_cow_block() allocates new block, copies data, updates
>     parent pointer
>  4. Traversal hits a condition requiring restart (node not cached,
>     lock contention, need higher write_lock_level)
>  5. btrfs_release_path() releases all locks and references
>  6. Memory pressure triggers writeback on the COW'd block
>  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>     BTRFS_HEADER_FLAG_WRITTEN
>  8. goto again - traversal restarts from root
>  9. Traversal reaches the same tree node
>  10. should_cow_block() sees WRITTEN flag, returns true
>  11. btrfs_cow_block() allocates yet another new block, copies data,
>      updates parent pointer again, consuming another reservation
>  12. Steps 4-11 repeat under sustained memory pressure
>
> This series fixes the problem with two complementary approaches:
>
> Patch 1 implements Filipe's suggestion of overwriting in place. When
> should_cow_block() encounters a WRITTEN buffer whose generation matches
> the current transaction, it re-dirties the buffer and returns false
> instead of requesting a COW. This is crash-safe because the committed
> superblock does not reference buffers allocated in the current
> transaction. Log trees, zoned devices, FORCE_COW, relocation, and
> buffers mid-writeback are excluded. Beyond improving the amplification bug,
> this is a general optimization for the entire transaction lifetime: any
> time writeback runs during a transaction, revisiting the same path no
> longer triggers unnecessary COW, reducing extent allocation overhead,
> memory copies, and space usage per transaction.
>
> Patch 2 inhibits writeback on COW'd buffers for the lifetime of the
> transaction handle. This prevents WRITTEN from being set while a
> handle holds a reference to the buffer, avoiding unnecessary re-COW
> at its source. Only WB_SYNC_NONE (background/periodic flusher
> writeback) is inhibited. WB_SYNC_ALL (data-integrity writeback from
> btrfs_sync_log() and btrfs_write_and_wait_transaction()) always
> proceeds, which is required for correctness in the fsync path where
> log tree blocks must be written while the inhibiting handle is still
> active.
>
> Both approaches are independently useful. The overwrite path is a
> general optimization that eliminates unnecessary COW across the entire
> transaction, not just within a single handle. Writeback inhibition
> prevents the problem from occurring in the first place and is the
> only fix for log trees and zoned devices where overwrite does not
> apply. Together they provide defense in depth against COW
> amplification.
>
> Patch 3 adds a tracepoint for tracking search slot restarts.
>
> Note: Boris's AS_KERNEL_FILE changes prevent cgroup-scoped reclaim
> from targeting extent buffer pages, making this harder to trigger via
> per-cgroup memory limits. I was able to reproduce the amplification
> using global memory pressure (drop_caches, root cgroup memory.reclaim,
> memory hog) with concurrent filesystem operations.
>
> Benchmark: concurrent filesystem operations under aggressive global
> memory pressure. COW counts measured via bpftrace.
>
>   Approach                   Max COW count   Num operations
>   -------                    -------------   --------------
>   Baseline                              35   -
>   Overwrite only                        19   ~same
>   Writeback inhibition only              5   ~2x
>   Combined (this series)                 4   ~2x
>
> Changes since v2:
>
> Patch 1:
>  - Add smp_mb__before_atomic() before FORCE_COW check (Filipe, Sun YangKai)
>  - Hoist WRITEBACK, RELOC, FORCE_COW checks before WRITTEN block (Sun YangKai)
>  - Update dirty_pages comment for qgroup_account_snapshot() (Filipe)
>  - Relax btrfs_mark_buffer_dirty() lockdep assertion to accept read locks
>  - Use btrfs_mark_buffer_dirty() instead of set_extent_buffer_dirty()
>  - Remove folio_test_dirty() debug assertion (concurrent reader race)
>
> Patch 2:
>  - Fix comment style (Filipe)
>
> Patch 3:
>  - Replace counters with per-restart-site tracepoint (Filipe)
>
> Changes since v1:
>
> The v1 patch used a per-btrfs_search_slot() xarray to track COW'd
> buffers. Filipe pointed out this was too complex and too narrow in
> scope, and suggested overwriting in place instead. Qu raised the
> concern that other btrfs_cow_block() callers outside of search_slot
> would not be covered. Boris suggested transaction-handle-level scope
> as an alternative.
>
> v2 implemented both overwrite-in-place and writeback inhibition at
> the transaction handle level. The per-search-slot xarray was replaced
> by a per-transaction-handle xarray, which covers all
> btrfs_force_cow_block() callers.
>
> Leo Martins (3):
>   btrfs: skip COW for written extent buffers allocated in current
>     transaction
>   btrfs: inhibit extent buffer writeback to prevent COW amplification
>   btrfs: add tracepoint for search slot restart tracking

You can add to all the patches:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
>  fs/btrfs/ctree.c             | 70 +++++++++++++++++++++++++++++++-----
>  fs/btrfs/disk-io.c           |  2 +-
>  fs/btrfs/extent_io.c         | 67 +++++++++++++++++++++++++++++++---
>  fs/btrfs/extent_io.h         |  6 ++++
>  fs/btrfs/transaction.c       | 19 ++++++++++
>  fs/btrfs/transaction.h       |  3 ++
>  include/trace/events/btrfs.h | 24 +++++++++++++
>  7 files changed, 176 insertions(+), 15 deletions(-)
>
> --
> 2.47.3
>
>

      parent reply	other threads:[~2026-02-25 17:39 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24 19:22 [PATCH v3 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-24 19:22 ` [PATCH v3 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-25  6:06   ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-25  6:17   ` Sun YangKai
2026-02-26  2:30     ` Leo Martins
2026-02-26  5:03       ` Sun YangKai
2026-02-24 19:22 ` [PATCH v3 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-02-25 17:38 ` Filipe Manana [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H5RAa72uLEktuTiQ-gWHQuZ8_q_zY7cKT5xBjVkyruELA@mail.gmail.com \
    --to=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=loemra.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox