public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Leo Martins <loemra.dev@gmail.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure
Date: Mon, 2 Mar 2026 15:12:16 -0800	[thread overview]
Message-ID: <20260302231216.GA78195@zen.localdomain> (raw)
In-Reply-To: <cover.1772097864.git.loemra.dev@gmail.com>

On Thu, Feb 26, 2026 at 01:51:05AM -0800, Leo Martins wrote:
> I've been investigating a pattern where COW amplification under memory
> pressure exhausts the transaction block reserve, leading to spurious
> ENOSPCs on filesystems with plenty of unallocated space.
> 
> The pattern is:
> 
>  1. btrfs_search_slot() begins tree traversal with cow=1
>  2. Node at level N needs COW (old generation or WRITTEN flag set)
>  3. btrfs_cow_block() allocates new block, copies data, updates
>     parent pointer
>  4. Traversal hits a condition requiring restart (node not cached,
>     lock contention, need higher write_lock_level)
>  5. btrfs_release_path() releases all locks and references
>  6. Memory pressure triggers writeback on the COW'd block
>  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
>     BTRFS_HEADER_FLAG_WRITTEN
>  8. goto again - traversal restarts from root
>  9. Traversal reaches the same tree node
>  10. should_cow_block() sees WRITTEN flag, returns true
>  11. btrfs_cow_block() allocates yet another new block, copies data,
>      updates parent pointer again, consuming another reservation
>  12. Steps 4-11 repeat under sustained memory pressure
> 
> This series fixes the problem with two complementary approaches:
> 
> Patch 1 implements Filipe's suggestion of overwriting in place. When
> should_cow_block() encounters a WRITTEN buffer whose generation matches
> the current transaction, it re-dirties the buffer and returns false
> instead of requesting a COW. This is crash-safe because the committed
> superblock does not reference buffers allocated in the current
> transaction. Log trees, zoned devices, FORCE_COW, relocation, and
> buffers mid-writeback are excluded. Beyond improving the amplification
> bug, this is a general optimization for the entire transaction lifetime:
> any time writeback runs during a transaction, revisiting the same path
> no longer triggers unnecessary COW, reducing extent allocation overhead,
> memory copies, and space usage per transaction.
> 
> Patch 2 inhibits writeback on COW'd buffers for the lifetime of the
> transaction handle. This prevents WRITTEN from being set while a
> handle holds a reference to the buffer, avoiding unnecessary re-COW
> at its source. Only WB_SYNC_NONE (background/periodic flusher
> writeback) is inhibited. WB_SYNC_ALL (data-integrity writeback from
> btrfs_sync_log() and btrfs_write_and_wait_transaction()) always
> proceeds, which is required for correctness in the fsync path where
> log tree blocks must be written while the inhibiting handle is still
> active.
> 
> Both approaches are independently useful. The overwrite path is a
> general optimization that eliminates unnecessary COW across the entire
> transaction, not just within a single handle. Writeback inhibition
> prevents the problem from occurring in the first place and is the
> only fix for log trees and zoned devices where overwrite does not
> apply. Together they provide defense in depth against COW
> amplification.
> 
> Patch 3 adds a tracepoint for tracking search slot restarts.
> 
> Note: Boris's AS_KERNEL_FILE changes prevent cgroup-scoped reclaim
> from targeting extent buffer pages, making this harder to trigger via
> per-cgroup memory limits. I was able to reproduce the amplification
> using global memory pressure (drop_caches, root cgroup memory.reclaim,
> memory hog) with concurrent filesystem operations.
> 
> Benchmark: concurrent filesystem operations under aggressive global
> memory pressure. COW counts measured via bpftrace.
> 
>   Approach                   Max COW count   Num operations
>   -------                    -------------   --------------
>   Baseline                              35   -
>   Overwrite only                        19   ~same
>   Writeback inhibition only              5   ~2x
>   Combined (this series)                 4   ~2x

I applied Filipe's suggestion of the lock assert and removing a comment
in patch 2 and pushed the series to for-next.

Thanks,
Boris

> 
> Changes since v3:
> 
> Patch 2:
>  - Also inhibit writeback in should_cow_block() when COW is skipped,
>    so that every transaction handle that reuses a COW'd buffer inhibits
>    its writeback, not just the handle that originally COW'd it
> 
> Changes since v2:
> 
> Patch 1:
>  - Add smp_mb__before_atomic() before FORCE_COW check (Filipe, Sun YangKai)
>  - Hoist WRITEBACK, RELOC, FORCE_COW checks before WRITTEN block (Sun YangKai)
>  - Update dirty_pages comment for qgroup_account_snapshot() (Filipe)
>  - Relax btrfs_mark_buffer_dirty() lockdep assertion to accept read locks
>  - Use btrfs_mark_buffer_dirty() instead of set_extent_buffer_dirty()
>  - Remove folio_test_dirty() debug assertion (concurrent reader race)
> 
> Patch 2:
>  - Fix comment style (Filipe)
> 
> Patch 3:
>  - Replace counters with per-restart-site tracepoint (Filipe)
> 
> Changes since v1:
> 
> The v1 patch used a per-btrfs_search_slot() xarray to track COW'd
> buffers. Filipe pointed out this was too complex and too narrow in
> scope, and suggested overwriting in place instead. Qu raised the
> concern that other btrfs_cow_block() callers outside of search_slot
> would not be covered. Boris suggested transaction-handle-level scope
> as an alternative.
> 
> v2 implemented both overwrite-in-place and writeback inhibition at
> the transaction handle level. The per-search-slot xarray was replaced
> by a per-transaction-handle xarray, which covers all
> btrfs_force_cow_block() callers.
> 
> Leo Martins (3):
>   btrfs: skip COW for written extent buffers allocated in current
>     transaction
>   btrfs: inhibit extent buffer writeback to prevent COW amplification
>   btrfs: add tracepoint for search slot restart tracking
> 
>  fs/btrfs/ctree.c             | 106 ++++++++++++++++++++++++++++-------
>  fs/btrfs/disk-io.c           |   2 +-
>  fs/btrfs/extent_io.c         |  67 ++++++++++++++++++++--
>  fs/btrfs/extent_io.h         |   6 ++
>  fs/btrfs/transaction.c       |  19 +++++++
>  fs/btrfs/transaction.h       |   3 +
>  include/trace/events/btrfs.h |  24 ++++++++
>  7 files changed, 200 insertions(+), 27 deletions(-)
> 
> -- 
> 2.47.3
> 

      parent reply	other threads:[~2026-03-02 23:11 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26  9:51 [PATCH v4 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-26  9:51 ` [PATCH v4 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-03-04  6:31   ` Qu Wenruo
2026-03-05 20:11     ` Boris Burkov
2026-02-26  9:51 ` [PATCH v4 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-26 15:23   ` Filipe Manana
2026-02-26 15:27   ` Sun YangKai
2026-03-04  4:22   ` David Sterba
2026-02-26  9:51 ` [PATCH v4 3/3] btrfs: add tracepoint for search slot restart tracking Leo Martins
2026-03-02 23:12 ` Boris Burkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260302231216.GA78195@zen.localdomain \
    --to=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=loemra.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox