public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Leo Martins <loemra.dev@gmail.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Boris Burkov <boris@bur.io>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
Date: Mon,  2 Feb 2026 17:09:58 -0800	[thread overview]
Message-ID: <20260203011002.1872801-1-loemra.dev@gmail.com> (raw)
In-Reply-To: <CAL3q7H7O9eAA+sapfsGfn=crF3Levk7wrFNVSbwEP4uMtLJbJA@mail.gmail.com>

On Fri, 30 Jan 2026 15:57:51 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Jan 30, 2026 at 3:44 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Fri, Jan 30, 2026 at 12:49:55PM +0000, Filipe Manana wrote:
> > > On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
> > > >
> > > > On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> > > >
> > > > > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > > > > >
> > > > > > I've been investigating enospcs at Meta and have observed a strange
> > > > > > pattern where filesystems are enospcing with lots of unallocated space
> > > > > > (> 100G). Sample dmesg dump at bottom of message.
> > > > > >
> > > > > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > > > > from the transaction block reserve and finding it exhausted leading to a
> > > > > > warning and enospc. This is a bug as the reservations are meant to be
> > > > > > worst case. It should be impossible to exhaust the transaction block
> > > > > > reserve.
> > > > > >
> > > > > > Some tracing of affected hosts revealed that there were single
> > > > > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > > > > reproduce this behavior locally by creating a very constrained cgroup
> > > > > > and producing a lot of concurrent filesystem operations. Here's the
> > > > > > pattern:
> > > > > >
> > > > > >  1. btrfs_search_slot() begins tree traversal with cow=1
> > > > > >  2. Node at level N needs COW (old generation or WRITTEN flag set)
> > > > > >  3. btrfs_cow_block() allocates new node, updates parent pointer
> > > > > >  4. Traversal continues, but hits a condition requiring restart (e.g., node
> > > > > >     not cached, lock contention, need higher write_lock_level)
> > > > > >  5. btrfs_release_path() releases all locks and references
> > > > > >  6. Memory pressure triggers writeback on the COW'd node
> > > > > >  7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > > > > >     BTRFS_HEADER_FLAG_WRITTEN
> > > > > >  8. goto again - traversal restarts from root
> > > > > >  9. Traversal reaches the freshly COW'd node
> > > > > >  10. should_cow_block() sees WRITTEN flag set, returns true
> > > > > >  11. btrfs_cow_block() allocates another new node - same logical position,
> > > > > >      new physical location, new reservation consumed
> > > > > >  12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > > > > >
> > > > > > Note this behavior should be much harder to trigger since Boris's
> > > > > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > > > > accounted for in user cgroups. However, I believe it
> > > > > > would still be an issue under global memory pressure.
> > > > > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > > > > >
> > > > > > This COW amplification breaks the idea that transaction reservations are
> > > > > > worst case as any search slot call could find itself in this COW loop and
> > > > > > exhaust its reservation.
> > > > > >
> > > > > > My proposed solution is to temporarily pin extent buffers for the
> > > > > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > > > > amplification that can be seen during high memory pressure.
> > > > > >
> > > > > > The implementation uses a local xarray to track COW'd buffers for the
> > > > > > duration of the search. The xarray stores extent_buffer pointers without
> > > > > > taking additional references; this is safe because tracked buffers remain
> > > > > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > > > > dirty buffers cannot be reclaimed by memory pressure.
> > > > > >
> > > > > > Synchronization is provided by eb->lock: increments in
> > > > > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > > > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > > > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > > > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > > > > Once we decrement, we're done and don't care if writeback proceeds
> > > > > > immediately.
> > > > >
> > > > > This seems too complex to me.
> > > > >
> > > > > So this problem is very similar to some idea I had a few years ago but
> > > > > never managed to implement.
> > > > > It was about avoiding unnecessary COW, not for this space reservation
> > > > > exhaustion due to sustained memory pressure, but it would solve it
> > > > > too.
> > > > >
> > > > > The idea was that we do unnecessary COW in cases like this:
> > > > >
> > > > > 1) We COW a path in some tree and we are at transaction N;
> > > > >
> > > > > 2) Writeback happened for the extent buffers in that path while we are
> > > > > in the same transaction, because we reached the 32M limit and some
> > > > > task called btrfs_btree_balance_dirty() or something else triggered
> > > > > writeback of the btree inode;
> > > > >
> > > > > 3) While still at transaction N, we visit the same path to add an item
> > > > > to a leaf, or modify an item, whatever. Because the extent buffers
> > > > > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > > > > returns true).
> > > > >
> > > > > So during the lifetime of a transaction we can have a lot of
> > > > > unnecessary COW - we spend more time allocating extents, allocating
> > > > > memory, copying extent buffer data, use more space per transaction,
> > > > > etc.
> > > > >
> > > > > The idea was to not COW when an extent buffer has
> > > > > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > > > > (btrfs_header_generation(eb)) matches the current transaction.
> > > > > That is safe because there's no committed tree that points to an
> > > > > extent buffer created in the current transaction.
> > > > >
> > > > > Any further modification to the extent buffer must be sure that the
> > > > > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > > > > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > > > > the extent buffer to the same location again before the transaction
> > > > > commits the superblocks.
> > > > >
> > > > > Have you considered an approach like this?
> > > >
> > > > I had not considered this, but it is a great idea.
> > > >
> > > > My first thought is that implementing this could be as simple
> > > > as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> > > > would mess with the assumptions around the log tree. From
> > > > btrfs_sync_log():
> > > >
> > > > /*
> > > >  * IO has been started, blocks of the log tree have WRITTEN flag set
> > > >  * in their headers. new modifications of the log will be written to
> > > >  * new positions. so it's safe to allow log writers to go in.
> > > >  */
> > > >
> > > > ^ Assumes that WRITTEN blocks will be COW'd.
> > > >
> > > > The issue looks like:
> > > >
> > > >  1. fsync A COWs eb
> > > >  2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> > > >  3. fsync B does __not__ COW eb and modifies it
> > > >  4. fsync A writes modified eb to disk
> > > >  5. CRASH; the log tree is corrupted
> > > >
> > > > One way to avoid that is to keep the current behavior for the log
> > > > tree, but that leaves the potential for COW amplification...
> > > >
> > > > Another idea is to track the log_transid in the eb in the same way
> > > > the transid is tracked. Then, in should_cow_block we have something
> > > > like:
> > > >
> > > > if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
> > > >     buf->log_transid != root->log_transid)
> > > >   return true;
> > >
> > > Log trees are special since their lifetime doesn't span on
> > > transaction, so what I suggested doesn't work of course for log trees
> > > and I forgot to mention that.
> > >
> > > Tracking the log_transid in the extent buffer will not always work -
> > > because it can be evicted and reloaded, so we would lose its value.
> > > We would have to update the on-disk format to store it somewhere or
> > > keep another in memory structure to track that, or prevent eviction of
> > > log tree buffers - all of those are too complex.
> >
> > Supposing we cannot think of a way to do overwrites on log tree ebs,
> > but that we can make it work for other ebs (excluding the zoned case
> > you mentioned below):
> >
> > What do you think about the problem of space reservation exhaustion
> > due to COW amplification when narrowed to just log trees? As far as I
> > can tell, there is nothing special about how logged items consume
> > transaction reservation so the problem would be reduced but still exist.
> 
> Log trees are small to start with, and if they ever hit -ENOSPC, we
> just fallback to transaction commit.
> 
> >
> > Do we want to pursue working out the kinkds in eb-overwrite (seems super
> > valuable regardless of motivation) and think of some other final
> > backstop for log tree ebs? Given that the fsync will be sending the ebs
> > down to the disk quite soon anyway, I was thinking it might be more
> > palatable to try to fully prevent premature writeback of log tree ebs.
> >
> > >
> > > So I had this half baked patch from many years ago:
> > >
> > >  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
> > >                       *root, struct btrfs_path *path, int level);
> > > @@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
> > > btrfs_trans_handle *trans,
> > >          *    block to ensure the metadata consistency.
> > >          */
> > >         if (btrfs_header_generation(buf) == trans->transid &&
> > > -           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> > >             !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> > >               btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> > > -           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> > > +           !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
> > > +
> > > +               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> > > +                       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > > +                               return 1;
> > > +                       return 0;
> > > +               }
> > > +
> > > +               if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
> > > +                   test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
> > > +                       return 1;
> > >
> > > This was before a recent refactoring of should_cow_block(), but you
> > > should get the ideia.
> > > IIRC all fstests were passing back then, except for one or two which I
> > > never spent time debugging.
> > >
> > > And as that attempt was before the tree checker existed, we would need
> > > to make sure we don't change and eb while the tree checker is
> > > verifying it - making sure the tree checker read locks the eb should
> > > be enough.
> >
> > I suspect this is what Sun was hitting in his replies to Leo.
> >
> > >
> > > There's also one problem with this idea: it won't work for zoned
> > > devices as writes are sequential and we can't write twice to the same
> > > location without doing the zone reset thing which only happens around
> > > transaction commit time IIRC.
> > >
> > > Thanks.
> > >
> >
> > Thanks for your input on this, it's really appreciated,
> > Boris
> >
> > > >
> > > > Please let me know if you see any issues with this approach or
> > > > if you can think of a better method.
> > > >
> > > > Thanks,
> > > > Leo
> > > >
> > > > >
> > > > > It would solve this space reservation exhaustion problem, as well as
> > > > > unnecessary COW for general optimization, without the need to for a
> > > > > local xarray, which besides being very specific for the
> > > > > btrfs_search_slot() case (we COW in other places), also requires a
> > > > > memory allocation which can fail.
> > > > >
> > > > > Thanks.

I just want to re-iterate my understanding of the discussion to make
sure everyone is on the same page.

We've discussed two potential strategies: pinning and overwriting.

Overwriting, as suggested by Filipe, where we skip COWing if an eb
has already been COW'd this transaction generation and instead
overwrite the eb in place.

Pinning, as suggested by me, where we skip flushing ebs that have
been COW'd in the current search_slot.

With pinning alone, the class of unlock-retry re-COW loops goes away,
but not all re-COWs. If two different transaction handles in the same
generation touch the same path they would still re-COW.

With overwriting alone, we never re-COW within a generation, we just
overwrite the eb. This is a general optimization that also happens
to fix the amplification bug. But it doesn't work for log trees or
zoned devices, leaving those exposed to the original bug.

So, in my view, the solution is to implement both. Pinning to prevent
the unnecessary writeback that triggers re-COWs during retry loops,
and overwriting to avoid useless COWing when we legitimately revisit
the same path. For log trees and zoned devices where overwriting isn't
possible, pinning alone provides the fix.

To address the critique that the original pinning approach is too
narrow, only working in btrfs_search_slot(), we can instead tie the
pin to the transaction handle. I also believe that we could address
having to allocate memory mid-transaction by reserving enough space
at start_transaction so we don't risk having to abort the transaction.

Let me know if this all makes sense. Or if you disagree that both
approaches are necessary.

Thanks,
Leo

  reply	other threads:[~2026-02-03  1:10 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-27 20:42 [PATCH] btrfs: prevent COW amplification during btrfs_search_slot Leo Martins
2026-01-28 21:48 ` Qu Wenruo
2026-01-29 19:30   ` Leo Martins
2026-01-29 11:52 ` Filipe Manana
2026-01-30  0:12   ` Leo Martins
2026-01-30  4:14     ` Sun YangKai
2026-01-30  9:37       ` Sun YangKai
2026-01-30 15:50         ` Sun YangKai
2026-01-30 16:11           ` Filipe Manana
2026-01-31  9:16             ` Sun YangKai
2026-01-30 12:49     ` Filipe Manana
2026-01-30 15:43       ` Boris Burkov
2026-01-30 15:57         ` Filipe Manana
2026-02-03  1:09           ` Leo Martins [this message]
2026-01-30 21:43       ` Leo Martins
2026-01-30 22:34         ` Qu Wenruo
2026-01-31  0:11           ` Boris Burkov
2026-01-31  1:06             ` Qu Wenruo
2026-01-31 17:16               ` Boris Burkov
2026-01-31 21:59                 ` Qu Wenruo
2026-02-10  7:45 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260203011002.1872801-1-loemra.dev@gmail.com \
    --to=loemra.dev@gmail.com \
    --cc=boris@bur.io \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox