From: Leo Martins <loemra.dev@gmail.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: prevent COW amplification during btrfs_search_slot
Date: Fri, 30 Jan 2026 13:43:17 -0800 [thread overview]
Message-ID: <20260130214319.3714908-1-loemra.dev@gmail.com> (raw)
In-Reply-To: <CAL3q7H5kf=vR1Jcc5NduS_jUROyE0WzXCLgzAe=wMVeZcBKOGw@mail.gmail.com>
On Fri, 30 Jan 2026 12:49:55 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> On Fri, Jan 30, 2026 at 12:13 AM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > On Thu, 29 Jan 2026 11:52:07 +0000 Filipe Manana <fdmanana@kernel.org> wrote:
> >
> > > On Tue, Jan 27, 2026 at 8:43 PM Leo Martins <loemra.dev@gmail.com> wrote:
> > > >
> > > > I've been investigating enospcs at Meta and have observed a strange
> > > > pattern where filesystems are enospcing with lots of unallocated space
> > > > (> 100G). Sample dmesg dump at bottom of message.
> > > >
> > > > btrfs_insert_delayed_dir_index is attempting to migrate some reservation
> > > > from the transaction block reserve and finding it exhausted leading to a
> > > > warning and enospc. This is a bug as the reservations are meant to be
> > > > worst case. It should be impossible to exhaust the transaction block
> > > > reserve.
> > > >
> > > > Some tracing of affected hosts revealed that there were single
> > > > btrfs_search_slot calls that were COWing 100s of times. I was able to
> > > > reproduce this behavior locally by creating a very constrained cgroup
> > > > and producing a lot of concurrent filesystem operations. Here's the
> > > > pattern:
> > > >
> > > > 1. btrfs_search_slot() begins tree traversal with cow=1
> > > > 2. Node at level N needs COW (old generation or WRITTEN flag set)
> > > > 3. btrfs_cow_block() allocates new node, updates parent pointer
> > > > 4. Traversal continues, but hits a condition requiring restart (e.g., node
> > > > not cached, lock contention, need higher write_lock_level)
> > > > 5. btrfs_release_path() releases all locks and references
> > > > 6. Memory pressure triggers writeback on the COW'd node
> > > > 7. lock_extent_buffer_for_io() clears EXTENT_BUFFER_DIRTY and sets
> > > > BTRFS_HEADER_FLAG_WRITTEN
> > > > 8. goto again - traversal restarts from root
> > > > 9. Traversal reaches the freshly COW'd node
> > > > 10. should_cow_block() sees WRITTEN flag set, returns true
> > > > 11. btrfs_cow_block() allocates another new node - same logical position,
> > > > new physical location, new reservation consumed
> > > > 12. Steps 4-11 repeat indefinitely under sustained memory pressure
> > > >
> > > > Note this behavior should be much harder to trigger since Boris's
> > > > AS_KERNEL_FILE changes that make it so that extent_buffer pages aren't
> > > > accounted for in user cgroups. However, I believe it
> > > > would still be an issue under global memory pressure.
> > > > Link: https://lore.kernel.org/linux-btrfs/cover.1755812945.git.boris@bur.io/
> > > >
> > > > This COW amplification breaks the idea that transaction reservations are
> > > > worst case as any search slot call could find itself in this COW loop and
> > > > exhaust its reservation.
> > > >
> > > > My proposed solution is to temporarily pin extent buffers for the
> > > > lifetime of btrfs_search_slot. This prevents the massive COW
> > > > amplification that can be seen during high memory pressure.
> > > >
> > > > The implementation uses a local xarray to track COW'd buffers for the
> > > > duration of the search. The xarray stores extent_buffer pointers without
> > > > taking additional references; this is safe because tracked buffers remain
> > > > dirty (writeback_blockers prevents the dirty bit from being cleared) and
> > > > dirty buffers cannot be reclaimed by memory pressure.
> > > >
> > > > Synchronization is provided by eb->lock: increments in
> > > > btrfs_search_slot_track_cow() occur while holding the write lock, and
> > > > the check in lock_extent_buffer_for_io() also holds the write lock via
> > > > btrfs_tree_lock(). Decrements don't require eb->lock because
> > > > writeback_blockers is atomic and merely indicates "don't write yet".
> > > > Once we decrement, we're done and don't care if writeback proceeds
> > > > immediately.
> > >
> > > This seems too complex to me.
> > >
> > > So this problem is very similar to some idea I had a few years ago but
> > > never managed to implement.
> > > It was about avoiding unnecessary COW, not for this space reservation
> > > exhaustion due to sustained memory pressure, but it would solve it
> > > too.
> > >
> > > The idea was that we do unnecessary COW in cases like this:
> > >
> > > 1) We COW a path in some tree and we are at transaction N;
> > >
> > > 2) Writeback happened for the extent buffers in that path while we are
> > > in the same transaction, because we reached the 32M limit and some
> > > task called btrfs_btree_balance_dirty() or something else triggered
> > > writeback of the btree inode;
> > >
> > > 3) While still at transaction N, we visit the same path to add an item
> > > to a leaf, or modify an item, whatever. Because the extent buffers
> > > have BTRFS_HEADER_FLAG_WRITTEN, we COW them again (should_cow_block()
> > > returns true).
> > >
> > > So during the lifetime of a transaction we can have a lot of
> > > unnecessary COW - we spend more time allocating extents, allocating
> > > memory, copying extent buffer data, use more space per transaction,
> > > etc.
> > >
> > > The idea was to not COW when an extent buffer has
> > > BTRFS_HEADER_FLAG_WRITTEN set, but only if its generation
> > > (btrfs_header_generation(eb)) matches the current transaction.
> > > That is safe because there's no committed tree that points to an
> > > extent buffer created in the current transaction.
> > >
> > > Any further modification to the extent buffer must be sure that the
> > > EXTENT_BUFFER_DIRTY flag is set, that the eb range is still in the
> > > transaction's dirty_pages io tree, etc, so that we don't miss writing
> > > the extent buffer to the same location again before the transaction
> > > commits the superblocks.
> > >
> > > Have you considered an approach like this?
> >
> > I had not considered this, but it is a great idea.
> >
> > My first thought is that implementing this could be as simple
> > as removing the BTRFS_HEADER_FLAG_WRITTEN check. However, this
> > would mess with the assumptions around the log tree. From
> > btrfs_sync_log():
> >
> > /*
> > * IO has been started, blocks of the log tree have WRITTEN flag set
> > * in their headers. new modifications of the log will be written to
> > * new positions. so it's safe to allow log writers to go in.
> > */
> >
> > ^ Assumes that WRITTEN blocks will be COW'd.
> >
> > The issue looks like:
> >
> > 1. fsync A COWs eb
> > 2. fsync A lock_extent_buffer_for_io(); sets WRITTEN, unlocks tree
> > 3. fsync B does __not__ COW eb and modifies it
> > 4. fsync A writes modified eb to disk
> > 5. CRASH; the log tree is corrupted
> >
> > One way to avoid that is to keep the current behavior for the log
> > tree, but that leaves the potential for COW amplification...
> >
> > Another idea is to track the log_transid in the eb in the same way
> > the transid is tracked. Then, in should_cow_block we have something
> > like:
> >
> > if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID &&
> > buf->log_transid != root->log_transid)
> > return true;
>
> Log trees are special since their lifetime doesn't span on
> transaction, so what I suggested doesn't work of course for log trees
> and I forgot to mention that.
>
> Tracking the log_transid in the extent buffer will not always work -
> because it can be evicted and reloaded, so we would lose its value.
> We would have to update the on-disk format to store it somewhere or
> keep another in memory structure to track that, or prevent eviction of
> log tree buffers - all of those are too complex.
>
> So I had this half baked patch from many years ago:
>
> static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int level);
> @@ -1426,11 +1427,30 @@ static inline int should_cow_block(struct
> btrfs_trans_handle *trans,
> * block to ensure the metadata consistency.
> */
> if (btrfs_header_generation(buf) == trans->transid &&
> - !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> - !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
> + !test_bit(BTRFS_ROOT_FORCE_COW, &root->state)) {
> +
> + if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
> + if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> + return 1;
> + return 0;
> + }
> +
> + if (test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) ||
> + test_bit(EXTENT_BUFFER_WRITE_ERR, &buf->bflags))
> + return 1;
This is a great starting point. Will iterate on this and send
out a v2 next week.
Thanks,
Leo
>
> This was before a recent refactoring of should_cow_block(), but you
> should get the ideia.
> IIRC all fstests were passing back then, except for one or two which I
> never spent time debugging.
>
> And as that attempt was before the tree checker existed, we would need
> to make sure we don't change and eb while the tree checker is
> verifying it - making sure the tree checker read locks the eb should
> be enough.
>
> There's also one problem with this idea: it won't work for zoned
> devices as writes are sequential and we can't write twice to the same
> location without doing the zone reset thing which only happens around
> transaction commit time IIRC.
>
> Thanks.
>
> >
> > Please let me know if you see any issues with this approach or
> > if you can think of a better method.
> >
> > Thanks,
> > Leo
> >
> > >
> > > It would solve this space reservation exhaustion problem, as well as
> > > unnecessary COW for general optimization, without the need to for a
> > > local xarray, which besides being very specific for the
> > > btrfs_search_slot() case (we COW in other places), also requires a
> > > memory allocation which can fail.
> > >
> > > Thanks.
next prev parent reply other threads:[~2026-01-30 21:43 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-27 20:42 [PATCH] btrfs: prevent COW amplification during btrfs_search_slot Leo Martins
2026-01-28 21:48 ` Qu Wenruo
2026-01-29 19:30 ` Leo Martins
2026-01-29 11:52 ` Filipe Manana
2026-01-30 0:12 ` Leo Martins
2026-01-30 4:14 ` Sun YangKai
2026-01-30 9:37 ` Sun YangKai
2026-01-30 15:50 ` Sun YangKai
2026-01-30 16:11 ` Filipe Manana
2026-01-31 9:16 ` Sun YangKai
2026-01-30 12:49 ` Filipe Manana
2026-01-30 15:43 ` Boris Burkov
2026-01-30 15:57 ` Filipe Manana
2026-02-03 1:09 ` Leo Martins
2026-01-30 21:43 ` Leo Martins [this message]
2026-01-30 22:34 ` Qu Wenruo
2026-01-31 0:11 ` Boris Burkov
2026-01-31 1:06 ` Qu Wenruo
2026-01-31 17:16 ` Boris Burkov
2026-01-31 21:59 ` Qu Wenruo
2026-02-10 7:45 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260130214319.3714908-1-loemra.dev@gmail.com \
--to=loemra.dev@gmail.com \
--cc=fdmanana@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox