All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leo Martins <loemra.dev@gmail.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
Date: Tue, 17 Feb 2026 13:48:08 -0800	[thread overview]
Message-ID: <20260217214815.658944-1-loemra.dev@gmail.com> (raw)
In-Reply-To: <CAL3q7H54zGbpSiwnJXXg0pXLUQtZSwQ65X8iN716Tko0EtynRQ@mail.gmail.com>

On Mon, 16 Feb 2026 12:18:48 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Feb 13, 2026 at 8:38 PM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> >
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> >
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> >
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated, but
> > background writeback may have already written and cleared it.
> 
> This is not quite correct, the dirty_pages range is never cleared on
> background writeback.
> We only clear it during a transaction commit, in
> btrfs_write_and_wait_transaction().
> 
> Normally we shouldn't care about setting the range again in
> dirty_pages, because after
> we call  btrfs_write_and_wait_transaction(), no more COW should be
> possible using this
> transaction (which is in the unblocked state, so any new COW attempt
> will be in another transaction).
> 
> The exception is if we have snapshots to create and qgroups are
> enabled, since in qgroup_account_snapshot() we
> call btrfs_write_and_wait_transaction() and after that we can get more
> COW, due to all the stuff we need to do to
> create a snapshot, before we get to the final call to
> btrfs_write_and_wait_transaction() right before we write the
> super blocks in btrfs_commit_transaction().

Got it, thanks for the correction. Updated in v3.

Thanks,
Leo

> 
> >
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> >
> > Exclude cases where in-place overwrite is not safe:
> >  - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> >  - Zoned devices: require sequential writes
> >  - Log trees: log blocks are immediately referenced by a committed
> >    superblock via btrfs_sync_log(), so overwriting could corrupt the
> >    committed log
> >  - BTRFS_ROOT_FORCE_COW: snapshot in progress
> >  - BTRFS_HEADER_FLAG_RELOC: block being relocated
> >
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > ---
> >  fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 50 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..a345e1be24d8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> >         return ret;
> >  }
> >
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> >                                     const struct btrfs_root *root,
> > -                                   const struct extent_buffer *buf)
> > +                                   struct extent_buffer *buf)
> >  {
> >         if (btrfs_is_testing(root->fs_info))
> >                 return false;
> > @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> >         if (btrfs_header_generation(buf) != trans->transid)
> >                 return true;
> >
> > -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > +               /*
> > +                * The buffer was allocated in this transaction and has been
> > +                * written back to disk (WRITTEN is set). Normally we'd COW
> > +                * it again, but since the committed superblock doesn't
> > +                * reference this buffer (it was allocated this transaction),
> 
> Missing an "in" before "this transaction".
> 
> > +                * we can safely overwrite it in place.
> > +                *
> > +                * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > +                * persisted at this bytenr and will be again after the
> > +                * in-place update. This is important so that
> > +                * btrfs_free_tree_block() correctly pins the block if it is
> > +                * freed later (e.g., during tree rebalancing or FORCE_COW).
> > +                *
> > +                * We re-dirty the buffer to ensure the in-place modifications
> > +                * will be written back to disk.
> > +                *
> > +                * Exclusions:
> > +                * - Log trees: log blocks are written and immediately
> > +                *   referenced by a committed superblock via
> > +                *   btrfs_sync_log(), bypassing the normal transaction
> > +                *   commit. Overwriting in place could corrupt the
> > +                *   committed log.
> > +                * - Zoned devices: require sequential writes
> > +                * - FORCE_COW: snapshot in progress
> > +                * - RELOC flag: block being relocated
> > +                */
> > +               if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> > +                   !btrfs_is_zoned(root->fs_info) &&
> > +                   btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> > +                   !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
> 
> We need a  smp_mb__before_atomic() before checking FORCE_COW, see the
> existing code below.
> 
> > +                   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> > +                       /*
> > +                        * Re-register this block's range in the current
> > +                        * transaction's dirty_pages so that
> > +                        * btrfs_write_and_wait_transaction() writes it.
> > +                        * The range was originally registered when the block
> > +                        * was allocated, but that transaction's dirty_pages
> > +                        * may have already been released.
> 
> I think it's worth adding something like: "... already been released
> if we are in a transaction that creates snapshots and we have qgroups
> enabled."
> 
> Otherwise it looks good, thanks!
> 
> > +                        */
> > +                       btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > +                                            buf->start,
> > +                                            buf->start + buf->len - 1,
> > +                                            EXTENT_DIRTY, NULL);
> > +                       set_extent_buffer_dirty(buf);
> > +                       return false;
> > +               }
> >                 return true;
> > +       }
> >
> >         /* Ensure we can see the FORCE_COW bit. */
> >         smp_mb__before_atomic();
> > --
> > 2.47.3
> >
> >

  reply	other threads:[~2026-02-17 21:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-14  1:25   ` Sun YangKai
2026-02-17 22:04     ` Leo Martins
2026-02-16 12:18   ` Filipe Manana
2026-02-17 21:48     ` Leo Martins [this message]
2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-16 12:33   ` Filipe Manana
2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
2026-02-16 12:40   ` Filipe Manana
2026-02-17 22:06     ` Leo Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260217214815.658944-1-loemra.dev@gmail.com \
    --to=loemra.dev@gmail.com \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.