public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Leo Martins <loemra.dev@gmail.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction
Date: Tue, 17 Feb 2026 13:48:08 -0800	[thread overview]
Message-ID: <20260217214815.658944-1-loemra.dev@gmail.com> (raw)
In-Reply-To: <CAL3q7H54zGbpSiwnJXXg0pXLUQtZSwQ65X8iN716Tko0EtynRQ@mail.gmail.com>

On Mon, 16 Feb 2026 12:18:48 +0000 Filipe Manana <fdmanana@kernel.org> wrote:

> On Fri, Feb 13, 2026 at 8:38 PM Leo Martins <loemra.dev@gmail.com> wrote:
> >
> > When memory pressure causes writeback of a recently COW'd buffer,
> > btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent
> > btrfs_search_slot() restarts then see the WRITTEN flag and re-COW
> > the buffer unnecessarily, causing COW amplification that can exhaust
> > block reservations and degrade throughput.
> >
> > Overwriting in place is crash-safe because the committed superblock
> > does not reference buffers allocated in the current (uncommitted)
> > transaction, so no on-disk tree points to this block yet.
> >
> > When should_cow_block() encounters a WRITTEN buffer whose generation
> > matches the current transaction, instead of requesting a COW, re-dirty
> > the buffer and re-register its range in the transaction's dirty_pages.
> >
> > Both are necessary because btrfs tracks dirty metadata through two
> > independent mechanisms. set_extent_buffer_dirty() sets the
> > EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY
> > mark, which is what background writeback (btree_write_cache_pages) uses
> > to find and write dirty buffers. The transaction's dirty_pages io tree
> > is a separate structure used by btrfs_write_and_wait_transaction() at
> > commit time to ensure all buffers allocated during the transaction are
> > persisted. The dirty_pages range was originally registered in
> > btrfs_init_new_buffer() when the block was first allocated, but
> > background writeback may have already written and cleared it.
> 
> This is not quite correct, the dirty_pages range is never cleared on
> background writeback.
> We only clear it during a transaction commit, in
> btrfs_write_and_wait_transaction().
> 
> Normally we shouldn't care about setting the range again in
> dirty_pages, because after
> we call  btrfs_write_and_wait_transaction(), no more COW should be
> possible using this
> transaction (which is in the unblocked state, so any new COW attempt
> will be in another transaction).
> 
> The exception is if we have snapshots to create and qgroups are
> enabled, since in qgroup_account_snapshot() we
> call btrfs_write_and_wait_transaction() and after that we can get more
> COW, due to all the stuff we need to do to
> create a snapshot, before we get to the final call to
> btrfs_write_and_wait_transaction() right before we write the
> super blocks in btrfs_commit_transaction().

Got it, thanks for the correction. Updated in v3.

Thanks,
Leo

> 
> >
> > Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block()
> > correctly pins the block if it is freed later.
> >
> > Exclude cases where in-place overwrite is not safe:
> >  - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O
> >  - Zoned devices: require sequential writes
> >  - Log trees: log blocks are immediately referenced by a committed
> >    superblock via btrfs_sync_log(), so overwriting could corrupt the
> >    committed log
> >  - BTRFS_ROOT_FORCE_COW: snapshot in progress
> >  - BTRFS_HEADER_FLAG_RELOC: block being relocated
> >
> > Signed-off-by: Leo Martins <loemra.dev@gmail.com>
> > ---
> >  fs/btrfs/ctree.c | 53 +++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 50 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 7267b2502665..a345e1be24d8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -599,9 +599,9 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
> >         return ret;
> >  }
> >
> > -static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> > +static inline bool should_cow_block(struct btrfs_trans_handle *trans,
> >                                     const struct btrfs_root *root,
> > -                                   const struct extent_buffer *buf)
> > +                                   struct extent_buffer *buf)
> >  {
> >         if (btrfs_is_testing(root->fs_info))
> >                 return false;
> > @@ -621,8 +621,55 @@ static inline bool should_cow_block(const struct btrfs_trans_handle *trans,
> >         if (btrfs_header_generation(buf) != trans->transid)
> >                 return true;
> >
> > -       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN))
> > +       if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> > +               /*
> > +                * The buffer was allocated in this transaction and has been
> > +                * written back to disk (WRITTEN is set). Normally we'd COW
> > +                * it again, but since the committed superblock doesn't
> > +                * reference this buffer (it was allocated this transaction),
> 
> Missing an "in" before "this transaction".
> 
> > +                * we can safely overwrite it in place.
> > +                *
> > +                * We keep BTRFS_HEADER_FLAG_WRITTEN set. The block has been
> > +                * persisted at this bytenr and will be again after the
> > +                * in-place update. This is important so that
> > +                * btrfs_free_tree_block() correctly pins the block if it is
> > +                * freed later (e.g., during tree rebalancing or FORCE_COW).
> > +                *
> > +                * We re-dirty the buffer to ensure the in-place modifications
> > +                * will be written back to disk.
> > +                *
> > +                * Exclusions:
> > +                * - Log trees: log blocks are written and immediately
> > +                *   referenced by a committed superblock via
> > +                *   btrfs_sync_log(), bypassing the normal transaction
> > +                *   commit. Overwriting in place could corrupt the
> > +                *   committed log.
> > +                * - Zoned devices: require sequential writes
> > +                * - FORCE_COW: snapshot in progress
> > +                * - RELOC flag: block being relocated
> > +                */
> > +               if (!test_bit(EXTENT_BUFFER_WRITEBACK, &buf->bflags) &&
> > +                   !btrfs_is_zoned(root->fs_info) &&
> > +                   btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID &&
> > +                   !test_bit(BTRFS_ROOT_FORCE_COW, &root->state) &&
> 
> We need a  smp_mb__before_atomic() before checking FORCE_COW, see the
> existing code below.
> 
> > +                   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) {
> > +                       /*
> > +                        * Re-register this block's range in the current
> > +                        * transaction's dirty_pages so that
> > +                        * btrfs_write_and_wait_transaction() writes it.
> > +                        * The range was originally registered when the block
> > +                        * was allocated, but that transaction's dirty_pages
> > +                        * may have already been released.
> 
> I think it's worth adding something like: "... already been released
> if we are in a transaction that creates snapshots and we have qgroups
> enabled."
> 
> Otherwise it looks good, thanks!
> 
> > +                        */
> > +                       btrfs_set_extent_bit(&trans->transaction->dirty_pages,
> > +                                            buf->start,
> > +                                            buf->start + buf->len - 1,
> > +                                            EXTENT_DIRTY, NULL);
> > +                       set_extent_buffer_dirty(buf);
> > +                       return false;
> > +               }
> >                 return true;
> > +       }
> >
> >         /* Ensure we can see the FORCE_COW bit. */
> >         smp_mb__before_atomic();
> > --
> > 2.47.3
> >
> >

  reply	other threads:[~2026-02-17 21:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 20:30 [PATCH v2 0/3] btrfs: fix COW amplification under memory pressure Leo Martins
2026-02-13 20:30 ` [PATCH v2 1/3] btrfs: skip COW for written extent buffers allocated in current transaction Leo Martins
2026-02-14  1:25   ` Sun YangKai
2026-02-17 22:04     ` Leo Martins
2026-02-16 12:18   ` Filipe Manana
2026-02-17 21:48     ` Leo Martins [this message]
2026-02-13 20:30 ` [PATCH v2 2/3] btrfs: inhibit extent buffer writeback to prevent COW amplification Leo Martins
2026-02-16 12:33   ` Filipe Manana
2026-02-13 20:30 ` [PATCH v2 3/3] btrfs: add tracepoint for COW amplification tracking Leo Martins
2026-02-16 12:40   ` Filipe Manana
2026-02-17 22:06     ` Leo Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260217214815.658944-1-loemra.dev@gmail.com \
    --to=loemra.dev@gmail.com \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox