All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v3 0/4] btrfs: improve stalls under sudden writeback
Date: Tue,  7 Apr 2026 12:30:10 -0700	[thread overview]
Message-ID: <cover.1775589916.git.boris@bur.io> (raw)

If you have a system with very large memory (TiBs) and a normal
percentage based dirty_ratio/dirty_background_ratio like the defaults of
20%/10%, then we can theoretically rack up 100s of GiB of dirty pages
before doing any writeback. This is further exacerbated if we also see a
sudden drop in the free memory due to a large allocation. If we
(relatively likely for a large ram system) also have a large disk, we are
unlikely to do trigger much preemptive metadata reclaim either.

Once we do start doing writeback with such a large supply, the results
are somewhat ugly. The delalloc work generates a huge amount of delayed
refs without proper reservations which sends the metadata space system
into a tailspin trying to run yet more delalloc to free space.
Ultimately, the system stalls waiting for huge amounts of ordered
extents and delayed refs blocking all users in start_transaction() on
tickets in reserve_space().

This patch series aims to address these issues in a relatively targeted
way by improving our reservations for delalloc delayed refs and by doing
some very basic smoothing of the work in flush_space(). Further work
could be done to improve flush_space() heuristics and latency but this
is already a big help on my observed workloads.

I was able to reproduce stalls on a more "modest" system with 264GiB of
ram by using a somewhat silly 80% dirty_ratio.

I was unfortunately unable to reproduce any stalls on a yet smaller
system with only 32GiB of ram.

The first 2 patches do the delayed_ref rsv accounting on btrfs_inode,
mirroring inode->block_rsv.
The 3th patch is a cleanup to the types counting max extents
The 4th patch reduces the size of the unit of work in shrink_delalloc()
to further reduce stalls.
---
Changelog:
v3:
- Merge csum reservation patch (2) into main delalloc delrefs rsv patch (1)
- Add delayed refs reservations for RST and subvol tree metadata cow to
  patch 1.
- Do the migration in the nocow/prealloc finish_one_ordered() cases as
  there are still metadata delayed refs generated.
- Double delref rsv for cows (add+drop). This seems really conservative
  to me, but I think it is correct. If we like it, it needs to happen
  more places too...
- Upgrade ASSERTs in patch 3 (old patch 4) to log unexpected values.
- Remove unused return value in migrate function.
- Various stylistic issues in several patches.
v2:
- patch 1 no longer embeds a new block_rsv on btrfs_inode for the
  delayed reservation. Instead it does the reservation on
  inode->block_rsv and migrates it to trans->delayed_rsv at the moment
  of truth.

Boris Burkov (4):
  btrfs: reserve space for delayed_refs in delalloc
  btrfs: account for compression in delalloc extent reservation
  btrfs: make inode->outstanding_extents a u64
  btrfs: cap shrink_delalloc iterations to 128M

 fs/btrfs/btrfs_inode.h       | 20 ++++++--
 fs/btrfs/delalloc-space.c    | 84 +++++++++++++++++++++++++++------
 fs/btrfs/delalloc-space.h    |  3 ++
 fs/btrfs/fs.h                | 13 ------
 fs/btrfs/inode.c             | 90 ++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.c      |  4 +-
 fs/btrfs/space-info.c        | 31 ++++++++-----
 fs/btrfs/tests/inode-tests.c | 18 ++++----
 fs/btrfs/transaction.c       | 36 ++++++---------
 include/trace/events/btrfs.h |  8 ++--
 10 files changed, 210 insertions(+), 97 deletions(-)

-- 
2.53.0


             reply	other threads:[~2026-04-07 19:30 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-07 19:30 Boris Burkov [this message]
2026-04-07 19:30 ` [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-04-08 14:56   ` Filipe Manana
2026-04-08 17:34     ` Boris Burkov
2026-04-10 15:27       ` Filipe Manana
2026-04-10 23:06         ` Boris Burkov
2026-04-07 19:30 ` [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-04-08 14:58   ` Filipe Manana
2026-04-07 19:30 ` [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-04-08 15:04   ` Filipe Manana
2026-04-07 19:30 ` [PATCH v3 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1775589916.git.boris@bur.io \
    --to=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.