Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
Date: Fri, 24 Apr 2026 13:11:31 -0700	[thread overview]
Message-ID: <20260424201054.GA2801466@zen.localdomain> (raw)
In-Reply-To: <d916e8e3-0c25-4540-b829-a13ef058c762@gmx.com>

On Fri, Apr 24, 2026 at 07:37:38PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2026/4/24 16:08, Qu Wenruo 写道:
> > 
> > 
> > 在 2026/4/10 03:18, Boris Burkov 写道:
> > [...]
> > > 
> > > This means iterating over to_reclaim by 128MiB at a time until it is
> > > drained or we satisfy a ticket, rather than trying 3 times to do the
> > > whole thing.
> > > 
> > > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > 
> > Hi Boris,
> > 
> > I'm testing the latest for-next base as the baseline for the incoming
> > huge folio support.
> > 
> > On arm64 64K page size, 4K fs block size, I'm seeing a very weird
> > behavior on generic/027.
> > On 7.0-rc7, the test case takes less than 5 seconds and passes as expected.
> > 
> > But on for-next it never finished, furthermore there is always a kworker
> > taking a full core, deadlooping inside
> > btrfs_async_reclaim_metadata_space(), and you can not unmount the fs.
> > 
> > Here is the "echo l > /proc/sysrq-trigger" stack dump for the involved
> > btrfs kworker:
> > 
> > [ 6616.093728] CPU: 0 UID: 0 PID: 501715 Comm: kworker/u33:0 Not tainted
> > 7.0.0-rc7-custom-64k+ #9 PREEMPT(full)
> > [ 6616.093732] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> > 2/2/2022
> > [ 6616.093734] Workqueue: events_unbound
> > btrfs_async_reclaim_metadata_space [btrfs]
> > [ 6616.093849] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> > BTYPE=--)
> > [ 6616.093852] pc : btrfs_start_delalloc_roots+0xf0/0x268 [btrfs]
> > [ 6616.093923] lr : btrfs_start_delalloc_roots+0x88/0x268 [btrfs]
> > [ 6616.093987] sp : ffff80008af0fbd0
> > [...]
> > [ 6616.094008] Call trace:
> > [ 6616.094009]  btrfs_start_delalloc_roots+0xf0/0x268 [btrfs] (P)
> > [ 6616.094073]  flush_space+0x3d4/0x6b0 [btrfs]
> > [ 6616.094138]  do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> > [ 6616.094201]  btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> > [ 6616.094263]  process_one_work+0x174/0x540
> > [ 6616.094277]  worker_thread+0x1a0/0x318
> > [ 6616.094279]  kthread+0x140/0x158
> > [ 6616.094285]  ret_from_fork+0x10/0x20
> > 
> > So it's a regression, and bisection points to this patch.
> > 
> > And I tried the following steps to further confirm it's caused by this
> > commit:
> > 
> > - The test passes just before the commit
> >    The previous commit is "btrfs: make inode->outstanding_extents a u64".
> > 
> > - The test failed at that commit
> >    The test case never finish and one kworker dead looping.
> > 
> > - The test case pass at for-next with this commit reverted
> >    The test case finishes in seconds as usual.
> 
> Furthermore, even with this particular patch *reverted*, I'm still seeing
> generic/224 hitting the same problem.
> 
> Currently I'm testing at the commit before the whole series, which is
> "btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
> generic/224 hang nor 100% kworker CPU usage.
> 
> Thus I'm afraid the whole series may be involved.
> 
> Thanks,
> Qu
> 

Now that I have had a good chance to try and repro, here is what I have
seen so far on my desktop x86 machine and a cloud arm machine.

x86:
a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
  consistently done in 1 second
8099a837f487 ("btrfs: cap shrink_delalloc iterations to 128M")
  finishes, but in ~500s
ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
  finishes, but in ~500s

arm:
a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
  consistently done in ~300 seconds
ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
  done in ~600s

The two inconsistencies are that I didn't see it go fast on g/027 with just
the shrink_delalloc iterations patch reverted, and I don't have a 2
second baseline on my arm setup.

So I agree that this patch series effectively breaks those tests, on x86
as well. I didn't notice the change in runtime, unfortunately, as I only
looked for success/failure.

As to the cause:
Both g/027 and g/224 are explicitly testing lots of writes to a small
filesystem.

I suspect that what is happening is what Filipe warned about with
excessive space reclaim/pinning reclaim/etc. choking the workload
due to excessive reservation. I have played around with reducing the
reservation sizes in various ways (set it back to 0, set the level
estimate to 4 as test, etc.) and the result varies from back to full
speed or a 60s run. So in my setup, at least, it looks like the
performance of g/027 is very sensitive to how much we reserve.

Would you be willing to let it run for 5-10m to see if you also
reproduce this behavior?

I will try to instrument the reservations and reclaim codepaths and see
if I can think of a nice fix to reserve "enough but not too much".

I can also try to attack the "stuck big fs under big reclaim" more
directly by trying to make reclaim less stuck-prone, rather than messing
with reservations. Though it would be quite disappointing if we
practically cannot make the reservation choices more accurate..

Thanks,
Boris

> > 
> > Do you have any clue on what's going wrong? I guess it's pretty hard to
> > hit on x86_64.
> > 
> > I have a local btrfs branch with huge folios support, with that it's
> > pretty easy to hit similar problems on x86_64, but without that branch,
> > no hit is observed so far on x86_64.
> > 
> > Thanks,
> > Qu
> > 
> > > ---
> > >   fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
> > >   1 file changed, 20 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > > index f0436eea1544..e931deb3d013 100644
> > > --- a/fs/btrfs/space-info.c
> > > +++ b/fs/btrfs/space-info.c
> > > @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > >       struct btrfs_trans_handle *trans;
> > >       u64 delalloc_bytes;
> > >       u64 ordered_bytes;
> > > -    u64 items;
> > >       long time_left;
> > > -    int loops;
> > > +    u64 orig_tickets_id;
> > >       delalloc_bytes = percpu_counter_sum_positive(&fs_info-
> > > >delalloc_bytes);
> > >       ordered_bytes = percpu_counter_sum_positive(&fs_info-
> > > >ordered_bytes);
> > > @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > >           return;
> > >       /* Calc the number of the pages we need flush for space
> > > reservation */
> > > -    if (to_reclaim == U64_MAX) {
> > > -        items = U64_MAX;
> > > -    } else {
> > > +    if (to_reclaim != U64_MAX) {
> > >           /*
> > >            * to_reclaim is set to however much metadata we need to
> > >            * reclaim, but reclaiming that much data doesn't really track
> > > @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > >            * aggressive.
> > >            */
> > >           to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> > > -        items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> > >       }
> > >       trans = current->journal_info;
> > > @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > >       if (ordered_bytes > delalloc_bytes && !for_preempt)
> > >           wait_ordered = true;
> > > -    loops = 0;
> > > -    while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> > > -        u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> > > -        long nr_pages = min_t(u64, temp, LONG_MAX);
> > > +    spin_lock(&space_info->lock);
> > > +    orig_tickets_id = space_info->tickets_id;
> > > +    spin_unlock(&space_info->lock);
> > > +
> > > +    while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> > > +        u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> > > +        long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
> > > PAGE_SHIFT;
> > > +        u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
> > >           int async_pages;
> > >           btrfs_start_delalloc_roots(fs_info, nr_pages, true);
> > > @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > >                  atomic_read(&fs_info->async_delalloc_pages) <=
> > >                  async_pages);
> > >   skip_async:
> > > -        loops++;
> > > +        to_reclaim -= iter_reclaim;
> > >           if (wait_ordered && !trans) {
> > >               btrfs_wait_ordered_roots(fs_info, items, NULL);
> > >           } else {
> > > @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > >               spin_unlock(&space_info->lock);
> > >               break;
> > >           }
> > > +        /*
> > > +         * If a ticket was satisfied since we started, break out
> > > +         * so the async reclaim state machine can process delayed
> > > +         * refs before we flush more delalloc.
> > > +         */
> > > +        if (space_info->tickets_id != orig_tickets_id) {
> > > +            spin_unlock(&space_info->lock);
> > > +            break;
> > > +        }
> > >           spin_unlock(&space_info->lock);
> > >           delalloc_bytes = percpu_counter_sum_positive(
> > 
> > 
> 

  parent reply	other threads:[~2026-04-24 20:12 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
2026-04-09 17:48 ` [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-04-10 16:07   ` Filipe Manana
2026-04-09 17:48 ` [PATCH v4 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-04-09 17:48 ` [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-04-13 18:43   ` David Sterba
2026-04-09 17:48 ` [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
2026-04-24  6:38   ` Qu Wenruo
2026-04-24  9:48     ` Sun YangKai
2026-04-24 10:07     ` Qu Wenruo
2026-04-24 15:26       ` Boris Burkov
2026-04-24 20:11       ` Boris Burkov [this message]
2026-04-24 22:06         ` Qu Wenruo
2026-04-24 22:10           ` Boris Burkov
2026-04-24 22:21             ` Qu Wenruo
2026-04-24 22:23               ` Boris Burkov
2026-04-24 22:59               ` Qu Wenruo
2026-04-13 18:41 ` [PATCH v4 0/4] btrfs: improve stalls under sudden writeback David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260424201054.GA2801466@zen.localdomain \
    --to=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox