From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
Date: Fri, 24 Apr 2026 15:10:59 -0700 [thread overview]
Message-ID: <20260424221059.GA2970690@zen.localdomain> (raw)
In-Reply-To: <2dd6a177-b6f5-4c15-976b-7897c6d468dc@gmx.com>
On Sat, Apr 25, 2026 at 07:36:49AM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/25 05:41, Boris Burkov 写道:
> > On Fri, Apr 24, 2026 at 07:37:38PM +0930, Qu Wenruo wrote:
> [...]
> > >
> > > Furthermore, even with this particular patch *reverted*, I'm still seeing
> > > generic/224 hitting the same problem.
> > >
> > > Currently I'm testing at the commit before the whole series, which is
> > > "btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
> > > generic/224 hang nor 100% kworker CPU usage.
> > >
> > > Thus I'm afraid the whole series may be involved.
>
> Sorry, at least on my arm64 machine, the first 3 patches are not the root
> problem.
>
> In fact on v7.0-rc7, I can still hit generic/224 hang, aka kernel detects
> 120s time out for hung processes, and my VM is configured to reset after
> such detection.
>
> I'm going to slightly loose the hung task detection time (120s->150s) and
> check if it's just too slow in this particular case.
>
>
> Although the last patch is still causing excessive CPU usage here, and very
> reliably.
>
> > >
> > > Thanks,
> > > Qu
> > >
> >
> > Now that I have had a good chance to try and repro, here is what I have
> > seen so far on my desktop x86 machine and a cloud arm machine.
> >
> > x86:
> > a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
> > consistently done in 1 second
> > 8099a837f487 ("btrfs: cap shrink_delalloc iterations to 128M")
> > finishes, but in ~500s
> > ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
> > finishes, but in ~500s
> >
> > arm:
> > a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
> > consistently done in ~300 seconds
> > ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
> > done in ~600s
> >
> > The two inconsistencies are that I didn't see it go fast on g/027 with just
> > the shrink_delalloc iterations patch reverted, and I don't have a 2
> > second baseline on my arm setup.
>
> At least we got something that both of us can reproduce.
>
> Another thing is, for g/027 on arm64 I'm also actively monitoring the CPU
> usage through top.
>
> Have you experienced very high (~100%) CPU usage on a kworker during g/027?
>
No :(
As far as I can tell the system is stuck waiting on a commit. I'll keep
trying to repro your symptom.
I'm curious if it goes away for you with Sun's proposed fix, something
like setting pages to at least 1 after those to min() operations.
Such a patch had no impact on the behavior of either of my systems,
though.
> That's the most reliably symptom on my arm64 systems, and that's the
> criteria I used to bisect, as it takes less than 5 seconds to determine if
> it's good or not.
>
> >
> > So I agree that this patch series effectively breaks those tests, on x86
> > as well. I didn't notice the change in runtime, unfortunately, as I only
> > looked for success/failure.
> >
> > As to the cause:
> > Both g/027 and g/224 are explicitly testing lots of writes to a small
> > filesystem.
> >
> > I suspect that what is happening is what Filipe warned about with
> > excessive space reclaim/pinning reclaim/etc. choking the workload
> > due to excessive reservation. I have played around with reducing the
> > reservation sizes in various ways (set it back to 0, set the level
> > estimate to 4 as test, etc.) and the result varies from back to full
> > speed or a 60s run. So in my setup, at least, it looks like the
> > performance of g/027 is very sensitive to how much we reserve.
>
> At least to me, the biggest problem is the 100% CPU usage of the kworker,
> which indicates a pretty bad dead looping.
>
> >
> > Would you be willing to let it run for 5-10m to see if you also
> > reproduce this behavior?
>
> Unfortunately it didn't even finish after 15m here.
> And there is the dmesg with time stamps, the calltrace is triggered by
> "echo l > /proc/sysrq-trigger".
>
> [ 30.140269] run fstests generic/027 at 2026-04-25 07:19:32
> [ 30.392655] BTRFS: device fsid 85ba0f7c-dfed-4220-9d47-72b07a1c81d8 devid
> 1 transid 8 /dev/mapper/test-scratch1 (253:2) scanned by mount (1108)
> [ 30.395605] BTRFS info (device dm-2): first mount of filesystem
> 85ba0f7c-dfed-4220-9d47-72b07a1c81d8
> [ 30.395625] BTRFS info (device dm-2): using crc32c checksum algorithm
> [ 30.398590] BTRFS info (device dm-2): checking UUID tree
> [ 30.398734] BTRFS info (device dm-2): turning on async discard
> [ 30.398737] BTRFS info (device dm-2): enabling free space tree
> [ 33.294754] systemd-journald[360]: Time jumped backwards, rotating.
> [ 993.736548] sysrq: Show backtrace of all active CPUs
> [ 993.736581] NMI backtrace for cpu 0
> [ 993.736608] CPU: 0 UID: 0 PID: 2410 Comm: bash Not tainted
> 7.0.0-rc7-custom-64k+ #10 PREEMPT(full)
> [ 993.736613] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> 2/2/2022
> [ 993.736616] Call trace:
> [ 993.736618] show_stack+0x20/0x38 (C)
> [ 993.736635] dump_stack_lvl+0x60/0x80
> [ 993.736646] dump_stack+0x18/0x24
> [ 993.736649] nmi_cpu_backtrace+0xf0/0x128
> [ 993.736665] nmi_trigger_cpumask_backtrace+0x1c4/0x1f8
> [ 993.736668] arch_trigger_cpumask_backtrace+0x20/0x40
> [ 993.736675] sysrq_handle_showallcpus+0x24/0x38
> [ 993.736686] __handle_sysrq+0x9c/0x1b8
> [ 993.736689] write_sysrq_trigger+0xcc/0x100
> [ 993.736692] proc_reg_write+0x7c/0xf0
> [ 993.736701] vfs_write+0xd8/0x3a8
> [ 993.736716] ksys_write+0x70/0x120
> [ 993.736719] __arm64_sys_write+0x20/0x40
> [ 993.736722] invoke_syscall.constprop.0+0x64/0xe8
> [ 993.736726] el0_svc_common.constprop.0+0x40/0xe8
> [ 993.736728] do_el0_svc+0x24/0x38
> [ 993.736730] el0_svc+0x3c/0x198
> [ 993.736733] el0t_64_sync_handler+0xa0/0xe8
> [ 993.736735] el0t_64_sync+0x198/0x1a0
> [ 993.736755] Sending NMI from CPU 0 to CPUs 1-7:
> [ 993.736769] NMI backtrace for cpu 3
> [ 993.736777] CPU: 3 UID: 0 PID: 212 Comm: kworker/u38:2 Not tainted
> 7.0.0-rc7-custom-64k+ #10 PREEMPT(full)
> [ 993.736780] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> 2/2/2022
> [ 993.736782] Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> [btrfs]
> [ 993.736879] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [ 993.736882] pc : _raw_spin_unlock_irqrestore+0x10/0x60
> [ 993.736899] lr : __percpu_counter_sum+0x94/0xc0
> [ 993.736909] sp : ffff8000834cfc10
> [ 993.736910] x29: ffff8000834cfc10 x28: 0000000000000400 x27:
> 0000000008000000
> [ 993.736912] x26: ffff0000ccfef81c x25: 0000000000000000 x24:
> ffffb6b45621ef98
> [ 993.736914] x23: ffff0000d2e48698 x22: ffffb6b45621a000 x21:
> ffffb6b456219080
> [ 993.736916] x20: ffffb6b456219288 x19: 0000000000009000 x18:
> ffff494da90b0000
> [ 993.736917] x17: 0000000000000000 x16: ffffb6b45524e920 x15:
> ffffb6b45621ef98
> [ 993.736919] x14: ffffb6b456111740 x13: 0000000000000180 x12:
> ffff0001ff1c1740
> [ 993.736921] x11: 00000000000000c0 x10: 4eb904daffc7d416 x9 :
> ffffb6b45524e9b4
> [ 993.736922] x8 : ffff8000834cfab0 x7 : 0000000000000000 x6 :
> ffffffffffffffff
> [ 993.736924] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> 0000000000000008
> [ 993.736926] x2 : 0000000000000008 x1 : 0000000000000000 x0 :
> ffff0000d2e48698
> [ 993.736928] Call trace:
> [ 993.736929] _raw_spin_unlock_irqrestore+0x10/0x60 (P)
> [ 993.736932] flush_space+0x45c/0x6b0 [btrfs]
> [ 993.737001] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> [ 993.737064] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> [ 993.737126] process_one_work+0x174/0x540
> [ 993.737138] worker_thread+0x1a0/0x318
> [ 993.737140] kthread+0x140/0x158
> [ 993.737145] ret_from_fork+0x10/0x20
> [ 993.737156] NMI backtrace for cpu 4
>
> >
> > I will try to instrument the reservations and reclaim codepaths and see
> > if I can think of a nice fix to reserve "enough but not too much".
> >
> > I can also try to attack the "stuck big fs under big reclaim" more
> > directly by trying to make reclaim less stuck-prone, rather than messing
> > with reservations. Though it would be quite disappointing if we
> > practically cannot make the reservation choices more accurate..
>
> Totally understandable, ENOSPC in btrfs is always the biggest challenge, and
> the trade-offs are always hard to balance.
>
> Meanwhile I'd prefer to have the last commit reverted so that we can
> continue our regular testing.
>
> Thanks,
> Qu
>
Totally agreed. I'm going to revert the whole series till I understand
what happened here.
Thanks,
Boris
> >
> > Thanks,
> > Boris
> >
> > > >
> > > > Do you have any clue on what's going wrong? I guess it's pretty hard to
> > > > hit on x86_64.
> > > >
> > > > I have a local btrfs branch with huge folios support, with that it's
> > > > pretty easy to hit similar problems on x86_64, but without that branch,
> > > > no hit is observed so far on x86_64.
> > > >
> > > > Thanks,
> > > > Qu
> > > >
> > > > > ---
> > > > > fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
> > > > > 1 file changed, 20 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > > > > index f0436eea1544..e931deb3d013 100644
> > > > > --- a/fs/btrfs/space-info.c
> > > > > +++ b/fs/btrfs/space-info.c
> > > > > @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > struct btrfs_trans_handle *trans;
> > > > > u64 delalloc_bytes;
> > > > > u64 ordered_bytes;
> > > > > - u64 items;
> > > > > long time_left;
> > > > > - int loops;
> > > > > + u64 orig_tickets_id;
> > > > > delalloc_bytes = percpu_counter_sum_positive(&fs_info-
> > > > > > delalloc_bytes);
> > > > > ordered_bytes = percpu_counter_sum_positive(&fs_info-
> > > > > > ordered_bytes);
> > > > > @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > return;
> > > > > /* Calc the number of the pages we need flush for space
> > > > > reservation */
> > > > > - if (to_reclaim == U64_MAX) {
> > > > > - items = U64_MAX;
> > > > > - } else {
> > > > > + if (to_reclaim != U64_MAX) {
> > > > > /*
> > > > > * to_reclaim is set to however much metadata we need to
> > > > > * reclaim, but reclaiming that much data doesn't really track
> > > > > @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > * aggressive.
> > > > > */
> > > > > to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> > > > > - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> > > > > }
> > > > > trans = current->journal_info;
> > > > > @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > if (ordered_bytes > delalloc_bytes && !for_preempt)
> > > > > wait_ordered = true;
> > > > > - loops = 0;
> > > > > - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> > > > > - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> > > > > - long nr_pages = min_t(u64, temp, LONG_MAX);
> > > > > + spin_lock(&space_info->lock);
> > > > > + orig_tickets_id = space_info->tickets_id;
> > > > > + spin_unlock(&space_info->lock);
> > > > > +
> > > > > + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> > > > > + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> > > > > + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
> > > > > PAGE_SHIFT;
> > > > > + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
> > > > > int async_pages;
> > > > > btrfs_start_delalloc_roots(fs_info, nr_pages, true);
> > > > > @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > atomic_read(&fs_info->async_delalloc_pages) <=
> > > > > async_pages);
> > > > > skip_async:
> > > > > - loops++;
> > > > > + to_reclaim -= iter_reclaim;
> > > > > if (wait_ordered && !trans) {
> > > > > btrfs_wait_ordered_roots(fs_info, items, NULL);
> > > > > } else {
> > > > > @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > spin_unlock(&space_info->lock);
> > > > > break;
> > > > > }
> > > > > + /*
> > > > > + * If a ticket was satisfied since we started, break out
> > > > > + * so the async reclaim state machine can process delayed
> > > > > + * refs before we flush more delalloc.
> > > > > + */
> > > > > + if (space_info->tickets_id != orig_tickets_id) {
> > > > > + spin_unlock(&space_info->lock);
> > > > > + break;
> > > > > + }
> > > > > spin_unlock(&space_info->lock);
> > > > > delalloc_bytes = percpu_counter_sum_positive(
> > > >
> > > >
> > >
>
next prev parent reply other threads:[~2026-04-24 22:11 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
2026-04-09 17:48 ` [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-04-10 16:07 ` Filipe Manana
2026-04-09 17:48 ` [PATCH v4 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-04-09 17:48 ` [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-04-13 18:43 ` David Sterba
2026-04-09 17:48 ` [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
2026-04-24 6:38 ` Qu Wenruo
2026-04-24 9:48 ` Sun YangKai
2026-04-24 10:07 ` Qu Wenruo
2026-04-24 15:26 ` Boris Burkov
2026-04-24 20:11 ` Boris Burkov
2026-04-24 22:06 ` Qu Wenruo
2026-04-24 22:10 ` Boris Burkov [this message]
2026-04-24 22:21 ` Qu Wenruo
2026-04-24 22:23 ` Boris Burkov
2026-04-24 22:59 ` Qu Wenruo
2026-04-13 18:41 ` [PATCH v4 0/4] btrfs: improve stalls under sudden writeback David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260424221059.GA2970690@zen.localdomain \
--to=boris@bur.io \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox