From: Baoquan He <baoquan.he@linux.dev>
To: kerayhuang <huangzjsmile@gmail.com>
Cc: bhe@redhat.com, flyingpeng@tencent.com, kasong@tencent.com,
kerayhuang@tencent.com, albinwyang@tencent.com,
linux-mm@kvack.org
Subject: Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
Date: Fri, 1 May 2026 10:43:40 +0800 [thread overview]
Message-ID: <afQTXPLbNrB3NOqe@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20260429124931.452003-1-kerayhuang@tencent.com>
On 04/29/26 at 08:49pm, kerayhuang wrote:
> Hi Baoquan,
> Thanks for the review!
> > Hi Keray,
> >
> > On 04/24/26 at 08:37pm, kerayhuang wrote:
> > > Add periodic cond_resched() calls during large full_clusters
> > > reclaim operations to prevent softlockup issues.
> > >
> > > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > > Reviewed-by: Kairui Song <kasong@tencent.com>
> > > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > > ---
> > > mm/swapfile.c | 1 +
> > > 1 file changed, 1 insertion(+)
> >
> > Thanks for the patch. The change looks good to me, however there are
> > still small concerns.
> >
> > For patch log, it might be better to provide more details, e.g did you
> > observe this issue in a product environment, or just a code exploring?
> > If observed in a product environment, what does the backtrace look like
> > when softlockup happened?
>
> We hit a real softlockup in an internal stress test environment.
> The workload was LTP memory/swap stress on a large arm64 machine,
> with 320 CPUs, about 1TB memory and an 8.6GB swap device.
> The system was under heavy load and the swap device had a large
> number of full clusters. The softlockup was triggered during
> a stress test after about 3 days.
>
> The backtrace looks like:
>
> PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7"
> #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
> #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
> #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
> ...
> #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
> #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
> #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
> #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
> #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
> #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
> #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c
>
> From the vmcore analysis, swap_reclaim_work() called
> swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
> At the time of the softlockup, there were still 1427 full clusters remaining in the
> full_clusters list.
>
> I will add these details to the commit log in v2.
Sounds like a very great root cause digging. Adding these into patch log
will be very helpful.
By the way, is it worth a Fixes tag?
>
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 9174f1eeffb0..74a1e324449d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> > > swap_cluster_unlock(ci);
> > > if (to_scan <= 0)
> > > break;
> > > + cond_resched();
> >
> > Besides, is it a little bit too aggressive to call cond_resched() for
> > each cluster reclaiming compared with the old code? Do you consider to
> > make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> > or other number decided based on your testing performance statistics.
>
> I think calling cond_resched() per cluster is reasonable
> here because:
>
> 1) Each cluster iteration already involves scanning up to 512 slots,
> and each slot reclaim may call __try_to_reclaim_swap() which does
> non-trivial work (lock/unlock, folio lookup, swap cache deletion,
> and potentially slab freeing). So the work per cluster is already
> substantial.
>
> 2) cond_resched() is a lightweight check - it only actually reschedules
> when need_resched() is set, so in the common case it's just a flag
> check with negligible overhead. Therefore calling it once per cluster
> gives a bounded latency without forcing an actual context switch every
> time. If we call it only every 8 or 16 clusters, the worst-case
> non-preemptible window can still become quite large on machines with
> many full clusters.
>
> 2) This is a workqueue context (swap_reclaim_work), not a hot fast
> path, so the slight overhead is acceptable.
OK, that sounds good. Just when system is under heavy stress, it could
yield after each cluster reclaiming. Imagine a system with a bigger swap
disk, it will alwasy need to check if swap is 50% full and if it's in
workqueue. Anyway, maybe I am overthinking. The overrall looks very
great to me, good catch, good root cause digging and good fix. Let's see
if other people have any concern.
Thanks
Baoquan
prev parent reply other threads:[~2026-05-01 2:43 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49 ` kerayhuang
2026-05-01 2:43 ` Baoquan He [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afQTXPLbNrB3NOqe@MiWiFi-R3L-srv \
--to=baoquan.he@linux.dev \
--cc=albinwyang@tencent.com \
--cc=bhe@redhat.com \
--cc=flyingpeng@tencent.com \
--cc=huangzjsmile@gmail.com \
--cc=kasong@tencent.com \
--cc=kerayhuang@tencent.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox