From: kerayhuang <huangzjsmile@gmail.com>
To: baoquan.he@linux.dev
Cc: bhe@redhat.com, flyingpeng@tencent.com, huangzjsmile@gmail.com,
kasong@tencent.com, kerayhuang@tencent.com,
albinwyang@tencent.com, linux-mm@kvack.org
Subject: Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
Date: Wed, 29 Apr 2026 20:49:31 +0800 [thread overview]
Message-ID: <20260429124931.452003-1-kerayhuang@tencent.com> (raw)
In-Reply-To: <aeuOuj5rHa6hgLga@MiWiFi-R3L-srv>
Hi Baoquan,
Thanks for the review!
> Hi Keray,
>
> On 04/24/26 at 08:37pm, kerayhuang wrote:
> > Add periodic cond_resched() calls during large full_clusters
> > reclaim operations to prevent softlockup issues.
> >
> > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > Reviewed-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > ---
> > mm/swapfile.c | 1 +
> > 1 file changed, 1 insertion(+)
>
> Thanks for the patch. The change looks good to me, however there are
> still small concerns.
>
> For patch log, it might be better to provide more details, e.g did you
> observe this issue in a product environment, or just a code exploring?
> If observed in a product environment, what does the backtrace look like
> when softlockup happened?
We hit a real softlockup in an internal stress test environment.
The workload was LTP memory/swap stress on a large arm64 machine,
with 320 CPUs, about 1TB memory and an 8.6GB swap device.
The system was under heavy load and the swap device had a large
number of full clusters. The softlockup was triggered during
a stress test after about 3 days.
The backtrace looks like:
PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7"
#0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
#1 [ffff800080183d90] panic at ffffa4c1360d5e9c
#2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
...
#16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
#17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
#18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
#19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
#20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
#21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
#22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c
From the vmcore analysis, swap_reclaim_work() called
swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
At the time of the softlockup, there were still 1427 full clusters remaining in the
full_clusters list.
I will add these details to the commit log in v2.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9174f1eeffb0..74a1e324449d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> > swap_cluster_unlock(ci);
> > if (to_scan <= 0)
> > break;
> > + cond_resched();
>
> Besides, is it a little bit too aggressive to call cond_resched() for
> each cluster reclaiming compared with the old code? Do you consider to
> make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> or other number decided based on your testing performance statistics.
I think calling cond_resched() per cluster is reasonable
here because:
1) Each cluster iteration already involves scanning up to 512 slots,
and each slot reclaim may call __try_to_reclaim_swap() which does
non-trivial work (lock/unlock, folio lookup, swap cache deletion,
and potentially slab freeing). So the work per cluster is already
substantial.
2) cond_resched() is a lightweight check - it only actually reschedules
when need_resched() is set, so in the common case it's just a flag
check with negligible overhead. Therefore calling it once per cluster
gives a bounded latency without forcing an actual context switch every
time. If we call it only every 8 or 16 clusters, the worst-case
non-preemptible window can still become quite large on machines with
many full clusters.
2) This is a workqueue context (swap_reclaim_work), not a hot fast
path, so the slight overhead is acceptable.
Thanks,
Keray
next prev parent reply other threads:[~2026-04-29 12:49 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49 ` kerayhuang [this message]
2026-05-01 2:43 ` Baoquan He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260429124931.452003-1-kerayhuang@tencent.com \
--to=huangzjsmile@gmail.com \
--cc=albinwyang@tencent.com \
--cc=baoquan.he@linux.dev \
--cc=bhe@redhat.com \
--cc=flyingpeng@tencent.com \
--cc=kasong@tencent.com \
--cc=kerayhuang@tencent.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox