* [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
@ 2026-04-24 12:37 kerayhuang
2026-04-24 15:39 ` Baoquan He
0 siblings, 1 reply; 4+ messages in thread
From: kerayhuang @ 2026-04-24 12:37 UTC (permalink / raw)
To: kasong, bhe; +Cc: linux-mm, kerayhuang, Hao Peng
Add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.
Signed-off-by: kerayhuang <kerayhuang@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
---
mm/swapfile.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9174f1eeffb0..74a1e324449d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
swap_cluster_unlock(ci);
if (to_scan <= 0)
break;
+ cond_resched();
}
}
--
2.43.5
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
@ 2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49 ` kerayhuang
0 siblings, 1 reply; 4+ messages in thread
From: Baoquan He @ 2026-04-24 15:39 UTC (permalink / raw)
To: kerayhuang; +Cc: kasong, bhe, linux-mm, kerayhuang, Hao Peng
Hi Keray,
On 04/24/26 at 08:37pm, kerayhuang wrote:
> Add periodic cond_resched() calls during large full_clusters
> reclaim operations to prevent softlockup issues.
>
> Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> Reviewed-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> ---
> mm/swapfile.c | 1 +
> 1 file changed, 1 insertion(+)
Thanks for the patch. The change looks good to me, however there are
still small concerns.
For patch log, it might be better to provide more details, e.g did you
observe this issue in a product environment, or just a code exploring?
If observed in a product environment, what does the backtrace look like
when softlockup happened?
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9174f1eeffb0..74a1e324449d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> swap_cluster_unlock(ci);
> if (to_scan <= 0)
> break;
> + cond_resched();
Besides, is it a little bit too aggressive to call cond_resched() for
each cluster reclaiming compared with the old code? Do you consider to
make it gentle, e.g calling cond_resched() every several clusters, 8, 16
or other number decided based on your testing performance statistics.
Thanks
Baoquan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
2026-04-24 15:39 ` Baoquan He
@ 2026-04-29 12:49 ` kerayhuang
2026-05-01 2:43 ` Baoquan He
0 siblings, 1 reply; 4+ messages in thread
From: kerayhuang @ 2026-04-29 12:49 UTC (permalink / raw)
To: baoquan.he
Cc: bhe, flyingpeng, huangzjsmile, kasong, kerayhuang, albinwyang,
linux-mm
Hi Baoquan,
Thanks for the review!
> Hi Keray,
>
> On 04/24/26 at 08:37pm, kerayhuang wrote:
> > Add periodic cond_resched() calls during large full_clusters
> > reclaim operations to prevent softlockup issues.
> >
> > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > Reviewed-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > ---
> > mm/swapfile.c | 1 +
> > 1 file changed, 1 insertion(+)
>
> Thanks for the patch. The change looks good to me, however there are
> still small concerns.
>
> For patch log, it might be better to provide more details, e.g did you
> observe this issue in a product environment, or just a code exploring?
> If observed in a product environment, what does the backtrace look like
> when softlockup happened?
We hit a real softlockup in an internal stress test environment.
The workload was LTP memory/swap stress on a large arm64 machine,
with 320 CPUs, about 1TB memory and an 8.6GB swap device.
The system was under heavy load and the swap device had a large
number of full clusters. The softlockup was triggered during
a stress test after about 3 days.
The backtrace looks like:
PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7"
#0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
#1 [ffff800080183d90] panic at ffffa4c1360d5e9c
#2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
...
#16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
#17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
#18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
#19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
#20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
#21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
#22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c
From the vmcore analysis, swap_reclaim_work() called
swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
At the time of the softlockup, there were still 1427 full clusters remaining in the
full_clusters list.
I will add these details to the commit log in v2.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9174f1eeffb0..74a1e324449d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> > swap_cluster_unlock(ci);
> > if (to_scan <= 0)
> > break;
> > + cond_resched();
>
> Besides, is it a little bit too aggressive to call cond_resched() for
> each cluster reclaiming compared with the old code? Do you consider to
> make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> or other number decided based on your testing performance statistics.
I think calling cond_resched() per cluster is reasonable
here because:
1) Each cluster iteration already involves scanning up to 512 slots,
and each slot reclaim may call __try_to_reclaim_swap() which does
non-trivial work (lock/unlock, folio lookup, swap cache deletion,
and potentially slab freeing). So the work per cluster is already
substantial.
2) cond_resched() is a lightweight check - it only actually reschedules
when need_resched() is set, so in the common case it's just a flag
check with negligible overhead. Therefore calling it once per cluster
gives a bounded latency without forcing an actual context switch every
time. If we call it only every 8 or 16 clusters, the worst-case
non-preemptible window can still become quite large on machines with
many full clusters.
2) This is a workqueue context (swap_reclaim_work), not a hot fast
path, so the slight overhead is acceptable.
Thanks,
Keray
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
2026-04-29 12:49 ` kerayhuang
@ 2026-05-01 2:43 ` Baoquan He
0 siblings, 0 replies; 4+ messages in thread
From: Baoquan He @ 2026-05-01 2:43 UTC (permalink / raw)
To: kerayhuang; +Cc: bhe, flyingpeng, kasong, kerayhuang, albinwyang, linux-mm
On 04/29/26 at 08:49pm, kerayhuang wrote:
> Hi Baoquan,
> Thanks for the review!
> > Hi Keray,
> >
> > On 04/24/26 at 08:37pm, kerayhuang wrote:
> > > Add periodic cond_resched() calls during large full_clusters
> > > reclaim operations to prevent softlockup issues.
> > >
> > > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > > Reviewed-by: Kairui Song <kasong@tencent.com>
> > > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > > ---
> > > mm/swapfile.c | 1 +
> > > 1 file changed, 1 insertion(+)
> >
> > Thanks for the patch. The change looks good to me, however there are
> > still small concerns.
> >
> > For patch log, it might be better to provide more details, e.g did you
> > observe this issue in a product environment, or just a code exploring?
> > If observed in a product environment, what does the backtrace look like
> > when softlockup happened?
>
> We hit a real softlockup in an internal stress test environment.
> The workload was LTP memory/swap stress on a large arm64 machine,
> with 320 CPUs, about 1TB memory and an 8.6GB swap device.
> The system was under heavy load and the swap device had a large
> number of full clusters. The softlockup was triggered during
> a stress test after about 3 days.
>
> The backtrace looks like:
>
> PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7"
> #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
> #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
> #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
> ...
> #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
> #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
> #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
> #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
> #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
> #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
> #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c
>
> From the vmcore analysis, swap_reclaim_work() called
> swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
> At the time of the softlockup, there were still 1427 full clusters remaining in the
> full_clusters list.
>
> I will add these details to the commit log in v2.
Sounds like a very great root cause digging. Adding these into patch log
will be very helpful.
By the way, is it worth a Fixes tag?
>
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 9174f1eeffb0..74a1e324449d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> > > swap_cluster_unlock(ci);
> > > if (to_scan <= 0)
> > > break;
> > > + cond_resched();
> >
> > Besides, is it a little bit too aggressive to call cond_resched() for
> > each cluster reclaiming compared with the old code? Do you consider to
> > make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> > or other number decided based on your testing performance statistics.
>
> I think calling cond_resched() per cluster is reasonable
> here because:
>
> 1) Each cluster iteration already involves scanning up to 512 slots,
> and each slot reclaim may call __try_to_reclaim_swap() which does
> non-trivial work (lock/unlock, folio lookup, swap cache deletion,
> and potentially slab freeing). So the work per cluster is already
> substantial.
>
> 2) cond_resched() is a lightweight check - it only actually reschedules
> when need_resched() is set, so in the common case it's just a flag
> check with negligible overhead. Therefore calling it once per cluster
> gives a bounded latency without forcing an actual context switch every
> time. If we call it only every 8 or 16 clusters, the worst-case
> non-preemptible window can still become quite large on machines with
> many full clusters.
>
> 2) This is a workqueue context (swap_reclaim_work), not a hot fast
> path, so the slight overhead is acceptable.
OK, that sounds good. Just when system is under heavy stress, it could
yield after each cluster reclaiming. Imagine a system with a bigger swap
disk, it will alwasy need to check if swap is 50% full and if it's in
workqueue. Anyway, maybe I am overthinking. The overrall looks very
great to me, good catch, good root cause digging and good fix. Let's see
if other people have any concern.
Thanks
Baoquan
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-01 2:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49 ` kerayhuang
2026-05-01 2:43 ` Baoquan He
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox