* [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup @ 2026-04-24 12:37 kerayhuang 2026-04-24 15:39 ` Baoquan He 0 siblings, 1 reply; 4+ messages in thread From: kerayhuang @ 2026-04-24 12:37 UTC (permalink / raw) To: kasong, bhe; +Cc: linux-mm, kerayhuang, Hao Peng Add periodic cond_resched() calls during large full_clusters reclaim operations to prevent softlockup issues. Signed-off-by: kerayhuang <kerayhuang@tencent.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> --- mm/swapfile.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index 9174f1eeffb0..74a1e324449d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) swap_cluster_unlock(ci); if (to_scan <= 0) break; + cond_resched(); } } -- 2.43.5 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup 2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang @ 2026-04-24 15:39 ` Baoquan He 2026-04-29 12:49 ` kerayhuang 0 siblings, 1 reply; 4+ messages in thread From: Baoquan He @ 2026-04-24 15:39 UTC (permalink / raw) To: kerayhuang; +Cc: kasong, bhe, linux-mm, kerayhuang, Hao Peng Hi Keray, On 04/24/26 at 08:37pm, kerayhuang wrote: > Add periodic cond_resched() calls during large full_clusters > reclaim operations to prevent softlockup issues. > > Signed-off-by: kerayhuang <kerayhuang@tencent.com> > Reviewed-by: Kairui Song <kasong@tencent.com> > Reviewed-by: Hao Peng <flyingpeng@tencent.com> > --- > mm/swapfile.c | 1 + > 1 file changed, 1 insertion(+) Thanks for the patch. The change looks good to me, however there are still small concerns. For patch log, it might be better to provide more details, e.g did you observe this issue in a product environment, or just a code exploring? If observed in a product environment, what does the backtrace look like when softlockup happened? > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 9174f1eeffb0..74a1e324449d 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) > swap_cluster_unlock(ci); > if (to_scan <= 0) > break; > + cond_resched(); Besides, is it a little bit too aggressive to call cond_resched() for each cluster reclaiming compared with the old code? Do you consider to make it gentle, e.g calling cond_resched() every several clusters, 8, 16 or other number decided based on your testing performance statistics. Thanks Baoquan ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup 2026-04-24 15:39 ` Baoquan He @ 2026-04-29 12:49 ` kerayhuang 2026-05-01 2:43 ` Baoquan He 0 siblings, 1 reply; 4+ messages in thread From: kerayhuang @ 2026-04-29 12:49 UTC (permalink / raw) To: baoquan.he Cc: bhe, flyingpeng, huangzjsmile, kasong, kerayhuang, albinwyang, linux-mm Hi Baoquan, Thanks for the review! > Hi Keray, > > On 04/24/26 at 08:37pm, kerayhuang wrote: > > Add periodic cond_resched() calls during large full_clusters > > reclaim operations to prevent softlockup issues. > > > > Signed-off-by: kerayhuang <kerayhuang@tencent.com> > > Reviewed-by: Kairui Song <kasong@tencent.com> > > Reviewed-by: Hao Peng <flyingpeng@tencent.com> > > --- > > mm/swapfile.c | 1 + > > 1 file changed, 1 insertion(+) > > Thanks for the patch. The change looks good to me, however there are > still small concerns. > > For patch log, it might be better to provide more details, e.g did you > observe this issue in a product environment, or just a code exploring? > If observed in a product environment, what does the backtrace look like > when softlockup happened? We hit a real softlockup in an internal stress test environment. The workload was LTP memory/swap stress on a large arm64 machine, with 320 CPUs, about 1TB memory and an 8.6GB swap device. The system was under heavy load and the swap device had a large number of full clusters. The softlockup was triggered during a stress test after about 3 days. The backtrace looks like: PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7" #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4 #1 [ffff800080183d90] panic at ffffa4c1360d5e9c #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8 ... #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614 #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474 #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398 #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c From the vmcore analysis, swap_reclaim_work() called swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters. At the time of the softlockup, there were still 1427 full clusters remaining in the full_clusters list. I will add these details to the commit log in v2. > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 9174f1eeffb0..74a1e324449d 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) > > swap_cluster_unlock(ci); > > if (to_scan <= 0) > > break; > > + cond_resched(); > > Besides, is it a little bit too aggressive to call cond_resched() for > each cluster reclaiming compared with the old code? Do you consider to > make it gentle, e.g calling cond_resched() every several clusters, 8, 16 > or other number decided based on your testing performance statistics. I think calling cond_resched() per cluster is reasonable here because: 1) Each cluster iteration already involves scanning up to 512 slots, and each slot reclaim may call __try_to_reclaim_swap() which does non-trivial work (lock/unlock, folio lookup, swap cache deletion, and potentially slab freeing). So the work per cluster is already substantial. 2) cond_resched() is a lightweight check - it only actually reschedules when need_resched() is set, so in the common case it's just a flag check with negligible overhead. Therefore calling it once per cluster gives a bounded latency without forcing an actual context switch every time. If we call it only every 8 or 16 clusters, the worst-case non-preemptible window can still become quite large on machines with many full clusters. 2) This is a workqueue context (swap_reclaim_work), not a hot fast path, so the slight overhead is acceptable. Thanks, Keray ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup 2026-04-29 12:49 ` kerayhuang @ 2026-05-01 2:43 ` Baoquan He 0 siblings, 0 replies; 4+ messages in thread From: Baoquan He @ 2026-05-01 2:43 UTC (permalink / raw) To: kerayhuang; +Cc: bhe, flyingpeng, kasong, kerayhuang, albinwyang, linux-mm On 04/29/26 at 08:49pm, kerayhuang wrote: > Hi Baoquan, > Thanks for the review! > > Hi Keray, > > > > On 04/24/26 at 08:37pm, kerayhuang wrote: > > > Add periodic cond_resched() calls during large full_clusters > > > reclaim operations to prevent softlockup issues. > > > > > > Signed-off-by: kerayhuang <kerayhuang@tencent.com> > > > Reviewed-by: Kairui Song <kasong@tencent.com> > > > Reviewed-by: Hao Peng <flyingpeng@tencent.com> > > > --- > > > mm/swapfile.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > Thanks for the patch. The change looks good to me, however there are > > still small concerns. > > > > For patch log, it might be better to provide more details, e.g did you > > observe this issue in a product environment, or just a code exploring? > > If observed in a product environment, what does the backtrace look like > > when softlockup happened? > > We hit a real softlockup in an internal stress test environment. > The workload was LTP memory/swap stress on a large arm64 machine, > with 320 CPUs, about 1TB memory and an 8.6GB swap device. > The system was under heavy load and the swap device had a large > number of full clusters. The softlockup was triggered during > a stress test after about 3 days. > > The backtrace looks like: > > PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7" > #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4 > #1 [ffff800080183d90] panic at ffffa4c1360d5e9c > #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8 > ... > #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614 > #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc > #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474 > #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c > #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc > #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398 > #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c > > From the vmcore analysis, swap_reclaim_work() called > swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters. > At the time of the softlockup, there were still 1427 full clusters remaining in the > full_clusters list. > > I will add these details to the commit log in v2. Sounds like a very great root cause digging. Adding these into patch log will be very helpful. By the way, is it worth a Fixes tag? > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index 9174f1eeffb0..74a1e324449d 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) > > > swap_cluster_unlock(ci); > > > if (to_scan <= 0) > > > break; > > > + cond_resched(); > > > > Besides, is it a little bit too aggressive to call cond_resched() for > > each cluster reclaiming compared with the old code? Do you consider to > > make it gentle, e.g calling cond_resched() every several clusters, 8, 16 > > or other number decided based on your testing performance statistics. > > I think calling cond_resched() per cluster is reasonable > here because: > > 1) Each cluster iteration already involves scanning up to 512 slots, > and each slot reclaim may call __try_to_reclaim_swap() which does > non-trivial work (lock/unlock, folio lookup, swap cache deletion, > and potentially slab freeing). So the work per cluster is already > substantial. > > 2) cond_resched() is a lightweight check - it only actually reschedules > when need_resched() is set, so in the common case it's just a flag > check with negligible overhead. Therefore calling it once per cluster > gives a bounded latency without forcing an actual context switch every > time. If we call it only every 8 or 16 clusters, the worst-case > non-preemptible window can still become quite large on machines with > many full clusters. > > 2) This is a workqueue context (swap_reclaim_work), not a hot fast > path, so the slight overhead is acceptable. OK, that sounds good. Just when system is under heavy stress, it could yield after each cluster reclaiming. Imagine a system with a bigger swap disk, it will alwasy need to check if swap is 50% full and if it's in workqueue. Anyway, maybe I am overthinking. The overrall looks very great to me, good catch, good root cause digging and good fix. Let's see if other people have any concern. Thanks Baoquan ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-01 2:43 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang 2026-04-24 15:39 ` Baoquan He 2026-04-29 12:49 ` kerayhuang 2026-05-01 2:43 ` Baoquan He
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox