[PATCH] mm/swap: Add cond_resched() in swap_reclaim_full

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
@ 2026-04-24 12:37 kerayhuang
  2026-04-24 15:39 ` Baoquan He
  0 siblings, 1 reply; 4+ messages in thread
From: kerayhuang @ 2026-04-24 12:37 UTC (permalink / raw)
  To: kasong, bhe; +Cc: linux-mm, kerayhuang, Hao Peng

Add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.

Signed-off-by: kerayhuang <kerayhuang@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
---
 mm/swapfile.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9174f1eeffb0..74a1e324449d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		swap_cluster_unlock(ci);
 		if (to_scan <= 0)
 			break;
+		cond_resched();
 	}
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
  2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
@ 2026-04-24 15:39 ` Baoquan He
  2026-04-29 12:49   ` kerayhuang
  0 siblings, 1 reply; 4+ messages in thread
From: Baoquan He @ 2026-04-24 15:39 UTC (permalink / raw)
  To: kerayhuang; +Cc: kasong, bhe, linux-mm, kerayhuang, Hao Peng

Hi Keray,

On 04/24/26 at 08:37pm, kerayhuang wrote:
> Add periodic cond_resched() calls during large full_clusters
> reclaim operations to prevent softlockup issues.
> 
> Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> Reviewed-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> ---
>  mm/swapfile.c | 1 +
>  1 file changed, 1 insertion(+)

Thanks for the patch. The change looks good to me, however there are
still small concerns.

For patch log, it might be better to provide more details, e.g did you
observe this issue in a product environment, or just a code exploring?
If observed in a product environment, what does the backtrace look like 
when softlockup happened?

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9174f1eeffb0..74a1e324449d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>  		swap_cluster_unlock(ci);
>  		if (to_scan <= 0)
>  			break;
> +		cond_resched();

Besides, is it a little bit too aggressive to call cond_resched() for
each cluster reclaiming compared with the old code? Do you consider to
make it gentle, e.g calling cond_resched() every several clusters, 8, 16
or other number decided based on your testing performance statistics.

Thanks
Baoquan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
  2026-04-24 15:39 ` Baoquan He
@ 2026-04-29 12:49   ` kerayhuang
  2026-05-01  2:43     ` Baoquan He
  0 siblings, 1 reply; 4+ messages in thread
From: kerayhuang @ 2026-04-29 12:49 UTC (permalink / raw)
  To: baoquan.he
  Cc: bhe, flyingpeng, huangzjsmile, kasong, kerayhuang, albinwyang,
	linux-mm

Hi Baoquan,
Thanks for the review!
> Hi Keray,
> 
> On 04/24/26 at 08:37pm, kerayhuang wrote:
> > Add periodic cond_resched() calls during large full_clusters
> > reclaim operations to prevent softlockup issues.
> > 
> > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > Reviewed-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > ---
> >  mm/swapfile.c | 1 +
> >  1 file changed, 1 insertion(+)
> 
> Thanks for the patch. The change looks good to me, however there are
> still small concerns.
> 
> For patch log, it might be better to provide more details, e.g did you
> observe this issue in a product environment, or just a code exploring?
> If observed in a product environment, what does the backtrace look like 
> when softlockup happened?

We hit a real softlockup in an internal stress test environment. 
The workload was LTP memory/swap stress on a large arm64 machine, 
with 320 CPUs, about 1TB memory and an 8.6GB swap device. 
The system was under heavy load and the swap device had a large 
number of full clusters. The softlockup was triggered during 
a stress test after about 3 days.

The backtrace looks like:

  PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
   #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
   #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
   #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
   ...
  #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
  #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
  #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
  #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
  #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
  #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
  #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c

From the vmcore analysis, swap_reclaim_work() called
swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
At the time of the softlockup, there were still 1427 full clusters remaining in the
full_clusters list.

I will add these details to the commit log in v2.

> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9174f1eeffb0..74a1e324449d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> >  		swap_cluster_unlock(ci);
> >  		if (to_scan <= 0)
> >  			break;
> > +		cond_resched();
> 
> Besides, is it a little bit too aggressive to call cond_resched() for
> each cluster reclaiming compared with the old code? Do you consider to
> make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> or other number decided based on your testing performance statistics.

I think calling cond_resched() per cluster is reasonable
here because:

1) Each cluster iteration already involves scanning up to 512 slots,
   and each slot reclaim may call __try_to_reclaim_swap() which does
   non-trivial work (lock/unlock, folio lookup, swap cache deletion,
   and potentially slab freeing). So the work per cluster is already
   substantial.

2) cond_resched() is a lightweight check - it only actually reschedules 
   when need_resched() is set, so in the common case it's just a flag
   check with negligible overhead. Therefore calling it once per cluster
   gives a bounded latency without forcing an actual context switch every 
   time. If we call it only every 8 or 16 clusters, the worst-case 
   non-preemptible window can still become quite large on machines with 
   many full clusters.

2) This is a workqueue context (swap_reclaim_work), not a hot fast
   path, so the slight overhead is acceptable.

Thanks,
Keray 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
  2026-04-29 12:49   ` kerayhuang
@ 2026-05-01  2:43     ` Baoquan He
  0 siblings, 0 replies; 4+ messages in thread
From: Baoquan He @ 2026-05-01  2:43 UTC (permalink / raw)
  To: kerayhuang; +Cc: bhe, flyingpeng, kasong, kerayhuang, albinwyang, linux-mm

On 04/29/26 at 08:49pm, kerayhuang wrote:
> Hi Baoquan,
> Thanks for the review!
> > Hi Keray,
> > 
> > On 04/24/26 at 08:37pm, kerayhuang wrote:
> > > Add periodic cond_resched() calls during large full_clusters
> > > reclaim operations to prevent softlockup issues.
> > > 
> > > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > > Reviewed-by: Kairui Song <kasong@tencent.com>
> > > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > > ---
> > >  mm/swapfile.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > 
> > Thanks for the patch. The change looks good to me, however there are
> > still small concerns.
> > 
> > For patch log, it might be better to provide more details, e.g did you
> > observe this issue in a product environment, or just a code exploring?
> > If observed in a product environment, what does the backtrace look like 
> > when softlockup happened?
> 
> We hit a real softlockup in an internal stress test environment. 
> The workload was LTP memory/swap stress on a large arm64 machine, 
> with 320 CPUs, about 1TB memory and an 8.6GB swap device. 
> The system was under heavy load and the swap device had a large 
> number of full clusters. The softlockup was triggered during 
> a stress test after about 3 days.
> 
> The backtrace looks like:
> 
>   PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
>    #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
>    #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
>    #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
>    ...
>   #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
>   #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
>   #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
>   #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
>   #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
>   #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
>   #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c
> 
> From the vmcore analysis, swap_reclaim_work() called
> swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
> At the time of the softlockup, there were still 1427 full clusters remaining in the
> full_clusters list.
> 
> I will add these details to the commit log in v2.

Sounds like a very great root cause digging. Adding these into patch log
will be very helpful.

By the way, is it worth a Fixes tag?

> 
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 9174f1eeffb0..74a1e324449d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> > >  		swap_cluster_unlock(ci);
> > >  		if (to_scan <= 0)
> > >  			break;
> > > +		cond_resched();
> > 
> > Besides, is it a little bit too aggressive to call cond_resched() for
> > each cluster reclaiming compared with the old code? Do you consider to
> > make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> > or other number decided based on your testing performance statistics.
> 
> I think calling cond_resched() per cluster is reasonable
> here because:
> 
> 1) Each cluster iteration already involves scanning up to 512 slots,
>    and each slot reclaim may call __try_to_reclaim_swap() which does
>    non-trivial work (lock/unlock, folio lookup, swap cache deletion,
>    and potentially slab freeing). So the work per cluster is already
>    substantial.
> 
> 2) cond_resched() is a lightweight check - it only actually reschedules 
>    when need_resched() is set, so in the common case it's just a flag
>    check with negligible overhead. Therefore calling it once per cluster
>    gives a bounded latency without forcing an actual context switch every 
>    time. If we call it only every 8 or 16 clusters, the worst-case 
>    non-preemptible window can still become quite large on machines with 
>    many full clusters.
> 
> 2) This is a workqueue context (swap_reclaim_work), not a hot fast
>    path, so the slight overhead is acceptable.

OK, that sounds good. Just when system is under heavy stress, it could
yield after each cluster reclaiming. Imagine a system with a bigger swap
disk, it will alwasy need to check if swap is 50% full and if it's in
workqueue. Anyway, maybe I am overthinking. The overrall looks very
great to me, good catch, good root cause digging and good fix. Let's see
if other people have any concern.

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-01  2:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49   ` kerayhuang
2026-05-01  2:43     ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox