Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: kerayhuang <huangzjsmile@gmail.com>
To: baoquan.he@linux.dev
Cc: bhe@redhat.com, flyingpeng@tencent.com, huangzjsmile@gmail.com,
	kasong@tencent.com, kerayhuang@tencent.com,
	albinwyang@tencent.com, linux-mm@kvack.org
Subject: Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
Date: Wed, 29 Apr 2026 20:49:31 +0800	[thread overview]
Message-ID: <20260429124931.452003-1-kerayhuang@tencent.com> (raw)
In-Reply-To: <aeuOuj5rHa6hgLga@MiWiFi-R3L-srv>

Hi Baoquan,
Thanks for the review!
> Hi Keray,
> 
> On 04/24/26 at 08:37pm, kerayhuang wrote:
> > Add periodic cond_resched() calls during large full_clusters
> > reclaim operations to prevent softlockup issues.
> > 
> > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > Reviewed-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > ---
> >  mm/swapfile.c | 1 +
> >  1 file changed, 1 insertion(+)
> 
> Thanks for the patch. The change looks good to me, however there are
> still small concerns.
> 
> For patch log, it might be better to provide more details, e.g did you
> observe this issue in a product environment, or just a code exploring?
> If observed in a product environment, what does the backtrace look like 
> when softlockup happened?

We hit a real softlockup in an internal stress test environment. 
The workload was LTP memory/swap stress on a large arm64 machine, 
with 320 CPUs, about 1TB memory and an 8.6GB swap device. 
The system was under heavy load and the swap device had a large 
number of full clusters. The softlockup was triggered during 
a stress test after about 3 days.

The backtrace looks like:

  PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
   #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
   #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
   #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
   ...
  #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
  #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
  #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
  #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
  #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
  #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
  #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c

From the vmcore analysis, swap_reclaim_work() called
swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
At the time of the softlockup, there were still 1427 full clusters remaining in the
full_clusters list.

I will add these details to the commit log in v2.

> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9174f1eeffb0..74a1e324449d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> >  		swap_cluster_unlock(ci);
> >  		if (to_scan <= 0)
> >  			break;
> > +		cond_resched();
> 
> Besides, is it a little bit too aggressive to call cond_resched() for
> each cluster reclaiming compared with the old code? Do you consider to
> make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> or other number decided based on your testing performance statistics.

I think calling cond_resched() per cluster is reasonable
here because:

1) Each cluster iteration already involves scanning up to 512 slots,
   and each slot reclaim may call __try_to_reclaim_swap() which does
   non-trivial work (lock/unlock, folio lookup, swap cache deletion,
   and potentially slab freeing). So the work per cluster is already
   substantial.

2) cond_resched() is a lightweight check - it only actually reschedules 
   when need_resched() is set, so in the common case it's just a flag
   check with negligible overhead. Therefore calling it once per cluster
   gives a bounded latency without forcing an actual context switch every 
   time. If we call it only every 8 or 16 clusters, the worst-case 
   non-preemptible window can still become quite large on machines with 
   many full clusters.

2) This is a workqueue context (swap_reclaim_work), not a hot fast
   path, so the slight overhead is acceptable.

Thanks,
Keray

next prev parent reply	other threads:[~2026-04-29 12:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49   ` kerayhuang [this message]
2026-05-01  2:43     ` Baoquan He

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260429124931.452003-1-kerayhuang@tencent.com \
    --to=huangzjsmile@gmail.com \
    --cc=albinwyang@tencent.com \
    --cc=baoquan.he@linux.dev \
    --cc=bhe@redhat.com \
    --cc=flyingpeng@tencent.com \
    --cc=kasong@tencent.com \
    --cc=kerayhuang@tencent.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox