public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Baoquan He <baoquan.he@linux.dev>
To: kerayhuang <huangzjsmile@gmail.com>
Cc: bhe@redhat.com, flyingpeng@tencent.com, kasong@tencent.com,
	kerayhuang@tencent.com, albinwyang@tencent.com,
	linux-mm@kvack.org
Subject: Re: [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
Date: Fri, 1 May 2026 10:43:40 +0800	[thread overview]
Message-ID: <afQTXPLbNrB3NOqe@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20260429124931.452003-1-kerayhuang@tencent.com>

On 04/29/26 at 08:49pm, kerayhuang wrote:
> Hi Baoquan,
> Thanks for the review!
> > Hi Keray,
> > 
> > On 04/24/26 at 08:37pm, kerayhuang wrote:
> > > Add periodic cond_resched() calls during large full_clusters
> > > reclaim operations to prevent softlockup issues.
> > > 
> > > Signed-off-by: kerayhuang <kerayhuang@tencent.com>
> > > Reviewed-by: Kairui Song <kasong@tencent.com>
> > > Reviewed-by: Hao Peng <flyingpeng@tencent.com>
> > > ---
> > >  mm/swapfile.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > 
> > Thanks for the patch. The change looks good to me, however there are
> > still small concerns.
> > 
> > For patch log, it might be better to provide more details, e.g did you
> > observe this issue in a product environment, or just a code exploring?
> > If observed in a product environment, what does the backtrace look like 
> > when softlockup happened?
> 
> We hit a real softlockup in an internal stress test environment. 
> The workload was LTP memory/swap stress on a large arm64 machine, 
> with 320 CPUs, about 1TB memory and an 8.6GB swap device. 
> The system was under heavy load and the swap device had a large 
> number of full clusters. The softlockup was triggered during 
> a stress test after about 3 days.
> 
> The backtrace looks like:
> 
>   PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
>    #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
>    #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
>    #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
>    ...
>   #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
>   #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
>   #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
>   #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
>   #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
>   #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
>   #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c
> 
> From the vmcore analysis, swap_reclaim_work() called
> swap_reclaim_full_clusters() with force=true, which sets to_scan to 1551 clusters.
> At the time of the softlockup, there were still 1427 full clusters remaining in the
> full_clusters list.
> 
> I will add these details to the commit log in v2.

Sounds like a very great root cause digging. Adding these into patch log
will be very helpful.

By the way, is it worth a Fixes tag?

> 
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 9174f1eeffb0..74a1e324449d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1054,6 +1054,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> > >  		swap_cluster_unlock(ci);
> > >  		if (to_scan <= 0)
> > >  			break;
> > > +		cond_resched();
> > 
> > Besides, is it a little bit too aggressive to call cond_resched() for
> > each cluster reclaiming compared with the old code? Do you consider to
> > make it gentle, e.g calling cond_resched() every several clusters, 8, 16
> > or other number decided based on your testing performance statistics.
> 
> I think calling cond_resched() per cluster is reasonable
> here because:
> 
> 1) Each cluster iteration already involves scanning up to 512 slots,
>    and each slot reclaim may call __try_to_reclaim_swap() which does
>    non-trivial work (lock/unlock, folio lookup, swap cache deletion,
>    and potentially slab freeing). So the work per cluster is already
>    substantial.
> 
> 2) cond_resched() is a lightweight check - it only actually reschedules 
>    when need_resched() is set, so in the common case it's just a flag
>    check with negligible overhead. Therefore calling it once per cluster
>    gives a bounded latency without forcing an actual context switch every 
>    time. If we call it only every 8 or 16 clusters, the worst-case 
>    non-preemptible window can still become quite large on machines with 
>    many full clusters.
> 
> 2) This is a workqueue context (swap_reclaim_work), not a hot fast
>    path, so the slight overhead is acceptable.

OK, that sounds good. Just when system is under heavy stress, it could
yield after each cluster reclaiming. Imagine a system with a bigger swap
disk, it will alwasy need to check if swap is 50% full and if it's in
workqueue. Anyway, maybe I am overthinking. The overrall looks very
great to me, good catch, good root cause digging and good fix. Let's see
if other people have any concern.

Thanks
Baoquan


      reply	other threads:[~2026-05-01  2:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 12:37 [PATCH] mm/swap: Add cond_resched() in swap_reclaim_full_clusters to prevent softlockup kerayhuang
2026-04-24 15:39 ` Baoquan He
2026-04-29 12:49   ` kerayhuang
2026-05-01  2:43     ` Baoquan He [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afQTXPLbNrB3NOqe@MiWiFi-R3L-srv \
    --to=baoquan.he@linux.dev \
    --cc=albinwyang@tencent.com \
    --cc=bhe@redhat.com \
    --cc=flyingpeng@tencent.com \
    --cc=huangzjsmile@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kerayhuang@tencent.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox