From: "Jiayuan Chen" <jiayuan.chen@linux.dev>
To: "Andrew Morton" <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, "Jiayuan Chen" <jiayuan.chen@shopee.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"David Hildenbrand" <david@kernel.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Qi Zheng" <zhengqi.arch@bytedance.com>,
"Shakeel Butt" <shakeel.butt@linux.dev>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Axel Rasmussen" <axelrasmussen@google.com>,
"Yuanchu Xie" <yuanchu@google.com>, "Wei Xu" <weixugc@google.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
Date: Tue, 23 Dec 2025 01:51:32 +0000 [thread overview]
Message-ID: <42e6103fb07fca398f0942c7c41129ffcce90dc6@linux.dev> (raw)
In-Reply-To: <20251222102900.91eddc815291496eaf60cbf8@linux-foundation.org>
December 23, 2025 at 02:29, "Andrew Morton" <akpm@linux-foundation.org mailto:akpm@linux-foundation.org?to=%22Andrew%20Morton%22%20%3Cakpm%40linux-foundation.org%3E > wrote:
Hi Andrew,
Thanks for the review.
>
> On Mon, 22 Dec 2025 20:20:21 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> >
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >
> > When kswapd fails to reclaim memory, kswapd_failures is incremented.
> > Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> > futile reclaim attempts. However, any successful direct reclaim
> > unconditionally resets kswapd_failures to 0, which can cause problems.
> >
> > We observed an issue in production on a multi-NUMA system where a
> > process allocated large amounts of anonymous pages on a single NUMA
> > node, causing its watermark to drop below high and evicting most file
> > pages:
> >
> > $ numastat -m
> > Per-node system memory usage (in MBs):
> > Node 0 Node 1 Total
> > --------------- --------------- ---------------
> > MemTotal 128222.19 127983.91 256206.11
> > MemFree 1414.48 1432.80 2847.29
> > MemUsed 126807.71 126551.11 252358.82
> > SwapCached 0.00 0.00 0.00
> > Active 29017.91 25554.57 54572.48
> > Inactive 92749.06 95377.00 188126.06
> > Active(anon) 28998.96 23356.47 52355.43
> > Inactive(anon) 92685.27 87466.11 180151.39
> > Active(file) 18.95 2198.10 2217.05
> > Inactive(file) 63.79 7910.89 7974.68
> >
> > With swap disabled, only file pages can be reclaimed. When kswapd is
> > woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> > raise free memory above the high watermark since reclaimable file pages
> > are insufficient. Normally, kswapd would eventually stop after
> > kswapd_failures reaches MAX_RECLAIM_RETRIES.
> >
> > However, pods on this machine have memory.high set in their cgroup.
> >
> What's a "pod"?
A pod is a Kubernetes container. Sorry for the unclear terminology.
> >
> > Business processes continuously trigger the high limit, causing frequent
> > direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> > kswapd from ever stopping.
> >
> > The result is that kswapd runs endlessly, repeatedly evicting the few
> > remaining file pages which are actually hot. These pages constantly
> > refault, generating sustained heavy IO READ pressure.
> >
> Yes, not good.
>
> >
> > Fix this by only resetting kswapd_failures from direct reclaim when the
> > node is actually balanced. This prevents direct reclaim from keeping
> > kswapd alive when the node cannot be balanced through reclaim alone.
> >
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
> > lruvec_memcg(lruvec));
> > }
> >
> > +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> >
> Forward declaration could be avoided by relocating pgdat_balanced(),
> although the patch will get a lot larger.
Thanks for pointing this out.
> >
> > +static inline void reset_kswapd_failures(struct pglist_data *pgdat,
> > + struct scan_control *sc)
> >
> It would be nice to have a nice comment explaining why this is here.
> Why are we checking for balanced?
You're right, a comment explaining the rationale would be helpful.
> >
> > +{
> > + if (!current_is_kswapd() &&
> >
> kswapd can no longer clear ->kswapd_failures. What's the thinking here?
Good catch. My original thinking was that kswapd already checks pgdat_balanced()
in its own path after successful reclaim, so I wanted to avoid redundant checks.
But looking at the code again, this is indeed a bug - kswapd's reclaim path does
need to clear kswapd_failures on successful reclaim.
next prev parent reply other threads:[~2025-12-23 1:51 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-22 12:20 [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
2025-12-22 18:29 ` Andrew Morton
2025-12-23 1:51 ` Jiayuan Chen [this message]
2025-12-22 21:15 ` Shakeel Butt
2025-12-23 1:42 ` Jiayuan Chen
2025-12-23 6:11 ` Shakeel Butt
2025-12-23 8:22 ` Jiayuan Chen
2026-01-05 4:51 ` Shakeel Butt
2026-01-06 5:25 ` Jiayuan Chen
2026-01-06 9:49 ` Michal Hocko
2026-01-06 11:19 ` Jiayuan Chen
2026-01-06 12:59 ` Michal Hocko
2026-01-06 16:50 ` Shakeel Butt
2026-01-06 19:14 ` Michal Hocko
2026-01-06 17:45 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42e6103fb07fca398f0942c7c41129ffcce90dc6@linux.dev \
--to=jiayuan.chen@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jiayuan.chen@shopee.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.