From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F1D93303A02
	for <linux-kernel@vger.kernel.org>; Tue,  6 Jan 2026 05:25:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1767677149; cv=none; b=P0eFHFOevsPaJfTX+ZiKCTO77dpPxCufM0raJVLJ7ZpoACp0hOw/0EmnMVQQ4+Jk9FCIM1sQWg+zRPq4BjrNQwlq9+9OMZbN6AgUDhSqBERxZSFdpsINr3FVPPdw/6f56BKQUTTPbdOmQ8ULKN6ojTQQi0VBkTdeTCMqGVKAKps=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1767677149; c=relaxed/simple;
	bh=5El4NGXvF8pJPNohvEQtePOl6oUzRXNVPVg1jz2m8f4=;
	h=MIME-Version:Date:Content-Type:From:Message-ID:Subject:To:Cc:
	 In-Reply-To:References; b=D1zz0NEXQiJIYq7OVeHAxWx641TRPEc5WDeyuWLzK6smG8LlHQyCqpORoqsNq5B61AiRemZC1LOC/tUV82P9fYGa+TkjoSfJgZgcF8c2JCoN1URCCOcBAjVnbDCiTcYM33l32UxFfWJwR7D7UXQi8aT2whaUWeh81m52FAgQk28=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=IbmOCT/e; arc=none smtp.client-ip=91.218.175.173
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="IbmOCT/e"
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1767677144;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=/qScMDy8mz80NZ8SjOwgc1YriWtALcRcs1IwrYU/jNo=;
	b=IbmOCT/e8a1pAevtWnDiO6JIJGso9sPNLtpDd2GQpMzpGtPvsaxNapWBckGntWrwF+EjKz
	rONtdvkzcnMOQLXpP4iNDMQlSchWAyj/5Sq9QOtOV6tBm7uD3n5M9z9wD6uthI1S/QwHtF
	k70TruFeqtl9143WknkiV2egqDB7pUM=
Date: Tue, 06 Jan 2026 05:25:42 +0000
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: "Jiayuan Chen" <jiayuan.chen@linux.dev>
Message-ID: <d7df4e26841d83154f2cc2487d5acbaf2ff2cc27@linux.dev>
TLS-Required: No
Subject: Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset
 from direct reclaim
To: "Shakeel Butt" <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org, "Jiayuan Chen" <jiayuan.chen@shopee.com>, "Andrew
 Morton" <akpm@linux-foundation.org>, "Johannes Weiner"
 <hannes@cmpxchg.org>, "David Hildenbrand" <david@kernel.org>, "Michal
 Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>,
 "Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>, "Axel Rasmussen"
 <axelrasmussen@google.com>, "Yuanchu Xie" <yuanchu@google.com>, "Wei Xu"
 <weixugc@google.com>, linux-kernel@vger.kernel.org
In-Reply-To: <gd7qbyakigogdbfxkujtc2ewwfzbwudn2l6vqkbkttv46wkfrd@nqseltiu2do5>
References: <20251222122022.254268-1-jiayuan.chen@linux.dev>
 <4owaeb7bmkfgfzqd4ztdsi4tefc36cnmpju4yrknsgjm4y32ez@qsgn6lnv3cxb>
 <2e574085ed3d7775c3b83bb80d302ce45415ac42@linux.dev>
 <u2llnnpmpsgarwrt74ffgo3cuwe4apdbeh5hkclzbh5gykwltb@whb7uuj7ub5i>
 <e93c75cb1a46a60ec415215c555312c82b9145ac@linux.dev>
 <gd7qbyakigogdbfxkujtc2ewwfzbwudn2l6vqkbkttv46wkfrd@nqseltiu2do5>
X-Migadu-Flow: FLOW_OUT

January 5, 2026 at 12:51, "Shakeel Butt" <shakeel.butt@linux.dev mailto:s=
hakeel.butt@linux.dev?to=3D%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux=
.dev%3E > wrote:


>=20
>=20Hi Jiayuan,
>=20
>=20Sorry for late reply due to holidays/break. I will still be slow to
> respond this week but will be fully back after one more week. Anyways,
> let me respond below.

No worries about the delay - happy holidays!

> On Tue, Dec 23, 2025 at 08:22:43AM +0000, Jiayuan Chen wrote:
>=20
>=20>=20
>=20> December 23, 2025 at 14:11, "Shakeel Butt" <shakeel.butt@linux.dev =
mailto:shakeel.butt@linux.dev?to=3D%22Shakeel%20Butt%22%20%3Cshakeel.butt=
%40linux.dev%3E > wrote:
> >=20=20
>=20>=20=20
>=20>=20=20
>=20>  On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> >=20=20
>=20>  >=20
>=20>  > December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.d=
ev mailto:shakeel.butt@linux.dev?to=3D%22Shakeel%20Butt%22%20%3Cshakeel.b=
utt%40linux.dev%3E > wrote:
> >  >=20
>=20>  [...]
> >=20=20
>=20>  >=20
>=20>  > >=20
>=20>  > I don't think kswapd is an issue here. The system is out of memo=
ry and
> >  > most of the memory is unreclaimable. Either change the workload to=
 use
> >  > less memory or enable swap (or zswap) to have more reclaimable mem=
ory.
> >  >=20
>=20>  >=20
>=20>  > Hi,
> >  > Thanks for looking into this.
> >  >=20
>=20>  > Sorry, I didn't describe the scenario clearly enough in the orig=
inal patch. Let me clarify:
> >  >=20
>=20>  > This is a multi-NUMA system where the memory pressure is not glo=
bal but node-local. The key observation is:
> >  >=20
>=20>  > Node 0: Under memory pressure, most memory is anonymous (unrecla=
imable without swap)
> >  > Node 1: Has plenty of reclaimable memory (~60GB file cache out of =
125GB total)
> >  >=20
>=20>  Thanks and now the situation is much more clear. IIUC you are runn=
ing
> >  multiple workloads (pods) on the system. How is the memcg limits
> >  configured for these workloads. You mentioned memory.high, what abou=
t
> >=20=20
>=20>  Thanks for the questions. We have pods configured with memory.high=
 and pods configured with memory.max.
> >=20=20
>=20>  Actually, memory.max itself causes heavy I/O issues for us, becaus=
e it keeps trying to reclaim hot
> >  pages within the cgroup aggressively without killing the process.=20
>=20>=20=20
>=20>  So we configured some pods with memory.high instead, since it perf=
orms reclaim in resume_user_mode_work,
> >  which somewhat throttles the memory allocation of user processes.
> >=20=20
>=20>  memory.max? Also are you using cpusets to limit the pods to indivi=
dual
> >  nodes (cpu & memory) or they can run on any node?
> >=20=20
>=20>  Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured =
for our cgroups, binding
> >  them to specific NUMA nodes. But I don't think this is directly rela=
ted to the issue - the
> >  problem can occur with or without cpusets. Even without cpusets.cpus=
, the kernel prefers
> >  to allocate memory from the node where the process is running, so if=
 a process happens to
> >  run on a CPU belonging to Node 0, the behavior would be similar.
> >=20
>=20Are you limiting (using cpuset.cpus) the workloads to single respecti=
ve
> nodes or the individual workloads can still run on multiple nodes? For
> example do you have a workload which can run on both (or more) nodes?

We have many workloads. Some performance-sensitive ones have cpuset.cpus =
configured to
bind to a specific node, while others don't.

> >=20
>=20> Overall I still think it is unbalanced numa nodes in terms of memor=
y and
> >  may for cpu as well. Anyways let's talk about kswapd.
> >  >=20
>=20>  > Node 0's kswapd runs continuously but cannot reclaim anything
> >  > Direct reclaim succeeds by reclaiming from Node 1
> >  > Direct reclaim resets kswapd_failures,
> >  >=20
>=20>  So successful reclaim on one node does not reset kswapd_failures o=
n
> >  other node. The kernel reclaims each node one by one, so if Node 0
> >  direct reclaim was successfull only then kernel allows to reset the
> >  kswapd_failures of Node 0 to be reset.
> >=20=20
>=20>  Let me dig deeper into this.
> >=20=20
>=20>  When either memory.max or memory.high is reached, direct reclaim i=
s
> >  triggered. The memory being reclaimed depends on the CPU where the
> >  process is running.
> >=20=20
>=20>  When the problem occurred, we had workloads continuously hitting=
=20
>=20>  memory.max and workloads continuously hitting memory.high:
> >=20=20
>=20>  reclaim_high -> -> try_to_free_mem_cgroup_pages
> >  ^ do_try_to_free_pages(zone of current node)
> >  | shrink_zones()
> >  try_charge_memcg - shrink_node()
> >  kswapd_failures =3D 0
> >=20=20
>=20>  Although the pages are hot, if we scan aggressively enough, they w=
ill eventually
> >  be reclaimed, and then kswapd_failures gets reset to 0 - because eve=
n reclaiming
> >  a single page resets kswapd_failures to 0.
> >=20=20
>=20>  The end result is that we most workloads, which didn't even hit th=
eir high
> >  or max limits, experiencing continuous refaults, causing heavy I/O.
> >=20
>=20So, the decision to reset kswapd_failures on memcg reclaim can be
> re-evaluated but I think that is not the root cause here. The


The workloads triggering direct reclaim have their memory spread across m=
ultiple nodes,
since we don't set cpuset.mems, so the cgroup can reclaim memory from mul=
tiple nodes.
In particular, complex applications have many threads, different threads =
allocating and
freeing large amounts of memory (both anonymous and file pages), and thes=
e allocations
can consume memory from nodes that are above the low watermark.

You're right that multiple factors contribute to the issue I described. T=
his patch addresses
one of them, just like the boost_watermark patch I submitted before, and =
the recent patch
about memory.high causing high I/O. There are other scenarios as well tha=
t I'm still trying
to reproduce.

That said, I believe this patch is still a valid fix on its own - resetti=
ng kswapd_failures
when the node is not actually balanced doesn't seem like correct behavior=
 regardless of the
broader context.

> kswapd_failures mechanism is for situations where kswapd is unable to
> reclaim and then punting on the direct reclaimers but in your situation
> the workloads are not numa memory bound and thus there really is not an=
y
> numa level direct reclaimers. Also the lack of reclaimable memory is
> making the situation worse.


> >=20
>=20> Thanks.
> >=20=20
>=20>  >=20
>=20>  > preventing Node 0's kswapd from stopping
> >  > The few file pages on Node 0 are hot and keep refaulting, causing =
heavy I/O
> >  >=20
>=20>  Have you tried numa balancing? Though I think it would be better t=
o
> >  schedule upfront in a way that one node is not overcommitted but num=
a
> >  balancing provides a dynamic way to adjust the load on each node.
> >=20=20
>=20>  Yes, we have tried it. Actually, I submitted a patch about a month=
 ago to improve
> >  its observability:
> >  https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.ho=
me/
> >  (though only Steven replied, a bit awkward :( ).
> >=20=20
>=20>  We found that the default settings didn't work well for our worklo=
ads. When we tried
> >  to increase scan_size to make it more aggressive, we noticed the sys=
tem load started
> >  to increase. So we haven't fully adopted it yet.
> >=20
>=20I feel the numa balancing will not help as well as or it might make i=
t
> worse as the workloads may have allocated some memory on the other node
> which numa balancing might try to move to the node which is already
> under pressure.

Agreed.

> Let me say what I think is the issue. You have the situation where node
> 0 is overcommitted and is mostly filled with unreclaimable memory. The
> workloads running on node 0 have their workingset continuously getting
> reclaimed due to node 0 being OOM.

>From our monitoring, only a single cgroup triggered direct reclaim - some
hitting memory.high and some hitting memory.max (we have tracepoints for =
monitoring).

> I think the simplest solution for you is to enable swap to have more
> reclaimable memory on the system. Hopefully you will have workingset of
> the workloads fully in memory on each node.
>=20
>=20You can try to change application/workload to be more numa aware and
> balance their anon memory on the given nodes but I think that would muc=
h
> more involved and error prone.

Enabling swap is one solution, but due to historical reasons we haven't
enabled it - our disk performance is relatively poor. zram is also an
option, but the migration would take significant time.

Thanks