From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F1D93303A02 for ; Tue, 6 Jan 2026 05:25:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767677149; cv=none; b=P0eFHFOevsPaJfTX+ZiKCTO77dpPxCufM0raJVLJ7ZpoACp0hOw/0EmnMVQQ4+Jk9FCIM1sQWg+zRPq4BjrNQwlq9+9OMZbN6AgUDhSqBERxZSFdpsINr3FVPPdw/6f56BKQUTTPbdOmQ8ULKN6ojTQQi0VBkTdeTCMqGVKAKps= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767677149; c=relaxed/simple; bh=5El4NGXvF8pJPNohvEQtePOl6oUzRXNVPVg1jz2m8f4=; h=MIME-Version:Date:Content-Type:From:Message-ID:Subject:To:Cc: In-Reply-To:References; b=D1zz0NEXQiJIYq7OVeHAxWx641TRPEc5WDeyuWLzK6smG8LlHQyCqpORoqsNq5B61AiRemZC1LOC/tUV82P9fYGa+TkjoSfJgZgcF8c2JCoN1URCCOcBAjVnbDCiTcYM33l32UxFfWJwR7D7UXQi8aT2whaUWeh81m52FAgQk28= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=IbmOCT/e; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="IbmOCT/e" Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1767677144; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/qScMDy8mz80NZ8SjOwgc1YriWtALcRcs1IwrYU/jNo=; b=IbmOCT/e8a1pAevtWnDiO6JIJGso9sPNLtpDd2GQpMzpGtPvsaxNapWBckGntWrwF+EjKz rONtdvkzcnMOQLXpP4iNDMQlSchWAyj/5Sq9QOtOV6tBm7uD3n5M9z9wD6uthI1S/QwHtF k70TruFeqtl9143WknkiV2egqDB7pUM= Date: Tue, 06 Jan 2026 05:25:42 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: "Jiayuan Chen" Message-ID: TLS-Required: No Subject: Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim To: "Shakeel Butt" Cc: linux-mm@kvack.org, "Jiayuan Chen" , "Andrew Morton" , "Johannes Weiner" , "David Hildenbrand" , "Michal Hocko" , "Qi Zheng" , "Lorenzo Stoakes" , "Axel Rasmussen" , "Yuanchu Xie" , "Wei Xu" , linux-kernel@vger.kernel.org In-Reply-To: References: <20251222122022.254268-1-jiayuan.chen@linux.dev> <4owaeb7bmkfgfzqd4ztdsi4tefc36cnmpju4yrknsgjm4y32ez@qsgn6lnv3cxb> <2e574085ed3d7775c3b83bb80d302ce45415ac42@linux.dev> X-Migadu-Flow: FLOW_OUT January 5, 2026 at 12:51, "Shakeel Butt" wrote: >=20 >=20Hi Jiayuan, >=20 >=20Sorry for late reply due to holidays/break. I will still be slow to > respond this week but will be fully back after one more week. Anyways, > let me respond below. No worries about the delay - happy holidays! > On Tue, Dec 23, 2025 at 08:22:43AM +0000, Jiayuan Chen wrote: >=20 >=20>=20 >=20> December 23, 2025 at 14:11, "Shakeel Butt" wrote: > >=20=20 >=20>=20=20 >=20>=20=20 >=20> On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote: > >=20=20 >=20> >=20 >=20> > December 23, 2025 at 05:15, "Shakeel Butt" wrote: > > >=20 >=20> [...] > >=20=20 >=20> >=20 >=20> > >=20 >=20> > I don't think kswapd is an issue here. The system is out of memo= ry and > > > most of the memory is unreclaimable. Either change the workload to= use > > > less memory or enable swap (or zswap) to have more reclaimable mem= ory. > > >=20 >=20> >=20 >=20> > Hi, > > > Thanks for looking into this. > > >=20 >=20> > Sorry, I didn't describe the scenario clearly enough in the orig= inal patch. Let me clarify: > > >=20 >=20> > This is a multi-NUMA system where the memory pressure is not glo= bal but node-local. The key observation is: > > >=20 >=20> > Node 0: Under memory pressure, most memory is anonymous (unrecla= imable without swap) > > > Node 1: Has plenty of reclaimable memory (~60GB file cache out of = 125GB total) > > >=20 >=20> Thanks and now the situation is much more clear. IIUC you are runn= ing > > multiple workloads (pods) on the system. How is the memcg limits > > configured for these workloads. You mentioned memory.high, what abou= t > >=20=20 >=20> Thanks for the questions. We have pods configured with memory.high= and pods configured with memory.max. > >=20=20 >=20> Actually, memory.max itself causes heavy I/O issues for us, becaus= e it keeps trying to reclaim hot > > pages within the cgroup aggressively without killing the process.=20 >=20>=20=20 >=20> So we configured some pods with memory.high instead, since it perf= orms reclaim in resume_user_mode_work, > > which somewhat throttles the memory allocation of user processes. > >=20=20 >=20> memory.max? Also are you using cpusets to limit the pods to indivi= dual > > nodes (cpu & memory) or they can run on any node? > >=20=20 >=20> Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured = for our cgroups, binding > > them to specific NUMA nodes. But I don't think this is directly rela= ted to the issue - the > > problem can occur with or without cpusets. Even without cpusets.cpus= , the kernel prefers > > to allocate memory from the node where the process is running, so if= a process happens to > > run on a CPU belonging to Node 0, the behavior would be similar. > >=20 >=20Are you limiting (using cpuset.cpus) the workloads to single respecti= ve > nodes or the individual workloads can still run on multiple nodes? For > example do you have a workload which can run on both (or more) nodes? We have many workloads. Some performance-sensitive ones have cpuset.cpus = configured to bind to a specific node, while others don't. > >=20 >=20> Overall I still think it is unbalanced numa nodes in terms of memor= y and > > may for cpu as well. Anyways let's talk about kswapd. > > >=20 >=20> > Node 0's kswapd runs continuously but cannot reclaim anything > > > Direct reclaim succeeds by reclaiming from Node 1 > > > Direct reclaim resets kswapd_failures, > > >=20 >=20> So successful reclaim on one node does not reset kswapd_failures o= n > > other node. The kernel reclaims each node one by one, so if Node 0 > > direct reclaim was successfull only then kernel allows to reset the > > kswapd_failures of Node 0 to be reset. > >=20=20 >=20> Let me dig deeper into this. > >=20=20 >=20> When either memory.max or memory.high is reached, direct reclaim i= s > > triggered. The memory being reclaimed depends on the CPU where the > > process is running. > >=20=20 >=20> When the problem occurred, we had workloads continuously hitting= =20 >=20> memory.max and workloads continuously hitting memory.high: > >=20=20 >=20> reclaim_high -> -> try_to_free_mem_cgroup_pages > > ^ do_try_to_free_pages(zone of current node) > > | shrink_zones() > > try_charge_memcg - shrink_node() > > kswapd_failures =3D 0 > >=20=20 >=20> Although the pages are hot, if we scan aggressively enough, they w= ill eventually > > be reclaimed, and then kswapd_failures gets reset to 0 - because eve= n reclaiming > > a single page resets kswapd_failures to 0. > >=20=20 >=20> The end result is that we most workloads, which didn't even hit th= eir high > > or max limits, experiencing continuous refaults, causing heavy I/O. > >=20 >=20So, the decision to reset kswapd_failures on memcg reclaim can be > re-evaluated but I think that is not the root cause here. The The workloads triggering direct reclaim have their memory spread across m= ultiple nodes, since we don't set cpuset.mems, so the cgroup can reclaim memory from mul= tiple nodes. In particular, complex applications have many threads, different threads = allocating and freeing large amounts of memory (both anonymous and file pages), and thes= e allocations can consume memory from nodes that are above the low watermark. You're right that multiple factors contribute to the issue I described. T= his patch addresses one of them, just like the boost_watermark patch I submitted before, and = the recent patch about memory.high causing high I/O. There are other scenarios as well tha= t I'm still trying to reproduce. That said, I believe this patch is still a valid fix on its own - resetti= ng kswapd_failures when the node is not actually balanced doesn't seem like correct behavior= regardless of the broader context. > kswapd_failures mechanism is for situations where kswapd is unable to > reclaim and then punting on the direct reclaimers but in your situation > the workloads are not numa memory bound and thus there really is not an= y > numa level direct reclaimers. Also the lack of reclaimable memory is > making the situation worse. > >=20 >=20> Thanks. > >=20=20 >=20> >=20 >=20> > preventing Node 0's kswapd from stopping > > > The few file pages on Node 0 are hot and keep refaulting, causing = heavy I/O > > >=20 >=20> Have you tried numa balancing? Though I think it would be better t= o > > schedule upfront in a way that one node is not overcommitted but num= a > > balancing provides a dynamic way to adjust the load on each node. > >=20=20 >=20> Yes, we have tried it. Actually, I submitted a patch about a month= ago to improve > > its observability: > > https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.ho= me/ > > (though only Steven replied, a bit awkward :( ). > >=20=20 >=20> We found that the default settings didn't work well for our worklo= ads. When we tried > > to increase scan_size to make it more aggressive, we noticed the sys= tem load started > > to increase. So we haven't fully adopted it yet. > >=20 >=20I feel the numa balancing will not help as well as or it might make i= t > worse as the workloads may have allocated some memory on the other node > which numa balancing might try to move to the node which is already > under pressure. Agreed. > Let me say what I think is the issue. You have the situation where node > 0 is overcommitted and is mostly filled with unreclaimable memory. The > workloads running on node 0 have their workingset continuously getting > reclaimed due to node 0 being OOM. >From our monitoring, only a single cgroup triggered direct reclaim - some hitting memory.high and some hitting memory.max (we have tracepoints for = monitoring). > I think the simplest solution for you is to enable swap to have more > reclaimable memory on the system. Hopefully you will have workingset of > the workloads fully in memory on each node. >=20 >=20You can try to change application/workload to be more numa aware and > balance their anon memory on the given nodes but I think that would muc= h > more involved and error prone. Enabling swap is one solution, but due to historical reasons we haven't enabled it - our disk performance is relatively poor. zram is also an option, but the migration would take significant time. Thanks