From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04BE5C7115B for ; Mon, 23 Jun 2025 18:59:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C8F26B00B8; Mon, 23 Jun 2025 14:59:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 580246B00B9; Mon, 23 Jun 2025 14:59:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3F3126B00BA; Mon, 23 Jun 2025 14:59:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 276CE6B00B8 for ; Mon, 23 Jun 2025 14:59:48 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0241C1D78D9 for ; Mon, 23 Jun 2025 18:59:48 +0000 (UTC) X-FDA: 83587579656.21.D6E97DC Received: from camel.aspen.relay.mailchannels.net (camel.aspen.relay.mailchannels.net [23.83.221.29]) by imf24.hostedemail.com (Postfix) with ESMTP id 8DCA6180017 for ; Mon, 23 Jun 2025 18:59:45 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=stgolabs.net header.s=dreamhost header.b=OVCscsDl; spf=pass (imf24.hostedemail.com: domain of dave@stgolabs.net designates 23.83.221.29 as permitted sender) smtp.mailfrom=dave@stgolabs.net; dmarc=none; arc=pass ("mailchannels.net:s=arc-2022:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750705185; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SbeA1cNMDnzNtM+5YorSpDjE8btnBJG854B31CDfQSQ=; b=BmsFLz5gRQkfEqq/yIFNFsUFsMoR/10aM+w+UDD9qI6P1rQfl4aT2epOZ/81aP6NnxzeW4 F5xx6TODBrJp1KV2IFLl+W/UWGg6re6o0kRlWxM1rbLE1zecN7mohR3US3ryKKHPDtY8T2 FCIIkU3pyJ8kBzDAGcqakTNU+g8HopA= ARC-Authentication-Results: i=2; imf24.hostedemail.com; dkim=pass header.d=stgolabs.net header.s=dreamhost header.b=OVCscsDl; spf=pass (imf24.hostedemail.com: domain of dave@stgolabs.net designates 23.83.221.29 as permitted sender) smtp.mailfrom=dave@stgolabs.net; dmarc=none; arc=pass ("mailchannels.net:s=arc-2022:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1750705185; a=rsa-sha256; cv=pass; b=jli/TxTesVvVRtBQ2YlUGhjVcuqtLVzi/MmZbpBmM0PbNTKjeDYNN6f3D3MKIe7StxQgko zby5AMxhJo66pTv8uqQh7JHzXAGVR0RtXW83KPGLQ4/eRAsxdy8Gkx3xAt3uIVqon7qFlB aT3/dHZbJmM7PkPeK5T1eVr/aZo5hNs= X-Sender-Id: dreamhost|x-authsender|dave@stgolabs.net Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 0AF66783C8A; Mon, 23 Jun 2025 18:59:44 +0000 (UTC) Received: from pdx1-sub0-mail-a316.dreamhost.com (100-100-153-160.trex-nlb.outbound.svc.cluster.local [100.100.153.160]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 97018783EBA; Mon, 23 Jun 2025 18:59:43 +0000 (UTC) ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1750705183; a=rsa-sha256; cv=none; b=Js/qvFvBkM5el5GV2s0bP3W550mU7c5VWHQhClxOp0o1lFCvYIKz8Rtvkb00Rwf9KeD6lW 3zwi3T7WVKql3RQh72vsg6Ob5+EMlbfH5rMCsQuGCZASsWR6a+8KZqfjD8AJ/Hb7jZpzAD Y0PAaV6R6J+B7O7//xLC6RmKY0+wmrtBMRCql+xS/DBlyv/Ef7mNtFDnK/kT1+r9l6q+2W JWxr1HkkFpOAVBCgsqQcPi6RQ2Ws4lmRmbU218F0V/uUACX3Lhns9p5oTm6cqJGums/dL4 bPLeV1hbPiGS0SddwOphFW7aG2ON6gjo4aZxBT0elFxmUNrlEmVFdk6CoRjX0w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=mailchannels.net; s=arc-2022; t=1750705183; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SbeA1cNMDnzNtM+5YorSpDjE8btnBJG854B31CDfQSQ=; b=CXPXPGeJtoDQgVy9ZmWIbLXLKrdIVBZzjrbOevU24w+kKwSZWSa6R9PDn/4fxEaQOMDYa7 +XSY9c9l5i5Lw4DJHE5NQBMOZ2tS2nUTphqvSaurDmam3hxnIVfk9Ig4nMJuXrk5+tEVnI iy1iXVDGvjGQJgDKDG2ocAWyhGAHxjgnnvqqUY2vb/+NiR3Mwhpl93YGgdtnRM2HT/D9mX 3/GdhtGZSjJzr/M468VPedBjWsGi2XwH8du9ohbuZGvH850qL3QBkDA9bALU8/A3XXxZcN jmzL6/ltY3RmuUpWyzaNldl01zxnmDL6VZM/S/vNwXtHfv24BMRtZpUrA8yLEw== ARC-Authentication-Results: i=1; rspamd-679c59f89-nhrmx; auth=pass smtp.auth=dreamhost smtp.mailfrom=dave@stgolabs.net X-Sender-Id: dreamhost|x-authsender|dave@stgolabs.net X-MC-Relay: Bad X-MailChannels-SenderId: dreamhost|x-authsender|dave@stgolabs.net X-MailChannels-Auth-Id: dreamhost X-Supply-Army: 43b3f58940d94dc6_1750705183925_1127390209 X-MC-Loop-Signature: 1750705183925:2628879635 X-MC-Ingress-Time: 1750705183925 Received: from pdx1-sub0-mail-a316.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384) by 100.100.153.160 (trex/7.0.3); Mon, 23 Jun 2025 18:59:43 +0000 Received: from offworld.lan (syn-076-167-199-067.res.spectrum.com [76.167.199.67]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: dave@stgolabs.net) by pdx1-sub0-mail-a316.dreamhost.com (Postfix) with ESMTPSA id 4bQy7L6qDZzDB; Mon, 23 Jun 2025 11:59:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=stgolabs.net; s=dreamhost; t=1750705183; bh=SbeA1cNMDnzNtM+5YorSpDjE8btnBJG854B31CDfQSQ=; h=From:To:Cc:Subject:Date:Content-Transfer-Encoding; b=OVCscsDlzKAMVbdf0TwYsTObROnIo+SvRUNbi6EGZ9QAO+qRaOk2X7d6AvKF1bxm6 fRMUchdTxRVesEnWFEbFr9sDnzyhgjHDNB7/06zdHgPPu61tThYImVJaYmiNST1Bj2 B5vegee8qVRy16A7jZwwdhcr6s7WodJ2cQvoCD5/k/zSAPwcjFKlFkmcy9QXTDeggb +O71z8UXJNb1yEqjgimOro8woWeqNTSnTcrlQJOLskK2AI3AKLxxPl3B+hRbdQfJcg 2am+fXOx9M0/2Uizlgo/HnSdN6nTS8YIeI+T/e3SHPfWUyDkOKBmTrfW9UqRFaaUC9 IaxO5zHoqUVMg== From: Davidlohr Bueso To: akpm@linux-foundation.org Cc: mhocko@kernel.org, hannes@cmpxchg.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, yosryahmed@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dave@stgolabs.net Subject: [PATCH 4/4] mm: introduce per-node proactive reclaim interface Date: Mon, 23 Jun 2025 11:58:51 -0700 Message-Id: <20250623185851.830632-5-dave@stgolabs.net> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250623185851.830632-1-dave@stgolabs.net> References: <20250623185851.830632-1-dave@stgolabs.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 8DCA6180017 X-Stat-Signature: nyhrtyjkneojuhnpnjyayp6jcxzf4eww X-HE-Tag: 1750705185-389099 X-HE-Meta: U2FsdGVkX18hrKE5sstM1E6YVydcT46UUEhL2x+sJTMs7JoWrArD8jnX8bHjmTvbKV8EgjiNuqkgmiMeQN/tpjl0M90DU/wejJrg7bKDpumFsK2HH4FOMyhqCp/hN+fTGTO/s/aIidqDt3gnh35EtIuk+mpDhoRN4x7vbzA3pgooaUHlaE7A3us0AK0r7bP6YjDNqkr7fPFhTL9KvuW3WFA3/BZhIbfGe7nIVGHAKRkvUp0BOZUg7prJPHlY1xStO3KgJb6oUYDtq/Gt1xM0SKncLsDSMDIUPKtso4uMf1F1NDaZ6kA0xQmCw0uCJrULdc9y0JvnXAfyxtLNFkLl/6q2lvB4iurLhW4Bt92Q3FOdss8GDuZEiW+OgVwbEvhw5Vpb0deV4FTSUcbUBFxUeCJ1JIw52u9qgYCbILFCbV+uZ7NLCSP+OT6SiEf/azA9gNl0s6ulCLtM9ji342ny/qirpA2m5iqvHhSbwiNq3cOf/psbrqLm4UmYKQIHvdQrRJwMgNqvsu9//uKCHyorCuRc3hEKmaSO+lLdiKUHaWkaCoVkmYGU8PIPvVbH1/t8ENqEscbyCnUTVS7I1MgSjB2qsPD565w9fjjluLlAdSoSISui4wMLttZ7a9J5bXP3Ol2WpZkvEJoafuyAFfm3dGBwHsWhMDnYUi8SNfZCOUNf6liITWb+mfzcoUAUVb0uDki3nunE0sON5bMsau6fiMrYnSuiFfvWKhnRNM06IEZ5rMUJf+i1l0QQC3S4jUP2jsLFDXh49188Z8jQ09ASJlFtbu0UsYwqWQGdtOy377URh7yjRFuWCUVIfg0YVOCISfFaarMrJk9GnXkeH/ADXr1T9l8GvSoq9v0U7Vc+K6CaTTAm82i04uWV2JlJuAvyf88Pcdgkm7JyNnjzogzhaOVOzinijOmXP3XpBBMbioijjwxvKghdH8sUSiOPdGFPhpywCjkI0lXFXi///kp Dffi4Y7n e9hPeh/3yJrMtgAGDkT0CayoWhZ+Cozckhw13rQ1ZIFtcF8iOSc4pHNmmiK+pPY5EucCsxb25TkMCICcYGYjZbbzgTO0Om3J+lOcYcF38UZdaANagGYaVAFNZp3F5hbrt+mmZsaLk9UjGh00V8jHzbXLDPenOSGmN5JpKICgPzt1PB60nUrU4YsujrNAQ5mgVqR1CkxD1sk0bXXl3bSF4SzvmTGnVvDsGxthjiUpCswisQ2TiFjCJCTqQ9WEtl72Q6i0wrx72T6sy9kDFwktAgODXY9zAwAqhr15huthW3IS1FpGJvQzfQawsKEvVIikU8R2u4Z35nJsNMnPdL20dlLHoJPXTd3HZQv13oRWnV/a+M9eg444AYDHW6aTCX4Estnb3x/aFhdn7IFL6YQoLK27wBg56fbo+sHxiCHe81DNDQGT8o7t7jq6niWn++NeG/7JA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This adds support for allowing proactive reclaim in general on a NUMA system. A per-node interface extends support for beyond a memcg-specific interface, respecting the current semantics of memory.reclaim: respecting aging LRU and not supporting artificially triggering eviction on nodes belonging to non-bottom tiers. This patch allows userspace to do: echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim One of the premises for this is to semantically align as best as possible with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which semantics around reclaim (eviction) vs demotion were not clear, rendering charging expectations to be broken. With this approach: 1. Users who do not use memcg can benefit from proactive reclaim. The memcg interface is not NUMA aware and there are usecases that are focusing on NUMA balancing rather than workload memory footprint. 2. Proactive reclaim on top tiers will trigger demotion, for which memory is still byte-addressable. Reclaiming on the bottom nodes will trigger evicting to swap (the traditional sense of reclaim). This follows the semantics of what is today part of the aging process on tiered memory, mirroring what every other form of reclaim does (reactive and memcg proactive reclaim). Furthermore per-node proactive reclaim is not as susceptible to the memcg charging problem mentioned above. 3. Unlike the nodes= arg, this interface avoids confusing semantics, such as what exactly the user wants when mixing top-tier and low-tier nodes in the nodemask. Further per-node interface is less exposed to "free up memory in my container" usecases, where eviction is intended. 4. Users that *really* want to free up memory can use proactive reclaim on nodes knowingly to be on the bottom tiers to force eviction in a natural way - higher access latencies are still better than swap. If compelled, while no guarantees and perhaps not worth the effort, users could also also potentially follow a ladder-like approach to eventually free up the memory. Alternatively, perhaps an 'evict' option could be added to the parameters for both memory.reclaim and per-node interfaces to force this action unconditionally. Signed-off-by: Davidlohr Bueso --- Documentation/ABI/stable/sysfs-devices-node | 9 ++++ drivers/base/node.c | 2 + include/linux/swap.h | 16 +++++++ mm/vmscan.c | 53 ++++++++++++++++++--- 4 files changed, 74 insertions(+), 6 deletions(-) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index a02707cb7cbc..2d0e023f22a7 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -227,3 +227,12 @@ Contact: Jiaqi Yan Description: Of the raw poisoned pages on a NUMA node, how many pages are recovered by memory error recovery attempt. + +What: /sys/devices/system/node/nodeX/reclaim +Date: June 2025 +Contact: Linux Memory Management list +Description: + Perform user-triggered proactive reclaim on a NUMA node. + This interface is equivalent to the memcg variant. + + See Documentation/admin-guide/cgroup-v2.rst diff --git a/drivers/base/node.c b/drivers/base/node.c index 6d66382dae65..548b532a2129 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -659,6 +659,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + reclaim_register_node(node); } return error; @@ -675,6 +676,7 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + reclaim_unregister_node(node); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/swap.h b/include/linux/swap.h index bc0e1c275fc0..dac7ba98783d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -431,6 +431,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; long remove_mapping(struct address_space *mapping, struct folio *folio); +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) +extern int reclaim_register_node(struct node *node); +extern void reclaim_unregister_node(struct node *node); + +#else + +static inline int reclaim_register_node(struct node *node) +{ + return 0; +} + +static inline void reclaim_unregister_node(struct node *node) +{ +} +#endif /* CONFIG_SYSFS && CONFIG_NUMA */ + #ifdef CONFIG_NUMA extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; diff --git a/mm/vmscan.c b/mm/vmscan.c index cdd9cb97fb79..f77feb75c678 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -94,10 +94,8 @@ struct scan_control { unsigned long anon_cost; unsigned long file_cost; -#ifdef CONFIG_MEMCG /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ int *proactive_swappiness; -#endif /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 @@ -121,7 +119,7 @@ struct scan_control { /* Has cache_trim_mode failed at least once? */ unsigned int cache_trim_mode_failed:1; - /* Proactive reclaim invoked by userspace through memory.reclaim */ + /* Proactive reclaim invoked by userspace */ unsigned int proactive:1; /* @@ -7732,13 +7730,15 @@ static const match_table_t tokens = { { MEMORY_RECLAIM_NULL, NULL }, }; -int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat) +int user_proactive_reclaim(char *buf, + struct mem_cgroup *memcg, pg_data_t *pgdat) { unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; int swappiness = -1; char *old_buf, *start; substring_t args[MAX_OPT_ARGS]; + gfp_t gfp_mask = GFP_KERNEL; if (!buf || (!memcg && !pgdat)) return -EINVAL; @@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; reclaimed = try_to_free_mem_cgroup_pages(memcg, - batch_size, GFP_KERNEL, + batch_size, gfp_mask, reclaim_options, swappiness == -1 ? NULL : &swappiness); } else { - return -EINVAL; + struct scan_control sc = { + .gfp_mask = current_gfp_context(gfp_mask), + .reclaim_idx = gfp_zone(gfp_mask), + .proactive_swappiness = swappiness == -1 ? NULL : &swappiness, + .priority = DEF_PRIORITY, + .may_writepage = !laptop_mode, + .nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX), + .may_unmap = 1, + .may_swap = 1, + .proactive = 1, + }; + + if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED, + &pgdat->flags)) + return -EAGAIN; + + reclaimed = __node_reclaim(pgdat, gfp_mask, + batch_size, &sc); + clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags); } if (!reclaimed && !nr_retries--) @@ -7855,3 +7873,26 @@ void check_move_unevictable_folios(struct folio_batch *fbatch) } } EXPORT_SYMBOL_GPL(check_move_unevictable_folios); + +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) +static ssize_t reclaim_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int ret, nid = dev->id; + + ret = user_proactive_reclaim((char *)buf, NULL, NODE_DATA(nid)); + return ret ? -EAGAIN : count; +} + +static DEVICE_ATTR_WO(reclaim); +int reclaim_register_node(struct node *node) +{ + return device_create_file(&node->dev, &dev_attr_reclaim); +} + +void reclaim_unregister_node(struct node *node) +{ + return device_remove_file(&node->dev, &dev_attr_reclaim); +} +#endif -- 2.39.5