From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A333C83F1B for ; Thu, 17 Jul 2025 02:46:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D96FE8D0007; Wed, 16 Jul 2025 22:46:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D20E58D0001; Wed, 16 Jul 2025 22:46:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BE8DF8D0007; Wed, 16 Jul 2025 22:46:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A94688D0001 for ; Wed, 16 Jul 2025 22:46:34 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4E6B680597 for ; Thu, 17 Jul 2025 02:46:34 +0000 (UTC) X-FDA: 83672218308.10.EEEDC86 Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) by imf06.hostedemail.com (Postfix) with ESMTP id 5591F180003 for ; Thu, 17 Jul 2025 02:46:32 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=XJPs6PiY; spf=pass (imf06.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.176 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752720392; a=rsa-sha256; cv=none; b=VrgVkX2OLvuAEB+YDvLlphHCUQ/pDm4KB9VrFQHifFRrwdGohjj4/dwn65wgNGgJmCiJ0O AM9BW/l2lybZOTTrgULDWS+d/XrfYhpnmd7d9vRLdTSTEZJKqZXhzgaZ1SogJ2IS7iowIk m4/jma6GWM/4B4f9wIicsgQya9yUGZI= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=XJPs6PiY; spf=pass (imf06.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.176 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752720392; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JUo2hVC0vs4MzvmEw2bHFnvQ8HrEIxLo4ilwAk5EHtw=; b=8ch+euM1xtQ/NcQitxkFKyMgR1R9BIKkSouKIwqRs9/T+RIu9jnuKJ3p4JnIV5GivE/iLX Vg2WdndbQ3xdlubaDPW1zfhPyQV4ODEZ9LzXpQRbhas1+NflKhLNcXCcunpsb6tzXhyCzk KVy0s7EEYv5/AFRibE7AsLB82TJNT6I= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1752720390; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=JUo2hVC0vs4MzvmEw2bHFnvQ8HrEIxLo4ilwAk5EHtw=; b=XJPs6PiYxmBPjijpeWEaTev+QLaMW6mCVfn42eERr8lpYlFc0mJJVS3SDaR44+veAJwINi LQtpsW3ovJKeIJFz/c1Kjt3gbgj9TVOkbFsRUDduVxunjmdSdg9CWyXj8Ctbde5kSqE3/I ZuM+H4Rz4xTIyTGWvbOd6F0JvdkwyRU= From: Roman Gushchin To: Davidlohr Bueso Cc: akpm@linux-foundation.org, mhocko@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, yosryahmed@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface In-Reply-To: <20250623185851.830632-5-dave@stgolabs.net> (Davidlohr Bueso's message of "Mon, 23 Jun 2025 11:58:51 -0700") References: <20250623185851.830632-1-dave@stgolabs.net> <20250623185851.830632-5-dave@stgolabs.net> Date: Wed, 16 Jul 2025 19:46:25 -0700 Message-ID: <87qzyfr0u6.fsf@linux.dev> MIME-Version: 1.0 Content-Type: text/plain X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 5591F180003 X-Stat-Signature: smnww7zmghgdzy9q4rucct7ksnddpwx3 X-HE-Tag: 1752720392-792833 X-HE-Meta: U2FsdGVkX1/lcqBYRHmd8IjMBp6+QFubt88D1Wl5rRL+DaiIOSHbakUaSUQESCwfJaWFJUWLmPOwEnE3b1KdZarpLSj18vP6Ca1G4/608B2leD/9HcGJdEDM3CELISDcojV1XiakxjkqgnD/K96NwEn/BPUDkHN20VK/9yjvIcgAKedM1YitFTuMPskeUyzUU70UyGMWU0crkD0mvxEFIZENlKouWrcLw7hdI6kPJw7WIRS7/QsPvniXC5ujyuUCC2obAEnmQqQnQbvgb1Axz52CuyQ+XSTWBkw7qlWDslGqJBUSRuTgxLmQKmaH3kKBHWJIXabFN9/0ijgeL3nWa+kdXTVPLR0wQRFJRVBPaM5u477TcaMx6LyRFRGiBOwU6/BGZd/VaK2Px0xPw/7Gk2IhpxSAGTG7GLjsXxKx6r/0TuSs3uCfZE+lZLPDq5XV4ySCUciBvy0KqiM1NksLSM7i/LpXJf2TfOBGZOoRZr/3FcpsGk8+20aqSFKQAJbe/iL25qLOcI0V6ibkWfSUCQYvuIL4eE39PAMqIf5Odn0pomCMGgwrm/+8legTaC1rwcvw5BvvlhSf3q7LxazrYLRj4hFNBpPUbrJN0vfcf3tliStrfhqMlGwxAntmInEd98P9wrquzssl0c5EXu/ENE9QY6Mvhw/AEP7dZCiua7MSjWVHpCRzDylI4Xr6IK2xg/3tOzavw89RgoeeU/hXYcfOHI15l35dQZ6LeX5uIUpFfqmFfTfKG/K3eBcZB82I9eqnJwzHQE0Yu5LEDe/F17vkSh2wsC2GYAnN7DIzRs1rQCVgENE4vvmX/ukW5rt1OvEtMakO+C9Cw4j1PF0/9FXuGRkpo2qwMEeVufqiTkyeRhvVbQ7LrFHNodpi7yMvTJYcd4WzkGKxuusy6EgCs/4MUIVfnRuhKs+ECRIhxbunNZK9GweDoyINtOQOkgHBi8PqVi6V+YJzVUOCV5f wU9P+7F3 OGKtIYRTjpXcjgOJ6h/fUs6DH18o8QnWfhqS2Hu6mWRUB0Hoz03sl73KllhiZORLczGVjio87uwi0GwWpc2CtY2jDZUs2Ope1T41uGWbNQB15odWc7HvmLamHo7ncaKWJCf/XqEU8IOQDpIpI4dxWHgaEj9gybd7MvnmqjBK90zNlMw06za2ooytXxX5GQnGUTfXaxtBKFUGw7NmxhP9uSpnL1EgOdfCBodkGhXilj3XDOr8FK6wzd16fXV7ehsQZIrn79m0luDAGUT6ofrgMAtKqP/TMZnyGBz+XC8UsWIZgth5COL0sAfS/8KzlqBzbp7sU4jZArZ76fRlshd/RrBRW1A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Davidlohr Bueso writes: > This adds support for allowing proactive reclaim in general on a > NUMA system. A per-node interface extends support for beyond a > memcg-specific interface, respecting the current semantics of > memory.reclaim: respecting aging LRU and not supporting > artificially triggering eviction on nodes belonging to non-bottom > tiers. > > This patch allows userspace to do: > > echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim > > One of the premises for this is to semantically align as best as > possible with memory.reclaim. During a brief time memcg did > support nodemask until 55ab834a86a9 (Revert "mm: add nodes= > arg to memory.reclaim"), for which semantics around reclaim > (eviction) vs demotion were not clear, rendering charging > expectations to be broken. > > With this approach: > > 1. Users who do not use memcg can benefit from proactive reclaim. > The memcg interface is not NUMA aware and there are usecases that > are focusing on NUMA balancing rather than workload memory footprint. > > 2. Proactive reclaim on top tiers will trigger demotion, for which > memory is still byte-addressable. Reclaiming on the bottom nodes > will trigger evicting to swap (the traditional sense of reclaim). > This follows the semantics of what is today part of the aging process > on tiered memory, mirroring what every other form of reclaim does > (reactive and memcg proactive reclaim). Furthermore per-node proactive > reclaim is not as susceptible to the memcg charging problem mentioned > above. > > 3. Unlike the nodes= arg, this interface avoids confusing semantics, > such as what exactly the user wants when mixing top-tier and low-tier > nodes in the nodemask. Further per-node interface is less exposed to > "free up memory in my container" usecases, where eviction is intended. > > 4. Users that *really* want to free up memory can use proactive reclaim > on nodes knowingly to be on the bottom tiers to force eviction in a > natural way - higher access latencies are still better than swap. > If compelled, while no guarantees and perhaps not worth the effort, > users could also also potentially follow a ladder-like approach to > eventually free up the memory. Alternatively, perhaps an 'evict' option > could be added to the parameters for both memory.reclaim and per-node > interfaces to force this action unconditionally. > > Signed-off-by: Davidlohr Bueso Acked-by: Roman Gushchin small nit below > --- > Documentation/ABI/stable/sysfs-devices-node | 9 ++++ > drivers/base/node.c | 2 + > include/linux/swap.h | 16 +++++++ > mm/vmscan.c | 53 ++++++++++++++++++--- > 4 files changed, 74 insertions(+), 6 deletions(-) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node > index a02707cb7cbc..2d0e023f22a7 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -227,3 +227,12 @@ Contact: Jiaqi Yan > Description: > Of the raw poisoned pages on a NUMA node, how many pages are > recovered by memory error recovery attempt. > + > +What: /sys/devices/system/node/nodeX/reclaim > +Date: June 2025 > +Contact: Linux Memory Management list > +Description: > + Perform user-triggered proactive reclaim on a NUMA node. > + This interface is equivalent to the memcg variant. > + > + See Documentation/admin-guide/cgroup-v2.rst > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 6d66382dae65..548b532a2129 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -659,6 +659,7 @@ static int register_node(struct node *node, int num) > } else { > hugetlb_register_node(node); > compaction_register_node(node); > + reclaim_register_node(node); > } > > return error; > @@ -675,6 +676,7 @@ void unregister_node(struct node *node) > { > hugetlb_unregister_node(node); > compaction_unregister_node(node); > + reclaim_unregister_node(node); > node_remove_accesses(node); > node_remove_caches(node); > device_unregister(&node->dev); > diff --git a/include/linux/swap.h b/include/linux/swap.h > index bc0e1c275fc0..dac7ba98783d 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -431,6 +431,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); > extern int vm_swappiness; > long remove_mapping(struct address_space *mapping, struct folio *folio); > > +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) > +extern int reclaim_register_node(struct node *node); > +extern void reclaim_unregister_node(struct node *node); > + > +#else > + > +static inline int reclaim_register_node(struct node *node) > +{ > + return 0; > +} > + > +static inline void reclaim_unregister_node(struct node *node) > +{ > +} > +#endif /* CONFIG_SYSFS && CONFIG_NUMA */ > + > #ifdef CONFIG_NUMA > extern int sysctl_min_unmapped_ratio; > extern int sysctl_min_slab_ratio; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cdd9cb97fb79..f77feb75c678 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -94,10 +94,8 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > -#ifdef CONFIG_MEMCG > /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > int *proactive_swappiness; > -#endif > > /* Can active folios be deactivated as part of reclaim? */ > #define DEACTIVATE_ANON 1 > @@ -121,7 +119,7 @@ struct scan_control { > /* Has cache_trim_mode failed at least once? */ > unsigned int cache_trim_mode_failed:1; > > - /* Proactive reclaim invoked by userspace through memory.reclaim */ > + /* Proactive reclaim invoked by userspace */ > unsigned int proactive:1; > > /* > @@ -7732,13 +7730,15 @@ static const match_table_t tokens = { > { MEMORY_RECLAIM_NULL, NULL }, > }; > > -int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat) > +int user_proactive_reclaim(char *buf, > + struct mem_cgroup *memcg, pg_data_t *pgdat) > { > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > int swappiness = -1; > char *old_buf, *start; > substring_t args[MAX_OPT_ARGS]; > + gfp_t gfp_mask = GFP_KERNEL; > > if (!buf || (!memcg && !pgdat)) > return -EINVAL; > @@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | > MEMCG_RECLAIM_PROACTIVE; > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - batch_size, GFP_KERNEL, > + batch_size, gfp_mask, > reclaim_options, > swappiness == -1 ? NULL : &swappiness); > } else { > - return -EINVAL; > + struct scan_control sc = { > + .gfp_mask = current_gfp_context(gfp_mask), > + .reclaim_idx = gfp_zone(gfp_mask), > + .proactive_swappiness = swappiness == -1 ? NULL : &swappiness, > + .priority = DEF_PRIORITY, > + .may_writepage = !laptop_mode, > + .nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX), > + .may_unmap = 1, > + .may_swap = 1, > + .proactive = 1, > + }; > + > + if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED, > + &pgdat->flags)) > + return -EAGAIN; Isn't EBUSY a better choice here? At least to distinguish between no reclaimable memory left and somebody else is abusing the same interface cases.