linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next v2 0/4] mm: per-node proactive reclaim
@ 2025-06-23 18:58 Davidlohr Bueso
  2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-06-23 18:58 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel, dave

Hello,

This is a tardy follow up to v1:
https://lore.kernel.org/linux-mm/20240904162740.1043168-1-dave@stgolabs.net/

Changes:
 - Not a change perse, but further discussed with mhocko potential usecases
   to justify upstreaming this interface. Nowadays NUMA represents the common
   abstraction for memory tiering representing devices of various performance
   characteristics. This interface makes a lot of sense given memcg's lack
   of NUMA awareness.
   
 - Consolidate both memcg and per-node flavors into a common helper. (Yosry)

Patch 1 is a small fixlet independent of the rest of the series.
Patches 2-3 make some of the machinery more generic.
Patch 4 adds the sysfs interface (which has further been deemed ok albeit
not following the one value per file "rule").

Please consider for v6.16.

Thanks!

Davidlohr Bueso (4):
  mm/vmscan: respect psi_memstall region in node reclaim
  mm/memcg: make memory.reclaim interface generic
  mm/vmscan: make __node_reclaim() more generic
  mm: introduce per-node proactive reclaim interface

 Documentation/ABI/stable/sysfs-devices-node |   9 +
 drivers/base/node.c                         |   2 +
 include/linux/swap.h                        |  16 ++
 mm/internal.h                               |   2 +
 mm/memcontrol.c                             |  77 +-------
 mm/vmscan.c                                 | 195 +++++++++++++++++---
 6 files changed, 201 insertions(+), 100 deletions(-)

--
2.39.5



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim
  2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
@ 2025-06-23 18:58 ` Davidlohr Bueso
  2025-06-25 17:08   ` Shakeel Butt
  2025-07-17  1:44   ` Roman Gushchin
  2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-06-23 18:58 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel, dave

... rather benign but keep proper ending order.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a93a1ba9009e..c13c01eb0b42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7653,8 +7653,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	set_task_reclaim_state(p, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(sc.gfp_mask);
-	psi_memstall_leave(&pflags);
 	delayacct_freepages_end();
+	psi_memstall_leave(&pflags);
 
 	trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
  2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
@ 2025-06-23 18:58 ` Davidlohr Bueso
  2025-06-23 21:45   ` Andrew Morton
                     ` (3 more replies)
  2025-06-23 18:58 ` [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-06-23 18:58 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel, dave

This adds a general call for both parsing as well as the
common reclaim semantics. memcg is still the only user and
no change in semantics.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 mm/internal.h   |  2 +
 mm/memcontrol.c | 77 ++------------------------------------
 mm/vmscan.c     | 98 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 104 insertions(+), 73 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3823fb356d3b..fc4262262b31 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -517,6 +517,8 @@ extern unsigned long highest_memmap_pfn;
 bool folio_isolate_lru(struct folio *folio);
 void folio_putback_lru(struct folio *folio);
 extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+int user_proactive_reclaim(char *buf,
+			   struct mem_cgroup *memcg, pg_data_t *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..015e406eadfa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -51,7 +51,6 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
-#include <linux/parser.h>
 #include <linux/vmpressure.h>
 #include <linux/memremap.h>
 #include <linux/mm_inline.h>
@@ -4566,83 +4565,15 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
-enum {
-	MEMORY_RECLAIM_SWAPPINESS = 0,
-	MEMORY_RECLAIM_SWAPPINESS_MAX,
-	MEMORY_RECLAIM_NULL,
-};
-
-static const match_table_t tokens = {
-	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
-	{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
-	{ MEMORY_RECLAIM_NULL, NULL },
-};
-
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
-	unsigned long nr_to_reclaim, nr_reclaimed = 0;
-	int swappiness = -1;
-	unsigned int reclaim_options;
-	char *old_buf, *start;
-	substring_t args[MAX_OPT_ARGS];
-
-	buf = strstrip(buf);
-
-	old_buf = buf;
-	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
-	if (buf == old_buf)
-		return -EINVAL;
-
-	buf = strstrip(buf);
-
-	while ((start = strsep(&buf, " ")) != NULL) {
-		if (!strlen(start))
-			continue;
-		switch (match_token(start, tokens, args)) {
-		case MEMORY_RECLAIM_SWAPPINESS:
-			if (match_int(&args[0], &swappiness))
-				return -EINVAL;
-			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
-				return -EINVAL;
-			break;
-		case MEMORY_RECLAIM_SWAPPINESS_MAX:
-			swappiness = SWAPPINESS_ANON_ONLY;
-			break;
-		default:
-			return -EINVAL;
-		}
-	}
-
-	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
-	while (nr_reclaimed < nr_to_reclaim) {
-		/* Will converge on zero, but reclaim enforces a minimum */
-		unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
-		unsigned long reclaimed;
-
-		if (signal_pending(current))
-			return -EINTR;
-
-		/*
-		 * This is the final attempt, drain percpu lru caches in the
-		 * hope of introducing more evictable pages for
-		 * try_to_free_mem_cgroup_pages().
-		 */
-		if (!nr_retries)
-			lru_add_drain_all();
-
-		reclaimed = try_to_free_mem_cgroup_pages(memcg,
-					batch_size, GFP_KERNEL,
-					reclaim_options,
-					swappiness == -1 ? NULL : &swappiness);
-
-		if (!reclaimed && !nr_retries--)
-			return -EAGAIN;
+	int ret;
 
-		nr_reclaimed += reclaimed;
-	}
+	ret = user_proactive_reclaim(buf, memcg, NULL);
+	if (ret)
+		return ret;
 
 	return nbytes;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c13c01eb0b42..63ddec550c3b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
 #include <linux/mmu_notifier.h>
+#include <linux/parser.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -6714,6 +6715,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
+#else
+unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+					   unsigned long nr_pages,
+					   gfp_t gfp_mask,
+					   unsigned int reclaim_options,
+					   int *swappiness)
+{
+	return 0;
+}
 #endif
 
 static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -7708,6 +7718,94 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 
 	return ret;
 }
+
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_SWAPPINESS_MAX,
+	MEMORY_RECLAIM_NULL,
+};
+static const match_table_t tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
+int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
+{
+	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	int swappiness = -1;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+
+	if (!buf || (!memcg && !pgdat))
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < MIN_SWAPPINESS ||
+			    swappiness > MAX_SWAPPINESS)
+				return -EINVAL;
+			break;
+		case MEMORY_RECLAIM_SWAPPINESS_MAX:
+			swappiness = SWAPPINESS_ANON_ONLY;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	while (nr_reclaimed < nr_to_reclaim) {
+		/* Will converge on zero, but reclaim enforces a minimum */
+		unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
+		unsigned long reclaimed;
+
+		if (signal_pending(current))
+			return -EINTR;
+
+		/*
+		 * This is the final attempt, drain percpu lru caches in the
+		 * hope of introducing more evictable pages.
+		 */
+		if (!nr_retries)
+			lru_add_drain_all();
+
+		if (memcg) {
+			unsigned int reclaim_options;
+
+			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
+					  MEMCG_RECLAIM_PROACTIVE;
+			reclaimed = try_to_free_mem_cgroup_pages(memcg,
+						 batch_size, GFP_KERNEL,
+						 reclaim_options,
+						 swappiness == -1 ? NULL : &swappiness);
+		} else {
+			return -EINVAL;
+		}
+
+		if (!reclaimed && !nr_retries--)
+			return -EAGAIN;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	return 0;
+}
+
 #endif
 
 /**
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic
  2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
  2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
  2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
@ 2025-06-23 18:58 ` Davidlohr Bueso
  2025-07-17  2:03   ` Roman Gushchin
  2025-07-17 22:25   ` Shakeel Butt
  2025-06-23 18:58 ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-06-23 18:58 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel, dave

As this will be called from non page allocator paths for
proactive reclaim, allow users to pass the sc and nr of
pages, and adjust the return value as well. No change in
semantics.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 mm/vmscan.c | 48 +++++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63ddec550c3b..cdd9cb97fb79 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7618,36 +7618,26 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 /*
  * Try to free up some pages from this node through reclaim.
  */
-static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
+static unsigned long __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
+				    unsigned long nr_pages,
+				    struct scan_control *sc)
 {
-	/* Minimum pages needed in order to stay on node */
-	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	unsigned int noreclaim_flag;
-	struct scan_control sc = {
-		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = current_gfp_context(gfp_mask),
-		.order = order,
-		.priority = NODE_RECLAIM_PRIORITY,
-		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
-		.may_swap = 1,
-		.reclaim_idx = gfp_zone(gfp_mask),
-	};
 	unsigned long pflags;
 
-	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
-					   sc.gfp_mask);
+	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order,
+					   sc->gfp_mask);
 
 	cond_resched();
 	psi_memstall_enter(&pflags);
 	delayacct_freepages_start();
-	fs_reclaim_acquire(sc.gfp_mask);
+	fs_reclaim_acquire(sc->gfp_mask);
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_UNMAP
 	 */
 	noreclaim_flag = memalloc_noreclaim_save();
-	set_task_reclaim_state(p, &sc.reclaim_state);
+	set_task_reclaim_state(p, &sc->reclaim_state);
 
 	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
 	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) {
@@ -7656,24 +7646,36 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(pgdat, &sc);
-		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
+			shrink_node(pgdat, sc);
+		} while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
 	}
 
 	set_task_reclaim_state(p, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
-	fs_reclaim_release(sc.gfp_mask);
+	fs_reclaim_release(sc->gfp_mask);
 	delayacct_freepages_end();
 	psi_memstall_leave(&pflags);
 
-	trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
+	trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed);
 
-	return sc.nr_reclaimed >= nr_pages;
+	return sc->nr_reclaimed;
 }
 
 int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
 	int ret;
+	/* Minimum pages needed in order to stay on node */
+	const unsigned long nr_pages = 1 << order;
+	struct scan_control sc = {
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+		.gfp_mask = current_gfp_context(gfp_mask),
+		.order = order,
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
+		.may_swap = 1,
+		.reclaim_idx = gfp_zone(gfp_mask),
+	};
 
 	/*
 	 * Node reclaim reclaims unmapped file backed pages and
@@ -7708,7 +7710,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
 		return NODE_RECLAIM_NOSCAN;
 
-	ret = __node_reclaim(pgdat, gfp_mask, order);
+	ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
 	clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 
 	if (ret)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2025-06-23 18:58 ` [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic Davidlohr Bueso
@ 2025-06-23 18:58 ` Davidlohr Bueso
  2025-06-25 23:10   ` Shakeel Butt
                     ` (3 more replies)
  2025-06-23 21:50 ` [PATCH -next v2 0/4] mm: per-node proactive reclaim Andrew Morton
  2025-07-16  0:24 ` Andrew Morton
  5 siblings, 4 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-06-23 18:58 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel, dave

This adds support for allowing proactive reclaim in general on a
NUMA system. A per-node interface extends support for beyond a
memcg-specific interface, respecting the current semantics of
memory.reclaim: respecting aging LRU and not supporting
artificially triggering eviction on nodes belonging to non-bottom
tiers.

This patch allows userspace to do:

     echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim

One of the premises for this is to semantically align as best as
possible with memory.reclaim. During a brief time memcg did
support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
arg to memory.reclaim"), for which semantics around reclaim
(eviction) vs demotion were not clear, rendering charging
expectations to be broken.

With this approach:

1. Users who do not use memcg can benefit from proactive reclaim.
The memcg interface is not NUMA aware and there are usecases that
are focusing on NUMA balancing rather than workload memory footprint.

2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes
will trigger evicting to swap (the traditional sense of reclaim).
This follows the semantics of what is today part of the aging process
on tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.

3. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.

4. Users that *really* want to free up memory can use proactive reclaim
on nodes knowingly to be on the bottom tiers to force eviction in a
natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict' option
could be added to the parameters for both memory.reclaim and per-node
interfaces to force this action unconditionally.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 Documentation/ABI/stable/sysfs-devices-node |  9 ++++
 drivers/base/node.c                         |  2 +
 include/linux/swap.h                        | 16 +++++++
 mm/vmscan.c                                 | 53 ++++++++++++++++++---
 4 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index a02707cb7cbc..2d0e023f22a7 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -227,3 +227,12 @@ Contact:	Jiaqi Yan <jiaqiyan@google.com>
 Description:
 		Of the raw poisoned pages on a NUMA node, how many pages are
 		recovered by memory error recovery attempt.
+
+What:		/sys/devices/system/node/nodeX/reclaim
+Date:		June 2025
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		Perform user-triggered proactive reclaim on a NUMA node.
+		This interface is equivalent to the memcg variant.
+
+		See Documentation/admin-guide/cgroup-v2.rst
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6d66382dae65..548b532a2129 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -659,6 +659,7 @@ static int register_node(struct node *node, int num)
 	} else {
 		hugetlb_register_node(node);
 		compaction_register_node(node);
+		reclaim_register_node(node);
 	}
 
 	return error;
@@ -675,6 +676,7 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);
 	compaction_unregister_node(node);
+	reclaim_unregister_node(node);
 	node_remove_accesses(node);
 	node_remove_caches(node);
 	device_unregister(&node->dev);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..dac7ba98783d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -431,6 +431,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 long remove_mapping(struct address_space *mapping, struct folio *folio);
 
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int reclaim_register_node(struct node *node);
+extern void reclaim_unregister_node(struct node *node);
+
+#else
+
+static inline int reclaim_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void reclaim_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
+
 #ifdef CONFIG_NUMA
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cdd9cb97fb79..f77feb75c678 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -94,10 +94,8 @@ struct scan_control {
 	unsigned long	anon_cost;
 	unsigned long	file_cost;
 
-#ifdef CONFIG_MEMCG
 	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
 	int *proactive_swappiness;
-#endif
 
 	/* Can active folios be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
@@ -121,7 +119,7 @@ struct scan_control {
 	/* Has cache_trim_mode failed at least once? */
 	unsigned int cache_trim_mode_failed:1;
 
-	/* Proactive reclaim invoked by userspace through memory.reclaim */
+	/* Proactive reclaim invoked by userspace */
 	unsigned int proactive:1;
 
 	/*
@@ -7732,13 +7730,15 @@ static const match_table_t tokens = {
 	{ MEMORY_RECLAIM_NULL, NULL },
 };
 
-int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
+int user_proactive_reclaim(char *buf,
+			   struct mem_cgroup *memcg, pg_data_t *pgdat)
 {
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
 	int swappiness = -1;
 	char *old_buf, *start;
 	substring_t args[MAX_OPT_ARGS];
+	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (!buf || (!memcg && !pgdat))
 		return -EINVAL;
@@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat
 			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
 					  MEMCG_RECLAIM_PROACTIVE;
 			reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						 batch_size, GFP_KERNEL,
+						 batch_size, gfp_mask,
 						 reclaim_options,
 						 swappiness == -1 ? NULL : &swappiness);
 		} else {
-			return -EINVAL;
+			struct scan_control sc = {
+				.gfp_mask = current_gfp_context(gfp_mask),
+				.reclaim_idx = gfp_zone(gfp_mask),
+				.proactive_swappiness = swappiness == -1 ? NULL : &swappiness,
+				.priority = DEF_PRIORITY,
+				.may_writepage = !laptop_mode,
+				.nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX),
+				.may_unmap = 1,
+				.may_swap = 1,
+				.proactive = 1,
+			};
+
+			if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED,
+						  &pgdat->flags))
+				return -EAGAIN;
+
+			reclaimed = __node_reclaim(pgdat, gfp_mask,
+						   batch_size, &sc);
+			clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 		}
 
 		if (!reclaimed && !nr_retries--)
@@ -7855,3 +7873,26 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_folios);
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+static ssize_t reclaim_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int ret, nid = dev->id;
+
+	ret = user_proactive_reclaim((char *)buf, NULL, NODE_DATA(nid));
+	return ret ? -EAGAIN : count;
+}
+
+static DEVICE_ATTR_WO(reclaim);
+int reclaim_register_node(struct node *node)
+{
+	return device_create_file(&node->dev, &dev_attr_reclaim);
+}
+
+void reclaim_unregister_node(struct node *node)
+{
+	return device_remove_file(&node->dev, &dev_attr_reclaim);
+}
+#endif
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
@ 2025-06-23 21:45   ` Andrew Morton
  2025-06-23 23:36     ` Davidlohr Bueso
  2025-06-24 18:26   ` Klara Modin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2025-06-23 21:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel

On Mon, 23 Jun 2025 11:58:49 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> This adds a general call for both parsing as well as the
> common reclaim semantics. memcg is still the only user and
> no change in semantics.
> 
> +int user_proactive_reclaim(char *buf,
> +			   struct mem_cgroup *memcg, pg_data_t *pgdat);

Feeling nitty, is this a good name for it?  It's hard to imagine what a
function called "user_proactive_reclaim" actually does.

That it isn't documented isn't helpful either!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next v2 0/4] mm: per-node proactive reclaim
  2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2025-06-23 18:58 ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
@ 2025-06-23 21:50 ` Andrew Morton
  2025-07-16  0:24 ` Andrew Morton
  5 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2025-06-23 21:50 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel

On Mon, 23 Jun 2025 11:58:47 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> This is a tardy follow up to v1:
> https://lore.kernel.org/linux-mm/20240904162740.1043168-1-dave@stgolabs.net/

Cool, I'll add it to mm-new for testing.

The v2 series didn't have a [0/N] so I scraped the words from the v1
series.  These were almost a copy of the [4/4] patch changelog but
there seem to have been some changes.  Please do maintain and resend the
cover letter verbiage.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-06-23 21:45   ` Andrew Morton
@ 2025-06-23 23:36     ` Davidlohr Bueso
  0 siblings, 0 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-06-23 23:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel

On Mon, 23 Jun 2025, Andrew Morton wrote:

>On Mon, 23 Jun 2025 11:58:49 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
>
>> This adds a general call for both parsing as well as the
>> common reclaim semantics. memcg is still the only user and
>> no change in semantics.
>>
>> +int user_proactive_reclaim(char *buf,
>> +			   struct mem_cgroup *memcg, pg_data_t *pgdat);
>
>Feeling nitty, is this a good name for it?  It's hard to imagine what a
>function called "user_proactive_reclaim" actually does.

I'm open to another name, sure. But imo the chosen one is actually pretty
descriptive: you know it's coming from userspace (justifying the 'buf'),
you know this is not about memory pressure and the memcg/pgdat parameters
tell the possible interfaces. Would prefixing a 'do_' be any better?

>That it isn't documented isn't helpful either!

I had done this but felt rather redundant and unnecessary, and further
don't expect for it go gain any other users. But ok, will add.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
  2025-06-23 21:45   ` Andrew Morton
@ 2025-06-24 18:26   ` Klara Modin
  2025-07-17  1:58   ` Roman Gushchin
  2025-07-17 22:17   ` Shakeel Butt
  3 siblings, 0 replies; 28+ messages in thread
From: Klara Modin @ 2025-06-24 18:26 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel

Hi,

On 2025-06-23 11:58:49 -0700, Davidlohr Bueso wrote:
> This adds a general call for both parsing as well as the
> common reclaim semantics. memcg is still the only user and
> no change in semantics.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  mm/internal.h   |  2 +
>  mm/memcontrol.c | 77 ++------------------------------------
>  mm/vmscan.c     | 98 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 104 insertions(+), 73 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 3823fb356d3b..fc4262262b31 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -517,6 +517,8 @@ extern unsigned long highest_memmap_pfn;
>  bool folio_isolate_lru(struct folio *folio);
>  void folio_putback_lru(struct folio *folio);
>  extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
> +int user_proactive_reclaim(char *buf,
> +			   struct mem_cgroup *memcg, pg_data_t *pgdat);
>  
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 902da8a9c643..015e406eadfa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -51,7 +51,6 @@
>  #include <linux/spinlock.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> -#include <linux/parser.h>
>  #include <linux/vmpressure.h>
>  #include <linux/memremap.h>
>  #include <linux/mm_inline.h>
> @@ -4566,83 +4565,15 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> -enum {
> -	MEMORY_RECLAIM_SWAPPINESS = 0,
> -	MEMORY_RECLAIM_SWAPPINESS_MAX,
> -	MEMORY_RECLAIM_NULL,
> -};
> -
> -static const match_table_t tokens = {
> -	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> -	{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
> -	{ MEMORY_RECLAIM_NULL, NULL },
> -};
> -
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			      size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> -	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -	int swappiness = -1;
> -	unsigned int reclaim_options;
> -	char *old_buf, *start;
> -	substring_t args[MAX_OPT_ARGS];
> -
> -	buf = strstrip(buf);
> -
> -	old_buf = buf;
> -	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> -	if (buf == old_buf)
> -		return -EINVAL;
> -
> -	buf = strstrip(buf);
> -
> -	while ((start = strsep(&buf, " ")) != NULL) {
> -		if (!strlen(start))
> -			continue;
> -		switch (match_token(start, tokens, args)) {
> -		case MEMORY_RECLAIM_SWAPPINESS:
> -			if (match_int(&args[0], &swappiness))
> -				return -EINVAL;
> -			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> -				return -EINVAL;
> -			break;
> -		case MEMORY_RECLAIM_SWAPPINESS_MAX:
> -			swappiness = SWAPPINESS_ANON_ONLY;
> -			break;
> -		default:
> -			return -EINVAL;
> -		}
> -	}
> -
> -	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> -	while (nr_reclaimed < nr_to_reclaim) {
> -		/* Will converge on zero, but reclaim enforces a minimum */
> -		unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
> -		unsigned long reclaimed;
> -
> -		if (signal_pending(current))
> -			return -EINTR;
> -
> -		/*
> -		 * This is the final attempt, drain percpu lru caches in the
> -		 * hope of introducing more evictable pages for
> -		 * try_to_free_mem_cgroup_pages().
> -		 */
> -		if (!nr_retries)
> -			lru_add_drain_all();
> -
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -					batch_size, GFP_KERNEL,
> -					reclaim_options,
> -					swappiness == -1 ? NULL : &swappiness);
> -
> -		if (!reclaimed && !nr_retries--)
> -			return -EAGAIN;
> +	int ret;
>  
> -		nr_reclaimed += reclaimed;
> -	}

> +	ret = user_proactive_reclaim(buf, memcg, NULL);

This is outside CONFIG_NUMA.

> +	if (ret)
> +		return ret;
>  
>  	return nbytes;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c13c01eb0b42..63ddec550c3b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/parser.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -6714,6 +6715,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  
>  	return nr_reclaimed;
>  }
> +#else
> +unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +					   unsigned long nr_pages,
> +					   gfp_t gfp_mask,
> +					   unsigned int reclaim_options,
> +					   int *swappiness)
> +{
> +	return 0;
> +}
>  #endif
>  
>  static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> @@ -7708,6 +7718,94 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  
>  	return ret;
>  }
> +
> +enum {
> +	MEMORY_RECLAIM_SWAPPINESS = 0,
> +	MEMORY_RECLAIM_SWAPPINESS_MAX,
> +	MEMORY_RECLAIM_NULL,
> +};
> +static const match_table_t tokens = {
> +	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +	{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
> +{
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int swappiness = -1;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +
> +	if (!buf || (!memcg && !pgdat))
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	while ((start = strsep(&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		switch (match_token(start, tokens, args)) {
> +		case MEMORY_RECLAIM_SWAPPINESS:
> +			if (match_int(&args[0], &swappiness))
> +				return -EINVAL;
> +			if (swappiness < MIN_SWAPPINESS ||
> +			    swappiness > MAX_SWAPPINESS)
> +				return -EINVAL;
> +			break;
> +		case MEMORY_RECLAIM_SWAPPINESS_MAX:
> +			swappiness = SWAPPINESS_ANON_ONLY;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		/* Will converge on zero, but reclaim enforces a minimum */
> +		unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
> +		unsigned long reclaimed;
> +
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		/*
> +		 * This is the final attempt, drain percpu lru caches in the
> +		 * hope of introducing more evictable pages.
> +		 */
> +		if (!nr_retries)
> +			lru_add_drain_all();
> +
> +		if (memcg) {
> +			unsigned int reclaim_options;
> +
> +			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> +					  MEMCG_RECLAIM_PROACTIVE;
> +			reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +						 batch_size, GFP_KERNEL,
> +						 reclaim_options,
> +						 swappiness == -1 ? NULL : &swappiness);
> +		} else {
> +			return -EINVAL;
> +		}
> +
> +		if (!reclaimed && !nr_retries--)
> +			return -EAGAIN;
> +
> +		nr_reclaimed += reclaimed;
> +	}
> +
> +	return 0;
> +}
> +
>  #endif

Should this really be inside CONFIG_NUMA? It was moved from outside of
CONFIG_NUMA where it's now called which results in a build failure if
it's disabled. Or is there a stub missing?

>  
>  /**
> -- 
> 2.39.5
> 

Regards,
Klara Modin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim
  2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
@ 2025-06-25 17:08   ` Shakeel Butt
  2025-07-17  1:44   ` Roman Gushchin
  1 sibling, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-06-25 17:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, roman.gushchin, yosryahmed, linux-mm,
	linux-kernel

On Mon, Jun 23, 2025 at 11:58:48AM -0700, Davidlohr Bueso wrote:
> ... rather benign but keep proper ending order.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Nit: in next version, please have a more clear commit message.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-06-23 18:58 ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
@ 2025-06-25 23:10   ` Shakeel Butt
  2025-06-27 19:07     ` SeongJae Park
  2025-07-17  2:46   ` Roman Gushchin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2025-06-25 23:10 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, roman.gushchin, yosryahmed, linux-mm,
	linux-kernel

On Mon, Jun 23, 2025 at 11:58:51AM -0700, Davidlohr Bueso wrote:
> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
> 
> This patch allows userspace to do:
> 
>      echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
> 
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
> 
> With this approach:
> 
> 1. Users who do not use memcg can benefit from proactive reclaim.
> The memcg interface is not NUMA aware and there are usecases that
> are focusing on NUMA balancing rather than workload memory footprint.
> 
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
> 
> 3. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
> 
> 4. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Overall looks good but I will try to dig deeper in next couple of days
(or weeks).

One orthogonal thought: I wonder if we want a unified aging (hotness or
generation or active/inactive) view of jobs/memcgs/system. At the moment
due to the way LRUs are implemented i.e. per-memcg per-node, we can have
different view of these LRUs even for the same memcg. For example the
hottest pages in low tier node might be colder than coldest pages in the
top tier. Not sure how to implement it in a scalable way.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-06-25 23:10   ` Shakeel Butt
@ 2025-06-27 19:07     ` SeongJae Park
  0 siblings, 0 replies; 28+ messages in thread
From: SeongJae Park @ 2025-06-27 19:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Davidlohr Bueso, akpm, mhocko, hannes,
	roman.gushchin, yosryahmed, linux-mm, linux-kernel

On Wed, 25 Jun 2025 16:10:16 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Mon, Jun 23, 2025 at 11:58:51AM -0700, Davidlohr Bueso wrote:
> > This adds support for allowing proactive reclaim in general on a
> > NUMA system. A per-node interface extends support for beyond a
> > memcg-specific interface, respecting the current semantics of
> > memory.reclaim: respecting aging LRU and not supporting
> > artificially triggering eviction on nodes belonging to non-bottom
> > tiers.
> > 
> > This patch allows userspace to do:
> > 
> >      echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
[...]
> One orthogonal thought: I wonder if we want a unified aging (hotness or
> generation or active/inactive) view of jobs/memcgs/system. At the moment
> due to the way LRUs are implemented i.e. per-memcg per-node, we can have
> different view of these LRUs even for the same memcg. For example the
> hottest pages in low tier node might be colder than coldest pages in the
> top tier.

I think it would be nice to have, and DAMON could help.

DAMON can monitor access patterns on the entire physical address space and make
actions such as migrating pages to different nodes[1] or LRU-[de]activate
([anti-]aging)[2] for specific cgroups[3,4], based on the monitored access
pattern.

Such migrations and [anti-]aging would not conflict with page fault and memory
pressure based promotions and demotions, so could help existing tiering
solutions by running those together.

> Not sure how to implement it in a scalable way.

DAMON's monitoring overhead is designed to be not ruled by memory size, so
scalable in terms of memory size.  We recently found it actually shows
reasonable monitoring results on an 1 TiB memory machine[5].  DAMON incurs
minimum overhead and limited to one CPU by default.  If needed, it could also
scale out using multiple threads.

[1] https://lore.kernel.org/all/20250420194030.75838-1-sj@kernel.org
[2] https://lore.kernel.org/all/20220613192301.8817-1-sj@kernel.org
[3] https://lkml.kernel.org/r/20221205230830.144349-1-sj@kernel.org
[4] https://lore.kernel.org/20250619220023.24023-1-sj@kernel.org
[5] page 46, right side plot of
    https://static.sched.com/hosted_files/ossna2025/16/damon_ossna25.pdf?_gl=1*12x1jv*_gcl_au*OTkyNjI0NTk0LjE3NTA4Nzg1Mzg.*FPAU*OTkyNjI0NTk0LjE3NTA4Nzg1Mzg.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next v2 0/4] mm: per-node proactive reclaim
  2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2025-06-23 21:50 ` [PATCH -next v2 0/4] mm: per-node proactive reclaim Andrew Morton
@ 2025-07-16  0:24 ` Andrew Morton
  2025-07-16 15:15   ` Shakeel Butt
  5 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2025-07-16  0:24 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mhocko, hannes, roman.gushchin, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel


We're a bit short of reviews of David's series.  Does anyone have
additional input?

Thanks.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next v2 0/4] mm: per-node proactive reclaim
  2025-07-16  0:24 ` Andrew Morton
@ 2025-07-16 15:15   ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-07-16 15:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Davidlohr Bueso, mhocko, hannes, roman.gushchin, yosryahmed,
	linux-mm, linux-kernel

On Tue, Jul 15, 2025 at 05:24:10PM -0700, Andrew Morton wrote:
> 
> We're a bit short of reviews of David's series.  Does anyone have
> additional input?
> 

I am on this and will respond within a day or two.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim
  2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
  2025-06-25 17:08   ` Shakeel Butt
@ 2025-07-17  1:44   ` Roman Gushchin
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Gushchin @ 2025-07-17  1:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, shakeel.butt, yosryahmed, linux-mm,
	linux-kernel

Davidlohr Bueso <dave@stgolabs.net> writes:

> ... rather benign but keep proper ending order.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
  2025-06-23 21:45   ` Andrew Morton
  2025-06-24 18:26   ` Klara Modin
@ 2025-07-17  1:58   ` Roman Gushchin
  2025-07-17 16:35     ` Davidlohr Bueso
  2025-07-17 22:17   ` Shakeel Butt
  3 siblings, 1 reply; 28+ messages in thread
From: Roman Gushchin @ 2025-07-17  1:58 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, shakeel.butt, yosryahmed, linux-mm,
	linux-kernel

Davidlohr Bueso <dave@stgolabs.net> writes:

> This adds a general call for both parsing as well as the
> common reclaim semantics. memcg is still the only user and
> no change in semantics.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  mm/internal.h   |  2 +
>  mm/memcontrol.c | 77 ++------------------------------------
>  mm/vmscan.c     | 98 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 104 insertions(+), 73 deletions(-)
> ...
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c13c01eb0b42..63ddec550c3b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/parser.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -6714,6 +6715,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  
>  	return nr_reclaimed;
>  }
> +#else
> +unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +					   unsigned long nr_pages,
> +					   gfp_t gfp_mask,
> +					   unsigned int reclaim_options,
> +					   int *swappiness)
> +{
> +	return 0;
> +}
>  #endif
>  
>  static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
> @@ -7708,6 +7718,94 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  
>  	return ret;
>  }
> +
> +enum {
> +	MEMORY_RECLAIM_SWAPPINESS = 0,
> +	MEMORY_RECLAIM_SWAPPINESS_MAX,
> +	MEMORY_RECLAIM_NULL,
> +};
> +static const match_table_t tokens = {
> +	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +	{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
> +{
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int swappiness = -1;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +
> +	if (!buf || (!memcg && !pgdat))
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);

To be honest, not a big fan of this refactoring. Effectively parts of
the memcg user interface are moved into mm/vmscan.c. I get that you want
to use the exact same interface somewhere else, but still...

Is it possible to keep it in mm/memcontrol.c?
Also maybe split the actual reclaim mechanism and user's input parsing?

Thanks


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic
  2025-06-23 18:58 ` [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic Davidlohr Bueso
@ 2025-07-17  2:03   ` Roman Gushchin
  2025-07-17 22:25   ` Shakeel Butt
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Gushchin @ 2025-07-17  2:03 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, shakeel.butt, yosryahmed, linux-mm,
	linux-kernel

Davidlohr Bueso <dave@stgolabs.net> writes:

> As this will be called from non page allocator paths for
> proactive reclaim, allow users to pass the sc and nr of
> pages, and adjust the return value as well. No change in
> semantics.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-06-23 18:58 ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
  2025-06-25 23:10   ` Shakeel Butt
@ 2025-07-17  2:46   ` Roman Gushchin
  2025-07-17 16:26     ` Davidlohr Bueso
       [not found]   ` <20250717064925.2304-1-hdanton@sina.com>
  2025-07-17 22:28   ` Shakeel Butt
  3 siblings, 1 reply; 28+ messages in thread
From: Roman Gushchin @ 2025-07-17  2:46 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, shakeel.butt, yosryahmed, linux-mm,
	linux-kernel

Davidlohr Bueso <dave@stgolabs.net> writes:

> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
>
> This patch allows userspace to do:
>
>      echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
>
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
>
> With this approach:
>
> 1. Users who do not use memcg can benefit from proactive reclaim.
> The memcg interface is not NUMA aware and there are usecases that
> are focusing on NUMA balancing rather than workload memory footprint.
>
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
>
> 3. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
>
> 4. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

small nit below

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  9 ++++
>  drivers/base/node.c                         |  2 +
>  include/linux/swap.h                        | 16 +++++++
>  mm/vmscan.c                                 | 53 ++++++++++++++++++---
>  4 files changed, 74 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index a02707cb7cbc..2d0e023f22a7 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -227,3 +227,12 @@ Contact:	Jiaqi Yan <jiaqiyan@google.com>
>  Description:
>  		Of the raw poisoned pages on a NUMA node, how many pages are
>  		recovered by memory error recovery attempt.
> +
> +What:		/sys/devices/system/node/nodeX/reclaim
> +Date:		June 2025
> +Contact:	Linux Memory Management list <linux-mm@kvack.org>
> +Description:
> +		Perform user-triggered proactive reclaim on a NUMA node.
> +		This interface is equivalent to the memcg variant.
> +
> +		See Documentation/admin-guide/cgroup-v2.rst
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 6d66382dae65..548b532a2129 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -659,6 +659,7 @@ static int register_node(struct node *node, int num)
>  	} else {
>  		hugetlb_register_node(node);
>  		compaction_register_node(node);
> +		reclaim_register_node(node);
>  	}
>  
>  	return error;
> @@ -675,6 +676,7 @@ void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);
>  	compaction_unregister_node(node);
> +	reclaim_unregister_node(node);
>  	node_remove_accesses(node);
>  	node_remove_caches(node);
>  	device_unregister(&node->dev);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index bc0e1c275fc0..dac7ba98783d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -431,6 +431,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  long remove_mapping(struct address_space *mapping, struct folio *folio);
>  
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +extern int reclaim_register_node(struct node *node);
> +extern void reclaim_unregister_node(struct node *node);
> +
> +#else
> +
> +static inline int reclaim_register_node(struct node *node)
> +{
> +	return 0;
> +}
> +
> +static inline void reclaim_unregister_node(struct node *node)
> +{
> +}
> +#endif /* CONFIG_SYSFS && CONFIG_NUMA */
> +
>  #ifdef CONFIG_NUMA
>  extern int sysctl_min_unmapped_ratio;
>  extern int sysctl_min_slab_ratio;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cdd9cb97fb79..f77feb75c678 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -94,10 +94,8 @@ struct scan_control {
>  	unsigned long	anon_cost;
>  	unsigned long	file_cost;
>  
> -#ifdef CONFIG_MEMCG
>  	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
>  	int *proactive_swappiness;
> -#endif
>  
>  	/* Can active folios be deactivated as part of reclaim? */
>  #define DEACTIVATE_ANON 1
> @@ -121,7 +119,7 @@ struct scan_control {
>  	/* Has cache_trim_mode failed at least once? */
>  	unsigned int cache_trim_mode_failed:1;
>  
> -	/* Proactive reclaim invoked by userspace through memory.reclaim */
> +	/* Proactive reclaim invoked by userspace */
>  	unsigned int proactive:1;
>  
>  	/*
> @@ -7732,13 +7730,15 @@ static const match_table_t tokens = {
>  	{ MEMORY_RECLAIM_NULL, NULL },
>  };
>  
> -int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
> +int user_proactive_reclaim(char *buf,
> +			   struct mem_cgroup *memcg, pg_data_t *pgdat)
>  {
>  	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>  	unsigned long nr_to_reclaim, nr_reclaimed = 0;
>  	int swappiness = -1;
>  	char *old_buf, *start;
>  	substring_t args[MAX_OPT_ARGS];
> +	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	if (!buf || (!memcg && !pgdat))
>  		return -EINVAL;
> @@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat
>  			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
>  					  MEMCG_RECLAIM_PROACTIVE;
>  			reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -						 batch_size, GFP_KERNEL,
> +						 batch_size, gfp_mask,
>  						 reclaim_options,
>  						 swappiness == -1 ? NULL : &swappiness);
>  		} else {
> -			return -EINVAL;
> +			struct scan_control sc = {
> +				.gfp_mask = current_gfp_context(gfp_mask),
> +				.reclaim_idx = gfp_zone(gfp_mask),
> +				.proactive_swappiness = swappiness == -1 ? NULL : &swappiness,
> +				.priority = DEF_PRIORITY,
> +				.may_writepage = !laptop_mode,
> +				.nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX),
> +				.may_unmap = 1,
> +				.may_swap = 1,
> +				.proactive = 1,
> +			};
> +
> +			if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED,
> +						  &pgdat->flags))
> +				return -EAGAIN;

Isn't EBUSY a better choice here?
At least to distinguish between no reclaimable memory left and
somebody else is abusing the same interface cases.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
       [not found]   ` <20250717064925.2304-1-hdanton@sina.com>
@ 2025-07-17  7:39     ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2025-07-17  7:39 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Davidlohr Bueso, akpm, hannes, Roman Gushchin, shakeel.butt,
	yosryahmed, linux-mm, linux-kernel

On Thu 17-07-25 14:49:24, Hillf Danton wrote:
> Davidlohr Bueso <dave@stgolabs.net> writes:
> 
> > This adds support for allowing proactive reclaim in general on a
> > NUMA system. A per-node interface extends support for beyond a
> > memcg-specific interface, respecting the current semantics of
> > memory.reclaim: respecting aging LRU and not supporting
> > artificially triggering eviction on nodes belonging to non-bottom
> > tiers.
> >
> > This patch allows userspace to do:
> >
> >      echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
> >
> When kswapd is active, this is not needed.
> When kswapd is idle, why is this needed?

Usecases are described in the section of the email you haven't quoted in your reply.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-07-17  2:46   ` Roman Gushchin
@ 2025-07-17 16:26     ` Davidlohr Bueso
  2025-07-17 22:46       ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Davidlohr Bueso @ 2025-07-17 16:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: akpm, mhocko, hannes, shakeel.butt, yosryahmed, linux-mm,
	linux-kernel

On Wed, 16 Jul 2025, Roman Gushchin wrote:

>Davidlohr Bueso <dave@stgolabs.net> writes:
>> @@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat
>>			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
>>					  MEMCG_RECLAIM_PROACTIVE;
>>			reclaimed = try_to_free_mem_cgroup_pages(memcg,
>> -						 batch_size, GFP_KERNEL,
>> +						 batch_size, gfp_mask,
>>						 reclaim_options,
>>						 swappiness == -1 ? NULL : &swappiness);
>>		} else {
>> -			return -EINVAL;
>> +			struct scan_control sc = {
>> +				.gfp_mask = current_gfp_context(gfp_mask),
>> +				.reclaim_idx = gfp_zone(gfp_mask),
>> +				.proactive_swappiness = swappiness == -1 ? NULL : &swappiness,
>> +				.priority = DEF_PRIORITY,
>> +				.may_writepage = !laptop_mode,
>> +				.nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX),
>> +				.may_unmap = 1,
>> +				.may_swap = 1,
>> +				.proactive = 1,
>> +			};
>> +
>> +			if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED,
>> +						  &pgdat->flags))
>> +				return -EAGAIN;
>
>Isn't EBUSY a better choice here?
>At least to distinguish between no reclaimable memory left and
>somebody else is abusing the same interface cases.

Yes, I agree.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-07-17  1:58   ` Roman Gushchin
@ 2025-07-17 16:35     ` Davidlohr Bueso
  0 siblings, 0 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2025-07-17 16:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: akpm, mhocko, hannes, shakeel.butt, yosryahmed, linux-mm,
	linux-kernel

On Wed, 16 Jul 2025, Roman Gushchin wrote:

>Davidlohr Bueso <dave@stgolabs.net> writes:
>
>> This adds a general call for both parsing as well as the
>> common reclaim semantics. memcg is still the only user and
>> no change in semantics.
>>
>> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
>> ---
>>  mm/internal.h   |  2 +
>>  mm/memcontrol.c | 77 ++------------------------------------
>>  mm/vmscan.c     | 98 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 104 insertions(+), 73 deletions(-)
>> ...
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c13c01eb0b42..63ddec550c3b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -57,6 +57,7 @@
>>  #include <linux/rculist_nulls.h>
>>  #include <linux/random.h>
>>  #include <linux/mmu_notifier.h>
>> +#include <linux/parser.h>
>>
>>  #include <asm/tlbflush.h>
>>  #include <asm/div64.h>
>> @@ -6714,6 +6715,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>
>>  	return nr_reclaimed;
>>  }
>> +#else
>> +unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>> +					   unsigned long nr_pages,
>> +					   gfp_t gfp_mask,
>> +					   unsigned int reclaim_options,
>> +					   int *swappiness)
>> +{
>> +	return 0;
>> +}
>>  #endif
>>
>>  static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>> @@ -7708,6 +7718,94 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>>
>>  	return ret;
>>  }
>> +
>> +enum {
>> +	MEMORY_RECLAIM_SWAPPINESS = 0,
>> +	MEMORY_RECLAIM_SWAPPINESS_MAX,
>> +	MEMORY_RECLAIM_NULL,
>> +};
>> +static const match_table_t tokens = {
>> +	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
>> +	{ MEMORY_RECLAIM_SWAPPINESS_MAX, "swappiness=max"},
>> +	{ MEMORY_RECLAIM_NULL, NULL },
>> +};
>> +
>> +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
>> +{
>> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
>> +	int swappiness = -1;
>> +	char *old_buf, *start;
>> +	substring_t args[MAX_OPT_ARGS];
>> +
>> +	if (!buf || (!memcg && !pgdat))
>> +		return -EINVAL;
>> +
>> +	buf = strstrip(buf);
>> +
>> +	old_buf = buf;
>> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
>> +	if (buf == old_buf)
>> +		return -EINVAL;
>> +
>> +	buf = strstrip(buf);
>
>To be honest, not a big fan of this refactoring. Effectively parts of
>the memcg user interface are moved into mm/vmscan.c. I get that you want
>to use the exact same interface somewhere else, but still...

I disagree, further this is no different than other callers in vmscan.c
around memcg.

>Is it possible to keep it in mm/memcontrol.c?

Why? now proactive reclaim is not special to memcg... unless strong reasons
it makes little sense to keep it there.

>Also maybe split the actual reclaim mechanism and user's input parsing?

I tried something like this initially, and ended up prefering this way.

Further, this approach limits the reachability of the input parsing logic, and
the interface is already being an exception to the one-value per file "rule".

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2025-07-17  1:58   ` Roman Gushchin
@ 2025-07-17 22:17   ` Shakeel Butt
  2025-07-17 22:52     ` Andrew Morton
  3 siblings, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2025-07-17 22:17 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, roman.gushchin, yosryahmed, linux-mm,
	linux-kernel

On Mon, Jun 23, 2025 at 11:58:49AM -0700, Davidlohr Bueso wrote:
> +
> +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
> +{
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int swappiness = -1;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +
> +	if (!buf || (!memcg && !pgdat))

I don't think this series is adding a use-case where both memcg and
pgdat are non-NULL, so let's error out on that as well.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic
  2025-06-23 18:58 ` [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic Davidlohr Bueso
  2025-07-17  2:03   ` Roman Gushchin
@ 2025-07-17 22:25   ` Shakeel Butt
  1 sibling, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-07-17 22:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, roman.gushchin, yosryahmed, linux-mm,
	linux-kernel

On Mon, Jun 23, 2025 at 11:58:50AM -0700, Davidlohr Bueso wrote:
> As this will be called from non page allocator paths for
> proactive reclaim, allow users to pass the sc and nr of
> pages, and adjust the return value as well. No change in
> semantics.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-06-23 18:58 ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
                     ` (2 preceding siblings ...)
       [not found]   ` <20250717064925.2304-1-hdanton@sina.com>
@ 2025-07-17 22:28   ` Shakeel Butt
  3 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-07-17 22:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mhocko, hannes, roman.gushchin, yosryahmed, linux-mm,
	linux-kernel

On Mon, Jun 23, 2025 at 11:58:51AM -0700, Davidlohr Bueso wrote:
> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
> 
> This patch allows userspace to do:
> 
>      echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
> 
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
> 
> With this approach:
> 
> 1. Users who do not use memcg can benefit from proactive reclaim.
> The memcg interface is not NUMA aware and there are usecases that
> are focusing on NUMA balancing rather than workload memory footprint.
> 
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
> 
> 3. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
> 
> 4. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

After Roman's suggestion, you can add:

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
  2025-07-17 16:26     ` Davidlohr Bueso
@ 2025-07-17 22:46       ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2025-07-17 22:46 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Roman Gushchin, mhocko, hannes, shakeel.butt, yosryahmed,
	linux-mm, linux-kernel

On Thu, 17 Jul 2025 09:26:37 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> >> +			if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED,
> >> +						  &pgdat->flags))
> >> +				return -EAGAIN;
> >
> >Isn't EBUSY a better choice here?
> >At least to distinguish between no reclaimable memory left and
> >somebody else is abusing the same interface cases.
> 
> Yes, I agree.

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-introduce-per-node-proactive-reclaim-interface-fix
Date: Thu Jul 17 03:44:14 PM PDT 2025

user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED
contention, per Roman

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c~mm-introduce-per-node-proactive-reclaim-interface-fix
+++ a/mm/vmscan.c
@@ -7818,7 +7818,7 @@ int user_proactive_reclaim(char *buf,
 
 			if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED,
 						  &pgdat->flags))
-				return -EAGAIN;
+				return -EBUSY;
 
 			reclaimed = __node_reclaim(pgdat, gfp_mask,
 						   batch_size, &sc);
_



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-07-17 22:17   ` Shakeel Butt
@ 2025-07-17 22:52     ` Andrew Morton
  2025-07-17 23:56       ` Davidlohr Bueso
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2025-07-17 22:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Davidlohr Bueso, mhocko, hannes, roman.gushchin, yosryahmed,
	linux-mm, linux-kernel

On Thu, 17 Jul 2025 15:17:09 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Mon, Jun 23, 2025 at 11:58:49AM -0700, Davidlohr Bueso wrote:
> > +
> > +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
> > +{
> > +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +	int swappiness = -1;
> > +	char *old_buf, *start;
> > +	substring_t args[MAX_OPT_ARGS];
> > +
> > +	if (!buf || (!memcg && !pgdat))
> 
> I don't think this series is adding a use-case where both memcg and
> pgdat are non-NULL, so let's error out on that as well.

As a followup, please.  This has been in -next for four weeks and I'd
prefer not to have to route around it (again).



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-07-17 22:52     ` Andrew Morton
@ 2025-07-17 23:56       ` Davidlohr Bueso
  2025-07-18  0:17         ` Shakeel Butt
  0 siblings, 1 reply; 28+ messages in thread
From: Davidlohr Bueso @ 2025-07-17 23:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shakeel Butt, mhocko, hannes, roman.gushchin, yosryahmed,
	linux-mm, linux-kernel

On Thu, 17 Jul 2025, Andrew Morton wrote:

>On Thu, 17 Jul 2025 15:17:09 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
>> On Mon, Jun 23, 2025 at 11:58:49AM -0700, Davidlohr Bueso wrote:
>> > +
>> > +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
>> > +{
>> > +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>> > +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
>> > +	int swappiness = -1;
>> > +	char *old_buf, *start;
>> > +	substring_t args[MAX_OPT_ARGS];
>> > +
>> > +	if (!buf || (!memcg && !pgdat))
>>
>> I don't think this series is adding a use-case where both memcg and
>> pgdat are non-NULL, so let's error out on that as well.
>
>As a followup, please.  This has been in -next for four weeks and I'd
>prefer not to have to route around it (again).
>

From: Davidlohr Bueso <dave@stgolabs.net>
Date: Thu, 17 Jul 2025 16:53:24 -0700
Subject: [PATCH] mm-introduce-per-node-proactive-reclaim-interface-fix

Both memcg and node is also a bogus case, per Shakeel.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
  mm/vmscan.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4598d18ba256..d5f7b1703234 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7758,7 +7758,7 @@ int user_proactive_reclaim(char *buf,
  	substring_t args[MAX_OPT_ARGS];
  	gfp_t gfp_mask = GFP_KERNEL;
  
-	if (!buf || (!memcg && !pgdat))
+	if (!buf || (!memcg && !pgdat) || (memcg && pgdat))
  		return -EINVAL;
  
  	buf = strstrip(buf);
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] mm/memcg: make memory.reclaim interface generic
  2025-07-17 23:56       ` Davidlohr Bueso
@ 2025-07-18  0:17         ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-07-18  0:17 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, mhocko, hannes, roman.gushchin, yosryahmed,
	linux-mm, linux-kernel

On Thu, Jul 17, 2025 at 04:56:04PM -0700, Davidlohr Bueso wrote:
> On Thu, 17 Jul 2025, Andrew Morton wrote:
> 
> > On Thu, 17 Jul 2025 15:17:09 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > 
> > > On Mon, Jun 23, 2025 at 11:58:49AM -0700, Davidlohr Bueso wrote:
> > > > +
> > > > +int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
> > > > +{
> > > > +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > > > +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > > +	int swappiness = -1;
> > > > +	char *old_buf, *start;
> > > > +	substring_t args[MAX_OPT_ARGS];
> > > > +
> > > > +	if (!buf || (!memcg && !pgdat))
> > > 
> > > I don't think this series is adding a use-case where both memcg and
> > > pgdat are non-NULL, so let's error out on that as well.
> > 
> > As a followup, please.  This has been in -next for four weeks and I'd
> > prefer not to have to route around it (again).
> > 
> 
> From: Davidlohr Bueso <dave@stgolabs.net>
> Date: Thu, 17 Jul 2025 16:53:24 -0700
> Subject: [PATCH] mm-introduce-per-node-proactive-reclaim-interface-fix
> 
> Both memcg and node is also a bogus case, per Shakeel.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

With this, I think we are good. We can always refactor and move code
around to our taste but interface and functionality wise this is fine.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-07-18  0:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
2025-06-25 17:08   ` Shakeel Butt
2025-07-17  1:44   ` Roman Gushchin
2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
2025-06-23 21:45   ` Andrew Morton
2025-06-23 23:36     ` Davidlohr Bueso
2025-06-24 18:26   ` Klara Modin
2025-07-17  1:58   ` Roman Gushchin
2025-07-17 16:35     ` Davidlohr Bueso
2025-07-17 22:17   ` Shakeel Butt
2025-07-17 22:52     ` Andrew Morton
2025-07-17 23:56       ` Davidlohr Bueso
2025-07-18  0:17         ` Shakeel Butt
2025-06-23 18:58 ` [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic Davidlohr Bueso
2025-07-17  2:03   ` Roman Gushchin
2025-07-17 22:25   ` Shakeel Butt
2025-06-23 18:58 ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
2025-06-25 23:10   ` Shakeel Butt
2025-06-27 19:07     ` SeongJae Park
2025-07-17  2:46   ` Roman Gushchin
2025-07-17 16:26     ` Davidlohr Bueso
2025-07-17 22:46       ` Andrew Morton
     [not found]   ` <20250717064925.2304-1-hdanton@sina.com>
2025-07-17  7:39     ` Michal Hocko
2025-07-17 22:28   ` Shakeel Butt
2025-06-23 21:50 ` [PATCH -next v2 0/4] mm: per-node proactive reclaim Andrew Morton
2025-07-16  0:24 ` Andrew Morton
2025-07-16 15:15   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).