[PATCH v3] mm: Add nodes= arg to memory.reclaim

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02 22:35 Mina Almasry
  2022-12-02 23:51 ` Shakeel Butt
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Mina Almasry @ 2022-12-02 22:35 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, Mina Almasry,
	Michal Hocko, bagasdotme, cgroups, linux-doc, linux-kernel,
	linux-mm

The nodes= arg instructs the kernel to only scan the given nodes for
proactive reclaim. For example use cases, consider a 2 tier memory system:

nodes 0,1 -> top tier
nodes 2,3 -> second tier

$ echo "1m nodes=0" > memory.reclaim

This instructs the kernel to attempt to reclaim 1m memory from node 0.
Since node 0 is a top tier node, demotion will be attempted first. This
is useful to direct proactive reclaim to specific nodes that are under
pressure.

$ echo "1m nodes=2,3" > memory.reclaim

This instructs the kernel to attempt to reclaim 1m memory in the second tier,
since this tier of memory has no demotion targets the memory will be
reclaimed.

$ echo "1m nodes=0,1" > memory.reclaim

Instructs the kernel to reclaim memory from the top tier nodes, which can
be desirable according to the userspace policy if there is pressure on
the top tiers. Since these nodes have demotion targets, the kernel will
attempt demotion first.

Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim""), the proactive reclaim interface memory.reclaim does both
reclaim and demotion. Reclaim and demotion incur different latency costs
to the jobs in the cgroup. Demoted memory would still be addressable
by the userspace at a higher latency, but reclaimed memory would need to
incur a pagefault.

The 'nodes' arg is useful to allow the userspace to control demotion
and reclaim independently according to its policy: if the memory.reclaim
is called on a node with demotion targets, it will attempt demotion first;
if it is called on a node without demotion targets, it will only attempt
reclaim.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v3:
- Dropped RFC tag from subject.
- Added Michal's Ack
- Applied some of Bagas's comment suggestions.
- Converted try_to_fre_mem_cgorup_pages() to take nodemask_t* instead of
  nodemask_t as Shakeel and Muchun suggested.

Cc: bagasdotme@gmail.com

Thanks for the comments and reviews.
---
 Documentation/admin-guide/cgroup-v2.rst | 15 +++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
 mm/vmscan.c                             |  4 +-
 4 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 74cec76be9f2..c8ae7c897f14 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.

-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
+	This file accepts a string which contains the number of bytes to
+	reclaim.

 	Example::

 	  echo "1G" > memory.reclaim

-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.

+	This file also allows the user to specify the nodes to reclaim from,
+	via the 'nodes=' key, for example::
+
+	  echo "1G nodes=0,1" > memory.reclaim
+
+	The above instructs the kernel to reclaim memory from nodes 0,1.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ceed49516ad..2787b84eaf12 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,7 +418,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  nodemask_t *nodemask);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23750cec0036..0f02f47a87e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/parser.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							NULL);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,

 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options,
+						    NULL);
 	psi_memstall_leave(&pflags);

 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+					NULL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  NULL))
 			nr_retries--;
 	}

@@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}

 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL);

 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,

 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL))
 				nr_reclaims--;
 			continue;
 		}
@@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }

+enum {
+	MEMORY_RECLAIM_NODES = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t if_tokens = {
+	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
-	unsigned int reclaim_options;
-	int err;
+	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
+				       MEMCG_RECLAIM_PROACTIVE;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	int token;
+	char value[256];
+	nodemask_t nodemask = NODE_MASK_ALL;

 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;

-	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		token = match_token(start, if_tokens, args);
+		match_strlcpy(value, args, sizeof(value));
+		switch (token) {
+		case MEMORY_RECLAIM_NODES:
+			if (nodelist_parse(value, nodemask) < 0)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
 	while (nr_reclaimed < nr_to_reclaim) {
 		unsigned long reclaimed;

@@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,

 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, reclaim_options);
+						GFP_KERNEL, reclaim_options,
+						&nodemask);

 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b8e8e43806b..62b0c9b46bd2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   nodemask_t *nodemask)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.nodemask = nodemask,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
--
2.39.0.rc0.267.gcb52ba06e7-goog

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
@ 2022-12-02 23:51 ` Shakeel Butt
  2022-12-03  3:17 ` Muchun Song
       [not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2 siblings, 0 replies; 35+ messages in thread
From: Shakeel Butt @ 2022-12-02 23:51 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, Michal Hocko,
	bagasdotme, cgroups, linux-doc, linux-kernel, linux-mm

On Fri, Dec 2, 2022 at 2:37 PM Mina Almasry <almasrymina@google.com> wrote:
>
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
  2022-12-02 23:51 ` Shakeel Butt
@ 2022-12-03  3:17 ` Muchun Song
       [not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2 siblings, 0 replies; 35+ messages in thread
From: Muchun Song @ 2022-12-03  3:17 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl,
	Michal Hocko, bagasdotme, cgroups, Linux Doc Mailing List,
	linux-kernel, Linux Memory Management List



> On Dec 3, 2022, at 06:35, Mina Almasry <almasrymina@google.com> wrote:
> 
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
> 
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
> 
> $ echo "1m nodes=0" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
> 
> $ echo "1m nodes=2,3" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
> 
> $ echo "1m nodes=0,1" > memory.reclaim
> 
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
> 
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2022-12-12  8:55   ` Michal Hocko
       [not found]     ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2022-12-16  9:54     ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
  0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-12  8:55 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
> 
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
> 
> $ echo "1m nodes=0" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
> 
> $ echo "1m nodes=2,3" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
> 
> $ echo "1m nodes=0,1" > memory.reclaim
> 
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
> 
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
> 
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

After discussion in [1] I have realized that I haven't really thought
through all the consequences of this patch and therefore I am retracting
my ack here. I am not nacking the patch at this statge but I also think
this shouldn't be merged now and we should really consider all the
consequences.

Let me summarize my main concerns here as well. The proposed
implementation doesn't apply the provided nodemask to the whole reclaim
process. This means that demotion can happen outside of the mask so the
the user request cannot really control demotion targets and that limits
the interface should there be any need for a finer grained control in
the future (see an example in [2]).
Another problem is that this can limit future reclaim extensions because
of existing assumptions of the interface [3] - specify only top-tier
node to force the aging without actually reclaiming any charges and
(ab)use the interface only for aging on multi-tier system. A change to
the reclaim to not demote in some cases could break this usecase.

My counter proposal would be to define the nodemask for memory.reclaim
as a domain to constrain the charge reclaim. That means both aging and
reclaim including demotion which is a part of aging. This will allow
to control where to demote for balancing purposes (e.g. demote to node 2
rather than 3) which is impossible with the proposed scheme.

[1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
[2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
[3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found]     ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-13  0:54       ` Mina Almasry
  2022-12-13  6:30         ` Huang, Ying
  2022-12-13  8:33         ` Michal Hocko
  0 siblings, 2 replies; 35+ messages in thread
From: Mina Almasry @ 2022-12-13  0:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> > The nodes= arg instructs the kernel to only scan the given nodes for
> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >
> > nodes 0,1 -> top tier
> > nodes 2,3 -> second tier
> >
> > $ echo "1m nodes=0" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> > Since node 0 is a top tier node, demotion will be attempted first. This
> > is useful to direct proactive reclaim to specific nodes that are under
> > pressure.
> >
> > $ echo "1m nodes=2,3" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> > since this tier of memory has no demotion targets the memory will be
> > reclaimed.
> >
> > $ echo "1m nodes=0,1" > memory.reclaim
> >
> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> > be desirable according to the userspace policy if there is pressure on
> > the top tiers. Since these nodes have demotion targets, the kernel will
> > attempt demotion first.
> >
> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim""), the proactive reclaim interface memory.reclaim does both
> > reclaim and demotion. Reclaim and demotion incur different latency costs
> > to the jobs in the cgroup. Demoted memory would still be addressable
> > by the userspace at a higher latency, but reclaimed memory would need to
> > incur a pagefault.
> >
> > The 'nodes' arg is useful to allow the userspace to control demotion
> > and reclaim independently according to its policy: if the memory.reclaim
> > is called on a node with demotion targets, it will attempt demotion first;
> > if it is called on a node without demotion targets, it will only attempt
> > reclaim.
> >
> > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> > Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> After discussion in [1] I have realized that I haven't really thought
> through all the consequences of this patch and therefore I am retracting
> my ack here. I am not nacking the patch at this statge but I also think
> this shouldn't be merged now and we should really consider all the
> consequences.
>
> Let me summarize my main concerns here as well. The proposed
> implementation doesn't apply the provided nodemask to the whole reclaim
> process. This means that demotion can happen outside of the mask so the
> the user request cannot really control demotion targets and that limits
> the interface should there be any need for a finer grained control in
> the future (see an example in [2]).
> Another problem is that this can limit future reclaim extensions because
> of existing assumptions of the interface [3] - specify only top-tier
> node to force the aging without actually reclaiming any charges and
> (ab)use the interface only for aging on multi-tier system. A change to
> the reclaim to not demote in some cases could break this usecase.
>

I think this is correct. My use case is to request from the kernel to
do demotion without reclaim in the cgroup, and the reason for that is
stated in the commit message:

"Reclaim and demotion incur different latency costs to the jobs in the
cgroup. Demoted memory would still be addressable by the userspace at
a higher latency, but reclaimed memory would need to incur a
pagefault."

For jobs of some latency tiers, we would like to trigger proactive
demotion (which incurs relatively low latency on the job), but not
trigger proactive reclaim (which incurs a pagefault). I initially had
proposed a separate interface for this, but Johannes directed me to
this interface instead in [1]. In the same email Johannes also tells
me that meta's reclaim stack relies on memory.reclaim triggering
demotion, so it seems that I'm not the first to take a dependency on
this. Additionally in [2] Johannes also says it would be great if in
the long term reclaim policy and demotion policy do not diverge.

[1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
[2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6-druUgvl0LCNAfugRpC6u6w@public.gmane.org/

> My counter proposal would be to define the nodemask for memory.reclaim
> as a domain to constrain the charge reclaim. That means both aging and
> reclaim including demotion which is a part of aging. This will allow
> to control where to demote for balancing purposes (e.g. demote to node 2
> rather than 3) which is impossible with the proposed scheme.
>

My understanding is that with this interface in order to trigger
demotion I would want to list both the top tier nodes and the bottom
tier nodes on the nodemask, and since the bottom tier nodes are in the
nodemask the kernel will not just trigger demotion, but will also
trigger reclaim. This is very specifically not our use case and not
the goal of this patch.

I had also suggested adding a demotion= arg to memory.reclaim so the
userspace may customize this behavior, but Johannes rejected this in
[3] to adhere to the aging pipeline.

All in all I like Johannes's model in [3] describing the aging
pipeline and the relationship between demotion and reclaim. The nodes=
arg is just a hint to the kernel that the userspace is looking for
reclaim from a top tier node (which would be done by demotion
according to the aging pipeline) or a bottom tier node (which would be
done by reclaim according to the aging pipeline). I think this
interface is aligned with this model.

[3] https://lore.kernel.org/linux-mm/Y36XchdgTCsMP4jT-druUgvl0LCNAfugRpC6u6w@public.gmane.org/

> [1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
> [2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
> [3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13  0:54       ` Mina Almasry
@ 2022-12-13  6:30         ` Huang, Ying
  2022-12-13  7:48           ` Wei Xu
                             ` (2 more replies)
  2022-12-13  8:33         ` Michal Hocko
  1 sibling, 3 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-13  6:30 UTC (permalink / raw)
  To: Mina Almasry, Michal Hocko
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme, cgroups,
	linux-doc, linux-kernel, linux-mm

Mina Almasry <almasrymina@google.com> writes:

> On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
>> > The nodes= arg instructs the kernel to only scan the given nodes for
>> > proactive reclaim. For example use cases, consider a 2 tier memory system:
>> >
>> > nodes 0,1 -> top tier
>> > nodes 2,3 -> second tier
>> >
>> > $ echo "1m nodes=0" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
>> > Since node 0 is a top tier node, demotion will be attempted first. This
>> > is useful to direct proactive reclaim to specific nodes that are under
>> > pressure.
>> >
>> > $ echo "1m nodes=2,3" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
>> > since this tier of memory has no demotion targets the memory will be
>> > reclaimed.
>> >
>> > $ echo "1m nodes=0,1" > memory.reclaim
>> >
>> > Instructs the kernel to reclaim memory from the top tier nodes, which can
>> > be desirable according to the userspace policy if there is pressure on
>> > the top tiers. Since these nodes have demotion targets, the kernel will
>> > attempt demotion first.
>> >
>> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > reclaim""), the proactive reclaim interface memory.reclaim does both
>> > reclaim and demotion. Reclaim and demotion incur different latency costs
>> > to the jobs in the cgroup. Demoted memory would still be addressable
>> > by the userspace at a higher latency, but reclaimed memory would need to
>> > incur a pagefault.
>> >
>> > The 'nodes' arg is useful to allow the userspace to control demotion
>> > and reclaim independently according to its policy: if the memory.reclaim
>> > is called on a node with demotion targets, it will attempt demotion first;
>> > if it is called on a node without demotion targets, it will only attempt
>> > reclaim.
>> >
>> > Acked-by: Michal Hocko <mhocko@suse.com>
>> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>>
>> After discussion in [1] I have realized that I haven't really thought
>> through all the consequences of this patch and therefore I am retracting
>> my ack here. I am not nacking the patch at this statge but I also think
>> this shouldn't be merged now and we should really consider all the
>> consequences.
>>
>> Let me summarize my main concerns here as well. The proposed
>> implementation doesn't apply the provided nodemask to the whole reclaim
>> process. This means that demotion can happen outside of the mask so the
>> the user request cannot really control demotion targets and that limits
>> the interface should there be any need for a finer grained control in
>> the future (see an example in [2]).
>> Another problem is that this can limit future reclaim extensions because
>> of existing assumptions of the interface [3] - specify only top-tier
>> node to force the aging without actually reclaiming any charges and
>> (ab)use the interface only for aging on multi-tier system. A change to
>> the reclaim to not demote in some cases could break this usecase.
>>
>
> I think this is correct. My use case is to request from the kernel to
> do demotion without reclaim in the cgroup, and the reason for that is
> stated in the commit message:
>
> "Reclaim and demotion incur different latency costs to the jobs in the
> cgroup. Demoted memory would still be addressable by the userspace at
> a higher latency, but reclaimed memory would need to incur a
> pagefault."
>
> For jobs of some latency tiers, we would like to trigger proactive
> demotion (which incurs relatively low latency on the job), but not
> trigger proactive reclaim (which incurs a pagefault). I initially had
> proposed a separate interface for this, but Johannes directed me to
> this interface instead in [1]. In the same email Johannes also tells
> me that meta's reclaim stack relies on memory.reclaim triggering
> demotion, so it seems that I'm not the first to take a dependency on
> this. Additionally in [2] Johannes also says it would be great if in
> the long term reclaim policy and demotion policy do not diverge.
>
> [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/

After these discussion, I think the solution maybe use different
interfaces for "proactive demote" and "proactive reclaim".  That is,
reconsider "memory.demote".  In this way, we will always uncharge the
cgroup for "memory.reclaim".  This avoid the possible confusion there.
And, because demotion is considered aging, we don't need to disable
demotion for "memory.reclaim", just don't count it.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13  6:30         ` Huang, Ying
@ 2022-12-13  7:48           ` Wei Xu
  2022-12-13  8:51           ` Michal Hocko
       [not found]           ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
  2 siblings, 0 replies; 35+ messages in thread
From: Wei Xu @ 2022-12-13  7:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mina Almasry, Michal Hocko, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, fvdl, bagasdotme, cgroups,
	linux-doc, linux-kernel, linux-mm

On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko@suse.com>
> >> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
>
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

+1 on memory.demote.

> Best Regards,
> Huang, Ying
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13  6:30         ` Huang, Ying
  2022-12-13  7:48           ` Wei Xu
@ 2022-12-13  8:51           ` Michal Hocko
  2022-12-13 13:42             ` Huang, Ying
       [not found]           ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
  2 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-13  8:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue 13-12-22 14:30:57, Huang, Ying wrote:
> Mina Almasry <almasrymina@google.com> writes:
[...]
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

As already pointed out in my previous email, we should really think more
about future requirements. Do we add memory.promote interface when there
is a request to implement numa balancing into the userspace? Maybe yes
but maybe the node balancing should be more generic than bound to memory
tiering and apply to a more fine grained nodemask control.

Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE),
reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize
(swap in, or read ahead) which are per mm/file. Their primary usability
issue is that they are process centric and that requires a very deep
understanding of the process mm layout so it is not really usable for a
larger scale orchestration.
The important part of those interfaces is that they do not talk about
demotion because that is an implementation detail. I think we want to
follow that model at least. From a higher level POV I believe we really
need an interface to age&reclaim and balance memory among nodes. Are
there more higher level usecases?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13  8:51           ` Michal Hocko
@ 2022-12-13 13:42             ` Huang, Ying
  0 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-13 13:42 UTC (permalink / raw)
  To: Michal Hocko, Mina Almasry, weixugc
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Yang Shi, Yosry Ahmed, fvdl, bagasdotme, cgroups, linux-doc,
	linux-kernel, linux-mm

Michal Hocko <mhocko@suse.com> writes:

> On Tue 13-12-22 14:30:57, Huang, Ying wrote:
>> Mina Almasry <almasrymina@google.com> writes:
> [...]
>> After these discussion, I think the solution maybe use different
>> interfaces for "proactive demote" and "proactive reclaim".  That is,
>> reconsider "memory.demote".  In this way, we will always uncharge the
>> cgroup for "memory.reclaim".  This avoid the possible confusion there.
>> And, because demotion is considered aging, we don't need to disable
>> demotion for "memory.reclaim", just don't count it.
>
> As already pointed out in my previous email, we should really think more
> about future requirements. Do we add memory.promote interface when there
> is a request to implement numa balancing into the userspace? Maybe yes
> but maybe the node balancing should be more generic than bound to memory
> tiering and apply to a more fine grained nodemask control.
>
> Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE),
> reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize
> (swap in, or read ahead) which are per mm/file. Their primary usability
> issue is that they are process centric and that requires a very deep
> understanding of the process mm layout so it is not really usable for a
> larger scale orchestration.
> The important part of those interfaces is that they do not talk about
> demotion because that is an implementation detail. I think we want to
> follow that model at least. From a higher level POV I believe we really
> need an interface to age&reclaim and balance memory among nodes. Are
> there more higher level usecases?

Yes.  If the high level interface can satisfy the requirements, we
should use them or define them.  But I guess Mina and Xu has some
requirements at the level of memory tiers (demotion/promotion)?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found]           ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2022-12-13 13:30             ` Johannes Weiner
  2022-12-13 14:03               ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2022-12-13 13:30 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mina Almasry, Michal Hocko, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> >> > Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
> 
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

Hm, so in summary:

1) memory.reclaim would demote and reclaim like today, but it would
   change to only count reclaimed pages against the goal.

2) memory.demote would only demote.

   a) What if the demotion targets are full? Would it reclaim or fail?

3) Would memory.reclaim and memory.demote still need nodemasks? Would
   they return -EINVAL if a) memory.reclaim gets passed only toptier
   nodes or b) memory.demote gets passed any lasttier nodes?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13 13:30             ` Johannes Weiner
@ 2022-12-13 14:03               ` Michal Hocko
       [not found]                 ` <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-13 14:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme, cgroups,
	linux-doc, linux-kernel, linux-mm

On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
[...]
> > After these discussion, I think the solution maybe use different
> > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > reconsider "memory.demote".  In this way, we will always uncharge the
> > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > And, because demotion is considered aging, we don't need to disable
> > demotion for "memory.reclaim", just don't count it.
> 
> Hm, so in summary:
> 
> 1) memory.reclaim would demote and reclaim like today, but it would
>    change to only count reclaimed pages against the goal.
> 
> 2) memory.demote would only demote.
> 
>    a) What if the demotion targets are full? Would it reclaim or fail?
> 
> 3) Would memory.reclaim and memory.demote still need nodemasks? Would
>    they return -EINVAL if a) memory.reclaim gets passed only toptier
>    nodes or b) memory.demote gets passed any lasttier nodes?

I would also add
4) Do we want to allow to control the demotion path (e.g. which node to
   demote from and to) and how to achieve that?
5) Is the demotion api restricted to multi-tier systems or any numa
   configuration allowed as well?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found]                 ` <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-13 19:29                   ` Mina Almasry
  2022-12-14 10:23                     ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Mina Almasry @ 2022-12-13 19:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Huang, Ying, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA, fvdl-hpIqsD4AKlfQT0dZR+AlfA,
	bagasdotme-Re5JQEeQqe8AvxtiuMwx3w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> [...]
> > > After these discussion, I think the solution maybe use different
> > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > And, because demotion is considered aging, we don't need to disable
> > > demotion for "memory.reclaim", just don't count it.
> >
> > Hm, so in summary:
> >
> > 1) memory.reclaim would demote and reclaim like today, but it would
> >    change to only count reclaimed pages against the goal.
> >
> > 2) memory.demote would only demote.
> >

If the above 2 points are agreeable then yes, this sounds good to me
and does address our use case.

> >    a) What if the demotion targets are full? Would it reclaim or fail?
> >

Wei will chime in if he disagrees, but I think we _require_ that it
fails, not falls back to reclaim. The interface is asking for
demotion, and is called memory.demote. For such an interface to fall
back to reclaim would be very confusing to userspace and may trigger
reclaim on a high priority job that we want to shield from proactive
reclaim.

> > 3) Would memory.reclaim and memory.demote still need nodemasks?

memory.demote will need a nodemask, for sure. Today the nodemask would
be useful if there is a specific node in the top tier that is
overloaded and we want to reduce the pressure by demoting. In the
future there will be N tiers and the nodemask says which tier to
demote from.

I don't think memory.reclaim would need a nodemask anymore? At least I
no longer see the use for it for us.

> >    Would
> >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> >    nodes or b) memory.demote gets passed any lasttier nodes?
>

Honestly it would be great if memory.reclaim can force reclaim from a
top tier nodes. It breaks the aginig pipeline, yes, but if the user is
specifically asking for that because they decided in their usecase
it's a good idea then the kernel should comply IMO. Not a strict
requirement for us. Wei will chime in if he disagrees.

memory.demote returning -EINVAL for lasttier nodes makes sense to me.

> I would also add
> 4) Do we want to allow to control the demotion path (e.g. which node to
>    demote from and to) and how to achieve that?

We care deeply about specifying which node to demote _from_. That
would be some node that is approaching pressure and we're looking for
proactive saving from. So far I haven't seen any reason to control
which nodes to demote _to_. The kernel deciding that based on the
aging pipeline and the node distances sounds good to me. Obviously
someone else may find that useful.

> 5) Is the demotion api restricted to multi-tier systems or any numa
>    configuration allowed as well?
>

demotion will of course not work on single tiered systems. The
interface may return some failure on such systems or not be available
at all.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13 19:29                   ` Mina Almasry
@ 2022-12-14 10:23                     ` Michal Hocko
  2022-12-15  5:50                       ` Huang, Ying
       [not found]                       ` <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-14 10:23 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Johannes Weiner, Huang, Ying, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> > [...]
> > > > After these discussion, I think the solution maybe use different
> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > > And, because demotion is considered aging, we don't need to disable
> > > > demotion for "memory.reclaim", just don't count it.
> > >
> > > Hm, so in summary:
> > >
> > > 1) memory.reclaim would demote and reclaim like today, but it would
> > >    change to only count reclaimed pages against the goal.
> > >
> > > 2) memory.demote would only demote.
> > >
> 
> If the above 2 points are agreeable then yes, this sounds good to me
> and does address our use case.
> 
> > >    a) What if the demotion targets are full? Would it reclaim or fail?
> > >
> 
> Wei will chime in if he disagrees, but I think we _require_ that it
> fails, not falls back to reclaim. The interface is asking for
> demotion, and is called memory.demote. For such an interface to fall
> back to reclaim would be very confusing to userspace and may trigger
> reclaim on a high priority job that we want to shield from proactive
> reclaim.

But what should happen if the immediate demotion target is full but
lower tiers are still usable. Should the first one demote before
allowing to demote from the top tier?
 
> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> 
> memory.demote will need a nodemask, for sure. Today the nodemask would
> be useful if there is a specific node in the top tier that is
> overloaded and we want to reduce the pressure by demoting. In the
> future there will be N tiers and the nodemask says which tier to
> demote from.

OK, so what is the exact semantic of the node mask. Does it control
where to demote from or to or both?

> I don't think memory.reclaim would need a nodemask anymore? At least I
> no longer see the use for it for us.
> 
> > >    Would
> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> > >    nodes or b) memory.demote gets passed any lasttier nodes?
> >
> 
> Honestly it would be great if memory.reclaim can force reclaim from a
> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> specifically asking for that because they decided in their usecase
> it's a good idea then the kernel should comply IMO. Not a strict
> requirement for us. Wei will chime in if he disagrees.

That would require a nodemask to say which nodes to reclaim, no? The
default behavior should be in line with what standard memory reclaim
does. If the demotion is a part of that process so should be
memory.reclaim part of it. If we want to have a finer control then a
nodemask is really a must and then the nodemaks should constrain both
agining and reclaim.

> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> 
> > I would also add
> > 4) Do we want to allow to control the demotion path (e.g. which node to
> >    demote from and to) and how to achieve that?
> 
> We care deeply about specifying which node to demote _from_. That
> would be some node that is approaching pressure and we're looking for
> proactive saving from. So far I haven't seen any reason to control
> which nodes to demote _to_. The kernel deciding that based on the
> aging pipeline and the node distances sounds good to me. Obviously
> someone else may find that useful.

Please keep in mind that the interface should be really prepared for
future extensions so try to abstract from your immediate usecases.

> > 5) Is the demotion api restricted to multi-tier systems or any numa
> >    configuration allowed as well?
> >
> 
> demotion will of course not work on single tiered systems. The
> interface may return some failure on such systems or not be available
> at all.

Is there any strong reason for that? We do not have any interface to
control NUMA balancing from userspace. Why cannot we use the interface
for that purpose? 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-14 10:23                     ` Michal Hocko
@ 2022-12-15  5:50                       ` Huang, Ying
       [not found]                         ` <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
       [not found]                       ` <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  1 sibling, 1 reply; 35+ messages in thread
From: Huang, Ying @ 2022-12-15  5:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Johannes Weiner, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

Michal Hocko <mhocko@suse.com> writes:

> On Tue 13-12-22 11:29:45, Mina Almasry wrote:
>> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>> >
>> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
>> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
>> > [...]
>> > > > After these discussion, I think the solution maybe use different
>> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
>> > > > reconsider "memory.demote".  In this way, we will always uncharge the
>> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
>> > > > And, because demotion is considered aging, we don't need to disable
>> > > > demotion for "memory.reclaim", just don't count it.
>> > >
>> > > Hm, so in summary:
>> > >
>> > > 1) memory.reclaim would demote and reclaim like today, but it would
>> > >    change to only count reclaimed pages against the goal.
>> > >
>> > > 2) memory.demote would only demote.
>> > >
>> 
>> If the above 2 points are agreeable then yes, this sounds good to me
>> and does address our use case.
>> 
>> > >    a) What if the demotion targets are full? Would it reclaim or fail?
>> > >
>> 
>> Wei will chime in if he disagrees, but I think we _require_ that it
>> fails, not falls back to reclaim. The interface is asking for
>> demotion, and is called memory.demote. For such an interface to fall
>> back to reclaim would be very confusing to userspace and may trigger
>> reclaim on a high priority job that we want to shield from proactive
>> reclaim.
>
> But what should happen if the immediate demotion target is full but
> lower tiers are still usable. Should the first one demote before
> allowing to demote from the top tier?
>  
>> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
>> 
>> memory.demote will need a nodemask, for sure. Today the nodemask would
>> be useful if there is a specific node in the top tier that is
>> overloaded and we want to reduce the pressure by demoting. In the
>> future there will be N tiers and the nodemask says which tier to
>> demote from.
>
> OK, so what is the exact semantic of the node mask. Does it control
> where to demote from or to or both?
>
>> I don't think memory.reclaim would need a nodemask anymore? At least I
>> no longer see the use for it for us.
>> 
>> > >    Would
>> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
>> > >    nodes or b) memory.demote gets passed any lasttier nodes?
>> >
>> 
>> Honestly it would be great if memory.reclaim can force reclaim from a
>> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
>> specifically asking for that because they decided in their usecase
>> it's a good idea then the kernel should comply IMO. Not a strict
>> requirement for us. Wei will chime in if he disagrees.
>
> That would require a nodemask to say which nodes to reclaim, no? The
> default behavior should be in line with what standard memory reclaim
> does. If the demotion is a part of that process so should be
> memory.reclaim part of it. If we want to have a finer control then a
> nodemask is really a must and then the nodemaks should constrain both
> agining and reclaim.
>
>> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
>> 
>> > I would also add
>> > 4) Do we want to allow to control the demotion path (e.g. which node to
>> >    demote from and to) and how to achieve that?
>> 
>> We care deeply about specifying which node to demote _from_. That
>> would be some node that is approaching pressure and we're looking for
>> proactive saving from. So far I haven't seen any reason to control
>> which nodes to demote _to_. The kernel deciding that based on the
>> aging pipeline and the node distances sounds good to me. Obviously
>> someone else may find that useful.
>
> Please keep in mind that the interface should be really prepared for
> future extensions so try to abstract from your immediate usecases.

I see two requirements here, one is to control the demotion source, that
is, which nodes to free memory.  The other is to control the demotion
path.  I think that we can use two different parameters for them, for
example, "from=<demotion source nodes>" and "to=<demotion target
nodes>".  In most cases we don't need to control the demotion path.
Because in current implementation, the nodes in the lower tiers in the
same socket (local nodes) will be preferred.  I think that this is
the desired behavior in most cases.

>> > 5) Is the demotion api restricted to multi-tier systems or any numa
>> >    configuration allowed as well?
>> >
>> 
>> demotion will of course not work on single tiered systems. The
>> interface may return some failure on such systems or not be available
>> at all.
>
> Is there any strong reason for that? We do not have any interface to
> control NUMA balancing from userspace. Why cannot we use the interface
> for that purpose? 

Do you mean to demote the cold pages from the specified source nodes to
the specified target nodes in different sockets?  We don't do that to
avoid loop in the demotion path.  If we prevent the target nodes from
demoting cold pages to the source nodes at the same time, it seems
doable.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found]                         ` <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2022-12-15  9:21                           ` Michal Hocko
  2022-12-16  3:02                             ` Huang, Ying
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-15  9:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mina Almasry, Johannes Weiner, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA, fvdl-hpIqsD4AKlfQT0dZR+AlfA,
	bagasdotme-Re5JQEeQqe8AvxtiuMwx3w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu 15-12-22 13:50:14, Huang, Ying wrote:
> Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> writes:
> 
> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >> >
> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> >> > [...]
> >> > > > After these discussion, I think the solution maybe use different
> >> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> >> > > > reconsider "memory.demote".  In this way, we will always uncharge the
> >> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> >> > > > And, because demotion is considered aging, we don't need to disable
> >> > > > demotion for "memory.reclaim", just don't count it.
> >> > >
> >> > > Hm, so in summary:
> >> > >
> >> > > 1) memory.reclaim would demote and reclaim like today, but it would
> >> > >    change to only count reclaimed pages against the goal.
> >> > >
> >> > > 2) memory.demote would only demote.
> >> > >
> >> 
> >> If the above 2 points are agreeable then yes, this sounds good to me
> >> and does address our use case.
> >> 
> >> > >    a) What if the demotion targets are full? Would it reclaim or fail?
> >> > >
> >> 
> >> Wei will chime in if he disagrees, but I think we _require_ that it
> >> fails, not falls back to reclaim. The interface is asking for
> >> demotion, and is called memory.demote. For such an interface to fall
> >> back to reclaim would be very confusing to userspace and may trigger
> >> reclaim on a high priority job that we want to shield from proactive
> >> reclaim.
> >
> > But what should happen if the immediate demotion target is full but
> > lower tiers are still usable. Should the first one demote before
> > allowing to demote from the top tier?
> >  
> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> >> 
> >> memory.demote will need a nodemask, for sure. Today the nodemask would
> >> be useful if there is a specific node in the top tier that is
> >> overloaded and we want to reduce the pressure by demoting. In the
> >> future there will be N tiers and the nodemask says which tier to
> >> demote from.
> >
> > OK, so what is the exact semantic of the node mask. Does it control
> > where to demote from or to or both?
> >
> >> I don't think memory.reclaim would need a nodemask anymore? At least I
> >> no longer see the use for it for us.
> >> 
> >> > >    Would
> >> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> >> > >    nodes or b) memory.demote gets passed any lasttier nodes?
> >> >
> >> 
> >> Honestly it would be great if memory.reclaim can force reclaim from a
> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> >> specifically asking for that because they decided in their usecase
> >> it's a good idea then the kernel should comply IMO. Not a strict
> >> requirement for us. Wei will chime in if he disagrees.
> >
> > That would require a nodemask to say which nodes to reclaim, no? The
> > default behavior should be in line with what standard memory reclaim
> > does. If the demotion is a part of that process so should be
> > memory.reclaim part of it. If we want to have a finer control then a
> > nodemask is really a must and then the nodemaks should constrain both
> > agining and reclaim.
> >
> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> >> 
> >> > I would also add
> >> > 4) Do we want to allow to control the demotion path (e.g. which node to
> >> >    demote from and to) and how to achieve that?
> >> 
> >> We care deeply about specifying which node to demote _from_. That
> >> would be some node that is approaching pressure and we're looking for
> >> proactive saving from. So far I haven't seen any reason to control
> >> which nodes to demote _to_. The kernel deciding that based on the
> >> aging pipeline and the node distances sounds good to me. Obviously
> >> someone else may find that useful.
> >
> > Please keep in mind that the interface should be really prepared for
> > future extensions so try to abstract from your immediate usecases.
> 
> I see two requirements here, one is to control the demotion source, that
> is, which nodes to free memory.  The other is to control the demotion
> path.  I think that we can use two different parameters for them, for
> example, "from=<demotion source nodes>" and "to=<demotion target
> nodes>".  In most cases we don't need to control the demotion path.
> Because in current implementation, the nodes in the lower tiers in the
> same socket (local nodes) will be preferred.  I think that this is
> the desired behavior in most cases.

Even if the demotion path is not really required at the moment we should
keep in mind future potential extensions. E.g. when a userspace based
balancing is to be implemented because the default behavior cannot
capture userspace policies (one example would be enforcing a
prioritization of containers when some container's demoted pages would
need to be demoted further to free up a space for a different
workload). 
 
> >> > 5) Is the demotion api restricted to multi-tier systems or any numa
> >> >    configuration allowed as well?
> >> >
> >> 
> >> demotion will of course not work on single tiered systems. The
> >> interface may return some failure on such systems or not be available
> >> at all.
> >
> > Is there any strong reason for that? We do not have any interface to
> > control NUMA balancing from userspace. Why cannot we use the interface
> > for that purpose? 
> 
> Do you mean to demote the cold pages from the specified source nodes to
> the specified target nodes in different sockets?  We don't do that to
> avoid loop in the demotion path.  If we prevent the target nodes from
> demoting cold pages to the source nodes at the same time, it seems
> doable.

Loops could be avoid by properly specifying from and to nodes if this is
going to be a fine grained interface to control demotion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-15  9:21                           ` Michal Hocko
@ 2022-12-16  3:02                             ` Huang, Ying
  0 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-16  3:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Johannes Weiner, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

Michal Hocko <mhocko@suse.com> writes:

> On Thu 15-12-22 13:50:14, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
>> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> >
>> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
>> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
>> >> > [...]
>> >> > > > After these discussion, I think the solution maybe use different
>> >> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
>> >> > > > reconsider "memory.demote".  In this way, we will always uncharge the
>> >> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
>> >> > > > And, because demotion is considered aging, we don't need to disable
>> >> > > > demotion for "memory.reclaim", just don't count it.
>> >> > >
>> >> > > Hm, so in summary:
>> >> > >
>> >> > > 1) memory.reclaim would demote and reclaim like today, but it would
>> >> > >    change to only count reclaimed pages against the goal.
>> >> > >
>> >> > > 2) memory.demote would only demote.
>> >> > >
>> >> 
>> >> If the above 2 points are agreeable then yes, this sounds good to me
>> >> and does address our use case.
>> >> 
>> >> > >    a) What if the demotion targets are full? Would it reclaim or fail?
>> >> > >
>> >> 
>> >> Wei will chime in if he disagrees, but I think we _require_ that it
>> >> fails, not falls back to reclaim. The interface is asking for
>> >> demotion, and is called memory.demote. For such an interface to fall
>> >> back to reclaim would be very confusing to userspace and may trigger
>> >> reclaim on a high priority job that we want to shield from proactive
>> >> reclaim.
>> >
>> > But what should happen if the immediate demotion target is full but
>> > lower tiers are still usable. Should the first one demote before
>> > allowing to demote from the top tier?
>> >  
>> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
>> >> 
>> >> memory.demote will need a nodemask, for sure. Today the nodemask would
>> >> be useful if there is a specific node in the top tier that is
>> >> overloaded and we want to reduce the pressure by demoting. In the
>> >> future there will be N tiers and the nodemask says which tier to
>> >> demote from.
>> >
>> > OK, so what is the exact semantic of the node mask. Does it control
>> > where to demote from or to or both?
>> >
>> >> I don't think memory.reclaim would need a nodemask anymore? At least I
>> >> no longer see the use for it for us.
>> >> 
>> >> > >    Would
>> >> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
>> >> > >    nodes or b) memory.demote gets passed any lasttier nodes?
>> >> >
>> >> 
>> >> Honestly it would be great if memory.reclaim can force reclaim from a
>> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
>> >> specifically asking for that because they decided in their usecase
>> >> it's a good idea then the kernel should comply IMO. Not a strict
>> >> requirement for us. Wei will chime in if he disagrees.
>> >
>> > That would require a nodemask to say which nodes to reclaim, no? The
>> > default behavior should be in line with what standard memory reclaim
>> > does. If the demotion is a part of that process so should be
>> > memory.reclaim part of it. If we want to have a finer control then a
>> > nodemask is really a must and then the nodemaks should constrain both
>> > agining and reclaim.
>> >
>> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
>> >> 
>> >> > I would also add
>> >> > 4) Do we want to allow to control the demotion path (e.g. which node to
>> >> >    demote from and to) and how to achieve that?
>> >> 
>> >> We care deeply about specifying which node to demote _from_. That
>> >> would be some node that is approaching pressure and we're looking for
>> >> proactive saving from. So far I haven't seen any reason to control
>> >> which nodes to demote _to_. The kernel deciding that based on the
>> >> aging pipeline and the node distances sounds good to me. Obviously
>> >> someone else may find that useful.
>> >
>> > Please keep in mind that the interface should be really prepared for
>> > future extensions so try to abstract from your immediate usecases.
>> 
>> I see two requirements here, one is to control the demotion source, that
>> is, which nodes to free memory.  The other is to control the demotion
>> path.  I think that we can use two different parameters for them, for
>> example, "from=<demotion source nodes>" and "to=<demotion target
>> nodes>".  In most cases we don't need to control the demotion path.
>> Because in current implementation, the nodes in the lower tiers in the
>> same socket (local nodes) will be preferred.  I think that this is
>> the desired behavior in most cases.
>
> Even if the demotion path is not really required at the moment we should
> keep in mind future potential extensions. E.g. when a userspace based
> balancing is to be implemented because the default behavior cannot
> capture userspace policies (one example would be enforcing a
> prioritization of containers when some container's demoted pages would
> need to be demoted further to free up a space for a different
> workload). 

Yes.  We should consider the potential requirements.

>> >> > 5) Is the demotion api restricted to multi-tier systems or any numa
>> >> >    configuration allowed as well?
>> >> >
>> >> 
>> >> demotion will of course not work on single tiered systems. The
>> >> interface may return some failure on such systems or not be available
>> >> at all.
>> >
>> > Is there any strong reason for that? We do not have any interface to
>> > control NUMA balancing from userspace. Why cannot we use the interface
>> > for that purpose? 
>> 
>> Do you mean to demote the cold pages from the specified source nodes to
>> the specified target nodes in different sockets?  We don't do that to
>> avoid loop in the demotion path.  If we prevent the target nodes from
>> demoting cold pages to the source nodes at the same time, it seems
>> doable.
>
> Loops could be avoid by properly specifying from and to nodes if this is
> going to be a fine grained interface to control demotion.

Yes.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found]                       ` <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-15 17:58                         ` Wei Xu
  2022-12-16  8:40                           ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Xu @ 2022-12-15 17:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Johannes Weiner, Huang, Ying, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, fvdl-hpIqsD4AKlfQT0dZR+AlfA,
	bagasdotme-Re5JQEeQqe8AvxtiuMwx3w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed, Dec 14, 2022 at 2:23 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> > >
> > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> > > [...]
> > > > > After these discussion, I think the solution maybe use different
> > > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > > > And, because demotion is considered aging, we don't need to disable
> > > > > demotion for "memory.reclaim", just don't count it.
> > > >
> > > > Hm, so in summary:
> > > >
> > > > 1) memory.reclaim would demote and reclaim like today, but it would
> > > >    change to only count reclaimed pages against the goal.
> > > >
> > > > 2) memory.demote would only demote.
> > > >
> >
> > If the above 2 points are agreeable then yes, this sounds good to me
> > and does address our use case.
> >
> > > >    a) What if the demotion targets are full? Would it reclaim or fail?
> > > >
> >
> > Wei will chime in if he disagrees, but I think we _require_ that it
> > fails, not falls back to reclaim. The interface is asking for
> > demotion, and is called memory.demote. For such an interface to fall
> > back to reclaim would be very confusing to userspace and may trigger
> > reclaim on a high priority job that we want to shield from proactive
> > reclaim.
>
> But what should happen if the immediate demotion target is full but
> lower tiers are still usable. Should the first one demote before
> allowing to demote from the top tier?

In that case, the demotion will fall back to the lower tiers.  See
node_get_allowed_targets() and establish_demotion_targets()..

> > > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> >
> > memory.demote will need a nodemask, for sure. Today the nodemask would
> > be useful if there is a specific node in the top tier that is
> > overloaded and we want to reduce the pressure by demoting. In the
> > future there will be N tiers and the nodemask says which tier to
> > demote from.
>
> OK, so what is the exact semantic of the node mask. Does it control
> where to demote from or to or both?

The nodemask argument proposed here is to control where to demote
from.   We can follow the existing kernel demotion order to select
where to demote to.  If the need to control the demotion destination
arises, another argument can be added.

> > I don't think memory.reclaim would need a nodemask anymore? At least I
> > no longer see the use for it for us.
> >
> > > >    Would
> > > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> > > >    nodes or b) memory.demote gets passed any lasttier nodes?
> > >
> >
> > Honestly it would be great if memory.reclaim can force reclaim from a
> > top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> > specifically asking for that because they decided in their usecase
> > it's a good idea then the kernel should comply IMO. Not a strict
> > requirement for us. Wei will chime in if he disagrees.
>
> That would require a nodemask to say which nodes to reclaim, no? The
> default behavior should be in line with what standard memory reclaim
> does. If the demotion is a part of that process so should be
> memory.reclaim part of it. If we want to have a finer control then a
> nodemask is really a must and then the nodemaks should constrain both
> agining and reclaim.

Given that the original meaning of memory.reclaim is to free up
memory, I agree that when a nodemask is provided, the kernel should be
allowed to do both aging/demotion and reclaim.  Whether to allow
reclaim from top-tier nodes is a kernel implementation choice.  The
userspace should not depend on that.

Also, because the expectation of memory.reclaim is to free up the
specified amount of bytes, I think if a page is demoted, but both its
source and target nodes are still in the given nodemask, such a
demoted page should not be counted towards the requested bytes of
memory.reclaim. In the case that no nodemask is given (i.e. to free up
memory from all nodes), the demoted pages should never be counted in
the return value of try_to_free_mem_cgroup_pages().

Meanwhile, I'd argue that even though we want to unify demotion and
reclaim, there are still significant differences between them.
Demotion moves pages between two memory tiers, while reclaim can move
pages to a much slower tier, e.g. disk-based files or swap.  Both the
page movement latencies and the reaccess latencies can be
significantly different for demotion/reclaim.  So it is useful for the
userspace to be able to request demotion without reclaim.  A separate
interface, e.g. memory.demote, seems like a good choice for that.

> > memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> >
> > > I would also add
> > > 4) Do we want to allow to control the demotion path (e.g. which node to
> > >    demote from and to) and how to achieve that?
> >
> > We care deeply about specifying which node to demote _from_. That
> > would be some node that is approaching pressure and we're looking for
> > proactive saving from. So far I haven't seen any reason to control
> > which nodes to demote _to_. The kernel deciding that based on the
> > aging pipeline and the node distances sounds good to me. Obviously
> > someone else may find that useful.
>
> Please keep in mind that the interface should be really prepared for
> future extensions so try to abstract from your immediate usecases.
>
> > > 5) Is the demotion api restricted to multi-tier systems or any numa
> > >    configuration allowed as well?
> > >
> >
> > demotion will of course not work on single tiered systems. The
> > interface may return some failure on such systems or not be available
> > at all.
>
> Is there any strong reason for that? We do not have any interface to
> control NUMA balancing from userspace. Why cannot we use the interface
> for that purpose?

A demotion interface such as memory.demote will trigger the demotion
code path in the kernel, which depends on multiple memory tiers.

I think what you are getting is a more general page migration
interface for memcg, which will need both source and target nodes as
arguments. I think this can be a great idea.  It should be able to
support our demotion use cases as well.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-15 17:58                         ` Wei Xu
@ 2022-12-16  8:40                           ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-16  8:40 UTC (permalink / raw)
  To: Wei Xu
  Cc: Mina Almasry, Johannes Weiner, Huang, Ying, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, fvdl, bagasdotme, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu 15-12-22 09:58:12, Wei Xu wrote:
> On Wed, Dec 14, 2022 at 2:23 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> > > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> > > > [...]
> > > > > > After these discussion, I think the solution maybe use different
> > > > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > > > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > > > > And, because demotion is considered aging, we don't need to disable
> > > > > > demotion for "memory.reclaim", just don't count it.
> > > > >
> > > > > Hm, so in summary:
> > > > >
> > > > > 1) memory.reclaim would demote and reclaim like today, but it would
> > > > >    change to only count reclaimed pages against the goal.
> > > > >
> > > > > 2) memory.demote would only demote.
> > > > >
> > >
> > > If the above 2 points are agreeable then yes, this sounds good to me
> > > and does address our use case.
> > >
> > > > >    a) What if the demotion targets are full? Would it reclaim or fail?
> > > > >
> > >
> > > Wei will chime in if he disagrees, but I think we _require_ that it
> > > fails, not falls back to reclaim. The interface is asking for
> > > demotion, and is called memory.demote. For such an interface to fall
> > > back to reclaim would be very confusing to userspace and may trigger
> > > reclaim on a high priority job that we want to shield from proactive
> > > reclaim.
> >
> > But what should happen if the immediate demotion target is full but
> > lower tiers are still usable. Should the first one demote before
> > allowing to demote from the top tier?
> 
> In that case, the demotion will fall back to the lower tiers.  See
> node_get_allowed_targets() and establish_demotion_targets()..

I am not talking about an implicit behavior that we do not want to cast
into interface. If we want to allow a fine grained control over demotion
then the implementation shouldn't rely on the current behavior.

[...]
> > Is there any strong reason for that? We do not have any interface to
> > control NUMA balancing from userspace. Why cannot we use the interface
> > for that purpose?
> 
> A demotion interface such as memory.demote will trigger the demotion
> code path in the kernel, which depends on multiple memory tiers.

Demotion is just a fancy name of a directed migration. There is no realy
dependency on the HW nor the technology.

> I think what you are getting is a more general page migration
> interface for memcg, which will need both source and target nodes as
> arguments. I think this can be a great idea.  It should be able to
> support our demotion use cases as well.

yes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13  0:54       ` Mina Almasry
  2022-12-13  6:30         ` Huang, Ying
@ 2022-12-13  8:33         ` Michal Hocko
  2022-12-13 15:58           ` Johannes Weiner
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-13  8:33 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Mon 12-12-22 16:54:27, Mina Almasry wrote:
> On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Let me summarize my main concerns here as well. The proposed
> > implementation doesn't apply the provided nodemask to the whole reclaim
> > process. This means that demotion can happen outside of the mask so the
> > the user request cannot really control demotion targets and that limits
> > the interface should there be any need for a finer grained control in
> > the future (see an example in [2]).
> > Another problem is that this can limit future reclaim extensions because
> > of existing assumptions of the interface [3] - specify only top-tier
> > node to force the aging without actually reclaiming any charges and
> > (ab)use the interface only for aging on multi-tier system. A change to
> > the reclaim to not demote in some cases could break this usecase.
> >
> 
> I think this is correct. My use case is to request from the kernel to
> do demotion without reclaim in the cgroup, and the reason for that is
> stated in the commit message:
> 
> "Reclaim and demotion incur different latency costs to the jobs in the
> cgroup. Demoted memory would still be addressable by the userspace at
> a higher latency, but reclaimed memory would need to incur a
> pagefault."
> 
> For jobs of some latency tiers, we would like to trigger proactive
> demotion (which incurs relatively low latency on the job), but not
> trigger proactive reclaim (which incurs a pagefault). I initially had
> proposed a separate interface for this, but Johannes directed me to
> this interface instead in [1]. In the same email Johannes also tells
> me that meta's reclaim stack relies on memory.reclaim triggering
> demotion, so it seems that I'm not the first to take a dependency on
> this. Additionally in [2] Johannes also says it would be great if in
> the long term reclaim policy and demotion policy do not diverge.

I do recognize your need to control the demotion but I argue that it is
a bad idea to rely on an implicit behavior of the memory reclaim and an
interface which is _documented_ to primarily _reclaim_ memory.

Really, consider that the current demotion implementation will change
in the future and based on a newly added heuristic memory reclaim or
compression would be preferred over migration to a different tier.  This
might completely break your current assumptions and break your usecase
which relies on an implicit demotion behavior.  Do you see that as a
potential problem at all? What shall we do in that case? Special case
memory.reclaim behavior?

Now to your specific usecase. If there is a need to do a memory
distribution balancing then fine but this should be a well defined
interface. E.g. is there a need to not only control demotion but
promotions as well? I haven't heard anybody requesting that so far
but I can easily imagine that like outsourcing the memory reclaim to
the userspace someone might want to do the same thing with the numa
balancing because $REASONS. Should that ever happen, I am pretty sure
hooking into memory.reclaim is not really a great idea.

See where I am coming from?

> [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13  8:33         ` Michal Hocko
@ 2022-12-13 15:58           ` Johannes Weiner
       [not found]             ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Johannes Weiner @ 2022-12-13 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> I do recognize your need to control the demotion but I argue that it is
> a bad idea to rely on an implicit behavior of the memory reclaim and an
> interface which is _documented_ to primarily _reclaim_ memory.

I think memory.reclaim should demote as part of page aging. What I'd
like to avoid is *having* to manually control the aging component in
the interface (e.g. making memory.reclaim *only* reclaim, and
*requiring* a coordinated use of memory.demote to ensure progress.)

> Really, consider that the current demotion implementation will change
> in the future and based on a newly added heuristic memory reclaim or
> compression would be preferred over migration to a different tier.  This
> might completely break your current assumptions and break your usecase
> which relies on an implicit demotion behavior.  Do you see that as a
> potential problem at all? What shall we do in that case? Special case
> memory.reclaim behavior?

Shouldn't that be derived from the distance propertiers in the tier
configuration?

I.e. if local compression is faster than demoting to a slower node, we
should maybe have a separate tier for that. Ignoring proactive reclaim
or demotion commands for a second: on that node, global memory
pressure should always compress first, while the oldest pages from the
compression cache should demote to the other node(s) - until they
eventually get swapped out.

However fine-grained we make proactive reclaim control over these
stages, it should at least be possible for the user to request the
default behavior that global pressure follows, without jumping through
hoops or requiring the coordinated use of multiple knobs. So IMO there
is an argument for having a singular knob that requests comprehensive
aging and reclaiming across the configured hierarchy.

As far as explicit control over the individual stages goes - no idea
if you would call the compression stage demotion or reclaim. The
distinction still does not make much of sense to me, since reclaim is
just another form of demotion. Sure, page faults have a different
access latency than dax to slower memory. But you could also have 3
tiers of memory where the difference between tier 1 and 2 is much
smaller than the difference between 2 and 3, and you might want to
apply different demotion rates between them as well.

The other argument is that demotion does not free cgroup memory,
whereas reclaim does. But with multiple memory tiers of vastly
different performance, isn't there also an argument for granting
cgroups different shares of each memory? So that a higher priority
group has access to a bigger share of the fastest memory, and lower
prio cgroups are relegated to lower tiers. If we split those pools,
then "demotion" will actually free memory in a cgroup.

This is why I liked adding a nodes= argument to memory.reclaim the
best. It doesn't encode a distinction that may not last for long.

The problem comes from how to interpret the input argument and the
return value, right? Could we solve this by requiring the passed
nodes= to all be of the same memory tier? Then there is no confusion
around what is requested and what the return value means.

And if no nodes are passed, it means reclaim (from the lowest memory
tier) X pages and demote as needed, then return the reclaimed pages.

> Now to your specific usecase. If there is a need to do a memory
> distribution balancing then fine but this should be a well defined
> interface. E.g. is there a need to not only control demotion but
> promotions as well? I haven't heard anybody requesting that so far
> but I can easily imagine that like outsourcing the memory reclaim to
> the userspace someone might want to do the same thing with the numa
> balancing because $REASONS. Should that ever happen, I am pretty sure
> hooking into memory.reclaim is not really a great idea.

Should this ever happen, it would seem fair that that be a separate
knob anyway, no? One knob to move the pipeline in one direction
(aging), one knob to move it the other way.

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
       [not found]             ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-12-13 19:53               ` Mina Almasry
  2022-12-14  7:20                 ` Huang, Ying
  0 siblings, 1 reply; 35+ messages in thread
From: Mina Almasry @ 2022-12-13 19:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> > I do recognize your need to control the demotion but I argue that it is
> > a bad idea to rely on an implicit behavior of the memory reclaim and an
> > interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
>
> > Really, consider that the current demotion implementation will change
> > in the future and based on a newly added heuristic memory reclaim or
> > compression would be preferred over migration to a different tier.  This
> > might completely break your current assumptions and break your usecase
> > which relies on an implicit demotion behavior.  Do you see that as a
> > potential problem at all? What shall we do in that case? Special case
> > memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion. Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory? So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>

I would also like to say I implemented something in line with that in [1].

In this patch, pages demoted from inside the nodemask to outside the
nodemask count as 'reclaimed'. This, in my mind, is a very generic
solution to the 'should demoted pages count as reclaim?' problem, and
will work in all scenarios as long as the nodemask passed to
shrink_folio_list() is set correctly by the call stack.

> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.
>

I feel like I arrived at a better solution in [1], where pages demoted
from inside of the nodemask to outside count as reclaimed and the rest
don't. But I think we could solve this by explicit checks that nodes=
arg are from the same tier, yes.

> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.
>
> > Now to your specific usecase. If there is a need to do a memory
> > distribution balancing then fine but this should be a well defined
> > interface. E.g. is there a need to not only control demotion but
> > promotions as well? I haven't heard anybody requesting that so far
> > but I can easily imagine that like outsourcing the memory reclaim to
> > the userspace someone might want to do the same thing with the numa
> > balancing because $REASONS. Should that ever happen, I am pretty sure
> > hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.

[1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13 19:53               ` Mina Almasry
@ 2022-12-14  7:20                 ` Huang, Ying
  0 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-14  7:20 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Zefan Li,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

Mina Almasry <almasrymina@google.com> writes:

> On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
>> > I do recognize your need to control the demotion but I argue that it is
>> > a bad idea to rely on an implicit behavior of the memory reclaim and an
>> > interface which is _documented_ to primarily _reclaim_ memory.
>>
>> I think memory.reclaim should demote as part of page aging. What I'd
>> like to avoid is *having* to manually control the aging component in
>> the interface (e.g. making memory.reclaim *only* reclaim, and
>> *requiring* a coordinated use of memory.demote to ensure progress.)
>>
>> > Really, consider that the current demotion implementation will change
>> > in the future and based on a newly added heuristic memory reclaim or
>> > compression would be preferred over migration to a different tier.  This
>> > might completely break your current assumptions and break your usecase
>> > which relies on an implicit demotion behavior.  Do you see that as a
>> > potential problem at all? What shall we do in that case? Special case
>> > memory.reclaim behavior?
>>
>> Shouldn't that be derived from the distance propertiers in the tier
>> configuration?
>>
>> I.e. if local compression is faster than demoting to a slower node, we
>> should maybe have a separate tier for that. Ignoring proactive reclaim
>> or demotion commands for a second: on that node, global memory
>> pressure should always compress first, while the oldest pages from the
>> compression cache should demote to the other node(s) - until they
>> eventually get swapped out.
>>
>> However fine-grained we make proactive reclaim control over these
>> stages, it should at least be possible for the user to request the
>> default behavior that global pressure follows, without jumping through
>> hoops or requiring the coordinated use of multiple knobs. So IMO there
>> is an argument for having a singular knob that requests comprehensive
>> aging and reclaiming across the configured hierarchy.
>>
>> As far as explicit control over the individual stages goes - no idea
>> if you would call the compression stage demotion or reclaim. The
>> distinction still does not make much of sense to me, since reclaim is
>> just another form of demotion. Sure, page faults have a different
>> access latency than dax to slower memory. But you could also have 3
>> tiers of memory where the difference between tier 1 and 2 is much
>> smaller than the difference between 2 and 3, and you might want to
>> apply different demotion rates between them as well.
>>
>> The other argument is that demotion does not free cgroup memory,
>> whereas reclaim does. But with multiple memory tiers of vastly
>> different performance, isn't there also an argument for granting
>> cgroups different shares of each memory? So that a higher priority
>> group has access to a bigger share of the fastest memory, and lower
>> prio cgroups are relegated to lower tiers. If we split those pools,
>> then "demotion" will actually free memory in a cgroup.
>>
>
> I would also like to say I implemented something in line with that in [1].
>
> In this patch, pages demoted from inside the nodemask to outside the
> nodemask count as 'reclaimed'. This, in my mind, is a very generic
> solution to the 'should demoted pages count as reclaim?' problem, and
> will work in all scenarios as long as the nodemask passed to
> shrink_folio_list() is set correctly by the call stack.

It's still not clear that how many pages should be demoted among the
nodes inside the nodemask.  One possibility is to keep as many higher
tier pages as possible.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13 15:58           ` Johannes Weiner
       [not found]             ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-12-14  7:15             ` Huang, Ying
  2022-12-14 10:43             ` Michal Hocko
  2 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-14  7:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme, cgroups,
	linux-doc, linux-kernel, linux-mm

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
>> I do recognize your need to control the demotion but I argue that it is
>> a bad idea to rely on an implicit behavior of the memory reclaim and an
>> interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
>
>> Really, consider that the current demotion implementation will change
>> in the future and based on a newly added heuristic memory reclaim or
>> compression would be preferred over migration to a different tier.  This
>> might completely break your current assumptions and break your usecase
>> which relies on an implicit demotion behavior.  Do you see that as a
>> potential problem at all? What shall we do in that case? Special case
>> memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion. Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory? So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.

Yes.  The definition is clear if nodes= from the same memory tier.

> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.

It appears that the definition isn't very clear here.  How many pages
should be demoted?  The target number is the value echoed to
memory.reclaim?  Or requested_number - pages_in_lowest_tier?  Should we
demote in as many tiers as possible or in as few tiers as possible?  One
possibility is to take advantage of top tier memory as much as
possible.  That is, try to reclaim pages in lower tiers only.

>> Now to your specific usecase. If there is a need to do a memory
>> distribution balancing then fine but this should be a well defined
>> interface. E.g. is there a need to not only control demotion but
>> promotions as well? I haven't heard anybody requesting that so far
>> but I can easily imagine that like outsourcing the memory reclaim to
>> the userspace someone might want to do the same thing with the numa
>> balancing because $REASONS. Should that ever happen, I am pretty sure
>> hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.

Agree.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
  2022-12-13 15:58           ` Johannes Weiner
       [not found]             ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-12-14  7:15             ` Huang, Ying
@ 2022-12-14 10:43             ` Michal Hocko
  2 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-14 10:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue 13-12-22 16:58:50, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> > I do recognize your need to control the demotion but I argue that it is
> > a bad idea to rely on an implicit behavior of the memory reclaim and an
> > interface which is _documented_ to primarily _reclaim_ memory.
> 
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)

Yes, I do agree with that. Demotion is a part of the aging. I meant to say
that the result of the operation should be reclaimed charges but that
doesn't mean that demotion is not a part of that process.

I am mostly concerned about demote only behavior that Mina is targetting
and want to use memory.reclaim interface.

> > Really, consider that the current demotion implementation will change
> > in the future and based on a newly added heuristic memory reclaim or
> > compression would be preferred over migration to a different tier.  This
> > might completely break your current assumptions and break your usecase
> > which relies on an implicit demotion behavior.  Do you see that as a
> > potential problem at all? What shall we do in that case? Special case
> > memory.reclaim behavior?
> 
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
> 
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
> 
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
> 
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion.

From the external visibility POV the major difference between the two is
that the reclaim decreases the overall charged memory. And there are
pro-active reclaim usecases which rely on that. Demotion is mostly
memory placement rebalancing. Sure still visible in per-node stats and
with implications to performance but that is a different story.

> Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
> 
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory?

Yes. We have already had requests for per node limits in the past. And I
do expect this will show up as a problem here as well but with a
reasonable memory.reclaim and potentially memory.demote interfaces the
balancing and policy making can be outsourced to the userspace .

> So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
> 
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
> 
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.

Just to make sure I am on the same page. This means that if a node mask
is specified then it always implies demotion without any control over
how the demotion is done, right?

> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.

IMO this is rather constrained semantic which will completely rule out
some potentially interesting usecases. E.g. fine grained control over
the demotion path or enforced reclaim for node balancing. Also if we
ever want a promote interface then it would better fit with demote
counterpart.

> > Now to your specific usecase. If there is a need to do a memory
> > distribution balancing then fine but this should be a well defined
> > interface. E.g. is there a need to not only control demotion but
> > promotions as well? I haven't heard anybody requesting that so far
> > but I can easily imagine that like outsourcing the memory reclaim to
> > the userspace someone might want to do the same thing with the numa
> > balancing because $REASONS. Should that ever happen, I am pretty sure
> > hooking into memory.reclaim is not really a great idea.
> 
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.

Yes, this is what I am inclining to as well.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
  2022-12-12  8:55   ` Michal Hocko
       [not found]     ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-16  9:54     ` Michal Hocko
  2022-12-16 12:02       ` Mina Almasry
                         ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-16  9:54 UTC (permalink / raw)
  To: Mina Almasry, Andrew Morton
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

Andrew,
I have noticed that the patch made it into Linus tree already. Can we
please revert it because the semantic is not really clear and we should
really not create yet another user API maintenance problem. I am
proposing to revert the nodemask extension for now before we grow any
upstream users. Deeper in the email thread are some proposals how to
move forward with that.
--- 
From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 16 Dec 2022 10:46:33 +0100
Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"

This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.

Although it is recognized that a finer grained pro-active reclaim is
something we need and want the semantic of this implementation is really
ambiguous.

From a follow up discussion it became clear that there are two essential
usecases here. One is to use memory.reclaim to pro-actively reclaim
memory and expectation is that the requested and reported amount of memory is
uncharged from the memcg. Another usecase focuses on pro-active demotion
when the memory is merely shuffled around to demotion targets while the
overall charged memory stays unchanged.

The current implementation considers demoted pages as reclaimed and that
break both usecases. [1] has tried to address the reporting part but
there are more issues with that summarized in [2] and follow up emails.

Let's revert the nodemask based extension of the memcg pro-active
reclaim for now until we settle with a more robust semantic.

[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
[2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 15 +++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 67 +++++--------------------
 mm/vmscan.c                             |  4 +-
 4 files changed, 21 insertions(+), 68 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..74cec76be9f2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.
 
-	This file accepts a string which contains the number of bytes to
-	reclaim.
+	This file accepts a single key, the number of bytes to reclaim.
+	No nested keys are currently supported.
 
 	Example::
 
 	  echo "1G" > memory.reclaim
 
+	The interface can be later extended with nested keys to
+	configure the reclaim behavior. For example, specify the
+	type of memory to reclaim from (anon, file, ..).
+
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.
 
-	This file also allows the user to specify the nodes to reclaim from,
-	via the 'nodes=' key, for example::
-
-	  echo "1G nodes=0,1" > memory.reclaim
-
-	The above instructs the kernel to reclaim memory from nodes 0,1.
-
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2787b84eaf12..0ceed49516ad 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options,
-						  nodemask_t *nodemask);
+						  unsigned int reclaim_options);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..73afff8062f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,7 +63,6 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
-#include <linux/parser.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP,
-							NULL);
+							MEMCG_RECLAIM_MAY_SWAP);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options,
-						    NULL);
+						    gfp_mask, reclaim_options);
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
-					NULL)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP,
-						  NULL))
+						  MEMCG_RECLAIM_MAY_SWAP))
 			nr_retries--;
 	}
 
@@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
-					NULL);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
-					NULL))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
 				nr_reclaims--;
 			continue;
 		}
@@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
-enum {
-	MEMORY_RECLAIM_NODES = 0,
-	MEMORY_RECLAIM_NULL,
-};
-
-static const match_table_t if_tokens = {
-	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
-	{ MEMORY_RECLAIM_NULL, NULL },
-};
-
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
-	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
-				       MEMCG_RECLAIM_PROACTIVE;
-	char *old_buf, *start;
-	substring_t args[MAX_OPT_ARGS];
-	int token;
-	char value[256];
-	nodemask_t nodemask = NODE_MASK_ALL;
-
-	buf = strstrip(buf);
-
-	old_buf = buf;
-	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
-	if (buf == old_buf)
-		return -EINVAL;
+	unsigned int reclaim_options;
+	int err;
 
 	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "", &nr_to_reclaim);
+	if (err)
+		return err;
 
-	while ((start = strsep(&buf, " ")) != NULL) {
-		if (!strlen(start))
-			continue;
-		token = match_token(start, if_tokens, args);
-		match_strlcpy(value, args, sizeof(value));
-		switch (token) {
-		case MEMORY_RECLAIM_NODES:
-			if (nodelist_parse(value, nodemask) < 0)
-				return -EINVAL;
-			break;
-		default:
-			return -EINVAL;
-		}
-	}
-
+	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
 	while (nr_reclaimed < nr_to_reclaim) {
 		unsigned long reclaimed;
 
@@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, reclaim_options,
-						&nodemask);
+						GFP_KERNEL, reclaim_options);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aba991c505f1..546540bc770a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options,
-					   nodemask_t *nodemask)
+					   unsigned int reclaim_options)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
-		.nodemask = nodemask,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
-- 
2.30.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
  2022-12-16  9:54     ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
@ 2022-12-16 12:02       ` Mina Almasry
  2022-12-16 12:22         ` Michal Hocko
  2022-12-16 12:28       ` Bagas Sanjaya
       [not found]       ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2 siblings, 1 reply; 35+ messages in thread
From: Mina Almasry @ 2022-12-16 12:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Andrew,
> I have noticed that the patch made it into Linus tree already. Can we
> please revert it because the semantic is not really clear and we should
> really not create yet another user API maintenance problem. I am
> proposing to revert the nodemask extension for now before we grow any
> upstream users. Deeper in the email thread are some proposals how to
> move forward with that.

There are proposals, many which have been rejected due to not
addressing the motivating use cases and others that have been rejected
by fellow maintainers, and some that are awaiting feedback. No, there
is no other clear-cut way forward for this use case right now. I have
found the merged approach by far the most agreeable so far.

> ---
> From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 16 Dec 2022 10:46:33 +0100
> Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
>
> This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
>
> Although it is recognized that a finer grained pro-active reclaim is
> something we need and want the semantic of this implementation is really
> ambiguous.
>
> From a follow up discussion it became clear that there are two essential
> usecases here. One is to use memory.reclaim to pro-actively reclaim
> memory and expectation is that the requested and reported amount of memory is
> uncharged from the memcg. Another usecase focuses on pro-active demotion
> when the memory is merely shuffled around to demotion targets while the
> overall charged memory stays unchanged.
>
> The current implementation considers demoted pages as reclaimed and that
> break both usecases.

I think you're making it sound like this specific patch broke both use
cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert
"mm/vmscan: never demote for memcg reclaim"") has been in the tree for
around 7 months now and that is the commit that enabled demotion in
memcg reclaim, and implicitly counted demoted pages as reclaimed in
memcg reclaim, which is the source of the ambiguity. Not the patch
that you are reverting here.

The irony I find with this revert is that this patch actually removes
the ambiguity and does not exacerbate it. Currently using
memory.reclaim _without_ the nodes= arg is ambiguous because demoted
pages count as reclaimed. On the other hand using memory.reclaim
_with_ the nodes= arg is completely unambiguous: the kernel will
demote-only from top tier nodes and reclaim-only from bottom tier
nodes.

> [1] has tried to address the reporting part but
> there are more issues with that summarized in [2] and follow up emails.
>

I am the one that put effort into resolving the ambiguity introduced
by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") and proposed [1]. Reverting this patch does nothing to
resolve ambiguity that it did not introduce.

> Let's revert the nodemask based extension of the memcg pro-active
> reclaim for now until we settle with a more robust semantic.
>

I do not think we should revert this. It enables a couple of important
use cases for Google:

1. Enables us to specifically trigger proactive reclaim in a memcg on
a memory tiered system by specifying only the lower tiered nodes using
the nodes= arg.
2. Enabled us to specifically trigger proactive demotion in a memcg on
a memory tiered system by specifying only the top tier nodes using the
nodes= arg.

Both use cases are broken with this revert, and no progress to resolve
the ambiguity is made with this revert.

I agree with Michal that there is ambiguity that has existed in the
kernel for about 7 months now and is introduced by commit 3f1509c57b1b
("Revert "mm/vmscan: never demote for memcg reclaim""), and I'm trying
to fix this ambiguity in [1]. I think we should move forward in fixing
the ambiguity through the review of the patch in [1] and not revert
patches that enable useful use-cases and did not introduce the
ambiguity.

> [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com

Broken link. Actual link to my patch to fix the ambiguity:
[1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/

> [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 67 +++++--------------------
>  mm/vmscan.c                             |  4 +-
>  4 files changed, 21 insertions(+), 68 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..74cec76be9f2 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
>         This is a simple interface to trigger memory reclaim in the
>         target cgroup.
>
> -       This file accepts a string which contains the number of bytes to
> -       reclaim.
> +       This file accepts a single key, the number of bytes to reclaim.
> +       No nested keys are currently supported.
>
>         Example::
>
>           echo "1G" > memory.reclaim
>
> +       The interface can be later extended with nested keys to
> +       configure the reclaim behavior. For example, specify the
> +       type of memory to reclaim from (anon, file, ..).
> +
>         Please note that the kernel can over or under reclaim from
>         the target cgroup. If less bytes are reclaimed than the
>         specified amount, -EAGAIN is returned.
> @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
>         This means that the networking layer will not adapt based on
>         reclaim induced by memory.reclaim.
>
> -       This file also allows the user to specify the nodes to reclaim from,
> -       via the 'nodes=' key, for example::
> -
> -         echo "1G nodes=0,1" > memory.reclaim
> -
> -       The above instructs the kernel to reclaim memory from nodes 0,1.
> -
>    memory.peak
>         A read-only single value file which exists on non-root
>         cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2787b84eaf12..0ceed49516ad 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> -                                                 unsigned int reclaim_options,
> -                                                 nodemask_t *nodemask);
> +                                                 unsigned int reclaim_options);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>                                                 gfp_t gfp_mask, bool noswap,
>                                                 pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..73afff8062f9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,7 +63,6 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> -#include <linux/parser.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>                 psi_memstall_enter(&pflags);
>                 nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>                                                         gfp_mask,
> -                                                       MEMCG_RECLAIM_MAY_SWAP,
> -                                                       NULL);
> +                                                       MEMCG_RECLAIM_MAY_SWAP);
>                 psi_memstall_leave(&pflags);
>         } while ((memcg = parent_mem_cgroup(memcg)) &&
>                  !mem_cgroup_is_root(memcg));
> @@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>         psi_memstall_enter(&pflags);
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -                                                   gfp_mask, reclaim_options,
> -                                                   NULL);
> +                                                   gfp_mask, reclaim_options);
>         psi_memstall_leave(&pflags);
>
>         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>                 }
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> -                                       NULL)) {
> +                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
>                         ret = -EBUSY;
>                         break;
>                 }
> @@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>                         return -EINTR;
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                                 MEMCG_RECLAIM_MAY_SWAP,
> -                                                 NULL))
> +                                                 MEMCG_RECLAIM_MAY_SWAP))
>                         nr_retries--;
>         }
>
> @@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>                 }
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> -                                       NULL);
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
>
>                 if (!reclaimed && !nr_retries--)
>                         break;
> @@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>                 if (nr_reclaims) {
>                         if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> -                                       NULL))
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
>                                 nr_reclaims--;
>                         continue;
>                 }
> @@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> -enum {
> -       MEMORY_RECLAIM_NODES = 0,
> -       MEMORY_RECLAIM_NULL,
> -};
> -
> -static const match_table_t if_tokens = {
> -       { MEMORY_RECLAIM_NODES, "nodes=%s" },
> -       { MEMORY_RECLAIM_NULL, NULL },
> -};
> -
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -       unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> -                                      MEMCG_RECLAIM_PROACTIVE;
> -       char *old_buf, *start;
> -       substring_t args[MAX_OPT_ARGS];
> -       int token;
> -       char value[256];
> -       nodemask_t nodemask = NODE_MASK_ALL;
> -
> -       buf = strstrip(buf);
> -
> -       old_buf = buf;
> -       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> -       if (buf == old_buf)
> -               return -EINVAL;
> +       unsigned int reclaim_options;
> +       int err;
>
>         buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +       if (err)
> +               return err;
>
> -       while ((start = strsep(&buf, " ")) != NULL) {
> -               if (!strlen(start))
> -                       continue;
> -               token = match_token(start, if_tokens, args);
> -               match_strlcpy(value, args, sizeof(value));
> -               switch (token) {
> -               case MEMORY_RECLAIM_NODES:
> -                       if (nodelist_parse(value, nodemask) < 0)
> -                               return -EINVAL;
> -                       break;
> -               default:
> -                       return -EINVAL;
> -               }
> -       }
> -
> +       reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>         while (nr_reclaimed < nr_to_reclaim) {
>                 unsigned long reclaimed;
>
> @@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                                 nr_to_reclaim - nr_reclaimed,
> -                                               GFP_KERNEL, reclaim_options,
> -                                               &nodemask);
> +                                               GFP_KERNEL, reclaim_options);
>
>                 if (!reclaimed && !nr_retries--)
>                         return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aba991c505f1..546540bc770a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                            unsigned long nr_pages,
>                                            gfp_t gfp_mask,
> -                                          unsigned int reclaim_options,
> -                                          nodemask_t *nodemask)
> +                                          unsigned int reclaim_options)
>  {
>         unsigned long nr_reclaimed;
>         unsigned int noreclaim_flag;
> @@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                 .may_unmap = 1,
>                 .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>                 .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> -               .nodemask = nodemask,
>         };
>         /*
>          * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.30.2
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
  2022-12-16 12:02       ` Mina Almasry
@ 2022-12-16 12:22         ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-16 12:22 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Andrew Morton, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Fri 16-12-22 04:02:12, Mina Almasry wrote:
> On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Andrew,
> > I have noticed that the patch made it into Linus tree already. Can we
> > please revert it because the semantic is not really clear and we should
> > really not create yet another user API maintenance problem. I am
> > proposing to revert the nodemask extension for now before we grow any
> > upstream users. Deeper in the email thread are some proposals how to
> > move forward with that.
> 
> There are proposals, many which have been rejected due to not
> addressing the motivating use cases and others that have been rejected
> by fellow maintainers, and some that are awaiting feedback. No, there
> is no other clear-cut way forward for this use case right now. I have
> found the merged approach by far the most agreeable so far.

There is a clear need for further discussion and until then we do not
want to expose interface and create dependencies that will inevitably
hard to change the semantic later.

> > From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Fri, 16 Dec 2022 10:46:33 +0100
> > Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
> >
> > This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
> >
> > Although it is recognized that a finer grained pro-active reclaim is
> > something we need and want the semantic of this implementation is really
> > ambiguous.
> >
> > From a follow up discussion it became clear that there are two essential
> > usecases here. One is to use memory.reclaim to pro-actively reclaim
> > memory and expectation is that the requested and reported amount of memory is
> > uncharged from the memcg. Another usecase focuses on pro-active demotion
> > when the memory is merely shuffled around to demotion targets while the
> > overall charged memory stays unchanged.
> >
> > The current implementation considers demoted pages as reclaimed and that
> > break both usecases.
> 
> I think you're making it sound like this specific patch broke both use
> cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert
> "mm/vmscan: never demote for memcg reclaim"") has been in the tree for
> around 7 months now and that is the commit that enabled demotion in
> memcg reclaim, and implicitly counted demoted pages as reclaimed in
> memcg reclaim, which is the source of the ambiguity. Not the patch
> that you are reverting here.
> 
> The irony I find with this revert is that this patch actually removes
> the ambiguity and does not exacerbate it. Currently using
> memory.reclaim _without_ the nodes= arg is ambiguous because demoted
> pages count as reclaimed. On the other hand using memory.reclaim
> _with_ the nodes= arg is completely unambiguous: the kernel will
> demote-only from top tier nodes and reclaim-only from bottom tier
> nodes.

Yes, demoted patches are indeed counted as reclaimed but that is not a
major issue because from the external point of view charges are getting
reclaimed. It is nodes specification which makes the latent problem much
more obvious.

> 
> > [1] has tried to address the reporting part but
> > there are more issues with that summarized in [2] and follow up emails.
> >
> 
> I am the one that put effort into resolving the ambiguity introduced
> by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") and proposed [1]. Reverting this patch does nothing to
> resolve ambiguity that it did not introduce.
> 
> > Let's revert the nodemask based extension of the memcg pro-active
> > reclaim for now until we settle with a more robust semantic.
> >
> 
> I do not think we should revert this. It enables a couple of important
> use cases for Google:
> 
> 1. Enables us to specifically trigger proactive reclaim in a memcg on
> a memory tiered system by specifying only the lower tiered nodes using
> the nodes= arg.
> 2. Enabled us to specifically trigger proactive demotion in a memcg on
> a memory tiered system by specifying only the top tier nodes using the
> nodes= arg.

That is clear and the aim of the revert is not to disallow those
usecases. We just need a clear and futureproof interface for that.
Changing the semantic after the fact is a nogo, hence the revert.
> 
> Both use cases are broken with this revert, and no progress to resolve
> the ambiguity is made with this revert.

There cannot be any regression with the revert now because the code
hasn't been upstream.

So let's remove the interface until we can agree on the exact semantic
and build the interface from there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
  2022-12-16  9:54     ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
  2022-12-16 12:02       ` Mina Almasry
@ 2022-12-16 12:28       ` Bagas Sanjaya
       [not found]       ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2 siblings, 0 replies; 35+ messages in thread
From: Bagas Sanjaya @ 2022-12-16 12:28 UTC (permalink / raw)
  To: Michal Hocko, Mina Almasry, Andrew Morton
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, Huang Ying, Yang Shi,
	Yosry Ahmed, weixugc, fvdl, cgroups, linux-doc, linux-kernel,
	linux-mm

On 12/16/22 16:54, Michal Hocko wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..74cec76be9f2 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
>  
> -	This file accepts a string which contains the number of bytes to
> -	reclaim.
> +	This file accepts a single key, the number of bytes to reclaim.
> +	No nested keys are currently supported.
>  
>  	Example::
>  
>  	  echo "1G" > memory.reclaim
>  
> +	The interface can be later extended with nested keys to
> +	configure the reclaim behavior. For example, specify the
> +	type of memory to reclaim from (anon, file, ..).
> +
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>  
> -	This file also allows the user to specify the nodes to reclaim from,
> -	via the 'nodes=' key, for example::
> -
> -	  echo "1G nodes=0,1" > memory.reclaim
> -
> -	The above instructs the kernel to reclaim memory from nodes 0,1.
> -
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.

Ah! I forgot to add my Reviewed-by: tag when the original patch was
submitted. However, I was Cc'ed the revert presumably due to Cc: tag in the
original.

In any case, for the documentation part:

Acked-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
       [not found]       ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-16 18:18         ` Andrew Morton
       [not found]           ` <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2022-12-16 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:

> I have noticed that the patch made it into Linus tree already. Can we
> please revert it because the semantic is not really clear and we should
> really not create yet another user API maintenance problem.

Well dang.  I was waiting for the discussion to converge, blissfully
unaware that the thing was sitting in mm-stable :(  I guess the

	Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
	Acked-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
	Acked-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>

fooled me.


I think it's a bit premature to revert at this stage.  Possibly we can
get to the desired end state by modifying the existing code.  Possibly
we can get to the desired end state by reverting this and by adding
something new.

If we can't get to the desired end state at all then yes, I'll send
Linus a revert of this patch later in this -rc cycle.


^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
       [not found]           ` <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2022-12-17  9:57             ` Michal Hocko
       [not found]               ` <Y52Scge3ynvn/mB4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-17  9:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri 16-12-22 10:18:20, Andrew Morton wrote:
> On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> 
> > I have noticed that the patch made it into Linus tree already. Can we
> > please revert it because the semantic is not really clear and we should
> > really not create yet another user API maintenance problem.
> 
> Well dang.  I was waiting for the discussion to converge, blissfully
> unaware that the thing was sitting in mm-stable :(  I guess the
> 
> 	Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 	Acked-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 	Acked-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
> 
> fooled me.

Hmm, as pointed out in http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
I've failed to see through all the consequences of the implementation.
SO my bad here to add my ack before fully understanding all the
implications.

> I think it's a bit premature to revert at this stage.  Possibly we can
> get to the desired end state by modifying the existing code.  Possibly
> we can get to the desired end state by reverting this and by adding
> something new.

Sure if we can converge to a proper implementation during the rc phase
then it would be ok. I cannot speak for others but at least for me
upcoming 2 weeks would be mostly offline so I cannot really contribute
much. A revert would be much more easier from the coordination POV IMHO.

Also I do not think there is any strong reason to rush this in. I do not
really see any major problems to have this extension in 6.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y52Scge3ynvn/mB4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
       [not found]               ` <Y52Scge3ynvn/mB4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-19 22:42                 ` Andrew Morton
  2023-01-03  8:37                   ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2022-12-19 22:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Sat, 17 Dec 2022 10:57:06 +0100 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:

> > I think it's a bit premature to revert at this stage.  Possibly we can
> > get to the desired end state by modifying the existing code.  Possibly
> > we can get to the desired end state by reverting this and by adding
> > something new.
> 
> Sure if we can converge to a proper implementation during the rc phase
> then it would be ok. I cannot speak for others but at least for me
> upcoming 2 weeks would be mostly offline so I cannot really contribute
> much. A revert would be much more easier from the coordination POV IMHO.
> 
> Also I do not think there is any strong reason to rush this in. I do not
> really see any major problems to have this extension in 6.2

I'll queue the revert in mm-unstable with a plan to merge it upstream
around the -rc5 timeframe if there hasn't been resolution.

Please check Mina's issues with this revert's current changelog and
perhaps send along a revised one.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
  2022-12-19 22:42                 ` Andrew Morton
@ 2023-01-03  8:37                   ` Michal Hocko
  2023-01-04  8:41                     ` Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Huang, Ying
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2023-01-03  8:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

[Sorry I was offline]

On Mon 19-12-22 14:42:52, Andrew Morton wrote:
> On Sat, 17 Dec 2022 10:57:06 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > > I think it's a bit premature to revert at this stage.  Possibly we can
> > > get to the desired end state by modifying the existing code.  Possibly
> > > we can get to the desired end state by reverting this and by adding
> > > something new.
> > 
> > Sure if we can converge to a proper implementation during the rc phase
> > then it would be ok. I cannot speak for others but at least for me
> > upcoming 2 weeks would be mostly offline so I cannot really contribute
> > much. A revert would be much more easier from the coordination POV IMHO.
> > 
> > Also I do not think there is any strong reason to rush this in. I do not
> > really see any major problems to have this extension in 6.2
> 
> I'll queue the revert in mm-unstable with a plan to merge it upstream
> around the -rc5 timeframe if there hasn't been resolution.

Thanks! I do not really think we need to rush node based reclaim and
better have a reasonable and more futureproof interface.

> Please check Mina's issues with this revert's current changelog and
> perhaps send along a revised one.

Yes, I believe, I have addressed all the feedback but I am open to alter
the wording of course. The biggest concern by Mina IIRC was that the
nr_reclaimed reporting has been a pre-existing problem. And I agree with
that. The thing is that this doesn't matter without node specification
because the memory gets reclaimed even if the reported value is over
accounted. With nodemask specification the value becomes bogus if no
demotion nodes are specified because no memory gets reclaimed
potentially while the success is still reported. Mina has tried to
address that but I am not convinced the fix is actually future proof.

This really requires more discussion.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim")
  2023-01-03  8:37                   ` Michal Hocko
@ 2023-01-04  8:41                     ` Huang, Ying
  2023-01-18 17:21                       ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Huang, Ying @ 2023-01-04  8:41 UTC (permalink / raw)
  To: Michal Hocko, Mina Almasry, Johannes Weiner, Yang Shi,
	Yosry Ahmed, weixugc, Tim Chen
  Cc: Andrew Morton, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

Michal Hocko <mhocko@suse.com> writes:

[snip]

> This really requires more discussion.

Let's start the discussion with some summary.

Requirements:

- Proactive reclaim.  The counting of current per-memcg proactive
  reclaim (memory.reclaim) isn't correct.  The demoted, but not
  reclaimed pages will be counted as reclaimed.  So "echo XXM >
  memory.reclaim" may exit prematurely before the specified number of
  memory is reclaimed.

- Proactive demote.  We need an interface to do per-memcg proactive
  demote.  We may reuse memory.reclaim via extending the concept of
  reclaiming to include demoting.  Or, we can add a new interface for
  that (for example, memory.demote).  In addition to demote from fast
  tier to slow tier, in theory, we may need to demote from a set of
  nodes to another set of nodes for something like general node
  balancing.

- Proactive promote.  In theory, this is possible, but there's no real
  life requirements yet.  And it should use a separate interface, so I
  don't think we need to discuss that here.

Open questions:

- Use memory.reclaim or memory.demote for proactive demote.  In current
  memcg context, reclaiming and demoting is quite different, because
  reclaiming will uncharge, while demoting will not.  But if we will add
  per-memory-tier charging finally, the difference disappears.  So the
  question becomes whether will we add per-memory-tier charging.

- Whether should we demote from faster tier nodes to lower tier nodes
  during the proactive reclaiming.  Choice A is to keep as much fast
  memory as possible.  That is, reclaim from the lowest tier nodes
  firstly, then the secondary lowest tier nodes, and so on.  Choice B is
  to demote at the same time of reclaiming.  In this way, if we
  proactively reclaim XX MB memory, we may free XX MB memory on the
  fastest memory nodes.

- When we proactively demote some memory from a fast memory tier, should
  we trigger memory competition in the slower memory tiers?  That is,
  whether to wake up kswapd of the slower memory tiers nodes?  If we
  want to make per-memcg proactive demoting to be per-memcg strictly, we
  should avoid to trigger the global behavior such as triggering memory
  competition in the slower memory tiers.  Instead, we can add a global
  proactive demote interface for that (such as per-memory-tier or
  per-node).

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim")
  2023-01-04  8:41                     ` Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Huang, Ying
@ 2023-01-18 17:21                       ` Michal Hocko
       [not found]                         ` <Y8gqkub3AM6c+Z5y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2023-01-18 17:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mina Almasry, Johannes Weiner, Yang Shi, Yosry Ahmed, weixugc,
	Tim Chen, Andrew Morton, Tejun Heo, Zefan Li, Jonathan Corbet,
	Roman Gushchin, Shakeel Butt, Muchun Song, fvdl, bagasdotme,
	cgroups, linux-doc, linux-kernel, linux-mm

On Wed 04-01-23 16:41:50, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> [snip]
> 
> > This really requires more discussion.
> 
> Let's start the discussion with some summary.
> 
> Requirements:
> 
> - Proactive reclaim.  The counting of current per-memcg proactive
>   reclaim (memory.reclaim) isn't correct.  The demoted, but not
>   reclaimed pages will be counted as reclaimed.  So "echo XXM >
>   memory.reclaim" may exit prematurely before the specified number of
>   memory is reclaimed.

This is reportedly a problem because memory.reclaim interface cannot be
used for proper memcg sizing IIRC.

> - Proactive demote.  We need an interface to do per-memcg proactive
>   demote.

For the further discussion it would be useful to reference the usecase
that is requiring this functionality. I believe this has been mentioned
somewhere but having it in this thread would help.

> We may reuse memory.reclaim via extending the concept of
>   reclaiming to include demoting.  Or, we can add a new interface for
>   that (for example, memory.demote).  In addition to demote from fast
>   tier to slow tier, in theory, we may need to demote from a set of
>   nodes to another set of nodes for something like general node
>   balancing.
> 
> - Proactive promote.  In theory, this is possible, but there's no real
>   life requirements yet.  And it should use a separate interface, so I
>   don't think we need to discuss that here.

Yes, proactive promotion is not backed by any real usecase at the
moment. We do not really have to focus on it but we should be aware of
the posibility and alow future extentions towards that functionality.
 
There is one requirement missing here.
 - Per NUMA node control - this is what makes the distinction between
   demotion and charge reclaim really semantically challenging - e.g.
   should demotions constrained by the provided nodemask or they should
   be implicit?

> Open questions:
> 
> - Use memory.reclaim or memory.demote for proactive demote.  In current
>   memcg context, reclaiming and demoting is quite different, because
>   reclaiming will uncharge, while demoting will not.  But if we will add
>   per-memory-tier charging finally, the difference disappears.  So the
>   question becomes whether will we add per-memory-tier charging.

The question is not whether but when IMHO. We've had a similar situation
with the swap accounting. Originally we have considered swap as a shared
resource but cgroupv2 goes with per swap limits because contention for
the swap space is really something people do care about.

> - Whether should we demote from faster tier nodes to lower tier nodes
>   during the proactive reclaiming.

I thought we are aligned on that. Demotion is a part of aging and that
is an integral part of the reclaim.

>   Choice A is to keep as much fast
>   memory as possible.  That is, reclaim from the lowest tier nodes
>   firstly, then the secondary lowest tier nodes, and so on.  Choice B is
>   to demote at the same time of reclaiming.  In this way, if we
>   proactively reclaim XX MB memory, we may free XX MB memory on the
>   fastest memory nodes.
> 
> - When we proactively demote some memory from a fast memory tier, should
>   we trigger memory competition in the slower memory tiers?  That is,
>   whether to wake up kswapd of the slower memory tiers nodes?

Johannes made some very strong arguments that there is no other choice
than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/).

>   If we
>   want to make per-memcg proactive demoting to be per-memcg strictly, we
>   should avoid to trigger the global behavior such as triggering memory
>   competition in the slower memory tiers.  Instead, we can add a global
>   proactive demote interface for that (such as per-memory-tier or
>   per-node).

I suspect we are left with a real usecase and then follow the path we
took for the swap accounting.

Other open questions I do see are
- what to do when the memory.reclaim is constrained by a nodemask as
  mentioned above. Is the whole reclaim process (including aging) bound to
  the given nodemask or does demotion escape from it.
- should the demotion be specific to multi-tier systems or the interface
  should be just NUMA based and users could use the scheme to shuffle
  memory around and allow numa balancing from userspace that way. That
  would imply that demotion is a dedicated interface of course.
- there are other usecases that would like to trigger aging from
  userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com).
  Isn't demotion just a special case of aging in general or should we
  end up with 3 different interfaces?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

[parent not found: <Y8gqkub3AM6c+Z5y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim")
       [not found]                         ` <Y8gqkub3AM6c+Z5y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2023-01-19  8:29                           ` Huang, Ying
  0 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2023-01-19  8:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Johannes Weiner, Yang Shi, Yosry Ahmed,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA, Tim Chen, Andrew Morton,
	Tejun Heo, Zefan Li, Jonathan Corbet, Roman Gushchin,
	Shakeel Butt, Muchun Song, fvdl-hpIqsD4AKlfQT0dZR+AlfA,
	bagasdotme-Re5JQEeQqe8AvxtiuMwx3w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Yuanchu Xie

Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> writes:

> On Wed 04-01-23 16:41:50, Huang, Ying wrote:
>> Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> writes:
>> 
>> [snip]
>> 
>> > This really requires more discussion.
>> 
>> Let's start the discussion with some summary.
>> 
>> Requirements:
>> 
>> - Proactive reclaim.  The counting of current per-memcg proactive
>>   reclaim (memory.reclaim) isn't correct.  The demoted, but not
>>   reclaimed pages will be counted as reclaimed.  So "echo XXM >
>>   memory.reclaim" may exit prematurely before the specified number of
>>   memory is reclaimed.
>
> This is reportedly a problem because memory.reclaim interface cannot be
> used for proper memcg sizing IIRC.
>
>> - Proactive demote.  We need an interface to do per-memcg proactive
>>   demote.
>
> For the further discussion it would be useful to reference the usecase
> that is requiring this functionality. I believe this has been mentioned
> somewhere but having it in this thread would help.

Sure.

Google people in [1] and [2] request a per-cgroup interface to demote
but not reclaim proactively.

"
For jobs of some latency tiers, we would like to trigger proactive
demotion (which incurs relatively low latency on the job), but not
trigger proactive reclaim (which incurs a pagefault).
"

Meta people (Johannes) in [3] say they used per-cgroup memory.reclaim
for demote and reclaim proactively.

 [1] https://lore.kernel.org/linux-mm/CAHS8izM-XdLgFrQ1k13X-4YrK=JGayRXV_G3c3Qh4NLKP7cH_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
 [2] https://lore.kernel.org/linux-mm/CAJD7tkZNW=u1TD-Fd_3RuzRNtaFjxihbGm0836QHkdp0Nn-vyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/
 [3] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg-druUgvl0LCNAfugRpC6u6w@public.gmane.org/

>> We may reuse memory.reclaim via extending the concept of
>>   reclaiming to include demoting.  Or, we can add a new interface for
>>   that (for example, memory.demote).  In addition to demote from fast
>>   tier to slow tier, in theory, we may need to demote from a set of
>>   nodes to another set of nodes for something like general node
>>   balancing.
>> 
>> - Proactive promote.  In theory, this is possible, but there's no real
>>   life requirements yet.  And it should use a separate interface, so I
>>   don't think we need to discuss that here.
>
> Yes, proactive promotion is not backed by any real usecase at the
> moment. We do not really have to focus on it but we should be aware of
> the posibility and alow future extentions towards that functionality.

OK.

> There is one requirement missing here.
>  - Per NUMA node control - this is what makes the distinction between
>    demotion and charge reclaim really semantically challenging - e.g.
>    should demotions constrained by the provided nodemask or they should
>    be implicit?

Yes.  We may need to specify the NUMA nodes for demotion/reclaiming
source, target, or even path.  That is, to fine control the proactive
demotion/reclaiming.

>> Open questions:
>> 
>> - Use memory.reclaim or memory.demote for proactive demote.  In current
>>   memcg context, reclaiming and demoting is quite different, because
>>   reclaiming will uncharge, while demoting will not.  But if we will add
>>   per-memory-tier charging finally, the difference disappears.  So the
>>   question becomes whether will we add per-memory-tier charging.
>
> The question is not whether but when IMHO. We've had a similar situation
> with the swap accounting. Originally we have considered swap as a shared
> resource but cgroupv2 goes with per swap limits because contention for
> the swap space is really something people do care about.

So, when we design user space interface for proactive demotion, we
should keep per-memory-tier charging in mind.

>> - Whether should we demote from faster tier nodes to lower tier nodes
>>   during the proactive reclaiming.
>
> I thought we are aligned on that. Demotion is a part of aging and that
> is an integral part of the reclaim.

As in the choice A/B of the below text, we should keep more fast memory
size or slow memory size?  For original active/inactive LRU lists, we
will balance the size of lists.  But we don't have similar stuff for the
memory tiers.  What is the preferred balancing policy?  Choice A/B below
are 2 extreme policies that are defined clearly.

>>   Choice A is to keep as much fast
>>   memory as possible.  That is, reclaim from the lowest tier nodes
>>   firstly, then the secondary lowest tier nodes, and so on.  Choice B is
>>   to demote at the same time of reclaiming.  In this way, if we
>>   proactively reclaim XX MB memory, we may free XX MB memory on the
>>   fastest memory nodes.
>> 
>> - When we proactively demote some memory from a fast memory tier, should
>>   we trigger memory competition in the slower memory tiers?  That is,
>>   whether to wake up kswapd of the slower memory tiers nodes?
>
> Johannes made some very strong arguments that there is no other choice
> than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY-druUgvl0LCNAfugRpC6u6w@public.gmane.org/).

I have no objection for that too.  The below is just another choice.  If
people don't think it's useful.  I will not insist on it.

>>   If we
>>   want to make per-memcg proactive demoting to be per-memcg strictly, we
>>   should avoid to trigger the global behavior such as triggering memory
>>   competition in the slower memory tiers.  Instead, we can add a global
>>   proactive demote interface for that (such as per-memory-tier or
>>   per-node).
>
> I suspect we are left with a real usecase and then follow the path we
> took for the swap accounting.

Thanks for adding that.

> Other open questions I do see are
> - what to do when the memory.reclaim is constrained by a nodemask as
>   mentioned above. Is the whole reclaim process (including aging) bound to
>   the given nodemask or does demotion escape from it.

Per my understanding, we can use multiple node masks if necessary.  For
example, for "source=<mask1>", we may demote from <mask1> to other
nodes; for "source=<mask1> destination=<mask2>", we will demote from
<mask1> to <mask2>, but will not demote to other nodes.

> - should the demotion be specific to multi-tier systems or the interface
>   should be just NUMA based and users could use the scheme to shuffle
>   memory around and allow numa balancing from userspace that way. That
>   would imply that demotion is a dedicated interface of course.

It appears that if we can force the demotion target nodes (even in the
same tier).  We can implement numa balancing from user space?

> - there are other usecases that would like to trigger aging from
>   userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org).
>   Isn't demotion just a special case of aging in general or should we
>   end up with 3 different interfaces?

Thanks for pointer!  If my understanding were correct, this appears a
user of proactive reclaiming/demotion interface?  Cced the patch author
for any further requirements for the interface.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-01-19  8:29 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
2022-12-02 23:51 ` Shakeel Butt
2022-12-03  3:17 ` Muchun Song
     [not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2022-12-12  8:55   ` Michal Hocko
     [not found]     ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-13  0:54       ` Mina Almasry
2022-12-13  6:30         ` Huang, Ying
2022-12-13  7:48           ` Wei Xu
2022-12-13  8:51           ` Michal Hocko
2022-12-13 13:42             ` Huang, Ying
     [not found]           ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2022-12-13 13:30             ` Johannes Weiner
2022-12-13 14:03               ` Michal Hocko
     [not found]                 ` <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-13 19:29                   ` Mina Almasry
2022-12-14 10:23                     ` Michal Hocko
2022-12-15  5:50                       ` Huang, Ying
     [not found]                         ` <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2022-12-15  9:21                           ` Michal Hocko
2022-12-16  3:02                             ` Huang, Ying
     [not found]                       ` <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-15 17:58                         ` Wei Xu
2022-12-16  8:40                           ` Michal Hocko
2022-12-13  8:33         ` Michal Hocko
2022-12-13 15:58           ` Johannes Weiner
     [not found]             ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-12-13 19:53               ` Mina Almasry
2022-12-14  7:20                 ` Huang, Ying
2022-12-14  7:15             ` Huang, Ying
2022-12-14 10:43             ` Michal Hocko
2022-12-16  9:54     ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
2022-12-16 12:02       ` Mina Almasry
2022-12-16 12:22         ` Michal Hocko
2022-12-16 12:28       ` Bagas Sanjaya
     [not found]       ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-16 18:18         ` Andrew Morton
     [not found]           ` <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-12-17  9:57             ` Michal Hocko
     [not found]               ` <Y52Scge3ynvn/mB4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-19 22:42                 ` Andrew Morton
2023-01-03  8:37                   ` Michal Hocko
2023-01-04  8:41                     ` Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Huang, Ying
2023-01-18 17:21                       ` Michal Hocko
     [not found]                         ` <Y8gqkub3AM6c+Z5y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-01-19  8:29                           ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox