* [PATCH v3] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02 22:35 Mina Almasry
2022-12-02 23:51 ` Shakeel Butt
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Mina Almasry @ 2022-12-02 22:35 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton
Cc: Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, Mina Almasry,
Michal Hocko, bagasdotme, cgroups, linux-doc, linux-kernel,
linux-mm
The nodes= arg instructs the kernel to only scan the given nodes for
proactive reclaim. For example use cases, consider a 2 tier memory system:
nodes 0,1 -> top tier
nodes 2,3 -> second tier
$ echo "1m nodes=0" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory from node 0.
Since node 0 is a top tier node, demotion will be attempted first. This
is useful to direct proactive reclaim to specific nodes that are under
pressure.
$ echo "1m nodes=2,3" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory in the second tier,
since this tier of memory has no demotion targets the memory will be
reclaimed.
$ echo "1m nodes=0,1" > memory.reclaim
Instructs the kernel to reclaim memory from the top tier nodes, which can
be desirable according to the userspace policy if there is pressure on
the top tiers. Since these nodes have demotion targets, the kernel will
attempt demotion first.
Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim""), the proactive reclaim interface memory.reclaim does both
reclaim and demotion. Reclaim and demotion incur different latency costs
to the jobs in the cgroup. Demoted memory would still be addressable
by the userspace at a higher latency, but reclaimed memory would need to
incur a pagefault.
The 'nodes' arg is useful to allow the userspace to control demotion
and reclaim independently according to its policy: if the memory.reclaim
is called on a node with demotion targets, it will attempt demotion first;
if it is called on a node without demotion targets, it will only attempt
reclaim.
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
v3:
- Dropped RFC tag from subject.
- Added Michal's Ack
- Applied some of Bagas's comment suggestions.
- Converted try_to_fre_mem_cgorup_pages() to take nodemask_t* instead of
nodemask_t as Shakeel and Muchun suggested.
Cc: bagasdotme@gmail.com
Thanks for the comments and reviews.
---
Documentation/admin-guide/cgroup-v2.rst | 15 +++---
include/linux/swap.h | 3 +-
mm/memcontrol.c | 67 ++++++++++++++++++++-----
mm/vmscan.c | 4 +-
4 files changed, 68 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 74cec76be9f2..c8ae7c897f14 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.
- This file accepts a single key, the number of bytes to reclaim.
- No nested keys are currently supported.
+ This file accepts a string which contains the number of bytes to
+ reclaim.
Example::
echo "1G" > memory.reclaim
- The interface can be later extended with nested keys to
- configure the reclaim behavior. For example, specify the
- type of memory to reclaim from (anon, file, ..).
-
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.
+ This file also allows the user to specify the nodes to reclaim from,
+ via the 'nodes=' key, for example::
+
+ echo "1G nodes=0,1" > memory.reclaim
+
+ The above instructs the kernel to reclaim memory from nodes 0,1.
+
memory.peak
A read-only single value file which exists on non-root
cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ceed49516ad..2787b84eaf12 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,7 +418,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options);
+ unsigned int reclaim_options,
+ nodemask_t *nodemask);
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23750cec0036..0f02f47a87e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/seq_buf.h>
+#include <linux/parser.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
psi_memstall_enter(&pflags);
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
- MEMCG_RECLAIM_MAY_SWAP);
+ MEMCG_RECLAIM_MAY_SWAP,
+ NULL);
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
- gfp_mask, reclaim_options);
+ gfp_mask, reclaim_options,
+ NULL);
psi_memstall_leave(&pflags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+ memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+ NULL)) {
ret = -EBUSY;
break;
}
@@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- MEMCG_RECLAIM_MAY_SWAP))
+ MEMCG_RECLAIM_MAY_SWAP,
+ NULL))
nr_retries--;
}
@@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
}
reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+ NULL);
if (!reclaimed && !nr_retries--)
break;
@@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+ NULL))
nr_reclaims--;
continue;
}
@@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}
+enum {
+ MEMORY_RECLAIM_NODES = 0,
+ MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t if_tokens = {
+ { MEMORY_RECLAIM_NODES, "nodes=%s" },
+ { MEMORY_RECLAIM_NULL, NULL },
+};
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
- unsigned int reclaim_options;
- int err;
+ unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
+ MEMCG_RECLAIM_PROACTIVE;
+ char *old_buf, *start;
+ substring_t args[MAX_OPT_ARGS];
+ int token;
+ char value[256];
+ nodemask_t nodemask = NODE_MASK_ALL;
buf = strstrip(buf);
- err = page_counter_memparse(buf, "", &nr_to_reclaim);
- if (err)
- return err;
- reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
+ old_buf = buf;
+ nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+ if (buf == old_buf)
+ return -EINVAL;
+
+ buf = strstrip(buf);
+
+ while ((start = strsep(&buf, " ")) != NULL) {
+ if (!strlen(start))
+ continue;
+ token = match_token(start, if_tokens, args);
+ match_strlcpy(value, args, sizeof(value));
+ switch (token) {
+ case MEMORY_RECLAIM_NODES:
+ if (nodelist_parse(value, nodemask) < 0)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+
while (nr_reclaimed < nr_to_reclaim) {
unsigned long reclaimed;
@@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
reclaimed = try_to_free_mem_cgroup_pages(memcg,
nr_to_reclaim - nr_reclaimed,
- GFP_KERNEL, reclaim_options);
+ GFP_KERNEL, reclaim_options,
+ &nodemask);
if (!reclaimed && !nr_retries--)
return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b8e8e43806b..62b0c9b46bd2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options)
+ unsigned int reclaim_options,
+ nodemask_t *nodemask)
{
unsigned long nr_reclaimed;
unsigned int noreclaim_flag;
@@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+ .nodemask = nodemask,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
--
2.39.0.rc0.267.gcb52ba06e7-goog
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
@ 2022-12-02 23:51 ` Shakeel Butt
2022-12-03 3:17 ` Muchun Song
[not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2 siblings, 0 replies; 35+ messages in thread
From: Shakeel Butt @ 2022-12-02 23:51 UTC (permalink / raw)
To: Mina Almasry
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, Michal Hocko,
bagasdotme, cgroups, linux-doc, linux-kernel, linux-mm
On Fri, Dec 2, 2022 at 2:37 PM Mina Almasry <almasrymina@google.com> wrote:
>
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
Acked-by: Shakeel Butt <shakeelb@google.com>
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
2022-12-02 23:51 ` Shakeel Butt
@ 2022-12-03 3:17 ` Muchun Song
[not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2 siblings, 0 replies; 35+ messages in thread
From: Muchun Song @ 2022-12-03 3:17 UTC (permalink / raw)
To: Mina Almasry
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl,
Michal Hocko, bagasdotme, cgroups, Linux Doc Mailing List,
linux-kernel, Linux Memory Management List
> On Dec 3, 2022, at 06:35, Mina Almasry <almasrymina@google.com> wrote:
>
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Thanks.
^ permalink raw reply [flat|nested] 35+ messages in thread[parent not found: <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
[not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2022-12-12 8:55 ` Michal Hocko
[not found] ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-16 9:54 ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-12 8:55 UTC (permalink / raw)
To: Mina Almasry
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
After discussion in [1] I have realized that I haven't really thought
through all the consequences of this patch and therefore I am retracting
my ack here. I am not nacking the patch at this statge but I also think
this shouldn't be merged now and we should really consider all the
consequences.
Let me summarize my main concerns here as well. The proposed
implementation doesn't apply the provided nodemask to the whole reclaim
process. This means that demotion can happen outside of the mask so the
the user request cannot really control demotion targets and that limits
the interface should there be any need for a finer grained control in
the future (see an example in [2]).
Another problem is that this can limit future reclaim extensions because
of existing assumptions of the interface [3] - specify only top-tier
node to force the aging without actually reclaiming any charges and
(ab)use the interface only for aging on multi-tier system. A change to
the reclaim to not demote in some cases could break this usecase.
My counter proposal would be to define the nodemask for memory.reclaim
as a domain to constrain the charge reclaim. That means both aging and
reclaim including demotion which is a part of aging. This will allow
to control where to demote for balancing purposes (e.g. demote to node 2
rather than 3) which is impossible with the proposed scheme.
[1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
[2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
[3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread[parent not found: <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
[not found] ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-13 0:54 ` Mina Almasry
2022-12-13 6:30 ` Huang, Ying
2022-12-13 8:33 ` Michal Hocko
0 siblings, 2 replies; 35+ messages in thread
From: Mina Almasry @ 2022-12-13 0:54 UTC (permalink / raw)
To: Michal Hocko
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> > The nodes= arg instructs the kernel to only scan the given nodes for
> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >
> > nodes 0,1 -> top tier
> > nodes 2,3 -> second tier
> >
> > $ echo "1m nodes=0" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> > Since node 0 is a top tier node, demotion will be attempted first. This
> > is useful to direct proactive reclaim to specific nodes that are under
> > pressure.
> >
> > $ echo "1m nodes=2,3" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> > since this tier of memory has no demotion targets the memory will be
> > reclaimed.
> >
> > $ echo "1m nodes=0,1" > memory.reclaim
> >
> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> > be desirable according to the userspace policy if there is pressure on
> > the top tiers. Since these nodes have demotion targets, the kernel will
> > attempt demotion first.
> >
> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim""), the proactive reclaim interface memory.reclaim does both
> > reclaim and demotion. Reclaim and demotion incur different latency costs
> > to the jobs in the cgroup. Demoted memory would still be addressable
> > by the userspace at a higher latency, but reclaimed memory would need to
> > incur a pagefault.
> >
> > The 'nodes' arg is useful to allow the userspace to control demotion
> > and reclaim independently according to its policy: if the memory.reclaim
> > is called on a node with demotion targets, it will attempt demotion first;
> > if it is called on a node without demotion targets, it will only attempt
> > reclaim.
> >
> > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> > Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> After discussion in [1] I have realized that I haven't really thought
> through all the consequences of this patch and therefore I am retracting
> my ack here. I am not nacking the patch at this statge but I also think
> this shouldn't be merged now and we should really consider all the
> consequences.
>
> Let me summarize my main concerns here as well. The proposed
> implementation doesn't apply the provided nodemask to the whole reclaim
> process. This means that demotion can happen outside of the mask so the
> the user request cannot really control demotion targets and that limits
> the interface should there be any need for a finer grained control in
> the future (see an example in [2]).
> Another problem is that this can limit future reclaim extensions because
> of existing assumptions of the interface [3] - specify only top-tier
> node to force the aging without actually reclaiming any charges and
> (ab)use the interface only for aging on multi-tier system. A change to
> the reclaim to not demote in some cases could break this usecase.
>
I think this is correct. My use case is to request from the kernel to
do demotion without reclaim in the cgroup, and the reason for that is
stated in the commit message:
"Reclaim and demotion incur different latency costs to the jobs in the
cgroup. Demoted memory would still be addressable by the userspace at
a higher latency, but reclaimed memory would need to incur a
pagefault."
For jobs of some latency tiers, we would like to trigger proactive
demotion (which incurs relatively low latency on the job), but not
trigger proactive reclaim (which incurs a pagefault). I initially had
proposed a separate interface for this, but Johannes directed me to
this interface instead in [1]. In the same email Johannes also tells
me that meta's reclaim stack relies on memory.reclaim triggering
demotion, so it seems that I'm not the first to take a dependency on
this. Additionally in [2] Johannes also says it would be great if in
the long term reclaim policy and demotion policy do not diverge.
[1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
[2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
> My counter proposal would be to define the nodemask for memory.reclaim
> as a domain to constrain the charge reclaim. That means both aging and
> reclaim including demotion which is a part of aging. This will allow
> to control where to demote for balancing purposes (e.g. demote to node 2
> rather than 3) which is impossible with the proposed scheme.
>
My understanding is that with this interface in order to trigger
demotion I would want to list both the top tier nodes and the bottom
tier nodes on the nodemask, and since the bottom tier nodes are in the
nodemask the kernel will not just trigger demotion, but will also
trigger reclaim. This is very specifically not our use case and not
the goal of this patch.
I had also suggested adding a demotion= arg to memory.reclaim so the
userspace may customize this behavior, but Johannes rejected this in
[3] to adhere to the aging pipeline.
All in all I like Johannes's model in [3] describing the aging
pipeline and the relationship between demotion and reclaim. The nodes=
arg is just a hint to the kernel that the userspace is looking for
reclaim from a top tier node (which would be done by demotion
according to the aging pipeline) or a bottom tier node (which would be
done by reclaim according to the aging pipeline). I think this
interface is aligned with this model.
[3] https://lore.kernel.org/linux-mm/Y36XchdgTCsMP4jT-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
> [1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
> [2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
> [3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 0:54 ` Mina Almasry
@ 2022-12-13 6:30 ` Huang, Ying
2022-12-13 7:48 ` Wei Xu
` (2 more replies)
2022-12-13 8:33 ` Michal Hocko
1 sibling, 3 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-13 6:30 UTC (permalink / raw)
To: Mina Almasry, Michal Hocko
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme, cgroups,
linux-doc, linux-kernel, linux-mm
Mina Almasry <almasrymina@google.com> writes:
> On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
>> > The nodes= arg instructs the kernel to only scan the given nodes for
>> > proactive reclaim. For example use cases, consider a 2 tier memory system:
>> >
>> > nodes 0,1 -> top tier
>> > nodes 2,3 -> second tier
>> >
>> > $ echo "1m nodes=0" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
>> > Since node 0 is a top tier node, demotion will be attempted first. This
>> > is useful to direct proactive reclaim to specific nodes that are under
>> > pressure.
>> >
>> > $ echo "1m nodes=2,3" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
>> > since this tier of memory has no demotion targets the memory will be
>> > reclaimed.
>> >
>> > $ echo "1m nodes=0,1" > memory.reclaim
>> >
>> > Instructs the kernel to reclaim memory from the top tier nodes, which can
>> > be desirable according to the userspace policy if there is pressure on
>> > the top tiers. Since these nodes have demotion targets, the kernel will
>> > attempt demotion first.
>> >
>> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > reclaim""), the proactive reclaim interface memory.reclaim does both
>> > reclaim and demotion. Reclaim and demotion incur different latency costs
>> > to the jobs in the cgroup. Demoted memory would still be addressable
>> > by the userspace at a higher latency, but reclaimed memory would need to
>> > incur a pagefault.
>> >
>> > The 'nodes' arg is useful to allow the userspace to control demotion
>> > and reclaim independently according to its policy: if the memory.reclaim
>> > is called on a node with demotion targets, it will attempt demotion first;
>> > if it is called on a node without demotion targets, it will only attempt
>> > reclaim.
>> >
>> > Acked-by: Michal Hocko <mhocko@suse.com>
>> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>>
>> After discussion in [1] I have realized that I haven't really thought
>> through all the consequences of this patch and therefore I am retracting
>> my ack here. I am not nacking the patch at this statge but I also think
>> this shouldn't be merged now and we should really consider all the
>> consequences.
>>
>> Let me summarize my main concerns here as well. The proposed
>> implementation doesn't apply the provided nodemask to the whole reclaim
>> process. This means that demotion can happen outside of the mask so the
>> the user request cannot really control demotion targets and that limits
>> the interface should there be any need for a finer grained control in
>> the future (see an example in [2]).
>> Another problem is that this can limit future reclaim extensions because
>> of existing assumptions of the interface [3] - specify only top-tier
>> node to force the aging without actually reclaiming any charges and
>> (ab)use the interface only for aging on multi-tier system. A change to
>> the reclaim to not demote in some cases could break this usecase.
>>
>
> I think this is correct. My use case is to request from the kernel to
> do demotion without reclaim in the cgroup, and the reason for that is
> stated in the commit message:
>
> "Reclaim and demotion incur different latency costs to the jobs in the
> cgroup. Demoted memory would still be addressable by the userspace at
> a higher latency, but reclaimed memory would need to incur a
> pagefault."
>
> For jobs of some latency tiers, we would like to trigger proactive
> demotion (which incurs relatively low latency on the job), but not
> trigger proactive reclaim (which incurs a pagefault). I initially had
> proposed a separate interface for this, but Johannes directed me to
> this interface instead in [1]. In the same email Johannes also tells
> me that meta's reclaim stack relies on memory.reclaim triggering
> demotion, so it seems that I'm not the first to take a dependency on
> this. Additionally in [2] Johannes also says it would be great if in
> the long term reclaim policy and demotion policy do not diverge.
>
> [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
After these discussion, I think the solution maybe use different
interfaces for "proactive demote" and "proactive reclaim". That is,
reconsider "memory.demote". In this way, we will always uncharge the
cgroup for "memory.reclaim". This avoid the possible confusion there.
And, because demotion is considered aging, we don't need to disable
demotion for "memory.reclaim", just don't count it.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 6:30 ` Huang, Ying
@ 2022-12-13 7:48 ` Wei Xu
2022-12-13 8:51 ` Michal Hocko
[not found] ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2 siblings, 0 replies; 35+ messages in thread
From: Wei Xu @ 2022-12-13 7:48 UTC (permalink / raw)
To: Huang, Ying
Cc: Mina Almasry, Michal Hocko, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Yang Shi, Yosry Ahmed, fvdl, bagasdotme, cgroups,
linux-doc, linux-kernel, linux-mm
On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko@suse.com>
> >> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
>
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim". That is,
> reconsider "memory.demote". In this way, we will always uncharge the
> cgroup for "memory.reclaim". This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.
+1 on memory.demote.
> Best Regards,
> Huang, Ying
>
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 6:30 ` Huang, Ying
2022-12-13 7:48 ` Wei Xu
@ 2022-12-13 8:51 ` Michal Hocko
2022-12-13 13:42 ` Huang, Ying
[not found] ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-13 8:51 UTC (permalink / raw)
To: Huang, Ying
Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue 13-12-22 14:30:57, Huang, Ying wrote:
> Mina Almasry <almasrymina@google.com> writes:
[...]
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim". That is,
> reconsider "memory.demote". In this way, we will always uncharge the
> cgroup for "memory.reclaim". This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.
As already pointed out in my previous email, we should really think more
about future requirements. Do we add memory.promote interface when there
is a request to implement numa balancing into the userspace? Maybe yes
but maybe the node balancing should be more generic than bound to memory
tiering and apply to a more fine grained nodemask control.
Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE),
reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize
(swap in, or read ahead) which are per mm/file. Their primary usability
issue is that they are process centric and that requires a very deep
understanding of the process mm layout so it is not really usable for a
larger scale orchestration.
The important part of those interfaces is that they do not talk about
demotion because that is an implementation detail. I think we want to
follow that model at least. From a higher level POV I believe we really
need an interface to age&reclaim and balance memory among nodes. Are
there more higher level usecases?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 8:51 ` Michal Hocko
@ 2022-12-13 13:42 ` Huang, Ying
0 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-13 13:42 UTC (permalink / raw)
To: Michal Hocko, Mina Almasry, weixugc
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Yang Shi, Yosry Ahmed, fvdl, bagasdotme, cgroups, linux-doc,
linux-kernel, linux-mm
Michal Hocko <mhocko@suse.com> writes:
> On Tue 13-12-22 14:30:57, Huang, Ying wrote:
>> Mina Almasry <almasrymina@google.com> writes:
> [...]
>> After these discussion, I think the solution maybe use different
>> interfaces for "proactive demote" and "proactive reclaim". That is,
>> reconsider "memory.demote". In this way, we will always uncharge the
>> cgroup for "memory.reclaim". This avoid the possible confusion there.
>> And, because demotion is considered aging, we don't need to disable
>> demotion for "memory.reclaim", just don't count it.
>
> As already pointed out in my previous email, we should really think more
> about future requirements. Do we add memory.promote interface when there
> is a request to implement numa balancing into the userspace? Maybe yes
> but maybe the node balancing should be more generic than bound to memory
> tiering and apply to a more fine grained nodemask control.
>
> Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE),
> reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize
> (swap in, or read ahead) which are per mm/file. Their primary usability
> issue is that they are process centric and that requires a very deep
> understanding of the process mm layout so it is not really usable for a
> larger scale orchestration.
> The important part of those interfaces is that they do not talk about
> demotion because that is an implementation detail. I think we want to
> follow that model at least. From a higher level POV I believe we really
> need an interface to age&reclaim and balance memory among nodes. Are
> there more higher level usecases?
Yes. If the high level interface can satisfy the requirements, we
should use them or define them. But I guess Mina and Xu has some
requirements at the level of memory tiers (demotion/promotion)?
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>]
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
[not found] ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2022-12-13 13:30 ` Johannes Weiner
2022-12-13 14:03 ` Michal Hocko
0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2022-12-13 13:30 UTC (permalink / raw)
To: Huang, Ying
Cc: Mina Almasry, Michal Hocko, Tejun Heo, Zefan Li, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> >> > Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
>
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim". That is,
> reconsider "memory.demote". In this way, we will always uncharge the
> cgroup for "memory.reclaim". This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.
Hm, so in summary:
1) memory.reclaim would demote and reclaim like today, but it would
change to only count reclaimed pages against the goal.
2) memory.demote would only demote.
a) What if the demotion targets are full? Would it reclaim or fail?
3) Would memory.reclaim and memory.demote still need nodemasks? Would
they return -EINVAL if a) memory.reclaim gets passed only toptier
nodes or b) memory.demote gets passed any lasttier nodes?
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 13:30 ` Johannes Weiner
@ 2022-12-13 14:03 ` Michal Hocko
[not found] ` <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-13 14:03 UTC (permalink / raw)
To: Johannes Weiner
Cc: Huang, Ying, Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme, cgroups,
linux-doc, linux-kernel, linux-mm
On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
[...]
> > After these discussion, I think the solution maybe use different
> > interfaces for "proactive demote" and "proactive reclaim". That is,
> > reconsider "memory.demote". In this way, we will always uncharge the
> > cgroup for "memory.reclaim". This avoid the possible confusion there.
> > And, because demotion is considered aging, we don't need to disable
> > demotion for "memory.reclaim", just don't count it.
>
> Hm, so in summary:
>
> 1) memory.reclaim would demote and reclaim like today, but it would
> change to only count reclaimed pages against the goal.
>
> 2) memory.demote would only demote.
>
> a) What if the demotion targets are full? Would it reclaim or fail?
>
> 3) Would memory.reclaim and memory.demote still need nodemasks? Would
> they return -EINVAL if a) memory.reclaim gets passed only toptier
> nodes or b) memory.demote gets passed any lasttier nodes?
I would also add
4) Do we want to allow to control the demotion path (e.g. which node to
demote from and to) and how to achieve that?
5) Is the demotion api restricted to multi-tier systems or any numa
configuration allowed as well?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 0:54 ` Mina Almasry
2022-12-13 6:30 ` Huang, Ying
@ 2022-12-13 8:33 ` Michal Hocko
2022-12-13 15:58 ` Johannes Weiner
1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2022-12-13 8:33 UTC (permalink / raw)
To: Mina Almasry
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
On Mon 12-12-22 16:54:27, Mina Almasry wrote:
> On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Let me summarize my main concerns here as well. The proposed
> > implementation doesn't apply the provided nodemask to the whole reclaim
> > process. This means that demotion can happen outside of the mask so the
> > the user request cannot really control demotion targets and that limits
> > the interface should there be any need for a finer grained control in
> > the future (see an example in [2]).
> > Another problem is that this can limit future reclaim extensions because
> > of existing assumptions of the interface [3] - specify only top-tier
> > node to force the aging without actually reclaiming any charges and
> > (ab)use the interface only for aging on multi-tier system. A change to
> > the reclaim to not demote in some cases could break this usecase.
> >
>
> I think this is correct. My use case is to request from the kernel to
> do demotion without reclaim in the cgroup, and the reason for that is
> stated in the commit message:
>
> "Reclaim and demotion incur different latency costs to the jobs in the
> cgroup. Demoted memory would still be addressable by the userspace at
> a higher latency, but reclaimed memory would need to incur a
> pagefault."
>
> For jobs of some latency tiers, we would like to trigger proactive
> demotion (which incurs relatively low latency on the job), but not
> trigger proactive reclaim (which incurs a pagefault). I initially had
> proposed a separate interface for this, but Johannes directed me to
> this interface instead in [1]. In the same email Johannes also tells
> me that meta's reclaim stack relies on memory.reclaim triggering
> demotion, so it seems that I'm not the first to take a dependency on
> this. Additionally in [2] Johannes also says it would be great if in
> the long term reclaim policy and demotion policy do not diverge.
I do recognize your need to control the demotion but I argue that it is
a bad idea to rely on an implicit behavior of the memory reclaim and an
interface which is _documented_ to primarily _reclaim_ memory.
Really, consider that the current demotion implementation will change
in the future and based on a newly added heuristic memory reclaim or
compression would be preferred over migration to a different tier. This
might completely break your current assumptions and break your usecase
which relies on an implicit demotion behavior. Do you see that as a
potential problem at all? What shall we do in that case? Special case
memory.reclaim behavior?
Now to your specific usecase. If there is a need to do a memory
distribution balancing then fine but this should be a well defined
interface. E.g. is there a need to not only control demotion but
promotions as well? I haven't heard anybody requesting that so far
but I can easily imagine that like outsourcing the memory reclaim to
the userspace someone might want to do the same thing with the numa
balancing because $REASONS. Should that ever happen, I am pretty sure
hooking into memory.reclaim is not really a great idea.
See where I am coming from?
> [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 8:33 ` Michal Hocko
@ 2022-12-13 15:58 ` Johannes Weiner
[not found] ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Johannes Weiner @ 2022-12-13 15:58 UTC (permalink / raw)
To: Michal Hocko
Cc: Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> I do recognize your need to control the demotion but I argue that it is
> a bad idea to rely on an implicit behavior of the memory reclaim and an
> interface which is _documented_ to primarily _reclaim_ memory.
I think memory.reclaim should demote as part of page aging. What I'd
like to avoid is *having* to manually control the aging component in
the interface (e.g. making memory.reclaim *only* reclaim, and
*requiring* a coordinated use of memory.demote to ensure progress.)
> Really, consider that the current demotion implementation will change
> in the future and based on a newly added heuristic memory reclaim or
> compression would be preferred over migration to a different tier. This
> might completely break your current assumptions and break your usecase
> which relies on an implicit demotion behavior. Do you see that as a
> potential problem at all? What shall we do in that case? Special case
> memory.reclaim behavior?
Shouldn't that be derived from the distance propertiers in the tier
configuration?
I.e. if local compression is faster than demoting to a slower node, we
should maybe have a separate tier for that. Ignoring proactive reclaim
or demotion commands for a second: on that node, global memory
pressure should always compress first, while the oldest pages from the
compression cache should demote to the other node(s) - until they
eventually get swapped out.
However fine-grained we make proactive reclaim control over these
stages, it should at least be possible for the user to request the
default behavior that global pressure follows, without jumping through
hoops or requiring the coordinated use of multiple knobs. So IMO there
is an argument for having a singular knob that requests comprehensive
aging and reclaiming across the configured hierarchy.
As far as explicit control over the individual stages goes - no idea
if you would call the compression stage demotion or reclaim. The
distinction still does not make much of sense to me, since reclaim is
just another form of demotion. Sure, page faults have a different
access latency than dax to slower memory. But you could also have 3
tiers of memory where the difference between tier 1 and 2 is much
smaller than the difference between 2 and 3, and you might want to
apply different demotion rates between them as well.
The other argument is that demotion does not free cgroup memory,
whereas reclaim does. But with multiple memory tiers of vastly
different performance, isn't there also an argument for granting
cgroups different shares of each memory? So that a higher priority
group has access to a bigger share of the fastest memory, and lower
prio cgroups are relegated to lower tiers. If we split those pools,
then "demotion" will actually free memory in a cgroup.
This is why I liked adding a nodes= argument to memory.reclaim the
best. It doesn't encode a distinction that may not last for long.
The problem comes from how to interpret the input argument and the
return value, right? Could we solve this by requiring the passed
nodes= to all be of the same memory tier? Then there is no confusion
around what is requested and what the return value means.
And if no nodes are passed, it means reclaim (from the lowest memory
tier) X pages and demote as needed, then return the reclaimed pages.
> Now to your specific usecase. If there is a need to do a memory
> distribution balancing then fine but this should be a well defined
> interface. E.g. is there a need to not only control demotion but
> promotions as well? I haven't heard anybody requesting that so far
> but I can easily imagine that like outsourcing the memory reclaim to
> the userspace someone might want to do the same thing with the numa
> balancing because $REASONS. Should that ever happen, I am pretty sure
> hooking into memory.reclaim is not really a great idea.
Should this ever happen, it would seem fair that that be a separate
knob anyway, no? One knob to move the pipeline in one direction
(aging), one knob to move it the other way.
^ permalink raw reply [flat|nested] 35+ messages in thread[parent not found: <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
[not found] ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-12-13 19:53 ` Mina Almasry
2022-12-14 7:20 ` Huang, Ying
0 siblings, 1 reply; 35+ messages in thread
From: Mina Almasry @ 2022-12-13 19:53 UTC (permalink / raw)
To: Johannes Weiner
Cc: Michal Hocko, Tejun Heo, Zefan Li, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> > I do recognize your need to control the demotion but I argue that it is
> > a bad idea to rely on an implicit behavior of the memory reclaim and an
> > interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
>
> > Really, consider that the current demotion implementation will change
> > in the future and based on a newly added heuristic memory reclaim or
> > compression would be preferred over migration to a different tier. This
> > might completely break your current assumptions and break your usecase
> > which relies on an implicit demotion behavior. Do you see that as a
> > potential problem at all? What shall we do in that case? Special case
> > memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion. Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory? So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>
I would also like to say I implemented something in line with that in [1].
In this patch, pages demoted from inside the nodemask to outside the
nodemask count as 'reclaimed'. This, in my mind, is a very generic
solution to the 'should demoted pages count as reclaim?' problem, and
will work in all scenarios as long as the nodemask passed to
shrink_folio_list() is set correctly by the call stack.
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.
>
I feel like I arrived at a better solution in [1], where pages demoted
from inside of the nodemask to outside count as reclaimed and the rest
don't. But I think we could solve this by explicit checks that nodes=
arg are from the same tier, yes.
> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.
>
> > Now to your specific usecase. If there is a need to do a memory
> > distribution balancing then fine but this should be a well defined
> > interface. E.g. is there a need to not only control demotion but
> > promotions as well? I haven't heard anybody requesting that so far
> > but I can easily imagine that like outsourcing the memory reclaim to
> > the userspace someone might want to do the same thing with the numa
> > balancing because $REASONS. Should that ever happen, I am pretty sure
> > hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.
[1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 19:53 ` Mina Almasry
@ 2022-12-14 7:20 ` Huang, Ying
0 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-14 7:20 UTC (permalink / raw)
To: Mina Almasry
Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Zefan Li,
Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
Mina Almasry <almasrymina@google.com> writes:
> On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
>> > I do recognize your need to control the demotion but I argue that it is
>> > a bad idea to rely on an implicit behavior of the memory reclaim and an
>> > interface which is _documented_ to primarily _reclaim_ memory.
>>
>> I think memory.reclaim should demote as part of page aging. What I'd
>> like to avoid is *having* to manually control the aging component in
>> the interface (e.g. making memory.reclaim *only* reclaim, and
>> *requiring* a coordinated use of memory.demote to ensure progress.)
>>
>> > Really, consider that the current demotion implementation will change
>> > in the future and based on a newly added heuristic memory reclaim or
>> > compression would be preferred over migration to a different tier. This
>> > might completely break your current assumptions and break your usecase
>> > which relies on an implicit demotion behavior. Do you see that as a
>> > potential problem at all? What shall we do in that case? Special case
>> > memory.reclaim behavior?
>>
>> Shouldn't that be derived from the distance propertiers in the tier
>> configuration?
>>
>> I.e. if local compression is faster than demoting to a slower node, we
>> should maybe have a separate tier for that. Ignoring proactive reclaim
>> or demotion commands for a second: on that node, global memory
>> pressure should always compress first, while the oldest pages from the
>> compression cache should demote to the other node(s) - until they
>> eventually get swapped out.
>>
>> However fine-grained we make proactive reclaim control over these
>> stages, it should at least be possible for the user to request the
>> default behavior that global pressure follows, without jumping through
>> hoops or requiring the coordinated use of multiple knobs. So IMO there
>> is an argument for having a singular knob that requests comprehensive
>> aging and reclaiming across the configured hierarchy.
>>
>> As far as explicit control over the individual stages goes - no idea
>> if you would call the compression stage demotion or reclaim. The
>> distinction still does not make much of sense to me, since reclaim is
>> just another form of demotion. Sure, page faults have a different
>> access latency than dax to slower memory. But you could also have 3
>> tiers of memory where the difference between tier 1 and 2 is much
>> smaller than the difference between 2 and 3, and you might want to
>> apply different demotion rates between them as well.
>>
>> The other argument is that demotion does not free cgroup memory,
>> whereas reclaim does. But with multiple memory tiers of vastly
>> different performance, isn't there also an argument for granting
>> cgroups different shares of each memory? So that a higher priority
>> group has access to a bigger share of the fastest memory, and lower
>> prio cgroups are relegated to lower tiers. If we split those pools,
>> then "demotion" will actually free memory in a cgroup.
>>
>
> I would also like to say I implemented something in line with that in [1].
>
> In this patch, pages demoted from inside the nodemask to outside the
> nodemask count as 'reclaimed'. This, in my mind, is a very generic
> solution to the 'should demoted pages count as reclaim?' problem, and
> will work in all scenarios as long as the nodemask passed to
> shrink_folio_list() is set correctly by the call stack.
It's still not clear that how many pages should be demoted among the
nodes inside the nodemask. One possibility is to keep as many higher
tier pages as possible.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 15:58 ` Johannes Weiner
[not found] ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-12-14 7:15 ` Huang, Ying
2022-12-14 10:43 ` Michal Hocko
2 siblings, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2022-12-14 7:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Michal Hocko, Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme, cgroups,
linux-doc, linux-kernel, linux-mm
Johannes Weiner <hannes@cmpxchg.org> writes:
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
>> I do recognize your need to control the demotion but I argue that it is
>> a bad idea to rely on an implicit behavior of the memory reclaim and an
>> interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
>
>> Really, consider that the current demotion implementation will change
>> in the future and based on a newly added heuristic memory reclaim or
>> compression would be preferred over migration to a different tier. This
>> might completely break your current assumptions and break your usecase
>> which relies on an implicit demotion behavior. Do you see that as a
>> potential problem at all? What shall we do in that case? Special case
>> memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion. Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory? So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.
Yes. The definition is clear if nodes= from the same memory tier.
> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.
It appears that the definition isn't very clear here. How many pages
should be demoted? The target number is the value echoed to
memory.reclaim? Or requested_number - pages_in_lowest_tier? Should we
demote in as many tiers as possible or in as few tiers as possible? One
possibility is to take advantage of top tier memory as much as
possible. That is, try to reclaim pages in lower tiers only.
>> Now to your specific usecase. If there is a need to do a memory
>> distribution balancing then fine but this should be a well defined
>> interface. E.g. is there a need to not only control demotion but
>> promotions as well? I haven't heard anybody requesting that so far
>> but I can easily imagine that like outsourcing the memory reclaim to
>> the userspace someone might want to do the same thing with the numa
>> balancing because $REASONS. Should that ever happen, I am pretty sure
>> hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.
Agree.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim
2022-12-13 15:58 ` Johannes Weiner
[not found] ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-12-14 7:15 ` Huang, Ying
@ 2022-12-14 10:43 ` Michal Hocko
2 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-14 10:43 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mina Almasry, Tejun Heo, Zefan Li, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue 13-12-22 16:58:50, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> > I do recognize your need to control the demotion but I argue that it is
> > a bad idea to rely on an implicit behavior of the memory reclaim and an
> > interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
Yes, I do agree with that. Demotion is a part of the aging. I meant to say
that the result of the operation should be reclaimed charges but that
doesn't mean that demotion is not a part of that process.
I am mostly concerned about demote only behavior that Mina is targetting
and want to use memory.reclaim interface.
> > Really, consider that the current demotion implementation will change
> > in the future and based on a newly added heuristic memory reclaim or
> > compression would be preferred over migration to a different tier. This
> > might completely break your current assumptions and break your usecase
> > which relies on an implicit demotion behavior. Do you see that as a
> > potential problem at all? What shall we do in that case? Special case
> > memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion.
From the external visibility POV the major difference between the two is
that the reclaim decreases the overall charged memory. And there are
pro-active reclaim usecases which rely on that. Demotion is mostly
memory placement rebalancing. Sure still visible in per-node stats and
with implications to performance but that is a different story.
> Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory?
Yes. We have already had requests for per node limits in the past. And I
do expect this will show up as a problem here as well but with a
reasonable memory.reclaim and potentially memory.demote interfaces the
balancing and policy making can be outsourced to the userspace .
> So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.
Just to make sure I am on the same page. This means that if a node mask
is specified then it always implies demotion without any control over
how the demotion is done, right?
> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.
IMO this is rather constrained semantic which will completely rule out
some potentially interesting usecases. E.g. fine grained control over
the demotion path or enforced reclaim for node balancing. Also if we
ever want a promote interface then it would better fit with demote
counterpart.
> > Now to your specific usecase. If there is a need to do a memory
> > distribution balancing then fine but this should be a well defined
> > interface. E.g. is there a need to not only control demotion but
> > promotions as well? I haven't heard anybody requesting that so far
> > but I can easily imagine that like outsourcing the memory reclaim to
> > the userspace someone might want to do the same thing with the numa
> > balancing because $REASONS. Should that ever happen, I am pretty sure
> > hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.
Yes, this is what I am inclining to as well.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
2022-12-12 8:55 ` Michal Hocko
[not found] ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-16 9:54 ` Michal Hocko
2022-12-16 12:02 ` Mina Almasry
` (2 more replies)
1 sibling, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-16 9:54 UTC (permalink / raw)
To: Mina Almasry, Andrew Morton
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
Andrew,
I have noticed that the patch made it into Linus tree already. Can we
please revert it because the semantic is not really clear and we should
really not create yet another user API maintenance problem. I am
proposing to revert the nodemask extension for now before we grow any
upstream users. Deeper in the email thread are some proposals how to
move forward with that.
---
From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 16 Dec 2022 10:46:33 +0100
Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
Although it is recognized that a finer grained pro-active reclaim is
something we need and want the semantic of this implementation is really
ambiguous.
From a follow up discussion it became clear that there are two essential
usecases here. One is to use memory.reclaim to pro-actively reclaim
memory and expectation is that the requested and reported amount of memory is
uncharged from the memcg. Another usecase focuses on pro-active demotion
when the memory is merely shuffled around to demotion targets while the
overall charged memory stays unchanged.
The current implementation considers demoted pages as reclaimed and that
break both usecases. [1] has tried to address the reporting part but
there are more issues with that summarized in [2] and follow up emails.
Let's revert the nodemask based extension of the memcg pro-active
reclaim for now until we settle with a more robust semantic.
[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
[2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Documentation/admin-guide/cgroup-v2.rst | 15 +++---
include/linux/swap.h | 3 +-
mm/memcontrol.c | 67 +++++--------------------
mm/vmscan.c | 4 +-
4 files changed, 21 insertions(+), 68 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..74cec76be9f2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.
- This file accepts a string which contains the number of bytes to
- reclaim.
+ This file accepts a single key, the number of bytes to reclaim.
+ No nested keys are currently supported.
Example::
echo "1G" > memory.reclaim
+ The interface can be later extended with nested keys to
+ configure the reclaim behavior. For example, specify the
+ type of memory to reclaim from (anon, file, ..).
+
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.
- This file also allows the user to specify the nodes to reclaim from,
- via the 'nodes=' key, for example::
-
- echo "1G nodes=0,1" > memory.reclaim
-
- The above instructs the kernel to reclaim memory from nodes 0,1.
-
memory.peak
A read-only single value file which exists on non-root
cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2787b84eaf12..0ceed49516ad 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options,
- nodemask_t *nodemask);
+ unsigned int reclaim_options);
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..73afff8062f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,7 +63,6 @@
#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/seq_buf.h>
-#include <linux/parser.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
psi_memstall_enter(&pflags);
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
- MEMCG_RECLAIM_MAY_SWAP,
- NULL);
+ MEMCG_RECLAIM_MAY_SWAP);
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
- gfp_mask, reclaim_options,
- NULL);
+ gfp_mask, reclaim_options);
psi_memstall_leave(&pflags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
- NULL)) {
+ memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
ret = -EBUSY;
break;
}
@@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- MEMCG_RECLAIM_MAY_SWAP,
- NULL))
+ MEMCG_RECLAIM_MAY_SWAP))
nr_retries--;
}
@@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
}
reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
- NULL);
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
if (!reclaimed && !nr_retries--)
break;
@@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
- NULL))
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
nr_reclaims--;
continue;
}
@@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}
-enum {
- MEMORY_RECLAIM_NODES = 0,
- MEMORY_RECLAIM_NULL,
-};
-
-static const match_table_t if_tokens = {
- { MEMORY_RECLAIM_NODES, "nodes=%s" },
- { MEMORY_RECLAIM_NULL, NULL },
-};
-
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
- unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
- MEMCG_RECLAIM_PROACTIVE;
- char *old_buf, *start;
- substring_t args[MAX_OPT_ARGS];
- int token;
- char value[256];
- nodemask_t nodemask = NODE_MASK_ALL;
-
- buf = strstrip(buf);
-
- old_buf = buf;
- nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
- if (buf == old_buf)
- return -EINVAL;
+ unsigned int reclaim_options;
+ int err;
buf = strstrip(buf);
+ err = page_counter_memparse(buf, "", &nr_to_reclaim);
+ if (err)
+ return err;
- while ((start = strsep(&buf, " ")) != NULL) {
- if (!strlen(start))
- continue;
- token = match_token(start, if_tokens, args);
- match_strlcpy(value, args, sizeof(value));
- switch (token) {
- case MEMORY_RECLAIM_NODES:
- if (nodelist_parse(value, nodemask) < 0)
- return -EINVAL;
- break;
- default:
- return -EINVAL;
- }
- }
-
+ reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
while (nr_reclaimed < nr_to_reclaim) {
unsigned long reclaimed;
@@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
reclaimed = try_to_free_mem_cgroup_pages(memcg,
nr_to_reclaim - nr_reclaimed,
- GFP_KERNEL, reclaim_options,
- &nodemask);
+ GFP_KERNEL, reclaim_options);
if (!reclaimed && !nr_retries--)
return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aba991c505f1..546540bc770a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
- unsigned int reclaim_options,
- nodemask_t *nodemask)
+ unsigned int reclaim_options)
{
unsigned long nr_reclaimed;
unsigned int noreclaim_flag;
@@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
- .nodemask = nodemask,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
--
2.30.2
--
Michal Hocko
SUSE Labs
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
2022-12-16 9:54 ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
@ 2022-12-16 12:02 ` Mina Almasry
2022-12-16 12:22 ` Michal Hocko
2022-12-16 12:28 ` Bagas Sanjaya
[not found] ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2 siblings, 1 reply; 35+ messages in thread
From: Mina Almasry @ 2022-12-16 12:02 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Andrew,
> I have noticed that the patch made it into Linus tree already. Can we
> please revert it because the semantic is not really clear and we should
> really not create yet another user API maintenance problem. I am
> proposing to revert the nodemask extension for now before we grow any
> upstream users. Deeper in the email thread are some proposals how to
> move forward with that.
There are proposals, many which have been rejected due to not
addressing the motivating use cases and others that have been rejected
by fellow maintainers, and some that are awaiting feedback. No, there
is no other clear-cut way forward for this use case right now. I have
found the merged approach by far the most agreeable so far.
> ---
> From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 16 Dec 2022 10:46:33 +0100
> Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
>
> This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
>
> Although it is recognized that a finer grained pro-active reclaim is
> something we need and want the semantic of this implementation is really
> ambiguous.
>
> From a follow up discussion it became clear that there are two essential
> usecases here. One is to use memory.reclaim to pro-actively reclaim
> memory and expectation is that the requested and reported amount of memory is
> uncharged from the memcg. Another usecase focuses on pro-active demotion
> when the memory is merely shuffled around to demotion targets while the
> overall charged memory stays unchanged.
>
> The current implementation considers demoted pages as reclaimed and that
> break both usecases.
I think you're making it sound like this specific patch broke both use
cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert
"mm/vmscan: never demote for memcg reclaim"") has been in the tree for
around 7 months now and that is the commit that enabled demotion in
memcg reclaim, and implicitly counted demoted pages as reclaimed in
memcg reclaim, which is the source of the ambiguity. Not the patch
that you are reverting here.
The irony I find with this revert is that this patch actually removes
the ambiguity and does not exacerbate it. Currently using
memory.reclaim _without_ the nodes= arg is ambiguous because demoted
pages count as reclaimed. On the other hand using memory.reclaim
_with_ the nodes= arg is completely unambiguous: the kernel will
demote-only from top tier nodes and reclaim-only from bottom tier
nodes.
> [1] has tried to address the reporting part but
> there are more issues with that summarized in [2] and follow up emails.
>
I am the one that put effort into resolving the ambiguity introduced
by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") and proposed [1]. Reverting this patch does nothing to
resolve ambiguity that it did not introduce.
> Let's revert the nodemask based extension of the memcg pro-active
> reclaim for now until we settle with a more robust semantic.
>
I do not think we should revert this. It enables a couple of important
use cases for Google:
1. Enables us to specifically trigger proactive reclaim in a memcg on
a memory tiered system by specifying only the lower tiered nodes using
the nodes= arg.
2. Enabled us to specifically trigger proactive demotion in a memcg on
a memory tiered system by specifying only the top tier nodes using the
nodes= arg.
Both use cases are broken with this revert, and no progress to resolve
the ambiguity is made with this revert.
I agree with Michal that there is ambiguity that has existed in the
kernel for about 7 months now and is introduced by commit 3f1509c57b1b
("Revert "mm/vmscan: never demote for memcg reclaim""), and I'm trying
to fix this ambiguity in [1]. I think we should move forward in fixing
the ambiguity through the review of the patch in [1] and not revert
patches that enable useful use-cases and did not introduce the
ambiguity.
> [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
Broken link. Actual link to my patch to fix the ambiguity:
[1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/
> [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 15 +++---
> include/linux/swap.h | 3 +-
> mm/memcontrol.c | 67 +++++--------------------
> mm/vmscan.c | 4 +-
> 4 files changed, 21 insertions(+), 68 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..74cec76be9f2 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
> This is a simple interface to trigger memory reclaim in the
> target cgroup.
>
> - This file accepts a string which contains the number of bytes to
> - reclaim.
> + This file accepts a single key, the number of bytes to reclaim.
> + No nested keys are currently supported.
>
> Example::
>
> echo "1G" > memory.reclaim
>
> + The interface can be later extended with nested keys to
> + configure the reclaim behavior. For example, specify the
> + type of memory to reclaim from (anon, file, ..).
> +
> Please note that the kernel can over or under reclaim from
> the target cgroup. If less bytes are reclaimed than the
> specified amount, -EAGAIN is returned.
> @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
> This means that the networking layer will not adapt based on
> reclaim induced by memory.reclaim.
>
> - This file also allows the user to specify the nodes to reclaim from,
> - via the 'nodes=' key, for example::
> -
> - echo "1G nodes=0,1" > memory.reclaim
> -
> - The above instructs the kernel to reclaim memory from nodes 0,1.
> -
> memory.peak
> A read-only single value file which exists on non-root
> cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2787b84eaf12..0ceed49516ad 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -418,8 +418,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options,
> - nodemask_t *nodemask);
> + unsigned int reclaim_options);
> extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> gfp_t gfp_mask, bool noswap,
> pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..73afff8062f9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,7 +63,6 @@
> #include <linux/resume_user_mode.h>
> #include <linux/psi.h>
> #include <linux/seq_buf.h>
> -#include <linux/parser.h>
> #include "internal.h"
> #include <net/sock.h>
> #include <net/ip.h>
> @@ -2393,8 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> psi_memstall_enter(&pflags);
> nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> gfp_mask,
> - MEMCG_RECLAIM_MAY_SWAP,
> - NULL);
> + MEMCG_RECLAIM_MAY_SWAP);
> psi_memstall_leave(&pflags);
> } while ((memcg = parent_mem_cgroup(memcg)) &&
> !mem_cgroup_is_root(memcg));
> @@ -2685,8 +2683,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> psi_memstall_enter(&pflags);
> nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> - gfp_mask, reclaim_options,
> - NULL);
> + gfp_mask, reclaim_options);
> psi_memstall_leave(&pflags);
>
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3506,8 +3503,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> }
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> - NULL)) {
> + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> ret = -EBUSY;
> break;
> }
> @@ -3618,8 +3614,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> return -EINTR;
>
> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> - MEMCG_RECLAIM_MAY_SWAP,
> - NULL))
> + MEMCG_RECLAIM_MAY_SWAP))
> nr_retries--;
> }
>
> @@ -6429,8 +6424,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> }
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> - NULL);
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
>
> if (!reclaimed && !nr_retries--)
> break;
> @@ -6479,8 +6473,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
> if (nr_reclaims) {
> if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> - NULL))
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> nr_reclaims--;
> continue;
> }
> @@ -6603,54 +6596,21 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> -enum {
> - MEMORY_RECLAIM_NODES = 0,
> - MEMORY_RECLAIM_NULL,
> -};
> -
> -static const match_table_t if_tokens = {
> - { MEMORY_RECLAIM_NODES, "nodes=%s" },
> - { MEMORY_RECLAIM_NULL, NULL },
> -};
> -
> static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> unsigned long nr_to_reclaim, nr_reclaimed = 0;
> - unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> - MEMCG_RECLAIM_PROACTIVE;
> - char *old_buf, *start;
> - substring_t args[MAX_OPT_ARGS];
> - int token;
> - char value[256];
> - nodemask_t nodemask = NODE_MASK_ALL;
> -
> - buf = strstrip(buf);
> -
> - old_buf = buf;
> - nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> - if (buf == old_buf)
> - return -EINVAL;
> + unsigned int reclaim_options;
> + int err;
>
> buf = strstrip(buf);
> + err = page_counter_memparse(buf, "", &nr_to_reclaim);
> + if (err)
> + return err;
>
> - while ((start = strsep(&buf, " ")) != NULL) {
> - if (!strlen(start))
> - continue;
> - token = match_token(start, if_tokens, args);
> - match_strlcpy(value, args, sizeof(value));
> - switch (token) {
> - case MEMORY_RECLAIM_NODES:
> - if (nodelist_parse(value, nodemask) < 0)
> - return -EINVAL;
> - break;
> - default:
> - return -EINVAL;
> - }
> - }
> -
> + reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> while (nr_reclaimed < nr_to_reclaim) {
> unsigned long reclaimed;
>
> @@ -6667,8 +6627,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> nr_to_reclaim - nr_reclaimed,
> - GFP_KERNEL, reclaim_options,
> - &nodemask);
> + GFP_KERNEL, reclaim_options);
>
> if (!reclaimed && !nr_retries--)
> return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aba991c505f1..546540bc770a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6757,8 +6757,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> unsigned long nr_pages,
> gfp_t gfp_mask,
> - unsigned int reclaim_options,
> - nodemask_t *nodemask)
> + unsigned int reclaim_options)
> {
> unsigned long nr_reclaimed;
> unsigned int noreclaim_flag;
> @@ -6773,7 +6772,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> .may_unmap = 1,
> .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
> .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> - .nodemask = nodemask,
> };
> /*
> * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.30.2
>
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
2022-12-16 12:02 ` Mina Almasry
@ 2022-12-16 12:22 ` Michal Hocko
0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2022-12-16 12:22 UTC (permalink / raw)
To: Mina Almasry
Cc: Andrew Morton, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc, fvdl, bagasdotme,
cgroups, linux-doc, linux-kernel, linux-mm
On Fri 16-12-22 04:02:12, Mina Almasry wrote:
> On Fri, Dec 16, 2022 at 1:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Andrew,
> > I have noticed that the patch made it into Linus tree already. Can we
> > please revert it because the semantic is not really clear and we should
> > really not create yet another user API maintenance problem. I am
> > proposing to revert the nodemask extension for now before we grow any
> > upstream users. Deeper in the email thread are some proposals how to
> > move forward with that.
>
> There are proposals, many which have been rejected due to not
> addressing the motivating use cases and others that have been rejected
> by fellow maintainers, and some that are awaiting feedback. No, there
> is no other clear-cut way forward for this use case right now. I have
> found the merged approach by far the most agreeable so far.
There is a clear need for further discussion and until then we do not
want to expose interface and create dependencies that will inevitably
hard to change the semantic later.
> > From 7c5285f1725d5abfcae5548ab0d73be9ceded2a1 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Fri, 16 Dec 2022 10:46:33 +0100
> > Subject: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
> >
> > This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
> >
> > Although it is recognized that a finer grained pro-active reclaim is
> > something we need and want the semantic of this implementation is really
> > ambiguous.
> >
> > From a follow up discussion it became clear that there are two essential
> > usecases here. One is to use memory.reclaim to pro-actively reclaim
> > memory and expectation is that the requested and reported amount of memory is
> > uncharged from the memcg. Another usecase focuses on pro-active demotion
> > when the memory is merely shuffled around to demotion targets while the
> > overall charged memory stays unchanged.
> >
> > The current implementation considers demoted pages as reclaimed and that
> > break both usecases.
>
> I think you're making it sound like this specific patch broke both use
> cases, and IMO that is not accurate. commit 3f1509c57b1b ("Revert
> "mm/vmscan: never demote for memcg reclaim"") has been in the tree for
> around 7 months now and that is the commit that enabled demotion in
> memcg reclaim, and implicitly counted demoted pages as reclaimed in
> memcg reclaim, which is the source of the ambiguity. Not the patch
> that you are reverting here.
>
> The irony I find with this revert is that this patch actually removes
> the ambiguity and does not exacerbate it. Currently using
> memory.reclaim _without_ the nodes= arg is ambiguous because demoted
> pages count as reclaimed. On the other hand using memory.reclaim
> _with_ the nodes= arg is completely unambiguous: the kernel will
> demote-only from top tier nodes and reclaim-only from bottom tier
> nodes.
Yes, demoted patches are indeed counted as reclaimed but that is not a
major issue because from the external point of view charges are getting
reclaimed. It is nodes specification which makes the latent problem much
more obvious.
>
> > [1] has tried to address the reporting part but
> > there are more issues with that summarized in [2] and follow up emails.
> >
>
> I am the one that put effort into resolving the ambiguity introduced
> by commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") and proposed [1]. Reverting this patch does nothing to
> resolve ambiguity that it did not introduce.
>
> > Let's revert the nodemask based extension of the memcg pro-active
> > reclaim for now until we settle with a more robust semantic.
> >
>
> I do not think we should revert this. It enables a couple of important
> use cases for Google:
>
> 1. Enables us to specifically trigger proactive reclaim in a memcg on
> a memory tiered system by specifying only the lower tiered nodes using
> the nodes= arg.
> 2. Enabled us to specifically trigger proactive demotion in a memcg on
> a memory tiered system by specifying only the top tier nodes using the
> nodes= arg.
That is clear and the aim of the revert is not to disallow those
usecases. We just need a clear and futureproof interface for that.
Changing the semantic after the fact is a nogo, hence the revert.
>
> Both use cases are broken with this revert, and no progress to resolve
> the ambiguity is made with this revert.
There cannot be any regression with the revert now because the code
hasn't been upstream.
So let's remove the interface until we can agree on the exact semantic
and build the interface from there.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
2022-12-16 9:54 ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
2022-12-16 12:02 ` Mina Almasry
@ 2022-12-16 12:28 ` Bagas Sanjaya
[not found] ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2 siblings, 0 replies; 35+ messages in thread
From: Bagas Sanjaya @ 2022-12-16 12:28 UTC (permalink / raw)
To: Michal Hocko, Mina Almasry, Andrew Morton
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
Roman Gushchin, Shakeel Butt, Muchun Song, Huang Ying, Yang Shi,
Yosry Ahmed, weixugc, fvdl, cgroups, linux-doc, linux-kernel,
linux-mm
On 12/16/22 16:54, Michal Hocko wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..74cec76be9f2 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,13 +1245,17 @@ PAGE_SIZE multiple when read back.
> This is a simple interface to trigger memory reclaim in the
> target cgroup.
>
> - This file accepts a string which contains the number of bytes to
> - reclaim.
> + This file accepts a single key, the number of bytes to reclaim.
> + No nested keys are currently supported.
>
> Example::
>
> echo "1G" > memory.reclaim
>
> + The interface can be later extended with nested keys to
> + configure the reclaim behavior. For example, specify the
> + type of memory to reclaim from (anon, file, ..).
> +
> Please note that the kernel can over or under reclaim from
> the target cgroup. If less bytes are reclaimed than the
> specified amount, -EAGAIN is returned.
> @@ -1263,13 +1267,6 @@ PAGE_SIZE multiple when read back.
> This means that the networking layer will not adapt based on
> reclaim induced by memory.reclaim.
>
> - This file also allows the user to specify the nodes to reclaim from,
> - via the 'nodes=' key, for example::
> -
> - echo "1G nodes=0,1" > memory.reclaim
> -
> - The above instructs the kernel to reclaim memory from nodes 0,1.
> -
> memory.peak
> A read-only single value file which exists on non-root
> cgroups.
Ah! I forgot to add my Reviewed-by: tag when the original patch was
submitted. However, I was Cc'ed the revert presumably due to Cc: tag in the
original.
In any case, for the documentation part:
Acked-by: Bagas Sanjaya <bagasdotme@gmail.com>
--
An old man doll... just what I always wanted! - Clara
^ permalink raw reply [flat|nested] 35+ messages in thread[parent not found: <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim"
[not found] ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2022-12-16 18:18 ` Andrew Morton
[not found] ` <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2022-12-16 18:18 UTC (permalink / raw)
To: Michal Hocko
Cc: Mina Almasry, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
Huang Ying, Yang Shi, Yosry Ahmed, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
fvdl-hpIqsD4AKlfQT0dZR+AlfA, bagasdotme-Re5JQEeQqe8AvxtiuMwx3w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
On Fri, 16 Dec 2022 10:54:16 +0100 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> I have noticed that the patch made it into Linus tree already. Can we
> please revert it because the semantic is not really clear and we should
> really not create yet another user API maintenance problem.
Well dang. I was waiting for the discussion to converge, blissfully
unaware that the thing was sitting in mm-stable :( I guess the
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Acked-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
fooled me.
I think it's a bit premature to revert at this stage. Possibly we can
get to the desired end state by modifying the existing code. Possibly
we can get to the desired end state by reverting this and by adding
something new.
If we can't get to the desired end state at all then yes, I'll send
Linus a revert of this patch later in this -rc cycle.
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2023-01-19 8:29 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
2022-12-02 23:51 ` Shakeel Butt
2022-12-03 3:17 ` Muchun Song
[not found] ` <20221202223533.1785418-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2022-12-12 8:55 ` Michal Hocko
[not found] ` <Y5bsmpCyeryu3Zz1-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-13 0:54 ` Mina Almasry
2022-12-13 6:30 ` Huang, Ying
2022-12-13 7:48 ` Wei Xu
2022-12-13 8:51 ` Michal Hocko
2022-12-13 13:42 ` Huang, Ying
[not found] ` <87k02volwe.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2022-12-13 13:30 ` Johannes Weiner
2022-12-13 14:03 ` Michal Hocko
[not found] ` <Y5iGJ/9PMmSCwqLj-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-13 19:29 ` Mina Almasry
2022-12-14 10:23 ` Michal Hocko
2022-12-15 5:50 ` Huang, Ying
[not found] ` <87mt7pdxm1.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>
2022-12-15 9:21 ` Michal Hocko
2022-12-16 3:02 ` Huang, Ying
[not found] ` <Y5mkJL6I5Zlc1k97-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-15 17:58 ` Wei Xu
2022-12-16 8:40 ` Michal Hocko
2022-12-13 8:33 ` Michal Hocko
2022-12-13 15:58 ` Johannes Weiner
[not found] ` <Y5iet+ch24YrvExA-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-12-13 19:53 ` Mina Almasry
2022-12-14 7:20 ` Huang, Ying
2022-12-14 7:15 ` Huang, Ying
2022-12-14 10:43 ` Michal Hocko
2022-12-16 9:54 ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
2022-12-16 12:02 ` Mina Almasry
2022-12-16 12:22 ` Michal Hocko
2022-12-16 12:28 ` Bagas Sanjaya
[not found] ` <Y5xASNe1x8cusiTx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-16 18:18 ` Andrew Morton
[not found] ` <20221216101820.3f4a370af2c93d3c2e78ed8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-12-17 9:57 ` Michal Hocko
[not found] ` <Y52Scge3ynvn/mB4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-12-19 22:42 ` Andrew Morton
2023-01-03 8:37 ` Michal Hocko
2023-01-04 8:41 ` Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Huang, Ying
2023-01-18 17:21 ` Michal Hocko
[not found] ` <Y8gqkub3AM6c+Z5y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2023-01-19 8:29 ` Huang, Ying
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox