* [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
@ 2024-09-20 22:11 kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
` (5 more replies)
0 siblings, 6 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Currently in Linux, there is no concept of fairness in memory tiering. Depending
on the memory usage and access patterns of other colocated applications, an
application cannot be sure of how much memory in which tier it will get, and how
much its performance will suffer or benefit.
Fairness is, however, important in a multi-tenant system. For example, an
application may need to meet a certain tail latency requirement, which can be
difficult to satisfy without x amount of frequently accessed pages in top-tier
memory. Similarly, an application may want to declare a minimum throughput when
running on a system for capacity planning purposes, but without fairness
controls in memory tiering its throughput can fluctuate wildly as other
applications come and go on the system.
In this proposal, we amend the memory.low control in memcg to protect a cgroup’s
memory usage in top-tier memory. A low protection for top-tier memory is scaled
proportionally to the ratio of top-tier memory and total memory on the system.
The protection is then applied to reclaim for top-tier memory. Promotion by NUMA
balancing is also throttled through reduced scanning window when top-tier memory
is contended and the cgroup is over its protection.
Experiments we did with microbenchmarks exhibiting a range of memory access
patterns and memory size confirmed that when top-tier memory is contended, the
system moves towards a stable memory distribution where each cgroup’s memory
usage in local DRAM converges to the protected amounts.
One notable missing part in the patches is determining which NUMA nodes have
top-tier memory; currently they use hardcoded node 0 for top-tier memory and
node 1 for a CPU-less node backed by CXL memory. We’re working on removing
this artifact and correctly applying to top-tier nodes in the system.
Your feedback is greatly appreciated!
Kaiyang Zhao (4):
Add get_cgroup_local_usage for estimating the top-tier memory usage
calculate memory.low for the local node and track its usage
use memory.low local node protection for local node reclaim
reduce NUMA balancing scan size of cgroups over their local memory.low
include/linux/memcontrol.h | 25 ++++++++-----
include/linux/page_counter.h | 16 ++++++---
kernel/sched/fair.c | 54 +++++++++++++++++++++++++---
mm/hugetlb_cgroup.c | 4 +--
mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++------
mm/page_counter.c | 52 +++++++++++++++++++++------
mm/vmscan.c | 19 +++++++---
7 files changed, 192 insertions(+), 46 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
` (4 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Approximate the usage of top-tier memory of a cgroup by its anon,
file, shmem and slab sizes in the top-tier.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..94aba4498fca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -648,6 +648,8 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
memcg == target;
}
+unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
+
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f19a58c252f0..20b715441332 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -855,6 +855,30 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
return READ_ONCE(memcg->vmstats->events_local[i]);
}
+/* Usage is in pages. */
+unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush)
+{
+ struct lruvec *lruvec;
+ const int local_nid = 0;
+
+ if (!memcg)
+ return 0;
+
+ if (flush)
+ mem_cgroup_flush_stats_ratelimited(memcg);
+
+ lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(local_nid));
+ unsigned long anon = lruvec_page_state(lruvec, NR_ANON_MAPPED);
+ unsigned long file = lruvec_page_state(lruvec, NR_FILE_PAGES);
+ unsigned long shmem = lruvec_page_state(lruvec, NR_SHMEM);
+ /* Slab size are in bytes */
+ unsigned long slab =
+ lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE_B) / PAGE_SIZE
+ + lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE_B) / PAGE_SIZE;
+
+ return anon + file + shmem + slab;
+}
+
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
{
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-09-21 23:18 ` kernel test robot
` (2 more replies)
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
` (3 subsequent siblings)
5 siblings, 3 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Add a memory.low for the top-tier node (locallow) and track its usage.
locallow is set by scaling low by the ratio of node 0 capacity and
node 0 + node 1 capacity.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
include/linux/page_counter.h | 16 ++++++++---
mm/hugetlb_cgroup.c | 4 +--
mm/memcontrol.c | 42 ++++++++++++++++++++++-------
mm/page_counter.c | 52 ++++++++++++++++++++++++++++--------
4 files changed, 88 insertions(+), 26 deletions(-)
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 79dbd8bc35a7..aa56c93415ef 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -13,6 +13,7 @@ struct page_counter {
* memcg->memory.usage is a hot member of struct mem_cgroup.
*/
atomic_long_t usage;
+ struct mem_cgroup *memcg; /* memcg that owns this counter */
CACHELINE_PADDING(_pad1_);
/* effective memory.min and memory.min usage tracking */
@@ -25,6 +26,10 @@ struct page_counter {
atomic_long_t low_usage;
atomic_long_t children_low_usage;
+ unsigned long elocallow;
+ atomic_long_t locallow_usage;
+ atomic_long_t children_locallow_usage;
+
unsigned long watermark;
/* Latest cg2 reset watermark */
unsigned long local_watermark;
@@ -36,6 +41,7 @@ struct page_counter {
bool protection_support;
unsigned long min;
unsigned long low;
+ unsigned long locallow;
unsigned long high;
unsigned long max;
struct page_counter *parent;
@@ -52,12 +58,13 @@ struct page_counter {
*/
static inline void page_counter_init(struct page_counter *counter,
struct page_counter *parent,
- bool protection_support)
+ bool protection_support, struct mem_cgroup *memcg)
{
counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
counter->max = PAGE_COUNTER_MAX;
counter->parent = parent;
counter->protection_support = protection_support;
+ counter->memcg = memcg;
}
static inline unsigned long page_counter_read(struct page_counter *counter)
@@ -72,7 +79,8 @@ bool page_counter_try_charge(struct page_counter *counter,
struct page_counter **fail);
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
-void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
+ unsigned long nr_pages_local);
static inline void page_counter_set_high(struct page_counter *counter,
unsigned long nr_pages)
@@ -99,11 +107,11 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
#ifdef CONFIG_MEMCG
void page_counter_calculate_protection(struct page_counter *root,
struct page_counter *counter,
- bool recursive_protection);
+ bool recursive_protection, int is_local);
#else
static inline void page_counter_calculate_protection(struct page_counter *root,
struct page_counter *counter,
- bool recursive_protection) {}
+ bool recursive_protection, int is_local) {}
#endif
#endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index d8d0e665caed..0e07a7a1d5b8 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
}
page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
idx),
- fault_parent, false);
+ fault_parent, false, NULL);
page_counter_init(
hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
- rsvd_parent, false);
+ rsvd_parent, false, NULL);
limit = round_down(PAGE_COUNTER_MAX,
pages_per_huge_page(&hstates[idx]));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20b715441332..d7c5fff12105 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1497,6 +1497,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
vm_event_name(memcg_vm_event_stat[i]),
memcg_events(memcg, memcg_vm_event_stat[i]));
}
+
+ seq_buf_printf(s, "local_usage %lu\n",
+ get_cgroup_local_usage(memcg, true));
}
static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
@@ -3597,8 +3600,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent) {
WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
- page_counter_init(&memcg->memory, &parent->memory, true);
- page_counter_init(&memcg->swap, &parent->swap, false);
+ page_counter_init(&memcg->memory, &parent->memory, true, memcg);
+ page_counter_init(&memcg->swap, &parent->swap, false, NULL);
#ifdef CONFIG_MEMCG_V1
WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
page_counter_init(&memcg->kmem, &parent->kmem, false);
@@ -3607,8 +3610,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
} else {
init_memcg_stats();
init_memcg_events();
- page_counter_init(&memcg->memory, NULL, true);
- page_counter_init(&memcg->swap, NULL, false);
+ page_counter_init(&memcg->memory, NULL, true, memcg);
+ page_counter_init(&memcg->swap, NULL, false, NULL);
#ifdef CONFIG_MEMCG_V1
page_counter_init(&memcg->kmem, NULL, false);
page_counter_init(&memcg->tcpmem, NULL, false);
@@ -3677,7 +3680,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
memcg1_css_offline(memcg);
page_counter_set_min(&memcg->memory, 0);
- page_counter_set_low(&memcg->memory, 0);
+ page_counter_set_low(&memcg->memory, 0, 0);
zswap_memcg_offline_cleanup(memcg);
@@ -3748,7 +3751,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
#endif
page_counter_set_min(&memcg->memory, 0);
- page_counter_set_low(&memcg->memory, 0);
+ page_counter_set_low(&memcg->memory, 0, 0);
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
memcg1_soft_limit_reset(memcg);
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
@@ -4051,6 +4054,12 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_locallow_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->memory.locallow));
+}
+
static int memory_low_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -4061,7 +4070,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- unsigned long low;
+ struct sysinfo si;
+ unsigned long low, locallow, local_capacity, total_capacity;
int err;
buf = strstrip(buf);
@@ -4069,7 +4079,15 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
if (err)
return err;
- page_counter_set_low(&memcg->memory, low);
+ /* Hardcoded 0 for local node and 1 for remote. */
+ si_meminfo_node(&si, 0);
+ local_capacity = si.totalram; /* In pages. */
+ total_capacity = local_capacity;
+ si_meminfo_node(&si, 1);
+ total_capacity += si.totalram;
+ locallow = low * local_capacity / total_capacity;
+
+ page_counter_set_low(&memcg->memory, low, locallow);
return nbytes;
}
@@ -4394,6 +4412,11 @@ static struct cftype memory_files[] = {
.seq_show = memory_low_show,
.write = memory_low_write,
},
+ {
+ .name = "locallow",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_locallow_show,
+ },
{
.name = "high",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -4483,7 +4506,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
if (!root)
root = root_mem_cgroup;
- page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
+ page_counter_calculate_protection(&root->memory, &memcg->memory,
+ recursive_protection, false);
}
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
diff --git a/mm/page_counter.c b/mm/page_counter.c
index b249d15af9dd..97205aafab46 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -18,8 +18,10 @@ static bool track_protection(struct page_counter *c)
return c->protection_support;
}
+extern unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
+
static void propagate_protected_usage(struct page_counter *c,
- unsigned long usage)
+ unsigned long usage, unsigned long local_usage)
{
unsigned long protected, old_protected;
long delta;
@@ -44,6 +46,15 @@ static void propagate_protected_usage(struct page_counter *c,
if (delta)
atomic_long_add(delta, &c->parent->children_low_usage);
}
+
+ protected = min(local_usage, READ_ONCE(c->locallow));
+ old_protected = atomic_long_read(&c->locallow_usage);
+ if (protected != old_protected) {
+ old_protected = atomic_long_xchg(&c->locallow_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_locallow_usage);
+ }
}
/**
@@ -63,7 +74,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
atomic_long_set(&counter->usage, new);
}
if (track_protection(counter))
- propagate_protected_usage(counter, new);
+ propagate_protected_usage(counter, new,
+ get_cgroup_local_usage(counter->memcg, false));
}
/**
@@ -83,7 +95,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
new = atomic_long_add_return(nr_pages, &c->usage);
if (protection)
- propagate_protected_usage(c, new);
+ propagate_protected_usage(c, new,
+ get_cgroup_local_usage(counter->memcg, false));
/*
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
@@ -151,7 +164,8 @@ bool page_counter_try_charge(struct page_counter *counter,
goto failed;
}
if (protection)
- propagate_protected_usage(c, new);
+ propagate_protected_usage(c, new,
+ get_cgroup_local_usage(counter->memcg, false));
/* see comment on page_counter_charge */
if (new > READ_ONCE(c->local_watermark)) {
@@ -238,7 +252,8 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
WRITE_ONCE(counter->min, nr_pages);
for (c = counter; c; c = c->parent)
- propagate_protected_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage),
+ get_cgroup_local_usage(counter->memcg, false));
}
/**
@@ -248,14 +263,17 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
*
* The caller must serialize invocations on the same counter.
*/
-void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
+ unsigned long nr_pages_local)
{
struct page_counter *c;
WRITE_ONCE(counter->low, nr_pages);
+ WRITE_ONCE(counter->locallow, nr_pages_local);
for (c = counter; c; c = c->parent)
- propagate_protected_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage),
+ get_cgroup_local_usage(counter->memcg, false));
}
/**
@@ -421,9 +439,9 @@ static unsigned long effective_protection(unsigned long usage,
*/
void page_counter_calculate_protection(struct page_counter *root,
struct page_counter *counter,
- bool recursive_protection)
+ bool recursive_protection, int is_local)
{
- unsigned long usage, parent_usage;
+ unsigned long usage, parent_usage, local_usage, parent_local_usage;
struct page_counter *parent = counter->parent;
/*
@@ -437,16 +455,19 @@ void page_counter_calculate_protection(struct page_counter *root,
return;
usage = page_counter_read(counter);
- if (!usage)
+ local_usage = get_cgroup_local_usage(counter->memcg, true);
+ if (!usage || !local_usage)
return;
if (parent == root) {
counter->emin = READ_ONCE(counter->min);
counter->elow = READ_ONCE(counter->low);
+ counter->elocallow = READ_ONCE(counter->locallow);
return;
}
parent_usage = page_counter_read(parent);
+ parent_local_usage = get_cgroup_local_usage(parent->memcg, true);
WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
READ_ONCE(counter->min),
@@ -454,7 +475,16 @@ void page_counter_calculate_protection(struct page_counter *root,
atomic_long_read(&parent->children_min_usage),
recursive_protection));
- WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
+ if (is_local)
+ WRITE_ONCE(counter->elocallow,
+ effective_protection(local_usage, parent_local_usage,
+ READ_ONCE(counter->locallow),
+ READ_ONCE(parent->elocallow),
+ atomic_long_read(&parent->children_locallow_usage),
+ recursive_protection));
+ else
+ WRITE_ONCE(counter->elow,
+ effective_protection(usage, parent_usage,
READ_ONCE(counter->low),
READ_ONCE(parent->elow),
atomic_long_read(&parent->children_low_usage),
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-09-22 0:51 ` kernel test robot
` (2 more replies)
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
` (2 subsequent siblings)
5 siblings, 3 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
When reclaim targets the top-tier node usage by the root memcg,
apply local memory.low protection instead of global protection.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
include/linux/memcontrol.h | 23 ++++++++++++++---------
mm/memcontrol.c | 4 ++--
mm/vmscan.c | 19 ++++++++++++++-----
3 files changed, 30 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 94aba4498fca..256912b91922 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -586,9 +586,9 @@ static inline bool mem_cgroup_disabled(void)
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
- unsigned long *low)
+ unsigned long *low, unsigned long *locallow)
{
- *min = *low = 0;
+ *min = *low = *locallow = 0;
if (mem_cgroup_disabled())
return;
@@ -631,10 +631,11 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
*min = READ_ONCE(memcg->memory.emin);
*low = READ_ONCE(memcg->memory.elow);
+ *locallow = READ_ONCE(memcg->memory.elocallow);
}
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
- struct mem_cgroup *memcg);
+ struct mem_cgroup *memcg, int is_local);
static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
struct mem_cgroup *memcg)
@@ -651,13 +652,17 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
if (mem_cgroup_unprotected(target, memcg))
return false;
- return READ_ONCE(memcg->memory.elow) >=
- page_counter_read(&memcg->memory);
+ if (is_local)
+ return READ_ONCE(memcg->memory.elocallow) >=
+ get_cgroup_local_usage(memcg, true);
+ else
+ return READ_ONCE(memcg->memory.elow) >=
+ page_counter_read(&memcg->memory);
}
static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
@@ -1159,13 +1164,13 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
- unsigned long *low)
+ unsigned long *low, unsigned long *locallow)
{
*min = *low = 0;
}
static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
}
@@ -1175,7 +1180,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
return true;
}
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
return false;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7c5fff12105..61718ba998fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4495,7 +4495,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
* of a top-down tree iteration, not for isolated queries.
*/
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
bool recursive_protection =
cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
@@ -4507,7 +4507,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
root = root_mem_cgroup;
page_counter_calculate_protection(&root->memory, &memcg->memory,
- recursive_protection, false);
+ recursive_protection, is_local);
}
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce471d686a88..a2681d52fc5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2377,6 +2377,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
enum scan_balance scan_balance;
unsigned long ap, fp;
enum lru_list lru;
+ int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
/* If we have no swap space, do not bother scanning anon folios. */
if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
@@ -2457,12 +2458,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
for_each_evictable_lru(lru) {
bool file = is_file_lru(lru);
unsigned long lruvec_size;
- unsigned long low, min;
+ unsigned long low, min, locallow;
unsigned long scan;
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
mem_cgroup_protection(sc->target_mem_cgroup, memcg,
- &min, &low);
+ &min, &low, &locallow);
+ if (is_local)
+ low = locallow;
if (min || low) {
/*
@@ -2494,7 +2497,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
* again by how much of the total memory used is under
* hard protection.
*/
- unsigned long cgroup_size = mem_cgroup_size(memcg);
+ unsigned long cgroup_size;
+
+ if (is_local)
+ cgroup_size = get_cgroup_local_usage(memcg, true);
+ else
+ cgroup_size = mem_cgroup_size(memcg);
unsigned long protection;
/* memory.low scaling, make sure we retry before OOM */
@@ -5869,6 +5877,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
};
struct mem_cgroup_reclaim_cookie *partial = &reclaim;
struct mem_cgroup *memcg;
+ int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
/*
* In most cases, direct reclaimers can do partial walks
@@ -5896,7 +5905,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
*/
cond_resched();
- mem_cgroup_calculate_protection(target_memcg, memcg);
+ mem_cgroup_calculate_protection(target_memcg, memcg, is_local);
if (mem_cgroup_below_min(target_memcg, memcg)) {
/*
@@ -5904,7 +5913,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
* If there is no reclaimable memory, OOM.
*/
continue;
- } else if (mem_cgroup_below_low(target_memcg, memcg)) {
+ } else if (mem_cgroup_below_low(target_memcg, memcg, is_local)) {
/*
* Soft protection.
* Respect the protection only as long as
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
` (2 preceding siblings ...)
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-10-11 20:51 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion Kaiyang Zhao
2024-11-08 19:01 ` kaiyang2
5 siblings, 0 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
When the top-tier node has less free memory than the promotion watermark,
reduce the scan size of cgroups that are over their local memory.low
proportional to their overage. In this case, the top-tier memory usage
of the cgroup should be reduced, and demotion is working towards the
goal. A smaller scan size should cause a slower rate of promotion for
the cgroup so as to not working against demotion.
A mininum of 1/16th of sysctl_numa_balancing_scan_size is still allowed
for such cgroups because identifying hot pages trapped in slow-tier is
still a worthy goal in this case (although a secondary objective).
16 is arbitrary and may need tuning.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1b756f927b2..1737b2369f56 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1727,14 +1727,21 @@ static inline bool cpupid_valid(int cpupid)
* advantage of fast memory capacity, all recently accessed slow
* memory pages will be migrated to fast memory node without
* considering hot threshold.
+ * This is also used for detecting memory pressure and decide whether
+ * limitting promotion scan size is needed, for which we don't requrie
+ * more free pages than the promo watermark.
*/
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+static bool pgdat_free_space_enough(struct pglist_data *pgdat,
+ bool require_extra)
{
int z;
unsigned long enough_wmark;
- enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
- pgdat->node_present_pages >> 4);
+ if (require_extra)
+ enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+ pgdat->node_present_pages >> 4);
+ else
+ enough_wmark = 0;
for (z = pgdat->nr_zones - 1; z >= 0; z--) {
struct zone *zone = pgdat->node_zones + z;
@@ -1846,7 +1853,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
unsigned int latency, th, def_th;
pgdat = NODE_DATA(dst_nid);
- if (pgdat_free_space_enough(pgdat)) {
+ if (pgdat_free_space_enough(pgdat, true)) {
/* workload changed, reset hot threshold */
pgdat->nbp_threshold = 0;
return true;
@@ -3214,10 +3221,14 @@ static void task_numa_work(struct callback_head *work)
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
- long pages, virtpages;
+ long pages, virtpages, min_scan_pages;
struct vma_iterator vmi;
bool vma_pids_skipped;
bool vma_pids_forced = false;
+ struct pglist_data *pgdat = NODE_DATA(0); /* hardcoded node 0 */
+ struct mem_cgroup *memcg;
+ unsigned long cgroup_size, cgroup_locallow;
+ const long min_scan_pages_fraction = 16; /* 1/16th of the scan size */
SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
@@ -3262,6 +3273,39 @@ static void task_numa_work(struct callback_head *work)
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+
+ min_scan_pages = pages;
+ min_scan_pages /= min_scan_pages_fraction;
+
+ memcg = get_mem_cgroup_from_current();
+ /*
+ * Reduce the scan size when the local node is under pressure
+ * (WMARK_PROMO is not satisfied),
+ * proportional to a cgroup's overage of local memory guarantee.
+ * 10% over: 68% of scan size
+ * 20% over: 48% of scan size
+ * 50% over: 20% of scan size
+ * 100% over: 6% of scan size
+ */
+ if (likely(memcg)) {
+ if (!pgdat_free_space_enough(pgdat, false)) {
+ cgroup_size = get_cgroup_local_usage(memcg, false);
+ /*
+ * Protection needs refreshing, but reclaim on the cgroup
+ * should have refreshed recently.
+ */
+ cgroup_locallow = READ_ONCE(memcg->memory.elocallow);
+ if (cgroup_size > cgroup_locallow) {
+ /* 1/x^4 */
+ for (int i = 0; i < 4; i++)
+ pages = pages * cgroup_locallow / (cgroup_size + 1);
+ /* Lower bound to min_scan_pages. */
+ pages = max(pages, min_scan_pages);
+ }
+ }
+ css_put(&memcg->css);
+ }
+
virtpages = pages * 8; /* Scan up to this much virtual space */
if (!pages)
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
@ 2024-09-21 23:18 ` kernel test robot
2024-09-22 8:39 ` kernel test robot
2024-10-15 22:05 ` Gregory Price
2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-21 23:18 UTC (permalink / raw)
To: kaiyang2; +Cc: oe-kbuild-all
Hi,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master next-20240920]
[cannot apply to tip/sched/core v6.11]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240920221202.1734227-3-kaiyang2%40cs.cmu.edu
patch subject: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240922/202409220804.TAoLKEBm-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240922/202409220804.TAoLKEBm-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409220804.TAoLKEBm-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/page_counter.c:268: warning: Function parameter or struct member 'nr_pages_local' not described in 'page_counter_set_low'
>> mm/page_counter.c:443: warning: Function parameter or struct member 'is_local' not described in 'page_counter_calculate_protection'
vim +268 mm/page_counter.c
bf8d5d52ffe89a Roman Gushchin 2018-06-07 258
230671533d6463 Roman Gushchin 2018-06-07 259 /**
230671533d6463 Roman Gushchin 2018-06-07 260 * page_counter_set_low - set the amount of protected memory
230671533d6463 Roman Gushchin 2018-06-07 261 * @counter: counter
230671533d6463 Roman Gushchin 2018-06-07 262 * @nr_pages: value to set
230671533d6463 Roman Gushchin 2018-06-07 263 *
230671533d6463 Roman Gushchin 2018-06-07 264 * The caller must serialize invocations on the same counter.
230671533d6463 Roman Gushchin 2018-06-07 265 */
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 266 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 267 unsigned long nr_pages_local)
230671533d6463 Roman Gushchin 2018-06-07 @268 {
230671533d6463 Roman Gushchin 2018-06-07 269 struct page_counter *c;
230671533d6463 Roman Gushchin 2018-06-07 270
f86b810c2610b0 Chris Down 2020-04-01 271 WRITE_ONCE(counter->low, nr_pages);
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 272 WRITE_ONCE(counter->locallow, nr_pages_local);
230671533d6463 Roman Gushchin 2018-06-07 273
230671533d6463 Roman Gushchin 2018-06-07 274 for (c = counter; c; c = c->parent)
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 275 propagate_protected_usage(c, atomic_long_read(&c->usage),
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 276 get_cgroup_local_usage(counter->memcg, false));
230671533d6463 Roman Gushchin 2018-06-07 277 }
230671533d6463 Roman Gushchin 2018-06-07 278
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 279 /**
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 280 * page_counter_memparse - memparse() for page counter limits
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 281 * @buf: string to parse
650c5e565492f9 Johannes Weiner 2015-02-11 282 * @max: string meaning maximum possible value
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 283 * @nr_pages: returns the result in number of pages
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 284 *
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 285 * Returns -EINVAL, or 0 and @nr_pages on success. @nr_pages will be
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 286 * limited to %PAGE_COUNTER_MAX.
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 287 */
650c5e565492f9 Johannes Weiner 2015-02-11 288 int page_counter_memparse(const char *buf, const char *max,
650c5e565492f9 Johannes Weiner 2015-02-11 289 unsigned long *nr_pages)
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 290 {
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 291 char *end;
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 292 u64 bytes;
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 293
650c5e565492f9 Johannes Weiner 2015-02-11 294 if (!strcmp(buf, max)) {
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 295 *nr_pages = PAGE_COUNTER_MAX;
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 296 return 0;
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 297 }
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 298
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 299 bytes = memparse(buf, &end);
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 300 if (*end != '\0')
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 301 return -EINVAL;
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 302
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 303 *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 304
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 305 return 0;
3e32cb2e0a12b6 Johannes Weiner 2014-12-10 306 }
a8585ac6862198 Maarten Lankhorst 2024-07-03 307
a8585ac6862198 Maarten Lankhorst 2024-07-03 308
941ce635234162 Roman Gushchin 2024-07-26 309 #ifdef CONFIG_MEMCG
a8585ac6862198 Maarten Lankhorst 2024-07-03 310 /*
a8585ac6862198 Maarten Lankhorst 2024-07-03 311 * This function calculates an individual page counter's effective
a8585ac6862198 Maarten Lankhorst 2024-07-03 312 * protection which is derived from its own memory.min/low, its
a8585ac6862198 Maarten Lankhorst 2024-07-03 313 * parent's and siblings' settings, as well as the actual memory
a8585ac6862198 Maarten Lankhorst 2024-07-03 314 * distribution in the tree.
a8585ac6862198 Maarten Lankhorst 2024-07-03 315 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 316 * The following rules apply to the effective protection values:
a8585ac6862198 Maarten Lankhorst 2024-07-03 317 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 318 * 1. At the first level of reclaim, effective protection is equal to
a8585ac6862198 Maarten Lankhorst 2024-07-03 319 * the declared protection in memory.min and memory.low.
a8585ac6862198 Maarten Lankhorst 2024-07-03 320 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 321 * 2. To enable safe delegation of the protection configuration, at
a8585ac6862198 Maarten Lankhorst 2024-07-03 322 * subsequent levels the effective protection is capped to the
a8585ac6862198 Maarten Lankhorst 2024-07-03 323 * parent's effective protection.
a8585ac6862198 Maarten Lankhorst 2024-07-03 324 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 325 * 3. To make complex and dynamic subtrees easier to configure, the
a8585ac6862198 Maarten Lankhorst 2024-07-03 326 * user is allowed to overcommit the declared protection at a given
a8585ac6862198 Maarten Lankhorst 2024-07-03 327 * level. If that is the case, the parent's effective protection is
a8585ac6862198 Maarten Lankhorst 2024-07-03 328 * distributed to the children in proportion to how much protection
a8585ac6862198 Maarten Lankhorst 2024-07-03 329 * they have declared and how much of it they are utilizing.
a8585ac6862198 Maarten Lankhorst 2024-07-03 330 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 331 * This makes distribution proportional, but also work-conserving:
a8585ac6862198 Maarten Lankhorst 2024-07-03 332 * if one counter claims much more protection than it uses memory,
a8585ac6862198 Maarten Lankhorst 2024-07-03 333 * the unused remainder is available to its siblings.
a8585ac6862198 Maarten Lankhorst 2024-07-03 334 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 335 * 4. Conversely, when the declared protection is undercommitted at a
a8585ac6862198 Maarten Lankhorst 2024-07-03 336 * given level, the distribution of the larger parental protection
a8585ac6862198 Maarten Lankhorst 2024-07-03 337 * budget is NOT proportional. A counter's protection from a sibling
a8585ac6862198 Maarten Lankhorst 2024-07-03 338 * is capped to its own memory.min/low setting.
a8585ac6862198 Maarten Lankhorst 2024-07-03 339 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 340 * 5. However, to allow protecting recursive subtrees from each other
a8585ac6862198 Maarten Lankhorst 2024-07-03 341 * without having to declare each individual counter's fixed share
a8585ac6862198 Maarten Lankhorst 2024-07-03 342 * of the ancestor's claim to protection, any unutilized -
a8585ac6862198 Maarten Lankhorst 2024-07-03 343 * "floating" - protection from up the tree is distributed in
a8585ac6862198 Maarten Lankhorst 2024-07-03 344 * proportion to each counter's *usage*. This makes the protection
a8585ac6862198 Maarten Lankhorst 2024-07-03 345 * neutral wrt sibling cgroups and lets them compete freely over
a8585ac6862198 Maarten Lankhorst 2024-07-03 346 * the shared parental protection budget, but it protects the
a8585ac6862198 Maarten Lankhorst 2024-07-03 347 * subtree as a whole from neighboring subtrees.
a8585ac6862198 Maarten Lankhorst 2024-07-03 348 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 349 * Note that 4. and 5. are not in conflict: 4. is about protecting
a8585ac6862198 Maarten Lankhorst 2024-07-03 350 * against immediate siblings whereas 5. is about protecting against
a8585ac6862198 Maarten Lankhorst 2024-07-03 351 * neighboring subtrees.
a8585ac6862198 Maarten Lankhorst 2024-07-03 352 */
a8585ac6862198 Maarten Lankhorst 2024-07-03 353 static unsigned long effective_protection(unsigned long usage,
a8585ac6862198 Maarten Lankhorst 2024-07-03 354 unsigned long parent_usage,
a8585ac6862198 Maarten Lankhorst 2024-07-03 355 unsigned long setting,
a8585ac6862198 Maarten Lankhorst 2024-07-03 356 unsigned long parent_effective,
a8585ac6862198 Maarten Lankhorst 2024-07-03 357 unsigned long siblings_protected,
a8585ac6862198 Maarten Lankhorst 2024-07-03 358 bool recursive_protection)
a8585ac6862198 Maarten Lankhorst 2024-07-03 359 {
a8585ac6862198 Maarten Lankhorst 2024-07-03 360 unsigned long protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03 361 unsigned long ep;
a8585ac6862198 Maarten Lankhorst 2024-07-03 362
a8585ac6862198 Maarten Lankhorst 2024-07-03 363 protected = min(usage, setting);
a8585ac6862198 Maarten Lankhorst 2024-07-03 364 /*
a8585ac6862198 Maarten Lankhorst 2024-07-03 365 * If all cgroups at this level combined claim and use more
a8585ac6862198 Maarten Lankhorst 2024-07-03 366 * protection than what the parent affords them, distribute
a8585ac6862198 Maarten Lankhorst 2024-07-03 367 * shares in proportion to utilization.
a8585ac6862198 Maarten Lankhorst 2024-07-03 368 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 369 * We are using actual utilization rather than the statically
a8585ac6862198 Maarten Lankhorst 2024-07-03 370 * claimed protection in order to be work-conserving: claimed
a8585ac6862198 Maarten Lankhorst 2024-07-03 371 * but unused protection is available to siblings that would
a8585ac6862198 Maarten Lankhorst 2024-07-03 372 * otherwise get a smaller chunk than what they claimed.
a8585ac6862198 Maarten Lankhorst 2024-07-03 373 */
a8585ac6862198 Maarten Lankhorst 2024-07-03 374 if (siblings_protected > parent_effective)
a8585ac6862198 Maarten Lankhorst 2024-07-03 375 return protected * parent_effective / siblings_protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03 376
a8585ac6862198 Maarten Lankhorst 2024-07-03 377 /*
a8585ac6862198 Maarten Lankhorst 2024-07-03 378 * Ok, utilized protection of all children is within what the
a8585ac6862198 Maarten Lankhorst 2024-07-03 379 * parent affords them, so we know whatever this child claims
a8585ac6862198 Maarten Lankhorst 2024-07-03 380 * and utilizes is effectively protected.
a8585ac6862198 Maarten Lankhorst 2024-07-03 381 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 382 * If there is unprotected usage beyond this value, reclaim
a8585ac6862198 Maarten Lankhorst 2024-07-03 383 * will apply pressure in proportion to that amount.
a8585ac6862198 Maarten Lankhorst 2024-07-03 384 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 385 * If there is unutilized protection, the cgroup will be fully
a8585ac6862198 Maarten Lankhorst 2024-07-03 386 * shielded from reclaim, but we do return a smaller value for
a8585ac6862198 Maarten Lankhorst 2024-07-03 387 * protection than what the group could enjoy in theory. This
a8585ac6862198 Maarten Lankhorst 2024-07-03 388 * is okay. With the overcommit distribution above, effective
a8585ac6862198 Maarten Lankhorst 2024-07-03 389 * protection is always dependent on how memory is actually
a8585ac6862198 Maarten Lankhorst 2024-07-03 390 * consumed among the siblings anyway.
a8585ac6862198 Maarten Lankhorst 2024-07-03 391 */
a8585ac6862198 Maarten Lankhorst 2024-07-03 392 ep = protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03 393
a8585ac6862198 Maarten Lankhorst 2024-07-03 394 /*
a8585ac6862198 Maarten Lankhorst 2024-07-03 395 * If the children aren't claiming (all of) the protection
a8585ac6862198 Maarten Lankhorst 2024-07-03 396 * afforded to them by the parent, distribute the remainder in
a8585ac6862198 Maarten Lankhorst 2024-07-03 397 * proportion to the (unprotected) memory of each cgroup. That
a8585ac6862198 Maarten Lankhorst 2024-07-03 398 * way, cgroups that aren't explicitly prioritized wrt each
a8585ac6862198 Maarten Lankhorst 2024-07-03 399 * other compete freely over the allowance, but they are
a8585ac6862198 Maarten Lankhorst 2024-07-03 400 * collectively protected from neighboring trees.
a8585ac6862198 Maarten Lankhorst 2024-07-03 401 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 402 * We're using unprotected memory for the weight so that if
a8585ac6862198 Maarten Lankhorst 2024-07-03 403 * some cgroups DO claim explicit protection, we don't protect
a8585ac6862198 Maarten Lankhorst 2024-07-03 404 * the same bytes twice.
a8585ac6862198 Maarten Lankhorst 2024-07-03 405 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 406 * Check both usage and parent_usage against the respective
a8585ac6862198 Maarten Lankhorst 2024-07-03 407 * protected values. One should imply the other, but they
a8585ac6862198 Maarten Lankhorst 2024-07-03 408 * aren't read atomically - make sure the division is sane.
a8585ac6862198 Maarten Lankhorst 2024-07-03 409 */
a8585ac6862198 Maarten Lankhorst 2024-07-03 410 if (!recursive_protection)
a8585ac6862198 Maarten Lankhorst 2024-07-03 411 return ep;
a8585ac6862198 Maarten Lankhorst 2024-07-03 412
a8585ac6862198 Maarten Lankhorst 2024-07-03 413 if (parent_effective > siblings_protected &&
a8585ac6862198 Maarten Lankhorst 2024-07-03 414 parent_usage > siblings_protected &&
a8585ac6862198 Maarten Lankhorst 2024-07-03 415 usage > protected) {
a8585ac6862198 Maarten Lankhorst 2024-07-03 416 unsigned long unclaimed;
a8585ac6862198 Maarten Lankhorst 2024-07-03 417
a8585ac6862198 Maarten Lankhorst 2024-07-03 418 unclaimed = parent_effective - siblings_protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03 419 unclaimed *= usage - protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03 420 unclaimed /= parent_usage - siblings_protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03 421
a8585ac6862198 Maarten Lankhorst 2024-07-03 422 ep += unclaimed;
a8585ac6862198 Maarten Lankhorst 2024-07-03 423 }
a8585ac6862198 Maarten Lankhorst 2024-07-03 424
a8585ac6862198 Maarten Lankhorst 2024-07-03 425 return ep;
a8585ac6862198 Maarten Lankhorst 2024-07-03 426 }
a8585ac6862198 Maarten Lankhorst 2024-07-03 427
a8585ac6862198 Maarten Lankhorst 2024-07-03 428
a8585ac6862198 Maarten Lankhorst 2024-07-03 429 /**
a8585ac6862198 Maarten Lankhorst 2024-07-03 430 * page_counter_calculate_protection - check if memory consumption is in the normal range
a8585ac6862198 Maarten Lankhorst 2024-07-03 431 * @root: the top ancestor of the sub-tree being checked
a8585ac6862198 Maarten Lankhorst 2024-07-03 432 * @counter: the page_counter the counter to update
a8585ac6862198 Maarten Lankhorst 2024-07-03 433 * @recursive_protection: Whether to use memory_recursiveprot behavior.
a8585ac6862198 Maarten Lankhorst 2024-07-03 434 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 435 * Calculates elow/emin thresholds for given page_counter.
a8585ac6862198 Maarten Lankhorst 2024-07-03 436 *
a8585ac6862198 Maarten Lankhorst 2024-07-03 437 * WARNING: This function is not stateless! It can only be used as part
a8585ac6862198 Maarten Lankhorst 2024-07-03 438 * of a top-down tree iteration, not for isolated queries.
a8585ac6862198 Maarten Lankhorst 2024-07-03 439 */
a8585ac6862198 Maarten Lankhorst 2024-07-03 440 void page_counter_calculate_protection(struct page_counter *root,
a8585ac6862198 Maarten Lankhorst 2024-07-03 441 struct page_counter *counter,
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 442 bool recursive_protection, int is_local)
a8585ac6862198 Maarten Lankhorst 2024-07-03 @443 {
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
@ 2024-09-22 0:51 ` kernel test robot
2024-09-22 16:31 ` kernel test robot
2024-10-15 21:52 ` Gregory Price
2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-22 0:51 UTC (permalink / raw)
To: kaiyang2; +Cc: oe-kbuild-all
Hi,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master next-20240920]
[cannot apply to tip/sched/core v6.11]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240920221202.1734227-4-kaiyang2%40cs.cmu.edu
patch subject: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240922/202409221032.DoTv9B0p-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240922/202409221032.DoTv9B0p-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409221032.DoTv9B0p-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/memcontrol.c:4499: warning: Function parameter or struct member 'is_local' not described in 'mem_cgroup_calculate_protection'
vim +4499 mm/memcontrol.c
c077719be8e9e6 KAMEZAWA Hiroyuki 2009-01-07 4488
241994ed8649f7 Johannes Weiner 2015-02-11 4489 /**
05395718b2fe48 Mel Gorman 2021-06-30 4490 * mem_cgroup_calculate_protection - check if memory consumption is in the normal range
34c81057927311 Sean Christopherson 2017-07-10 4491 * @root: the top ancestor of the sub-tree being checked
241994ed8649f7 Johannes Weiner 2015-02-11 4492 * @memcg: the memory cgroup to check
241994ed8649f7 Johannes Weiner 2015-02-11 4493 *
230671533d6463 Roman Gushchin 2018-06-07 4494 * WARNING: This function is not stateless! It can only be used as part
230671533d6463 Roman Gushchin 2018-06-07 4495 * of a top-down tree iteration, not for isolated queries.
241994ed8649f7 Johannes Weiner 2015-02-11 4496 */
45c7f7e1ef17f0 Chris Down 2020-08-06 4497 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
3ebe5883ec39d9 Kaiyang Zhao 2024-09-20 4498 struct mem_cgroup *memcg, int is_local)
241994ed8649f7 Johannes Weiner 2015-02-11 @4499 {
a8585ac6862198 Maarten Lankhorst 2024-07-03 4500 bool recursive_protection =
a8585ac6862198 Maarten Lankhorst 2024-07-03 4501 cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
230671533d6463 Roman Gushchin 2018-06-07 4502
241994ed8649f7 Johannes Weiner 2015-02-11 4503 if (mem_cgroup_disabled())
45c7f7e1ef17f0 Chris Down 2020-08-06 4504 return;
241994ed8649f7 Johannes Weiner 2015-02-11 4505
34c81057927311 Sean Christopherson 2017-07-10 4506 if (!root)
34c81057927311 Sean Christopherson 2017-07-10 4507 root = root_mem_cgroup;
22f7496f0b9012 Yafang Shao 2020-08-06 4508
6f4c005a5f8b8f Kaiyang Zhao 2024-09-20 4509 page_counter_calculate_protection(&root->memory, &memcg->memory,
3ebe5883ec39d9 Kaiyang Zhao 2024-09-20 4510 recursive_protection, is_local);
241994ed8649f7 Johannes Weiner 2015-02-11 4511 }
241994ed8649f7 Johannes Weiner 2015-02-11 4512
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
2024-09-21 23:18 ` kernel test robot
@ 2024-09-22 8:39 ` kernel test robot
2024-10-15 22:05 ` Gregory Price
2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-22 8:39 UTC (permalink / raw)
To: kaiyang2
Cc: oe-lkp, lkp, linux-kernel, linux-mm, cgroups, roman.gushchin,
shakeel.butt, muchun.song, akpm, mhocko, nehagholkar, abhishekd,
hannes, weixugc, rientjes, Kaiyang Zhao, oliver.sang
Hello,
kernel test robot noticed "BUG:kernel_NULL_pointer_dereference,address" on:
commit: 6f4c005a5f8b8ff1ce674731545b302af5f28f3f ("[RFC PATCH 2/4] calculate memory.low for the local node and track its usage")
url: https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20240920221202.1734227-3-kaiyang2@cs.cmu.edu/
patch subject: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
in testcase: boot
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+---------------------------------------------+------------+------------+
| | 0af685cc17 | 6f4c005a5f |
+---------------------------------------------+------------+------------+
| boot_successes | 12 | 0 |
| boot_failures | 0 | 12 |
| BUG:kernel_NULL_pointer_dereference,address | 0 | 12 |
| Oops | 0 | 12 |
| RIP:si_meminfo_node | 0 | 12 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 12 |
+---------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202409221625.1e974ac-oliver.sang@intel.com
[ 14.204830][ T1] BUG: kernel NULL pointer dereference, address: 0000000000000090
[ 14.206729][ T1] #PF: supervisor read access in kernel mode
[ 14.208090][ T1] #PF: error_code(0x0000) - not-present page
[ 14.209393][ T1] PGD 0 P4D 0
[ 14.210212][ T1] Oops: Oops: 0000 [#1] SMP PTI
[ 14.211269][ T1] CPU: 1 UID: 0 PID: 1 Comm: systemd Not tainted 6.11.0-rc6-00570-g6f4c005a5f8b #1
[ 14.213284][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 14.215290][ T1] RIP: 0010:si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3))
[ 14.216523][ T1] Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 48 63 c6 55 31 d2 4c 8b 04 c5 c0 a7 fb 8c 53 48 89 c5 48 89 fb 4c 89 c0 49 8d b8 00 1e 00 00 <48> 8b 88 90 00 00 00 48 05 00 06 00 00 48 01 ca 48 39 f8 75 eb 48
All code
========
0: 90 nop
1: 90 nop
2: 66 0f 1f 00 nopw (%rax)
6: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
b: 48 63 c6 movslq %esi,%rax
e: 55 push %rbp
f: 31 d2 xor %edx,%edx
11: 4c 8b 04 c5 c0 a7 fb mov -0x73045840(,%rax,8),%r8
18: 8c
19: 53 push %rbx
1a: 48 89 c5 mov %rax,%rbp
1d: 48 89 fb mov %rdi,%rbx
20: 4c 89 c0 mov %r8,%rax
23: 49 8d b8 00 1e 00 00 lea 0x1e00(%r8),%rdi
2a:* 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx <-- trapping instruction
31: 48 05 00 06 00 00 add $0x600,%rax
37: 48 01 ca add %rcx,%rdx
3a: 48 39 f8 cmp %rdi,%rax
3d: 75 eb jne 0x2a
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx
7: 48 05 00 06 00 00 add $0x600,%rax
d: 48 01 ca add %rcx,%rdx
10: 48 39 f8 cmp %rdi,%rax
13: 75 eb jne 0x0
15: 48 rex.W
[ 14.220364][ T1] RSP: 0018:ffffb14b40013d68 EFLAGS: 00010246
[ 14.221717][ T1] RAX: 0000000000000000 RBX: ffffb14b40013d88 RCX: 00000000003a19a2
[ 14.223496][ T1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000001e00
[ 14.225170][ T1] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000008
[ 14.226964][ T1] R10: 0000000000000008 R11: 0fffffffffffffff R12: ffffb14b40013d88
[ 14.228774][ T1] R13: 00000000003e7ac3 R14: ffffb14b40013e88 R15: ffff98ab0434f7a0
[ 14.230421][ T1] FS: 00007f9569ae9940(0000) GS:ffff98adefd00000(0000) knlGS:0000000000000000
[ 14.234569][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.235900][ T1] CR2: 0000000000000090 CR3: 0000000100072000 CR4: 00000000000006f0
[ 14.237620][ T1] Call Trace:
[ 14.238502][ T1] <TASK>
[ 14.239254][ T1] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[ 14.240189][ T1] ? page_fault_oops (arch/x86/mm/fault.c:715)
[ 14.241254][ T1] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
[ 14.242297][ T1] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:623)
[ 14.243313][ T1] ? si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3))
[ 14.244443][ T1] ? si_meminfo_node (mm/show_mem.c:114)
[ 14.245460][ T1] memory_low_write (mm/memcontrol.c:4088)
[ 14.246547][ T1] kernfs_fop_write_iter (fs/kernfs/file.c:338)
[ 14.247804][ T1] vfs_write (fs/read_write.c:497 fs/read_write.c:590)
[ 14.248830][ T1] ksys_write (fs/read_write.c:643)
[ 14.249783][ T1] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
[ 14.250800][ T1] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 14.252260][ T1] RIP: 0033:0x7f956a64b240
[ 14.253276][ T1] Code: 40 00 48 8b 15 c1 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
All code
========
0: 40 00 48 8b add %cl,-0x75(%rax)
4: 15 c1 9b 0d 00 adc $0xd9bc1,%eax
9: f7 d8 neg %eax
b: 64 89 02 mov %eax,%fs:(%rdx)
e: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
15: eb b7 jmp 0xffffffffffffffce
17: 0f 1f 00 nopl (%rax)
1a: 80 3d a1 23 0e 00 00 cmpb $0x0,0xe23a1(%rip) # 0xe23c2
21: 74 17 je 0x3a
23: b8 01 00 00 00 mov $0x1,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 58 ja 0x8a
32: c3 retq
33: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
3a: 48 83 ec 28 sub $0x28,%rsp
3e: 48 rex.W
3f: 89 .byte 0x89
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 58 ja 0x60
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10: 48 83 ec 28 sub $0x28,%rsp
14: 48 rex.W
15: 89 .byte 0x89
[ 14.257195][ T1] RSP: 002b:00007ffcc66594e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 14.259009][ T1] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f956a64b240
[ 14.260848][ T1] RDX: 0000000000000002 RSI: 00007ffcc6659740 RDI: 000000000000001b
[ 14.262500][ T1] RBP: 00007ffcc6659740 R08: 0000000000000000 R09: 0000000000000001
[ 14.264147][ T1] R10: 00007f956a6c4820 R11: 0000000000000202 R12: 0000000000000002
[ 14.265934][ T1] R13: 000055fd63872c10 R14: 0000000000000002 R15: 00007f956a7219e0
[ 14.267589][ T1] </TASK>
[ 14.268340][ T1] Modules linked in: ip_tables
[ 14.269410][ T1] CR2: 0000000000000090
[ 14.270478][ T1] ---[ end trace 0000000000000000 ]---
[ 14.271717][ T1] RIP: 0010:si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3))
[ 14.272874][ T1] Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 48 63 c6 55 31 d2 4c 8b 04 c5 c0 a7 fb 8c 53 48 89 c5 48 89 fb 4c 89 c0 49 8d b8 00 1e 00 00 <48> 8b 88 90 00 00 00 48 05 00 06 00 00 48 01 ca 48 39 f8 75 eb 48
All code
========
0: 90 nop
1: 90 nop
2: 66 0f 1f 00 nopw (%rax)
6: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
b: 48 63 c6 movslq %esi,%rax
e: 55 push %rbp
f: 31 d2 xor %edx,%edx
11: 4c 8b 04 c5 c0 a7 fb mov -0x73045840(,%rax,8),%r8
18: 8c
19: 53 push %rbx
1a: 48 89 c5 mov %rax,%rbp
1d: 48 89 fb mov %rdi,%rbx
20: 4c 89 c0 mov %r8,%rax
23: 49 8d b8 00 1e 00 00 lea 0x1e00(%r8),%rdi
2a:* 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx <-- trapping instruction
31: 48 05 00 06 00 00 add $0x600,%rax
37: 48 01 ca add %rcx,%rdx
3a: 48 39 f8 cmp %rdi,%rax
3d: 75 eb jne 0x2a
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx
7: 48 05 00 06 00 00 add $0x600,%rax
d: 48 01 ca add %rcx,%rdx
10: 48 39 f8 cmp %rdi,%rax
13: 75 eb jne 0x0
15: 48 rex.W
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240922/202409221625.1e974ac-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
2024-09-22 0:51 ` kernel test robot
@ 2024-09-22 16:31 ` kernel test robot
2024-10-15 21:52 ` Gregory Price
2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-22 16:31 UTC (permalink / raw)
To: kaiyang2; +Cc: oe-kbuild-all
Hi,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master next-20240920]
[cannot apply to tip/sched/core v6.11]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240920221202.1734227-4-kaiyang2%40cs.cmu.edu
patch subject: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240923/202409230026.mq4sC7is-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240923/202409230026.mq4sC7is-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409230026.mq4sC7is-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/vmscan.c: In function 'get_scan_count':
>> mm/vmscan.c:2503:47: error: implicit declaration of function 'get_cgroup_local_usage' [-Werror=implicit-function-declaration]
2503 | cgroup_size = get_cgroup_local_usage(memcg, true);
| ^~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/get_cgroup_local_usage +2503 mm/vmscan.c
2360
2361 /*
2362 * Determine how aggressively the anon and file LRU lists should be
2363 * scanned.
2364 *
2365 * nr[0] = anon inactive folios to scan; nr[1] = anon active folios to scan
2366 * nr[2] = file inactive folios to scan; nr[3] = file active folios to scan
2367 */
2368 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
2369 unsigned long *nr)
2370 {
2371 struct pglist_data *pgdat = lruvec_pgdat(lruvec);
2372 struct mem_cgroup *memcg = lruvec_memcg(lruvec);
2373 unsigned long anon_cost, file_cost, total_cost;
2374 int swappiness = sc_swappiness(sc, memcg);
2375 u64 fraction[ANON_AND_FILE];
2376 u64 denominator = 0; /* gcc */
2377 enum scan_balance scan_balance;
2378 unsigned long ap, fp;
2379 enum lru_list lru;
2380 int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
2381
2382 /* If we have no swap space, do not bother scanning anon folios. */
2383 if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
2384 scan_balance = SCAN_FILE;
2385 goto out;
2386 }
2387
2388 /*
2389 * Global reclaim will swap to prevent OOM even with no
2390 * swappiness, but memcg users want to use this knob to
2391 * disable swapping for individual groups completely when
2392 * using the memory controller's swap limit feature would be
2393 * too expensive.
2394 */
2395 if (cgroup_reclaim(sc) && !swappiness) {
2396 scan_balance = SCAN_FILE;
2397 goto out;
2398 }
2399
2400 /*
2401 * Do not apply any pressure balancing cleverness when the
2402 * system is close to OOM, scan both anon and file equally
2403 * (unless the swappiness setting disagrees with swapping).
2404 */
2405 if (!sc->priority && swappiness) {
2406 scan_balance = SCAN_EQUAL;
2407 goto out;
2408 }
2409
2410 /*
2411 * If the system is almost out of file pages, force-scan anon.
2412 */
2413 if (sc->file_is_tiny) {
2414 scan_balance = SCAN_ANON;
2415 goto out;
2416 }
2417
2418 /*
2419 * If there is enough inactive page cache, we do not reclaim
2420 * anything from the anonymous working right now.
2421 */
2422 if (sc->cache_trim_mode) {
2423 scan_balance = SCAN_FILE;
2424 goto out;
2425 }
2426
2427 scan_balance = SCAN_FRACT;
2428 /*
2429 * Calculate the pressure balance between anon and file pages.
2430 *
2431 * The amount of pressure we put on each LRU is inversely
2432 * proportional to the cost of reclaiming each list, as
2433 * determined by the share of pages that are refaulting, times
2434 * the relative IO cost of bringing back a swapped out
2435 * anonymous page vs reloading a filesystem page (swappiness).
2436 *
2437 * Although we limit that influence to ensure no list gets
2438 * left behind completely: at least a third of the pressure is
2439 * applied, before swappiness.
2440 *
2441 * With swappiness at 100, anon and file have equal IO cost.
2442 */
2443 total_cost = sc->anon_cost + sc->file_cost;
2444 anon_cost = total_cost + sc->anon_cost;
2445 file_cost = total_cost + sc->file_cost;
2446 total_cost = anon_cost + file_cost;
2447
2448 ap = swappiness * (total_cost + 1);
2449 ap /= anon_cost + 1;
2450
2451 fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
2452 fp /= file_cost + 1;
2453
2454 fraction[0] = ap;
2455 fraction[1] = fp;
2456 denominator = ap + fp;
2457 out:
2458 for_each_evictable_lru(lru) {
2459 bool file = is_file_lru(lru);
2460 unsigned long lruvec_size;
2461 unsigned long low, min, locallow;
2462 unsigned long scan;
2463
2464 lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
2465 mem_cgroup_protection(sc->target_mem_cgroup, memcg,
2466 &min, &low, &locallow);
2467 if (is_local)
2468 low = locallow;
2469
2470 if (min || low) {
2471 /*
2472 * Scale a cgroup's reclaim pressure by proportioning
2473 * its current usage to its memory.low or memory.min
2474 * setting.
2475 *
2476 * This is important, as otherwise scanning aggression
2477 * becomes extremely binary -- from nothing as we
2478 * approach the memory protection threshold, to totally
2479 * nominal as we exceed it. This results in requiring
2480 * setting extremely liberal protection thresholds. It
2481 * also means we simply get no protection at all if we
2482 * set it too low, which is not ideal.
2483 *
2484 * If there is any protection in place, we reduce scan
2485 * pressure by how much of the total memory used is
2486 * within protection thresholds.
2487 *
2488 * There is one special case: in the first reclaim pass,
2489 * we skip over all groups that are within their low
2490 * protection. If that fails to reclaim enough pages to
2491 * satisfy the reclaim goal, we come back and override
2492 * the best-effort low protection. However, we still
2493 * ideally want to honor how well-behaved groups are in
2494 * that case instead of simply punishing them all
2495 * equally. As such, we reclaim them based on how much
2496 * memory they are using, reducing the scan pressure
2497 * again by how much of the total memory used is under
2498 * hard protection.
2499 */
2500 unsigned long cgroup_size;
2501
2502 if (is_local)
> 2503 cgroup_size = get_cgroup_local_usage(memcg, true);
2504 else
2505 cgroup_size = mem_cgroup_size(memcg);
2506 unsigned long protection;
2507
2508 /* memory.low scaling, make sure we retry before OOM */
2509 if (!sc->memcg_low_reclaim && low > min) {
2510 protection = low;
2511 sc->memcg_low_skipped = 1;
2512 } else {
2513 protection = min;
2514 }
2515
2516 /* Avoid TOCTOU with earlier protection check */
2517 cgroup_size = max(cgroup_size, protection);
2518
2519 scan = lruvec_size - lruvec_size * protection /
2520 (cgroup_size + 1);
2521
2522 /*
2523 * Minimally target SWAP_CLUSTER_MAX pages to keep
2524 * reclaim moving forwards, avoiding decrementing
2525 * sc->priority further than desirable.
2526 */
2527 scan = max(scan, SWAP_CLUSTER_MAX);
2528 } else {
2529 scan = lruvec_size;
2530 }
2531
2532 scan >>= sc->priority;
2533
2534 /*
2535 * If the cgroup's already been deleted, make sure to
2536 * scrape out the remaining cache.
2537 */
2538 if (!scan && !mem_cgroup_online(memcg))
2539 scan = min(lruvec_size, SWAP_CLUSTER_MAX);
2540
2541 switch (scan_balance) {
2542 case SCAN_EQUAL:
2543 /* Scan lists relative to size */
2544 break;
2545 case SCAN_FRACT:
2546 /*
2547 * Scan types proportional to swappiness and
2548 * their relative recent reclaim efficiency.
2549 * Make sure we don't miss the last page on
2550 * the offlined memory cgroups because of a
2551 * round-off error.
2552 */
2553 scan = mem_cgroup_online(memcg) ?
2554 div64_u64(scan * fraction[file], denominator) :
2555 DIV64_U64_ROUND_UP(scan * fraction[file],
2556 denominator);
2557 break;
2558 case SCAN_FILE:
2559 case SCAN_ANON:
2560 /* Scan one type exclusively */
2561 if ((scan_balance == SCAN_FILE) != file)
2562 scan = 0;
2563 break;
2564 default:
2565 /* Look ma, no brain */
2566 BUG();
2567 }
2568
2569 nr[lru] = scan;
2570 }
2571 }
2572
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
` (3 preceding siblings ...)
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
@ 2024-10-11 20:51 ` Kaiyang Zhao
2024-11-08 19:01 ` kaiyang2
5 siblings, 0 replies; 13+ messages in thread
From: Kaiyang Zhao @ 2024-10-11 20:51 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, gourry
Adding some preliminary results from testing on a *real* system with CXL
memory.
The system has 256GB local DRAM + 64GB CXL memory. We used a microbenchmark
that allocates memory and accesses it at tunable hotness levels. We ran 3 such
microbenchmarks in 3 cgroups. The first container has 2 times the access
hotness than the second and the third container. All containers have a 100GB
memory.low set, meaning that ~82GB of local DRAM usage is protected.
Case 1 Container 1: Uses 120GB Container 2: Uses 40GB Container 3: Uses 40GB
Without fairness patch: same as with fairness.
With fairness patch: Container 1 has 120GB in local DRAM. Container 2 and 3
each have 40GB in local DRAM. As long as DRAM memory is not under pressure,
containers can exceed the lower guarantee and put everything in DRAM.
Case 2: Container 1: Uses 120GB Container 2: Uses 90GB Container 3: Uses 90GB
Without fairness patch: Container 1 gets 120GB in local DRAM, and Container 2
and 3 are stuck with ~65GB in local DRAM since they have colder data.
With fairness patch: Container 1 starts early and gets all 120GB in DRAM
memory. As container 2 and 3 start, they initially each get ~65GB in DRAM and
~25GB in CXL memory. Promotion attempts trigger local memory reclaim by kswapd,
which trims the DRAM usage by container 1 and increases the DRAM usage of
container 2 and 3. Eventually, the usage of DRAM memory for all 3 containers
converges at ~82GB, and the excess unprotected usage of 3 containers is in CXL
memory.
Case 3:
Container 1: Uses 120GB Container 2: Uses 70GB Container 3: Uses 70GB
Without fairness patch: Container 1 gets 120GB in local DRAM, and Container 2
and 3 are stuck with ~65GB in local DRAM since they have colder data.
With fairness patch: While the total memory demand exceeds DRAM capacity, at
the stable state, Container 1 is still able to get ~105GB in local DRAM, more
than the lower guarantee. Meanwhile, all memory usage by Container 2 and 3 are
protected from the noisy neighbor Container 1 and resides in DRAM only.
We’re working on getting performance data from more benchmarks and also Meta’s
production workloads. Stay tuned for more results!
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
2024-09-22 0:51 ` kernel test robot
2024-09-22 16:31 ` kernel test robot
@ 2024-10-15 21:52 ` Gregory Price
2 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-15 21:52 UTC (permalink / raw)
To: kaiyang2
Cc: linux-mm, cgroups, roman.gushchin, shakeel.butt, muchun.song,
akpm, mhocko, nehagholkar, abhishekd, hannes, weixugc, rientjes
On Fri, Sep 20, 2024 at 10:11:50PM +0000, kaiyang2@cs.cmu.edu wrote:
> From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
>
> When reclaim targets the top-tier node usage by the root memcg,
> apply local memory.low protection instead of global protection.
>
Changelog probably needs a little more context about the intended
affect of this change. What exactly is the implication of this
change compared to applying it against elow?
> Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> ---
> include/linux/memcontrol.h | 23 ++++++++++++++---------
> mm/memcontrol.c | 4 ++--
> mm/vmscan.c | 19 ++++++++++++++-----
> 3 files changed, 30 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 94aba4498fca..256912b91922 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -586,9 +586,9 @@ static inline bool mem_cgroup_disabled(void)
> static inline void mem_cgroup_protection(struct mem_cgroup *root,
> struct mem_cgroup *memcg,
> unsigned long *min,
> - unsigned long *low)
> + unsigned long *low, unsigned long *locallow)
> {
> - *min = *low = 0;
> + *min = *low = *locallow = 0;
>
"locallow" can be read as "loc allow" or "local low", probably you
want to change all the references to local_low.
Sorry for not saying this on earlier feedback.
> if (mem_cgroup_disabled())
> return;
> @@ -631,10 +631,11 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
>
> *min = READ_ONCE(memcg->memory.emin);
> *low = READ_ONCE(memcg->memory.elow);
> + *locallow = READ_ONCE(memcg->memory.elocallow);
> }
>
> void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> - struct mem_cgroup *memcg);
> + struct mem_cgroup *memcg, int is_local);
>
> static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
> struct mem_cgroup *memcg)
> @@ -651,13 +652,17 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
> unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
>
> static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> if (mem_cgroup_unprotected(target, memcg))
> return false;
>
> - return READ_ONCE(memcg->memory.elow) >=
> - page_counter_read(&memcg->memory);
> + if (is_local)
> + return READ_ONCE(memcg->memory.elocallow) >=
> + get_cgroup_local_usage(memcg, true);
> + else
> + return READ_ONCE(memcg->memory.elow) >=
> + page_counter_read(&memcg->memory);
Don't need else case here is if block returns.
> }
>
> static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> @@ -1159,13 +1164,13 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> static inline void mem_cgroup_protection(struct mem_cgroup *root,
> struct mem_cgroup *memcg,
> unsigned long *min,
> - unsigned long *low)
> + unsigned long *low, unsigned long *locallow)
> {
> *min = *low = 0;
> }
>
> static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> }
>
> @@ -1175,7 +1180,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
> return true;
> }
> static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> return false;
> }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7c5fff12105..61718ba998fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4495,7 +4495,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
> * of a top-down tree iteration, not for isolated queries.
> */
> void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> bool recursive_protection =
> cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> @@ -4507,7 +4507,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> root = root_mem_cgroup;
>
> page_counter_calculate_protection(&root->memory, &memcg->memory,
> - recursive_protection, false);
> + recursive_protection, is_local);
> }
>
> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ce471d686a88..a2681d52fc5f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2377,6 +2377,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> enum scan_balance scan_balance;
> unsigned long ap, fp;
> enum lru_list lru;
> + int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
int should be bool to be more explicit as to what the valid values are.
Should be addressed across the patch set.
>
> /* If we have no swap space, do not bother scanning anon folios. */
> if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
> @@ -2457,12 +2458,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> for_each_evictable_lru(lru) {
> bool file = is_file_lru(lru);
> unsigned long lruvec_size;
> - unsigned long low, min;
> + unsigned long low, min, locallow;
> unsigned long scan;
>
> lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> - &min, &low);
> + &min, &low, &locallow);
> + if (is_local)
> + low = locallow;
>
> if (min || low) {
> /*
> @@ -2494,7 +2497,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> * again by how much of the total memory used is under
> * hard protection.
> */
> - unsigned long cgroup_size = mem_cgroup_size(memcg);
> + unsigned long cgroup_size;
> +
> + if (is_local)
> + cgroup_size = get_cgroup_local_usage(memcg, true);
> + else
> + cgroup_size = mem_cgroup_size(memcg);
> unsigned long protection;
>
> /* memory.low scaling, make sure we retry before OOM */
> @@ -5869,6 +5877,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> };
> struct mem_cgroup_reclaim_cookie *partial = &reclaim;
> struct mem_cgroup *memcg;
> + int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
>
> /*
> * In most cases, direct reclaimers can do partial walks
> @@ -5896,7 +5905,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> */
> cond_resched();
>
> - mem_cgroup_calculate_protection(target_memcg, memcg);
> + mem_cgroup_calculate_protection(target_memcg, memcg, is_local);
>
> if (mem_cgroup_below_min(target_memcg, memcg)) {
> /*
> @@ -5904,7 +5913,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> * If there is no reclaimable memory, OOM.
> */
> continue;
> - } else if (mem_cgroup_below_low(target_memcg, memcg)) {
> + } else if (mem_cgroup_below_low(target_memcg, memcg, is_local)) {
> /*
> * Soft protection.
> * Respect the protection only as long as
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
2024-09-21 23:18 ` kernel test robot
2024-09-22 8:39 ` kernel test robot
@ 2024-10-15 22:05 ` Gregory Price
2 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-15 22:05 UTC (permalink / raw)
To: kaiyang2
Cc: linux-mm, cgroups, roman.gushchin, shakeel.butt, muchun.song,
akpm, mhocko, nehagholkar, abhishekd, hannes, weixugc, rientjes
On Fri, Sep 20, 2024 at 10:11:49PM +0000, kaiyang2@cs.cmu.edu wrote:
> From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
>
> Add a memory.low for the top-tier node (locallow) and track its usage.
> locallow is set by scaling low by the ratio of node 0 capacity and
> node 0 + node 1 capacity.
>
> Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> ---
> include/linux/page_counter.h | 16 ++++++++---
> mm/hugetlb_cgroup.c | 4 +--
> mm/memcontrol.c | 42 ++++++++++++++++++++++-------
> mm/page_counter.c | 52 ++++++++++++++++++++++++++++--------
> 4 files changed, 88 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 79dbd8bc35a7..aa56c93415ef 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -13,6 +13,7 @@ struct page_counter {
> * memcg->memory.usage is a hot member of struct mem_cgroup.
> */
> atomic_long_t usage;
> + struct mem_cgroup *memcg; /* memcg that owns this counter */
Can you make some comments on the lifetime of this new memcg reference?
How is it referenced, how is it cleaned up, etc.
Probably it's worth added this in a separate patch so it's easier
to review the reference tracking.
> CACHELINE_PADDING(_pad1_);
>
> /* effective memory.min and memory.min usage tracking */
> @@ -25,6 +26,10 @@ struct page_counter {
> atomic_long_t low_usage;
> atomic_long_t children_low_usage;
>
> + unsigned long elocallow;
> + atomic_long_t locallow_usage;
per note on other email - probably want local_low_* instead of locallow.
> + atomic_long_t children_locallow_usage;
> +
> unsigned long watermark;
> /* Latest cg2 reset watermark */
> unsigned long local_watermark;
> @@ -36,6 +41,7 @@ struct page_counter {
> bool protection_support;
> unsigned long min;
> unsigned long low;
> + unsigned long locallow;
> unsigned long high;
> unsigned long max;
> struct page_counter *parent;
> @@ -52,12 +58,13 @@ struct page_counter {
> */
> static inline void page_counter_init(struct page_counter *counter,
> struct page_counter *parent,
> - bool protection_support)
> + bool protection_support, struct mem_cgroup *memcg)
> {
> counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
> counter->max = PAGE_COUNTER_MAX;
> counter->parent = parent;
> counter->protection_support = protection_support;
> + counter->memcg = memcg;
> }
>
> static inline unsigned long page_counter_read(struct page_counter *counter)
> @@ -72,7 +79,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> struct page_counter **fail);
> void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
> -void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
> +void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
> + unsigned long nr_pages_local);
>
> static inline void page_counter_set_high(struct page_counter *counter,
> unsigned long nr_pages)
> @@ -99,11 +107,11 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
> #ifdef CONFIG_MEMCG
> void page_counter_calculate_protection(struct page_counter *root,
> struct page_counter *counter,
> - bool recursive_protection);
> + bool recursive_protection, int is_local);
`bool is_local` is preferred
> #else
> static inline void page_counter_calculate_protection(struct page_counter *root,
> struct page_counter *counter,
> - bool recursive_protection) {}
> + bool recursive_protection, int is_local) {}
> #endif
>
> #endif /* _LINUX_PAGE_COUNTER_H */
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index d8d0e665caed..0e07a7a1d5b8 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
> }
> page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
> idx),
> - fault_parent, false);
> + fault_parent, false, NULL);
> page_counter_init(
> hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
> - rsvd_parent, false);
> + rsvd_parent, false, NULL);
>
> limit = round_down(PAGE_COUNTER_MAX,
> pages_per_huge_page(&hstates[idx]));
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 20b715441332..d7c5fff12105 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1497,6 +1497,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> vm_event_name(memcg_vm_event_stat[i]),
> memcg_events(memcg, memcg_vm_event_stat[i]));
> }
> +
> + seq_buf_printf(s, "local_usage %lu\n",
> + get_cgroup_local_usage(memcg, true));
> }
>
> static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> @@ -3597,8 +3600,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> if (parent) {
> WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
>
> - page_counter_init(&memcg->memory, &parent->memory, true);
> - page_counter_init(&memcg->swap, &parent->swap, false);
> + page_counter_init(&memcg->memory, &parent->memory, true, memcg);
> + page_counter_init(&memcg->swap, &parent->swap, false, NULL);
> #ifdef CONFIG_MEMCG_V1
> WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
> page_counter_init(&memcg->kmem, &parent->kmem, false);
> @@ -3607,8 +3610,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> } else {
> init_memcg_stats();
> init_memcg_events();
> - page_counter_init(&memcg->memory, NULL, true);
> - page_counter_init(&memcg->swap, NULL, false);
> + page_counter_init(&memcg->memory, NULL, true, memcg);
> + page_counter_init(&memcg->swap, NULL, false, NULL);
> #ifdef CONFIG_MEMCG_V1
> page_counter_init(&memcg->kmem, NULL, false);
> page_counter_init(&memcg->tcpmem, NULL, false);
> @@ -3677,7 +3680,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> memcg1_css_offline(memcg);
>
> page_counter_set_min(&memcg->memory, 0);
> - page_counter_set_low(&memcg->memory, 0);
> + page_counter_set_low(&memcg->memory, 0, 0);
>
> zswap_memcg_offline_cleanup(memcg);
>
> @@ -3748,7 +3751,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
> page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
> #endif
> page_counter_set_min(&memcg->memory, 0);
> - page_counter_set_low(&memcg->memory, 0);
> + page_counter_set_low(&memcg->memory, 0, 0);
> page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
> memcg1_soft_limit_reset(memcg);
> page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
> @@ -4051,6 +4054,12 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static int memory_locallow_show(struct seq_file *m, void *v)
> +{
> + return seq_puts_memcg_tunable(m,
> + READ_ONCE(mem_cgroup_from_seq(m)->memory.locallow));
> +}
> +
> static int memory_low_show(struct seq_file *m, void *v)
> {
> return seq_puts_memcg_tunable(m,
> @@ -4061,7 +4070,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> - unsigned long low;
> + struct sysinfo si;
> + unsigned long low, locallow, local_capacity, total_capacity;
> int err;
>
> buf = strstrip(buf);
> @@ -4069,7 +4079,15 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
> if (err)
> return err;
>
> - page_counter_set_low(&memcg->memory, low);
> + /* Hardcoded 0 for local node and 1 for remote. */
I know we've talked about this before about this, but this is obviously broken
for multi-socket systems. If so, this needs a FIXME or a TODO at least so that
it's at least obvious that this patch isn't ready for upstream - even as an RFC.
Probably we can't move forward until we figure out how to solve this problem
out ahead of this patch set. Worth discussing this issue explicitly.
Maybe rather than guessing, a preferred node should be set for local and
remote if this mechanism is in use. Otherwise just guessing which local
and which remote node seems like it will be wrong - especially for sufficiently
large-threaded processes.
> + si_meminfo_node(&si, 0);
> + local_capacity = si.totalram; /* In pages. */
> + total_capacity = local_capacity;
> + si_meminfo_node(&si, 1);
> + total_capacity += si.totalram;
> + locallow = low * local_capacity / total_capacity;
> +
> + page_counter_set_low(&memcg->memory, low, locallow);
>
> return nbytes;
> }
> @@ -4394,6 +4412,11 @@ static struct cftype memory_files[] = {
> .seq_show = memory_low_show,
> .write = memory_low_write,
> },
> + {
> + .name = "locallow",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_locallow_show,
> + },
> {
> .name = "high",
> .flags = CFTYPE_NOT_ON_ROOT,
> @@ -4483,7 +4506,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> if (!root)
> root = root_mem_cgroup;
>
> - page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
> + page_counter_calculate_protection(&root->memory, &memcg->memory,
> + recursive_protection, false);
> }
>
> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b249d15af9dd..97205aafab46 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -18,8 +18,10 @@ static bool track_protection(struct page_counter *c)
> return c->protection_support;
> }
>
> +extern unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
> +
> static void propagate_protected_usage(struct page_counter *c,
> - unsigned long usage)
> + unsigned long usage, unsigned long local_usage)
> {
> unsigned long protected, old_protected;
> long delta;
> @@ -44,6 +46,15 @@ static void propagate_protected_usage(struct page_counter *c,
> if (delta)
> atomic_long_add(delta, &c->parent->children_low_usage);
> }
> +
> + protected = min(local_usage, READ_ONCE(c->locallow));
> + old_protected = atomic_long_read(&c->locallow_usage);
> + if (protected != old_protected) {
> + old_protected = atomic_long_xchg(&c->locallow_usage, protected);
> + delta = protected - old_protected;
> + if (delta)
> + atomic_long_add(delta, &c->parent->children_locallow_usage);
> + }
> }
>
> /**
> @@ -63,7 +74,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> atomic_long_set(&counter->usage, new);
> }
> if (track_protection(counter))
> - propagate_protected_usage(counter, new);
> + propagate_protected_usage(counter, new,
> + get_cgroup_local_usage(counter->memcg, false));
> }
>
> /**
> @@ -83,7 +95,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>
> new = atomic_long_add_return(nr_pages, &c->usage);
> if (protection)
> - propagate_protected_usage(c, new);
> + propagate_protected_usage(c, new,
> + get_cgroup_local_usage(counter->memcg, false));
> /*
> * This is indeed racy, but we can live with some
> * inaccuracy in the watermark.
> @@ -151,7 +164,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> goto failed;
> }
> if (protection)
> - propagate_protected_usage(c, new);
> + propagate_protected_usage(c, new,
> + get_cgroup_local_usage(counter->memcg, false));
>
> /* see comment on page_counter_charge */
> if (new > READ_ONCE(c->local_watermark)) {
> @@ -238,7 +252,8 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
> WRITE_ONCE(counter->min, nr_pages);
>
> for (c = counter; c; c = c->parent)
> - propagate_protected_usage(c, atomic_long_read(&c->usage));
> + propagate_protected_usage(c, atomic_long_read(&c->usage),
> + get_cgroup_local_usage(counter->memcg, false));
> }
>
> /**
> @@ -248,14 +263,17 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
> *
> * The caller must serialize invocations on the same counter.
> */
> -void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
> +void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
> + unsigned long nr_pages_local)
> {
> struct page_counter *c;
>
> WRITE_ONCE(counter->low, nr_pages);
> + WRITE_ONCE(counter->locallow, nr_pages_local);
>
> for (c = counter; c; c = c->parent)
> - propagate_protected_usage(c, atomic_long_read(&c->usage));
> + propagate_protected_usage(c, atomic_long_read(&c->usage),
> + get_cgroup_local_usage(counter->memcg, false));
> }
>
> /**
> @@ -421,9 +439,9 @@ static unsigned long effective_protection(unsigned long usage,
> */
> void page_counter_calculate_protection(struct page_counter *root,
> struct page_counter *counter,
> - bool recursive_protection)
> + bool recursive_protection, int is_local)
> {
> - unsigned long usage, parent_usage;
> + unsigned long usage, parent_usage, local_usage, parent_local_usage;
> struct page_counter *parent = counter->parent;
>
> /*
> @@ -437,16 +455,19 @@ void page_counter_calculate_protection(struct page_counter *root,
> return;
>
> usage = page_counter_read(counter);
> - if (!usage)
> + local_usage = get_cgroup_local_usage(counter->memcg, true);
> + if (!usage || !local_usage)
> return;
>
> if (parent == root) {
> counter->emin = READ_ONCE(counter->min);
> counter->elow = READ_ONCE(counter->low);
> + counter->elocallow = READ_ONCE(counter->locallow);
> return;
> }
>
> parent_usage = page_counter_read(parent);
> + parent_local_usage = get_cgroup_local_usage(parent->memcg, true);
>
> WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
> READ_ONCE(counter->min),
> @@ -454,7 +475,16 @@ void page_counter_calculate_protection(struct page_counter *root,
> atomic_long_read(&parent->children_min_usage),
> recursive_protection));
>
> - WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
> + if (is_local)
> + WRITE_ONCE(counter->elocallow,
> + effective_protection(local_usage, parent_local_usage,
> + READ_ONCE(counter->locallow),
> + READ_ONCE(parent->elocallow),
> + atomic_long_read(&parent->children_locallow_usage),
> + recursive_protection));
> + else
> + WRITE_ONCE(counter->elow,
> + effective_protection(usage, parent_usage,
> READ_ONCE(counter->low),
> READ_ONCE(parent->elow),
> atomic_long_read(&parent->children_low_usage),
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
` (4 preceding siblings ...)
2024-10-11 20:51 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion Kaiyang Zhao
@ 2024-11-08 19:01 ` kaiyang2
5 siblings, 0 replies; 13+ messages in thread
From: kaiyang2 @ 2024-11-08 19:01 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, gourry,
Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Adding some performance results from testing on a *real* system with CXL memory
to demonstrate the values of the patches.
The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads
together in two cgroups. One is a microbenchmark that allocates memory and
accesses it at tunable hotness levels. It allocates 256GB of memory and
accesses it in sequential passes with a very hot access pattern (~1 second per
pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017,
which uses about 14GB of memory in total. We apply memory bandwidth limits (1
Gbps memory bandwidth per logical core) and LLC contention mitigation by
setting cpuset for each cgroup.
Case 1: omnetpp running without the microbenchmark.
It is able to use all local memory and without resource contention. This is
the optimal case.
Avg rate reported by SPEC= 84.7
Case 2: Running two workloads stacked without the fairness patches and start
the microbenchmark first.
Avg= 62.7 (-25.9%)
Case 3: Set memory.low = 19GB for both workloads This is enough memory local
low protection for the entire memory usage of omnetpp.
Avg = 75.3 (-11.1%)
Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it
finishes because the hint faults for it only triggers for a few seconds in the
~20 minute runtime. Due to the short runtime of the workload and how tiering
currently works, it finishes before the memory usage converges to the point
where all its memory use is local. However, this still represents a significant
improvement over case 2.
Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for
the microbenchmark.
Avg= 84.0 (<1% difference with case 1)
Analysis: by setting both memory.low and memory.high, the usage of local memory
is essentially provisioned for the microbenchmark. Therefore, even if the
microbenchmark starts first, when omnetpp starts it can get all local memory
from the very beginning and achieve near non-colocated performance.
We’re working on getting performance data from Meta’s production workloads.
Stay tuned for more results.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-11-08 19:02 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
2024-09-21 23:18 ` kernel test robot
2024-09-22 8:39 ` kernel test robot
2024-10-15 22:05 ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
2024-09-22 0:51 ` kernel test robot
2024-09-22 16:31 ` kernel test robot
2024-10-15 21:52 ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
2024-10-11 20:51 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion Kaiyang Zhao
2024-11-08 19:01 ` kaiyang2
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.