[RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
@ 2024-09-20 22:11 kaiyang2
  2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao

From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

Currently in Linux, there is no concept of fairness in memory tiering. Depending
on the memory usage and access patterns of other colocated applications, an
application cannot be sure of how much memory in which tier it will get, and how
much its performance will suffer or benefit.

Fairness is, however, important in a multi-tenant system. For example, an
application may need to meet a certain tail latency requirement, which can be
difficult to satisfy without x amount of frequently accessed pages in top-tier
memory. Similarly, an application may want to declare a minimum throughput when
running on a system for capacity planning purposes, but without fairness
controls in memory tiering its throughput can fluctuate wildly as other
applications come and go on the system.

In this proposal, we amend the memory.low control in memcg to protect a cgroup’s
memory usage in top-tier memory. A low protection for top-tier memory is scaled
proportionally to the ratio of top-tier memory and total memory on the system.
The protection is then applied to reclaim for top-tier memory. Promotion by NUMA
balancing is also throttled through reduced scanning window when top-tier memory
is contended and the cgroup is over its protection.

Experiments we did with microbenchmarks exhibiting a range of memory access
patterns and memory size confirmed that when top-tier memory is contended, the
system moves towards a stable memory distribution where each cgroup’s memory
usage in local DRAM converges to the protected amounts.

One notable missing part in the patches is determining which NUMA nodes have
top-tier memory; currently they use hardcoded node 0 for top-tier memory and
node 1 for a CPU-less node backed by CXL memory. We’re working on removing
this artifact and correctly applying to top-tier nodes in the system.

Your feedback is greatly appreciated!

Kaiyang Zhao (4):
  Add get_cgroup_local_usage for estimating the top-tier memory usage
  calculate memory.low for the local node and track its usage
  use memory.low local node protection for local node reclaim
  reduce NUMA balancing scan size of cgroups over their local memory.low

 include/linux/memcontrol.h   | 25 ++++++++-----
 include/linux/page_counter.h | 16 ++++++---
 kernel/sched/fair.c          | 54 +++++++++++++++++++++++++---
 mm/hugetlb_cgroup.c          |  4 +--
 mm/memcontrol.c              | 68 ++++++++++++++++++++++++++++++------
 mm/page_counter.c            | 52 +++++++++++++++++++++------
 mm/vmscan.c                  | 19 +++++++---
 7 files changed, 192 insertions(+), 46 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage
  2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
  2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao

From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

Approximate the usage of top-tier memory of a cgroup by its anon,
file, shmem and slab sizes in the top-tier.

Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c            | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..94aba4498fca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -648,6 +648,8 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 		memcg == target;
 }
 
+unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
+
 static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
 					struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f19a58c252f0..20b715441332 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -855,6 +855,30 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 	return READ_ONCE(memcg->vmstats->events_local[i]);
 }
 
+/* Usage is in pages. */
+unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush)
+{
+	struct lruvec *lruvec;
+	const int local_nid = 0;
+
+	if (!memcg)
+		return 0;
+
+	if (flush)
+		mem_cgroup_flush_stats_ratelimited(memcg);
+
+	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(local_nid));
+	unsigned long anon = lruvec_page_state(lruvec, NR_ANON_MAPPED);
+	unsigned long file = lruvec_page_state(lruvec, NR_FILE_PAGES);
+	unsigned long shmem = lruvec_page_state(lruvec, NR_SHMEM);
+	/* Slab size are in bytes */
+	unsigned long slab =
+		lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE_B) / PAGE_SIZE
+		+ lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE_B) / PAGE_SIZE;
+
+	return anon + file + shmem + slab;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
  2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
  2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
  2024-09-21 23:18   ` kernel test robot
                     ` (2 more replies)
  2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao

From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

Add a memory.low for the top-tier node (locallow) and track its usage.
locallow is set by scaling low by the ratio of node 0 capacity and
node 0 + node 1 capacity.

Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
 include/linux/page_counter.h | 16 ++++++++---
 mm/hugetlb_cgroup.c          |  4 +--
 mm/memcontrol.c              | 42 ++++++++++++++++++++++-------
 mm/page_counter.c            | 52 ++++++++++++++++++++++++++++--------
 4 files changed, 88 insertions(+), 26 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 79dbd8bc35a7..aa56c93415ef 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -13,6 +13,7 @@ struct page_counter {
 	 * memcg->memory.usage is a hot member of struct mem_cgroup.
 	 */
 	atomic_long_t usage;
+	struct mem_cgroup *memcg; /* memcg that owns this counter */
 	CACHELINE_PADDING(_pad1_);
 
 	/* effective memory.min and memory.min usage tracking */
@@ -25,6 +26,10 @@ struct page_counter {
 	atomic_long_t low_usage;
 	atomic_long_t children_low_usage;
 
+	unsigned long elocallow;
+	atomic_long_t locallow_usage;
+	atomic_long_t children_locallow_usage;
+
 	unsigned long watermark;
 	/* Latest cg2 reset watermark */
 	unsigned long local_watermark;
@@ -36,6 +41,7 @@ struct page_counter {
 	bool protection_support;
 	unsigned long min;
 	unsigned long low;
+	unsigned long locallow;
 	unsigned long high;
 	unsigned long max;
 	struct page_counter *parent;
@@ -52,12 +58,13 @@ struct page_counter {
  */
 static inline void page_counter_init(struct page_counter *counter,
 				     struct page_counter *parent,
-				     bool protection_support)
+				     bool protection_support, struct mem_cgroup *memcg)
 {
 	counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
 	counter->max = PAGE_COUNTER_MAX;
 	counter->parent = parent;
 	counter->protection_support = protection_support;
+	counter->memcg = memcg;
 }
 
 static inline unsigned long page_counter_read(struct page_counter *counter)
@@ -72,7 +79,8 @@ bool page_counter_try_charge(struct page_counter *counter,
 			     struct page_counter **fail);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
-void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
+					unsigned long nr_pages_local);
 
 static inline void page_counter_set_high(struct page_counter *counter,
 					 unsigned long nr_pages)
@@ -99,11 +107,11 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 #ifdef CONFIG_MEMCG
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
-				       bool recursive_protection);
+				       bool recursive_protection, int is_local);
 #else
 static inline void page_counter_calculate_protection(struct page_counter *root,
 						     struct page_counter *counter,
-						     bool recursive_protection) {}
+						     bool recursive_protection, int is_local) {}
 #endif
 
 #endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index d8d0e665caed..0e07a7a1d5b8 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 		}
 		page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
 								     idx),
-				  fault_parent, false);
+				  fault_parent, false, NULL);
 		page_counter_init(
 			hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
-			rsvd_parent, false);
+			rsvd_parent, false, NULL);
 
 		limit = round_down(PAGE_COUNTER_MAX,
 				   pages_per_huge_page(&hstates[idx]));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20b715441332..d7c5fff12105 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1497,6 +1497,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 			       vm_event_name(memcg_vm_event_stat[i]),
 			       memcg_events(memcg, memcg_vm_event_stat[i]));
 	}
+
+	seq_buf_printf(s, "local_usage %lu\n",
+		       get_cgroup_local_usage(memcg, true));
 }
 
 static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
@@ -3597,8 +3600,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
 
-		page_counter_init(&memcg->memory, &parent->memory, true);
-		page_counter_init(&memcg->swap, &parent->swap, false);
+		page_counter_init(&memcg->memory, &parent->memory, true, memcg);
+		page_counter_init(&memcg->swap, &parent->swap, false, NULL);
 #ifdef CONFIG_MEMCG_V1
 		WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
 		page_counter_init(&memcg->kmem, &parent->kmem, false);
@@ -3607,8 +3610,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	} else {
 		init_memcg_stats();
 		init_memcg_events();
-		page_counter_init(&memcg->memory, NULL, true);
-		page_counter_init(&memcg->swap, NULL, false);
+		page_counter_init(&memcg->memory, NULL, true, memcg);
+		page_counter_init(&memcg->swap, NULL, false, NULL);
 #ifdef CONFIG_MEMCG_V1
 		page_counter_init(&memcg->kmem, NULL, false);
 		page_counter_init(&memcg->tcpmem, NULL, false);
@@ -3677,7 +3680,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	memcg1_css_offline(memcg);
 
 	page_counter_set_min(&memcg->memory, 0);
-	page_counter_set_low(&memcg->memory, 0);
+	page_counter_set_low(&memcg->memory, 0, 0);
 
 	zswap_memcg_offline_cleanup(memcg);
 
@@ -3748,7 +3751,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
 #endif
 	page_counter_set_min(&memcg->memory, 0);
-	page_counter_set_low(&memcg->memory, 0);
+	page_counter_set_low(&memcg->memory, 0, 0);
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg1_soft_limit_reset(memcg);
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
@@ -4051,6 +4054,12 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_locallow_show(struct seq_file *m, void *v)
+{
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->memory.locallow));
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m,
@@ -4061,7 +4070,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	unsigned long low;
+	struct sysinfo si;
+	unsigned long low, locallow, local_capacity, total_capacity;
 	int err;
 
 	buf = strstrip(buf);
@@ -4069,7 +4079,15 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	page_counter_set_low(&memcg->memory, low);
+	/* Hardcoded 0 for local node and 1 for remote. */
+	si_meminfo_node(&si, 0);
+	local_capacity = si.totalram; /* In pages. */
+	total_capacity = local_capacity;
+	si_meminfo_node(&si, 1);
+	total_capacity += si.totalram;
+	locallow = low * local_capacity / total_capacity;
+
+	page_counter_set_low(&memcg->memory, low, locallow);
 
 	return nbytes;
 }
@@ -4394,6 +4412,11 @@ static struct cftype memory_files[] = {
 		.seq_show = memory_low_show,
 		.write = memory_low_write,
 	},
+	{
+		.name = "locallow",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_locallow_show,
+	},
 	{
 		.name = "high",
 		.flags = CFTYPE_NOT_ON_ROOT,
@@ -4483,7 +4506,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 	if (!root)
 		root = root_mem_cgroup;
 
-	page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
+	page_counter_calculate_protection(&root->memory, &memcg->memory,
+					recursive_protection, false);
 }
 
 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
diff --git a/mm/page_counter.c b/mm/page_counter.c
index b249d15af9dd..97205aafab46 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -18,8 +18,10 @@ static bool track_protection(struct page_counter *c)
 	return c->protection_support;
 }
 
+extern unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
+
 static void propagate_protected_usage(struct page_counter *c,
-				      unsigned long usage)
+				      unsigned long usage, unsigned long local_usage)
 {
 	unsigned long protected, old_protected;
 	long delta;
@@ -44,6 +46,15 @@ static void propagate_protected_usage(struct page_counter *c,
 		if (delta)
 			atomic_long_add(delta, &c->parent->children_low_usage);
 	}
+
+	protected = min(local_usage, READ_ONCE(c->locallow));
+	old_protected = atomic_long_read(&c->locallow_usage);
+	if (protected != old_protected) {
+		old_protected = atomic_long_xchg(&c->locallow_usage, protected);
+		delta = protected - old_protected;
+		if (delta)
+			atomic_long_add(delta, &c->parent->children_locallow_usage);
+	}
 }
 
 /**
@@ -63,7 +74,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 		atomic_long_set(&counter->usage, new);
 	}
 	if (track_protection(counter))
-		propagate_protected_usage(counter, new);
+		propagate_protected_usage(counter, new,
+				get_cgroup_local_usage(counter->memcg, false));
 }
 
 /**
@@ -83,7 +95,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 
 		new = atomic_long_add_return(nr_pages, &c->usage);
 		if (protection)
-			propagate_protected_usage(c, new);
+			propagate_protected_usage(c, new,
+					get_cgroup_local_usage(counter->memcg, false));
 		/*
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
@@ -151,7 +164,8 @@ bool page_counter_try_charge(struct page_counter *counter,
 			goto failed;
 		}
 		if (protection)
-			propagate_protected_usage(c, new);
+			propagate_protected_usage(c, new,
+					get_cgroup_local_usage(counter->memcg, false));
 
 		/* see comment on page_counter_charge */
 		if (new > READ_ONCE(c->local_watermark)) {
@@ -238,7 +252,8 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
 	WRITE_ONCE(counter->min, nr_pages);
 
 	for (c = counter; c; c = c->parent)
-		propagate_protected_usage(c, atomic_long_read(&c->usage));
+		propagate_protected_usage(c, atomic_long_read(&c->usage),
+				get_cgroup_local_usage(counter->memcg, false));
 }
 
 /**
@@ -248,14 +263,17 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
  *
  * The caller must serialize invocations on the same counter.
  */
-void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
+				unsigned long nr_pages_local)
 {
 	struct page_counter *c;
 
 	WRITE_ONCE(counter->low, nr_pages);
+	WRITE_ONCE(counter->locallow, nr_pages_local);
 
 	for (c = counter; c; c = c->parent)
-		propagate_protected_usage(c, atomic_long_read(&c->usage));
+		propagate_protected_usage(c, atomic_long_read(&c->usage),
+				get_cgroup_local_usage(counter->memcg, false));
 }
 
 /**
@@ -421,9 +439,9 @@ static unsigned long effective_protection(unsigned long usage,
  */
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
-				       bool recursive_protection)
+				       bool recursive_protection, int is_local)
 {
-	unsigned long usage, parent_usage;
+	unsigned long usage, parent_usage, local_usage, parent_local_usage;
 	struct page_counter *parent = counter->parent;
 
 	/*
@@ -437,16 +455,19 @@ void page_counter_calculate_protection(struct page_counter *root,
 		return;
 
 	usage = page_counter_read(counter);
-	if (!usage)
+	local_usage = get_cgroup_local_usage(counter->memcg, true);
+	if (!usage || !local_usage)
 		return;
 
 	if (parent == root) {
 		counter->emin = READ_ONCE(counter->min);
 		counter->elow = READ_ONCE(counter->low);
+		counter->elocallow = READ_ONCE(counter->locallow);
 		return;
 	}
 
 	parent_usage = page_counter_read(parent);
+	parent_local_usage = get_cgroup_local_usage(parent->memcg, true);
 
 	WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
 			READ_ONCE(counter->min),
@@ -454,7 +475,16 @@ void page_counter_calculate_protection(struct page_counter *root,
 			atomic_long_read(&parent->children_min_usage),
 			recursive_protection));
 
-	WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
+	if (is_local)
+		WRITE_ONCE(counter->elocallow,
+			effective_protection(local_usage, parent_local_usage,
+			READ_ONCE(counter->locallow),
+			READ_ONCE(parent->elocallow),
+			atomic_long_read(&parent->children_locallow_usage),
+			recursive_protection));
+	else
+		WRITE_ONCE(counter->elow,
+			effective_protection(usage, parent_usage,
 			READ_ONCE(counter->low),
 			READ_ONCE(parent->elow),
 			atomic_long_read(&parent->children_low_usage),
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
  2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
  2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
  2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
  2024-09-22  0:51   ` kernel test robot
                     ` (2 more replies)
  2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao

From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

When reclaim targets the top-tier node usage by the root memcg,
apply local memory.low protection instead of global protection.

Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
 include/linux/memcontrol.h | 23 ++++++++++++++---------
 mm/memcontrol.c            |  4 ++--
 mm/vmscan.c                | 19 ++++++++++++++-----
 3 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 94aba4498fca..256912b91922 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -586,9 +586,9 @@ static inline bool mem_cgroup_disabled(void)
 static inline void mem_cgroup_protection(struct mem_cgroup *root,
 					 struct mem_cgroup *memcg,
 					 unsigned long *min,
-					 unsigned long *low)
+					 unsigned long *low, unsigned long *locallow)
 {
-	*min = *low = 0;
+	*min = *low = *locallow = 0;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -631,10 +631,11 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
 
 	*min = READ_ONCE(memcg->memory.emin);
 	*low = READ_ONCE(memcg->memory.elow);
+	*locallow = READ_ONCE(memcg->memory.elocallow);
 }
 
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
-				     struct mem_cgroup *memcg);
+				     struct mem_cgroup *memcg, int is_local);
 
 static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 					  struct mem_cgroup *memcg)
@@ -651,13 +652,17 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
 
 static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
-					struct mem_cgroup *memcg)
+					struct mem_cgroup *memcg, int is_local)
 {
 	if (mem_cgroup_unprotected(target, memcg))
 		return false;
 
-	return READ_ONCE(memcg->memory.elow) >=
-		page_counter_read(&memcg->memory);
+	if (is_local)
+		return READ_ONCE(memcg->memory.elocallow) >=
+			get_cgroup_local_usage(memcg, true);
+	else
+		return READ_ONCE(memcg->memory.elow) >=
+			page_counter_read(&memcg->memory);
 }
 
 static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
@@ -1159,13 +1164,13 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 static inline void mem_cgroup_protection(struct mem_cgroup *root,
 					 struct mem_cgroup *memcg,
 					 unsigned long *min,
-					 unsigned long *low)
+					 unsigned long *low, unsigned long *locallow)
 {
 	*min = *low = 0;
 }
 
 static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
-						   struct mem_cgroup *memcg)
+						   struct mem_cgroup *memcg, int is_local)
 {
 }
 
@@ -1175,7 +1180,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 	return true;
 }
 static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
-					struct mem_cgroup *memcg)
+					struct mem_cgroup *memcg, int is_local)
 {
 	return false;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7c5fff12105..61718ba998fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4495,7 +4495,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  *          of a top-down tree iteration, not for isolated queries.
  */
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
-				     struct mem_cgroup *memcg)
+				     struct mem_cgroup *memcg, int is_local)
 {
 	bool recursive_protection =
 		cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
@@ -4507,7 +4507,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	page_counter_calculate_protection(&root->memory, &memcg->memory,
-					recursive_protection, false);
+					recursive_protection, is_local);
 }
 
 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce471d686a88..a2681d52fc5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2377,6 +2377,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	enum scan_balance scan_balance;
 	unsigned long ap, fp;
 	enum lru_list lru;
+	int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
 
 	/* If we have no swap space, do not bother scanning anon folios. */
 	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
@@ -2457,12 +2458,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	for_each_evictable_lru(lru) {
 		bool file = is_file_lru(lru);
 		unsigned long lruvec_size;
-		unsigned long low, min;
+		unsigned long low, min, locallow;
 		unsigned long scan;
 
 		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
 		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
-				      &min, &low);
+				      &min, &low, &locallow);
+		if (is_local)
+			low = locallow;
 
 		if (min || low) {
 			/*
@@ -2494,7 +2497,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			 * again by how much of the total memory used is under
 			 * hard protection.
 			 */
-			unsigned long cgroup_size = mem_cgroup_size(memcg);
+			unsigned long cgroup_size;
+
+			if (is_local)
+				cgroup_size = get_cgroup_local_usage(memcg, true);
+			else
+				cgroup_size = mem_cgroup_size(memcg);
 			unsigned long protection;
 
 			/* memory.low scaling, make sure we retry before OOM */
@@ -5869,6 +5877,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 	};
 	struct mem_cgroup_reclaim_cookie *partial = &reclaim;
 	struct mem_cgroup *memcg;
+	int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
 
 	/*
 	 * In most cases, direct reclaimers can do partial walks
@@ -5896,7 +5905,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		 */
 		cond_resched();
 
-		mem_cgroup_calculate_protection(target_memcg, memcg);
+		mem_cgroup_calculate_protection(target_memcg, memcg, is_local);
 
 		if (mem_cgroup_below_min(target_memcg, memcg)) {
 			/*
@@ -5904,7 +5913,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * If there is no reclaimable memory, OOM.
 			 */
 			continue;
-		} else if (mem_cgroup_below_low(target_memcg, memcg)) {
+		} else if (mem_cgroup_below_low(target_memcg, memcg, is_local)) {
 			/*
 			 * Soft protection.
 			 * Respect the protection only as long as
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low
  2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
                   ` (2 preceding siblings ...)
  2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
  2024-10-11 20:51 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion Kaiyang Zhao
  2024-11-08 19:01 ` kaiyang2
  5 siblings, 0 replies; 13+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao

From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

When the top-tier node has less free memory than the promotion watermark,
reduce the scan size of cgroups that are over their local memory.low
proportional to their overage. In this case, the top-tier memory usage
of the cgroup should be reduced, and demotion is working towards the
goal. A smaller scan size should cause a slower rate of promotion for
the cgroup so as to not working against demotion.

A mininum of 1/16th of sysctl_numa_balancing_scan_size is still allowed
for such cgroups because identifying hot pages trapped in slow-tier is
still a worthy goal in this case (although a secondary objective).
16 is arbitrary and may need tuning.

Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 49 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1b756f927b2..1737b2369f56 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1727,14 +1727,21 @@ static inline bool cpupid_valid(int cpupid)
  * advantage of fast memory capacity, all recently accessed slow
  * memory pages will be migrated to fast memory node without
  * considering hot threshold.
+ * This is also used for detecting memory pressure and decide whether
+ * limitting promotion scan size is needed, for which we don't requrie
+ * more free pages than the promo watermark.
  */
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+static bool pgdat_free_space_enough(struct pglist_data *pgdat,
+						bool require_extra)
 {
 	int z;
 	unsigned long enough_wmark;
 
-	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
-			   pgdat->node_present_pages >> 4);
+	if (require_extra)
+		enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+				pgdat->node_present_pages >> 4);
+	else
+		enough_wmark = 0;
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
@@ -1846,7 +1853,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		unsigned int latency, th, def_th;
 
 		pgdat = NODE_DATA(dst_nid);
-		if (pgdat_free_space_enough(pgdat)) {
+		if (pgdat_free_space_enough(pgdat, true)) {
 			/* workload changed, reset hot threshold */
 			pgdat->nbp_threshold = 0;
 			return true;
@@ -3214,10 +3221,14 @@ static void task_numa_work(struct callback_head *work)
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
-	long pages, virtpages;
+	long pages, virtpages, min_scan_pages;
 	struct vma_iterator vmi;
 	bool vma_pids_skipped;
 	bool vma_pids_forced = false;
+	struct pglist_data *pgdat = NODE_DATA(0);  /* hardcoded node 0 */
+	struct mem_cgroup *memcg;
+	unsigned long cgroup_size, cgroup_locallow;
+	const long min_scan_pages_fraction = 16; /* 1/16th of the scan size */
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -3262,6 +3273,39 @@ static void task_numa_work(struct callback_head *work)
 
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+
+	min_scan_pages = pages;
+	min_scan_pages /= min_scan_pages_fraction;
+
+	memcg = get_mem_cgroup_from_current();
+	/*
+	 * Reduce the scan size when the local node is under pressure
+	 * (WMARK_PROMO is not satisfied),
+	 * proportional to a cgroup's overage of local memory guarantee.
+	 * 10% over: 68% of scan size
+	 * 20% over: 48% of scan size
+	 * 50% over: 20% of scan size
+	 * 100% over: 6% of scan size
+	 */
+	if (likely(memcg)) {
+		if (!pgdat_free_space_enough(pgdat, false)) {
+			cgroup_size = get_cgroup_local_usage(memcg, false);
+			/*
+			 * Protection needs refreshing, but reclaim on the cgroup
+			 * should have refreshed recently.
+			 */
+			cgroup_locallow = READ_ONCE(memcg->memory.elocallow);
+			if (cgroup_size > cgroup_locallow) {
+				/* 1/x^4 */
+				for (int i = 0; i < 4; i++)
+					pages = pages * cgroup_locallow / (cgroup_size + 1);
+				/* Lower bound to min_scan_pages. */
+				pages = max(pages, min_scan_pages);
+			}
+		}
+		css_put(&memcg->css);
+	}
+
 	virtpages = pages * 8;	   /* Scan up to this much virtual space */
 	if (!pages)
 		return;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
  2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
@ 2024-09-21 23:18   ` kernel test robot
  2024-09-22  8:39   ` kernel test robot
  2024-10-15 22:05   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-21 23:18 UTC (permalink / raw)
  To: kaiyang2; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master next-20240920]
[cannot apply to tip/sched/core v6.11]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240920221202.1734227-3-kaiyang2%40cs.cmu.edu
patch subject: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240922/202409220804.TAoLKEBm-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240922/202409220804.TAoLKEBm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409220804.TAoLKEBm-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/page_counter.c:268: warning: Function parameter or struct member 'nr_pages_local' not described in 'page_counter_set_low'
>> mm/page_counter.c:443: warning: Function parameter or struct member 'is_local' not described in 'page_counter_calculate_protection'


vim +268 mm/page_counter.c

bf8d5d52ffe89a Roman Gushchin    2018-06-07  258  
230671533d6463 Roman Gushchin    2018-06-07  259  /**
230671533d6463 Roman Gushchin    2018-06-07  260   * page_counter_set_low - set the amount of protected memory
230671533d6463 Roman Gushchin    2018-06-07  261   * @counter: counter
230671533d6463 Roman Gushchin    2018-06-07  262   * @nr_pages: value to set
230671533d6463 Roman Gushchin    2018-06-07  263   *
230671533d6463 Roman Gushchin    2018-06-07  264   * The caller must serialize invocations on the same counter.
230671533d6463 Roman Gushchin    2018-06-07  265   */
6f4c005a5f8b8f Kaiyang Zhao      2024-09-20  266  void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
6f4c005a5f8b8f Kaiyang Zhao      2024-09-20  267  				unsigned long nr_pages_local)
230671533d6463 Roman Gushchin    2018-06-07 @268  {
230671533d6463 Roman Gushchin    2018-06-07  269  	struct page_counter *c;
230671533d6463 Roman Gushchin    2018-06-07  270  
f86b810c2610b0 Chris Down        2020-04-01  271  	WRITE_ONCE(counter->low, nr_pages);
6f4c005a5f8b8f Kaiyang Zhao      2024-09-20  272  	WRITE_ONCE(counter->locallow, nr_pages_local);
230671533d6463 Roman Gushchin    2018-06-07  273  
230671533d6463 Roman Gushchin    2018-06-07  274  	for (c = counter; c; c = c->parent)
6f4c005a5f8b8f Kaiyang Zhao      2024-09-20  275  		propagate_protected_usage(c, atomic_long_read(&c->usage),
6f4c005a5f8b8f Kaiyang Zhao      2024-09-20  276  				get_cgroup_local_usage(counter->memcg, false));
230671533d6463 Roman Gushchin    2018-06-07  277  }
230671533d6463 Roman Gushchin    2018-06-07  278  
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  279  /**
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  280   * page_counter_memparse - memparse() for page counter limits
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  281   * @buf: string to parse
650c5e565492f9 Johannes Weiner   2015-02-11  282   * @max: string meaning maximum possible value
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  283   * @nr_pages: returns the result in number of pages
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  284   *
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  285   * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  286   * limited to %PAGE_COUNTER_MAX.
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  287   */
650c5e565492f9 Johannes Weiner   2015-02-11  288  int page_counter_memparse(const char *buf, const char *max,
650c5e565492f9 Johannes Weiner   2015-02-11  289  			  unsigned long *nr_pages)
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  290  {
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  291  	char *end;
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  292  	u64 bytes;
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  293  
650c5e565492f9 Johannes Weiner   2015-02-11  294  	if (!strcmp(buf, max)) {
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  295  		*nr_pages = PAGE_COUNTER_MAX;
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  296  		return 0;
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  297  	}
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  298  
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  299  	bytes = memparse(buf, &end);
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  300  	if (*end != '\0')
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  301  		return -EINVAL;
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  302  
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  303  	*nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  304  
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  305  	return 0;
3e32cb2e0a12b6 Johannes Weiner   2014-12-10  306  }
a8585ac6862198 Maarten Lankhorst 2024-07-03  307  
a8585ac6862198 Maarten Lankhorst 2024-07-03  308  
941ce635234162 Roman Gushchin    2024-07-26  309  #ifdef CONFIG_MEMCG
a8585ac6862198 Maarten Lankhorst 2024-07-03  310  /*
a8585ac6862198 Maarten Lankhorst 2024-07-03  311   * This function calculates an individual page counter's effective
a8585ac6862198 Maarten Lankhorst 2024-07-03  312   * protection which is derived from its own memory.min/low, its
a8585ac6862198 Maarten Lankhorst 2024-07-03  313   * parent's and siblings' settings, as well as the actual memory
a8585ac6862198 Maarten Lankhorst 2024-07-03  314   * distribution in the tree.
a8585ac6862198 Maarten Lankhorst 2024-07-03  315   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  316   * The following rules apply to the effective protection values:
a8585ac6862198 Maarten Lankhorst 2024-07-03  317   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  318   * 1. At the first level of reclaim, effective protection is equal to
a8585ac6862198 Maarten Lankhorst 2024-07-03  319   *    the declared protection in memory.min and memory.low.
a8585ac6862198 Maarten Lankhorst 2024-07-03  320   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  321   * 2. To enable safe delegation of the protection configuration, at
a8585ac6862198 Maarten Lankhorst 2024-07-03  322   *    subsequent levels the effective protection is capped to the
a8585ac6862198 Maarten Lankhorst 2024-07-03  323   *    parent's effective protection.
a8585ac6862198 Maarten Lankhorst 2024-07-03  324   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  325   * 3. To make complex and dynamic subtrees easier to configure, the
a8585ac6862198 Maarten Lankhorst 2024-07-03  326   *    user is allowed to overcommit the declared protection at a given
a8585ac6862198 Maarten Lankhorst 2024-07-03  327   *    level. If that is the case, the parent's effective protection is
a8585ac6862198 Maarten Lankhorst 2024-07-03  328   *    distributed to the children in proportion to how much protection
a8585ac6862198 Maarten Lankhorst 2024-07-03  329   *    they have declared and how much of it they are utilizing.
a8585ac6862198 Maarten Lankhorst 2024-07-03  330   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  331   *    This makes distribution proportional, but also work-conserving:
a8585ac6862198 Maarten Lankhorst 2024-07-03  332   *    if one counter claims much more protection than it uses memory,
a8585ac6862198 Maarten Lankhorst 2024-07-03  333   *    the unused remainder is available to its siblings.
a8585ac6862198 Maarten Lankhorst 2024-07-03  334   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  335   * 4. Conversely, when the declared protection is undercommitted at a
a8585ac6862198 Maarten Lankhorst 2024-07-03  336   *    given level, the distribution of the larger parental protection
a8585ac6862198 Maarten Lankhorst 2024-07-03  337   *    budget is NOT proportional. A counter's protection from a sibling
a8585ac6862198 Maarten Lankhorst 2024-07-03  338   *    is capped to its own memory.min/low setting.
a8585ac6862198 Maarten Lankhorst 2024-07-03  339   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  340   * 5. However, to allow protecting recursive subtrees from each other
a8585ac6862198 Maarten Lankhorst 2024-07-03  341   *    without having to declare each individual counter's fixed share
a8585ac6862198 Maarten Lankhorst 2024-07-03  342   *    of the ancestor's claim to protection, any unutilized -
a8585ac6862198 Maarten Lankhorst 2024-07-03  343   *    "floating" - protection from up the tree is distributed in
a8585ac6862198 Maarten Lankhorst 2024-07-03  344   *    proportion to each counter's *usage*. This makes the protection
a8585ac6862198 Maarten Lankhorst 2024-07-03  345   *    neutral wrt sibling cgroups and lets them compete freely over
a8585ac6862198 Maarten Lankhorst 2024-07-03  346   *    the shared parental protection budget, but it protects the
a8585ac6862198 Maarten Lankhorst 2024-07-03  347   *    subtree as a whole from neighboring subtrees.
a8585ac6862198 Maarten Lankhorst 2024-07-03  348   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  349   * Note that 4. and 5. are not in conflict: 4. is about protecting
a8585ac6862198 Maarten Lankhorst 2024-07-03  350   * against immediate siblings whereas 5. is about protecting against
a8585ac6862198 Maarten Lankhorst 2024-07-03  351   * neighboring subtrees.
a8585ac6862198 Maarten Lankhorst 2024-07-03  352   */
a8585ac6862198 Maarten Lankhorst 2024-07-03  353  static unsigned long effective_protection(unsigned long usage,
a8585ac6862198 Maarten Lankhorst 2024-07-03  354  					  unsigned long parent_usage,
a8585ac6862198 Maarten Lankhorst 2024-07-03  355  					  unsigned long setting,
a8585ac6862198 Maarten Lankhorst 2024-07-03  356  					  unsigned long parent_effective,
a8585ac6862198 Maarten Lankhorst 2024-07-03  357  					  unsigned long siblings_protected,
a8585ac6862198 Maarten Lankhorst 2024-07-03  358  					  bool recursive_protection)
a8585ac6862198 Maarten Lankhorst 2024-07-03  359  {
a8585ac6862198 Maarten Lankhorst 2024-07-03  360  	unsigned long protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03  361  	unsigned long ep;
a8585ac6862198 Maarten Lankhorst 2024-07-03  362  
a8585ac6862198 Maarten Lankhorst 2024-07-03  363  	protected = min(usage, setting);
a8585ac6862198 Maarten Lankhorst 2024-07-03  364  	/*
a8585ac6862198 Maarten Lankhorst 2024-07-03  365  	 * If all cgroups at this level combined claim and use more
a8585ac6862198 Maarten Lankhorst 2024-07-03  366  	 * protection than what the parent affords them, distribute
a8585ac6862198 Maarten Lankhorst 2024-07-03  367  	 * shares in proportion to utilization.
a8585ac6862198 Maarten Lankhorst 2024-07-03  368  	 *
a8585ac6862198 Maarten Lankhorst 2024-07-03  369  	 * We are using actual utilization rather than the statically
a8585ac6862198 Maarten Lankhorst 2024-07-03  370  	 * claimed protection in order to be work-conserving: claimed
a8585ac6862198 Maarten Lankhorst 2024-07-03  371  	 * but unused protection is available to siblings that would
a8585ac6862198 Maarten Lankhorst 2024-07-03  372  	 * otherwise get a smaller chunk than what they claimed.
a8585ac6862198 Maarten Lankhorst 2024-07-03  373  	 */
a8585ac6862198 Maarten Lankhorst 2024-07-03  374  	if (siblings_protected > parent_effective)
a8585ac6862198 Maarten Lankhorst 2024-07-03  375  		return protected * parent_effective / siblings_protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03  376  
a8585ac6862198 Maarten Lankhorst 2024-07-03  377  	/*
a8585ac6862198 Maarten Lankhorst 2024-07-03  378  	 * Ok, utilized protection of all children is within what the
a8585ac6862198 Maarten Lankhorst 2024-07-03  379  	 * parent affords them, so we know whatever this child claims
a8585ac6862198 Maarten Lankhorst 2024-07-03  380  	 * and utilizes is effectively protected.
a8585ac6862198 Maarten Lankhorst 2024-07-03  381  	 *
a8585ac6862198 Maarten Lankhorst 2024-07-03  382  	 * If there is unprotected usage beyond this value, reclaim
a8585ac6862198 Maarten Lankhorst 2024-07-03  383  	 * will apply pressure in proportion to that amount.
a8585ac6862198 Maarten Lankhorst 2024-07-03  384  	 *
a8585ac6862198 Maarten Lankhorst 2024-07-03  385  	 * If there is unutilized protection, the cgroup will be fully
a8585ac6862198 Maarten Lankhorst 2024-07-03  386  	 * shielded from reclaim, but we do return a smaller value for
a8585ac6862198 Maarten Lankhorst 2024-07-03  387  	 * protection than what the group could enjoy in theory. This
a8585ac6862198 Maarten Lankhorst 2024-07-03  388  	 * is okay. With the overcommit distribution above, effective
a8585ac6862198 Maarten Lankhorst 2024-07-03  389  	 * protection is always dependent on how memory is actually
a8585ac6862198 Maarten Lankhorst 2024-07-03  390  	 * consumed among the siblings anyway.
a8585ac6862198 Maarten Lankhorst 2024-07-03  391  	 */
a8585ac6862198 Maarten Lankhorst 2024-07-03  392  	ep = protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03  393  
a8585ac6862198 Maarten Lankhorst 2024-07-03  394  	/*
a8585ac6862198 Maarten Lankhorst 2024-07-03  395  	 * If the children aren't claiming (all of) the protection
a8585ac6862198 Maarten Lankhorst 2024-07-03  396  	 * afforded to them by the parent, distribute the remainder in
a8585ac6862198 Maarten Lankhorst 2024-07-03  397  	 * proportion to the (unprotected) memory of each cgroup. That
a8585ac6862198 Maarten Lankhorst 2024-07-03  398  	 * way, cgroups that aren't explicitly prioritized wrt each
a8585ac6862198 Maarten Lankhorst 2024-07-03  399  	 * other compete freely over the allowance, but they are
a8585ac6862198 Maarten Lankhorst 2024-07-03  400  	 * collectively protected from neighboring trees.
a8585ac6862198 Maarten Lankhorst 2024-07-03  401  	 *
a8585ac6862198 Maarten Lankhorst 2024-07-03  402  	 * We're using unprotected memory for the weight so that if
a8585ac6862198 Maarten Lankhorst 2024-07-03  403  	 * some cgroups DO claim explicit protection, we don't protect
a8585ac6862198 Maarten Lankhorst 2024-07-03  404  	 * the same bytes twice.
a8585ac6862198 Maarten Lankhorst 2024-07-03  405  	 *
a8585ac6862198 Maarten Lankhorst 2024-07-03  406  	 * Check both usage and parent_usage against the respective
a8585ac6862198 Maarten Lankhorst 2024-07-03  407  	 * protected values. One should imply the other, but they
a8585ac6862198 Maarten Lankhorst 2024-07-03  408  	 * aren't read atomically - make sure the division is sane.
a8585ac6862198 Maarten Lankhorst 2024-07-03  409  	 */
a8585ac6862198 Maarten Lankhorst 2024-07-03  410  	if (!recursive_protection)
a8585ac6862198 Maarten Lankhorst 2024-07-03  411  		return ep;
a8585ac6862198 Maarten Lankhorst 2024-07-03  412  
a8585ac6862198 Maarten Lankhorst 2024-07-03  413  	if (parent_effective > siblings_protected &&
a8585ac6862198 Maarten Lankhorst 2024-07-03  414  	    parent_usage > siblings_protected &&
a8585ac6862198 Maarten Lankhorst 2024-07-03  415  	    usage > protected) {
a8585ac6862198 Maarten Lankhorst 2024-07-03  416  		unsigned long unclaimed;
a8585ac6862198 Maarten Lankhorst 2024-07-03  417  
a8585ac6862198 Maarten Lankhorst 2024-07-03  418  		unclaimed = parent_effective - siblings_protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03  419  		unclaimed *= usage - protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03  420  		unclaimed /= parent_usage - siblings_protected;
a8585ac6862198 Maarten Lankhorst 2024-07-03  421  
a8585ac6862198 Maarten Lankhorst 2024-07-03  422  		ep += unclaimed;
a8585ac6862198 Maarten Lankhorst 2024-07-03  423  	}
a8585ac6862198 Maarten Lankhorst 2024-07-03  424  
a8585ac6862198 Maarten Lankhorst 2024-07-03  425  	return ep;
a8585ac6862198 Maarten Lankhorst 2024-07-03  426  }
a8585ac6862198 Maarten Lankhorst 2024-07-03  427  
a8585ac6862198 Maarten Lankhorst 2024-07-03  428  
a8585ac6862198 Maarten Lankhorst 2024-07-03  429  /**
a8585ac6862198 Maarten Lankhorst 2024-07-03  430   * page_counter_calculate_protection - check if memory consumption is in the normal range
a8585ac6862198 Maarten Lankhorst 2024-07-03  431   * @root: the top ancestor of the sub-tree being checked
a8585ac6862198 Maarten Lankhorst 2024-07-03  432   * @counter: the page_counter the counter to update
a8585ac6862198 Maarten Lankhorst 2024-07-03  433   * @recursive_protection: Whether to use memory_recursiveprot behavior.
a8585ac6862198 Maarten Lankhorst 2024-07-03  434   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  435   * Calculates elow/emin thresholds for given page_counter.
a8585ac6862198 Maarten Lankhorst 2024-07-03  436   *
a8585ac6862198 Maarten Lankhorst 2024-07-03  437   * WARNING: This function is not stateless! It can only be used as part
a8585ac6862198 Maarten Lankhorst 2024-07-03  438   *          of a top-down tree iteration, not for isolated queries.
a8585ac6862198 Maarten Lankhorst 2024-07-03  439   */
a8585ac6862198 Maarten Lankhorst 2024-07-03  440  void page_counter_calculate_protection(struct page_counter *root,
a8585ac6862198 Maarten Lankhorst 2024-07-03  441  				       struct page_counter *counter,
6f4c005a5f8b8f Kaiyang Zhao      2024-09-20  442  				       bool recursive_protection, int is_local)
a8585ac6862198 Maarten Lankhorst 2024-07-03 @443  {

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
  2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
@ 2024-09-22  0:51   ` kernel test robot
  2024-09-22 16:31   ` kernel test robot
  2024-10-15 21:52   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-22  0:51 UTC (permalink / raw)
  To: kaiyang2; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master next-20240920]
[cannot apply to tip/sched/core v6.11]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240920221202.1734227-4-kaiyang2%40cs.cmu.edu
patch subject: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240922/202409221032.DoTv9B0p-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240922/202409221032.DoTv9B0p-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409221032.DoTv9B0p-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/memcontrol.c:4499: warning: Function parameter or struct member 'is_local' not described in 'mem_cgroup_calculate_protection'


vim +4499 mm/memcontrol.c

c077719be8e9e6 KAMEZAWA Hiroyuki   2009-01-07  4488  
241994ed8649f7 Johannes Weiner     2015-02-11  4489  /**
05395718b2fe48 Mel Gorman          2021-06-30  4490   * mem_cgroup_calculate_protection - check if memory consumption is in the normal range
34c81057927311 Sean Christopherson 2017-07-10  4491   * @root: the top ancestor of the sub-tree being checked
241994ed8649f7 Johannes Weiner     2015-02-11  4492   * @memcg: the memory cgroup to check
241994ed8649f7 Johannes Weiner     2015-02-11  4493   *
230671533d6463 Roman Gushchin      2018-06-07  4494   * WARNING: This function is not stateless! It can only be used as part
230671533d6463 Roman Gushchin      2018-06-07  4495   *          of a top-down tree iteration, not for isolated queries.
241994ed8649f7 Johannes Weiner     2015-02-11  4496   */
45c7f7e1ef17f0 Chris Down          2020-08-06  4497  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
3ebe5883ec39d9 Kaiyang Zhao        2024-09-20  4498  				     struct mem_cgroup *memcg, int is_local)
241994ed8649f7 Johannes Weiner     2015-02-11 @4499  {
a8585ac6862198 Maarten Lankhorst   2024-07-03  4500  	bool recursive_protection =
a8585ac6862198 Maarten Lankhorst   2024-07-03  4501  		cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
230671533d6463 Roman Gushchin      2018-06-07  4502  
241994ed8649f7 Johannes Weiner     2015-02-11  4503  	if (mem_cgroup_disabled())
45c7f7e1ef17f0 Chris Down          2020-08-06  4504  		return;
241994ed8649f7 Johannes Weiner     2015-02-11  4505  
34c81057927311 Sean Christopherson 2017-07-10  4506  	if (!root)
34c81057927311 Sean Christopherson 2017-07-10  4507  		root = root_mem_cgroup;
22f7496f0b9012 Yafang Shao         2020-08-06  4508  
6f4c005a5f8b8f Kaiyang Zhao        2024-09-20  4509  	page_counter_calculate_protection(&root->memory, &memcg->memory,
3ebe5883ec39d9 Kaiyang Zhao        2024-09-20  4510  					recursive_protection, is_local);
241994ed8649f7 Johannes Weiner     2015-02-11  4511  }
241994ed8649f7 Johannes Weiner     2015-02-11  4512  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
  2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
  2024-09-21 23:18   ` kernel test robot
@ 2024-09-22  8:39   ` kernel test robot
  2024-10-15 22:05   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-22  8:39 UTC (permalink / raw)
  To: kaiyang2
  Cc: oe-lkp, lkp, linux-kernel, linux-mm, cgroups, roman.gushchin,
	shakeel.butt, muchun.song, akpm, mhocko, nehagholkar, abhishekd,
	hannes, weixugc, rientjes, Kaiyang Zhao, oliver.sang



Hello,

kernel test robot noticed "BUG:kernel_NULL_pointer_dereference,address" on:

commit: 6f4c005a5f8b8ff1ce674731545b302af5f28f3f ("[RFC PATCH 2/4] calculate memory.low for the local node and track its usage")
url: https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20240920221202.1734227-3-kaiyang2@cs.cmu.edu/
patch subject: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+---------------------------------------------+------------+------------+
|                                             | 0af685cc17 | 6f4c005a5f |
+---------------------------------------------+------------+------------+
| boot_successes                              | 12         | 0          |
| boot_failures                               | 0          | 12         |
| BUG:kernel_NULL_pointer_dereference,address | 0          | 12         |
| Oops                                        | 0          | 12         |
| RIP:si_meminfo_node                         | 0          | 12         |
| Kernel_panic-not_syncing:Fatal_exception    | 0          | 12         |
+---------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202409221625.1e974ac-oliver.sang@intel.com


[   14.204830][    T1] BUG: kernel NULL pointer dereference, address: 0000000000000090
[   14.206729][    T1] #PF: supervisor read access in kernel mode
[   14.208090][    T1] #PF: error_code(0x0000) - not-present page
[   14.209393][    T1] PGD 0 P4D 0
[   14.210212][    T1] Oops: Oops: 0000 [#1] SMP PTI
[   14.211269][    T1] CPU: 1 UID: 0 PID: 1 Comm: systemd Not tainted 6.11.0-rc6-00570-g6f4c005a5f8b #1
[   14.213284][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 14.215290][ T1] RIP: 0010:si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3)) 
[ 14.216523][ T1] Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 48 63 c6 55 31 d2 4c 8b 04 c5 c0 a7 fb 8c 53 48 89 c5 48 89 fb 4c 89 c0 49 8d b8 00 1e 00 00 <48> 8b 88 90 00 00 00 48 05 00 06 00 00 48 01 ca 48 39 f8 75 eb 48
All code
========
   0:	90                   	nop
   1:	90                   	nop
   2:	66 0f 1f 00          	nopw   (%rax)
   6:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
   b:	48 63 c6             	movslq %esi,%rax
   e:	55                   	push   %rbp
   f:	31 d2                	xor    %edx,%edx
  11:	4c 8b 04 c5 c0 a7 fb 	mov    -0x73045840(,%rax,8),%r8
  18:	8c 
  19:	53                   	push   %rbx
  1a:	48 89 c5             	mov    %rax,%rbp
  1d:	48 89 fb             	mov    %rdi,%rbx
  20:	4c 89 c0             	mov    %r8,%rax
  23:	49 8d b8 00 1e 00 00 	lea    0x1e00(%r8),%rdi
  2a:*	48 8b 88 90 00 00 00 	mov    0x90(%rax),%rcx		<-- trapping instruction
  31:	48 05 00 06 00 00    	add    $0x600,%rax
  37:	48 01 ca             	add    %rcx,%rdx
  3a:	48 39 f8             	cmp    %rdi,%rax
  3d:	75 eb                	jne    0x2a
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 8b 88 90 00 00 00 	mov    0x90(%rax),%rcx
   7:	48 05 00 06 00 00    	add    $0x600,%rax
   d:	48 01 ca             	add    %rcx,%rdx
  10:	48 39 f8             	cmp    %rdi,%rax
  13:	75 eb                	jne    0x0
  15:	48                   	rex.W
[   14.220364][    T1] RSP: 0018:ffffb14b40013d68 EFLAGS: 00010246
[   14.221717][    T1] RAX: 0000000000000000 RBX: ffffb14b40013d88 RCX: 00000000003a19a2
[   14.223496][    T1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000001e00
[   14.225170][    T1] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000008
[   14.226964][    T1] R10: 0000000000000008 R11: 0fffffffffffffff R12: ffffb14b40013d88
[   14.228774][    T1] R13: 00000000003e7ac3 R14: ffffb14b40013e88 R15: ffff98ab0434f7a0
[   14.230421][    T1] FS:  00007f9569ae9940(0000) GS:ffff98adefd00000(0000) knlGS:0000000000000000
[   14.234569][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.235900][    T1] CR2: 0000000000000090 CR3: 0000000100072000 CR4: 00000000000006f0
[   14.237620][    T1] Call Trace:
[   14.238502][    T1]  <TASK>
[ 14.239254][ T1] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) 
[ 14.240189][ T1] ? page_fault_oops (arch/x86/mm/fault.c:715) 
[ 14.241254][ T1] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) 
[ 14.242297][ T1] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:623) 
[ 14.243313][ T1] ? si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3)) 
[ 14.244443][ T1] ? si_meminfo_node (mm/show_mem.c:114) 
[ 14.245460][ T1] memory_low_write (mm/memcontrol.c:4088) 
[ 14.246547][ T1] kernfs_fop_write_iter (fs/kernfs/file.c:338) 
[ 14.247804][ T1] vfs_write (fs/read_write.c:497 fs/read_write.c:590) 
[ 14.248830][ T1] ksys_write (fs/read_write.c:643) 
[ 14.249783][ T1] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) 
[ 14.250800][ T1] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[   14.252260][    T1] RIP: 0033:0x7f956a64b240
[ 14.253276][ T1] Code: 40 00 48 8b 15 c1 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
All code
========
   0:	40 00 48 8b          	add    %cl,-0x75(%rax)
   4:	15 c1 9b 0d 00       	adc    $0xd9bc1,%eax
   9:	f7 d8                	neg    %eax
   b:	64 89 02             	mov    %eax,%fs:(%rdx)
   e:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
  15:	eb b7                	jmp    0xffffffffffffffce
  17:	0f 1f 00             	nopl   (%rax)
  1a:	80 3d a1 23 0e 00 00 	cmpb   $0x0,0xe23a1(%rip)        # 0xe23c2
  21:	74 17                	je     0x3a
  23:	b8 01 00 00 00       	mov    $0x1,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 58                	ja     0x8a
  32:	c3                   	retq   
  33:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  3a:	48 83 ec 28          	sub    $0x28,%rsp
  3e:	48                   	rex.W
  3f:	89                   	.byte 0x89

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 58                	ja     0x60
   8:	c3                   	retq   
   9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  10:	48 83 ec 28          	sub    $0x28,%rsp
  14:	48                   	rex.W
  15:	89                   	.byte 0x89
[   14.257195][    T1] RSP: 002b:00007ffcc66594e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   14.259009][    T1] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f956a64b240
[   14.260848][    T1] RDX: 0000000000000002 RSI: 00007ffcc6659740 RDI: 000000000000001b
[   14.262500][    T1] RBP: 00007ffcc6659740 R08: 0000000000000000 R09: 0000000000000001
[   14.264147][    T1] R10: 00007f956a6c4820 R11: 0000000000000202 R12: 0000000000000002
[   14.265934][    T1] R13: 000055fd63872c10 R14: 0000000000000002 R15: 00007f956a7219e0
[   14.267589][    T1]  </TASK>
[   14.268340][    T1] Modules linked in: ip_tables
[   14.269410][    T1] CR2: 0000000000000090
[   14.270478][    T1] ---[ end trace 0000000000000000 ]---
[ 14.271717][ T1] RIP: 0010:si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3)) 
[ 14.272874][ T1] Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 48 63 c6 55 31 d2 4c 8b 04 c5 c0 a7 fb 8c 53 48 89 c5 48 89 fb 4c 89 c0 49 8d b8 00 1e 00 00 <48> 8b 88 90 00 00 00 48 05 00 06 00 00 48 01 ca 48 39 f8 75 eb 48
All code
========
   0:	90                   	nop
   1:	90                   	nop
   2:	66 0f 1f 00          	nopw   (%rax)
   6:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
   b:	48 63 c6             	movslq %esi,%rax
   e:	55                   	push   %rbp
   f:	31 d2                	xor    %edx,%edx
  11:	4c 8b 04 c5 c0 a7 fb 	mov    -0x73045840(,%rax,8),%r8
  18:	8c 
  19:	53                   	push   %rbx
  1a:	48 89 c5             	mov    %rax,%rbp
  1d:	48 89 fb             	mov    %rdi,%rbx
  20:	4c 89 c0             	mov    %r8,%rax
  23:	49 8d b8 00 1e 00 00 	lea    0x1e00(%r8),%rdi
  2a:*	48 8b 88 90 00 00 00 	mov    0x90(%rax),%rcx		<-- trapping instruction
  31:	48 05 00 06 00 00    	add    $0x600,%rax
  37:	48 01 ca             	add    %rcx,%rdx
  3a:	48 39 f8             	cmp    %rdi,%rax
  3d:	75 eb                	jne    0x2a
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 8b 88 90 00 00 00 	mov    0x90(%rax),%rcx
   7:	48 05 00 06 00 00    	add    $0x600,%rax
   d:	48 01 ca             	add    %rcx,%rdx
  10:	48 39 f8             	cmp    %rdi,%rax
  13:	75 eb                	jne    0x0
  15:	48                   	rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240922/202409221625.1e974ac-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
  2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
  2024-09-22  0:51   ` kernel test robot
@ 2024-09-22 16:31   ` kernel test robot
  2024-10-15 21:52   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-09-22 16:31 UTC (permalink / raw)
  To: kaiyang2; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master next-20240920]
[cannot apply to tip/sched/core v6.11]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240920221202.1734227-4-kaiyang2%40cs.cmu.edu
patch subject: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240923/202409230026.mq4sC7is-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240923/202409230026.mq4sC7is-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409230026.mq4sC7is-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/vmscan.c: In function 'get_scan_count':
>> mm/vmscan.c:2503:47: error: implicit declaration of function 'get_cgroup_local_usage' [-Werror=implicit-function-declaration]
    2503 |                                 cgroup_size = get_cgroup_local_usage(memcg, true);
         |                                               ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/get_cgroup_local_usage +2503 mm/vmscan.c

  2360	
  2361	/*
  2362	 * Determine how aggressively the anon and file LRU lists should be
  2363	 * scanned.
  2364	 *
  2365	 * nr[0] = anon inactive folios to scan; nr[1] = anon active folios to scan
  2366	 * nr[2] = file inactive folios to scan; nr[3] = file active folios to scan
  2367	 */
  2368	static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
  2369				   unsigned long *nr)
  2370	{
  2371		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
  2372		struct mem_cgroup *memcg = lruvec_memcg(lruvec);
  2373		unsigned long anon_cost, file_cost, total_cost;
  2374		int swappiness = sc_swappiness(sc, memcg);
  2375		u64 fraction[ANON_AND_FILE];
  2376		u64 denominator = 0;	/* gcc */
  2377		enum scan_balance scan_balance;
  2378		unsigned long ap, fp;
  2379		enum lru_list lru;
  2380		int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
  2381	
  2382		/* If we have no swap space, do not bother scanning anon folios. */
  2383		if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
  2384			scan_balance = SCAN_FILE;
  2385			goto out;
  2386		}
  2387	
  2388		/*
  2389		 * Global reclaim will swap to prevent OOM even with no
  2390		 * swappiness, but memcg users want to use this knob to
  2391		 * disable swapping for individual groups completely when
  2392		 * using the memory controller's swap limit feature would be
  2393		 * too expensive.
  2394		 */
  2395		if (cgroup_reclaim(sc) && !swappiness) {
  2396			scan_balance = SCAN_FILE;
  2397			goto out;
  2398		}
  2399	
  2400		/*
  2401		 * Do not apply any pressure balancing cleverness when the
  2402		 * system is close to OOM, scan both anon and file equally
  2403		 * (unless the swappiness setting disagrees with swapping).
  2404		 */
  2405		if (!sc->priority && swappiness) {
  2406			scan_balance = SCAN_EQUAL;
  2407			goto out;
  2408		}
  2409	
  2410		/*
  2411		 * If the system is almost out of file pages, force-scan anon.
  2412		 */
  2413		if (sc->file_is_tiny) {
  2414			scan_balance = SCAN_ANON;
  2415			goto out;
  2416		}
  2417	
  2418		/*
  2419		 * If there is enough inactive page cache, we do not reclaim
  2420		 * anything from the anonymous working right now.
  2421		 */
  2422		if (sc->cache_trim_mode) {
  2423			scan_balance = SCAN_FILE;
  2424			goto out;
  2425		}
  2426	
  2427		scan_balance = SCAN_FRACT;
  2428		/*
  2429		 * Calculate the pressure balance between anon and file pages.
  2430		 *
  2431		 * The amount of pressure we put on each LRU is inversely
  2432		 * proportional to the cost of reclaiming each list, as
  2433		 * determined by the share of pages that are refaulting, times
  2434		 * the relative IO cost of bringing back a swapped out
  2435		 * anonymous page vs reloading a filesystem page (swappiness).
  2436		 *
  2437		 * Although we limit that influence to ensure no list gets
  2438		 * left behind completely: at least a third of the pressure is
  2439		 * applied, before swappiness.
  2440		 *
  2441		 * With swappiness at 100, anon and file have equal IO cost.
  2442		 */
  2443		total_cost = sc->anon_cost + sc->file_cost;
  2444		anon_cost = total_cost + sc->anon_cost;
  2445		file_cost = total_cost + sc->file_cost;
  2446		total_cost = anon_cost + file_cost;
  2447	
  2448		ap = swappiness * (total_cost + 1);
  2449		ap /= anon_cost + 1;
  2450	
  2451		fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
  2452		fp /= file_cost + 1;
  2453	
  2454		fraction[0] = ap;
  2455		fraction[1] = fp;
  2456		denominator = ap + fp;
  2457	out:
  2458		for_each_evictable_lru(lru) {
  2459			bool file = is_file_lru(lru);
  2460			unsigned long lruvec_size;
  2461			unsigned long low, min, locallow;
  2462			unsigned long scan;
  2463	
  2464			lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
  2465			mem_cgroup_protection(sc->target_mem_cgroup, memcg,
  2466					      &min, &low, &locallow);
  2467			if (is_local)
  2468				low = locallow;
  2469	
  2470			if (min || low) {
  2471				/*
  2472				 * Scale a cgroup's reclaim pressure by proportioning
  2473				 * its current usage to its memory.low or memory.min
  2474				 * setting.
  2475				 *
  2476				 * This is important, as otherwise scanning aggression
  2477				 * becomes extremely binary -- from nothing as we
  2478				 * approach the memory protection threshold, to totally
  2479				 * nominal as we exceed it.  This results in requiring
  2480				 * setting extremely liberal protection thresholds. It
  2481				 * also means we simply get no protection at all if we
  2482				 * set it too low, which is not ideal.
  2483				 *
  2484				 * If there is any protection in place, we reduce scan
  2485				 * pressure by how much of the total memory used is
  2486				 * within protection thresholds.
  2487				 *
  2488				 * There is one special case: in the first reclaim pass,
  2489				 * we skip over all groups that are within their low
  2490				 * protection. If that fails to reclaim enough pages to
  2491				 * satisfy the reclaim goal, we come back and override
  2492				 * the best-effort low protection. However, we still
  2493				 * ideally want to honor how well-behaved groups are in
  2494				 * that case instead of simply punishing them all
  2495				 * equally. As such, we reclaim them based on how much
  2496				 * memory they are using, reducing the scan pressure
  2497				 * again by how much of the total memory used is under
  2498				 * hard protection.
  2499				 */
  2500				unsigned long cgroup_size;
  2501	
  2502				if (is_local)
> 2503					cgroup_size = get_cgroup_local_usage(memcg, true);
  2504				else
  2505					cgroup_size = mem_cgroup_size(memcg);
  2506				unsigned long protection;
  2507	
  2508				/* memory.low scaling, make sure we retry before OOM */
  2509				if (!sc->memcg_low_reclaim && low > min) {
  2510					protection = low;
  2511					sc->memcg_low_skipped = 1;
  2512				} else {
  2513					protection = min;
  2514				}
  2515	
  2516				/* Avoid TOCTOU with earlier protection check */
  2517				cgroup_size = max(cgroup_size, protection);
  2518	
  2519				scan = lruvec_size - lruvec_size * protection /
  2520					(cgroup_size + 1);
  2521	
  2522				/*
  2523				 * Minimally target SWAP_CLUSTER_MAX pages to keep
  2524				 * reclaim moving forwards, avoiding decrementing
  2525				 * sc->priority further than desirable.
  2526				 */
  2527				scan = max(scan, SWAP_CLUSTER_MAX);
  2528			} else {
  2529				scan = lruvec_size;
  2530			}
  2531	
  2532			scan >>= sc->priority;
  2533	
  2534			/*
  2535			 * If the cgroup's already been deleted, make sure to
  2536			 * scrape out the remaining cache.
  2537			 */
  2538			if (!scan && !mem_cgroup_online(memcg))
  2539				scan = min(lruvec_size, SWAP_CLUSTER_MAX);
  2540	
  2541			switch (scan_balance) {
  2542			case SCAN_EQUAL:
  2543				/* Scan lists relative to size */
  2544				break;
  2545			case SCAN_FRACT:
  2546				/*
  2547				 * Scan types proportional to swappiness and
  2548				 * their relative recent reclaim efficiency.
  2549				 * Make sure we don't miss the last page on
  2550				 * the offlined memory cgroups because of a
  2551				 * round-off error.
  2552				 */
  2553				scan = mem_cgroup_online(memcg) ?
  2554				       div64_u64(scan * fraction[file], denominator) :
  2555				       DIV64_U64_ROUND_UP(scan * fraction[file],
  2556							  denominator);
  2557				break;
  2558			case SCAN_FILE:
  2559			case SCAN_ANON:
  2560				/* Scan one type exclusively */
  2561				if ((scan_balance == SCAN_FILE) != file)
  2562					scan = 0;
  2563				break;
  2564			default:
  2565				/* Look ma, no brain */
  2566				BUG();
  2567			}
  2568	
  2569			nr[lru] = scan;
  2570		}
  2571	}
  2572	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
  2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
                   ` (3 preceding siblings ...)
  2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
@ 2024-10-11 20:51 ` Kaiyang Zhao
  2024-11-08 19:01 ` kaiyang2
  5 siblings, 0 replies; 13+ messages in thread
From: Kaiyang Zhao @ 2024-10-11 20:51 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, gourry

Adding some preliminary results from testing on a *real* system with CXL
memory.

The system has 256GB local DRAM + 64GB CXL memory. We used a microbenchmark
that allocates memory and accesses it at tunable hotness levels. We ran 3 such
microbenchmarks in 3 cgroups. The first container has 2 times the access
hotness than the second and the third container. All containers have a 100GB
memory.low set, meaning that ~82GB of local DRAM usage is protected.

Case 1 Container 1: Uses 120GB Container 2: Uses 40GB Container 3: Uses 40GB

Without fairness patch: same as with fairness.

With fairness patch: Container 1 has 120GB in local DRAM. Container 2 and 3
each have 40GB in local DRAM. As long as DRAM memory is not under pressure,
containers can exceed the lower guarantee and put everything in DRAM.

Case 2: Container 1: Uses 120GB Container 2: Uses 90GB Container 3: Uses 90GB

Without fairness patch: Container 1 gets 120GB in local DRAM, and Container 2
and 3 are stuck with ~65GB in local DRAM since they have colder data.

With fairness patch: Container 1 starts early and gets all 120GB in DRAM
memory. As container 2 and 3 start, they initially each get ~65GB in DRAM and
~25GB in CXL memory. Promotion attempts trigger local memory reclaim by kswapd,
which trims the DRAM usage by container 1 and increases the DRAM usage of
container 2 and 3. Eventually, the usage of DRAM memory for all 3 containers
converges at ~82GB, and the excess unprotected usage of 3 containers is in CXL
memory.

Case 3:

Container 1: Uses 120GB Container 2: Uses 70GB Container 3: Uses 70GB

Without fairness patch: Container 1 gets 120GB in local DRAM, and Container 2
and 3 are stuck with ~65GB in local DRAM since they have colder data.

With fairness patch: While the total memory demand exceeds DRAM capacity, at
the stable state, Container 1 is still able to get ~105GB in local DRAM, more
than the lower guarantee. Meanwhile, all memory usage by Container 2 and 3 are
protected from the noisy neighbor Container 1 and resides in DRAM only.

We’re working on getting performance data from more benchmarks and also Meta’s
production workloads. Stay tuned for more results!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
  2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
  2024-09-22  0:51   ` kernel test robot
  2024-09-22 16:31   ` kernel test robot
@ 2024-10-15 21:52   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-15 21:52 UTC (permalink / raw)
  To: kaiyang2
  Cc: linux-mm, cgroups, roman.gushchin, shakeel.butt, muchun.song,
	akpm, mhocko, nehagholkar, abhishekd, hannes, weixugc, rientjes

On Fri, Sep 20, 2024 at 10:11:50PM +0000, kaiyang2@cs.cmu.edu wrote:
> From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> 
> When reclaim targets the top-tier node usage by the root memcg,
> apply local memory.low protection instead of global protection.
> 

Changelog probably needs a little more context about the intended
affect of this change.  What exactly is the implication of this
change compared to applying it against elow?

> Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> ---
>  include/linux/memcontrol.h | 23 ++++++++++++++---------
>  mm/memcontrol.c            |  4 ++--
>  mm/vmscan.c                | 19 ++++++++++++++-----
>  3 files changed, 30 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 94aba4498fca..256912b91922 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -586,9 +586,9 @@ static inline bool mem_cgroup_disabled(void)
>  static inline void mem_cgroup_protection(struct mem_cgroup *root,
>  					 struct mem_cgroup *memcg,
>  					 unsigned long *min,
> -					 unsigned long *low)
> +					 unsigned long *low, unsigned long *locallow)
>  {
> -	*min = *low = 0;
> +	*min = *low = *locallow = 0;
>  

"locallow" can be read as "loc allow" or "local low", probably you
want to change all the references to local_low.

Sorry for not saying this on earlier feedback.


>  	if (mem_cgroup_disabled())
>  		return;
> @@ -631,10 +631,11 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
>  
>  	*min = READ_ONCE(memcg->memory.emin);
>  	*low = READ_ONCE(memcg->memory.elow);
> +	*locallow = READ_ONCE(memcg->memory.elocallow);
>  }
>  
>  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> -				     struct mem_cgroup *memcg);
> +				     struct mem_cgroup *memcg, int is_local);
>  
>  static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
>  					  struct mem_cgroup *memcg)
> @@ -651,13 +652,17 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
>  unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
>  
>  static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> -					struct mem_cgroup *memcg)
> +					struct mem_cgroup *memcg, int is_local)
>  {
>  	if (mem_cgroup_unprotected(target, memcg))
>  		return false;
>  
> -	return READ_ONCE(memcg->memory.elow) >=
> -		page_counter_read(&memcg->memory);
> +	if (is_local)
> +		return READ_ONCE(memcg->memory.elocallow) >=
> +			get_cgroup_local_usage(memcg, true);
> +	else
> +		return READ_ONCE(memcg->memory.elow) >=
> +			page_counter_read(&memcg->memory);

Don't need else case here is if block returns.

>  }
>  
>  static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> @@ -1159,13 +1164,13 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  static inline void mem_cgroup_protection(struct mem_cgroup *root,
>  					 struct mem_cgroup *memcg,
>  					 unsigned long *min,
> -					 unsigned long *low)
> +					 unsigned long *low, unsigned long *locallow)
>  {
>  	*min = *low = 0;
>  }
>  
>  static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> -						   struct mem_cgroup *memcg)
> +						   struct mem_cgroup *memcg, int is_local)
>  {
>  }
>  
> @@ -1175,7 +1180,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
>  	return true;
>  }
>  static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> -					struct mem_cgroup *memcg)
> +					struct mem_cgroup *memcg, int is_local)
>  {
>  	return false;
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7c5fff12105..61718ba998fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4495,7 +4495,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   *          of a top-down tree iteration, not for isolated queries.
>   */
>  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> -				     struct mem_cgroup *memcg)
> +				     struct mem_cgroup *memcg, int is_local)
>  {
>  	bool recursive_protection =
>  		cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> @@ -4507,7 +4507,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>  		root = root_mem_cgroup;
>  
>  	page_counter_calculate_protection(&root->memory, &memcg->memory,
> -					recursive_protection, false);
> +					recursive_protection, is_local);
>  }
>  
>  static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ce471d686a88..a2681d52fc5f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2377,6 +2377,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	enum scan_balance scan_balance;
>  	unsigned long ap, fp;
>  	enum lru_list lru;
> +	int is_local = (pgdat->node_id == 0) && root_reclaim(sc);

int should be bool to be more explicit as to what the valid values are.

Should be addressed across the patch set.

>  
>  	/* If we have no swap space, do not bother scanning anon folios. */
>  	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
> @@ -2457,12 +2458,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	for_each_evictable_lru(lru) {
>  		bool file = is_file_lru(lru);
>  		unsigned long lruvec_size;
> -		unsigned long low, min;
> +		unsigned long low, min, locallow;
>  		unsigned long scan;
>  
>  		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
>  		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> -				      &min, &low);
> +				      &min, &low, &locallow);
> +		if (is_local)
> +			low = locallow;
>  
>  		if (min || low) {
>  			/*
> @@ -2494,7 +2497,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			 * again by how much of the total memory used is under
>  			 * hard protection.
>  			 */
> -			unsigned long cgroup_size = mem_cgroup_size(memcg);
> +			unsigned long cgroup_size;
> +
> +			if (is_local)
> +				cgroup_size = get_cgroup_local_usage(memcg, true);
> +			else
> +				cgroup_size = mem_cgroup_size(memcg);
>  			unsigned long protection;
>  
>  			/* memory.low scaling, make sure we retry before OOM */
> @@ -5869,6 +5877,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  	};
>  	struct mem_cgroup_reclaim_cookie *partial = &reclaim;
>  	struct mem_cgroup *memcg;
> +	int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
>  
>  	/*
>  	 * In most cases, direct reclaimers can do partial walks
> @@ -5896,7 +5905,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		 */
>  		cond_resched();
>  
> -		mem_cgroup_calculate_protection(target_memcg, memcg);
> +		mem_cgroup_calculate_protection(target_memcg, memcg, is_local);
>  
>  		if (mem_cgroup_below_min(target_memcg, memcg)) {
>  			/*
> @@ -5904,7 +5913,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  			 * If there is no reclaimable memory, OOM.
>  			 */
>  			continue;
> -		} else if (mem_cgroup_below_low(target_memcg, memcg)) {
> +		} else if (mem_cgroup_below_low(target_memcg, memcg, is_local)) {
>  			/*
>  			 * Soft protection.
>  			 * Respect the protection only as long as
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
  2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
  2024-09-21 23:18   ` kernel test robot
  2024-09-22  8:39   ` kernel test robot
@ 2024-10-15 22:05   ` Gregory Price
  2 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2024-10-15 22:05 UTC (permalink / raw)
  To: kaiyang2
  Cc: linux-mm, cgroups, roman.gushchin, shakeel.butt, muchun.song,
	akpm, mhocko, nehagholkar, abhishekd, hannes, weixugc, rientjes

On Fri, Sep 20, 2024 at 10:11:49PM +0000, kaiyang2@cs.cmu.edu wrote:
> From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> 
> Add a memory.low for the top-tier node (locallow) and track its usage.
> locallow is set by scaling low by the ratio of node 0 capacity and
> node 0 + node 1 capacity.
> 
> Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> ---
>  include/linux/page_counter.h | 16 ++++++++---
>  mm/hugetlb_cgroup.c          |  4 +--
>  mm/memcontrol.c              | 42 ++++++++++++++++++++++-------
>  mm/page_counter.c            | 52 ++++++++++++++++++++++++++++--------
>  4 files changed, 88 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 79dbd8bc35a7..aa56c93415ef 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -13,6 +13,7 @@ struct page_counter {
>  	 * memcg->memory.usage is a hot member of struct mem_cgroup.
>  	 */
>  	atomic_long_t usage;
> +	struct mem_cgroup *memcg; /* memcg that owns this counter */

Can you make some comments on the lifetime of this new memcg reference?

How is it referenced, how is it cleaned up, etc.

Probably it's worth added this in a separate patch so it's easier
to review the reference tracking.

>  	CACHELINE_PADDING(_pad1_);
>  
>  	/* effective memory.min and memory.min usage tracking */
> @@ -25,6 +26,10 @@ struct page_counter {
>  	atomic_long_t low_usage;
>  	atomic_long_t children_low_usage;
>  
> +	unsigned long elocallow;
> +	atomic_long_t locallow_usage;

per note on other email - probably want local_low_* instead of locallow.

> +	atomic_long_t children_locallow_usage;
> +
>  	unsigned long watermark;
>  	/* Latest cg2 reset watermark */
>  	unsigned long local_watermark;
> @@ -36,6 +41,7 @@ struct page_counter {
>  	bool protection_support;
>  	unsigned long min;
>  	unsigned long low;
> +	unsigned long locallow;
>  	unsigned long high;
>  	unsigned long max;
>  	struct page_counter *parent;
> @@ -52,12 +58,13 @@ struct page_counter {
>   */
>  static inline void page_counter_init(struct page_counter *counter,
>  				     struct page_counter *parent,
> -				     bool protection_support)
> +				     bool protection_support, struct mem_cgroup *memcg)
>  {
>  	counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
>  	counter->max = PAGE_COUNTER_MAX;
>  	counter->parent = parent;
>  	counter->protection_support = protection_support;
> +	counter->memcg = memcg;
>  }
>  
>  static inline unsigned long page_counter_read(struct page_counter *counter)
> @@ -72,7 +79,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>  			     struct page_counter **fail);
>  void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
>  void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
> -void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
> +void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
> +					unsigned long nr_pages_local);
>  
>  static inline void page_counter_set_high(struct page_counter *counter,
>  					 unsigned long nr_pages)
> @@ -99,11 +107,11 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
>  #ifdef CONFIG_MEMCG
>  void page_counter_calculate_protection(struct page_counter *root,
>  				       struct page_counter *counter,
> -				       bool recursive_protection);
> +				       bool recursive_protection, int is_local);

`bool is_local` is preferred

>  #else
>  static inline void page_counter_calculate_protection(struct page_counter *root,
>  						     struct page_counter *counter,
> -						     bool recursive_protection) {}
> +						     bool recursive_protection, int is_local) {}
>  #endif
>  
>  #endif /* _LINUX_PAGE_COUNTER_H */
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index d8d0e665caed..0e07a7a1d5b8 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>  		}
>  		page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
>  								     idx),
> -				  fault_parent, false);
> +				  fault_parent, false, NULL);
>  		page_counter_init(
>  			hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
> -			rsvd_parent, false);
> +			rsvd_parent, false, NULL);
>  
>  		limit = round_down(PAGE_COUNTER_MAX,
>  				   pages_per_huge_page(&hstates[idx]));
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 20b715441332..d7c5fff12105 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1497,6 +1497,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  			       vm_event_name(memcg_vm_event_stat[i]),
>  			       memcg_events(memcg, memcg_vm_event_stat[i]));
>  	}
> +
> +	seq_buf_printf(s, "local_usage %lu\n",
> +		       get_cgroup_local_usage(memcg, true));
>  }
>  
>  static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> @@ -3597,8 +3600,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	if (parent) {
>  		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
>  
> -		page_counter_init(&memcg->memory, &parent->memory, true);
> -		page_counter_init(&memcg->swap, &parent->swap, false);
> +		page_counter_init(&memcg->memory, &parent->memory, true, memcg);
> +		page_counter_init(&memcg->swap, &parent->swap, false, NULL);
>  #ifdef CONFIG_MEMCG_V1
>  		WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
>  		page_counter_init(&memcg->kmem, &parent->kmem, false);
> @@ -3607,8 +3610,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	} else {
>  		init_memcg_stats();
>  		init_memcg_events();
> -		page_counter_init(&memcg->memory, NULL, true);
> -		page_counter_init(&memcg->swap, NULL, false);
> +		page_counter_init(&memcg->memory, NULL, true, memcg);
> +		page_counter_init(&memcg->swap, NULL, false, NULL);
>  #ifdef CONFIG_MEMCG_V1
>  		page_counter_init(&memcg->kmem, NULL, false);
>  		page_counter_init(&memcg->tcpmem, NULL, false);
> @@ -3677,7 +3680,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	memcg1_css_offline(memcg);
>  
>  	page_counter_set_min(&memcg->memory, 0);
> -	page_counter_set_low(&memcg->memory, 0);
> +	page_counter_set_low(&memcg->memory, 0, 0);
>  
>  	zswap_memcg_offline_cleanup(memcg);
>  
> @@ -3748,7 +3751,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  	page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
>  #endif
>  	page_counter_set_min(&memcg->memory, 0);
> -	page_counter_set_low(&memcg->memory, 0);
> +	page_counter_set_low(&memcg->memory, 0, 0);
>  	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
>  	memcg1_soft_limit_reset(memcg);
>  	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
> @@ -4051,6 +4054,12 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static int memory_locallow_show(struct seq_file *m, void *v)
> +{
> +	return seq_puts_memcg_tunable(m,
> +		READ_ONCE(mem_cgroup_from_seq(m)->memory.locallow));
> +}
> +
>  static int memory_low_show(struct seq_file *m, void *v)
>  {
>  	return seq_puts_memcg_tunable(m,
> @@ -4061,7 +4070,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	unsigned long low;
> +	struct sysinfo si;
> +	unsigned long low, locallow, local_capacity, total_capacity;
>  	int err;
>  
>  	buf = strstrip(buf);
> @@ -4069,7 +4079,15 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
>  	if (err)
>  		return err;
>  
> -	page_counter_set_low(&memcg->memory, low);
> +	/* Hardcoded 0 for local node and 1 for remote. */

I know we've talked about this before about this, but this is obviously broken
for multi-socket systems.  If so, this needs a FIXME or a TODO at least so that
it's at least obvious that this patch isn't ready for upstream - even as an RFC.

Probably we can't move forward until we figure out how to solve this problem
out ahead of this patch set.  Worth discussing this issue explicitly.

Maybe rather than guessing, a preferred node should be set for local and
remote if this mechanism is in use.  Otherwise just guessing which local
and which remote node seems like it will be wrong - especially for sufficiently
large-threaded processes.

> +	si_meminfo_node(&si, 0);
> +	local_capacity = si.totalram; /* In pages. */
> +	total_capacity = local_capacity;
> +	si_meminfo_node(&si, 1);
> +	total_capacity += si.totalram;
> +	locallow = low * local_capacity / total_capacity;
> +
> +	page_counter_set_low(&memcg->memory, low, locallow);
>  
>  	return nbytes;
>  }
> @@ -4394,6 +4412,11 @@ static struct cftype memory_files[] = {
>  		.seq_show = memory_low_show,
>  		.write = memory_low_write,
>  	},
> +	{
> +		.name = "locallow",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_locallow_show,
> +	},
>  	{
>  		.name = "high",
>  		.flags = CFTYPE_NOT_ON_ROOT,
> @@ -4483,7 +4506,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>  	if (!root)
>  		root = root_mem_cgroup;
>  
> -	page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
> +	page_counter_calculate_protection(&root->memory, &memcg->memory,
> +					recursive_protection, false);
>  }
>  
>  static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b249d15af9dd..97205aafab46 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -18,8 +18,10 @@ static bool track_protection(struct page_counter *c)
>  	return c->protection_support;
>  }
>  
> +extern unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
> +
>  static void propagate_protected_usage(struct page_counter *c,
> -				      unsigned long usage)
> +				      unsigned long usage, unsigned long local_usage)
>  {
>  	unsigned long protected, old_protected;
>  	long delta;
> @@ -44,6 +46,15 @@ static void propagate_protected_usage(struct page_counter *c,
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_low_usage);
>  	}
> +
> +	protected = min(local_usage, READ_ONCE(c->locallow));
> +	old_protected = atomic_long_read(&c->locallow_usage);
> +	if (protected != old_protected) {
> +		old_protected = atomic_long_xchg(&c->locallow_usage, protected);
> +		delta = protected - old_protected;
> +		if (delta)
> +			atomic_long_add(delta, &c->parent->children_locallow_usage);
> +	}
>  }
>  
>  /**
> @@ -63,7 +74,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
>  		atomic_long_set(&counter->usage, new);
>  	}
>  	if (track_protection(counter))
> -		propagate_protected_usage(counter, new);
> +		propagate_protected_usage(counter, new,
> +				get_cgroup_local_usage(counter->memcg, false));
>  }
>  
>  /**
> @@ -83,7 +95,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  
>  		new = atomic_long_add_return(nr_pages, &c->usage);
>  		if (protection)
> -			propagate_protected_usage(c, new);
> +			propagate_protected_usage(c, new,
> +					get_cgroup_local_usage(counter->memcg, false));
>  		/*
>  		 * This is indeed racy, but we can live with some
>  		 * inaccuracy in the watermark.
> @@ -151,7 +164,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>  			goto failed;
>  		}
>  		if (protection)
> -			propagate_protected_usage(c, new);
> +			propagate_protected_usage(c, new,
> +					get_cgroup_local_usage(counter->memcg, false));
>  
>  		/* see comment on page_counter_charge */
>  		if (new > READ_ONCE(c->local_watermark)) {
> @@ -238,7 +252,8 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
>  	WRITE_ONCE(counter->min, nr_pages);
>  
>  	for (c = counter; c; c = c->parent)
> -		propagate_protected_usage(c, atomic_long_read(&c->usage));
> +		propagate_protected_usage(c, atomic_long_read(&c->usage),
> +				get_cgroup_local_usage(counter->memcg, false));
>  }
>  
>  /**
> @@ -248,14 +263,17 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
>   *
>   * The caller must serialize invocations on the same counter.
>   */
> -void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
> +void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
> +				unsigned long nr_pages_local)
>  {
>  	struct page_counter *c;
>  
>  	WRITE_ONCE(counter->low, nr_pages);
> +	WRITE_ONCE(counter->locallow, nr_pages_local);
>  
>  	for (c = counter; c; c = c->parent)
> -		propagate_protected_usage(c, atomic_long_read(&c->usage));
> +		propagate_protected_usage(c, atomic_long_read(&c->usage),
> +				get_cgroup_local_usage(counter->memcg, false));
>  }
>  
>  /**
> @@ -421,9 +439,9 @@ static unsigned long effective_protection(unsigned long usage,
>   */
>  void page_counter_calculate_protection(struct page_counter *root,
>  				       struct page_counter *counter,
> -				       bool recursive_protection)
> +				       bool recursive_protection, int is_local)
>  {
> -	unsigned long usage, parent_usage;
> +	unsigned long usage, parent_usage, local_usage, parent_local_usage;
>  	struct page_counter *parent = counter->parent;
>  
>  	/*
> @@ -437,16 +455,19 @@ void page_counter_calculate_protection(struct page_counter *root,
>  		return;
>  
>  	usage = page_counter_read(counter);
> -	if (!usage)
> +	local_usage = get_cgroup_local_usage(counter->memcg, true);
> +	if (!usage || !local_usage)
>  		return;
>  
>  	if (parent == root) {
>  		counter->emin = READ_ONCE(counter->min);
>  		counter->elow = READ_ONCE(counter->low);
> +		counter->elocallow = READ_ONCE(counter->locallow);
>  		return;
>  	}
>  
>  	parent_usage = page_counter_read(parent);
> +	parent_local_usage = get_cgroup_local_usage(parent->memcg, true);
>  
>  	WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
>  			READ_ONCE(counter->min),
> @@ -454,7 +475,16 @@ void page_counter_calculate_protection(struct page_counter *root,
>  			atomic_long_read(&parent->children_min_usage),
>  			recursive_protection));
>  
> -	WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
> +	if (is_local)
> +		WRITE_ONCE(counter->elocallow,
> +			effective_protection(local_usage, parent_local_usage,
> +			READ_ONCE(counter->locallow),
> +			READ_ONCE(parent->elocallow),
> +			atomic_long_read(&parent->children_locallow_usage),
> +			recursive_protection));
> +	else
> +		WRITE_ONCE(counter->elow,
> +			effective_protection(usage, parent_usage,
>  			READ_ONCE(counter->low),
>  			READ_ONCE(parent->elow),
>  			atomic_long_read(&parent->children_low_usage),
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
  2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
                   ` (4 preceding siblings ...)
  2024-10-11 20:51 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion Kaiyang Zhao
@ 2024-11-08 19:01 ` kaiyang2
  5 siblings, 0 replies; 13+ messages in thread
From: kaiyang2 @ 2024-11-08 19:01 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
	nehagholkar, abhishekd, hannes, weixugc, rientjes, gourry,
	Kaiyang Zhao

From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

Adding some performance results from testing on a *real* system with CXL memory
to demonstrate the values of the patches.

The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads
together in two cgroups. One is a microbenchmark that allocates memory and
accesses it at tunable hotness levels. It allocates 256GB of memory and
accesses it in sequential passes with a very hot access pattern (~1 second per
pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017,
which uses about 14GB of memory in total. We apply memory bandwidth limits (1
Gbps memory bandwidth per logical core) and LLC contention mitigation by
setting cpuset for each cgroup.

Case 1: omnetpp running without the microbenchmark.
It is able to use all local memory and without resource contention. This is
the optimal case.
Avg rate reported by SPEC= 84.7

Case 2: Running two workloads stacked without the fairness patches and start
the microbenchmark first.
Avg= 62.7 (-25.9%)

Case 3: Set memory.low = 19GB for both workloads This is enough memory local
low protection for the entire memory usage of omnetpp.
Avg = 75.3 (-11.1%)
Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it
finishes because the hint faults for it only triggers for a few seconds in the
~20 minute runtime. Due to the short runtime of the workload and how tiering
currently works, it finishes before the memory usage converges to the point
where all its memory use is local. However, this still represents a significant
improvement over case 2.

Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for
the microbenchmark. 
Avg= 84.0 (<1% difference with case 1)
Analysis: by setting both memory.low and memory.high, the usage of local memory
is essentially provisioned for the microbenchmark. Therefore, even if the
microbenchmark starts first, when omnetpp starts it can get all local memory
from the very beginning and achieve near non-colocated performance.

We’re working on getting performance data from Meta’s production workloads.
Stay tuned for more results.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-11-08 19:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
2024-09-21 23:18   ` kernel test robot
2024-09-22  8:39   ` kernel test robot
2024-10-15 22:05   ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
2024-09-22  0:51   ` kernel test robot
2024-09-22 16:31   ` kernel test robot
2024-10-15 21:52   ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
2024-10-11 20:51 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion Kaiyang Zhao
2024-11-08 19:01 ` kaiyang2

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.