* [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
@ 2022-07-04  7:06 Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
                   ` (14 more replies)
  0 siblings, 15 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.
In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed.  The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by establishing
the per-node demotion targets based on the distances between nodes.
This current memory tier kernel interface needs to be improved for
several important use cases:
* The current tier initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into a higher tier.
* The current tier hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.
* Also because the current tier hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tier hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tier
  hierarchy unstable and make it difficult to support tier-based
  memory accounting.
* A higher tier node can only be demoted to selected nodes on the
  next lower tier as defined by the demotion path, not any other
  node from any lower tier.  This strict, hard-coded demotion order
  does not work in all use cases (e.g. some use cases may want to
  allow cross-socket demotion to another node in the same demotion
  tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to
  override the system-wide, per-node demotion order from the
  userspace.  This demotion order is also inconsistent with the page
  allocation fallback order when all the nodes in a higher tier are
  out of space: The page allocation can fall back to any node from
  any lower tier, whereas the demotion order doesn't allow that.
* There are no interfaces for the userspace to learn about the memory
  tier hierarchy in order to optimize its memory allocations.
This patch series make the creation of memory tiers explicit under
the control of userspace or device driver.
Memory Tier Initialization
==========================
By default, all memory nodes are assigned to the default tier with
tier ID value 200.
A device driver can move up or down its memory nodes from the default
tier.  For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.
The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
Memory Allocation for Demotion
==============================
This patch series keep the demotion target page allocation logic same.
The demotion page allocation pick the closest NUMA node in the
next lower tier to the current NUMA node allocating pages from.
This will be later improved to use the same page allocation strategy
using fallback list.
Sysfs Interface:
-------------
Listing current list of memory tiers details:
:/sys/devices/system/memtier$ ls
default_tier max_tier  memtier1  power  uevent
:/sys/devices/system/memtier$ cat default_tier
memtier200
:/sys/devices/system/memtier$ cat max_tier 
400
:/sys/devices/system/memtier$ 
Per node memory tier details:
For a cpu only NUMA node:
:/sys/devices/system/node# cat node0/memtier 
:/sys/devices/system/node# echo 1 > node0/memtier 
:/sys/devices/system/node# cat node0/memtier 
:/sys/devices/system/node# 
For a NUMA node with memory:
:/sys/devices/system/node# cat node1/memtier 
1
:/sys/devices/system/node# ls ../memtier/
default_tier  max_tier  memtier1  power  uevent
:/sys/devices/system/node# echo 2 > node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# ls ../memtier/
default_tier  max_tier  memtier1  memtier2  power  uevent
:/sys/devices/system/node# cat node1/memtier 
2
:/sys/devices/system/node# 
Removing a memory tier
:/sys/devices/system/node# cat node1/memtier 
2
:/sys/devices/system/node# echo 1 > node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# cat node1/memtier 
1
:/sys/devices/system/node# 
:/sys/devices/system/node# ls ../memtier/
default_tier  max_tier  memtier1  power  uevent
:/sys/devices/system/node# 
The above resulted in removal of memtier2 which was created in the earlier step.
Changes from v7:
* Fix kernel crash with demotion.
* Improve documentation.
Changes from v6:
* Drop the usage of rank.
* Address other review feedback.
Changes from v5:
* Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
  are going to be used for features other than demotion. Hence keep all N_MEMORY
  nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
* Add NODE_DATA->memtier
* Rearrage patches to add sysfs files later.
* Add support to create memory tiers from userspace.
* Address other review feedback.
Changes from v4:
* Address review feedback.
* Reverse the meaning of "rank": higher rank value means higher tier.
* Add "/sys/devices/system/memtier/default_tier".
* Add node_is_toptier
v4:
Add support for explicit memory tiers and ranks.
v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
  /sys/devices/system/node/demotion_targets instead and make
  it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs
v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.
Aneesh Kumar K.V (10):
  mm/demotion: Add support for explicit memory tiers
  mm/demotion: Move memory demotion related code
  mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  mm/demotion: Add hotplug callbacks to handle new numa node onlined
  mm/demotion: Build demotion targets based on explicit memory tiers
  mm/demotion: Expose memory tier details via sysfs
  mm/demotion: Add per node memory tier attribute to sysfs
  mm/demotion: Add pg_data_t member to track node memory tier details
  mm/demotion: Update node_is_toptier to work with memory tiers
  mm/demotion: Add sysfs ABI documentation
Jagdish Gediya (2):
  mm/demotion: Demote pages according to allocation fallback order
  mm/demotion: Add documentation for memory tiering
 .../ABI/testing/sysfs-kernel-mm-memory-tiers  |  61 ++
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/memory-tiering.rst         | 192 +++++
 drivers/base/node.c                           |  42 +
 drivers/dax/kmem.c                            |   6 +-
 include/linux/memory-tiers.h                  |  72 ++
 include/linux/migrate.h                       |  15 -
 include/linux/mmzone.h                        |   3 +
 include/linux/node.h                          |   5 -
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   1 +
 mm/memory-tiers.c                             | 791 ++++++++++++++++++
 mm/migrate.c                                  | 453 +---------
 mm/mprotect.c                                 |   1 +
 mm/vmscan.c                                   |  59 +-
 mm/vmstat.c                                   |   4 -
 16 files changed, 1215 insertions(+), 492 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
 create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c
-- 
2.36.1
^ permalink raw reply	[flat|nested] 42+ messages in thread
* [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 02/12] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya
In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.
This current memory tier kernel interface needs to be improved for
several important use cases,
The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.
With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.
The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.
This patch series address the above by defining memory tiers explicitly.
This patch introduce explicity memory tiers. The tier ID value
of a memory tier is used to derive the demotion order between
NUMA nodes.
For example, if we have 3 memtiers: memtier100, memtier200, memiter300
then the memory tier order is: memtier300 -> memtier200 -> memtier100
where memtier300 is the highest tier and memtier100 is the lowest tier.
While reclaim we migrate pages from fast(higher) tiers to slow(lower)
tiers when the fast(higher) tier is under memory pressure.
This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
which are created by different kernel subsystems. The default memory
tier created by the kernel is memtier200. A kernel parameter is provided
to override the default memory tier.
Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 15 +++++++
 mm/Makefile                  |  1 +
 mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..a81dbc20e0d1
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_NUMA
+
+#define MEMORY_TIER_HBM_GPU	300
+#define MEMORY_TIER_DRAM	200
+#define MEMORY_TIER_PMEM	100
+
+#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIER_ID	400
+
+#endif	/* CONFIG_NUMA */
+#endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..69a5d81c0a12
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	nodemask_t nodelist;
+	int id;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->id < memtier->id) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
+{
+	struct memory_tier *memtier;
+
+	if (tier > MAX_MEMORY_TIER_ID)
+		return ERR_PTR(-EINVAL);
+
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return ERR_PTR(-ENOMEM);
+
+	memtier->id   = tier;
+
+	insert_memory_tier(memtier);
+
+	return memtier;
+}
+
+static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
+core_param(default_memory_tier, default_memtier, uint, 0644);
+
+static int __init memory_tier_init(void)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs. Since this is early during
+	 * boot, we could avoid holding memtory_tier_lock. But
+	 * keep it simple by holding locks. So we can add lock
+	 * held debug checks in other functions.
+	 */
+	mutex_lock(&memory_tier_lock);
+	memtier = register_memory_tier(default_memtier);
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+	mutex_unlock(&memory_tier_lock);
+	return 0;
+}
+subsys_initcall(memory_tier_init);
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 02/12] mm/demotion: Move memory demotion related code
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 03/12] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
This move memory demotion related code to mm/memory-tiers.c.
No functional change in this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  7 ++++
 include/linux/migrate.h      |  2 --
 mm/memory-tiers.c            | 63 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 60 +---------------------------------
 mm/vmscan.c                  |  1 +
 5 files changed, 72 insertions(+), 61 deletions(-)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a81dbc20e0d1..c47dbe381089 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_MEMORY_TIERS_H
 #define _LINUX_MEMORY_TIERS_H
 
+#include <linux/types.h>
+
 #ifdef CONFIG_NUMA
 
 #define MEMORY_TIER_HBM_GPU	300
@@ -11,5 +13,10 @@
 #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
 #define MAX_MEMORY_TIER_ID	400
 
+extern bool numa_demotion_enabled;
+
+#else
+
+#define numa_demotion_enabled	false
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 069a89e847f3..43e737215f33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
 extern void set_migration_target_nodes(void);
 extern void migrate_on_reclaim_init(void);
-extern bool numa_demotion_enabled;
 extern int next_demotion_node(int node);
 #else
 static inline void set_migration_target_nodes(void) {}
@@ -87,7 +86,6 @@ static inline int next_demotion_node(int node)
 {
         return NUMA_NO_NODE;
 }
-#define numa_demotion_enabled  false
 #endif
 
 #ifdef CONFIG_COMPACTION
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 69a5d81c0a12..2dcf70802661 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/types.h>
+#include <linux/device.h>
 #include <linux/nodemask.h>
 #include <linux/slab.h>
 #include <linux/lockdep.h>
@@ -76,3 +77,65 @@ static int __init memory_tier_init(void)
 	return 0;
 }
 subsys_initcall(memory_tier_init);
+
+bool numa_demotion_enabled = false;
+
+#ifdef CONFIG_MIGRATION
+#ifdef CONFIG_SYSFS
+static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_enabled ? "true" : "false");
+}
+
+static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_demotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute numa_demotion_enabled_attr =
+	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
+	       numa_demotion_enabled_store);
+
+static struct attribute *numa_attrs[] = {
+	&numa_demotion_enabled_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group numa_attr_group = {
+	.attrs = numa_attrs,
+};
+
+static int __init numa_init_sysfs(void)
+{
+	int err;
+	struct kobject *numa_kobj;
+
+	numa_kobj = kobject_create_and_add("numa", mm_kobj);
+	if (!numa_kobj) {
+		pr_err("failed to create numa kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(numa_kobj, &numa_attr_group);
+	if (err) {
+		pr_err("failed to register numa group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(numa_kobj);
+	return err;
+}
+subsys_initcall(numa_init_sysfs);
+#endif /* CONFIG_SYSFS */
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c1ea61f39d8..fce7d4a9e940 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2509,64 +2509,6 @@ void __init migrate_on_reclaim_init(void)
 	set_migration_target_nodes();
 	cpus_read_unlock();
 }
+#endif /* CONFIG_NUMA */
 
-bool numa_demotion_enabled = false;
-
-#ifdef CONFIG_SYSFS
-static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
-					  struct kobj_attribute *attr, char *buf)
-{
-	return sysfs_emit(buf, "%s\n",
-			  numa_demotion_enabled ? "true" : "false");
-}
-
-static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   const char *buf, size_t count)
-{
-	ssize_t ret;
-
-	ret = kstrtobool(buf, &numa_demotion_enabled);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static struct kobj_attribute numa_demotion_enabled_attr =
-	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
-	       numa_demotion_enabled_store);
-
-static struct attribute *numa_attrs[] = {
-	&numa_demotion_enabled_attr.attr,
-	NULL,
-};
-
-static const struct attribute_group numa_attr_group = {
-	.attrs = numa_attrs,
-};
-
-static int __init numa_init_sysfs(void)
-{
-	int err;
-	struct kobject *numa_kobj;
 
-	numa_kobj = kobject_create_and_add("numa", mm_kobj);
-	if (!numa_kobj) {
-		pr_err("failed to create numa kobject\n");
-		return -ENOMEM;
-	}
-	err = sysfs_create_group(numa_kobj, &numa_attr_group);
-	if (err) {
-		pr_err("failed to register numa group\n");
-		goto delete_obj;
-	}
-	return 0;
-
-delete_obj:
-	kobject_put(numa_kobj);
-	return err;
-}
-subsys_initcall(numa_init_sysfs);
-#endif /* CONFIG_SYSFS */
-#endif /* CONFIG_NUMA */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d9a683e3a7..3a8f78277f99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 03/12] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 02/12] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 04/12] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya
By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
is the memory tier designated for nodes with DRAM
Set dax kmem device node's tier to MEMORY_TIER_PMEM. MEMORY_TIER_PMEM
appears below DEFAULT_MEMORY_TIER in demotion order.
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/dax/kmem.c           |  6 ++-
 include/linux/memory-tiers.h |  5 +++
 mm/memory-tiers.c            | 79 ++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..0c03889286ac 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/memory-tiers.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -41,6 +42,9 @@ struct dax_kmem_data {
 	struct resource *res[];
 };
 
+static unsigned int dax_kmem_memtier = MEMORY_TIER_PMEM;
+module_param(dax_kmem_memtier, uint, 0644);
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
@@ -146,7 +150,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	}
 
 	dev_set_drvdata(dev, data);
-
+	node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
 	return 0;
 
 err_request_mem:
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index c47dbe381089..9d36ff13c954 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -14,9 +14,14 @@
 #define MAX_MEMORY_TIER_ID	400
 
 extern bool numa_demotion_enabled;
+int node_create_and_set_memory_tier(int node, int tier);
 
 #else
 
 #define numa_demotion_enabled	false
+static inline int node_create_and_set_memory_tier(int node, int tier)
+{
+	return 0;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 2dcf70802661..fc404fcff7ff 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -51,6 +51,85 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
 	return memtier;
 }
 
+static void unregister_memory_tier(struct memory_tier *memtier)
+{
+	list_del(&memtier->list);
+	kfree(memtier);
+}
+
+static struct memory_tier *__node_get_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (node_isset(node, memtier->nodelist))
+			return memtier;
+	}
+	return NULL;
+}
+
+static struct memory_tier *__get_memory_tier_from_id(int id)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (memtier->id == id)
+			return memtier;
+	}
+	return NULL;
+}
+
+static int __node_create_and_set_memory_tier(int node, int tier)
+{
+	int ret = 0;
+	struct memory_tier *memtier;
+
+	memtier = __get_memory_tier_from_id(tier);
+	if (!memtier) {
+		memtier = register_memory_tier(tier);
+		if (IS_ERR(memtier)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+	node_set(node, memtier->nodelist);
+out:
+	return ret;
+}
+
+int node_create_and_set_memory_tier(int node, int tier)
+{
+	struct memory_tier *current_tier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+
+	current_tier = __node_get_memory_tier(node);
+	if (!current_tier) {
+		ret = __node_create_and_set_memory_tier(node, tier);
+		goto out;
+	}
+
+	if (current_tier->id == tier)
+		goto out;
+
+	node_clear(node, current_tier->nodelist);
+
+	ret = __node_create_and_set_memory_tier(node, tier);
+	if (ret) {
+		/* reset it back to older tier */
+		node_set(node, current_tier->nodelist);
+		goto out;
+	}
+	if (nodes_empty(current_tier->nodelist))
+		unregister_memory_tier(current_tier);
+out:
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier);
+
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
 core_param(default_memory_tier, default_memtier, uint, 0644);
 
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 04/12] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 03/12] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 05/12] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
If the new NUMA node onlined doesn't have a memory tier assigned,
the kernel adds the NUMA node to default memory tier.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory-tiers.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc404fcff7ff..2147112981a6 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@
 #include <linux/slab.h>
 #include <linux/lockdep.h>
 #include <linux/moduleparam.h>
+#include <linux/memory.h>
 #include <linux/memory-tiers.h>
 
 struct memory_tier {
@@ -130,8 +131,73 @@ int node_create_and_set_memory_tier(int node, int tier)
 }
 EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier);
 
+static int __node_set_memory_tier(int node, int tier)
+{
+	int ret = 0;
+	struct memory_tier *memtier;
+
+	memtier = __get_memory_tier_from_id(tier);
+	if (!memtier) {
+		ret = -EINVAL;
+		goto out;
+	}
+	node_set(node, memtier->nodelist);
+out:
+	return ret;
+}
+
+static int node_set_memory_tier(int node, int tier)
+{
+	struct memory_tier *memtier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+	memtier = __node_get_memory_tier(node);
+	if (!memtier)
+		ret = __node_set_memory_tier(node, tier);
+
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
 core_param(default_memory_tier, default_memtier, uint, 0644);
+/*
+ * This runs whether reclaim-based migration is enabled or not,
+ * which ensures that the user can turn reclaim-based migration
+ * at any time without needing to recalculate migration targets.
+ */
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *_arg)
+{
+	struct memory_notify *arg = _arg;
+
+	/*
+	 * Only update the node migration order when a node is
+	 * changing status, like online->offline.
+	 */
+	if (arg->status_change_nid < 0)
+		return notifier_from_errno(0);
+
+	switch (action) {
+	case MEM_ONLINE:
+		/*
+		 * We ignore the error here, if the node already have the tier
+		 * registered, we will continue to use that for the new memory
+		 * we are adding here.
+		 */
+		node_set_memory_tier(arg->status_change_nid, default_memtier);
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static void __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+}
 
 static int __init memory_tier_init(void)
 {
@@ -153,6 +219,8 @@ static int __init memory_tier_init(void)
 	/* CPU only nodes are not part of memory tiers. */
 	memtier->nodelist = node_states[N_MEMORY];
 	mutex_unlock(&memory_tier_lock);
+
+	migrate_on_reclaim_init();
 	return 0;
 }
 subsys_initcall(memory_tier_init);
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 05/12] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 04/12] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 06/12] mm/demotion: Expose memory tier details via sysfs Aneesh Kumar K.V
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default tier 200 and additional memory tiers will be added by drivers like
dax kmem.
This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.
Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.
A new memory tier can be inserted into the tier hierarchy for a new set
of nodes without affecting the node assignment of any existing memtier,
provided that there is enough gap in the tier ID values for the new memtier.
The absolute value of tier ID of a memtier doesn't necessarily carry any meaning.
Its value relative to other memtiers decides the level of this memtier in the tier
hierarchy.
For now, This patch supports hardcoded tier ID values which are 300, 200 and 100 for
memory tiers.
Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  13 ++
 include/linux/migrate.h      |  13 --
 mm/memory-tiers.c            | 227 ++++++++++++++++++++
 mm/migrate.c                 | 394 -----------------------------------
 mm/vmstat.c                  |   4 -
 5 files changed, 240 insertions(+), 411 deletions(-)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 9d36ff13c954..3234301c2537 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -15,6 +15,14 @@
 
 extern bool numa_demotion_enabled;
 int node_create_and_set_memory_tier(int node, int tier);
+#ifdef CONFIG_MIGRATION
+int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
 
 #else
 
@@ -23,5 +31,10 @@ static inline int node_create_and_set_memory_tier(int node, int tier)
 {
 	return 0;
 }
+
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 43e737215f33..93fab62e6548 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #endif /* CONFIG_MIGRATION */
 
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
-extern void set_migration_target_nodes(void);
-extern void migrate_on_reclaim_init(void);
-extern int next_demotion_node(int node);
-#else
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
-static inline int next_demotion_node(int node)
-{
-        return NUMA_NO_NODE;
-}
-#endif
-
 #ifdef CONFIG_COMPACTION
 extern int PageMovable(struct page *page);
 extern void __SetPageMovable(struct page *page, struct address_space *mapping);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 2147112981a6..0596f0b11065 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,16 +6,85 @@
 #include <linux/lockdep.h>
 #include <linux/moduleparam.h>
 #include <linux/memory.h>
+#include <linux/random.h>
 #include <linux/memory-tiers.h>
 
+#include "internal.h"
+
 struct memory_tier {
 	struct list_head list;
 	nodemask_t nodelist;
 	int id;
 };
 
+struct demotion_nodes {
+	nodemask_t preferred;
+};
+
+static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node   0    1    2    3
+ *    0  10   20   30   40
+ *    1  20   10   40   30
+ *    2  30   40   10   40
+ *    3  40   30   40   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-1
+ * memory_tiers[2] = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   30
+ *    2  30   30   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-2
+ * memory_tiers[2] = <empty>
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   40
+ *    2  30   40   10
+ *
+ * memory_tiers[0] = 1
+ * memory_tiers[1] = 0
+ * memory_tiers[2] = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
 
 static void insert_memory_tier(struct memory_tier *memtier)
 {
@@ -108,6 +177,7 @@ int node_create_and_set_memory_tier(int node, int tier)
 	current_tier = __node_get_memory_tier(node);
 	if (!current_tier) {
 		ret = __node_create_and_set_memory_tier(node, tier);
+		establish_migration_targets();
 		goto out;
 	}
 
@@ -124,6 +194,8 @@ int node_create_and_set_memory_tier(int node, int tier)
 	}
 	if (nodes_empty(current_tier->nodelist))
 		unregister_memory_tier(current_tier);
+
+	establish_migration_targets();
 out:
 	mutex_unlock(&memory_tier_lock);
 
@@ -153,14 +225,152 @@ static int node_set_memory_tier(int node, int tier)
 
 	mutex_lock(&memory_tier_lock);
 	memtier = __node_get_memory_tier(node);
+	/*
+	 * if node is already part of the tier proceed with the
+	 * current tier value, because we might want to establish
+	 * new migration paths now. The node might be added to a tier
+	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
+	 * will have skipped this node.
+	 */
 	if (!memtier)
 		ret = __node_set_memory_tier(node, tier);
+	establish_migration_targets();
 
 	mutex_unlock(&memory_tier_lock);
 
 	return ret;
 }
 
+#ifdef CONFIG_MIGRATION
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	struct demotion_nodes *nd;
+	int target;
+
+	if (!node_demotion)
+		return NUMA_NO_NODE;
+
+	nd = &node_demotion[node];
+
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
+	 */
+	rcu_read_lock();
+	/*
+	 * If there are multiple target nodes, just select one
+	 * target node randomly.
+	 *
+	 * In addition, we can also use round-robin to select
+	 * target node, but we should introduce another variable
+	 * for node_demotion[] to record last selected target node,
+	 * that may cause cache ping-pong due to the changing of
+	 * last target node. Or introducing per-cpu data to avoid
+	 * caching issue, which seems more complicated. So selecting
+	 * target node randomly seems better until now.
+	 */
+	target = node_random(&nd->preferred);
+	rcu_read_unlock();
+
+	return target;
+}
+
+/* Disable reclaim-based migration. */
+static void __disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_node_state(node, N_MEMORY)
+		node_demotion[node].preferred = NODE_MASK_NONE;
+}
+
+static void disable_all_migrate_targets(void)
+{
+	__disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 */
+	synchronize_rcu();
+}
+#else
+static void disable_all_migrate_targets(void) {}
+#endif
+
+/*
+ * Find an automatic demotion target for all memory
+ * nodes. Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static void establish_migration_targets(void)
+{
+	struct memory_tier *memtier;
+	struct demotion_nodes *nd;
+	int target = NUMA_NO_NODE, node;
+	int distance, best_distance;
+	nodemask_t used;
+
+	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
+		return;
+
+	disable_all_migrate_targets();
+
+	for_each_node_state(node, N_MEMORY) {
+		best_distance = -1;
+		nd = &node_demotion[node];
+
+		memtier = __node_get_memory_tier(node);
+		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+			continue;
+		/*
+		 * Get the next memtier to find the  demotion node list.
+		 */
+		memtier = list_next_entry(memtier, list);
+
+		/*
+		 * find_next_best_node, use 'used' nodemask as a skip list.
+		 * Add all memory nodes except the selected memory tier
+		 * nodelist to skip list so that we find the best node from the
+		 * memtier nodelist.
+		 */
+		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
+
+		/*
+		 * Find all the nodes in the memory tier node list of same best distance.
+		 * add them to the preferred mask. We randomly select between nodes
+		 * in the preferred mask when allocating pages during demotion.
+		 */
+		do {
+			target = find_next_best_node(node, &used);
+			if (target == NUMA_NO_NODE)
+				break;
+
+			distance = node_distance(node, target);
+			if (distance == best_distance || best_distance == -1) {
+				best_distance = distance;
+				node_set(target, nd->preferred);
+			} else {
+				break;
+			}
+		} while (1);
+	}
+}
+
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
 core_param(default_memory_tier, default_memtier, uint, 0644);
 /*
@@ -181,6 +391,17 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 		return notifier_from_errno(0);
 
 	switch (action) {
+	case MEM_OFFLINE:
+		/*
+		 * In case we are moving out of N_MEMORY. Keep the node
+		 * in the memory tier so that when we bring memory online,
+		 * they appear in the right memory tier. We still need
+		 * to rebuild the demotion order.
+		 */
+		mutex_lock(&memory_tier_lock);
+		establish_migration_targets();
+		mutex_unlock(&memory_tier_lock);
+		break;
 	case MEM_ONLINE:
 		/*
 		 * We ignore the error here, if the node already have the tier
@@ -196,6 +417,12 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 
 static void __init migrate_on_reclaim_init(void)
 {
+
+	if (IS_ENABLED(CONFIG_MIGRATION)) {
+		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+					GFP_KERNEL);
+		WARN_ON(!node_demotion);
+	}
 	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fce7d4a9e940..c758c9c21d7d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets.  Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node.  The
- * CPUs are placed in the node with the "fast" memory.  The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- *	Socket A: 0, 1, 2
- *	Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1.  When Node 1 fills up, it should be migrated to
- * Node 2.  The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- *	0 -> 1 -> 2 -> stop
- *	3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- *	0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking.  Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
-/**
- * next_demotion_node() - Get the next node in the demotion path
- * @node: The starting node to lookup the next node
- *
- * Return: node id for next memory node in the demotion path hierarchy
- * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
- * @node online or guarantee that it *continues* to be the next demotion
- * target.
- */
-int next_demotion_node(int node)
-{
-	struct demotion_nodes *nd;
-	unsigned short target_nr, index;
-	int target;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	/*
-	 * node_demotion[] is updated without excluding this
-	 * function from running.  RCU doesn't provide any
-	 * compiler barriers, so the READ_ONCE() is required
-	 * to avoid compiler reordering or read merging.
-	 *
-	 * Make sure to use RCU over entire code blocks if
-	 * node_demotion[] reads need to be consistent.
-	 */
-	rcu_read_lock();
-	target_nr = READ_ONCE(nd->nr);
-
-	switch (target_nr) {
-	case 0:
-		target = NUMA_NO_NODE;
-		goto out;
-	case 1:
-		index = 0;
-		break;
-	default:
-		/*
-		 * If there are multiple target nodes, just select one
-		 * target node randomly.
-		 *
-		 * In addition, we can also use round-robin to select
-		 * target node, but we should introduce another variable
-		 * for node_demotion[] to record last selected target node,
-		 * that may cause cache ping-pong due to the changing of
-		 * last target node. Or introducing per-cpu data to avoid
-		 * caching issue, which seems more complicated. So selecting
-		 * target node randomly seems better until now.
-		 */
-		index = get_random_int() % target_nr;
-		break;
-	}
-
-	target = READ_ONCE(nd->nodes[index]);
-
-out:
-	rcu_read_unlock();
-	return target;
-}
-
-/* Disable reclaim-based migration. */
-static void __disable_all_migrate_targets(void)
-{
-	int node, i;
-
-	if (!node_demotion)
-		return;
-
-	for_each_online_node(node) {
-		node_demotion[node].nr = 0;
-		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
-			node_demotion[node].nodes[i] = NUMA_NO_NODE;
-	}
-}
-
-static void disable_all_migrate_targets(void)
-{
-	__disable_all_migrate_targets();
-
-	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	synchronize_rcu();
-}
-
-/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK.  It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
-				    int best_distance)
-{
-	int migration_target, index, val;
-	struct demotion_nodes *nd;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	migration_target = find_next_best_node(node, used);
-	if (migration_target == NUMA_NO_NODE)
-		return NUMA_NO_NODE;
-
-	/*
-	 * If the node has been set a migration target node before,
-	 * which means it's the best distance between them. Still
-	 * check if this node can be demoted to other target nodes
-	 * if they have a same best distance.
-	 */
-	if (best_distance != -1) {
-		val = node_distance(node, migration_target);
-		if (val > best_distance)
-			goto out_clear;
-	}
-
-	index = nd->nr;
-	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
-		      "Exceeds maximum demotion target nodes\n"))
-		goto out_clear;
-
-	nd->nodes[index] = migration_target;
-	nd->nr++;
-
-	return migration_target;
-out_clear:
-	node_clear(migration_target, *used);
-	return NUMA_NO_NODE;
-}
-
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided.  If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[].  However, it can not run simultaneously
- * with itself.  Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
-	nodemask_t next_pass;
-	nodemask_t this_pass;
-	nodemask_t used_targets = NODE_MASK_NONE;
-	int node, best_distance;
-
-	/*
-	 * Avoid any oddities like cycles that could occur
-	 * from changes in the topology.  This will leave
-	 * a momentary gap when migration is disabled.
-	 */
-	disable_all_migrate_targets();
-
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
-
-	for_each_node_mask(node, this_pass) {
-		best_distance = -1;
-
-		/*
-		 * Try to set up the migration path for the node, and the target
-		 * migration nodes can be multiple, so doing a loop to find all
-		 * the target nodes if they all have a best node distance.
-		 */
-		do {
-			int target_node =
-				establish_migrate_target(node, &used_targets,
-							 best_distance);
-
-			if (target_node == NUMA_NO_NODE)
-				break;
-
-			if (best_distance == -1)
-				best_distance = node_distance(node, target_node);
-
-			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
-			 */
-			node_set(target_node, next_pass);
-		} while (1);
-	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
-}
-
-/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
-	get_online_mems();
-	__set_migration_target_nodes();
-	put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems().  That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
- */
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
-						 unsigned long action, void *_arg)
-{
-	struct memory_notify *arg = _arg;
-
-	/*
-	 * Only update the node migration order when a node is
-	 * changing status, like online->offline.  This avoids
-	 * the overhead of synchronize_rcu() in most cases.
-	 */
-	if (arg->status_change_nid < 0)
-		return notifier_from_errno(0);
-
-	switch (action) {
-	case MEM_GOING_OFFLINE:
-		/*
-		 * Make sure there are not transient states where
-		 * an offline node is a migration target.  This
-		 * will leave migration disabled until the offline
-		 * completes and the MEM_OFFLINE case below runs.
-		 */
-		disable_all_migrate_targets();
-		break;
-	case MEM_OFFLINE:
-	case MEM_ONLINE:
-		/*
-		 * Recalculate the target nodes once the node
-		 * reaches its final state (online or offline).
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_CANCEL_OFFLINE:
-		/*
-		 * MEM_GOING_OFFLINE disabled all the migration
-		 * targets.  Reenable them.
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_GOING_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		break;
-	}
-
-	return notifier_from_errno(0);
-}
-#endif
-
-void __init migrate_on_reclaim_init(void)
-{
-	node_demotion = kcalloc(nr_node_ids,
-				sizeof(struct demotion_nodes),
-				GFP_KERNEL);
-	WARN_ON(!node_demotion);
-#ifdef CONFIG_MEMORY_HOTPLUG
-	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-#endif
-	/*
-	 * At this point, all numa nodes with memory/CPus have their state
-	 * properly set, so we can build the demotion order now.
-	 * Let us hold the cpu_hotplug lock just, as we could possibily have
-	 * CPU hotplug events during boot.
-	 */
-	cpus_read_lock();
-	set_migration_target_nodes();
-	cpus_read_unlock();
-}
 #endif /* CONFIG_NUMA */
-
-
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..35c6ff97cf29 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -28,7 +28,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
-#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
-		set_migration_target_nodes();
 	}
 
 	return 0;
@@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
 		return 0;
 
 	node_clear_state(node, N_CPU);
-	set_migration_target_nodes();
 
 	return 0;
 }
@@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
-	migrate_on_reclaim_init();
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 06/12] mm/demotion: Expose memory tier details via sysfs
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 05/12] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 07/12] mm/demotion: Add per node memory tier attribute to sysfs Aneesh Kumar K.V
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
This patch adds /sys/devices/system/memtier/ where all memory tier
related details can be found. All created memory tiers will be
listed there as /sys/devices/system/memtier/memtierN/
The nodes which are part of a specific memory tier can be listed
via /sys/devices/system/memtier/memtierN/nodelist
/sys/devices/system/memtier/max_tier shows the max tier ID value
supported.
/sys/devices/system/memtier/default_tier shows the memory tier to which
NUMA nodes get added by default if not assigned a specific memory tier.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory-tiers.c | 93 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 6 deletions(-)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0596f0b11065..4acf7570ae1b 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -13,14 +13,15 @@
 
 struct memory_tier {
 	struct list_head list;
+	struct device dev;
 	nodemask_t nodelist;
-	int id;
 };
 
 struct demotion_nodes {
 	nodemask_t preferred;
 };
 
+#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
@@ -86,6 +87,42 @@ static LIST_HEAD(memory_tiers);
  */
 static struct demotion_nodes *node_demotion __read_mostly;
 
+static struct bus_type memory_tier_subsys = {
+	.name = "memtier",
+	.dev_name = "memtier",
+};
+
+static ssize_t nodelist_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct memory_tier *memtier = to_memory_tier(dev);
+
+	return sysfs_emit(buf, "%*pbl\n",
+			  nodemask_pr_args(&memtier->nodelist));
+}
+static DEVICE_ATTR_RO(nodelist);
+
+static struct attribute *memory_tier_dev_attrs[] = {
+	&dev_attr_nodelist.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_dev_group = {
+	.attrs = memory_tier_dev_attrs,
+};
+
+static const struct attribute_group *memory_tier_dev_groups[] = {
+	&memory_tier_dev_group,
+	NULL
+};
+
+static void memory_tier_device_release(struct device *dev)
+{
+	struct memory_tier *tier = to_memory_tier(dev);
+
+	kfree(tier);
+}
+
 static void insert_memory_tier(struct memory_tier *memtier)
 {
 	struct list_head *ent;
@@ -95,7 +132,7 @@ static void insert_memory_tier(struct memory_tier *memtier)
 
 	list_for_each(ent, &memory_tiers) {
 		tmp_memtier = list_entry(ent, struct memory_tier, list);
-		if (tmp_memtier->id < memtier->id) {
+		if (tmp_memtier->dev.id < memtier->dev.id) {
 			list_add_tail(&memtier->list, ent);
 			return;
 		}
@@ -105,6 +142,7 @@ static void insert_memory_tier(struct memory_tier *memtier)
 
 static struct memory_tier *register_memory_tier(unsigned int tier)
 {
+	int error;
 	struct memory_tier *memtier;
 
 	if (tier > MAX_MEMORY_TIER_ID)
@@ -114,17 +152,26 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
 	if (!memtier)
 		return ERR_PTR(-ENOMEM);
 
-	memtier->id   = tier;
+	memtier->dev.id = tier;
+	memtier->dev.bus = &memory_tier_subsys;
+	memtier->dev.release = memory_tier_device_release;
+	memtier->dev.groups = memory_tier_dev_groups;
 
 	insert_memory_tier(memtier);
 
+	error = device_register(&memtier->dev);
+	if (error) {
+		list_del(&memtier->list);
+		put_device(&memtier->dev);
+		return ERR_PTR(error);
+	}
 	return memtier;
 }
 
 static void unregister_memory_tier(struct memory_tier *memtier)
 {
 	list_del(&memtier->list);
-	kfree(memtier);
+	device_unregister(&memtier->dev);
 }
 
 static struct memory_tier *__node_get_memory_tier(int node)
@@ -143,7 +190,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	struct memory_tier *memtier;
 
 	list_for_each_entry(memtier, &memory_tiers, list) {
-		if (memtier->id == id)
+		if (memtier->dev.id == id)
 			return memtier;
 	}
 	return NULL;
@@ -181,7 +228,7 @@ int node_create_and_set_memory_tier(int node, int tier)
 		goto out;
 	}
 
-	if (current_tier->id == tier)
+	if (current_tier->dev.id == tier)
 		goto out;
 
 	node_clear(node, current_tier->nodelist);
@@ -426,10 +473,44 @@ static void __init migrate_on_reclaim_init(void)
 	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
 }
 
+static ssize_t
+max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIER_ID);
+}
+static DEVICE_ATTR_RO(max_tier);
+
+static ssize_t
+default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "memtier%d\n", default_memtier);
+}
+static DEVICE_ATTR_RO(default_tier);
+
+static struct attribute *memory_tier_attrs[] = {
+	&dev_attr_max_tier.attr,
+	&dev_attr_default_tier.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+	.attrs = memory_tier_attrs,
+};
+
+static const struct attribute_group *memory_tier_attr_groups[] = {
+	&memory_tier_attr_group,
+	NULL,
+};
+
 static int __init memory_tier_init(void)
 {
+	int ret;
 	struct memory_tier *memtier;
 
+	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
+	if (ret)
+		pr_err("%s() failed to register subsystem: %d\n", __func__, ret);
+
 	/*
 	 * Register only default memory tier to hide all empty
 	 * memory tier from sysfs. Since this is early during
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 07/12] mm/demotion: Add per node memory tier attribute to sysfs
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 06/12] mm/demotion: Expose memory tier details via sysfs Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 08/12] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya
Add support to modify the memory tier for a NUMA node.
/sys/devices/system/node/nodeN/memtier
where N = node id
When read, It list the memory tier that the node belongs to.
When written, the kernel moves the node into the specified
memory tier, the tier assignment of all other nodes are not
affected.
If the memory tier does not exist, it is created.
Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/node.c          | 42 ++++++++++++++++++++++++++++++++++++
 include/linux/memory-tiers.h |  2 ++
 mm/memory-tiers.c            | 42 ++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ac6376ef7a1..667f37eecf3a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
+#include <linux/memory-tiers.h>
 
 static struct bus_type node_subsys = {
 	.name = "node",
@@ -560,11 +561,52 @@ static ssize_t node_read_distance(struct device *dev,
 }
 static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
 
+#ifdef CONFIG_NUMA
+static ssize_t memtier_show(struct device *dev,
+			    struct device_attribute *attr,
+			    char *buf)
+{
+	int node = dev->id;
+	int tier_index = node_get_memory_tier_id(node);
+
+	/*
+	 * CPU only NUMA node is not part of memory tiers.
+	 */
+	if (tier_index != -1)
+		return sysfs_emit(buf, "%d\n", tier_index);
+	return 0;
+}
+
+static ssize_t memtier_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	unsigned long tier;
+	int node = dev->id;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &tier);
+	if (ret)
+		return ret;
+
+	ret = node_update_memory_tier(node, tier);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(memtier);
+#endif
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_meminfo.attr,
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+#ifdef CONFIG_NUMA
+	&dev_attr_memtier.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 3234301c2537..453f6e5d357c 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -23,6 +23,8 @@ static inline int next_demotion_node(int node)
 	return NUMA_NO_NODE;
 }
 #endif
+int node_get_memory_tier_id(int node);
+int node_update_memory_tier(int node, int tier);
 
 #else
 
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 4acf7570ae1b..b7cb368cb9c0 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -288,6 +288,48 @@ static int node_set_memory_tier(int node, int tier)
 	return ret;
 }
 
+int node_get_memory_tier_id(int node)
+{
+	int tier = -1;
+	struct memory_tier *memtier;
+	/*
+	 * Make sure memory tier is not unregistered
+	 * while it is being read.
+	 */
+	mutex_lock(&memory_tier_lock);
+	memtier = __node_get_memory_tier(node);
+	if (memtier)
+		tier = memtier->dev.id;
+	mutex_unlock(&memory_tier_lock);
+
+	return tier;
+}
+
+int node_update_memory_tier(int node, int tier)
+{
+	struct memory_tier *current_tier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+
+	current_tier = __node_get_memory_tier(node);
+	if (!current_tier || current_tier->dev.id == tier)
+		goto out;
+
+	node_clear(node, current_tier->nodelist);
+
+	ret = __node_create_and_set_memory_tier(node, tier);
+
+	if (nodes_empty(current_tier->nodelist))
+		unregister_memory_tier(current_tier);
+
+	establish_migration_targets();
+out:
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
 #ifdef CONFIG_MIGRATION
 /**
  * next_demotion_node() - Get the next node in the demotion path
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 08/12] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 07/12] mm/demotion: Add per node memory tier attribute to sysfs Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 09/12] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
Also update different helpes to use NODE_DATA()->memtier. Since
node specific memtier can change based on the reassignment of
NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
needs to happen under an rcu read lock or memory_tier_lock.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  11 ++++
 include/linux/mmzone.h       |   3 +
 mm/memory-tiers.c            | 104 +++++++++++++++++++++++++----------
 3 files changed, 89 insertions(+), 29 deletions(-)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 453f6e5d357c..705b63ee31d5 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -6,6 +6,9 @@
 
 #ifdef CONFIG_NUMA
 
+#include <linux/device.h>
+#include <linux/nodemask.h>
+
 #define MEMORY_TIER_HBM_GPU	300
 #define MEMORY_TIER_DRAM	200
 #define MEMORY_TIER_PMEM	100
@@ -13,6 +16,12 @@
 #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
 #define MAX_MEMORY_TIER_ID	400
 
+struct memory_tier {
+	struct list_head list;
+	struct device dev;
+	nodemask_t nodelist;
+};
+
 extern bool numa_demotion_enabled;
 int node_create_and_set_memory_tier(int node, int tier);
 #ifdef CONFIG_MIGRATION
@@ -25,6 +34,8 @@ static inline int next_demotion_node(int node)
 #endif
 int node_get_memory_tier_id(int node);
 int node_update_memory_tier(int node, int tier);
+struct memory_tier *node_get_memory_tier(int node);
+void node_put_memory_tier(struct memory_tier *memtier);
 
 #else
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..353812495a70 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_NUMA
+	struct memory_tier __rcu *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index b7cb368cb9c0..6a2476faf13a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -1,22 +1,15 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/types.h>
-#include <linux/device.h>
-#include <linux/nodemask.h>
 #include <linux/slab.h>
 #include <linux/lockdep.h>
 #include <linux/moduleparam.h>
 #include <linux/memory.h>
 #include <linux/random.h>
+#include <linux/rcupdate.h>
 #include <linux/memory-tiers.h>
 
 #include "internal.h"
 
-struct memory_tier {
-	struct list_head list;
-	struct device dev;
-	nodemask_t nodelist;
-};
-
 struct demotion_nodes {
 	nodemask_t preferred;
 };
@@ -120,7 +113,7 @@ static void memory_tier_device_release(struct device *dev)
 {
 	struct memory_tier *tier = to_memory_tier(dev);
 
-	kfree(tier);
+	kfree_rcu(tier);
 }
 
 static void insert_memory_tier(struct memory_tier *memtier)
@@ -176,13 +169,18 @@ static void unregister_memory_tier(struct memory_tier *memtier)
 
 static struct memory_tier *__node_get_memory_tier(int node)
 {
-	struct memory_tier *memtier;
+	pg_data_t *pgdat;
 
-	list_for_each_entry(memtier, &memory_tiers, list) {
-		if (node_isset(node, memtier->nodelist))
-			return memtier;
-	}
-	return NULL;
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return NULL;
+	/*
+	 * Since we hold memory_tier_lock, we can avoid
+	 * RCU read locks when accessing the details. No
+	 * parallel updates are possible here.
+	 */
+	return rcu_dereference_check(pgdat->memtier,
+				     lockdep_is_held(&memory_tier_lock));
 }
 
 static struct memory_tier *__get_memory_tier_from_id(int id)
@@ -196,6 +194,33 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	return NULL;
 }
 
+/*
+ * Called with memory_tier_lock. Hence the device references cannot
+ * be dropped during this function.
+ */
+static void memtier_node_set(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+	struct memory_tier *current_memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	current_memtier = rcu_dereference_check(pgdat->memtier,
+						lockdep_is_held(&memory_tier_lock));
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	if (current_memtier)
+		node_clear(node, current_memtier->nodelist);
+	synchronize_rcu();
+	node_set(node, memtier->nodelist);
+	rcu_assign_pointer(pgdat->memtier, memtier);
+}
+
 static int __node_create_and_set_memory_tier(int node, int tier)
 {
 	int ret = 0;
@@ -209,7 +234,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
 			goto out;
 		}
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -231,14 +256,7 @@ int node_create_and_set_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
-
 	ret = __node_create_and_set_memory_tier(node, tier);
-	if (ret) {
-		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
-		goto out;
-	}
 	if (nodes_empty(current_tier->nodelist))
 		unregister_memory_tier(current_tier);
 
@@ -260,7 +278,7 @@ static int __node_set_memory_tier(int node, int tier)
 		ret = -EINVAL;
 		goto out;
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -316,10 +334,7 @@ int node_update_memory_tier(int node, int tier)
 	if (!current_tier || current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
-
 	ret = __node_create_and_set_memory_tier(node, tier);
-
 	if (nodes_empty(current_tier->nodelist))
 		unregister_memory_tier(current_tier);
 
@@ -330,6 +345,34 @@ int node_update_memory_tier(int node, int tier)
 	return ret;
 }
 
+/*
+ * lockless access to memory tier of a NUMA node.
+ */
+struct memory_tier *node_get_memory_tier(int node)
+{
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return NULL;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier)
+		goto out;
+
+	get_device(&memtier->dev);
+out:
+	rcu_read_unlock();
+	return memtier;
+}
+
+void node_put_memory_tier(struct memory_tier *memtier)
+{
+	put_device(&memtier->dev);
+}
+
 #ifdef CONFIG_MIGRATION
 /**
  * next_demotion_node() - Get the next node in the demotion path
@@ -546,7 +589,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 
 static int __init memory_tier_init(void)
 {
-	int ret;
+	int ret, node;
 	struct memory_tier *memtier;
 
 	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
@@ -567,7 +610,10 @@ static int __init memory_tier_init(void)
 		      __func__, PTR_ERR(memtier));
 
 	/* CPU only nodes are not part of memory tiers. */
-	memtier->nodelist = node_states[N_MEMORY];
+	for_each_node_state(node, N_MEMORY) {
+		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
+		node_set(node, memtier->nodelist);
+	}
 	mutex_unlock(&memory_tier_lock);
 
 	migrate_on_reclaim_init();
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 09/12] mm/demotion: Demote pages according to allocation fallback order
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 08/12] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 10/12] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya, Aneesh Kumar K . V
From: Jagdish Gediya <jvgediya@linux.ibm.com>
Currently, a higher tier node can only be demoted to selected
nodes on the next lower tier as defined by the demotion path.
This strict, hard-coded demotion order does not work in all
use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback
when the preferred demotion node is out of space). This demotion
order is also inconsistent with the page allocation fallback order
when all the nodes in a higher tier are out of space: The page
allocation can fall back to any node from any lower tier, whereas
the demotion order doesn't allow that currently.
This patch adds support to get all the allowed demotion targets
for a memory tier. demote_page_list() function is now modified
to utilize this allowed node mask as the fallback allocation mask.
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
move allowed mask to memory tier
---
 include/linux/memory-tiers.h | 17 +++++++-
 mm/memory-tiers.c            | 76 +++++++++++++++++++++++++++++++++---
 mm/vmscan.c                  | 58 ++++++++++++++++++++-------
 3 files changed, 129 insertions(+), 22 deletions(-)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 705b63ee31d5..335d21a30b2c 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -3,11 +3,12 @@
 #define _LINUX_MEMORY_TIERS_H
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/mmzone.h>
 
 #ifdef CONFIG_NUMA
 
 #include <linux/device.h>
-#include <linux/nodemask.h>
 
 #define MEMORY_TIER_HBM_GPU	300
 #define MEMORY_TIER_DRAM	200
@@ -20,18 +21,25 @@ struct memory_tier {
 	struct list_head list;
 	struct device dev;
 	nodemask_t nodelist;
+	nodemask_t lower_tier_mask;
 };
 
 extern bool numa_demotion_enabled;
 int node_create_and_set_memory_tier(int node, int tier);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 #else
 static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
-#endif
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
+#endif /* CONFIG_MIGRATION */
 int node_get_memory_tier_id(int node);
 int node_update_memory_tier(int node, int tier);
 struct memory_tier *node_get_memory_tier(int node);
@@ -49,5 +57,10 @@ static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 6a2476faf13a..aecce987df7c 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -374,6 +374,24 @@ void node_put_memory_tier(struct memory_tier *memtier)
 }
 
 #ifdef CONFIG_MIGRATION
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * pg_data_t.memtier updates includes a synchronize_rcu()
+	 * which ensures that we either find NULL or a valid memtier
+	 * in NODE_DATA. protect the access via rcu_read_lock();
+	 */
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (memtier)
+		*targets = memtier->lower_tier_mask;
+	else
+		*targets = NODE_MASK_NONE;
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -422,10 +440,19 @@ int next_demotion_node(int node)
 /* Disable reclaim-based migration. */
 static void __disable_all_migrate_targets(void)
 {
+	struct memory_tier *memtier;
 	int node;
 
-	for_each_node_state(node, N_MEMORY)
+	for_each_node_state(node, N_MEMORY) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		/*
+		 * We are holding memory_tier_lock, it is safe
+		 * to access pgda->memtier.
+		 */
+		memtier = rcu_dereference_check(NODE_DATA(node)->memtier,
+						lockdep_is_held(&memory_tier_lock));
+		memtier->lower_tier_mask = NODE_MASK_NONE;
+	}
 }
 
 static void disable_all_migrate_targets(void)
@@ -455,10 +482,26 @@ static void establish_migration_targets(void)
 	struct demotion_nodes *nd;
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t used;
-
-	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
-		return;
+	nodemask_t used, lower_tier = NODE_MASK_NONE;
+
+	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) {
+
+		for_each_node_state(node, N_MEMORY) {
+			/*
+			 * We are holding memory_tier_lock, it is safe
+			 * to access pgda->memtier.
+			 */
+			memtier = rcu_dereference_check(NODE_DATA(node)->memtier,
+							lockdep_is_held(&memory_tier_lock));
+			memtier->lower_tier_mask = NODE_MASK_NONE;
+		}
+		/*
+		 * Wait for read side to work with old values
+		 * or see the updated NODE_MASK_NONE;
+		 */
+		synchronize_rcu();
+		goto build_lower_tier_mask;
+	}
 
 	disable_all_migrate_targets();
 
@@ -501,6 +544,29 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+build_lower_tier_mask:
+	/*
+	 * Now build the lower_tier mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list)
+		nodes_or(lower_tier, lower_tier, memtier->nodelist);
+	/*
+	 * Removes nodes not yet in N_MEMORY.
+	 */
+	nodes_and(lower_tier, node_states[N_MEMORY], lower_tier);
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from lower_tier nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the lower_tier mask.
+		 */
+		nodes_andnot(lower_tier, lower_tier, memtier->nodelist);
+		memtier->lower_tier_mask = lower_tier;
+	}
 }
 
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8f78277f99..60a5235dd639 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
+static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
+	struct page *target_page;
+	nodemask_t *allowed_mask;
+	struct migration_target_control *mtc;
+
+	mtc = (struct migration_target_control *)private;
+
+	allowed_mask = mtc->nmask;
+	/*
+	 * make sure we allocate from the target node first also trying to
+	 * reclaim pages from the target node via kswapd if we are low on
+	 * free memory on target node. If we don't do this and if we have low
+	 * free memory on the target memtier, we would start allocating pages
+	 * from higher memory tiers without even forcing a demotion of cold
+	 * pages from the target memtier. This can result in the kernel placing
+	 * hotpages in higher memory tiers.
+	 */
+	mtc->nmask = NULL;
+	mtc->gfp_mask |= __GFP_THISNODE;
+	target_page = alloc_migration_target(page, (unsigned long)mtc);
+	if (target_page)
+		return target_page;
 
-	return alloc_migration_target(page, (unsigned long)&mtc);
+	mtc->gfp_mask &= ~__GFP_THISNODE;
+	mtc->nmask = allowed_mask;
+
+	return alloc_migration_target(page, (unsigned long)mtc);
 }
 
 /*
@@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
 	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 10/12] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (8 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 09/12] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 11/12] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
With memory tiers support we can have memory only NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers.
All NUMA nodes are by default top tier nodes. With lower memory
tiers added we consider all memory tiers above a memory tier having
CPU NUMA nodes as a top memory tier
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  6 ++++++
 include/linux/node.h         |  5 -----
 mm/huge_memory.c             |  1 +
 mm/memory-tiers.c            | 41 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 |  1 +
 mm/mprotect.c                |  1 +
 6 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 335d21a30b2c..ff1a08933575 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -44,6 +44,7 @@ int node_get_memory_tier_id(int node);
 int node_update_memory_tier(int node, int tier);
 struct memory_tier *node_get_memory_tier(int node);
 void node_put_memory_tier(struct memory_tier *memtier);
+bool node_is_toptier(int node);
 
 #else
 
@@ -62,5 +63,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/node.h b/include/linux/node.h
index 40d641a8bfb0..9ec680dd607f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
-static inline bool node_is_toptier(int node)
-{
-	return node_state(node, N_CPU);
-}
-
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 834f288b3769..8405662646e9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -35,6 +35,7 @@
 #include <linux/numa.h>
 #include <linux/page_owner.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index aecce987df7c..7204f7381a15 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -18,6 +18,7 @@ struct demotion_nodes {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+static int top_tier_id;
 /*
  * node_demotion[] examples:
  *
@@ -373,6 +374,31 @@ void node_put_memory_tier(struct memory_tier *memtier)
 	put_device(&memtier->dev);
 }
 
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->dev.id >= top_tier_id)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 #ifdef CONFIG_MIGRATION
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
 {
@@ -545,6 +571,21 @@ static void establish_migration_targets(void)
 		} while (1);
 	}
 build_lower_tier_mask:
+	/*
+	 * Promotion is allowed from a memory tier to higher
+	 * memory tier only if the memory tier doesn't include
+	 * compute. We want to  skip promotion from a memory tier,
+	 * if any node that is  part of the memory tier have CPUs.
+	 * Once we detect such a memory tier, we consider that tier
+	 * as top tiper from which promotion is not allowed.
+	 */
+	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
+		nodes_and(used, node_states[N_CPU], memtier->nodelist);
+		if (!nodes_empty(used)) {
+			top_tier_id = memtier->dev.id;
+			break;
+		}
+	}
 	/*
 	 * Now build the lower_tier mask for each node collecting node mask from
 	 * all memory tier below it. This allows us to fallback demotion page
diff --git a/mm/migrate.c b/mm/migrate.c
index c758c9c21d7d..1da81136eaaa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
 #include <linux/memory.h>
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..92a2fc0fa88b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -31,6 +31,7 @@
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 11/12] mm/demotion: Add documentation for memory tiering
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (9 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 10/12] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04  7:06 ` [PATCH v8 12/12] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya, Aneesh Kumar K . V
From: Jagdish Gediya <jvgediya@linux.ibm.com>
All N_MEMORY nodes are divided into 3 memory tiers with tier ID value
MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default,
all nodes are assigned to default memory tier (MEMORY_TIER_DRAM).
Demotion path for all N_MEMORY nodes is prepared based on the tier ID value
of memory tiers.
This patch adds documention for memory tiering introduction, its sysfs
interfaces and how demotion is performed based on memory tiers.
[update doc format by Bagas Sanjaya <bagasdotme@gmail.com>]
Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/memory-tiering.rst         | 192 ++++++++++++++++++
 2 files changed, 193 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..3f211cbca8c3 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   memory-tiering
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
new file mode 100644
index 000000000000..107599dbc952
--- /dev/null
+++ b/Documentation/admin-guide/mm/memory-tiering.rst
@@ -0,0 +1,192 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _admin_guide_memory_tiering:
+
+============
+Memory tiers
+============
+
+This document describes explicit memory tiering support along with
+demotion based on memory tiers.
+
+Introduction
+============
+
+Many systems have multiple types of memory devices e.g. GPU, DRAM and
+PMEM. The memory subsystem of these systems can be called a memory
+tiering system because the performance of the each types of
+memory is different. Memory tiers are defined based on the hardware
+capabilities of memory nodes. Each memory tier is assigned a tier ID
+value that determines the memory tier position in demotion order.
+
+The memory tier assignment of each node is independent of each
+other. Moving a node from one tier to another doesn't affect
+the tier assignment of any other node.
+
+Memory tiers are used to build the demotion targets for nodes. A node
+can demote its pages to any node of any lower tiers.
+
+Memory tier ID
+=================
+
+Memory nodes are divided into 3 types of memory tiers with tier ID
+value as shown based on their hardware characteristics.
+
+
+  * MEMORY_TIER_HBM_GPU
+  * MEMORY_TIER_DRAM
+  * MEMORY_TIER_PMEM
+
+Memory tiers initialization and (re)assignments
+===============================================
+
+By default, all nodes are assigned to the memory tier with the default tier ID
+DEFAULT_MEMORY_TIER which is 200 (MEMORY_TIER_DRAM). The memory tier of
+the memory node can be either modified through sysfs or from the driver. On
+hotplug, the memory tier with default tier ID is assigned to the memory node.
+
+
+Sysfs interfaces
+================
+
+Nodes belonging to specific tier can be read from,
+/sys/devices/system/memtier/memtierN/nodelist (read-Only)
+
+Examples:
+
+1. On a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node and
+   node 2 is a PMEM node an ideal tier layout will be
+
+   .. code-block:: sh
+
+      $ cat /sys/devices/system/memtier/memtier0/nodelist
+      1
+      $ cat /sys/devices/system/memtier/memtier1/nodelist
+      0
+      $ cat /sys/devices/system/memtier/memtier2/nodelist
+      2
+
+2. On a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+   nodes.
+
+   .. code-block:: sh
+
+      $ cat /sys/devices/system/memtier/memtier0/nodelist
+      cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or directory
+      $ cat /sys/devices/system/memtier/memtier1/nodelist
+      0-1
+      $ cat /sys/devices/system/memtier/memtier2/nodelist
+      2-3
+
+Default memory tier can be read from,
+/sys/devices/system/memtier/default_tier (read-Only)
+
+   .. code-block:: sh
+
+      $ cat /sys/devices/system/memtier/default_tier
+      memtier200
+
+Max memory tier ID supported can be read from,
+/sys/devices/system/memtier/max_tier (read-Only)
+
+   .. code-block:: sh
+
+      $ cat /sys/devices/system/memtier/max_tier
+      400
+
+Individual node's memory tier can be read of set using,
+/sys/devices/system/node/nodeN/memtier	(read-write), where N = node id
+
+When this interface is written, node is moved from the old memory tier
+to new memory tier and demotion targets for all N_MEMORY nodes are
+built again.
+
+For example 1 mentioned above,
+   .. code-block:: sh
+
+      $ cat /sys/devices/system/node/node0/memtier
+      1
+      $ cat /sys/devices/system/node/node1/memtier
+      0
+      $ cat /sys/devices/system/node/node2/memtier
+      2
+
+Additional memory tiers can be created by writing a tier ID value to this file.
+This results in a new memory tier creation and moving the specific NUMA node to
+that memory tier.
+
+Demotion
+========
+
+In a system with DRAM and persistent memory, once DRAM
+fills up, reclaim will start and some of the DRAM contents will be
+thrown out even if there is a space in persistent memory.
+Consequently, allocations will, at some point, start falling over to the slower
+persistent memory.
+
+That has two nasty properties. First, the newer allocations can end up in
+the slower persistent memory. Second, reclaimed data in DRAM are just
+discarded even if there are gobs of space in persistent memory that could
+be used.
+
+Instead of a page being discarded during reclaim, it can be moved to
+persistent memory. Allowing page migration during reclaim enables
+these systems to migrate pages from fast (higher) tiers to slow (lower)
+tiers when the fast (higher) tier is under pressure.
+
+
+Enable/Disable demotion
+-----------------------
+
+By default demotion is disabled, it can be enabled/disabled using
+below sysfs interface,
+
+   .. code-block:: sh
+
+      $ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+
+preferred and allowed demotion nodes
+------------------------------------
+
+Preferred nodes for a specific N_MEMORY node are the best nodes
+from the next possible lower memory tier. Allowed nodes for any
+node are all the nodes available in all possible lower memory
+tiers.
+
+For example on a system where Node 0 & 1 are CPU + DRAM nodes,
+node 2 & 3 are PMEM nodes,
+
+  * node distances:
+
+    ====  ==   ==   ==   ==
+    node   0    1    2    3
+    ====  ==   ==   ==   ==
+       0  10   20   30   40
+       1  20   10   40   30
+       2  30   40   10   40
+       3  40   30   40   10
+    ====  ==   ==   ==   ==
+
+
+   .. code-block:: none
+
+      memory_tiers[0] = <empty>
+      memory_tiers[1] = 0-1
+      memory_tiers[2] = 2-3
+
+      node_demotion[0].preferred = 2
+      node_demotion[0].allowed   = 2, 3
+      node_demotion[1].preferred = 3
+      node_demotion[1].allowed   = 3, 2
+      node_demotion[2].preferred = <empty>
+      node_demotion[2].allowed   = <empty>
+      node_demotion[3].preferred = <empty>
+      node_demotion[3].allowed   = <empty>
+
+Memory allocation for demotion
+------------------------------
+
+If a page needs to be demoted from any node, the kernel first tries
+to allocate a new page from the node's preferred node and fallbacks to
+node's allowed targets in allocation fallback order.
+
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* [PATCH v8 12/12] mm/demotion: Add sysfs ABI documentation
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (10 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 11/12] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
@ 2022-07-04  7:06 ` Aneesh Kumar K.V
  2022-07-04 15:00 ` [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Matthew Wilcox
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-04  7:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V
Add sysfs ABI documentation.
Signed-off-by: Wei Xu <weixugc@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 .../ABI/testing/sysfs-kernel-mm-memory-tiers  | 61 +++++++++++++++++++
 1 file changed, 61 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
new file mode 100644
index 000000000000..843fb59d2f3d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
@@ -0,0 +1,61 @@
+What:		/sys/devices/system/memtier/
+Date:		June 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+		This is the directory containing the information about memory tiers.
+
+		Each memory tier has its own subdirectory.
+
+		The order of memory tiers is determined by their tier ID value.
+		A higher tier ID value means a higher tier. memtier300 is higher
+		memory tier compared to memtier 100.
+
+What:		/sys/devices/system/memtier/default_tier
+Date:		June 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Default memory tier
+
+		The default memory tier to which memory would get added via hotplug
+		if the NUMA node is not part of any memory tier
+
+What:		/sys/devices/system/memtier/max_tier
+Date:		June 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Maximum memory tier ID supported
+
+		The max memory tier device ID we can create. Users can create memory
+		tiers in range [0 - max_tier]
+
+What:		/sys/devices/system/memtier/memtierN/
+Date:		June 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Directory with details of a specific memory tier
+
+		This is the directory containing the information about a particular
+		memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1).
+
+		The memtier device ID number itself is just an identifier and has no
+		special meaning. Its value relative to other memtiers decides the level
+		of this memtier in the tier hierarchy.
+
+
+What:		/sys/devices/system/memtier/memtierN/nodelist
+Date:		June 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Memory tier nodelist
+
+
+		When read, list the memory nodes in the specified tier.
+
+What:		/sys/devices/system/node/nodeN/memtier
+Date:		June 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Memory tier details for node N
+
+		When read, list the device ID of the memory tier that the node belongs
+		to.  Its value is empty for a CPU-only NUMA node.
+
+		When written, the kernel moves the node into the specified memory
+		tier if the move is allowed. The tier assignments of all other
+		nodes are not affected.
-- 
2.36.1
^ permalink raw reply related	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (11 preceding siblings ...)
  2022-07-04  7:06 ` [PATCH v8 12/12] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
@ 2022-07-04 15:00 ` Matthew Wilcox
  2022-07-05  3:45   ` Alistair Popple
  2022-07-05  4:17   ` Aneesh Kumar K V
  2022-07-05  4:29 ` Huang, Ying
  2022-07-11 15:29 ` Aneesh Kumar K.V
  14 siblings, 2 replies; 42+ messages in thread
From: Matthew Wilcox @ 2022-07-04 15:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Johannes Weiner, jvgediya.oss
On Mon, Jul 04, 2022 at 12:36:00PM +0530, Aneesh Kumar K.V wrote:
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.
> 
> * The current tier hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
These things that you identify as problems seem perfectly sensible to me.
Memory which is attached to this CPU has the lowest latency and should
be preferred over more remote memory, no matter its bandwidth.
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-04 15:00 ` [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Matthew Wilcox
@ 2022-07-05  3:45   ` Alistair Popple
  2022-07-05  4:17   ` Aneesh Kumar K V
  1 sibling, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2022-07-05  3:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Aneesh Kumar K.V, linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Dan Williams, Johannes Weiner, jvgediya.oss
Matthew Wilcox <willy@infradead.org> writes:
> On Mon, Jul 04, 2022 at 12:36:00PM +0530, Aneesh Kumar K.V wrote:
>> * The current tier initialization code always initializes
>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>   a virtual machine) and should be put into a higher tier.
>>
>> * The current tier hierarchy always puts CPU nodes into the top
>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>   with CPUs are better to be placed into the next lower tier.
>
> These things that you identify as problems seem perfectly sensible to me.
> Memory which is attached to this CPU has the lowest latency and should
> be preferred over more remote memory, no matter its bandwidth.
It is a problem because HBM NUMA node memory is generally also used by
some kind of device/accelerator (eg. GPU). Typically users would prefer
to keep HBM memory for use by the accelerator rather than random pages
demoted from the CPU as accelerators have orders of magnitude better
performance when accessing local HBM vs. remote memory.
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-04 15:00 ` [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Matthew Wilcox
  2022-07-05  3:45   ` Alistair Popple
@ 2022-07-05  4:17   ` Aneesh Kumar K V
  1 sibling, 0 replies; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-05  4:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Johannes Weiner, jvgediya.oss
On 7/4/22 8:30 PM, Matthew Wilcox wrote:
> On Mon, Jul 04, 2022 at 12:36:00PM +0530, Aneesh Kumar K.V wrote:
>> * The current tier initialization code always initializes
>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>   a virtual machine) and should be put into a higher tier.
>>
>> * The current tier hierarchy always puts CPU nodes into the top
>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>   with CPUs are better to be placed into the next lower tier.
> 
> These things that you identify as problems seem perfectly sensible to me.
> Memory which is attached to this CPU has the lowest latency and should
> be preferred over more remote memory, no matter its bandwidth.
Allocation will prefer local memory over remote memory. Memory tiers are used during
demotion and currently, the kernel demotes cold pages from DRAM memory to these
special device memories because they appear as memory-only NUMA nodes. In many cases
(ex: GPU) what is desired is the demotion of cold pages from GPU memory to DRAM or
even slow memory.
This patchset builds a framework to enable such demotion criteria.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (12 preceding siblings ...)
  2022-07-04 15:00 ` [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Matthew Wilcox
@ 2022-07-05  4:29 ` Huang, Ying
  2022-07-05  5:22   ` Aneesh Kumar K V
  2022-07-11 15:29 ` Aneesh Kumar K.V
  14 siblings, 1 reply; 42+ messages in thread
From: Huang, Ying @ 2022-07-05  4:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
Hi, Aneesh,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created during
> the kernel initialization and updated when a NUMA node is hot-added or
> hot-removed.  The current implementation puts all nodes with CPU into
> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> the per-node demotion targets based on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases:
>
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.
>
> * The current tier hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tier hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tier hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tier
>   hierarchy unstable and make it difficult to support tier-based
>   memory accounting.
>
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier as defined by the demotion path, not any other
>   node from any lower tier.  This strict, hard-coded demotion order
>   does not work in all use cases (e.g. some use cases may want to
>   allow cross-socket demotion to another node in the same demotion
>   tier as a fallback when the preferred demotion node is out of
>   space), and has resulted in the feature request for an interface to
>   override the system-wide, per-node demotion order from the
>   userspace.  This demotion order is also inconsistent with the page
>   allocation fallback order when all the nodes in a higher tier are
>   out of space: The page allocation can fall back to any node from
>   any lower tier, whereas the demotion order doesn't allow that.
>
> * There are no interfaces for the userspace to learn about the memory
>   tier hierarchy in order to optimize its memory allocations.
>
> This patch series make the creation of memory tiers explicit under
> the control of userspace or device driver.
>
> Memory Tier Initialization
> ==========================
>
> By default, all memory nodes are assigned to the default tier with
> tier ID value 200.
>
> A device driver can move up or down its memory nodes from the default
> tier.  For example, PMEM can move down its memory nodes below the
> default tier, whereas GPU can move up its memory nodes above the
> default tier.
>
> The kernel initialization code makes the decision on which exact tier
> a memory node should be assigned to based on the requests from the
> device drivers as well as the memory device hardware information
> provided by the firmware.
>
> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>
> Memory Allocation for Demotion
> ==============================
> This patch series keep the demotion target page allocation logic same.
> The demotion page allocation pick the closest NUMA node in the
> next lower tier to the current NUMA node allocating pages from.
>
> This will be later improved to use the same page allocation strategy
> using fallback list.
>
> Sysfs Interface:
> -------------
> Listing current list of memory tiers details:
>
> :/sys/devices/system/memtier$ ls
> default_tier max_tier  memtier1  power  uevent
> :/sys/devices/system/memtier$ cat default_tier
> memtier200
> :/sys/devices/system/memtier$ cat max_tier 
> 400
> :/sys/devices/system/memtier$ 
>
> Per node memory tier details:
>
> For a cpu only NUMA node:
>
> :/sys/devices/system/node# cat node0/memtier 
> :/sys/devices/system/node# echo 1 > node0/memtier 
> :/sys/devices/system/node# cat node0/memtier 
> :/sys/devices/system/node# 
>
> For a NUMA node with memory:
> :/sys/devices/system/node# cat node1/memtier 
> 1
> :/sys/devices/system/node# ls ../memtier/
> default_tier  max_tier  memtier1  power  uevent
> :/sys/devices/system/node# echo 2 > node1/memtier 
> :/sys/devices/system/node# 
> :/sys/devices/system/node# ls ../memtier/
> default_tier  max_tier  memtier1  memtier2  power  uevent
> :/sys/devices/system/node# cat node1/memtier 
> 2
> :/sys/devices/system/node# 
>
> Removing a memory tier
> :/sys/devices/system/node# cat node1/memtier 
> 2
> :/sys/devices/system/node# echo 1 > node1/memtier
Thanks a lot for your patchset.
Per my understanding, we haven't reach consensus on
- how to create the default memory tiers in kernel (via abstract
  distance provided by drivers?  Or use SLIT as the first step?)
- how to override the default memory tiers from user space
As in the following thread and email,
https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
I think that we need to finalized on that firstly?
Best Regards,
Huang, Ying
> :/sys/devices/system/node# 
> :/sys/devices/system/node# cat node1/memtier 
> 1
> :/sys/devices/system/node# 
> :/sys/devices/system/node# ls ../memtier/
> default_tier  max_tier  memtier1  power  uevent
> :/sys/devices/system/node# 
>
> The above resulted in removal of memtier2 which was created in the earlier step.
>
> Changes from v7:
> * Fix kernel crash with demotion.
> * Improve documentation.
>
> Changes from v6:
> * Drop the usage of rank.
> * Address other review feedback.
>
> Changes from v5:
> * Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
>   are going to be used for features other than demotion. Hence keep all N_MEMORY
>   nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
> * Add NODE_DATA->memtier
> * Rearrage patches to add sysfs files later.
> * Add support to create memory tiers from userspace.
> * Address other review feedback.
>
>
> Changes from v4:
> * Address review feedback.
> * Reverse the meaning of "rank": higher rank value means higher tier.
> * Add "/sys/devices/system/memtier/default_tier".
> * Add node_is_toptier
>
> v4:
> Add support for explicit memory tiers and ranks.
>
> v3:
> - Modify patch 1 subject to make it more specific
> - Remove /sys/kernel/mm/numa/demotion_targets interface, use
>   /sys/devices/system/node/demotion_targets instead and make
>   it writable to override node_states[N_DEMOTION_TARGETS].
> - Add support to view per node demotion targets via sysfs
>
> v2:
> In v1, only 1st patch of this patch series was sent, which was
> implemented to avoid some of the limitations on the demotion
> target sharing, however for certain numa topology, the demotion
> targets found by that patch was not most optimal, so 1st patch
> in this series is modified according to suggestions from Huang
> and Baolin. Different examples of demotion list comparasion
> between existing implementation and changed implementation can
> be found in the commit message of 1st patch.
>
>
> Aneesh Kumar K.V (10):
>   mm/demotion: Add support for explicit memory tiers
>   mm/demotion: Move memory demotion related code
>   mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
>   mm/demotion: Add hotplug callbacks to handle new numa node onlined
>   mm/demotion: Build demotion targets based on explicit memory tiers
>   mm/demotion: Expose memory tier details via sysfs
>   mm/demotion: Add per node memory tier attribute to sysfs
>   mm/demotion: Add pg_data_t member to track node memory tier details
>   mm/demotion: Update node_is_toptier to work with memory tiers
>   mm/demotion: Add sysfs ABI documentation
>
> Jagdish Gediya (2):
>   mm/demotion: Demote pages according to allocation fallback order
>   mm/demotion: Add documentation for memory tiering
>
>  .../ABI/testing/sysfs-kernel-mm-memory-tiers  |  61 ++
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  .../admin-guide/mm/memory-tiering.rst         | 192 +++++
>  drivers/base/node.c                           |  42 +
>  drivers/dax/kmem.c                            |   6 +-
>  include/linux/memory-tiers.h                  |  72 ++
>  include/linux/migrate.h                       |  15 -
>  include/linux/mmzone.h                        |   3 +
>  include/linux/node.h                          |   5 -
>  mm/Makefile                                   |   1 +
>  mm/huge_memory.c                              |   1 +
>  mm/memory-tiers.c                             | 791 ++++++++++++++++++
>  mm/migrate.c                                  | 453 +---------
>  mm/mprotect.c                                 |   1 +
>  mm/vmscan.c                                   |  59 +-
>  mm/vmstat.c                                   |   4 -
>  16 files changed, 1215 insertions(+), 492 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
>  create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-05  4:29 ` Huang, Ying
@ 2022-07-05  5:22   ` Aneesh Kumar K V
  2022-07-12  1:16     ` Huang, Ying
  0 siblings, 1 reply; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-05  5:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
On 7/5/22 9:59 AM, Huang, Ying wrote:
> Hi, Aneesh,
> 
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> The current kernel has the basic memory tiering support: Inactive
>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> tier NUMA node to make room for new allocations on the higher tier
>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> migrated (promoted) to a higher tier NUMA node to improve the
>> performance.
>>
>> In the current kernel, memory tiers are defined implicitly via a
>> demotion path relationship between NUMA nodes, which is created during
>> the kernel initialization and updated when a NUMA node is hot-added or
>> hot-removed.  The current implementation puts all nodes with CPU into
>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>> the per-node demotion targets based on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for
>> several important use cases:
>>
>> * The current tier initialization code always initializes
>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>   a virtual machine) and should be put into a higher tier.
>>
>> * The current tier hierarchy always puts CPU nodes into the top
>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>   with CPUs are better to be placed into the next lower tier.
>>
>> * Also because the current tier hierarchy always puts CPU nodes
>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>   triggers a memory node from CPU-less into a CPU node (or vice
>>   versa), the memory tier hierarchy gets changed, even though no
>>   memory node is added or removed.  This can make the tier
>>   hierarchy unstable and make it difficult to support tier-based
>>   memory accounting.
>>
>> * A higher tier node can only be demoted to selected nodes on the
>>   next lower tier as defined by the demotion path, not any other
>>   node from any lower tier.  This strict, hard-coded demotion order
>>   does not work in all use cases (e.g. some use cases may want to
>>   allow cross-socket demotion to another node in the same demotion
>>   tier as a fallback when the preferred demotion node is out of
>>   space), and has resulted in the feature request for an interface to
>>   override the system-wide, per-node demotion order from the
>>   userspace.  This demotion order is also inconsistent with the page
>>   allocation fallback order when all the nodes in a higher tier are
>>   out of space: The page allocation can fall back to any node from
>>   any lower tier, whereas the demotion order doesn't allow that.
>>
>> * There are no interfaces for the userspace to learn about the memory
>>   tier hierarchy in order to optimize its memory allocations.
>>
>> This patch series make the creation of memory tiers explicit under
>> the control of userspace or device driver.
>>
>> Memory Tier Initialization
>> ==========================
>>
>> By default, all memory nodes are assigned to the default tier with
>> tier ID value 200.
>>
>> A device driver can move up or down its memory nodes from the default
>> tier.  For example, PMEM can move down its memory nodes below the
>> default tier, whereas GPU can move up its memory nodes above the
>> default tier.
>>
>> The kernel initialization code makes the decision on which exact tier
>> a memory node should be assigned to based on the requests from the
>> device drivers as well as the memory device hardware information
>> provided by the firmware.
>>
>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>
>> Memory Allocation for Demotion
>> ==============================
>> This patch series keep the demotion target page allocation logic same.
>> The demotion page allocation pick the closest NUMA node in the
>> next lower tier to the current NUMA node allocating pages from.
>>
>> This will be later improved to use the same page allocation strategy
>> using fallback list.
>>
>> Sysfs Interface:
>> -------------
>> Listing current list of memory tiers details:
>>
>> :/sys/devices/system/memtier$ ls
>> default_tier max_tier  memtier1  power  uevent
>> :/sys/devices/system/memtier$ cat default_tier
>> memtier200
>> :/sys/devices/system/memtier$ cat max_tier 
>> 400
>> :/sys/devices/system/memtier$ 
>>
>> Per node memory tier details:
>>
>> For a cpu only NUMA node:
>>
>> :/sys/devices/system/node# cat node0/memtier 
>> :/sys/devices/system/node# echo 1 > node0/memtier 
>> :/sys/devices/system/node# cat node0/memtier 
>> :/sys/devices/system/node# 
>>
>> For a NUMA node with memory:
>> :/sys/devices/system/node# cat node1/memtier 
>> 1
>> :/sys/devices/system/node# ls ../memtier/
>> default_tier  max_tier  memtier1  power  uevent
>> :/sys/devices/system/node# echo 2 > node1/memtier 
>> :/sys/devices/system/node# 
>> :/sys/devices/system/node# ls ../memtier/
>> default_tier  max_tier  memtier1  memtier2  power  uevent
>> :/sys/devices/system/node# cat node1/memtier 
>> 2
>> :/sys/devices/system/node# 
>>
>> Removing a memory tier
>> :/sys/devices/system/node# cat node1/memtier 
>> 2
>> :/sys/devices/system/node# echo 1 > node1/memtier
> 
> Thanks a lot for your patchset.
> 
> Per my understanding, we haven't reach consensus on
> 
> - how to create the default memory tiers in kernel (via abstract
>   distance provided by drivers?  Or use SLIT as the first step?)
> 
> - how to override the default memory tiers from user space
> 
> As in the following thread and email,
> 
> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> 
> I think that we need to finalized on that firstly?
I did list the proposal here 
https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
if the user wants a different tier topology. 
All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
to control the tier assignment this can be a range of memory tiers. 
Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
the memory tier assignment based on device attributes.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (13 preceding siblings ...)
  2022-07-05  4:29 ` Huang, Ying
@ 2022-07-11 15:29 ` Aneesh Kumar K.V
  14 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-11 15:29 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created during
> the kernel initialization and updated when a NUMA node is hot-added or
> hot-removed.  The current implementation puts all nodes with CPU into
> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> the per-node demotion targets based on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases:
>
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.
>
> * The current tier hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tier hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tier hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tier
>   hierarchy unstable and make it difficult to support tier-based
>   memory accounting.
>
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier as defined by the demotion path, not any other
>   node from any lower tier.  This strict, hard-coded demotion order
>   does not work in all use cases (e.g. some use cases may want to
>   allow cross-socket demotion to another node in the same demotion
>   tier as a fallback when the preferred demotion node is out of
>   space), and has resulted in the feature request for an interface to
>   override the system-wide, per-node demotion order from the
>   userspace.  This demotion order is also inconsistent with the page
>   allocation fallback order when all the nodes in a higher tier are
>   out of space: The page allocation can fall back to any node from
>   any lower tier, whereas the demotion order doesn't allow that.
>
> * There are no interfaces for the userspace to learn about the memory
>   tier hierarchy in order to optimize its memory allocations.
>
> This patch series make the creation of memory tiers explicit under
> the control of userspace or device driver.
>
> Memory Tier Initialization
> ==========================
>
> By default, all memory nodes are assigned to the default tier with
> tier ID value 200.
>
> A device driver can move up or down its memory nodes from the default
> tier.  For example, PMEM can move down its memory nodes below the
> default tier, whereas GPU can move up its memory nodes above the
> default tier.
>
> The kernel initialization code makes the decision on which exact tier
> a memory node should be assigned to based on the requests from the
> device drivers as well as the memory device hardware information
> provided by the firmware.
>
> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>
> Memory Allocation for Demotion
> ==============================
> This patch series keep the demotion target page allocation logic same.
> The demotion page allocation pick the closest NUMA node in the
> next lower tier to the current NUMA node allocating pages from.
>
> This will be later improved to use the same page allocation strategy
> using fallback list.
>
> Sysfs Interface:
> -------------
> Listing current list of memory tiers details:
>
> :/sys/devices/system/memtier$ ls
> default_tier max_tier  memtier1  power  uevent
> :/sys/devices/system/memtier$ cat default_tier
> memtier200
> :/sys/devices/system/memtier$ cat max_tier 
> 400
> :/sys/devices/system/memtier$ 
>
> Per node memory tier details:
>
> For a cpu only NUMA node:
>
> :/sys/devices/system/node# cat node0/memtier 
> :/sys/devices/system/node# echo 1 > node0/memtier 
> :/sys/devices/system/node# cat node0/memtier 
> :/sys/devices/system/node# 
>
> For a NUMA node with memory:
> :/sys/devices/system/node# cat node1/memtier 
> 1
> :/sys/devices/system/node# ls ../memtier/
> default_tier  max_tier  memtier1  power  uevent
> :/sys/devices/system/node# echo 2 > node1/memtier 
> :/sys/devices/system/node# 
> :/sys/devices/system/node# ls ../memtier/
> default_tier  max_tier  memtier1  memtier2  power  uevent
> :/sys/devices/system/node# cat node1/memtier 
> 2
> :/sys/devices/system/node# 
>
> Removing a memory tier
> :/sys/devices/system/node# cat node1/memtier 
> 2
> :/sys/devices/system/node# echo 1 > node1/memtier 
> :/sys/devices/system/node# 
> :/sys/devices/system/node# cat node1/memtier 
> 1
> :/sys/devices/system/node# 
> :/sys/devices/system/node# ls ../memtier/
> default_tier  max_tier  memtier1  power  uevent
> :/sys/devices/system/node# 
>
> The above resulted in removal of memtier2 which was created in the earlier step.
>
> Changes from v7:
> * Fix kernel crash with demotion.
> * Improve documentation.
>
> Changes from v6:
> * Drop the usage of rank.
> * Address other review feedback.
>
> Changes from v5:
> * Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
>   are going to be used for features other than demotion. Hence keep all N_MEMORY
>   nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
> * Add NODE_DATA->memtier
> * Rearrage patches to add sysfs files later.
> * Add support to create memory tiers from userspace.
> * Address other review feedback.
>
>
> Changes from v4:
> * Address review feedback.
> * Reverse the meaning of "rank": higher rank value means higher tier.
> * Add "/sys/devices/system/memtier/default_tier".
> * Add node_is_toptier
>
> v4:
> Add support for explicit memory tiers and ranks.
>
> v3:
> - Modify patch 1 subject to make it more specific
> - Remove /sys/kernel/mm/numa/demotion_targets interface, use
>   /sys/devices/system/node/demotion_targets instead and make
>   it writable to override node_states[N_DEMOTION_TARGETS].
> - Add support to view per node demotion targets via sysfs
>
> v2:
> In v1, only 1st patch of this patch series was sent, which was
> implemented to avoid some of the limitations on the demotion
> target sharing, however for certain numa topology, the demotion
> targets found by that patch was not most optimal, so 1st patch
> in this series is modified according to suggestions from Huang
> and Baolin. Different examples of demotion list comparasion
> between existing implementation and changed implementation can
> be found in the commit message of 1st patch.
>
>
> Aneesh Kumar K.V (10):
>   mm/demotion: Add support for explicit memory tiers
>   mm/demotion: Move memory demotion related code
>   mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
>   mm/demotion: Add hotplug callbacks to handle new numa node onlined
>   mm/demotion: Build demotion targets based on explicit memory tiers
>   mm/demotion: Expose memory tier details via sysfs
>   mm/demotion: Add per node memory tier attribute to sysfs
>   mm/demotion: Add pg_data_t member to track node memory tier details
>   mm/demotion: Update node_is_toptier to work with memory tiers
>   mm/demotion: Add sysfs ABI documentation
>
> Jagdish Gediya (2):
>   mm/demotion: Demote pages according to allocation fallback order
>   mm/demotion: Add documentation for memory tiering
>
>  .../ABI/testing/sysfs-kernel-mm-memory-tiers  |  61 ++
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  .../admin-guide/mm/memory-tiering.rst         | 192 +++++
>  drivers/base/node.c                           |  42 +
>  drivers/dax/kmem.c                            |   6 +-
>  include/linux/memory-tiers.h                  |  72 ++
>  include/linux/migrate.h                       |  15 -
>  include/linux/mmzone.h                        |   3 +
>  include/linux/node.h                          |   5 -
>  mm/Makefile                                   |   1 +
>  mm/huge_memory.c                              |   1 +
>  mm/memory-tiers.c                             | 791 ++++++++++++++++++
>  mm/migrate.c                                  | 453 +---------
>  mm/mprotect.c                                 |   1 +
>  mm/vmscan.c                                   |  59 +-
>  mm/vmstat.c                                   |   4 -
>  16 files changed, 1215 insertions(+), 492 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
>  create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
  Gentle ping. Any objections for this series? 
  -aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-05  5:22   ` Aneesh Kumar K V
@ 2022-07-12  1:16     ` Huang, Ying
  2022-07-12  4:42       ` Aneesh Kumar K V
  0 siblings, 1 reply; 42+ messages in thread
From: Huang, Ying @ 2022-07-12  1:16 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 7/5/22 9:59 AM, Huang, Ying wrote:
>> Hi, Aneesh,
>> 
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> The current kernel has the basic memory tiering support: Inactive
>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>> tier NUMA node to make room for new allocations on the higher tier
>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>> migrated (promoted) to a higher tier NUMA node to improve the
>>> performance.
>>>
>>> In the current kernel, memory tiers are defined implicitly via a
>>> demotion path relationship between NUMA nodes, which is created during
>>> the kernel initialization and updated when a NUMA node is hot-added or
>>> hot-removed.  The current implementation puts all nodes with CPU into
>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>> the per-node demotion targets based on the distances between nodes.
>>>
>>> This current memory tier kernel interface needs to be improved for
>>> several important use cases:
>>>
>>> * The current tier initialization code always initializes
>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>   a virtual machine) and should be put into a higher tier.
>>>
>>> * The current tier hierarchy always puts CPU nodes into the top
>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>   with CPUs are better to be placed into the next lower tier.
>>>
>>> * Also because the current tier hierarchy always puts CPU nodes
>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>   versa), the memory tier hierarchy gets changed, even though no
>>>   memory node is added or removed.  This can make the tier
>>>   hierarchy unstable and make it difficult to support tier-based
>>>   memory accounting.
>>>
>>> * A higher tier node can only be demoted to selected nodes on the
>>>   next lower tier as defined by the demotion path, not any other
>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>   does not work in all use cases (e.g. some use cases may want to
>>>   allow cross-socket demotion to another node in the same demotion
>>>   tier as a fallback when the preferred demotion node is out of
>>>   space), and has resulted in the feature request for an interface to
>>>   override the system-wide, per-node demotion order from the
>>>   userspace.  This demotion order is also inconsistent with the page
>>>   allocation fallback order when all the nodes in a higher tier are
>>>   out of space: The page allocation can fall back to any node from
>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>
>>> * There are no interfaces for the userspace to learn about the memory
>>>   tier hierarchy in order to optimize its memory allocations.
>>>
>>> This patch series make the creation of memory tiers explicit under
>>> the control of userspace or device driver.
>>>
>>> Memory Tier Initialization
>>> ==========================
>>>
>>> By default, all memory nodes are assigned to the default tier with
>>> tier ID value 200.
>>>
>>> A device driver can move up or down its memory nodes from the default
>>> tier.  For example, PMEM can move down its memory nodes below the
>>> default tier, whereas GPU can move up its memory nodes above the
>>> default tier.
>>>
>>> The kernel initialization code makes the decision on which exact tier
>>> a memory node should be assigned to based on the requests from the
>>> device drivers as well as the memory device hardware information
>>> provided by the firmware.
>>>
>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>
>>> Memory Allocation for Demotion
>>> ==============================
>>> This patch series keep the demotion target page allocation logic same.
>>> The demotion page allocation pick the closest NUMA node in the
>>> next lower tier to the current NUMA node allocating pages from.
>>>
>>> This will be later improved to use the same page allocation strategy
>>> using fallback list.
>>>
>>> Sysfs Interface:
>>> -------------
>>> Listing current list of memory tiers details:
>>>
>>> :/sys/devices/system/memtier$ ls
>>> default_tier max_tier  memtier1  power  uevent
>>> :/sys/devices/system/memtier$ cat default_tier
>>> memtier200
>>> :/sys/devices/system/memtier$ cat max_tier 
>>> 400
>>> :/sys/devices/system/memtier$ 
>>>
>>> Per node memory tier details:
>>>
>>> For a cpu only NUMA node:
>>>
>>> :/sys/devices/system/node# cat node0/memtier 
>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>> :/sys/devices/system/node# cat node0/memtier 
>>> :/sys/devices/system/node# 
>>>
>>> For a NUMA node with memory:
>>> :/sys/devices/system/node# cat node1/memtier 
>>> 1
>>> :/sys/devices/system/node# ls ../memtier/
>>> default_tier  max_tier  memtier1  power  uevent
>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>> :/sys/devices/system/node# 
>>> :/sys/devices/system/node# ls ../memtier/
>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>> :/sys/devices/system/node# cat node1/memtier 
>>> 2
>>> :/sys/devices/system/node# 
>>>
>>> Removing a memory tier
>>> :/sys/devices/system/node# cat node1/memtier 
>>> 2
>>> :/sys/devices/system/node# echo 1 > node1/memtier
>> 
>> Thanks a lot for your patchset.
>> 
>> Per my understanding, we haven't reach consensus on
>> 
>> - how to create the default memory tiers in kernel (via abstract
>>   distance provided by drivers?  Or use SLIT as the first step?)
>> 
>> - how to override the default memory tiers from user space
>> 
>> As in the following thread and email,
>> 
>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> 
>> I think that we need to finalized on that firstly?
>
> I did list the proposal here 
>
> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>
> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
> if the user wants a different tier topology. 
>
> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>
> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
> to control the tier assignment this can be a range of memory tiers. 
>
> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
> the memory tier assignment based on device attributes.
Sorry for late reply.
As the first step, it may be better to skip the parts that we haven't
reached consensus yet, for example, the user space interface to override
the default memory tiers.  And we can use 0, 1, 2 as the default memory
tier IDs.  We can refine/revise the in-kernel implementation, but we
cannot change the user space ABI.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  1:16     ` Huang, Ying
@ 2022-07-12  4:42       ` Aneesh Kumar K V
  2022-07-12  5:09         ` Aneesh Kumar K V
  2022-07-12  6:59         ` Huang, Ying
  0 siblings, 2 replies; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-12  4:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
On 7/12/22 6:46 AM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>> Hi, Aneesh,
>>>
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> The current kernel has the basic memory tiering support: Inactive
>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>> tier NUMA node to make room for new allocations on the higher tier
>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>> performance.
>>>>
>>>> In the current kernel, memory tiers are defined implicitly via a
>>>> demotion path relationship between NUMA nodes, which is created during
>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>> the per-node demotion targets based on the distances between nodes.
>>>>
>>>> This current memory tier kernel interface needs to be improved for
>>>> several important use cases:
>>>>
>>>> * The current tier initialization code always initializes
>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>   a virtual machine) and should be put into a higher tier.
>>>>
>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>   with CPUs are better to be placed into the next lower tier.
>>>>
>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>   memory node is added or removed.  This can make the tier
>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>   memory accounting.
>>>>
>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>   next lower tier as defined by the demotion path, not any other
>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>   allow cross-socket demotion to another node in the same demotion
>>>>   tier as a fallback when the preferred demotion node is out of
>>>>   space), and has resulted in the feature request for an interface to
>>>>   override the system-wide, per-node demotion order from the
>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>   out of space: The page allocation can fall back to any node from
>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>
>>>> * There are no interfaces for the userspace to learn about the memory
>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>
>>>> This patch series make the creation of memory tiers explicit under
>>>> the control of userspace or device driver.
>>>>
>>>> Memory Tier Initialization
>>>> ==========================
>>>>
>>>> By default, all memory nodes are assigned to the default tier with
>>>> tier ID value 200.
>>>>
>>>> A device driver can move up or down its memory nodes from the default
>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>> default tier, whereas GPU can move up its memory nodes above the
>>>> default tier.
>>>>
>>>> The kernel initialization code makes the decision on which exact tier
>>>> a memory node should be assigned to based on the requests from the
>>>> device drivers as well as the memory device hardware information
>>>> provided by the firmware.
>>>>
>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>
>>>> Memory Allocation for Demotion
>>>> ==============================
>>>> This patch series keep the demotion target page allocation logic same.
>>>> The demotion page allocation pick the closest NUMA node in the
>>>> next lower tier to the current NUMA node allocating pages from.
>>>>
>>>> This will be later improved to use the same page allocation strategy
>>>> using fallback list.
>>>>
>>>> Sysfs Interface:
>>>> -------------
>>>> Listing current list of memory tiers details:
>>>>
>>>> :/sys/devices/system/memtier$ ls
>>>> default_tier max_tier  memtier1  power  uevent
>>>> :/sys/devices/system/memtier$ cat default_tier
>>>> memtier200
>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>> 400
>>>> :/sys/devices/system/memtier$ 
>>>>
>>>> Per node memory tier details:
>>>>
>>>> For a cpu only NUMA node:
>>>>
>>>> :/sys/devices/system/node# cat node0/memtier 
>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>> :/sys/devices/system/node# cat node0/memtier 
>>>> :/sys/devices/system/node# 
>>>>
>>>> For a NUMA node with memory:
>>>> :/sys/devices/system/node# cat node1/memtier 
>>>> 1
>>>> :/sys/devices/system/node# ls ../memtier/
>>>> default_tier  max_tier  memtier1  power  uevent
>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>> :/sys/devices/system/node# 
>>>> :/sys/devices/system/node# ls ../memtier/
>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>> :/sys/devices/system/node# cat node1/memtier 
>>>> 2
>>>> :/sys/devices/system/node# 
>>>>
>>>> Removing a memory tier
>>>> :/sys/devices/system/node# cat node1/memtier 
>>>> 2
>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>
>>> Thanks a lot for your patchset.
>>>
>>> Per my understanding, we haven't reach consensus on
>>>
>>> - how to create the default memory tiers in kernel (via abstract
>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>
>>> - how to override the default memory tiers from user space
>>>
>>> As in the following thread and email,
>>>
>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>
>>> I think that we need to finalized on that firstly?
>>
>> I did list the proposal here 
>>
>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>
>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>> if the user wants a different tier topology. 
>>
>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>
>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>> to control the tier assignment this can be a range of memory tiers. 
>>
>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>> the memory tier assignment based on device attributes.
> 
> Sorry for late reply.
> 
> As the first step, it may be better to skip the parts that we haven't
> reached consensus yet, for example, the user space interface to override
> the default memory tiers.  And we can use 0, 1, 2 as the default memory
> tier IDs.  We can refine/revise the in-kernel implementation, but we
> cannot change the user space ABI.
> 
Can you help list the use case that will be broken by using tierID as outlined in this series?
One of the details that were mentioned earlier was the need to track top-tier memory usage in a
memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
can work with tier IDs too. Let me know if you think otherwise. So at this point
I am not sure which area we are still debating w.r.t the userspace interface.
I will still keep the default tier IDs with a large range between them. That will allow
us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
together. If we still want to go back to rank based approach the tierID value won't have much
meaning anyway.
Any feedback on patches 1 - 5, so that I can request Andrew to merge them? 
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  4:42       ` Aneesh Kumar K V
@ 2022-07-12  5:09         ` Aneesh Kumar K V
  2022-07-12 18:02           ` Yang Shi
  2022-07-12  6:59         ` Huang, Ying
  1 sibling, 1 reply; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-12  5:09 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
> On 7/12/22 6:46 AM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>> Hi, Aneesh,
>>>>
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>> performance.
>>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases:
>>>>>
>>>>> * The current tier initialization code always initializes
>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>
>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>   memory node is added or removed.  This can make the tier
>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>   memory accounting.
>>>>>
>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>   space), and has resulted in the feature request for an interface to
>>>>>   override the system-wide, per-node demotion order from the
>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>   out of space: The page allocation can fall back to any node from
>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>
>>>>> This patch series make the creation of memory tiers explicit under
>>>>> the control of userspace or device driver.
>>>>>
>>>>> Memory Tier Initialization
>>>>> ==========================
>>>>>
>>>>> By default, all memory nodes are assigned to the default tier with
>>>>> tier ID value 200.
>>>>>
>>>>> A device driver can move up or down its memory nodes from the default
>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>> default tier.
>>>>>
>>>>> The kernel initialization code makes the decision on which exact tier
>>>>> a memory node should be assigned to based on the requests from the
>>>>> device drivers as well as the memory device hardware information
>>>>> provided by the firmware.
>>>>>
>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>
>>>>> Memory Allocation for Demotion
>>>>> ==============================
>>>>> This patch series keep the demotion target page allocation logic same.
>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>
>>>>> This will be later improved to use the same page allocation strategy
>>>>> using fallback list.
>>>>>
>>>>> Sysfs Interface:
>>>>> -------------
>>>>> Listing current list of memory tiers details:
>>>>>
>>>>> :/sys/devices/system/memtier$ ls
>>>>> default_tier max_tier  memtier1  power  uevent
>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>> memtier200
>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>> 400
>>>>> :/sys/devices/system/memtier$ 
>>>>>
>>>>> Per node memory tier details:
>>>>>
>>>>> For a cpu only NUMA node:
>>>>>
>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>> :/sys/devices/system/node# 
>>>>>
>>>>> For a NUMA node with memory:
>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>> 1
>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>> :/sys/devices/system/node# 
>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>> 2
>>>>> :/sys/devices/system/node# 
>>>>>
>>>>> Removing a memory tier
>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>> 2
>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>
>>>> Thanks a lot for your patchset.
>>>>
>>>> Per my understanding, we haven't reach consensus on
>>>>
>>>> - how to create the default memory tiers in kernel (via abstract
>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>
>>>> - how to override the default memory tiers from user space
>>>>
>>>> As in the following thread and email,
>>>>
>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>
>>>> I think that we need to finalized on that firstly?
>>>
>>> I did list the proposal here 
>>>
>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>
>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>> if the user wants a different tier topology. 
>>>
>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>
>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>> to control the tier assignment this can be a range of memory tiers. 
>>>
>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>> the memory tier assignment based on device attributes.
>>
>> Sorry for late reply.
>>
>> As the first step, it may be better to skip the parts that we haven't
>> reached consensus yet, for example, the user space interface to override
>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>> cannot change the user space ABI.
>>
> 
> Can you help list the use case that will be broken by using tierID as outlined in this series?
> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> can work with tier IDs too. Let me know if you think otherwise. So at this point
> I am not sure which area we are still debating w.r.t the userspace interface.
> 
> I will still keep the default tier IDs with a large range between them. That will allow
> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
> together. If we still want to go back to rank based approach the tierID value won't have much
> meaning anyway.
> 
> Any feedback on patches 1 - 5, so that I can request Andrew to merge them? 
> 
Looking at this again, I guess we just need to drop patch 7
mm/demotion: Add per node memory tier attribute to sysfs ? 
We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included.
It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful
and agreed upon. Hence patch 6 can be merged? 
patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers
are exposed/created from userspace. Hence that can be merged? 
If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so
that we can skip merging them based on what we conclude w.r.t usage of rank.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  4:42       ` Aneesh Kumar K V
  2022-07-12  5:09         ` Aneesh Kumar K V
@ 2022-07-12  6:59         ` Huang, Ying
  2022-07-12  7:31           ` Aneesh Kumar K V
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Ying @ 2022-07-12  6:59 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 7/12/22 6:46 AM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>> Hi, Aneesh,
>>>>
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>> performance.
>>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases:
>>>>>
>>>>> * The current tier initialization code always initializes
>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>
>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>   memory node is added or removed.  This can make the tier
>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>   memory accounting.
>>>>>
>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>   space), and has resulted in the feature request for an interface to
>>>>>   override the system-wide, per-node demotion order from the
>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>   out of space: The page allocation can fall back to any node from
>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>
>>>>> This patch series make the creation of memory tiers explicit under
>>>>> the control of userspace or device driver.
>>>>>
>>>>> Memory Tier Initialization
>>>>> ==========================
>>>>>
>>>>> By default, all memory nodes are assigned to the default tier with
>>>>> tier ID value 200.
>>>>>
>>>>> A device driver can move up or down its memory nodes from the default
>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>> default tier.
>>>>>
>>>>> The kernel initialization code makes the decision on which exact tier
>>>>> a memory node should be assigned to based on the requests from the
>>>>> device drivers as well as the memory device hardware information
>>>>> provided by the firmware.
>>>>>
>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>
>>>>> Memory Allocation for Demotion
>>>>> ==============================
>>>>> This patch series keep the demotion target page allocation logic same.
>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>
>>>>> This will be later improved to use the same page allocation strategy
>>>>> using fallback list.
>>>>>
>>>>> Sysfs Interface:
>>>>> -------------
>>>>> Listing current list of memory tiers details:
>>>>>
>>>>> :/sys/devices/system/memtier$ ls
>>>>> default_tier max_tier  memtier1  power  uevent
>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>> memtier200
>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>> 400
>>>>> :/sys/devices/system/memtier$ 
>>>>>
>>>>> Per node memory tier details:
>>>>>
>>>>> For a cpu only NUMA node:
>>>>>
>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>> :/sys/devices/system/node# 
>>>>>
>>>>> For a NUMA node with memory:
>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>> 1
>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>> :/sys/devices/system/node# 
>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>> 2
>>>>> :/sys/devices/system/node# 
>>>>>
>>>>> Removing a memory tier
>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>> 2
>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>
>>>> Thanks a lot for your patchset.
>>>>
>>>> Per my understanding, we haven't reach consensus on
>>>>
>>>> - how to create the default memory tiers in kernel (via abstract
>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>
>>>> - how to override the default memory tiers from user space
>>>>
>>>> As in the following thread and email,
>>>>
>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>
>>>> I think that we need to finalized on that firstly?
>>>
>>> I did list the proposal here 
>>>
>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>
>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>> if the user wants a different tier topology. 
>>>
>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>
>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>> to control the tier assignment this can be a range of memory tiers. 
>>>
>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>> the memory tier assignment based on device attributes.
>> 
>> Sorry for late reply.
>> 
>> As the first step, it may be better to skip the parts that we haven't
>> reached consensus yet, for example, the user space interface to override
>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>> cannot change the user space ABI.
>> 
>
> Can you help list the use case that will be broken by using tierID as outlined in this series?
> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> can work with tier IDs too. Let me know if you think otherwise. So at this point
> I am not sure which area we are still debating w.r.t the userspace interface.
In
https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
per my understanding, Johannes suggested to override the kernel default
memory tiers with "abstract distance" via drivers implementing memory
devices.  As you said in another email, that is related to [7/12] of the
series.  And we can table it for future.
And per my understanding, he also suggested to make memory tier IDs
dynamic.  For example, after the "abstract distance" of a driver is
overridden by users, the total number of memory tiers may be changed,
and the memory tier ID of some nodes may be changed too.  This will make
memory tier ID easier to be understood, but more unstable.  For example,
this will make it harder to specify the per-memory-tier memory partition
for a cgroup.
> I will still keep the default tier IDs with a large range between them. That will allow
> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
> together. If we still want to go back to rank based approach the tierID value won't have much
> meaning anyway.
I agree to get rid of "rank".
> Any feedback on patches 1 - 5, so that I can request Andrew to merge
> them?
I hope that we can discuss with Johannes firstly.  But it appears that
he is busy recently.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  6:59         ` Huang, Ying
@ 2022-07-12  7:31           ` Aneesh Kumar K V
  2022-07-12  8:48             ` Huang, Ying
  0 siblings, 1 reply; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-12  7:31 UTC (permalink / raw)
  To: Huang, Ying, Johannes Weiner
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	jvgediya.oss
On 7/12/22 12:29 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>> Hi, Aneesh,
>>>>>
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>> performance.
>>>>>>
>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>
>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>> several important use cases:
>>>>>>
>>>>>> * The current tier initialization code always initializes
>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>
>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>
>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>   memory accounting.
>>>>>>
>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>
>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>
>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>> the control of userspace or device driver.
>>>>>>
>>>>>> Memory Tier Initialization
>>>>>> ==========================
>>>>>>
>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>> tier ID value 200.
>>>>>>
>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>> default tier.
>>>>>>
>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>> a memory node should be assigned to based on the requests from the
>>>>>> device drivers as well as the memory device hardware information
>>>>>> provided by the firmware.
>>>>>>
>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>
>>>>>> Memory Allocation for Demotion
>>>>>> ==============================
>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>
>>>>>> This will be later improved to use the same page allocation strategy
>>>>>> using fallback list.
>>>>>>
>>>>>> Sysfs Interface:
>>>>>> -------------
>>>>>> Listing current list of memory tiers details:
>>>>>>
>>>>>> :/sys/devices/system/memtier$ ls
>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>> memtier200
>>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>>> 400
>>>>>> :/sys/devices/system/memtier$ 
>>>>>>
>>>>>> Per node memory tier details:
>>>>>>
>>>>>> For a cpu only NUMA node:
>>>>>>
>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>> :/sys/devices/system/node# 
>>>>>>
>>>>>> For a NUMA node with memory:
>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>> 1
>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>>> :/sys/devices/system/node# 
>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>> 2
>>>>>> :/sys/devices/system/node# 
>>>>>>
>>>>>> Removing a memory tier
>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>> 2
>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>
>>>>> Thanks a lot for your patchset.
>>>>>
>>>>> Per my understanding, we haven't reach consensus on
>>>>>
>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>
>>>>> - how to override the default memory tiers from user space
>>>>>
>>>>> As in the following thread and email,
>>>>>
>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>
>>>>> I think that we need to finalized on that firstly?
>>>>
>>>> I did list the proposal here 
>>>>
>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>
>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>> if the user wants a different tier topology. 
>>>>
>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>
>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>> to control the tier assignment this can be a range of memory tiers. 
>>>>
>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>> the memory tier assignment based on device attributes.
>>>
>>> Sorry for late reply.
>>>
>>> As the first step, it may be better to skip the parts that we haven't
>>> reached consensus yet, for example, the user space interface to override
>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>> cannot change the user space ABI.
>>>
>>
>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>> I am not sure which area we are still debating w.r.t the userspace interface.
> 
> In
> 
> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> 
> per my understanding, Johannes suggested to override the kernel default
> memory tiers with "abstract distance" via drivers implementing memory
> devices.  As you said in another email, that is related to [7/12] of the
> series.  And we can table it for future.
> 
> And per my understanding, he also suggested to make memory tier IDs
> dynamic.  For example, after the "abstract distance" of a driver is
> overridden by users, the total number of memory tiers may be changed,
> and the memory tier ID of some nodes may be changed too.  This will make
> memory tier ID easier to be understood, but more unstable.  For example,
> this will make it harder to specify the per-memory-tier memory partition
> for a cgroup.
> 
With all the approaches we discussed so far, a memory tier of a numa node can be changed.
ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
posted here https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
doesn't consider the node movement from one memory tier to another. If we need
a stable pgdat->memtier we will have to prevent a node memory tier reassignment
while we have pages from the memory tier charged to a cgroup. This patchset should not
prevent such a restriction.
There are 3 knobs provided in this patchset. 
1. kernel parameter to change default memory tier. Changing this applies only to new memory that is
 hotplugged. The existing node to memtier mapping remains the same.
2. module parameter to change dax kmem memory tier. Same as above. 
3. Ability to change node to memory tier mapping via /sys/devices/system/node/nodeN/memtier . We
 should be able to add any restrictions w.r.t cgroup there. 
Hence my observation is that the requirement for a stable node to memory tier mapping should not
prevent the merging of this patch series.
>> I will still keep the default tier IDs with a large range between them. That will allow
>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
>> together. If we still want to go back to rank based approach the tierID value won't have much
>> meaning anyway.
> 
> I agree to get rid of "rank".
> 
>> Any feedback on patches 1 - 5, so that I can request Andrew to merge
>> them?
> 
> I hope that we can discuss with Johannes firstly.  But it appears that
> he is busy recently.
> 
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  7:31           ` Aneesh Kumar K V
@ 2022-07-12  8:48             ` Huang, Ying
  2022-07-12  9:17               ` Aneesh Kumar K V
  0 siblings, 1 reply; 42+ messages in thread
From: Huang, Ying @ 2022-07-12  8:48 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 7/12/22 12:29 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>>> Hi, Aneesh,
>>>>>>
>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>>> performance.
>>>>>>>
>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>>
>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>> several important use cases:
>>>>>>>
>>>>>>> * The current tier initialization code always initializes
>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>>
>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>>
>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>>   memory accounting.
>>>>>>>
>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>
>>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>>
>>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>>> the control of userspace or device driver.
>>>>>>>
>>>>>>> Memory Tier Initialization
>>>>>>> ==========================
>>>>>>>
>>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>>> tier ID value 200.
>>>>>>>
>>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>>> default tier.
>>>>>>>
>>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>>> a memory node should be assigned to based on the requests from the
>>>>>>> device drivers as well as the memory device hardware information
>>>>>>> provided by the firmware.
>>>>>>>
>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>>
>>>>>>> Memory Allocation for Demotion
>>>>>>> ==============================
>>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>>
>>>>>>> This will be later improved to use the same page allocation strategy
>>>>>>> using fallback list.
>>>>>>>
>>>>>>> Sysfs Interface:
>>>>>>> -------------
>>>>>>> Listing current list of memory tiers details:
>>>>>>>
>>>>>>> :/sys/devices/system/memtier$ ls
>>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>>> memtier200
>>>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>>>> 400
>>>>>>> :/sys/devices/system/memtier$ 
>>>>>>>
>>>>>>> Per node memory tier details:
>>>>>>>
>>>>>>> For a cpu only NUMA node:
>>>>>>>
>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>> :/sys/devices/system/node# 
>>>>>>>
>>>>>>> For a NUMA node with memory:
>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>> 1
>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>>>> :/sys/devices/system/node# 
>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>> 2
>>>>>>> :/sys/devices/system/node# 
>>>>>>>
>>>>>>> Removing a memory tier
>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>> 2
>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>>
>>>>>> Thanks a lot for your patchset.
>>>>>>
>>>>>> Per my understanding, we haven't reach consensus on
>>>>>>
>>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>>
>>>>>> - how to override the default memory tiers from user space
>>>>>>
>>>>>> As in the following thread and email,
>>>>>>
>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>>
>>>>>> I think that we need to finalized on that firstly?
>>>>>
>>>>> I did list the proposal here 
>>>>>
>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>
>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>>> if the user wants a different tier topology. 
>>>>>
>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>>
>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>>> to control the tier assignment this can be a range of memory tiers. 
>>>>>
>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>>> the memory tier assignment based on device attributes.
>>>>
>>>> Sorry for late reply.
>>>>
>>>> As the first step, it may be better to skip the parts that we haven't
>>>> reached consensus yet, for example, the user space interface to override
>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>>> cannot change the user space ABI.
>>>>
>>>
>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>>> I am not sure which area we are still debating w.r.t the userspace interface.
>> 
>> In
>> 
>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> 
>> per my understanding, Johannes suggested to override the kernel default
>> memory tiers with "abstract distance" via drivers implementing memory
>> devices.  As you said in another email, that is related to [7/12] of the
>> series.  And we can table it for future.
>> 
>> And per my understanding, he also suggested to make memory tier IDs
>> dynamic.  For example, after the "abstract distance" of a driver is
>> overridden by users, the total number of memory tiers may be changed,
>> and the memory tier ID of some nodes may be changed too.  This will make
>> memory tier ID easier to be understood, but more unstable.  For example,
>> this will make it harder to specify the per-memory-tier memory partition
>> for a cgroup.
>> 
>
> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
> posted here
> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
> doesn't consider the node movement from one memory tier to another. If we need
> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
> while we have pages from the memory tier charged to a cgroup. This patchset should not
> prevent such a restriction.
Absolute stableness doesn't exist even in "rank" based solution.  But
"rank" can improve the stableness at some degree.  For example, if we
move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
nodes can keep its memory tier ID stable.  This may be not a real issue
finally.  But we need to discuss that.
Tim has suggested to use top-tier(s) memory partition among cgroups.
But I don't think that has been finalized.  We may use per-memory-tier
memory partition among cgroups.  I don't know whether Wei will use that
(may be implemented in the user space).
And, if we thought stableness between nodes and memory tier ID isn't
important.  Why should we use sparse memory device IDs (that is, 100,
200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
> There are 3 knobs provided in this patchset. 
>
> 1. kernel parameter to change default memory tier. Changing this applies only to new memory that is
>  hotplugged. The existing node to memtier mapping remains the same.
>
> 2. module parameter to change dax kmem memory tier. Same as above. 
Why do we need these 2 knobs?  For example, we may use user space
overridden mechanism suggested by Johannes.
> 3. Ability to change node to memory tier mapping via /sys/devices/system/node/nodeN/memtier . We
>  should be able to add any restrictions w.r.t cgroup there. 
I think that we have decided to delay this feature ([7/12])?
Best Regards,
Huang, Ying
> Hence my observation is that the requirement for a stable node to memory tier mapping should not
> prevent the merging of this patch series.
>
>
>>> I will still keep the default tier IDs with a large range between them. That will allow
>>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
>>> together. If we still want to go back to rank based approach the tierID value won't have much
>>> meaning anyway.
>> 
>> I agree to get rid of "rank".
>> 
>>> Any feedback on patches 1 - 5, so that I can request Andrew to merge
>>> them?
>> 
>> I hope that we can discuss with Johannes firstly.  But it appears that
>> he is busy recently.
>> 
>
>
> -aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  8:48             ` Huang, Ying
@ 2022-07-12  9:17               ` Aneesh Kumar K V
  2022-07-13  2:59                 ` Huang, Ying
  0 siblings, 1 reply; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-12  9:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
On 7/12/22 2:18 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 7/12/22 12:29 PM, Huang, Ying wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>>>> Hi, Aneesh,
>>>>>>>
>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>
>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>>>
>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>> several important use cases:
>>>>>>>>
>>>>>>>> * The current tier initialization code always initializes
>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>>>
>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>>>
>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>>>   memory accounting.
>>>>>>>>
>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>
>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>>>
>>>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>>>> the control of userspace or device driver.
>>>>>>>>
>>>>>>>> Memory Tier Initialization
>>>>>>>> ==========================
>>>>>>>>
>>>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>>>> tier ID value 200.
>>>>>>>>
>>>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>>>> default tier.
>>>>>>>>
>>>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>>>> a memory node should be assigned to based on the requests from the
>>>>>>>> device drivers as well as the memory device hardware information
>>>>>>>> provided by the firmware.
>>>>>>>>
>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>>>
>>>>>>>> Memory Allocation for Demotion
>>>>>>>> ==============================
>>>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>>>
>>>>>>>> This will be later improved to use the same page allocation strategy
>>>>>>>> using fallback list.
>>>>>>>>
>>>>>>>> Sysfs Interface:
>>>>>>>> -------------
>>>>>>>> Listing current list of memory tiers details:
>>>>>>>>
>>>>>>>> :/sys/devices/system/memtier$ ls
>>>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>>>> memtier200
>>>>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>>>>> 400
>>>>>>>> :/sys/devices/system/memtier$ 
>>>>>>>>
>>>>>>>> Per node memory tier details:
>>>>>>>>
>>>>>>>> For a cpu only NUMA node:
>>>>>>>>
>>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>
>>>>>>>> For a NUMA node with memory:
>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>> 1
>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>>>>> :/sys/devices/system/node# 
>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>> 2
>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>
>>>>>>>> Removing a memory tier
>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>> 2
>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>>>
>>>>>>> Thanks a lot for your patchset.
>>>>>>>
>>>>>>> Per my understanding, we haven't reach consensus on
>>>>>>>
>>>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>>>
>>>>>>> - how to override the default memory tiers from user space
>>>>>>>
>>>>>>> As in the following thread and email,
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>>>
>>>>>>> I think that we need to finalized on that firstly?
>>>>>>
>>>>>> I did list the proposal here 
>>>>>>
>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>
>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>>>> if the user wants a different tier topology. 
>>>>>>
>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>>>
>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>>>> to control the tier assignment this can be a range of memory tiers. 
>>>>>>
>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>>>> the memory tier assignment based on device attributes.
>>>>>
>>>>> Sorry for late reply.
>>>>>
>>>>> As the first step, it may be better to skip the parts that we haven't
>>>>> reached consensus yet, for example, the user space interface to override
>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>>>> cannot change the user space ABI.
>>>>>
>>>>
>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>>>
>>> In
>>>
>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>
>>> per my understanding, Johannes suggested to override the kernel default
>>> memory tiers with "abstract distance" via drivers implementing memory
>>> devices.  As you said in another email, that is related to [7/12] of the
>>> series.  And we can table it for future.
>>>
>>> And per my understanding, he also suggested to make memory tier IDs
>>> dynamic.  For example, after the "abstract distance" of a driver is
>>> overridden by users, the total number of memory tiers may be changed,
>>> and the memory tier ID of some nodes may be changed too.  This will make
>>> memory tier ID easier to be understood, but more unstable.  For example,
>>> this will make it harder to specify the per-memory-tier memory partition
>>> for a cgroup.
>>>
>>
>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
>> posted here
>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
>> doesn't consider the node movement from one memory tier to another. If we need
>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
>> while we have pages from the memory tier charged to a cgroup. This patchset should not
>> prevent such a restriction.
> 
> Absolute stableness doesn't exist even in "rank" based solution.  But
> "rank" can improve the stableness at some degree.  For example, if we
> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
> nodes can keep its memory tier ID stable.  This may be not a real issue
> finally.  But we need to discuss that.
> 
I agree that using ranks gives us the flexibility to change demotion order
without being blocked by cgroup usage. But how frequently do we expect the
tier assignment to change? My expectation was these reassignments are going
to be rare and won't happen frequently after a system is up and running?
Hence using tierID for demotion order won't prevent a node reassignment
much because we don't expect to change the node tierID during runtime. In
the rare case we do, we will have to make sure there is no cgroup usage from
the specific memory tier. 
Even if we use ranks, we will have to avoid a rank update, if such
an update can change the meaning of top tier? ie, if a rank update
can result in a node being moved from top tier to non top tier. 
> Tim has suggested to use top-tier(s) memory partition among cgroups.
> But I don't think that has been finalized.  We may use per-memory-tier
> memory partition among cgroups.  I don't know whether Wei will use that
> (may be implemented in the user space).
> 
> And, if we thought stableness between nodes and memory tier ID isn't
> important.  Why should we use sparse memory device IDs (that is, 100,
> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
> 
The range allows us to use memtier ID for demotion order. ie, as we start initializing
devices with different attributes via dax kmem, there will be a desire to
assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
us to put these devices in the range [0 - 200) without updating the node to memtier
mapping of existing NUMA nodes (ie, without updating default memtier).
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  5:09         ` Aneesh Kumar K V
@ 2022-07-12 18:02           ` Yang Shi
  2022-07-13  3:42             ` Huang, Ying
  0 siblings, 1 reply; 42+ messages in thread
From: Yang Shi @ 2022-07-12 18:02 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Huang, Ying, Linux MM, Andrew Morton, Wei Xu, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Johannes Weiner, jvgediya.oss
On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
> > On 7/12/22 6:46 AM, Huang, Ying wrote:
> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >>
> >>> On 7/5/22 9:59 AM, Huang, Ying wrote:
> >>>> Hi, Aneesh,
> >>>>
> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> >>>>
> >>>>> The current kernel has the basic memory tiering support: Inactive
> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >>>>> tier NUMA node to make room for new allocations on the higher tier
> >>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >>>>> migrated (promoted) to a higher tier NUMA node to improve the
> >>>>> performance.
> >>>>>
> >>>>> In the current kernel, memory tiers are defined implicitly via a
> >>>>> demotion path relationship between NUMA nodes, which is created during
> >>>>> the kernel initialization and updated when a NUMA node is hot-added or
> >>>>> hot-removed.  The current implementation puts all nodes with CPU into
> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> >>>>> the per-node demotion targets based on the distances between nodes.
> >>>>>
> >>>>> This current memory tier kernel interface needs to be improved for
> >>>>> several important use cases:
> >>>>>
> >>>>> * The current tier initialization code always initializes
> >>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
> >>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
> >>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
> >>>>>   a virtual machine) and should be put into a higher tier.
> >>>>>
> >>>>> * The current tier hierarchy always puts CPU nodes into the top
> >>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >>>>>   with CPUs are better to be placed into the next lower tier.
> >>>>>
> >>>>> * Also because the current tier hierarchy always puts CPU nodes
> >>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
> >>>>>   triggers a memory node from CPU-less into a CPU node (or vice
> >>>>>   versa), the memory tier hierarchy gets changed, even though no
> >>>>>   memory node is added or removed.  This can make the tier
> >>>>>   hierarchy unstable and make it difficult to support tier-based
> >>>>>   memory accounting.
> >>>>>
> >>>>> * A higher tier node can only be demoted to selected nodes on the
> >>>>>   next lower tier as defined by the demotion path, not any other
> >>>>>   node from any lower tier.  This strict, hard-coded demotion order
> >>>>>   does not work in all use cases (e.g. some use cases may want to
> >>>>>   allow cross-socket demotion to another node in the same demotion
> >>>>>   tier as a fallback when the preferred demotion node is out of
> >>>>>   space), and has resulted in the feature request for an interface to
> >>>>>   override the system-wide, per-node demotion order from the
> >>>>>   userspace.  This demotion order is also inconsistent with the page
> >>>>>   allocation fallback order when all the nodes in a higher tier are
> >>>>>   out of space: The page allocation can fall back to any node from
> >>>>>   any lower tier, whereas the demotion order doesn't allow that.
> >>>>>
> >>>>> * There are no interfaces for the userspace to learn about the memory
> >>>>>   tier hierarchy in order to optimize its memory allocations.
> >>>>>
> >>>>> This patch series make the creation of memory tiers explicit under
> >>>>> the control of userspace or device driver.
> >>>>>
> >>>>> Memory Tier Initialization
> >>>>> ==========================
> >>>>>
> >>>>> By default, all memory nodes are assigned to the default tier with
> >>>>> tier ID value 200.
> >>>>>
> >>>>> A device driver can move up or down its memory nodes from the default
> >>>>> tier.  For example, PMEM can move down its memory nodes below the
> >>>>> default tier, whereas GPU can move up its memory nodes above the
> >>>>> default tier.
> >>>>>
> >>>>> The kernel initialization code makes the decision on which exact tier
> >>>>> a memory node should be assigned to based on the requests from the
> >>>>> device drivers as well as the memory device hardware information
> >>>>> provided by the firmware.
> >>>>>
> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> >>>>>
> >>>>> Memory Allocation for Demotion
> >>>>> ==============================
> >>>>> This patch series keep the demotion target page allocation logic same.
> >>>>> The demotion page allocation pick the closest NUMA node in the
> >>>>> next lower tier to the current NUMA node allocating pages from.
> >>>>>
> >>>>> This will be later improved to use the same page allocation strategy
> >>>>> using fallback list.
> >>>>>
> >>>>> Sysfs Interface:
> >>>>> -------------
> >>>>> Listing current list of memory tiers details:
> >>>>>
> >>>>> :/sys/devices/system/memtier$ ls
> >>>>> default_tier max_tier  memtier1  power  uevent
> >>>>> :/sys/devices/system/memtier$ cat default_tier
> >>>>> memtier200
> >>>>> :/sys/devices/system/memtier$ cat max_tier
> >>>>> 400
> >>>>> :/sys/devices/system/memtier$
> >>>>>
> >>>>> Per node memory tier details:
> >>>>>
> >>>>> For a cpu only NUMA node:
> >>>>>
> >>>>> :/sys/devices/system/node# cat node0/memtier
> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier
> >>>>> :/sys/devices/system/node# cat node0/memtier
> >>>>> :/sys/devices/system/node#
> >>>>>
> >>>>> For a NUMA node with memory:
> >>>>> :/sys/devices/system/node# cat node1/memtier
> >>>>> 1
> >>>>> :/sys/devices/system/node# ls ../memtier/
> >>>>> default_tier  max_tier  memtier1  power  uevent
> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier
> >>>>> :/sys/devices/system/node#
> >>>>> :/sys/devices/system/node# ls ../memtier/
> >>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
> >>>>> :/sys/devices/system/node# cat node1/memtier
> >>>>> 2
> >>>>> :/sys/devices/system/node#
> >>>>>
> >>>>> Removing a memory tier
> >>>>> :/sys/devices/system/node# cat node1/memtier
> >>>>> 2
> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier
> >>>>
> >>>> Thanks a lot for your patchset.
> >>>>
> >>>> Per my understanding, we haven't reach consensus on
> >>>>
> >>>> - how to create the default memory tiers in kernel (via abstract
> >>>>   distance provided by drivers?  Or use SLIT as the first step?)
> >>>>
> >>>> - how to override the default memory tiers from user space
> >>>>
> >>>> As in the following thread and email,
> >>>>
> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >>>>
> >>>> I think that we need to finalized on that firstly?
> >>>
> >>> I did list the proposal here
> >>>
> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >>>
> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
> >>> if the user wants a different tier topology.
> >>>
> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
> >>>
> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
> >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
> >>> to control the tier assignment this can be a range of memory tiers.
> >>>
> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
> >>> the memory tier assignment based on device attributes.
> >>
> >> Sorry for late reply.
> >>
> >> As the first step, it may be better to skip the parts that we haven't
> >> reached consensus yet, for example, the user space interface to override
> >> the default memory tiers.  And we can use 0, 1, 2 as the default memory
> >> tier IDs.  We can refine/revise the in-kernel implementation, but we
> >> cannot change the user space ABI.
> >>
> >
> > Can you help list the use case that will be broken by using tierID as outlined in this series?
> > One of the details that were mentioned earlier was the need to track top-tier memory usage in a
> > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> > can work with tier IDs too. Let me know if you think otherwise. So at this point
> > I am not sure which area we are still debating w.r.t the userspace interface.
> >
> > I will still keep the default tier IDs with a large range between them. That will allow
> > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
> > together. If we still want to go back to rank based approach the tierID value won't have much
> > meaning anyway.
> >
> > Any feedback on patches 1 - 5, so that I can request Andrew to merge them?
> >
>
> Looking at this again, I guess we just need to drop patch 7
> mm/demotion: Add per node memory tier attribute to sysfs ?
>
> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included.
> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful
> and agreed upon. Hence patch 6 can be merged?
>
> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers
> are exposed/created from userspace. Hence that can be merged?
>
> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so
> that we can skip merging them based on what we conclude w.r.t usage of rank.
I think the most controversial part is the user visible interfaces so
far. And IIUC the series could be split roughly into two parts, patch
1 - 5 and others. The patch 1 -5 added the explicit memory tier
support and fixed the issue reported by Jagdish. I think we are on the
same page for this part. But I haven't seen any thorough review on
those patches yet since we got distracted by spending most time
discussing about the user visible interfaces.
So would it help to move things forward to submit patch 1 - 5 as a
standalone series to get thorough review then get merged?
>
> -aneesh
>
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12  9:17               ` Aneesh Kumar K V
@ 2022-07-13  2:59                 ` Huang, Ying
  2022-07-13  6:46                   ` Wei Xu
  2022-07-13  9:40                   ` Aneesh Kumar K.V
  0 siblings, 2 replies; 42+ messages in thread
From: Huang, Ying @ 2022-07-13  2:59 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 7/12/22 2:18 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 7/12/22 12:29 PM, Huang, Ying wrote:
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>>>>> Hi, Aneesh,
>>>>>>>>
>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>>
>>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>>>>
>>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>>> several important use cases:
>>>>>>>>>
>>>>>>>>> * The current tier initialization code always initializes
>>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>>>>
>>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>>>>
>>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>>>>   memory accounting.
>>>>>>>>>
>>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>>
>>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>>>>
>>>>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>>>>> the control of userspace or device driver.
>>>>>>>>>
>>>>>>>>> Memory Tier Initialization
>>>>>>>>> ==========================
>>>>>>>>>
>>>>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>>>>> tier ID value 200.
>>>>>>>>>
>>>>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>>>>> default tier.
>>>>>>>>>
>>>>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>>>>> a memory node should be assigned to based on the requests from the
>>>>>>>>> device drivers as well as the memory device hardware information
>>>>>>>>> provided by the firmware.
>>>>>>>>>
>>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>>>>
>>>>>>>>> Memory Allocation for Demotion
>>>>>>>>> ==============================
>>>>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>>>>
>>>>>>>>> This will be later improved to use the same page allocation strategy
>>>>>>>>> using fallback list.
>>>>>>>>>
>>>>>>>>> Sysfs Interface:
>>>>>>>>> -------------
>>>>>>>>> Listing current list of memory tiers details:
>>>>>>>>>
>>>>>>>>> :/sys/devices/system/memtier$ ls
>>>>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>>>>> memtier200
>>>>>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>>>>>> 400
>>>>>>>>> :/sys/devices/system/memtier$ 
>>>>>>>>>
>>>>>>>>> Per node memory tier details:
>>>>>>>>>
>>>>>>>>> For a cpu only NUMA node:
>>>>>>>>>
>>>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>>
>>>>>>>>> For a NUMA node with memory:
>>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>>> 1
>>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>>> 2
>>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>>
>>>>>>>>> Removing a memory tier
>>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>>> 2
>>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>>>>
>>>>>>>> Thanks a lot for your patchset.
>>>>>>>>
>>>>>>>> Per my understanding, we haven't reach consensus on
>>>>>>>>
>>>>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>>>>
>>>>>>>> - how to override the default memory tiers from user space
>>>>>>>>
>>>>>>>> As in the following thread and email,
>>>>>>>>
>>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>>>>
>>>>>>>> I think that we need to finalized on that firstly?
>>>>>>>
>>>>>>> I did list the proposal here 
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>>
>>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>>>>> if the user wants a different tier topology. 
>>>>>>>
>>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>>>>
>>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>>>>> to control the tier assignment this can be a range of memory tiers. 
>>>>>>>
>>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>>>>> the memory tier assignment based on device attributes.
>>>>>>
>>>>>> Sorry for late reply.
>>>>>>
>>>>>> As the first step, it may be better to skip the parts that we haven't
>>>>>> reached consensus yet, for example, the user space interface to override
>>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>>>>> cannot change the user space ABI.
>>>>>>
>>>>>
>>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>>>>
>>>> In
>>>>
>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>
>>>> per my understanding, Johannes suggested to override the kernel default
>>>> memory tiers with "abstract distance" via drivers implementing memory
>>>> devices.  As you said in another email, that is related to [7/12] of the
>>>> series.  And we can table it for future.
>>>>
>>>> And per my understanding, he also suggested to make memory tier IDs
>>>> dynamic.  For example, after the "abstract distance" of a driver is
>>>> overridden by users, the total number of memory tiers may be changed,
>>>> and the memory tier ID of some nodes may be changed too.  This will make
>>>> memory tier ID easier to be understood, but more unstable.  For example,
>>>> this will make it harder to specify the per-memory-tier memory partition
>>>> for a cgroup.
>>>>
>>>
>>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
>>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
>>> posted here
>>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
>>> doesn't consider the node movement from one memory tier to another. If we need
>>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
>>> while we have pages from the memory tier charged to a cgroup. This patchset should not
>>> prevent such a restriction.
>> 
>> Absolute stableness doesn't exist even in "rank" based solution.  But
>> "rank" can improve the stableness at some degree.  For example, if we
>> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
>> nodes can keep its memory tier ID stable.  This may be not a real issue
>> finally.  But we need to discuss that.
>> 
>
> I agree that using ranks gives us the flexibility to change demotion order
> without being blocked by cgroup usage. But how frequently do we expect the
> tier assignment to change? My expectation was these reassignments are going
> to be rare and won't happen frequently after a system is up and running?
> Hence using tierID for demotion order won't prevent a node reassignment
> much because we don't expect to change the node tierID during runtime. In
> the rare case we do, we will have to make sure there is no cgroup usage from
> the specific memory tier. 
>
> Even if we use ranks, we will have to avoid a rank update, if such
> an update can change the meaning of top tier? ie, if a rank update
> can result in a node being moved from top tier to non top tier.
>
>> Tim has suggested to use top-tier(s) memory partition among cgroups.
>> But I don't think that has been finalized.  We may use per-memory-tier
>> memory partition among cgroups.  I don't know whether Wei will use that
>> (may be implemented in the user space).
>> 
>> And, if we thought stableness between nodes and memory tier ID isn't
>> important.  Why should we use sparse memory device IDs (that is, 100,
>> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
>> 
>
>
> The range allows us to use memtier ID for demotion order. ie, as we start initializing
> devices with different attributes via dax kmem, there will be a desire to
> assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
> us to put these devices in the range [0 - 200) without updating the node to memtier
> mapping of existing NUMA nodes (ie, without updating default memtier).
I believe that sparse memory tier IDs can make memory tier more stable
in some cases.  But this is different from the system suggested by
Johannes.  Per my understanding, with Johannes' system, we will
- one driver may online different memory types (such as kmem_dax may
  online HBM, PMEM, etc.)
- one memory type manages several memory nodes (NUMA nodes)
- one "abstract distance" for each memory type
- the "abstract distance" can be offset by user space override knob
- memory tiers generated dynamic from different memory types according
  "abstract distance" and overridden "offset"
- the granularity to group several memory types into one memory tier can
  be overridden via user space knob
In this way, the memory tiers may be changed totally after user space
overridden.  It may be hard to link memory tiers before/after the
overridden.  So we may need to reset all per-memory-tier configuration,
such as cgroup paritation limit or interleave weight, etc.
Personally, I think the system above makes sense.  But I think we need
to make sure whether it satisfies the requirements.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-12 18:02           ` Yang Shi
@ 2022-07-13  3:42             ` Huang, Ying
  2022-07-13  6:38               ` Wei Xu
                                 ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Huang, Ying @ 2022-07-13  3:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: Aneesh Kumar K V, Linux MM, Andrew Morton, Wei Xu,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Johannes Weiner,
	jvgediya.oss
Yang Shi <shy828301@gmail.com> writes:
> On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
>> > On 7/12/22 6:46 AM, Huang, Ying wrote:
>> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>
>> >>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>> >>>> Hi, Aneesh,
>> >>>>
>> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> >>>>
>> >>>>> The current kernel has the basic memory tiering support: Inactive
>> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> >>>>> tier NUMA node to make room for new allocations on the higher tier
>> >>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> >>>>> migrated (promoted) to a higher tier NUMA node to improve the
>> >>>>> performance.
>> >>>>>
>> >>>>> In the current kernel, memory tiers are defined implicitly via a
>> >>>>> demotion path relationship between NUMA nodes, which is created during
>> >>>>> the kernel initialization and updated when a NUMA node is hot-added or
>> >>>>> hot-removed.  The current implementation puts all nodes with CPU into
>> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>> >>>>> the per-node demotion targets based on the distances between nodes.
>> >>>>>
>> >>>>> This current memory tier kernel interface needs to be improved for
>> >>>>> several important use cases:
>> >>>>>
>> >>>>> * The current tier initialization code always initializes
>> >>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>> >>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>> >>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>> >>>>>   a virtual machine) and should be put into a higher tier.
>> >>>>>
>> >>>>> * The current tier hierarchy always puts CPU nodes into the top
>> >>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>> >>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>> >>>>>   with CPUs are better to be placed into the next lower tier.
>> >>>>>
>> >>>>> * Also because the current tier hierarchy always puts CPU nodes
>> >>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>> >>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>> >>>>>   versa), the memory tier hierarchy gets changed, even though no
>> >>>>>   memory node is added or removed.  This can make the tier
>> >>>>>   hierarchy unstable and make it difficult to support tier-based
>> >>>>>   memory accounting.
>> >>>>>
>> >>>>> * A higher tier node can only be demoted to selected nodes on the
>> >>>>>   next lower tier as defined by the demotion path, not any other
>> >>>>>   node from any lower tier.  This strict, hard-coded demotion order
>> >>>>>   does not work in all use cases (e.g. some use cases may want to
>> >>>>>   allow cross-socket demotion to another node in the same demotion
>> >>>>>   tier as a fallback when the preferred demotion node is out of
>> >>>>>   space), and has resulted in the feature request for an interface to
>> >>>>>   override the system-wide, per-node demotion order from the
>> >>>>>   userspace.  This demotion order is also inconsistent with the page
>> >>>>>   allocation fallback order when all the nodes in a higher tier are
>> >>>>>   out of space: The page allocation can fall back to any node from
>> >>>>>   any lower tier, whereas the demotion order doesn't allow that.
>> >>>>>
>> >>>>> * There are no interfaces for the userspace to learn about the memory
>> >>>>>   tier hierarchy in order to optimize its memory allocations.
>> >>>>>
>> >>>>> This patch series make the creation of memory tiers explicit under
>> >>>>> the control of userspace or device driver.
>> >>>>>
>> >>>>> Memory Tier Initialization
>> >>>>> ==========================
>> >>>>>
>> >>>>> By default, all memory nodes are assigned to the default tier with
>> >>>>> tier ID value 200.
>> >>>>>
>> >>>>> A device driver can move up or down its memory nodes from the default
>> >>>>> tier.  For example, PMEM can move down its memory nodes below the
>> >>>>> default tier, whereas GPU can move up its memory nodes above the
>> >>>>> default tier.
>> >>>>>
>> >>>>> The kernel initialization code makes the decision on which exact tier
>> >>>>> a memory node should be assigned to based on the requests from the
>> >>>>> device drivers as well as the memory device hardware information
>> >>>>> provided by the firmware.
>> >>>>>
>> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>> >>>>>
>> >>>>> Memory Allocation for Demotion
>> >>>>> ==============================
>> >>>>> This patch series keep the demotion target page allocation logic same.
>> >>>>> The demotion page allocation pick the closest NUMA node in the
>> >>>>> next lower tier to the current NUMA node allocating pages from.
>> >>>>>
>> >>>>> This will be later improved to use the same page allocation strategy
>> >>>>> using fallback list.
>> >>>>>
>> >>>>> Sysfs Interface:
>> >>>>> -------------
>> >>>>> Listing current list of memory tiers details:
>> >>>>>
>> >>>>> :/sys/devices/system/memtier$ ls
>> >>>>> default_tier max_tier  memtier1  power  uevent
>> >>>>> :/sys/devices/system/memtier$ cat default_tier
>> >>>>> memtier200
>> >>>>> :/sys/devices/system/memtier$ cat max_tier
>> >>>>> 400
>> >>>>> :/sys/devices/system/memtier$
>> >>>>>
>> >>>>> Per node memory tier details:
>> >>>>>
>> >>>>> For a cpu only NUMA node:
>> >>>>>
>> >>>>> :/sys/devices/system/node# cat node0/memtier
>> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier
>> >>>>> :/sys/devices/system/node# cat node0/memtier
>> >>>>> :/sys/devices/system/node#
>> >>>>>
>> >>>>> For a NUMA node with memory:
>> >>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>> 1
>> >>>>> :/sys/devices/system/node# ls ../memtier/
>> >>>>> default_tier  max_tier  memtier1  power  uevent
>> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier
>> >>>>> :/sys/devices/system/node#
>> >>>>> :/sys/devices/system/node# ls ../memtier/
>> >>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>> >>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>> 2
>> >>>>> :/sys/devices/system/node#
>> >>>>>
>> >>>>> Removing a memory tier
>> >>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>> 2
>> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>> >>>>
>> >>>> Thanks a lot for your patchset.
>> >>>>
>> >>>> Per my understanding, we haven't reach consensus on
>> >>>>
>> >>>> - how to create the default memory tiers in kernel (via abstract
>> >>>>   distance provided by drivers?  Or use SLIT as the first step?)
>> >>>>
>> >>>> - how to override the default memory tiers from user space
>> >>>>
>> >>>> As in the following thread and email,
>> >>>>
>> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> >>>>
>> >>>> I think that we need to finalized on that firstly?
>> >>>
>> >>> I did list the proposal here
>> >>>
>> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>> >>>
>> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>> >>> if the user wants a different tier topology.
>> >>>
>> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>> >>>
>> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>> >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>> >>> to control the tier assignment this can be a range of memory tiers.
>> >>>
>> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>> >>> the memory tier assignment based on device attributes.
>> >>
>> >> Sorry for late reply.
>> >>
>> >> As the first step, it may be better to skip the parts that we haven't
>> >> reached consensus yet, for example, the user space interface to override
>> >> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>> >> tier IDs.  We can refine/revise the in-kernel implementation, but we
>> >> cannot change the user space ABI.
>> >>
>> >
>> > Can you help list the use case that will be broken by using tierID as outlined in this series?
>> > One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>> > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>> > can work with tier IDs too. Let me know if you think otherwise. So at this point
>> > I am not sure which area we are still debating w.r.t the userspace interface.
>> >
>> > I will still keep the default tier IDs with a large range between them. That will allow
>> > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
>> > together. If we still want to go back to rank based approach the tierID value won't have much
>> > meaning anyway.
>> >
>> > Any feedback on patches 1 - 5, so that I can request Andrew to merge them?
>> >
>>
>> Looking at this again, I guess we just need to drop patch 7
>> mm/demotion: Add per node memory tier attribute to sysfs ?
>>
>> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included.
>> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful
>> and agreed upon. Hence patch 6 can be merged?
>>
>> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers
>> are exposed/created from userspace. Hence that can be merged?
>>
>> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so
>> that we can skip merging them based on what we conclude w.r.t usage of rank.
>
> I think the most controversial part is the user visible interfaces so
> far. And IIUC the series could be split roughly into two parts, patch
> 1 - 5 and others. The patch 1 -5 added the explicit memory tier
> support and fixed the issue reported by Jagdish. I think we are on the
> same page for this part. But I haven't seen any thorough review on
> those patches yet since we got distracted by spending most time
> discussing about the user visible interfaces.
>
> So would it help to move things forward to submit patch 1 - 5 as a
> standalone series to get thorough review then get merged?
Yes.  I think this is a good idea.  We can discuss the in kernel
implementation (without user space interface) in details and try to make
it merged.
And we can continue our discussion of user space interface in a separate
thread.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  3:42             ` Huang, Ying
@ 2022-07-13  6:38               ` Wei Xu
  2022-07-13  6:39               ` Wei Xu
  2022-07-13  7:25               ` Aneesh Kumar K V
  2 siblings, 0 replies; 42+ messages in thread
From: Wei Xu @ 2022-07-13  6:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yang Shi, Aneesh Kumar K V, Linux MM, Andrew Morton,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Johannes Weiner,
	jvgediya.oss
[-- Attachment #1: Type: text/plain, Size: 11685 bytes --]
On Tue, Jul 12, 2022 at 8:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
> >> > On 7/12/22 6:46 AM, Huang, Ying wrote:
> >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >> >>
> >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote:
> >> >>>> Hi, Aneesh,
> >> >>>>
> >> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> >> >>>>
> >> >>>>> The current kernel has the basic memory tiering support: Inactive
> >> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a
> lower
> >> >>>>> tier NUMA node to make room for new allocations on the higher tier
> >> >>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node
> can be
> >> >>>>> migrated (promoted) to a higher tier NUMA node to improve the
> >> >>>>> performance.
> >> >>>>>
> >> >>>>> In the current kernel, memory tiers are defined implicitly via a
> >> >>>>> demotion path relationship between NUMA nodes, which is created
> during
> >> >>>>> the kernel initialization and updated when a NUMA node is
> hot-added or
> >> >>>>> hot-removed.  The current implementation puts all nodes with CPU
> into
> >> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by
> establishing
> >> >>>>> the per-node demotion targets based on the distances between
> nodes.
> >> >>>>>
> >> >>>>> This current memory tier kernel interface needs to be improved for
> >> >>>>> several important use cases:
> >> >>>>>
> >> >>>>> * The current tier initialization code always initializes
> >> >>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
> >> >>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
> >> >>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
> >> >>>>>   a virtual machine) and should be put into a higher tier.
> >> >>>>>
> >> >>>>> * The current tier hierarchy always puts CPU nodes into the top
> >> >>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >> >>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM
> nodes
> >> >>>>>   with CPUs are better to be placed into the next lower tier.
> >> >>>>>
> >> >>>>> * Also because the current tier hierarchy always puts CPU nodes
> >> >>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
> >> >>>>>   triggers a memory node from CPU-less into a CPU node (or vice
> >> >>>>>   versa), the memory tier hierarchy gets changed, even though no
> >> >>>>>   memory node is added or removed.  This can make the tier
> >> >>>>>   hierarchy unstable and make it difficult to support tier-based
> >> >>>>>   memory accounting.
> >> >>>>>
> >> >>>>> * A higher tier node can only be demoted to selected nodes on the
> >> >>>>>   next lower tier as defined by the demotion path, not any other
> >> >>>>>   node from any lower tier.  This strict, hard-coded demotion
> order
> >> >>>>>   does not work in all use cases (e.g. some use cases may want to
> >> >>>>>   allow cross-socket demotion to another node in the same demotion
> >> >>>>>   tier as a fallback when the preferred demotion node is out of
> >> >>>>>   space), and has resulted in the feature request for an
> interface to
> >> >>>>>   override the system-wide, per-node demotion order from the
> >> >>>>>   userspace.  This demotion order is also inconsistent with the
> page
> >> >>>>>   allocation fallback order when all the nodes in a higher tier
> are
> >> >>>>>   out of space: The page allocation can fall back to any node from
> >> >>>>>   any lower tier, whereas the demotion order doesn't allow that.
> >> >>>>>
> >> >>>>> * There are no interfaces for the userspace to learn about the
> memory
> >> >>>>>   tier hierarchy in order to optimize its memory allocations.
> >> >>>>>
> >> >>>>> This patch series make the creation of memory tiers explicit under
> >> >>>>> the control of userspace or device driver.
> >> >>>>>
> >> >>>>> Memory Tier Initialization
> >> >>>>> ==========================
> >> >>>>>
> >> >>>>> By default, all memory nodes are assigned to the default tier with
> >> >>>>> tier ID value 200.
> >> >>>>>
> >> >>>>> A device driver can move up or down its memory nodes from the
> default
> >> >>>>> tier.  For example, PMEM can move down its memory nodes below the
> >> >>>>> default tier, whereas GPU can move up its memory nodes above the
> >> >>>>> default tier.
> >> >>>>>
> >> >>>>> The kernel initialization code makes the decision on which exact
> tier
> >> >>>>> a memory node should be assigned to based on the requests from the
> >> >>>>> device drivers as well as the memory device hardware information
> >> >>>>> provided by the firmware.
> >> >>>>>
> >> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> >> >>>>>
> >> >>>>> Memory Allocation for Demotion
> >> >>>>> ==============================
> >> >>>>> This patch series keep the demotion target page allocation logic
> same.
> >> >>>>> The demotion page allocation pick the closest NUMA node in the
> >> >>>>> next lower tier to the current NUMA node allocating pages from.
> >> >>>>>
> >> >>>>> This will be later improved to use the same page allocation
> strategy
> >> >>>>> using fallback list.
> >> >>>>>
> >> >>>>> Sysfs Interface:
> >> >>>>> -------------
> >> >>>>> Listing current list of memory tiers details:
> >> >>>>>
> >> >>>>> :/sys/devices/system/memtier$ ls
> >> >>>>> default_tier max_tier  memtier1  power  uevent
> >> >>>>> :/sys/devices/system/memtier$ cat default_tier
> >> >>>>> memtier200
> >> >>>>> :/sys/devices/system/memtier$ cat max_tier
> >> >>>>> 400
> >> >>>>> :/sys/devices/system/memtier$
> >> >>>>>
> >> >>>>> Per node memory tier details:
> >> >>>>>
> >> >>>>> For a cpu only NUMA node:
> >> >>>>>
> >> >>>>> :/sys/devices/system/node# cat node0/memtier
> >> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier
> >> >>>>> :/sys/devices/system/node# cat node0/memtier
> >> >>>>> :/sys/devices/system/node#
> >> >>>>>
> >> >>>>> For a NUMA node with memory:
> >> >>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>> 1
> >> >>>>> :/sys/devices/system/node# ls ../memtier/
> >> >>>>> default_tier  max_tier  memtier1  power  uevent
> >> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier
> >> >>>>> :/sys/devices/system/node#
> >> >>>>> :/sys/devices/system/node# ls ../memtier/
> >> >>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
> >> >>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>> 2
> >> >>>>> :/sys/devices/system/node#
> >> >>>>>
> >> >>>>> Removing a memory tier
> >> >>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>> 2
> >> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier
> >> >>>>
> >> >>>> Thanks a lot for your patchset.
> >> >>>>
> >> >>>> Per my understanding, we haven't reach consensus on
> >> >>>>
> >> >>>> - how to create the default memory tiers in kernel (via abstract
> >> >>>>   distance provided by drivers?  Or use SLIT as the first step?)
> >> >>>>
> >> >>>> - how to override the default memory tiers from user space
> >> >>>>
> >> >>>> As in the following thread and email,
> >> >>>>
> >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >> >>>>
> >> >>>> I think that we need to finalized on that firstly?
> >> >>>
> >> >>> I did list the proposal here
> >> >>>
> >> >>>
> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >> >>>
> >> >>> So both the kernel default and driver-specific default tiers now
> become kernel parameters that can be updated
> >> >>> if the user wants a different tier topology.
> >> >>>
> >> >>> All memory that is not managed by a driver gets added to
> default_memory_tier which got a default value of 200
> >> >>>
> >> >>> For now, the only driver that is updated is dax kmem, which adds
> the memory it manages to memory tier 100.
> >> >>> Later as we learn more about the device attributes (HMAT or
> something similar) that we might want to use
> >> >>> to control the tier assignment this can be a range of memory tiers.
> >> >>>
> >> >>> Based on the above, I guess we can merge what is posted in this
> series and later fine-tune/update
> >> >>> the memory tier assignment based on device attributes.
> >> >>
> >> >> Sorry for late reply.
> >> >>
> >> >> As the first step, it may be better to skip the parts that we haven't
> >> >> reached consensus yet, for example, the user space interface to
> override
> >> >> the default memory tiers.  And we can use 0, 1, 2 as the default
> memory
> >> >> tier IDs.  We can refine/revise the in-kernel implementation, but we
> >> >> cannot change the user space ABI.
> >> >>
> >> >
> >> > Can you help list the use case that will be broken by using tierID as
> outlined in this series?
> >> > One of the details that were mentioned earlier was the need to track
> top-tier memory usage in a
> >> > memcg and IIUC the patchset posted
> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> >> > can work with tier IDs too. Let me know if you think otherwise. So at
> this point
> >> > I am not sure which area we are still debating w.r.t the userspace
> interface.
> >> >
> >> > I will still keep the default tier IDs with a large range between
> them. That will allow
> >> > us to go back to tierID based demotion order if we can. That is much
> simpler than using tierID and rank
> >> > together. If we still want to go back to rank based approach the
> tierID value won't have much
> >> > meaning anyway.
> >> >
> >> > Any feedback on patches 1 - 5, so that I can request Andrew to merge
> them?
> >> >
> >>
> >> Looking at this again, I guess we just need to drop patch 7
> >> mm/demotion: Add per node memory tier attribute to sysfs ?
> >>
> >> We do agree to use the device model to expose memory tiers to userspace
> so patch 6 can still be included.
> >> It also exposes max_tier, default_tier, and node list of a memory tier.
> All these are useful
> >> and agreed upon. Hence patch 6 can be merged?
> >>
> >> patch 8 - 10 -> are done based on the request from others and is
> independent of how memory tiers
> >> are exposed/created from userspace. Hence that can be merged?
> >>
> >> If you agree I can rebase the series moving patch 7,11,12 as the last
> patches in the series so
> >> that we can skip merging them based on what we conclude w.r.t usage of
> rank.
> >
> > I think the most controversial part is the user visible interfaces so
> > far. And IIUC the series could be split roughly into two parts, patch
> > 1 - 5 and others. The patch 1 -5 added the explicit memory tier
> > support and fixed the issue reported by Jagdish. I think we are on the
> > same page for this part. But I haven't seen any thorough review on
> > those patches yet since we got distracted by spending most time
> > discussing about the user visible interfaces.
> >
> > So would it help to move things forward to submit patch 1 - 5 as a
> > standalone series to get thorough review then get merged?
>
> Yes.  I think this is a good idea.  We can discuss the in kernel
> implementation (without user space interface) in details and try to make
> it merged.
>
> And we can continue our discussion of user space interface in a separate
> thread.
>
> Best Regards,
> Huang, Ying
>
>
I also agree that it is a good idea to split this patch series into the
kernel and userspace parts.
The current sysfs interface provides more dynamic memtiers than what I have
expected.  Let's have more discussions on that after the kernel space
changes are finalized.
Wei
[-- Attachment #2: Type: text/html, Size: 16935 bytes --]
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  3:42             ` Huang, Ying
  2022-07-13  6:38               ` Wei Xu
@ 2022-07-13  6:39               ` Wei Xu
  2022-07-13  7:25               ` Aneesh Kumar K V
  2 siblings, 0 replies; 42+ messages in thread
From: Wei Xu @ 2022-07-13  6:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yang Shi, Aneesh Kumar K V, Linux MM, Andrew Morton,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Johannes Weiner,
	jvgediya.oss
On Tue, Jul 12, 2022 at 8:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
> >> > On 7/12/22 6:46 AM, Huang, Ying wrote:
> >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >> >>
> >> >>> On 7/5/22 9:59 AM, Huang, Ying wrote:
> >> >>>> Hi, Aneesh,
> >> >>>>
> >> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> >> >>>>
> >> >>>>> The current kernel has the basic memory tiering support: Inactive
> >> >>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >> >>>>> tier NUMA node to make room for new allocations on the higher tier
> >> >>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >> >>>>> migrated (promoted) to a higher tier NUMA node to improve the
> >> >>>>> performance.
> >> >>>>>
> >> >>>>> In the current kernel, memory tiers are defined implicitly via a
> >> >>>>> demotion path relationship between NUMA nodes, which is created during
> >> >>>>> the kernel initialization and updated when a NUMA node is hot-added or
> >> >>>>> hot-removed.  The current implementation puts all nodes with CPU into
> >> >>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> >> >>>>> the per-node demotion targets based on the distances between nodes.
> >> >>>>>
> >> >>>>> This current memory tier kernel interface needs to be improved for
> >> >>>>> several important use cases:
> >> >>>>>
> >> >>>>> * The current tier initialization code always initializes
> >> >>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
> >> >>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
> >> >>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
> >> >>>>>   a virtual machine) and should be put into a higher tier.
> >> >>>>>
> >> >>>>> * The current tier hierarchy always puts CPU nodes into the top
> >> >>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >> >>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >> >>>>>   with CPUs are better to be placed into the next lower tier.
> >> >>>>>
> >> >>>>> * Also because the current tier hierarchy always puts CPU nodes
> >> >>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
> >> >>>>>   triggers a memory node from CPU-less into a CPU node (or vice
> >> >>>>>   versa), the memory tier hierarchy gets changed, even though no
> >> >>>>>   memory node is added or removed.  This can make the tier
> >> >>>>>   hierarchy unstable and make it difficult to support tier-based
> >> >>>>>   memory accounting.
> >> >>>>>
> >> >>>>> * A higher tier node can only be demoted to selected nodes on the
> >> >>>>>   next lower tier as defined by the demotion path, not any other
> >> >>>>>   node from any lower tier.  This strict, hard-coded demotion order
> >> >>>>>   does not work in all use cases (e.g. some use cases may want to
> >> >>>>>   allow cross-socket demotion to another node in the same demotion
> >> >>>>>   tier as a fallback when the preferred demotion node is out of
> >> >>>>>   space), and has resulted in the feature request for an interface to
> >> >>>>>   override the system-wide, per-node demotion order from the
> >> >>>>>   userspace.  This demotion order is also inconsistent with the page
> >> >>>>>   allocation fallback order when all the nodes in a higher tier are
> >> >>>>>   out of space: The page allocation can fall back to any node from
> >> >>>>>   any lower tier, whereas the demotion order doesn't allow that.
> >> >>>>>
> >> >>>>> * There are no interfaces for the userspace to learn about the memory
> >> >>>>>   tier hierarchy in order to optimize its memory allocations.
> >> >>>>>
> >> >>>>> This patch series make the creation of memory tiers explicit under
> >> >>>>> the control of userspace or device driver.
> >> >>>>>
> >> >>>>> Memory Tier Initialization
> >> >>>>> ==========================
> >> >>>>>
> >> >>>>> By default, all memory nodes are assigned to the default tier with
> >> >>>>> tier ID value 200.
> >> >>>>>
> >> >>>>> A device driver can move up or down its memory nodes from the default
> >> >>>>> tier.  For example, PMEM can move down its memory nodes below the
> >> >>>>> default tier, whereas GPU can move up its memory nodes above the
> >> >>>>> default tier.
> >> >>>>>
> >> >>>>> The kernel initialization code makes the decision on which exact tier
> >> >>>>> a memory node should be assigned to based on the requests from the
> >> >>>>> device drivers as well as the memory device hardware information
> >> >>>>> provided by the firmware.
> >> >>>>>
> >> >>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> >> >>>>>
> >> >>>>> Memory Allocation for Demotion
> >> >>>>> ==============================
> >> >>>>> This patch series keep the demotion target page allocation logic same.
> >> >>>>> The demotion page allocation pick the closest NUMA node in the
> >> >>>>> next lower tier to the current NUMA node allocating pages from.
> >> >>>>>
> >> >>>>> This will be later improved to use the same page allocation strategy
> >> >>>>> using fallback list.
> >> >>>>>
> >> >>>>> Sysfs Interface:
> >> >>>>> -------------
> >> >>>>> Listing current list of memory tiers details:
> >> >>>>>
> >> >>>>> :/sys/devices/system/memtier$ ls
> >> >>>>> default_tier max_tier  memtier1  power  uevent
> >> >>>>> :/sys/devices/system/memtier$ cat default_tier
> >> >>>>> memtier200
> >> >>>>> :/sys/devices/system/memtier$ cat max_tier
> >> >>>>> 400
> >> >>>>> :/sys/devices/system/memtier$
> >> >>>>>
> >> >>>>> Per node memory tier details:
> >> >>>>>
> >> >>>>> For a cpu only NUMA node:
> >> >>>>>
> >> >>>>> :/sys/devices/system/node# cat node0/memtier
> >> >>>>> :/sys/devices/system/node# echo 1 > node0/memtier
> >> >>>>> :/sys/devices/system/node# cat node0/memtier
> >> >>>>> :/sys/devices/system/node#
> >> >>>>>
> >> >>>>> For a NUMA node with memory:
> >> >>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>> 1
> >> >>>>> :/sys/devices/system/node# ls ../memtier/
> >> >>>>> default_tier  max_tier  memtier1  power  uevent
> >> >>>>> :/sys/devices/system/node# echo 2 > node1/memtier
> >> >>>>> :/sys/devices/system/node#
> >> >>>>> :/sys/devices/system/node# ls ../memtier/
> >> >>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
> >> >>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>> 2
> >> >>>>> :/sys/devices/system/node#
> >> >>>>>
> >> >>>>> Removing a memory tier
> >> >>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>> 2
> >> >>>>> :/sys/devices/system/node# echo 1 > node1/memtier
> >> >>>>
> >> >>>> Thanks a lot for your patchset.
> >> >>>>
> >> >>>> Per my understanding, we haven't reach consensus on
> >> >>>>
> >> >>>> - how to create the default memory tiers in kernel (via abstract
> >> >>>>   distance provided by drivers?  Or use SLIT as the first step?)
> >> >>>>
> >> >>>> - how to override the default memory tiers from user space
> >> >>>>
> >> >>>> As in the following thread and email,
> >> >>>>
> >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >> >>>>
> >> >>>> I think that we need to finalized on that firstly?
> >> >>>
> >> >>> I did list the proposal here
> >> >>>
> >> >>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >> >>>
> >> >>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
> >> >>> if the user wants a different tier topology.
> >> >>>
> >> >>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
> >> >>>
> >> >>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
> >> >>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
> >> >>> to control the tier assignment this can be a range of memory tiers.
> >> >>>
> >> >>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
> >> >>> the memory tier assignment based on device attributes.
> >> >>
> >> >> Sorry for late reply.
> >> >>
> >> >> As the first step, it may be better to skip the parts that we haven't
> >> >> reached consensus yet, for example, the user space interface to override
> >> >> the default memory tiers.  And we can use 0, 1, 2 as the default memory
> >> >> tier IDs.  We can refine/revise the in-kernel implementation, but we
> >> >> cannot change the user space ABI.
> >> >>
> >> >
> >> > Can you help list the use case that will be broken by using tierID as outlined in this series?
> >> > One of the details that were mentioned earlier was the need to track top-tier memory usage in a
> >> > memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> >> > can work with tier IDs too. Let me know if you think otherwise. So at this point
> >> > I am not sure which area we are still debating w.r.t the userspace interface.
> >> >
> >> > I will still keep the default tier IDs with a large range between them. That will allow
> >> > us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
> >> > together. If we still want to go back to rank based approach the tierID value won't have much
> >> > meaning anyway.
> >> >
> >> > Any feedback on patches 1 - 5, so that I can request Andrew to merge them?
> >> >
> >>
> >> Looking at this again, I guess we just need to drop patch 7
> >> mm/demotion: Add per node memory tier attribute to sysfs ?
> >>
> >> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included.
> >> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful
> >> and agreed upon. Hence patch 6 can be merged?
> >>
> >> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers
> >> are exposed/created from userspace. Hence that can be merged?
> >>
> >> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so
> >> that we can skip merging them based on what we conclude w.r.t usage of rank.
> >
> > I think the most controversial part is the user visible interfaces so
> > far. And IIUC the series could be split roughly into two parts, patch
> > 1 - 5 and others. The patch 1 -5 added the explicit memory tier
> > support and fixed the issue reported by Jagdish. I think we are on the
> > same page for this part. But I haven't seen any thorough review on
> > those patches yet since we got distracted by spending most time
> > discussing about the user visible interfaces.
> >
> > So would it help to move things forward to submit patch 1 - 5 as a
> > standalone series to get thorough review then get merged?
>
> Yes.  I think this is a good idea.  We can discuss the in kernel
> implementation (without user space interface) in details and try to make
> it merged.
>
> And we can continue our discussion of user space interface in a separate
> thread.
>
> Best Regards,
> Huang, Ying
>
I also agree that it is a good idea to split this patch series into
the kernel and userspace parts.
The current sysfs interface provides more dynamic memtiers than what I
have expected.  Let's have more discussions on that after the kernel
space changes are finalized.
Wei
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  2:59                 ` Huang, Ying
@ 2022-07-13  6:46                   ` Wei Xu
  2022-07-13  8:17                     ` Huang, Ying
  2022-07-13  9:44                     ` Aneesh Kumar K.V
  2022-07-13  9:40                   ` Aneesh Kumar K.V
  1 sibling, 2 replies; 42+ messages in thread
From: Wei Xu @ 2022-07-13  6:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Aneesh Kumar K V, Johannes Weiner, Linux MM, Andrew Morton,
	Yang Shi, Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
> > On 7/12/22 2:18 PM, Huang, Ying wrote:
> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >>
> >>> On 7/12/22 12:29 PM, Huang, Ying wrote:
> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >>>>
> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >>>>>>
> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
> >>>>>>>> Hi, Aneesh,
> >>>>>>>>
> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> >>>>>>>>
> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive
> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
> >>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
> >>>>>>>>> performance.
> >>>>>>>>>
> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during
> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
> >>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> >>>>>>>>> the per-node demotion targets based on the distances between nodes.
> >>>>>>>>>
> >>>>>>>>> This current memory tier kernel interface needs to be improved for
> >>>>>>>>> several important use cases:
> >>>>>>>>>
> >>>>>>>>> * The current tier initialization code always initializes
> >>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
> >>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
> >>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
> >>>>>>>>>   a virtual machine) and should be put into a higher tier.
> >>>>>>>>>
> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
> >>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >>>>>>>>>   with CPUs are better to be placed into the next lower tier.
> >>>>>>>>>
> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
> >>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
> >>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
> >>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
> >>>>>>>>>   memory node is added or removed.  This can make the tier
> >>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
> >>>>>>>>>   memory accounting.
> >>>>>>>>>
> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
> >>>>>>>>>   next lower tier as defined by the demotion path, not any other
> >>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
> >>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
> >>>>>>>>>   allow cross-socket demotion to another node in the same demotion
> >>>>>>>>>   tier as a fallback when the preferred demotion node is out of
> >>>>>>>>>   space), and has resulted in the feature request for an interface to
> >>>>>>>>>   override the system-wide, per-node demotion order from the
> >>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
> >>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
> >>>>>>>>>   out of space: The page allocation can fall back to any node from
> >>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
> >>>>>>>>>
> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory
> >>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
> >>>>>>>>>
> >>>>>>>>> This patch series make the creation of memory tiers explicit under
> >>>>>>>>> the control of userspace or device driver.
> >>>>>>>>>
> >>>>>>>>> Memory Tier Initialization
> >>>>>>>>> ==========================
> >>>>>>>>>
> >>>>>>>>> By default, all memory nodes are assigned to the default tier with
> >>>>>>>>> tier ID value 200.
> >>>>>>>>>
> >>>>>>>>> A device driver can move up or down its memory nodes from the default
> >>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
> >>>>>>>>> default tier.
> >>>>>>>>>
> >>>>>>>>> The kernel initialization code makes the decision on which exact tier
> >>>>>>>>> a memory node should be assigned to based on the requests from the
> >>>>>>>>> device drivers as well as the memory device hardware information
> >>>>>>>>> provided by the firmware.
> >>>>>>>>>
> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> >>>>>>>>>
> >>>>>>>>> Memory Allocation for Demotion
> >>>>>>>>> ==============================
> >>>>>>>>> This patch series keep the demotion target page allocation logic same.
> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the
> >>>>>>>>> next lower tier to the current NUMA node allocating pages from.
> >>>>>>>>>
> >>>>>>>>> This will be later improved to use the same page allocation strategy
> >>>>>>>>> using fallback list.
> >>>>>>>>>
> >>>>>>>>> Sysfs Interface:
> >>>>>>>>> -------------
> >>>>>>>>> Listing current list of memory tiers details:
> >>>>>>>>>
> >>>>>>>>> :/sys/devices/system/memtier$ ls
> >>>>>>>>> default_tier max_tier  memtier1  power  uevent
> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
> >>>>>>>>> memtier200
> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier
> >>>>>>>>> 400
> >>>>>>>>> :/sys/devices/system/memtier$
> >>>>>>>>>
> >>>>>>>>> Per node memory tier details:
> >>>>>>>>>
> >>>>>>>>> For a cpu only NUMA node:
> >>>>>>>>>
> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier
> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
> >>>>>>>>> :/sys/devices/system/node#
> >>>>>>>>>
> >>>>>>>>> For a NUMA node with memory:
> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
> >>>>>>>>> 1
> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
> >>>>>>>>> default_tier  max_tier  memtier1  power  uevent
> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier
> >>>>>>>>> :/sys/devices/system/node#
> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
> >>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
> >>>>>>>>> 2
> >>>>>>>>> :/sys/devices/system/node#
> >>>>>>>>>
> >>>>>>>>> Removing a memory tier
> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
> >>>>>>>>> 2
> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
> >>>>>>>>
> >>>>>>>> Thanks a lot for your patchset.
> >>>>>>>>
> >>>>>>>> Per my understanding, we haven't reach consensus on
> >>>>>>>>
> >>>>>>>> - how to create the default memory tiers in kernel (via abstract
> >>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
> >>>>>>>>
> >>>>>>>> - how to override the default memory tiers from user space
> >>>>>>>>
> >>>>>>>> As in the following thread and email,
> >>>>>>>>
> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >>>>>>>>
> >>>>>>>> I think that we need to finalized on that firstly?
> >>>>>>>
> >>>>>>> I did list the proposal here
> >>>>>>>
> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >>>>>>>
> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
> >>>>>>> if the user wants a different tier topology.
> >>>>>>>
> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
> >>>>>>>
> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
> >>>>>>> to control the tier assignment this can be a range of memory tiers.
> >>>>>>>
> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
> >>>>>>> the memory tier assignment based on device attributes.
> >>>>>>
> >>>>>> Sorry for late reply.
> >>>>>>
> >>>>>> As the first step, it may be better to skip the parts that we haven't
> >>>>>> reached consensus yet, for example, the user space interface to override
> >>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
> >>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
> >>>>>> cannot change the user space ABI.
> >>>>>>
> >>>>>
> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
> >>>>> I am not sure which area we are still debating w.r.t the userspace interface.
> >>>>
> >>>> In
> >>>>
> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >>>>
> >>>> per my understanding, Johannes suggested to override the kernel default
> >>>> memory tiers with "abstract distance" via drivers implementing memory
> >>>> devices.  As you said in another email, that is related to [7/12] of the
> >>>> series.  And we can table it for future.
> >>>>
> >>>> And per my understanding, he also suggested to make memory tier IDs
> >>>> dynamic.  For example, after the "abstract distance" of a driver is
> >>>> overridden by users, the total number of memory tiers may be changed,
> >>>> and the memory tier ID of some nodes may be changed too.  This will make
> >>>> memory tier ID easier to be understood, but more unstable.  For example,
> >>>> this will make it harder to specify the per-memory-tier memory partition
> >>>> for a cgroup.
> >>>>
> >>>
> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
> >>> posted here
> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
> >>> doesn't consider the node movement from one memory tier to another. If we need
> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not
> >>> prevent such a restriction.
> >>
> >> Absolute stableness doesn't exist even in "rank" based solution.  But
> >> "rank" can improve the stableness at some degree.  For example, if we
> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
> >> nodes can keep its memory tier ID stable.  This may be not a real issue
> >> finally.  But we need to discuss that.
> >>
> >
> > I agree that using ranks gives us the flexibility to change demotion order
> > without being blocked by cgroup usage. But how frequently do we expect the
> > tier assignment to change? My expectation was these reassignments are going
> > to be rare and won't happen frequently after a system is up and running?
> > Hence using tierID for demotion order won't prevent a node reassignment
> > much because we don't expect to change the node tierID during runtime. In
> > the rare case we do, we will have to make sure there is no cgroup usage from
> > the specific memory tier.
> >
> > Even if we use ranks, we will have to avoid a rank update, if such
> > an update can change the meaning of top tier? ie, if a rank update
> > can result in a node being moved from top tier to non top tier.
> >
> >> Tim has suggested to use top-tier(s) memory partition among cgroups.
> >> But I don't think that has been finalized.  We may use per-memory-tier
> >> memory partition among cgroups.  I don't know whether Wei will use that
> >> (may be implemented in the user space).
> >>
> >> And, if we thought stableness between nodes and memory tier ID isn't
> >> important.  Why should we use sparse memory device IDs (that is, 100,
> >> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
> >>
> >
> >
> > The range allows us to use memtier ID for demotion order. ie, as we start initializing
> > devices with different attributes via dax kmem, there will be a desire to
> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
> > us to put these devices in the range [0 - 200) without updating the node to memtier
> > mapping of existing NUMA nodes (ie, without updating default memtier).
>
> I believe that sparse memory tier IDs can make memory tier more stable
> in some cases.  But this is different from the system suggested by
> Johannes.  Per my understanding, with Johannes' system, we will
>
> - one driver may online different memory types (such as kmem_dax may
>   online HBM, PMEM, etc.)
>
> - one memory type manages several memory nodes (NUMA nodes)
>
> - one "abstract distance" for each memory type
>
> - the "abstract distance" can be offset by user space override knob
>
> - memory tiers generated dynamic from different memory types according
>   "abstract distance" and overridden "offset"
>
> - the granularity to group several memory types into one memory tier can
>   be overridden via user space knob
>
> In this way, the memory tiers may be changed totally after user space
> overridden.  It may be hard to link memory tiers before/after the
> overridden.  So we may need to reset all per-memory-tier configuration,
> such as cgroup paritation limit or interleave weight, etc.
>
> Personally, I think the system above makes sense.  But I think we need
> to make sure whether it satisfies the requirements.
>
> Best Regards,
> Huang, Ying
>
Th "memory type" and "abstract distance" concepts sound to me similar
to the memory tier "rank" idea.
We can have some well-defined type/distance/rank values, e.g. HBM,
DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with.  The
memory tiers will build from these values.  It can be configurable to
whether/how to collapse several values into a single tier.
Wei
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  3:42             ` Huang, Ying
  2022-07-13  6:38               ` Wei Xu
  2022-07-13  6:39               ` Wei Xu
@ 2022-07-13  7:25               ` Aneesh Kumar K V
  2022-07-13  8:20                 ` Huang, Ying
  2 siblings, 1 reply; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-13  7:25 UTC (permalink / raw)
  To: Huang, Ying, Yang Shi
  Cc: Linux MM, Andrew Morton, Wei Xu, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss
On 7/13/22 9:12 AM, Huang, Ying wrote:
> Yang Shi <shy828301@gmail.com> writes:
> 
>> On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V
>> <aneesh.kumar@linux.ibm.com> wrote:
>>>
>>> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>>>> Hi, Aneesh,
>>>>>>>
>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>
>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>>>
>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>> several important use cases:
>>>>>>>>
>>>>>>>> * The current tier initialization code always initializes
>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>>>
>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>>>
>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>>>   memory accounting.
>>>>>>>>
>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>
>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>>>
>>>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>>>> the control of userspace or device driver.
>>>>>>>>
>>>>>>>> Memory Tier Initialization
>>>>>>>> ==========================
>>>>>>>>
>>>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>>>> tier ID value 200.
>>>>>>>>
>>>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>>>> default tier.
>>>>>>>>
>>>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>>>> a memory node should be assigned to based on the requests from the
>>>>>>>> device drivers as well as the memory device hardware information
>>>>>>>> provided by the firmware.
>>>>>>>>
>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>>>
>>>>>>>> Memory Allocation for Demotion
>>>>>>>> ==============================
>>>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>>>
>>>>>>>> This will be later improved to use the same page allocation strategy
>>>>>>>> using fallback list.
>>>>>>>>
>>>>>>>> Sysfs Interface:
>>>>>>>> -------------
>>>>>>>> Listing current list of memory tiers details:
>>>>>>>>
>>>>>>>> :/sys/devices/system/memtier$ ls
>>>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>>>> memtier200
>>>>>>>> :/sys/devices/system/memtier$ cat max_tier
>>>>>>>> 400
>>>>>>>> :/sys/devices/system/memtier$
>>>>>>>>
>>>>>>>> Per node memory tier details:
>>>>>>>>
>>>>>>>> For a cpu only NUMA node:
>>>>>>>>
>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier
>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>>>>>>>> :/sys/devices/system/node#
>>>>>>>>
>>>>>>>> For a NUMA node with memory:
>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>>>>>>>> 1
>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier
>>>>>>>> :/sys/devices/system/node#
>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>>>>>>>> 2
>>>>>>>> :/sys/devices/system/node#
>>>>>>>>
>>>>>>>> Removing a memory tier
>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>>>>>>>> 2
>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>>>
>>>>>>> Thanks a lot for your patchset.
>>>>>>>
>>>>>>> Per my understanding, we haven't reach consensus on
>>>>>>>
>>>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>>>
>>>>>>> - how to override the default memory tiers from user space
>>>>>>>
>>>>>>> As in the following thread and email,
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>>>
>>>>>>> I think that we need to finalized on that firstly?
>>>>>>
>>>>>> I did list the proposal here
>>>>>>
>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>
>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>>>> if the user wants a different tier topology.
>>>>>>
>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>>>
>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>>>> to control the tier assignment this can be a range of memory tiers.
>>>>>>
>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>>>> the memory tier assignment based on device attributes.
>>>>>
>>>>> Sorry for late reply.
>>>>>
>>>>> As the first step, it may be better to skip the parts that we haven't
>>>>> reached consensus yet, for example, the user space interface to override
>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>>>> cannot change the user space ABI.
>>>>>
>>>>
>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>>>>
>>>> I will still keep the default tier IDs with a large range between them. That will allow
>>>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
>>>> together. If we still want to go back to rank based approach the tierID value won't have much
>>>> meaning anyway.
>>>>
>>>> Any feedback on patches 1 - 5, so that I can request Andrew to merge them?
>>>>
>>>
>>> Looking at this again, I guess we just need to drop patch 7
>>> mm/demotion: Add per node memory tier attribute to sysfs ?
>>>
>>> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included.
>>> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful
>>> and agreed upon. Hence patch 6 can be merged?
>>>
>>> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers
>>> are exposed/created from userspace. Hence that can be merged?
>>>
>>> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so
>>> that we can skip merging them based on what we conclude w.r.t usage of rank.
>>
>> I think the most controversial part is the user visible interfaces so
>> far. And IIUC the series could be split roughly into two parts, patch
>> 1 - 5 and others. The patch 1 -5 added the explicit memory tier
>> support and fixed the issue reported by Jagdish. I think we are on the
>> same page for this part. But I haven't seen any thorough review on
>> those patches yet since we got distracted by spending most time
>> discussing about the user visible interfaces.
>>
>> So would it help to move things forward to submit patch 1 - 5 as a
>> standalone series to get thorough review then get merged?
> 
> Yes.  I think this is a good idea.  We can discuss the in kernel
> implementation (without user space interface) in details and try to make
> it merged.
> 
> And we can continue our discussion of user space interface in a separate
> thread.
Thanks. I will post patch 1 - 5 as a series for review.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  6:46                   ` Wei Xu
@ 2022-07-13  8:17                     ` Huang, Ying
  2022-07-19 14:00                       ` Jonathan Cameron
  2022-07-13  9:44                     ` Aneesh Kumar K.V
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Ying @ 2022-07-13  8:17 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K V, Johannes Weiner, Linux MM, Andrew Morton,
	Yang Shi, Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
Wei Xu <weixugc@google.com> writes:
> On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>> > On 7/12/22 2:18 PM, Huang, Ying wrote:
>> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>
>> >>> On 7/12/22 12:29 PM, Huang, Ying wrote:
>> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>>>
>> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>>>>>
>> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>> >>>>>>>> Hi, Aneesh,
>> >>>>>>>>
>> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> >>>>>>>>
>> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>> >>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>> >>>>>>>>> performance.
>> >>>>>>>>>
>> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>> >>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>> >>>>>>>>> the per-node demotion targets based on the distances between nodes.
>> >>>>>>>>>
>> >>>>>>>>> This current memory tier kernel interface needs to be improved for
>> >>>>>>>>> several important use cases:
>> >>>>>>>>>
>> >>>>>>>>> * The current tier initialization code always initializes
>> >>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>> >>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>> >>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>> >>>>>>>>>   a virtual machine) and should be put into a higher tier.
>> >>>>>>>>>
>> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>> >>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>> >>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>> >>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>> >>>>>>>>>
>> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>> >>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>> >>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>> >>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>> >>>>>>>>>   memory node is added or removed.  This can make the tier
>> >>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>> >>>>>>>>>   memory accounting.
>> >>>>>>>>>
>> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>> >>>>>>>>>   next lower tier as defined by the demotion path, not any other
>> >>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>> >>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>> >>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>> >>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>> >>>>>>>>>   space), and has resulted in the feature request for an interface to
>> >>>>>>>>>   override the system-wide, per-node demotion order from the
>> >>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>> >>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>> >>>>>>>>>   out of space: The page allocation can fall back to any node from
>> >>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>> >>>>>>>>>
>> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>> >>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>> >>>>>>>>>
>> >>>>>>>>> This patch series make the creation of memory tiers explicit under
>> >>>>>>>>> the control of userspace or device driver.
>> >>>>>>>>>
>> >>>>>>>>> Memory Tier Initialization
>> >>>>>>>>> ==========================
>> >>>>>>>>>
>> >>>>>>>>> By default, all memory nodes are assigned to the default tier with
>> >>>>>>>>> tier ID value 200.
>> >>>>>>>>>
>> >>>>>>>>> A device driver can move up or down its memory nodes from the default
>> >>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>> >>>>>>>>> default tier.
>> >>>>>>>>>
>> >>>>>>>>> The kernel initialization code makes the decision on which exact tier
>> >>>>>>>>> a memory node should be assigned to based on the requests from the
>> >>>>>>>>> device drivers as well as the memory device hardware information
>> >>>>>>>>> provided by the firmware.
>> >>>>>>>>>
>> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>> >>>>>>>>>
>> >>>>>>>>> Memory Allocation for Demotion
>> >>>>>>>>> ==============================
>> >>>>>>>>> This patch series keep the demotion target page allocation logic same.
>> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>> >>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>> >>>>>>>>>
>> >>>>>>>>> This will be later improved to use the same page allocation strategy
>> >>>>>>>>> using fallback list.
>> >>>>>>>>>
>> >>>>>>>>> Sysfs Interface:
>> >>>>>>>>> -------------
>> >>>>>>>>> Listing current list of memory tiers details:
>> >>>>>>>>>
>> >>>>>>>>> :/sys/devices/system/memtier$ ls
>> >>>>>>>>> default_tier max_tier  memtier1  power  uevent
>> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>> >>>>>>>>> memtier200
>> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier
>> >>>>>>>>> 400
>> >>>>>>>>> :/sys/devices/system/memtier$
>> >>>>>>>>>
>> >>>>>>>>> Per node memory tier details:
>> >>>>>>>>>
>> >>>>>>>>> For a cpu only NUMA node:
>> >>>>>>>>>
>> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier
>> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>> >>>>>>>>> :/sys/devices/system/node#
>> >>>>>>>>>
>> >>>>>>>>> For a NUMA node with memory:
>> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>>>>>> 1
>> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>> >>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier
>> >>>>>>>>> :/sys/devices/system/node#
>> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>> >>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>>>>>> 2
>> >>>>>>>>> :/sys/devices/system/node#
>> >>>>>>>>>
>> >>>>>>>>> Removing a memory tier
>> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>>>>>> 2
>> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>> >>>>>>>>
>> >>>>>>>> Thanks a lot for your patchset.
>> >>>>>>>>
>> >>>>>>>> Per my understanding, we haven't reach consensus on
>> >>>>>>>>
>> >>>>>>>> - how to create the default memory tiers in kernel (via abstract
>> >>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>> >>>>>>>>
>> >>>>>>>> - how to override the default memory tiers from user space
>> >>>>>>>>
>> >>>>>>>> As in the following thread and email,
>> >>>>>>>>
>> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> >>>>>>>>
>> >>>>>>>> I think that we need to finalized on that firstly?
>> >>>>>>>
>> >>>>>>> I did list the proposal here
>> >>>>>>>
>> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>> >>>>>>>
>> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>> >>>>>>> if the user wants a different tier topology.
>> >>>>>>>
>> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>> >>>>>>>
>> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>> >>>>>>> to control the tier assignment this can be a range of memory tiers.
>> >>>>>>>
>> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>> >>>>>>> the memory tier assignment based on device attributes.
>> >>>>>>
>> >>>>>> Sorry for late reply.
>> >>>>>>
>> >>>>>> As the first step, it may be better to skip the parts that we haven't
>> >>>>>> reached consensus yet, for example, the user space interface to override
>> >>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>> >>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>> >>>>>> cannot change the user space ABI.
>> >>>>>>
>> >>>>>
>> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>> >>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>> >>>>
>> >>>> In
>> >>>>
>> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> >>>>
>> >>>> per my understanding, Johannes suggested to override the kernel default
>> >>>> memory tiers with "abstract distance" via drivers implementing memory
>> >>>> devices.  As you said in another email, that is related to [7/12] of the
>> >>>> series.  And we can table it for future.
>> >>>>
>> >>>> And per my understanding, he also suggested to make memory tier IDs
>> >>>> dynamic.  For example, after the "abstract distance" of a driver is
>> >>>> overridden by users, the total number of memory tiers may be changed,
>> >>>> and the memory tier ID of some nodes may be changed too.  This will make
>> >>>> memory tier ID easier to be understood, but more unstable.  For example,
>> >>>> this will make it harder to specify the per-memory-tier memory partition
>> >>>> for a cgroup.
>> >>>>
>> >>>
>> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
>> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
>> >>> posted here
>> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
>> >>> doesn't consider the node movement from one memory tier to another. If we need
>> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
>> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not
>> >>> prevent such a restriction.
>> >>
>> >> Absolute stableness doesn't exist even in "rank" based solution.  But
>> >> "rank" can improve the stableness at some degree.  For example, if we
>> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
>> >> nodes can keep its memory tier ID stable.  This may be not a real issue
>> >> finally.  But we need to discuss that.
>> >>
>> >
>> > I agree that using ranks gives us the flexibility to change demotion order
>> > without being blocked by cgroup usage. But how frequently do we expect the
>> > tier assignment to change? My expectation was these reassignments are going
>> > to be rare and won't happen frequently after a system is up and running?
>> > Hence using tierID for demotion order won't prevent a node reassignment
>> > much because we don't expect to change the node tierID during runtime. In
>> > the rare case we do, we will have to make sure there is no cgroup usage from
>> > the specific memory tier.
>> >
>> > Even if we use ranks, we will have to avoid a rank update, if such
>> > an update can change the meaning of top tier? ie, if a rank update
>> > can result in a node being moved from top tier to non top tier.
>> >
>> >> Tim has suggested to use top-tier(s) memory partition among cgroups.
>> >> But I don't think that has been finalized.  We may use per-memory-tier
>> >> memory partition among cgroups.  I don't know whether Wei will use that
>> >> (may be implemented in the user space).
>> >>
>> >> And, if we thought stableness between nodes and memory tier ID isn't
>> >> important.  Why should we use sparse memory device IDs (that is, 100,
>> >> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
>> >>
>> >
>> >
>> > The range allows us to use memtier ID for demotion order. ie, as we start initializing
>> > devices with different attributes via dax kmem, there will be a desire to
>> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
>> > us to put these devices in the range [0 - 200) without updating the node to memtier
>> > mapping of existing NUMA nodes (ie, without updating default memtier).
>>
>> I believe that sparse memory tier IDs can make memory tier more stable
>> in some cases.  But this is different from the system suggested by
>> Johannes.  Per my understanding, with Johannes' system, we will
>>
>> - one driver may online different memory types (such as kmem_dax may
>>   online HBM, PMEM, etc.)
>>
>> - one memory type manages several memory nodes (NUMA nodes)
>>
>> - one "abstract distance" for each memory type
>>
>> - the "abstract distance" can be offset by user space override knob
>>
>> - memory tiers generated dynamic from different memory types according
>>   "abstract distance" and overridden "offset"
>>
>> - the granularity to group several memory types into one memory tier can
>>   be overridden via user space knob
>>
>> In this way, the memory tiers may be changed totally after user space
>> overridden.  It may be hard to link memory tiers before/after the
>> overridden.  So we may need to reset all per-memory-tier configuration,
>> such as cgroup paritation limit or interleave weight, etc.
>>
>> Personally, I think the system above makes sense.  But I think we need
>> to make sure whether it satisfies the requirements.
>>
>> Best Regards,
>> Huang, Ying
>>
>
> Th "memory type" and "abstract distance" concepts sound to me similar
> to the memory tier "rank" idea.
Yes.  "abstract distance" is similar as "rank".
> We can have some well-defined type/distance/rank values, e.g. HBM,
> DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with.  The
> memory tiers will build from these values.  It can be configurable to
> whether/how to collapse several values into a single tier.
The memory types are registered by drivers (such as kmem_dax).  And the
distances can come from SLIT, HMAT, and other firmware or driver
specific information sources.
Per my understanding, this solution may make memory tier IDs more
unstable.  For example, the memory ID of a node may be changed after the
user override the distance of a memory type.  Although I think the
overriding should be a rare operations, will it be a real issue for your
use cases?
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  7:25               ` Aneesh Kumar K V
@ 2022-07-13  8:20                 ` Huang, Ying
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Ying @ 2022-07-13  8:20 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Yang Shi, Linux MM, Andrew Morton, Wei Xu, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Johannes Weiner, jvgediya.oss
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 7/13/22 9:12 AM, Huang, Ying wrote:
>> Yang Shi <shy828301@gmail.com> writes:
>> 
>>> On Mon, Jul 11, 2022 at 10:10 PM Aneesh Kumar K V
>>> <aneesh.kumar@linux.ibm.com> wrote:
>>>>
>>>> On 7/12/22 10:12 AM, Aneesh Kumar K V wrote:
>>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>>>>> Hi, Aneesh,
>>>>>>>>
>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>>
>>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>>>>
>>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>>> several important use cases:
>>>>>>>>>
>>>>>>>>> * The current tier initialization code always initializes
>>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>>>>
>>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>>>>
>>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>>>>   memory accounting.
>>>>>>>>>
>>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>>
>>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>>>>
>>>>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>>>>> the control of userspace or device driver.
>>>>>>>>>
>>>>>>>>> Memory Tier Initialization
>>>>>>>>> ==========================
>>>>>>>>>
>>>>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>>>>> tier ID value 200.
>>>>>>>>>
>>>>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>>>>> default tier.
>>>>>>>>>
>>>>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>>>>> a memory node should be assigned to based on the requests from the
>>>>>>>>> device drivers as well as the memory device hardware information
>>>>>>>>> provided by the firmware.
>>>>>>>>>
>>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>>>>
>>>>>>>>> Memory Allocation for Demotion
>>>>>>>>> ==============================
>>>>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>>>>
>>>>>>>>> This will be later improved to use the same page allocation strategy
>>>>>>>>> using fallback list.
>>>>>>>>>
>>>>>>>>> Sysfs Interface:
>>>>>>>>> -------------
>>>>>>>>> Listing current list of memory tiers details:
>>>>>>>>>
>>>>>>>>> :/sys/devices/system/memtier$ ls
>>>>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>>>>> memtier200
>>>>>>>>> :/sys/devices/system/memtier$ cat max_tier
>>>>>>>>> 400
>>>>>>>>> :/sys/devices/system/memtier$
>>>>>>>>>
>>>>>>>>> Per node memory tier details:
>>>>>>>>>
>>>>>>>>> For a cpu only NUMA node:
>>>>>>>>>
>>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier
>>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>>>>>>>>> :/sys/devices/system/node#
>>>>>>>>>
>>>>>>>>> For a NUMA node with memory:
>>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>>>>>>>>> 1
>>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier
>>>>>>>>> :/sys/devices/system/node#
>>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>>>>>>>>> 2
>>>>>>>>> :/sys/devices/system/node#
>>>>>>>>>
>>>>>>>>> Removing a memory tier
>>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>>>>>>>>> 2
>>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>>>>
>>>>>>>> Thanks a lot for your patchset.
>>>>>>>>
>>>>>>>> Per my understanding, we haven't reach consensus on
>>>>>>>>
>>>>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>>>>
>>>>>>>> - how to override the default memory tiers from user space
>>>>>>>>
>>>>>>>> As in the following thread and email,
>>>>>>>>
>>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>>>>
>>>>>>>> I think that we need to finalized on that firstly?
>>>>>>>
>>>>>>> I did list the proposal here
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>>
>>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>>>>> if the user wants a different tier topology.
>>>>>>>
>>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>>>>
>>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>>>>> to control the tier assignment this can be a range of memory tiers.
>>>>>>>
>>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>>>>> the memory tier assignment based on device attributes.
>>>>>>
>>>>>> Sorry for late reply.
>>>>>>
>>>>>> As the first step, it may be better to skip the parts that we haven't
>>>>>> reached consensus yet, for example, the user space interface to override
>>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>>>>> cannot change the user space ABI.
>>>>>>
>>>>>
>>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>>>>>
>>>>> I will still keep the default tier IDs with a large range between them. That will allow
>>>>> us to go back to tierID based demotion order if we can. That is much simpler than using tierID and rank
>>>>> together. If we still want to go back to rank based approach the tierID value won't have much
>>>>> meaning anyway.
>>>>>
>>>>> Any feedback on patches 1 - 5, so that I can request Andrew to merge them?
>>>>>
>>>>
>>>> Looking at this again, I guess we just need to drop patch 7
>>>> mm/demotion: Add per node memory tier attribute to sysfs ?
>>>>
>>>> We do agree to use the device model to expose memory tiers to userspace so patch 6 can still be included.
>>>> It also exposes max_tier, default_tier, and node list of a memory tier. All these are useful
>>>> and agreed upon. Hence patch 6 can be merged?
>>>>
>>>> patch 8 - 10 -> are done based on the request from others and is independent of how memory tiers
>>>> are exposed/created from userspace. Hence that can be merged?
>>>>
>>>> If you agree I can rebase the series moving patch 7,11,12 as the last patches in the series so
>>>> that we can skip merging them based on what we conclude w.r.t usage of rank.
>>>
>>> I think the most controversial part is the user visible interfaces so
>>> far. And IIUC the series could be split roughly into two parts, patch
>>> 1 - 5 and others. The patch 1 -5 added the explicit memory tier
>>> support and fixed the issue reported by Jagdish. I think we are on the
>>> same page for this part. But I haven't seen any thorough review on
>>> those patches yet since we got distracted by spending most time
>>> discussing about the user visible interfaces.
>>>
>>> So would it help to move things forward to submit patch 1 - 5 as a
>>> standalone series to get thorough review then get merged?
>> 
>> Yes.  I think this is a good idea.  We can discuss the in kernel
>> implementation (without user space interface) in details and try to make
>> it merged.
>> 
>> And we can continue our discussion of user space interface in a separate
>> thread.
>
> Thanks. I will post patch 1 - 5 as a series for review.
I think that you should add 8-10 too, that is, all in-kernel
implementation except the user space interface part.  Although I think
we should squash 8/12 personally.  We can discuss that further during
review.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  2:59                 ` Huang, Ying
  2022-07-13  6:46                   ` Wei Xu
@ 2022-07-13  9:40                   ` Aneesh Kumar K.V
  2022-07-14  4:56                     ` Huang, Ying
  1 sibling, 1 reply; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-13  9:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
"Huang, Ying" <ying.huang@intel.com> writes:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
>> On 7/12/22 2:18 PM, Huang, Ying wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>> 
>>>> On 7/12/22 12:29 PM, Huang, Ying wrote:
>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>>>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>
>>>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>>>>>>>>> Hi, Aneesh,
>>>>>>>>>
>>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>>>
>>>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>>>>>> performance.
>>>>>>>>>>
>>>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>>>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>>>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>>>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>>>>>>>>>> the per-node demotion targets based on the distances between nodes.
>>>>>>>>>>
>>>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>>>> several important use cases:
>>>>>>>>>>
>>>>>>>>>> * The current tier initialization code always initializes
>>>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>>>>   a virtual machine) and should be put into a higher tier.
>>>>>>>>>>
>>>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>>>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>>>>>>>>>>
>>>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>>>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>>>>>>>>>>   memory node is added or removed.  This can make the tier
>>>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>>>>>>>>>>   memory accounting.
>>>>>>>>>>
>>>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>>>>>   next lower tier as defined by the demotion path, not any other
>>>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>>>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>>>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>>>>>>>>>>   space), and has resulted in the feature request for an interface to
>>>>>>>>>>   override the system-wide, per-node demotion order from the
>>>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>>>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>>>>>>>>>>   out of space: The page allocation can fall back to any node from
>>>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>>>
>>>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>>>>>>>>>>
>>>>>>>>>> This patch series make the creation of memory tiers explicit under
>>>>>>>>>> the control of userspace or device driver.
>>>>>>>>>>
>>>>>>>>>> Memory Tier Initialization
>>>>>>>>>> ==========================
>>>>>>>>>>
>>>>>>>>>> By default, all memory nodes are assigned to the default tier with
>>>>>>>>>> tier ID value 200.
>>>>>>>>>>
>>>>>>>>>> A device driver can move up or down its memory nodes from the default
>>>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>>>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>>>>>>>>>> default tier.
>>>>>>>>>>
>>>>>>>>>> The kernel initialization code makes the decision on which exact tier
>>>>>>>>>> a memory node should be assigned to based on the requests from the
>>>>>>>>>> device drivers as well as the memory device hardware information
>>>>>>>>>> provided by the firmware.
>>>>>>>>>>
>>>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>>>>>>>>>>
>>>>>>>>>> Memory Allocation for Demotion
>>>>>>>>>> ==============================
>>>>>>>>>> This patch series keep the demotion target page allocation logic same.
>>>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>>>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>>>>>>>>>>
>>>>>>>>>> This will be later improved to use the same page allocation strategy
>>>>>>>>>> using fallback list.
>>>>>>>>>>
>>>>>>>>>> Sysfs Interface:
>>>>>>>>>> -------------
>>>>>>>>>> Listing current list of memory tiers details:
>>>>>>>>>>
>>>>>>>>>> :/sys/devices/system/memtier$ ls
>>>>>>>>>> default_tier max_tier  memtier1  power  uevent
>>>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>>>>>>>>>> memtier200
>>>>>>>>>> :/sys/devices/system/memtier$ cat max_tier 
>>>>>>>>>> 400
>>>>>>>>>> :/sys/devices/system/memtier$ 
>>>>>>>>>>
>>>>>>>>>> Per node memory tier details:
>>>>>>>>>>
>>>>>>>>>> For a cpu only NUMA node:
>>>>>>>>>>
>>>>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier 
>>>>>>>>>> :/sys/devices/system/node# cat node0/memtier 
>>>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>>>
>>>>>>>>>> For a NUMA node with memory:
>>>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>>>> 1
>>>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>>>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier 
>>>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>>>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>>>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>>>> 2
>>>>>>>>>> :/sys/devices/system/node# 
>>>>>>>>>>
>>>>>>>>>> Removing a memory tier
>>>>>>>>>> :/sys/devices/system/node# cat node1/memtier 
>>>>>>>>>> 2
>>>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>>>>>>>>>
>>>>>>>>> Thanks a lot for your patchset.
>>>>>>>>>
>>>>>>>>> Per my understanding, we haven't reach consensus on
>>>>>>>>>
>>>>>>>>> - how to create the default memory tiers in kernel (via abstract
>>>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>>>>>>>>>
>>>>>>>>> - how to override the default memory tiers from user space
>>>>>>>>>
>>>>>>>>> As in the following thread and email,
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>>>>>
>>>>>>>>> I think that we need to finalized on that firstly?
>>>>>>>>
>>>>>>>> I did list the proposal here 
>>>>>>>>
>>>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>>>
>>>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>>>>>>>> if the user wants a different tier topology. 
>>>>>>>>
>>>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>>>>>>>>
>>>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>>>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>>>>>>>> to control the tier assignment this can be a range of memory tiers. 
>>>>>>>>
>>>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>>>>>>>> the memory tier assignment based on device attributes.
>>>>>>>
>>>>>>> Sorry for late reply.
>>>>>>>
>>>>>>> As the first step, it may be better to skip the parts that we haven't
>>>>>>> reached consensus yet, for example, the user space interface to override
>>>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>>>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>>>>>>> cannot change the user space ABI.
>>>>>>>
>>>>>>
>>>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>>>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>>>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>>>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>>>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>>>>>
>>>>> In
>>>>>
>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>>>>>
>>>>> per my understanding, Johannes suggested to override the kernel default
>>>>> memory tiers with "abstract distance" via drivers implementing memory
>>>>> devices.  As you said in another email, that is related to [7/12] of the
>>>>> series.  And we can table it for future.
>>>>>
>>>>> And per my understanding, he also suggested to make memory tier IDs
>>>>> dynamic.  For example, after the "abstract distance" of a driver is
>>>>> overridden by users, the total number of memory tiers may be changed,
>>>>> and the memory tier ID of some nodes may be changed too.  This will make
>>>>> memory tier ID easier to be understood, but more unstable.  For example,
>>>>> this will make it harder to specify the per-memory-tier memory partition
>>>>> for a cgroup.
>>>>>
>>>>
>>>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
>>>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
>>>> posted here
>>>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
>>>> doesn't consider the node movement from one memory tier to another. If we need
>>>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
>>>> while we have pages from the memory tier charged to a cgroup. This patchset should not
>>>> prevent such a restriction.
>>> 
>>> Absolute stableness doesn't exist even in "rank" based solution.  But
>>> "rank" can improve the stableness at some degree.  For example, if we
>>> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
>>> nodes can keep its memory tier ID stable.  This may be not a real issue
>>> finally.  But we need to discuss that.
>>> 
>>
>> I agree that using ranks gives us the flexibility to change demotion order
>> without being blocked by cgroup usage. But how frequently do we expect the
>> tier assignment to change? My expectation was these reassignments are going
>> to be rare and won't happen frequently after a system is up and running?
>> Hence using tierID for demotion order won't prevent a node reassignment
>> much because we don't expect to change the node tierID during runtime. In
>> the rare case we do, we will have to make sure there is no cgroup usage from
>> the specific memory tier. 
>>
>> Even if we use ranks, we will have to avoid a rank update, if such
>> an update can change the meaning of top tier? ie, if a rank update
>> can result in a node being moved from top tier to non top tier.
>>
>>> Tim has suggested to use top-tier(s) memory partition among cgroups.
>>> But I don't think that has been finalized.  We may use per-memory-tier
>>> memory partition among cgroups.  I don't know whether Wei will use that
>>> (may be implemented in the user space).
>>> 
>>> And, if we thought stableness between nodes and memory tier ID isn't
>>> important.  Why should we use sparse memory device IDs (that is, 100,
>>> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
>>> 
>>
>>
>> The range allows us to use memtier ID for demotion order. ie, as we start initializing
>> devices with different attributes via dax kmem, there will be a desire to
>> assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
>> us to put these devices in the range [0 - 200) without updating the node to memtier
>> mapping of existing NUMA nodes (ie, without updating default memtier).
>
> I believe that sparse memory tier IDs can make memory tier more stable
> in some cases.  But this is different from the system suggested by
> Johannes.  Per my understanding, with Johannes' system, we will
>
> - one driver may online different memory types (such as kmem_dax may
>   online HBM, PMEM, etc.)
>
> - one memory type manages several memory nodes (NUMA nodes)
>
> - one "abstract distance" for each memory type
>
> - the "abstract distance" can be offset by user space override knob
>
> - memory tiers generated dynamic from different memory types according
>   "abstract distance" and overridden "offset"
>
> - the granularity to group several memory types into one memory tier can
>   be overridden via user space knob
>
> In this way, the memory tiers may be changed totally after user space
> overridden.  It may be hard to link memory tiers before/after the
> overridden.  So we may need to reset all per-memory-tier configuration,
> such as cgroup paritation limit or interleave weight, etc.
Making sure we all agree on the details.
In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
instead of calling it "abstract distance" I was referring it as device
attributes.
Johannes also suggested these device attributes/"abstract distance"
to be used to derive the memory tier to which the memory type/memory
device will be assigned.
So dax kmem would manage different types of memory and based on the device
attributes, we would assign them to different memory tiers (memory tiers
in the range [0-200)).
Now the additional detail here is that we might add knobs that will be
used by dax kmem to fine-tune memory types to memory tiers assignment.
On updating these knob values, the kernel should rebuild the entire
memory tier hierarchy. (earlier I was considering only newly added
memory devices will get impacted by such a change. But I agree it
makes sense to rebuild the entire hierarchy again) But that rebuilding
will be restricted to dax kmem driver.
>
> Personally, I think the system above makes sense.  But I think we need
> to make sure whether it satisfies the requirements.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  6:46                   ` Wei Xu
  2022-07-13  8:17                     ` Huang, Ying
@ 2022-07-13  9:44                     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-13  9:44 UTC (permalink / raw)
  To: Wei Xu, Huang, Ying
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
Wei Xu <weixugc@google.com> writes:
> On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>> > On 7/12/22 2:18 PM, Huang, Ying wrote:
>> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>
>> >>> On 7/12/22 12:29 PM, Huang, Ying wrote:
>> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>>>
>> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:
>> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> >>>>>>
>> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:
>> >>>>>>>> Hi, Aneesh,
>> >>>>>>>>
>> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> >>>>>>>>
>> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive
>> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
>> >>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>> >>>>>>>>> performance.
>> >>>>>>>>>
>> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during
>> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
>> >>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
>> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
>> >>>>>>>>> the per-node demotion targets based on the distances between nodes.
>> >>>>>>>>>
>> >>>>>>>>> This current memory tier kernel interface needs to be improved for
>> >>>>>>>>> several important use cases:
>> >>>>>>>>>
>> >>>>>>>>> * The current tier initialization code always initializes
>> >>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
>> >>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
>> >>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
>> >>>>>>>>>   a virtual machine) and should be put into a higher tier.
>> >>>>>>>>>
>> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
>> >>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>> >>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>> >>>>>>>>>   with CPUs are better to be placed into the next lower tier.
>> >>>>>>>>>
>> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
>> >>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
>> >>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
>> >>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
>> >>>>>>>>>   memory node is added or removed.  This can make the tier
>> >>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
>> >>>>>>>>>   memory accounting.
>> >>>>>>>>>
>> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
>> >>>>>>>>>   next lower tier as defined by the demotion path, not any other
>> >>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
>> >>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
>> >>>>>>>>>   allow cross-socket demotion to another node in the same demotion
>> >>>>>>>>>   tier as a fallback when the preferred demotion node is out of
>> >>>>>>>>>   space), and has resulted in the feature request for an interface to
>> >>>>>>>>>   override the system-wide, per-node demotion order from the
>> >>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
>> >>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
>> >>>>>>>>>   out of space: The page allocation can fall back to any node from
>> >>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
>> >>>>>>>>>
>> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory
>> >>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
>> >>>>>>>>>
>> >>>>>>>>> This patch series make the creation of memory tiers explicit under
>> >>>>>>>>> the control of userspace or device driver.
>> >>>>>>>>>
>> >>>>>>>>> Memory Tier Initialization
>> >>>>>>>>> ==========================
>> >>>>>>>>>
>> >>>>>>>>> By default, all memory nodes are assigned to the default tier with
>> >>>>>>>>> tier ID value 200.
>> >>>>>>>>>
>> >>>>>>>>> A device driver can move up or down its memory nodes from the default
>> >>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
>> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
>> >>>>>>>>> default tier.
>> >>>>>>>>>
>> >>>>>>>>> The kernel initialization code makes the decision on which exact tier
>> >>>>>>>>> a memory node should be assigned to based on the requests from the
>> >>>>>>>>> device drivers as well as the memory device hardware information
>> >>>>>>>>> provided by the firmware.
>> >>>>>>>>>
>> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>> >>>>>>>>>
>> >>>>>>>>> Memory Allocation for Demotion
>> >>>>>>>>> ==============================
>> >>>>>>>>> This patch series keep the demotion target page allocation logic same.
>> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the
>> >>>>>>>>> next lower tier to the current NUMA node allocating pages from.
>> >>>>>>>>>
>> >>>>>>>>> This will be later improved to use the same page allocation strategy
>> >>>>>>>>> using fallback list.
>> >>>>>>>>>
>> >>>>>>>>> Sysfs Interface:
>> >>>>>>>>> -------------
>> >>>>>>>>> Listing current list of memory tiers details:
>> >>>>>>>>>
>> >>>>>>>>> :/sys/devices/system/memtier$ ls
>> >>>>>>>>> default_tier max_tier  memtier1  power  uevent
>> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
>> >>>>>>>>> memtier200
>> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier
>> >>>>>>>>> 400
>> >>>>>>>>> :/sys/devices/system/memtier$
>> >>>>>>>>>
>> >>>>>>>>> Per node memory tier details:
>> >>>>>>>>>
>> >>>>>>>>> For a cpu only NUMA node:
>> >>>>>>>>>
>> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier
>> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
>> >>>>>>>>> :/sys/devices/system/node#
>> >>>>>>>>>
>> >>>>>>>>> For a NUMA node with memory:
>> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>>>>>> 1
>> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>> >>>>>>>>> default_tier  max_tier  memtier1  power  uevent
>> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier
>> >>>>>>>>> :/sys/devices/system/node#
>> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
>> >>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
>> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>>>>>> 2
>> >>>>>>>>> :/sys/devices/system/node#
>> >>>>>>>>>
>> >>>>>>>>> Removing a memory tier
>> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
>> >>>>>>>>> 2
>> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier
>> >>>>>>>>
>> >>>>>>>> Thanks a lot for your patchset.
>> >>>>>>>>
>> >>>>>>>> Per my understanding, we haven't reach consensus on
>> >>>>>>>>
>> >>>>>>>> - how to create the default memory tiers in kernel (via abstract
>> >>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
>> >>>>>>>>
>> >>>>>>>> - how to override the default memory tiers from user space
>> >>>>>>>>
>> >>>>>>>> As in the following thread and email,
>> >>>>>>>>
>> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> >>>>>>>>
>> >>>>>>>> I think that we need to finalized on that firstly?
>> >>>>>>>
>> >>>>>>> I did list the proposal here
>> >>>>>>>
>> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>> >>>>>>>
>> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
>> >>>>>>> if the user wants a different tier topology.
>> >>>>>>>
>> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
>> >>>>>>>
>> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
>> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
>> >>>>>>> to control the tier assignment this can be a range of memory tiers.
>> >>>>>>>
>> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
>> >>>>>>> the memory tier assignment based on device attributes.
>> >>>>>>
>> >>>>>> Sorry for late reply.
>> >>>>>>
>> >>>>>> As the first step, it may be better to skip the parts that we haven't
>> >>>>>> reached consensus yet, for example, the user space interface to override
>> >>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
>> >>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
>> >>>>>> cannot change the user space ABI.
>> >>>>>>
>> >>>>>
>> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
>> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
>> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
>> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
>> >>>>> I am not sure which area we are still debating w.r.t the userspace interface.
>> >>>>
>> >>>> In
>> >>>>
>> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
>> >>>>
>> >>>> per my understanding, Johannes suggested to override the kernel default
>> >>>> memory tiers with "abstract distance" via drivers implementing memory
>> >>>> devices.  As you said in another email, that is related to [7/12] of the
>> >>>> series.  And we can table it for future.
>> >>>>
>> >>>> And per my understanding, he also suggested to make memory tier IDs
>> >>>> dynamic.  For example, after the "abstract distance" of a driver is
>> >>>> overridden by users, the total number of memory tiers may be changed,
>> >>>> and the memory tier ID of some nodes may be changed too.  This will make
>> >>>> memory tier ID easier to be understood, but more unstable.  For example,
>> >>>> this will make it harder to specify the per-memory-tier memory partition
>> >>>> for a cgroup.
>> >>>>
>> >>>
>> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
>> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
>> >>> posted here
>> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
>> >>> doesn't consider the node movement from one memory tier to another. If we need
>> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
>> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not
>> >>> prevent such a restriction.
>> >>
>> >> Absolute stableness doesn't exist even in "rank" based solution.  But
>> >> "rank" can improve the stableness at some degree.  For example, if we
>> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
>> >> nodes can keep its memory tier ID stable.  This may be not a real issue
>> >> finally.  But we need to discuss that.
>> >>
>> >
>> > I agree that using ranks gives us the flexibility to change demotion order
>> > without being blocked by cgroup usage. But how frequently do we expect the
>> > tier assignment to change? My expectation was these reassignments are going
>> > to be rare and won't happen frequently after a system is up and running?
>> > Hence using tierID for demotion order won't prevent a node reassignment
>> > much because we don't expect to change the node tierID during runtime. In
>> > the rare case we do, we will have to make sure there is no cgroup usage from
>> > the specific memory tier.
>> >
>> > Even if we use ranks, we will have to avoid a rank update, if such
>> > an update can change the meaning of top tier? ie, if a rank update
>> > can result in a node being moved from top tier to non top tier.
>> >
>> >> Tim has suggested to use top-tier(s) memory partition among cgroups.
>> >> But I don't think that has been finalized.  We may use per-memory-tier
>> >> memory partition among cgroups.  I don't know whether Wei will use that
>> >> (may be implemented in the user space).
>> >>
>> >> And, if we thought stableness between nodes and memory tier ID isn't
>> >> important.  Why should we use sparse memory device IDs (that is, 100,
>> >> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
>> >>
>> >
>> >
>> > The range allows us to use memtier ID for demotion order. ie, as we start initializing
>> > devices with different attributes via dax kmem, there will be a desire to
>> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
>> > us to put these devices in the range [0 - 200) without updating the node to memtier
>> > mapping of existing NUMA nodes (ie, without updating default memtier).
>>
>> I believe that sparse memory tier IDs can make memory tier more stable
>> in some cases.  But this is different from the system suggested by
>> Johannes.  Per my understanding, with Johannes' system, we will
>>
>> - one driver may online different memory types (such as kmem_dax may
>>   online HBM, PMEM, etc.)
>>
>> - one memory type manages several memory nodes (NUMA nodes)
>>
>> - one "abstract distance" for each memory type
>>
>> - the "abstract distance" can be offset by user space override knob
>>
>> - memory tiers generated dynamic from different memory types according
>>   "abstract distance" and overridden "offset"
>>
>> - the granularity to group several memory types into one memory tier can
>>   be overridden via user space knob
>>
>> In this way, the memory tiers may be changed totally after user space
>> overridden.  It may be hard to link memory tiers before/after the
>> overridden.  So we may need to reset all per-memory-tier configuration,
>> such as cgroup paritation limit or interleave weight, etc.
>>
>> Personally, I think the system above makes sense.  But I think we need
>> to make sure whether it satisfies the requirements.
>>
>> Best Regards,
>> Huang, Ying
>>
>
> Th "memory type" and "abstract distance" concepts sound to me similar
> to the memory tier "rank" idea.
>
> We can have some well-defined type/distance/rank values, e.g. HBM,
> DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with.  The
> memory tiers will build from these values.  It can be configurable to
> whether/how to collapse several values into a single tier.
But then we also don't want to not use it directly for demotion
order. Instead, we can use tierID. The memory type to memory tier assignment
can be fine-tuned using device attribute/"abstract
distance"/rank/userspace override etc.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  9:40                   ` Aneesh Kumar K.V
@ 2022-07-14  4:56                     ` Huang, Ying
  2022-07-14  5:29                       ` Aneesh Kumar K V
  0 siblings, 1 reply; 42+ messages in thread
From: Huang, Ying @ 2022-07-14  4:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> "Huang, Ying" <ying.huang@intel.com> writes:
[snip]
>>
>> I believe that sparse memory tier IDs can make memory tier more stable
>> in some cases.  But this is different from the system suggested by
>> Johannes.  Per my understanding, with Johannes' system, we will
>>
>> - one driver may online different memory types (such as kmem_dax may
>>   online HBM, PMEM, etc.)
>>
>> - one memory type manages several memory nodes (NUMA nodes)
>>
>> - one "abstract distance" for each memory type
>>
>> - the "abstract distance" can be offset by user space override knob
>>
>> - memory tiers generated dynamic from different memory types according
>>   "abstract distance" and overridden "offset"
>>
>> - the granularity to group several memory types into one memory tier can
>>   be overridden via user space knob
>>
>> In this way, the memory tiers may be changed totally after user space
>> overridden.  It may be hard to link memory tiers before/after the
>> overridden.  So we may need to reset all per-memory-tier configuration,
>> such as cgroup paritation limit or interleave weight, etc.
>
> Making sure we all agree on the details.
>
> In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> instead of calling it "abstract distance" I was referring it as device
> attributes.
>
> Johannes also suggested these device attributes/"abstract distance"
> to be used to derive the memory tier to which the memory type/memory
> device will be assigned.
>
> So dax kmem would manage different types of memory and based on the device
> attributes, we would assign them to different memory tiers (memory tiers
> in the range [0-200)).
>
> Now the additional detail here is that we might add knobs that will be
> used by dax kmem to fine-tune memory types to memory tiers assignment.
> On updating these knob values, the kernel should rebuild the entire
> memory tier hierarchy. (earlier I was considering only newly added
> memory devices will get impacted by such a change. But I agree it
> makes sense to rebuild the entire hierarchy again) But that rebuilding
> will be restricted to dax kmem driver.
>
Thanks for explanation and pointer.  Per my understanding, memory
types and memory devices including abstract distances are used to
describe the *physical* memory devices, not *policy*.  We may add more
physical attributes to these memory devices, such as, latency,
throughput, etc.  I think we can reach consensus on this point?
In contrast, memory tiers are more about policy, such as
demotion/promotion, interleaving and possible partition among cgroups.
How to derive memory tiers from memory types (or devices)?  We have
multiple choices.
Per my understanding, Johannes suggested to use some policy parameters
such as distance granularity (e.g., if granularity is 100, then memory
devices with abstract distance 0-100, 100-200, 200-300, ... will be put
to memory tier 0, 1, 2, ...) to build the memory tiers.  Distance
granularity may be not flexible enough, we may need something like a set
of cutoffs or range, e.g., 50, 100, 200, 500, or 0-50, 50-100, 100-200,
200-500, >500.  These policy parameters should be overridable from user
space.
And per my understanding, you suggested to place memory devices to
memory tiers directly via a knob of memory types (or memory devices).
e.g., memory_type/memtier can be written to place the memory devices of
the memory_type to the specified memtier.  Or via
memorty_type/distance_offset to do that.
Best Regards,
Huang, Ying
[snip]
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-14  4:56                     ` Huang, Ying
@ 2022-07-14  5:29                       ` Aneesh Kumar K V
  2022-07-14  7:21                         ` Huang, Ying
  0 siblings, 1 reply; 42+ messages in thread
From: Aneesh Kumar K V @ 2022-07-14  5:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
On 7/14/22 10:26 AM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> "Huang, Ying" <ying.huang@intel.com> writes:
> 
> [snip]
> 
>>>
>>> I believe that sparse memory tier IDs can make memory tier more stable
>>> in some cases.  But this is different from the system suggested by
>>> Johannes.  Per my understanding, with Johannes' system, we will
>>>
>>> - one driver may online different memory types (such as kmem_dax may
>>>   online HBM, PMEM, etc.)
>>>
>>> - one memory type manages several memory nodes (NUMA nodes)
>>>
>>> - one "abstract distance" for each memory type
>>>
>>> - the "abstract distance" can be offset by user space override knob
>>>
>>> - memory tiers generated dynamic from different memory types according
>>>   "abstract distance" and overridden "offset"
>>>
>>> - the granularity to group several memory types into one memory tier can
>>>   be overridden via user space knob
>>>
>>> In this way, the memory tiers may be changed totally after user space
>>> overridden.  It may be hard to link memory tiers before/after the
>>> overridden.  So we may need to reset all per-memory-tier configuration,
>>> such as cgroup paritation limit or interleave weight, etc.
>>
>> Making sure we all agree on the details.
>>
>> In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>> instead of calling it "abstract distance" I was referring it as device
>> attributes.
>>
>> Johannes also suggested these device attributes/"abstract distance"
>> to be used to derive the memory tier to which the memory type/memory
>> device will be assigned.
>>
>> So dax kmem would manage different types of memory and based on the device
>> attributes, we would assign them to different memory tiers (memory tiers
>> in the range [0-200)).
>>
>> Now the additional detail here is that we might add knobs that will be
>> used by dax kmem to fine-tune memory types to memory tiers assignment.
>> On updating these knob values, the kernel should rebuild the entire
>> memory tier hierarchy. (earlier I was considering only newly added
>> memory devices will get impacted by such a change. But I agree it
>> makes sense to rebuild the entire hierarchy again) But that rebuilding
>> will be restricted to dax kmem driver.
>>
> 
> Thanks for explanation and pointer.  Per my understanding, memory
> types and memory devices including abstract distances are used to
> describe the *physical* memory devices, not *policy*.  We may add more
> physical attributes to these memory devices, such as, latency,
> throughput, etc.  I think we can reach consensus on this point?
> 
> In contrast, memory tiers are more about policy, such as
> demotion/promotion, interleaving and possible partition among cgroups.
> How to derive memory tiers from memory types (or devices)?  We have
> multiple choices.
> 
agreed to the above.
> Per my understanding, Johannes suggested to use some policy parameters
> such as distance granularity (e.g., if granularity is 100, then memory
> devices with abstract distance 0-100, 100-200, 200-300, ... will be put
> to memory tier 0, 1, 2, ...) to build the memory tiers.  Distance
> granularity may be not flexible enough, we may need something like a set
> of cutoffs or range, e.g., 50, 100, 200, 500, or 0-50, 50-100, 100-200,
> 200-500, >500.  These policy parameters should be overridable from user
> space.
> 
The term distance was always confusing to me. Instead, I was generalizing it as an attribute.
The challenge with the term distance for me was in clarifying the distance of this memory device from
where? Instead, it is much simpler to group devices based on device attributes such as write latency.
So everything you explained above is correct, except we describe it in terms of a
single device attribute or a combination of multiple device attributes. We could convert
a combination of multiple device attribute to an "abstract distance". Such an
"abstract distance" is derived based on different device attribute values with
policy parameters overridable from userspace.
> And per my understanding, you suggested to place memory devices to
> memory tiers directly via a knob of memory types (or memory devices).
> e.g., memory_type/memtier can be written to place the memory devices of
> the memory_type to the specified memtier.  Or via
> memorty_type/distance_offset to do that.
> 
What I explained above is what I would expect the kernel to do by default. Before we can
reach there we need to get a better understanding of which device attribute describes
the grouping of memory devices to a memory tier. Do we need latency-based grouping
or bandwidth-based grouping? Till then userspace can place these devices to different
memory tiers. Hence the addition of /sys/devices/system/node/nodeN/memtier write feature
which moves a memory node to a specific memory tier. 
I am not suggesting we override the memory types from userspace.
-aneesh
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-14  5:29                       ` Aneesh Kumar K V
@ 2022-07-14  7:21                         ` Huang, Ying
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Ying @ 2022-07-14  7:21 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 7/14/22 10:26 AM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>> 
>> [snip]
>> 
>>>>
>>>> I believe that sparse memory tier IDs can make memory tier more stable
>>>> in some cases.  But this is different from the system suggested by
>>>> Johannes.  Per my understanding, with Johannes' system, we will
>>>>
>>>> - one driver may online different memory types (such as kmem_dax may
>>>>   online HBM, PMEM, etc.)
>>>>
>>>> - one memory type manages several memory nodes (NUMA nodes)
>>>>
>>>> - one "abstract distance" for each memory type
>>>>
>>>> - the "abstract distance" can be offset by user space override knob
>>>>
>>>> - memory tiers generated dynamic from different memory types according
>>>>   "abstract distance" and overridden "offset"
>>>>
>>>> - the granularity to group several memory types into one memory tier can
>>>>   be overridden via user space knob
>>>>
>>>> In this way, the memory tiers may be changed totally after user space
>>>> overridden.  It may be hard to link memory tiers before/after the
>>>> overridden.  So we may need to reset all per-memory-tier configuration,
>>>> such as cgroup paritation limit or interleave weight, etc.
>>>
>>> Making sure we all agree on the details.
>>>
>>> In the proposal https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>> instead of calling it "abstract distance" I was referring it as device
>>> attributes.
>>>
>>> Johannes also suggested these device attributes/"abstract distance"
>>> to be used to derive the memory tier to which the memory type/memory
>>> device will be assigned.
>>>
>>> So dax kmem would manage different types of memory and based on the device
>>> attributes, we would assign them to different memory tiers (memory tiers
>>> in the range [0-200)).
>>>
>>> Now the additional detail here is that we might add knobs that will be
>>> used by dax kmem to fine-tune memory types to memory tiers assignment.
>>> On updating these knob values, the kernel should rebuild the entire
>>> memory tier hierarchy. (earlier I was considering only newly added
>>> memory devices will get impacted by such a change. But I agree it
>>> makes sense to rebuild the entire hierarchy again) But that rebuilding
>>> will be restricted to dax kmem driver.
>>>
>> 
>> Thanks for explanation and pointer.  Per my understanding, memory
>> types and memory devices including abstract distances are used to
>> describe the *physical* memory devices, not *policy*.  We may add more
>> physical attributes to these memory devices, such as, latency,
>> throughput, etc.  I think we can reach consensus on this point?
>> 
>> In contrast, memory tiers are more about policy, such as
>> demotion/promotion, interleaving and possible partition among cgroups.
>> How to derive memory tiers from memory types (or devices)?  We have
>> multiple choices.
>> 
>
> agreed to the above.
>
>> Per my understanding, Johannes suggested to use some policy parameters
>> such as distance granularity (e.g., if granularity is 100, then memory
>> devices with abstract distance 0-100, 100-200, 200-300, ... will be put
>> to memory tier 0, 1, 2, ...) to build the memory tiers.  Distance
>> granularity may be not flexible enough, we may need something like a set
>> of cutoffs or range, e.g., 50, 100, 200, 500, or 0-50, 50-100, 100-200,
>> 200-500, >500.  These policy parameters should be overridable from user
>> space.
>> 
>
> The term distance was always confusing to me. Instead, I was
> generalizing it as an attribute.
Attributes sounds too general to me :-)
> The challenge with the term distance for me was in clarifying the
> distance of this memory device from where? Instead, it is much simpler
> to group devices based on device attributes such as write latency.
Per my understanding, the "distance" here is the distance from local
CPUs, that is, get rid of the influence of NUMA topology as much as
possible.
There may be other memory accessing initiators in the system, such as
GPU, etc.  But we don't want to have different memory tiers for each
initiators, so we mainly consider CPUs.  The device drivers of other
initiators may consider other type of memory tiers.
The "distance" characters the latency of the memory device under typical
memory throughput in the system.  So it characterizes both latency and
throughput, because the latency will increase with the throughput.  This
one of reasons we need to override the default distance, because the
typical memory throughput may be different among different workloads.
The "abstract distance" can come from SLIT, HMAT firstly.  Then we can
try to explore the other possible sources of information.
> So everything you explained above is correct, except we describe it in terms of a
> single device attribute or a combination of multiple device attributes. We could convert
> a combination of multiple device attribute to an "abstract distance".
Sounds good to me.
> Such an "abstract distance" is derived based on different device
> attribute values with policy parameters overridable from userspace.
I think "abstract distance" is different from policy parameters.
>> And per my understanding, you suggested to place memory devices to
>> memory tiers directly via a knob of memory types (or memory devices).
>> e.g., memory_type/memtier can be written to place the memory devices of
>> the memory_type to the specified memtier.  Or via
>> memorty_type/distance_offset to do that.
>> 
>
> What I explained above is what I would expect the kernel to do by default. Before we can
> reach there we need to get a better understanding of which device attribute describes
> the grouping of memory devices to a memory tier. Do we need latency-based grouping
> or bandwidth-based grouping? Till then userspace can place these devices to different
> memory tiers. Hence the addition of /sys/devices/system/node/nodeN/memtier write feature
> which moves a memory node to a specific memory tier. 
>
> I am not suggesting we override the memory types from userspace.
OK.  I don't think we need this.  We can examine the target solution
above and try to find any issue with it.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-13  8:17                     ` Huang, Ying
@ 2022-07-19 14:00                       ` Jonathan Cameron
  2022-07-25  6:02                         ` Huang, Ying
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Cameron @ 2022-07-19 14:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Wei Xu, Aneesh Kumar K V, Johannes Weiner, Linux MM,
	Andrew Morton, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, jvgediya.oss
On Wed, 13 Jul 2022 16:17:21 +0800
"Huang, Ying" <ying.huang@intel.com> wrote:
> Wei Xu <weixugc@google.com> writes:
> 
> > On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying <ying.huang@intel.com> wrote:  
> >>
> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >>  
> >> > On 7/12/22 2:18 PM, Huang, Ying wrote:  
> >> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >> >>  
> >> >>> On 7/12/22 12:29 PM, Huang, Ying wrote:  
> >> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >> >>>>  
> >> >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote:  
> >> >>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> >> >>>>>>  
> >> >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote:  
> >> >>>>>>>> Hi, Aneesh,
> >> >>>>>>>>
> >> >>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> >> >>>>>>>>  
> >> >>>>>>>>> The current kernel has the basic memory tiering support: Inactive
> >> >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >> >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier
> >> >>>>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >> >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
> >> >>>>>>>>> performance.
> >> >>>>>>>>>
> >> >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
> >> >>>>>>>>> demotion path relationship between NUMA nodes, which is created during
> >> >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or
> >> >>>>>>>>> hot-removed.  The current implementation puts all nodes with CPU into
> >> >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> >> >>>>>>>>> the per-node demotion targets based on the distances between nodes.
> >> >>>>>>>>>
> >> >>>>>>>>> This current memory tier kernel interface needs to be improved for
> >> >>>>>>>>> several important use cases:
> >> >>>>>>>>>
> >> >>>>>>>>> * The current tier initialization code always initializes
> >> >>>>>>>>>   each memory-only NUMA node into a lower tier.  But a memory-only
> >> >>>>>>>>>   NUMA node may have a high performance memory device (e.g. a DRAM
> >> >>>>>>>>>   device attached via CXL.mem or a DRAM-backed memory-only node on
> >> >>>>>>>>>   a virtual machine) and should be put into a higher tier.
> >> >>>>>>>>>
> >> >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top
> >> >>>>>>>>>   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >> >>>>>>>>>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >> >>>>>>>>>   with CPUs are better to be placed into the next lower tier.
> >> >>>>>>>>>
> >> >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes
> >> >>>>>>>>>   into the top tier, when a CPU is hot-added (or hot-removed) and
> >> >>>>>>>>>   triggers a memory node from CPU-less into a CPU node (or vice
> >> >>>>>>>>>   versa), the memory tier hierarchy gets changed, even though no
> >> >>>>>>>>>   memory node is added or removed.  This can make the tier
> >> >>>>>>>>>   hierarchy unstable and make it difficult to support tier-based
> >> >>>>>>>>>   memory accounting.
> >> >>>>>>>>>
> >> >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the
> >> >>>>>>>>>   next lower tier as defined by the demotion path, not any other
> >> >>>>>>>>>   node from any lower tier.  This strict, hard-coded demotion order
> >> >>>>>>>>>   does not work in all use cases (e.g. some use cases may want to
> >> >>>>>>>>>   allow cross-socket demotion to another node in the same demotion
> >> >>>>>>>>>   tier as a fallback when the preferred demotion node is out of
> >> >>>>>>>>>   space), and has resulted in the feature request for an interface to
> >> >>>>>>>>>   override the system-wide, per-node demotion order from the
> >> >>>>>>>>>   userspace.  This demotion order is also inconsistent with the page
> >> >>>>>>>>>   allocation fallback order when all the nodes in a higher tier are
> >> >>>>>>>>>   out of space: The page allocation can fall back to any node from
> >> >>>>>>>>>   any lower tier, whereas the demotion order doesn't allow that.
> >> >>>>>>>>>
> >> >>>>>>>>> * There are no interfaces for the userspace to learn about the memory
> >> >>>>>>>>>   tier hierarchy in order to optimize its memory allocations.
> >> >>>>>>>>>
> >> >>>>>>>>> This patch series make the creation of memory tiers explicit under
> >> >>>>>>>>> the control of userspace or device driver.
> >> >>>>>>>>>
> >> >>>>>>>>> Memory Tier Initialization
> >> >>>>>>>>> ==========================
> >> >>>>>>>>>
> >> >>>>>>>>> By default, all memory nodes are assigned to the default tier with
> >> >>>>>>>>> tier ID value 200.
> >> >>>>>>>>>
> >> >>>>>>>>> A device driver can move up or down its memory nodes from the default
> >> >>>>>>>>> tier.  For example, PMEM can move down its memory nodes below the
> >> >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the
> >> >>>>>>>>> default tier.
> >> >>>>>>>>>
> >> >>>>>>>>> The kernel initialization code makes the decision on which exact tier
> >> >>>>>>>>> a memory node should be assigned to based on the requests from the
> >> >>>>>>>>> device drivers as well as the memory device hardware information
> >> >>>>>>>>> provided by the firmware.
> >> >>>>>>>>>
> >> >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> >> >>>>>>>>>
> >> >>>>>>>>> Memory Allocation for Demotion
> >> >>>>>>>>> ==============================
> >> >>>>>>>>> This patch series keep the demotion target page allocation logic same.
> >> >>>>>>>>> The demotion page allocation pick the closest NUMA node in the
> >> >>>>>>>>> next lower tier to the current NUMA node allocating pages from.
> >> >>>>>>>>>
> >> >>>>>>>>> This will be later improved to use the same page allocation strategy
> >> >>>>>>>>> using fallback list.
> >> >>>>>>>>>
> >> >>>>>>>>> Sysfs Interface:
> >> >>>>>>>>> -------------
> >> >>>>>>>>> Listing current list of memory tiers details:
> >> >>>>>>>>>
> >> >>>>>>>>> :/sys/devices/system/memtier$ ls
> >> >>>>>>>>> default_tier max_tier  memtier1  power  uevent
> >> >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier
> >> >>>>>>>>> memtier200
> >> >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier
> >> >>>>>>>>> 400
> >> >>>>>>>>> :/sys/devices/system/memtier$
> >> >>>>>>>>>
> >> >>>>>>>>> Per node memory tier details:
> >> >>>>>>>>>
> >> >>>>>>>>> For a cpu only NUMA node:
> >> >>>>>>>>>
> >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
> >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier
> >> >>>>>>>>> :/sys/devices/system/node# cat node0/memtier
> >> >>>>>>>>> :/sys/devices/system/node#
> >> >>>>>>>>>
> >> >>>>>>>>> For a NUMA node with memory:
> >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>>>>>> 1
> >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
> >> >>>>>>>>> default_tier  max_tier  memtier1  power  uevent
> >> >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier
> >> >>>>>>>>> :/sys/devices/system/node#
> >> >>>>>>>>> :/sys/devices/system/node# ls ../memtier/
> >> >>>>>>>>> default_tier  max_tier  memtier1  memtier2  power  uevent
> >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>>>>>> 2
> >> >>>>>>>>> :/sys/devices/system/node#
> >> >>>>>>>>>
> >> >>>>>>>>> Removing a memory tier
> >> >>>>>>>>> :/sys/devices/system/node# cat node1/memtier
> >> >>>>>>>>> 2
> >> >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier  
> >> >>>>>>>>
> >> >>>>>>>> Thanks a lot for your patchset.
> >> >>>>>>>>
> >> >>>>>>>> Per my understanding, we haven't reach consensus on
> >> >>>>>>>>
> >> >>>>>>>> - how to create the default memory tiers in kernel (via abstract
> >> >>>>>>>>   distance provided by drivers?  Or use SLIT as the first step?)
> >> >>>>>>>>
> >> >>>>>>>> - how to override the default memory tiers from user space
> >> >>>>>>>>
> >> >>>>>>>> As in the following thread and email,
> >> >>>>>>>>
> >> >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >> >>>>>>>>
> >> >>>>>>>> I think that we need to finalized on that firstly?  
> >> >>>>>>>
> >> >>>>>>> I did list the proposal here
> >> >>>>>>>
> >> >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >> >>>>>>>
> >> >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated
> >> >>>>>>> if the user wants a different tier topology.
> >> >>>>>>>
> >> >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200
> >> >>>>>>>
> >> >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100.
> >> >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use
> >> >>>>>>> to control the tier assignment this can be a range of memory tiers.
> >> >>>>>>>
> >> >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update
> >> >>>>>>> the memory tier assignment based on device attributes.  
> >> >>>>>>
> >> >>>>>> Sorry for late reply.
> >> >>>>>>
> >> >>>>>> As the first step, it may be better to skip the parts that we haven't
> >> >>>>>> reached consensus yet, for example, the user space interface to override
> >> >>>>>> the default memory tiers.  And we can use 0, 1, 2 as the default memory
> >> >>>>>> tier IDs.  We can refine/revise the in-kernel implementation, but we
> >> >>>>>> cannot change the user space ABI.
> >> >>>>>>  
> >> >>>>>
> >> >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series?
> >> >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a
> >> >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com
> >> >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point
> >> >>>>> I am not sure which area we are still debating w.r.t the userspace interface.  
> >> >>>>
> >> >>>> In
> >> >>>>
> >> >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/
> >> >>>>
> >> >>>> per my understanding, Johannes suggested to override the kernel default
> >> >>>> memory tiers with "abstract distance" via drivers implementing memory
> >> >>>> devices.  As you said in another email, that is related to [7/12] of the
> >> >>>> series.  And we can table it for future.
> >> >>>>
> >> >>>> And per my understanding, he also suggested to make memory tier IDs
> >> >>>> dynamic.  For example, after the "abstract distance" of a driver is
> >> >>>> overridden by users, the total number of memory tiers may be changed,
> >> >>>> and the memory tier ID of some nodes may be changed too.  This will make
> >> >>>> memory tier ID easier to be understood, but more unstable.  For example,
> >> >>>> this will make it harder to specify the per-memory-tier memory partition
> >> >>>> for a cgroup.
> >> >>>>  
> >> >>>
> >> >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed.
> >> >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches
> >> >>> posted here
> >> >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/
> >> >>> doesn't consider the node movement from one memory tier to another. If we need
> >> >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment
> >> >>> while we have pages from the memory tier charged to a cgroup. This patchset should not
> >> >>> prevent such a restriction.  
> >> >>
> >> >> Absolute stableness doesn't exist even in "rank" based solution.  But
> >> >> "rank" can improve the stableness at some degree.  For example, if we
> >> >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM
> >> >> nodes can keep its memory tier ID stable.  This may be not a real issue
> >> >> finally.  But we need to discuss that.
> >> >>  
> >> >
> >> > I agree that using ranks gives us the flexibility to change demotion order
> >> > without being blocked by cgroup usage. But how frequently do we expect the
> >> > tier assignment to change? My expectation was these reassignments are going
> >> > to be rare and won't happen frequently after a system is up and running?
> >> > Hence using tierID for demotion order won't prevent a node reassignment
> >> > much because we don't expect to change the node tierID during runtime. In
> >> > the rare case we do, we will have to make sure there is no cgroup usage from
> >> > the specific memory tier.
> >> >
> >> > Even if we use ranks, we will have to avoid a rank update, if such
> >> > an update can change the meaning of top tier? ie, if a rank update
> >> > can result in a node being moved from top tier to non top tier.
> >> >  
> >> >> Tim has suggested to use top-tier(s) memory partition among cgroups.
> >> >> But I don't think that has been finalized.  We may use per-memory-tier
> >> >> memory partition among cgroups.  I don't know whether Wei will use that
> >> >> (may be implemented in the user space).
> >> >>
> >> >> And, if we thought stableness between nodes and memory tier ID isn't
> >> >> important.  Why should we use sparse memory device IDs (that is, 100,
> >> >> 200, 300)?  Why not just 0, 1, 2, ...?  That looks more natural.
> >> >>  
> >> >
> >> >
> >> > The range allows us to use memtier ID for demotion order. ie, as we start initializing
> >> > devices with different attributes via dax kmem, there will be a desire to
> >> > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables
> >> > us to put these devices in the range [0 - 200) without updating the node to memtier
> >> > mapping of existing NUMA nodes (ie, without updating default memtier).  
> >>
> >> I believe that sparse memory tier IDs can make memory tier more stable
> >> in some cases.  But this is different from the system suggested by
> >> Johannes.  Per my understanding, with Johannes' system, we will
> >>
> >> - one driver may online different memory types (such as kmem_dax may
> >>   online HBM, PMEM, etc.)
> >>
> >> - one memory type manages several memory nodes (NUMA nodes)
> >>
> >> - one "abstract distance" for each memory type
> >>
> >> - the "abstract distance" can be offset by user space override knob
> >>
> >> - memory tiers generated dynamic from different memory types according
> >>   "abstract distance" and overridden "offset"
> >>
> >> - the granularity to group several memory types into one memory tier can
> >>   be overridden via user space knob
> >>
> >> In this way, the memory tiers may be changed totally after user space
> >> overridden.  It may be hard to link memory tiers before/after the
> >> overridden.  So we may need to reset all per-memory-tier configuration,
> >> such as cgroup paritation limit or interleave weight, etc.
> >>
> >> Personally, I think the system above makes sense.  But I think we need
> >> to make sure whether it satisfies the requirements.
> >>
> >> Best Regards,
> >> Huang, Ying
> >>  
> >
> > Th "memory type" and "abstract distance" concepts sound to me similar
> > to the memory tier "rank" idea.  
> 
> Yes.  "abstract distance" is similar as "rank".
> 
> > We can have some well-defined type/distance/rank values, e.g. HBM,
> > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with.  The
> > memory tiers will build from these values.  It can be configurable to
> > whether/how to collapse several values into a single tier.  
> 
> The memory types are registered by drivers (such as kmem_dax).  And the
> distances can come from SLIT, HMAT, and other firmware or driver
> specific information sources.
> 
> Per my understanding, this solution may make memory tier IDs more
> unstable.  For example, the memory ID of a node may be changed after the
> user override the distance of a memory type.  Although I think the
> overriding should be a rare operations, will it be a real issue for your
> use cases?
Not sure how common it is, but I'm aware of systems that have dynamic
access characteristics.  i.e. the bandwidth and latency of a access
to a given memory device will change dynamically at runtime (typically
due to something like hardware degradation / power saving etc).  Potentially
leading to memory in use needing to move in 'demotion order'.  We could
handle that with a per device tier and rank that changes...
Just thought I'd throw that out there to add to the complexity ;)
I don't consider it important to support initially but just wanted to
point out this will only get more complex over time.
Jonathan
> 
> Best Regards,
> Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
* Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion
  2022-07-19 14:00                       ` Jonathan Cameron
@ 2022-07-25  6:02                         ` Huang, Ying
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Ying @ 2022-07-25  6:02 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Wei Xu, Aneesh Kumar K V, Johannes Weiner, Linux MM,
	Andrew Morton, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, jvgediya.oss
Hi, Jonathan,
Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:
> On Wed, 13 Jul 2022 16:17:21 +0800
> "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> Wei Xu <weixugc@google.com> writes:
[snip]
>> >
>> > Th "memory type" and "abstract distance" concepts sound to me similar
>> > to the memory tier "rank" idea.  
>> 
>> Yes.  "abstract distance" is similar as "rank".
>> 
>> > We can have some well-defined type/distance/rank values, e.g. HBM,
>> > DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with.  The
>> > memory tiers will build from these values.  It can be configurable to
>> > whether/how to collapse several values into a single tier.  
>> 
>> The memory types are registered by drivers (such as kmem_dax).  And the
>> distances can come from SLIT, HMAT, and other firmware or driver
>> specific information sources.
>> 
>> Per my understanding, this solution may make memory tier IDs more
>> unstable.  For example, the memory ID of a node may be changed after the
>> user override the distance of a memory type.  Although I think the
>> overriding should be a rare operations, will it be a real issue for your
>> use cases?
>
> Not sure how common it is, but I'm aware of systems that have dynamic
> access characteristics.  i.e. the bandwidth and latency of a access
> to a given memory device will change dynamically at runtime (typically
> due to something like hardware degradation / power saving etc).  Potentially
> leading to memory in use needing to move in 'demotion order'.  We could
> handle that with a per device tier and rank that changes...
>
> Just thought I'd throw that out there to add to the complexity ;)
> I don't consider it important to support initially but just wanted to
> point out this will only get more complex over time.
>
Thanks for your information!
If we make the mapping from the abstract distance range to the memory
tier ID stable at some degree, the memory tier ID can be stable at some
degree, e.g.,
abstract distance range         memory tier ID
1  -100                         0
101-200                         1
201-300                         2
301-400                         3
401-500                         4
500-                            5
Then if the abstract distance of a memory device changes at run time,
its memory tier ID will change.  But the memory tier ID of other memory
devices can be unchanged.
If so, the memory tier IDs are unstable mainly when we change the
mapping from the abstract distance range to memory tier ID.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 42+ messages in thread
end of thread, other threads:[~2022-07-25  6:03 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-04  7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 02/12] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 03/12] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 04/12] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 05/12] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 06/12] mm/demotion: Expose memory tier details via sysfs Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 07/12] mm/demotion: Add per node memory tier attribute to sysfs Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 08/12] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 09/12] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 10/12] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 11/12] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
2022-07-04  7:06 ` [PATCH v8 12/12] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
2022-07-04 15:00 ` [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Matthew Wilcox
2022-07-05  3:45   ` Alistair Popple
2022-07-05  4:17   ` Aneesh Kumar K V
2022-07-05  4:29 ` Huang, Ying
2022-07-05  5:22   ` Aneesh Kumar K V
2022-07-12  1:16     ` Huang, Ying
2022-07-12  4:42       ` Aneesh Kumar K V
2022-07-12  5:09         ` Aneesh Kumar K V
2022-07-12 18:02           ` Yang Shi
2022-07-13  3:42             ` Huang, Ying
2022-07-13  6:38               ` Wei Xu
2022-07-13  6:39               ` Wei Xu
2022-07-13  7:25               ` Aneesh Kumar K V
2022-07-13  8:20                 ` Huang, Ying
2022-07-12  6:59         ` Huang, Ying
2022-07-12  7:31           ` Aneesh Kumar K V
2022-07-12  8:48             ` Huang, Ying
2022-07-12  9:17               ` Aneesh Kumar K V
2022-07-13  2:59                 ` Huang, Ying
2022-07-13  6:46                   ` Wei Xu
2022-07-13  8:17                     ` Huang, Ying
2022-07-19 14:00                       ` Jonathan Cameron
2022-07-25  6:02                         ` Huang, Ying
2022-07-13  9:44                     ` Aneesh Kumar K.V
2022-07-13  9:40                   ` Aneesh Kumar K.V
2022-07-14  4:56                     ` Huang, Ying
2022-07-14  5:29                       ` Aneesh Kumar K V
2022-07-14  7:21                         ` Huang, Ying
2022-07-11 15:29 ` Aneesh Kumar K.V
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).