From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: linux-mm@kvack.org, akpm@linux-foundation.org
Cc: Wei Xu <weixugc@google.com>, Huang Ying <ying.huang@intel.com>,
Yang Shi <shy828301@gmail.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Tim C Chen <tim.c.chen@intel.com>,
Michal Hocko <mhocko@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Hesham Almatary <hesham.almatary@huawei.com>,
Dave Hansen <dave.hansen@intel.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Alistair Popple <apopple@nvidia.com>,
Dan Williams <dan.j.williams@intel.com>,
Johannes Weiner <hannes@cmpxchg.org>,
jvgediya.oss@gmail.com,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
Jagdish Gediya <jvgediya@linux.ibm.com>
Subject: [PATCH v8 01/12] mm/demotion: Add support for explicit memory tiers
Date: Mon, 4 Jul 2022 12:36:01 +0530 [thread overview]
Message-ID: <20220704070612.299585-2-aneesh.kumar@linux.ibm.com> (raw)
In-Reply-To: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com>
In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed. The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.
This current memory tier kernel interface needs to be improved for
several important use cases,
The current tier initialization code always initializes
each memory-only NUMA node into a lower tier. But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.
The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.
With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.
The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.
This patch series address the above by defining memory tiers explicitly.
This patch introduce explicity memory tiers. The tier ID value
of a memory tier is used to derive the demotion order between
NUMA nodes.
For example, if we have 3 memtiers: memtier100, memtier200, memiter300
then the memory tier order is: memtier300 -> memtier200 -> memtier100
where memtier300 is the highest tier and memtier100 is the lowest tier.
While reclaim we migrate pages from fast(higher) tiers to slow(lower)
tiers when the fast(higher) tier is under memory pressure.
This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
which are created by different kernel subsystems. The default memory
tier created by the kernel is memtier200. A kernel parameter is provided
to override the default memory tier.
Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
include/linux/memory-tiers.h | 15 +++++++
mm/Makefile | 1 +
mm/memory-tiers.c | 78 ++++++++++++++++++++++++++++++++++++
3 files changed, 94 insertions(+)
create mode 100644 include/linux/memory-tiers.h
create mode 100644 mm/memory-tiers.c
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..a81dbc20e0d1
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_NUMA
+
+#define MEMORY_TIER_HBM_GPU 300
+#define MEMORY_TIER_DRAM 200
+#define MEMORY_TIER_PMEM 100
+
+#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIER_ID 400
+
+#endif /* CONFIG_NUMA */
+#endif /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..69a5d81c0a12
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+ struct list_head list;
+ nodemask_t nodelist;
+ int id;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+ struct list_head *ent;
+ struct memory_tier *tmp_memtier;
+
+ lockdep_assert_held_once(&memory_tier_lock);
+
+ list_for_each(ent, &memory_tiers) {
+ tmp_memtier = list_entry(ent, struct memory_tier, list);
+ if (tmp_memtier->id < memtier->id) {
+ list_add_tail(&memtier->list, ent);
+ return;
+ }
+ }
+ list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
+{
+ struct memory_tier *memtier;
+
+ if (tier > MAX_MEMORY_TIER_ID)
+ return ERR_PTR(-EINVAL);
+
+ memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+ if (!memtier)
+ return ERR_PTR(-ENOMEM);
+
+ memtier->id = tier;
+
+ insert_memory_tier(memtier);
+
+ return memtier;
+}
+
+static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
+core_param(default_memory_tier, default_memtier, uint, 0644);
+
+static int __init memory_tier_init(void)
+{
+ struct memory_tier *memtier;
+
+ /*
+ * Register only default memory tier to hide all empty
+ * memory tier from sysfs. Since this is early during
+ * boot, we could avoid holding memtory_tier_lock. But
+ * keep it simple by holding locks. So we can add lock
+ * held debug checks in other functions.
+ */
+ mutex_lock(&memory_tier_lock);
+ memtier = register_memory_tier(default_memtier);
+ if (IS_ERR(memtier))
+ panic("%s() failed to register memory tier: %ld\n",
+ __func__, PTR_ERR(memtier));
+
+ /* CPU only nodes are not part of memory tiers. */
+ memtier->nodelist = node_states[N_MEMORY];
+ mutex_unlock(&memory_tier_lock);
+ return 0;
+}
+subsys_initcall(memory_tier_init);
--
2.36.1
next prev parent reply other threads:[~2022-07-04 7:08 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-04 7:06 [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-04 7:06 ` Aneesh Kumar K.V [this message]
2022-07-04 7:06 ` [PATCH v8 02/12] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 03/12] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 04/12] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 05/12] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 06/12] mm/demotion: Expose memory tier details via sysfs Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 07/12] mm/demotion: Add per node memory tier attribute to sysfs Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 08/12] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 09/12] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 10/12] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 11/12] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
2022-07-04 7:06 ` [PATCH v8 12/12] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
2022-07-04 15:00 ` [PATCH v8 00/12] mm/demotion: Memory tiers and demotion Matthew Wilcox
2022-07-05 3:45 ` Alistair Popple
2022-07-05 4:17 ` Aneesh Kumar K V
2022-07-05 4:29 ` Huang, Ying
2022-07-05 5:22 ` Aneesh Kumar K V
2022-07-12 1:16 ` Huang, Ying
2022-07-12 4:42 ` Aneesh Kumar K V
2022-07-12 5:09 ` Aneesh Kumar K V
2022-07-12 18:02 ` Yang Shi
2022-07-13 3:42 ` Huang, Ying
2022-07-13 6:38 ` Wei Xu
2022-07-13 6:39 ` Wei Xu
2022-07-13 7:25 ` Aneesh Kumar K V
2022-07-13 8:20 ` Huang, Ying
2022-07-12 6:59 ` Huang, Ying
2022-07-12 7:31 ` Aneesh Kumar K V
2022-07-12 8:48 ` Huang, Ying
2022-07-12 9:17 ` Aneesh Kumar K V
2022-07-13 2:59 ` Huang, Ying
2022-07-13 6:46 ` Wei Xu
2022-07-13 8:17 ` Huang, Ying
2022-07-19 14:00 ` Jonathan Cameron
2022-07-25 6:02 ` Huang, Ying
2022-07-13 9:44 ` Aneesh Kumar K.V
2022-07-13 9:40 ` Aneesh Kumar K.V
2022-07-14 4:56 ` Huang, Ying
2022-07-14 5:29 ` Aneesh Kumar K V
2022-07-14 7:21 ` Huang, Ying
2022-07-11 15:29 ` Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220704070612.299585-2-aneesh.kumar@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=hannes@cmpxchg.org \
--cc=hesham.almatary@huawei.com \
--cc=jvgediya.oss@gmail.com \
--cc=jvgediya@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shy828301@gmail.com \
--cc=tim.c.chen@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).