From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Wei Xu <weixugc@google.com>, Yang Shi <shy828301@gmail.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Tim C Chen <tim.c.chen@intel.com>,
Michal Hocko <mhocko@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Hesham Almatary <hesham.almatary@huawei.com>,
Dave Hansen <dave.hansen@intel.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Alistair Popple <apopple@nvidia.com>,
Dan Williams <dan.j.williams@intel.com>,
Johannes Weiner <hannes@cmpxchg.org>,
jvgediya.oss@gmail.com, Jagdish Gediya <jvgediya@linux.ibm.com>
Subject: Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
Date: Fri, 15 Jul 2022 15:57:13 +0530 [thread overview]
Message-ID: <87sfn2u0vy.fsf@linux.ibm.com> (raw)
In-Reply-To: <3659f1bb-a82e-1aad-f297-808a2c17687d@linux.ibm.com>
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
....
>
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface. For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface. So I suggest you
>> to implement with the new interface in mind. What do you think about
>> the following design?
>>
>
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100.
>
>
>> - Each NUMA node belongs to a memory type, and each memory type
>> corresponds to a "abstract distance", so each NUMA node corresonds to
>> a "distance". For simplicity, we can start with static distances, for
>> example, DRAM (default): 150, PMEM: 250. The distance of each NUMA
>> node can be recorded in a global array,
>>
>> int node_distances[MAX_NUMNODES];
>>
>> or, just
>>
>> pgdat->distance
>>
>
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch?
>
>
>> - Each memory tier corresponds to a range of distance, for example,
>> 0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>
>> - The core API of memory tier could be
>>
>> struct memory_tier *find_create_memory_tier(int distance);
>>
>> it will find the memory tier which covers "distance" in the memory
>> tier list, or create a new memory tier if not found.
>>
>
> I was expecting this to be internal to dax kmem. How dax kmem maps
> "abstract distance" to a memory tier. At this point this patchset is
> keeping all that for a future patchset.
>
This shows how i was expecting "abstract distance" to be integrated.
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 82cae08976bc..1281aec63986 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -1332,6 +1332,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.mapping = &mapping;
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = &p->nd_set;
+ ndr_desc.memtier_distance = PMEM_MEMTIER_DEFAULT_DISTANCE;
if (p->hcall_flush_required) {
set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ae5f4acf2675..7b8cf1f15562 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2641,6 +2641,10 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
NUMA_NO_NODE, ndr_desc->numa_node, &res.start, &res.end);
}
+ /*
+ * We may want to look at SLIT/HMAT to fine tune this
+ */
+ ndr_desc->memtier_distance = PMEM_MEMTIER_DEFAULT_DISTANCE;
/*
* Persistence domain bits are hierarchical, if
* ACPI_NFIT_CAPABILITY_CACHE_FLUSH is set then
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1dad813ee4a6..708a40cf29c0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -570,8 +570,9 @@ static void dax_region_unregister(void *region)
}
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct range *range, int target_node, unsigned int align,
- unsigned long flags)
+ struct range *range, int target_node,
+ int memtier_distance, unsigned int align,
+ unsigned long flags)
{
struct dax_region *dax_region;
@@ -599,6 +600,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
dax_region->align = align;
dax_region->dev = parent;
dax_region->target_node = target_node;
+ dax_region->memtier_distance = memtier_distance;
ida_init(&dax_region->ida);
dax_region->res = (struct resource) {
.start = range->start,
@@ -1370,6 +1372,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
dev_dax->dax_dev = dax_dev;
dev_dax->target_node = dax_region->target_node;
+ dev_dax->memtier_distance = dax_region->memtier_distance;
dev_dax->align = dax_region->align;
ida_init(&dev_dax->ida);
kref_get(&dax_region->kref);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index fbb940293d6d..3de4292392dd 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,8 +13,9 @@ void dax_region_put(struct dax_region *dax_region);
#define IORESOURCE_DAX_STATIC (1UL << 0)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct range *range, int target_node, unsigned int align,
- unsigned long flags);
+ struct range *range, int target_node,
+ int memtier_distance, unsigned int align,
+ unsigned long flags);
struct dev_dax_data {
struct dax_region *dax_region;
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 1c974b7caae6..5db382c78d0e 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -31,6 +31,7 @@ void dax_bus_exit(void);
struct dax_region {
int id;
int target_node;
+ int memtier_distance;
struct kref kref;
struct device *dev;
unsigned int align;
@@ -64,6 +65,7 @@ struct dev_dax {
struct dax_device *dax_dev;
unsigned int align;
int target_node;
+ int memtier_distance;
int id;
struct ida ida;
struct device dev;
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1bf040dbc834..b9f80971c07b 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -26,7 +26,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
range.start = res->start;
range.end = res->end;
dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
- PMD_SIZE, 0);
+ mri->memtier_distance, PMD_SIZE, 0);
if (!dax_region)
return -ENOMEM;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 0c03889286ac..32878bd96f09 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -45,13 +45,18 @@ struct dax_kmem_data {
static unsigned int dax_kmem_memtier = MEMORY_TIER_PMEM;
module_param(dax_kmem_memtier, uint, 0644);
+int find_memtier_from_distance(struct dev_dax *dev_dax)
+{
+ return dax_kmem_memtier + dev_dax->memtier_distance;
+}
+
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
struct device *dev = &dev_dax->dev;
unsigned long total_len = 0;
struct dax_kmem_data *data;
int i, rc, mapped = 0;
- int numa_node;
+ int numa_node, mem_tier;
/*
* Ensure good NUMA information for the persistent memory.
@@ -150,7 +155,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
}
dev_set_drvdata(dev, data);
- node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
+ mem_tier = find_memtier_from_distance(dev_dax);
+ node_create_and_set_memory_tier(numa_node, mem_tier);
return 0;
err_request_mem:
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index f050ea78bb83..1b51fc0490de 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,8 +54,10 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
range = pgmap.range;
range.start += offset;
dax_region = alloc_dax_region(dev, region_id, &range,
- nd_region->target_node, le32_to_cpu(pfn_sb->align),
- IORESOURCE_DAX_STATIC);
+ nd_region->target_node,
+ nd_region->memtier_distance,
+ le32_to_cpu(pfn_sb->align),
+ IORESOURCE_DAX_STATIC);
if (!dax_region)
return ERR_PTR(-ENOMEM);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index ec5219680092..cf7a379a2220 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -416,6 +416,7 @@ struct nd_region {
u64 ndr_size;
u64 ndr_start;
int id, num_lanes, ro, numa_node, target_node;
+ int memtier_distance;
void *provider_data;
struct kernfs_node *bb_state;
struct badblocks bb;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index d976260eca7a..f2067de8d660 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1019,6 +1019,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->ro = ro;
nd_region->numa_node = ndr_desc->numa_node;
nd_region->target_node = ndr_desc->target_node;
+ nd_region->memtier_distance = ndr_desc->memtier_distance;
ida_init(&nd_region->ns_ida);
ida_init(&nd_region->btt_ida);
ida_init(&nd_region->pfn_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 0d61e07b6827..bf20e018074f 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -121,6 +121,7 @@ struct nd_region_desc {
int num_lanes;
int numa_node;
int target_node;
+ int memtier_distance;
unsigned long flags;
struct device_node *of_node;
int (*flush)(struct nd_region *nd_region, struct bio *bio);
@@ -224,6 +225,8 @@ struct nvdimm_fw_ops {
int (*arm)(struct nvdimm *nvdimm, enum nvdimm_fwa_trigger arg);
};
+#define PMEM_MEMTIER_DEFAULT_DISTANCE 10
+
void badrange_init(struct badrange *badrange);
int badrange_add(struct badrange *badrange, u64 addr, u64 length);
void badrange_forget(struct badrange *badrange, phys_addr_t start,
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index c04c4fd2e209..5850e2bbbfed 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -6,6 +6,7 @@
struct memregion_info {
int target_node;
+ int memtier_distance;
};
#ifdef CONFIG_MEMREGION
next prev parent reply other threads:[~2022-07-15 10:27 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-14 4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-07-15 7:53 ` Huang, Ying
2022-07-15 9:08 ` Aneesh Kumar K V
2022-07-15 9:24 ` Aneesh Kumar K V
2022-07-15 10:27 ` Aneesh Kumar K.V [this message]
2022-07-18 6:08 ` Huang, Ying
2022-07-18 6:57 ` Huang, Ying
2022-07-18 8:00 ` Aneesh Kumar K V
2022-07-18 8:55 ` Huang, Ying
2022-07-15 16:59 ` Wei Xu
2022-07-18 5:28 ` Huang, Ying
2022-07-18 5:58 ` Alistair Popple
2022-07-18 6:56 ` Aneesh Kumar K V
2022-07-14 4:53 ` [PATCH v9 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 3/8] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-15 4:38 ` Alistair Popple
2022-07-15 7:23 ` Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-15 4:47 ` Alistair Popple
2022-07-15 7:21 ` Aneesh Kumar K.V
2022-07-18 5:41 ` Alistair Popple
2022-07-14 4:53 ` [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-15 5:49 ` Alistair Popple
2022-07-15 7:19 ` Aneesh Kumar K.V
2022-07-18 5:22 ` Alistair Popple
2022-07-14 4:53 ` [PATCH v9 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-14 4:53 ` [PATCH v9 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sfn2u0vy.fsf@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=hannes@cmpxchg.org \
--cc=hesham.almatary@huawei.com \
--cc=jvgediya.oss@gmail.com \
--cc=jvgediya@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shy828301@gmail.com \
--cc=tim.c.chen@intel.com \
--cc=weixugc@google.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.