From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Huang Ying <ying.huang@intel.com>,
Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Tim C Chen <tim.c.chen@intel.com>,
Brice Goglin <brice.goglin@gmail.com>,
Michal Hocko <mhocko@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Hesham Almatary <hesham.almatary@huawei.com>,
Dave Hansen <dave.hansen@intel.com>,
Alistair Popple <apopple@nvidia.com>,
Dan Williams <dan.j.williams@intel.com>,
Feng Tang <feng.tang@intel.com>,
Jagdish Gediya <jvgediya@linux.ibm.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
David Rientjes <rientjes@google.com>
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
Date: Mon, 06 Jun 2022 23:16:15 +0530 [thread overview]
Message-ID: <87ee01ofbs.fsf@linux.ibm.com> (raw)
In-Reply-To: <efede910-e0d7-02e6-d536-c25a7225d88c@linux.ibm.com>
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 6/6/22 9:46 PM, Jonathan Cameron wrote:
>> On Mon, 6 Jun 2022 21:31:16 +0530
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
>>
>>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
>>>> On Fri, 3 Jun 2022 14:10:47 +0530
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
>>>>
>>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>>>>> On Fri, 27 May 2022 17:55:23 +0530
>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>>>>>>
>>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>
>>>>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>>>>
>>>>>>> /sys/devices/system/node/nodeN/memtier
>>>>>>>
>>>>>>> where N = node id
>>>>>>>
>>>>>>> When read, It list the memory tier that the node belongs to.
>>>>>>>
>>>>>>> When written, the kernel moves the node into the specified
>>>>>>> memory tier, the tier assignment of all other nodes are not
>>>>>>> affected.
>>>>>>>
>>>>>>> If the memory tier does not exist, writing to the above file
>>>>>>> create the tier and assign the NUMA node to that tier.
>>>>>> creates
>>>>>>
>>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>>>>> for creation is the rank, not the tier number.
>>>>>>
>>>>>> My suggestion is move to an explicit creation file such as
>>>>>> memtier/create_tier_from_rank
>>>>>> to which writing the rank gives results in a new tier
>>>>>> with the next device ID and requested rank.
>>>>>
>>>>> I think the below workflow is much simpler.
>>>>>
>>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>>> 1-3
>>>>> :/sys/devices/system# cat node/node1/memtier
>>>>> 1
>>>>> :/sys/devices/system# ls memtier/memtier*
>>>>> nodelist power rank subsystem uevent
>>>>> /sys/devices/system# ls memtier/
>>>>> default_rank max_tier memtier1 power uevent
>>>>> :/sys/devices/system# echo 2 > node/node1/memtier
>>>>> :/sys/devices/system#
>>>>>
>>>>> :/sys/devices/system# ls memtier/
>>>>> default_rank max_tier memtier1 memtier2 power uevent
>>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>>> 2-3
>>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
>>>>> 1
>>>>> :/sys/devices/system#
>>>>>
>>>>> ie, to create a tier we just write the tier id/tier index to
>>>>> node/nodeN/memtier file. That will create a new memory tier if needed
>>>>> and add the node to that specific memory tier. Since for now we are
>>>>> having 1:1 mapping between tier index to rank value, we can derive the
>>>>> rank value from the memory tier index.
>>>>>
>>>>> For dynamic memory tier support, we can assign a rank value such that
>>>>> new memory tiers are always created such that it comes last in the
>>>>> demotion order.
>>>>
>>>> I'm not keen on having to pass through an intermediate state where
>>>> the rank may well be wrong, but I guess it's not that harmful even
>>>> if it feels wrong ;)
>>>>
>>>
>>> Any new memory tier added can be of lowest rank (rank - 0) and hence
>>> will appear as the highest memory tier in demotion order.
>>
>> Depends on driver interaction - if new memory is CXL attached or
>> GPU attached, chances are the driver has an input on which tier
>> it is put in by default.
>>
>>> User can then
>>> assign the right rank value to the memory tier? Also the actual demotion
>>> target paths are built during memory block online which in most case
>>> would happen after we properly verify that the device got assigned to
>>> the right memory tier with correct rank value?
>>
>> Agreed, though that may change the model of how memory is brought online
>> somewhat.
>>
>>>
>>>> Races are potentially a bit of a pain though depending on what we
>>>> expect the usage model to be.
>>>>
>>>> There are patterns (CXL regions for example) of guaranteeing the
>>>> 'right' device is created by doing something like
>>>>
>>>> cat create_tier > temp.txt
>>>> #(temp gets 2 for example on first call then
>>>> # next read of this file gets 3 etc)
>>>>
>>>> cat temp.txt > create_tier
>>>> # will fail if there hasn't been a read of the same value
>>>>
>>>> Assuming all software keeps to the model, then there are no
>>>> race conditions over creation. Otherwise we have two new
>>>> devices turn up very close to each other and userspace scripting
>>>> tries to create two new tiers - if it races they may end up in
>>>> the same tier when that wasn't the intent. Then code to set
>>>> the rank also races and we get two potentially very different
>>>> memories in a tier with a randomly selected rank.
>>>>
>>>> Fun and games... And a fine illustration why sysfs based 'device'
>>>> creation is tricky to get right (and lots of cases in the kernel
>>>> don't).
>>>>
>>>
>>> I would expect userspace to be careful and verify the memory tier and
>>> rank value before we online the memory blocks backed by the device. Even
>>> if we race, the result would be two device not intended to be part of
>>> the same memory tier appearing at the same tier. But then we won't be
>>> building demotion targets yet. So userspace could verify this, move the
>>> nodes out of the memory tier. Once it is verified, memory blocks can be
>>> onlined.
>>
>> The race is there and not avoidable as far as I can see. Two processes A and B.
>>
>> A checks for a spare tier number
>> B checks for a spare tier number
>> A tries to assign node 3 to new tier 2 (new tier created)
>> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
>> is the same method we'd use to put it in the existing tier we can't tell this
>> write was meant to create a new tier).
>> A writes rank 100 to tier 2
>> A checks rank for tier 2 and finds it is 100 as expected.
>> B write rank 200 to tier 2 (it could check if still default but even that is racy)
>> B checks rank for tier 2 rank and finds it is 200 as expected.
>> A onlines memory.
>> B onlines memory.
>>
>> Both think they got what they wanted, but A definitely didn't.
>>
>> One work around is the read / write approach and create_tier.
>>
>> A reads create_tier - gets 2.
>> B reads create_tier - gets 3.
>> A writes 2 to create_tier as that's what it read.
>> B writes 3 to create_tier as that's what it read.
>>
>> continue with created tiers. Obviously can exhaust tiers, but if this is
>> root only, could just create lots anyway so no worse off.
>>
>>>
>>> Having said that can you outline the usage of
>>> memtier/create_tier_from_rank ?
>>
>> There are corner cases to deal with...
>>
>> A writes 100 to create_tier_from_rank.
>> A goes looking for matching tier - finds it: tier2
>> B writes 200 to create_tier_from_rank
>> B goes looking for matching tier - finds it: tier3
>>
>> rest is fine as operating on different tiers.
>>
>> Trickier is
>> A writes 100 to create_tier_from_rank - succeed.
>> B writes 100 to create_tier_from_rank - Could fail, or could just eat it?
>>
>> Logically this is same as separate create_tier and then a write
>> of rank, but in one operation, but then you need to search
>> for the right one. As such, perhaps a create_tier
>> that does the read/write pair as above is the best solution.
>>
>
> This all is good when we allow dynamic rank values. But currently we are
> restricting ourselves to three rank value as below:
>
> rank memtier
> 300 memtier0
> 200 memtier1
> 100 memtier2
>
> Now with the above, how do we define a write to create_tier_from_rank.
> What should be the behavior if user write value other than above defined
> rank values? Also enforcing the above three rank values as supported
> implies teaching userspace about them. I am trying to see how to fit
> create_tier_from_rank without requiring the above.
>
> Can we look at implementing create_tier_from_rank when we start
> supporting dynamic tiers/rank values? ie,
>
> we still allow node/nodeN/memtier. But with dynamic tiers a race free
> way to get a new memory tier would be echo rank >
> memtier/create_tier_from_rank. We could also say, memtier0/1/2 are
> kernel defined memory tiers. Writing to memtier/create_tier_from_rank
> will create new memory tiers above memtier2 with the rank value specified?
>
To keep it compatible we could do this. ie, we just allow creation of
one additional memory tier (memtier3) via the above interface.
:/sys/devices/system/memtier# ls -al
total 0
drwxr-xr-x 4 root root 0 Jun 6 17:39 .
drwxr-xr-x 10 root root 0 Jun 6 17:39 ..
--w------- 1 root root 4096 Jun 6 17:40 create_tier_from_rank
-r--r--r-- 1 root root 4096 Jun 6 17:40 default_tier
-r--r--r-- 1 root root 4096 Jun 6 17:40 max_tier
drwxr-xr-x 3 root root 0 Jun 6 17:39 memtier1
drwxr-xr-x 2 root root 0 Jun 6 17:40 power
-rw-r--r-- 1 root root 4096 Jun 6 17:39 uevent
:/sys/devices/system/memtier# echo 20 > create_tier_from_rank
:/sys/devices/system/memtier# ls
create_tier_from_rank default_tier max_tier memtier1 memtier3 power uevent
:/sys/devices/system/memtier# cat memtier3/rank
20
:/sys/devices/system/memtier# echo 20 > create_tier_from_rank
bash: echo: write error: No space left on device
:/sys/devices/system/memtier#
is this good?
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0468af60d427..a4150120ba24 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -13,7 +13,7 @@
#define MEMORY_RANK_PMEM 100
#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM
-#define MAX_MEMORY_TIERS 3
+#define MAX_MEMORY_TIERS 4
extern bool numa_demotion_enabled;
extern nodemask_t promotion_mask;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index c6eb223a219f..7fdee0c4c4ea 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -169,7 +169,8 @@ static void insert_memory_tier(struct memory_tier *memtier)
list_add_tail(&memtier->list, &memory_tiers);
}
-static struct memory_tier *register_memory_tier(unsigned int tier)
+static struct memory_tier *register_memory_tier(unsigned int tier,
+ unsigned int rank)
{
int error;
struct memory_tier *memtier;
@@ -182,7 +183,7 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
return NULL;
memtier->dev.id = tier;
- memtier->rank = get_rank_from_tier(tier);
+ memtier->rank = rank;
memtier->dev.bus = &memory_tier_subsys;
memtier->dev.release = memory_tier_device_release;
memtier->dev.groups = memory_tier_dev_groups;
@@ -218,9 +219,53 @@ default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
}
static DEVICE_ATTR_RO(default_tier);
+
+static struct memory_tier *__get_memory_tier_from_id(int id);
+static ssize_t create_tier_from_rank_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int ret, rank;
+ struct memory_tier *memtier;
+
+ ret = kstrtouint(buf, 10, &rank);
+ if (ret)
+ return ret;
+
+ if (ret == MEMORY_RANK_HBM_GPU ||
+ rank == MEMORY_TIER_DRAM ||
+ rank == MEMORY_RANK_PMEM)
+ return -EINVAL;
+
+ mutex_lock(&memory_tier_lock);
+ /*
+ * For now we only support creation of one additional tier via
+ * this interface.
+ */
+ memtier = __get_memory_tier_from_id(3);
+ if (!memtier) {
+ memtier = register_memory_tier(3, rank);
+ if (!memtier) {
+ ret = -EINVAL;
+ goto out;
+ }
+ } else {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ ret = count;
+out:
+ mutex_unlock(&memory_tier_lock);
+ return ret;
+}
+static DEVICE_ATTR_WO(create_tier_from_rank);
+
+
static struct attribute *memory_tier_attrs[] = {
&dev_attr_max_tier.attr,
&dev_attr_default_tier.attr,
+ &dev_attr_create_tier_from_rank.attr,
NULL
};
@@ -302,7 +347,7 @@ static int __node_set_memory_tier(int node, int tier)
memtier = __get_memory_tier_from_id(tier);
if (!memtier) {
- memtier = register_memory_tier(tier);
+ memtier = register_memory_tier(tier, get_rank_from_tier(tier));
if (!memtier) {
ret = -EINVAL;
goto out;
@@ -651,7 +696,8 @@ static int __init memory_tier_init(void)
* Register only default memory tier to hide all empty
* memory tier from sysfs.
*/
- memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+ memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
+ get_rank_from_tier(DEFAULT_MEMORY_TIER));
if (!memtier)
panic("%s() failed to register memory tier: %d\n", __func__, ret);
next prev parent reply other threads:[~2022-06-06 17:52 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
2022-05-27 2:58 ` Ying Huang
2022-05-27 14:05 ` Hesham Almatary
2022-05-27 16:25 ` Wei Xu
2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-05-27 12:25 ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-05-27 13:59 ` Jonathan Cameron
2022-06-02 6:07 ` Ying Huang
2022-06-06 2:49 ` Ying Huang
2022-06-06 3:56 ` Aneesh Kumar K V
2022-06-06 5:33 ` Ying Huang
2022-06-06 6:01 ` Aneesh Kumar K V
2022-06-06 6:27 ` Aneesh Kumar K.V
2022-06-06 7:53 ` Ying Huang
2022-06-06 8:01 ` Aneesh Kumar K V
2022-06-06 8:52 ` Ying Huang
2022-06-06 9:02 ` Aneesh Kumar K V
2022-06-08 1:24 ` Ying Huang
2022-06-08 7:16 ` Ying Huang
2022-06-08 8:24 ` Aneesh Kumar K V
2022-06-08 8:27 ` Ying Huang
2022-05-27 12:25 ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-05-27 14:15 ` Jonathan Cameron
2022-06-03 8:40 ` Aneesh Kumar K V
2022-06-06 14:59 ` Jonathan Cameron
2022-06-06 16:01 ` Aneesh Kumar K V
2022-06-06 16:16 ` Jonathan Cameron
2022-06-06 16:39 ` Aneesh Kumar K V
2022-06-06 17:46 ` Aneesh Kumar K.V [this message]
2022-06-07 14:32 ` Jonathan Cameron
2022-06-08 7:18 ` Ying Huang
2022-06-08 8:25 ` Aneesh Kumar K V
2022-06-08 8:29 ` Ying Huang
2022-05-27 12:25 ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-05-27 14:31 ` Jonathan Cameron
2022-05-30 3:35 ` [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c kernel test robot
2022-05-27 12:25 ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-01 6:29 ` Bharata B Rao
2022-06-01 13:49 ` Aneesh Kumar K V
2022-06-02 6:36 ` Bharata B Rao
2022-06-03 9:04 ` Aneesh Kumar K V
2022-06-06 10:11 ` Bharata B Rao
2022-06-06 10:16 ` Aneesh Kumar K V
2022-06-06 11:54 ` Aneesh Kumar K.V
2022-06-06 12:09 ` Bharata B Rao
2022-06-06 13:00 ` Aneesh Kumar K V
2022-05-27 12:25 ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
2022-05-27 14:45 ` Jonathan Cameron
2022-05-27 15:45 ` Aneesh Kumar K V
2022-05-30 12:36 ` Jonathan Cameron
2022-06-02 6:41 ` Ying Huang
2022-05-27 12:25 ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-02 6:43 ` Ying Huang
2022-05-27 12:25 ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-05-27 15:03 ` Jonathan Cameron
2022-06-02 7:35 ` Ying Huang
2022-06-03 15:09 ` Aneesh Kumar K V
2022-06-06 0:43 ` Ying Huang
2022-06-06 4:07 ` Aneesh Kumar K V
2022-06-06 5:26 ` Ying Huang
2022-06-06 6:21 ` Aneesh Kumar K.V
2022-06-06 7:42 ` Ying Huang
2022-06-06 8:02 ` Aneesh Kumar K V
2022-06-06 8:06 ` Ying Huang
2022-06-06 17:07 ` Yang Shi
2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
2022-05-27 16:30 ` Wei Xu
2022-05-29 4:31 ` Ying Huang
2022-05-30 12:50 ` Jonathan Cameron
2022-05-31 1:57 ` Ying Huang
2022-06-07 19:25 ` Tim Chen
2022-06-08 4:41 ` Aneesh Kumar K V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ee01ofbs.fsf@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=brice.goglin@gmail.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=feng.tang@intel.com \
--cc=gthelen@google.com \
--cc=hesham.almatary@huawei.com \
--cc=jvgediya@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=rientjes@google.com \
--cc=shy828301@gmail.com \
--cc=tim.c.chen@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).