linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: <linux-mm@kvack.org>, <akpm@linux-foundation.org>,
	Huang Ying <ying.huang@intel.com>,
	Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier
Date: Mon, 30 May 2022 13:36:57 +0100	[thread overview]
Message-ID: <20220530133657.00001164@Huawei.com> (raw)
In-Reply-To: <37d91345-306d-e308-61c1-50e0d76992c0@linux.ibm.com>

On Fri, 27 May 2022 21:15:09 +0530
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:

> On 5/27/22 8:15 PM, Jonathan Cameron wrote:
> > On Fri, 27 May 2022 17:55:26 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> >   
> >> The rank approach allows us to keep memory tier device IDs stable even if there
> >> is a need to change the tier ordering among different memory tiers. e.g. DRAM
> >> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
> >> or lower than these nodes. A new memory tier can be inserted into the tier
> >> hierarchy for a new set of nodes without affecting the node assignment of any
> >> existing memtier, provided that there is enough gap in the rank values for the
> >> new memtier.
> >>
> >> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
> >> Its value relative to other memtiers decides the level of this memtier in the tier
> >> hierarchy.
> >>
> >> For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
> >> memory tiers 0,1 & 2 respectively.
> >>
> >> Below is the sysfs interface to read the rank values of memory tier,
> >> /sys/devices/system/memtier/memtierN/rank
> >>
> >> This interface is read only for now, write support can be added when there is
> >> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
> >> requirement among them, rank can be utilized there as rank decides now memory
> >> tiering ordering and not memory tier device ids.
> >>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>  
> > 
> > I'd squash a lot of this with the original patch introducing tiers. As things
> > stand we have 2 tricky to follow patches covering the same code rather than
> > one that would be simpler.
> >   
> 
> Sure. Will do that in the next update.
> 
> > Jonathan
> >   
> >> ---
> >>   drivers/base/node.c     |   5 +-
> >>   drivers/dax/kmem.c      |   2 +-
> >>   include/linux/migrate.h |  17 ++--
> >>   mm/migrate.c            | 218 ++++++++++++++++++++++++----------------
> >>   4 files changed, 144 insertions(+), 98 deletions(-)
> >>
> >> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >> index cf4a58446d8c..892f7c23c94e 100644
> >> --- a/drivers/base/node.c
> >> +++ b/drivers/base/node.c
> >> @@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
> >>   			    char *buf)
> >>   {
> >>   	int node = dev->id;
> >> +	int tier_index = node_get_memory_tier_id(node);
> >>   
> >> -	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> >> +	if (tier_index != -1)
> >> +		return sysfs_emit(buf, "%d\n", tier_index);  
> > I think failure to get a tier is an error. So if it happens, return an error code.
> > Also prefered to handle errors out of line as more idiomatic so reviewers
> > read it quicker.
> > 
> > 	if (tier_index == -1)
> > 		return -EINVAL;
> > 
> > 	return sysfs_emit()...
> >   
> >> +	return 0;
> >>   }
> >>     
> 
> 
> That was needed to handle NUMA nodes that is not part of any memory 
> tiers, like CPU only NUMA node or NUMA node that doesn't want to 
> participate in memory demotion.
> 
> 
> 
> >>   static ssize_t memtier_store(struct device *dev,
> >> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> >> index 991782aa2448..79953426ddaf 100644
> >> --- a/drivers/dax/kmem.c
> >> +++ b/drivers/dax/kmem.c
> >> @@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >>   	dev_set_drvdata(dev, data);
> >>     
> 
> 
> ...
> 
> >>   
> >> -static DEVICE_ATTR_RO(default_tier);
> >> +static DEVICE_ATTR_RO(default_rank);
> >>   
> >>   static struct attribute *memoty_tier_attrs[] = {
> >> -	&dev_attr_max_tiers.attr,
> >> -	&dev_attr_default_tier.attr,
> >> +	&dev_attr_max_tier.attr,
> >> +	&dev_attr_default_rank.attr,  
> > 
> > hmm. Not sure why rename to tier rather than tiers.
> > 
> > Also, I think we default should be tier, not rank.  If someone later
> > wants to change the rank of tier1 that's up to them, but any new hotplugged
> > memory should still end up in their by default.
> >   
> 
> Didn't we say, the tier index/device id is a meaning less entity that 
> control just the naming. ie, for memtier128, 128 doesn't mean anything.

> Instead it is the rank value associated with memtier128 that control the 
> demotion order? If so what we want to update the userspace is max tier 
> index userspace can expect and what is the default rank value to which 
> memory will be added by hotplug.

Sort of.  I think we want default to refer to a particular tier, probably
at all times, thus allowing the administrator to potentially change what the
rank of that default group is for everything currently in it and anything
added later.  So I would keep the default as pointing to a particular
tier. This also reflect the earlier discussion about having multiple tiers
with the same rank. I would allow that as it makes for a cleaner interface
if we make rank writeable in the future. If that happens, what does
default rank mean? Which of the the tiers is used?

For other cases, rank is the value that matters for ordering but the particular
tier is what a driver etc uses.

The reason being to allow an admin to change the rank of (for example)
all GPU memory, such that it affects the GPU memory already present and
any added in the future (rather than a new tier being created with whatever
the GPU driver thinks the rank should be).  The way I think about this
means that default should be the same - tied to a particular tier, not
a particular rank.  If software wants the current rank of the default tier
then it can go look it up in the tier. 

> 
> But yes. tierindex 1 and default rank 200 are reserved and created by 
> default.
> 
> 
> ....
> 
> >>   	/*
> >>   	 * if node is already part of the tier proceed with the
> >>   	 * current tier value, because we might want to establish
> >> @@ -2411,15 +2452,17 @@ int node_set_memory_tier(int node, int tier)
> >>   	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> >>   	 * will have skipped this node.
> >>   	 */
> >> -	if (current_tier != -1)
> >> -		tier = current_tier;
> >> -	ret = __node_set_memory_tier(node, tier);
> >> +	if (memtier)
> >> +		establish_migration_targets();
> >> +	else {
> >> +		/* For now rank value and tier value is same. */  
> > 
> > We should avoid baking that in...  
> 
> 
> Making it dynamic adds lots of complexity such as an ida alloc for tier 
> index etc. I didn't want to get there unless we are sure we need dynamic 
> number of tiers.

Agreed it's more complex (though not very).  I'm just suggesting dropping
the comment.

If it were me, I'd make tier0 the default with the mid rank. Then tier1
as slower and tier2 as faster.  Hopefully that would avoid any userspace
code making assumptions about the ordering.

Jonathan




> 
> -aneesh
> 



  reply	other threads:[~2022-05-30 12:37 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
2022-05-27  2:58 ` Ying Huang
2022-05-27 14:05   ` Hesham Almatary
2022-05-27 16:25     ` Wei Xu
2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-05-27 13:59     ` Jonathan Cameron
2022-06-02  6:07     ` Ying Huang
2022-06-06  2:49       ` Ying Huang
2022-06-06  3:56         ` Aneesh Kumar K V
2022-06-06  5:33           ` Ying Huang
2022-06-06  6:01             ` Aneesh Kumar K V
2022-06-06  6:27               ` Aneesh Kumar K.V
2022-06-06  7:53                 ` Ying Huang
2022-06-06  8:01                   ` Aneesh Kumar K V
2022-06-06  8:52                     ` Ying Huang
2022-06-06  9:02                       ` Aneesh Kumar K V
2022-06-08  1:24                         ` Ying Huang
2022-06-08  7:16     ` Ying Huang
2022-06-08  8:24       ` Aneesh Kumar K V
2022-06-08  8:27         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-05-27 14:15     ` Jonathan Cameron
2022-06-03  8:40       ` Aneesh Kumar K V
2022-06-06 14:59         ` Jonathan Cameron
2022-06-06 16:01           ` Aneesh Kumar K V
2022-06-06 16:16             ` Jonathan Cameron
2022-06-06 16:39               ` Aneesh Kumar K V
2022-06-06 17:46                 ` Aneesh Kumar K.V
2022-06-07 14:32                   ` Jonathan Cameron
2022-06-08  7:18     ` Ying Huang
2022-06-08  8:25       ` Aneesh Kumar K V
2022-06-08  8:29         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-05-27 14:31     ` Jonathan Cameron
2022-05-30  3:35     ` [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c kernel test robot
2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-01  6:29     ` Bharata B Rao
2022-06-01 13:49       ` Aneesh Kumar K V
2022-06-02  6:36         ` Bharata B Rao
2022-06-03  9:04           ` Aneesh Kumar K V
2022-06-06 10:11             ` Bharata B Rao
2022-06-06 10:16               ` Aneesh Kumar K V
2022-06-06 11:54                 ` Aneesh Kumar K.V
2022-06-06 12:09                   ` Bharata B Rao
2022-06-06 13:00                     ` Aneesh Kumar K V
2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
2022-05-27 14:45     ` Jonathan Cameron
2022-05-27 15:45       ` Aneesh Kumar K V
2022-05-30 12:36         ` Jonathan Cameron [this message]
2022-06-02  6:41     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-02  6:43     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-05-27 15:03     ` Jonathan Cameron
2022-06-02  7:35     ` Ying Huang
2022-06-03 15:09       ` Aneesh Kumar K V
2022-06-06  0:43         ` Ying Huang
2022-06-06  4:07           ` Aneesh Kumar K V
2022-06-06  5:26             ` Ying Huang
2022-06-06  6:21               ` Aneesh Kumar K.V
2022-06-06  7:42                 ` Ying Huang
2022-06-06  8:02                   ` Aneesh Kumar K V
2022-06-06  8:06                     ` Ying Huang
2022-06-06 17:07               ` Yang Shi
2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
2022-05-27 16:30   ` Wei Xu
2022-05-29  4:31     ` Ying Huang
2022-05-30 12:50       ` Jonathan Cameron
2022-05-31  1:57         ` Ying Huang
2022-06-07 19:25         ` Tim Chen
2022-06-08  4:41           ` Aneesh Kumar K V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220530133657.00001164@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).