Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: <linux-mm@kvack.org>, <akpm@linux-foundation.org>,
	Huang Ying <ying.huang@intel.com>,
	Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
Date: Tue, 7 Jun 2022 15:32:01 +0100	[thread overview]
Message-ID: <20220607153201.00004a8d@Huawei.com> (raw)
In-Reply-To: <87ee01ofbs.fsf@linux.ibm.com>

On Mon, 06 Jun 2022 23:16:15 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
> > On 6/6/22 9:46 PM, Jonathan Cameron wrote:  
> >> On Mon, 6 Jun 2022 21:31:16 +0530
> >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> >>   
> >>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:  
> >>>> On Fri, 3 Jun 2022 14:10:47 +0530
> >>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> >>>>      
> >>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:  
> >>>>>> On Fri, 27 May 2022 17:55:23 +0530
> >>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> >>>>>>         
> >>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> >>>>>>>
> >>>>>>> Add support to read/write the memory tierindex for a NUMA node.
> >>>>>>>
> >>>>>>> /sys/devices/system/node/nodeN/memtier
> >>>>>>>
> >>>>>>> where N = node id
> >>>>>>>
> >>>>>>> When read, It list the memory tier that the node belongs to.
> >>>>>>>
> >>>>>>> When written, the kernel moves the node into the specified
> >>>>>>> memory tier, the tier assignment of all other nodes are not
> >>>>>>> affected.
> >>>>>>>
> >>>>>>> If the memory tier does not exist, writing to the above file
> >>>>>>> create the tier and assign the NUMA node to that tier.  
> >>>>>> creates
> >>>>>>
> >>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
> >>>>>> for creation is the rank, not the tier number.
> >>>>>>
> >>>>>> My suggestion is move to an explicit creation file such as
> >>>>>> memtier/create_tier_from_rank
> >>>>>> to which writing the rank gives results in a new tier
> >>>>>> with the next device ID and requested rank.  
> >>>>>
> >>>>> I think the below workflow is much simpler.
> >>>>>
> >>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
> >>>>> 1-3
> >>>>> :/sys/devices/system# cat node/node1/memtier
> >>>>> 1
> >>>>> :/sys/devices/system# ls memtier/memtier*
> >>>>> nodelist  power  rank  subsystem  uevent
> >>>>> /sys/devices/system# ls memtier/
> >>>>> default_rank  max_tier  memtier1  power  uevent
> >>>>> :/sys/devices/system# echo 2 > node/node1/memtier
> >>>>> :/sys/devices/system#
> >>>>>
> >>>>> :/sys/devices/system# ls memtier/
> >>>>> default_rank  max_tier  memtier1  memtier2  power  uevent
> >>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
> >>>>> 2-3
> >>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
> >>>>> 1
> >>>>> :/sys/devices/system#
> >>>>>
> >>>>> ie, to create a tier we just write the tier id/tier index to
> >>>>> node/nodeN/memtier file. That will create a new memory tier if needed
> >>>>> and add the node to that specific memory tier. Since for now we are
> >>>>> having 1:1 mapping between tier index to rank value, we can derive the
> >>>>> rank value from the memory tier index.
> >>>>>
> >>>>> For dynamic memory tier support, we can assign a rank value such that
> >>>>> new memory tiers are always created such that it comes last in the
> >>>>> demotion order.  
> >>>>
> >>>> I'm not keen on having to pass through an intermediate state where
> >>>> the rank may well be wrong, but I guess it's not that harmful even
> >>>> if it feels wrong ;)
> >>>>      
> >>>
> >>> Any new memory tier added can be of lowest rank (rank - 0) and hence
> >>> will appear as the highest memory tier in demotion order.  
> >> 
> >> Depends on driver interaction - if new memory is CXL attached or
> >> GPU attached, chances are the driver has an input on which tier
> >> it is put in by default.
> >>   
> >>> User can then
> >>> assign the right rank value to the memory tier? Also the actual demotion
> >>> target paths are built during memory block online which in most case
> >>> would happen after we properly verify that the device got assigned to
> >>> the right memory tier with correct rank value?  
> >> 
> >> Agreed, though that may change the model of how memory is brought online
> >> somewhat.
> >>   
> >>>  
> >>>> Races are potentially a bit of a pain though depending on what we
> >>>> expect the usage model to be.
> >>>>
> >>>> There are patterns (CXL regions for example) of guaranteeing the
> >>>> 'right' device is created by doing something like
> >>>>
> >>>> cat create_tier > temp.txt
> >>>> #(temp gets 2 for example on first call then
> >>>> # next read of this file gets 3 etc)
> >>>>
> >>>> cat temp.txt > create_tier
> >>>> # will fail if there hasn't been a read of the same value
> >>>>
> >>>> Assuming all software keeps to the model, then there are no
> >>>> race conditions over creation.  Otherwise we have two new
> >>>> devices turn up very close to each other and userspace scripting
> >>>> tries to create two new tiers - if it races they may end up in
> >>>> the same tier when that wasn't the intent.  Then code to set
> >>>> the rank also races and we get two potentially very different
> >>>> memories in a tier with a randomly selected rank.
> >>>>
> >>>> Fun and games...  And a fine illustration why sysfs based 'device'
> >>>> creation is tricky to get right (and lots of cases in the kernel
> >>>> don't).
> >>>>      
> >>>
> >>> I would expect userspace to be careful and verify the memory tier and
> >>> rank value before we online the memory blocks backed by the device. Even
> >>> if we race, the result would be two device not intended to be part of
> >>> the same memory tier appearing at the same tier. But then we won't be
> >>> building demotion targets yet. So userspace could verify this, move the
> >>> nodes out of the memory tier. Once it is verified, memory blocks can be
> >>> onlined.  
> >> 
> >> The race is there and not avoidable as far as I can see. Two processes A and B.
> >> 
> >> A checks for a spare tier number
> >> B checks for a spare tier number
> >> A tries to assign node 3 to new tier 2 (new tier created)
> >> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
> >> is the same method we'd use to put it in the existing tier we can't tell this
> >> write was meant to create a new tier).
> >> A writes rank 100 to tier 2
> >> A checks rank for tier 2 and finds it is 100 as expected.
> >> B write rank 200 to tier 2 (it could check if still default but even that is racy)
> >> B checks rank for tier 2 rank and finds it is 200 as expected.
> >> A onlines memory.
> >> B onlines memory.
> >> 
> >> Both think they got what they wanted, but A definitely didn't.
> >> 
> >> One work around is the read / write approach and create_tier.
> >> 
> >> A reads create_tier - gets 2.
> >> B reads create_tier - gets 3.
> >> A writes 2 to create_tier as that's what it read.
> >> B writes 3 to create_tier as that's what it read.
> >> 
> >> continue with created tiers.  Obviously can exhaust tiers, but if this is
> >> root only, could just create lots anyway so no worse off.
> >>     
> >>>
> >>> Having said that can you outline the usage of
> >>> memtier/create_tier_from_rank ?  
> >> 
> >> There are corner cases to deal with...
> >> 
> >> A writes 100 to create_tier_from_rank.
> >> A goes looking for matching tier - finds it: tier2
> >> B writes 200 to create_tier_from_rank
> >> B goes looking for matching tier - finds it: tier3
> >> 
> >> rest is fine as operating on different tiers.
> >> 
> >> Trickier is
> >> A writes 100 to create_tier_from_rank  - succeed.
> >> B writes 100 to create_tier_from_rank  - Could fail, or could just eat it?
> >> 
> >> Logically this is same as separate create_tier and then a write
> >> of rank, but in one operation, but then you need to search
> >> for the right one.  As such, perhaps a create_tier
> >> that does the read/write pair as above is the best solution.
> >>   
> >
> > This all is good when we allow dynamic rank values. But currently we are 
> > restricting ourselves to three rank value as below:
> >
> > rank   memtier
> > 300    memtier0
> > 200    memtier1
> > 100    memtier2
> >
> > Now with the above, how do we define a write to create_tier_from_rank. 
> > What should be the behavior if user write value other than above defined 
> > rank values? Also enforcing the above three rank values as supported 
> > implies teaching userspace about them. I am trying to see how to fit
> > create_tier_from_rank without requiring the above.
> >
> > Can we look at implementing create_tier_from_rank when we start 
> > supporting dynamic tiers/rank values? ie,
> >
> > we still allow node/nodeN/memtier. But with dynamic tiers a race free
> > way to get a new memory tier would be echo rank > 
> > memtier/create_tier_from_rank. We could also say, memtier0/1/2 are 
> > kernel defined memory tiers. Writing to memtier/create_tier_from_rank 
> > will create new memory tiers above memtier2 with the rank value specified?
> >  
> 
> To keep it compatible we could do this. ie, we just allow creation of
> one additional memory tier (memtier3) via the above interface.

Two options - either have no dynamic tier creation for now (which I'm
fine with) or define an interface that will work long term - so that
means allowing lots of dynamic tiers from day one (where "lots" is at least 2
to prove the interface works if the limit is scaled up in future)

A half way house where we potentially need to change the interface later
is not a good stop gap.

That means we should either deal with the race conditions today, or
declare that we don't care about them (which might be a valid statement, but
userspace needs to be aware of that - ultimately it means userspace must
ensure mediation via some userspace software component).

Jonathan

> 
> 
> :/sys/devices/system/memtier# ls -al
> total 0
> drwxr-xr-x  4 root root    0 Jun  6 17:39 .
> drwxr-xr-x 10 root root    0 Jun  6 17:39 ..
> --w-------  1 root root 4096 Jun  6 17:40 create_tier_from_rank
> -r--r--r--  1 root root 4096 Jun  6 17:40 default_tier
> -r--r--r--  1 root root 4096 Jun  6 17:40 max_tier
> drwxr-xr-x  3 root root    0 Jun  6 17:39 memtier1
> drwxr-xr-x  2 root root    0 Jun  6 17:40 power
> -rw-r--r--  1 root root 4096 Jun  6 17:39 uevent
> :/sys/devices/system/memtier# echo 20 > create_tier_from_rank 
> :/sys/devices/system/memtier# ls
> create_tier_from_rank  default_tier  max_tier  memtier1  memtier3  power  uevent
> :/sys/devices/system/memtier# cat memtier3/rank 
> 20
> :/sys/devices/system/memtier# echo 20 > create_tier_from_rank 
> bash: echo: write error: No space left on device
> :/sys/devices/system/memtier# 
> 
> is this good? 
> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 0468af60d427..a4150120ba24 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -13,7 +13,7 @@
>  #define MEMORY_RANK_PMEM	100
>  
>  #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> -#define MAX_MEMORY_TIERS  3
> +#define MAX_MEMORY_TIERS  4
>  
>  extern bool numa_demotion_enabled;
>  extern nodemask_t promotion_mask;
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index c6eb223a219f..7fdee0c4c4ea 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -169,7 +169,8 @@ static void insert_memory_tier(struct memory_tier *memtier)
>  	list_add_tail(&memtier->list, &memory_tiers);
>  }
>  
> -static struct memory_tier *register_memory_tier(unsigned int tier)
> +static struct memory_tier *register_memory_tier(unsigned int tier,
> +						unsigned int rank)
>  {
>  	int error;
>  	struct memory_tier *memtier;
> @@ -182,7 +183,7 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
>  		return NULL;
>  
>  	memtier->dev.id = tier;
> -	memtier->rank = get_rank_from_tier(tier);
> +	memtier->rank = rank;
>  	memtier->dev.bus = &memory_tier_subsys;
>  	memtier->dev.release = memory_tier_device_release;
>  	memtier->dev.groups = memory_tier_dev_groups;
> @@ -218,9 +219,53 @@ default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
>  }
>  static DEVICE_ATTR_RO(default_tier);
>  
> +
> +static struct memory_tier *__get_memory_tier_from_id(int id);
> +static ssize_t create_tier_from_rank_store(struct device *dev,
> +					   struct device_attribute *attr,
> +					   const char *buf, size_t count)
> +{
> +	int ret, rank;
> +	struct memory_tier *memtier;
> +
> +	ret = kstrtouint(buf, 10, &rank);
> +	if (ret)
> +		return ret;
> +
> +	if (ret == MEMORY_RANK_HBM_GPU ||
> +	    rank == MEMORY_TIER_DRAM ||
> +	    rank == MEMORY_RANK_PMEM)
> +		return -EINVAL;
> +
> +	mutex_lock(&memory_tier_lock);
> +	/*
> +	 * For now we only support creation of one additional tier via
> +	 * this interface.
> +	 */
> +	memtier = __get_memory_tier_from_id(3);
> +	if (!memtier) {
> +		memtier = register_memory_tier(3, rank);
> +		if (!memtier) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	} else {
> +		ret = -ENOSPC;
> +		goto out;
> +	}
> +
> +	ret = count;
> +out:
> +	mutex_unlock(&memory_tier_lock);
> +	return ret;
> +}
> +static DEVICE_ATTR_WO(create_tier_from_rank);
> +
> +
>  static struct attribute *memory_tier_attrs[] = {
>  	&dev_attr_max_tier.attr,
>  	&dev_attr_default_tier.attr,
> +	&dev_attr_create_tier_from_rank.attr,
>  	NULL
>  };
>  
> @@ -302,7 +347,7 @@ static int __node_set_memory_tier(int node, int tier)
>  
>  	memtier = __get_memory_tier_from_id(tier);
>  	if (!memtier) {
> -		memtier = register_memory_tier(tier);
> +		memtier = register_memory_tier(tier, get_rank_from_tier(tier));
>  		if (!memtier) {
>  			ret = -EINVAL;
>  			goto out;
> @@ -651,7 +696,8 @@ static int __init memory_tier_init(void)
>  	 * Register only default memory tier to hide all empty
>  	 * memory tier from sysfs.
>  	 */
> -	memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> +	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
> +				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
>  	if (!memtier)
>  		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>  
>

next prev parent reply	other threads:[~2022-06-07 14:32 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
2022-05-27  2:58 ` Ying Huang
2022-05-27 14:05   ` Hesham Almatary
2022-05-27 16:25     ` Wei Xu
2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-05-27 13:59     ` Jonathan Cameron
2022-06-02  6:07     ` Ying Huang
2022-06-06  2:49       ` Ying Huang
2022-06-06  3:56         ` Aneesh Kumar K V
2022-06-06  5:33           ` Ying Huang
2022-06-06  6:01             ` Aneesh Kumar K V
2022-06-06  6:27               ` Aneesh Kumar K.V
2022-06-06  7:53                 ` Ying Huang
2022-06-06  8:01                   ` Aneesh Kumar K V
2022-06-06  8:52                     ` Ying Huang
2022-06-06  9:02                       ` Aneesh Kumar K V
2022-06-08  1:24                         ` Ying Huang
2022-06-08  7:16     ` Ying Huang
2022-06-08  8:24       ` Aneesh Kumar K V
2022-06-08  8:27         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-05-27 14:15     ` Jonathan Cameron
2022-06-03  8:40       ` Aneesh Kumar K V
2022-06-06 14:59         ` Jonathan Cameron
2022-06-06 16:01           ` Aneesh Kumar K V
2022-06-06 16:16             ` Jonathan Cameron
2022-06-06 16:39               ` Aneesh Kumar K V
2022-06-06 17:46                 ` Aneesh Kumar K.V
2022-06-07 14:32                   ` Jonathan Cameron [this message]
2022-06-08  7:18     ` Ying Huang
2022-06-08  8:25       ` Aneesh Kumar K V
2022-06-08  8:29         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-05-27 14:31     ` Jonathan Cameron
2022-05-30  3:35     ` [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c kernel test robot
2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-01  6:29     ` Bharata B Rao
2022-06-01 13:49       ` Aneesh Kumar K V
2022-06-02  6:36         ` Bharata B Rao
2022-06-03  9:04           ` Aneesh Kumar K V
2022-06-06 10:11             ` Bharata B Rao
2022-06-06 10:16               ` Aneesh Kumar K V
2022-06-06 11:54                 ` Aneesh Kumar K.V
2022-06-06 12:09                   ` Bharata B Rao
2022-06-06 13:00                     ` Aneesh Kumar K V
2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
2022-05-27 14:45     ` Jonathan Cameron
2022-05-27 15:45       ` Aneesh Kumar K V
2022-05-30 12:36         ` Jonathan Cameron
2022-06-02  6:41     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-02  6:43     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-05-27 15:03     ` Jonathan Cameron
2022-06-02  7:35     ` Ying Huang
2022-06-03 15:09       ` Aneesh Kumar K V
2022-06-06  0:43         ` Ying Huang
2022-06-06  4:07           ` Aneesh Kumar K V
2022-06-06  5:26             ` Ying Huang
2022-06-06  6:21               ` Aneesh Kumar K.V
2022-06-06  7:42                 ` Ying Huang
2022-06-06  8:02                   ` Aneesh Kumar K V
2022-06-06  8:06                     ` Ying Huang
2022-06-06 17:07               ` Yang Shi
2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
2022-05-27 16:30   ` Wei Xu
2022-05-29  4:31     ` Ying Huang
2022-05-30 12:50       ` Jonathan Cameron
2022-05-31  1:57         ` Ying Huang
2022-06-07 19:25         ` Tim Chen
2022-06-08  4:41           ` Aneesh Kumar K V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220607153201.00004a8d@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).