Re: [PATCH v3 3/3] cxl: Add memory hotplug notifier for cxl region

Linux CXL
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Dave Jiang <dave.jiang@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>, <linux-cxl@vger.kernel.org>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	<dan.j.williams@intel.com>, <ira.weiny@intel.com>,
	<vishal.l.verma@intel.com>, <alison.schofield@intel.com>,
	<dave@stgolabs.net>
Subject: Re: [PATCH v3 3/3] cxl: Add memory hotplug notifier for cxl region
Date: Tue, 9 Jan 2024 16:27:53 +0000	[thread overview]
Message-ID: <20240109162753.00005b2b@Huawei.com> (raw)
In-Reply-To: <38be52cf-a4ea-402c-9b14-47a80427f0c8@intel.com>

On Mon, 8 Jan 2024 11:18:33 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 1/8/24 05:15, Jonathan Cameron wrote:
> > On Mon, 08 Jan 2024 14:49:03 +0800
> > "Huang, Ying" <ying.huang@intel.com> wrote:
> >   
> >> Dave Jiang <dave.jiang@intel.com> writes:
> >>  
> >>> When the CXL region is formed, the driver would computed the performance
> >>> data for the region. However this data is not available at the node data
> >>> collection that has been populated by the HMAT during kernel
> >>> initialization. Add a memory hotplug notifier to update the performance
> >>> data to the node hmem_attrs to expose the newly calculated region
> >>> performance data. The CXL region is created under specific CFMWS. The
> >>> node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws().
> >>> Additional regions may overwrite the initial data, but since this is
> >>> for the same proximity domain it's a don't care for now.
> >>>
> >>> node_set_perf_attrs() symbol is exported to allow update of perf attribs
> >>> for a node. The sysfs path of
> >>> /sys/devices/system/node/nodeX/access0/initiators/* is created by
> >>> ndoe_set_perf_attrs() for the various attributes where nodeX is matched
> >>> to the proximity domain of the CXL region.  
> > 
> > As per discussion below.  Why is access1 not also relevant for CXL memory?
> > (it's probably more relevant than access0 in many cases!)
> > 
> > For historical references, I wanted access0 to be the CPU only one, but
> > review feedback was that access0 was already defined as 'initiator based'
> > so we couldn't just make the 0 indexed one the case most people care about.
> > Hence we grew access1 to cover the CPU only case which most software cares
> > about.
> >   
> >>>
> >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>> Cc: Rafael J. Wysocki <rafael@kernel.org>
> >>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> >>> ---
> >>> v3:
> >>> - Change EXPORT_SYMBOL_NS_GPL(,CXL) to EXPORT_SYMBOL_GPL() (Jonathan)
> >>> - use read_bandwidth as check for valid coords (Jonathan)
> >>> - Remove setting of coord access level 1. (Jonathan)
> >>> ---
> >>>  drivers/base/node.c       |    1 +
> >>>  drivers/cxl/core/region.c |   42 ++++++++++++++++++++++++++++++++++++++++++
> >>>  drivers/cxl/cxl.h         |    3 +++
> >>>  3 files changed, 46 insertions(+)
> >>>
> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >>> index cb2b6cc7f6e6..48e5cb292765 100644
> >>> --- a/drivers/base/node.c
> >>> +++ b/drivers/base/node.c
> >>> @@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
> >>>  		}
> >>>  	}
> >>>  }
> >>> +EXPORT_SYMBOL_GPL(node_set_perf_attrs);
> >>>  
> >>>  /**
> >>>   * struct node_cache_info - Internal tracking for memory node caches
> >>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> >>> index d28d24524d41..bee65f535d6c 100644
> >>> --- a/drivers/cxl/core/region.c
> >>> +++ b/drivers/cxl/core/region.c
> >>> @@ -4,6 +4,7 @@
> >>>  #include <linux/genalloc.h>
> >>>  #include <linux/device.h>
> >>>  #include <linux/module.h>
> >>> +#include <linux/memory.h>
> >>>  #include <linux/slab.h>
> >>>  #include <linux/uuid.h>
> >>>  #include <linux/sort.h>
> >>> @@ -2972,6 +2973,42 @@ static int is_system_ram(struct resource *res, void *arg)
> >>>  	return 1;
> >>>  }
> >>>  
> >>> +static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
> >>> +					  unsigned long action, void *arg)
> >>> +{
> >>> +	struct cxl_region *cxlr = container_of(nb, struct cxl_region,
> >>> +					       memory_notifier);
> >>> +	struct cxl_region_params *p = &cxlr->params;
> >>> +	struct cxl_endpoint_decoder *cxled = p->targets[0];
> >>> +	struct cxl_decoder *cxld = &cxled->cxld;
> >>> +	struct memory_notify *mnb = arg;
> >>> +	int nid = mnb->status_change_nid;
> >>> +	int region_nid;
> >>> +
> >>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> >>> +		return NOTIFY_DONE;
> >>> +
> >>> +	region_nid = phys_to_target_node(cxld->hpa_range.start);
> >>> +	if (nid != region_nid)
> >>> +		return NOTIFY_DONE;
> >>> +
> >>> +	/* Don't set if there's no coordinate information */
> >>> +	if (!cxlr->coord.write_bandwidth)
> >>> +		return NOTIFY_DONE;    
> >>
> >> Although you said you will use "read_bandwidth" in changelog, you
> >> actually didn't do that.
> >>  
> >>> +
> >>> +	node_set_perf_attrs(nid, &cxlr->coord, 0);
> >>> +	node_set_perf_attrs(nid, &cxlr->coord, 1);    
> >>
> >> And this.
> >>
> >> But I don't think it's good to remove access level 1.  According to
> >> commit b9fffe47212c ("node: Add access1 class to represent CPU to memory
> >> characteristics").  Access level 1 is for performance from CPU to
> >> memory.  So, we should keep access level 1.  For CXL memory device,
> >> access level 0 and access level 1 should be equivalent.  Will the code
> >> be used for something like GPU connected via CXL?  Where the access
> >> level 0 may be for the performance from GPU to the memory.
> >>  
> > I disagree. They are no more equivalent than they are on any other complex system.
> > 
> > e.g. A CXL root port being described using generic Port infrastructure may be
> > on a different die (IO dies are a common architecture) in the package
> > than the CPU cores and that IO die may well have generic initiators that
> > are much nearer than the CPU cores.
> > 
> > In those cases access0 will cover initators on the IO die but access1 will
> > cover the nearest CPU cores (initiators).
> > 
> > Both should arguably be there for CXL memory as both are as relevant as
> > they are for any other memory.
> > 
> > If / when we get some GPUs etc on CXL that are initiators this will all
> > get a lot more fun but for now we can kick that into the long grass.  
> 
> 
> With the current way of storing HMAT targets information, only the
> best performance data is stored (access0). The existing HMAT handling
> code also sets the access1 if the associated initiator node contains
> a CPU for conventional memory. The current calculated full CXL path
> is the access0 data. I think what's missing is the check to see if
> the associated initiator node is also a CPU node and sets access1
> conditionally based on that. Maybe if that conditional gets added
> then that is ok for what we have now?

You also need the access1 initiators to be figured out (nearest
one that has a CPU) - so two separate sets of calculations.
Could short cut the maths if they happen to be the same node of
course.

> 
> If/When the non-CPU initiators shows up for CXL, we'll need to change
> the way to store the initiator to generic target table data and how
> we calculate and setup access0 vs access1. Maybe that can be done as
> a later iteration?

I'm not that bothered yet about CXL initiators - the issue today
is ones on a different node the host side of the root ports.

For giggles the NVIDIA Grace proposals for how they manage their
GPU partitioning will create a bunch of GI nodes that may well
be nearer to the CXL ports - I've no idea!
https://lore.kernel.org/qemu-devel/20231225045603.7654-1-ankita@nvidia.com/

Jonathan

> 
> DJ
> 
> > 
> > Jonathan
> > 
> >   
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>  
> >>> +
> >>> +	return NOTIFY_OK;
> >>> +}
> >>> +
> >>> +static void remove_coord_notifier(void *data)
> >>> +{
> >>> +	struct cxl_region *cxlr = data;
> >>> +
> >>> +	unregister_memory_notifier(&cxlr->memory_notifier);
> >>> +}
> >>> +
> >>>  static int cxl_region_probe(struct device *dev)
> >>>  {
> >>>  	struct cxl_region *cxlr = to_cxl_region(dev);
> >>> @@ -2997,6 +3034,11 @@ static int cxl_region_probe(struct device
> >>> *dev) goto out;
> >>>  	}
> >>>  
> >>> +	cxlr->memory_notifier.notifier_call =
> >>> cxl_region_perf_attrs_callback;
> >>> +	cxlr->memory_notifier.priority = HMAT_CALLBACK_PRI;
> >>> +	register_memory_notifier(&cxlr->memory_notifier);
> >>> +	rc = devm_add_action_or_reset(&cxlr->dev,
> >>> remove_coord_notifier, cxlr); +
> >>>  	/*
> >>>  	 * From this point on any path that changes the region's
> >>> state away from
> >>>  	 * CXL_CONFIG_COMMIT is also responsible for releasing
> >>> the driver. diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> >>> index 4639d0d6ef54..2498086c8edc 100644
> >>> --- a/drivers/cxl/cxl.h
> >>> +++ b/drivers/cxl/cxl.h
> >>> @@ -6,6 +6,7 @@
> >>>  
> >>>  #include <linux/libnvdimm.h>
> >>>  #include <linux/bitfield.h>
> >>> +#include <linux/notifier.h>
> >>>  #include <linux/bitops.h>
> >>>  #include <linux/log2.h>
> >>>  #include <linux/node.h>
> >>> @@ -520,6 +521,7 @@ struct cxl_region_params {
> >>>   * @flags: Region state flags
> >>>   * @params: active + config params for the region
> >>>   * @coord: QoS access coordinates for the region
> >>> + * @memory_notifier: notifier for setting the access coordinates
> >>> to node */
> >>>  struct cxl_region {
> >>>  	struct device dev;
> >>> @@ -531,6 +533,7 @@ struct cxl_region {
> >>>  	unsigned long flags;
> >>>  	struct cxl_region_params params;
> >>>  	struct access_coordinate coord;
> >>> +	struct notifier_block memory_notifier;
> >>>  };
> >>>  
> >>>  struct cxl_nvdimm_bridge {    
> >>  
> > 
> >

next prev parent reply	other threads:[~2024-01-09 16:27 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-04 23:48 [PATCH v3 0/3] cxl: Add support to report region access coordinates to numa nodes Dave Jiang
2024-01-04 23:48 ` [PATCH v3 1/3] cxl/region: Calculate performance data for a region Dave Jiang
2024-01-05  0:07   ` Dan Williams
2024-01-05 22:50     ` Dave Jiang
2024-01-04 23:48 ` [PATCH v3 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions Dave Jiang
2024-01-05  0:19   ` Dan Williams
2024-01-08 12:07     ` Jonathan Cameron
2024-01-04 23:48 ` [PATCH v3 3/3] cxl: Add memory hotplug notifier for cxl region Dave Jiang
2024-01-05 22:00   ` Dan Williams
2024-01-08  6:49   ` Huang, Ying
2024-01-08 12:15     ` Jonathan Cameron
2024-01-08 18:18       ` Dave Jiang
2024-01-09  2:15         ` Huang, Ying
2024-01-09 15:55           ` Dave Jiang
2024-01-09 16:27         ` Jonathan Cameron [this message]
2024-01-09 19:28           ` Dan Williams
2024-01-10 10:00             ` Jonathan Cameron
2024-01-10 15:27               ` Dave Jiang
2024-01-12 11:30                 ` Jonathan Cameron
2024-01-12 15:57                   ` Dave Jiang
2024-01-09  0:26       ` Dan Williams
2024-01-08 16:12     ` Dave Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240109162753.00005b2b@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox