From: Dan Williams <dan.j.williams@intel.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Dave Jiang <dave.jiang@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>, <linux-cxl@vger.kernel.org>,
"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
"Rafael J. Wysocki" <rafael@kernel.org>,
<dan.j.williams@intel.com>, <ira.weiny@intel.com>,
<vishal.l.verma@intel.com>, <alison.schofield@intel.com>,
<dave@stgolabs.net>
Subject: Re: [PATCH v3 3/3] cxl: Add memory hotplug notifier for cxl region
Date: Tue, 9 Jan 2024 11:28:22 -0800 [thread overview]
Message-ID: <659d9e563e1fd_24a8294b7@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20240109162753.00005b2b@Huawei.com>
Jonathan Cameron wrote:
> On Mon, 8 Jan 2024 11:18:33 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
> > On 1/8/24 05:15, Jonathan Cameron wrote:
> > > On Mon, 08 Jan 2024 14:49:03 +0800
> > > "Huang, Ying" <ying.huang@intel.com> wrote:
> > >
> > >> Dave Jiang <dave.jiang@intel.com> writes:
> > >>
> > >>> When the CXL region is formed, the driver would computed the performance
> > >>> data for the region. However this data is not available at the node data
> > >>> collection that has been populated by the HMAT during kernel
> > >>> initialization. Add a memory hotplug notifier to update the performance
> > >>> data to the node hmem_attrs to expose the newly calculated region
> > >>> performance data. The CXL region is created under specific CFMWS. The
> > >>> node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws().
> > >>> Additional regions may overwrite the initial data, but since this is
> > >>> for the same proximity domain it's a don't care for now.
> > >>>
> > >>> node_set_perf_attrs() symbol is exported to allow update of perf attribs
> > >>> for a node. The sysfs path of
> > >>> /sys/devices/system/node/nodeX/access0/initiators/* is created by
> > >>> ndoe_set_perf_attrs() for the various attributes where nodeX is matched
> > >>> to the proximity domain of the CXL region.
> > >
> > > As per discussion below. Why is access1 not also relevant for CXL memory?
> > > (it's probably more relevant than access0 in many cases!)
> > >
> > > For historical references, I wanted access0 to be the CPU only one, but
> > > review feedback was that access0 was already defined as 'initiator based'
> > > so we couldn't just make the 0 indexed one the case most people care about.
> > > Hence we grew access1 to cover the CPU only case which most software cares
> > > about.
> > >
> > >>>
> > >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > >>> Cc: Rafael J. Wysocki <rafael@kernel.org>
> > >>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > >>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> > >>> ---
> > >>> v3:
> > >>> - Change EXPORT_SYMBOL_NS_GPL(,CXL) to EXPORT_SYMBOL_GPL() (Jonathan)
> > >>> - use read_bandwidth as check for valid coords (Jonathan)
> > >>> - Remove setting of coord access level 1. (Jonathan)
> > >>> ---
> > >>> drivers/base/node.c | 1 +
> > >>> drivers/cxl/core/region.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > >>> drivers/cxl/cxl.h | 3 +++
> > >>> 3 files changed, 46 insertions(+)
> > >>>
> > >>> diff --git a/drivers/base/node.c b/drivers/base/node.c
> > >>> index cb2b6cc7f6e6..48e5cb292765 100644
> > >>> --- a/drivers/base/node.c
> > >>> +++ b/drivers/base/node.c
> > >>> @@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
> > >>> }
> > >>> }
> > >>> }
> > >>> +EXPORT_SYMBOL_GPL(node_set_perf_attrs);
> > >>>
> > >>> /**
> > >>> * struct node_cache_info - Internal tracking for memory node caches
> > >>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > >>> index d28d24524d41..bee65f535d6c 100644
> > >>> --- a/drivers/cxl/core/region.c
> > >>> +++ b/drivers/cxl/core/region.c
> > >>> @@ -4,6 +4,7 @@
> > >>> #include <linux/genalloc.h>
> > >>> #include <linux/device.h>
> > >>> #include <linux/module.h>
> > >>> +#include <linux/memory.h>
> > >>> #include <linux/slab.h>
> > >>> #include <linux/uuid.h>
> > >>> #include <linux/sort.h>
> > >>> @@ -2972,6 +2973,42 @@ static int is_system_ram(struct resource *res, void *arg)
> > >>> return 1;
> > >>> }
> > >>>
> > >>> +static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
> > >>> + unsigned long action, void *arg)
> > >>> +{
> > >>> + struct cxl_region *cxlr = container_of(nb, struct cxl_region,
> > >>> + memory_notifier);
> > >>> + struct cxl_region_params *p = &cxlr->params;
> > >>> + struct cxl_endpoint_decoder *cxled = p->targets[0];
> > >>> + struct cxl_decoder *cxld = &cxled->cxld;
> > >>> + struct memory_notify *mnb = arg;
> > >>> + int nid = mnb->status_change_nid;
> > >>> + int region_nid;
> > >>> +
> > >>> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> > >>> + return NOTIFY_DONE;
> > >>> +
> > >>> + region_nid = phys_to_target_node(cxld->hpa_range.start);
> > >>> + if (nid != region_nid)
> > >>> + return NOTIFY_DONE;
> > >>> +
> > >>> + /* Don't set if there's no coordinate information */
> > >>> + if (!cxlr->coord.write_bandwidth)
> > >>> + return NOTIFY_DONE;
> > >>
> > >> Although you said you will use "read_bandwidth" in changelog, you
> > >> actually didn't do that.
> > >>
> > >>> +
> > >>> + node_set_perf_attrs(nid, &cxlr->coord, 0);
> > >>> + node_set_perf_attrs(nid, &cxlr->coord, 1);
> > >>
> > >> And this.
> > >>
> > >> But I don't think it's good to remove access level 1. According to
> > >> commit b9fffe47212c ("node: Add access1 class to represent CPU to memory
> > >> characteristics"). Access level 1 is for performance from CPU to
> > >> memory. So, we should keep access level 1. For CXL memory device,
> > >> access level 0 and access level 1 should be equivalent. Will the code
> > >> be used for something like GPU connected via CXL? Where the access
> > >> level 0 may be for the performance from GPU to the memory.
> > >>
> > > I disagree. They are no more equivalent than they are on any other complex system.
> > >
> > > e.g. A CXL root port being described using generic Port infrastructure may be
> > > on a different die (IO dies are a common architecture) in the package
> > > than the CPU cores and that IO die may well have generic initiators that
> > > are much nearer than the CPU cores.
> > >
> > > In those cases access0 will cover initators on the IO die but access1 will
> > > cover the nearest CPU cores (initiators).
> > >
> > > Both should arguably be there for CXL memory as both are as relevant as
> > > they are for any other memory.
> > >
> > > If / when we get some GPUs etc on CXL that are initiators this will all
> > > get a lot more fun but for now we can kick that into the long grass.
> >
> >
> > With the current way of storing HMAT targets information, only the
> > best performance data is stored (access0). The existing HMAT handling
> > code also sets the access1 if the associated initiator node contains
> > a CPU for conventional memory. The current calculated full CXL path
> > is the access0 data. I think what's missing is the check to see if
> > the associated initiator node is also a CPU node and sets access1
> > conditionally based on that. Maybe if that conditional gets added
> > then that is ok for what we have now?
>
> You also need the access1 initiators to be figured out (nearest
> one that has a CPU) - so two separate sets of calculations.
> Could short cut the maths if they happen to be the same node of
> course.
Where is "access1" coming from? The generic port is the only performance
profile that is being calculated by the CDAT code and there is no other
initiator.
Now if "access1" is a convention of "that's the CPU" then we should skip
emitting access0 altogether and reserve that for some future accelerator
case that can define a better access profile talking to its own local
memory. Otherwise having access0 and access1 when the only initiator is
the generic port (which includes all CPUs attached to that generic port)
does not resolve for me.
> > If/When the non-CPU initiators shows up for CXL, we'll need to change
> > the way to store the initiator to generic target table data and how
> > we calculate and setup access0 vs access1. Maybe that can be done as
> > a later iteration?
>
> I'm not that bothered yet about CXL initiators - the issue today
> is ones on a different node the host side of the root ports.
>
> For giggles the NVIDIA Grace proposals for how they manage their
> GPU partitioning will create a bunch of GI nodes that may well
> be nearer to the CXL ports - I've no idea!
> https://lore.kernel.org/qemu-devel/20231225045603.7654-1-ankita@nvidia.com/
It seems sad that we, as an industry, went through all this trouble to
define an enumerable CXL bus only to fall back to ACPI for enumeration.
The Linux reaction to CFMWS takes a "Linux likely needs *at least* this many
memory target nodes considered at the beginning of time", with a "circle
back to the dynamic node creation problem later if it proves to be
insufficient". The NVIDIA proposal appears to be crossing that
threshold, and I will go invite them to do the work to dynamically
enumerate initiators into the Linux tracking structures.
As for where this leaves this patchset, it is clear from this
conversation that v6.9 is a better target for clarifying this NUMA
information, but I think it is ok to move ahead with the base CDAT
parsing for v6.8 (the bits that are already exposed to linux-next). Any
objections?
next prev parent reply other threads:[~2024-01-09 19:28 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-04 23:48 [PATCH v3 0/3] cxl: Add support to report region access coordinates to numa nodes Dave Jiang
2024-01-04 23:48 ` [PATCH v3 1/3] cxl/region: Calculate performance data for a region Dave Jiang
2024-01-05 0:07 ` Dan Williams
2024-01-05 22:50 ` Dave Jiang
2024-01-04 23:48 ` [PATCH v3 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions Dave Jiang
2024-01-05 0:19 ` Dan Williams
2024-01-08 12:07 ` Jonathan Cameron
2024-01-04 23:48 ` [PATCH v3 3/3] cxl: Add memory hotplug notifier for cxl region Dave Jiang
2024-01-05 22:00 ` Dan Williams
2024-01-08 6:49 ` Huang, Ying
2024-01-08 12:15 ` Jonathan Cameron
2024-01-08 18:18 ` Dave Jiang
2024-01-09 2:15 ` Huang, Ying
2024-01-09 15:55 ` Dave Jiang
2024-01-09 16:27 ` Jonathan Cameron
2024-01-09 19:28 ` Dan Williams [this message]
2024-01-10 10:00 ` Jonathan Cameron
2024-01-10 15:27 ` Dave Jiang
2024-01-12 11:30 ` Jonathan Cameron
2024-01-12 15:57 ` Dave Jiang
2024-01-09 0:26 ` Dan Williams
2024-01-08 16:12 ` Dave Jiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=659d9e563e1fd_24a8294b7@dwillia2-xfh.jf.intel.com.notmuch \
--to=dan.j.williams@intel.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=gregkh@linuxfoundation.org \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=rafael@kernel.org \
--cc=vishal.l.verma@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox