* [PATCH v2 1/3] cxl/region: Calculate performance data for a region
2023-12-15 23:15 [PATCH v2 0/3] cxl: Add support to report region access coordinates to numa nodes Dave Jiang
@ 2023-12-15 23:15 ` Dave Jiang
2023-12-19 14:51 ` Jonathan Cameron
2023-12-15 23:16 ` [PATCH v2 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions Dave Jiang
2023-12-15 23:16 ` [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region Dave Jiang
2 siblings, 1 reply; 11+ messages in thread
From: Dave Jiang @ 2023-12-15 23:15 UTC (permalink / raw)
To: linux-cxl
Cc: dan.j.williams, ira.weiny, vishal.l.verma, alison.schofield,
jonathan.cameron, dave, brice.goglin, nifan.cxl
Calculate and store the performance data for a CXL region. Find the worst
read and write latency for all the included ranges from each of the devices
that attributes to the region and designate that as the latency data. Sum
all the read and write bandwidth data for each of the device region and
that is the total bandwidth for the region.
The perf list is expected to be constructed before the endpoint decoders
are registered and thus there should be no early reading of the entries
from the region assemble action. The calling of the region qos calculate
function is under the protection of cxl_dpa_rwsem and will ensure that
all DPA associated work has completed.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
v2:
- Move cxled declaration (Fan)
- Move calculate function to core/cdat.c
- Make cxlr->coord a struct instead of allocated (Dan)
- Remove list_empty() check (Dan)
- Move calculation to cxl_region_attach() under cxl_dpa_rwsem (Dan)
- Normalize perf numbers to HMAT coords (Brice, Dan)
---
drivers/cxl/core/cdat.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/region.c | 2 ++
drivers/cxl/cxl.h | 5 ++++
3 files changed, 60 insertions(+)
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index 5fe57fe5e2ee..29bba04306e9 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -547,3 +547,56 @@ void cxl_switch_parse_cdat(struct cxl_port *port)
EXPORT_SYMBOL_NS_GPL(cxl_switch_parse_cdat, CXL);
MODULE_IMPORT_NS(CXL);
+
+void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
+ struct cxl_endpoint_decoder *cxled)
+{
+ struct list_head *perf_list;
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
+ struct range dpa = {
+ .start = cxled->dpa_res->start,
+ .end = cxled->dpa_res->end,
+ };
+ struct cxl_dpa_perf *perf;
+ bool found = false;
+
+ switch (cxlr->mode) {
+ case CXL_DECODER_RAM:
+ perf_list = &mds->ram_perf_list;
+ break;
+ case CXL_DECODER_PMEM:
+ perf_list = &mds->pmem_perf_list;
+ break;
+ default:
+ return;
+ }
+
+ list_for_each_entry(perf, perf_list, list) {
+ if (range_contains(&perf->dpa_range, &dpa)) {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return;
+
+ /* Get total bandwidth and the worst latency for the cxl region */
+ cxlr->coord.read_latency = max_t(unsigned int,
+ cxlr->coord.read_latency,
+ perf->coord.read_latency);
+ cxlr->coord.write_latency = max_t(unsigned int,
+ cxlr->coord.write_latency,
+ perf->coord.write_latency);
+ cxlr->coord.read_bandwidth += perf->coord.read_bandwidth;
+ cxlr->coord.write_bandwidth += perf->coord.write_bandwidth;
+
+ /*
+ * Convert latency to nanosec from picosec to be consistent with HMAT
+ * attributes.
+ */
+ cxlr->coord.read_latency = DIV_ROUND_UP(cxlr->coord.read_latency, 1000);
+ cxlr->coord.write_latency = DIV_ROUND_UP(cxlr->coord.write_latency, 1000);
+}
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 56e575c79bb4..be7383e74ef5 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1721,6 +1721,8 @@ static int cxl_region_attach(struct cxl_region *cxlr,
return -EINVAL;
}
+ cxl_region_perf_data_calculate(cxlr, cxled);
+
if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
int i;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 492dbf63935f..4639d0d6ef54 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -519,6 +519,7 @@ struct cxl_region_params {
* @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
* @flags: Region state flags
* @params: active + config params for the region
+ * @coord: QoS access coordinates for the region
*/
struct cxl_region {
struct device dev;
@@ -529,6 +530,7 @@ struct cxl_region {
struct cxl_pmem_region *cxlr_pmem;
unsigned long flags;
struct cxl_region_params params;
+ struct access_coordinate coord;
};
struct cxl_nvdimm_bridge {
@@ -879,6 +881,9 @@ void cxl_switch_parse_cdat(struct cxl_port *port);
int cxl_endpoint_get_perf_coordinates(struct cxl_port *port,
struct access_coordinate *coord);
+void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
+ struct cxl_endpoint_decoder *cxled);
+
/*
* Unit test builds overrides this to __weak, find the 'strong' version
* of these symbols in tools/testing/cxl/.
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v2 1/3] cxl/region: Calculate performance data for a region
2023-12-15 23:15 ` [PATCH v2 1/3] cxl/region: Calculate performance data for a region Dave Jiang
@ 2023-12-19 14:51 ` Jonathan Cameron
2023-12-21 22:51 ` Dave Jiang
0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Cameron @ 2023-12-19 14:51 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, dan.j.williams, ira.weiny, vishal.l.verma,
alison.schofield, dave, brice.goglin, nifan.cxl
On Fri, 15 Dec 2023 16:15:59 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> Calculate and store the performance data for a CXL region. Find the worst
> read and write latency for all the included ranges from each of the devices
> that attributes to the region and designate that as the latency data. Sum
> all the read and write bandwidth data for each of the device region and
> that is the total bandwidth for the region.
>
> The perf list is expected to be constructed before the endpoint decoders
> are registered and thus there should be no early reading of the entries
> from the region assemble action. The calling of the region qos calculate
> function is under the protection of cxl_dpa_rwsem and will ensure that
> all DPA associated work has completed.
>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Trivial comments inline. With the HMAT reference tweaked,
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> v2:
> - Move cxled declaration (Fan)
> - Move calculate function to core/cdat.c
> - Make cxlr->coord a struct instead of allocated (Dan)
> - Remove list_empty() check (Dan)
> - Move calculation to cxl_region_attach() under cxl_dpa_rwsem (Dan)
> - Normalize perf numbers to HMAT coords (Brice, Dan)
> ---
> drivers/cxl/core/cdat.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 2 ++
> drivers/cxl/cxl.h | 5 ++++
> 3 files changed, 60 insertions(+)
>
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index 5fe57fe5e2ee..29bba04306e9 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -547,3 +547,56 @@ void cxl_switch_parse_cdat(struct cxl_port *port)
> EXPORT_SYMBOL_NS_GPL(cxl_switch_parse_cdat, CXL);
>
> MODULE_IMPORT_NS(CXL);
> +
> +void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct list_head *perf_list;
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> + struct range dpa = {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> + struct cxl_dpa_perf *perf;
> + bool found = false;
> +
> + switch (cxlr->mode) {
> + case CXL_DECODER_RAM:
> + perf_list = &mds->ram_perf_list;
> + break;
> + case CXL_DECODER_PMEM:
> + perf_list = &mds->pmem_perf_list;
> + break;
> + default:
> + return;
> + }
> +
> + list_for_each_entry(perf, perf_list, list) {
> + if (range_contains(&perf->dpa_range, &dpa)) {
> + found = true;
> + break;
> + }
> + }
> +
> + if (!found)
> + return;
Could use
if (list_entry_is_head())
return;
and drop the found variable. Though that is a little bit specific to the
internals of the list infrastructure so maybe adding a variable is better..
There is precedence for both approaches in tree.
> +
> + /* Get total bandwidth and the worst latency for the cxl region */
> + cxlr->coord.read_latency = max_t(unsigned int,
> + cxlr->coord.read_latency,
> + perf->coord.read_latency);
> + cxlr->coord.write_latency = max_t(unsigned int,
> + cxlr->coord.write_latency,
> + perf->coord.write_latency);
> + cxlr->coord.read_bandwidth += perf->coord.read_bandwidth;
> + cxlr->coord.write_bandwidth += perf->coord.write_bandwidth;
> +
> + /*
> + * Convert latency to nanosec from picosec to be consistent with HMAT
HMAT version what? You may ask why is there a breaking change in the HMAT definition
between 6.2 and 6.3 but I'd rather you didn't :(
> + * attributes.
> + */
> + cxlr->coord.read_latency = DIV_ROUND_UP(cxlr->coord.read_latency, 1000);
> + cxlr->coord.write_latency = DIV_ROUND_UP(cxlr->coord.write_latency, 1000);
> +}
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2 1/3] cxl/region: Calculate performance data for a region
2023-12-19 14:51 ` Jonathan Cameron
@ 2023-12-21 22:51 ` Dave Jiang
2024-01-08 13:58 ` Jonathan Cameron
0 siblings, 1 reply; 11+ messages in thread
From: Dave Jiang @ 2023-12-21 22:51 UTC (permalink / raw)
To: Jonathan Cameron
Cc: linux-cxl, dan.j.williams, ira.weiny, vishal.l.verma,
alison.schofield, dave, brice.goglin, nifan.cxl
On 12/19/23 07:51, Jonathan Cameron wrote:
> On Fri, 15 Dec 2023 16:15:59 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> Calculate and store the performance data for a CXL region. Find the worst
>> read and write latency for all the included ranges from each of the devices
>> that attributes to the region and designate that as the latency data. Sum
>> all the read and write bandwidth data for each of the device region and
>> that is the total bandwidth for the region.
>>
>> The perf list is expected to be constructed before the endpoint decoders
>> are registered and thus there should be no early reading of the entries
>> from the region assemble action. The calling of the region qos calculate
>> function is under the protection of cxl_dpa_rwsem and will ensure that
>> all DPA associated work has completed.
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>
> Trivial comments inline. With the HMAT reference tweaked,
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
>> ---
>> v2:
>> - Move cxled declaration (Fan)
>> - Move calculate function to core/cdat.c
>> - Make cxlr->coord a struct instead of allocated (Dan)
>> - Remove list_empty() check (Dan)
>> - Move calculation to cxl_region_attach() under cxl_dpa_rwsem (Dan)
>> - Normalize perf numbers to HMAT coords (Brice, Dan)
>> ---
>> drivers/cxl/core/cdat.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
>> drivers/cxl/core/region.c | 2 ++
>> drivers/cxl/cxl.h | 5 ++++
>> 3 files changed, 60 insertions(+)
>>
>> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
>> index 5fe57fe5e2ee..29bba04306e9 100644
>> --- a/drivers/cxl/core/cdat.c
>> +++ b/drivers/cxl/core/cdat.c
>> @@ -547,3 +547,56 @@ void cxl_switch_parse_cdat(struct cxl_port *port)
>> EXPORT_SYMBOL_NS_GPL(cxl_switch_parse_cdat, CXL);
>>
>> MODULE_IMPORT_NS(CXL);
>> +
>> +void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
>> + struct cxl_endpoint_decoder *cxled)
>> +{
>> + struct list_head *perf_list;
>> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> + struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
>> + struct range dpa = {
>> + .start = cxled->dpa_res->start,
>> + .end = cxled->dpa_res->end,
>> + };
>> + struct cxl_dpa_perf *perf;
>> + bool found = false;
>> +
>> + switch (cxlr->mode) {
>> + case CXL_DECODER_RAM:
>> + perf_list = &mds->ram_perf_list;
>> + break;
>> + case CXL_DECODER_PMEM:
>> + perf_list = &mds->pmem_perf_list;
>> + break;
>> + default:
>> + return;
>> + }
>> +
>> + list_for_each_entry(perf, perf_list, list) {
>> + if (range_contains(&perf->dpa_range, &dpa)) {
>> + found = true;
>> + break;
>> + }
>> + }
>> +
>> + if (!found)
>> + return;
>
> Could use
> if (list_entry_is_head())
> return;
> and drop the found variable. Though that is a little bit specific to the
> internals of the list infrastructure so maybe adding a variable is better..
> There is precedence for both approaches in tree.
>
Hmm....maybe not having to rely on list internals makes it a little easier to read?
>> +
>> + /* Get total bandwidth and the worst latency for the cxl region */
>> + cxlr->coord.read_latency = max_t(unsigned int,
>> + cxlr->coord.read_latency,
>> + perf->coord.read_latency);
>> + cxlr->coord.write_latency = max_t(unsigned int,
>> + cxlr->coord.write_latency,
>> + perf->coord.write_latency);
>> + cxlr->coord.read_bandwidth += perf->coord.read_bandwidth;
>> + cxlr->coord.write_bandwidth += perf->coord.write_bandwidth;
>> +
>> + /*
>> + * Convert latency to nanosec from picosec to be consistent with HMAT
>
> HMAT version what? You may ask why is there a breaking change in the HMAT definition
> between 6.2 and 6.3 but I'd rather you didn't :(
Do you mean between revision 1 vs 2? I see different code for parsing it in hmat_normalize() call depending on 1 vs 2. My ACPI r6.5 doc says the HMAT revision included is 2. Assuming the final HMAT latency coordinates are always in nanoseconds and our raw data calculation is always in picoseconds, the HMAT version doesn't really impact at this location right? I think the hmat_normalize() call in HMAT will ensure that all latency data are nanoseconds base. Should I just say "calculated data resulted from HMAT" to make it clear it's not data straight from the tables?
>
>
>> + * attributes.
>> + */
>> + cxlr->coord.read_latency = DIV_ROUND_UP(cxlr->coord.read_latency, 1000);
>> + cxlr->coord.write_latency = DIV_ROUND_UP(cxlr->coord.write_latency, 1000);
>> +}
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2 1/3] cxl/region: Calculate performance data for a region
2023-12-21 22:51 ` Dave Jiang
@ 2024-01-08 13:58 ` Jonathan Cameron
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Cameron @ 2024-01-08 13:58 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, dan.j.williams, ira.weiny, vishal.l.verma,
alison.schofield, dave, brice.goglin, nifan.cxl
On Thu, 21 Dec 2023 15:51:06 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> On 12/19/23 07:51, Jonathan Cameron wrote:
> > On Fri, 15 Dec 2023 16:15:59 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >
> >> Calculate and store the performance data for a CXL region. Find the worst
> >> read and write latency for all the included ranges from each of the devices
> >> that attributes to the region and designate that as the latency data. Sum
> >> all the read and write bandwidth data for each of the device region and
> >> that is the total bandwidth for the region.
> >>
> >> The perf list is expected to be constructed before the endpoint decoders
> >> are registered and thus there should be no early reading of the entries
> >> from the region assemble action. The calling of the region qos calculate
> >> function is under the protection of cxl_dpa_rwsem and will ensure that
> >> all DPA associated work has completed.
> >>
> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> >
> > Trivial comments inline. With the HMAT reference tweaked,
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> >> ---
> >> v2:
> >> - Move cxled declaration (Fan)
> >> - Move calculate function to core/cdat.c
> >> - Make cxlr->coord a struct instead of allocated (Dan)
> >> - Remove list_empty() check (Dan)
> >> - Move calculation to cxl_region_attach() under cxl_dpa_rwsem (Dan)
> >> - Normalize perf numbers to HMAT coords (Brice, Dan)
> >> ---
> >> drivers/cxl/core/cdat.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
> >> drivers/cxl/core/region.c | 2 ++
> >> drivers/cxl/cxl.h | 5 ++++
> >> 3 files changed, 60 insertions(+)
> >>
> >> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> >> index 5fe57fe5e2ee..29bba04306e9 100644
> >> --- a/drivers/cxl/core/cdat.c
> >> +++ b/drivers/cxl/core/cdat.c
> >> @@ -547,3 +547,56 @@ void cxl_switch_parse_cdat(struct cxl_port *port)
> >> EXPORT_SYMBOL_NS_GPL(cxl_switch_parse_cdat, CXL);
> >>
> >> MODULE_IMPORT_NS(CXL);
> >> +
> >> +void cxl_region_perf_data_calculate(struct cxl_region *cxlr,
> >> + struct cxl_endpoint_decoder *cxled)
> >> +{
> >> + struct list_head *perf_list;
> >> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> >> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >> + struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> >> + struct range dpa = {
> >> + .start = cxled->dpa_res->start,
> >> + .end = cxled->dpa_res->end,
> >> + };
> >> + struct cxl_dpa_perf *perf;
> >> + bool found = false;
> >> +
> >> + switch (cxlr->mode) {
> >> + case CXL_DECODER_RAM:
> >> + perf_list = &mds->ram_perf_list;
> >> + break;
> >> + case CXL_DECODER_PMEM:
> >> + perf_list = &mds->pmem_perf_list;
> >> + break;
> >> + default:
> >> + return;
> >> + }
> >> +
> >> + list_for_each_entry(perf, perf_list, list) {
> >> + if (range_contains(&perf->dpa_range, &dpa)) {
> >> + found = true;
> >> + break;
> >> + }
> >> + }
> >> +
> >> + if (!found)
> >> + return;
> >
> > Could use
> > if (list_entry_is_head())
> > return;
> > and drop the found variable. Though that is a little bit specific to the
> > internals of the list infrastructure so maybe adding a variable is better..
> > There is precedence for both approaches in tree.
> >
>
> Hmm....maybe not having to rely on list internals makes it a little easier to read?
Maybe :) Up to you.
>
> >> +
> >> + /* Get total bandwidth and the worst latency for the cxl region */
> >> + cxlr->coord.read_latency = max_t(unsigned int,
> >> + cxlr->coord.read_latency,
> >> + perf->coord.read_latency);
> >> + cxlr->coord.write_latency = max_t(unsigned int,
> >> + cxlr->coord.write_latency,
> >> + perf->coord.write_latency);
> >> + cxlr->coord.read_bandwidth += perf->coord.read_bandwidth;
> >> + cxlr->coord.write_bandwidth += perf->coord.write_bandwidth;
> >> +
> >> + /*
> >> + * Convert latency to nanosec from picosec to be consistent with HMAT
> >
> > HMAT version what? You may ask why is there a breaking change in the HMAT definition
> > between 6.2 and 6.3 but I'd rather you didn't :(
>
> Do you mean between revision 1 vs 2?
> I see different code for parsing
> it in hmat_normalize() call depending on 1 vs 2.My ACPI r6.5 doc says
> the HMAT revision included is 2. Assuming the final HMAT latency
> coordinates are always in nanoseconds and our raw data calculation is
> always in picoseconds, the HMAT version doesn't really impact at this
> location right? I think the hmat_normalize() call in HMAT will ensure
> that all latency data are nanoseconds base. Should I just say
> "calculated data resulted from HMAT" to make it clear it's not data
> straight from the tables?
>
yes. That works nicely.
> >
> >
> >> + * attributes.
> >> + */
> >> + cxlr->coord.read_latency =
> >> DIV_ROUND_UP(cxlr->coord.read_latency, 1000);
> >> + cxlr->coord.write_latency =
> >> DIV_ROUND_UP(cxlr->coord.write_latency, 1000); +}
> >
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions
2023-12-15 23:15 [PATCH v2 0/3] cxl: Add support to report region access coordinates to numa nodes Dave Jiang
2023-12-15 23:15 ` [PATCH v2 1/3] cxl/region: Calculate performance data for a region Dave Jiang
@ 2023-12-15 23:16 ` Dave Jiang
2023-12-19 14:58 ` Jonathan Cameron
2023-12-15 23:16 ` [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region Dave Jiang
2 siblings, 1 reply; 11+ messages in thread
From: Dave Jiang @ 2023-12-15 23:16 UTC (permalink / raw)
To: linux-cxl
Cc: dan.j.williams, ira.weiny, vishal.l.verma, alison.schofield,
jonathan.cameron, dave, brice.goglin, nifan.cxl
Add read/write latencies and bandwidth sysfs attributes for the enabled CXL
region. The bandwidth is the aggregated bandwidth of all devices that
contribute to the CXL region. The latency is the worst latency of the
device amongst all the devices that contribute to the CXL region.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
v2:
- Add units for documentation (Brice, Dan)
- Add explanation initiator/target relation. (Brice)
- Fix issue in commit log (Fan)
---
Documentation/ABI/testing/sysfs-bus-cxl | 56 +++++++++++++++++++++++++++++++
drivers/cxl/core/region.c | 24 +++++++++++++
2 files changed, 80 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index fff2581b8033..e859f466a6b5 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -552,3 +552,59 @@ Description:
attribute is only visible for devices supporting the
capability. The retrieved errors are logged as kernel
events when cxl_poison event tracing is enabled.
+
+
+What: /sys/bus/cxl/devices/regionZ/read_bandwidth
+Date: Apr, 2023
+KernelVersion: v6.8
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) The aggregated read bandwidth of the region. The number is
+ the accumulated read bandwidth of all CXL memory devices that
+ contributes to the region in MB/s. Should be equivalent to
+ attributes in /sys/devices/system/node/nodeX/accessY/. See
+ Documentation/ABI/stable/sysfs-devices-node.
+ The host bus latency in the calculation is from proximity
+ domain 0 to the host bus proximity domain.
+
+
+What: /sys/bus/cxl/devices/regionZ/write_bandwidth
+Date: Apr, 2023
+KernelVersion: v6.8
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) The aggregated write bandwidth of the region. The number is
+ the accumulated write bandwidth of all CXL memory devices that
+ contributes to the region in MB/s. Should be equivalent to
+ attributes in /sys/devices/system/node/nodeX/accessY/. See
+ Documentation/ABI/stable/sysfs-devices-node.
+ The host bus latency in the calculation is from proximity
+ domain 0 to the host bus proximity domain.
+
+
+What: /sys/bus/cxl/devices/regionZ/read_latency
+Date: Apr, 2023
+KernelVersion: v6.8
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) The read latency of the region. The number is
+ the worst read latency of all CXL memory devices that
+ contributes to the region in nanoseconds. Should be
+ equivalent to attributes in /sys/devices/system/node/nodeX/accessY/.
+ See Documentation/ABI/stable/sysfs-devices-node.
+ The host bus latency in the calculation is from proximity
+ domain 0 to the host bus proximity domain.
+
+
+What: /sys/bus/cxl/devices/regionZ/write_latency
+Date: Apr, 2023
+KernelVersion: v6.8
+Contact: linux-cxl@vger.kernel.org
+Description:
+ (RO) The write latency of the region. The number is
+ the worst write latency of all CXL memory devices that
+ contributes to the region in nanoseconds. Should be
+ equivalent to attributes in /sys/devices/system/node/nodeX/accessY/.
+ See Documentation/ABI/stable/sysfs-devices-node.
+ The host bus latency in the calculation is from proximity
+ domain 0 to the host bus proximity domain.
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index be7383e74ef5..d97fa5f32e86 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -645,6 +645,26 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RW(size);
+#define ACCESS_ATTR(attrib) \
+static ssize_t attrib##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct cxl_region *cxlr = to_cxl_region(dev); \
+ \
+ if (cxlr->coord.write_bandwidth == 0) \
+ return 0; \
+ \
+ return sysfs_emit(buf, "%u\n", \
+ cxlr->coord.attrib); \
+} \
+static DEVICE_ATTR_RO(attrib)
+
+ACCESS_ATTR(read_bandwidth);
+ACCESS_ATTR(read_latency);
+ACCESS_ATTR(write_bandwidth);
+ACCESS_ATTR(write_latency);
+
static struct attribute *cxl_region_attrs[] = {
&dev_attr_uuid.attr,
&dev_attr_commit.attr,
@@ -653,6 +673,10 @@ static struct attribute *cxl_region_attrs[] = {
&dev_attr_resource.attr,
&dev_attr_size.attr,
&dev_attr_mode.attr,
+ &dev_attr_read_bandwidth.attr,
+ &dev_attr_write_bandwidth.attr,
+ &dev_attr_read_latency.attr,
+ &dev_attr_write_latency.attr,
NULL,
};
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v2 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions
2023-12-15 23:16 ` [PATCH v2 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions Dave Jiang
@ 2023-12-19 14:58 ` Jonathan Cameron
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Cameron @ 2023-12-19 14:58 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, dan.j.williams, ira.weiny, vishal.l.verma,
alison.schofield, dave, brice.goglin, nifan.cxl
On Fri, 15 Dec 2023 16:16:05 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> Add read/write latencies and bandwidth sysfs attributes for the enabled CXL
> region. The bandwidth is the aggregated bandwidth of all devices that
> contribute to the CXL region. The latency is the worst latency of the
> device amongst all the devices that contribute to the CXL region.
>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
A few comments inline.
Jonathan
> ---
> v2:
> - Add units for documentation (Brice, Dan)
> - Add explanation initiator/target relation. (Brice)
> - Fix issue in commit log (Fan)
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 56 +++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 24 +++++++++++++
> 2 files changed, 80 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index fff2581b8033..e859f466a6b5 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -552,3 +552,59 @@ Description:
> attribute is only visible for devices supporting the
> capability. The retrieved errors are logged as kernel
> events when cxl_poison event tracing is enabled.
> +
> +
> +What: /sys/bus/cxl/devices/regionZ/read_bandwidth
> +Date: Apr, 2023
> +KernelVersion: v6.8
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) The aggregated read bandwidth of the region. The number is
> + the accumulated read bandwidth of all CXL memory devices that
> + contributes to the region in MB/s. Should be equivalent to
> + attributes in /sys/devices/system/node/nodeX/accessY/. See
> + Documentation/ABI/stable/sysfs-devices-node.
> + The host bus latency in the calculation is from proximity
> + domain 0 to the host bus proximity domain.
If it's equivalent to /sys/devices/system/node/nodeX/accessY then it isn't
the domain 0 value it's the domain Y one. (IIRC how that works!)
> +
> +
> +What: /sys/bus/cxl/devices/regionZ/write_bandwidth
> +Date: Apr, 2023
> +KernelVersion: v6.8
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) The aggregated write bandwidth of the region. The number is
> + the accumulated write bandwidth of all CXL memory devices that
> + contributes to the region in MB/s. Should be equivalent to
> + attributes in /sys/devices/system/node/nodeX/accessY/. See
> + Documentation/ABI/stable/sysfs-devices-node.
> + The host bus latency in the calculation is from proximity
> + domain 0 to the host bus proximity domain.
> +
> +
> +What: /sys/bus/cxl/devices/regionZ/read_latency
> +Date: Apr, 2023
> +KernelVersion: v6.8
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) The read latency of the region. The number is
> + the worst read latency of all CXL memory devices that
> + contributes to the region in nanoseconds. Should be
> + equivalent to attributes in /sys/devices/system/node/nodeX/accessY/.
> + See Documentation/ABI/stable/sysfs-devices-node.
> + The host bus latency in the calculation is from proximity
> + domain 0 to the host bus proximity domain.
> +
> +
> +What: /sys/bus/cxl/devices/regionZ/write_latency
> +Date: Apr, 2023
> +KernelVersion: v6.8
> +Contact: linux-cxl@vger.kernel.org
> +Description:
> + (RO) The write latency of the region. The number is
> + the worst write latency of all CXL memory devices that
> + contributes to the region in nanoseconds. Should be
> + equivalent to attributes in /sys/devices/system/node/nodeX/accessY/.
> + See Documentation/ABI/stable/sysfs-devices-node.
> + The host bus latency in the calculation is from proximity
> + domain 0 to the host bus proximity domain.
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index be7383e74ef5..d97fa5f32e86 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -645,6 +645,26 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> }
> static DEVICE_ATTR_RW(size);
>
> +#define ACCESS_ATTR(attrib) \
> +static ssize_t attrib##_show(struct device *dev, \
> + struct device_attribute *attr, \
> + char *buf) \
> +{ \
> + struct cxl_region *cxlr = to_cxl_region(dev); \
> + \
> + if (cxlr->coord.write_bandwidth == 0) \
Fun questions for long run. Does a RO DC region ever supply a write bandwidth?
Also, I'd prefer returning an error to say it's not provided than
returning that it is, but the length of the string is 0.
coord.attrib seems better in general.
I'd prefer to hide these attributes entirely if the value isn't available.
Is that tricky to do for some reason? There is already a cxl_region_visible()
callback.
> + return 0; \
> + \
> + return sysfs_emit(buf, "%u\n", \
> + cxlr->coord.attrib); \
Oddly short line wrap.
> +} \
> +static DEVICE_ATTR_RO(attrib)
> +
> +ACCESS_ATTR(read_bandwidth);
> +ACCESS_ATTR(read_latency);
> +ACCESS_ATTR(write_bandwidth);
> +ACCESS_ATTR(write_latency);
> +
> static struct attribute *cxl_region_attrs[] = {
> &dev_attr_uuid.attr,
> &dev_attr_commit.attr,
> @@ -653,6 +673,10 @@ static struct attribute *cxl_region_attrs[] = {
> &dev_attr_resource.attr,
> &dev_attr_size.attr,
> &dev_attr_mode.attr,
> + &dev_attr_read_bandwidth.attr,
> + &dev_attr_write_bandwidth.attr,
> + &dev_attr_read_latency.attr,
> + &dev_attr_write_latency.attr,
> NULL,
> };
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region
2023-12-15 23:15 [PATCH v2 0/3] cxl: Add support to report region access coordinates to numa nodes Dave Jiang
2023-12-15 23:15 ` [PATCH v2 1/3] cxl/region: Calculate performance data for a region Dave Jiang
2023-12-15 23:16 ` [PATCH v2 2/3] cxl/region: Add sysfs attribute for locality attributes of CXL regions Dave Jiang
@ 2023-12-15 23:16 ` Dave Jiang
2023-12-19 15:15 ` Jonathan Cameron
2 siblings, 1 reply; 11+ messages in thread
From: Dave Jiang @ 2023-12-15 23:16 UTC (permalink / raw)
To: linux-cxl
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Huang, Ying,
dan.j.williams, ira.weiny, vishal.l.verma, alison.schofield,
jonathan.cameron, dave, brice.goglin, nifan.cxl
When the CXL region is formed, the driver would computed the performance
data for the region. However this data is not available at the node data
collection that has been populated by the HMAT during kernel
initialization. Add a memory hotplug notifier to update the performance
data to the node hmem_attrs to expose the newly calculated region
performance data. The CXL region is created under specific CFMWS. The
node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws().
The notifier will run once only and turn itself off after the initial
run. Additional regions may overwrite the initial data, but since this is
for the same poximity domain it's a don't care for now.
node_set_perf_attrs() is exported to allow update of perf attribs for a
node. Given that only CXL is using this, export only to CXL namespace.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
v2:
- Fix notifier return values (Dan)
- Use devm_add_action_or_reset() instead of adding a remove callback (Dan)
- Add Ying review tag
---
drivers/base/node.c | 1 +
drivers/cxl/core/region.c | 42 ++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 3 +++
3 files changed, 46 insertions(+)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index cb2b6cc7f6e6..f5b5a3f11894 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
}
}
}
+EXPORT_SYMBOL_NS_GPL(node_set_perf_attrs, CXL);
/**
* struct node_cache_info - Internal tracking for memory node caches
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index d97fa5f32e86..1765bf716484 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -4,6 +4,7 @@
#include <linux/genalloc.h>
#include <linux/device.h>
#include <linux/module.h>
+#include <linux/memory.h>
#include <linux/slab.h>
#include <linux/uuid.h>
#include <linux/sort.h>
@@ -2960,6 +2961,42 @@ static int is_system_ram(struct resource *res, void *arg)
return 1;
}
+static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
+ unsigned long action, void *arg)
+{
+ struct cxl_region *cxlr = container_of(nb, struct cxl_region,
+ memory_notifier);
+ struct cxl_region_params *p = &cxlr->params;
+ struct cxl_endpoint_decoder *cxled = p->targets[0];
+ struct cxl_decoder *cxld = &cxled->cxld;
+ struct memory_notify *mnb = arg;
+ int nid = mnb->status_change_nid;
+ int region_nid;
+
+ if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+ return NOTIFY_DONE;
+
+ region_nid = phys_to_target_node(cxld->hpa_range.start);
+ if (nid != region_nid)
+ return NOTIFY_DONE;
+
+ /* Don't set if there's no coordinate information */
+ if (!cxlr->coord.write_bandwidth)
+ return NOTIFY_DONE;
+
+ node_set_perf_attrs(nid, &cxlr->coord, 0);
+ node_set_perf_attrs(nid, &cxlr->coord, 1);
+
+ return NOTIFY_OK;
+}
+
+static void remove_coord_notifier(void *data)
+{
+ struct cxl_region *cxlr = data;
+
+ unregister_memory_notifier(&cxlr->memory_notifier);
+}
+
static int cxl_region_probe(struct device *dev)
{
struct cxl_region *cxlr = to_cxl_region(dev);
@@ -2985,6 +3022,11 @@ static int cxl_region_probe(struct device *dev)
goto out;
}
+ cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback;
+ cxlr->memory_notifier.priority = HMAT_CALLBACK_PRI;
+ register_memory_notifier(&cxlr->memory_notifier);
+ rc = devm_add_action_or_reset(&cxlr->dev, remove_coord_notifier, cxlr);
+
/*
* From this point on any path that changes the region's state away from
* CXL_CONFIG_COMMIT is also responsible for releasing the driver.
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 4639d0d6ef54..2498086c8edc 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -6,6 +6,7 @@
#include <linux/libnvdimm.h>
#include <linux/bitfield.h>
+#include <linux/notifier.h>
#include <linux/bitops.h>
#include <linux/log2.h>
#include <linux/node.h>
@@ -520,6 +521,7 @@ struct cxl_region_params {
* @flags: Region state flags
* @params: active + config params for the region
* @coord: QoS access coordinates for the region
+ * @memory_notifier: notifier for setting the access coordinates to node
*/
struct cxl_region {
struct device dev;
@@ -531,6 +533,7 @@ struct cxl_region {
unsigned long flags;
struct cxl_region_params params;
struct access_coordinate coord;
+ struct notifier_block memory_notifier;
};
struct cxl_nvdimm_bridge {
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region
2023-12-15 23:16 ` [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region Dave Jiang
@ 2023-12-19 15:15 ` Jonathan Cameron
2023-12-22 18:17 ` Dave Jiang
0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Cameron @ 2023-12-19 15:15 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, Greg Kroah-Hartman, Rafael J. Wysocki, Huang, Ying,
dan.j.williams, ira.weiny, vishal.l.verma, alison.schofield, dave,
brice.goglin, nifan.cxl
On Fri, 15 Dec 2023 16:16:11 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> When the CXL region is formed, the driver would computed the performance
> data for the region. However this data is not available at the node data
> collection that has been populated by the HMAT during kernel
> initialization. Add a memory hotplug notifier to update the performance
> data to the node hmem_attrs to expose the newly calculated region
> performance data. The CXL region is created under specific CFMWS. The
> node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws().
> The notifier will run once only and turn itself off after the initial
> run. Additional regions may overwrite the initial data, but since this is
> for the same poximity domain it's a don't care for now.
proximity
>
> node_set_perf_attrs() is exported to allow update of perf attribs for a
> node. Given that only CXL is using this, export only to CXL namespace.
>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
What is end result of this?
/sys/devices/system/node/node/access0/
/sys/devices/system/node/node/access1/
With just the bandwidths and latencies?
No targets or initiators under accessX/targets or accessX/initiators?
Or have those been set up earlier? In which case do we handle
the worse bandwidth being inside the host CPU?
> ---
> v2:
> - Fix notifier return values (Dan)
> - Use devm_add_action_or_reset() instead of adding a remove callback (Dan)
> - Add Ying review tag
> ---
> drivers/base/node.c | 1 +
> drivers/cxl/core/region.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/cxl.h | 3 +++
> 3 files changed, 46 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cb2b6cc7f6e6..f5b5a3f11894 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
> }
> }
> }
> +EXPORT_SYMBOL_NS_GPL(node_set_perf_attrs, CXL);
This feels ugly as namespaces usually about what is providing the facility not
a 'who can use it' control.
Also, I'm aware of at least one other user who will want this in the not
too distant future. So if we want to namespace it, I'd prefer a NODE namespace
or something along those lines.
>
> /**
> * struct node_cache_info - Internal tracking for memory node caches
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index d97fa5f32e86..1765bf716484 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -4,6 +4,7 @@
> #include <linux/genalloc.h>
> #include <linux/device.h>
> #include <linux/module.h>
> +#include <linux/memory.h>
> #include <linux/slab.h>
> #include <linux/uuid.h>
> #include <linux/sort.h>
> @@ -2960,6 +2961,42 @@ static int is_system_ram(struct resource *res, void *arg)
> return 1;
> }
>
> +static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
> + unsigned long action, void *arg)
> +{
> + struct cxl_region *cxlr = container_of(nb, struct cxl_region,
> + memory_notifier);
> + struct cxl_region_params *p = &cxlr->params;
> + struct cxl_endpoint_decoder *cxled = p->targets[0];
> + struct cxl_decoder *cxld = &cxled->cxld;
> + struct memory_notify *mnb = arg;
> + int nid = mnb->status_change_nid;
> + int region_nid;
> +
> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> + return NOTIFY_DONE;
> +
> + region_nid = phys_to_target_node(cxld->hpa_range.start);
> + if (nid != region_nid)
> + return NOTIFY_DONE;
> +
> + /* Don't set if there's no coordinate information */
> + if (!cxlr->coord.write_bandwidth)
> + return NOTIFY_DONE;
Could future proof a bit to allow for RO memory by using read_bandwith here.
> +
> + node_set_perf_attrs(nid, &cxlr->coord, 0);
> + node_set_perf_attrs(nid, &cxlr->coord, 1);
Hmm. Assumption that the access attributes from no CPU requesters is the same
as the CPU bothers me a little.
> +
> + return NOTIFY_OK;
> +}
> +
> +static void remove_coord_notifier(void *data)
> +{
> + struct cxl_region *cxlr = data;
> +
> + unregister_memory_notifier(&cxlr->memory_notifier);
> +}
> +
> static int cxl_region_probe(struct device *dev)
> {
> struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -2985,6 +3022,11 @@ static int cxl_region_probe(struct device *dev)
> goto out;
> }
>
> + cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback;
> + cxlr->memory_notifier.priority = HMAT_CALLBACK_PRI;
> + register_memory_notifier(&cxlr->memory_notifier);
> + rc = devm_add_action_or_reset(&cxlr->dev, remove_coord_notifier, cxlr);
> +
> /*
> * From this point on any path that changes the region's state away from
> * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region
2023-12-19 15:15 ` Jonathan Cameron
@ 2023-12-22 18:17 ` Dave Jiang
2024-01-08 13:56 ` Jonathan Cameron
0 siblings, 1 reply; 11+ messages in thread
From: Dave Jiang @ 2023-12-22 18:17 UTC (permalink / raw)
To: Jonathan Cameron
Cc: linux-cxl, Greg Kroah-Hartman, Rafael J. Wysocki, Huang, Ying,
dan.j.williams, ira.weiny, vishal.l.verma, alison.schofield, dave,
brice.goglin, nifan.cxl
On 12/19/23 08:15, Jonathan Cameron wrote:
> On Fri, 15 Dec 2023 16:16:11 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> When the CXL region is formed, the driver would computed the performance
>> data for the region. However this data is not available at the node data
>> collection that has been populated by the HMAT during kernel
>> initialization. Add a memory hotplug notifier to update the performance
>> data to the node hmem_attrs to expose the newly calculated region
>> performance data. The CXL region is created under specific CFMWS. The
>> node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws().
>> The notifier will run once only and turn itself off after the initial
>> run. Additional regions may overwrite the initial data, but since this is
>> for the same poximity domain it's a don't care for now.
>
> proximity
>
>>
>> node_set_perf_attrs() is exported to allow update of perf attribs for a
>> node. Given that only CXL is using this, export only to CXL namespace.
>>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Cc: Rafael J. Wysocki <rafael@kernel.org>
>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>
> What is end result of this?
>
> /sys/devices/system/node/node/access0/
> /sys/devices/system/node/node/access1/
> With just the bandwidths and latencies?
> No targets or initiators under accessX/targets or accessX/initiators?
# tree ./devices/system/node/node2/access0
./devices/system/node/node2/access0
├── initiators
│ ├── node1 -> ../../../node1
│ ├── read_bandwidth
│ ├── read_latency
│ ├── write_bandwidth
│ └── write_latency
├── power
│ ├── async
│ ├── runtime_active_kids
│ ├── runtime_enabled
│ ├── runtime_status
│ └── runtime_usage
├── targets
└── uevent
>
> Or have those been set up earlier? In which case do we handle
> the worse bandwidth being inside the host CPU?
I think it gets setup via the memory online callback notifier the region driver registered.
>
>> ---
>> v2:
>> - Fix notifier return values (Dan)
>> - Use devm_add_action_or_reset() instead of adding a remove callback (Dan)
>> - Add Ying review tag
>> ---
>> drivers/base/node.c | 1 +
>> drivers/cxl/core/region.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> drivers/cxl/cxl.h | 3 +++
>> 3 files changed, 46 insertions(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cb2b6cc7f6e6..f5b5a3f11894 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>> }
>> }
>> }
>> +EXPORT_SYMBOL_NS_GPL(node_set_perf_attrs, CXL);
> This feels ugly as namespaces usually about what is providing the facility not
> a 'who can use it' control.
>
> Also, I'm aware of at least one other user who will want this in the not
> too distant future. So if we want to namespace it, I'd prefer a NODE namespace
> or something along those lines.
I'll just make it normal export if we are anticipating another user.
>
>>
>> /**
>> * struct node_cache_info - Internal tracking for memory node caches
>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> index d97fa5f32e86..1765bf716484 100644
>> --- a/drivers/cxl/core/region.c
>> +++ b/drivers/cxl/core/region.c
>> @@ -4,6 +4,7 @@
>> #include <linux/genalloc.h>
>> #include <linux/device.h>
>> #include <linux/module.h>
>> +#include <linux/memory.h>
>> #include <linux/slab.h>
>> #include <linux/uuid.h>
>> #include <linux/sort.h>
>> @@ -2960,6 +2961,42 @@ static int is_system_ram(struct resource *res, void *arg)
>> return 1;
>> }
>>
>> +static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
>> + unsigned long action, void *arg)
>> +{
>> + struct cxl_region *cxlr = container_of(nb, struct cxl_region,
>> + memory_notifier);
>> + struct cxl_region_params *p = &cxlr->params;
>> + struct cxl_endpoint_decoder *cxled = p->targets[0];
>> + struct cxl_decoder *cxld = &cxled->cxld;
>> + struct memory_notify *mnb = arg;
>> + int nid = mnb->status_change_nid;
>> + int region_nid;
>> +
>> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>> + return NOTIFY_DONE;
>> +
>> + region_nid = phys_to_target_node(cxld->hpa_range.start);
>> + if (nid != region_nid)
>> + return NOTIFY_DONE;
>> +
>> + /* Don't set if there's no coordinate information */
>> + if (!cxlr->coord.write_bandwidth)
>> + return NOTIFY_DONE;
>
> Could future proof a bit to allow for RO memory by using read_bandwith here.
Yes. I didn't realize there will be RO memory. I just assumed that bandwidth would always be > 0 for a valid set of data.
>
>> +
>> + node_set_perf_attrs(nid, &cxlr->coord, 0);
>> + node_set_perf_attrs(nid, &cxlr->coord, 1);
>
> Hmm. Assumption that the access attributes from no CPU requesters is the same
> as the CPU bothers me a little.
I wasn't too sure about updating this. Should I only update access 0?
>
>> +
>> + return NOTIFY_OK;
>> +}
>> +
>> +static void remove_coord_notifier(void *data)
>> +{
>> + struct cxl_region *cxlr = data;
>> +
>> + unregister_memory_notifier(&cxlr->memory_notifier);
>> +}
>> +
>> static int cxl_region_probe(struct device *dev)
>> {
>> struct cxl_region *cxlr = to_cxl_region(dev);
>> @@ -2985,6 +3022,11 @@ static int cxl_region_probe(struct device *dev)
>> goto out;
>> }
>>
>> + cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback;
>> + cxlr->memory_notifier.priority = HMAT_CALLBACK_PRI;
>> + register_memory_notifier(&cxlr->memory_notifier);
>> + rc = devm_add_action_or_reset(&cxlr->dev, remove_coord_notifier, cxlr);
>> +
>> /*
>> * From this point on any path that changes the region's state away from
>> * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
>
>>
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2 3/3] cxl: Add memory hotplug notifier for cxl region
2023-12-22 18:17 ` Dave Jiang
@ 2024-01-08 13:56 ` Jonathan Cameron
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Cameron @ 2024-01-08 13:56 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, Greg Kroah-Hartman, Rafael J. Wysocki, Huang, Ying,
dan.j.williams, ira.weiny, vishal.l.verma, alison.schofield, dave,
brice.goglin, nifan.cxl
On Fri, 22 Dec 2023 11:17:15 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> On 12/19/23 08:15, Jonathan Cameron wrote:
> > On Fri, 15 Dec 2023 16:16:11 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >
> >> When the CXL region is formed, the driver would computed the performance
> >> data for the region. However this data is not available at the node data
> >> collection that has been populated by the HMAT during kernel
> >> initialization. Add a memory hotplug notifier to update the performance
> >> data to the node hmem_attrs to expose the newly calculated region
> >> performance data. The CXL region is created under specific CFMWS. The
> >> node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws().
> >> The notifier will run once only and turn itself off after the initial
> >> run. Additional regions may overwrite the initial data, but since this is
> >> for the same poximity domain it's a don't care for now.
> >
> > proximity
> >
> >>
> >> node_set_perf_attrs() is exported to allow update of perf attribs for a
> >> node. Given that only CXL is using this, export only to CXL namespace.
> >>
> >> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >> Cc: Rafael J. Wysocki <rafael@kernel.org>
> >> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> >
> > What is end result of this?
> >
> > /sys/devices/system/node/node/access0/
> > /sys/devices/system/node/node/access1/
> > With just the bandwidths and latencies?
> > No targets or initiators under accessX/targets or accessX/initiators?
>
> # tree ./devices/system/node/node2/access0
> ./devices/system/node/node2/access0
> ├── initiators
> │ ├── node1 -> ../../../node1
> │ ├── read_bandwidth
> │ ├── read_latency
> │ ├── write_bandwidth
> │ └── write_latency
> ├── power
> │ ├── async
> │ ├── runtime_active_kids
> │ ├── runtime_enabled
> │ ├── runtime_status
> │ └── runtime_usage
> ├── targets
> └── uevent
>
>
> >
> > Or have those been set up earlier? In which case do we handle
> > the worse bandwidth being inside the host CPU?
>
> I think it gets setup via the memory online callback notifier the region driver registered.
Hmm. For access0 this is a problem in the long term as the nearest
initiator might be below the port. Still lots of stuff to do before
that works anyway.
access1 is fine in general as CPUs won't be below the port unless CXL
gains a lot of new functionality in CXL rev X :)
>
> >
> >> ---
> >> v2:
> >> - Fix notifier return values (Dan)
> >> - Use devm_add_action_or_reset() instead of adding a remove callback (Dan)
> >> - Add Ying review tag
> >> ---
> >> drivers/base/node.c | 1 +
> >> drivers/cxl/core/region.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> >> drivers/cxl/cxl.h | 3 +++
> >> 3 files changed, 46 insertions(+)
> >>
> >> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >> index cb2b6cc7f6e6..f5b5a3f11894 100644
> >> --- a/drivers/base/node.c
> >> +++ b/drivers/base/node.c
> >> @@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
> >> }
> >> }
> >> }
> >> +EXPORT_SYMBOL_NS_GPL(node_set_perf_attrs, CXL);
> > This feels ugly as namespaces usually about what is providing the facility not
> > a 'who can use it' control.
> >
> > Also, I'm aware of at least one other user who will want this in the not
> > too distant future. So if we want to namespace it, I'd prefer a NODE namespace
> > or something along those lines.
>
> I'll just make it normal export if we are anticipating another user.
>
> >
> >>
> >> /**
> >> * struct node_cache_info - Internal tracking for memory node caches
> >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> >> index d97fa5f32e86..1765bf716484 100644
> >> --- a/drivers/cxl/core/region.c
> >> +++ b/drivers/cxl/core/region.c
> >> @@ -4,6 +4,7 @@
> >> #include <linux/genalloc.h>
> >> #include <linux/device.h>
> >> #include <linux/module.h>
> >> +#include <linux/memory.h>
> >> #include <linux/slab.h>
> >> #include <linux/uuid.h>
> >> #include <linux/sort.h>
> >> @@ -2960,6 +2961,42 @@ static int is_system_ram(struct resource *res, void *arg)
> >> return 1;
> >> }
> >>
> >> +static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
> >> + unsigned long action, void *arg)
> >> +{
> >> + struct cxl_region *cxlr = container_of(nb, struct cxl_region,
> >> + memory_notifier);
> >> + struct cxl_region_params *p = &cxlr->params;
> >> + struct cxl_endpoint_decoder *cxled = p->targets[0];
> >> + struct cxl_decoder *cxld = &cxled->cxld;
> >> + struct memory_notify *mnb = arg;
> >> + int nid = mnb->status_change_nid;
> >> + int region_nid;
> >> +
> >> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> >> + return NOTIFY_DONE;
> >> +
> >> + region_nid = phys_to_target_node(cxld->hpa_range.start);
> >> + if (nid != region_nid)
> >> + return NOTIFY_DONE;
> >> +
> >> + /* Don't set if there's no coordinate information */
> >> + if (!cxlr->coord.write_bandwidth)
> >> + return NOTIFY_DONE;
> >
> > Could future proof a bit to allow for RO memory by using read_bandwith here.
>
> Yes. I didn't realize there will be RO memory. I just assumed that bandwidth would always be > 0 for a valid set of data.
>
> >
> >> +
> >> + node_set_perf_attrs(nid, &cxlr->coord, 0);
> >> + node_set_perf_attrs(nid, &cxlr->coord, 1);
> >
> > Hmm. Assumption that the access attributes from no CPU requesters is the same
> > as the CPU bothers me a little.
>
> I wasn't too sure about updating this. Should I only update access 0?
They shouldn't be updated to the same thing. It should depend on which
node is in their initiators directory. It can be different between access0
and access1
> >
> >> +
> >> + return NOTIFY_OK;
> >> +}
> >> +
> >> +static void remove_coord_notifier(void *data)
> >> +{
> >> + struct cxl_region *cxlr = data;
> >> +
> >> + unregister_memory_notifier(&cxlr->memory_notifier);
> >> +}
> >> +
> >> static int cxl_region_probe(struct device *dev)
> >> {
> >> struct cxl_region *cxlr = to_cxl_region(dev);
> >> @@ -2985,6 +3022,11 @@ static int cxl_region_probe(struct device *dev)
> >> goto out;
> >> }
> >>
> >> + cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback;
> >> + cxlr->memory_notifier.priority = HMAT_CALLBACK_PRI;
> >> + register_memory_notifier(&cxlr->memory_notifier);
> >> + rc = devm_add_action_or_reset(&cxlr->dev, remove_coord_notifier, cxlr);
> >> +
> >> /*
> >> * From this point on any path that changes the region's state away from
> >> * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
> >
> >>
> >
^ permalink raw reply [flat|nested] 11+ messages in thread