[PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems

patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
@ 2024-06-10 18:35 Tony Luck
  2024-06-10 18:35 ` [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope Tony Luck
                   ` (18 more replies)
  0 siblings, 19 replies; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

This series based on top of tip x86/cache commit f385f0246394
("x86/resctrl: Replace open coded cacheinfo searches")

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPUs support an MSR that can partition the RMID counters
in the same way. This allows monitoring features to be used. Legacy
monitoring files provide the sum of counters from each SNC node for
backwards compatibility. Additional  files per SNC node provide details
per node.

Memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

L3 cache occupancy and allocation operate on the portion of
L3 cache available for each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since v19: https://lore.kernel.org/all/20240528222006.58283-1-tony.luck@intel.com/

1-4:	Refactor on top of <linux/cacheinfo.h> change.
	Nothing functional.

5:	No change

6:	Updated commit message with note about RMID Sharing mode.
	Renamed __rmid_read() to __rmid_read_phys() and performed
	translation from logical RMID to physical RMID at callsites.
	Updated comment for __rmid_read_phys() with explanation of
	logical/physical RMIDs. Consistently use "SNC node" avoid
	SNC domain. Add specifics for non-SNC mode.
	Joined split line on __rmid_read() definition (even with the
	added "_phys" to its name still fits on one line.

7:	No change

8:	get_cpu_cacheinfo_level() moved to <linux/cacheinfo.h>
	currently in tip x86/cache
	no other changes

9:	Dropped the "sumdomains" field from struct rmid_read (a NULL
	domain field now indicates that summing is needed).
	Fix kerneldoc comments for struct rmid_read.
	Updated commit comments with more "why" than "what".

10:	No change

11:	Fix commit comments per suggestions
	1) Added some "why it is OK to take a bit from evtid"
	2) s/The stolen bit is given to/Give the bit to/
	3) Don't use "l3_cache_id" (which looks like a variable)

12:	Fix commit message.
	s/kernfs_find_and_get_ns()/kernfs_find_and_get()/
	Add kernfs_put() to drop hold from kernfs_find_and_get()
	Drop useless "/* create the directory */" comment.

13:	Add kernfs_put() to drop hold from kernfs_find_and_get() [two places]

14:	Add cpumask parameter to mon_event_read() so SNC decsions are
	all in rdtgroup_mondata_show() instead of spread between functions.
	Add comments in rdtgroup_mondata_show() to explain the sum vs. no-sum
	cases.
	Moved the mon_event_read() call into both arms of the if-else
	instead of "d = NULL; goto got_cacheinfo;"

15:	New (replaces 15-17). Make __mon_event_read() do the sum across
	domains (at filesystem level). Move the CPU/domain sanity check out
	of resctrl_arch_rmid_read() and into __mon_event_read()
	with separate scope tests for single domain vs. sum over
	domains.

16:	[Was 18] Update commit message with details about MSR 0xCA0, what changes
	when bit 0 is cleared, and why this is necessary.
	Dropped "Add an architecture specific hook" language from
	commit message.

17:	[Was 19] Drop "and enabling" from shortlog (enabling done by
	previous commit).
	Added checks that cpumask_weight() isn't returning zero (to keep
	static checkers from warning of possible divide by zero).

18:	[Was 20] Fix some "Sub-NUMA" references to say "Sub-NUMA Cluster"
	Added document section on effect of SNC mode on MBA and L3 CAT.

Tony Luck (18):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster
    (SNC) systems
  x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
  x86/resctrl: Add a new field to struct rmid_read for summation of
    domains
  x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
  x86/resctrl: Allocate a new field in union mon_data_bits
  x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
  x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC)
    mode
  x86/resctrl: Fill out rmid_read structure for smp_call*() to read a
    counter
  x86/resctrl: Make __mon_event_count() handle sum domains
  x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC)
    systems
  x86/resctrl: Sub-NUMA Cluster (SNC) detection
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  27 ++
 include/linux/resctrl.h                   |  87 ++++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  93 +++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 312 ++++++++++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  85 +++---
 arch/x86/kernel/cpu/resctrl/monitor.c     | 242 ++++++++++++++---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  27 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 272 ++++++++++++-------
 9 files changed, 835 insertions(+), 311 deletions(-)


base-commit: f385f024639431bec3e70c33cdbc9563894b3ee5
-- 
2.45.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:12   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 02/18] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.

In preparation for features that are scoped at the NUMA node level
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.

Clean up the error handling when looking up domains. Report invalid id's
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the callsites.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  9 ++++-
 arch/x86/kernel/cpu/resctrl/core.c        | 46 ++++++++++++++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  2 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 ++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  5 ++-
 5 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a365f67131ec..ed693bfe474d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -150,13 +150,18 @@ struct resctrl_membw {
 struct rdt_parse_data;
 struct resctrl_schema;
 
+enum resctrl_scope {
+	RESCTRL_L2_CACHE = 2,
+	RESCTRL_L3_CACHE = 3,
+};
+
 /**
  * struct rdt_resource - attributes of a resctrl resource
  * @rid:		The index of the resource
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @cache_level:	Which cache level defines scope of this resource
+ * @scope:		Scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		RCU list of all domains for this resource
@@ -174,7 +179,7 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	enum resctrl_scope	scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a113d9aba553..f85b2ff40eef 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -68,7 +68,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -82,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.cache_level		= 2,
+			.scope			= RESCTRL_L2_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
@@ -96,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -108,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.cache_level		= 3,
+			.scope			= RESCTRL_L3_CACHE,
 			.domains		= domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
@@ -392,9 +392,6 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct rdt_domain *d;
 	struct list_head *l;
 
-	if (id < 0)
-		return ERR_PTR(-ENODEV);
-
 	list_for_each(l, &r->domains) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
@@ -484,6 +481,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 	return 0;
 }
 
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+	switch (scope) {
+	case RESCTRL_L2_CACHE:
+	case RESCTRL_L3_CACHE:
+		return get_cpu_cacheinfo_id(cpu, scope);
+	default:
+		break;
+	}
+
+	return -EINVAL;
+}
+
 /*
  * domain_add_cpu - Add a cpu to a resource's domain list.
  *
@@ -499,7 +509,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
  */
 static void domain_add_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
@@ -507,12 +517,14 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	lockdep_assert_held(&domain_list_lock);
 
-	d = rdt_find_domain(r, id, &add_pos);
-	if (IS_ERR(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
 		return;
 	}
 
+	d = rdt_find_domain(r, id, &add_pos);
+
 	if (d) {
 		cpumask_set_cpu(cpu, &d->cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -552,15 +564,21 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 {
-	int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+	int id = get_domain_id_from_scope(cpu, r->scope);
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
+	if (id < 0) {
+		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->scope, r->name);
+		return;
+	}
+
 	d = rdt_find_domain(r, id, NULL);
-	if (IS_ERR_OR_NULL(d)) {
-		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
+	if (!d) {
+		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
 		return;
 	}
 	hw_dom = resctrl_to_arch_dom(d);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b7291f60399c..2bf021d42500 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -577,7 +577,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 
 	r = &rdt_resources_all[resid].r_resctrl;
 	d = rdt_find_domain(r, domid, NULL);
-	if (IS_ERR_OR_NULL(d)) {
+	if (!d) {
 		ret = -ENOENT;
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 1bbfd3c1e300..201011f0ed0b 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,9 +292,13 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
+	enum resctrl_scope scope = plr->s->res->scope;
 	struct cacheinfo *ci;
 	int ret;
 
+	if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+		return -ENODEV;
+
 	/* Pick the first cpu we find that is associated with the cache. */
 	plr->cpu = cpumask_first(&plr->d->cpu_mask);
 
@@ -305,7 +309,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		goto out_region;
 	}
 
-	ci = get_cpu_cacheinfo_level(plr->cpu, plr->s->res->cache_level);
+	ci = get_cpu_cacheinfo_level(plr->cpu, scope);
 	if (ci) {
 		plr->line_size = ci->coherency_line_size;
 		plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index cb68a121dabb..50f5876a3020 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1454,8 +1454,11 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	struct cacheinfo *ci;
 	int num_b;
 
+	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+		return size;
+
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->cache_level);
+	ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->scope);
 	if (ci)
 		size = ci->size / r->cache.cbm_len * num_b;
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope
  2024-06-10 18:35 ` [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2024-06-20 21:12   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:12 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> Resctrl resources operate on subsets of CPUs in the system with the
> defining attribute of each subset being an instance of a particular
> level of cache. E.g. all CPUs sharing an L3 cache would be part of the
> same domain.
> 
> In preparation for features that are scoped at the NUMA node level
> change the code from explicit references to "cache_level" to a more
> generic scope. At this point the only options for this scope are groups
> of CPUs that share an L2 cache or L3 cache.
> 
> Clean up the error handling when looking up domains. Report invalid id's
> before calling rdt_find_domain() in preparation for better messages when
> scope can be other than cache scope. This means that rdt_find_domain()
> will never return an error. So remove checks for error from the callsites.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 02/18] x86/resctrl: Prepare to split rdt_domain structure
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-06-10 18:35 ` [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:13   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 03/18] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.

To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 16 ++++--
 arch/x86/kernel/cpu/resctrl/core.c        | 26 +++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 24 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 14 +++---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 60 +++++++++++------------
 6 files changed, 81 insertions(+), 73 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ed693bfe474d..f63fcf17a3bc 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -59,10 +59,20 @@ struct resctrl_staged_config {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
  * @cpu_mask:		which CPUs share this resource
+ */
+struct rdt_domain_hdr {
+	struct list_head		list;
+	int				id;
+	struct cpumask			cpu_mask;
+};
+
+/**
+ * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -77,9 +87,7 @@ struct resctrl_staged_config {
  *			by closid
  */
 struct rdt_domain {
-	struct list_head		list;
-	int				id;
-	struct cpumask			cpu_mask;
+	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f85b2ff40eef..96fff44f9d03 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -355,9 +355,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
 
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/* Find the domain that contains this CPU */
-		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
 	}
 
@@ -393,12 +393,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	struct list_head *l;
 
 	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, list);
+		d = list_entry(l, struct rdt_domain, hdr.list);
 		/* When id is found, return its domain. */
-		if (id == d->id)
+		if (id == d->hdr.id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->id)
+		if (id < d->hdr.id)
 			break;
 	}
 
@@ -526,7 +526,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	d = rdt_find_domain(r, id, &add_pos);
 
 	if (d) {
-		cpumask_set_cpu(cpu, &d->cpu_mask);
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
 		return;
@@ -537,8 +537,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 
 	d = &hw_dom->d_resctrl;
-	d->id = id;
-	cpumask_set_cpu(cpu, &d->cpu_mask);
+	d->hdr.id = id;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
@@ -552,11 +552,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail_rcu(&d->list, add_pos);
+	list_add_tail_rcu(&d->hdr.list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del_rcu(&d->list);
+		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
 		domain_free(hw_dom);
 	}
@@ -583,10 +583,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 	hw_dom = resctrl_to_arch_dom(d);
 
-	cpumask_clear_cpu(cpu, &d->cpu_mask);
-	if (cpumask_empty(&d->cpu_mask)) {
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del_rcu(&d->list);
+		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
 
 		/*
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 2bf021d42500..6246f48b0449 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -69,7 +69,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -148,7 +148,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
 
 	cfg = &d->staged_config[s->conf_type];
 	if (cfg->have_new_ctrl) {
-		rdt_last_cmd_printf("Duplicate domain %d\n", d->id);
+		rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id);
 		return -EINVAL;
 	}
 
@@ -231,8 +231,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
 			if (r->parse_ctrlval(&data, s, d))
@@ -280,7 +280,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	hw_dom->ctrl_val[idx] = cfg_val;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		msr_param.res = NULL;
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -330,7 +330,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 			}
 		}
 		if (msr_param.res)
-			smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+			smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
 	}
 
 	return 0;
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	lockdep_assert_cpus_held();
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -460,7 +460,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 			ctrl_val = resctrl_arch_get_config(r, dom, closid,
 							   schema->conf_type);
 
-		seq_printf(s, r->format_str, dom->id, max_data_width,
+		seq_printf(s, r->format_str, dom->hdr.id, max_data_width,
 			   ctrl_val);
 		sep = true;
 	}
@@ -489,7 +489,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			} else {
 				seq_printf(s, "%s:%d=%x\n",
 					   rdtgrp->plr->s->res->name,
-					   rdtgrp->plr->d->id,
+					   rdtgrp->plr->d->hdr.id,
 					   rdtgrp->plr->cbm);
 			}
 		} else {
@@ -537,7 +537,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		return;
 	}
 
-	cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU);
+	cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
 
 	/*
 	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
@@ -546,7 +546,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	 * counters on some platforms if its called in IRQ context.
 	 */
 	if (tick_nohz_full_cpu(cpu))
-		smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+		smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
 	else
 		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2345e6836593..ab8a198d88b3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -281,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 
 	resctrl_arch_rmid_read_context_check();
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
+	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
 	ret = __rmid_read(rmid, eventid, &msr_val);
@@ -364,7 +364,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 			 * CLOSID and RMID because there may be dependencies between them
 			 * on some architectures.
 			 */
-			trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
+			trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, val);
 		}
 
 		if (force_free || !rmid_dirty) {
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		/*
 		 * For the first limbo RMID in the domain,
 		 * setup up the limbo worker.
@@ -801,7 +801,7 @@ void cqm_handle_limbo(struct work_struct *work)
 	__check_limbo(d, false);
 
 	if (has_busy_rmid(d)) {
-		d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+		d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
 							   RESCTRL_PICK_ANY_CPU);
 		schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo,
 					 delay);
@@ -825,7 +825,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
 	unsigned long delay = msecs_to_jiffies(delay_ms);
 	int cpu;
 
-	cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+	cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
 	dom->cqm_work_cpu = cpu;
 
 	if (cpu < nr_cpu_ids)
@@ -868,7 +868,7 @@ void mbm_handle_overflow(struct work_struct *work)
 	 * Re-check for housekeeping CPUs. This allows the overflow handler to
 	 * move off a nohz_full CPU quickly.
 	 */
-	d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask,
+	d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
 						   RESCTRL_PICK_ANY_CPU);
 	schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
 
@@ -897,7 +897,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
 	 */
 	if (!resctrl_mounted || !resctrl_arch_mon_capable())
 		return;
-	cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu);
+	cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu);
 	dom->mbm_work_cpu = cpu;
 
 	if (cpu < nr_cpu_ids)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 201011f0ed0b..df45c839a58f 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
 	int cpu;
 	int ret;
 
-	for_each_cpu(cpu, &plr->d->cpu_mask) {
+	for_each_cpu(cpu, &plr->d->hdr.cpu_mask) {
 		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
 		if (!pm_req) {
 			rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n");
@@ -300,7 +300,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 		return -ENODEV;
 
 	/* Pick the first cpu we find that is associated with the cache. */
-	plr->cpu = cpumask_first(&plr->d->cpu_mask);
+	plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 
 	if (!cpu_online(plr->cpu)) {
 		rdt_last_cmd_printf("CPU %u associated with cache not online\n",
@@ -854,10 +854,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, list) {
+		list_for_each_entry(d_i, &r->domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
-					   &d_i->cpu_mask);
+					   &d_i->hdr.cpu_mask);
 		}
 	}
 
@@ -865,7 +865,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * Next test if new pseudo-locked region would intersect with
 	 * existing region.
 	 */
-	if (cpumask_intersects(&d->cpu_mask, cpu_with_psl))
+	if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl))
 		ret = true;
 
 	free_cpumask_var(cpu_with_psl);
@@ -1197,7 +1197,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 	}
 
 	plr->thread_done = 0;
-	cpu = cpumask_first(&plr->d->cpu_mask);
+	cpu = cpumask_first(&plr->d->hdr.cpu_mask);
 	if (!cpu_online(cpu)) {
 		ret = -ENODEV;
 		goto out;
@@ -1527,7 +1527,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * may be scheduled elsewhere and invalidate entries in the
 	 * pseudo-locked region.
 	 */
-	if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) {
+	if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 50f5876a3020..b6ba77cdf0e8 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -317,7 +317,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of,
 				rdt_last_cmd_puts("Cache domain offline\n");
 				ret = -ENODEV;
 			} else {
-				mask = &rdtgrp->plr->d->cpu_mask;
+				mask = &rdtgrp->plr->d->hdr.cpu_mask;
 				seq_printf(s, is_cpu_list(of) ?
 					   "%*pbl\n" : "%*pb\n",
 					   cpumask_pr_args(mask));
@@ -1021,12 +1021,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
 		exclusive = 0;
-		seq_printf(seq, "%d=", dom->id);
+		seq_printf(seq, "%d=", dom->hdr.id);
 		for (i = 0; i < closids_supported(); i++) {
 			if (!closid_allocated(i))
 				continue;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1458,7 +1458,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->scope);
+	ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->scope);
 	if (ci)
 		size = ci->size / r->cache.cbm_len * num_b;
 
@@ -1502,7 +1502,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 			size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res,
 						    rdtgrp->plr->d,
 						    rdtgrp->plr->cbm);
-			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size);
+			seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size);
 		}
 		goto out;
 	}
@@ -1514,7 +1514,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, list) {
+		list_for_each_entry(d, &r->domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1532,7 +1532,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 				else
 					size = rdtgroup_cbm_to_size(r, d, ctrl);
 			}
-			seq_printf(s, "%d=%u", d->id, size);
+			seq_printf(s, "%d=%u", d->hdr.id, size);
 			sep = true;
 		}
 		seq_putc(s, '\n');
@@ -1592,7 +1592,7 @@ static void mon_event_config_read(void *info)
 
 static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
 {
-	smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1);
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
 
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
@@ -1604,7 +1604,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1612,7 +1612,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 		mon_info.evtid = evtid;
 		mondata_config_read(dom, &mon_info);
 
-		seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config);
+		seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config);
 		sep = true;
 	}
 	seq_puts(s, "\n");
@@ -1678,7 +1678,7 @@ static void mbm_config_write_domain(struct rdt_resource *r,
 	 * are scoped at the domain level. Writing any of these MSRs
 	 * on one CPU is observed by all the CPUs in the domain.
 	 */
-	smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write,
 			      &mon_info, 1);
 
 	/*
@@ -1728,8 +1728,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
-		if (d->id == dom_id) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
+		if (d->hdr.id == dom_id) {
 			mbm_config_write_domain(r, d, evtid, val);
 			goto next;
 		}
@@ -2276,14 +2276,14 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, list) {
+	list_for_each_entry(d, &r_l->domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
-			for_each_cpu(cpu, &d->cpu_mask)
+			for_each_cpu(cpu, &d->hdr.cpu_mask)
 				cpumask_set_cpu(cpu, cpu_mask);
 		else
 			/* Pick one CPU from each domain instance to update MSR */
-			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
+			cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask);
 	}
 
 	/* Update QOS_CFG MSR on all the CPUs in cpu_mask */
@@ -2312,7 +2312,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	int cpu = cpumask_any(&d->cpu_mask);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	int i;
 
 	d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val),
@@ -2361,7 +2361,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2700,7 +2700,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		list_for_each_entry(dom, &r->domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
 						   RESCTRL_PICK_ANY_CPU);
 	}
@@ -2827,13 +2827,13 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * CBMs in all domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
 		msr_param.dom = d;
-		smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1);
+		smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1);
 	}
 
 	return 0;
@@ -3031,7 +3031,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	char name[32];
 	int ret;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->id);
+	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
 	/* create the directory */
 	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
 	if (IS_ERR(kn))
@@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	}
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->id;
+	priv.u.domid = d->hdr.id;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
@@ -3098,7 +3098,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(dom, &r->domains, list) {
+	list_for_each_entry(dom, &r->domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3257,7 +3257,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
 	 */
 	tmp_cbm = cfg->new_ctrl;
 	if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) {
-		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id);
+		rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id);
 		return -ENOSPC;
 	}
 	cfg->have_new_ctrl = true;
@@ -3280,7 +3280,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, list) {
+	list_for_each_entry(d, &s->res->domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3295,7 +3295,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, list) {
+	list_for_each_entry(d, &r->domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3941,7 +3941,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	 * per domain monitor data directories.
 	 */
 	if (resctrl_mounted && resctrl_arch_mon_capable())
-		rmdir_mondata_subdir_allrdtgrp(r, d->id);
+		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 02/18] x86/resctrl: Prepare to split rdt_domain structure
  2024-06-10 18:35 ` [PATCH v20 02/18] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2024-06-20 21:13   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:13 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> The rdt_domain structure is used for both control and monitor features.
> It is about to be split into separate structures for these two usages
> because the scope for control and monitoring features for a resource
> will be different for future resources.
> 
> To allow for common code that scans a list of domains looking for a
> specific domain id, move all the common fields ("list", "id", "cpu_mask")
> into their own structure within the rdt_domain structure.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 03/18] x86/resctrl: Prepare for different scope for control/monitor operations
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
  2024-06-10 18:35 ` [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope Tony Luck
  2024-06-10 18:35 ` [PATCH v20 02/18] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:13   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 04/18] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.

Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).

Create separate domain lists for control and monitor operations.

Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   |  25 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |   7 +-
 arch/x86/kernel/cpu/resctrl/core.c        | 224 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  12 +-
 arch/x86/kernel/cpu/resctrl/monitor.c     |   4 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  60 +++---
 7 files changed, 240 insertions(+), 96 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index f63fcf17a3bc..96ddf9ff3183 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -58,15 +58,22 @@ struct resctrl_staged_config {
 	bool			have_new_ctrl;
 };
 
+enum resctrl_domain_type {
+	RESCTRL_CTRL_DOMAIN,
+	RESCTRL_MON_DOMAIN,
+};
+
 /**
  * struct rdt_domain_hdr - common header for different domain types
  * @list:		all instances of this resource
  * @id:			unique id for this instance
+ * @type:		type of this instance
  * @cpu_mask:		which CPUs share this resource
  */
 struct rdt_domain_hdr {
 	struct list_head		list;
 	int				id;
+	enum resctrl_domain_type	type;
 	struct cpumask			cpu_mask;
 };
 
@@ -169,10 +176,12 @@ enum resctrl_scope {
  * @alloc_capable:	Is allocation available on this machine
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
- * @scope:		Scope of this resource
+ * @ctrl_scope:		Scope of this resource for control functions
+ * @mon_scope:		Scope of this resource for monitor functions
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		RCU list of all domains for this resource
+ * @ctrl_domains:	RCU list of all control domains for this resource
+ * @mon_domains:	RCU list of all monitor domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -187,10 +196,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -236,8 +247,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
 void resctrl_online_cpu(unsigned int cpu);
 void resctrl_offline_cpu(unsigned int cpu);
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index f1d926832ec8..377679b79919 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -558,8 +558,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos);
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
@@ -578,7 +578,8 @@ int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 96fff44f9d03..edd9b2bfb53d 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,7 +60,8 @@ static void mba_wrmsr_intel(struct msr_param *m);
 static void cat_wrmsr(struct msr_param *m);
 static void mba_wrmsr_amd(struct msr_param *m);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains)
+#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -68,8 +69,10 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.mon_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L3),
+			.mon_domains		= mon_domain_init(RDT_RESOURCE_L3),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -82,8 +85,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
-			.scope			= RESCTRL_L2_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.ctrl_scope		= RESCTRL_L2_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_L2),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -96,8 +99,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_MBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -108,8 +111,8 @@ struct rdt_hw_resource rdt_resources_all[] = {
 		.r_resctrl = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
-			.scope			= RESCTRL_L3_CACHE,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.ctrl_scope		= RESCTRL_L3_CACHE,
+			.ctrl_domains		= ctrl_domain_init(RDT_RESOURCE_SMBA),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -349,13 +352,28 @@ static void cat_wrmsr(struct msr_param *m)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
 	struct rdt_domain *d;
 
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
+		/* Find the domain that contains this CPU */
+		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
+			return d;
+	}
+
+	return NULL;
+}
+
+struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+{
+	struct rdt_domain *d;
+
+	lockdep_assert_cpus_held();
+
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		/* Find the domain that contains this CPU */
 		if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
 			return d;
@@ -379,26 +397,26 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Search for a domain id in a resource domain list.
  *
- * Search resource r's domain list to find the resource id. If the resource
- * id is found in a domain, return the domain. Otherwise, if requested by
- * caller, return the first domain whose id is bigger than the input id.
- * The domain list is sorted by id in ascending order.
+ * Search the domain list to find the domain id. If the domain id is
+ * found, return the domain. NULL otherwise.  If the domain id is not
+ * found (and NULL returned) then the first domain with id bigger than
+ * the input id can be returned to the caller via @pos.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
-				   struct list_head **pos)
+struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id,
+				       struct list_head **pos)
 {
-	struct rdt_domain *d;
+	struct rdt_domain_hdr *d;
 	struct list_head *l;
 
-	list_for_each(l, &r->domains) {
-		d = list_entry(l, struct rdt_domain, hdr.list);
+	list_for_each(l, h) {
+		d = list_entry(l, struct rdt_domain_hdr, list);
 		/* When id is found, return its domain. */
-		if (id == d->hdr.id)
+		if (id == d->id)
 			return d;
 		/* Stop searching when finding id's position in sorted list. */
-		if (id < d->hdr.id)
+		if (id < d->id)
 			break;
 	}
 
@@ -494,38 +512,29 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	return -EINVAL;
 }
 
-/*
- * domain_add_cpu - Add a cpu to a resource's domain list.
- *
- * If an existing domain in the resource r's domain list matches the cpu's
- * resource id, add the cpu in the domain.
- *
- * Otherwise, a new domain is allocated and inserted into the right position
- * in the domain list sorted by id in ascending order.
- *
- * The order in the domain list is visible to users when we print entries
- * in the schemata file and schemata input is validated to have the same order
- * as this list.
- */
-static void domain_add_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct list_head *add_pos = NULL;
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 	int err;
 
 	lockdep_assert_held(&domain_list_lock);
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, &add_pos);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+			return;
+		d = container_of(hdr, struct rdt_domain, hdr);
 
-	if (d) {
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
 			rdt_domain_reconfigure_cdp(r);
@@ -538,23 +547,70 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
+	d->hdr.type = RESCTRL_CTRL_DOMAIN;
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	rdt_domain_reconfigure_cdp(r);
 
-	if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
+	if (domain_setup_ctrlval(r, d)) {
 		domain_free(hw_dom);
 		return;
 	}
 
-	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+	list_add_tail_rcu(&d->hdr.list, add_pos);
+
+	err = resctrl_online_ctrl_domain(r, d);
+	if (err) {
+		list_del_rcu(&d->hdr.list);
+		synchronize_rcu();
+		domain_free(hw_dom);
+	}
+}
+
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct list_head *add_pos = NULL;
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+	int err;
+
+	lockdep_assert_held(&domain_list_lock);
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, &add_pos);
+	if (hdr) {
+		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+			return;
+		d = container_of(hdr, struct rdt_domain, hdr);
+
+		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->hdr.id = id;
+	d->hdr.type = RESCTRL_MON_DOMAIN;
+	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+
+	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
 	}
 
 	list_add_tail_rcu(&d->hdr.list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_mon_domain(r, d);
 	if (err) {
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
@@ -562,30 +618,45 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
-static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+static void domain_add_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_add_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_add_cpu_mon(cpu, r);
+}
+
+static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
-	int id = get_domain_id_from_scope(cpu, r->scope);
+	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
 	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
 	struct rdt_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
 	if (id < 0) {
-		pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
-			     cpu, r->scope, r->name);
+		pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->ctrl_scope, r->name);
 		return;
 	}
 
-	d = rdt_find_domain(r, id, NULL);
-	if (!d) {
-		pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu);
+	hdr = rdt_find_domain(&r->ctrl_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
 		return;
 	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
 	hw_dom = resctrl_to_arch_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_ctrl_domain(r, d);
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
 
@@ -601,6 +672,53 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	}
 }
 
+static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
+{
+	int id = get_domain_id_from_scope(cpu, r->mon_scope);
+	struct rdt_hw_domain *hw_dom;
+	struct rdt_domain_hdr *hdr;
+	struct rdt_domain *d;
+
+	lockdep_assert_held(&domain_list_lock);
+
+	if (id < 0) {
+		pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+			     cpu, r->mon_scope, r->name);
+		return;
+	}
+
+	hdr = rdt_find_domain(&r->mon_domains, id, NULL);
+	if (!hdr) {
+		pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n",
+			id, cpu, r->name);
+		return;
+	}
+
+	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+		return;
+
+	d = container_of(hdr, struct rdt_domain, hdr);
+	hw_dom = resctrl_to_arch_dom(d);
+
+	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
+	if (cpumask_empty(&d->hdr.cpu_mask)) {
+		resctrl_offline_mon_domain(r, d);
+		list_del_rcu(&d->hdr.list);
+		synchronize_rcu();
+		domain_free(hw_dom);
+
+		return;
+	}
+}
+
+static void domain_remove_cpu(int cpu, struct rdt_resource *r)
+{
+	if (r->alloc_capable)
+		domain_remove_cpu_ctrl(cpu, r);
+	if (r->mon_capable)
+		domain_remove_cpu_mon(cpu, r);
+}
+
 static void clear_closid_rmid(int cpu)
 {
 	struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 6246f48b0449..8cc36723f077 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -231,7 +231,7 @@ static int parse_line(char *line, struct resctrl_schema *s,
 		return -EINVAL;
 	}
 	dom = strim(dom);
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			data.buf = dom;
 			data.rdtgrp = rdtgrp;
@@ -306,7 +306,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 		msr_param.res = NULL;
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
@@ -450,7 +450,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	lockdep_assert_cpus_held();
 
 	seq_printf(s, "%*s:", max_name_width, schema->name);
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -556,6 +556,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
+	struct rdt_domain_hdr *hdr;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
@@ -576,11 +577,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
-	if (!d) {
+	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
 		ret = -ENOENT;
 		goto out;
 	}
+	d = container_of(hdr, struct rdt_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ab8a198d88b3..82a44de8136f 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		/*
 		 * For the first limbo RMID in the domain,
 		 * setup up the limbo worker.
@@ -687,7 +687,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	idx = resctrl_arch_rmid_idx_encode(closid, rmid);
 	pmbm_data = &dom_mbm->mbm_local[idx];
 
-	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba);
 	if (!dom_mba) {
 		pr_warn_once("Failure to get domain for MBA update\n");
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index df45c839a58f..bdcf95f561d4 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
  */
 static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
 {
-	enum resctrl_scope scope = plr->s->res->scope;
+	enum resctrl_scope scope = plr->s->res->ctrl_scope;
 	struct cacheinfo *ci;
 	int ret;
 
@@ -854,7 +854,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	 * associated with them.
 	 */
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(d_i, &r->domains, hdr.list) {
+		list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) {
 			if (d_i->plr)
 				cpumask_or(cpu_with_psl, cpu_with_psl,
 					   &d_i->hdr.cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b6ba77cdf0e8..17d4610eecf5 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	for_each_alloc_capable_rdt_resource(r) {
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->ctrl_domains, hdr.list)
 			memset(dom->staged_config, 0, sizeof(dom->staged_config));
 	}
 }
@@ -1021,7 +1021,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->ctrl_domains, hdr.list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 		if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)
 			continue;
 		has_cache = true;
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			ctrl = resctrl_arch_get_config(r, d, closid,
 						       s->conf_type);
 			if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) {
@@ -1454,11 +1454,11 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 	struct cacheinfo *ci;
 	int num_b;
 
-	if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+	if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE))
 		return size;
 
 	num_b = bitmap_weight(&cbm, r->cache.cbm_len);
-	ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->scope);
+	ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->ctrl_scope);
 	if (ci)
 		size = ci->size / r->cache.cbm_len * num_b;
 
@@ -1514,7 +1514,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 		type = schema->conf_type;
 		sep = false;
 		seq_printf(s, "%*s:", max_name_width, schema->name);
-		list_for_each_entry(d, &r->domains, hdr.list) {
+		list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 			if (sep)
 				seq_putc(s, ';');
 			if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) {
@@ -1604,7 +1604,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1728,7 +1728,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->mon_domains, hdr.list) {
 		if (d->hdr.id == dom_id) {
 			mbm_config_write_domain(r, d, evtid, val);
 			goto next;
@@ -2276,7 +2276,7 @@ static int set_cache_qos_cfg(int level, bool enable)
 		return -ENOMEM;
 
 	r_l = &rdt_resources_all[level].r_resctrl;
-	list_for_each_entry(d, &r_l->domains, hdr.list) {
+	list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) {
 		if (r_l->cache.arch_has_per_cpu_cfg)
 			/* Pick all the CPUs in the domain instance */
 			for_each_cpu(cpu, &d->hdr.cpu_mask)
@@ -2361,7 +2361,7 @@ static int set_mba_sc(bool mba_sc)
 
 	r->membw.mba_sc = mba_sc;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		for (i = 0; i < num_closid; i++)
 			d->mbps_val[i] = MBA_MAX_MBPS;
 	}
@@ -2700,7 +2700,7 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, hdr.list)
+		list_for_each_entry(dom, &r->mon_domains, hdr.list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL,
 						   RESCTRL_PICK_ANY_CPU);
 	}
@@ -2824,10 +2824,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
 
 	/*
 	 * Disable resource control for this resource by setting all
-	 * CBMs in all domains to the maximum mask value. Pick one CPU
+	 * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU
 	 * from each domain to update the MSRs below.
 	 */
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		hw_dom = resctrl_to_arch_dom(d);
 
 		for (i = 0; i < hw_res->num_closid; i++)
@@ -3098,7 +3098,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
-	list_for_each_entry(dom, &r->domains, hdr.list) {
+	list_for_each_entry(dom, &r->mon_domains, hdr.list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3280,7 +3280,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 	struct rdt_domain *d;
 	int ret;
 
-	list_for_each_entry(d, &s->res->domains, hdr.list) {
+	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
 		ret = __init_one_rdt_domain(d, s, closid);
 		if (ret < 0)
 			return ret;
@@ -3295,7 +3295,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 	struct resctrl_staged_config *cfg;
 	struct rdt_domain *d;
 
-	list_for_each_entry(d, &r->domains, hdr.list) {
+	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
 			d->mbps_val[closid] = MBA_MAX_MBPS;
 			continue;
@@ -3926,15 +3926,19 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	mutex_lock(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
 
-	if (!r->mon_capable)
-		goto out_unlock;
+	mutex_unlock(&rdtgroup_mutex);
+}
+
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	mutex_lock(&rdtgroup_mutex);
 
 	/*
 	 * If resctrl is mounted, remove all the
@@ -3960,7 +3964,6 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 
 	domain_destroy_mon_state(d);
 
-out_unlock:
 	mutex_unlock(&rdtgroup_mutex);
 }
 
@@ -3995,7 +3998,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	int err = 0;
 
@@ -4004,11 +4007,18 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) {
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		err = mba_sc_domain_allocate(r, d);
-		goto out_unlock;
 	}
 
-	if (!r->mon_capable)
-		goto out_unlock;
+	mutex_unlock(&rdtgroup_mutex);
+
+	return err;
+}
+
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	mutex_lock(&rdtgroup_mutex);
 
 	err = domain_setup_mon_state(r, d);
 	if (err)
@@ -4073,7 +4083,7 @@ void resctrl_offline_cpu(unsigned int cpu)
 	if (!l3->mon_capable)
 		goto out_unlock;
 
-	d = get_domain_from_cpu(cpu, l3);
+	d = get_mon_domain_from_cpu(cpu, l3);
 	if (d) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 03/18] x86/resctrl: Prepare for different scope for control/monitor operations
  2024-06-10 18:35 ` [PATCH v20 03/18] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2024-06-20 21:13   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:13 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> Resctrl assumes that control and monitor operations on a resource are
> performed at the same scope.
> 
> Prepare for systems that use different scope (specifically Intel needs
> to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
> and NODE scope for cache occupancy and memory bandwidth monitoring).
> 
> Create separate domain lists for control and monitor operations.
> 
> Note that errors during initialization of either control or monitor
> functions on a domain would previously result in that domain being
> excluded from both control and monitor operations. Now the domains are
> allocated independently it is no longer required to disable both control
> and monitor operations if either fail.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 04/18] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (2 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 03/18] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:14   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 05/18] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.

Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.

Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 48 ++++++++-------
 arch/x86/kernel/cpu/resctrl/internal.h    | 62 ++++++++++++--------
 arch/x86/kernel/cpu/resctrl/core.c        | 71 ++++++++++++-----------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++-----
 arch/x86/kernel/cpu/resctrl/monitor.c     | 40 ++++++-------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 64 ++++++++++----------
 7 files changed, 174 insertions(+), 145 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 96ddf9ff3183..aa2c22a8e37b 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -78,7 +78,23 @@ struct rdt_domain_hdr {
 };
 
 /**
- * struct rdt_domain - group of CPUs sharing a resctrl resource
+ * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
+ * @hdr:		common header for different domain types
+ * @plr:		pseudo-locked region (if any) associated with domain
+ * @staged_config:	parsed configuration to be applied
+ * @mbps_val:		When mba_sc is enabled, this holds the array of user
+ *			specified control values for mba_sc in MBps, indexed
+ *			by closid
+ */
+struct rdt_ctrl_domain {
+	struct rdt_domain_hdr		hdr;
+	struct pseudo_lock_region	*plr;
+	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
+	u32				*mbps_val;
+};
+
+/**
+ * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
@@ -87,13 +103,8 @@ struct rdt_domain_hdr {
  * @cqm_limbo:		worker to periodically read CQM h/w counters
  * @mbm_work_cpu:	worker CPU for MBM h/w counters
  * @cqm_work_cpu:	worker CPU for CQM h/w counters
- * @plr:		pseudo-locked region (if any) associated with domain
- * @staged_config:	parsed configuration to be applied
- * @mbps_val:		When mba_sc is enabled, this holds the array of user
- *			specified control values for mba_sc in MBps, indexed
- *			by closid
  */
-struct rdt_domain {
+struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
@@ -102,9 +113,6 @@ struct rdt_domain {
 	struct delayed_work		cqm_limbo;
 	int				mbm_work_cpu;
 	int				cqm_work_cpu;
-	struct pseudo_lock_region	*plr;
-	struct resctrl_staged_config	staged_config[CDP_NUM_TYPES];
-	u32				*mbps_val;
 };
 
 /**
@@ -208,7 +216,7 @@ struct rdt_resource {
 	const char		*format_str;
 	int			(*parse_ctrlval)(struct rdt_parse_data *data,
 						 struct resctrl_schema *s,
-						 struct rdt_domain *d);
+						 struct rdt_ctrl_domain *d);
 	struct list_head	evt_list;
 	unsigned long		fflags;
 	bool			cdp_capable;
@@ -242,15 +250,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
  * Update the ctrl_val and apply this config right now.
  * Must be called on one of the domain's CPUs.
  */
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val);
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
 void resctrl_online_cpu(unsigned int cpu);
 void resctrl_offline_cpu(unsigned int cpu);
 
@@ -279,7 +287,7 @@ void resctrl_offline_cpu(unsigned int cpu);
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
  */
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *arch_mon_ctx);
 
@@ -312,7 +320,7 @@ static inline void resctrl_arch_rmid_read_context_check(void)
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 closid, u32 rmid,
 			     enum resctrl_event_id eventid);
 
@@ -325,7 +333,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  *
  * This can be called from any CPU.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
 
 extern unsigned int resctrl_rmid_realloc_threshold;
 extern unsigned int resctrl_rmid_realloc_limit;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 377679b79919..135190e0711c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -147,7 +147,7 @@ union mon_data_bits {
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
-	struct rdt_domain	*d;
+	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
 	int			err;
@@ -232,7 +232,7 @@ struct mongroup {
  */
 struct pseudo_lock_region {
 	struct resctrl_schema	*s;
-	struct rdt_domain	*d;
+	struct rdt_ctrl_domain	*d;
 	u32			cbm;
 	wait_queue_head_t	lock_thread_wq;
 	int			thread_done;
@@ -355,25 +355,41 @@ struct arch_mbm_state {
 };
 
 /**
- * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share
- *			  a resource
+ * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share
+ *			       a resource for a control function
  * @d_resctrl:	Properties exposed to the resctrl file system
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *
+ * Members of this structure are accessed via helpers that provide abstraction.
+ */
+struct rdt_hw_ctrl_domain {
+	struct rdt_ctrl_domain		d_resctrl;
+	u32				*ctrl_val;
+};
+
+/**
+ * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
+ *			      a resource for a monitor function
+ * @d_resctrl:	Properties exposed to the resctrl file system
  * @arch_mbm_total:	arch private state for MBM total bandwidth
  * @arch_mbm_local:	arch private state for MBM local bandwidth
  *
  * Members of this structure are accessed via helpers that provide abstraction.
  */
-struct rdt_hw_domain {
-	struct rdt_domain		d_resctrl;
-	u32				*ctrl_val;
+struct rdt_hw_mon_domain {
+	struct rdt_mon_domain		d_resctrl;
 	struct arch_mbm_state		*arch_mbm_total;
 	struct arch_mbm_state		*arch_mbm_local;
 };
 
-static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
+static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r)
+{
+	return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
+}
+
+static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
 {
-	return container_of(r, struct rdt_hw_domain, d_resctrl);
+	return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
 }
 
 /**
@@ -385,7 +401,7 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r)
  */
 struct msr_param {
 	struct rdt_resource	*res;
-	struct rdt_domain	*dom;
+	struct rdt_ctrl_domain	*dom;
 	u32			low;
 	u32			high;
 };
@@ -458,9 +474,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
 }
 
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d);
+	      struct rdt_ctrl_domain *d);
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d);
+	     struct rdt_ctrl_domain *d);
 
 extern struct mutex rdtgroup_mutex;
 
@@ -564,22 +580,22 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive);
-unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d,
+unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				  unsigned long cbm);
 enum rdtgrp_mode rdtgroup_mode_by_closid(int closid);
 int rdtgroup_tasks_assigned(struct rdtgroup *r);
 int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp);
 int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp);
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm);
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d);
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm);
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d);
 int rdt_pseudo_lock_init(void);
 void rdt_pseudo_lock_release(void);
 int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp);
 void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp);
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r);
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closids_supported(void);
 void closid_free(int closid);
 int alloc_rmid(u32 closid);
@@ -590,19 +606,19 @@ bool __init rdt_cpu_has(int flag);
 void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first);
-void mbm_setup_overflow_handler(struct rdt_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms,
 				int exclude_cpu);
 void mbm_handle_overflow(struct work_struct *work);
 void __init intel_rdt_mbm_apply_quirk(void);
 bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
 			     int exclude_cpu);
 void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_domain *d);
-void __check_limbo(struct rdt_domain *d, bool force_free);
+bool has_busy_rmid(struct rdt_mon_domain *d);
+void __check_limbo(struct rdt_mon_domain *d, bool force_free);
 void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
 void __init thread_throttle_mode_init(void);
 void __init mbm_config_rftype_init(const char *config);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index edd9b2bfb53d..b4f2be776408 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -309,8 +309,8 @@ static void rdt_get_cdp_l2_config(void)
 
 static void mba_wrmsr_amd(struct msr_param *m)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
 	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
@@ -333,8 +333,8 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 
 static void mba_wrmsr_intel(struct msr_param *m)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
 	unsigned int i;
 
 	/*  Write the delay values for mba. */
@@ -344,17 +344,17 @@ static void mba_wrmsr_intel(struct msr_param *m)
 
 static void cat_wrmsr(struct msr_param *m)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom);
 	unsigned int i;
 
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]);
 }
 
-struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	lockdep_assert_cpus_held();
 
@@ -367,9 +367,9 @@ struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r)
 	return NULL;
 }
 
-struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
+struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r)
 {
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	lockdep_assert_cpus_held();
 
@@ -440,18 +440,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc)
 		*dc = r->default_ctrl;
 }
 
-static void domain_free(struct rdt_hw_domain *hw_dom)
+static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
+{
+	kfree(hw_dom->ctrl_val);
+	kfree(hw_dom);
+}
+
+static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
 {
 	kfree(hw_dom->arch_mbm_total);
 	kfree(hw_dom->arch_mbm_local);
-	kfree(hw_dom->ctrl_val);
 	kfree(hw_dom);
 }
 
-static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct msr_param m;
 	u32 *dc;
 
@@ -476,7 +481,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
  * @num_rmid:	The size of the MBM counter array
  * @hw_dom:	The domain that owns the allocated arrays
  */
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
 {
 	size_t tsize;
 
@@ -515,10 +520,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int err;
 
 	lockdep_assert_held(&domain_list_lock);
@@ -533,7 +538,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (hdr) {
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 			return;
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_ctrl_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		if (r->cache.arch_has_per_cpu_cfg)
@@ -553,7 +558,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	rdt_domain_reconfigure_cdp(r);
 
 	if (domain_setup_ctrlval(r, d)) {
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 		return;
 	}
 
@@ -563,7 +568,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (err) {
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 	}
 }
 
@@ -571,9 +576,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
 	struct list_head *add_pos = NULL;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	int err;
 
 	lockdep_assert_held(&domain_list_lock);
@@ -588,7 +593,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	if (hdr) {
 		if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 			return;
-		d = container_of(hdr, struct rdt_domain, hdr);
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 		cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 		return;
@@ -604,7 +609,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 		return;
 	}
 
@@ -614,7 +619,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	if (err) {
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 	}
 }
 
@@ -629,9 +634,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->ctrl_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
@@ -651,8 +656,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_ctrl_domain, hdr);
+	hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
@@ -661,12 +666,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 		synchronize_rcu();
 
 		/*
-		 * rdt_domain "d" is going to be freed below, so clear
+		 * rdt_ctrl_domain "d" is going to be freed below, so clear
 		 * its pointer from pseudo_lock_region struct.
 		 */
 		if (d->plr)
 			d->plr->d = NULL;
-		domain_free(hw_dom);
+		ctrl_domain_free(hw_dom);
 
 		return;
 	}
@@ -675,9 +680,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
 static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 {
 	int id = get_domain_id_from_scope(cpu, r->mon_scope);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_mon_domain *hw_dom;
 	struct rdt_domain_hdr *hdr;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	lockdep_assert_held(&domain_list_lock);
 
@@ -697,15 +702,15 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
 	if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
 		return;
 
-	d = container_of(hdr, struct rdt_domain, hdr);
-	hw_dom = resctrl_to_arch_dom(d);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
+	hw_dom = resctrl_to_arch_mon_dom(d);
 
 	cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
 	if (cpumask_empty(&d->hdr.cpu_mask)) {
 		resctrl_offline_mon_domain(r, d);
 		list_del_rcu(&d->hdr.list);
 		synchronize_rcu();
-		domain_free(hw_dom);
+		mon_domain_free(hw_dom);
 
 		return;
 	}
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 8cc36723f077..3b9383612c35 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -60,7 +60,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
-	     struct rdt_domain *d)
+	     struct rdt_ctrl_domain *d)
 {
 	struct resctrl_staged_config *cfg;
 	u32 closid = data->rdtgrp->closid;
@@ -139,7 +139,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r)
  * resource type.
  */
 int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s,
-	      struct rdt_domain *d)
+	      struct rdt_ctrl_domain *d)
 {
 	struct rdtgroup *rdtgrp = data->rdtgrp;
 	struct resctrl_staged_config *cfg;
@@ -208,8 +208,8 @@ static int parse_line(char *line, struct resctrl_schema *s,
 	struct resctrl_staged_config *cfg;
 	struct rdt_resource *r = s->res;
 	struct rdt_parse_data data;
+	struct rdt_ctrl_domain *d;
 	char *dom = NULL, *id;
-	struct rdt_domain *d;
 	unsigned long dom_id;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -272,11 +272,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type)
 	}
 }
 
-int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type t, u32 cfg_val)
 {
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	u32 idx = get_config_index(closid, t);
 	struct msr_param msr_param;
 
@@ -297,17 +297,17 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
+	struct rdt_ctrl_domain *d;
 	enum resctrl_conf_type t;
-	struct rdt_domain *d;
 	u32 idx;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 		msr_param.res = NULL;
 		for (t = 0; t < CDP_NUM_TYPES; t++) {
 			cfg = &hw_dom->d_resctrl.staged_config[t];
@@ -430,10 +430,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 			    u32 closid, enum resctrl_conf_type type)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d);
 	u32 idx = get_config_index(closid, type);
 
 	return hw_dom->ctrl_val[idx];
@@ -442,7 +442,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid)
 {
 	struct rdt_resource *r = schema->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	bool sep = false;
 	u32 ctrl_val;
 
@@ -514,7 +514,7 @@ static int smp_mon_event_count(void *arg)
 }
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
-		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
+		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
 	int cpu;
@@ -557,11 +557,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 {
 	struct kernfs_open_file *of = m->private;
 	struct rdt_domain_hdr *hdr;
+	struct rdt_mon_domain *d;
 	u32 resid, evtid, domid;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
 	union mon_data_bits md;
-	struct rdt_domain *d;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -582,7 +582,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 		ret = -ENOENT;
 		goto out;
 	}
-	d = container_of(hdr, struct rdt_domain, hdr);
+	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
 	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
 
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 82a44de8136f..89d7e6fcbaa1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -209,7 +209,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	return 0;
 }
 
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
 						 u32 rmid,
 						 enum resctrl_event_id eventid)
 {
@@ -228,11 +228,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom,
 	return NULL;
 }
 
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     u32 unused, u32 rmid,
 			     enum resctrl_event_id eventid)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct arch_mbm_state *am;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -248,9 +248,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
  * Assumes that hardware counters are also reset and thus that there is
  * no need to record initial non-zero counts.
  */
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 
 	if (is_mbm_total_enabled())
 		memset(hw_dom->arch_mbm_total, 0,
@@ -269,12 +269,12 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
 	return chunks >> shift;
 }
 
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *ignored)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
 	int ret;
@@ -320,7 +320,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
  * decrement the count. If the busy count gets to zero on an RMID, we
  * free the RMID
  */
-void __check_limbo(struct rdt_domain *d, bool force_free)
+void __check_limbo(struct rdt_mon_domain *d, bool force_free)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -378,7 +378,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
 	resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
 }
 
-bool has_busy_rmid(struct rdt_domain *d)
+bool has_busy_rmid(struct rdt_mon_domain *d)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 
@@ -479,7 +479,7 @@ int alloc_rmid(u32 closid)
 static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 	u32 idx;
 
 	lockdep_assert_held(&rdtgroup_mutex);
@@ -531,7 +531,7 @@ void free_rmid(u32 closid, u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
 				       u32 rmid, enum resctrl_event_id evtid)
 {
 	u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -667,12 +667,12 @@ void mon_event_count(void *info)
  * throttle MSRs already have low percentage values.  To avoid
  * unnecessarily restricting such rdtgroups, we also increase the bandwidth.
  */
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
+	struct rdt_ctrl_domain *dom_mba;
 	struct rdt_resource *r_mba;
-	struct rdt_domain *dom_mba;
 	u32 cur_bw, user_bw, idx;
 	struct list_head *head;
 	struct rdtgroup *entry;
@@ -733,7 +733,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
 }
 
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
 		       u32 closid, u32 rmid)
 {
 	struct rmid_read rr;
@@ -791,12 +791,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
 void cqm_handle_limbo(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
-	d = container_of(work, struct rdt_domain, cqm_limbo.work);
+	d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
 
 	__check_limbo(d, false);
 
@@ -819,7 +819,7 @@ void cqm_handle_limbo(struct work_struct *work)
  * @exclude_cpu:   Which CPU the handler should not run on,
  *		   RESCTRL_PICK_ANY_CPU to pick any CPU.
  */
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
 			     int exclude_cpu)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -836,9 +836,9 @@ void mbm_handle_overflow(struct work_struct *work)
 {
 	unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
 	struct rdtgroup *prgrp, *crgrp;
+	struct rdt_mon_domain *d;
 	struct list_head *head;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
@@ -851,7 +851,7 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-	d = container_of(work, struct rdt_domain, mbm_over.work);
+	d = container_of(work, struct rdt_mon_domain, mbm_over.work);
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
 		mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
@@ -885,7 +885,7 @@ void mbm_handle_overflow(struct work_struct *work)
  * @exclude_cpu:   Which CPU the handler should not run on,
  *		   RESCTRL_PICK_ANY_CPU to pick any CPU.
  */
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
 				int exclude_cpu)
 {
 	unsigned long delay = msecs_to_jiffies(delay_ms);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index bdcf95f561d4..70f0069b87d8 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -809,7 +809,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
  * Return: true if @cbm overlaps with pseudo-locked region on @d, false
  * otherwise.
  */
-bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm)
+bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int cbm_len;
 	unsigned long cbm_b;
@@ -836,11 +836,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm
  *         if it is not possible to test due to memory allocation issue,
  *         false otherwise.
  */
-bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
+bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d)
 {
+	struct rdt_ctrl_domain *d_i;
 	cpumask_var_t cpu_with_psl;
 	struct rdt_resource *r;
-	struct rdt_domain *d_i;
 	bool ret = false;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 17d4610eecf5..eb3bbfa96d5a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -92,8 +92,8 @@ void rdt_last_cmd_printf(const char *fmt, ...)
 
 void rdt_staged_configs_clear(void)
 {
+	struct rdt_ctrl_domain *dom;
 	struct rdt_resource *r;
-	struct rdt_domain *dom;
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -1012,7 +1012,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of,
 	unsigned long sw_shareable = 0, hw_shareable = 0;
 	unsigned long exclusive = 0, pseudo_locked = 0;
 	struct rdt_resource *r = s->res;
-	struct rdt_domain *dom;
+	struct rdt_ctrl_domain *dom;
 	int i, hwb, swb, excl, psl;
 	enum rdtgrp_mode mode;
 	bool sep = false;
@@ -1243,7 +1243,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of,
  *
  * Return: false if CBM does not overlap, true if it does.
  */
-static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d,
+static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d,
 				    unsigned long cbm, int closid,
 				    enum resctrl_conf_type type, bool exclusive)
 {
@@ -1298,7 +1298,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d
  *
  * Return: true if CBM overlap detected, false if there is no overlap
  */
-bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
+bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d,
 			   unsigned long cbm, int closid, bool exclusive)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -1329,10 +1329,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
 static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 {
 	int closid = rdtgrp->closid;
+	struct rdt_ctrl_domain *d;
 	struct resctrl_schema *s;
 	struct rdt_resource *r;
 	bool has_cache = false;
-	struct rdt_domain *d;
 	u32 ctrl;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -1448,7 +1448,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of,
  * bitmap functions work correctly.
  */
 unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
-				  struct rdt_domain *d, unsigned long cbm)
+				  struct rdt_ctrl_domain *d, unsigned long cbm)
 {
 	unsigned int size = 0;
 	struct cacheinfo *ci;
@@ -1476,9 +1476,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of,
 {
 	struct resctrl_schema *schema;
 	enum resctrl_conf_type type;
+	struct rdt_ctrl_domain *d;
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *r;
-	struct rdt_domain *d;
 	unsigned int size;
 	int ret = 0;
 	u32 closid;
@@ -1590,7 +1590,7 @@ static void mon_event_config_read(void *info)
 	mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS;
 }
 
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info)
+static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info)
 {
 	smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1);
 }
@@ -1598,7 +1598,7 @@ static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mo
 static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
 {
 	struct mon_config_info mon_info = {0};
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	bool sep = false;
 
 	cpus_read_lock();
@@ -1657,7 +1657,7 @@ static void mon_event_config_write(void *info)
 }
 
 static void mbm_config_write_domain(struct rdt_resource *r,
-				    struct rdt_domain *d, u32 evtid, u32 val)
+				    struct rdt_mon_domain *d, u32 evtid, u32 val)
 {
 	struct mon_config_info mon_info = {0};
 
@@ -1698,7 +1698,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
-	struct rdt_domain *d;
+	struct rdt_mon_domain *d;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
 	lockdep_assert_cpus_held();
@@ -2257,9 +2257,9 @@ static inline bool is_mba_linear(void)
 static int set_cache_qos_cfg(int level, bool enable)
 {
 	void (*update)(void *arg);
+	struct rdt_ctrl_domain *d;
 	struct rdt_resource *r_l;
 	cpumask_var_t cpu_mask;
-	struct rdt_domain *d;
 	int cpu;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2309,7 +2309,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r)
 		l3_qos_cfg_update(&hw_res->cdp_enabled);
 }
 
-static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
+static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	u32 num_closid = resctrl_arch_get_num_closid(r);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
@@ -2327,7 +2327,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d)
 }
 
 static void mba_sc_domain_destroy(struct rdt_resource *r,
-				  struct rdt_domain *d)
+				  struct rdt_ctrl_domain *d)
 {
 	kfree(d->mbps_val);
 	d->mbps_val = NULL;
@@ -2353,7 +2353,7 @@ static int set_mba_sc(bool mba_sc)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 	u32 num_closid = resctrl_arch_get_num_closid(r);
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	if (!supports_mba_mbps() || mba_sc == is_mba_sc(r))
@@ -2625,7 +2625,7 @@ static int rdt_get_tree(struct fs_context *fc)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	unsigned long flags = RFTYPE_CTRL_BASE;
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	struct rdt_resource *r;
 	int ret;
 
@@ -2810,9 +2810,9 @@ static int rdt_init_fs_context(struct fs_context *fc)
 static int reset_all_ctrls(struct rdt_resource *r)
 {
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-	struct rdt_hw_domain *hw_dom;
+	struct rdt_hw_ctrl_domain *hw_dom;
 	struct msr_param msr_param;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int i;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -2828,7 +2828,7 @@ static int reset_all_ctrls(struct rdt_resource *r)
 	 * from each domain to update the MSRs below.
 	 */
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
-		hw_dom = resctrl_to_arch_dom(d);
+		hw_dom = resctrl_to_arch_ctrl_dom(d);
 
 		for (i = 0; i < hw_res->num_closid; i++)
 			hw_dom->ctrl_val[i] = r->default_ctrl;
@@ -3021,7 +3021,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
-				struct rdt_domain *d,
+				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
 	union mon_data_bits priv;
@@ -3070,7 +3070,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
  * and "monitor" groups with given domain id.
  */
 static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   struct rdt_domain *d)
+					   struct rdt_mon_domain *d)
 {
 	struct kernfs_node *parent_kn;
 	struct rdtgroup *prgrp, *crgrp;
@@ -3092,7 +3092,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdt_resource *r,
 				       struct rdtgroup *prgrp)
 {
-	struct rdt_domain *dom;
+	struct rdt_mon_domain *dom;
 	int ret;
 
 	/* Walking r->domains, ensure it can't race with cpuhp */
@@ -3197,7 +3197,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r)
  * Set the RDT domain up to start off with all usable allocations. That is,
  * all shareable and unused bits. All-zero CBM is invalid.
  */
-static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
+static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s,
 				 u32 closid)
 {
 	enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type);
@@ -3277,7 +3277,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s,
  */
 static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 {
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 	int ret;
 
 	list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) {
@@ -3293,7 +3293,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid)
 static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid)
 {
 	struct resctrl_staged_config *cfg;
-	struct rdt_domain *d;
+	struct rdt_ctrl_domain *d;
 
 	list_for_each_entry(d, &r->ctrl_domains, hdr.list) {
 		if (is_mba_sc(r)) {
@@ -3919,14 +3919,14 @@ static void __init rdtgroup_setup_default(void)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static void domain_destroy_mon_state(struct rdt_domain *d)
+static void domain_destroy_mon_state(struct rdt_mon_domain *d)
 {
 	bitmap_free(d->rmid_busy_llc);
 	kfree(d->mbm_total);
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	mutex_lock(&rdtgroup_mutex);
 
@@ -3936,7 +3936,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	mutex_lock(&rdtgroup_mutex);
 
@@ -3967,7 +3967,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
 	mutex_unlock(&rdtgroup_mutex);
 }
 
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
 	size_t tsize;
@@ -3998,7 +3998,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d)
 {
 	int err = 0;
 
@@ -4014,7 +4014,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return err;
 }
 
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
 {
 	int err;
 
@@ -4069,8 +4069,8 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
 void resctrl_offline_cpu(unsigned int cpu)
 {
 	struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	struct rdt_mon_domain *d;
 	struct rdtgroup *rdtgrp;
-	struct rdt_domain *d;
 
 	mutex_lock(&rdtgroup_mutex);
 	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 04/18] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  2024-06-10 18:35 ` [PATCH v20 04/18] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2024-06-20 21:14   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:14 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> The same rdt_domain structure is used for both control and monitor
> functions. But this results in wasted memory as some of the fields are
> only used by control functions, while most are only used for monitor
> functions.

To me the "wasted memory" motivation is secondary to the potential
confusion caused by validity of struct members being dependent on which
list the struct belongs to. Could be obvious if all operations are on the list
but once a list entry is passed through a few layers it becomes harder
to decipher.

> 
> Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
> just the fields required for control and monitoring respectively.
> 
> Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
> and rdt_hw_mon_domain.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 05/18] x86/resctrl: Add node-scope to the options for feature scope
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (3 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 04/18] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:15   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.

Add RESCTRL_L3_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are divided between
nodes that share an L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h            | 1 +
 arch/x86/kernel/cpu/resctrl/core.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index aa2c22a8e37b..64b6ad1b22a1 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -176,6 +176,7 @@ struct resctrl_schema;
 enum resctrl_scope {
 	RESCTRL_L2_CACHE = 2,
 	RESCTRL_L3_CACHE = 3,
+	RESCTRL_L3_NODE,
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b4f2be776408..b86c525d0620 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -510,6 +510,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
 	case RESCTRL_L2_CACHE:
 	case RESCTRL_L3_CACHE:
 		return get_cpu_cacheinfo_id(cpu, scope);
+	case RESCTRL_L3_NODE:
+		return cpu_to_node(cpu);
 	default:
 		break;
 	}
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 05/18] x86/resctrl: Add node-scope to the options for feature scope
  2024-06-10 18:35 ` [PATCH v20 05/18] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2024-06-20 21:15   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:15 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> Currently supported resctrl features are all domain scoped the same as the
> scope of the L2 or L3 caches.
> 
> Add RESCTRL_L3_NODE as a new option for features that are scoped at the
> same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
> Cluster (SNC) feature where monitoring features are divided between
> nodes that share an L3 cache.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (4 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 05/18] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-17 22:36   ` Moger, Babu
  2024-06-20 21:19   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems Tony Luck
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

RMID sharing mode divides the physical RMIDs evenly between SNC nodes
but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
SNC nodes per L3 cache instance would have 100 logical RMIDs available
for Linux to use. A task running on SNC node 0 with RMID 5 would
accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
Another task using RMID 5, but running on SNC node 1 would accumulate
data in physical RMID 105.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance.  Initialize this to
"1". Runtime detection of SNC mode will adjust this value.

Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
   nodes.
3) Add a function to convert from logical RMID values (assigned to
   tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
   to physical RMID values to load into IA32_QM_EVTSEL MSR when
   reading counters on each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 56 ++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 89d7e6fcbaa1..f2fd35d294f2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
 
 #define CF(cf)	((unsigned long)(1048576 * (cf) + 0.5))
 
+static int snc_nodes_per_l3_cache = 1;
+
 /*
  * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
  * If rmid > rmid threshold, MBM total and local values should be multiplied
@@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
 	return entry;
 }
 
-static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
+/*
+ * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
+ * "snc_nodes_per_l3_cache  == 1") no translation of the RMID value is
+ * needed. The physical RMID is the same as the logical RMID.
+ *
+ * On a platform with SNC mode enabled, Linux enables RMID sharing mode
+ * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
+ * Resource Director Technology Architecture Specification" for a full
+ * description of RMID sharing mode).
+ *
+ * In RMID sharing mode there are fewer "logical RMID" values available
+ * to accumulate data ("physical RMIDs" are divided evenly between SNC
+ * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
+ * each SNC node.
+ *
+ * The value loaded into IA32_PQR_ASSOC is the "logical RMID".
+ *
+ * Data is collected independently on each SNC node and can be retrieved
+ * using the "physical RMID" value computed by this function and loaded
+ * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
+ *
+ * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
+ * cache.  So a "physical RMID" may be read from any CPU that shares
+ * the L3 cache with the desired SNC node, not just from a CPU in
+ * the specific SNC node.
+ */
+static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+
+	if (snc_nodes_per_l3_cache  == 1)
+		return lrmid;
+
+	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+}
+
+static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
 {
 	u64 msr_val;
 
@@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
 			     enum resctrl_event_id eventid)
 {
 	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	struct arch_mbm_state *am;
+	u32 prmid;
 
 	am = get_arch_mbm_state(hw_dom, rmid, eventid);
 	if (am) {
 		memset(am, 0, sizeof(*am));
 
+		prmid = logical_rmid_to_physical_rmid(cpu, rmid);
 		/* Record any initial, non-zero count value. */
-		__rmid_read(rmid, eventid, &am->prev_msr);
+		__rmid_read_phys(prmid, eventid, &am->prev_msr);
 	}
 }
 
@@ -275,8 +316,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 {
 	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
+	int cpu = cpumask_any(&d->hdr.cpu_mask);
 	struct arch_mbm_state *am;
 	u64 msr_val, chunks;
+	u32 prmid;
 	int ret;
 
 	resctrl_arch_rmid_read_context_check();
@@ -284,7 +327,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
 		return -EINVAL;
 
-	ret = __rmid_read(rmid, eventid, &msr_val);
+	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
+	ret = __rmid_read_phys(prmid, eventid, &msr_val);
 	if (ret)
 		return ret;
 
@@ -1022,8 +1066,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-06-10 18:35 ` [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2024-06-17 22:36   ` Moger, Babu
  2024-06-18 22:58     ` Reinette Chatre
  2024-06-20 21:19   ` Reinette Chatre
  1 sibling, 1 reply; 61+ messages in thread
From: Moger, Babu @ 2024-06-17 22:36 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/2024 1:35 PM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> RMID sharing mode divides the physical RMIDs evenly between SNC nodes
> but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
> with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
> SNC nodes per L3 cache instance would have 100 logical RMIDs available
> for Linux to use. A task running on SNC node 0 with RMID 5 would
> accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
> Another task using RMID 5, but running on SNC node 1 would accumulate
> data in physical RMID 105.

The differentiation of physical and logical RMIDs are confusing. For me, 
they are all physical RMIDs. You are basically using the different 
physical range based on the snc node. It will be helpful if you simplify 
the description. Hop

> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate

File name is not helpful. Just mention the name of the global.

> how many SNC domains share an L3 cache instance.  Initialize this to
> "1". Runtime detection of SNC mode will adjust this value.
> 
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
>     number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>     nodes.
> 3) Add a function to convert from logical RMID values (assigned to
>     tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
>     to physical RMID values to load into IA32_QM_EVTSEL MSR when
>     reading counters on each SNC node.

Please simplify the description. Calling logical and physical is confusing.

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 56 ++++++++++++++++++++++++---
>   1 file changed, 50 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..f2fd35d294f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>   
>   #define CF(cf)	((unsigned long)(1048576 * (cf) + 0.5))
>   
> +static int snc_nodes_per_l3_cache = 1;
> +
>   /*
>    * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
>    * If rmid > rmid threshold, MBM total and local values should be multiplied
> @@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>   	return entry;
>   }
>   
> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> +/*
> + * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
> + * "snc_nodes_per_l3_cache  == 1") no translation of the RMID value is
> + * needed. The physical RMID is the same as the logical RMID.
> + *
> + * On a platform with SNC mode enabled, Linux enables RMID sharing mode
> + * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
> + * Resource Director Technology Architecture Specification" for a full
> + * description of RMID sharing mode).
> + *
> + * In RMID sharing mode there are fewer "logical RMID" values available
> + * to accumulate data ("physical RMIDs" are divided evenly between SNC
> + * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
> + * each SNC node.
> + *
> + * The value loaded into IA32_PQR_ASSOC is the "logical RMID".
> + *
> + * Data is collected independently on each SNC node and can be retrieved
> + * using the "physical RMID" value computed by this function and loaded
> + * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
> + *
> + * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
> + * cache.  So a "physical RMID" may be read from any CPU that shares
> + * the L3 cache with the desired SNC node, not just from a CPU in
> + * the specific SNC node.
> + */
> +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)

How about ? (or something similar)

static int get_snc_node_rmid(int cpu, int rmid)

> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (snc_nodes_per_l3_cache  == 1)
> +		return lrmid;
> +
> +	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +}
> +
> +static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)

You don't need to write new function.  Just update the rmid.


>   {
>   	u64 msr_val;
>   
> @@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>   	 * are error bits.
>   	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
>   	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>   
>   	if (msr_val & RMID_VAL_ERROR)
> @@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			     enum resctrl_event_id eventid)
>   {
>   	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	struct arch_mbm_state *am;
> +	u32 prmid;

snc_rmid?

>   
>   	am = get_arch_mbm_state(hw_dom, rmid, eventid);
>   	if (am) {
>   		memset(am, 0, sizeof(*am));
>   
> +		prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>   		/* Record any initial, non-zero count value. */
> -		__rmid_read(rmid, eventid, &am->prev_msr);
> +		__rmid_read_phys(prmid, eventid, &am->prev_msr);

How about ? Feel free to simplify.

           if (snc_nodes_per_l3_cache > 1) {
                  snc_rmid = get_snc_node_rmid(cpu, rmid);
                 __rmid_read(snc_rmid, eventid, &am->prev_msr);
           } else {
               __rmid_read(rmid, eventid, &am->prev_msr);
           }

>   	}
>   }
>   
> @@ -275,8 +316,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   {
>   	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	struct arch_mbm_state *am;
>   	u64 msr_val, chunks;
> +	u32 prmid;
>   	int ret;
>   
>   	resctrl_arch_rmid_read_context_check();
> @@ -284,7 +327,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>   		return -EINVAL;
>   
> -	ret = __rmid_read(rmid, eventid, &msr_val);
> +	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
> +	ret = __rmid_read_phys(prmid, eventid, &msr_val);

Same as above.

>   	if (ret)
>   		return ret;
>   
> @@ -1022,8 +1066,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   	int ret;
>   
>   	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>   	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>   
>   	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)

-- 
- Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-06-17 22:36   ` Moger, Babu
@ 2024-06-18 22:58     ` Reinette Chatre
  2024-06-19 14:43       ` Moger, Babu
  0 siblings, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-18 22:58 UTC (permalink / raw)
  To: babu.moger, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Babu and Tony,

On 6/17/24 3:36 PM, Moger, Babu wrote:
> On 6/10/2024 1:35 PM, Tony Luck wrote:

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 89d7e6fcbaa1..f2fd35d294f2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>>   #define CF(cf)    ((unsigned long)(1048576 * (cf) + 0.5))
>> +static int snc_nodes_per_l3_cache = 1;
>> +
>>   /*
>>    * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
>>    * If rmid > rmid threshold, MBM total and local values should be multiplied
>> @@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>>       return entry;
>>   }
>> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>> +/*
>> + * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
>> + * "snc_nodes_per_l3_cache  == 1") no translation of the RMID value is
>> + * needed. The physical RMID is the same as the logical RMID.
>> + *
>> + * On a platform with SNC mode enabled, Linux enables RMID sharing mode
>> + * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
>> + * Resource Director Technology Architecture Specification" for a full
>> + * description of RMID sharing mode).
>> + *
>> + * In RMID sharing mode there are fewer "logical RMID" values available
>> + * to accumulate data ("physical RMIDs" are divided evenly between SNC
>> + * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
>> + * each SNC node.
>> + *
>> + * The value loaded into IA32_PQR_ASSOC is the "logical RMID".
>> + *
>> + * Data is collected independently on each SNC node and can be retrieved
>> + * using the "physical RMID" value computed by this function and loaded
>> + * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
>> + *
>> + * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
>> + * cache.  So a "physical RMID" may be read from any CPU that shares
>> + * the L3 cache with the desired SNC node, not just from a CPU in
>> + * the specific SNC node.
>> + */
>> +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
> 
> How about ? (or something similar)
> 
> static int get_snc_node_rmid(int cpu, int rmid)
> 
>> +{
>> +    struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +    if (snc_nodes_per_l3_cache  == 1)

(nit: unnecessary space)

>> +        return lrmid;
>> +
>> +    return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
>> +}
>> +
>> +static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
> 
> You don't need to write new function.  Just update the rmid.
> 
> 
>>   {
>>       u64 msr_val;
>> @@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>>        * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>>        * are error bits.
>>        */
>> -    wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>> +    wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
>>       rdmsrl(MSR_IA32_QM_CTR, msr_val);
>>       if (msr_val & RMID_VAL_ERROR)
>> @@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>>                    enum resctrl_event_id eventid)
>>   {
>>       struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>> +    int cpu = cpumask_any(&d->hdr.cpu_mask);
>>       struct arch_mbm_state *am;
>> +    u32 prmid;
> 
> snc_rmid?
> 
>>       am = get_arch_mbm_state(hw_dom, rmid, eventid);
>>       if (am) {
>>           memset(am, 0, sizeof(*am));
>> +        prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>>           /* Record any initial, non-zero count value. */
>> -        __rmid_read(rmid, eventid, &am->prev_msr);
>> +        __rmid_read_phys(prmid, eventid, &am->prev_msr);
> 
> How about ? Feel free to simplify.
> 
>            if (snc_nodes_per_l3_cache > 1) {
>                   snc_rmid = get_snc_node_rmid(cpu, rmid);
>                  __rmid_read(snc_rmid, eventid, &am->prev_msr);
>            } else {
>                __rmid_read(rmid, eventid, &am->prev_msr);
>            }
> 

When considering something like this I think it would be better to contain the
SNC checking in a single place so that all places needing to read RMID need not
remember to have the same copied "if (snc_nodes_per_l3_cache > 1)" check.
This then essentially becomes logical_rmid_to_physical_rmid() in this patch so
now it just becomes a question about what name to pick for variables and functions.

I do prefer a name like __rmid_read_phys()  with a unique "prmid" parameter since that
should prompt developer to give a second thought to what rmid parameter is provided
instead of just blindly calling __rmid_read() that implies that it is reading the
data for the RMID used by resctrl without considering that a conversion may be needed.

I do understand and agree that "logical" vs "physical" is not intuitive here but
to that end I find that the comments explain the distinction well. If there are
better suggestions then they are surely welcome.

In summary, I do think that the "__rmid_read()" function needs a name change to make
clear that it may not be reading the RMID used internally by resctrl and this function
should be accompanied by a function with similar term in its name that does the
conversion and includes the SNC check.

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-06-18 22:58     ` Reinette Chatre
@ 2024-06-19 14:43       ` Moger, Babu
  0 siblings, 0 replies; 61+ messages in thread
From: Moger, Babu @ 2024-06-19 14:43 UTC (permalink / raw)
  To: Reinette Chatre, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches



On 6/18/24 17:58, Reinette Chatre wrote:
> Hi Babu and Tony,
> 
> On 6/17/24 3:36 PM, Moger, Babu wrote:
>> On 6/10/2024 1:35 PM, Tony Luck wrote:
> 
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 89d7e6fcbaa1..f2fd35d294f2 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>>>   #define CF(cf)    ((unsigned long)(1048576 * (cf) + 0.5))
>>> +static int snc_nodes_per_l3_cache = 1;
>>> +
>>>   /*
>>>    * The correction factor table is documented in
>>> Documentation/arch/x86/resctrl.rst.
>>>    * If rmid > rmid threshold, MBM total and local values should be
>>> multiplied
>>> @@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32
>>> idx)
>>>       return entry;
>>>   }
>>> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>>> +/*
>>> + * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
>>> + * "snc_nodes_per_l3_cache  == 1") no translation of the RMID value is
>>> + * needed. The physical RMID is the same as the logical RMID.
>>> + *
>>> + * On a platform with SNC mode enabled, Linux enables RMID sharing mode
>>> + * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
>>> + * Resource Director Technology Architecture Specification" for a full
>>> + * description of RMID sharing mode).
>>> + *
>>> + * In RMID sharing mode there are fewer "logical RMID" values available
>>> + * to accumulate data ("physical RMIDs" are divided evenly between SNC
>>> + * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
>>> + * each SNC node.
>>> + *
>>> + * The value loaded into IA32_PQR_ASSOC is the "logical RMID".
>>> + *
>>> + * Data is collected independently on each SNC node and can be retrieved
>>> + * using the "physical RMID" value computed by this function and loaded
>>> + * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
>>> + *
>>> + * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
>>> + * cache.  So a "physical RMID" may be read from any CPU that shares
>>> + * the L3 cache with the desired SNC node, not just from a CPU in
>>> + * the specific SNC node.
>>> + */
>>> +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
>>
>> How about ? (or something similar)
>>
>> static int get_snc_node_rmid(int cpu, int rmid)
>>
>>> +{
>>> +    struct rdt_resource *r =
>>> &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>> +
>>> +    if (snc_nodes_per_l3_cache  == 1)
> 
> (nit: unnecessary space)
> 
>>> +        return lrmid;
>>> +
>>> +    return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) *
>>> r->num_rmid;
>>> +}
>>> +
>>> +static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid,
>>> u64 *val)
>>
>> You don't need to write new function.  Just update the rmid.
>>
>>
>>>   {
>>>       u64 msr_val;
>>> @@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum
>>> resctrl_event_id eventid, u64 *val)
>>>        * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>>>        * are error bits.
>>>        */
>>> -    wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>>> +    wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
>>>       rdmsrl(MSR_IA32_QM_CTR, msr_val);
>>>       if (msr_val & RMID_VAL_ERROR)
>>> @@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource
>>> *r, struct rdt_mon_domain *d,
>>>                    enum resctrl_event_id eventid)
>>>   {
>>>       struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>>> +    int cpu = cpumask_any(&d->hdr.cpu_mask);
>>>       struct arch_mbm_state *am;
>>> +    u32 prmid;
>>
>> snc_rmid?
>>
>>>       am = get_arch_mbm_state(hw_dom, rmid, eventid);
>>>       if (am) {
>>>           memset(am, 0, sizeof(*am));
>>> +        prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>>>           /* Record any initial, non-zero count value. */
>>> -        __rmid_read(rmid, eventid, &am->prev_msr);
>>> +        __rmid_read_phys(prmid, eventid, &am->prev_msr);
>>
>> How about ? Feel free to simplify.
>>
>>            if (snc_nodes_per_l3_cache > 1) {
>>                   snc_rmid = get_snc_node_rmid(cpu, rmid);
>>                  __rmid_read(snc_rmid, eventid, &am->prev_msr);
>>            } else {
>>                __rmid_read(rmid, eventid, &am->prev_msr);
>>            }
>>
> 
> When considering something like this I think it would be better to contain
> the
> SNC checking in a single place so that all places needing to read RMID
> need not
> remember to have the same copied "if (snc_nodes_per_l3_cache > 1)" check.
> This then essentially becomes logical_rmid_to_physical_rmid() in this
> patch so
> now it just becomes a question about what name to pick for variables and
> functions.
> 
> I do prefer a name like __rmid_read_phys()  with a unique "prmid"
> parameter since that
> should prompt developer to give a second thought to what rmid parameter is
> provided
> instead of just blindly calling __rmid_read() that implies that it is
> reading the
> data for the RMID used by resctrl without considering that a conversion
> may be needed.

Ok. That sounds reasonable.

> 
> I do understand and agree that "logical" vs "physical" is not intuitive
> here but
> to that end I find that the comments explain the distinction well. If
> there are
> better suggestions then they are surely welcome.
> 
> In summary, I do think that the "__rmid_read()" function needs a name
> change to make
> clear that it may not be reading the RMID used internally by resctrl and
> this function
> should be accompanied by a function with similar term in its name that
> does the
> conversion and includes the SNC check.
> 
> Reinette
> 

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache
  2024-06-10 18:35 ` [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
  2024-06-17 22:36   ` Moger, Babu
@ 2024-06-20 21:19   ` Reinette Chatre
  1 sibling, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:19 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> RMID sharing mode divides the physical RMIDs evenly between SNC nodes
> but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
> with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
> SNC nodes per L3 cache instance would have 100 logical RMIDs available
> for Linux to use. A task running on SNC node 0 with RMID 5 would
> accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
> Another task using RMID 5, but running on SNC node 1 would accumulate
> data in physical RMID 105.
> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
> how many SNC domains share an L3 cache instance.  Initialize this to
> "1". Runtime detection of SNC mode will adjust this value.
> 
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
>     number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>     nodes.
> 3) Add a function to convert from logical RMID values (assigned to
>     tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
>     to physical RMID values to load into IA32_QM_EVTSEL MSR when
>     reading counters on each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 56 ++++++++++++++++++++++++---
>   1 file changed, 50 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..f2fd35d294f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>   
>   #define CF(cf)	((unsigned long)(1048576 * (cf) + 0.5))
>   
> +static int snc_nodes_per_l3_cache = 1;
> +
>   /*
>    * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
>    * If rmid > rmid threshold, MBM total and local values should be multiplied
> @@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>   	return entry;
>   }
>   
> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> +/*
> + * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
> + * "snc_nodes_per_l3_cache  == 1") no translation of the RMID value is

(nit: same unnecessary space as in code)

> + * needed. The physical RMID is the same as the logical RMID.
> + *
> + * On a platform with SNC mode enabled, Linux enables RMID sharing mode
> + * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
> + * Resource Director Technology Architecture Specification" for a full
> + * description of RMID sharing mode).
> + *
> + * In RMID sharing mode there are fewer "logical RMID" values available
> + * to accumulate data ("physical RMIDs" are divided evenly between SNC
> + * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
> + * each SNC node.
> + *
> + * The value loaded into IA32_PQR_ASSOC is the "logical RMID".
> + *
> + * Data is collected independently on each SNC node and can be retrieved
> + * using the "physical RMID" value computed by this function and loaded
> + * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
> + *
> + * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
> + * cache.  So a "physical RMID" may be read from any CPU that shares
> + * the L3 cache with the desired SNC node, not just from a CPU in
> + * the specific SNC node.
> + */
> +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)

It is not clear to me where we are in the discussion about the naming. If
the "logical" vs "physical" becomes an issue then perhaps the "logical" can
just be dropped? Resulting in just "rmid_to_phys_rmid()" (to match with
__rmid_read_phys()) ? I'm ok with what you have here also.

> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (snc_nodes_per_l3_cache  == 1)

(nit: extra space here)

> +		return lrmid;
> +
> +	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +}
> +
> +static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
>   {
>   	u64 msr_val;
>   
> @@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>   	 * are error bits.
>   	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
>   	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>   
>   	if (msr_val & RMID_VAL_ERROR)
> @@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			     enum resctrl_event_id eventid)
>   {
>   	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	struct arch_mbm_state *am;
> +	u32 prmid;
>   
>   	am = get_arch_mbm_state(hw_dom, rmid, eventid);
>   	if (am) {
>   		memset(am, 0, sizeof(*am));
>   
> +		prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>   		/* Record any initial, non-zero count value. */
> -		__rmid_read(rmid, eventid, &am->prev_msr);
> +		__rmid_read_phys(prmid, eventid, &am->prev_msr);
>   	}
>   }
>   
> @@ -275,8 +316,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   {
>   	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	struct arch_mbm_state *am;
>   	u64 msr_val, chunks;
> +	u32 prmid;
>   	int ret;
>   
>   	resctrl_arch_rmid_read_context_check();
> @@ -284,7 +327,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>   		return -EINVAL;
>   
> -	ret = __rmid_read(rmid, eventid, &msr_val);
> +	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
> +	ret = __rmid_read_phys(prmid, eventid, &msr_val);
>   	if (ret)
>   		return ret;
>   
> @@ -1022,8 +1066,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   	int ret;
>   
>   	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>   	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>   
>   	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)

Apart from the two spacing related nits this looks good to me.

| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (5 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:21   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 08/18] x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files Tony Luck
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

When SNC is enabled there is a mismatch between the MBA control function
which operates at L3 cache scope and the MBM monitor functions which
measure memory bandwidth on each SNC node.

Block use of the mba_MBps when scopes for MBA/MBM do not match.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index eb3bbfa96d5a..a0a43dbe011b 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2339,10 +2339,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
  */
 static bool supports_mba_mbps(void)
 {
+	struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		r->ctrl_scope == rmbm->mon_scope);
 }
 
 /*
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-10 18:35 ` [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems Tony Luck
@ 2024-06-20 21:21   ` Reinette Chatre
  2024-06-20 22:07     ` Luck, Tony
  0 siblings, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:21 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> When SNC is enabled there is a mismatch between the MBA control function
> which operates at L3 cache scope and the MBM monitor functions which
> measure memory bandwidth on each SNC node.
> 
> Block use of the mba_MBps when scopes for MBA/MBM do not match.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index eb3bbfa96d5a..a0a43dbe011b 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2339,10 +2339,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
>    */
>   static bool supports_mba_mbps(void)
>   {
> +	struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>   	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>   
>   	return (is_mbm_local_enabled() &&
> -		r->alloc_capable && is_mba_linear());
> +		r->alloc_capable && is_mba_linear() &&
> +		r->ctrl_scope == rmbm->mon_scope);
>   }
>   
>   /*

The function comments of supports_mba_mbps() needs an update to accompany
this new requirement.

I also think that the "mba_MBps" mount option is now complicated enough to
warrant a clear error to user space when using it fails. invalfc() is
available for this and enables user space to get detailed log message
from a read() on an fd created by fsopen().

Perhaps something like (please check line length and feel free to improve
since as is it may quite cryptic):
	rdt_parse_param(...)
	{


	...
	case Opt_mba_mbps:
		if (!supports_mba_mbps())
			return invalfc(fc, "mba_MBps requires both MBM and (linear scale) MBA at L3 scope");
	...
	}


Reinette


^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-20 21:21   ` Reinette Chatre
@ 2024-06-20 22:07     ` Luck, Tony
  2024-06-20 22:12       ` Luck, Tony
  2024-06-21  1:56       ` Reinette Chatre
  0 siblings, 2 replies; 61+ messages in thread
From: Luck, Tony @ 2024-06-20 22:07 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Wieczor-Retman, Maciej,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> > When SNC is enabled there is a mismatch between the MBA control function
> > which operates at L3 cache scope and the MBM monitor functions which
> > measure memory bandwidth on each SNC node.
> >
> > Block use of the mba_MBps when scopes for MBA/MBM do not match.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
> >   1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index eb3bbfa96d5a..a0a43dbe011b 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -2339,10 +2339,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
> >    */
> >   static bool supports_mba_mbps(void)
> >   {
> > +   struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> >     struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> >
> >     return (is_mbm_local_enabled() &&
> > -           r->alloc_capable && is_mba_linear());
> > +           r->alloc_capable && is_mba_linear() &&
> > +           r->ctrl_scope == rmbm->mon_scope);
> >   }
> >
> >   /*
>
> The function comments of supports_mba_mbps() needs an update to accompany
> this new requirement.

Will add comment on extra requirement.

> I also think that the "mba_MBps" mount option is now complicated enough to
> warrant a clear error to user space when using it fails. invalfc() is
> available for this and enables user space to get detailed log message
> from a read() on an fd created by fsopen().
>
> Perhaps something like (please check line length and feel free to improve
> since as is it may quite cryptic):
>       rdt_parse_param(...)
>       {
>
>
>       ...
>       case Opt_mba_mbps:
>               if (!supports_mba_mbps())
>                       return invalfc(fc, "mba_MBps requires both MBM and (linear scale) MBA at L3 scope");
>       ...
>       }

Line length is indeed a problem (108 characters). Usual line split methods barely help as the moving the
string to the next line and aligning with the "(" only saves 4 characters.

How about this (suggestions for a shorter variable name - line is 97 characters)

static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";

rdt_parse_param(...)
{
	...
	case Opt_mba_mbps:
		if (!supports_mba_mbps())
			return invalfc(fc, mba_mbps_invalid);
	...
}

-Tony









^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-20 22:07     ` Luck, Tony
@ 2024-06-20 22:12       ` Luck, Tony
  2024-06-21  1:56       ` Reinette Chatre
  1 sibling, 0 replies; 61+ messages in thread
From: Luck, Tony @ 2024-06-20 22:12 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Wieczor-Retman, Maciej,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";

checkpatch recommends "static const char ..." pushing this over 100 chars :-(

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-20 22:07     ` Luck, Tony
  2024-06-20 22:12       ` Luck, Tony
@ 2024-06-21  1:56       ` Reinette Chatre
  2024-06-21 15:24         ` Tony Luck
  1 sibling, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-21  1:56 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

Hi Tony,

On 6/20/24 3:07 PM, Luck, Tony wrote:
>>> When SNC is enabled there is a mismatch between the MBA control function
>>> which operates at L3 cache scope and the MBM monitor functions which
>>> measure memory bandwidth on each SNC node.
>>>
>>> Block use of the mba_MBps when scopes for MBA/MBM do not match.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>>    arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index eb3bbfa96d5a..a0a43dbe011b 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -2339,10 +2339,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
>>>     */
>>>    static bool supports_mba_mbps(void)
>>>    {
>>> +   struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>>      struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>>>
>>>      return (is_mbm_local_enabled() &&
>>> -           r->alloc_capable && is_mba_linear());
>>> +           r->alloc_capable && is_mba_linear() &&
>>> +           r->ctrl_scope == rmbm->mon_scope);
>>>    }
>>>
>>>    /*
>>
>> The function comments of supports_mba_mbps() needs an update to accompany
>> this new requirement.
> 
> Will add comment on extra requirement.
> 
>> I also think that the "mba_MBps" mount option is now complicated enough to
>> warrant a clear error to user space when using it fails. invalfc() is
>> available for this and enables user space to get detailed log message
>> from a read() on an fd created by fsopen().
>>
>> Perhaps something like (please check line length and feel free to improve
>> since as is it may quite cryptic):
>>        rdt_parse_param(...)
>>        {
>>
>>
>>        ...
>>        case Opt_mba_mbps:
>>                if (!supports_mba_mbps())
>>                        return invalfc(fc, "mba_MBps requires both MBM and (linear scale) MBA at L3 scope");
>>        ...
>>        }
> 
> Line length is indeed a problem (108 characters). Usual line split methods barely help as the moving the
> string to the next line and aligning with the "(" only saves 4 characters.
> 
> How about this (suggestions for a shorter variable name - line is 97 characters)
> 
> static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";
> 
> rdt_parse_param(...)
> {
> 	...
> 	case Opt_mba_mbps:
> 		if (!supports_mba_mbps())
> 			return invalfc(fc, mba_mbps_invalid);
> 	...
> }

On 6/20/24 3:12 PM, Luck, Tony wrote:
>> static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";
> 
> checkpatch recommends "static const char ..." pushing this over 100 chars :-(
> 

How about something like below that reaches 96:

	case Opt_mba_mbps:
		if (!supports_mba_mbps())
			return invalfc(fc,
				       "mba_MBps requires both MBM and linear MBA at L3 scope");


Reinette


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-21  1:56       ` Reinette Chatre
@ 2024-06-21 15:24         ` Tony Luck
  2024-06-21 17:10           ` Reinette Chatre
  0 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-21 15:24 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

On Thu, Jun 20, 2024 at 06:56:56PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 6/20/24 3:07 PM, Luck, Tony wrote:
> > > > When SNC is enabled there is a mismatch between the MBA control function
> > > > which operates at L3 cache scope and the MBM monitor functions which
> > > > measure memory bandwidth on each SNC node.
> > > > 
> > > > Block use of the mba_MBps when scopes for MBA/MBM do not match.
> > > > 
> > > > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > > > ---
> > > >    arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
> > > >    1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > > > index eb3bbfa96d5a..a0a43dbe011b 100644
> > > > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > > > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > > > @@ -2339,10 +2339,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
> > > >     */
> > > >    static bool supports_mba_mbps(void)
> > > >    {
> > > > +   struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > > >      struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> > > > 
> > > >      return (is_mbm_local_enabled() &&
> > > > -           r->alloc_capable && is_mba_linear());
> > > > +           r->alloc_capable && is_mba_linear() &&
> > > > +           r->ctrl_scope == rmbm->mon_scope);
> > > >    }
> > > > 
> > > >    /*
> > > 
> > > The function comments of supports_mba_mbps() needs an update to accompany
> > > this new requirement.
> > 
> > Will add comment on extra requirement.
> > 
> > > I also think that the "mba_MBps" mount option is now complicated enough to
> > > warrant a clear error to user space when using it fails. invalfc() is
> > > available for this and enables user space to get detailed log message
> > > from a read() on an fd created by fsopen().
> > > 
> > > Perhaps something like (please check line length and feel free to improve
> > > since as is it may quite cryptic):
> > >        rdt_parse_param(...)
> > >        {
> > > 
> > > 
> > >        ...
> > >        case Opt_mba_mbps:
> > >                if (!supports_mba_mbps())
> > >                        return invalfc(fc, "mba_MBps requires both MBM and (linear scale) MBA at L3 scope");
> > >        ...
> > >        }
> > 
> > Line length is indeed a problem (108 characters). Usual line split methods barely help as the moving the
> > string to the next line and aligning with the "(" only saves 4 characters.
> > 
> > How about this (suggestions for a shorter variable name - line is 97 characters)
> > 
> > static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";
> > 
> > rdt_parse_param(...)
> > {
> > 	...
> > 	case Opt_mba_mbps:
> > 		if (!supports_mba_mbps())
> > 			return invalfc(fc, mba_mbps_invalid);
> > 	...
> > }
> 
> On 6/20/24 3:12 PM, Luck, Tony wrote:
> > > static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";
> > 
> > checkpatch recommends "static const char ..." pushing this over 100 chars :-(
> > 
> 
> How about something like below that reaches 96:
> 
> 	case Opt_mba_mbps:
> 		if (!supports_mba_mbps())
> 			return invalfc(fc,
> 				       "mba_MBps requires both MBM and linear MBA at L3 scope");
> 

Reinette,

Alternative option. Move the messaging into supports_mba_mbps() and
split into shorter pieces for each reason. The other callers of
supports_mba_mbps() that are just re-checking status would pass
a NULL argument.

If this looks reasonable I can do it in two patches. First to add
invalfc() for the existing cases. Second to add the SNC change.

-Tony

---

static bool supports_mba_mbps(struct fs_context *fc)
{
			return invalfc(fc, mba_mbps_invalid);
	struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

	if (!is_mbm_local_enabled()) {
		if (fc)
			invalfc(fc, "mba_MBps requires local MBM");
		return false;
	}
	if (!r->alloc_capable) {
		if (fc)
			invalfc(fc, "mba_MBps requires MBA");
		return false;
	}
	if (!is_mba_linear()) {
		if (fc)
			invalfc(fc, "mba_MBps requires linear MBA");
		return false;
	}
	if (r->ctrl_scope != rmbm->mon_scope) {
		if (fc)
			invalfc(fc, "mba_MBps requires MBM/MBA at L3 scope");
		return false;
	}

	return true;
}

rdt_parse_param(...)
{
	...
	case Opt_mba_mbps:
		if (!supports_mba_mbps(fc))
			return -EINVAL;
	...
}

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems
  2024-06-21 15:24         ` Tony Luck
@ 2024-06-21 17:10           ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-21 17:10 UTC (permalink / raw)
  To: Tony Luck
  Cc: Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman, James Morse,
	Babu Moger, Drew Fustini, Dave Martin, x86@kernel.org,
	linux-kernel@vger.kernel.org, patches@lists.linux.dev

Hi Tony,

On 6/21/24 8:24 AM, Tony Luck wrote:
> On Thu, Jun 20, 2024 at 06:56:56PM -0700, Reinette Chatre wrote:
>> On 6/20/24 3:07 PM, Luck, Tony wrote:
>>>>> When SNC is enabled there is a mismatch between the MBA control function
>>>>> which operates at L3 cache scope and the MBM monitor functions which
>>>>> measure memory bandwidth on each SNC node.
>>>>>
>>>>> Block use of the mba_MBps when scopes for MBA/MBM do not match.
>>>>>
>>>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>>>> ---
>>>>>     arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +++-
>>>>>     1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> index eb3bbfa96d5a..a0a43dbe011b 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>>> @@ -2339,10 +2339,12 @@ static void mba_sc_domain_destroy(struct rdt_resource *r,
>>>>>      */
>>>>>     static bool supports_mba_mbps(void)
>>>>>     {
>>>>> +   struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>>>>>       struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>>>>>
>>>>>       return (is_mbm_local_enabled() &&
>>>>> -           r->alloc_capable && is_mba_linear());
>>>>> +           r->alloc_capable && is_mba_linear() &&
>>>>> +           r->ctrl_scope == rmbm->mon_scope);
>>>>>     }
>>>>>
>>>>>     /*
>>>>
>>>> The function comments of supports_mba_mbps() needs an update to accompany
>>>> this new requirement.
>>>
>>> Will add comment on extra requirement.
>>>
>>>> I also think that the "mba_MBps" mount option is now complicated enough to
>>>> warrant a clear error to user space when using it fails. invalfc() is
>>>> available for this and enables user space to get detailed log message
>>>> from a read() on an fd created by fsopen().
>>>>
>>>> Perhaps something like (please check line length and feel free to improve
>>>> since as is it may quite cryptic):
>>>>         rdt_parse_param(...)
>>>>         {
>>>>
>>>>
>>>>         ...
>>>>         case Opt_mba_mbps:
>>>>                 if (!supports_mba_mbps())
>>>>                         return invalfc(fc, "mba_MBps requires both MBM and (linear scale) MBA at L3 scope");
>>>>         ...
>>>>         }
>>>
>>> Line length is indeed a problem (108 characters). Usual line split methods barely help as the moving the
>>> string to the next line and aligning with the "(" only saves 4 characters.
>>>
>>> How about this (suggestions for a shorter variable name - line is 97 characters)
>>>
>>> static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";
>>>
>>> rdt_parse_param(...)
>>> {
>>> 	...
>>> 	case Opt_mba_mbps:
>>> 		if (!supports_mba_mbps())
>>> 			return invalfc(fc, mba_mbps_invalid);
>>> 	...
>>> }
>>
>> On 6/20/24 3:12 PM, Luck, Tony wrote:
>>>> static char mba_mbps_invalid[] = "mba_MBps requires both MBM and (linear scale) MBA at L3 scope";
>>>
>>> checkpatch recommends "static const char ..." pushing this over 100 chars :-(
>>>
>>
>> How about something like below that reaches 96:
>>
>> 	case Opt_mba_mbps:
>> 		if (!supports_mba_mbps())
>> 			return invalfc(fc,
>> 				       "mba_MBps requires both MBM and linear MBA at L3 scope");
>>
> 
> Reinette,
> 
> Alternative option. Move the messaging into supports_mba_mbps() and
> split into shorter pieces for each reason. The other callers of
> supports_mba_mbps() that are just re-checking status would pass
> a NULL argument.

This fragmentation of the mount parameter checking, splitting its error
reporting to be partially into generic code, does not look ideal to me.

Looking at the information provided in the messages you created I can think
of two more options:

rdt_parse_param(...)
{
	...
	const char *msg;
	...

  	case Opt_mba_mbps:
		msg = "mba_MBps requires local MBM and linear scale MBA at L3 scope";
		if (!supports_mba_mbps())
  			return invalfc(fc, "%s", msg);
	...
}


rdt_parse_param(...)
{
	...
  	case Opt_mba_mbps:
		if (!supports_mba_mbps()) {
			errorfc(fc,
				"mba_MBps requires local MBM and linear scale MBA at L3 scope");
			return -EINVAL;
		}
	...
}

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 08/18] x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (6 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:22   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains Tony Luck
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

When SNC is enabled monitoring data is collected at the SNC node
granularity, but must be reported at L3-cache granularity for
backwards compatibility in addition to reporting at the node
level.

Add a "ci" field to the rdt_mon_domain structure to save the
cache information about the enclosing L3 cache for the domain.
This provides:

1) The cache id which is needed to compose the name of the legacy
monitoring directory, and to determine which domains should be
summed to provide L3-scoped data.

2) The shared_cpu_map which is needed to determine which CPUs can
be used to read the RMID counters with the MSR interface.

This is the first step to an eventual goal of monitor reporting files
like this (for a system with two SNC nodes per L3):

$ cd /sys/fs/resctrl/mon_data
$ tree mon_L3_00
mon_L3_00			<- 00 here is L3 cache id
├── llc_occupancy		\  These files provide legacy support
├── mbm_local_bytes		 > for non-SNC aware monitor apps
├── mbm_total_bytes		/  that expect data at L3 cache level
├── mon_sub_L3_00		<- 00 here is SNC node id
│   ├── llc_occupancy		\  These files are finer grained
│   ├── mbm_local_bytes		 > data from each SNC node
│   └── mbm_total_bytes		/
└── mon_sub_L3_01
    ├── llc_occupancy		\
    ├── mbm_local_bytes		 > As above, but for node 1.
    └── mbm_total_bytes		/

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h                   | 2 ++
 arch/x86/kernel/cpu/resctrl/internal.h    | 1 +
 arch/x86/kernel/cpu/resctrl/core.c        | 7 ++++++-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 -
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 1 -
 5 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 64b6ad1b22a1..d733e1f6485d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
 /**
  * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
  * @hdr:		common header for different domain types
+ * @ci:			cache info for this domain
  * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
  * @mbm_total:		saved state for MBM total bandwidth
  * @mbm_local:		saved state for MBM local bandwidth
@@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
  */
 struct rdt_mon_domain {
 	struct rdt_domain_hdr		hdr;
+	struct cacheinfo		*ci;
 	unsigned long			*rmid_busy_llc;
 	struct mbm_state		*mbm_total;
 	struct mbm_state		*mbm_local;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 135190e0711c..99f601d05f3b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_RESCTRL_INTERNAL_H
 #define _ASM_X86_RESCTRL_INTERNAL_H
 
+#include <linux/cacheinfo.h>
 #include <linux/resctrl.h>
 #include <linux/sched.h>
 #include <linux/kernfs.h>
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b86c525d0620..95ef8fe3cb50 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -19,7 +19,6 @@
 #include <linux/cpu.h>
 #include <linux/slab.h>
 #include <linux/err.h>
-#include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
 
 #include <asm/cpu_device_id.h>
@@ -608,6 +607,12 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	d = &hw_dom->d_resctrl;
 	d->hdr.id = id;
 	d->hdr.type = RESCTRL_MON_DOMAIN;
+	d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
+	if (!d->ci) {
+		pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
+		mon_domain_free(hw_dom);
+		return;
+	}
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 70f0069b87d8..e69489d48625 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -11,7 +11,6 @@
 
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
-#include <linux/cacheinfo.h>
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
 #include <linux/debugfs.h>
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a0a43dbe011b..869dd1973b5d 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -12,7 +12,6 @@
 
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
-#include <linux/cacheinfo.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
 #include <linux/fs.h>
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 08/18] x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
  2024-06-10 18:35 ` [PATCH v20 08/18] x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files Tony Luck
@ 2024-06-20 21:22   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:22 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> When SNC is enabled monitoring data is collected at the SNC node
> granularity, but must be reported at L3-cache granularity for
> backwards compatibility in addition to reporting at the node
> level.
> 
> Add a "ci" field to the rdt_mon_domain structure to save the
> cache information about the enclosing L3 cache for the domain.
> This provides:
> 
> 1) The cache id which is needed to compose the name of the legacy
> monitoring directory, and to determine which domains should be
> summed to provide L3-scoped data.
> 
> 2) The shared_cpu_map which is needed to determine which CPUs can
> be used to read the RMID counters with the MSR interface.
> 
> This is the first step to an eventual goal of monitor reporting files
> like this (for a system with two SNC nodes per L3):
> 
> $ cd /sys/fs/resctrl/mon_data
> $ tree mon_L3_00
> mon_L3_00			<- 00 here is L3 cache id
> ├── llc_occupancy		\  These files provide legacy support
> ├── mbm_local_bytes		 > for non-SNC aware monitor apps
> ├── mbm_total_bytes		/  that expect data at L3 cache level
> ├── mon_sub_L3_00		<- 00 here is SNC node id
> │   ├── llc_occupancy		\  These files are finer grained
> │   ├── mbm_local_bytes		 > data from each SNC node
> │   └── mbm_total_bytes		/
> └── mon_sub_L3_01
>      ├── llc_occupancy		\
>      ├── mbm_local_bytes		 > As above, but for node 1.
>      └── mbm_total_bytes		/
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   include/linux/resctrl.h                   | 2 ++
>   arch/x86/kernel/cpu/resctrl/internal.h    | 1 +
>   arch/x86/kernel/cpu/resctrl/core.c        | 7 ++++++-
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 -
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 1 -
>   5 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 64b6ad1b22a1..d733e1f6485d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -96,6 +96,7 @@ struct rdt_ctrl_domain {
>   /**
>    * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
>    * @hdr:		common header for different domain types
> + * @ci:			cache info for this domain
>    * @rmid_busy_llc:	bitmap of which limbo RMIDs are above threshold
>    * @mbm_total:		saved state for MBM total bandwidth
>    * @mbm_local:		saved state for MBM local bandwidth
> @@ -106,6 +107,7 @@ struct rdt_ctrl_domain {
>    */
>   struct rdt_mon_domain {
>   	struct rdt_domain_hdr		hdr;
> +	struct cacheinfo		*ci;
>   	unsigned long			*rmid_busy_llc;
>   	struct mbm_state		*mbm_total;
>   	struct mbm_state		*mbm_local;

With struct cacheinfo used here I expected cacheinfo.h to be included in
include/linux/resctrl.h since it is now needed by resctrl fs?

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (7 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 08/18] x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:22   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 10/18] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function Tony Luck
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

When a user reads a monitor file rdtgroup_mondata_show() calls
mon_event_read() to package up all the required details into an rmid_read
structure which is passed across the smp_call*() infrastructure to code
that will read data from hardware and return the value (or error status)
in the rmid_read structure.

Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require
the smp_call-ed code to sum event data from all domains that share an
L3 cache.

Add a pointer to the L3 "cacheinfo" structure to struct rmid_read
for the data collection routines to use to pick the domains to be
summed.

Reinette suggested that the rmid_read structure has become complex
enough to warrant documentation of each of its fields. Add the kerneldoc
documentation for struct rmid_read.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 99f601d05f3b..d29c7b58c151 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -145,12 +145,25 @@ union mon_data_bits {
 	} u;
 };
 
+/**
+ * struct rmid_read - Data passed across smp_call*() to read event count
+ * @rgrp:	Resctrl group
+ * @r:		Resource
+ * @d:		Domain. If NULL then sum all domains in @r sharing L3 @ci.id
+ * @evtid:	Which monitor event to read
+ * @first:	Initializes MBM counter when true
+ * @ci:		Cacheinfo for L3. Used when summing domains
+ * @err:	Return error indication
+ * @val:	Return value of event counter
+ * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only)
+ */
 struct rmid_read {
 	struct rdtgroup		*rgrp;
 	struct rdt_resource	*r;
 	struct rdt_mon_domain	*d;
 	enum resctrl_event_id	evtid;
 	bool			first;
+	struct cacheinfo	*ci;
 	int			err;
 	u64			val;
 	void			*arch_mon_ctx;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains
  2024-06-10 18:35 ` [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains Tony Luck
@ 2024-06-20 21:22   ` Reinette Chatre
  2024-06-20 22:42     ` Luck, Tony
  0 siblings, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:22 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> When a user reads a monitor file rdtgroup_mondata_show() calls
> mon_event_read() to package up all the required details into an rmid_read
> structure which is passed across the smp_call*() infrastructure to code
> that will read data from hardware and return the value (or error status)
> in the rmid_read structure.
> 
> Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require
> the smp_call-ed code to sum event data from all domains that share an
> L3 cache.
> 
> Add a pointer to the L3 "cacheinfo" structure to struct rmid_read
> for the data collection routines to use to pick the domains to be
> summed.
> 
> Reinette suggested that the rmid_read structure has become complex
> enough to warrant documentation of each of its fields. Add the kerneldoc
> documentation for struct rmid_read.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 99f601d05f3b..d29c7b58c151 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -145,12 +145,25 @@ union mon_data_bits {
>   	} u;
>   };
>   
> +/**
> + * struct rmid_read - Data passed across smp_call*() to read event count
> + * @rgrp:	Resctrl group
> + * @r:		Resource
> + * @d:		Domain. If NULL then sum all domains in @r sharing L3 @ci.id
> + * @evtid:	Which monitor event to read
> + * @first:	Initializes MBM counter when true
> + * @ci:		Cacheinfo for L3. Used when summing domains
> + * @err:	Return error indication
> + * @val:	Return value of event counter
> + * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only)
> + */

Thank you for adding the kerneldoc. I understand that this file is not
consistent on how these kerneldoc are formatted, but could you please
pick whether you think sentences need to end with a period and then stick
to it in this portion?

>   struct rmid_read {
>   	struct rdtgroup		*rgrp;
>   	struct rdt_resource	*r;
>   	struct rdt_mon_domain	*d;
>   	enum resctrl_event_id	evtid;
>   	bool			first;
> +	struct cacheinfo	*ci;
>   	int			err;
>   	u64			val;
>   	void			*arch_mon_ctx;

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains
  2024-06-20 21:22   ` Reinette Chatre
@ 2024-06-20 22:42     ` Luck, Tony
  2024-06-21  1:59       ` Reinette Chatre
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2024-06-20 22:42 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Wieczor-Retman, Maciej,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> > When a user reads a monitor file rdtgroup_mondata_show() calls
> > mon_event_read() to package up all the required details into an rmid_read
> > structure which is passed across the smp_call*() infrastructure to code
> > that will read data from hardware and return the value (or error status)
> > in the rmid_read structure.
> >
> > Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require
> > the smp_call-ed code to sum event data from all domains that share an
> > L3 cache.
> >
> > Add a pointer to the L3 "cacheinfo" structure to struct rmid_read
> > for the data collection routines to use to pick the domains to be
> > summed.
> >
> > Reinette suggested that the rmid_read structure has become complex
> > enough to warrant documentation of each of its fields. Add the kerneldoc
> > documentation for struct rmid_read.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >   arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 99f601d05f3b..d29c7b58c151 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -145,12 +145,25 @@ union mon_data_bits {
> >     } u;
> >   };
> >
> > +/**
> > + * struct rmid_read - Data passed across smp_call*() to read event count
> > + * @rgrp:  Resctrl group
> > + * @r:             Resource
> > + * @d:             Domain. If NULL then sum all domains in @r sharing L3 @ci.id
> > + * @evtid: Which monitor event to read
> > + * @first: Initializes MBM counter when true
> > + * @ci:            Cacheinfo for L3. Used when summing domains
> > + * @err:   Return error indication
> > + * @val:   Return value of event counter
> > + * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only)
> > + */
>
> Thank you for adding the kerneldoc. I understand that this file is not
> consistent on how these kerneldoc are formatted, but could you please
> pick whether you think sentences need to end with a period and then stick
> to it in this portion?

This is about the @d and @ci entries that have a "sentence" ending with period,
and then more text that doesn't (matching other lines in this block).

Maybe some other punctuation to split the parts?  Do you like "colon"

* @d:         Domain: If NULL then sum all domains in @r sharing L3 @ci.id
* @ci:        Cacheinfo for L3: Used when summing domains

of maybe "dash"

* @d:         Domain - If NULL then sum all domains in @r sharing L3 @ci.id
* @ci:        Cacheinfo for L3 - Used when summing domains

Or something else?

>
> >   struct rmid_read {
> >     struct rdtgroup         *rgrp;
> >     struct rdt_resource     *r;
> >     struct rdt_mon_domain   *d;
> >     enum resctrl_event_id   evtid;
> >     bool                    first;
> > +   struct cacheinfo        *ci;
> >     int                     err;
> >     u64                     val;
> >     void                    *arch_mon_ctx;

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains
  2024-06-20 22:42     ` Luck, Tony
@ 2024-06-21  1:59       ` Reinette Chatre
  2024-06-21 16:07         ` Luck, Tony
  0 siblings, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-21  1:59 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

Hi Tony,

On 6/20/24 3:42 PM, Luck, Tony wrote:
>>> When a user reads a monitor file rdtgroup_mondata_show() calls
>>> mon_event_read() to package up all the required details into an rmid_read
>>> structure which is passed across the smp_call*() infrastructure to code
>>> that will read data from hardware and return the value (or error status)
>>> in the rmid_read structure.
>>>
>>> Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require
>>> the smp_call-ed code to sum event data from all domains that share an
>>> L3 cache.
>>>
>>> Add a pointer to the L3 "cacheinfo" structure to struct rmid_read
>>> for the data collection routines to use to pick the domains to be
>>> summed.
>>>
>>> Reinette suggested that the rmid_read structure has become complex
>>> enough to warrant documentation of each of its fields. Add the kerneldoc
>>> documentation for struct rmid_read.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>> ---
>>>    arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++++
>>>    1 file changed, 13 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
>>> index 99f601d05f3b..d29c7b58c151 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/internal.h
>>> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
>>> @@ -145,12 +145,25 @@ union mon_data_bits {
>>>      } u;
>>>    };
>>>
>>> +/**
>>> + * struct rmid_read - Data passed across smp_call*() to read event count
>>> + * @rgrp:  Resctrl group
>>> + * @r:             Resource
>>> + * @d:             Domain. If NULL then sum all domains in @r sharing L3 @ci.id
>>> + * @evtid: Which monitor event to read
>>> + * @first: Initializes MBM counter when true
>>> + * @ci:            Cacheinfo for L3. Used when summing domains
>>> + * @err:   Return error indication
>>> + * @val:   Return value of event counter
>>> + * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only)
>>> + */
>>
>> Thank you for adding the kerneldoc. I understand that this file is not
>> consistent on how these kerneldoc are formatted, but could you please
>> pick whether you think sentences need to end with a period and then stick
>> to it in this portion?
> 
> This is about the @d and @ci entries that have a "sentence" ending with period,
> and then more text that doesn't (matching other lines in this block).

Correct.

> 
> Maybe some other punctuation to split the parts?  Do you like "colon"
> 
> * @d:         Domain: If NULL then sum all domains in @r sharing L3 @ci.id
> * @ci:        Cacheinfo for L3: Used when summing domains
> 
> of maybe "dash"
> 
> * @d:         Domain - If NULL then sum all domains in @r sharing L3 @ci.id
> * @ci:        Cacheinfo for L3 - Used when summing domains
> 
> Or something else?

I do not think there is a need to introduce new syntax. It will be easiest
to just have all sentences end with a period. The benefit of this is that it
encourages useful full sentence descriptions. For example, below is a _draft_ of
such a description. Please note that I wrote it quickly and hope it will be improved
(and corrected!). The goal of it being here is to give ideas on how this kerneldoc
can be written to be useful and consistent.

/**
  * struct rmid_read - Data passed across smp_call*() to read event count
  * @rgrp:  Resource group for which the counter is being read. If it is a parent
  *	   resource group then its event count is summed with the count from all
  *	   its child resource groups.
  * @r:	   Resource describing the properties of the event being read.
  * @d:	   Domain that the counter should be read from. If NULL then sum all
  *	   domains in @r sharing L3 @ci.id
  * @evtid: Which monitor event to read.
  * @first: Initialize MBM counter when true.
  * @ci:    Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
  * @err:   Error encountered when reading counter.
  * @val:   Returned value of event counter. If @rgrp is a parent resource group,
  *	   @val contains the sum of event counts from its child resource groups.
  *	   If @d is NULL, @val contains the sum of all domains in @r sharing @ci.id,
  *	   (summed across child resource groups if @rgrp is a parent resource group).
  * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only).
  */

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains
  2024-06-21  1:59       ` Reinette Chatre
@ 2024-06-21 16:07         ` Luck, Tony
  2024-06-21 17:10           ` Reinette Chatre
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2024-06-21 16:07 UTC (permalink / raw)
  To: Chatre, Reinette, Yu, Fenghua, Wieczor-Retman, Maciej,
	Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> I do not think there is a need to introduce new syntax. It will be easiest
> to just have all sentences end with a period. The benefit of this is that it
> encourages useful full sentence descriptions. For example, below is a _draft_ of
> such a description. Please note that I wrote it quickly and hope it will be improved
> (and corrected!). The goal of it being here is to give ideas on how this kerneldoc
> can be written to be useful and consistent.
>
> /**
>   * struct rmid_read - Data passed across smp_call*() to read event count

Should this end with a period too?  In the resctrl code a few cases use ".",
most don't. So no period matches resctrl style. But the example in
Documentation/doc-guide/kernel-doc.rst does end with a period.

>   * @rgrp:  Resource group for which the counter is being read. If it is a parent
>   *      resource group then its event count is summed with the count from all
>   *      its child resource groups.
>   * @r:          Resource describing the properties of the event being read.
>   * @d:          Domain that the counter should be read from. If NULL then sum all
>   *      domains in @r sharing L3 @ci.id
>   * @evtid: Which monitor event to read.
>   * @first: Initialize MBM counter when true.
>   * @ci:    Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
>   * @err:   Error encountered when reading counter.
>   * @val:   Returned value of event counter. If @rgrp is a parent resource group,
>   *      @val contains the sum of event counts from its child resource groups.
>   *      If @d is NULL, @val contains the sum of all domains in @r sharing @ci.id,
>   *      (summed across child resource groups if @rgrp is a parent resource group).
>   * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only).
>   */

This all looks good to me.  Since you have supplied 99% of the content for this
patch in the series I should assign authorship to you (which requires your
Signed-off-by tag). Is that OK? Should I split into two parts? First to add the
kerneldoc (by you). Second to add the new field (by me).

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains
  2024-06-21 16:07         ` Luck, Tony
@ 2024-06-21 17:10           ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-21 17:10 UTC (permalink / raw)
  To: Luck, Tony, Yu, Fenghua, Wieczor-Retman, Maciej, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

Hi Tony,

On 6/21/24 9:07 AM, Luck, Tony wrote:
>> I do not think there is a need to introduce new syntax. It will be easiest
>> to just have all sentences end with a period. The benefit of this is that it
>> encourages useful full sentence descriptions. For example, below is a _draft_ of
>> such a description. Please note that I wrote it quickly and hope it will be improved
>> (and corrected!). The goal of it being here is to give ideas on how this kerneldoc
>> can be written to be useful and consistent.
>>
>> /**
>>    * struct rmid_read - Data passed across smp_call*() to read event count
> 
> Should this end with a period too?  In the resctrl code a few cases use ".",
> most don't. So no period matches resctrl style. But the example in
> Documentation/doc-guide/kernel-doc.rst does end with a period.

Having period will be ideal but since that does not match existing style it may
look out of place. I thus do not have strong opinion here.

> 
>>    * @rgrp:  Resource group for which the counter is being read. If it is a parent
>>    *      resource group then its event count is summed with the count from all
>>    *      its child resource groups.
>>    * @r:          Resource describing the properties of the event being read.
>>    * @d:          Domain that the counter should be read from. If NULL then sum all
>>    *      domains in @r sharing L3 @ci.id
>>    * @evtid: Which monitor event to read.
>>    * @first: Initialize MBM counter when true.
>>    * @ci:    Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
>>    * @err:   Error encountered when reading counter.
>>    * @val:   Returned value of event counter. If @rgrp is a parent resource group,
>>    *      @val contains the sum of event counts from its child resource groups.

contains -> includes (to indicate it contains the count from parent as well as children)

>>    *      If @d is NULL, @val contains the sum of all domains in @r sharing @ci.id,
>>    *      (summed across child resource groups if @rgrp is a parent resource group).
>>    * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only).
>>    */
> 
> This all looks good to me.  Since you have supplied 99% of the content for this
> patch in the series I should assign authorship to you (which requires your
> Signed-off-by tag). Is that OK? Should I split into two parts? First to add the
> kerneldoc (by you). Second to add the new field (by me).

No need to split the patch. You can keep authorship. You are welcome to add:

Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 10/18] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (8 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:23   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 11/18] x86/resctrl: Allocate a new field in union mon_data_bits Tony Luck
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
files in the original "mon_L3_XX" directories and also in each
of the "mon_sub_L3_YY" directories.

Refactor mkdir_mondata_subdir() to move the creation of monitoring files
into a helper function to avoid the need to duplicate code later.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 45 ++++++++++++++++----------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 869dd1973b5d..66acbad1c585 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3021,14 +3021,37 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 	}
 }
 
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+			     struct rdt_resource *r, struct rdtgroup *prgrp)
+{
+	union mon_data_bits priv;
+	struct mon_evt *mevt;
+	struct rmid_read rr;
+	int ret;
+
+	if (WARN_ON(list_empty(&r->evt_list)))
+		return -EPERM;
+
+	priv.u.rid = r->rid;
+	priv.u.domid = d->hdr.id;
+	list_for_each_entry(mevt, &r->evt_list, list) {
+		priv.u.evtid = mevt->evtid;
+		ret = mon_addfile(kn, mevt->name, priv.priv);
+		if (ret)
+			return ret;
+
+		if (is_mbm_event(mevt->evtid))
+			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
+	}
+
+	return 0;
+}
+
 static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
-	union mon_data_bits priv;
 	struct kernfs_node *kn;
-	struct mon_evt *mevt;
-	struct rmid_read rr;
 	char name[32];
 	int ret;
 
@@ -3042,22 +3065,10 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 	if (ret)
 		goto out_destroy;
 
-	if (WARN_ON(list_empty(&r->evt_list))) {
-		ret = -EPERM;
+	ret = mon_add_all_files(kn, d, r, prgrp);
+	if (ret)
 		goto out_destroy;
-	}
 
-	priv.u.rid = r->rid;
-	priv.u.domid = d->hdr.id;
-	list_for_each_entry(mevt, &r->evt_list, list) {
-		priv.u.evtid = mevt->evtid;
-		ret = mon_addfile(kn, mevt->name, priv.priv);
-		if (ret)
-			goto out_destroy;
-
-		if (is_mbm_event(mevt->evtid))
-			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
-	}
 	kernfs_activate(kn);
 	return 0;
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 10/18] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
  2024-06-10 18:35 ` [PATCH v20 10/18] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function Tony Luck
@ 2024-06-20 21:23   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:23 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
> files in the original "mon_L3_XX" directories and also in each
> of the "mon_sub_L3_YY" directories.
> 
> Refactor mkdir_mondata_subdir() to move the creation of monitoring files
> into a helper function to avoid the need to duplicate code later.
> 
> No functional change.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 11/18] x86/resctrl: Allocate a new field in union mon_data_bits
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (9 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 10/18] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:28   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 12/18] x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files Tony Luck
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

When Sub-NUMA Cluster (SNC) mode is enabled the legacy monitor reporting
files must report the sum of the data from all of the SNC nodes that
share the L3 cache that is referenced by the monitor file.

Resctrl squeezes all the attributes of these files into 32-bits so they
can be stored in the "priv" field of struct kernfs_node.

Currently only three monitor events are defined by enum resctrl_event_id
so reducing it from 8-bits to 7-bits still provides more than enough
space to represent all the known event types. But note that this choice
was arbitrary. The "rid" field is also far wider than needed for the
current number of resource id types.  This structure is purely internal
to resctrl, no ABI issues with modifying it. Subsequent changes may
rearrange the allocation of bits between each of the fields as needed.

Give the bit to a new "sum" field that indicates that reading this file
must sum across SNC nodes. This bit also indicates that the domid field
is the id of an L3 cache (instead of a domain id) to find which domains
must be summed.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index d29c7b58c151..77da29ced7eb 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -133,14 +133,20 @@ struct mon_evt {
  *                     as kernfs private data
  * @rid:               Resource id associated with the event file
  * @evtid:             Event id associated with the event file
- * @domid:             The domain to which the event file belongs
+ * @sum:               Set when event must be summed across multiple
+ *                     domains.
+ * @domid:             When @sum is zero this is the domain to which
+ *                     the event file belongs. When @sum is one this
+ *                     is the id of the L3 cache that all domains to be
+ *                     summed share.
  * @u:                 Name of the bit fields struct
  */
 union mon_data_bits {
 	void *priv;
 	struct {
 		unsigned int rid		: 10;
-		enum resctrl_event_id evtid	: 8;
+		enum resctrl_event_id evtid	: 7;
+		unsigned int sum		: 1;
 		unsigned int domid		: 14;
 	} u;
 };
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 11/18] x86/resctrl: Allocate a new field in union mon_data_bits
  2024-06-10 18:35 ` [PATCH v20 11/18] x86/resctrl: Allocate a new field in union mon_data_bits Tony Luck
@ 2024-06-20 21:28   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:28 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

(looks like we are stuck with "allocate")

On 6/10/24 11:35 AM, Tony Luck wrote:
> When Sub-NUMA Cluster (SNC) mode is enabled the legacy monitor reporting
> files must report the sum of the data from all of the SNC nodes that
> share the L3 cache that is referenced by the monitor file.
> 
> Resctrl squeezes all the attributes of these files into 32-bits so they
> can be stored in the "priv" field of struct kernfs_node.
> 
> Currently only three monitor events are defined by enum resctrl_event_id
> so reducing it from 8-bits to 7-bits still provides more than enough
> space to represent all the known event types. But note that this choice
> was arbitrary. The "rid" field is also far wider than needed for the
> current number of resource id types.  This structure is purely internal
> to resctrl, no ABI issues with modifying it. Subsequent changes may
> rearrange the allocation of bits between each of the fields as needed.
> 
> Give the bit to a new "sum" field that indicates that reading this file
> must sum across SNC nodes. This bit also indicates that the domid field
> is the id of an L3 cache (instead of a domain id) to find which domains
> must be summed.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/internal.h | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index d29c7b58c151..77da29ced7eb 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -133,14 +133,20 @@ struct mon_evt {
>    *                     as kernfs private data
>    * @rid:               Resource id associated with the event file
>    * @evtid:             Event id associated with the event file
> - * @domid:             The domain to which the event file belongs
> + * @sum:               Set when event must be summed across multiple
> + *                     domains.
> + * @domid:             When @sum is zero this is the domain to which
> + *                     the event file belongs. When @sum is one this
> + *                     is the id of the L3 cache that all domains to be
> + *                     summed share.
>    * @u:                 Name of the bit fields struct
>    */

It is not obvious to me how to best maintain consistency with existing
kerneldoc. Perhaps just let @sum not end in period? No strong opinion here.

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 12/18] x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (10 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 11/18] x86/resctrl: Allocate a new field in union mon_data_bits Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:30   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 13/18] x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode Tony Luck
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

When SNC mode is enabled, create subdirectories and files to monitor
at the SNC node granularity. Legacy behavior is preserved by tagging
the monitor files at the L3 granularity with the "sum" attribute.
When the user reads these files the kernel will read monitor data
from all SNC nodes that share the same L3 cache instance and return
the aggregated value to the user.

Note that the "domid" field for files that must sum across SNC domains
has the L3 cache instance id, while non-summing files use the domain id.

The "sum" files do not need to make a call to mon_event_read() to
initialize the MBM counters. This will be handled by initializing the
individual SNC nodes that share the L3.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 +++++++++++++++++++-------
 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 66acbad1c585..fc7f3f139800 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3022,7 +3022,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
 }
 
 static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
-			     struct rdt_resource *r, struct rdtgroup *prgrp)
+			     struct rdt_resource *r, struct rdtgroup *prgrp,
+			     bool do_sum)
 {
 	union mon_data_bits priv;
 	struct mon_evt *mevt;
@@ -3033,14 +3034,15 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
 		return -EPERM;
 
 	priv.u.rid = r->rid;
-	priv.u.domid = d->hdr.id;
+	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
+	priv.u.sum = do_sum;
 	list_for_each_entry(mevt, &r->evt_list, list) {
 		priv.u.evtid = mevt->evtid;
 		ret = mon_addfile(kn, mevt->name, priv.priv);
 		if (ret)
 			return ret;
 
-		if (is_mbm_event(mevt->evtid))
+		if (!do_sum && is_mbm_event(mevt->evtid))
 			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
 	}
 
@@ -3051,23 +3053,51 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
 				struct rdt_mon_domain *d,
 				struct rdt_resource *r, struct rdtgroup *prgrp)
 {
-	struct kernfs_node *kn;
+	struct kernfs_node *kn, *ckn;
 	char name[32];
-	int ret;
+	bool snc_mode;
+	int ret = 0;
 
-	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
-	/* create the directory */
-	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
-	if (IS_ERR(kn))
-		return PTR_ERR(kn);
+	lockdep_assert_held(&rdtgroup_mutex);
 
-	ret = rdtgroup_kn_set_ugid(kn);
-	if (ret)
-		goto out_destroy;
+	snc_mode = r->mon_scope != RESCTRL_L3_CACHE;
+	sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+	kn = kernfs_find_and_get(parent_kn, name);
+	if (kn) {
+		/*
+		 * rdtgroup_mutex will prevent this directory from being
+		 * removed. No need to keep this hold.
+		 */
+		kernfs_put(kn);
+	} else {
+		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+		if (IS_ERR(kn))
+			return PTR_ERR(kn);
 
-	ret = mon_add_all_files(kn, d, r, prgrp);
-	if (ret)
-		goto out_destroy;
+		ret = rdtgroup_kn_set_ugid(kn);
+		if (ret)
+			goto out_destroy;
+		ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+		if (ret)
+			goto out_destroy;
+	}
+
+	if (snc_mode) {
+		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+		ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
+		if (IS_ERR(ckn)) {
+			ret = -EINVAL;
+			goto out_destroy;
+		}
+
+		ret = rdtgroup_kn_set_ugid(ckn);
+		if (ret)
+			goto out_destroy;
+
+		ret = mon_add_all_files(ckn, d, r, prgrp, false);
+		if (ret)
+			goto out_destroy;
+	}
 
 	kernfs_activate(kn);
 	return 0;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 12/18] x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
  2024-06-10 18:35 ` [PATCH v20 12/18] x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files Tony Luck
@ 2024-06-20 21:30   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:30 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> When SNC mode is enabled, create subdirectories and files to monitor
> at the SNC node granularity. Legacy behavior is preserved by tagging
> the monitor files at the L3 granularity with the "sum" attribute.
> When the user reads these files the kernel will read monitor data
> from all SNC nodes that share the same L3 cache instance and return
> the aggregated value to the user.
> 
> Note that the "domid" field for files that must sum across SNC domains
> has the L3 cache instance id, while non-summing files use the domain id.
> 
> The "sum" files do not need to make a call to mon_event_read() to
> initialize the MBM counters. This will be handled by initializing the
> individual SNC nodes that share the L3.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 +++++++++++++++++++-------
>   1 file changed, 46 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 66acbad1c585..fc7f3f139800 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3022,7 +3022,8 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>   }
>   
>   static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> -			     struct rdt_resource *r, struct rdtgroup *prgrp)
> +			     struct rdt_resource *r, struct rdtgroup *prgrp,
> +			     bool do_sum)
>   {
>   	union mon_data_bits priv;
>   	struct mon_evt *mevt;
> @@ -3033,14 +3034,15 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
>   		return -EPERM;
>   
>   	priv.u.rid = r->rid;
> -	priv.u.domid = d->hdr.id;
> +	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> +	priv.u.sum = do_sum;
>   	list_for_each_entry(mevt, &r->evt_list, list) {
>   		priv.u.evtid = mevt->evtid;
>   		ret = mon_addfile(kn, mevt->name, priv.priv);
>   		if (ret)
>   			return ret;
>   
> -		if (is_mbm_event(mevt->evtid))
> +		if (!do_sum && is_mbm_event(mevt->evtid))
>   			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
>   	}
>   
> @@ -3051,23 +3053,51 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>   				struct rdt_mon_domain *d,
>   				struct rdt_resource *r, struct rdtgroup *prgrp)
>   {
> -	struct kernfs_node *kn;
> +	struct kernfs_node *kn, *ckn;
>   	char name[32];
> -	int ret;
> +	bool snc_mode;
> +	int ret = 0;
>   
> -	sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
> -	/* create the directory */
> -	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> -	if (IS_ERR(kn))
> -		return PTR_ERR(kn);
> +	lockdep_assert_held(&rdtgroup_mutex);
>   
> -	ret = rdtgroup_kn_set_ugid(kn);
> -	if (ret)
> -		goto out_destroy;
> +	snc_mode = r->mon_scope != RESCTRL_L3_CACHE;

I think that testing that it _should_ be of particular scope will be
easier to understand than testing what it should not be. Can this instead
be a positive check of:
	snc_mode = r->mon_scope == RESCTRL_L3_NODE;

> +	sprintf(name, "mon_%s_%02d", r->name, d->ci->id);

I find this to be too subtle and potentially confusing since it uses
d->ci->id interchangeable for both SNC and non-SNC mode. I understand
that in non-SNC mode the domain id will be the same as the cache id
but I would prefer that the code use the data structures as intended
instead of relying backdoor on assumptions. Something like:

	sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);

> +	kn = kernfs_find_and_get(parent_kn, name);
> +	if (kn) {
> +		/*
> +		 * rdtgroup_mutex will prevent this directory from being
> +		 * removed. No need to keep this hold.
> +		 */
> +		kernfs_put(kn);
> +	} else {
> +		kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
> +		if (IS_ERR(kn))
> +			return PTR_ERR(kn);
>   
> -	ret = mon_add_all_files(kn, d, r, prgrp);
> -	if (ret)
> -		goto out_destroy;
> +		ret = rdtgroup_kn_set_ugid(kn);
> +		if (ret)
> +			goto out_destroy;
> +		ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> +		if (ret)
> +			goto out_destroy;
> +	}
> +
> +	if (snc_mode) {
> +		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +		ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> +		if (IS_ERR(ckn)) {
> +			ret = -EINVAL;
> +			goto out_destroy;
> +		}
> +
> +		ret = rdtgroup_kn_set_ugid(ckn);
> +		if (ret)
> +			goto out_destroy;
> +
> +		ret = mon_add_all_files(ckn, d, r, prgrp, false);
> +		if (ret)
> +			goto out_destroy;
> +	}
>   
>   	kernfs_activate(kn);
>   	return 0;

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 13/18] x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (11 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 12/18] x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:30   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 14/18] x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter Tony Luck
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

In SNC mode there are multiple subdirectories in each L3 level monitor
directory (one for each SNC node). If all the CPUs in an SNC node are
taken offline, just remove the SNC  directory for that node. In
non-SNC mode, or when the last SNC node directory is removed, also
remove the L3 monitor directory.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 41 +++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index fc7f3f139800..5142ce43ac13 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3004,20 +3004,47 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 
 /*
  * Remove all subdirectories of mon_data of ctrl_mon groups
- * and monitor groups with given domain id.
+ * and monitor groups for the given domain.
  */
 static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
-					   unsigned int dom_id)
+					   struct rdt_mon_domain *d)
 {
 	struct rdtgroup *prgrp, *crgrp;
+	struct kernfs_node *kn;
+	char subname[32];
 	char name[32];
 
+	sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
+	if (r->mon_scope != RESCTRL_L3_CACHE) {
+		/*
+		 * SNC mode: Unless the last domain is being removed must
+		 * just remove the SNC subdomain.
+		 */
+		sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+	}
+
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
-		sprintf(name, "mon_%s_%02d", r->name, dom_id);
-		kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+		kn = kernfs_find_and_get(prgrp->mon.mon_data_kn, name);
+		if (!kn)
+			continue;
+		kernfs_put(kn);
+
+		if (kn->dir.subdirs <= 1)
+			kernfs_remove(kn);
+		else
+			kernfs_remove_by_name(kn, subname);
 
-		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
-			kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) {
+			kn = kernfs_find_and_get(crgrp->mon.mon_data_kn, name);
+			if (!kn)
+				continue;
+			kernfs_put(kn);
+
+			if (kn->dir.subdirs <= 1)
+				kernfs_remove(kn);
+			else
+				kernfs_remove_by_name(kn, subname);
+		}
 	}
 }
 
@@ -3987,7 +4014,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
 	 * per domain monitor data directories.
 	 */
 	if (resctrl_mounted && resctrl_arch_mon_capable())
-		rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id);
+		rmdir_mondata_subdir_allrdtgrp(r, d);
 
 	if (is_mbm_enabled())
 		cancel_delayed_work(&d->mbm_over);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 13/18] x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode
  2024-06-10 18:35 ` [PATCH v20 13/18] x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode Tony Luck
@ 2024-06-20 21:30   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:30 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> In SNC mode there are multiple subdirectories in each L3 level monitor
> directory (one for each SNC node). If all the CPUs in an SNC node are
> taken offline, just remove the SNC  directory for that node. In

(nit: watch for random extra spaces)

> non-SNC mode, or when the last SNC node directory is removed, also
> remove the L3 monitor directory.

Perhaps drop the "also" since it is not relevant to non-SNC mode?

> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 41 +++++++++++++++++++++-----
>   1 file changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index fc7f3f139800..5142ce43ac13 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3004,20 +3004,47 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>   
>   /*
>    * Remove all subdirectories of mon_data of ctrl_mon groups
> - * and monitor groups with given domain id.
> + * and monitor groups for the given domain.
>    */
>   static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> -					   unsigned int dom_id)
> +					   struct rdt_mon_domain *d)
>   {
>   	struct rdtgroup *prgrp, *crgrp;
> +	struct kernfs_node *kn;
> +	char subname[32];
>   	char name[32];
>   
> +	sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
> +	if (r->mon_scope != RESCTRL_L3_CACHE) {

Same comments about positive check and subtle assignment as in previous
patch.

> +		/*
> +		 * SNC mode: Unless the last domain is being removed must
> +		 * just remove the SNC subdomain.
> +		 */

Can this comment be moved to be part of the top function comments? It is
not relevant to code being commented here and only seems to be here to
avoid duplicating it in the spots where it is relevant.

> +		sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +	}
> +

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 14/18] x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (12 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 13/18] x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:31   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 15/18] x86/resctrl: Make __mon_event_count() handle sum domains Tony Luck
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

mon_event_read() fills out most fields of the struct rmid_read that is
passed via an smp_call*() function to a CPU that is part of the correct
domain to read the monitor counters.

With Sub-NUMA Cluster (SNC) mode there are now two cases to handle:

1) Reading a file that returns a value for a single domain.
   + Choose the CPU to execute from the domain cpu_mask

2) Reading a file that must sum across domains sharing an L3 cache
   instance.
   + Indicate to called code that a sum is needed by passing a NULL
     rdt_mon_domain pointer.
   + Choose the CPU from the L3 shared_cpu_map.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h    |  2 +-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 39 ++++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 77da29ced7eb..75bb1afc4842 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -627,7 +627,7 @@ void mon_event_count(void *info);
 int rdtgroup_mondata_show(struct seq_file *m, void *arg);
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
-		    int evtid, int first);
+		    cpumask_t *cpumask, int evtid, int first);
 void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
 				unsigned long delay_ms,
 				int exclude_cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 3b9383612c35..5a43931fd423 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -515,7 +515,7 @@ static int smp_mon_event_count(void *arg)
 
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
-		    int evtid, int first)
+		    cpumask_t *cpumask, int evtid, int first)
 {
 	int cpu;
 
@@ -537,7 +537,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		return;
 	}
 
-	cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
+	cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
 
 	/*
 	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
@@ -546,7 +546,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	 * counters on some platforms if its called in IRQ context.
 	 */
 	if (tick_nohz_full_cpu(cpu))
-		smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
+		smp_call_function_any(cpumask, mon_event_count, rr, 1);
 	else
 		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
 
@@ -575,16 +575,39 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	resid = md.u.rid;
 	domid = md.u.domid;
 	evtid = md.u.evtid;
-
 	r = &rdt_resources_all[resid].r_resctrl;
-	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
-	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+
+	if (md.u.sum) {
+		/*
+		 * This file requires summing across all SNC domains that share
+		 * the L3 cache id that was provided in the "domid" field of the
+		 * mon_data_bits union. Search all domains in the resource for
+		 * one that matches this cache id.
+		 */
+		list_for_each_entry(d, &r->mon_domains, hdr.list) {
+			if (d->ci->id == domid) {
+				rr.ci = d->ci;
+				mon_event_read(&rr, r, NULL, rdtgrp, &d->ci->shared_cpu_map, evtid, false);
+				goto checkresult;
+			}
+		}
 		ret = -ENOENT;
 		goto out;
+	} else {
+		/*
+		 * This file provides data from a single domain. Search
+		 * the resource to find the domain with "domid".
+		 */
+		hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
+		if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+			ret = -ENOENT;
+			goto out;
+		}
+		d = container_of(hdr, struct rdt_mon_domain, hdr);
+		mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
 	}
-	d = container_of(hdr, struct rdt_mon_domain, hdr);
 
-	mon_event_read(&rr, r, d, rdtgrp, evtid, false);
+checkresult:
 
 	if (rr.err == -EIO)
 		seq_puts(m, "Error\n");
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5142ce43ac13..cd54ca6cc563 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3070,7 +3070,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
 			return ret;
 
 		if (!do_sum && is_mbm_event(mevt->evtid))
-			mon_event_read(&rr, r, d, prgrp, mevt->evtid, true);
+			mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
 	}
 
 	return 0;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 14/18] x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter
  2024-06-10 18:35 ` [PATCH v20 14/18] x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter Tony Luck
@ 2024-06-20 21:31   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:31 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> mon_event_read() fills out most fields of the struct rmid_read that is
> passed via an smp_call*() function to a CPU that is part of the correct
> domain to read the monitor counters.
> 
> With Sub-NUMA Cluster (SNC) mode there are now two cases to handle:
> 
> 1) Reading a file that returns a value for a single domain.
>     + Choose the CPU to execute from the domain cpu_mask
> 
> 2) Reading a file that must sum across domains sharing an L3 cache
>     instance.
>     + Indicate to called code that a sum is needed by passing a NULL
>       rdt_mon_domain pointer.
>     + Choose the CPU from the L3 shared_cpu_map.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>

The kerneldoc in patch #9 introduces new requirements related to
struct rmid_read wrt how NULL values are interpreted and
used. This makes it essential that struct rmid_read is always initialized
correctly and should no longer consist of whatever is on the stack. I
mentioned in response to v19 that static checkers found issues here.
I understand that mbm_update() always sets the domain in
struct rmid_read, but I do not find it acceptable that it
passes garbage as the cacheinfo pointer based on subtle assumptions on
when/how __mon_event_count() uses this field.

> ---
>   arch/x86/kernel/cpu/resctrl/internal.h    |  2 +-
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 39 ++++++++++++++++++-----
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  2 +-
>   3 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 77da29ced7eb..75bb1afc4842 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -627,7 +627,7 @@ void mon_event_count(void *info);
>   int rdtgroup_mondata_show(struct seq_file *m, void *arg);
>   void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> -		    int evtid, int first);
> +		    cpumask_t *cpumask, int evtid, int first);
>   void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
>   				unsigned long delay_ms,
>   				int exclude_cpu);
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3b9383612c35..5a43931fd423 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -515,7 +515,7 @@ static int smp_mon_event_count(void *arg)
>   
>   void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   		    struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> -		    int evtid, int first)
> +		    cpumask_t *cpumask, int evtid, int first)
>   {
>   	int cpu;
>   
> @@ -537,7 +537,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   		return;
>   	}
>   
> -	cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU);
> +	cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
>   
>   	/*
>   	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
> @@ -546,7 +546,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   	 * counters on some platforms if its called in IRQ context.
>   	 */
>   	if (tick_nohz_full_cpu(cpu))
> -		smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1);
> +		smp_call_function_any(cpumask, mon_event_count, rr, 1);
>   	else
>   		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
>   
> @@ -575,16 +575,39 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>   	resid = md.u.rid;
>   	domid = md.u.domid;
>   	evtid = md.u.evtid;
> -
>   	r = &rdt_resources_all[resid].r_resctrl;
> -	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> -	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> +
> +	if (md.u.sum) {
> +		/*
> +		 * This file requires summing across all SNC domains that share
> +		 * the L3 cache id that was provided in the "domid" field of the
> +		 * mon_data_bits union. Search all domains in the resource for
> +		 * one that matches this cache id.
> +		 */
> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			if (d->ci->id == domid) {
> +				rr.ci = d->ci;
> +				mon_event_read(&rr, r, NULL, rdtgrp, &d->ci->shared_cpu_map, evtid, false);

Please split this line, it is over 100

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 15/18] x86/resctrl: Make __mon_event_count() handle sum domains
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (13 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 14/18] x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:31   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 16/18] x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC) systems Tony Luck
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Legacy resctrl monitor files must provide the sum of event values across
all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.

There are now two cases:
1) A specific domain is provided in struct rmid_read
   This is either a non-SNC system, or the request is to read data
   from just one SNC node.
2) Domain pointer is NULL. In this case the cacheinfo field in struct
   rmid_read indicates that all SNC nodes that share that L3 cache
   instance should have the event read and return the sum of all
   values.

Update the CPU sanity check. The existing check that an event is read
from a CPU in the requested domain still applies when reading a single
domain. But when summing across domains a more relaxed check that the
current CPU is in the scope of the L3 cache instance is appropriate
since the MSRs to read events are scoped at L3 cache level.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++++++++++++++++++------
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f2fd35d294f2..c4d9a8df8d2d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -324,9 +324,6 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 	resctrl_arch_rmid_read_context_check();
 
-	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
-		return -EINVAL;
-
 	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
 	ret = __rmid_read_phys(prmid, eventid, &msr_val);
 	if (ret)
@@ -592,6 +589,8 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
 
 static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 {
+	int cpu = smp_processor_id();
+	struct rdt_mon_domain *d;
 	struct mbm_state *m;
 	u64 tval = 0;
 
@@ -603,12 +602,37 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
 		return 0;
 	}
 
-	rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
-					 &tval, rr->arch_mon_ctx);
-	if (rr->err)
-		return rr->err;
+	if (rr->d) {
+		/* Reading a single domain, must be on a CPU in that domain */
+		if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+			return -EINVAL;
+		rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
+						 rr->evtid, &tval, rr->arch_mon_ctx);
+		if (rr->err)
+			return rr->err;
+
+		rr->val += tval;
+
+		return 0;
+	}
 
-	rr->val += tval;
+	/* Summing domains that share a cache, must be on a CPU for that cache */
+	if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
+		return -EINVAL;
+
+	/*
+	 * Legacy files must report the sum of an event across all
+	 * domains that share the same L3 cache instance.
+	 */
+	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
+		if (d->ci->id != rr->ci->id)
+			continue;
+		rr->err = resctrl_arch_rmid_read(rr->r, d, closid, rmid,
+						 rr->evtid, &tval, rr->arch_mon_ctx);
+		if (rr->err)
+			return rr->err;
+		rr->val += tval;
+	}
 
 	return 0;
 }
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 15/18] x86/resctrl: Make __mon_event_count() handle sum domains
  2024-06-10 18:35 ` [PATCH v20 15/18] x86/resctrl: Make __mon_event_count() handle sum domains Tony Luck
@ 2024-06-20 21:31   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:31 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> Legacy resctrl monitor files must provide the sum of event values across
> all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.
> 
> There are now two cases:
> 1) A specific domain is provided in struct rmid_read
>     This is either a non-SNC system, or the request is to read data
>     from just one SNC node.
> 2) Domain pointer is NULL. In this case the cacheinfo field in struct
>     rmid_read indicates that all SNC nodes that share that L3 cache
>     instance should have the event read and return the sum of all
>     values.
> 
> Update the CPU sanity check. The existing check that an event is read
> from a CPU in the requested domain still applies when reading a single
> domain. But when summing across domains a more relaxed check that the
> current CPU is in the scope of the L3 cache instance is appropriate
> since the MSRs to read events are scoped at L3 cache level.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 40 +++++++++++++++++++++------
>   1 file changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f2fd35d294f2..c4d9a8df8d2d 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -324,9 +324,6 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   
>   	resctrl_arch_rmid_read_context_check();
>   
> -	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
> -		return -EINVAL;
> -
>   	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>   	ret = __rmid_read_phys(prmid, eventid, &msr_val);
>   	if (ret)
> @@ -592,6 +589,8 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
>   
>   static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>   {
> +	int cpu = smp_processor_id();
> +	struct rdt_mon_domain *d;
>   	struct mbm_state *m;
>   	u64 tval = 0;
>   
> @@ -603,12 +602,37 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>   		return 0;
>   	}
>   
> -	rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid,
> -					 &tval, rr->arch_mon_ctx);
> -	if (rr->err)
> -		return rr->err;
> +	if (rr->d) {
> +		/* Reading a single domain, must be on a CPU in that domain */

(nit: Please let this sentence as well as the later comment end with period.)

> +		if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
> +			return -EINVAL;
> +		rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
> +						 rr->evtid, &tval, rr->arch_mon_ctx);
> +		if (rr->err)
> +			return rr->err;
> +
> +		rr->val += tval;
> +
> +		return 0;
> +	}
>   
> -	rr->val += tval;
> +	/* Summing domains that share a cache, must be on a CPU for that cache */
> +	if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
> +		return -EINVAL;
> +
> +	/*
> +	 * Legacy files must report the sum of an event across all
> +	 * domains that share the same L3 cache instance.
> +	 */
> +	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> +		if (d->ci->id != rr->ci->id)
> +			continue;
> +		rr->err = resctrl_arch_rmid_read(rr->r, d, closid, rmid,
> +						 rr->evtid, &tval, rr->arch_mon_ctx);
> +		if (rr->err)
> +			return rr->err;

An error here means that the hardware does not have data available to return. Should
the unavailability of data for one domain be considered an error for all? Note how
this is handled in mon_event_count() where the error is discarded if any of the monitor
event reads succeeded ... should this be done here also?

> +		rr->val += tval;
> +	}
>   
>   	return 0;
>   }


Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 16/18] x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC) systems
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (14 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 15/18] x86/resctrl: Make __mon_event_count() handle sum domains Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:32   ` Reinette Chatre
  2024-06-10 18:35 ` [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection Tony Luck
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

Hardware has two RMID configuration options for SNC systems. The default
mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and
two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199
on node 1. This isn't very compatible with Linux resctrl usage. On this
example system a process using RMID 5 would only update monitor counters
while running on SNC node 0.

The other mode is "RMID Sharing Mode". This is enabled by clearing bit
0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode
the number of logical RMIDs is the number of physical RMIDs (from CPUID
leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A
process can use the same RMID across different SNC nodes.

See the "Intel Resource Director Technology Architecture Specification"
for additional details.

When SNC is enabled, update the MSR when a monitor domain is marked
online. Tehcnically this is overkill. It only needs to be done once
per L3 cache instance rather than per SNC domain. But there is no harm
in doing it more than once, and this is not in a critical path.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
 arch/x86/kernel/cpu/resctrl/core.c     |  2 ++
 arch/x86/kernel/cpu/resctrl/monitor.c  | 26 ++++++++++++++++++++++++++
 4 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e022e6eb766c..3cb8dd6311c3 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1164,6 +1164,7 @@
 #define MSR_IA32_QM_CTR			0xc8e
 #define MSR_IA32_PQR_ASSOC		0xc8f
 #define MSR_IA32_L3_CBM_BASE		0xc90
+#define MSR_RMID_SNC_CONFIG		0xca0
 #define MSR_IA32_L2_CBM_BASE		0xd10
 #define MSR_IA32_MBA_THRTL_BASE		0xd50
 
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 75bb1afc4842..324cf05858f5 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -529,6 +529,8 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
 
 int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
 
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+
 /*
  * To return the common struct rdt_resource, which is contained in struct
  * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 95ef8fe3cb50..1930fce9dfe9 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -615,6 +615,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
 	}
 	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
 
+	arch_mon_domain_online(r, d);
+
 	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		mon_domain_free(hw_dom);
 		return;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c4d9a8df8d2d..efbb84c00d79 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1082,6 +1082,32 @@ static void l3_mon_evt_init(struct rdt_resource *r)
 		list_add_tail(&mbm_local_event.list, &r->evt_list);
 }
 
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub-NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * counting operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * logical_rmid_to_physical_rmid() for code that does this.
+ */
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+{
+	u64 val;
+
+	if (snc_nodes_per_l3_cache == 1)
+		return;
+
+	rdmsrl(MSR_RMID_SNC_CONFIG, val);
+	val &= ~BIT_ULL(0);
+	wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
 int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 {
 	unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 16/18] x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC) systems
  2024-06-10 18:35 ` [PATCH v20 16/18] x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC) systems Tony Luck
@ 2024-06-20 21:32   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:32 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

shortlog: "RMID shared RMID mode" -> "RMID shared mode" or "shared RMID mode"?

On 6/10/24 11:35 AM, Tony Luck wrote:
> Hardware has two RMID configuration options for SNC systems. The default
> mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and
> two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199
> on node 1. This isn't very compatible with Linux resctrl usage. On this

Could we head off potential tangents with "This isn't very compatible"
changed to "This isn't compatible"?

> example system a process using RMID 5 would only update monitor counters
> while running on SNC node 0.
> 
> The other mode is "RMID Sharing Mode". This is enabled by clearing bit
> 0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode
> the number of logical RMIDs is the number of physical RMIDs (from CPUID
> leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A
> process can use the same RMID across different SNC nodes.
> 
> See the "Intel Resource Director Technology Architecture Specification"
> for additional details.
> 
> When SNC is enabled, update the MSR when a monitor domain is marked
> online. Tehcnically this is overkill. It only needs to be done once

Tehcnically -> Technically

> per L3 cache instance rather than per SNC domain. But there is no harm
> in doing it more than once, and this is not in a critical path.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/include/asm/msr-index.h       |  1 +
>   arch/x86/kernel/cpu/resctrl/internal.h |  2 ++
>   arch/x86/kernel/cpu/resctrl/core.c     |  2 ++
>   arch/x86/kernel/cpu/resctrl/monitor.c  | 26 ++++++++++++++++++++++++++
>   4 files changed, 31 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e022e6eb766c..3cb8dd6311c3 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1164,6 +1164,7 @@
>   #define MSR_IA32_QM_CTR			0xc8e
>   #define MSR_IA32_PQR_ASSOC		0xc8f
>   #define MSR_IA32_L3_CBM_BASE		0xc90
> +#define MSR_RMID_SNC_CONFIG		0xca0
>   #define MSR_IA32_L2_CBM_BASE		0xd10
>   #define MSR_IA32_MBA_THRTL_BASE		0xd50
>   
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 75bb1afc4842..324cf05858f5 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -529,6 +529,8 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l)
>   
>   int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable);
>   
> +void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
> +
>   /*
>    * To return the common struct rdt_resource, which is contained in struct
>    * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource.
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 95ef8fe3cb50..1930fce9dfe9 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -615,6 +615,8 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
>   	}
>   	cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
>   
> +	arch_mon_domain_online(r, d);
> +
>   	if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
>   		mon_domain_free(hw_dom);
>   		return;
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c4d9a8df8d2d..efbb84c00d79 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -1082,6 +1082,32 @@ static void l3_mon_evt_init(struct rdt_resource *r)
>   		list_add_tail(&mbm_local_event.list, &r->evt_list);
>   }
>   
> +/*
> + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
> + * which indicates that RMIDs are configured in legacy mode.
> + * This mode is incompatible with Linux resctrl semantics
> + * as RMIDs are partitioned between SNC nodes, which requires
> + * a user to know which RMID is allocated to a task.
> + * Clearing bit 0 reconfigures the RMID counters for use

"Clearing bit 0 configures RMID sharing mode for use ..."? It is
strange to me that this whole comment has no mention of
"RMID sharing mode" that seems to be goal of this change.

> + * in Sub-NUMA Cluster mode. This mode is better for Linux.
> + * The RMID space is divided between all SNC nodes with the
> + * RMIDs renumbered to start from zero in each node when
> + * counting operations from tasks. Code to read the counters
> + * must adjust RMID counter numbers based on SNC node. See
> + * logical_rmid_to_physical_rmid() for code that does this.
> + */
> +void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
> +{
> +	u64 val;
> +
> +	if (snc_nodes_per_l3_cache == 1)
> +		return;
> +
> +	rdmsrl(MSR_RMID_SNC_CONFIG, val);
> +	val &= ~BIT_ULL(0);
> +	wrmsrl(MSR_RMID_SNC_CONFIG, val);
> +}
> +
>   int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   {
>   	unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;

Patch looks good to me.

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (15 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 16/18] x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC) systems Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:34   ` Reinette Chatre
  2024-06-21 17:05   ` Markus Elfring
  2024-06-10 18:35 ` [PATCH v20 18/18] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
  2024-06-13 19:17 ` [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
  18 siblings, 2 replies; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

There isn't a simple hardware bit that indicates whether a CPU is
running in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing
the number of CPUs sharing the L3 cache with CPU0 to the number of CPUs in
the same NUMA node as CPU0.

If SNC mode is detected, print a single informational message to the
console.

Add the missing definition of pr_fmt() to monitor.c. This wasn't
noticed before as there are only "can't happen" console messages
from this file.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 66 +++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index efbb84c00d79..9835706ef772 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -15,6 +15,8 @@
  * Software Developer Manual June 2016, volume 3, section 17.17.
  */
 
+#define pr_fmt(fmt)	"resctrl: " fmt
+
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/sizes.h>
@@ -1108,6 +1110,68 @@ void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
 	wrmsrl(MSR_RMID_SNC_CONFIG, val);
 }
 
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+	X86_MATCH_VFM(INTEL_ICELAKE_X, 0),
+	X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X, 0),
+	X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, 0),
+	X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, 0),
+	X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, 0),
+	{}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * number CPUs sharing the L3 cache with CPU0 to the number of CPUs in
+ * the same NUMA node as CPU0.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+	struct cacheinfo *ci = get_cpu_cacheinfo_level(0, RESCTRL_L3_CACHE);
+	const cpumask_t *node0_cpumask;
+	int cpus_per_node, cpus_per_l3;
+	int ret;
+
+	if (!x86_match_cpu(snc_cpu_ids) || !ci)
+		return 1;
+
+	cpus_read_lock();
+	if (num_online_cpus() != num_present_cpus())
+		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+	cpus_read_unlock();
+
+	node0_cpumask = cpumask_of_node(cpu_to_node(0));
+
+	cpus_per_node = cpumask_weight(node0_cpumask);
+	cpus_per_l3 = cpumask_weight(&ci->shared_cpu_map);
+
+	if (!cpus_per_node || !cpus_per_l3)
+		return 1;
+
+	ret = cpus_per_l3 / cpus_per_node;
+
+	/* sanity check: Only valid results are 1, 2, 3, 4 */
+	switch (ret) {
+	case 1:
+		break;
+	case 2 ... 4:
+		pr_info("Sub-NUMA Cluster mode detected with %d nodes per L3 cache\n", ret);
+		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_L3_NODE;
+		break;
+	default:
+		pr_warn("Ignore improbable SNC node count %d\n", ret);
+		ret = 1;
+		break;
+	}
+
+	return ret;
+}
+
 int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 {
 	unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
@@ -1115,6 +1179,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	unsigned int threshold;
 	int ret;
 
+	snc_nodes_per_l3_cache = snc_get_config();
+
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
 	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
 	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection
  2024-06-10 18:35 ` [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection Tony Luck
@ 2024-06-20 21:34   ` Reinette Chatre
  2024-06-21 17:05   ` Markus Elfring
  1 sibling, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:34 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> There isn't a simple hardware bit that indicates whether a CPU is
> running in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing
> the number of CPUs sharing the L3 cache with CPU0 to the number of CPUs in
> the same NUMA node as CPU0.
> 
> If SNC mode is detected, print a single informational message to the
> console.
> 
> Add the missing definition of pr_fmt() to monitor.c. This wasn't
> noticed before as there are only "can't happen" console messages
> from this file.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 66 +++++++++++++++++++++++++++
>   1 file changed, 66 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index efbb84c00d79..9835706ef772 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -15,6 +15,8 @@
>    * Software Developer Manual June 2016, volume 3, section 17.17.
>    */
>   
> +#define pr_fmt(fmt)	"resctrl: " fmt
> +
>   #include <linux/cpu.h>
>   #include <linux/module.h>
>   #include <linux/sizes.h>
> @@ -1108,6 +1110,68 @@ void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
>   	wrmsrl(MSR_RMID_SNC_CONFIG, val);
>   }
>   
> +/* CPU models that support MSR_RMID_SNC_CONFIG */
> +static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
> +	X86_MATCH_VFM(INTEL_ICELAKE_X, 0),
> +	X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X, 0),
> +	X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, 0),
> +	X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, 0),
> +	X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, 0),
> +	{}
> +};
> +
> +/*
> + * There isn't a simple hardware bit that indicates whether a CPU is running
> + * in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the
> + * number CPUs sharing the L3 cache with CPU0 to the number of CPUs in

"number CPUs sharing" -> "number of CPUs sharing"?

> + * the same NUMA node as CPU0.
> + * It is not possible to accurately determine SNC state if the system is
> + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
> + * to L3 caches. It will be OK if system is booted with hyperthreading
> + * disabled (since this doesn't affect the ratio).
> + */
> +static __init int snc_get_config(void)
> +{
> +	struct cacheinfo *ci = get_cpu_cacheinfo_level(0, RESCTRL_L3_CACHE);
> +	const cpumask_t *node0_cpumask;
> +	int cpus_per_node, cpus_per_l3;
> +	int ret;
> +
> +	if (!x86_match_cpu(snc_cpu_ids) || !ci)
> +		return 1;
> +
> +	cpus_read_lock();
> +	if (num_online_cpus() != num_present_cpus())
> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +	cpus_read_unlock();
> +
> +	node0_cpumask = cpumask_of_node(cpu_to_node(0));
> +
> +	cpus_per_node = cpumask_weight(node0_cpumask);
> +	cpus_per_l3 = cpumask_weight(&ci->shared_cpu_map);
> +
> +	if (!cpus_per_node || !cpus_per_l3)
> +		return 1;
> +
> +	ret = cpus_per_l3 / cpus_per_node;
> +
> +	/* sanity check: Only valid results are 1, 2, 3, 4 */
> +	switch (ret) {
> +	case 1:
> +		break;
> +	case 2 ... 4:
> +		pr_info("Sub-NUMA Cluster mode detected with %d nodes per L3 cache\n", ret);
> +		rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_L3_NODE;
> +		break;
> +	default:
> +		pr_warn("Ignore improbable SNC node count %d\n", ret);
> +		ret = 1;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>   int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   {
>   	unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
> @@ -1115,6 +1179,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   	unsigned int threshold;
>   	int ret;
>   
> +	snc_nodes_per_l3_cache = snc_get_config();
> +
>   	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
>   	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
>   	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;

With typo fixed:

| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection
  2024-06-10 18:35 ` [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection Tony Luck
  2024-06-20 21:34   ` Reinette Chatre
@ 2024-06-21 17:05   ` Markus Elfring
  2024-06-21 17:14     ` Luck, Tony
  1 sibling, 1 reply; 61+ messages in thread
From: Markus Elfring @ 2024-06-21 17:05 UTC (permalink / raw)
  To: Tony Luck, x86, Babu Moger, Dave Martin, Drew Fustini, Fenghua Yu,
	James Morse, Maciej Wieczor-Retman, Peter Newman, Reinette Chatre
  Cc: LKML, patches

…
> Add the missing definition of pr_fmt() to monitor.c. …

How do you think about to add the tag “Fixes” accordingly?


…
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
…
> +static __init int snc_get_config(void)
> +{
…
> +	cpus_read_lock();
> +	if (num_online_cpus() != num_present_cpus())
> +		pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> +	cpus_read_unlock();
…

Would you become interested to apply a statement like “guard(cpus_read_lock)();”?
https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/cleanup.h#L133

Regards,
Markus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection
  2024-06-21 17:05   ` Markus Elfring
@ 2024-06-21 17:14     ` Luck, Tony
  0 siblings, 0 replies; 61+ messages in thread
From: Luck, Tony @ 2024-06-21 17:14 UTC (permalink / raw)
  To: Markus Elfring, x86@kernel.org, Babu Moger, Dave Martin,
	Drew Fustini, Yu, Fenghua, James Morse, Wieczor-Retman, Maciej,
	Peter Newman, Chatre, Reinette
  Cc: LKML, patches@lists.linux.dev

> > Add the missing definition of pr_fmt() to monitor.c. …
>
> How do you think about to add the tag “Fixes” accordingly?

Until this patch there were only "can't happen" pr_info()/pr_warn()
messages. So no real benefit from having this backported.

If it were to be backported, would need to split this out from the
rest of this patch as the rest of the changes are dependent on
on the previous 16 patches in this series.

> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> …
> > +static __init int snc_get_config(void)
> > +{
> …
> > +   cpus_read_lock();
> > +   if (num_online_cpus() != num_present_cpus())
> > +           pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
> > +   cpus_read_unlock();
> …
>
> Would you become interested to apply a statement like “guard(cpus_read_lock)();”?
> https://elixir.bootlin.com/linux/v6.10-rc4/source/include/linux/cleanup.h#L133

IMHO it would be better to convert resctrl to using the cleanup.h helpers
as a separate series rather than having just one place use it.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v20 18/18] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (16 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection Tony Luck
@ 2024-06-10 18:35 ` Tony Luck
  2024-06-20 21:35   ` Reinette Chatre
  2024-06-13 19:17 ` [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
  18 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2024-06-10 18:35 UTC (permalink / raw)
  To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches, Tony Luck

With Sub-NUMA Cluster (SNC) mode enabled the scope of monitoring resources
is per-NODE instead of per-L3 cache. Backwards compatibility is mainatined
by providing files in the mon_L3_XX directories that sum event counts
for all SNC nodes sharing an L3 cache.

New files provide per-SNC node event counts.

Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/arch/x86/resctrl.rst | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 627e23869bca..6695b7bc698d 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
 	all tasks in the group. In CTRL_MON groups these files provide
 	the sum for all tasks in the CTRL_MON group and all tasks in
 	MON groups. Please see example section for more details on usage.
+	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+	directories for each node (located within the "mon_L3_XX" directory
+	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+	where "YY" is the node number.
 
 "mon_hw_id":
 	Available only with debug option. The identifier used by hardware
@@ -484,6 +488,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
 each bit represents 5% of the capacity of the cache. You could partition
 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
 
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit would only represent 5MB.
+
 Memory bandwidth Allocation and monitoring
 ==========================================
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 18/18] x86/resctrl: Update documentation with Sub-NUMA cluster changes
  2024-06-10 18:35 ` [PATCH v20 18/18] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2024-06-20 21:35   ` Reinette Chatre
  0 siblings, 0 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-20 21:35 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
	James Morse, Babu Moger, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> With Sub-NUMA Cluster (SNC) mode enabled the scope of monitoring resources
> is per-NODE instead of per-L3 cache. Backwards compatibility is mainatined

mainatined -> maintained

> by providing files in the mon_L3_XX directories that sum event counts
> for all SNC nodes sharing an L3 cache.
> 
> New files provide per-SNC node event counts.
> 
> Users should be aware that SNC mode also affects the amount of L3 cache
> available for allocation within each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>   Documentation/arch/x86/resctrl.rst | 27 +++++++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
> 
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index 627e23869bca..6695b7bc698d 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -375,6 +375,10 @@ When monitoring is enabled all MON groups will also contain:
>   	all tasks in the group. In CTRL_MON groups these files provide
>   	the sum for all tasks in the CTRL_MON group and all tasks in
>   	MON groups. Please see example section for more details on usage.
> +	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
> +	directories for each node (located within the "mon_L3_XX" directory
> +	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
> +	where "YY" is the node number.
>   
>   "mon_hw_id":
>   	Available only with debug option. The identifier used by hardware
> @@ -484,6 +488,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
>   each bit represents 5% of the capacity of the cache. You could partition
>   the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
>   
> +Notes on Sub-NUMA Cluster mode
> +==============================
> +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
> +nodes much more readily than between regular NUMA nodes since the CPUs
> +on Sub-NUMA nodes share the same L3 cache and the system may report
> +the NUMA distance between Sub-NUMA nodes with a lower value than used
> +for regular NUMA nodes.
> +
> +The top-level monitoring files in each "mon_L3_XX" directory provide
> +the sum of data across all SNC nodes sharing an L3 cache instance.
> +Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
> +the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
> +"mon_sub_L3_YY" directories to get node local data.
> +
> +Memory bandwidth allocation is still performed at the L3 cache
> +level. I.e. throttling controls are applied to all SNC nodes.
> +
> +L3 cache allocation bitmaps also apply to all SNC nodes. But note that
> +the amount of L3 cache represented by each bit is divided by the number
> +of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
> +allocation masks each bit normally represents 10MB. With SNC mode enabled
> +with two SNC nodes per L3 cache, each bit would only represent 5MB.

"each bit would only represent 5MB" -> "each bit represents 5MB" or
"each bit only represents 5MB" or "each bit represents only 5MB"?

> +
>   Memory bandwidth Allocation and monitoring
>   ==========================================
>   

| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
                   ` (17 preceding siblings ...)
  2024-06-10 18:35 ` [PATCH v20 18/18] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
@ 2024-06-13 19:17 ` Moger, Babu
  2024-06-13 20:32   ` Reinette Chatre
  18 siblings, 1 reply; 61+ messages in thread
From: Moger, Babu @ 2024-06-13 19:17 UTC (permalink / raw)
  To: Tony Luck, Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Reinette,

I may be little bit out of sync here. Also, sorry to come back late in the
series.

Looking at the series again, I see this approach adds lots of code.
Look at this structure.


@@ -187,10 +196,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	enum resctrl_scope	scope;
+	enum resctrl_scope	ctrl_scope;
+	enum resctrl_scope	mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;

There are two scope fields.
There are two domains fields.

These are very confusing and very hard to maintain. Also, I am not sure if
these fields are useful for anything other than SNC feature. This approach
adds quite a bit of code for no specific advantage.

Why don't we just split the RDT_RESOURCE_L3 resource
into separate resources, one for control, one for monitoring.
We already have "control" only resources (MBA, SMBA, L2). Lets create new
"monitor" only resource. I feel it will be much cleaner approach.

Tony has already tried that approach and showed that it is much simpler.

v15-RFC :
https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/

What do you think?

Thanks
Babu


On 6/10/24 13:35, Tony Luck wrote:
> This series based on top of tip x86/cache commit f385f0246394
> ("x86/resctrl: Replace open coded cacheinfo searches")
> 
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features.  Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPUs support an MSR that can partition the RMID counters
> in the same way. This allows monitoring features to be used. Legacy
> monitoring files provide the sum of counters from each SNC node for
> backwards compatibility. Additional  files per SNC node provide details
> per node.
> 
> Memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
> 
> L3 cache occupancy and allocation operate on the portion of
> L3 cache available for each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> ---
> Changes since v19: https://lore.kernel.org/all/20240528222006.58283-1-tony.luck@intel.com/
> 
> 1-4:	Refactor on top of <linux/cacheinfo.h> change.
> 	Nothing functional.
> 
> 5:	No change
> 
> 6:	Updated commit message with note about RMID Sharing mode.
> 	Renamed __rmid_read() to __rmid_read_phys() and performed
> 	translation from logical RMID to physical RMID at callsites.
> 	Updated comment for __rmid_read_phys() with explanation of
> 	logical/physical RMIDs. Consistently use "SNC node" avoid
> 	SNC domain. Add specifics for non-SNC mode.
> 	Joined split line on __rmid_read() definition (even with the
> 	added "_phys" to its name still fits on one line.
> 
> 7:	No change
> 
> 8:	get_cpu_cacheinfo_level() moved to <linux/cacheinfo.h>
> 	currently in tip x86/cache
> 	no other changes
> 
> 9:	Dropped the "sumdomains" field from struct rmid_read (a NULL
> 	domain field now indicates that summing is needed).
> 	Fix kerneldoc comments for struct rmid_read.
> 	Updated commit comments with more "why" than "what".
> 
> 10:	No change
> 
> 11:	Fix commit comments per suggestions
> 	1) Added some "why it is OK to take a bit from evtid"
> 	2) s/The stolen bit is given to/Give the bit to/
> 	3) Don't use "l3_cache_id" (which looks like a variable)
> 
> 12:	Fix commit message.
> 	s/kernfs_find_and_get_ns()/kernfs_find_and_get()/
> 	Add kernfs_put() to drop hold from kernfs_find_and_get()
> 	Drop useless "/* create the directory */" comment.
> 
> 13:	Add kernfs_put() to drop hold from kernfs_find_and_get() [two places]
> 
> 14:	Add cpumask parameter to mon_event_read() so SNC decsions are
> 	all in rdtgroup_mondata_show() instead of spread between functions.
> 	Add comments in rdtgroup_mondata_show() to explain the sum vs. no-sum
> 	cases.
> 	Moved the mon_event_read() call into both arms of the if-else
> 	instead of "d = NULL; goto got_cacheinfo;"
> 
> 15:	New (replaces 15-17). Make __mon_event_read() do the sum across
> 	domains (at filesystem level). Move the CPU/domain sanity check out
> 	of resctrl_arch_rmid_read() and into __mon_event_read()
> 	with separate scope tests for single domain vs. sum over
> 	domains.
> 
> 16:	[Was 18] Update commit message with details about MSR 0xCA0, what changes
> 	when bit 0 is cleared, and why this is necessary.
> 	Dropped "Add an architecture specific hook" language from
> 	commit message.
> 
> 17:	[Was 19] Drop "and enabling" from shortlog (enabling done by
> 	previous commit).
> 	Added checks that cpumask_weight() isn't returning zero (to keep
> 	static checkers from warning of possible divide by zero).
> 
> 18:	[Was 20] Fix some "Sub-NUMA" references to say "Sub-NUMA Cluster"
> 	Added document section on effect of SNC mode on MBA and L3 CAT.
> 
> Tony Luck (18):
>   x86/resctrl: Prepare for new domain scope
>   x86/resctrl: Prepare to split rdt_domain structure
>   x86/resctrl: Prepare for different scope for control/monitor
>     operations
>   x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
>   x86/resctrl: Add node-scope to the options for feature scope
>   x86/resctrl: Introduce snc_nodes_per_l3_cache
>   x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster
>     (SNC) systems
>   x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files
>   x86/resctrl: Add a new field to struct rmid_read for summation of
>     domains
>   x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function
>   x86/resctrl: Allocate a new field in union mon_data_bits
>   x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files
>   x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC)
>     mode
>   x86/resctrl: Fill out rmid_read structure for smp_call*() to read a
>     counter
>   x86/resctrl: Make __mon_event_count() handle sum domains
>   x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC)
>     systems
>   x86/resctrl: Sub-NUMA Cluster (SNC) detection
>   x86/resctrl: Update documentation with Sub-NUMA cluster changes
> 
>  Documentation/arch/x86/resctrl.rst        |  27 ++
>  include/linux/resctrl.h                   |  87 ++++--
>  arch/x86/include/asm/msr-index.h          |   1 +
>  arch/x86/kernel/cpu/resctrl/internal.h    |  93 +++++--
>  arch/x86/kernel/cpu/resctrl/core.c        | 312 ++++++++++++++++------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  85 +++---
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 242 ++++++++++++++---
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  27 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 272 ++++++++++++-------
>  9 files changed, 835 insertions(+), 311 deletions(-)
> 
> 
> base-commit: f385f024639431bec3e70c33cdbc9563894b3ee5

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-13 19:17 ` [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
@ 2024-06-13 20:32   ` Reinette Chatre
  2024-06-13 21:02     ` Luck, Tony
  2024-06-14 16:27     ` Moger, Babu
  0 siblings, 2 replies; 61+ messages in thread
From: Reinette Chatre @ 2024-06-13 20:32 UTC (permalink / raw)
  To: babu.moger, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Babu,

On 6/13/24 12:17 PM, Moger, Babu wrote:
> I may be little bit out of sync here. Also, sorry to come back late in the
> series.
> 
> Looking at the series again, I see this approach adds lots of code.
> Look at this structure.
> 
> 
> @@ -187,10 +196,12 @@ struct rdt_resource {
>   	bool			alloc_capable;
>   	bool			mon_capable;
>   	int			num_rmid;
> -	enum resctrl_scope	scope;
> +	enum resctrl_scope	ctrl_scope;
> +	enum resctrl_scope	mon_scope;
>   	struct resctrl_cache	cache;
>   	struct resctrl_membw	membw;
> -	struct list_head	domains;
> +	struct list_head	ctrl_domains;
> +	struct list_head	mon_domains;
>   	char			*name;
>   	int			data_width;
>   	u32			default_ctrl;
> 
> There are two scope fields.
> There are two domains fields.
> 
> These are very confusing and very hard to maintain. Also, I am not sure if
> these fields are useful for anything other than SNC feature. This approach
> adds quite a bit of code for no specific advantage.
> 
> Why don't we just split the RDT_RESOURCE_L3 resource
> into separate resources, one for control, one for monitoring.
> We already have "control" only resources (MBA, SMBA, L2). Lets create new
> "monitor" only resource. I feel it will be much cleaner approach.
> 
> Tony has already tried that approach and showed that it is much simpler.
> 
> v15-RFC :
> https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/
> 
> What do you think?
> 

Some highlights of my thoughts in response to that series, but the whole thread
may be of interest to you:
https://lore.kernel.org/lkml/78c88c6d-2e8d-42d1-a6f2-1c5ac38fb258@intel.com/
https://lore.kernel.org/lkml/59944211-d34a-4ba3-a1de-095822c0b3f0@intel.com/

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-13 20:32   ` Reinette Chatre
@ 2024-06-13 21:02     ` Luck, Tony
  2024-06-14 16:27     ` Moger, Babu
  1 sibling, 0 replies; 61+ messages in thread
From: Luck, Tony @ 2024-06-13 21:02 UTC (permalink / raw)
  To: Chatre, Reinette, babu.moger@amd.com, Yu, Fenghua,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Drew Fustini,
	Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> Looking at the series again, I see this approach adds lots of code.
> Look at this structure.
> 
> 
> @@ -187,10 +196,12 @@ struct rdt_resource {
>   	bool			alloc_capable;
>   	bool			mon_capable;
>   	int			num_rmid;
> -	enum resctrl_scope	scope;
> +	enum resctrl_scope	ctrl_scope;
> +	enum resctrl_scope	mon_scope;
>   	struct resctrl_cache	cache;
>   	struct resctrl_membw	membw;
> -	struct list_head	domains;
> +	struct list_head	ctrl_domains;
> +	struct list_head	mon_domains;
>   	char			*name;
>   	int			data_width;
>   	u32			default_ctrl;
> 
> There are two scope fields.
> There are two domains fields.

I might at some future time split struct rdt_resource into struct
rdt_ctrl_resource and struct rdt_mon_resource.  That would get rid of
multiple scope and domain fields. There are also a bunch of fields that
are specific to just ctrl or mon functions.

That would require other churn. E.g. getting rid of the
rdt_resources_all[] array and the macros that scan it to perform
various actions. Replace with two lists, one each for active ctrl/mon
resources. Once James' patches to split into architecture vs. generic
parts are applied this might be useful so that CPU vendors could add
resources that didn't have equivalents for other architectures.

There isn't a pressing need for that today. But splitting the rdt_domain
structure now makes for a good base to build on later (if needed).

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-13 20:32   ` Reinette Chatre
  2024-06-13 21:02     ` Luck, Tony
@ 2024-06-14 16:27     ` Moger, Babu
  2024-06-14 16:46       ` Reinette Chatre
  1 sibling, 1 reply; 61+ messages in thread
From: Moger, Babu @ 2024-06-14 16:27 UTC (permalink / raw)
  To: Reinette Chatre, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Reinette,

On 6/13/24 15:32, Reinette Chatre wrote:
> Hi Babu,
> 
> On 6/13/24 12:17 PM, Moger, Babu wrote:
>> I may be little bit out of sync here. Also, sorry to come back late in the
>> series.
>>
>> Looking at the series again, I see this approach adds lots of code.
>> Look at this structure.
>>
>>
>> @@ -187,10 +196,12 @@ struct rdt_resource {
>>       bool            alloc_capable;
>>       bool            mon_capable;
>>       int            num_rmid;
>> -    enum resctrl_scope    scope;
>> +    enum resctrl_scope    ctrl_scope;
>> +    enum resctrl_scope    mon_scope;
>>       struct resctrl_cache    cache;
>>       struct resctrl_membw    membw;
>> -    struct list_head    domains;
>> +    struct list_head    ctrl_domains;
>> +    struct list_head    mon_domains;
>>       char            *name;
>>       int            data_width;
>>       u32            default_ctrl;
>>
>> There are two scope fields.
>> There are two domains fields.
>>
>> These are very confusing and very hard to maintain. Also, I am not sure if
>> these fields are useful for anything other than SNC feature. This approach
>> adds quite a bit of code for no specific advantage.
>>
>> Why don't we just split the RDT_RESOURCE_L3 resource
>> into separate resources, one for control, one for monitoring.
>> We already have "control" only resources (MBA, SMBA, L2). Lets create new
>> "monitor" only resource. I feel it will be much cleaner approach.
>>
>> Tony has already tried that approach and showed that it is much simpler.
>>
>> v15-RFC :
>> https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/
>>
>> What do you think?
>>
> 
> Some highlights of my thoughts in response to that series, but the whole
> thread
> may be of interest to you:
> https://lore.kernel.org/lkml/78c88c6d-2e8d-42d1-a6f2-1c5ac38fb258@intel.com/
> https://lore.kernel.org/lkml/59944211-d34a-4ba3-a1de-095822c0b3f0@intel.com/
> 

Went through the thread, in summary:

The main concerns are related to duplication of code and data structures.

The solutions are

a) Split the domains.
This is what this series is doing now. This creates members like
ctrl_scope, mon_scope, ctrl_domains etc.. These fields are added to all
the resources (MBA, SMBA and L2). Then there is additional domain header.


b) Split the resource.
  Split RDT_RESOURCE_L3 into two, one for "monitor" and one for "control".
  There will be one domain structure for "monitor" and  one for "control"

Both these approaches have code and data duplication. So, there is no
difference that way.

But complexity wise, approach (a) adds quite bit of complexity. Doesn't it?

For me, solution (b) looks simple and easy. Eventually, we may have to
restructure these data structures anyways. I feel it is the right direction.

-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-14 16:27     ` Moger, Babu
@ 2024-06-14 16:46       ` Reinette Chatre
  2024-06-14 21:29         ` Moger, Babu
  0 siblings, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-14 16:46 UTC (permalink / raw)
  To: babu.moger, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Babu,

On 6/14/24 9:27 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 6/13/24 15:32, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 6/13/24 12:17 PM, Moger, Babu wrote:
>>> I may be little bit out of sync here. Also, sorry to come back late in the
>>> series.
>>>
>>> Looking at the series again, I see this approach adds lots of code.
>>> Look at this structure.
>>>
>>>
>>> @@ -187,10 +196,12 @@ struct rdt_resource {
>>>        bool            alloc_capable;
>>>        bool            mon_capable;
>>>        int            num_rmid;
>>> -    enum resctrl_scope    scope;
>>> +    enum resctrl_scope    ctrl_scope;
>>> +    enum resctrl_scope    mon_scope;
>>>        struct resctrl_cache    cache;
>>>        struct resctrl_membw    membw;
>>> -    struct list_head    domains;
>>> +    struct list_head    ctrl_domains;
>>> +    struct list_head    mon_domains;
>>>        char            *name;
>>>        int            data_width;
>>>        u32            default_ctrl;
>>>
>>> There are two scope fields.
>>> There are two domains fields.
>>>
>>> These are very confusing and very hard to maintain. Also, I am not sure if
>>> these fields are useful for anything other than SNC feature. This approach
>>> adds quite a bit of code for no specific advantage.
>>>
>>> Why don't we just split the RDT_RESOURCE_L3 resource
>>> into separate resources, one for control, one for monitoring.
>>> We already have "control" only resources (MBA, SMBA, L2). Lets create new
>>> "monitor" only resource. I feel it will be much cleaner approach.
>>>
>>> Tony has already tried that approach and showed that it is much simpler.
>>>
>>> v15-RFC :
>>> https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/
>>>
>>> What do you think?
>>>
>>
>> Some highlights of my thoughts in response to that series, but the whole
>> thread
>> may be of interest to you:
>> https://lore.kernel.org/lkml/78c88c6d-2e8d-42d1-a6f2-1c5ac38fb258@intel.com/
>> https://lore.kernel.org/lkml/59944211-d34a-4ba3-a1de-095822c0b3f0@intel.com/
>>
> 
> Went through the thread, in summary:
> 
> The main concerns are related to duplication of code and data structures.
> 
> The solutions are
> 
> a) Split the domains.
> This is what this series is doing now. This creates members like
> ctrl_scope, mon_scope, ctrl_domains etc.. These fields are added to all
> the resources (MBA, SMBA and L2). Then there is additional domain header.
> 
> 
> b) Split the resource.
>    Split RDT_RESOURCE_L3 into two, one for "monitor" and one for "control".
>    There will be one domain structure for "monitor" and  one for "control"
> 
> Both these approaches have code and data duplication. So, there is no
> difference that way.

Could you please elaborate where code and data duplication of (a) is?

> 
> But complexity wise, approach (a) adds quite bit of complexity. Doesn't it?

"complex" is a subjective term. Could you please elaborate what is complex
about this? Is your concern about the size of the patch? To me that is
not a concern when considering the end result of how the resctrl structures
are organized.

> 
> For me, solution (b) looks simple and easy. Eventually, we may have to
> restructure these data structures anyways. I feel it is the right direction.
> 

I understand that it is tempting to look for smallest patch possible but we
really need to ensure that any work integrates well into resctrl. Doing
so may end up with larger patches but in the end it makes the data structures
and code easier to understand. I specifically find the duplication of structures
troublesome since that requires developers to always be on high alert of
what code is being worked on and what flows the particular code participates in
since the duplication results in members of structures be invalid based on which
code flow is used. To me this is an unnecessary burden on developers and against
your original goal of making resctrl easier to maintain.

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-14 16:46       ` Reinette Chatre
@ 2024-06-14 21:29         ` Moger, Babu
  2024-06-14 21:40           ` Luck, Tony
  2024-06-14 23:11           ` Reinette Chatre
  0 siblings, 2 replies; 61+ messages in thread
From: Moger, Babu @ 2024-06-14 21:29 UTC (permalink / raw)
  To: Reinette Chatre, babu.moger, Tony Luck, Fenghua Yu,
	Maciej Wieczor-Retman, Peter Newman, James Morse, Drew Fustini,
	Dave Martin
  Cc: x86, linux-kernel, patches

Hi Reinette,

On 6/14/2024 11:46 AM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 6/14/24 9:27 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 6/13/24 15:32, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 6/13/24 12:17 PM, Moger, Babu wrote:
>>>> I may be little bit out of sync here. Also, sorry to come back late 
>>>> in the
>>>> series.
>>>>
>>>> Looking at the series again, I see this approach adds lots of code.
>>>> Look at this structure.
>>>>
>>>>
>>>> @@ -187,10 +196,12 @@ struct rdt_resource {
>>>>        bool            alloc_capable;
>>>>        bool            mon_capable;
>>>>        int            num_rmid;
>>>> -    enum resctrl_scope    scope;
>>>> +    enum resctrl_scope    ctrl_scope;
>>>> +    enum resctrl_scope    mon_scope;
>>>>        struct resctrl_cache    cache;
>>>>        struct resctrl_membw    membw;
>>>> -    struct list_head    domains;
>>>> +    struct list_head    ctrl_domains;
>>>> +    struct list_head    mon_domains;
>>>>        char            *name;
>>>>        int            data_width;
>>>>        u32            default_ctrl;
>>>>
>>>> There are two scope fields.
>>>> There are two domains fields.
>>>>
>>>> These are very confusing and very hard to maintain. Also, I am not 
>>>> sure if
>>>> these fields are useful for anything other than SNC feature. This 
>>>> approach
>>>> adds quite a bit of code for no specific advantage.
>>>>
>>>> Why don't we just split the RDT_RESOURCE_L3 resource
>>>> into separate resources, one for control, one for monitoring.
>>>> We already have "control" only resources (MBA, SMBA, L2). Lets 
>>>> create new
>>>> "monitor" only resource. I feel it will be much cleaner approach.
>>>>
>>>> Tony has already tried that approach and showed that it is much 
>>>> simpler.
>>>>
>>>> v15-RFC :
>>>> https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/ 
>>>>
>>>>
>>>> What do you think?
>>>>
>>>
>>> Some highlights of my thoughts in response to that series, but the whole
>>> thread
>>> may be of interest to you:
>>> https://lore.kernel.org/lkml/78c88c6d-2e8d-42d1-a6f2-1c5ac38fb258@intel.com/ 
>>>
>>> https://lore.kernel.org/lkml/59944211-d34a-4ba3-a1de-095822c0b3f0@intel.com/ 
>>>
>>>
>>
>> Went through the thread, in summary:
>>
>> The main concerns are related to duplication of code and data structures.
>>
>> The solutions are
>>
>> a) Split the domains.
>> This is what this series is doing now. This creates members like
>> ctrl_scope, mon_scope, ctrl_domains etc.. These fields are added to all
>> the resources (MBA, SMBA and L2). Then there is additional domain header.
>>
>>
>> b) Split the resource.
>>    Split RDT_RESOURCE_L3 into two, one for "monitor" and one for 
>> "control".
>>    There will be one domain structure for "monitor" and  one for 
>> "control"
>>
>> Both these approaches have code and data duplication. So, there is no
>> difference that way.
> 
> Could you please elaborate where code and data duplication of (a) is?

We have ctrl_scope, mon_scope, ctrl_domains. mon_domains.  Only one 
resource, RDT_RESOURCE_L3 is going to use these fields. Rest of the 
resources don't need these fields. But these fields are part of all the 
resources.

I am not too worried about the size of the patch.  But, I don't foresee 
these fields will be used anytime soon in these resources(MBA. L3. 
SMBA). Why add it now? In future we may have to cleanup all these anyways.

> 
>>
>> But complexity wise, approach (a) adds quite bit of complexity. 
>> Doesn't it?
> 
> "complex" is a subjective term. Could you please elaborate what is complex
> about this? Is your concern about the size of the patch? To me that is
> not a concern when considering the end result of how the resctrl structures
> are organized.
> 
>>
>> For me, solution (b) looks simple and easy. Eventually, we may have to
>> restructure these data structures anyways. I feel it is the right 
>> direction.
>>
> 
> I understand that it is tempting to look for smallest patch possible but we
> really need to ensure that any work integrates well into resctrl. Doing
> so may end up with larger patches but in the end it makes the data 
> structures
> and code easier to understand. I specifically find the duplication of 
> structures
> troublesome since that requires developers to always be on high alert of
> what code is being worked on and what flows the particular code 
> participates in
> since the duplication results in members of structures be invalid based 
> on which
> code flow is used. To me this is an unnecessary burden on developers and 
> against
> your original goal of making resctrl easier to maintain.
> 
> Reinette

-- 
- Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-14 21:29         ` Moger, Babu
@ 2024-06-14 21:40           ` Luck, Tony
  2024-06-14 22:31             ` Moger, Babu
  2024-06-14 23:11           ` Reinette Chatre
  1 sibling, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2024-06-14 21:40 UTC (permalink / raw)
  To: babu.moger@amd.com, Chatre, Reinette, Yu, Fenghua,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Drew Fustini,
	Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev

> We have ctrl_scope, mon_scope, ctrl_domains. mon_domains.  Only one 
> resource, RDT_RESOURCE_L3 is going to use these fields. Rest of the 
> resources don't need these fields. But these fields are part of all the 
> resources.
>
> I am not too worried about the size of the patch.  But, I don't foresee 
> these fields will be used anytime soon in these resources(MBA. L3. 
> SMBA). Why add it now? In future we may have to cleanup all these anyways.

Babu,

I mentioned yesterday that future patches could split struct rdt_resource. I was noodling
at doing so back in February. Patches (messy, not finished or fit for consumption) are
here (just to give you an idea where things might build from here):

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git rdt_split_resource

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-14 21:40           ` Luck, Tony
@ 2024-06-14 22:31             ` Moger, Babu
  0 siblings, 0 replies; 61+ messages in thread
From: Moger, Babu @ 2024-06-14 22:31 UTC (permalink / raw)
  To: Luck, Tony, babu.moger@amd.com, Chatre, Reinette, Yu, Fenghua,
	Wieczor-Retman, Maciej, Peter Newman, James Morse, Drew Fustini,
	Dave Martin
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev



On 6/14/2024 4:40 PM, Luck, Tony wrote:
>> We have ctrl_scope, mon_scope, ctrl_domains. mon_domains.  Only one
>> resource, RDT_RESOURCE_L3 is going to use these fields. Rest of the
>> resources don't need these fields. But these fields are part of all the
>> resources.
>>
>> I am not too worried about the size of the patch.  But, I don't foresee
>> these fields will be used anytime soon in these resources(MBA. L3.
>> SMBA). Why add it now? In future we may have to cleanup all these anyways.
> 
> Babu,
> 
> I mentioned yesterday that future patches could split struct rdt_resource. I was noodling
> at doing so back in February. Patches (messy, not finished or fit for consumption) are
> here (just to give you an idea where things might build from here):
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git rdt_split_resource

I see, it is similar to splitting the resource (option a. v15-RFC) and 
little more. I think we should move to that direction eventually.
-- 
Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-14 21:29         ` Moger, Babu
  2024-06-14 21:40           ` Luck, Tony
@ 2024-06-14 23:11           ` Reinette Chatre
  2024-06-17 14:06             ` Moger, Babu
  1 sibling, 1 reply; 61+ messages in thread
From: Reinette Chatre @ 2024-06-14 23:11 UTC (permalink / raw)
  To: babu.moger, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Babu,

On 6/14/24 2:29 PM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 6/14/2024 11:46 AM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 6/14/24 9:27 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 6/13/24 15:32, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 6/13/24 12:17 PM, Moger, Babu wrote:
>>>>> I may be little bit out of sync here. Also, sorry to come back late in the
>>>>> series.
>>>>>
>>>>> Looking at the series again, I see this approach adds lots of code.
>>>>> Look at this structure.
>>>>>
>>>>>
>>>>> @@ -187,10 +196,12 @@ struct rdt_resource {
>>>>>        bool            alloc_capable;
>>>>>        bool            mon_capable;
>>>>>        int            num_rmid;
>>>>> -    enum resctrl_scope    scope;
>>>>> +    enum resctrl_scope    ctrl_scope;
>>>>> +    enum resctrl_scope    mon_scope;
>>>>>        struct resctrl_cache    cache;
>>>>>        struct resctrl_membw    membw;
>>>>> -    struct list_head    domains;
>>>>> +    struct list_head    ctrl_domains;
>>>>> +    struct list_head    mon_domains;
>>>>>        char            *name;
>>>>>        int            data_width;
>>>>>        u32            default_ctrl;
>>>>>
>>>>> There are two scope fields.
>>>>> There are two domains fields.
>>>>>
>>>>> These are very confusing and very hard to maintain. Also, I am not sure if
>>>>> these fields are useful for anything other than SNC feature. This approach
>>>>> adds quite a bit of code for no specific advantage.
>>>>>
>>>>> Why don't we just split the RDT_RESOURCE_L3 resource
>>>>> into separate resources, one for control, one for monitoring.
>>>>> We already have "control" only resources (MBA, SMBA, L2). Lets create new
>>>>> "monitor" only resource. I feel it will be much cleaner approach.
>>>>>
>>>>> Tony has already tried that approach and showed that it is much simpler.
>>>>>
>>>>> v15-RFC :
>>>>> https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/
>>>>>
>>>>> What do you think?
>>>>>
>>>>
>>>> Some highlights of my thoughts in response to that series, but the whole
>>>> thread
>>>> may be of interest to you:
>>>> https://lore.kernel.org/lkml/78c88c6d-2e8d-42d1-a6f2-1c5ac38fb258@intel.com/
>>>> https://lore.kernel.org/lkml/59944211-d34a-4ba3-a1de-095822c0b3f0@intel.com/
>>>>
>>>
>>> Went through the thread, in summary:
>>>
>>> The main concerns are related to duplication of code and data structures.
>>>
>>> The solutions are
>>>
>>> a) Split the domains.
>>> This is what this series is doing now. This creates members like
>>> ctrl_scope, mon_scope, ctrl_domains etc.. These fields are added to all
>>> the resources (MBA, SMBA and L2). Then there is additional domain header.
>>>
>>>
>>> b) Split the resource.
>>>    Split RDT_RESOURCE_L3 into two, one for "monitor" and one for "control".
>>>    There will be one domain structure for "monitor" and  one for "control"
>>>
>>> Both these approaches have code and data duplication. So, there is no
>>> difference that way.
>>
>> Could you please elaborate where code and data duplication of (a) is?
> 
> We have ctrl_scope, mon_scope, ctrl_domains. mon_domains.  Only one
> resource, RDT_RESOURCE_L3 is going to use these fields. Rest of the
> resources don't need these fields. But these fields are part of all
> the resources.

Correct. There are two new empty fields per resource that does
not support monitoring. Having the new mon_domains list results in
the benefit of eliminating monitoring fields from all the domains
forming part of resources that do not support monitoring. Providing
more details below but the additional pointer results in a significant
net reduction of unused fields. Having the new mon_scope field results
in the benefit to support SNC.

> 
> I am not too worried about the size of the patch.  But, I don't
> foresee these fields will be used anytime soon in these
> resources(MBA. L3. SMBA). Why add it now? In future we may have to
> cleanup all these anyways.

This work does indeed go through the effort to _eliminate_ unused fields.
Note how all domains of all resources (whether they support monitoring or
not) currently have to carry a significant number of monitoring fields.
These can be found in both struct rdt_domain (*rmid_busy_llc, *mbm_total,
*mbm_local, mbm_over, cqm_limbo, mbm_work_cpu, cqm_work_cpu)  as
well as struct rdt_hw_domain (*arch_mbm_total, *arch_mbm_local).

For a resource that does not support monitoring it is of course
unnecessary to carry all of this for _every_ domain instance and
after this series it no longer will.

Reinette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems
  2024-06-14 23:11           ` Reinette Chatre
@ 2024-06-17 14:06             ` Moger, Babu
  0 siblings, 0 replies; 61+ messages in thread
From: Moger, Babu @ 2024-06-17 14:06 UTC (permalink / raw)
  To: Reinette Chatre, Tony Luck, Fenghua Yu, Maciej Wieczor-Retman,
	Peter Newman, James Morse, Drew Fustini, Dave Martin
  Cc: x86, linux-kernel, patches

Hi Reinette,

On 6/14/24 18:11, Reinette Chatre wrote:
> Hi Babu,
> 
> On 6/14/24 2:29 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 6/14/2024 11:46 AM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 6/14/24 9:27 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 6/13/24 15:32, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On 6/13/24 12:17 PM, Moger, Babu wrote:
>>>>>> I may be little bit out of sync here. Also, sorry to come back late
>>>>>> in the
>>>>>> series.
>>>>>>
>>>>>> Looking at the series again, I see this approach adds lots of code.
>>>>>> Look at this structure.
>>>>>>
>>>>>>
>>>>>> @@ -187,10 +196,12 @@ struct rdt_resource {
>>>>>>        bool            alloc_capable;
>>>>>>        bool            mon_capable;
>>>>>>        int            num_rmid;
>>>>>> -    enum resctrl_scope    scope;
>>>>>> +    enum resctrl_scope    ctrl_scope;
>>>>>> +    enum resctrl_scope    mon_scope;
>>>>>>        struct resctrl_cache    cache;
>>>>>>        struct resctrl_membw    membw;
>>>>>> -    struct list_head    domains;
>>>>>> +    struct list_head    ctrl_domains;
>>>>>> +    struct list_head    mon_domains;
>>>>>>        char            *name;
>>>>>>        int            data_width;
>>>>>>        u32            default_ctrl;
>>>>>>
>>>>>> There are two scope fields.
>>>>>> There are two domains fields.
>>>>>>
>>>>>> These are very confusing and very hard to maintain. Also, I am not
>>>>>> sure if
>>>>>> these fields are useful for anything other than SNC feature. This
>>>>>> approach
>>>>>> adds quite a bit of code for no specific advantage.
>>>>>>
>>>>>> Why don't we just split the RDT_RESOURCE_L3 resource
>>>>>> into separate resources, one for control, one for monitoring.
>>>>>> We already have "control" only resources (MBA, SMBA, L2). Lets
>>>>>> create new
>>>>>> "monitor" only resource. I feel it will be much cleaner approach.
>>>>>>
>>>>>> Tony has already tried that approach and showed that it is much
>>>>>> simpler.
>>>>>>
>>>>>> v15-RFC :
>>>>>> https://lore.kernel.org/lkml/20240130222034.37181-1-tony.luck@intel.com/
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>
>>>>> Some highlights of my thoughts in response to that series, but the whole
>>>>> thread
>>>>> may be of interest to you:
>>>>> https://lore.kernel.org/lkml/78c88c6d-2e8d-42d1-a6f2-1c5ac38fb258@intel.com/
>>>>> https://lore.kernel.org/lkml/59944211-d34a-4ba3-a1de-095822c0b3f0@intel.com/
>>>>>
>>>>
>>>> Went through the thread, in summary:
>>>>
>>>> The main concerns are related to duplication of code and data structures.
>>>>
>>>> The solutions are
>>>>
>>>> a) Split the domains.
>>>> This is what this series is doing now. This creates members like
>>>> ctrl_scope, mon_scope, ctrl_domains etc.. These fields are added to all
>>>> the resources (MBA, SMBA and L2). Then there is additional domain header.
>>>>
>>>>
>>>> b) Split the resource.
>>>>    Split RDT_RESOURCE_L3 into two, one for "monitor" and one for
>>>> "control".
>>>>    There will be one domain structure for "monitor" and  one for
>>>> "control"
>>>>
>>>> Both these approaches have code and data duplication. So, there is no
>>>> difference that way.
>>>
>>> Could you please elaborate where code and data duplication of (a) is?
>>
>> We have ctrl_scope, mon_scope, ctrl_domains. mon_domains.  Only one
>> resource, RDT_RESOURCE_L3 is going to use these fields. Rest of the
>> resources don't need these fields. But these fields are part of all
>> the resources.
> 
> Correct. There are two new empty fields per resource that does
> not support monitoring. Having the new mon_domains list results in
> the benefit of eliminating monitoring fields from all the domains
> forming part of resources that do not support monitoring. Providing
> more details below but the additional pointer results in a significant
> net reduction of unused fields. Having the new mon_scope field results
> in the benefit to support SNC.
> 
>>
>> I am not too worried about the size of the patch.  But, I don't
>> foresee these fields will be used anytime soon in these
>> resources(MBA. L3. SMBA). Why add it now? In future we may have to
>> cleanup all these anyways.
> 
> This work does indeed go through the effort to _eliminate_ unused fields.
> Note how all domains of all resources (whether they support monitoring or
> not) currently have to carry a significant number of monitoring fields.
> These can be found in both struct rdt_domain (*rmid_busy_llc, *mbm_total,
> *mbm_local, mbm_over, cqm_limbo, mbm_work_cpu, cqm_work_cpu)  as
> well as struct rdt_hw_domain (*arch_mbm_total, *arch_mbm_local).
> 
> For a resource that does not support monitoring it is of course
> unnecessary to carry all of this for _every_ domain instance and
> after this series it no longer will.

Yes. I see that. Thanks for the explanation. Lets go ahead with the
series. This feature is been pending for a while. I will provide my
comments for series.
-- 
Thanks
Babu Moger

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2024-06-21 17:14 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 18:35 [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Tony Luck
2024-06-10 18:35 ` [PATCH v20 01/18] x86/resctrl: Prepare for new domain scope Tony Luck
2024-06-20 21:12   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 02/18] x86/resctrl: Prepare to split rdt_domain structure Tony Luck
2024-06-20 21:13   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 03/18] x86/resctrl: Prepare for different scope for control/monitor operations Tony Luck
2024-06-20 21:13   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 04/18] x86/resctrl: Split the rdt_domain and rdt_hw_domain structures Tony Luck
2024-06-20 21:14   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 05/18] x86/resctrl: Add node-scope to the options for feature scope Tony Luck
2024-06-20 21:15   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache Tony Luck
2024-06-17 22:36   ` Moger, Babu
2024-06-18 22:58     ` Reinette Chatre
2024-06-19 14:43       ` Moger, Babu
2024-06-20 21:19   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 07/18] x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems Tony Luck
2024-06-20 21:21   ` Reinette Chatre
2024-06-20 22:07     ` Luck, Tony
2024-06-20 22:12       ` Luck, Tony
2024-06-21  1:56       ` Reinette Chatre
2024-06-21 15:24         ` Tony Luck
2024-06-21 17:10           ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 08/18] x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files Tony Luck
2024-06-20 21:22   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 09/18] x86/resctrl: Add a new field to struct rmid_read for summation of domains Tony Luck
2024-06-20 21:22   ` Reinette Chatre
2024-06-20 22:42     ` Luck, Tony
2024-06-21  1:59       ` Reinette Chatre
2024-06-21 16:07         ` Luck, Tony
2024-06-21 17:10           ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 10/18] x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function Tony Luck
2024-06-20 21:23   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 11/18] x86/resctrl: Allocate a new field in union mon_data_bits Tony Luck
2024-06-20 21:28   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 12/18] x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files Tony Luck
2024-06-20 21:30   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 13/18] x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode Tony Luck
2024-06-20 21:30   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 14/18] x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter Tony Luck
2024-06-20 21:31   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 15/18] x86/resctrl: Make __mon_event_count() handle sum domains Tony Luck
2024-06-20 21:31   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 16/18] x86/resctrl: Enable RMID shared RMID mode on Sub-NUMA Cluster (SNC) systems Tony Luck
2024-06-20 21:32   ` Reinette Chatre
2024-06-10 18:35 ` [PATCH v20 17/18] x86/resctrl: Sub-NUMA Cluster (SNC) detection Tony Luck
2024-06-20 21:34   ` Reinette Chatre
2024-06-21 17:05   ` Markus Elfring
2024-06-21 17:14     ` Luck, Tony
2024-06-10 18:35 ` [PATCH v20 18/18] x86/resctrl: Update documentation with Sub-NUMA cluster changes Tony Luck
2024-06-20 21:35   ` Reinette Chatre
2024-06-13 19:17 ` [PATCH v20 00/18] Add support for Sub-NUMA cluster (SNC) systems Moger, Babu
2024-06-13 20:32   ` Reinette Chatre
2024-06-13 21:02     ` Luck, Tony
2024-06-14 16:27     ` Moger, Babu
2024-06-14 16:46       ` Reinette Chatre
2024-06-14 21:29         ` Moger, Babu
2024-06-14 21:40           ` Luck, Tony
2024-06-14 22:31             ` Moger, Babu
2024-06-14 23:11           ` Reinette Chatre
2024-06-17 14:06             ` Moger, Babu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).