* [PATCH v11 01/31] x86,fs/resctrl: Improve domain type checking
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
@ 2025-09-25 20:02 ` Tony Luck
2025-10-03 15:28 ` Reinette Chatre
2025-09-25 20:02 ` [PATCH v11 02/31] x86/resctrl: Move L3 initialization into new helper function Tony Luck
` (29 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:02 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each resctrl resource has a list of domain structures. These all begin
with a common rdt_domain_hdr.
Improve type checking of these headers by adding the resource id. Add
domain_header_is_valid() before each call to container_of() to ensure
the domain is the expected type.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 9 +++++++++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
fs/resctrl/ctrlmondata.c | 2 +-
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a7d92718b653..dfc91c5e8483 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -131,15 +131,24 @@ enum resctrl_domain_type {
* @list: all instances of this resource
* @id: unique id for this instance
* @type: type of this instance
+ * @rid: resource id for this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
enum resctrl_domain_type type;
+ enum resctrl_res_level rid;
struct cpumask cpu_mask;
};
+static inline bool domain_header_is_valid(struct rdt_domain_hdr *hdr,
+ enum resctrl_domain_type type,
+ enum resctrl_res_level rid)
+{
+ return !WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
+}
+
/**
* struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
* @hdr: common header for different domain types
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 06ca5a30140c..8be2619db2e7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -459,7 +459,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -476,6 +476,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_CTRL_DOMAIN;
+ d->hdr.rid = r->rid;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
@@ -515,7 +516,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
@@ -533,6 +534,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -593,7 +595,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -639,7 +641,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 0d0ef54fc4de..f248eaf50d3c 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -649,7 +649,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, resid)) {
ret = -ENOENT;
goto out;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 01/31] x86,fs/resctrl: Improve domain type checking
2025-09-25 20:02 ` [PATCH v11 01/31] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-10-03 15:28 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 15:28 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:02 PM, Tony Luck wrote:
> Each resctrl resource has a list of domain structures. These all begin
> with a common rdt_domain_hdr.
>
> Improve type checking of these headers by adding the resource id. Add
> domain_header_is_valid() before each call to container_of() to ensure
> the domain is the expected type.
Apart from the "short and sweet" part of the guidance [1] followed in the
changelog above the same guidance concluded with summary (in bold)
that the *why* of the patch should be clear. This changelog is missing
the "why".
Here is an attempt to address that, please feel free to rewrite and improve:
Every resctrl resource has a list of domain structures. struct rdt_ctrl_domain
and struct rdt_mon_domain both begin with struct rdt_domain_hdr
with rdt_domain_hdr::type used in validity checks before accessing
the domain of a particular type.
Add the resource id to struct rdt_domain_hdr in preparation for a new
monitoring domain structure that will be associated with a new monitoring
resource. Improve existing domain validity checks with a new helper
domain_header_is_valid() that checks both domain type and resource id.
domain_header_is_valid() should be used before every call to container_of()
that accesses a domain structure.
[1] https://lore.kernel.org/all/20250923100956.GAaNJx9BYhXKkfNJ71@fat_crate.local/
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Patch looks good to me.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 02/31] x86/resctrl: Move L3 initialization into new helper function
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
2025-09-25 20:02 ` [PATCH v11 01/31] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-09-25 20:02 ` Tony Luck
2025-10-03 15:28 ` Reinette Chatre
2025-09-25 20:02 ` [PATCH v11 03/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
` (28 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:02 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Carve out the resource monitoring domain init code into a separate helper
in order to be able to initialize new types of monitoring domains besides
the usual L3 ones.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 64 ++++++++++++++++--------------
1 file changed, 34 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8be2619db2e7..d422ae3b7ed6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -496,37 +496,13 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
}
}
-static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct list_head *add_pos = NULL;
struct rdt_hw_mon_domain *hw_dom;
- struct rdt_domain_hdr *hdr;
struct rdt_mon_domain *d;
struct cacheinfo *ci;
int err;
- lockdep_assert_held(&domain_list_lock);
-
- if (id < 0) {
- pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->mon_scope, r->name);
- return;
- }
-
- hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
- if (hdr) {
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- /* Update the mbm_assign_mode state for the CPU if supported */
- if (r->mon.mbm_cntr_assignable)
- resctrl_arch_mbm_cntr_assign_set_one(r);
- return;
- }
-
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
if (!hw_dom)
return;
@@ -534,7 +510,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
- d->hdr.rid = r->rid;
+ d->hdr.rid = RDT_RESOURCE_L3;
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -544,10 +520,6 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d->ci_id = ci->id;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- /* Update the mbm_assign_mode state for the CPU if supported */
- if (r->mon.mbm_cntr_assignable)
- resctrl_arch_mbm_cntr_assign_set_one(r);
-
arch_mon_domain_online(r, d);
if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
@@ -565,6 +537,38 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
}
}
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_domain_hdr *hdr;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr)
+ cpumask_set_cpu(cpu, &hdr->cpu_mask);
+
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ /* Update the mbm_assign_mode state for the CPU if supported */
+ if (r->mon.mbm_cntr_assignable)
+ resctrl_arch_mbm_cntr_assign_set_one(r);
+ if (!hdr)
+ l3_mon_domain_setup(cpu, id, r, add_pos);
+ break;
+ default:
+ pr_warn_once("Unknown resource rid=%d\n", r->rid);
+ break;
+ }
+}
+
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
if (r->alloc_capable)
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 02/31] x86/resctrl: Move L3 initialization into new helper function
2025-09-25 20:02 ` [PATCH v11 02/31] x86/resctrl: Move L3 initialization into new helper function Tony Luck
@ 2025-10-03 15:28 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 15:28 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:02 PM, Tony Luck wrote:
> Carve out the resource monitoring domain init code into a separate helper
> in order to be able to initialize new types of monitoring domains besides
> the usual L3 ones.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 03/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
2025-09-25 20:02 ` [PATCH v11 01/31] x86,fs/resctrl: Improve domain type checking Tony Luck
2025-09-25 20:02 ` [PATCH v11 02/31] x86/resctrl: Move L3 initialization into new helper function Tony Luck
@ 2025-09-25 20:02 ` Tony Luck
2025-10-03 15:29 ` Reinette Chatre
2025-09-25 20:02 ` [PATCH v11 04/31] x86/resctrl: Clean up domain_remove_cpu_ctrl() Tony Luck
` (27 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:02 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
New telemetry events will be associated with a new package scoped
resource with new domain structures.
Refactor domain_remove_cpu_mon() so all the L3 processing is separate
from general actions of clearing the CPU bit in the mask.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d422ae3b7ed6..b471918bced6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -645,20 +645,25 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- hw_dom = resctrl_to_arch_mon_dom(d);
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, d);
- list_del_rcu(&d->hdr.list);
+ list_del_rcu(&hdr->list);
synchronize_rcu();
mon_domain_free(hw_dom);
-
- return;
+ break;
+ default:
+ pr_warn_once("Unknown resource rid=%d\n", r->rid);
+ break;
}
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 03/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-09-25 20:02 ` [PATCH v11 03/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-10-03 15:29 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 15:29 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
Note that subject prefix indicates resctrl fs code is changed in this patch
but it only changes arch code.
On 9/25/25 1:02 PM, Tony Luck wrote:
> New telemetry events will be associated with a new package scoped
> resource with new domain structures.
>
> Refactor domain_remove_cpu_mon() so all the L3 processing is separate
"L3 processing" -> "L3 domain processing" ?
> from general actions of clearing the CPU bit in the mask.
"general actions" -> "general domain actions" ?
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index d422ae3b7ed6..b471918bced6 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -645,20 +645,25 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> return;
> }
>
> - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
> + cpumask_clear_cpu(cpu, &hdr->cpu_mask);
> + if (!cpumask_empty(&hdr->cpu_mask))
> return;
>
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
> - hw_dom = resctrl_to_arch_mon_dom(d);
> + switch (r->rid) {
> + case RDT_RESOURCE_L3:
This function evolves to where its local declarations contain a mix of domain structures
of different resources. I think it will make this function easier to understand if the
resource specific structures are declared local to the code block dedicated to that resource.
case RDT_RESOURCE_L3: {
struct rdt_hw_mon_domain *hw_dom;
struct rdt_mon_domain *d;
...
break;
}
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return;
>
> - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
> - if (cpumask_empty(&d->hdr.cpu_mask)) {
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> + hw_dom = resctrl_to_arch_mon_dom(d);
> resctrl_offline_mon_domain(r, d);
> - list_del_rcu(&d->hdr.list);
> + list_del_rcu(&hdr->list);
> synchronize_rcu();
> mon_domain_free(hw_dom);
> -
> - return;
> + break;
> + default:
> + pr_warn_once("Unknown resource rid=%d\n", r->rid);
> + break;
> }
> }
>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 04/31] x86/resctrl: Clean up domain_remove_cpu_ctrl()
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (2 preceding siblings ...)
2025-09-25 20:02 ` [PATCH v11 03/31] x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-09-25 20:02 ` Tony Luck
2025-10-03 15:30 ` Reinette Chatre
2025-09-25 20:02 ` [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Tony Luck
` (26 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:02 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
For symmetry with domain_remove_cpu_mon() refactor
domain_remove_cpu_ctrl() to take an early return when removing
a CPU does not empty the domain.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b471918bced6..28c8e28bb1dd 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -599,28 +599,27 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
+ return;
+
if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
hw_dom = resctrl_to_arch_ctrl_dom(d);
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_ctrl_domain(r, d);
- list_del_rcu(&d->hdr.list);
- synchronize_rcu();
-
- /*
- * rdt_ctrl_domain "d" is going to be freed below, so clear
- * its pointer from pseudo_lock_region struct.
- */
- if (d->plr)
- d->plr->d = NULL;
- ctrl_domain_free(hw_dom);
+ resctrl_offline_ctrl_domain(r, d);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
- return;
- }
+ /*
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
+ * its pointer from pseudo_lock_region struct.
+ */
+ if (d->plr)
+ d->plr->d = NULL;
+ ctrl_domain_free(hw_dom);
}
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 04/31] x86/resctrl: Clean up domain_remove_cpu_ctrl()
2025-09-25 20:02 ` [PATCH v11 04/31] x86/resctrl: Clean up domain_remove_cpu_ctrl() Tony Luck
@ 2025-10-03 15:30 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 15:30 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:02 PM, Tony Luck wrote:
> For symmetry with domain_remove_cpu_mon() refactor
> domain_remove_cpu_ctrl() to take an early return when removing
> a CPU does not empty the domain.
These changelog lines are getting shorter and shorter. I do not know if you noticed
but almost all the ABMC changelogs were reformatted during merge to use closer to 80
characters per line, sometimes more. You can avoid that extra churn by ensuring the
changelogs make use of 80 columns. There was a brief exchange about this at
https://lore.kernel.org/lkml/20250916105447.GCaMlB976WLxHHeNMD@fat_crate.local/
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (3 preceding siblings ...)
2025-09-25 20:02 ` [PATCH v11 04/31] x86/resctrl: Clean up domain_remove_cpu_ctrl() Tony Luck
@ 2025-09-25 20:02 ` Tony Luck
2025-10-03 15:33 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 06/31] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Tony Luck
` (25 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:02 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Up until now, all monitoring events were associated with the L3 resource
and it made sense to use the L3 specific "struct rdt_mon_domain *"
arguments to functions manipulating domains.
To simplify enabling of enumeration of domains for events in other
resources change the calling convention to pass the generic struct
rdt_domain_hdr and use that to find the domain specific structure
where needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +-
fs/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 4 +-
fs/resctrl/ctrlmondata.c | 15 ++++---
fs/resctrl/rdtgroup.c | 65 ++++++++++++++++++++----------
5 files changed, 58 insertions(+), 32 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index dfc91c5e8483..0b55809af5d7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -504,9 +504,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index cf1fd82dc5a9..22fdb3a9b6f4 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -362,7 +362,7 @@ void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first);
int resctrl_mon_resource_init(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 28c8e28bb1dd..2d93387b9251 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -529,7 +529,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
list_add_tail_rcu(&d->hdr.list, add_pos);
- err = resctrl_online_mon_domain(r, d);
+ err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -655,7 +655,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
d = container_of(hdr, struct rdt_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
- resctrl_offline_mon_domain(r, d);
+ resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
mon_domain_free(hw_dom);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index f248eaf50d3c..3ceef35208be 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -547,11 +547,16 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
}
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first)
{
+ struct rdt_mon_domain *d;
int cpu;
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+
/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -598,7 +603,6 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
enum resctrl_event_id evtid;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
struct rdtgroup *rdtgrp;
int domid, cpu, ret = 0;
struct rdt_resource *r;
@@ -623,6 +627,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
r = resctrl_arch_get_resource(resid);
if (md->sum) {
+ struct rdt_mon_domain *d;
+
/*
* This file requires summing across all domains that share
* the L3 cache id that was provided in the "domid" field of the
@@ -649,12 +655,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, resid)) {
+ if (!hdr) {
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
+ mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evtid, false);
}
checkresult:
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 0320360cd7a6..e3b83e48f2d9 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3164,13 +3164,18 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
* when last domain being summed is removed.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
+ struct rdt_mon_domain *d;
char subname[32];
bool snc_mode;
char name[32];
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
if (snc_mode)
@@ -3184,19 +3189,18 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
-static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
- bool do_sum)
+ int domid, bool do_sum)
{
struct rmid_read rr = {0};
struct mon_data *priv;
struct mon_evt *mevt;
- int ret, domid;
+ int ret;
for_each_mon_event(mevt) {
if (mevt->rid != r->rid || !mevt->enabled)
continue;
- domid = do_sum ? d->ci_id : d->hdr.id;
priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
if (WARN_ON_ONCE(!priv))
return -EINVAL;
@@ -3206,23 +3210,28 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
return ret;
if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
+ mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt->evtid, true);
}
return 0;
}
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_mon_domain *d,
+ struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
+ struct rdt_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
lockdep_assert_held(&rdtgroup_mutex);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
kn = kernfs_find_and_get(parent_kn, name);
@@ -3240,13 +3249,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
ret = rdtgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
- ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+ ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
if (ret)
goto out_destroy;
}
if (snc_mode) {
- sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
if (IS_ERR(ckn)) {
ret = -EINVAL;
@@ -3257,7 +3266,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;
- ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ ret = mon_add_all_files(ckn, hdr, r, prgrp, hdr->id, false);
if (ret)
goto out_destroy;
}
@@ -3275,7 +3284,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3283,12 +3292,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
parent_kn = crgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, crgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
}
}
}
@@ -3297,14 +3306,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_mon_domain *dom;
+ struct rdt_domain_hdr *hdr;
int ret;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->mon_domains, hdr.list) {
- ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
+ list_for_each_entry(hdr, &r->mon_domains, list) {
+ ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
if (ret)
return ret;
}
@@ -4187,8 +4196,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
mutex_unlock(&rdtgroup_mutex);
}
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
+ struct rdt_mon_domain *d;
+
mutex_lock(&rdtgroup_mutex);
/*
@@ -4196,8 +4207,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d);
+ rmdir_mondata_subdir_allrdtgrp(r, hdr);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ goto out_unlock;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4214,7 +4229,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
domain_destroy_mon_state(d);
-
+out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
@@ -4287,12 +4302,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
return err;
}
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- int err;
+ struct rdt_mon_domain *d;
+ int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ goto out_unlock;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
err = domain_setup_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4306,6 +4326,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
+ err = 0;
/*
* If the filesystem is not mounted then only the default resource group
* exists. Creation of its directories is deferred until mount time
@@ -4313,7 +4334,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
* If resctrl is mounted, add per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- mkdir_mondata_subdir_allrdtgrp(r, d);
+ mkdir_mondata_subdir_allrdtgrp(r, hdr);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr
2025-09-25 20:02 ` [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Tony Luck
@ 2025-10-03 15:33 ` Reinette Chatre
2025-10-03 22:55 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 15:33 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:02 PM, Tony Luck wrote:
> Up until now, all monitoring events were associated with the L3 resource
> and it made sense to use the L3 specific "struct rdt_mon_domain *"
> arguments to functions manipulating domains.
>
> To simplify enabling of enumeration of domains for events in other
What does "enabling of enumeration of domains" mean?
> resources change the calling convention to pass the generic struct
> rdt_domain_hdr and use that to find the domain specific structure
> where needed.
I think it will be helpful to highlight that this is a stepping stone
that highlights what domain management code is L3 specific and thus in
need of further refactoring to support new domain types vs. what is generic.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
...
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index f248eaf50d3c..3ceef35208be 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -547,11 +547,16 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
> }
>
> void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> - struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> + struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
> cpumask_t *cpumask, int evtid, int first)
> {
> + struct rdt_mon_domain *d;
> int cpu;
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
hdr can be NULL here so this is not safe. I understand this is removed in the next
patch but it is difficult to reason about the code if the steps are not solid.
> + return;
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> +
> /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> lockdep_assert_cpus_held();
>
> @@ -598,7 +603,6 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> enum resctrl_event_id evtid;
> struct rdt_domain_hdr *hdr;
> struct rmid_read rr = {0};
> - struct rdt_mon_domain *d;
> struct rdtgroup *rdtgrp;
> int domid, cpu, ret = 0;
> struct rdt_resource *r;
> @@ -623,6 +627,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> r = resctrl_arch_get_resource(resid);
>
> if (md->sum) {
> + struct rdt_mon_domain *d;
> +
> /*
> * This file requires summing across all domains that share
> * the L3 cache id that was provided in the "domid" field of the
> @@ -649,12 +655,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> * the resource to find the domain with "domid".
> */
> hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
> - if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, resid)) {
> + if (!hdr) {
> ret = -ENOENT;
> goto out;
> }
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
> - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> + mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evtid, false);
> }
>
> checkresult:
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 0320360cd7a6..e3b83e48f2d9 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3164,13 +3164,18 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> * when last domain being summed is removed.
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct rdtgroup *prgrp, *crgrp;
> + struct rdt_mon_domain *d;
> char subname[32];
> bool snc_mode;
> char name[32];
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> if (snc_mode)
> @@ -3184,19 +3189,18 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> }
> }
>
> -static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp,
> - bool do_sum)
> + int domid, bool do_sum)
> {
> struct rmid_read rr = {0};
> struct mon_data *priv;
> struct mon_evt *mevt;
> - int ret, domid;
> + int ret;
>
> for_each_mon_event(mevt) {
> if (mevt->rid != r->rid || !mevt->enabled)
> continue;
> - domid = do_sum ? d->ci_id : d->hdr.id;
Looks like an unrelated change. Would this not be more appropriate for "fs/resctrl: Refactor Sub-NUMA
Cluster (SNC) in mkdir/rmdir code flow"?
> priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
> if (WARN_ON_ONCE(!priv))
> return -EINVAL;
> @@ -3206,23 +3210,28 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> return ret;
>
> if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
> + mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt->evtid, true);
> }
>
> return 0;
> }
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> - struct rdt_mon_domain *d,
> + struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> struct kernfs_node *kn, *ckn;
> + struct rdt_mon_domain *d;
> char name[32];
> bool snc_mode;
> int ret = 0;
>
> lockdep_assert_held(&rdtgroup_mutex);
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return -EINVAL;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> kn = kernfs_find_and_get(parent_kn, name);
> @@ -3240,13 +3249,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> ret = rdtgroup_kn_set_ugid(kn);
> if (ret)
> goto out_destroy;
> - ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> + ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
This does not seem right ... looks like this aims to do some of the SNC enabling but
the domain id is always set to the domain of the node and does not distinguish between
the L3 id and node id?
> if (ret)
> goto out_destroy;
> }
>
> if (snc_mode) {
> - sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
> ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> if (IS_ERR(ckn)) {
> ret = -EINVAL;
> @@ -3257,7 +3266,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> if (ret)
> goto out_destroy;
>
> - ret = mon_add_all_files(ckn, d, r, prgrp, false);
> + ret = mon_add_all_files(ckn, hdr, r, prgrp, hdr->id, false);
> if (ret)
> goto out_destroy;
> }
> @@ -3275,7 +3284,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> * and "monitor" groups with given domain id.
> */
> static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct kernfs_node *parent_kn;
> struct rdtgroup *prgrp, *crgrp;
> @@ -3283,12 +3292,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> parent_kn = prgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
>
> head = &prgrp->mon.crdtgrp_list;
> list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> parent_kn = crgrp->mon.mon_data_kn;
> - mkdir_mondata_subdir(parent_kn, d, r, crgrp);
> + mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
> }
> }
> }
> @@ -3297,14 +3306,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> struct rdt_resource *r,
> struct rdtgroup *prgrp)
> {
> - struct rdt_mon_domain *dom;
> + struct rdt_domain_hdr *hdr;
> int ret;
>
> /* Walking r->domains, ensure it can't race with cpuhp */
> lockdep_assert_cpus_held();
>
> - list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> + list_for_each_entry(hdr, &r->mon_domains, list) {
> + ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
> if (ret)
> return ret;
> }
> @@ -4187,8 +4196,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
> mutex_unlock(&rdtgroup_mutex);
> }
>
> -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> + struct rdt_mon_domain *d;
> +
> mutex_lock(&rdtgroup_mutex);
>
> /*
> @@ -4196,8 +4207,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> * per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - rmdir_mondata_subdir_allrdtgrp(r, d);
> + rmdir_mondata_subdir_allrdtgrp(r, hdr);
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + goto out_unlock;
> +
One logical change per patch please.
While all other L3 specific functions modified to receive hdr as parameter are changed to use
container_of() at beginning of function to highlight that the functions are L3 specific ...
resctrl_offline_mon_domain() is changed differently. Looks like this changes the flow to
sneak in some PERF_PKG enabling for convenience and thus makes this patch harder to understand.
Splitting resctrl_offline_mon_domain() to handle different domain types seems more appropriate
for "x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG" where it should be
clear what changes are made to support PERF_PKG. In this patch, in this stage of series, the
entire function can be L3 specific.
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> if (resctrl_is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
> @@ -4214,7 +4229,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> }
>
> domain_destroy_mon_state(d);
> -
> +out_unlock:
> mutex_unlock(&rdtgroup_mutex);
> }
>
> @@ -4287,12 +4302,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> return err;
> }
>
> -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> {
> - int err;
> + struct rdt_mon_domain *d;
> + int err = -EINVAL;
>
> mutex_lock(&rdtgroup_mutex);
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + goto out_unlock;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> err = domain_setup_mon_state(r, d);
> if (err)
> goto out_unlock;
> @@ -4306,6 +4326,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
> INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
>
> + err = 0;
Considering the earlier exit on "if (err)", err can be expected to be 0 here?
> /*
> * If the filesystem is not mounted then only the default resource group
> * exists. Creation of its directories is deferred until mount time
> @@ -4313,7 +4334,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> * If resctrl is mounted, add per domain monitor data directories.
> */
> if (resctrl_mounted && resctrl_arch_mon_capable())
> - mkdir_mondata_subdir_allrdtgrp(r, d);
> + mkdir_mondata_subdir_allrdtgrp(r, hdr);
>
> out_unlock:
> mutex_unlock(&rdtgroup_mutex);
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr
2025-10-03 15:33 ` Reinette Chatre
@ 2025-10-03 22:55 ` Luck, Tony
2025-10-06 21:32 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-03 22:55 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Oct 03, 2025 at 08:33:00AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 1:02 PM, Tony Luck wrote:
> > Up until now, all monitoring events were associated with the L3 resource
> > and it made sense to use the L3 specific "struct rdt_mon_domain *"
> > arguments to functions manipulating domains.
> >
> > To simplify enabling of enumeration of domains for events in other
>
> What does "enabling of enumeration of domains" mean?
Is this better?
To prepare for events in resources other than L3, change the calling convention
to pass the generic struct rdt_domain_hdr and use that to find the domain specific
structure where needed.
>
> > resources change the calling convention to pass the generic struct
> > rdt_domain_hdr and use that to find the domain specific structure
> > where needed.
>
> I think it will be helpful to highlight that this is a stepping stone
> that highlights what domain management code is L3 specific and thus in
> need of further refactoring to support new domain types vs. what is generic.
Above re-wording cover this.
>
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
>
> ...
>
> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index f248eaf50d3c..3ceef35208be 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -547,11 +547,16 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
> > }
> >
> > void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > - struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
> > + struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
> > cpumask_t *cpumask, int evtid, int first)
> > {
> > + struct rdt_mon_domain *d;
> > int cpu;
> >
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
>
> hdr can be NULL here so this is not safe. I understand this is removed in the next
> patch but it is difficult to reason about the code if the steps are not solid.
Will fix.
>
> > + return;
> > + d = container_of(hdr, struct rdt_mon_domain, hdr);
> > +
> > /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> > lockdep_assert_cpus_held();
> >
> > @@ -598,7 +603,6 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> > enum resctrl_event_id evtid;
> > struct rdt_domain_hdr *hdr;
> > struct rmid_read rr = {0};
> > - struct rdt_mon_domain *d;
> > struct rdtgroup *rdtgrp;
> > int domid, cpu, ret = 0;
> > struct rdt_resource *r;
> > @@ -623,6 +627,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> > r = resctrl_arch_get_resource(resid);
> >
> > if (md->sum) {
> > + struct rdt_mon_domain *d;
> > +
> > /*
> > * This file requires summing across all domains that share
> > * the L3 cache id that was provided in the "domid" field of the
> > @@ -649,12 +655,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> > * the resource to find the domain with "domid".
> > */
> > hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
> > - if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, resid)) {
> > + if (!hdr) {
> > ret = -ENOENT;
> > goto out;
> > }
> > - d = container_of(hdr, struct rdt_mon_domain, hdr);
> > - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> > + mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evtid, false);
> > }
> >
> > checkresult:
> > diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> > index 0320360cd7a6..e3b83e48f2d9 100644
> > --- a/fs/resctrl/rdtgroup.c
> > +++ b/fs/resctrl/rdtgroup.c
> > @@ -3164,13 +3164,18 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> > * when last domain being summed is removed.
> > */
> > static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > - struct rdt_mon_domain *d)
> > + struct rdt_domain_hdr *hdr)
> > {
> > struct rdtgroup *prgrp, *crgrp;
> > + struct rdt_mon_domain *d;
> > char subname[32];
> > bool snc_mode;
> > char name[32];
> >
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + return;
> > +
> > + d = container_of(hdr, struct rdt_mon_domain, hdr);
> > snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> > if (snc_mode)
> > @@ -3184,19 +3189,18 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > }
> > }
> >
> > -static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> > +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> > struct rdt_resource *r, struct rdtgroup *prgrp,
> > - bool do_sum)
> > + int domid, bool do_sum)
> > {
> > struct rmid_read rr = {0};
> > struct mon_data *priv;
> > struct mon_evt *mevt;
> > - int ret, domid;
> > + int ret;
> >
> > for_each_mon_event(mevt) {
> > if (mevt->rid != r->rid || !mevt->enabled)
> > continue;
> > - domid = do_sum ? d->ci_id : d->hdr.id;
>
> Looks like an unrelated change. Would this not be more appropriate for "fs/resctrl: Refactor Sub-NUMA
> Cluster (SNC) in mkdir/rmdir code flow"?
Agreed. I'll cut this out and move to the later patch.
>
> > priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
> > if (WARN_ON_ONCE(!priv))
> > return -EINVAL;
> > @@ -3206,23 +3210,28 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
> > return ret;
> >
> > if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> > - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
> > + mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt->evtid, true);
> > }
> >
> > return 0;
> > }
> >
> > static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > - struct rdt_mon_domain *d,
> > + struct rdt_domain_hdr *hdr,
> > struct rdt_resource *r, struct rdtgroup *prgrp)
> > {
> > struct kernfs_node *kn, *ckn;
> > + struct rdt_mon_domain *d;
> > char name[32];
> > bool snc_mode;
> > int ret = 0;
> >
> > lockdep_assert_held(&rdtgroup_mutex);
> >
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + return -EINVAL;
> > +
> > + d = container_of(hdr, struct rdt_mon_domain, hdr);
> > snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> > kn = kernfs_find_and_get(parent_kn, name);
> > @@ -3240,13 +3249,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > ret = rdtgroup_kn_set_ugid(kn);
> > if (ret)
> > goto out_destroy;
> > - ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
> > + ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
>
> This does not seem right ... looks like this aims to do some of the SNC enabling but
> the domain id is always set to the domain of the node and does not distinguish between
> the L3 id and node id?
Also move to later patch (and get it right)
>
> > if (ret)
> > goto out_destroy;
> > }
> >
> > if (snc_mode) {
> > - sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > + sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
> > ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
> > if (IS_ERR(ckn)) {
> > ret = -EINVAL;
> > @@ -3257,7 +3266,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > if (ret)
> > goto out_destroy;
> >
> > - ret = mon_add_all_files(ckn, d, r, prgrp, false);
> > + ret = mon_add_all_files(ckn, hdr, r, prgrp, hdr->id, false);
> > if (ret)
> > goto out_destroy;
> > }
> > @@ -3275,7 +3284,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > * and "monitor" groups with given domain id.
> > */
> > static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > - struct rdt_mon_domain *d)
> > + struct rdt_domain_hdr *hdr)
> > {
> > struct kernfs_node *parent_kn;
> > struct rdtgroup *prgrp, *crgrp;
> > @@ -3283,12 +3292,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> >
> > list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> > parent_kn = prgrp->mon.mon_data_kn;
> > - mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> > + mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
> >
> > head = &prgrp->mon.crdtgrp_list;
> > list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
> > parent_kn = crgrp->mon.mon_data_kn;
> > - mkdir_mondata_subdir(parent_kn, d, r, crgrp);
> > + mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
> > }
> > }
> > }
> > @@ -3297,14 +3306,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
> > struct rdt_resource *r,
> > struct rdtgroup *prgrp)
> > {
> > - struct rdt_mon_domain *dom;
> > + struct rdt_domain_hdr *hdr;
> > int ret;
> >
> > /* Walking r->domains, ensure it can't race with cpuhp */
> > lockdep_assert_cpus_held();
> >
> > - list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> > - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
> > + list_for_each_entry(hdr, &r->mon_domains, list) {
> > + ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
> > if (ret)
> > return ret;
> > }
> > @@ -4187,8 +4196,10 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
> > mutex_unlock(&rdtgroup_mutex);
> > }
> >
> > -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> > +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> > {
> > + struct rdt_mon_domain *d;
> > +
> > mutex_lock(&rdtgroup_mutex);
> >
> > /*
> > @@ -4196,8 +4207,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> > * per domain monitor data directories.
> > */
> > if (resctrl_mounted && resctrl_arch_mon_capable())
> > - rmdir_mondata_subdir_allrdtgrp(r, d);
> > + rmdir_mondata_subdir_allrdtgrp(r, hdr);
> >
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + goto out_unlock;
> > +
>
> One logical change per patch please.
>
> While all other L3 specific functions modified to receive hdr as parameter are changed to use
> container_of() at beginning of function to highlight that the functions are L3 specific ...
> resctrl_offline_mon_domain() is changed differently. Looks like this changes the flow to
> sneak in some PERF_PKG enabling for convenience and thus makes this patch harder to understand.
> Splitting resctrl_offline_mon_domain() to handle different domain types seems more appropriate
> for "x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG" where it should be
> clear what changes are made to support PERF_PKG. In this patch, in this stage of series, the
> entire function can be L3 specific.
Will move to later patch and make the offline/online patches have same
style.
>
> > + d = container_of(hdr, struct rdt_mon_domain, hdr);
> > if (resctrl_is_mbm_enabled())
> > cancel_delayed_work(&d->mbm_over);
> > if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
> > @@ -4214,7 +4229,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
> > }
> >
> > domain_destroy_mon_state(d);
> > -
> > +out_unlock:
> > mutex_unlock(&rdtgroup_mutex);
> > }
> >
> > @@ -4287,12 +4302,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
> > return err;
> > }
> >
> > -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> > +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
> > {
> > - int err;
> > + struct rdt_mon_domain *d;
> > + int err = -EINVAL;
> >
> > mutex_lock(&rdtgroup_mutex);
> >
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + goto out_unlock;
> > +
> > + d = container_of(hdr, struct rdt_mon_domain, hdr);
> > err = domain_setup_mon_state(r, d);
> > if (err)
> > goto out_unlock;
> > @@ -4306,6 +4326,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> > if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
> > INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
> >
> > + err = 0;
>
> Considering the earlier exit on "if (err)", err can be expected to be 0 here?
Yes. Dropped this superfluous assignment.
>
> > /*
> > * If the filesystem is not mounted then only the default resource group
> > * exists. Creation of its directories is deferred until mount time
> > @@ -4313,7 +4334,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
> > * If resctrl is mounted, add per domain monitor data directories.
> > */
> > if (resctrl_mounted && resctrl_arch_mon_capable())
> > - mkdir_mondata_subdir_allrdtgrp(r, d);
> > + mkdir_mondata_subdir_allrdtgrp(r, hdr);
> >
> > out_unlock:
> > mutex_unlock(&rdtgroup_mutex);
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr
2025-10-03 22:55 ` Luck, Tony
@ 2025-10-06 21:32 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-06 21:32 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/3/25 3:55 PM, Luck, Tony wrote:
> On Fri, Oct 03, 2025 at 08:33:00AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 9/25/25 1:02 PM, Tony Luck wrote:
>>> Up until now, all monitoring events were associated with the L3 resource
>>> and it made sense to use the L3 specific "struct rdt_mon_domain *"
>>> arguments to functions manipulating domains.
>>>
>>> To simplify enabling of enumeration of domains for events in other
>>
>> What does "enabling of enumeration of domains" mean?
>
> Is this better?
>
> To prepare for events in resources other than L3, change the calling convention
> to pass the generic struct rdt_domain_hdr and use that to find the domain specific
> structure where needed.
I interpret above as a solution that is unrelated to the problem because the problem
is stated as "prepare for events in *resources*" while the solution changes how
*domain* structures are accessed.
Here is an attempt to make the problem and solution clear, please feel free to change:
Up until now, all monitoring events were associated with the L3 resource
and it made sense to use the L3 specific "struct rdt_mon_domain *"
argument to functions operating on domains.
Telemetry events will be tied to a new resource with its instances
represented by a new domain structure that, just like struct rdt_mon_domain,
starts with the generic struct rdt_domain_hdr.
Prepare to support domains belonging to different resources by
changing the calling convention of functions operating on domains.
Pass the generic header and use that to find the domain specific
structure where needed.
sidenote: I changed "manipulating" to "operating on" even though Boris wrote it since a
few of the functions changed in this patch do not manipulate the domains but instead use
them as reference.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 06/31] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (4 preceding siblings ...)
2025-09-25 20:02 ` [PATCH v11 05/31] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 15:34 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 07/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
` (24 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Use a generic struct rdt_domain_hdr representing a generic domain
header in struct rmid_read in order to support other telemetry events'
domains besides an L3 one. Adjust the code interacting with it to the
new struct layout.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 8 +++----
fs/resctrl/internal.h | 18 +++++++-------
arch/x86/kernel/cpu/resctrl/monitor.c | 17 +++++++++++---
fs/resctrl/ctrlmondata.c | 7 +-----
fs/resctrl/monitor.c | 34 +++++++++++++++++----------
5 files changed, 50 insertions(+), 34 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0b55809af5d7..0fef3045cac3 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -514,7 +514,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
* @r: resource that the counter should be read from.
- * @d: domain that the counter should be read from.
+ * @hdr: Header of domain that the counter should be read from.
* @closid: closid that matches the rmid. Depending on the architecture, the
* counter may match traffic of both @closid and @rmid, or @rmid
* only.
@@ -535,7 +535,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *arch_mon_ctx);
@@ -630,7 +630,7 @@ void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
* assigned to the RMID, event pair for this resource
* and domain.
* @r: Resource that the counter should be read from.
- * @d: Domain that the counter should be read from.
+ * @hdr: Header of domain that the counter should be read from.
* @closid: CLOSID that matches the RMID.
* @rmid: The RMID to which @cntr_id is assigned.
* @cntr_id: The counter to read.
@@ -644,7 +644,7 @@ void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 closid, u32 rmid, int cntr_id,
enum resctrl_event_id eventid, u64 *val);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 22fdb3a9b6f4..698ed84fd073 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -106,24 +106,26 @@ struct mon_data {
* resource group then its event count is summed with the count from all
* its child resource groups.
* @r: Resource describing the properties of the event being read.
- * @d: Domain that the counter should be read from. If NULL then sum all
- * domains in @r sharing L3 @ci.id
+ * @hdr: Header of domain that the counter should be read from. If NULL then
+ * sum all domains in @r sharing L3 @ci.id
* @evtid: Which monitor event to read.
* @first: Initialize MBM counter when true.
- * @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
+ * @ci: Cacheinfo for L3. Only set when @hdr is NULL. Used when summing
+ * domains.
* @is_mbm_cntr: true if "mbm_event" counter assignment mode is enabled and it
* is an MBM event.
* @err: Error encountered when reading counter.
- * @val: Returned value of event counter. If @rgrp is a parent resource group,
- * @val includes the sum of event counts from its child resource groups.
- * If @d is NULL, @val includes the sum of all domains in @r sharing @ci.id,
- * (summed across child resource groups if @rgrp is a parent resource group).
+ * @val: Returned value of event counter. If @rgrp is a parent resource
+ * group, @val includes the sum of event counts from its child
+ * resource groups. If @hdr is NULL, @val includes the sum of all
+ * domains in @r sharing @ci.id, (summed across child resource groups
+ * if @rgrp is a parent resource group).
* @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only).
*/
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_mon_domain *d;
+ struct rdt_domain_hdr *hdr;
enum resctrl_event_id evtid;
bool first;
struct cacheinfo *ci;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c8945610d455..cee1cd7fbdce 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,17 +238,23 @@ static u64 get_corrected_val(struct rdt_resource *r, struct rdt_mon_domain *d,
return chunks * hw_res->mon_scale;
}
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
- int cpu = cpumask_any(&d->hdr.cpu_mask);
+ struct rdt_mon_domain *d;
u64 msr_val;
u32 prmid;
+ int cpu;
int ret;
resctrl_arch_rmid_read_context_check();
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ cpu = cpumask_any(&hdr->cpu_mask);
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
ret = __rmid_read_phys(prmid, eventid, &msr_val);
if (ret)
@@ -312,13 +318,18 @@ void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
}
}
-int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, int cntr_id,
enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_mon_domain *d;
u64 msr_val;
int ret;
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
ret = __cntr_id_read(cntr_id, &msr_val);
if (ret)
return ret;
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 3ceef35208be..7b9fc5d3bdc8 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -550,13 +550,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first)
{
- struct rdt_mon_domain *d;
int cpu;
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -566,7 +561,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->rgrp = rdtgrp;
rr->evtid = evtid;
rr->r = r;
- rr->d = d;
+ rr->hdr = hdr;
rr->first = first;
if (resctrl_arch_mbm_cntr_assign_enabled(r) &&
resctrl_is_mbm_event(evtid)) {
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 4076336fbba6..32116361a5f6 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -159,7 +159,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
break;
entry = __rmid_entry(idx);
- if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
+ if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val,
arch_mon_ctx)) {
rmid_dirty = true;
@@ -424,8 +424,12 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
int err, ret;
u64 tval = 0;
+ if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+ d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+
if (rr->is_mbm_cntr) {
- cntr_id = mbm_cntr_get(rr->r, rr->d, rdtgrp, rr->evtid);
+ cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evtid);
if (cntr_id < 0) {
rr->err = -ENOENT;
return -EINVAL;
@@ -434,24 +438,24 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (rr->first) {
if (rr->is_mbm_cntr)
- resctrl_arch_reset_cntr(rr->r, rr->d, closid, rmid, cntr_id, rr->evtid);
+ resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evtid);
else
- resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evtid);
if (m)
memset(m, 0, sizeof(struct mbm_state));
return 0;
}
- if (rr->d) {
+ if (rr->hdr) {
/* Reading a single domain, must be on a CPU in that domain. */
- if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+ if (!cpumask_test_cpu(cpu, &rr->hdr->cpu_mask))
return -EINVAL;
if (rr->is_mbm_cntr)
- rr->err = resctrl_arch_cntr_read(rr->r, rr->d, closid, rmid, cntr_id,
+ rr->err = resctrl_arch_cntr_read(rr->r, rr->hdr, closid, rmid, cntr_id,
rr->evtid, &tval);
else
- rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
+ rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, closid, rmid,
rr->evtid, &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -477,10 +481,10 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (d->ci_id != rr->ci->id)
continue;
if (rr->is_mbm_cntr)
- err = resctrl_arch_cntr_read(rr->r, d, closid, rmid, cntr_id,
+ err = resctrl_arch_cntr_read(rr->r, &d->hdr, closid, rmid, cntr_id,
rr->evtid, &tval);
else
- err = resctrl_arch_rmid_read(rr->r, d, closid, rmid,
+ err = resctrl_arch_rmid_read(rr->r, &d->hdr, closid, rmid,
rr->evtid, &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
@@ -511,9 +515,13 @@ static void mbm_bw_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
u64 cur_bw, bytes, cur_bytes;
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
+ struct rdt_mon_domain *d;
struct mbm_state *m;
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+ d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ m = get_mbm_state(d, closid, rmid, rr->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -686,7 +694,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *
struct rmid_read rr = {0};
rr.r = r;
- rr.d = d;
+ rr.hdr = &d->hdr;
rr.evtid = evtid;
if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
rr.is_mbm_cntr = true;
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 06/31] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters
2025-09-25 20:03 ` [PATCH v11 06/31] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Tony Luck
@ 2025-10-03 15:34 ` Reinette Chatre
2025-10-03 22:59 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 15:34 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> Use a generic struct rdt_domain_hdr representing a generic domain
> header in struct rmid_read in order to support other telemetry events'
> domains besides an L3 one. Adjust the code interacting with it to the
> new struct layout.
I'd propose a small amend to be more specific and not assume reader knows
what rmid_read is used for:
struct rmid_read contains data passed around to read event counts. Use the
generic domain header struct rdt_domain_hdr in struct rmid_read in order to
support other telemetry events' domains besides an L3 one. Adjust the code
interacting with it to the new struct layout.
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 3ceef35208be..7b9fc5d3bdc8 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -550,13 +550,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
> cpumask_t *cpumask, int evtid, int first)
> {
> - struct rdt_mon_domain *d;
> int cpu;
>
> - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> - return;
> - d = container_of(hdr, struct rdt_mon_domain, hdr);
> -
Problematic snippet removed here ...
> /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> lockdep_assert_cpus_held();
>
> @@ -566,7 +561,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> rr->rgrp = rdtgrp;
> rr->evtid = evtid;
> rr->r = r;
> - rr->d = d;
> + rr->hdr = hdr;
> rr->first = first;
> if (resctrl_arch_mbm_cntr_assign_enabled(r) &&
> resctrl_is_mbm_event(evtid)) {
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 4076336fbba6..32116361a5f6 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -159,7 +159,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
> break;
>
> entry = __rmid_entry(idx);
> - if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
> + if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid,
> QOS_L3_OCCUP_EVENT_ID, &val,
> arch_mon_ctx)) {
> rmid_dirty = true;
> @@ -424,8 +424,12 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
> int err, ret;
> u64 tval = 0;
>
> + if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return -EINVAL;
> + d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
> +
... but now the problem is moved to __mon_event_count() where rr->hdr can be NULL and the
domain_header_is_valid() check is referencing NULL pointer when SNC is enabled?
Am I missing something here? Does this work on SNC?
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 06/31] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters
2025-10-03 15:34 ` Reinette Chatre
@ 2025-10-03 22:59 ` Luck, Tony
0 siblings, 0 replies; 84+ messages in thread
From: Luck, Tony @ 2025-10-03 22:59 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Oct 03, 2025 at 08:34:01AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 1:03 PM, Tony Luck wrote:
> > Use a generic struct rdt_domain_hdr representing a generic domain
> > header in struct rmid_read in order to support other telemetry events'
> > domains besides an L3 one. Adjust the code interacting with it to the
> > new struct layout.
>
> I'd propose a small amend to be more specific and not assume reader knows
> what rmid_read is used for:
>
> struct rmid_read contains data passed around to read event counts. Use the
> generic domain header struct rdt_domain_hdr in struct rmid_read in order to
> support other telemetry events' domains besides an L3 one. Adjust the code
> interacting with it to the new struct layout.
Looks good. Thanks.
>
>
> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index 3ceef35208be..7b9fc5d3bdc8 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -550,13 +550,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
> > cpumask_t *cpumask, int evtid, int first)
> > {
> > - struct rdt_mon_domain *d;
> > int cpu;
> >
> > - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > - return;
> > - d = container_of(hdr, struct rdt_mon_domain, hdr);
> > -
>
> Problematic snippet removed here ...
>
Yup.
> > /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> > lockdep_assert_cpus_held();
> >
> > @@ -566,7 +561,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > rr->rgrp = rdtgrp;
> > rr->evtid = evtid;
> > rr->r = r;
> > - rr->d = d;
> > + rr->hdr = hdr;
> > rr->first = first;
> > if (resctrl_arch_mbm_cntr_assign_enabled(r) &&
> > resctrl_is_mbm_event(evtid)) {
> > diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> > index 4076336fbba6..32116361a5f6 100644
> > --- a/fs/resctrl/monitor.c
> > +++ b/fs/resctrl/monitor.c
> > @@ -159,7 +159,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
> > break;
> >
> > entry = __rmid_entry(idx);
> > - if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
> > + if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid,
> > QOS_L3_OCCUP_EVENT_ID, &val,
> > arch_mon_ctx)) {
> > rmid_dirty = true;
> > @@ -424,8 +424,12 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
> > int err, ret;
> > u64 tval = 0;
> >
> > + if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + return -EINVAL;
> > + d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
> > +
>
> ... but now the problem is moved to __mon_event_count() where rr->hdr can be NULL and the
> domain_header_is_valid() check is referencing NULL pointer when SNC is enabled?
> Am I missing something here? Does this work on SNC?
You are right. This likely breaks SNC. I'll add a check for "!hdr" and
move this inside the "if (rr->is_mbm_cntr)" with a duplicate inside the
"if (rr->first)". This duplication will be cleaned up with a later
patch to refactor __mon_event_count().
>
> Reinette
>
-Tony
>
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 07/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (5 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 06/31] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:24 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 08/31] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
` (23 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The upcoming telemetry event monitoring are not tied to the L3
resource and will have a new domain structures.
Rename the L3 resource specific domain data structures to include
"l3_" in their names to avoid confusion between the different
resource specific domain structures:
rdt_mon_domain -> rdt_l3_mon_domain
rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 20 ++++----
arch/x86/kernel/cpu/resctrl/internal.h | 16 +++---
fs/resctrl/internal.h | 8 +--
arch/x86/kernel/cpu/resctrl/core.c | 14 +++---
arch/x86/kernel/cpu/resctrl/monitor.c | 36 +++++++-------
fs/resctrl/ctrlmondata.c | 2 +-
fs/resctrl/monitor.c | 68 +++++++++++++-------------
fs/resctrl/rdtgroup.c | 36 +++++++-------
8 files changed, 100 insertions(+), 100 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0fef3045cac3..66569662efee 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -178,7 +178,7 @@ struct mbm_cntr_cfg {
};
/**
- * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
+ * struct rdt_l3_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @ci_id: cache info id for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
@@ -192,7 +192,7 @@ struct mbm_cntr_cfg {
* @cntr_cfg: array of assignable counters' configuration (indexed
* by counter ID)
*/
-struct rdt_mon_domain {
+struct rdt_l3_mon_domain {
struct rdt_domain_hdr hdr;
unsigned int ci_id;
unsigned long *rmid_busy_llc;
@@ -364,10 +364,10 @@ struct resctrl_cpu_defaults {
};
struct resctrl_mon_config_info {
- struct rdt_resource *r;
- struct rdt_mon_domain *d;
- u32 evtid;
- u32 mon_config;
+ struct rdt_resource *r;
+ struct rdt_l3_mon_domain *d;
+ u32 evtid;
+ u32 mon_config;
};
/**
@@ -582,7 +582,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid,
enum resctrl_event_id eventid);
@@ -595,7 +595,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/**
* resctrl_arch_reset_all_ctrls() - Reset the control for each CLOSID to its
@@ -621,7 +621,7 @@ void resctrl_arch_reset_all_ctrls(struct rdt_resource *r);
*
* This can be called from any CPU.
*/
-void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
enum resctrl_event_id evtid, u32 rmid, u32 closid,
u32 cntr_id, bool assign);
@@ -659,7 +659,7 @@ int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, int cntr_id,
enum resctrl_event_id eventid);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 9f4c2f0aaf5c..6eca3d522fcc 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -60,17 +60,17 @@ struct rdt_hw_ctrl_domain {
};
/**
- * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
- * a resource for a monitor function
- * @d_resctrl: Properties exposed to the resctrl file system
+ * struct rdt_hw_l3_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_states: Per-event pointer to the MBM event's saved state.
* An MBM event's state is an array of struct arch_mbm_state
* indexed by RMID on x86.
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_mon_domain {
- struct rdt_mon_domain d_resctrl;
+struct rdt_hw_l3_mon_domain {
+ struct rdt_l3_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
};
@@ -79,9 +79,9 @@ static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctr
return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
}
-static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3_mon_domain *r)
{
- return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
}
/**
@@ -135,7 +135,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
extern struct rdt_hw_resource rdt_resources_all[];
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 698ed84fd073..d9e291d94926 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -369,7 +369,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int resctrl_mon_resource_init(void);
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
@@ -377,14 +377,14 @@ void mbm_handle_overflow(struct work_struct *work);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_mon_domain *d);
+bool has_busy_rmid(struct rdt_l3_mon_domain *d);
-void __check_limbo(struct rdt_mon_domain *d, bool force_free);
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free);
void resctrl_file_fflags_init(const char *config, unsigned long fflags);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 2d93387b9251..42f4f702eeec 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -363,7 +363,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
+static void mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
{
int idx;
@@ -400,7 +400,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
size_t tsize = sizeof(*hw_dom->arch_mbm_states[0]);
enum resctrl_event_id eventid;
@@ -498,8 +498,8 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- struct rdt_hw_mon_domain *hw_dom;
- struct rdt_mon_domain *d;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
struct cacheinfo *ci;
int err;
@@ -625,9 +625,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_mon_domain *hw_dom;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
- struct rdt_mon_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -653,7 +653,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index cee1cd7fbdce..b448e6816fe7 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -109,7 +109,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
*
* In RMID sharing mode there are fewer "logical RMID" values available
* to accumulate data ("physical RMIDs" are divided evenly between SNC
- * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
+ * nodes that share an L3 cache). Linux creates an rdt_l3_mon_domain for
* each SNC node.
*
* The value loaded into IA32_PQR_ASSOC is the "logical RMID".
@@ -157,7 +157,7 @@ static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_l3_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -171,11 +171,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
return state ? &state[rmid] : NULL;
}
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid,
enum resctrl_event_id eventid)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
u32 prmid;
@@ -194,9 +194,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
enum resctrl_event_id eventid;
int idx;
@@ -217,10 +217,10 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}
-static u64 get_corrected_val(struct rdt_resource *r, struct rdt_mon_domain *d,
+static u64 get_corrected_val(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 msr_val)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct arch_mbm_state *am;
u64 chunks;
@@ -242,7 +242,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u64 msr_val;
u32 prmid;
int cpu;
@@ -253,7 +253,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
cpu = cpumask_any(&hdr->cpu_mask);
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
ret = __rmid_read_phys(prmid, eventid, &msr_val);
@@ -302,11 +302,11 @@ static int __cntr_id_read(u32 cntr_id, u64 *val)
return 0;
}
-void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid, int cntr_id,
enum resctrl_event_id eventid)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;
am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -322,14 +322,14 @@ int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, int cntr_id,
enum resctrl_event_id eventid, u64 *val)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u64 msr_val;
int ret;
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
ret = __cntr_id_read(cntr_id, &msr_val);
if (ret)
return ret;
@@ -353,7 +353,7 @@ int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
* must adjust RMID counter numbers based on SNC node. See
* logical_rmid_to_physical_rmid() for code that does this.
*/
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
if (snc_nodes_per_l3_cache > 1)
msr_clear_bit(MSR_RMID_SNC_CONFIG, 0);
@@ -505,7 +505,7 @@ static void resctrl_abmc_set_one_amd(void *arg)
*/
static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_cpus_held();
@@ -544,11 +544,11 @@ static void resctrl_abmc_config_one_amd(void *info)
/*
* Send an IPI to the domain to assign the counter to RMID, event pair.
*/
-void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
enum resctrl_event_id evtid, u32 rmid, u32 closid,
u32 cntr_id, bool assign)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
union l3_qos_abmc_cfg abmc_cfg = { 0 };
struct arch_mbm_state *am;
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 7b9fc5d3bdc8..c95f8eb8e731 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -622,7 +622,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
r = resctrl_arch_get_resource(resid);
if (md->sum) {
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
/*
* This file requires summing across all domains that share
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 32116361a5f6..88b990e939ea 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -130,7 +130,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_mon_domain *d, bool force_free)
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -188,7 +188,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}
-bool has_busy_rmid(struct rdt_mon_domain *d)
+bool has_busy_rmid(struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -289,7 +289,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u32 idx;
lockdep_assert_held(&rdtgroup_mutex);
@@ -342,7 +342,7 @@ void free_rmid(u32 closid, u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}
-static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -362,7 +362,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
* Return:
* Valid counter ID on success, or -ENOENT on failure.
*/
-static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
+static int mbm_cntr_get(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
{
int cntr_id;
@@ -389,7 +389,7 @@ static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
* Return:
* Valid counter ID on success, or -ENOSPC on failure.
*/
-static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
+static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
{
int cntr_id;
@@ -408,7 +408,7 @@ static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
/*
* mbm_cntr_free() - Clear the counter ID configuration details in the domain @d.
*/
-static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
+static void mbm_cntr_free(struct rdt_l3_mon_domain *d, int cntr_id)
{
memset(&d->cntr_cfg[cntr_id], 0, sizeof(*d->cntr_cfg));
}
@@ -418,7 +418,7 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
int cpu = smp_processor_id();
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int cntr_id = -ENOENT;
struct mbm_state *m;
int err, ret;
@@ -426,7 +426,7 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
if (rr->is_mbm_cntr) {
cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evtid);
@@ -515,12 +515,12 @@ static void mbm_bw_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
u64 cur_bw, bytes, cur_bytes;
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct mbm_state *m;
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
- d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
m = get_mbm_state(d, closid, rmid, rr->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -620,7 +620,7 @@ static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu,
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
@@ -688,7 +688,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}
-static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
{
struct rmid_read rr = {0};
@@ -720,7 +720,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *
resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
-static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp)
{
/*
@@ -741,12 +741,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -769,7 +769,7 @@ void cqm_handle_limbo(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -786,7 +786,7 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
@@ -801,7 +801,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;
r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- d = container_of(work, struct rdt_mon_domain, mbm_over.work);
+ d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp);
@@ -835,7 +835,7 @@ void mbm_handle_overflow(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -1090,7 +1090,7 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kernfs_open_file *of, char *buf
* mbm_cntr_free_all() - Clear all the counter ID configuration details in the
* domain @d. Called when mbm_assign_mode is changed.
*/
-static void mbm_cntr_free_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+static void mbm_cntr_free_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
memset(d->cntr_cfg, 0, sizeof(*d->cntr_cfg) * r->mon.num_mbm_cntrs);
}
@@ -1099,7 +1099,7 @@ static void mbm_cntr_free_all(struct rdt_resource *r, struct rdt_mon_domain *d)
* resctrl_reset_rmid_all() - Reset all non-architecture states for all the
* supported RMIDs.
*/
-static void resctrl_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+static void resctrl_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
enum resctrl_event_id evt;
@@ -1120,7 +1120,7 @@ static void resctrl_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain
* Assign the counter if @assign is true else unassign the counter. Reset the
* associated non-architectural state.
*/
-static void rdtgroup_assign_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void rdtgroup_assign_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
enum resctrl_event_id evtid, u32 rmid, u32 closid,
u32 cntr_id, bool assign)
{
@@ -1140,7 +1140,7 @@ static void rdtgroup_assign_cntr(struct rdt_resource *r, struct rdt_mon_domain *
* Return:
* 0 on success, < 0 on failure.
*/
-static int rdtgroup_alloc_assign_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+static int rdtgroup_alloc_assign_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, struct mon_evt *mevt)
{
int cntr_id;
@@ -1175,7 +1175,7 @@ static int rdtgroup_alloc_assign_cntr(struct rdt_resource *r, struct rdt_mon_dom
* Return:
* 0 on success, < 0 on failure.
*/
-static int rdtgroup_assign_cntr_event(struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+static int rdtgroup_assign_cntr_event(struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
struct mon_evt *mevt)
{
struct rdt_resource *r = resctrl_arch_get_resource(mevt->rid);
@@ -1225,7 +1225,7 @@ void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
* rdtgroup_free_unassign_cntr() - Unassign and reset the counter ID configuration
* for the event pointed to by @mevt within the domain @d and resctrl group @rdtgrp.
*/
-static void rdtgroup_free_unassign_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void rdtgroup_free_unassign_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, struct mon_evt *mevt)
{
int cntr_id;
@@ -1246,7 +1246,7 @@ static void rdtgroup_free_unassign_cntr(struct rdt_resource *r, struct rdt_mon_d
* the event structure @mevt from the domain @d and the group @rdtgrp. Unassign
* the counters from all the domains if @d is NULL else unassign from @d.
*/
-static void rdtgroup_unassign_cntr_event(struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+static void rdtgroup_unassign_cntr_event(struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
struct mon_evt *mevt)
{
struct rdt_resource *r = resctrl_arch_get_resource(mevt->rid);
@@ -1321,7 +1321,7 @@ static int resctrl_parse_mem_transactions(char *tok, u32 *val)
static void rdtgroup_update_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
enum resctrl_event_id evtid)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int cntr_id;
list_for_each_entry(d, &r->mon_domains, hdr.list) {
@@ -1427,7 +1427,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int ret = 0;
bool enable;
@@ -1500,7 +1500,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1524,7 +1524,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
u32 cntrs, i;
int ret = 0;
@@ -1565,7 +1565,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
int mbm_L3_assignments_show(struct kernfs_open_file *of, struct seq_file *s, void *v)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct rdtgroup *rdtgrp;
struct mon_evt *mevt;
int ret = 0;
@@ -1628,7 +1628,7 @@ static struct mon_evt *mbm_get_mon_event_by_name(struct rdt_resource *r, char *n
return NULL;
}
-static int rdtgroup_modify_assign_state(char *assign, struct rdt_mon_domain *d,
+static int rdtgroup_modify_assign_state(char *assign, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, struct mon_evt *mevt)
{
int ret = 0;
@@ -1654,7 +1654,7 @@ static int rdtgroup_modify_assign_state(char *assign, struct rdt_mon_domain *d,
static int resctrl_parse_mbm_assignment(struct rdt_resource *r, struct rdtgroup *rdtgrp,
char *event, char *tok)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
unsigned long dom_id = 0;
char *dom_str, *id_str;
struct mon_evt *mevt;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index e3b83e48f2d9..1b4f4bd63143 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1618,7 +1618,7 @@ static void mondata_config_read(struct resctrl_mon_config_info *mon_info)
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct resctrl_mon_config_info mon_info;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1666,7 +1666,7 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
}
static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_mon_domain *d, u32 evtid, u32 val)
+ struct rdt_l3_mon_domain *d, u32 evtid, u32 val)
{
struct resctrl_mon_config_info mon_info = {0};
@@ -1707,8 +1707,8 @@ static void mbm_config_write_domain(struct rdt_resource *r,
static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
+ struct rdt_l3_mon_domain *d;
unsigned long dom_id, val;
- struct rdt_mon_domain *d;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -2716,7 +2716,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
struct rdt_resource *r;
int ret;
@@ -3167,7 +3167,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
char subname[32];
bool snc_mode;
char name[32];
@@ -3175,7 +3175,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
if (snc_mode)
@@ -3221,7 +3221,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3231,7 +3231,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
kn = kernfs_find_and_get(parent_kn, name);
@@ -4174,7 +4174,7 @@ static void rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_mon_domain *d)
+static void domain_destroy_mon_state(struct rdt_l3_mon_domain *d)
{
int idx;
@@ -4198,7 +4198,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
mutex_lock(&rdtgroup_mutex);
@@ -4212,7 +4212,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4246,7 +4246,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(*d->mbm_states[0]);
@@ -4304,7 +4304,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
@@ -4312,7 +4312,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
err = domain_setup_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4360,10 +4360,10 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
}
}
-static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
- struct rdt_resource *r)
+static struct rdt_l3_mon_domain *get_mon_domain_from_cpu(int cpu,
+ struct rdt_resource *r)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_cpus_held();
@@ -4379,7 +4379,7 @@ static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
void resctrl_offline_cpu(unsigned int cpu)
{
struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct rdtgroup *rdtgrp;
mutex_lock(&rdtgroup_mutex);
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 07/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-09-25 20:03 ` [PATCH v11 07/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-10-03 23:24 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:24 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> The upcoming telemetry event monitoring are not tied to the L3
> resource and will have a new domain structures.
>
> Rename the L3 resource specific domain data structures to include
> "l3_" in their names to avoid confusion between the different
> resource specific domain structures:
> rdt_mon_domain -> rdt_l3_mon_domain
> rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
>
> No functional change.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 08/31] x86,fs/resctrl: Rename some L3 specific functions
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (6 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 07/31] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:24 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 09/31] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
` (22 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
With the arrival of monitor events tied to new domains associated with
a different resource it would be clearer if the L3 resource specific
functions are more accurately named.
Rename three groups of functions:
Functions that allocate/free architecture per-RMID MBM state information:
arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc()
mon_domain_free() -> l3_mon_domain_free()
Functions that allocate/free filesystem per-RMID MBM state information:
domain_setup_mon_state() -> domain_setup_l3_mon_state()
domain_destroy_mon_state() -> domain_destroy_l3_mon_state()
Initialization/exit:
rdt_get_mon_l3_config() -> rdt_get_l3_mon_config()
resctrl_mon_resource_init() -> resctrl_l3_mon_resource_init()
resctrl_mon_resource_exit() -> resctrl_l3_mon_resource_exit()
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
fs/resctrl/internal.h | 6 +++---
arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++++---------
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
fs/resctrl/monitor.c | 6 +++---
fs/resctrl/rdtgroup.c | 22 +++++++++++-----------
6 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 6eca3d522fcc..14fadcff0d2b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -208,7 +208,7 @@ union l3_qos_abmc_cfg {
void rdt_ctrl_update(void *arg);
-int rdt_get_mon_l3_config(struct rdt_resource *r);
+int rdt_get_l3_mon_config(struct rdt_resource *r);
bool rdt_cpu_has(int flag);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index d9e291d94926..88b4489b68e1 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -357,7 +357,9 @@ int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
-void resctrl_mon_resource_exit(void);
+int resctrl_l3_mon_resource_init(void);
+
+void resctrl_l3_mon_resource_exit(void);
void mon_event_count(void *info);
@@ -367,8 +369,6 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first);
-int resctrl_mon_resource_init(void);
-
void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 42f4f702eeec..4762790c6e62 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -363,7 +363,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
+static void l3_mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
{
int idx;
@@ -396,11 +396,11 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
}
/**
- * arch_domain_mbm_alloc() - Allocate arch private storage for the MBM counters
+ * l3_mon_domain_mbm_alloc() - Allocate arch private storage for the MBM counters
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
+static int l3_mon_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
size_t tsize = sizeof(*hw_dom->arch_mbm_states[0]);
enum resctrl_event_id eventid;
@@ -514,7 +514,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
return;
}
d->ci_id = ci->id;
@@ -522,8 +522,8 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
arch_mon_domain_online(r, d);
- if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
- mon_domain_free(hw_dom);
+ if (l3_mon_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
+ l3_mon_domain_free(hw_dom);
return;
}
@@ -533,7 +533,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
}
}
@@ -658,7 +658,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
break;
default:
pr_warn_once("Unknown resource rid=%d\n", r->rid);
@@ -906,7 +906,7 @@ static __init bool get_rdt_mon_resources(void)
if (!ret)
return false;
- return !rdt_get_mon_l3_config(r);
+ return !rdt_get_l3_mon_config(r);
}
static __init void __check_quirks_intel(void)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index b448e6816fe7..ea81305fbc5d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -422,7 +422,7 @@ static __init int snc_get_config(void)
return ret;
}
-int __init rdt_get_mon_l3_config(struct rdt_resource *r)
+int __init rdt_get_l3_mon_config(struct rdt_resource *r)
{
unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 88b990e939ea..54ae3494adfe 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -1750,7 +1750,7 @@ ssize_t mbm_L3_assignments_write(struct kernfs_open_file *of, char *buf,
}
/**
- * resctrl_mon_resource_init() - Initialise global monitoring structures.
+ * resctrl_l3_mon_resource_init() - Initialise global monitoring structures.
*
* Allocate and initialise global monitor resources that do not belong to a
* specific domain. i.e. the rmid_ptrs[] used for the limbo and free lists.
@@ -1761,7 +1761,7 @@ ssize_t mbm_L3_assignments_write(struct kernfs_open_file *of, char *buf,
*
* Returns 0 for success, or -ENOMEM.
*/
-int resctrl_mon_resource_init(void)
+int resctrl_l3_mon_resource_init(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
int ret;
@@ -1813,7 +1813,7 @@ int resctrl_mon_resource_init(void)
return 0;
}
-void resctrl_mon_resource_exit(void)
+void resctrl_l3_mon_resource_exit(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 1b4f4bd63143..88b80944cf85 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4174,7 +4174,7 @@ static void rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_l3_mon_domain *d)
+static void domain_destroy_l3_mon_state(struct rdt_l3_mon_domain *d)
{
int idx;
@@ -4228,13 +4228,13 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
cancel_delayed_work(&d->cqm_limbo);
}
- domain_destroy_mon_state(d);
+ domain_destroy_l3_mon_state(d);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
/**
- * domain_setup_mon_state() - Initialise domain monitoring structures.
+ * domain_setup_l3_mon_state() - Initialise domain monitoring structures.
* @r: The resource for the newly online domain.
* @d: The newly online domain.
*
@@ -4242,11 +4242,11 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* Called when the first CPU of a domain comes online, regardless of whether
* the filesystem is mounted.
* During boot this may be called before global allocations have been made by
- * resctrl_mon_resource_init().
+ * resctrl_l3_mon_resource_init().
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
+static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(*d->mbm_states[0]);
@@ -4313,7 +4313,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
goto out_unlock;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- err = domain_setup_mon_state(r, d);
+ err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4429,13 +4429,13 @@ int resctrl_init(void)
thread_throttle_mode_init();
- ret = resctrl_mon_resource_init();
+ ret = resctrl_l3_mon_resource_init();
if (ret)
return ret;
ret = sysfs_create_mount_point(fs_kobj, "resctrl");
if (ret) {
- resctrl_mon_resource_exit();
+ resctrl_l3_mon_resource_exit();
return ret;
}
@@ -4470,7 +4470,7 @@ int resctrl_init(void)
cleanup_mountpoint:
sysfs_remove_mount_point(fs_kobj, "resctrl");
- resctrl_mon_resource_exit();
+ resctrl_l3_mon_resource_exit();
return ret;
}
@@ -4506,7 +4506,7 @@ static bool resctrl_online_domains_exist(void)
* When called by the architecture code, all CPUs and resctrl domains must be
* offline. This ensures the limbo and overflow handlers are not scheduled to
* run, meaning the data structures they access can be freed by
- * resctrl_mon_resource_exit().
+ * resctrl_l3_mon_resource_exit().
*
* After resctrl_exit() returns, the architecture code should return an
* error from all resctrl_arch_ functions that can do this.
@@ -4533,5 +4533,5 @@ void resctrl_exit(void)
* it can be used to umount resctrl.
*/
- resctrl_mon_resource_exit();
+ resctrl_l3_mon_resource_exit();
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 08/31] x86,fs/resctrl: Rename some L3 specific functions
2025-09-25 20:03 ` [PATCH v11 08/31] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
@ 2025-10-03 23:24 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:24 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> With the arrival of monitor events tied to new domains associated with
> a different resource it would be clearer if the L3 resource specific
> functions are more accurately named.
>
> Rename three groups of functions:
>
> Functions that allocate/free architecture per-RMID MBM state information:
> arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc()
> mon_domain_free() -> l3_mon_domain_free()
>
> Functions that allocate/free filesystem per-RMID MBM state information:
> domain_setup_mon_state() -> domain_setup_l3_mon_state()
> domain_destroy_mon_state() -> domain_destroy_l3_mon_state()
>
> Initialization/exit:
> rdt_get_mon_l3_config() -> rdt_get_l3_mon_config()
> resctrl_mon_resource_init() -> resctrl_l3_mon_resource_init()
> resctrl_mon_resource_exit() -> resctrl_l3_mon_resource_exit()
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 09/31] fs/resctrl: Make event details accessible to functions when reading events
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (7 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 08/31] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:27 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 10/31] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
` (21 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Reading monitoring data from MMIO requires more context to be able to
read the correct memory location. struct mon_evt is the appropriate
place for this event specific context.
Prepare for addition of extra fields to mon_evt by changing the calling
conventions to pass a pointer to the mon_evt structure instead of just
the event id.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 10 +++++-----
fs/resctrl/ctrlmondata.c | 18 +++++++++---------
fs/resctrl/monitor.c | 24 ++++++++++++------------
fs/resctrl/rdtgroup.c | 6 +++---
4 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 88b4489b68e1..12a2ab7e3c9b 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -81,7 +81,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
* @rid: Resource id associated with the event file.
- * @evtid: Event id associated with the event file.
+ * @evt: Event structure associated with the event file.
* @sum: Set when event must be summed across multiple
* domains.
* @domid: When @sum is zero this is the domain to which
@@ -95,7 +95,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
struct mon_data {
struct list_head list;
enum resctrl_res_level rid;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
int domid;
bool sum;
};
@@ -108,7 +108,7 @@ struct mon_data {
* @r: Resource describing the properties of the event being read.
* @hdr: Header of domain that the counter should be read from. If NULL then
* sum all domains in @r sharing L3 @ci.id
- * @evtid: Which monitor event to read.
+ * @evt: Which monitor event to read.
* @first: Initialize MBM counter when true.
* @ci: Cacheinfo for L3. Only set when @hdr is NULL. Used when summing
* domains.
@@ -126,7 +126,7 @@ struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
struct rdt_domain_hdr *hdr;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
bool first;
struct cacheinfo *ci;
bool is_mbm_cntr;
@@ -367,7 +367,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first);
+ cpumask_t *cpumask, struct mon_evt *evt, int first);
void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index c95f8eb8e731..77602563cb1f 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -548,7 +548,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first)
+ cpumask_t *cpumask, struct mon_evt *evt, int first)
{
int cpu;
@@ -559,15 +559,15 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* Setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
- rr->evtid = evtid;
+ rr->evt = evt;
rr->r = r;
rr->hdr = hdr;
rr->first = first;
if (resctrl_arch_mbm_cntr_assign_enabled(r) &&
- resctrl_is_mbm_event(evtid)) {
+ resctrl_is_mbm_event(evt->evtid)) {
rr->is_mbm_cntr = true;
} else {
- rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid);
+ rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evt->evtid);
if (IS_ERR(rr->arch_mon_ctx)) {
rr->err = -EINVAL;
return;
@@ -588,20 +588,20 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
if (rr->arch_mon_ctx)
- resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
enum resctrl_res_level resid;
- enum resctrl_event_id evtid;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
struct rdtgroup *rdtgrp;
int domid, cpu, ret = 0;
struct rdt_resource *r;
struct cacheinfo *ci;
+ struct mon_evt *evt;
struct mon_data *md;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
@@ -618,7 +618,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md->rid;
domid = md->domid;
- evtid = md->evtid;
+ evt = md->evt;
r = resctrl_arch_get_resource(resid);
if (md->sum) {
@@ -638,7 +638,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
continue;
rr.ci = ci;
mon_event_read(&rr, r, NULL, rdtgrp,
- &ci->shared_cpu_map, evtid, false);
+ &ci->shared_cpu_map, evt, false);
goto checkresult;
}
}
@@ -654,7 +654,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evtid, false);
+ mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evt, false);
}
checkresult:
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 54ae3494adfe..ee08ffbacc2b 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -429,7 +429,7 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
if (rr->is_mbm_cntr) {
- cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evtid);
+ cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evt->evtid);
if (cntr_id < 0) {
rr->err = -ENOENT;
return -EINVAL;
@@ -438,10 +438,10 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (rr->first) {
if (rr->is_mbm_cntr)
- resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evtid);
+ resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evt->evtid);
else
- resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evtid);
- m = get_mbm_state(d, closid, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evt->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
if (m)
memset(m, 0, sizeof(struct mbm_state));
return 0;
@@ -453,10 +453,10 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
return -EINVAL;
if (rr->is_mbm_cntr)
rr->err = resctrl_arch_cntr_read(rr->r, rr->hdr, closid, rmid, cntr_id,
- rr->evtid, &tval);
+ rr->evt->evtid, &tval);
else
rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -482,10 +482,10 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
continue;
if (rr->is_mbm_cntr)
err = resctrl_arch_cntr_read(rr->r, &d->hdr, closid, rmid, cntr_id,
- rr->evtid, &tval);
+ rr->evt->evtid, &tval);
else
err = resctrl_arch_rmid_read(rr->r, &d->hdr, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
ret = 0;
@@ -521,7 +521,7 @@ static void mbm_bw_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
- m = get_mbm_state(d, closid, rmid, rr->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -695,11 +695,11 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domai
rr.r = r;
rr.hdr = &d->hdr;
- rr.evtid = evtid;
+ rr.evt = &mon_event_all[evtid];
if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
rr.is_mbm_cntr = true;
} else {
- rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, evtid);
if (IS_ERR(rr.arch_mon_ctx)) {
pr_warn_ratelimited("Failed to allocate monitor context: %ld",
PTR_ERR(rr.arch_mon_ctx));
@@ -717,7 +717,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domai
mbm_bw_count(rdtgrp, &rr);
if (rr.arch_mon_ctx)
- resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(rr.r, evtid, rr.arch_mon_ctx);
}
static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 88b80944cf85..dc289b03c3d1 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3038,7 +3038,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
if (priv->rid == rid && priv->domid == domid &&
- priv->sum == do_sum && priv->evtid == mevt->evtid)
+ priv->sum == do_sum && priv->evt == mevt)
return priv;
}
@@ -3049,7 +3049,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
priv->rid = rid;
priv->domid = domid;
priv->sum = do_sum;
- priv->evtid = mevt->evtid;
+ priv->evt = mevt;
list_add_tail(&priv->list, &mon_data_kn_priv_list);
return priv;
@@ -3210,7 +3210,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
return ret;
if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt->evtid, true);
+ mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
}
return 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 09/31] fs/resctrl: Make event details accessible to functions when reading events
2025-09-25 20:03 ` [PATCH v11 09/31] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
@ 2025-10-03 23:27 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:27 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> Reading monitoring data from MMIO requires more context to be able to
To tie this together with the "instead of just the event id" used later
I'd propose:
Reading monitoring event data from MMIO requires more context
than the event id to be able to ...
> read the correct memory location. struct mon_evt is the appropriate
> place for this event specific context.
>
> Prepare for addition of extra fields to mon_evt by changing the calling
"mon_evt" -> "struct mon_evt"
> conventions to pass a pointer to the mon_evt structure instead of just
> the event id.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 10/31] x86,fs/resctrl: Handle events that can be read from any CPU
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (8 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 09/31] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:32 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 11/31] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
` (20 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl assumes that monitor events can only be read from a CPU in the
cpumask_t set of each domain. This is true for x86 events accessed
with an MSR interface, but may not be true for other access methods such
as MMIO.
Add a flag to struct mon_evt, settable by architecture code, to indicate
there are no restrictions on which CPU can read that event.
Bypass all the smp_call*() code for events that can be read on any CPU
and call mon_event_count() directly from mon_event_read().
Simplify CPU checking in __mon_event_count() with a helper.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
fs/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++---
fs/resctrl/ctrlmondata.c | 6 +++++
fs/resctrl/monitor.c | 43 ++++++++++++++++++++++--------
5 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66569662efee..22edd8d131d8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -409,7 +409,7 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id eventid);
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 12a2ab7e3c9b..40b76eaa33d0 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -61,6 +61,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* READS_TO_REMOTE_MEM) being tracked by @evtid.
* Only valid if @evtid is an MBM event.
* @configurable: true if the event is configurable
+ * @any_cpu: true if the event can be read from any CPU
* @enabled: true if the event is enabled
*/
struct mon_evt {
@@ -69,6 +70,7 @@ struct mon_evt {
char *name;
u32 evt_cfg;
bool configurable;
+ bool any_cpu;
bool enabled;
};
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 4762790c6e62..8db941fef7a0 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -889,15 +889,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_ABMC))
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 77602563cb1f..fbf55e61445c 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -574,6 +574,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
}
}
+ if (evt->any_cpu) {
+ mon_event_count(rr);
+ goto out_ctx_free;
+ }
+
cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
/*
@@ -587,6 +592,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
+out_ctx_free:
if (rr->arch_mon_ctx)
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index ee08ffbacc2b..6f8a9b5a2f6b 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -413,9 +413,33 @@ static void mbm_cntr_free(struct rdt_l3_mon_domain *d, int cntr_id)
memset(&d->cntr_cfg[cntr_id], 0, sizeof(*d->cntr_cfg));
}
+/*
+ * Called from preemptible context via a direct call of mon_event_count() for
+ * events that can be read on any CPU.
+ * Called from preemptible but non-migratable process context (mon_event_count()
+ * via smp_call_on_cpu()) OR non-preemptible context (mon_event_count() via
+ * smp_call_function_any()) for events that need to be read on a specific CPU.
+ */
+static bool cpu_on_correct_domain(struct rmid_read *rr)
+{
+ int cpu;
+
+ /* Any CPU is OK for this event */
+ if (rr->evt->any_cpu)
+ return true;
+
+ cpu = smp_processor_id();
+
+ /* Single domain. Must be on a CPU in that domain. */
+ if (rr->hdr)
+ return cpumask_test_cpu(cpu, &rr->hdr->cpu_mask);
+
+ /* Summing domains that share a cache, must be on a CPU for that cache. */
+ return cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map);
+}
+
static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
{
- int cpu = smp_processor_id();
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
struct rdt_l3_mon_domain *d;
@@ -424,6 +448,9 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
int err, ret;
u64 tval = 0;
+ if (!cpu_on_correct_domain(rr))
+ return -EINVAL;
+
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
@@ -448,9 +475,6 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
}
if (rr->hdr) {
- /* Reading a single domain, must be on a CPU in that domain. */
- if (!cpumask_test_cpu(cpu, &rr->hdr->cpu_mask))
- return -EINVAL;
if (rr->is_mbm_cntr)
rr->err = resctrl_arch_cntr_read(rr->r, rr->hdr, closid, rmid, cntr_id,
rr->evt->evtid, &tval);
@@ -465,10 +489,6 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
return 0;
}
- /* Summing domains that share a cache, must be on a CPU for that cache. */
- if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map))
- return -EINVAL;
-
/*
* Legacy files must report the sum of an event across all
* domains that share the same L3 cache instance.
@@ -957,7 +977,7 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id eventid)
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
{
if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
return;
@@ -966,6 +986,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid)
return;
}
+ mon_event_all[eventid].any_cpu = any_cpu;
mon_event_all[eventid].enabled = true;
}
@@ -1791,9 +1812,9 @@ int resctrl_l3_mon_resource_init(void)
if (r->mon.mbm_cntr_assignable) {
if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask;
mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask &
(READS_TO_LOCAL_MEM |
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 10/31] x86,fs/resctrl: Handle events that can be read from any CPU
2025-09-25 20:03 ` [PATCH v11 10/31] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
@ 2025-10-03 23:32 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:32 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> resctrl assumes that monitor events can only be read from a CPU in the
> cpumask_t set of each domain. This is true for x86 events accessed
> with an MSR interface, but may not be true for other access methods such
> as MMIO.
>
> Add a flag to struct mon_evt, settable by architecture code, to indicate
"Add a flag to struct mon_evt" -> "Add flag mon_evt::any_cpu"
> there are no restrictions on which CPU can read that event.
Or rather:
Introduce and use flag mon_evt::any_cpu, settable by architecture,
that indicates there are no restrictions on which CPU can read that
event.
>
> Bypass all the smp_call*() code for events that can be read on any CPU
> and call mon_event_count() directly from mon_event_read().
Above (from "Bypass ...") can be dropped since it is clear from patch.
>
> Simplify CPU checking in __mon_event_count() with a helper.
Above can be seen from patch but when trying to do so it is not clear
why this helper is needed and indicates that this is missing "why".
Proposal:
Refactor the CPU checking to avoid always calling smp_processor_id()
now that events can be read from preemptible context.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 11/31] x86,fs/resctrl: Support binary fixed point event counters
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (9 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 10/31] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 12/31] x86,fs/resctrl: Add an architectural hook called for each mount Tony Luck
` (19 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl assumes that all monitor events can be displayed as unsigned
decimal integers.
Hardware architecture counters may provide some telemetry events with
greater precision where the event is not a simple count, but is a
measurement of some sort (e.g. Joules for energy consumed).
Add a new argument to resctrl_enable_mon_event() for architecture code
to inform the file system that the value for a counter is a fixed-point
value with a specific number of binary places.
Only allow architecture to use floating point format on events that the
file system has marked with mon_evt::is_floating_point.
Display fixed point values with values rounded to an appropriate number
of decimal places for the precision of the number of binary places
provided. Add one extra decimal place for every three additional binary
places, except for low precision binary values where exact representation
is possible:
1 binary place is 0.0 or 0.5 => 1 decimal place
2 binary places is 0.0, 0.25, 0.5, 0.75 => 2 decimal places
3 binary places is 0.0, 0.125, etc. => 3 decimal places
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 3 +-
fs/resctrl/internal.h | 8 +++
arch/x86/kernel/cpu/resctrl/core.c | 6 +--
fs/resctrl/ctrlmondata.c | 84 ++++++++++++++++++++++++++++++
fs/resctrl/monitor.c | 14 +++--
5 files changed, 107 insertions(+), 8 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 22edd8d131d8..de66928e9430 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -409,7 +409,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
+ unsigned int binary_bits);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 40b76eaa33d0..f5189b6771a0 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* Only valid if @evtid is an MBM event.
* @configurable: true if the event is configurable
* @any_cpu: true if the event can be read from any CPU
+ * @is_floating_point: event values are displayed in floating point format
+ * @binary_bits: number of fixed-point binary bits from architecture,
+ * only valid if @is_floating_point is true
* @enabled: true if the event is enabled
*/
struct mon_evt {
@@ -71,6 +74,8 @@ struct mon_evt {
u32 evt_cfg;
bool configurable;
bool any_cpu;
+ bool is_floating_point;
+ unsigned int binary_bits;
bool enabled;
};
@@ -79,6 +84,9 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
#define for_each_mon_event(mevt) for (mevt = &mon_event_all[QOS_FIRST_EVENT]; \
mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++)
+/* Limit for mon_evt::binary_bits */
+#define MAX_BINARY_BITS 27
+
/**
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8db941fef7a0..ccba27df3ea6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -889,15 +889,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_ABMC))
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index fbf55e61445c..ae43e09fa5e5 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -17,6 +17,7 @@
#include <linux/cpu.h>
#include <linux/kernfs.h>
+#include <linux/math.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <linux/tick.h>
@@ -597,6 +598,87 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
+/*
+ * Decimal place precision to use for each number of fixed-point
+ * binary bits.
+ */
+static unsigned int decplaces[MAX_BINARY_BITS + 1] = {
+ [1] = 1,
+ [2] = 2,
+ [3] = 3,
+ [4] = 3,
+ [5] = 3,
+ [6] = 3,
+ [7] = 3,
+ [8] = 3,
+ [9] = 3,
+ [10] = 4,
+ [11] = 4,
+ [12] = 4,
+ [13] = 5,
+ [14] = 5,
+ [15] = 5,
+ [16] = 6,
+ [17] = 6,
+ [18] = 6,
+ [19] = 7,
+ [20] = 7,
+ [21] = 7,
+ [22] = 8,
+ [23] = 8,
+ [24] = 8,
+ [25] = 9,
+ [26] = 9,
+ [27] = 9
+};
+
+static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
+{
+ unsigned long long frac;
+ char buf[10];
+
+ if (!binary_bits) {
+ seq_printf(m, "%llu.0\n", val);
+ return;
+ }
+
+ /* Mask off the integer part of the fixed-point value. */
+ frac = val & GENMASK_ULL(binary_bits, 0);
+
+ /*
+ * Multiply by 10^{desired decimal places}. The integer part of
+ * the fixed point value is now almost what is needed.
+ */
+ frac *= int_pow(10ull, decplaces[binary_bits]);
+
+ /*
+ * Round to nearest by adding a value that would be a "1" in the
+ * binary_bits + 1 place. Integer part of fixed point value is
+ * now the needed value.
+ */
+ frac += 1ull << (binary_bits - 1);
+
+ /*
+ * Extract the integer part of the value. This is the decimal
+ * representation of the original fixed-point fractional value.
+ */
+ frac >>= binary_bits;
+
+ /*
+ * "frac" is now in the range [0 .. 10^decplaces). I.e. string
+ * representation will fit into chosen number of decimal places.
+ */
+ snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
+
+ /* Trim trailing zeroes */
+ for (int i = decplaces[binary_bits] - 1; i > 0; i--) {
+ if (buf[i] != '0')
+ break;
+ buf[i] = '\0';
+ }
+ seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
+}
+
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -675,6 +757,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
seq_puts(m, "Unavailable\n");
else if (rr.err == -ENOENT)
seq_puts(m, "Unassigned\n");
+ else if (evt->is_floating_point)
+ print_event_value(m, evt->binary_bits, rr.val);
else
seq_printf(m, "%llu\n", rr.val);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 6f8a9b5a2f6b..e354f01df615 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -977,16 +977,22 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsigned int binary_bits)
{
- if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
+ if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS ||
+ binary_bits > MAX_BINARY_BITS))
return;
if (mon_event_all[eventid].enabled) {
pr_warn("Duplicate enable for event %d\n", eventid);
return;
}
+ if (binary_bits && !mon_event_all[eventid].is_floating_point) {
+ pr_warn("Event %d may not be floating point\n", eventid);
+ return;
+ }
mon_event_all[eventid].any_cpu = any_cpu;
+ mon_event_all[eventid].binary_bits = binary_bits;
mon_event_all[eventid].enabled = true;
}
@@ -1812,9 +1818,9 @@ int resctrl_l3_mon_resource_init(void)
if (r->mon.mbm_cntr_assignable) {
if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask;
mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask &
(READS_TO_LOCAL_MEM |
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 12/31] x86,fs/resctrl: Add an architectural hook called for each mount
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (10 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 11/31] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 13/31] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor Tony Luck
` (18 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Enumeration of Intel telemetry events is an asynchronous process involving
several mutually dependent drivers added as auxiliary devices during
the device_initcall() phase of Linux boot. The process finishes after
the probe functions of these drivers completes. But this happens after
resctrl_arch_late_init() is executed.
Tracing the enumeration process shows that it does complete a full seven
seconds before the earliest possible mount of the resctrl file system
(when included in /etc/fstab for automatic mount by systemd).
Add a hook at the beginning of the mount code that will be used
to check for telemetry events and initialize if any are found.
Call the hook on every attempted mount. Expectations are that
most actions (like enumeration) will only need to be performed
on the first call.
resctrl filesystem calls the hook with no locks held. Architecture code
is responsible for any required locking.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 6 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++++
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 17 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index de66928e9430..6350064ac8be 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -511,6 +511,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
+/*
+ * Architecture hook called at beginning of each file system mount attempt.
+ * No locks are held.
+ */
+void resctrl_arch_pre_mount(void);
+
/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ccba27df3ea6..ee6d53aae455 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -717,6 +717,15 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
return 0;
}
+void resctrl_arch_pre_mount(void)
+{
+ static atomic_t only_once = ATOMIC_INIT(0);
+ int old = 0;
+
+ if (!atomic_try_cmpxchg(&only_once, &old, 1))
+ return;
+}
+
enum {
RDT_FLAG_CMT,
RDT_FLAG_MBM_TOTAL,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index dc289b03c3d1..72ae7224a2da 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2720,6 +2720,8 @@ static int rdt_get_tree(struct fs_context *fc)
struct rdt_resource *r;
int ret;
+ resctrl_arch_pre_mount();
+
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 13/31] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (11 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 12/31] x86,fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events Tony Luck
` (17 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Add a new PERF_PKG resource and introduce package level scope for
monitoring telemetry events so that CPU hot plug notifiers can build
domains at the package granularity.
Use the physical package ID available via topology_physical_package_id()
to identify the monitoring domains with package level scope. This enables
user space to use:
/sys/devices/system/cpu/cpuX/topology/physical_package_id
to identify the monitoring domain a CPU is associated with.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 2 ++
fs/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
fs/resctrl/rdtgroup.c | 2 ++
4 files changed, 16 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 6350064ac8be..ff67224b80c8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,6 +53,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_PERF_PKG,
/* Must be the last */
RDT_NUM_RESOURCES,
@@ -267,6 +268,7 @@ enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
RESCTRL_L3_NODE,
+ RESCTRL_PACKAGE,
};
/**
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index f5189b6771a0..96d97f4ff957 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -255,6 +255,8 @@ struct rdtgroup {
#define RFTYPE_ASSIGN_CONFIG BIT(11)
+#define RFTYPE_RES_PERF_PKG BIT(11)
+
#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ee6d53aae455..64c6f507b7bc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -100,6 +100,14 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_PERF_PKG] =
+ {
+ .r_resctrl = {
+ .name = "PERF_PKG",
+ .mon_scope = RESCTRL_PACKAGE,
+ .mon_domains = mon_domain_init(RDT_RESOURCE_PERF_PKG),
+ },
+ },
};
u32 resctrl_arch_system_num_rmid_idx(void)
@@ -433,6 +441,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return get_cpu_cacheinfo_id(cpu, scope);
case RESCTRL_L3_NODE:
return cpu_to_node(cpu);
+ case RESCTRL_PACKAGE:
+ return topology_physical_package_id(cpu);
default:
break;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 72ae7224a2da..6e8937f94e7a 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2330,6 +2330,8 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_MBA:
case RDT_RESOURCE_SMBA:
return RFTYPE_RES_MB;
+ case RDT_RESOURCE_PERF_PKG:
+ return RFTYPE_RES_PERF_PKG;
}
return WARN_ON_ONCE(1);
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (12 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 13/31] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:35 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 15/31] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651 Tony Luck
` (16 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each CPU collects data for telemetry events that it sends to the nearest
telemetry event aggregator either when the value of IA32_PQR_ASSOC.RMID
changes, or when a two millisecond timer expires.
The telemetry event aggregators maintain per-RMID per-event counts of the
total seen for all the CPUs. There may be more than one set of telemetry
event aggregators per package.
There are separate sets of aggregators for each type of event, but all
aggregators for a given type are symmetric keeping counts for the same
set of events for the CPUs that provide data to them.
Each telemetry event aggregator is responsible for a specific group of
events. E.g. on the Intel Clearwater Forest CPU there are two types of
aggregators. One type tracks a pair of energy related events. The other
type tracks a subset of "perf" type events.
The event counts are made available to Linux in a region of MMIO space
for each aggregator. All details about the layout of counters in each
aggregator MMIO region are described in XML files published by Intel and
made available in a GitHub repository [1].
The key to matching a specific telemetry aggregator to the XML file that
describes the MMIO layout is a 32-bit value. The Linux telemetry subsystem
refers to this as a "guid" while the XML files call it a "uniqueid".
Each XML file provides the following information:
1) Which telemetry events are included in the group.
2) The order in which the event counters appear for each RMID.
3) The value type of each event counter (integer or fixed-point).
4) The number of RMIDs supported.
5) Which additional aggregator status registers are included.
6) The total size of the MMIO region for an aggregator.
The INTEL_PMT_TELEMETRY driver enumerates support for telemetry events.
This driver provides intel_pmt_get_regions_by_feature() to list all
available telemetry event aggregators. The list includes the "guid",
the base address in MMIO space for the region where the event counters
are exposed, and the package id where the all the CPUs that report to this
aggregator are located.
Add a new Kconfig option CONFIG_X86_CPU_RESCTRL_INTEL_AET for the Intel
specific parts of telemetry code. This depends on the INTEL_PMT_TELEMETRY
and INTEL_TPMI drivers being built-in to the kernel for enumeration of
telemetry features.
Use INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() with
each per-RMID telemetry feature id to obtain a private copy of
struct pmt_feature_group that contains all discovered/enumerated
telemetry aggregator data for all event groups (known and unknown
to resctrl) of that feature id. Further processing on this structure
will enable all supported events in resctrl. Return the structure to
INTEL_PMT_TELEMETRY at resctrl exit time.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://github.com/intel/Intel-PMT # [1]
---
Note that checkpatch complains about this:
DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *,
if (!IS_ERR_OR_NULL(_T))
intel_pmt_put_feature_group(_T))
with:
CHECK: Alignment should match open parenthesis
But if the alignment is fixed, it then complains:
WARNING: Statements should start on a tabstop
---
arch/x86/kernel/cpu/resctrl/internal.h | 8 ++
arch/x86/kernel/cpu/resctrl/core.c | 5 +
arch/x86/kernel/cpu/resctrl/intel_aet.c | 144 ++++++++++++++++++++++++
arch/x86/Kconfig | 13 +++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
5 files changed, 171 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 14fadcff0d2b..886261a82b81 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -217,4 +217,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
+#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
+bool intel_aet_get_events(void);
+void __exit intel_aet_exit(void);
+#else
+static inline bool intel_aet_get_events(void) { return false; }
+static inline void __exit intel_aet_exit(void) { }
+#endif
+
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 64c6f507b7bc..9003a6344410 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -734,6 +734,9 @@ void resctrl_arch_pre_mount(void)
if (!atomic_try_cmpxchg(&only_once, &old, 1))
return;
+
+ if (!intel_aet_get_events())
+ return;
}
enum {
@@ -1091,6 +1094,8 @@ late_initcall(resctrl_arch_late_init);
static void __exit resctrl_arch_exit(void)
{
+ intel_aet_exit();
+
cpuhp_remove_state(rdt_online);
resctrl_exit();
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
new file mode 100644
index 000000000000..966c840f0d6b
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Resource Director Technology(RDT)
+ * - Intel Application Energy Telemetry
+ *
+ * Copyright (C) 2025 Intel Corporation
+ *
+ * Author:
+ * Tony Luck <tony.luck@intel.com>
+ */
+
+#define pr_fmt(fmt) "resctrl: " fmt
+
+#include <linux/array_size.h>
+#include <linux/cleanup.h>
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/intel_pmt_features.h>
+#include <linux/intel_vsec.h>
+#include <linux/overflow.h>
+#include <linux/resctrl.h>
+#include <linux/stddef.h>
+#include <linux/types.h>
+
+#include "internal.h"
+
+/**
+ * struct event_group - All information about a group of telemetry events.
+ * @pfg: Points to the aggregated telemetry space information
+ * returned by the intel_pmt_get_regions_by_feature()
+ * call to the INTEL_PMT_TELEMETRY driver that contains
+ * data for all telemetry regions of a specific type.
+ * Valid if the system supports the event group.
+ * NULL otherwise.
+ * @guid: Unique number per XML description file.
+ */
+struct event_group {
+ /* Data fields for additional structures to manage this group. */
+ struct pmt_feature_group *pfg;
+
+ /* Remaining fields initialized from XML file. */
+ u32 guid;
+};
+
+/*
+ * Link: https://github.com/intel/Intel-PMT
+ * File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
+ */
+static struct event_group energy_0x26696143 = {
+ .guid = 0x26696143,
+};
+
+/*
+ * Link: https://github.com/intel/Intel-PMT
+ * File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
+ */
+static struct event_group perf_0x26557651 = {
+ .guid = 0x26557651,
+};
+
+static struct event_group *known_energy_event_groups[] = {
+ &energy_0x26696143,
+};
+
+static struct event_group *known_perf_event_groups[] = {
+ &perf_0x26557651,
+};
+
+#define for_each_enabled_event_group(_peg, _grp) \
+ for (_peg = (_grp); _peg < &_grp[ARRAY_SIZE(_grp)]; _peg++) \
+ if ((*_peg)->pfg)
+
+/* Stub for now */
+static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
+{
+ return false;
+}
+
+DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *,
+ if (!IS_ERR_OR_NULL(_T))
+ intel_pmt_put_feature_group(_T))
+
+/*
+ * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
+ * pmt_feature_group for a specific feature. If there is one, the returned
+ * structure has an array of telemetry_region structures. Each describes
+ * one telemetry aggregator.
+ * Try to use every telemetry aggregator with a known guid.
+ */
+static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
+ unsigned int num_evg)
+{
+ struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
+ struct event_group **peg;
+ bool ret;
+
+ p = intel_pmt_get_regions_by_feature(feature);
+
+ if (IS_ERR_OR_NULL(p))
+ return false;
+
+ for (peg = evgs; peg < &evgs[num_evg]; peg++) {
+ ret = enable_events(*peg, p);
+ if (ret) {
+ (*peg)->pfg = no_free_ptr(p);
+ return true;
+ }
+ }
+
+ return false;
+}
+
+/*
+ * Ask INTEL_PMT_TELEMETRY driver for all the RMID based telemetry groups
+ * that it supports.
+ */
+bool intel_aet_get_events(void)
+{
+ bool ret1, ret2;
+
+ ret1 = get_pmt_feature(FEATURE_PER_RMID_ENERGY_TELEM,
+ known_energy_event_groups,
+ ARRAY_SIZE(known_energy_event_groups));
+ ret2 = get_pmt_feature(FEATURE_PER_RMID_PERF_TELEM,
+ known_perf_event_groups,
+ ARRAY_SIZE(known_perf_event_groups));
+
+ return ret1 || ret2;
+}
+
+void __exit intel_aet_exit(void)
+{
+ struct event_group **peg;
+
+ for_each_enabled_event_group(peg, known_energy_event_groups) {
+ intel_pmt_put_feature_group((*peg)->pfg);
+ (*peg)->pfg = NULL;
+ }
+ for_each_enabled_event_group(peg, known_perf_event_groups) {
+ intel_pmt_put_feature_group((*peg)->pfg);
+ (*peg)->pfg = NULL;
+ }
+}
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52c8910ba2ef..ce9d086625c1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -525,6 +525,19 @@ config X86_CPU_RESCTRL
Say N if unsure.
+config X86_CPU_RESCTRL_INTEL_AET
+ bool "Intel Application Energy Telemetry"
+ depends on X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
+ help
+ Enable per-RMID telemetry events in resctrl.
+
+ Intel feature that collects per-RMID execution data
+ about energy consumption, measure of frequency independent
+ activity and other performance metrics. Data is aggregated
+ per package.
+
+ Say N if unsure.
+
config X86_FRED
bool "Flexible Return and Event Delivery"
depends on X86_64
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index d8a04b195da2..273ddfa30836 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_X86_CPU_RESCTRL_INTEL_AET) += intel_aet.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include:
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-09-25 20:03 ` [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events Tony Luck
@ 2025-10-03 23:35 ` Reinette Chatre
2025-10-06 18:19 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:35 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> Each CPU collects data for telemetry events that it sends to the nearest
> telemetry event aggregator either when the value of IA32_PQR_ASSOC.RMID
Please note that one of the "Touchups" done during merge of [1] was to
use full names for registers in descriptions. Considering this,
"IA32_PQR_ASSOC.RMID" -> "MSR_IA32_PQR_ASSOC.RMID
(also please make same change in cover letter)
> changes, or when a two millisecond timer expires.
>
...
> +
> +/**
> + * struct event_group - All information about a group of telemetry events.
> + * @pfg: Points to the aggregated telemetry space information
> + * returned by the intel_pmt_get_regions_by_feature()
> + * call to the INTEL_PMT_TELEMETRY driver that contains
> + * data for all telemetry regions of a specific type.
> + * Valid if the system supports the event group.
> + * NULL otherwise.
> + * @guid: Unique number per XML description file.
> + */
> +struct event_group {
> + /* Data fields for additional structures to manage this group. */
> + struct pmt_feature_group *pfg;
> +
> + /* Remaining fields initialized from XML file. */
> + u32 guid;
> +};
...
> +
> +/*
> + * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
> + * pmt_feature_group for a specific feature. If there is one, the returned
> + * structure has an array of telemetry_region structures. Each describes
> + * one telemetry aggregator.
> + * Try to use every telemetry aggregator with a known guid.
The guid is associated with struct event_group and every telemetry region has
its own guid. It is not clear to me why the guid is not associated with pmt_feature_group.
To me this implies that a pmt_feature_group my contain telemetry regions that have
different guid.
This is not fully apparent in this patch but as this code evolves I do not think
the scenario where telemetry regions have different supported (by resctrl) guid is handled
by this enumeration.
If I understand correctly, all telemetry regions of a given pmt_feature_group will be
matched against a single supported guid at a time and all telemetry regions with that
guid will be considered usable and any other considered unusable without further processing
of that pmt_feature_group. If there are more than one matching guid supported by resctrl
then only events of the first one will be enumerated?
> + */
> +static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> + unsigned int num_evg)
> +{
> + struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
> + struct event_group **peg;
> + bool ret;
> +
> + p = intel_pmt_get_regions_by_feature(feature);
> +
> + if (IS_ERR_OR_NULL(p))
> + return false;
> +
> + for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> + ret = enable_events(*peg, p);
> + if (ret) {
> + (*peg)->pfg = no_free_ptr(p);
> + return true;
> + }
> + }
> +
> + return false;
> +}
Reinette
[1] https://lore.kernel.org/all/175793566119.709179.8448328033383658699.tip-bot2@tip-bot2/
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-10-03 23:35 ` Reinette Chatre
@ 2025-10-06 18:19 ` Luck, Tony
2025-10-06 21:33 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-06 18:19 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Oct 03, 2025 at 04:35:11PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 1:03 PM, Tony Luck wrote:
> > Each CPU collects data for telemetry events that it sends to the nearest
> > telemetry event aggregator either when the value of IA32_PQR_ASSOC.RMID
>
> Please note that one of the "Touchups" done during merge of [1] was to
> use full names for registers in descriptions. Considering this,
> "IA32_PQR_ASSOC.RMID" -> "MSR_IA32_PQR_ASSOC.RMID
>
> (also please make same change in cover letter)
Will do.
>
> > changes, or when a two millisecond timer expires.
> >
>
> ...
>
> > +
> > +/**
> > + * struct event_group - All information about a group of telemetry events.
> > + * @pfg: Points to the aggregated telemetry space information
> > + * returned by the intel_pmt_get_regions_by_feature()
> > + * call to the INTEL_PMT_TELEMETRY driver that contains
> > + * data for all telemetry regions of a specific type.
> > + * Valid if the system supports the event group.
> > + * NULL otherwise.
> > + * @guid: Unique number per XML description file.
> > + */
> > +struct event_group {
> > + /* Data fields for additional structures to manage this group. */
> > + struct pmt_feature_group *pfg;
> > +
> > + /* Remaining fields initialized from XML file. */
> > + u32 guid;
> > +};
>
>
> ...
>
> > +
> > +/*
> > + * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
> > + * pmt_feature_group for a specific feature. If there is one, the returned
> > + * structure has an array of telemetry_region structures. Each describes
> > + * one telemetry aggregator.
> > + * Try to use every telemetry aggregator with a known guid.
>
> The guid is associated with struct event_group and every telemetry region has
> its own guid. It is not clear to me why the guid is not associated with pmt_feature_group.
> To me this implies that a pmt_feature_group my contain telemetry regions that have
> different guid.
>
> This is not fully apparent in this patch but as this code evolves I do not think
> the scenario where telemetry regions have different supported (by resctrl) guid is handled
> by this enumeration.
> If I understand correctly, all telemetry regions of a given pmt_feature_group will be
> matched against a single supported guid at a time and all telemetry regions with that
> guid will be considered usable and any other considered unusable without further processing
> of that pmt_feature_group. If there are more than one matching guid supported by resctrl
> then only events of the first one will be enumerated?
>
> > + */
> > +static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> > + unsigned int num_evg)
> > +{
> > + struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
> > + struct event_group **peg;
> > + bool ret;
> > +
> > + p = intel_pmt_get_regions_by_feature(feature);
> > +
> > + if (IS_ERR_OR_NULL(p))
> > + return false;
> > +
> > + for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> > + ret = enable_events(*peg, p);
> > + if (ret) {
> > + (*peg)->pfg = no_free_ptr(p);
> > + return true;
> > + }
> > + }
> > +
> > + return false;
> > +}
Perhaps David wants to cope with a future system that supports multiple
guids?
You are right that my code will not handle this. It will just enable
the first recognised guid and ignore any others.
How about this. Take an extra reference on any pmt_feature_group
structures that include a known guid (to keep the accounting right
when intel_aet_exit() is called). This simplifies the function so
I don't need the __free() handler that confuses checkpatch.pl :-)
/*
* Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
* pmt_feature_group for a specific feature. If there is one, the returned
* structure has an array of telemetry_region structures, each element of
* the array describes one telemetry aggregator.
* A single pmt_feature_group may include multiple different guids.
* Try to use every telemetry aggregator with a known guid.
*/
static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
unsigned int num_evg)
{
struct pmt_feature_group *p = intel_pmt_get_regions_by_feature(feature);
struct event_group **peg;
bool ret = false;
if (IS_ERR_OR_NULL(p))
return false;
for (peg = evgs; peg < &evgs[num_evg]; peg++) {
if (enable_events(*peg, p)) {
kref_get(&p->kref);
(*peg)->pfg = no_free_ptr(p);
ret = true;
}
}
intel_pmt_put_feature_group(p);
return ret;
}
> Reinette
>
>
> [1] https://lore.kernel.org/all/175793566119.709179.8448328033383658699.tip-bot2@tip-bot2/
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-10-06 18:19 ` Luck, Tony
@ 2025-10-06 21:33 ` Reinette Chatre
2025-10-06 21:47 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-06 21:33 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/6/25 11:19 AM, Luck, Tony wrote:
> On Fri, Oct 03, 2025 at 04:35:11PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 9/25/25 1:03 PM, Tony Luck wrote:
>>> +
>>> +/*
>>> + * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
>>> + * pmt_feature_group for a specific feature. If there is one, the returned
>>> + * structure has an array of telemetry_region structures. Each describes
>>> + * one telemetry aggregator.
>>> + * Try to use every telemetry aggregator with a known guid.
>>
>> The guid is associated with struct event_group and every telemetry region has
>> its own guid. It is not clear to me why the guid is not associated with pmt_feature_group.
>> To me this implies that a pmt_feature_group my contain telemetry regions that have
>> different guid.
>>
>> This is not fully apparent in this patch but as this code evolves I do not think
>> the scenario where telemetry regions have different supported (by resctrl) guid is handled
>> by this enumeration.
>> If I understand correctly, all telemetry regions of a given pmt_feature_group will be
>> matched against a single supported guid at a time and all telemetry regions with that
>> guid will be considered usable and any other considered unusable without further processing
>> of that pmt_feature_group. If there are more than one matching guid supported by resctrl
>> then only events of the first one will be enumerated?
>>
>>> + */
>>> +static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
>>> + unsigned int num_evg)
>>> +{
>>> + struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
>>> + struct event_group **peg;
>>> + bool ret;
>>> +
>>> + p = intel_pmt_get_regions_by_feature(feature);
>>> +
>>> + if (IS_ERR_OR_NULL(p))
>>> + return false;
>>> +
>>> + for (peg = evgs; peg < &evgs[num_evg]; peg++) {
>>> + ret = enable_events(*peg, p);
>>> + if (ret) {
>>> + (*peg)->pfg = no_free_ptr(p);
>>> + return true;
>>> + }
>>> + }
>>> +
>>> + return false;
>>> +}
>
> Perhaps David wants to cope with a future system that supports multiple
> guids?
>
> You are right that my code will not handle this. It will just enable
> the first recognised guid and ignore any others.
>
> How about this. Take an extra reference on any pmt_feature_group
> structures that include a known guid (to keep the accounting right
> when intel_aet_exit() is called). This simplifies the function so
> I don't need the __free() handler that confuses checkpatch.pl :-)
>
>
> /*
> * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
> * pmt_feature_group for a specific feature. If there is one, the returned
> * structure has an array of telemetry_region structures, each element of
> * the array describes one telemetry aggregator.
> * A single pmt_feature_group may include multiple different guids.
> * Try to use every telemetry aggregator with a known guid.
> */
> static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> unsigned int num_evg)
> {
> struct pmt_feature_group *p = intel_pmt_get_regions_by_feature(feature);
> struct event_group **peg;
> bool ret = false;
>
> if (IS_ERR_OR_NULL(p))
> return false;
>
> for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> if (enable_events(*peg, p)) {
> kref_get(&p->kref);
This is not clear to me ... would enable_events() still mark all telemetry_regions
that do not match the event_group's guid as unusable? It seems to me that if more
than one even_group refers to the same pmt_feature_group then the first one to match
will "win" and make the other event_group's telemetry regions unusable.
> (*peg)->pfg = no_free_ptr(p);
> ret = true;
> }
> }
> intel_pmt_put_feature_group(p);
>
> return ret;
> }
>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-10-06 21:33 ` Reinette Chatre
@ 2025-10-06 21:47 ` Luck, Tony
2025-10-07 20:47 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-06 21:47 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Mon, Oct 06, 2025 at 02:33:00PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 10/6/25 11:19 AM, Luck, Tony wrote:
> > On Fri, Oct 03, 2025 at 04:35:11PM -0700, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 9/25/25 1:03 PM, Tony Luck wrote:
> >>> +
> >>> +/*
> >>> + * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
> >>> + * pmt_feature_group for a specific feature. If there is one, the returned
> >>> + * structure has an array of telemetry_region structures. Each describes
> >>> + * one telemetry aggregator.
> >>> + * Try to use every telemetry aggregator with a known guid.
> >>
> >> The guid is associated with struct event_group and every telemetry region has
> >> its own guid. It is not clear to me why the guid is not associated with pmt_feature_group.
> >> To me this implies that a pmt_feature_group my contain telemetry regions that have
> >> different guid.
> >>
> >> This is not fully apparent in this patch but as this code evolves I do not think
> >> the scenario where telemetry regions have different supported (by resctrl) guid is handled
> >> by this enumeration.
> >> If I understand correctly, all telemetry regions of a given pmt_feature_group will be
> >> matched against a single supported guid at a time and all telemetry regions with that
> >> guid will be considered usable and any other considered unusable without further processing
> >> of that pmt_feature_group. If there are more than one matching guid supported by resctrl
> >> then only events of the first one will be enumerated?
> >>
> >>> + */
> >>> +static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> >>> + unsigned int num_evg)
> >>> +{
> >>> + struct pmt_feature_group *p __free(intel_pmt_put_feature_group) = NULL;
> >>> + struct event_group **peg;
> >>> + bool ret;
> >>> +
> >>> + p = intel_pmt_get_regions_by_feature(feature);
> >>> +
> >>> + if (IS_ERR_OR_NULL(p))
> >>> + return false;
> >>> +
> >>> + for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> >>> + ret = enable_events(*peg, p);
> >>> + if (ret) {
> >>> + (*peg)->pfg = no_free_ptr(p);
> >>> + return true;
> >>> + }
> >>> + }
> >>> +
> >>> + return false;
> >>> +}
> >
> > Perhaps David wants to cope with a future system that supports multiple
> > guids?
> >
> > You are right that my code will not handle this. It will just enable
> > the first recognised guid and ignore any others.
> >
> > How about this. Take an extra reference on any pmt_feature_group
> > structures that include a known guid (to keep the accounting right
> > when intel_aet_exit() is called). This simplifies the function so
> > I don't need the __free() handler that confuses checkpatch.pl :-)
> >
> >
> > /*
> > * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
> > * pmt_feature_group for a specific feature. If there is one, the returned
> > * structure has an array of telemetry_region structures, each element of
> > * the array describes one telemetry aggregator.
> > * A single pmt_feature_group may include multiple different guids.
> > * Try to use every telemetry aggregator with a known guid.
> > */
> > static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> > unsigned int num_evg)
> > {
> > struct pmt_feature_group *p = intel_pmt_get_regions_by_feature(feature);
> > struct event_group **peg;
> > bool ret = false;
> >
> > if (IS_ERR_OR_NULL(p))
> > return false;
> >
> > for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> > if (enable_events(*peg, p)) {
> > kref_get(&p->kref);
>
> This is not clear to me ... would enable_events() still mark all telemetry_regions
> that do not match the event_group's guid as unusable? It seems to me that if more
> than one even_group refers to the same pmt_feature_group then the first one to match
> will "win" and make the other event_group's telemetry regions unusable.
Extra context needed. Sorry.
I'm changing enable_events() to only mark telemetry_regions regions as
unusable if they have a bad package id, or the MMIO size doesn't match.
I.e. they truly are bad.
Mis-match on guid will skip then while associating with a specific
event_gruoup, but leave them as usable.
This means that intel_aet_read_event() now has to check the guid as
well as !addr.
An alternative approach would be to ask the PMT code for separate
copies of the pmt_feature_group to attach to each event_group. I
didn't like this, do you think it would be better?
>
> > (*peg)->pfg = no_free_ptr(p);
> > ret = true;
> > }
> > }
> > intel_pmt_put_feature_group(p);
> >
> > return ret;
> > }
> >
>
> Reinette
>
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-10-06 21:47 ` Luck, Tony
@ 2025-10-07 20:47 ` Luck, Tony
2025-10-08 17:12 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-07 20:47 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Mon, Oct 06, 2025 at 02:47:15PM -0700, Luck, Tony wrote:
> On Mon, Oct 06, 2025 at 02:33:00PM -0700, Reinette Chatre wrote:
> > Hi Tony,
> > > static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> > > unsigned int num_evg)
> > > {
> > > struct pmt_feature_group *p = intel_pmt_get_regions_by_feature(feature);
> > > struct event_group **peg;
> > > bool ret = false;
> > >
> > > if (IS_ERR_OR_NULL(p))
> > > return false;
> > >
> > > for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> > > if (enable_events(*peg, p)) {
> > > kref_get(&p->kref);
> >
> > This is not clear to me ... would enable_events() still mark all telemetry_regions
> > that do not match the event_group's guid as unusable? It seems to me that if more
> > than one even_group refers to the same pmt_feature_group then the first one to match
> > will "win" and make the other event_group's telemetry regions unusable.
>
> Extra context needed. Sorry.
>
> I'm changing enable_events() to only mark telemetry_regions regions as
> unusable if they have a bad package id, or the MMIO size doesn't match.
> I.e. they truly are bad.
>
> Mis-match on guid will skip then while associating with a specific
> event_gruoup, but leave them as usable.
>
> This means that intel_aet_read_event() now has to check the guid as
> well as !addr.
>
> An alternative approach would be to ask the PMT code for separate
> copies of the pmt_feature_group to attach to each event_group. I
> didn't like this, do you think it would be better?
Working through more patches in the series, I've come to the one
that adjusts the number of RMIDs. The alternative approach of
having a separate copy of the pmt_feature_group is suddently looking
more attractive.
So the code would become:
static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
unsigned int num_evg)
{
struct pmt_feature_group *p;
struct event_group **peg;
bool ret = false;
for (peg = evgs; peg < &evgs[num_evg]; peg++) {
p = intel_pmt_get_regions_by_feature(feature);
if (IS_ERR_OR_NULL(p))
return false;
if (enable_events(*peg, p)) {
(*peg)->pfg = p;
ret = true;
} else {
intel_pmt_put_feature_group(p);
}
}
intel_pmt_put_feature_group(p);
return ret;
}
>
> >
> > > (*peg)->pfg = no_free_ptr(p);
> > > ret = true;
> > > }
> > > }
> > > intel_pmt_put_feature_group(p);
> > >
> > > return ret;
> > > }
> > >
> >
> > Reinette
> >
>
> -Tony
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-10-07 20:47 ` Luck, Tony
@ 2025-10-08 17:12 ` Reinette Chatre
2025-10-08 17:20 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-08 17:12 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/7/25 1:47 PM, Luck, Tony wrote:
> On Mon, Oct 06, 2025 at 02:47:15PM -0700, Luck, Tony wrote:
>> On Mon, Oct 06, 2025 at 02:33:00PM -0700, Reinette Chatre wrote:
>>> Hi Tony,
>>>> static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
>>>> unsigned int num_evg)
>>>> {
>>>> struct pmt_feature_group *p = intel_pmt_get_regions_by_feature(feature);
>>>> struct event_group **peg;
>>>> bool ret = false;
>>>>
>>>> if (IS_ERR_OR_NULL(p))
>>>> return false;
>>>>
>>>> for (peg = evgs; peg < &evgs[num_evg]; peg++) {
>>>> if (enable_events(*peg, p)) {
>>>> kref_get(&p->kref);
>>>
>>> This is not clear to me ... would enable_events() still mark all telemetry_regions
>>> that do not match the event_group's guid as unusable? It seems to me that if more
>>> than one even_group refers to the same pmt_feature_group then the first one to match
>>> will "win" and make the other event_group's telemetry regions unusable.
>>
>> Extra context needed. Sorry.
>>
>> I'm changing enable_events() to only mark telemetry_regions regions as
>> unusable if they have a bad package id, or the MMIO size doesn't match.
>> I.e. they truly are bad.
>>
>> Mis-match on guid will skip then while associating with a specific
>> event_gruoup, but leave them as usable.
>>
>> This means that intel_aet_read_event() now has to check the guid as
>> well as !addr.
>>
>> An alternative approach would be to ask the PMT code for separate
>> copies of the pmt_feature_group to attach to each event_group. I
>> didn't like this, do you think it would be better?
>
> Working through more patches in the series, I've come to the one
> that adjusts the number of RMIDs. The alternative approach of
I see, with the number of RMIDs a property of the event group self this
seems reasonable. While there is duplication of pmt_feature_group I am not
able to tell if this is a big issue since I am not clear on how/if systems
will be built this way.
> having a separate copy of the pmt_feature_group is suddently looking
> more attractive.
>
> So the code would become:
>
>
> static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> unsigned int num_evg)
> {
> struct pmt_feature_group *p;
> struct event_group **peg;
> bool ret = false;
>
> for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> p = intel_pmt_get_regions_by_feature(feature);
> if (IS_ERR_OR_NULL(p))
> return false;
>
> if (enable_events(*peg, p)) {
> (*peg)->pfg = p;
> ret = true;
> } else {
> intel_pmt_put_feature_group(p);
> }
> }
> intel_pmt_put_feature_group(p);
I am not able to tell why this "put" is needed? I assume the "put" of a
pmt_feature_group assigned to an event_group will still be done in
intel_aet_exit()?
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* RE: [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events
2025-10-08 17:12 ` Reinette Chatre
@ 2025-10-08 17:20 ` Luck, Tony
0 siblings, 0 replies; 84+ messages in thread
From: Luck, Tony @ 2025-10-08 17:20 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> > static bool get_pmt_feature(enum pmt_feature_id feature, struct event_group **evgs,
> > unsigned int num_evg)
> > {
> > struct pmt_feature_group *p;
> > struct event_group **peg;
> > bool ret = false;
> >
> > for (peg = evgs; peg < &evgs[num_evg]; peg++) {
> > p = intel_pmt_get_regions_by_feature(feature);
> > if (IS_ERR_OR_NULL(p))
> > return false;
> >
> > if (enable_events(*peg, p)) {
> > (*peg)->pfg = p;
> > ret = true;
> > } else {
> > intel_pmt_put_feature_group(p);
> > }
> > }
> > intel_pmt_put_feature_group(p);
>
> I am not able to tell why this "put" is needed? I assume the "put" of a
> pmt_feature_group assigned to an event_group will still be done in
> intel_aet_exit()?
Reinette
That "put" was left over from the previous version. You are right it
isn't needed. The "put" will be done in intel_aet_exit()
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 15/31] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (13 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 14/31] x86/resctrl: Discover hardware telemetry events Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 16/31] x86,fs/resctrl: Add architectural event pointer Tony Luck
` (15 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The Intel Clearwater Forest CPU supports two RMID-based PMT feature
groups documented in the xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
and xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml files in the Intel PMT
GIT repository [1].
The counter offsets in MMIO space are arranged in groups for each RMID.
E.g the "energy" counters for guid 0x26696143 are arranged like this:
MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY
MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY
MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY
MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY
...
MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY
MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY
After all counters there are three status registers that provide
indications of how many times an aggregator was unable to process
event counts, the time stamp for the most recent loss of data, and
the time stamp of the most recent successful update.
MMIO offset:0x2400 AGG_DATA_LOSS_COUNT
MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP
MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP
Define these events in the file system code and add the events
to the event_group structures.
PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
format. File system code must output as floating point values.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://github.com/intel/Intel-PMT # [1]
---
include/linux/resctrl_types.h | 11 +++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 43 +++++++++++++++++++++++++
fs/resctrl/monitor.c | 35 +++++++++++---------
3 files changed, 74 insertions(+), 15 deletions(-)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index acfe07860b34..a5f56faa18d2 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -50,6 +50,17 @@ enum resctrl_event_id {
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
+ /* Intel Telemetry Events */
+ PMT_EVENT_ENERGY,
+ PMT_EVENT_ACTIVITY,
+ PMT_EVENT_STALLS_LLC_HIT,
+ PMT_EVENT_C1_RES,
+ PMT_EVENT_UNHALTED_CORE_CYCLES,
+ PMT_EVENT_STALLS_LLC_MISS,
+ PMT_EVENT_AUTO_C6_RES,
+ PMT_EVENT_UNHALTED_REF_CYCLES,
+ PMT_EVENT_UOPS_RETIRED,
+
/* Must be the last */
QOS_NUM_EVENTS,
};
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 966c840f0d6b..f9b5f6cd08f8 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -13,6 +13,7 @@
#include <linux/array_size.h>
#include <linux/cleanup.h>
+#include <linux/compiler_types.h>
#include <linux/cpu.h>
#include <linux/err.h>
#include <linux/init.h>
@@ -20,11 +21,27 @@
#include <linux/intel_vsec.h>
#include <linux/overflow.h>
#include <linux/resctrl.h>
+#include <linux/resctrl_types.h>
#include <linux/stddef.h>
#include <linux/types.h>
#include "internal.h"
+/**
+ * struct pmt_event - Telemetry event.
+ * @id: Resctrl event id.
+ * @idx: Counter index within each per-RMID block of counters.
+ * @bin_bits: Zero for integer valued events, else number bits in fraction
+ * part of fixed-point.
+ */
+struct pmt_event {
+ enum resctrl_event_id id;
+ unsigned int idx;
+ unsigned int bin_bits;
+};
+
+#define EVT(_id, _idx, _bits) { .id = _id, .idx = _idx, .bin_bits = _bits }
+
/**
* struct event_group - All information about a group of telemetry events.
* @pfg: Points to the aggregated telemetry space information
@@ -34,6 +51,9 @@
* Valid if the system supports the event group.
* NULL otherwise.
* @guid: Unique number per XML description file.
+ * @mmio_size: Number of bytes of MMIO registers for this group.
+ * @num_events: Number of events in this group.
+ * @evts: Array of event descriptors.
*/
struct event_group {
/* Data fields for additional structures to manage this group. */
@@ -41,14 +61,26 @@ struct event_group {
/* Remaining fields initialized from XML file. */
u32 guid;
+ size_t mmio_size;
+ unsigned int num_events;
+ struct pmt_event evts[] __counted_by(num_events);
};
+#define XML_MMIO_SIZE(num_rmids, num_events, num_extra_status) \
+ (((num_rmids) * (num_events) + (num_extra_status)) * sizeof(u64))
+
/*
* Link: https://github.com/intel/Intel-PMT
* File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
*/
static struct event_group energy_0x26696143 = {
.guid = 0x26696143,
+ .mmio_size = XML_MMIO_SIZE(576, 2, 3),
+ .num_events = 2,
+ .evts = {
+ EVT(PMT_EVENT_ENERGY, 0, 18),
+ EVT(PMT_EVENT_ACTIVITY, 1, 18),
+ }
};
/*
@@ -57,6 +89,17 @@ static struct event_group energy_0x26696143 = {
*/
static struct event_group perf_0x26557651 = {
.guid = 0x26557651,
+ .mmio_size = XML_MMIO_SIZE(576, 7, 3),
+ .num_events = 7,
+ .evts = {
+ EVT(PMT_EVENT_STALLS_LLC_HIT, 0, 0),
+ EVT(PMT_EVENT_C1_RES, 1, 0),
+ EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 2, 0),
+ EVT(PMT_EVENT_STALLS_LLC_MISS, 3, 0),
+ EVT(PMT_EVENT_AUTO_C6_RES, 4, 0),
+ EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 5, 0),
+ EVT(PMT_EVENT_UOPS_RETIRED, 6, 0),
+ }
};
static struct event_group *known_energy_event_groups[] = {
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index e354f01df615..d44b764853bf 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -954,27 +954,32 @@ static void dom_data_exit(struct rdt_resource *r)
mutex_unlock(&rdtgroup_mutex);
}
+#define MON_EVENT(_eventid, _name, _res, _fp) \
+ [_eventid] = { \
+ .name = _name, \
+ .evtid = _eventid, \
+ .rid = _res, \
+ .is_floating_point = _fp, \
+}
+
/*
* All available events. Architecture code marks the ones that
* are supported by a system using resctrl_enable_mon_event()
* to set .enabled.
*/
struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
- [QOS_L3_OCCUP_EVENT_ID] = {
- .name = "llc_occupancy",
- .evtid = QOS_L3_OCCUP_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
- },
- [QOS_L3_MBM_TOTAL_EVENT_ID] = {
- .name = "mbm_total_bytes",
- .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
- },
- [QOS_L3_MBM_LOCAL_EVENT_ID] = {
- .name = "mbm_local_bytes",
- .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
- },
+ MON_EVENT(QOS_L3_OCCUP_EVENT_ID, "llc_occupancy", RDT_RESOURCE_L3, false),
+ MON_EVENT(QOS_L3_MBM_TOTAL_EVENT_ID, "mbm_total_bytes", RDT_RESOURCE_L3, false),
+ MON_EVENT(QOS_L3_MBM_LOCAL_EVENT_ID, "mbm_local_bytes", RDT_RESOURCE_L3, false),
+ MON_EVENT(PMT_EVENT_ENERGY, "core_energy", RDT_RESOURCE_PERF_PKG, true),
+ MON_EVENT(PMT_EVENT_ACTIVITY, "activity", RDT_RESOURCE_PERF_PKG, true),
+ MON_EVENT(PMT_EVENT_STALLS_LLC_HIT, "stalls_llc_hit", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_C1_RES, "c1_res", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_UNHALTED_CORE_CYCLES, "unhalted_core_cycles", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_STALLS_LLC_MISS, "stalls_llc_miss", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_AUTO_C6_RES, "c6_res", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_UNHALTED_REF_CYCLES, "unhalted_ref_cycles", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_UOPS_RETIRED, "uops_retired", RDT_RESOURCE_PERF_PKG, false),
};
void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsigned int binary_bits)
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 16/31] x86,fs/resctrl: Add architectural event pointer
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (14 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 15/31] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651 Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:38 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events Tony Luck
` (14 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system layer passes the domain, RMID, and event id to
resctrl_arch_rmid_read() to fetch an event counter.
Fetching a telemetry event counter requires additional information that
is private to the architecture, for example, the offset into MMIO space
from where counter should be read.
Add mon_evt::arch_priv void pointer. Architecture code can initialize
this when marking each event enabled.
File system code passes this pointer to resctrl_arch_rmid_read().
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 7 +++++--
fs/resctrl/internal.h | 4 ++++
arch/x86/kernel/cpu/resctrl/core.c | 6 +++---
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
fs/resctrl/monitor.c | 18 ++++++++++++------
5 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ff67224b80c8..111c8f1dc77e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -412,7 +412,7 @@ u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
- unsigned int binary_bits);
+ unsigned int binary_bits, void *arch_priv);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
@@ -529,6 +529,9 @@ void resctrl_arch_pre_mount(void);
* only.
* @rmid: rmid of the counter to read.
* @eventid: eventid to read, e.g. L3 occupancy.
+ * @arch_priv: Architecture private data for this event.
+ * The @arch_priv provided by the architecture via
+ * resctrl_enable_mon_event().
* @val: result of the counter read in bytes.
* @arch_mon_ctx: An architecture specific value from
* resctrl_arch_mon_ctx_alloc(), for MPAM this identifies
@@ -546,7 +549,7 @@ void resctrl_arch_pre_mount(void);
*/
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, void *arch_mon_ctx);
+ void *arch_priv, u64 *val, void *arch_mon_ctx);
/**
* resctrl_arch_rmid_read_context_check() - warn about invalid contexts
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 96d97f4ff957..aee6c4684f81 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -66,6 +66,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @binary_bits: number of fixed-point binary bits from architecture,
* only valid if @is_floating_point is true
* @enabled: true if the event is enabled
+ * @arch_priv: Architecture private data for this event.
+ * The @arch_priv provided by the architecture via
+ * resctrl_enable_mon_event().
*/
struct mon_evt {
enum resctrl_event_id evtid;
@@ -77,6 +80,7 @@ struct mon_evt {
bool is_floating_point;
unsigned int binary_bits;
bool enabled;
+ void *arch_priv;
};
extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9003a6344410..588de539a739 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -911,15 +911,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0, NULL);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0, NULL);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0, NULL);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_ABMC))
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ea81305fbc5d..175488185b06 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -240,7 +240,7 @@ static u64 get_corrected_val(struct rdt_resource *r, struct rdt_l3_mon_domain *d
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, void *ignored)
+ void *arch_priv, u64 *val, void *ignored)
{
struct rdt_l3_mon_domain *d;
u64 msr_val;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index d44b764853bf..1eb054749d20 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -137,9 +137,11 @@ void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
struct rmid_entry *entry;
u32 idx, cur_idx = 1;
void *arch_mon_ctx;
+ void *arch_priv;
bool rmid_dirty;
u64 val = 0;
+ arch_priv = mon_event_all[QOS_L3_OCCUP_EVENT_ID].arch_priv;
arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
if (IS_ERR(arch_mon_ctx)) {
pr_warn_ratelimited("Failed to allocate monitor context: %ld",
@@ -160,7 +162,7 @@ void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid,
- QOS_L3_OCCUP_EVENT_ID, &val,
+ QOS_L3_OCCUP_EVENT_ID, arch_priv, &val,
arch_mon_ctx)) {
rmid_dirty = true;
} else {
@@ -480,7 +482,8 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
rr->evt->evtid, &tval);
else
rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, closid, rmid,
- rr->evt->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, rr->evt->arch_priv,
+ &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -505,7 +508,8 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
rr->evt->evtid, &tval);
else
err = resctrl_arch_rmid_read(rr->r, &d->hdr, closid, rmid,
- rr->evt->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, rr->evt->arch_priv,
+ &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
ret = 0;
@@ -982,7 +986,8 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
MON_EVENT(PMT_EVENT_UOPS_RETIRED, "uops_retired", RDT_RESOURCE_PERF_PKG, false),
};
-void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsigned int binary_bits)
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
+ unsigned int binary_bits, void *arch_priv)
{
if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS ||
binary_bits > MAX_BINARY_BITS))
@@ -998,6 +1003,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsig
mon_event_all[eventid].any_cpu = any_cpu;
mon_event_all[eventid].binary_bits = binary_bits;
+ mon_event_all[eventid].arch_priv = arch_priv;
mon_event_all[eventid].enabled = true;
}
@@ -1823,9 +1829,9 @@ int resctrl_l3_mon_resource_init(void)
if (r->mon.mbm_cntr_assignable) {
if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0, NULL);
if (!resctrl_is_mon_event_enabled(QOS_L3_MBM_LOCAL_EVENT_ID))
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0, NULL);
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask;
mon_event_all[QOS_L3_MBM_LOCAL_EVENT_ID].evt_cfg = r->mon.mbm_cfg_mask &
(READS_TO_LOCAL_MEM |
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 16/31] x86,fs/resctrl: Add architectural event pointer
2025-09-25 20:03 ` [PATCH v11 16/31] x86,fs/resctrl: Add architectural event pointer Tony Luck
@ 2025-10-03 23:38 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:38 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> The resctrl file system layer passes the domain, RMID, and event id to
> resctrl_arch_rmid_read() to fetch an event counter.
>
> Fetching a telemetry event counter requires additional information that
> is private to the architecture, for example, the offset into MMIO space
> from where counter should be read.
>
> Add mon_evt::arch_priv void pointer. Architecture code can initialize
> this when marking each event enabled.
>
> File system code passes this pointer to resctrl_arch_rmid_read().
A suggestion to avoid describing code that can be seen from the patch:
The resctrl file system layer passes the domain, RMID, and event id to
the architecture to fetch an event counter.
Fetching a telemetry event counter requires additional information that
is private to the architecture, for example, the offset into MMIO space
from where the counter should be read.
Add mon_evt::arch_priv that architecture can use for any private
data related to the event. resctrl filesystem initializes mon_evt::arch_priv
when the architecture enables the event and passes it back to architecture
when needing to fetch an event counter.
>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (15 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 16/31] x86,fs/resctrl: Add architectural event pointer Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:52 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 18/31] fs/resctrl: Refactor L3 specific parts of __mon_event_count() Tony Luck
` (13 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The INTEL_PMT_TELEMETRY driver provides telemetry region structures of the
types requested by resctrl.
Scan these structures to discover which pass sanity checks to derive
a list of valid regions:
1) They have guid known to resctrl.
2) They have a valid package ID.
3) The enumerated size of the MMIO region matches the expected
value from the XML description file.
4) At least one region passes the above checks.
For each valid region enable all the events in the associated
event_group::evts[].
Pass a pointer to the pmt_event structure of the event within the struct
event_group that resctrl stores in mon_evt::arch_priv. resctrl passes
this pointer back when asking to read the event data which enables the
data to be found in MMIO.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 38 +++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index f9b5f6cd08f8..98ba9ba05ee5 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -20,9 +20,11 @@
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
#include <linux/overflow.h>
+#include <linux/printk.h>
#include <linux/resctrl.h>
#include <linux/resctrl_types.h>
#include <linux/stddef.h>
+#include <linux/topology.h>
#include <linux/types.h>
#include "internal.h"
@@ -114,12 +116,44 @@ static struct event_group *known_perf_event_groups[] = {
for (_peg = (_grp); _peg < &_grp[ARRAY_SIZE(_grp)]; _peg++) \
if ((*_peg)->pfg)
-/* Stub for now */
-static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
+static bool skip_telem_region(struct telemetry_region *tr, struct event_group *e)
{
+ if (tr->guid != e->guid)
+ return true;
+ if (tr->plat_info.package_id >= topology_max_packages()) {
+ pr_warn("Bad package %u in guid 0x%x\n", tr->plat_info.package_id,
+ tr->guid);
+ return true;
+ }
+ if (tr->size != e->mmio_size) {
+ pr_warn("MMIO space wrong size (%zu bytes) for guid 0x%x. Expected %zu bytes.\n",
+ tr->size, e->guid, e->mmio_size);
+ return true;
+ }
+
return false;
}
+static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
+{
+ bool usable_events = false;
+
+ for (int i = 0; i < p->count; i++) {
+ if (skip_telem_region(&p->regions[i], e))
+ continue;
+ usable_events = true;
+ }
+
+ if (!usable_events)
+ return false;
+
+ for (int j = 0; j < e->num_events; j++)
+ resctrl_enable_mon_event(e->evts[j].id, true,
+ e->evts[j].bin_bits, &e->evts[j]);
+
+ return true;
+}
+
DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *,
if (!IS_ERR_OR_NULL(_T))
intel_pmt_put_feature_group(_T))
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events
2025-09-25 20:03 ` [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events Tony Luck
@ 2025-10-03 23:52 ` Reinette Chatre
2025-10-06 19:58 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:52 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> The INTEL_PMT_TELEMETRY driver provides telemetry region structures of the
> types requested by resctrl.
>
> Scan these structures to discover which pass sanity checks to derive
> a list of valid regions:
The "to derive a list of valid regions" does not align with the
"At least one region passes the above checks" requirement. If this is about
valid (usable?) regions then I think (4) should be dropped. If this is instead about
valid events then above should be reworded to say that instead.
>
> 1) They have guid known to resctrl.
> 2) They have a valid package ID.
> 3) The enumerated size of the MMIO region matches the expected
> value from the XML description file.
> 4) At least one region passes the above checks.
>
Everything below is clear by looking at the patch. It can also be seen from patch
that enabling is done only once if there is *any* valid region instead of "for each
valid region". One thing that may be useful to add is "why" all events
can be enabled. If I understand correctly it can be something like:
Enable events that usable telemetry regions are responsible for.
> For each valid region enable all the events in the associated
> event_group::evts[].
>
> Pass a pointer to the pmt_event structure of the event within the struct
> event_group that resctrl stores in mon_evt::arch_priv. resctrl passes
> this pointer back when asking to read the event data which enables the
> data to be found in MMIO.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 38 +++++++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index f9b5f6cd08f8..98ba9ba05ee5 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -20,9 +20,11 @@
> #include <linux/intel_pmt_features.h>
> #include <linux/intel_vsec.h>
> #include <linux/overflow.h>
> +#include <linux/printk.h>
> #include <linux/resctrl.h>
> #include <linux/resctrl_types.h>
> #include <linux/stddef.h>
> +#include <linux/topology.h>
> #include <linux/types.h>
>
> #include "internal.h"
> @@ -114,12 +116,44 @@ static struct event_group *known_perf_event_groups[] = {
> for (_peg = (_grp); _peg < &_grp[ARRAY_SIZE(_grp)]; _peg++) \
> if ((*_peg)->pfg)
>
> -/* Stub for now */
> -static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> +static bool skip_telem_region(struct telemetry_region *tr, struct event_group *e)
> {
> + if (tr->guid != e->guid)
> + return true;
> + if (tr->plat_info.package_id >= topology_max_packages()) {
> + pr_warn("Bad package %u in guid 0x%x\n", tr->plat_info.package_id,
> + tr->guid);
> + return true;
> + }
I have not encountered any mention of the possibility that packages may differ
in which telemetry region types they support. For example, could it be possible for package
A to have usable regions of the PERF type but package B doesn't? From what I can tell
INTEL_PMT_TELEMETRY supports layouts where this can be possible. If I understand correctly
this implementation will create event files for these domains but when the user attempts to
read the data it will fail. Can this work add some snippet about possibility of this
scenario and if/how it is supported?
> + if (tr->size != e->mmio_size) {
> + pr_warn("MMIO space wrong size (%zu bytes) for guid 0x%x. Expected %zu bytes.\n",
> + tr->size, e->guid, e->mmio_size);
> + return true;
> + }
> +
> return false;
> }
>
> +static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> +{
> + bool usable_events = false;
> +
> + for (int i = 0; i < p->count; i++) {
> + if (skip_telem_region(&p->regions[i], e))
> + continue;
> + usable_events = true;
A previous concern [1] was why this loop does not break out at this point. I think it will
help to make this clear if marking a telemetry region as unusable (mark_telem_region_unusable())
is done in this patch. Doing so makes the "usable" and "unusable" distinction in one
patch while making clear that the loop needs to complete.
> + }
> +
> + if (!usable_events)
> + return false;
> +
> + for (int j = 0; j < e->num_events; j++)
> + resctrl_enable_mon_event(e->evts[j].id, true,
> + e->evts[j].bin_bits, &e->evts[j]);
> +
> + return true;
> +}
> +
> DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *,
> if (!IS_ERR_OR_NULL(_T))
> intel_pmt_put_feature_group(_T))
Reinette
[1] https://lore.kernel.org/lkml/9ac43e78-8955-db5d-61be-e08008e41f0d@linux.intel.com/
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events
2025-10-03 23:52 ` Reinette Chatre
@ 2025-10-06 19:58 ` Luck, Tony
2025-10-06 21:33 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-06 19:58 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Oct 03, 2025 at 04:52:01PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 1:03 PM, Tony Luck wrote:
> > The INTEL_PMT_TELEMETRY driver provides telemetry region structures of the
> > types requested by resctrl.
> >
> > Scan these structures to discover which pass sanity checks to derive
> > a list of valid regions:
>
> The "to derive a list of valid regions" does not align with the
> "At least one region passes the above checks" requirement. If this is about
> valid (usable?) regions then I think (4) should be dropped. If this is instead about
> valid events then above should be reworded to say that instead.
Will drop "4".
>
> >
> > 1) They have guid known to resctrl.
> > 2) They have a valid package ID.
> > 3) The enumerated size of the MMIO region matches the expected
> > value from the XML description file.
> > 4) At least one region passes the above checks.
> >
>
> Everything below is clear by looking at the patch. It can also be seen from patch
> that enabling is done only once if there is *any* valid region instead of "for each
> valid region". One thing that may be useful to add is "why" all events
> can be enabled. If I understand correctly it can be something like:
>
> Enable events that usable telemetry regions are responsible for.
Looks better. I will use this.
>
> > For each valid region enable all the events in the associated
> > event_group::evts[].
> >
> > Pass a pointer to the pmt_event structure of the event within the struct
> > event_group that resctrl stores in mon_evt::arch_priv. resctrl passes
> > this pointer back when asking to read the event data which enables the
> > data to be found in MMIO.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > arch/x86/kernel/cpu/resctrl/intel_aet.c | 38 +++++++++++++++++++++++--
> > 1 file changed, 36 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> > index f9b5f6cd08f8..98ba9ba05ee5 100644
> > --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> > +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> > @@ -20,9 +20,11 @@
> > #include <linux/intel_pmt_features.h>
> > #include <linux/intel_vsec.h>
> > #include <linux/overflow.h>
> > +#include <linux/printk.h>
> > #include <linux/resctrl.h>
> > #include <linux/resctrl_types.h>
> > #include <linux/stddef.h>
> > +#include <linux/topology.h>
> > #include <linux/types.h>
> >
> > #include "internal.h"
> > @@ -114,12 +116,44 @@ static struct event_group *known_perf_event_groups[] = {
> > for (_peg = (_grp); _peg < &_grp[ARRAY_SIZE(_grp)]; _peg++) \
> > if ((*_peg)->pfg)
> >
> > -/* Stub for now */
> > -static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> > +static bool skip_telem_region(struct telemetry_region *tr, struct event_group *e)
> > {
> > + if (tr->guid != e->guid)
> > + return true;
> > + if (tr->plat_info.package_id >= topology_max_packages()) {
> > + pr_warn("Bad package %u in guid 0x%x\n", tr->plat_info.package_id,
> > + tr->guid);
> > + return true;
> > + }
>
> I have not encountered any mention of the possibility that packages may differ
> in which telemetry region types they support. For example, could it be possible for package
> A to have usable regions of the PERF type but package B doesn't? From what I can tell
> INTEL_PMT_TELEMETRY supports layouts where this can be possible. If I understand correctly
> this implementation will create event files for these domains but when the user attempts to
> read the data it will fail. Can this work add some snippet about possibility of this
> scenario and if/how it is supported?
Yes, this is architecturally possible. But I do not expect that systems will
be built that do this. You are right that such a system will create files that
always return "Unavailable" when read.
Is it sufficient to document this in the commit message?
I don't feel that it would be worthwhile to suppress creation of these files for
a "can't happen" situation. I'm not sure that doing so would be significantly
better from a user interface perspective. Users would get slightly more notice
(-ENOENT when trying to open the file). But the code would require
architecture calls from file system code to check which files need to be created
separately for each domain.
>
> > + if (tr->size != e->mmio_size) {
> > + pr_warn("MMIO space wrong size (%zu bytes) for guid 0x%x. Expected %zu bytes.\n",
> > + tr->size, e->guid, e->mmio_size);
> > + return true;
> > + }
> > +
> > return false;
> > }
> >
> > +static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> > +{
> > + bool usable_events = false;
> > +
> > + for (int i = 0; i < p->count; i++) {
> > + if (skip_telem_region(&p->regions[i], e))
> > + continue;
> > + usable_events = true;
>
> A previous concern [1] was why this loop does not break out at this point. I think it will
> help to make this clear if marking a telemetry region as unusable (mark_telem_region_unusable())
> is done in this patch. Doing so makes the "usable" and "unusable" distinction in one
> patch while making clear that the loop needs to complete.
Ok. I'll pull mark_telem_region_unusable() into this patch.
> > + }
> > +
> > + if (!usable_events)
> > + return false;
> > +
> > + for (int j = 0; j < e->num_events; j++)
> > + resctrl_enable_mon_event(e->evts[j].id, true,
> > + e->evts[j].bin_bits, &e->evts[j]);
> > +
> > + return true;
> > +}
> > +
> > DEFINE_FREE(intel_pmt_put_feature_group, struct pmt_feature_group *,
> > if (!IS_ERR_OR_NULL(_T))
> > intel_pmt_put_feature_group(_T))
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/9ac43e78-8955-db5d-61be-e08008e41f0d@linux.intel.com/
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events
2025-10-06 19:58 ` Luck, Tony
@ 2025-10-06 21:33 ` Reinette Chatre
2025-10-06 21:54 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-06 21:33 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/6/25 12:58 PM, Luck, Tony wrote:
> On Fri, Oct 03, 2025 at 04:52:01PM -0700, Reinette Chatre wrote:
>> On 9/25/25 1:03 PM, Tony Luck wrote:
>>> @@ -114,12 +116,44 @@ static struct event_group *known_perf_event_groups[] = {
>>> for (_peg = (_grp); _peg < &_grp[ARRAY_SIZE(_grp)]; _peg++) \
>>> if ((*_peg)->pfg)
>>>
>>> -/* Stub for now */
>>> -static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
>>> +static bool skip_telem_region(struct telemetry_region *tr, struct event_group *e)
>>> {
>>> + if (tr->guid != e->guid)
>>> + return true;
>>> + if (tr->plat_info.package_id >= topology_max_packages()) {
>>> + pr_warn("Bad package %u in guid 0x%x\n", tr->plat_info.package_id,
>>> + tr->guid);
>>> + return true;
>>> + }
>>
>> I have not encountered any mention of the possibility that packages may differ
>> in which telemetry region types they support. For example, could it be possible for package
>> A to have usable regions of the PERF type but package B doesn't? From what I can tell
>> INTEL_PMT_TELEMETRY supports layouts where this can be possible. If I understand correctly
>> this implementation will create event files for these domains but when the user attempts to
>> read the data it will fail. Can this work add some snippet about possibility of this
>> scenario and if/how it is supported?
>
> Yes, this is architecturally possible. But I do not expect that systems will
> be built that do this. You are right that such a system will create files that
> always return "Unavailable" when read.
>
> Is it sufficient to document this in the commit message?
>
> I don't feel that it would be worthwhile to suppress creation of these files for
> a "can't happen" situation. I'm not sure that doing so would be significantly
> better from a user interface perspective. Users would get slightly more notice
> (-ENOENT when trying to open the file). But the code would require
> architecture calls from file system code to check which files need to be created
> separately for each domain.
I think it is sufficient to document this in the commit message to help create
confidence in robustness in support of different scenarios. I have not encountered such
a system but could this scenario be similar to one where a two socket system supports MBM
but only one socket has memory populated? I do not know what reading the counter MSR will
return in this case though.
>
>>
>>> + if (tr->size != e->mmio_size) {
>>> + pr_warn("MMIO space wrong size (%zu bytes) for guid 0x%x. Expected %zu bytes.\n",
>>> + tr->size, e->guid, e->mmio_size);
>>> + return true;
>>> + }
>>> +
>>> return false;
>>> }
>>>
>>> +static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
>>> +{
>>> + bool usable_events = false;
>>> +
>>> + for (int i = 0; i < p->count; i++) {
>>> + if (skip_telem_region(&p->regions[i], e))
>>> + continue;
>>> + usable_events = true;
>>
>> A previous concern [1] was why this loop does not break out at this point. I think it will
>> help to make this clear if marking a telemetry region as unusable (mark_telem_region_unusable())
>> is done in this patch. Doing so makes the "usable" and "unusable" distinction in one
>> patch while making clear that the loop needs to complete.
>
> Ok. I'll pull mark_telem_region_unusable() into this patch.
Thank you.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* RE: [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events
2025-10-06 21:33 ` Reinette Chatre
@ 2025-10-06 21:54 ` Luck, Tony
0 siblings, 0 replies; 84+ messages in thread
From: Luck, Tony @ 2025-10-06 21:54 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> > I don't feel that it would be worthwhile to suppress creation of these files for
> > a "can't happen" situation. I'm not sure that doing so would be significantly
> > better from a user interface perspective. Users would get slightly more notice
> > (-ENOENT when trying to open the file). But the code would require
> > architecture calls from file system code to check which files need to be created
> > separately for each domain.
>
> I think it is sufficient to document this in the commit message to help create
> confidence in robustness in support of different scenarios. I have not encountered such
> a system but could this scenario be similar to one where a two socket system supports MBM
> but only one socket has memory populated? I do not know what reading the counter MSR will
> return in this case though.
The counting h/w is likely unaware of whether DIMMs slots are populated or not. My guess
would be that in this case the counters would read as zero forever, rather than "unavailable".
CXL supports hot-add memory. So whether a node has memory could change at runtime
on a system with CXL support.
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 18/31] fs/resctrl: Refactor L3 specific parts of __mon_event_count()
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (16 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 17/31] x86/resctrl: Find and enable usable telemetry events Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:56 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 19/31] x86/resctrl: Read telemetry events Tony Luck
` (12 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The "MBM counter assignment" and "reset counter on first read" features
are only applicable to the RDT_RESOURCE_L3 resource.
Add a check for the RDT_RESOURCE_L3 resource.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/monitor.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 1eb054749d20..d484983c0f02 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -453,27 +453,29 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (!cpu_on_correct_domain(rr))
return -EINVAL;
- if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- return -EINVAL;
- d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
-
- if (rr->is_mbm_cntr) {
- cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evt->evtid);
- if (cntr_id < 0) {
- rr->err = -ENOENT;
+ if (rr->r->rid == RDT_RESOURCE_L3) {
+ if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
+ d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
+
+ if (rr->is_mbm_cntr) {
+ cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evt->evtid);
+ if (cntr_id < 0) {
+ rr->err = -ENOENT;
+ return -EINVAL;
+ }
}
- }
- if (rr->first) {
- if (rr->is_mbm_cntr)
- resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evt->evtid);
- else
- resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evt->evtid);
- m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
- if (m)
- memset(m, 0, sizeof(struct mbm_state));
- return 0;
+ if (rr->first) {
+ if (rr->is_mbm_cntr)
+ resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evt->evtid);
+ else
+ resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evt->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
+ if (m)
+ memset(m, 0, sizeof(struct mbm_state));
+ return 0;
+ }
}
if (rr->hdr) {
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 18/31] fs/resctrl: Refactor L3 specific parts of __mon_event_count()
2025-09-25 20:03 ` [PATCH v11 18/31] fs/resctrl: Refactor L3 specific parts of __mon_event_count() Tony Luck
@ 2025-10-03 23:56 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:56 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> The "MBM counter assignment" and "reset counter on first read" features
> are only applicable to the RDT_RESOURCE_L3 resource.
>
> Add a check for the RDT_RESOURCE_L3 resource.
Why?
"Add a check for the RDT_RESOURCE_L3 resource" can be seen from code.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> fs/resctrl/monitor.c | 38 ++++++++++++++++++++------------------
> 1 file changed, 20 insertions(+), 18 deletions(-)
>
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 1eb054749d20..d484983c0f02 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -453,27 +453,29 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
> if (!cpu_on_correct_domain(rr))
> return -EINVAL;
>
> - if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> - return -EINVAL;
> - d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
> -
> - if (rr->is_mbm_cntr) {
> - cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evt->evtid);
> - if (cntr_id < 0) {
> - rr->err = -ENOENT;
> + if (rr->r->rid == RDT_RESOURCE_L3) {
> + if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> return -EINVAL;
This snippet keeps moving around but it remains problematic from what I can tell since rr->hdr
can still be NULL here.
> + d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
> +
> + if (rr->is_mbm_cntr) {
> + cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evt->evtid);
> + if (cntr_id < 0) {
> + rr->err = -ENOENT;
> + return -EINVAL;
> + }
> }
> - }
>
> - if (rr->first) {
> - if (rr->is_mbm_cntr)
> - resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evt->evtid);
> - else
> - resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evt->evtid);
> - m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
> - if (m)
> - memset(m, 0, sizeof(struct mbm_state));
> - return 0;
> + if (rr->first) {
> + if (rr->is_mbm_cntr)
> + resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evt->evtid);
> + else
> + resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evt->evtid);
> + m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
> + if (m)
> + memset(m, 0, sizeof(struct mbm_state));
> + return 0;
> + }
> }
__mon_event_count() is now unreasonably complicated. One motivation from changelog is because
"MBM counter assignment is only applicable to RDT_RESOURCE_L3" but then the change proceeds
to only add a RDT_RESOURCE_L3 check in some parts of __mon_event_count() that deals
with counter assignment and leaves other parts to rely on rmid_read properties. It also
fails to mention or make explicit that all the special SNC handling is only applicable to
L3 and leaves that code to seemingly be generic and reader needs to be very familiar with
the code to know differently. I find SNC already a challenge and find this to increase that
complexity unnecessarily.
PERF_PKG only needs to reach a handful of lines in __mon_event_count() to read a counter.
I thus think the best way forward is to split __mon_event_count() and move all the L3 code
into its own function.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 19/31] x86/resctrl: Read telemetry events
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (17 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 18/31] fs/resctrl: Refactor L3 specific parts of __mon_event_count() Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow Tony Luck
` (11 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Telemetry events are enabled during the first mount of the resctrl
file system.
Mark telemetry regions that did not pass the sanity checks by
clearing their MMIO address fields so that they will not be
used when reading events.
Introduce intel_aet_read_event() to read telemetry events for resource
RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each
package, so scan all of them and add up all counters. Aggregators may
return an invalid data indication if they have received no records for
a given RMID. Return success to the user if one or more aggregators
provide valid data.
Resctrl now uses readq() so depends on X86_64. Update Kconfig.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 7 +++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 65 ++++++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 3 ++
arch/x86/Kconfig | 2 +-
4 files changed, 75 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 886261a82b81..97616c81682b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -220,9 +220,16 @@ void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
+int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
+ void *arch_priv, u64 *val);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void __exit intel_aet_exit(void) { }
+static inline int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
+ void *arch_priv, u64 *val)
+{
+ return -EINVAL;
+}
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 98ba9ba05ee5..d53211ac6204 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -12,13 +12,17 @@
#define pr_fmt(fmt) "resctrl: " fmt
#include <linux/array_size.h>
+#include <linux/bits.h>
#include <linux/cleanup.h>
#include <linux/compiler_types.h>
+#include <linux/container_of.h>
#include <linux/cpu.h>
#include <linux/err.h>
+#include <linux/errno.h>
#include <linux/init.h>
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
+#include <linux/io.h>
#include <linux/overflow.h>
#include <linux/printk.h>
#include <linux/resctrl.h>
@@ -134,13 +138,28 @@ static bool skip_telem_region(struct telemetry_region *tr, struct event_group *e
return false;
}
+/*
+ * Clear the address field of regions that did not pass the checks in
+ * skip_telem_region() so they will not be used by intel_aet_read_event().
+ * This is safe to do because intel_pmt_get_regions_by_feature() allocates
+ * a new pmt_feature_group structure to return to each caller and only makes
+ * use of the pmt_feature_group::kref field when intel_pmt_put_feature_group()
+ * returns the structure.
+ */
+static void mark_telem_region_unusable(struct telemetry_region *tr)
+{
+ tr->addr = NULL;
+}
+
static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
{
bool usable_events = false;
for (int i = 0; i < p->count; i++) {
- if (skip_telem_region(&p->regions[i], e))
+ if (skip_telem_region(&p->regions[i], e)) {
+ mark_telem_region_unusable(&p->regions[i]);
continue;
+ }
usable_events = true;
}
@@ -219,3 +238,47 @@ void __exit intel_aet_exit(void)
(*peg)->pfg = NULL;
}
}
+
+#define DATA_VALID BIT_ULL(63)
+#define DATA_BITS GENMASK_ULL(62, 0)
+
+/*
+ * Read counter for an event on a domain (summing all aggregators
+ * on the domain). If an aggregator hasn't received any data for a
+ * specific RMID, the MMIO read indicates that data is not valid.
+ * Return success if at least one aggregator has valid data.
+ */
+int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id eventid,
+ void *arch_priv, u64 *val)
+{
+ struct pmt_event *pevt = arch_priv;
+ struct event_group *e;
+ bool valid = false;
+ u64 evtcount;
+ void *pevt0;
+ u32 idx;
+
+ pevt0 = pevt - pevt->idx;
+ e = container_of(pevt0, struct event_group, evts);
+ idx = rmid * e->num_events;
+ idx += pevt->idx;
+
+ if (idx * sizeof(u64) + sizeof(u64) > e->mmio_size) {
+ pr_warn_once("MMIO index %u out of range\n", idx);
+ return -EIO;
+ }
+
+ for (int i = 0; i < e->pfg->count; i++) {
+ if (!e->pfg->regions[i].addr)
+ continue;
+ if (e->pfg->regions[i].plat_info.package_id != domid)
+ continue;
+ evtcount = readq(e->pfg->regions[i].addr + idx * sizeof(u64));
+ if (!(evtcount & DATA_VALID))
+ continue;
+ *val += evtcount & DATA_BITS;
+ valid = true;
+ }
+
+ return valid ? 0 : -EINVAL;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 175488185b06..7d14ae6a9737 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -250,6 +250,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
resctrl_arch_rmid_read_context_check();
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ return intel_aet_read_event(hdr->id, rmid, eventid, arch_priv, val);
+
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ce9d086625c1..6e0ec28ee904 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -527,7 +527,7 @@ config X86_CPU_RESCTRL
config X86_CPU_RESCTRL_INTEL_AET
bool "Intel Application Energy Telemetry"
- depends on X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
+ depends on X86_64 && X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
help
Enable per-RMID telemetry events in resctrl.
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (18 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 19/31] x86/resctrl: Read telemetry events Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-03 23:58 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 21/31] x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
` (10 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
SNC is only present in the RDT_RESOURCE_L3 domain.
Refactor code that makes and removes directories under "mon_data" to
special case the L3 resource.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/rdtgroup.c | 50 +++++++++++++++++++++++++++----------------
1 file changed, 32 insertions(+), 18 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 6e8937f94e7a..cab5cb9e6c93 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3155,6 +3155,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
return;
kernfs_put(kn);
+ /* Subdirectories are only present on SNC enabled systems */
if (kn->dir.subdirs <= 1)
kernfs_remove(kn);
else
@@ -3171,19 +3172,24 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
- struct rdt_l3_mon_domain *d;
+ int domid = hdr->id;
char subname[32];
- bool snc_mode;
char name[32];
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- return;
+ if (r->rid == RDT_RESOURCE_L3) {
+ struct rdt_l3_mon_domain *d;
- d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
- if (snc_mode)
- sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
+ /* SNC mode? */
+ if (r->mon_scope == RESCTRL_L3_NODE) {
+ domid = d->ci_id;
+ sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
+ }
+ }
+ sprintf(name, "mon_%s_%02d", r->name, domid);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
@@ -3213,7 +3219,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
if (ret)
return ret;
- if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
+ if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
}
@@ -3225,19 +3231,27 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
- struct rdt_l3_mon_domain *d;
+ bool snc_mode = false;
+ int domid = hdr->id;
char name[32];
- bool snc_mode;
int ret = 0;
lockdep_assert_held(&rdtgroup_mutex);
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- return -EINVAL;
+ if (r->rid == RDT_RESOURCE_L3) {
+ snc_mode = r->mon_scope == RESCTRL_L3_NODE;
+ if (snc_mode) {
+ struct rdt_l3_mon_domain *d;
+
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
+ domid = d->ci_id;
+ }
+ }
+ sprintf(name, "mon_%s_%02d", r->name, domid);
- d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
kn = kernfs_find_and_get(parent_kn, name);
if (kn) {
/*
@@ -3253,7 +3267,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
ret = rdtgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
- ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
+ ret = mon_add_all_files(kn, hdr, r, prgrp, domid, snc_mode);
if (ret)
goto out_destroy;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-09-25 20:03 ` [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow Tony Luck
@ 2025-10-03 23:58 ` Reinette Chatre
2025-10-06 23:10 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-03 23:58 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> SNC is only present in the RDT_RESOURCE_L3 domain.
>
> Refactor code that makes and removes directories under "mon_data" to
"makes and removes directories" -> "makes and removes SNC directories"?
> special case the L3 resource.
Why?
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> fs/resctrl/rdtgroup.c | 50 +++++++++++++++++++++++++++----------------
> 1 file changed, 32 insertions(+), 18 deletions(-)
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 6e8937f94e7a..cab5cb9e6c93 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3155,6 +3155,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> return;
> kernfs_put(kn);
>
> + /* Subdirectories are only present on SNC enabled systems */
> if (kn->dir.subdirs <= 1)
> kernfs_remove(kn);
> else
> @@ -3171,19 +3172,24 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdt_domain_hdr *hdr)
> {
> struct rdtgroup *prgrp, *crgrp;
> - struct rdt_l3_mon_domain *d;
> + int domid = hdr->id;
> char subname[32];
> - bool snc_mode;
> char name[32];
>
> - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> - return;
> + if (r->rid == RDT_RESOURCE_L3) {
> + struct rdt_l3_mon_domain *d;
>
> - d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> - snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> - if (snc_mode)
> - sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return;
> +
> + d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> + /* SNC mode? */
> + if (r->mon_scope == RESCTRL_L3_NODE) {
> + domid = d->ci_id;
> + sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
> + }
> + }
> + sprintf(name, "mon_%s_%02d", r->name, domid);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> @@ -3213,7 +3219,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> if (ret)
> return ret;
>
> - if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> + if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
> mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
> }
>
> @@ -3225,19 +3231,27 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> struct kernfs_node *kn, *ckn;
> - struct rdt_l3_mon_domain *d;
> + bool snc_mode = false;
> + int domid = hdr->id;
> char name[32];
> - bool snc_mode;
> int ret = 0;
>
> lockdep_assert_held(&rdtgroup_mutex);
>
> - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> - return -EINVAL;
> + if (r->rid == RDT_RESOURCE_L3) {
> + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> + if (snc_mode) {
> + struct rdt_l3_mon_domain *d;
> +
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return -EINVAL;
> +
> + d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> + domid = d->ci_id;
> + }
> + }
> + sprintf(name, "mon_%s_%02d", r->name, domid);
>
> - d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> - snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> kn = kernfs_find_and_get(parent_kn, name);
> if (kn) {
> /*
> @@ -3253,7 +3267,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> ret = rdtgroup_kn_set_ugid(kn);
> if (ret)
> goto out_destroy;
> - ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
> + ret = mon_add_all_files(kn, hdr, r, prgrp, domid, snc_mode);
> if (ret)
> goto out_destroy;
> }
mkdir_mondata_subdir(), similar to __mon_event_count(), is now unreasonably
complicated. Just like in that earlier change this inconsistently adds
RDT_RESOURCE_L3 checks, not to separate L3 code but instead to benefit PERF_PKG
enabling to reach the handful of lines needed by it.
Here too I think the best way forward is to split mkdir_mondata_subdir().
rmdir_mondata_subdir_allrdtgrp() may also do with a split ... most of the
code within it is dedicated to SNC and mon_rmdir_one_subdir() only exists
because of SNC ... any other usage can just call kernfs_remove_by_name(), no?
SNC is already complicated enabling and I think that PERF_PKG trying to wedge
itself into that is just too confusing. I expect separating this should simplify
this a lot.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-03 23:58 ` Reinette Chatre
@ 2025-10-06 23:10 ` Luck, Tony
2025-10-08 17:12 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-06 23:10 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Oct 03, 2025 at 04:58:45PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 1:03 PM, Tony Luck wrote:
> > SNC is only present in the RDT_RESOURCE_L3 domain.
> >
> > Refactor code that makes and removes directories under "mon_data" to
>
> "makes and removes directories" -> "makes and removes SNC directories"?
>
> > special case the L3 resource.
>
> Why?
>
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > fs/resctrl/rdtgroup.c | 50 +++++++++++++++++++++++++++----------------
> > 1 file changed, 32 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> > index 6e8937f94e7a..cab5cb9e6c93 100644
> > --- a/fs/resctrl/rdtgroup.c
> > +++ b/fs/resctrl/rdtgroup.c
> > @@ -3155,6 +3155,7 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> > return;
> > kernfs_put(kn);
> >
> > + /* Subdirectories are only present on SNC enabled systems */
> > if (kn->dir.subdirs <= 1)
> > kernfs_remove(kn);
> > else
> > @@ -3171,19 +3172,24 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > struct rdt_domain_hdr *hdr)
> > {
> > struct rdtgroup *prgrp, *crgrp;
> > - struct rdt_l3_mon_domain *d;
> > + int domid = hdr->id;
> > char subname[32];
> > - bool snc_mode;
> > char name[32];
> >
> > - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > - return;
> > + if (r->rid == RDT_RESOURCE_L3) {
> > + struct rdt_l3_mon_domain *d;
> >
> > - d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> > - snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> > - if (snc_mode)
> > - sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + return;
> > +
> > + d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> > + /* SNC mode? */
> > + if (r->mon_scope == RESCTRL_L3_NODE) {
> > + domid = d->ci_id;
> > + sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
> > + }
> > + }
> > + sprintf(name, "mon_%s_%02d", r->name, domid);
> >
> > list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> > mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> > @@ -3213,7 +3219,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
> > if (ret)
> > return ret;
> >
> > - if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
> > + if (r->rid == RDT_RESOURCE_L3 && !do_sum && resctrl_is_mbm_event(mevt->evtid))
> > mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
> > }
> >
> > @@ -3225,19 +3231,27 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > struct rdt_resource *r, struct rdtgroup *prgrp)
> > {
> > struct kernfs_node *kn, *ckn;
> > - struct rdt_l3_mon_domain *d;
> > + bool snc_mode = false;
> > + int domid = hdr->id;
> > char name[32];
> > - bool snc_mode;
> > int ret = 0;
> >
> > lockdep_assert_held(&rdtgroup_mutex);
> >
> > - if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > - return -EINVAL;
> > + if (r->rid == RDT_RESOURCE_L3) {
> > + snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > + if (snc_mode) {
> > + struct rdt_l3_mon_domain *d;
> > +
> > + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > + return -EINVAL;
> > +
> > + d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> > + domid = d->ci_id;
> > + }
> > + }
> > + sprintf(name, "mon_%s_%02d", r->name, domid);
> >
> > - d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> > - snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> > - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
> > kn = kernfs_find_and_get(parent_kn, name);
> > if (kn) {
> > /*
> > @@ -3253,7 +3267,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > ret = rdtgroup_kn_set_ugid(kn);
> > if (ret)
> > goto out_destroy;
> > - ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
> > + ret = mon_add_all_files(kn, hdr, r, prgrp, domid, snc_mode);
> > if (ret)
> > goto out_destroy;
> > }
>
> mkdir_mondata_subdir(), similar to __mon_event_count(), is now unreasonably
> complicated. Just like in that earlier change this inconsistently adds
> RDT_RESOURCE_L3 checks, not to separate L3 code but instead to benefit PERF_PKG
> enabling to reach the handful of lines needed by it.
> Here too I think the best way forward is to split mkdir_mondata_subdir().
>
> rmdir_mondata_subdir_allrdtgrp() may also do with a split ... most of the
> code within it is dedicated to SNC and mon_rmdir_one_subdir() only exists
> because of SNC ... any other usage can just call kernfs_remove_by_name(), no?
>
> SNC is already complicated enabling and I think that PERF_PKG trying to wedge
> itself into that is just too confusing. I expect separating this should simplify
> this a lot.
Ok. Splitting these makes sense. I'm terrible at naming. So I
tentatively have:
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
lockdep_assert_held(&rdtgroup_mutex);
if (r->mon_scope == RESCTRL_L3_NODE)
return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
else
return mkdir_mondata_subdir_normal(parent_kn, hdr, r, prgrp);
}
and:
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain_hdr *hdr)
{
if (r->mon_scope == RESCTRL_L3_NODE)
rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
else
rmdir_mondata_subdir_allrdtgrp_normal(r, hdr);
}
Better suggestions gratefully accepted.
>
> Reinette
>
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-06 23:10 ` Luck, Tony
@ 2025-10-08 17:12 ` Reinette Chatre
2025-10-08 21:15 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-08 17:12 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/6/25 4:10 PM, Luck, Tony wrote:
> On Fri, Oct 03, 2025 at 04:58:45PM -0700, Reinette Chatre wrote:
>> On 9/25/25 1:03 PM, Tony Luck wrote:
>>> @@ -3253,7 +3267,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>> ret = rdtgroup_kn_set_ugid(kn);
>>> if (ret)
>>> goto out_destroy;
>>> - ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
>>> + ret = mon_add_all_files(kn, hdr, r, prgrp, domid, snc_mode);
>>> if (ret)
>>> goto out_destroy;
>>> }
>>
>> mkdir_mondata_subdir(), similar to __mon_event_count(), is now unreasonably
>> complicated. Just like in that earlier change this inconsistently adds
>> RDT_RESOURCE_L3 checks, not to separate L3 code but instead to benefit PERF_PKG
>> enabling to reach the handful of lines needed by it.
>> Here too I think the best way forward is to split mkdir_mondata_subdir().
>>
>> rmdir_mondata_subdir_allrdtgrp() may also do with a split ... most of the
>> code within it is dedicated to SNC and mon_rmdir_one_subdir() only exists
>> because of SNC ... any other usage can just call kernfs_remove_by_name(), no?
>>
>> SNC is already complicated enabling and I think that PERF_PKG trying to wedge
>> itself into that is just too confusing. I expect separating this should simplify
>> this a lot.
>
> Ok. Splitting these makes sense. I'm terrible at naming. So I
> tentatively have:
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> if (r->mon_scope == RESCTRL_L3_NODE)
> return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
> else
> return mkdir_mondata_subdir_normal(parent_kn, hdr, r, prgrp);
> }
>
> and:
>
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdt_domain_hdr *hdr)
> {
> if (r->mon_scope == RESCTRL_L3_NODE)
> rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
> else
> rmdir_mondata_subdir_allrdtgrp_normal(r, hdr);
> }
>
> Better suggestions gratefully accepted.
It is not quite obvious to me how it all will turn out from here with the
addition of support for PERF_PKG. By just considering the above I think
that it helps to match the naming pattern with partners,
for example rmdir_mondata_subdir_allrdtgrp() as you have that matches
mkdir_mondata_subdir_allrdtgrp() that is not listed here. The problem is
that the new rmdir_mondata_subdir_allrdtgrp() is L3 specific while
mkdir_mondata_subdir_allrdtgrp() remains generic. I thus think that it may make
the code easier to follow if the L3 specific functions have _l3_ in the
name as you established in patch #8. So perhaps above should be
rmdir_l3_mondata_subdir_allrdtgrp() instead and then there may be a new
rmdir_mondata_subdir_allrdtgrp() that will be the new generic function
that calls the resource specific ones?
This could be extended to the new mkdir_mondata_subdir() above where
it is named mkdir_l3_mondata_subdir() called by generic mkdir_mondata_subdir()?
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-08 17:12 ` Reinette Chatre
@ 2025-10-08 21:15 ` Luck, Tony
2025-10-08 22:12 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-08 21:15 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Wed, Oct 08, 2025 at 10:12:36AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 10/6/25 4:10 PM, Luck, Tony wrote:
> > On Fri, Oct 03, 2025 at 04:58:45PM -0700, Reinette Chatre wrote:
> >> On 9/25/25 1:03 PM, Tony Luck wrote:
>
> >>> @@ -3253,7 +3267,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> >>> ret = rdtgroup_kn_set_ugid(kn);
> >>> if (ret)
> >>> goto out_destroy;
> >>> - ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
> >>> + ret = mon_add_all_files(kn, hdr, r, prgrp, domid, snc_mode);
> >>> if (ret)
> >>> goto out_destroy;
> >>> }
> >>
> >> mkdir_mondata_subdir(), similar to __mon_event_count(), is now unreasonably
> >> complicated. Just like in that earlier change this inconsistently adds
> >> RDT_RESOURCE_L3 checks, not to separate L3 code but instead to benefit PERF_PKG
> >> enabling to reach the handful of lines needed by it.
> >> Here too I think the best way forward is to split mkdir_mondata_subdir().
> >>
> >> rmdir_mondata_subdir_allrdtgrp() may also do with a split ... most of the
> >> code within it is dedicated to SNC and mon_rmdir_one_subdir() only exists
> >> because of SNC ... any other usage can just call kernfs_remove_by_name(), no?
> >>
> >> SNC is already complicated enabling and I think that PERF_PKG trying to wedge
> >> itself into that is just too confusing. I expect separating this should simplify
> >> this a lot.
> >
> > Ok. Splitting these makes sense. I'm terrible at naming. So I
> > tentatively have:
> >
> > static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > struct rdt_domain_hdr *hdr,
> > struct rdt_resource *r, struct rdtgroup *prgrp)
> > {
> > lockdep_assert_held(&rdtgroup_mutex);
> >
> > if (r->mon_scope == RESCTRL_L3_NODE)
> > return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
> > else
> > return mkdir_mondata_subdir_normal(parent_kn, hdr, r, prgrp);
> > }
> >
> > and:
> >
> > static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > struct rdt_domain_hdr *hdr)
> > {
> > if (r->mon_scope == RESCTRL_L3_NODE)
> > rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
> > else
> > rmdir_mondata_subdir_allrdtgrp_normal(r, hdr);
> > }
> >
> > Better suggestions gratefully accepted.
>
> It is not quite obvious to me how it all will turn out from here with the
> addition of support for PERF_PKG. By just considering the above I think
> that it helps to match the naming pattern with partners,
> for example rmdir_mondata_subdir_allrdtgrp() as you have that matches
> mkdir_mondata_subdir_allrdtgrp() that is not listed here. The problem is
> that the new rmdir_mondata_subdir_allrdtgrp() is L3 specific while
> mkdir_mondata_subdir_allrdtgrp() remains generic. I thus think that it may make
> the code easier to follow if the L3 specific functions have _l3_ in the
> name as you established in patch #8. So perhaps above should be
> rmdir_l3_mondata_subdir_allrdtgrp() instead and then there may be a new
> rmdir_mondata_subdir_allrdtgrp() that will be the new generic function
> that calls the resource specific ones?
>
> This could be extended to the new mkdir_mondata_subdir() above where
> it is named mkdir_l3_mondata_subdir() called by generic mkdir_mondata_subdir()?
Reinette
Making and removing the mon_data directories is the same for non-SNC L3
and PERF_PKG. The only "l3" connection is that SNC only occurs on L3.
So maybe my refactor should look like:
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
lockdep_assert_held(&rdtgroup_mutex);
if (r->mon_scope == RESCTRL_L3_NODE)
return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
... pruned version of original code without SNC bits ...
}
and:
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain_hdr *hdr)
{
if (r->mon_scope == RESCTRL_L3_NODE) {
rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
return;
}
... pruned version of original code without SNC bits ...
}
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-08 21:15 ` Luck, Tony
@ 2025-10-08 22:12 ` Reinette Chatre
2025-10-08 22:29 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-08 22:12 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/8/25 2:15 PM, Luck, Tony wrote:
> On Wed, Oct 08, 2025 at 10:12:36AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 10/6/25 4:10 PM, Luck, Tony wrote:
>>> On Fri, Oct 03, 2025 at 04:58:45PM -0700, Reinette Chatre wrote:
>>>> On 9/25/25 1:03 PM, Tony Luck wrote:
>>
>>>>> @@ -3253,7 +3267,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>>>> ret = rdtgroup_kn_set_ugid(kn);
>>>>> if (ret)
>>>>> goto out_destroy;
>>>>> - ret = mon_add_all_files(kn, hdr, r, prgrp, hdr->id, snc_mode);
>>>>> + ret = mon_add_all_files(kn, hdr, r, prgrp, domid, snc_mode);
>>>>> if (ret)
>>>>> goto out_destroy;
>>>>> }
>>>>
>>>> mkdir_mondata_subdir(), similar to __mon_event_count(), is now unreasonably
>>>> complicated. Just like in that earlier change this inconsistently adds
>>>> RDT_RESOURCE_L3 checks, not to separate L3 code but instead to benefit PERF_PKG
>>>> enabling to reach the handful of lines needed by it.
>>>> Here too I think the best way forward is to split mkdir_mondata_subdir().
>>>>
>>>> rmdir_mondata_subdir_allrdtgrp() may also do with a split ... most of the
>>>> code within it is dedicated to SNC and mon_rmdir_one_subdir() only exists
>>>> because of SNC ... any other usage can just call kernfs_remove_by_name(), no?
>>>>
>>>> SNC is already complicated enabling and I think that PERF_PKG trying to wedge
>>>> itself into that is just too confusing. I expect separating this should simplify
>>>> this a lot.
>>>
>>> Ok. Splitting these makes sense. I'm terrible at naming. So I
>>> tentatively have:
>>>
>>> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>> struct rdt_domain_hdr *hdr,
>>> struct rdt_resource *r, struct rdtgroup *prgrp)
>>> {
>>> lockdep_assert_held(&rdtgroup_mutex);
>>>
>>> if (r->mon_scope == RESCTRL_L3_NODE)
>>> return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
>>> else
>>> return mkdir_mondata_subdir_normal(parent_kn, hdr, r, prgrp);
>>> }
>>>
>>> and:
>>>
>>> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>>> struct rdt_domain_hdr *hdr)
>>> {
>>> if (r->mon_scope == RESCTRL_L3_NODE)
>>> rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
>>> else
>>> rmdir_mondata_subdir_allrdtgrp_normal(r, hdr);
>>> }
>>>
>>> Better suggestions gratefully accepted.
>>
>> It is not quite obvious to me how it all will turn out from here with the
>> addition of support for PERF_PKG. By just considering the above I think
>> that it helps to match the naming pattern with partners,
>> for example rmdir_mondata_subdir_allrdtgrp() as you have that matches
>> mkdir_mondata_subdir_allrdtgrp() that is not listed here. The problem is
>> that the new rmdir_mondata_subdir_allrdtgrp() is L3 specific while
>> mkdir_mondata_subdir_allrdtgrp() remains generic. I thus think that it may make
>> the code easier to follow if the L3 specific functions have _l3_ in the
>> name as you established in patch #8. So perhaps above should be
>> rmdir_l3_mondata_subdir_allrdtgrp() instead and then there may be a new
>> rmdir_mondata_subdir_allrdtgrp() that will be the new generic function
>> that calls the resource specific ones?
>>
>> This could be extended to the new mkdir_mondata_subdir() above where
>> it is named mkdir_l3_mondata_subdir() called by generic mkdir_mondata_subdir()?
>
> Reinette
>
> Making and removing the mon_data directories is the same for non-SNC L3
> and PERF_PKG. The only "l3" connection is that SNC only occurs on L3.
Thank you for clarifying. I was not able to keep this flow in my head.
>
> So maybe my refactor should look like:
>
>
> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> struct rdt_domain_hdr *hdr,
> struct rdt_resource *r, struct rdtgroup *prgrp)
> {
> lockdep_assert_held(&rdtgroup_mutex);
>
> if (r->mon_scope == RESCTRL_L3_NODE)
> return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
>
> ... pruned version of original code without SNC bits ...
> }
>
> and:
>
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> struct rdt_domain_hdr *hdr)
> {
> if (r->mon_scope == RESCTRL_L3_NODE) {
> rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
> return;
> }
>
> ... pruned version of original code without SNC bits ...
> }
Indeed, this will keep the functions generic in the sense that it operates
on all resource types. This looks good since I think once the SNC code is taken
what remains should be easy to follow.
I think it may also help to (in addition to the mon_scope check) add a RDT_RESOURCE_L3
check before the SNC code to keep the pattern that SNC only applies to the L3 resource.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* RE: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-08 22:12 ` Reinette Chatre
@ 2025-10-08 22:29 ` Luck, Tony
2025-10-09 2:16 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-08 22:29 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> > static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
> > struct rdt_domain_hdr *hdr,
> > struct rdt_resource *r, struct rdtgroup *prgrp)
> > {
> > lockdep_assert_held(&rdtgroup_mutex);
> >
> > if (r->mon_scope == RESCTRL_L3_NODE)
> > return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
> >
> > ... pruned version of original code without SNC bits ...
> > }
> >
> > and:
> >
> > static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> > struct rdt_domain_hdr *hdr)
> > {
> > if (r->mon_scope == RESCTRL_L3_NODE) {
> > rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
> > return;
> > }
> >
> > ... pruned version of original code without SNC bits ...
> > }
>
> Indeed, this will keep the functions generic in the sense that it operates
> on all resource types. This looks good since I think once the SNC code is taken
> what remains should be easy to follow.
> I think it may also help to (in addition to the mon_scope check) add a RDT_RESOURCE_L3
> check before the SNC code to keep the pattern that SNC only applies to the L3 resource.
Reinette,
The SNC versions to make and remove directories need to get the rdt_l3_mon_domain from
hdr. So they both begin with the standard:
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
[though rmdir function returns void, so doesn't have that "-EINVAL".]
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-08 22:29 ` Luck, Tony
@ 2025-10-09 2:16 ` Reinette Chatre
2025-10-09 17:45 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-09 2:16 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi Tony,
On 10/8/25 3:29 PM, Luck, Tony wrote:
>>> static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
>>> struct rdt_domain_hdr *hdr,
>>> struct rdt_resource *r, struct rdtgroup *prgrp)
>>> {
>>> lockdep_assert_held(&rdtgroup_mutex);
>>>
>>> if (r->mon_scope == RESCTRL_L3_NODE)
>>> return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
>>>
>>> ... pruned version of original code without SNC bits ...
>>> }
>>>
>>> and:
>>>
>>> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>>> struct rdt_domain_hdr *hdr)
>>> {
>>> if (r->mon_scope == RESCTRL_L3_NODE) {
>>> rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
>>> return;
>>> }
>>>
>>> ... pruned version of original code without SNC bits ...
>>> }
>>
>> Indeed, this will keep the functions generic in the sense that it operates
>> on all resource types. This looks good since I think once the SNC code is taken
>> what remains should be easy to follow.
>> I think it may also help to (in addition to the mon_scope check) add a RDT_RESOURCE_L3
>> check before the SNC code to keep the pattern that SNC only applies to the L3 resource.
>
> Reinette,
>
> The SNC versions to make and remove directories need to get the rdt_l3_mon_domain from
> hdr. So they both begin with the standard:
>
> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> return -EINVAL;
>
> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
>
> [though rmdir function returns void, so doesn't have that "-EINVAL".]
Understood. This is not about correctness but making the code easier to understand.
What I am aiming for is consistency in the code where the pattern
in existing flows use the resource ID as check to direct code flow to resource
specific code. In the above flow it uses the monitoring scope. This works of course,
but it is an implicit check because the L3 resource is the only one that currently
supports the "node" scope and does so when SNC is enabled.
My preference is for the code to be consistent in patterns used and find doing so
makes the code easier to read and understand.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-09 2:16 ` Reinette Chatre
@ 2025-10-09 17:45 ` Luck, Tony
2025-10-09 20:29 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-09 17:45 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
On Wed, Oct 08, 2025 at 07:16:07PM -0700, Reinette Chatre wrote:
> Understood. This is not about correctness but making the code easier to understand.
> What I am aiming for is consistency in the code where the pattern
> in existing flows use the resource ID as check to direct code flow to resource
> specific code. In the above flow it uses the monitoring scope. This works of course,
> but it is an implicit check because the L3 resource is the only one that currently
> supports the "node" scope and does so when SNC is enabled.
> My preference is for the code to be consistent in patterns used and find doing so
> makes the code easier to read and understand.
>
Reinette,
Should I address this "only one that currently" issue now? Maybe by
adding a bool "rdt_resource:snc_mode" so the implicit
if (r->mon_scope == RESCTRL_L3_NODE)
changes to:
if (r->snc_mode)
There are only two places where SNC mode is checked in this way. The
others rely on seeing that mon_data::sum is set, or that rr->hdr is
NULL. So it seems like a very small improvement.
If we ever add a node scoped resource that isn't related to SNC, it
would be needed at that point. But I'm not sure why hardware would
ever do that.
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-09 17:45 ` Luck, Tony
@ 2025-10-09 20:29 ` Reinette Chatre
2025-10-09 21:31 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-09 20:29 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi Tony,
On 10/9/25 10:45 AM, Luck, Tony wrote:
> On Wed, Oct 08, 2025 at 07:16:07PM -0700, Reinette Chatre wrote:
>> Understood. This is not about correctness but making the code easier to understand.
>> What I am aiming for is consistency in the code where the pattern
>> in existing flows use the resource ID as check to direct code flow to resource
>> specific code. In the above flow it uses the monitoring scope. This works of course,
>> but it is an implicit check because the L3 resource is the only one that currently
>> supports the "node" scope and does so when SNC is enabled.
>> My preference is for the code to be consistent in patterns used and find doing so
>> makes the code easier to read and understand.
>>
> Reinette,
>
> Should I address this "only one that currently" issue now? Maybe by
> adding a bool "rdt_resource:snc_mode" so the implicit
>
> if (r->mon_scope == RESCTRL_L3_NODE)
>
> changes to:
>
> if (r->snc_mode)
>
> There are only two places where SNC mode is checked in this way. The
> others rely on seeing that mon_data::sum is set, or that rr->hdr is
> NULL. So it seems like a very small improvement.
This is not about SNC mode or not but instead about this code being L3
resource specific.
I see the mon_data::sum and rr->hdr checks as supporting a separate
feature that was introduced to support SNC - it should not be used as
a check for SNC support even though it currently implies this due to SNC
being the only user. Could we not, hypothetically, even use these properties
in the region aware MBM work?
> If we ever add a node scoped resource that isn't related to SNC, it
> would be needed at that point. But I'm not sure why hardware would
> ever do that.
Right. This is not about just what is needed to enable this feature but
about making the code easy to follow for those that attempt to understand,
debug, and/or build on top.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* RE: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-09 20:29 ` Reinette Chatre
@ 2025-10-09 21:31 ` Luck, Tony
2025-10-09 21:46 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-09 21:31 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> > There are only two places where SNC mode is checked in this way. The
> > others rely on seeing that mon_data::sum is set, or that rr->hdr is
> > NULL. So it seems like a very small improvement.
>
> This is not about SNC mode or not but instead about this code being L3
> resource specific.
>
> I see the mon_data::sum and rr->hdr checks as supporting a separate
> feature that was introduced to support SNC - it should not be used as
> a check for SNC support even though it currently implies this due to SNC
> being the only user. Could we not, hypothetically, even use these properties
> in the region aware MBM work?
>
> > If we ever add a node scoped resource that isn't related to SNC, it
> > would be needed at that point. But I'm not sure why hardware would
> > ever do that.
>
> Right. This is not about just what is needed to enable this feature but
> about making the code easy to follow for those that attempt to understand,
> debug, and/or build on top.
Reinette,
Region aware MBM work will need to sum things to support legacy resctrl
"mbm_total_bytes". But while SNC sums across domains that share the
same cache id inside the same resource, we may be summing across
different resources (assuming we go with separate resources per region)
or summing across regions within a domain (if we bundle the regions into a new
struct rdt_region_mon_domain).
So __mon_event_count() will need to get additional refactoring and helper
functions, and struct mon_data an additional field to say that this other
"sum" function must be used.
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-09 21:31 ` Luck, Tony
@ 2025-10-09 21:46 ` Reinette Chatre
2025-10-09 22:08 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-09 21:46 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi Tony,
On 10/9/25 2:31 PM, Luck, Tony wrote:
>>> There are only two places where SNC mode is checked in this way. The
>>> others rely on seeing that mon_data::sum is set, or that rr->hdr is
>>> NULL. So it seems like a very small improvement.
>>
>> This is not about SNC mode or not but instead about this code being L3
>> resource specific.
>>
>> I see the mon_data::sum and rr->hdr checks as supporting a separate
>> feature that was introduced to support SNC - it should not be used as
>> a check for SNC support even though it currently implies this due to SNC
>> being the only user. Could we not, hypothetically, even use these properties
>> in the region aware MBM work?
>>
>>> If we ever add a node scoped resource that isn't related to SNC, it
>>> would be needed at that point. But I'm not sure why hardware would
>>> ever do that.
>>
>> Right. This is not about just what is needed to enable this feature but
>> about making the code easy to follow for those that attempt to understand,
>> debug, and/or build on top.
>
> Reinette,
>
> Region aware MBM work will need to sum things to support legacy resctrl
> "mbm_total_bytes". But while SNC sums across domains that share the
> same cache id inside the same resource, we may be summing across
> different resources (assuming we go with separate resources per region)
> or summing across regions within a domain (if we bundle the regions into a new
> struct rdt_region_mon_domain).
>
> So __mon_event_count() will need to get additional refactoring and helper
> functions, and struct mon_data an additional field to say that this other
> "sum" function must be used.
>
I did not mean to imply that this can be supported without refactoring. It does
seem as though you agree that mon_data::sum may be used for something
other than SNC and thus that using mon_data::sum as a check for SNC is not ideal.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* RE: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-09 21:46 ` Reinette Chatre
@ 2025-10-09 22:08 ` Luck, Tony
2025-10-10 0:16 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-09 22:08 UTC (permalink / raw)
To: Chatre, Reinette
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> I did not mean to imply that this can be supported without refactoring. It does
> seem as though you agree that mon_data::sum may be used for something
> other than SNC and thus that using mon_data::sum as a check for SNC is not ideal.
Reinette,
Yes, we are in agreement about non-SNC future usage.
Is it sufficient that I plant some WARN_ON_ONCE() in places where the
code assumes that mon_data::sum is only used by RDT_RESOURCE_L3
or for SNC?
Such code can be fixed by future patches that want to use mon_data::sum
for other things.
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-09 22:08 ` Luck, Tony
@ 2025-10-10 0:16 ` Reinette Chatre
2025-10-10 1:14 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-10 0:16 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi Tony,
On 10/9/25 3:08 PM, Luck, Tony wrote:
>> I did not mean to imply that this can be supported without refactoring. It does
>> seem as though you agree that mon_data::sum may be used for something
>> other than SNC and thus that using mon_data::sum as a check for SNC is not ideal.
>
> Reinette,
>
> Yes, we are in agreement about non-SNC future usage.
>
> Is it sufficient that I plant some WARN_ON_ONCE() in places where the
> code assumes that mon_data::sum is only used by RDT_RESOURCE_L3
> or for SNC?
From what I understand this series does this already? I think this only applies to
rdtgroup_mondata_show() that does below ("L3 specific" comments added by me just for this example)
in this series:
rdtgroup_mondata_show()
{
...
if (md->sum) {
struct rdt_l3_mon_domain *d;
if (WARN_ON_ONCE(resid != RDT_RESOURCE_L3)) {
...
}
list_for_each_entry(d, &r->mon_domains, hdr.list) {
if (d->ci_id == domid) { /* L3 specific field */
...
/* L3 specific */
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
}
}
...
}
This seems reasonable since the flow is different from the typical "check resource"
followed by a domain_header_is_valid() that a refactor to support another resource
would probably do as you state below.
>
> Such code can be fixed by future patches that want to use mon_data::sum
> for other things.
This discussion digressed a bit. The discussion started with a request to add a check
for the L3 resource before calling rmdir_mondata_subdir_allrdtgrp_snc().
I see this as something like:
if (r->rid == RDT_RESOURCE_L3 && r->mon_scope == RESCTRL_L3_NODE) {
rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
...
}
I understand that rmdir_mondata_subdir_allrdtgrp_snc() may look something like below
but I still find the flow easier to follow if a resource check is done before calling
rmdir_mondata_subdir_allrdtgrp_snc().
rmdir_mondata_subdir_allrdtgrp_snc(r, hdr)
{
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
...
}
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-10 0:16 ` Reinette Chatre
@ 2025-10-10 1:14 ` Luck, Tony
2025-10-10 1:54 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-10 1:14 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
On Thu, Oct 09, 2025 at 05:16:00PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 10/9/25 3:08 PM, Luck, Tony wrote:
> >> I did not mean to imply that this can be supported without refactoring. It does
> >> seem as though you agree that mon_data::sum may be used for something
> >> other than SNC and thus that using mon_data::sum as a check for SNC is not ideal.
> >
> > Reinette,
> >
> > Yes, we are in agreement about non-SNC future usage.
> >
> > Is it sufficient that I plant some WARN_ON_ONCE() in places where the
> > code assumes that mon_data::sum is only used by RDT_RESOURCE_L3
> > or for SNC?
>
> From what I understand this series does this already? I think this only applies to
> rdtgroup_mondata_show() that does below ("L3 specific" comments added by me just for this example)
> in this series:
>
> rdtgroup_mondata_show()
> {
> ...
> if (md->sum) {
> struct rdt_l3_mon_domain *d;
>
> if (WARN_ON_ONCE(resid != RDT_RESOURCE_L3)) {
Exactly what I now have.
> ...
My "..." is:
return -EINVAL;
> }
>
> list_for_each_entry(d, &r->mon_domains, hdr.list) {
> if (d->ci_id == domid) { /* L3 specific field */
> ...
> /* L3 specific */
> ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
> }
> }
> ...
> }
>
> This seems reasonable since the flow is different from the typical "check resource"
> followed by a domain_header_is_valid() that a refactor to support another resource
> would probably do as you state below.
I looked around to see if there were any other places that needed this,
but they all have checks for RDT_RESOURCE_L3 by the end of the series.
I've added a check in __mon_event_count() in patch 13 that gets deleted
in patch 18 when the L3 code is split out into a separate function.
> >
> > Such code can be fixed by future patches that want to use mon_data::sum
> > for other things.
>
> This discussion digressed a bit. The discussion started with a request to add a check
> for the L3 resource before calling rmdir_mondata_subdir_allrdtgrp_snc().
> I see this as something like:
> if (r->rid == RDT_RESOURCE_L3 && r->mon_scope == RESCTRL_L3_NODE) {
I'll add this. Same is needed in mkdir_mondata_subdir().
> rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
> ...
> }
>
> I understand that rmdir_mondata_subdir_allrdtgrp_snc() may look something like below
> but I still find the flow easier to follow if a resource check is done before calling
> rmdir_mondata_subdir_allrdtgrp_snc().
>
> rmdir_mondata_subdir_allrdtgrp_snc(r, hdr)
> {
> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> return -EINVAL;
>
> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> ...
>
> }
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 84+ messages in thread* Re: [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow
2025-10-10 1:14 ` Luck, Tony
@ 2025-10-10 1:54 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-10 1:54 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Wieczor-Retman, Maciej, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
Hi Tony,
On 10/9/25 6:14 PM, Luck, Tony wrote:
> On Thu, Oct 09, 2025 at 05:16:00PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 10/9/25 3:08 PM, Luck, Tony wrote:
>>>> I did not mean to imply that this can be supported without refactoring. It does
>>>> seem as though you agree that mon_data::sum may be used for something
>>>> other than SNC and thus that using mon_data::sum as a check for SNC is not ideal.
>>>
>>> Reinette,
>>>
>>> Yes, we are in agreement about non-SNC future usage.
>>>
>>> Is it sufficient that I plant some WARN_ON_ONCE() in places where the
>>> code assumes that mon_data::sum is only used by RDT_RESOURCE_L3
>>> or for SNC?
>>
>> From what I understand this series does this already? I think this only applies to
>> rdtgroup_mondata_show() that does below ("L3 specific" comments added by me just for this example)
>> in this series:
>>
>> rdtgroup_mondata_show()
>> {
>> ...
>> if (md->sum) {
>> struct rdt_l3_mon_domain *d;
>>
>> if (WARN_ON_ONCE(resid != RDT_RESOURCE_L3)) {
>
> Exactly what I now have.
>> ...
> My "..." is:
> return -EINVAL;
>> }
>>
>> list_for_each_entry(d, &r->mon_domains, hdr.list) {
>> if (d->ci_id == domid) { /* L3 specific field */
>> ...
>> /* L3 specific */
>> ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
>> }
>> }
>> ...
>> }
>>
>> This seems reasonable since the flow is different from the typical "check resource"
>> followed by a domain_header_is_valid() that a refactor to support another resource
>> would probably do as you state below.
>
> I looked around to see if there were any other places that needed this,
> but they all have checks for RDT_RESOURCE_L3 by the end of the series.
Thank you for checking. This seems like a good pattern to use consistently.
> I've added a check in __mon_event_count() in patch 13 that gets deleted
> in patch 18 when the L3 code is split out into a separate function.
>
>>>
>>> Such code can be fixed by future patches that want to use mon_data::sum
>>> for other things.
>>
>> This discussion digressed a bit. The discussion started with a request to add a check
>> for the L3 resource before calling rmdir_mondata_subdir_allrdtgrp_snc().
>> I see this as something like:
>> if (r->rid == RDT_RESOURCE_L3 && r->mon_scope == RESCTRL_L3_NODE) {
>
> I'll add this. Same is needed in mkdir_mondata_subdir().
>
Thank you very much.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 21/31] x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (19 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 20/31] fs/resctrl: Refactor Sub-NUMA Cluster (SNC) in mkdir/rmdir code flow Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:00 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 22/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
` (9 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The L3 resource has several requirements for domains. There are per-domain
structures that hold the 64-bit values of counters, and elements to keep
track of the overflow and limbo threads.
None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.
Define a new rdt_perf_pkg_mon_domain structure which just consists of
the standard rdt_domain_hdr to keep track of domain id and CPU mask.
Support the PERF_PKG resource in the CPU online/offline handlers.
Add WARN checks to code that sums domains for Sub-NUMA cluster to
confirm the resource ID is RDT_RESOURCE_L3.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++
arch/x86/kernel/cpu/resctrl/core.c | 15 +++++++++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 29 +++++++++++++++++++++++++
fs/resctrl/ctrlmondata.c | 5 +++++
fs/resctrl/rdtgroup.c | 10 +++++++++
5 files changed, 72 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 97616c81682b..b920f54f8736 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -84,6 +84,14 @@ static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3
return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
}
+/**
+ * struct rdt_perf_pkg_mon_domain - CPUs sharing an package scoped resctrl monitor resource
+ * @hdr: common header for different domain types
+ */
+struct rdt_perf_pkg_mon_domain {
+ struct rdt_domain_hdr hdr;
+};
+
/**
* struct msr_param - set a range of MSRs from a domain
* @res: The resource to use
@@ -222,6 +230,8 @@ bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
void *arch_priv, u64 *val);
+void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void __exit intel_aet_exit(void) { }
@@ -230,6 +240,9 @@ static inline int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_i
{
return -EINVAL;
}
+
+static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos) { }
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 588de539a739..5dff83e763a5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -573,6 +573,10 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (!hdr)
l3_mon_domain_setup(cpu, id, r, add_pos);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ if (!hdr)
+ intel_aet_mon_domain_setup(cpu, id, r, add_pos);
+ break;
default:
pr_warn_once("Unknown resource rid=%d\n", r->rid);
break;
@@ -635,6 +639,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct rdt_perf_pkg_mon_domain *pkgd;
struct rdt_hw_l3_mon_domain *hw_dom;
struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
@@ -670,6 +675,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
synchronize_rcu();
l3_mon_domain_free(hw_dom);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_PERF_PKG))
+ return;
+
+ pkgd = container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr);
+ resctrl_offline_mon_domain(r, hdr);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
+ kfree(pkgd);
+ break;
default:
pr_warn_once("Unknown resource rid=%d\n", r->rid);
break;
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index d53211ac6204..dc0d16af66be 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -17,16 +17,21 @@
#include <linux/compiler_types.h>
#include <linux/container_of.h>
#include <linux/cpu.h>
+#include <linux/cpumask.h>
#include <linux/err.h>
#include <linux/errno.h>
+#include <linux/gfp_types.h>
#include <linux/init.h>
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
#include <linux/io.h>
#include <linux/overflow.h>
#include <linux/printk.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
#include <linux/resctrl.h>
#include <linux/resctrl_types.h>
+#include <linux/slab.h>
#include <linux/stddef.h>
#include <linux/topology.h>
#include <linux/types.h>
@@ -282,3 +287,27 @@ int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id eventid,
return valid ? 0 : -EINVAL;
}
+
+void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos)
+{
+ struct rdt_perf_pkg_mon_domain *d;
+ int err;
+
+ d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
+ if (!d)
+ return;
+
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_mon_domain(r, &d->hdr);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ kfree(d);
+ }
+}
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index ae43e09fa5e5..f7fbfc4d258d 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -712,6 +712,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
if (md->sum) {
struct rdt_l3_mon_domain *d;
+ if (WARN_ON_ONCE(resid != RDT_RESOURCE_L3)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
/*
* This file requires summing across all domains that share
* the L3 cache id that was provided in the "domid" field of the
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index cab5cb9e6c93..fa6dfebea6b2 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3040,6 +3040,9 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
lockdep_assert_held(&rdtgroup_mutex);
+ if (WARN_ON_ONCE(do_sum && rid != RDT_RESOURCE_L3))
+ return NULL;
+
list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
if (priv->rid == rid && priv->domid == domid &&
priv->sum == do_sum && priv->evt == mevt)
@@ -4227,6 +4230,9 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
if (resctrl_mounted && resctrl_arch_mon_capable())
rmdir_mondata_subdir_allrdtgrp(r, hdr);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto out_unlock;
+
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
@@ -4327,6 +4333,9 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
mutex_lock(&rdtgroup_mutex);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto mkdir;
+
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
@@ -4344,6 +4353,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
+mkdir:
err = 0;
/*
* If the filesystem is not mounted then only the default resource group
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 21/31] x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-09-25 20:03 ` [PATCH v11 21/31] x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-10-04 0:00 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:00 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
(subject prefix missing the resctrl fs changes)
On 9/25/25 1:03 PM, Tony Luck wrote:
> The L3 resource has several requirements for domains. There are per-domain
> structures that hold the 64-bit values of counters, and elements to keep
> track of the overflow and limbo threads.
>
> None of these are needed for the PERF_PKG resource. The hardware counters
> are wide enough that they do not wrap around for decades.
>
> Define a new rdt_perf_pkg_mon_domain structure which just consists of
> the standard rdt_domain_hdr to keep track of domain id and CPU mask.
>
> Support the PERF_PKG resource in the CPU online/offline handlers.
Above can be seen from the patch. Would be helpful to highlight what this
support involves since it is not obvious from changes that just adds
gotos.
>
> Add WARN checks to code that sums domains for Sub-NUMA cluster to
> confirm the resource ID is RDT_RESOURCE_L3.
Above is clear from patch. Is there a "why" since it is not clear what
these changes that are related to counter reading have to do with domain
creation and deletion topic of this patch.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++
> arch/x86/kernel/cpu/resctrl/core.c | 15 +++++++++++++
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 29 +++++++++++++++++++++++++
> fs/resctrl/ctrlmondata.c | 5 +++++
> fs/resctrl/rdtgroup.c | 10 +++++++++
> 5 files changed, 72 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 97616c81682b..b920f54f8736 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -84,6 +84,14 @@ static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3
> return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
> }
>
> +/**
> + * struct rdt_perf_pkg_mon_domain - CPUs sharing an package scoped resctrl monitor resource
> + * @hdr: common header for different domain types
> + */
> +struct rdt_perf_pkg_mon_domain {
> + struct rdt_domain_hdr hdr;
> +};
> +
> /**
> * struct msr_param - set a range of MSRs from a domain
> * @res: The resource to use
> @@ -222,6 +230,8 @@ bool intel_aet_get_events(void);
> void __exit intel_aet_exit(void);
> int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
> void *arch_priv, u64 *val);
> +void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> + struct list_head *add_pos);
> #else
> static inline bool intel_aet_get_events(void) { return false; }
> static inline void __exit intel_aet_exit(void) { }
> @@ -230,6 +240,9 @@ static inline int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_i
> {
> return -EINVAL;
> }
> +
> +static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> + struct list_head *add_pos) { }
> #endif
>
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 588de539a739..5dff83e763a5 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -573,6 +573,10 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
> if (!hdr)
> l3_mon_domain_setup(cpu, id, r, add_pos);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + if (!hdr)
> + intel_aet_mon_domain_setup(cpu, id, r, add_pos);
> + break;
> default:
> pr_warn_once("Unknown resource rid=%d\n", r->rid);
> break;
> @@ -635,6 +639,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
> static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> {
> int id = get_domain_id_from_scope(cpu, r->mon_scope);
> + struct rdt_perf_pkg_mon_domain *pkgd;
Please move this declaration to resource specific case statement block.
> struct rdt_hw_l3_mon_domain *hw_dom;
> struct rdt_l3_mon_domain *d;
> struct rdt_domain_hdr *hdr;
> @@ -670,6 +675,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> synchronize_rcu();
> l3_mon_domain_free(hw_dom);
> break;
> + case RDT_RESOURCE_PERF_PKG:
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_PERF_PKG))
> + return;
> +
> + pkgd = container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr);
> + resctrl_offline_mon_domain(r, hdr);
> + list_del_rcu(&hdr->list);
> + synchronize_rcu();
> + kfree(pkgd);
> + break;
> default:
> pr_warn_once("Unknown resource rid=%d\n", r->rid);
> break;
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index d53211ac6204..dc0d16af66be 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -17,16 +17,21 @@
> #include <linux/compiler_types.h>
> #include <linux/container_of.h>
> #include <linux/cpu.h>
> +#include <linux/cpumask.h>
> #include <linux/err.h>
> #include <linux/errno.h>
> +#include <linux/gfp_types.h>
> #include <linux/init.h>
> #include <linux/intel_pmt_features.h>
> #include <linux/intel_vsec.h>
> #include <linux/io.h>
> #include <linux/overflow.h>
> #include <linux/printk.h>
> +#include <linux/rculist.h>
> +#include <linux/rcupdate.h>
> #include <linux/resctrl.h>
> #include <linux/resctrl_types.h>
> +#include <linux/slab.h>
> #include <linux/stddef.h>
> #include <linux/topology.h>
> #include <linux/types.h>
> @@ -282,3 +287,27 @@ int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id eventid,
>
> return valid ? 0 : -EINVAL;
> }
> +
> +void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> + struct list_head *add_pos)
> +{
> + struct rdt_perf_pkg_mon_domain *d;
> + int err;
> +
> + d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
> + if (!d)
> + return;
> +
> + d->hdr.id = id;
> + d->hdr.type = RESCTRL_MON_DOMAIN;
> + d->hdr.rid = r->rid;
In this series l3_mon_domain_setup() received an update to hardcode the resource. Can this do
the same to be consistent?
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 22/31] x86/resctrl: Add energy/perf choices to rdt boot option
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (20 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 21/31] x86/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 23/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
` (8 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Legacy resctrl features are enumerated by X86_FEATURE_* flags. These
may be overridden by quirks to disable features in the case of errata.
Users can use kernel command line options to either disable a feature,
or to force enable a feature that was disabled by a quirk.
Provide similar functionality for hardware features that do not have an
X86_FEATURE_* flag. Unlike other features that are tied to X86_FEATURE_*
flags, these must be queried by name. Add rdt_is_feature_enabled()
to check whether quirks or kernel command line have disabled a feature.
Users may force a feature to be disabled. E.g. "rdt=!perf" will ensure
that none of the perf telemetry events are enabled.
Resctrl architecture code may disable a feature that does not provide
full functionality. Users may override that decision. E.g. "rdt=energy"
will enable any available energy telemetry events even if they do not
provide full functionality.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
.../admin-guide/kernel-parameters.txt | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++
3 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 889e68e83682..74bc150b53f7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6155,7 +6155,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec, abmc.
+ mba, smba, bmec, abmc, energy, perf.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index b920f54f8736..e3710b9f993e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -225,6 +225,8 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
+bool rdt_is_feature_enabled(char *name);
+
#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 5dff83e763a5..f749a871e8c5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -766,6 +766,8 @@ enum {
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
RDT_FLAG_ABMC,
+ RDT_FLAG_ENERGY,
+ RDT_FLAG_PERF,
};
#define RDT_OPT(idx, n, f) \
@@ -792,6 +794,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
+ RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
+ RDT_OPT(RDT_FLAG_PERF, "perf", 0),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
@@ -841,6 +845,31 @@ bool rdt_cpu_has(int flag)
return ret;
}
+/*
+ * Hardware features that do not have X86_FEATURE_* bits. There is no
+ * "hardware does not support this at all" case. Assume that the caller
+ * has already determined that hardware support is present and just needs
+ * to check if the feature has been disabled by a quirk that has not been
+ * overridden by a command line option.
+ */
+bool rdt_is_feature_enabled(char *name)
+{
+ struct rdt_options *o;
+ bool ret = true;
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name)) {
+ if (o->force_off)
+ ret = false;
+ if (o->force_on)
+ ret = true;
+ break;
+ }
+ }
+
+ return ret;
+}
+
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt)
{
if (!rdt_cpu_has(X86_FEATURE_BMEC))
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 23/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (21 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 22/31] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:06 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 24/31] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] Tony Luck
` (7 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There are now three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF. This
is the maximum number of distinct values that can be loaded into the
IA32_PQR_ASSOC MSR. Note that systems with Sub-NUMA Cluster mode enabled
will force scaling down the CPUID enumerated value by the number of SNC
nodes per L3-cache.
2) The number of registers in MMIO space for each event. This
is enumerated in the XML files and is the value initialized into
event_group::num_rmids.
3) The number of "hardware counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This
is enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()
Event groups with insufficient "hardware counters" to track all RMIDs
are difficult for users to use, since the system may reassign "hardware
counters" at any time. This means that users cannot reliably collect
two consecutive event counts to compute the rate at which events are
occurring.
Introduce rdt_set_feature_disabled() to mark any under-resourced event
groups (those with telemetry_region::num_rmids < event_group::num_rmids)
as unusable. Note that the rdt_options[] structure must now be writable
at run-time. The request to disable will be overridden if the user
explicitly requests to enable using the "rdt=" Linux boot argument.
This will result in the available number of monitoring resource groups
being limited by the under-resourced event groups.
Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
resource "num_rmids" value to the smallest of these values as this value
will be used later to compare against the number of RMIDs supported
by other resources to determine how many monitoring resource groups
are supported.
N.B. Change type of rdt_resource::num_rmid to u32 to match type of
event_group::num_rmids so that min(r->num_rmid, e->num_rmids) won't
complain about mixing signed and unsigned types. Print r->num_rmid as
unsigned value in rdt_num_rmids_show().
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++++-
arch/x86/kernel/cpu/resctrl/intel_aet.c | 48 +++++++++++++++++++++++++
fs/resctrl/rdtgroup.c | 2 +-
5 files changed, 69 insertions(+), 3 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 111c8f1dc77e..c7b5e56d25bb 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -292,7 +292,7 @@ enum resctrl_schema_fmt {
* events of monitor groups created via mkdir.
*/
struct resctrl_mon {
- int num_rmid;
+ u32 num_rmid;
unsigned int mbm_cfg_mask;
int num_mbm_cntrs;
bool mbm_cntr_assignable;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e3710b9f993e..cea76f88422c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -227,6 +227,8 @@ void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
bool rdt_is_feature_enabled(char *name);
+void rdt_set_feature_disabled(char *name);
+
#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index f749a871e8c5..5b7f9a44d562 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -782,7 +782,7 @@ struct rdt_options {
bool force_off, force_on;
};
-static struct rdt_options rdt_options[] __ro_after_init = {
+static struct rdt_options rdt_options[] = {
RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC),
RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL),
RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL),
@@ -845,6 +845,22 @@ bool rdt_cpu_has(int flag)
return ret;
}
+/*
+ * Can be called during feature enumeration if sanity check of
+ * a feature's parameters indicates problems with the feature.
+ */
+void rdt_set_feature_disabled(char *name)
+{
+ struct rdt_options *o;
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name)) {
+ o->force_off = true;
+ return;
+ }
+ }
+}
+
/*
* Hardware features that do not have X86_FEATURE_* bits. There is no
* "hardware does not support this at all" case. Assume that the caller
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index dc0d16af66be..039e63d8c2e7 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -25,6 +25,7 @@
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
#include <linux/io.h>
+#include <linux/minmax.h>
#include <linux/overflow.h>
#include <linux/printk.h>
#include <linux/rculist.h>
@@ -55,6 +56,7 @@ struct pmt_event {
/**
* struct event_group - All information about a group of telemetry events.
+ * @name: Name for this group (used by boot rdt= option)
* @pfg: Points to the aggregated telemetry space information
* returned by the intel_pmt_get_regions_by_feature()
* call to the INTEL_PMT_TELEMETRY driver that contains
@@ -62,16 +64,22 @@ struct pmt_event {
* Valid if the system supports the event group.
* NULL otherwise.
* @guid: Unique number per XML description file.
+ * @num_rmids: Number of RMIDs supported by this group. May be
+ * adjusted downwards if enumeration from
+ * intel_pmt_get_regions_by_feature() indicates fewer
+ * RMIDs can be tracked simultaneously.
* @mmio_size: Number of bytes of MMIO registers for this group.
* @num_events: Number of events in this group.
* @evts: Array of event descriptors.
*/
struct event_group {
/* Data fields for additional structures to manage this group. */
+ char *name;
struct pmt_feature_group *pfg;
/* Remaining fields initialized from XML file. */
u32 guid;
+ u32 num_rmids;
size_t mmio_size;
unsigned int num_events;
struct pmt_event evts[] __counted_by(num_events);
@@ -85,7 +93,9 @@ struct event_group {
* File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
*/
static struct event_group energy_0x26696143 = {
+ .name = "energy",
.guid = 0x26696143,
+ .num_rmids = 576,
.mmio_size = XML_MMIO_SIZE(576, 2, 3),
.num_events = 2,
.evts = {
@@ -99,7 +109,9 @@ static struct event_group energy_0x26696143 = {
* File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
*/
static struct event_group perf_0x26557651 = {
+ .name = "perf",
.guid = 0x26557651,
+ .num_rmids = 576,
.mmio_size = XML_MMIO_SIZE(576, 7, 3),
.num_events = 7,
.evts = {
@@ -156,21 +168,57 @@ static void mark_telem_region_unusable(struct telemetry_region *tr)
tr->addr = NULL;
}
+static bool all_regions_have_sufficient_rmid(struct event_group *e, struct pmt_feature_group *p)
+{
+ struct telemetry_region *tr;
+
+ for (int i = 0; i < p->count; i++) {
+ tr = &p->regions[i];
+ if (skip_telem_region(tr, e))
+ continue;
+
+ if (tr->num_rmids < e->num_rmids)
+ return false;
+ }
+
+ return true;
+}
+
static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
bool usable_events = false;
+ /* Disable feature if insufficient RMIDs */
+ if (!all_regions_have_sufficient_rmid(e, p))
+ rdt_set_feature_disabled(e->name);
+
+ /* User can override above disable from kernel command line */
+ if (!rdt_is_feature_enabled(e->name))
+ return false;
+
for (int i = 0; i < p->count; i++) {
if (skip_telem_region(&p->regions[i], e)) {
mark_telem_region_unusable(&p->regions[i]);
continue;
}
+
+ /*
+ * e->num_rmids only adjusted lower if user forces an unusable
+ * region to be usable
+ */
+ e->num_rmids = min(e->num_rmids, p->regions[i].num_rmids);
usable_events = true;
}
if (!usable_events)
return false;
+ if (r->mon.num_rmid)
+ r->mon.num_rmid = min(r->mon.num_rmid, e->num_rmids);
+ else
+ r->mon.num_rmid = e->num_rmids;
+
for (int j = 0; j < e->num_events; j++)
resctrl_enable_mon_event(e->evts[j].id, true,
e->evts[j].bin_bits, &e->evts[j]);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index fa6dfebea6b2..19efb345c4a6 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1135,7 +1135,7 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- seq_printf(seq, "%d\n", r->mon.num_rmid);
+ seq_printf(seq, "%u\n", r->mon.num_rmid);
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 23/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources
2025-09-25 20:03 ` [PATCH v11 23/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
@ 2025-10-04 0:06 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:06 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
(nit in subject ... "resources" -> "resource" ... with caveat that
the term "telemetry resource" is not used much at all in this series)
On 9/25/25 1:03 PM, Tony Luck wrote:
> There are now three meanings for "number of RMIDs":
>
> 1) The number for legacy features enumerated by CPUID leaf 0xF. This
> is the maximum number of distinct values that can be loaded into the
> IA32_PQR_ASSOC MSR. Note that systems with Sub-NUMA Cluster mode enabled
"the IA32_PQR_ASSOC MSR" -> "MSR_IA32_PQR_ASSOC"
> will force scaling down the CPUID enumerated value by the number of SNC
> nodes per L3-cache.
>
> 2) The number of registers in MMIO space for each event. This
> is enumerated in the XML files and is the value initialized into
> event_group::num_rmids.
>
> 3) The number of "hardware counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This
> is enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()
>
> Event groups with insufficient "hardware counters" to track all RMIDs
> are difficult for users to use, since the system may reassign "hardware
> counters" at any time. This means that users cannot reliably collect
> two consecutive event counts to compute the rate at which events are
> occurring.
>
> Introduce rdt_set_feature_disabled() to mark any under-resourced event
> groups (those with telemetry_region::num_rmids < event_group::num_rmids)
Would it be more accurate to say
"(those with telemetry_region::num_rmids < event_group::num_rmids for any
of the event group's telemetry regions)"
> as unusable. Note that the rdt_options[] structure must now be writable
> at run-time. The request to disable will be overridden if the user
"Override the request ..."?
> explicitly requests to enable using the "rdt=" Linux boot argument.
> This will result in the available number of monitoring resource groups
> being limited by the under-resourced event groups.
needs imperative ... how about something like (for text starting with "The
request to disable ..."):
Limit an event group's number of possible monitor resource groups
to the lowest number of "hardware counters" if the user explicitly
requests to enable an under-resourced event group.
...
> @@ -156,21 +168,57 @@ static void mark_telem_region_unusable(struct telemetry_region *tr)
> tr->addr = NULL;
> }
>
> +static bool all_regions_have_sufficient_rmid(struct event_group *e, struct pmt_feature_group *p)
> +{
> + struct telemetry_region *tr;
> +
> + for (int i = 0; i < p->count; i++) {
> + tr = &p->regions[i];
> + if (skip_telem_region(tr, e))
> + continue;
> +
> + if (tr->num_rmids < e->num_rmids)
> + return false;
> + }
> +
> + return true;
> +}
> +
> static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> bool usable_events = false;
>
> + /* Disable feature if insufficient RMIDs */
> + if (!all_regions_have_sufficient_rmid(e, p))
> + rdt_set_feature_disabled(e->name);
> +
> + /* User can override above disable from kernel command line */
> + if (!rdt_is_feature_enabled(e->name))
> + return false;
> +
> for (int i = 0; i < p->count; i++) {
> if (skip_telem_region(&p->regions[i], e)) {
> mark_telem_region_unusable(&p->regions[i]);
> continue;
> }
It is unexpected to me that skip_telem_region() needs to be run twice with
second time marking regions as unusable. I think it will be simpler to just run
skip_telem_region() once to determine which telemetry regions are unusable, mark them as
such at that time, and from that point forward just interact with the usable telemetry
regions?
> +
> + /*
> + * e->num_rmids only adjusted lower if user forces an unusable
> + * region to be usable
In this function usable/unusable regions have a distinct meaning that is different
from what this comment intends since insufficient rmid does not make a region
"unusable" per skip_telem_region(). Perhaps something like:
e->num_rmids only adjusted lower if user (via rdt= kernel parameter) forces
an event group with insufficient RMID to be enabled.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 24/31] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (22 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 23/31] x86/resctrl: Handle number of RMIDs supported by telemetry resources Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:09 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 25/31] fs,x86/resctrl: Compute number of RMIDs as minimum across resources Tony Luck
` (6 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
closid_num_dirty_rmid[] is allocated in dom_data_init() during resctrl
initialization and freed by dom_data_exit() during resctrl exit giving
it the same life cycle as rmid_ptrs[].
Move closid_num_dirty_rmid[] allocaction/free out to
resctrl_l3_mon_resource_init() and resctrl_l3_mon_resource_exit() in
preparation for rmid_ptrs[] to be allocated on resctrl mount in support
of the new telemetry events.
Keep the rdtgroup_mutex protection around the allocation/free of
closid_num_dirty_rmid[] as ARM needs this to guarantee memory
ordering.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/monitor.c | 77 ++++++++++++++++++++++++++++----------------
1 file changed, 49 insertions(+), 28 deletions(-)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index d484983c0f02..5960a0afd0ca 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -883,36 +883,14 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
static int dom_data_init(struct rdt_resource *r)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
- u32 num_closid = resctrl_arch_get_num_closid(r);
struct rmid_entry *entry = NULL;
int err = 0, i;
u32 idx;
mutex_lock(&rdtgroup_mutex);
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
- u32 *tmp;
-
- /*
- * If the architecture hasn't provided a sanitised value here,
- * this may result in larger arrays than necessary. Resctrl will
- * use a smaller system wide value based on the resources in
- * use.
- */
- tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL);
- if (!tmp) {
- err = -ENOMEM;
- goto out_unlock;
- }
-
- closid_num_dirty_rmid = tmp;
- }
rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
if (!rmid_ptrs) {
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
- kfree(closid_num_dirty_rmid);
- closid_num_dirty_rmid = NULL;
- }
err = -ENOMEM;
goto out_unlock;
}
@@ -948,11 +926,6 @@ static void dom_data_exit(struct rdt_resource *r)
if (!r->mon_capable)
goto out_unlock;
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
- kfree(closid_num_dirty_rmid);
- closid_num_dirty_rmid = NULL;
- }
-
kfree(rmid_ptrs);
rmid_ptrs = NULL;
@@ -1789,6 +1762,43 @@ ssize_t mbm_L3_assignments_write(struct kernfs_open_file *of, char *buf,
return ret ?: nbytes;
}
+static int closid_num_dirty_rmid_alloc(struct rdt_resource *r)
+{
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+ u32 num_closid = resctrl_arch_get_num_closid(r);
+ u32 *tmp;
+
+ /* For ARM memory ordering access to closid_num_dirty_rmid */
+ mutex_lock(&rdtgroup_mutex);
+
+ /*
+ * If the architecture hasn't provided a sanitised value here,
+ * this may result in larger arrays than necessary. Resctrl will
+ * use a smaller system wide value based on the resources in
+ * use.
+ */
+ tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ closid_num_dirty_rmid = tmp;
+
+ mutex_unlock(&rdtgroup_mutex);
+ }
+
+ return 0;
+}
+
+static void closid_num_dirty_rmid_free(void)
+{
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+ mutex_lock(&rdtgroup_mutex);
+ kfree(closid_num_dirty_rmid);
+ closid_num_dirty_rmid = NULL;
+ mutex_unlock(&rdtgroup_mutex);
+ }
+}
+
/**
* resctrl_l3_mon_resource_init() - Initialise global monitoring structures.
*
@@ -1809,10 +1819,16 @@ int resctrl_l3_mon_resource_init(void)
if (!r->mon_capable)
return 0;
- ret = dom_data_init(r);
+ ret = closid_num_dirty_rmid_alloc(r);
if (ret)
return ret;
+ ret = dom_data_init(r);
+ if (ret) {
+ closid_num_dirty_rmid_free();
+ return ret;
+ }
+
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
@@ -1857,5 +1873,10 @@ void resctrl_l3_mon_resource_exit(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+ if (!r->mon_capable)
+ return;
+
+ closid_num_dirty_rmid_free();
+
dom_data_exit(r);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 24/31] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
2025-09-25 20:03 ` [PATCH v11 24/31] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] Tony Luck
@ 2025-10-04 0:09 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:09 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> closid_num_dirty_rmid[] is allocated in dom_data_init() during resctrl
> initialization and freed by dom_data_exit() during resctrl exit giving
> it the same life cycle as rmid_ptrs[].
>
> Move closid_num_dirty_rmid[] allocaction/free out to
> resctrl_l3_mon_resource_init() and resctrl_l3_mon_resource_exit() in
> preparation for rmid_ptrs[] to be allocated on resctrl mount in support
> of the new telemetry events.
>
> Keep the rdtgroup_mutex protection around the allocation/free of
> closid_num_dirty_rmid[] as ARM needs this to guarantee memory
> ordering.
I think this is heavy on describing the code that we were asked to avoid.
I amended the changelog below in an attempt to address this, please feel
free to improve:
closid_num_dirty_rmid[] and rmid_ptrs[] are allocated together
during resctrl initialization and freed together during resctrl exit.
Telemetry events are enumerated on resctrl mount so only at resctrl
mount will the number of RMID supported by all monitoring resources
and needed as size for rmid_ptrs[] be known.
Separate closid_num_dirty_rmid[] and rmid_ptrs[] allocation and free
in preparation for rmid_ptrs[] to be allocated on resctrl mount.
Keep the rdtgroup_mutex protection around the allocation and free of
closid_num_dirty_rmid[] as ARM needs this to guarantee memory
ordering.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Patch looks good to me.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 25/31] fs,x86/resctrl: Compute number of RMIDs as minimum across resources
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (23 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 24/31] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:10 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 26/31] fs/resctrl: Move RMID initialization to first mount Tony Luck
` (5 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl assumes that only the L3 resource supports monitor events, so
it simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as
the system's number of RMIDs.
The addition of telemetry events in a different resource breaks that
assumption.
Compute the number of available RMIDs as the minimum value across
all mon_capable resources (analogous to how the number of CLOSIDs
is computed across alloc_capable resources).
Note that mount time enumeration of the telemetry resource means that
this number can be reduced. If this happens, then some memory will
be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and
rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization
will be larger than needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 15 +++++++++++++--
fs/resctrl/rdtgroup.c | 6 ++++++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 5b7f9a44d562..1d43087c5975 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -110,12 +110,23 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
},
};
+/**
+ * resctrl_arch_system_num_rmid_idx - Compute number of supported RMIDs
+ * (minimum across all mon_capable resource)
+ *
+ * Return: Number of supported RMIDs at time of call. Note that mount time
+ * enumeration of resources may reduce the number.
+ */
u32 resctrl_arch_system_num_rmid_idx(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ u32 num_rmids = U32_MAX;
+ struct rdt_resource *r;
+
+ for_each_mon_capable_rdt_resource(r)
+ num_rmids = min(num_rmids, r->mon.num_rmid);
/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
- return r->mon.num_rmid;
+ return num_rmids == U32_MAX ? 0 : num_rmids;
}
struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 19efb345c4a6..5e3ee4b8f70b 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4268,6 +4268,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* During boot this may be called before global allocations have been made by
* resctrl_l3_mon_resource_init().
*
+ * Called during CPU online that may run as soon as CPU online callbacks
+ * are set up during resctrl initialization. The number of supported RMIDs
+ * may be reduced if additional mon_capable resources are enumerated
+ * at mount time. This means the rdt_l3_mon_domain::mbm_states[] and
+ * rdt_l3_mon_domain::rmid_busy_llc allocations may be larger than needed.
+ *
* Returns 0 for success, or -ENOMEM.
*/
static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 25/31] fs,x86/resctrl: Compute number of RMIDs as minimum across resources
2025-09-25 20:03 ` [PATCH v11 25/31] fs,x86/resctrl: Compute number of RMIDs as minimum across resources Tony Luck
@ 2025-10-04 0:10 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:10 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
Please use a consistent subject prefix (x86,fs/resctrl) when patch changes both arch
and fs code.
On 9/25/25 1:03 PM, Tony Luck wrote:
> resctrl assumes that only the L3 resource supports monitor events, so
> it simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as
> the system's number of RMIDs.
>
> The addition of telemetry events in a different resource breaks that
> assumption.
>
> Compute the number of available RMIDs as the minimum value across
> all mon_capable resources (analogous to how the number of CLOSIDs
> is computed across alloc_capable resources).
>
> Note that mount time enumeration of the telemetry resource means that
> this number can be reduced. If this happens, then some memory will
> be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and
> rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization
> will be larger than needed.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 26/31] fs/resctrl: Move RMID initialization to first mount
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (24 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 25/31] fs,x86/resctrl: Compute number of RMIDs as minimum across resources Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:12 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 27/31] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
` (4 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
L3 monitor features are enumerated during resctrl initialization
and rmid_ptrs[] that tracks all RMIDs and depends on the
number of supported RMIDs is allocated during this time.
Telemetry monitor features are enumerated during first resctrl mount and
may support a different number of RMIDs compared to L3 monitor features.
Delay allocation and initialization of rmid_ptrs[] until first mount.
Since the number of RMIDs cannot change on later mounts, keep the same
set of rmid_ptrs[] until resctrl_exit(). This is required because the
limbo handler keeps running after resctrl is unmounted and may likely
need to access rmid_ptrs[] as it keeps tracking busy RMIDs after unmount.
Rename routines to match what they now do:
dom_data_init() -> setup_rmid_lru_list()
dom_data_exit() -> free_rmid_lru_list()
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 ++++
fs/resctrl/monitor.c | 50 +++++++++++++++++++------------------------
fs/resctrl/rdtgroup.c | 5 +++++
3 files changed, 31 insertions(+), 28 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index aee6c4684f81..223a6cc6a64a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -369,6 +369,10 @@ int closids_supported(void);
void closid_free(int closid);
+int setup_rmid_lru_list(void);
+
+void free_rmid_lru_list(void);
+
int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 5960a0afd0ca..c0e1b672afce 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -880,20 +880,27 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
-static int dom_data_init(struct rdt_resource *r)
+int setup_rmid_lru_list(void)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry = NULL;
- int err = 0, i;
u32 idx;
+ int i;
- mutex_lock(&rdtgroup_mutex);
+ if (!resctrl_arch_mon_capable())
+ return 0;
+
+ /*
+ * Called on every mount, but the number of RMIDs cannot change
+ * after the first mount, so keep using the same set of rmid_ptrs[]
+ * until resctrl_exit().
+ */
+ if (rmid_ptrs)
+ return 0;
rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
- if (!rmid_ptrs) {
- err = -ENOMEM;
- goto out_unlock;
- }
+ if (!rmid_ptrs)
+ return -ENOMEM;
for (i = 0; i < idx_limit; i++) {
entry = &rmid_ptrs[i];
@@ -906,30 +913,24 @@ static int dom_data_init(struct rdt_resource *r)
/*
* RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and
* are always allocated. These are used for the rdtgroup_default
- * control group, which will be setup later in resctrl_init().
+ * control group, which was setup earlier in rdtgroup_setup_default().
*/
idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID,
RESCTRL_RESERVED_RMID);
entry = __rmid_entry(idx);
list_del(&entry->list);
-out_unlock:
- mutex_unlock(&rdtgroup_mutex);
-
- return err;
+ return 0;
}
-static void dom_data_exit(struct rdt_resource *r)
+void free_rmid_lru_list(void)
{
- mutex_lock(&rdtgroup_mutex);
-
- if (!r->mon_capable)
- goto out_unlock;
+ if (!resctrl_arch_mon_capable())
+ return;
+ mutex_lock(&rdtgroup_mutex);
kfree(rmid_ptrs);
rmid_ptrs = NULL;
-
-out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
@@ -1803,7 +1804,8 @@ static void closid_num_dirty_rmid_free(void)
* resctrl_l3_mon_resource_init() - Initialise global monitoring structures.
*
* Allocate and initialise global monitor resources that do not belong to a
- * specific domain. i.e. the rmid_ptrs[] used for the limbo and free lists.
+ * specific domain. i.e. the closid_num_dirty_rmid[] used to find the CLOSID
+ * with the cleanest set of RMIDs.
* Called once during boot after the struct rdt_resource's have been configured
* but before the filesystem is mounted.
* Resctrl's cpuhp callbacks may be called before this point to bring a domain
@@ -1823,12 +1825,6 @@ int resctrl_l3_mon_resource_init(void)
if (ret)
return ret;
- ret = dom_data_init(r);
- if (ret) {
- closid_num_dirty_rmid_free();
- return ret;
- }
-
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
@@ -1877,6 +1873,4 @@ void resctrl_l3_mon_resource_exit(void)
return;
closid_num_dirty_rmid_free();
-
- dom_data_exit(r);
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 5e3ee4b8f70b..f82bdb8f6f1d 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2734,6 +2734,10 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
+ ret = setup_rmid_lru_list();
+ if (ret)
+ goto out;
+
ret = rdtgroup_setup_root(ctx);
if (ret)
goto out;
@@ -4568,4 +4572,5 @@ void resctrl_exit(void)
*/
resctrl_l3_mon_resource_exit();
+ free_rmid_lru_list();
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 26/31] fs/resctrl: Move RMID initialization to first mount
2025-09-25 20:03 ` [PATCH v11 26/31] fs/resctrl: Move RMID initialization to first mount Tony Luck
@ 2025-10-04 0:12 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:12 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> L3 monitor features are enumerated during resctrl initialization
> and rmid_ptrs[] that tracks all RMIDs and depends on the
> number of supported RMIDs is allocated during this time.
>
> Telemetry monitor features are enumerated during first resctrl mount and
> may support a different number of RMIDs compared to L3 monitor features.
>
> Delay allocation and initialization of rmid_ptrs[] until first mount.
> Since the number of RMIDs cannot change on later mounts, keep the same
> set of rmid_ptrs[] until resctrl_exit(). This is required because the
> limbo handler keeps running after resctrl is unmounted and may likely
> need to access rmid_ptrs[] as it keeps tracking busy RMIDs after unmount.
>
> Rename routines to match what they now do:
> dom_data_init() -> setup_rmid_lru_list()
> dom_data_exit() -> free_rmid_lru_list()
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> fs/resctrl/internal.h | 4 ++++
> fs/resctrl/monitor.c | 50 +++++++++++++++++++------------------------
> fs/resctrl/rdtgroup.c | 5 +++++
> 3 files changed, 31 insertions(+), 28 deletions(-)
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index aee6c4684f81..223a6cc6a64a 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -369,6 +369,10 @@ int closids_supported(void);
>
> void closid_free(int closid);
>
> +int setup_rmid_lru_list(void);
> +
> +void free_rmid_lru_list(void);
> +
> int alloc_rmid(u32 closid);
>
> void free_rmid(u32 closid, u32 rmid);
> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> index 5960a0afd0ca..c0e1b672afce 100644
> --- a/fs/resctrl/monitor.c
> +++ b/fs/resctrl/monitor.c
> @@ -880,20 +880,27 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
> schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
> }
>
> -static int dom_data_init(struct rdt_resource *r)
> +int setup_rmid_lru_list(void)
> {
> u32 idx_limit = resctrl_arch_system_num_rmid_idx();
Should this be done after the resctrl_arch_mon_capable() check? Seems
unnecessary to determine the number of RMID on a system that does not
support monitoring.
> struct rmid_entry *entry = NULL;
> - int err = 0, i;
> u32 idx;
> + int i;
>
> - mutex_lock(&rdtgroup_mutex);
> + if (!resctrl_arch_mon_capable())
> + return 0;
> +
> + /*
> + * Called on every mount, but the number of RMIDs cannot change
> + * after the first mount, so keep using the same set of rmid_ptrs[]
> + * until resctrl_exit().
Could you please add the motivation that limbo handler accesses rmid_ptrs[]
after unmount? This seems a much stronger motivation about why this is not freed on
unmount and thus valuable for anyone that wants to refactor this later.
> + */
> + if (rmid_ptrs)
> + return 0;
>
> rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
> - if (!rmid_ptrs) {
> - err = -ENOMEM;
> - goto out_unlock;
> - }
> + if (!rmid_ptrs)
> + return -ENOMEM;
>
> for (i = 0; i < idx_limit; i++) {
> entry = &rmid_ptrs[i];
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 27/31] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (25 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 26/31] fs/resctrl: Move RMID initialization to first mount Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:23 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 28/31] fs/resctrl: Provide interface to create architecture specific debugfs area Tony Luck
` (3 subsequent siblings)
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Mark the RDT_RESOURCE_PERF_PKG resource as mon_capable and set the global
rdt_mon_capable flag.
Call domain_add_cpu_mon() for each online CPU to allocate all domains
for the RDT_RESOURCE_PERF_PKG since they were not created during resctrl
initialization because of the enumeration delay until first mount.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 17 ++++++++++++++++-
arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 +++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 1d43087c5975..48ed6242d136 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -755,14 +755,29 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
void resctrl_arch_pre_mount(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
static atomic_t only_once = ATOMIC_INIT(0);
- int old = 0;
+ int cpu, old = 0;
if (!atomic_try_cmpxchg(&only_once, &old, 1))
return;
if (!intel_aet_get_events())
return;
+
+ if (!r->mon_capable)
+ return;
+
+ /*
+ * Late discovery of telemetry events means the domains for the
+ * resource were not built. Do that now.
+ */
+ cpus_read_lock();
+ mutex_lock(&domain_list_lock);
+ for_each_online_cpu(cpu)
+ domain_add_cpu_mon(cpu, r);
+ mutex_unlock(&domain_list_lock);
+ cpus_read_unlock();
}
enum {
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 039e63d8c2e7..f6afe862b9de 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -214,6 +214,9 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
if (!usable_events)
return false;
+ r->mon_capable = true;
+ rdt_mon_capable = true;
+
if (r->mon.num_rmid)
r->mon.num_rmid = min(r->mon.num_rmid, e->num_rmids);
else
@@ -223,6 +226,8 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
resctrl_enable_mon_event(e->evts[j].id, true,
e->evts[j].bin_bits, &e->evts[j]);
+ pr_info("%s %s monitoring detected\n", r->name, e->name);
+
return true;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 27/31] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
2025-09-25 20:03 ` [PATCH v11 27/31] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-10-04 0:23 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:23 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> Mark the RDT_RESOURCE_PERF_PKG resource as mon_capable and set the global
> rdt_mon_capable flag.
Above is clear from patch.
>
> Call domain_add_cpu_mon() for each online CPU to allocate all domains
> for the RDT_RESOURCE_PERF_PKG since they were not created during resctrl
> initialization because of the enumeration delay until first mount.
Attempt at alternative:
Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG
resource is not considered "monitoring capable" during early resctrl initialization.
This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU
hot plug notifiers are registered and run for the first time right after resctrl
initialization.
Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry event
enumeration to ensure future CPU hotplug events include this resource and initialize its
domain list for CPUs that are already online.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 17 ++++++++++++++++-
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 5 +++++
> 2 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 1d43087c5975..48ed6242d136 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -755,14 +755,29 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
>
> void resctrl_arch_pre_mount(void)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> static atomic_t only_once = ATOMIC_INIT(0);
> - int old = 0;
> + int cpu, old = 0;
>
> if (!atomic_try_cmpxchg(&only_once, &old, 1))
> return;
>
> if (!intel_aet_get_events())
> return;
> +
> + if (!r->mon_capable)
> + return;
Is this necessary? Can r->mon_capable be false if intel_aet_get_events() fails?
> +
> + /*
> + * Late discovery of telemetry events means the domains for the
> + * resource were not built. Do that now.
> + */
> + cpus_read_lock();
hmmm ... until this point CPUs can come and go. This means that from the moment
r->mon_capable is set resctrl_arch_online_cpu() may run and thus domain_add_cpu_mon()
could be called twice for PERF_PKG? If all the second run does is set (again) a bit
in the cpumask then that *may* be ok (but should be documented) but the flow does not
seem safe to end up like that (more below)
> + mutex_lock(&domain_list_lock);
> + for_each_online_cpu(cpu)
> + domain_add_cpu_mon(cpu, r);
> + mutex_unlock(&domain_list_lock);
> + cpus_read_unlock();
> }
>
> enum {
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 039e63d8c2e7..f6afe862b9de 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -214,6 +214,9 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> if (!usable_events)
> return false;
>
> + r->mon_capable = true;
> + rdt_mon_capable = true;
> +
> if (r->mon.num_rmid)
> r->mon.num_rmid = min(r->mon.num_rmid, e->num_rmids);
> else
> @@ -223,6 +226,8 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> resctrl_enable_mon_event(e->evts[j].id, true,
> e->evts[j].bin_bits, &e->evts[j]);
I notice that the mon_capable flags are set *before* the events are enabled. If the first
CPU of a package comes online between setting the flag and enabling the events then the initial
domain creation will not be correct?
What if the mon_capable flags are set in resctrl_arch_pre_mount() after a successful
intel_aet_get_events()? Perhaps with CPU hotplug lock held? From what I can tell doing so will
impact the debugfs flow since that depends on the resource being mon_capable. Would there be a
problem with delaying the debugfs setup until after domain list is built?
>
> + pr_info("%s %s monitoring detected\n", r->name, e->name);
> +
> return true;
> }
>
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 28/31] fs/resctrl: Provide interface to create architecture specific debugfs area
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (26 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 27/31] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-09-25 20:03 ` [PATCH v11 29/31] x86/resctrl: Add debugfs files to show telemetry aggregator status Tony Luck
` (2 subsequent siblings)
30 siblings, 0 replies; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
All files below /sys/fs/resctrl are considered user ABI.
This leaves no place for architectures to provide additional
interfaces.
Add resctrl_debugfs_mon_info_arch_mkdir() which creates a directory in
the debugfs file system for a monitoring resource. Naming follows the
layout of the main resctrl hierarchy:
/sys/kernel/debug/resctrl/info/{resource}_MON/{arch}
The {arch} last level directory name matches the output of
the user level "uname -m" command.
Architecture code may use this directory for debug information,
or for minor tuning of features. It must not be used for basic
feature enabling as debugfs may not be configured/mounted on
production systems.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 10 ++++++++++
fs/resctrl/rdtgroup.c | 29 +++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index c7b5e56d25bb..d4be0f54c7e8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -678,6 +678,16 @@ void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
+/**
+ * resctrl_debugfs_mon_info_arch_mkdir() - Create a debugfs info directory.
+ * Removed by resctrl_exit().
+ * @r: Resource (must be mon_capable).
+ *
+ * Return: NULL if resource is not monitoring capable,
+ * dentry pointer on success, or ERR_PTR(-ERROR) on failure.
+ */
+struct dentry *resctrl_debugfs_mon_info_arch_mkdir(struct rdt_resource *r);
+
int resctrl_init(void);
void resctrl_exit(void);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index f82bdb8f6f1d..16b088c5f2be 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -24,6 +24,7 @@
#include <linux/sched/task.h>
#include <linux/slab.h>
#include <linux/user_namespace.h>
+#include <linux/utsname.h>
#include <uapi/linux/magic.h>
@@ -75,6 +76,8 @@ static void rdtgroup_destroy_root(void);
struct dentry *debugfs_resctrl;
+static struct dentry *debugfs_resctrl_info;
+
/*
* Memory bandwidth monitoring event to use for the default CTRL_MON group
* and each new CTRL_MON group created by the user. Only relevant when
@@ -4513,6 +4516,31 @@ int resctrl_init(void)
return ret;
}
+/*
+ * Create /sys/kernel/debug/resctrl/info/{r->name}_MON/{arch} directory
+ * by request for architecture to use for debugging or minor tuning.
+ * Basic functionality of features must not be controlled by files
+ * added to this directory as debugfs may not be configured/mounted
+ * on production systems.
+ */
+struct dentry *resctrl_debugfs_mon_info_arch_mkdir(struct rdt_resource *r)
+{
+ struct dentry *moninfodir;
+ char name[32];
+
+ if (!r->mon_capable)
+ return NULL;
+
+ if (!debugfs_resctrl_info)
+ debugfs_resctrl_info = debugfs_create_dir("info", debugfs_resctrl);
+
+ sprintf(name, "%s_MON", r->name);
+
+ moninfodir = debugfs_create_dir(name, debugfs_resctrl_info);
+
+ return debugfs_create_dir(utsname()->machine, moninfodir);
+}
+
static bool resctrl_online_domains_exist(void)
{
struct rdt_resource *r;
@@ -4564,6 +4592,7 @@ void resctrl_exit(void)
debugfs_remove_recursive(debugfs_resctrl);
debugfs_resctrl = NULL;
+ debugfs_resctrl_info = NULL;
unregister_filesystem(&rdt_fs_type);
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* [PATCH v11 29/31] x86/resctrl: Add debugfs files to show telemetry aggregator status
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (27 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 28/31] fs/resctrl: Provide interface to create architecture specific debugfs area Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:23 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for package events Tony Luck
2025-09-25 20:03 ` [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates Tony Luck
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each telemetry aggregator provides three status registers at the top
end of MMIO space after all the per-RMID per-event counters:
data_loss_count: This counts the number of times that this aggregator
failed to accumulate a counter value supplied by a CPU core.
data_loss_timestamp: This is a "timestamp" from a free running
25MHz uncore timer indicating when the most recent data loss occurred.
last_update_timestamp: Another 25MHz timestamp indicating when the
most recent counter update was successfully applied.
Create files in /sys/kernel/debug/resctrl/info/PERF_PKG_MON/x86_64/
to display the value of each of these status registers for each aggregator
in each enabled event group. The prefix for each file name describes
the type of aggregator, which package it is located on, and an opaque
instance number to provide a unique file name when there are multiple
aggregators on a package.
The suffix is one of the three strings listed above. An example name is:
energy_pkg0_agg2_data_loss_count
These files are removed along with all other debugfs entries by the
call to debugfs_remove_recursive() in resctrl_exit().
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 51 +++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index f6afe862b9de..f84935c57b67 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -18,8 +18,11 @@
#include <linux/container_of.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
+#include <linux/debugfs.h>
+#include <linux/dcache.h>
#include <linux/err.h>
#include <linux/errno.h>
+#include <linux/fs.h>
#include <linux/gfp_types.h>
#include <linux/init.h>
#include <linux/intel_pmt_features.h>
@@ -33,6 +36,7 @@
#include <linux/resctrl.h>
#include <linux/resctrl_types.h>
#include <linux/slab.h>
+#include <linux/sprintf.h>
#include <linux/stddef.h>
#include <linux/topology.h>
#include <linux/types.h>
@@ -184,9 +188,50 @@ static bool all_regions_have_sufficient_rmid(struct event_group *e, struct pmt_f
return true;
}
+static int status_read(void *priv, u64 *val)
+{
+ void __iomem *info = (void __iomem *)priv;
+
+ *val = readq(info);
+
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(status_fops, status_read, NULL, "%llu\n");
+
+static void make_status_files(struct dentry *dir, struct event_group *e, int pkg,
+ int instance, void *info_end)
+{
+ char name[64];
+
+ sprintf(name, "%s_pkg%d_agg%d_data_loss_count", e->name, pkg, instance);
+ debugfs_create_file(name, 0400, dir, info_end - 24, &status_fops);
+
+ sprintf(name, "%s_pkg%d_agg%d_data_loss_timestamp", e->name, pkg, instance);
+ debugfs_create_file(name, 0400, dir, info_end - 16, &status_fops);
+
+ sprintf(name, "%s_pkg%d_agg%d_last_update_timestamp", e->name, pkg, instance);
+ debugfs_create_file(name, 0400, dir, info_end - 8, &status_fops);
+}
+
+static void create_debug_event_status_files(struct dentry *dir, struct event_group *e,
+ struct pmt_feature_group *p)
+{
+ void *info_end;
+
+ for (int i = 0; i < p->count; i++) {
+ if (!p->regions[i].addr)
+ continue;
+ info_end = (void __force *)p->regions[i].addr + e->mmio_size;
+ make_status_files(dir, e, p->regions[i].plat_info.package_id,
+ i, info_end);
+ }
+}
+
static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
+ static struct dentry *infodir;
bool usable_events = false;
/* Disable feature if insufficient RMIDs */
@@ -226,6 +271,12 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
resctrl_enable_mon_event(e->evts[j].id, true,
e->evts[j].bin_bits, &e->evts[j]);
+ if (!infodir)
+ infodir = resctrl_debugfs_mon_info_arch_mkdir(r);
+
+ if (!IS_ERR_OR_NULL(infodir))
+ create_debug_event_status_files(infodir, e, p);
+
pr_info("%s %s monitoring detected\n", r->name, e->name);
return true;
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 29/31] x86/resctrl: Add debugfs files to show telemetry aggregator status
2025-09-25 20:03 ` [PATCH v11 29/31] x86/resctrl: Add debugfs files to show telemetry aggregator status Tony Luck
@ 2025-10-04 0:23 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:23 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> +
> +static void make_status_files(struct dentry *dir, struct event_group *e, int pkg,
> + int instance, void *info_end)
> +{
> + char name[64];
> +
> + sprintf(name, "%s_pkg%d_agg%d_data_loss_count", e->name, pkg, instance);
Please keep the type used for package id consistent throughout the series.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for package events
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (28 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 29/31] x86/resctrl: Add debugfs files to show telemetry aggregator status Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:25 ` Reinette Chatre
2025-09-25 20:03 ` [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates Tony Luck
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Update resctrl filesystem documentation with the details about the
resctrl files that support telemetry events.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Documentation/filesystems/resctrl.rst | 100 ++++++++++++++++++++++----
1 file changed, 87 insertions(+), 13 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index 006d23af66e1..cb6da9614f58 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -168,13 +168,12 @@ with respect to allocation:
bandwidth percentages are directly applied to
the threads running on the core
-If RDT monitoring is available there will be an "L3_MON" directory
+If L3 monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids":
- The number of RMIDs available. This is the
- upper bound for how many "CTRL_MON" + "MON"
- groups can be created.
+ The number of RMIDs supported by hardware for
+ L3 monitoring events.
"mon_features":
Lists the monitoring events if
@@ -400,6 +399,19 @@ with the following files:
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
+If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of RMIDs supported by hardware for
+ telemetry monitoring events.
+
+"mon_features":
+ Lists the telemetry monitoring events that are enabled on this system.
+
+The upper bound for how many "CTRL_MON" + "MON" can be created
+is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
+
Finally, in the top level of the "info" directory there is a file
named "last_cmd_status". This is reset with every "command" issued
via the file system (making new directories or writing to any of the
@@ -505,15 +517,40 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
- "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
- files provide a read out of the current value of the event for
- all tasks in the group. In CTRL_MON groups these files provide
- the sum for all tasks in the CTRL_MON group and all tasks in
- MON groups. Please see example section for more details on usage.
+ This contains directories for each monitor domain. One set for
+ each instance of an L3 cache, another set for each processor
+ package. The L3 cache directories are named "mon_L3_00",
+ "mon_L3_01" etc. The package directories "mon_PERF_PKG_00",
+ "mon_PERF_PKG_01" etc.
+
+ Within each directory there is one file per event. For
+ example the L3 directories may contain "llc_occupancy", "mbm_total_bytes",
+ and "mbm_local_bytes". The PERF_PKG directories may contain "core_energy",
+ "activity", etc. The info/`*`/mon_features files provide the full
+ list of event/file names.
+
+ "core energy" reports a floating point number for the energy (in Joules)
+ consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
+ during execution of instructions summed across all logical CPUs on a
+ package for the current RMID.
+
+ "activity" also reports a floating point value (in Farads).
+ This provides an estimate of work done independent of the
+ frequency that the CPUs used for execution.
+
+ Note that these two counters only measure energy/activity
+ in the "core" of the CPU (arithmetic units, TLB, L1 and L2
+ caches, etc.). They do not include L3 cache, memory, I/O
+ devices etc.
+
+ All other events report decimal integer values.
+
+ In a MON group these files provide a read out of the current
+ value of the event for all tasks in the group. In CTRL_MON groups
+ these files provide the sum for all tasks in the CTRL_MON group
+ and all tasks in MON groups. Please see example section for more
+ details on usage.
+
On systems with Sub-NUMA Cluster (SNC) enabled there are extra
directories for each node (located within the "mon_L3_XX" directory
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
@@ -1506,6 +1543,43 @@ Example with C::
resctrl_release_lock(fd);
}
+Debugfs
+=======
+In addition to the use of debugfs for tracing of pseudo-locking
+performance, architecture code may create debugfs directories
+associated with monitoring features for a specific resource.
+
+The full pathname for these is in the form:
+
+ /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
+
+The presence, names, and format of these files may vary
+between architectures even if the same resource is present.
+
+PERF_PKG_MON/x86_64
+-------------------
+Three files are present per telemetry aggregator instance
+that show status. The prefix of
+each file name describes the type ("energy" or "perf") which
+processor package it belongs to, and the instance number of
+the aggregator. For example: "energy_pkg1_agg2".
+
+The suffix describes which data is reported in the file and
+is one of:
+
+data_loss_count:
+ This counts the number of times that this aggregator
+ failed to accumulate a counter value supplied by a CPU.
+
+data_loss_timestamp:
+ This is a "timestamp" from a free running 25MHz uncore
+ timer indicating when the most recent data loss occurred.
+
+last_update_timestamp:
+ Another 25MHz timestamp indicating when the
+ most recent counter update was successfully applied.
+
+
Examples for RDT Monitoring along with allocation usage
=======================================================
Reading monitored data
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for package events
2025-09-25 20:03 ` [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for package events Tony Luck
@ 2025-10-04 0:25 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:25 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
Two nits in subject:
"Documentation" -> "documentation"
"package events" -> "telemetry events"?
(this is the one and only instance of "package event" in this
series and does not match changelog that follows)
On 9/25/25 1:03 PM, Tony Luck wrote:
> Update resctrl filesystem documentation with the details about the
> resctrl files that support telemetry events.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Documentation/filesystems/resctrl.rst | 100 ++++++++++++++++++++++----
> 1 file changed, 87 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index 006d23af66e1..cb6da9614f58 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -168,13 +168,12 @@ with respect to allocation:
> bandwidth percentages are directly applied to
> the threads running on the core
>
> -If RDT monitoring is available there will be an "L3_MON" directory
> +If L3 monitoring is available there will be an "L3_MON" directory
> with the following files:
>
> "num_rmids":
> - The number of RMIDs available. This is the
> - upper bound for how many "CTRL_MON" + "MON"
> - groups can be created.
> + The number of RMIDs supported by hardware for
> + L3 monitoring events.
>
> "mon_features":
> Lists the monitoring events if
> @@ -400,6 +399,19 @@ with the following files:
> bytes) at which a previously used LLC_occupancy
> counter can be considered for re-use.
>
> +If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
> +with the following files:
> +
> +"num_rmids":
> + The number of RMIDs supported by hardware for
> + telemetry monitoring events.
There may be some additional detail about how num_rmids is determined that could be valuable
to user space since from what I understand user space seems to have some control over this
number in addition to it being "supported by hardware".
For example, if the PERF event group has more RMID than the ENERGY event group
and the user needs to do significant monitoring of PERF then it may be useful to know
that by disabling ENERGY it could be possible to increase the number of RMIDs in order
to do that monitoring.
Additionally, from patch #23 we learned that "supported by hardware" can have different meanings ...
it could be the number of RMIDs "supported" or it could mean the number of RMIDs
that can be reliably "counted". A user force-enabling an under resourced event group will
thus encounter a num_rmids that does not match the (XML) spec.
> +
> +"mon_features":
> + Lists the telemetry monitoring events that are enabled on this system.
> +
> +The upper bound for how many "CTRL_MON" + "MON" can be created
> +is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
> +
> Finally, in the top level of the "info" directory there is a file
> named "last_cmd_status". This is reset with every "command" issued
> via the file system (making new directories or writing to any of the
> @@ -505,15 +517,40 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> - directories have one file per event (e.g. "llc_occupancy",
> - "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> - files provide a read out of the current value of the event for
> - all tasks in the group. In CTRL_MON groups these files provide
> - the sum for all tasks in the CTRL_MON group and all tasks in
> - MON groups. Please see example section for more details on usage.
> + This contains directories for each monitor domain. One set for
> + each instance of an L3 cache, another set for each processor
> + package. The L3 cache directories are named "mon_L3_00",
I still do not understand the "set" terminology. There is just one directory
per domain, no? For example, "This contains a directory for each monitoring domain of
a monitoring capable resource. One directory for each instance of an L3 cache
if L3 monitoring is available, another directory for each processor package if
telemetry monitoring is available."
> + "mon_L3_01" etc. The package directories "mon_PERF_PKG_00",
> + "mon_PERF_PKG_01" etc.
> +
> + Within each directory there is one file per event. For
> + example the L3 directories may contain "llc_occupancy", "mbm_total_bytes",
> + and "mbm_local_bytes". The PERF_PKG directories may contain "core_energy",
> + "activity", etc. The info/`*`/mon_features files provide the full
> + list of event/file names.
> +
> + "core energy" reports a floating point number for the energy (in Joules)
> + consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
> + during execution of instructions summed across all logical CPUs on a
> + package for the current RMID.
> +
> + "activity" also reports a floating point value (in Farads).
> + This provides an estimate of work done independent of the
> + frequency that the CPUs used for execution.
> +
> + Note that these two counters only measure energy/activity
To help be specific:
""core energy" and "activity" only measure ..."
> + in the "core" of the CPU (arithmetic units, TLB, L1 and L2
> + caches, etc.). They do not include L3 cache, memory, I/O
> + devices etc.
> +
> + All other events report decimal integer values.
> +
> + In a MON group these files provide a read out of the current
> + value of the event for all tasks in the group. In CTRL_MON groups
> + these files provide the sum for all tasks in the CTRL_MON group
> + and all tasks in MON groups. Please see example section for more
> + details on usage.
> +
Please have this text line length be consistent with surrounding text.
> On systems with Sub-NUMA Cluster (SNC) enabled there are extra
> directories for each node (located within the "mon_L3_XX" directory
> for the L3 cache they occupy). These are named "mon_sub_L3_YY"
> @@ -1506,6 +1543,43 @@ Example with C::
> resctrl_release_lock(fd);
> }
>
> +Debugfs
> +=======
> +In addition to the use of debugfs for tracing of pseudo-locking
> +performance, architecture code may create debugfs directories
> +associated with monitoring features for a specific resource.
> +
> +The full pathname for these is in the form:
> +
> + /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
> +
> +The presence, names, and format of these files may vary
> +between architectures even if the same resource is present.
> +
> +PERF_PKG_MON/x86_64
> +-------------------
> +Three files are present per telemetry aggregator instance
> +that show status. The prefix of
Please be consistent with line length and do not trim lines so short.
> +each file name describes the type ("energy" or "perf") which
> +processor package it belongs to, and the instance number of
> +the aggregator. For example: "energy_pkg1_agg2".
> +
> +The suffix describes which data is reported in the file and
> +is one of:
> +
> +data_loss_count:
> + This counts the number of times that this aggregator
> + failed to accumulate a counter value supplied by a CPU.
> +
> +data_loss_timestamp:
> + This is a "timestamp" from a free running 25MHz uncore
> + timer indicating when the most recent data loss occurred.
> +
> +last_update_timestamp:
> + Another 25MHz timestamp indicating when the
> + most recent counter update was successfully applied.
> +
> +
> Examples for RDT Monitoring along with allocation usage
> =======================================================
> Reading monitored data
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates
2025-09-25 20:02 [PATCH v11 00/31] x86,fs/resctrl telemetry monitoring Tony Luck
` (29 preceding siblings ...)
2025-09-25 20:03 ` [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for package events Tony Luck
@ 2025-09-25 20:03 ` Tony Luck
2025-10-04 0:26 ` Reinette Chatre
30 siblings, 1 reply; 84+ messages in thread
From: Tony Luck @ 2025-09-25 20:03 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl event monitoring on Sub-NUMA Cluster (SNC) systems sums the
counts for events across all nodes sharing an L3 cache.
Update the kerneldoc for rmid_read::sum and the do_sum argument to
mon_get_kn_priv() to say these are only used on the RDT_RESOURCE_L3
resource.
Add Return: value description for l3_mon_domain_mbm_alloc(),
resctrl_l3_mon_resource_init(), and domain_setup_l3_mon_state()
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 ++--
arch/x86/kernel/cpu/resctrl/core.c | 2 ++
fs/resctrl/monitor.c | 2 +-
fs/resctrl/rdtgroup.c | 5 +++--
4 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 223a6cc6a64a..0dd89d3fa31a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -96,8 +96,8 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
* @list: Member of the global @mon_data_kn_priv_list list.
* @rid: Resource id associated with the event file.
* @evt: Event structure associated with the event file.
- * @sum: Set when event must be summed across multiple
- * domains.
+ * @sum: Set for RDT_RESOURCE_L3 when event must be summed
+ * across multiple domains.
* @domid: When @sum is zero this is the domain to which
* the event file belongs. When @sum is one this
* is the id of the L3 cache that all domains to be
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 48ed6242d136..78c176e15b93 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -418,6 +418,8 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
* l3_mon_domain_mbm_alloc() - Allocate arch private storage for the MBM counters
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
+ *
+ * Return: %0 for success; Error code otherwise.
*/
static int l3_mon_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index c0e1b672afce..4cc310b9e78e 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -1811,7 +1811,7 @@ static void closid_num_dirty_rmid_free(void)
* Resctrl's cpuhp callbacks may be called before this point to bring a domain
* online.
*
- * Returns 0 for success, or -ENOMEM.
+ * Return: %0 for success; Error code otherwise.
*/
int resctrl_l3_mon_resource_init(void)
{
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 16b088c5f2be..04765dad3d31 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3037,7 +3037,8 @@ static void rmdir_all_sub(void)
* @rid: The resource id for the event file being created.
* @domid: The domain id for the event file being created.
* @mevt: The type of event file being created.
- * @do_sum: Whether SNC summing monitors are being created.
+ * @do_sum: Whether SNC summing monitors are being created. Only set
+ * when @rid == RDT_RESOURCE_L3.
*/
static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
struct mon_evt *mevt,
@@ -4281,7 +4282,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* at mount time. This means the rdt_l3_mon_domain::mbm_states[] and
* rdt_l3_mon_domain::rmid_busy_llc allocations may be larger than needed.
*
- * Returns 0 for success, or -ENOMEM.
+ * Return: %0 for success; Error code otherwise.
*/
static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 84+ messages in thread* Re: [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates
2025-09-25 20:03 ` [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates Tony Luck
@ 2025-10-04 0:26 ` Reinette Chatre
2025-10-06 16:54 ` Luck, Tony
0 siblings, 1 reply; 84+ messages in thread
From: Reinette Chatre @ 2025-10-04 0:26 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 9/25/25 1:03 PM, Tony Luck wrote:
> resctrl event monitoring on Sub-NUMA Cluster (SNC) systems sums the
> counts for events across all nodes sharing an L3 cache.
>
> Update the kerneldoc for rmid_read::sum and the do_sum argument to
> mon_get_kn_priv() to say these are only used on the RDT_RESOURCE_L3
> resource.
This is clear from the patch. Why is this needed as part of
telemetry event enabling? Perhaps this can be combined with the
unrelated SNC warnings found in "x86/resctrl: Handle domain creation/deletion
for RDT_RESOURCE_PERF_PKG" to be a patch dedicated to addressing SNC
topics related to telemetry events?
>
> Add Return: value description for l3_mon_domain_mbm_alloc(),
> resctrl_l3_mon_resource_init(), and domain_setup_l3_mon_state()
Appreciate the cleanups but please have series start with cleanups instead of end.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates
2025-10-04 0:26 ` Reinette Chatre
@ 2025-10-06 16:54 ` Luck, Tony
2025-10-06 21:34 ` Reinette Chatre
0 siblings, 1 reply; 84+ messages in thread
From: Luck, Tony @ 2025-10-06 16:54 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Oct 03, 2025 at 05:26:45PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 9/25/25 1:03 PM, Tony Luck wrote:
> > resctrl event monitoring on Sub-NUMA Cluster (SNC) systems sums the
> > counts for events across all nodes sharing an L3 cache.
> >
> > Update the kerneldoc for rmid_read::sum and the do_sum argument to
> > mon_get_kn_priv() to say these are only used on the RDT_RESOURCE_L3
> > resource.
>
> This is clear from the patch. Why is this needed as part of
> telemetry event enabling? Perhaps this can be combined with the
> unrelated SNC warnings found in "x86/resctrl: Handle domain creation/deletion
> for RDT_RESOURCE_PERF_PKG" to be a patch dedicated to addressing SNC
> topics related to telemetry events?
I will add an SNC cleanup patch to the series and make these changes there.
>
> >
> > Add Return: value description for l3_mon_domain_mbm_alloc(),
> > resctrl_l3_mon_resource_init(), and domain_setup_l3_mon_state()
>
> Appreciate the cleanups but please have series start with cleanups instead of end.
Can I bundle these cleanups with patch 8 that renames these functions?
>
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
>
> Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v11 31/31] fs/resctrl: Some kerneldoc updates
2025-10-06 16:54 ` Luck, Tony
@ 2025-10-06 21:34 ` Reinette Chatre
0 siblings, 0 replies; 84+ messages in thread
From: Reinette Chatre @ 2025-10-06 21:34 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 10/6/25 9:54 AM, Luck, Tony wrote:
> On Fri, Oct 03, 2025 at 05:26:45PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 9/25/25 1:03 PM, Tony Luck wrote:
>>> resctrl event monitoring on Sub-NUMA Cluster (SNC) systems sums the
>>> counts for events across all nodes sharing an L3 cache.
>>>
>>> Update the kerneldoc for rmid_read::sum and the do_sum argument to
>>> mon_get_kn_priv() to say these are only used on the RDT_RESOURCE_L3
>>> resource.
>>
>> This is clear from the patch. Why is this needed as part of
>> telemetry event enabling? Perhaps this can be combined with the
>> unrelated SNC warnings found in "x86/resctrl: Handle domain creation/deletion
>> for RDT_RESOURCE_PERF_PKG" to be a patch dedicated to addressing SNC
>> topics related to telemetry events?
>
> I will add an SNC cleanup patch to the series and make these changes there.
Thank you very much.
>>
>>>
>>> Add Return: value description for l3_mon_domain_mbm_alloc(),
>>> resctrl_l3_mon_resource_init(), and domain_setup_l3_mon_state()
>>
>> Appreciate the cleanups but please have series start with cleanups instead of end.
>
> Can I bundle these cleanups with patch 8 that renames these functions?
Good question. It is ok with me. I am not aware of concerns with doing something like
this.
Reinette
^ permalink raw reply [flat|nested] 84+ messages in thread