* [PATCH v13 01/32] x86,fs/resctrl: Improve domain type checking
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 02/32] x86/resctrl: Move L3 initialization into new helper function Tony Luck
` (32 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Every resctrl resource has a list of domain structures. struct rdt_ctrl_domain
and struct rdt_mon_domain both begin with struct rdt_domain_hdr with
rdt_domain_hdr::type used in validity checks before accessing the domain of a
particular type.
Add the resource id to struct rdt_domain_hdr in preparation for a new monitoring
domain structure that will be associated with a new monitoring resource. Improve
existing domain validity checks with a new helper domain_header_is_valid()
that checks both domain type and resource id. domain_header_is_valid() should
be used before every call to container_of() that accesses a domain structure.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 9 +++++++++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++----
fs/resctrl/ctrlmondata.c | 2 +-
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a7d92718b653..dfc91c5e8483 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -131,15 +131,24 @@ enum resctrl_domain_type {
* @list: all instances of this resource
* @id: unique id for this instance
* @type: type of this instance
+ * @rid: resource id for this instance
* @cpu_mask: which CPUs share this resource
*/
struct rdt_domain_hdr {
struct list_head list;
int id;
enum resctrl_domain_type type;
+ enum resctrl_res_level rid;
struct cpumask cpu_mask;
};
+static inline bool domain_header_is_valid(struct rdt_domain_hdr *hdr,
+ enum resctrl_domain_type type,
+ enum resctrl_res_level rid)
+{
+ return !WARN_ON_ONCE(hdr->type != type || hdr->rid != rid);
+}
+
/**
* struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource
* @hdr: common header for different domain types
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 06ca5a30140c..8be2619db2e7 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -459,7 +459,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->ctrl_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -476,6 +476,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_CTRL_DOMAIN;
+ d->hdr.rid = r->rid;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
rdt_domain_reconfigure_cdp(r);
@@ -515,7 +516,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
if (hdr) {
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
@@ -533,6 +534,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = r->rid;
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -593,7 +595,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
@@ -639,7 +641,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN))
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_mon_domain, hdr);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 0d0ef54fc4de..f248eaf50d3c 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -649,7 +649,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
+ if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, resid)) {
ret = -ENOENT;
goto out;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 02/32] x86/resctrl: Move L3 initialization into new helper function
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
2025-10-29 16:20 ` [PATCH v13 01/32] x86,fs/resctrl: Improve domain type checking Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 03/32] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
` (31 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Carve out the resource monitoring domain init code into a separate helper
in order to be able to initialize new types of monitoring domains besides
the usual L3 ones.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 64 ++++++++++++++++--------------
1 file changed, 34 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8be2619db2e7..d422ae3b7ed6 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -496,37 +496,13 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
}
}
-static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct list_head *add_pos = NULL;
struct rdt_hw_mon_domain *hw_dom;
- struct rdt_domain_hdr *hdr;
struct rdt_mon_domain *d;
struct cacheinfo *ci;
int err;
- lockdep_assert_held(&domain_list_lock);
-
- if (id < 0) {
- pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
- cpu, r->mon_scope, r->name);
- return;
- }
-
- hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
- if (hdr) {
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
-
- cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- /* Update the mbm_assign_mode state for the CPU if supported */
- if (r->mon.mbm_cntr_assignable)
- resctrl_arch_mbm_cntr_assign_set_one(r);
- return;
- }
-
hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
if (!hw_dom)
return;
@@ -534,7 +510,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d = &hw_dom->d_resctrl;
d->hdr.id = id;
d->hdr.type = RESCTRL_MON_DOMAIN;
- d->hdr.rid = r->rid;
+ d->hdr.rid = RDT_RESOURCE_L3;
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
@@ -544,10 +520,6 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
d->ci_id = ci->id;
cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
- /* Update the mbm_assign_mode state for the CPU if supported */
- if (r->mon.mbm_cntr_assignable)
- resctrl_arch_mbm_cntr_assign_set_one(r);
-
arch_mon_domain_online(r, d);
if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
@@ -565,6 +537,38 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
}
}
+static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
+{
+ int id = get_domain_id_from_scope(cpu, r->mon_scope);
+ struct list_head *add_pos = NULL;
+ struct rdt_domain_hdr *hdr;
+
+ lockdep_assert_held(&domain_list_lock);
+
+ if (id < 0) {
+ pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->mon_scope, r->name);
+ return;
+ }
+
+ hdr = resctrl_find_domain(&r->mon_domains, id, &add_pos);
+ if (hdr)
+ cpumask_set_cpu(cpu, &hdr->cpu_mask);
+
+ switch (r->rid) {
+ case RDT_RESOURCE_L3:
+ /* Update the mbm_assign_mode state for the CPU if supported */
+ if (r->mon.mbm_cntr_assignable)
+ resctrl_arch_mbm_cntr_assign_set_one(r);
+ if (!hdr)
+ l3_mon_domain_setup(cpu, id, r, add_pos);
+ break;
+ default:
+ pr_warn_once("Unknown resource rid=%d\n", r->rid);
+ break;
+ }
+}
+
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
if (r->alloc_capable)
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 03/32] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
2025-10-29 16:20 ` [PATCH v13 01/32] x86,fs/resctrl: Improve domain type checking Tony Luck
2025-10-29 16:20 ` [PATCH v13 02/32] x86/resctrl: Move L3 initialization into new helper function Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 04/32] x86/resctrl: Clean up domain_remove_cpu_ctrl() Tony Luck
` (30 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
New telemetry events will be associated with a new package scoped resource with
new domain structures.
Refactor domain_remove_cpu_mon() so all the L3 domain processing is separate
from general domain actions of clearing the CPU bit in the mask.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d422ae3b7ed6..c7bfaa391e9f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -626,9 +626,7 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
{
int id = get_domain_id_from_scope(cpu, r->mon_scope);
- struct rdt_hw_mon_domain *hw_dom;
struct rdt_domain_hdr *hdr;
- struct rdt_mon_domain *d;
lockdep_assert_held(&domain_list_lock);
@@ -645,20 +643,29 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
return;
}
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, r->rid))
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- hw_dom = resctrl_to_arch_mon_dom(d);
+ switch (r->rid) {
+ case RDT_RESOURCE_L3: {
+ struct rdt_hw_mon_domain *hw_dom;
+ struct rdt_mon_domain *d;
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, d);
- list_del_rcu(&d->hdr.list);
+ list_del_rcu(&hdr->list);
synchronize_rcu();
mon_domain_free(hw_dom);
-
- return;
+ break;
+ }
+ default:
+ pr_warn_once("Unknown resource rid=%d\n", r->rid);
+ break;
}
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 04/32] x86/resctrl: Clean up domain_remove_cpu_ctrl()
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (2 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 03/32] x86/resctrl: Refactor domain_remove_cpu_mon() ready for new domain types Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 05/32] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Tony Luck
` (29 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
For symmetry with domain_remove_cpu_mon() refactor domain_remove_cpu_ctrl()
to take an early return when removing a CPU does not empty the domain.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c7bfaa391e9f..b28663757dcf 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -599,28 +599,27 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r)
return;
}
+ cpumask_clear_cpu(cpu, &hdr->cpu_mask);
+ if (!cpumask_empty(&hdr->cpu_mask))
+ return;
+
if (!domain_header_is_valid(hdr, RESCTRL_CTRL_DOMAIN, r->rid))
return;
d = container_of(hdr, struct rdt_ctrl_domain, hdr);
hw_dom = resctrl_to_arch_ctrl_dom(d);
- cpumask_clear_cpu(cpu, &d->hdr.cpu_mask);
- if (cpumask_empty(&d->hdr.cpu_mask)) {
- resctrl_offline_ctrl_domain(r, d);
- list_del_rcu(&d->hdr.list);
- synchronize_rcu();
-
- /*
- * rdt_ctrl_domain "d" is going to be freed below, so clear
- * its pointer from pseudo_lock_region struct.
- */
- if (d->plr)
- d->plr->d = NULL;
- ctrl_domain_free(hw_dom);
+ resctrl_offline_ctrl_domain(r, d);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
- return;
- }
+ /*
+ * rdt_ctrl_domain "d" is going to be freed below, so clear
+ * its pointer from pseudo_lock_region struct.
+ */
+ if (d->plr)
+ d->plr->d = NULL;
+ ctrl_domain_free(hw_dom);
}
static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 05/32] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (3 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 04/32] x86/resctrl: Clean up domain_remove_cpu_ctrl() Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-12 19:18 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 06/32] fs/resctrl: Split L3 dependent parts out of __mon_event_count() Tony Luck
` (28 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Up until now, all monitoring events were associated with the L3 resource and it
made sense to use the L3 specific "struct rdt_mon_domain *" argument to functions
operating on domains.
Telemetry events will be tied to a new resource with its instances represented
by a new domain structure that, just like struct rdt_mon_domain, starts with
the generic struct rdt_domain_hdr.
Prepare to support domains belonging to different resources by changing the
calling convention of functions operating on domains. Pass the generic header
and use that to find the domain specific structure where needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +-
fs/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/core.c | 4 +-
fs/resctrl/ctrlmondata.c | 14 +++++--
fs/resctrl/rdtgroup.c | 65 +++++++++++++++++++++---------
5 files changed, 61 insertions(+), 28 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index dfc91c5e8483..0b55809af5d7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -504,9 +504,9 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d,
u32 closid, enum resctrl_conf_type type);
int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d);
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr);
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index cf1fd82dc5a9..22fdb3a9b6f4 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -362,7 +362,7 @@ void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first);
int resctrl_mon_resource_init(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b28663757dcf..44495bb915d5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -529,7 +529,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
list_add_tail_rcu(&d->hdr.list, add_pos);
- err = resctrl_online_mon_domain(r, d);
+ err = resctrl_online_mon_domain(r, &d->hdr);
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
@@ -656,7 +656,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
d = container_of(hdr, struct rdt_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
- resctrl_offline_mon_domain(r, d);
+ resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
mon_domain_free(hw_dom);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index f248eaf50d3c..a2ea6a66fa67 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -547,14 +547,21 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
}
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
- struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+ struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first)
{
+ struct rdt_mon_domain *d = NULL;
int cpu;
/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
+ if (hdr) {
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ }
+
/*
* Setup the parameters to pass to mon_event_count() to read the data.
*/
@@ -649,12 +656,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
* the resource to find the domain with "domid".
*/
hdr = resctrl_find_domain(&r->mon_domains, domid, NULL);
- if (!hdr || !domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, resid)) {
+ if (!hdr) {
ret = -ENOENT;
goto out;
}
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
+ mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evtid, false);
}
checkresult:
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 0320360cd7a6..f5a65c48bcab 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3164,13 +3164,18 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
* when last domain being summed is removed.
*/
static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
+ struct rdt_mon_domain *d;
char subname[32];
bool snc_mode;
char name[32];
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
if (snc_mode)
@@ -3184,15 +3189,20 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
-static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
+static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
bool do_sum)
{
struct rmid_read rr = {0};
+ struct rdt_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
int ret, domid;
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
for_each_mon_event(mevt) {
if (mevt->rid != r->rid || !mevt->enabled)
continue;
@@ -3206,23 +3216,28 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
return ret;
if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
+ mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt->evtid, true);
}
return 0;
}
static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_mon_domain *d,
+ struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
+ struct rdt_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
lockdep_assert_held(&rdtgroup_mutex);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
kn = kernfs_find_and_get(parent_kn, name);
@@ -3240,13 +3255,13 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
ret = rdtgroup_kn_set_ugid(kn);
if (ret)
goto out_destroy;
- ret = mon_add_all_files(kn, d, r, prgrp, snc_mode);
+ ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
if (ret)
goto out_destroy;
}
if (snc_mode) {
- sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
if (IS_ERR(ckn)) {
ret = -EINVAL;
@@ -3257,7 +3272,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (ret)
goto out_destroy;
- ret = mon_add_all_files(ckn, d, r, prgrp, false);
+ ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
if (ret)
goto out_destroy;
}
@@ -3275,7 +3290,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
* and "monitor" groups with given domain id.
*/
static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_mon_domain *d)
+ struct rdt_domain_hdr *hdr)
{
struct kernfs_node *parent_kn;
struct rdtgroup *prgrp, *crgrp;
@@ -3283,12 +3298,12 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
parent_kn = prgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, prgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list) {
parent_kn = crgrp->mon.mon_data_kn;
- mkdir_mondata_subdir(parent_kn, d, r, crgrp);
+ mkdir_mondata_subdir(parent_kn, hdr, r, crgrp);
}
}
}
@@ -3297,14 +3312,14 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
struct rdt_resource *r,
struct rdtgroup *prgrp)
{
- struct rdt_mon_domain *dom;
+ struct rdt_domain_hdr *hdr;
int ret;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- list_for_each_entry(dom, &r->mon_domains, hdr.list) {
- ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
+ list_for_each_entry(hdr, &r->mon_domains, list) {
+ ret = mkdir_mondata_subdir(parent_kn, hdr, r, prgrp);
if (ret)
return ret;
}
@@ -4187,16 +4202,23 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
mutex_unlock(&rdtgroup_mutex);
}
-void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
+ struct rdt_mon_domain *d;
+
mutex_lock(&rdtgroup_mutex);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ goto out_unlock;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+
/*
* If resctrl is mounted, remove all the
* per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- rmdir_mondata_subdir_allrdtgrp(r, d);
+ rmdir_mondata_subdir_allrdtgrp(r, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
@@ -4214,7 +4236,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d
}
domain_destroy_mon_state(d);
-
+out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
@@ -4287,12 +4309,17 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
return err;
}
-int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
+int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- int err;
+ struct rdt_mon_domain *d;
+ int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ goto out_unlock;
+
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
err = domain_setup_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4313,7 +4340,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d)
* If resctrl is mounted, add per domain monitor data directories.
*/
if (resctrl_mounted && resctrl_arch_mon_capable())
- mkdir_mondata_subdir_allrdtgrp(r, d);
+ mkdir_mondata_subdir_allrdtgrp(r, hdr);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 05/32] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr
2025-10-29 16:20 ` [PATCH v13 05/32] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Tony Luck
@ 2025-11-12 19:18 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-12 19:18 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:20 AM, Tony Luck wrote:
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 0320360cd7a6..f5a65c48bcab 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3164,13 +3164,18 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
> * when last domain being summed is removed.
> */
> static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
> - struct rdt_mon_domain *d)
> + struct rdt_domain_hdr *hdr)
> {
> struct rdtgroup *prgrp, *crgrp;
> + struct rdt_mon_domain *d;
> char subname[32];
> bool snc_mode;
> char name[32];
>
> + if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> + return;
> +
> + d = container_of(hdr, struct rdt_mon_domain, hdr);
> snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
Please note that this patch is inconsistent in how the functions are modified to access
members of struct rdt_domain_hdr now that it is provided as a parameter. For example, the
above d->hdr.id is unchanged while a similar line in mkdir_mondata_subdir() changes the
d->hdr.id to hdr->id.
This becomes irrelevant when considering the refactoring that comes later in the series but
a reviewer cannot be expected to know that at this point.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 06/32] fs/resctrl: Split L3 dependent parts out of __mon_event_count()
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (4 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 05/32] x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 07/32] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Tony Luck
` (27 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Carve out the L3 resource specific event reading code into a separate
helper to support reading event data from a new monitoring resource.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/monitor.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 572a9925bd6c..179962a81362 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -413,7 +413,7 @@ static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
memset(&d->cntr_cfg[cntr_id], 0, sizeof(*d->cntr_cfg));
}
-static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
+static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
{
int cpu = smp_processor_id();
u32 closid = rdtgrp->closid;
@@ -494,6 +494,18 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
return ret;
}
+static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
+{
+ switch (rr->r->rid) {
+ case RDT_RESOURCE_L3:
+ return __l3_mon_event_count(rdtgrp, rr);
+
+ default:
+ rr->err = -EINVAL;
+ return -EINVAL;
+ }
+}
+
/*
* mbm_bw_count() - Update bw count from values previously read by
* __mon_event_count().
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 07/32] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (5 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 06/32] fs/resctrl: Split L3 dependent parts out of __mon_event_count() Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-12 19:19 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 08/32] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
` (26 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Convert the whole call sequence from mon_event_read() to resctrl_arch_rmid_read()
to pass resource independent struct rdt_domain_hdr instead of an L3 specific
domain structure to prepare for monitoring events in other resources.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 4 +--
fs/resctrl/internal.h | 18 ++++++-----
arch/x86/kernel/cpu/resctrl/monitor.c | 12 +++++--
fs/resctrl/ctrlmondata.c | 9 +-----
fs/resctrl/monitor.c | 46 +++++++++++++++++----------
5 files changed, 52 insertions(+), 37 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0b55809af5d7..1a33d5e6ae23 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -514,7 +514,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
* @r: resource that the counter should be read from.
- * @d: domain that the counter should be read from.
+ * @hdr: Header of domain that the counter should be read from.
* @closid: closid that matches the rmid. Depending on the architecture, the
* counter may match traffic of both @closid and @rmid, or @rmid
* only.
@@ -535,7 +535,7 @@ void resctrl_offline_cpu(unsigned int cpu);
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *arch_mon_ctx);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 22fdb3a9b6f4..698ed84fd073 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -106,24 +106,26 @@ struct mon_data {
* resource group then its event count is summed with the count from all
* its child resource groups.
* @r: Resource describing the properties of the event being read.
- * @d: Domain that the counter should be read from. If NULL then sum all
- * domains in @r sharing L3 @ci.id
+ * @hdr: Header of domain that the counter should be read from. If NULL then
+ * sum all domains in @r sharing L3 @ci.id
* @evtid: Which monitor event to read.
* @first: Initialize MBM counter when true.
- * @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains.
+ * @ci: Cacheinfo for L3. Only set when @hdr is NULL. Used when summing
+ * domains.
* @is_mbm_cntr: true if "mbm_event" counter assignment mode is enabled and it
* is an MBM event.
* @err: Error encountered when reading counter.
- * @val: Returned value of event counter. If @rgrp is a parent resource group,
- * @val includes the sum of event counts from its child resource groups.
- * If @d is NULL, @val includes the sum of all domains in @r sharing @ci.id,
- * (summed across child resource groups if @rgrp is a parent resource group).
+ * @val: Returned value of event counter. If @rgrp is a parent resource
+ * group, @val includes the sum of event counts from its child
+ * resource groups. If @hdr is NULL, @val includes the sum of all
+ * domains in @r sharing @ci.id, (summed across child resource groups
+ * if @rgrp is a parent resource group).
* @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only).
*/
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
- struct rdt_mon_domain *d;
+ struct rdt_domain_hdr *hdr;
enum resctrl_event_id evtid;
bool first;
struct cacheinfo *ci;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index fe1a2aa53c16..982dcf23183c 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -238,19 +238,25 @@ static u64 get_corrected_val(struct rdt_resource *r, struct rdt_mon_domain *d,
return chunks * hw_res->mon_scale;
}
-int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
- int cpu = cpumask_any(&d->hdr.cpu_mask);
+ struct rdt_hw_mon_domain *hw_dom;
struct arch_mbm_state *am;
+ struct rdt_mon_domain *d;
u64 msr_val;
u32 prmid;
+ int cpu;
int ret;
resctrl_arch_rmid_read_context_check();
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return -EINVAL;
+ d = container_of(hdr, struct rdt_mon_domain, hdr);
+ hw_dom = resctrl_to_arch_mon_dom(d);
+ cpu = cpumask_any(&hdr->cpu_mask);
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
ret = __rmid_read_phys(prmid, eventid, &msr_val);
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index a2ea6a66fa67..ad347ab4ed29 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -550,25 +550,18 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first)
{
- struct rdt_mon_domain *d = NULL;
int cpu;
/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
- if (hdr) {
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
- }
-
/*
* Setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
rr->evtid = evtid;
rr->r = r;
- rr->d = d;
+ rr->hdr = hdr;
rr->first = first;
if (resctrl_arch_mbm_cntr_assign_enabled(r) &&
resctrl_is_mbm_event(evtid)) {
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 179962a81362..911a10aa6920 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -159,7 +159,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
break;
entry = __rmid_entry(idx);
- if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
+ if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid,
QOS_L3_OCCUP_EVENT_ID, &val,
arch_mon_ctx)) {
rmid_dirty = true;
@@ -413,19 +413,19 @@ static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
memset(&d->cntr_cfg[cntr_id], 0, sizeof(*d->cntr_cfg));
}
-static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
+static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
+ struct rdt_mon_domain *d)
{
int cpu = smp_processor_id();
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
- struct rdt_mon_domain *d;
int cntr_id = -ENOENT;
struct mbm_state *m;
int err, ret;
u64 tval = 0;
if (rr->is_mbm_cntr) {
- cntr_id = mbm_cntr_get(rr->r, rr->d, rdtgrp, rr->evtid);
+ cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evtid);
if (cntr_id < 0) {
rr->err = -ENOENT;
return -EINVAL;
@@ -434,24 +434,24 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (rr->first) {
if (rr->is_mbm_cntr)
- resctrl_arch_reset_cntr(rr->r, rr->d, closid, rmid, cntr_id, rr->evtid);
+ resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evtid);
else
- resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evtid);
if (m)
memset(m, 0, sizeof(struct mbm_state));
return 0;
}
- if (rr->d) {
+ if (d) {
/* Reading a single domain, must be on a CPU in that domain. */
- if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask))
+ if (!cpumask_test_cpu(cpu, &d->hdr.cpu_mask))
return -EINVAL;
if (rr->is_mbm_cntr)
- rr->err = resctrl_arch_cntr_read(rr->r, rr->d, closid, rmid, cntr_id,
+ rr->err = resctrl_arch_cntr_read(rr->r, d, closid, rmid, cntr_id,
rr->evtid, &tval);
else
- rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid,
+ rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, closid, rmid,
rr->evtid, &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -480,7 +480,7 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
err = resctrl_arch_cntr_read(rr->r, d, closid, rmid, cntr_id,
rr->evtid, &tval);
else
- err = resctrl_arch_rmid_read(rr->r, d, closid, rmid,
+ err = resctrl_arch_rmid_read(rr->r, &d->hdr, closid, rmid,
rr->evtid, &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
@@ -497,8 +497,18 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
{
switch (rr->r->rid) {
- case RDT_RESOURCE_L3:
- return __l3_mon_event_count(rdtgrp, rr);
+ case RDT_RESOURCE_L3: {
+ struct rdt_mon_domain *d = NULL;
+
+ if (rr->hdr) {
+ if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) {
+ rr->err = -EIO;
+ return -EINVAL;
+ }
+ d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ }
+ return __l3_mon_event_count(rdtgrp, rr, d);
+ }
default:
rr->err = -EINVAL;
@@ -523,9 +533,13 @@ static void mbm_bw_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
u64 cur_bw, bytes, cur_bytes;
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
+ struct rdt_mon_domain *d;
struct mbm_state *m;
- m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
+ if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ return;
+ d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ m = get_mbm_state(d, closid, rmid, rr->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -698,7 +712,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *
struct rmid_read rr = {0};
rr.r = r;
- rr.d = d;
+ rr.hdr = &d->hdr;
rr.evtid = evtid;
if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
rr.is_mbm_cntr = true;
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 07/32] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters
2025-10-29 16:20 ` [PATCH v13 07/32] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Tony Luck
@ 2025-11-12 19:19 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-12 19:19 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:20 AM, Tony Luck wrote:
> @@ -497,8 +497,18 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
> static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
> {
> switch (rr->r->rid) {
> - case RDT_RESOURCE_L3:
> - return __l3_mon_event_count(rdtgrp, rr);
> + case RDT_RESOURCE_L3: {
> + struct rdt_mon_domain *d = NULL;
> +
> + if (rr->hdr) {
> + if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) {
> + rr->err = -EIO;
> + return -EINVAL;
> + }
> + d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
> + }
> + return __l3_mon_event_count(rdtgrp, rr, d);
> + }
I tried running this series through a static checker and it flagged a few issues related to
this flow. The issues appear to be false positives but it demonstrates that this code is
becoming very hard to understand. Consider, for example, how __l3_mon_event_count() is
structured while thinking about d==NULL:
__l3_mon_event_count()
{
...
if (rr->is_mbm_cntr) {
/* dereferences d */
}
if (rr->first) {
/* dereferences d */
return 0;
}
if (d) {
/* dereferences d */
return 0;
}
/* sum code */
}
I believe it will be difficult for somebody to trace that rr->is_mbm_cntr and rr->first cannot
be true of d==NULL (the static checker issues supports this). The "if (d)" test that follows
these checks just adds to difficulty by implying that d could indeed be NULL before then.
I see two options to address this. I tried both and the static checker was ok with either. I find the
second option easier to understand than the first, but I share both for context:
option 1:
To make it obvious when the domain can be NULL:
__l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
{
...
if (rr->hdr) {
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) {
rr->err = -EIO;
return -EINVAL;
}
d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
if (rr->is_mbm_cntr) {
/* dereferences d */
}
if (rr->first) {
/* dereferences d */
return 0;
}
/* dereferences d */
return 0;
}
/* sum code */
}
While easier to understand the above does not make the code easier to read. The function is already quite long
and this adds an additional indentation level. This does not seem necessary since the rr->hdr!=NULL
scenario really just looks like a "function within a function" since it does a "return".
This brings to:
option 2:
Split __l3_mon_event_count() into, for example, __l3_mon_event_count() that handles the rr->hdr!=NULL
flow and __l3_mon_event_count_sum() that handles the rr->hdr==NULL flow.
This can be called from __mon_event_count():
if (rr->hdr)
return __l3_mon_event_count(rdtgrp, rr);
else
return __l3_mon_event_count_sum(rdtgrp, rr);
Option 2 looks like the better option to me. What do you think?
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 08/32] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (6 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 07/32] x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-13 4:01 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 09/32] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
` (25 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The upcoming telemetry event monitoring are not tied to the L3
resource and will have a new domain structures.
Rename the L3 resource specific domain data structures to include
"l3_" in their names to avoid confusion between the different
resource specific domain structures:
rdt_mon_domain -> rdt_l3_mon_domain
rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 22 ++++----
arch/x86/kernel/cpu/resctrl/internal.h | 16 +++---
fs/resctrl/internal.h | 8 +--
arch/x86/kernel/cpu/resctrl/core.c | 14 +++---
arch/x86/kernel/cpu/resctrl/monitor.c | 36 ++++++-------
fs/resctrl/ctrlmondata.c | 2 +-
fs/resctrl/monitor.c | 70 +++++++++++++-------------
fs/resctrl/rdtgroup.c | 40 +++++++--------
8 files changed, 104 insertions(+), 104 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 1a33d5e6ae23..a07542957e5a 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -178,7 +178,7 @@ struct mbm_cntr_cfg {
};
/**
- * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
+ * struct rdt_l3_mon_domain - group of CPUs sharing a resctrl monitor resource
* @hdr: common header for different domain types
* @ci_id: cache info id for this domain
* @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
@@ -192,7 +192,7 @@ struct mbm_cntr_cfg {
* @cntr_cfg: array of assignable counters' configuration (indexed
* by counter ID)
*/
-struct rdt_mon_domain {
+struct rdt_l3_mon_domain {
struct rdt_domain_hdr hdr;
unsigned int ci_id;
unsigned long *rmid_busy_llc;
@@ -364,10 +364,10 @@ struct resctrl_cpu_defaults {
};
struct resctrl_mon_config_info {
- struct rdt_resource *r;
- struct rdt_mon_domain *d;
- u32 evtid;
- u32 mon_config;
+ struct rdt_resource *r;
+ struct rdt_l3_mon_domain *d;
+ u32 evtid;
+ u32 mon_config;
};
/**
@@ -582,7 +582,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid,
enum resctrl_event_id eventid);
@@ -595,7 +595,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d);
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/**
* resctrl_arch_reset_all_ctrls() - Reset the control for each CLOSID to its
@@ -621,7 +621,7 @@ void resctrl_arch_reset_all_ctrls(struct rdt_resource *r);
*
* This can be called from any CPU.
*/
-void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
enum resctrl_event_id evtid, u32 rmid, u32 closid,
u32 cntr_id, bool assign);
@@ -644,7 +644,7 @@ void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
*/
-int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, int cntr_id,
enum resctrl_event_id eventid, u64 *val);
@@ -659,7 +659,7 @@ int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_mon_domain *d,
*
* This can be called from any CPU.
*/
-void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 closid, u32 rmid, int cntr_id,
enum resctrl_event_id eventid);
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 9f4c2f0aaf5c..6eca3d522fcc 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -60,17 +60,17 @@ struct rdt_hw_ctrl_domain {
};
/**
- * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
- * a resource for a monitor function
- * @d_resctrl: Properties exposed to the resctrl file system
+ * struct rdt_hw_l3_mon_domain - Arch private attributes of a set of CPUs that share
+ * a resource for a monitor function
+ * @d_resctrl: Properties exposed to the resctrl file system
* @arch_mbm_states: Per-event pointer to the MBM event's saved state.
* An MBM event's state is an array of struct arch_mbm_state
* indexed by RMID on x86.
*
* Members of this structure are accessed via helpers that provide abstraction.
*/
-struct rdt_hw_mon_domain {
- struct rdt_mon_domain d_resctrl;
+struct rdt_hw_l3_mon_domain {
+ struct rdt_l3_mon_domain d_resctrl;
struct arch_mbm_state *arch_mbm_states[QOS_NUM_L3_MBM_EVENTS];
};
@@ -79,9 +79,9 @@ static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctr
return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl);
}
-static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r)
+static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3_mon_domain *r)
{
- return container_of(r, struct rdt_hw_mon_domain, d_resctrl);
+ return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
}
/**
@@ -135,7 +135,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r
extern struct rdt_hw_resource rdt_resources_all[];
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d);
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d);
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 698ed84fd073..d9e291d94926 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -369,7 +369,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
int resctrl_mon_resource_init(void);
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
@@ -377,14 +377,14 @@ void mbm_handle_overflow(struct work_struct *work);
bool is_mba_sc(struct rdt_resource *r);
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu);
void cqm_handle_limbo(struct work_struct *work);
-bool has_busy_rmid(struct rdt_mon_domain *d);
+bool has_busy_rmid(struct rdt_l3_mon_domain *d);
-void __check_limbo(struct rdt_mon_domain *d, bool force_free);
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free);
void resctrl_file_fflags_init(const char *config, unsigned long fflags);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 44495bb915d5..8137c1442139 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -363,7 +363,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom)
+static void mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
{
int idx;
@@ -400,7 +400,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom)
+static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
size_t tsize = sizeof(*hw_dom->arch_mbm_states[0]);
enum resctrl_event_id eventid;
@@ -498,8 +498,8 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r)
static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct list_head *add_pos)
{
- struct rdt_hw_mon_domain *hw_dom;
- struct rdt_mon_domain *d;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
struct cacheinfo *ci;
int err;
@@ -648,13 +648,13 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
switch (r->rid) {
case RDT_RESOURCE_L3: {
- struct rdt_hw_mon_domain *hw_dom;
- struct rdt_mon_domain *d;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 982dcf23183c..8b293fc4e946 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -109,7 +109,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val)
*
* In RMID sharing mode there are fewer "logical RMID" values available
* to accumulate data ("physical RMIDs" are divided evenly between SNC
- * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
+ * nodes that share an L3 cache). Linux creates an rdt_l3_mon_domain for
* each SNC node.
*
* The value loaded into IA32_PQR_ASSOC is the "logical RMID".
@@ -157,7 +157,7 @@ static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
return 0;
}
-static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom,
+static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_l3_mon_domain *hw_dom,
u32 rmid,
enum resctrl_event_id eventid)
{
@@ -171,11 +171,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do
return state ? &state[rmid] : NULL;
}
-void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid,
enum resctrl_event_id eventid)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
int cpu = cpumask_any(&d->hdr.cpu_mask);
struct arch_mbm_state *am;
u32 prmid;
@@ -194,9 +194,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
* Assumes that hardware counters are also reset and thus that there is
* no need to record initial non-zero counts.
*/
-void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
enum resctrl_event_id eventid;
int idx;
@@ -217,10 +217,10 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
return chunks >> shift;
}
-static u64 get_corrected_val(struct rdt_resource *r, struct rdt_mon_domain *d,
+static u64 get_corrected_val(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 rmid, enum resctrl_event_id eventid, u64 msr_val)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
struct arch_mbm_state *am;
u64 chunks;
@@ -242,9 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
u64 *val, void *ignored)
{
- struct rdt_hw_mon_domain *hw_dom;
+ struct rdt_hw_l3_mon_domain *hw_dom;
+ struct rdt_l3_mon_domain *d;
struct arch_mbm_state *am;
- struct rdt_mon_domain *d;
u64 msr_val;
u32 prmid;
int cpu;
@@ -254,7 +254,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
hw_dom = resctrl_to_arch_mon_dom(d);
cpu = cpumask_any(&hdr->cpu_mask);
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
@@ -308,11 +308,11 @@ static int __cntr_id_read(u32 cntr_id, u64 *val)
return 0;
}
-void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid, int cntr_id,
enum resctrl_event_id eventid)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
struct arch_mbm_state *am;
am = get_arch_mbm_state(hw_dom, rmid, eventid);
@@ -324,7 +324,7 @@ void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
}
}
-int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_mon_domain *d,
+int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
u32 unused, u32 rmid, int cntr_id,
enum resctrl_event_id eventid, u64 *val)
{
@@ -354,7 +354,7 @@ int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_mon_domain *d,
* must adjust RMID counter numbers based on SNC node. See
* logical_rmid_to_physical_rmid() for code that does this.
*/
-void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d)
+void arch_mon_domain_online(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
if (snc_nodes_per_l3_cache > 1)
msr_clear_bit(MSR_RMID_SNC_CONFIG, 0);
@@ -515,7 +515,7 @@ static void resctrl_abmc_set_one_amd(void *arg)
*/
static void _resctrl_abmc_enable(struct rdt_resource *r, bool enable)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_cpus_held();
@@ -554,11 +554,11 @@ static void resctrl_abmc_config_one_amd(void *info)
/*
* Send an IPI to the domain to assign the counter to RMID, event pair.
*/
-void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
enum resctrl_event_id evtid, u32 rmid, u32 closid,
u32 cntr_id, bool assign)
{
- struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
+ struct rdt_hw_l3_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
union l3_qos_abmc_cfg abmc_cfg = { 0 };
struct arch_mbm_state *am;
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index ad347ab4ed29..b74c69f2d54e 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -596,9 +596,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
struct kernfs_open_file *of = m->private;
enum resctrl_res_level resid;
enum resctrl_event_id evtid;
+ struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
struct rdtgroup *rdtgrp;
int domid, cpu, ret = 0;
struct rdt_resource *r;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 911a10aa6920..dd295f31ec49 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -130,7 +130,7 @@ static void limbo_release_entry(struct rmid_entry *entry)
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
-void __check_limbo(struct rdt_mon_domain *d, bool force_free)
+void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -188,7 +188,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free)
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
}
-bool has_busy_rmid(struct rdt_mon_domain *d)
+bool has_busy_rmid(struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -289,7 +289,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
u32 idx;
lockdep_assert_held(&rdtgroup_mutex);
@@ -342,7 +342,7 @@ void free_rmid(u32 closid, u32 rmid)
list_add_tail(&entry->list, &rmid_free_lru);
}
-static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
+static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid,
u32 rmid, enum resctrl_event_id evtid)
{
u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid);
@@ -362,7 +362,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
* Return:
* Valid counter ID on success, or -ENOENT on failure.
*/
-static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
+static int mbm_cntr_get(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
{
int cntr_id;
@@ -389,7 +389,7 @@ static int mbm_cntr_get(struct rdt_resource *r, struct rdt_mon_domain *d,
* Return:
* Valid counter ID on success, or -ENOSPC on failure.
*/
-static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
+static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
{
int cntr_id;
@@ -408,13 +408,13 @@ static int mbm_cntr_alloc(struct rdt_resource *r, struct rdt_mon_domain *d,
/*
* mbm_cntr_free() - Clear the counter ID configuration details in the domain @d.
*/
-static void mbm_cntr_free(struct rdt_mon_domain *d, int cntr_id)
+static void mbm_cntr_free(struct rdt_l3_mon_domain *d, int cntr_id)
{
memset(&d->cntr_cfg[cntr_id], 0, sizeof(*d->cntr_cfg));
}
static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
- struct rdt_mon_domain *d)
+ struct rdt_l3_mon_domain *d)
{
int cpu = smp_processor_id();
u32 closid = rdtgrp->closid;
@@ -498,14 +498,14 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
{
switch (rr->r->rid) {
case RDT_RESOURCE_L3: {
- struct rdt_mon_domain *d = NULL;
+ struct rdt_l3_mon_domain *d = NULL;
if (rr->hdr) {
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3)) {
rr->err = -EIO;
return -EINVAL;
}
- d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
}
return __l3_mon_event_count(rdtgrp, rr, d);
}
@@ -533,12 +533,12 @@ static void mbm_bw_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
u64 cur_bw, bytes, cur_bytes;
u32 closid = rdtgrp->closid;
u32 rmid = rdtgrp->mon.rmid;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct mbm_state *m;
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
- d = container_of(rr->hdr, struct rdt_mon_domain, hdr);
+ d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
m = get_mbm_state(d, closid, rmid, rr->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -638,7 +638,7 @@ static struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu,
* throttle MSRs already have low percentage values. To avoid
* unnecessarily restricting such rdtgroups, we also increase the bandwidth.
*/
-static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
+static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_l3_mon_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
@@ -706,7 +706,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm)
resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
}
-static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, enum resctrl_event_id evtid)
{
struct rmid_read rr = {0};
@@ -738,7 +738,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_mon_domain *
resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
}
-static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp)
{
/*
@@ -759,12 +759,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d,
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
- d = container_of(work, struct rdt_mon_domain, cqm_limbo.work);
+ d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -787,7 +787,7 @@ void cqm_handle_limbo(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void cqm_setup_limbo_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -804,7 +804,7 @@ void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct list_head *head;
struct rdt_resource *r;
@@ -819,7 +819,7 @@ void mbm_handle_overflow(struct work_struct *work)
goto out_unlock;
r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- d = container_of(work, struct rdt_mon_domain, mbm_over.work);
+ d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp);
@@ -853,7 +853,7 @@ void mbm_handle_overflow(struct work_struct *work)
* @exclude_cpu: Which CPU the handler should not run on,
* RESCTRL_PICK_ANY_CPU to pick any CPU.
*/
-void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms,
+void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long delay_ms,
int exclude_cpu)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
@@ -1108,7 +1108,7 @@ ssize_t resctrl_mbm_assign_on_mkdir_write(struct kernfs_open_file *of, char *buf
* mbm_cntr_free_all() - Clear all the counter ID configuration details in the
* domain @d. Called when mbm_assign_mode is changed.
*/
-static void mbm_cntr_free_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+static void mbm_cntr_free_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
memset(d->cntr_cfg, 0, sizeof(*d->cntr_cfg) * r->mon.num_mbm_cntrs);
}
@@ -1117,7 +1117,7 @@ static void mbm_cntr_free_all(struct rdt_resource *r, struct rdt_mon_domain *d)
* resctrl_reset_rmid_all() - Reset all non-architecture states for all the
* supported RMIDs.
*/
-static void resctrl_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d)
+static void resctrl_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
enum resctrl_event_id evt;
@@ -1138,7 +1138,7 @@ static void resctrl_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain
* Assign the counter if @assign is true else unassign the counter. Reset the
* associated non-architectural state.
*/
-static void rdtgroup_assign_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void rdtgroup_assign_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
enum resctrl_event_id evtid, u32 rmid, u32 closid,
u32 cntr_id, bool assign)
{
@@ -1158,7 +1158,7 @@ static void rdtgroup_assign_cntr(struct rdt_resource *r, struct rdt_mon_domain *
* Return:
* 0 on success, < 0 on failure.
*/
-static int rdtgroup_alloc_assign_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+static int rdtgroup_alloc_assign_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, struct mon_evt *mevt)
{
int cntr_id;
@@ -1193,7 +1193,7 @@ static int rdtgroup_alloc_assign_cntr(struct rdt_resource *r, struct rdt_mon_dom
* Return:
* 0 on success, < 0 on failure.
*/
-static int rdtgroup_assign_cntr_event(struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+static int rdtgroup_assign_cntr_event(struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
struct mon_evt *mevt)
{
struct rdt_resource *r = resctrl_arch_get_resource(mevt->rid);
@@ -1243,7 +1243,7 @@ void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
* rdtgroup_free_unassign_cntr() - Unassign and reset the counter ID configuration
* for the event pointed to by @mevt within the domain @d and resctrl group @rdtgrp.
*/
-static void rdtgroup_free_unassign_cntr(struct rdt_resource *r, struct rdt_mon_domain *d,
+static void rdtgroup_free_unassign_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, struct mon_evt *mevt)
{
int cntr_id;
@@ -1264,7 +1264,7 @@ static void rdtgroup_free_unassign_cntr(struct rdt_resource *r, struct rdt_mon_d
* the event structure @mevt from the domain @d and the group @rdtgrp. Unassign
* the counters from all the domains if @d is NULL else unassign from @d.
*/
-static void rdtgroup_unassign_cntr_event(struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
+static void rdtgroup_unassign_cntr_event(struct rdt_l3_mon_domain *d, struct rdtgroup *rdtgrp,
struct mon_evt *mevt)
{
struct rdt_resource *r = resctrl_arch_get_resource(mevt->rid);
@@ -1339,7 +1339,7 @@ static int resctrl_parse_mem_transactions(char *tok, u32 *val)
static void rdtgroup_update_cntr_event(struct rdt_resource *r, struct rdtgroup *rdtgrp,
enum resctrl_event_id evtid)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int cntr_id;
list_for_each_entry(d, &r->mon_domains, hdr.list) {
@@ -1445,7 +1445,7 @@ ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int ret = 0;
bool enable;
@@ -1518,7 +1518,7 @@ int resctrl_num_mbm_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1542,7 +1542,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
struct seq_file *s, void *v)
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
u32 cntrs, i;
int ret = 0;
@@ -1583,7 +1583,7 @@ int resctrl_available_mbm_cntrs_show(struct kernfs_open_file *of,
int mbm_L3_assignments_show(struct kernfs_open_file *of, struct seq_file *s, void *v)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct rdtgroup *rdtgrp;
struct mon_evt *mevt;
int ret = 0;
@@ -1646,7 +1646,7 @@ static struct mon_evt *mbm_get_mon_event_by_name(struct rdt_resource *r, char *n
return NULL;
}
-static int rdtgroup_modify_assign_state(char *assign, struct rdt_mon_domain *d,
+static int rdtgroup_modify_assign_state(char *assign, struct rdt_l3_mon_domain *d,
struct rdtgroup *rdtgrp, struct mon_evt *mevt)
{
int ret = 0;
@@ -1672,7 +1672,7 @@ static int rdtgroup_modify_assign_state(char *assign, struct rdt_mon_domain *d,
static int resctrl_parse_mbm_assignment(struct rdt_resource *r, struct rdtgroup *rdtgrp,
char *event, char *tok)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
unsigned long dom_id = 0;
char *dom_str, *id_str;
struct mon_evt *mevt;
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index f5a65c48bcab..ea30a9d1ea9b 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1618,7 +1618,7 @@ static void mondata_config_read(struct resctrl_mon_config_info *mon_info)
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid)
{
struct resctrl_mon_config_info mon_info;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
bool sep = false;
cpus_read_lock();
@@ -1666,7 +1666,7 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
}
static void mbm_config_write_domain(struct rdt_resource *r,
- struct rdt_mon_domain *d, u32 evtid, u32 val)
+ struct rdt_l3_mon_domain *d, u32 evtid, u32 val)
{
struct resctrl_mon_config_info mon_info = {0};
@@ -1707,8 +1707,8 @@ static void mbm_config_write_domain(struct rdt_resource *r,
static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
{
char *dom_str = NULL, *id_str;
+ struct rdt_l3_mon_domain *d;
unsigned long dom_id, val;
- struct rdt_mon_domain *d;
/* Walking r->domains, ensure it can't race with cpuhp */
lockdep_assert_cpus_held();
@@ -2716,7 +2716,7 @@ static int rdt_get_tree(struct fs_context *fc)
{
struct rdt_fs_context *ctx = rdt_fc2context(fc);
unsigned long flags = RFTYPE_CTRL_BASE;
- struct rdt_mon_domain *dom;
+ struct rdt_l3_mon_domain *dom;
struct rdt_resource *r;
int ret;
@@ -3167,7 +3167,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
char subname[32];
bool snc_mode;
char name[32];
@@ -3175,7 +3175,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
if (snc_mode)
@@ -3193,8 +3193,8 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
struct rdt_resource *r, struct rdtgroup *prgrp,
bool do_sum)
{
+ struct rdt_l3_mon_domain *d;
struct rmid_read rr = {0};
- struct rdt_mon_domain *d;
struct mon_data *priv;
struct mon_evt *mevt;
int ret, domid;
@@ -3202,7 +3202,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
for_each_mon_event(mevt) {
if (mevt->rid != r->rid || !mevt->enabled)
continue;
@@ -3227,7 +3227,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
struct rdt_resource *r, struct rdtgroup *prgrp)
{
struct kernfs_node *kn, *ckn;
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
char name[32];
bool snc_mode;
int ret = 0;
@@ -3237,7 +3237,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
snc_mode = r->mon_scope == RESCTRL_L3_NODE;
sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
kn = kernfs_find_and_get(parent_kn, name);
@@ -4180,7 +4180,7 @@ static void rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_mon_domain *d)
+static void domain_destroy_mon_state(struct rdt_l3_mon_domain *d)
{
int idx;
@@ -4204,14 +4204,14 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain
void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
mutex_lock(&rdtgroup_mutex);
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
/*
* If resctrl is mounted, remove all the
@@ -4253,7 +4253,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
*
* Returns 0 for success, or -ENOMEM.
*/
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d)
+static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(*d->mbm_states[0]);
@@ -4311,7 +4311,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d
int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
int err = -EINVAL;
mutex_lock(&rdtgroup_mutex);
@@ -4319,7 +4319,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
- d = container_of(hdr, struct rdt_mon_domain, hdr);
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
err = domain_setup_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4366,10 +4366,10 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
}
}
-static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
- struct rdt_resource *r)
+static struct rdt_l3_mon_domain *get_mon_domain_from_cpu(int cpu,
+ struct rdt_resource *r)
{
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
lockdep_assert_cpus_held();
@@ -4385,7 +4385,7 @@ static struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu,
void resctrl_offline_cpu(unsigned int cpu)
{
struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3);
- struct rdt_mon_domain *d;
+ struct rdt_l3_mon_domain *d;
struct rdtgroup *rdtgrp;
mutex_lock(&rdtgroup_mutex);
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 08/32] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
2025-10-29 16:20 ` [PATCH v13 08/32] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-11-13 4:01 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 4:01 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:20 AM, Tony Luck wrote:
> The upcoming telemetry event monitoring are not tied to the L3
"are not tied" -> "is not tied"?
> resource and will have a new domain structures.
"a new domain structures" -> "new domain structures"?
Could you please review all changelogs, even those that already have
RB tag, to make full use of line length? This is required by tip and having
patches formatted correctly will reduce fixups needed later.
(This repeats
https://lore.kernel.org/lkml/22ee0370-2cdd-435e-a7d4-81dc0c3df547@intel.com/)
>
> Rename the L3 resource specific domain data structures to include
> "l3_" in their names to avoid confusion between the different
> resource specific domain structures:
> rdt_mon_domain -> rdt_l3_mon_domain
> rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
>
> No functional change.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
...
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 1a33d5e6ae23..a07542957e5a 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -178,7 +178,7 @@ struct mbm_cntr_cfg {
> };
>
> /**
> - * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource
> + * struct rdt_l3_mon_domain - group of CPUs sharing a resctrl monitor resource
nit: The "a resctrl monitor" in the description can now be updated to reflect that
it is now L3 resource specific. For example, "group of CPUs sharing RDT_RESOURCE_L3
monitoring"
> * @hdr: common header for different domain types
> * @ci_id: cache info id for this domain
> * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold
> @@ -192,7 +192,7 @@ struct mbm_cntr_cfg {
> * @cntr_cfg: array of assignable counters' configuration (indexed
> * by counter ID)
> */
> -struct rdt_mon_domain {
> +struct rdt_l3_mon_domain {
> struct rdt_domain_hdr hdr;
> unsigned int ci_id;
> unsigned long *rmid_busy_llc;
...
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 9f4c2f0aaf5c..6eca3d522fcc 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -60,17 +60,17 @@ struct rdt_hw_ctrl_domain {
> };
>
> /**
> - * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share
> - * a resource for a monitor function
> - * @d_resctrl: Properties exposed to the resctrl file system
> + * struct rdt_hw_l3_mon_domain - Arch private attributes of a set of CPUs that share
> + * a resource for a monitor function
Similar here. This is no longer just "a resource" but can only be an L3 resource.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 09/32] x86,fs/resctrl: Rename some L3 specific functions
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (7 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 08/32] x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-13 4:01 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 10/32] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
` (24 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
With the arrival of monitor events tied to new domains associated with
a different resource it would be clearer if the L3 resource specific
functions are more accurately named.
Rename three groups of functions:
Functions that allocate/free architecture per-RMID MBM state information:
arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc()
mon_domain_free() -> l3_mon_domain_free()
Functions that allocate/free filesystem per-RMID MBM state information:
domain_setup_mon_state() -> domain_setup_l3_mon_state()
domain_destroy_mon_state() -> domain_destroy_l3_mon_state()
Initialization/exit:
rdt_get_mon_l3_config() -> rdt_get_l3_mon_config()
resctrl_mon_resource_init() -> resctrl_l3_mon_resource_init()
resctrl_mon_resource_exit() -> resctrl_l3_mon_resource_exit()
Ensure kernel-doc descriptions of these functions' return values are
present and correctly formatted.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
fs/resctrl/internal.h | 6 +++---
arch/x86/kernel/cpu/resctrl/core.c | 20 +++++++++++---------
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
fs/resctrl/monitor.c | 8 ++++----
fs/resctrl/rdtgroup.c | 24 ++++++++++++------------
6 files changed, 32 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 6eca3d522fcc..14fadcff0d2b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -208,7 +208,7 @@ union l3_qos_abmc_cfg {
void rdt_ctrl_update(void *arg);
-int rdt_get_mon_l3_config(struct rdt_resource *r);
+int rdt_get_l3_mon_config(struct rdt_resource *r);
bool rdt_cpu_has(int flag);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index d9e291d94926..88b4489b68e1 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -357,7 +357,9 @@ int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
-void resctrl_mon_resource_exit(void);
+int resctrl_l3_mon_resource_init(void);
+
+void resctrl_l3_mon_resource_exit(void);
void mon_event_count(void *info);
@@ -367,8 +369,6 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
cpumask_t *cpumask, int evtid, int first);
-int resctrl_mon_resource_init(void);
-
void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
int exclude_cpu);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 8137c1442139..99d1048d29d1 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -363,7 +363,7 @@ static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom)
kfree(hw_dom);
}
-static void mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
+static void l3_mon_domain_free(struct rdt_hw_l3_mon_domain *hw_dom)
{
int idx;
@@ -396,11 +396,13 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *
}
/**
- * arch_domain_mbm_alloc() - Allocate arch private storage for the MBM counters
+ * l3_mon_domain_mbm_alloc() - Allocate arch private storage for the MBM counters
* @num_rmid: The size of the MBM counter array
* @hw_dom: The domain that owns the allocated arrays
+ *
+ * Return: 0 for success, or -ENOMEM.
*/
-static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
+static int l3_mon_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_l3_mon_domain *hw_dom)
{
size_t tsize = sizeof(*hw_dom->arch_mbm_states[0]);
enum resctrl_event_id eventid;
@@ -514,7 +516,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE);
if (!ci) {
pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name);
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
return;
}
d->ci_id = ci->id;
@@ -522,8 +524,8 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
arch_mon_domain_online(r, d);
- if (arch_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
- mon_domain_free(hw_dom);
+ if (l3_mon_domain_mbm_alloc(r->mon.num_rmid, hw_dom)) {
+ l3_mon_domain_free(hw_dom);
return;
}
@@ -533,7 +535,7 @@ static void l3_mon_domain_setup(int cpu, int id, struct rdt_resource *r, struct
if (err) {
list_del_rcu(&d->hdr.list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
}
}
@@ -659,7 +661,7 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
resctrl_offline_mon_domain(r, hdr);
list_del_rcu(&hdr->list);
synchronize_rcu();
- mon_domain_free(hw_dom);
+ l3_mon_domain_free(hw_dom);
break;
}
default:
@@ -908,7 +910,7 @@ static __init bool get_rdt_mon_resources(void)
if (!ret)
return false;
- return !rdt_get_mon_l3_config(r);
+ return !rdt_get_l3_mon_config(r);
}
static __init void __check_quirks_intel(void)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 8b293fc4e946..2d1453c905bc 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -423,7 +423,7 @@ static __init int snc_get_config(void)
return ret;
}
-int __init rdt_get_mon_l3_config(struct rdt_resource *r)
+int __init rdt_get_l3_mon_config(struct rdt_resource *r)
{
unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset;
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index dd295f31ec49..84e3d3ce66cf 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -1768,7 +1768,7 @@ ssize_t mbm_L3_assignments_write(struct kernfs_open_file *of, char *buf,
}
/**
- * resctrl_mon_resource_init() - Initialise global monitoring structures.
+ * resctrl_l3_mon_resource_init() - Initialise global monitoring structures.
*
* Allocate and initialise global monitor resources that do not belong to a
* specific domain. i.e. the rmid_ptrs[] used for the limbo and free lists.
@@ -1777,9 +1777,9 @@ ssize_t mbm_L3_assignments_write(struct kernfs_open_file *of, char *buf,
* Resctrl's cpuhp callbacks may be called before this point to bring a domain
* online.
*
- * Returns 0 for success, or -ENOMEM.
+ * Return: 0 for success, or -ENOMEM.
*/
-int resctrl_mon_resource_init(void)
+int resctrl_l3_mon_resource_init(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
int ret;
@@ -1829,7 +1829,7 @@ int resctrl_mon_resource_init(void)
return 0;
}
-void resctrl_mon_resource_exit(void)
+void resctrl_l3_mon_resource_exit(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index ea30a9d1ea9b..f57775c40d14 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4180,7 +4180,7 @@ static void rdtgroup_setup_default(void)
mutex_unlock(&rdtgroup_mutex);
}
-static void domain_destroy_mon_state(struct rdt_l3_mon_domain *d)
+static void domain_destroy_l3_mon_state(struct rdt_l3_mon_domain *d)
{
int idx;
@@ -4235,13 +4235,13 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
cancel_delayed_work(&d->cqm_limbo);
}
- domain_destroy_mon_state(d);
+ domain_destroy_l3_mon_state(d);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
/**
- * domain_setup_mon_state() - Initialise domain monitoring structures.
+ * domain_setup_l3_mon_state() - Initialise domain monitoring structures.
* @r: The resource for the newly online domain.
* @d: The newly online domain.
*
@@ -4249,11 +4249,11 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* Called when the first CPU of a domain comes online, regardless of whether
* the filesystem is mounted.
* During boot this may be called before global allocations have been made by
- * resctrl_mon_resource_init().
+ * resctrl_l3_mon_resource_init().
*
- * Returns 0 for success, or -ENOMEM.
+ * Return: 0 for success, or -ENOMEM.
*/
-static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
+static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
size_t tsize = sizeof(*d->mbm_states[0]);
@@ -4320,7 +4320,7 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
goto out_unlock;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- err = domain_setup_mon_state(r, d);
+ err = domain_setup_l3_mon_state(r, d);
if (err)
goto out_unlock;
@@ -4435,13 +4435,13 @@ int resctrl_init(void)
thread_throttle_mode_init();
- ret = resctrl_mon_resource_init();
+ ret = resctrl_l3_mon_resource_init();
if (ret)
return ret;
ret = sysfs_create_mount_point(fs_kobj, "resctrl");
if (ret) {
- resctrl_mon_resource_exit();
+ resctrl_l3_mon_resource_exit();
return ret;
}
@@ -4476,7 +4476,7 @@ int resctrl_init(void)
cleanup_mountpoint:
sysfs_remove_mount_point(fs_kobj, "resctrl");
- resctrl_mon_resource_exit();
+ resctrl_l3_mon_resource_exit();
return ret;
}
@@ -4512,7 +4512,7 @@ static bool resctrl_online_domains_exist(void)
* When called by the architecture code, all CPUs and resctrl domains must be
* offline. This ensures the limbo and overflow handlers are not scheduled to
* run, meaning the data structures they access can be freed by
- * resctrl_mon_resource_exit().
+ * resctrl_l3_mon_resource_exit().
*
* After resctrl_exit() returns, the architecture code should return an
* error from all resctrl_arch_ functions that can do this.
@@ -4539,5 +4539,5 @@ void resctrl_exit(void)
* it can be used to umount resctrl.
*/
- resctrl_mon_resource_exit();
+ resctrl_l3_mon_resource_exit();
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 09/32] x86,fs/resctrl: Rename some L3 specific functions
2025-10-29 16:20 ` [PATCH v13 09/32] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
@ 2025-11-13 4:01 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 4:01 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:20 AM, Tony Luck wrote:
> With the arrival of monitor events tied to new domains associated with
> a different resource it would be clearer if the L3 resource specific
> functions are more accurately named.
>
> Rename three groups of functions:
>
> Functions that allocate/free architecture per-RMID MBM state information:
> arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc()
> mon_domain_free() -> l3_mon_domain_free()
>
> Functions that allocate/free filesystem per-RMID MBM state information:
> domain_setup_mon_state() -> domain_setup_l3_mon_state()
> domain_destroy_mon_state() -> domain_destroy_l3_mon_state()
>
> Initialization/exit:
> rdt_get_mon_l3_config() -> rdt_get_l3_mon_config()
> resctrl_mon_resource_init() -> resctrl_l3_mon_resource_init()
> resctrl_mon_resource_exit() -> resctrl_l3_mon_resource_exit()
>
> Ensure kernel-doc descriptions of these functions' return values are
> present and correctly formatted.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
With changelog line lengths fixed:
| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 10/32] fs/resctrl: Make event details accessible to functions when reading events
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (8 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 09/32] x86,fs/resctrl: Rename some L3 specific functions Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
` (23 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Reading monitoring event data from MMIO requires more context than the event id
to be able to read the correct memory location. struct mon_evt is the appropriate
place for this event specific context.
Prepare for addition of extra fields to struct mon_evt by changing the calling
conventions to pass a pointer to the mon_evt structure instead of just the
event id.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
fs/resctrl/internal.h | 10 +++++-----
fs/resctrl/ctrlmondata.c | 18 +++++++++---------
fs/resctrl/monitor.c | 24 ++++++++++++------------
fs/resctrl/rdtgroup.c | 6 +++---
4 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 88b4489b68e1..12a2ab7e3c9b 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -81,7 +81,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
* @rid: Resource id associated with the event file.
- * @evtid: Event id associated with the event file.
+ * @evt: Event structure associated with the event file.
* @sum: Set when event must be summed across multiple
* domains.
* @domid: When @sum is zero this is the domain to which
@@ -95,7 +95,7 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
struct mon_data {
struct list_head list;
enum resctrl_res_level rid;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
int domid;
bool sum;
};
@@ -108,7 +108,7 @@ struct mon_data {
* @r: Resource describing the properties of the event being read.
* @hdr: Header of domain that the counter should be read from. If NULL then
* sum all domains in @r sharing L3 @ci.id
- * @evtid: Which monitor event to read.
+ * @evt: Which monitor event to read.
* @first: Initialize MBM counter when true.
* @ci: Cacheinfo for L3. Only set when @hdr is NULL. Used when summing
* domains.
@@ -126,7 +126,7 @@ struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_resource *r;
struct rdt_domain_hdr *hdr;
- enum resctrl_event_id evtid;
+ struct mon_evt *evt;
bool first;
struct cacheinfo *ci;
bool is_mbm_cntr;
@@ -367,7 +367,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first);
+ cpumask_t *cpumask, struct mon_evt *evt, int first);
void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom,
unsigned long delay_ms,
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index b74c69f2d54e..c3656812848b 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -548,7 +548,7 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id,
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain_hdr *hdr, struct rdtgroup *rdtgrp,
- cpumask_t *cpumask, int evtid, int first)
+ cpumask_t *cpumask, struct mon_evt *evt, int first)
{
int cpu;
@@ -559,15 +559,15 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* Setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
- rr->evtid = evtid;
+ rr->evt = evt;
rr->r = r;
rr->hdr = hdr;
rr->first = first;
if (resctrl_arch_mbm_cntr_assign_enabled(r) &&
- resctrl_is_mbm_event(evtid)) {
+ resctrl_is_mbm_event(evt->evtid)) {
rr->is_mbm_cntr = true;
} else {
- rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid);
+ rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evt->evtid);
if (IS_ERR(rr->arch_mon_ctx)) {
rr->err = -EINVAL;
return;
@@ -588,14 +588,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
if (rr->arch_mon_ctx)
- resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
enum resctrl_res_level resid;
- enum resctrl_event_id evtid;
struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
@@ -603,6 +602,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
int domid, cpu, ret = 0;
struct rdt_resource *r;
struct cacheinfo *ci;
+ struct mon_evt *evt;
struct mon_data *md;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
@@ -619,7 +619,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
resid = md->rid;
domid = md->domid;
- evtid = md->evtid;
+ evt = md->evt;
r = resctrl_arch_get_resource(resid);
if (md->sum) {
@@ -637,7 +637,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
continue;
rr.ci = ci;
mon_event_read(&rr, r, NULL, rdtgrp,
- &ci->shared_cpu_map, evtid, false);
+ &ci->shared_cpu_map, evt, false);
goto checkresult;
}
}
@@ -653,7 +653,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
ret = -ENOENT;
goto out;
}
- mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evtid, false);
+ mon_event_read(&rr, r, hdr, rdtgrp, &hdr->cpu_mask, evt, false);
}
checkresult:
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 84e3d3ce66cf..5cf928e10eaf 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -425,7 +425,7 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
u64 tval = 0;
if (rr->is_mbm_cntr) {
- cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evtid);
+ cntr_id = mbm_cntr_get(rr->r, d, rdtgrp, rr->evt->evtid);
if (cntr_id < 0) {
rr->err = -ENOENT;
return -EINVAL;
@@ -434,10 +434,10 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
if (rr->first) {
if (rr->is_mbm_cntr)
- resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evtid);
+ resctrl_arch_reset_cntr(rr->r, d, closid, rmid, cntr_id, rr->evt->evtid);
else
- resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evtid);
- m = get_mbm_state(d, closid, rmid, rr->evtid);
+ resctrl_arch_reset_rmid(rr->r, d, closid, rmid, rr->evt->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
if (m)
memset(m, 0, sizeof(struct mbm_state));
return 0;
@@ -449,10 +449,10 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
return -EINVAL;
if (rr->is_mbm_cntr)
rr->err = resctrl_arch_cntr_read(rr->r, d, closid, rmid, cntr_id,
- rr->evtid, &tval);
+ rr->evt->evtid, &tval);
else
rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -478,10 +478,10 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
continue;
if (rr->is_mbm_cntr)
err = resctrl_arch_cntr_read(rr->r, d, closid, rmid, cntr_id,
- rr->evtid, &tval);
+ rr->evt->evtid, &tval);
else
err = resctrl_arch_rmid_read(rr->r, &d->hdr, closid, rmid,
- rr->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
ret = 0;
@@ -539,7 +539,7 @@ static void mbm_bw_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (!domain_header_is_valid(rr->hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
d = container_of(rr->hdr, struct rdt_l3_mon_domain, hdr);
- m = get_mbm_state(d, closid, rmid, rr->evtid);
+ m = get_mbm_state(d, closid, rmid, rr->evt->evtid);
if (WARN_ON_ONCE(!m))
return;
@@ -713,11 +713,11 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domai
rr.r = r;
rr.hdr = &d->hdr;
- rr.evtid = evtid;
+ rr.evt = &mon_event_all[evtid];
if (resctrl_arch_mbm_cntr_assign_enabled(r)) {
rr.is_mbm_cntr = true;
} else {
- rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid);
+ rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, evtid);
if (IS_ERR(rr.arch_mon_ctx)) {
pr_warn_ratelimited("Failed to allocate monitor context: %ld",
PTR_ERR(rr.arch_mon_ctx));
@@ -735,7 +735,7 @@ static void mbm_update_one_event(struct rdt_resource *r, struct rdt_l3_mon_domai
mbm_bw_count(rdtgrp, &rr);
if (rr.arch_mon_ctx)
- resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx);
+ resctrl_arch_mon_ctx_free(rr.r, evtid, rr.arch_mon_ctx);
}
static void mbm_update(struct rdt_resource *r, struct rdt_l3_mon_domain *d,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index f57775c40d14..e0eb766c5cf4 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3038,7 +3038,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
list_for_each_entry(priv, &mon_data_kn_priv_list, list) {
if (priv->rid == rid && priv->domid == domid &&
- priv->sum == do_sum && priv->evtid == mevt->evtid)
+ priv->sum == do_sum && priv->evt == mevt)
return priv;
}
@@ -3049,7 +3049,7 @@ static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
priv->rid = rid;
priv->domid = domid;
priv->sum = do_sum;
- priv->evtid = mevt->evtid;
+ priv->evt = mevt;
list_add_tail(&priv->list, &mon_data_kn_priv_list);
return priv;
@@ -3216,7 +3216,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
return ret;
if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
- mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt->evtid, true);
+ mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
}
return 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (9 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 10/32] fs/resctrl: Make event details accessible to functions when reading events Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-30 6:14 ` Chen, Yu C
2025-11-13 4:02 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
` (22 subsequent siblings)
33 siblings, 2 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl assumes that monitor events can only be read from a CPU in the
cpumask_t set of each domain. This is true for x86 events accessed
with an MSR interface, but may not be true for other access methods such
as MMIO.
Introduce and use flag mon_evt::any_cpu, settable by architecture, that
indicates there are no restrictions on which CPU can read that event.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
fs/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 +++---
fs/resctrl/ctrlmondata.c | 6 ++++++
fs/resctrl/monitor.c | 3 ++-
5 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a07542957e5a..702205505dc9 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -409,7 +409,7 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id eventid);
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 12a2ab7e3c9b..40b76eaa33d0 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -61,6 +61,7 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* READS_TO_REMOTE_MEM) being tracked by @evtid.
* Only valid if @evtid is an MBM event.
* @configurable: true if the event is configurable
+ * @any_cpu: true if the event can be read from any CPU
* @enabled: true if the event is enabled
*/
struct mon_evt {
@@ -69,6 +70,7 @@ struct mon_evt {
char *name;
u32 evt_cfg;
bool configurable;
+ bool any_cpu;
bool enabled;
};
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 99d1048d29d1..78ad493dcc01 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -893,15 +893,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_ABMC))
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index c3656812848b..883be6f0810f 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -574,6 +574,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
}
}
+ if (evt->any_cpu) {
+ mon_event_count(rr);
+ goto out_ctx_free;
+ }
+
cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
/*
@@ -587,6 +592,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
+out_ctx_free:
if (rr->arch_mon_ctx)
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 5cf928e10eaf..6eab98b47816 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -975,7 +975,7 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id eventid)
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
{
if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
return;
@@ -984,6 +984,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid)
return;
}
+ mon_event_all[eventid].any_cpu = any_cpu;
mon_event_all[eventid].enabled = true;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU
2025-10-29 16:20 ` [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
@ 2025-10-30 6:14 ` Chen, Yu C
2025-10-30 15:54 ` Luck, Tony
2025-11-13 4:02 ` Reinette Chatre
1 sibling, 1 reply; 85+ messages in thread
From: Chen, Yu C @ 2025-10-30 6:14 UTC (permalink / raw)
To: Tony Luck
Cc: x86, linux-kernel, patches, Reinette Chatre, James Morse,
Fenghua Yu, Dave Martin, Peter Newman, Babu Moger, Drew Fustini,
Maciej Wieczor-Retman
Hi Tony,
On 10/30/2025 12:20 AM, Tony Luck wrote:
> resctrl assumes that monitor events can only be read from a CPU in the
> cpumask_t set of each domain. This is true for x86 events accessed
> with an MSR interface, but may not be true for other access methods such
> as MMIO.
>
> Introduce and use flag mon_evt::any_cpu, settable by architecture, that
> indicates there are no restrictions on which CPU can read that event.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
[snip]
> -void resctrl_enable_mon_event(enum resctrl_event_id eventid)
> +void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
> {
> if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
> return;
> @@ -984,6 +984,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid)
> return;
> }
>
> + mon_event_all[eventid].any_cpu = any_cpu;
> mon_event_all[eventid].enabled = true;
> }
>
It seems that cpu_on_correct_domain() was dropped, due to
the refactor of __mon_event_count() in patch 0006 means it is no
longer needed. But we still invoke smp_processor_id() in preemptible
context in __l3_mon_event_count() before further checkings, which would
cause a warning.
[ 4266.361951] BUG: using smp_processor_id() in preemptible [00000000]
code: grep/1603
[ 4266.363231] caller is __l3_mon_event_count+0x30/0x2a0
[ 4266.364250] Call Trace:
[ 4266.364262] <TASK>
[ 4266.364273] dump_stack_lvl+0x53/0x70
[ 4266.364289] check_preemption_disabled+0xca/0xe0
[ 4266.364303] __l3_mon_event_count+0x30/0x2a0
[ 4266.364320] mon_event_count+0x22/0x90
[ 4266.364334] rdtgroup_mondata_show+0x108/0x390
[ 4266.364353] seq_read_iter+0x10d/0x450
[ 4266.364368] vfs_read+0x215/0x330
[ 4266.364386] ksys_read+0x6b/0xe0
[ 4266.364401] do_syscall_64+0x57/0xd70
thanks,
Chenyu
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU
2025-10-30 6:14 ` Chen, Yu C
@ 2025-10-30 15:54 ` Luck, Tony
2025-10-30 16:18 ` Chen, Yu C
0 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-10-30 15:54 UTC (permalink / raw)
To: Chen, Yu C
Cc: x86, linux-kernel, patches, Reinette Chatre, James Morse,
Fenghua Yu, Dave Martin, Peter Newman, Babu Moger, Drew Fustini,
Maciej Wieczor-Retman
On Thu, Oct 30, 2025 at 02:14:27PM +0800, Chen, Yu C wrote:
> Hi Tony,
>
> On 10/30/2025 12:20 AM, Tony Luck wrote:
> > resctrl assumes that monitor events can only be read from a CPU in the
> > cpumask_t set of each domain. This is true for x86 events accessed
> > with an MSR interface, but may not be true for other access methods such
> > as MMIO.
> >
> > Introduce and use flag mon_evt::any_cpu, settable by architecture, that
> > indicates there are no restrictions on which CPU can read that event.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
>
> [snip]
>
> > -void resctrl_enable_mon_event(enum resctrl_event_id eventid)
> > +void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
> > {
> > if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
> > return;
> > @@ -984,6 +984,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid)
> > return;
> > }
> > + mon_event_all[eventid].any_cpu = any_cpu;
> > mon_event_all[eventid].enabled = true;
> > }
>
> It seems that cpu_on_correct_domain() was dropped, due to
> the refactor of __mon_event_count() in patch 0006 means it is no
> longer needed. But we still invoke smp_processor_id() in preemptible
> context in __l3_mon_event_count() before further checkings, which would
> cause a warning.
> [ 4266.361951] BUG: using smp_processor_id() in preemptible [00000000] code:
> grep/1603
> [ 4266.363231] caller is __l3_mon_event_count+0x30/0x2a0
> [ 4266.364250] Call Trace:
> [ 4266.364262] <TASK>
> [ 4266.364273] dump_stack_lvl+0x53/0x70
> [ 4266.364289] check_preemption_disabled+0xca/0xe0
> [ 4266.364303] __l3_mon_event_count+0x30/0x2a0
> [ 4266.364320] mon_event_count+0x22/0x90
> [ 4266.364334] rdtgroup_mondata_show+0x108/0x390
> [ 4266.364353] seq_read_iter+0x10d/0x450
> [ 4266.364368] vfs_read+0x215/0x330
> [ 4266.364386] ksys_read+0x6b/0xe0
> [ 4266.364401] do_syscall_64+0x57/0xd70
I didn't notice this in my testing. Is this in your region aware
tree? If you are still using RDT_RESOURCE_L3 then I can see how
you got this call trace.
Maybe you need to dig cpu_on_correct_domain() back up and apply
it to __l3_mon_event_count()?
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU
2025-10-30 15:54 ` Luck, Tony
@ 2025-10-30 16:18 ` Chen, Yu C
0 siblings, 0 replies; 85+ messages in thread
From: Chen, Yu C @ 2025-10-30 16:18 UTC (permalink / raw)
To: Luck, Tony
Cc: x86, linux-kernel, patches, Reinette Chatre, James Morse,
Fenghua Yu, Dave Martin, Peter Newman, Babu Moger, Drew Fustini,
Maciej Wieczor-Retman
On 10/30/2025 11:54 PM, Luck, Tony wrote:
> On Thu, Oct 30, 2025 at 02:14:27PM +0800, Chen, Yu C wrote:
>> Hi Tony,
>>
>> On 10/30/2025 12:20 AM, Tony Luck wrote:
>>> resctrl assumes that monitor events can only be read from a CPU in the
>>> cpumask_t set of each domain. This is true for x86 events accessed
>>> with an MSR interface, but may not be true for other access methods such
>>> as MMIO.
>>>
>>> Introduce and use flag mon_evt::any_cpu, settable by architecture, that
>>> indicates there are no restrictions on which CPU can read that event.
>>>
>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>
>> [snip]
>>
>>> -void resctrl_enable_mon_event(enum resctrl_event_id eventid)
>>> +void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
>>> {
>>> if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
>>> return;
>>> @@ -984,6 +984,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid)
>>> return;
>>> }
>>> + mon_event_all[eventid].any_cpu = any_cpu;
>>> mon_event_all[eventid].enabled = true;
>>> }
>>
>> It seems that cpu_on_correct_domain() was dropped, due to
>> the refactor of __mon_event_count() in patch 0006 means it is no
>> longer needed. But we still invoke smp_processor_id() in preemptible
>> context in __l3_mon_event_count() before further checkings, which would
>> cause a warning.
>> [ 4266.361951] BUG: using smp_processor_id() in preemptible [00000000] code:
>> grep/1603
>> [ 4266.363231] caller is __l3_mon_event_count+0x30/0x2a0
>> [ 4266.364250] Call Trace:
>> [ 4266.364262] <TASK>
>> [ 4266.364273] dump_stack_lvl+0x53/0x70
>> [ 4266.364289] check_preemption_disabled+0xca/0xe0
>> [ 4266.364303] __l3_mon_event_count+0x30/0x2a0
>> [ 4266.364320] mon_event_count+0x22/0x90
>> [ 4266.364334] rdtgroup_mondata_show+0x108/0x390
>> [ 4266.364353] seq_read_iter+0x10d/0x450
>> [ 4266.364368] vfs_read+0x215/0x330
>> [ 4266.364386] ksys_read+0x6b/0xe0
>> [ 4266.364401] do_syscall_64+0x57/0xd70
>
> I didn't notice this in my testing. Is this in your region aware
> tree? If you are still using RDT_RESOURCE_L3 then I can see how
> you got this call trace.
>
Yes, it was tested on the region aware tree.
> Maybe you need to dig cpu_on_correct_domain() back up and apply
> it to __l3_mon_event_count()?
>
Got it, will do.
Thanks,
Chenyu
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU
2025-10-29 16:20 ` [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
2025-10-30 6:14 ` Chen, Yu C
@ 2025-11-13 4:02 ` Reinette Chatre
1 sibling, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 4:02 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:20 AM, Tony Luck wrote:
> resctrl assumes that monitor events can only be read from a CPU in the
> cpumask_t set of each domain. This is true for x86 events accessed
> with an MSR interface, but may not be true for other access methods such
> as MMIO.
>
> Introduce and use flag mon_evt::any_cpu, settable by architecture, that
> indicates there are no restrictions on which CPU can read that event.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
This implies a generic change while __l3_mon_event_count() cannot support it.
While I understand why cpu_on_correct_domain() was removed, the existing checking
requires DEBUG_PREEMPT to be set so I think it would be helpful if there is something
like a WARN_ON_ONCE(rr->evt->any_cpu) before calling __l3_mon_event_count() to help
catch issues.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (10 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 11/32] x86,fs/resctrl: Handle events that can be read from any CPU Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-05 14:42 ` Dave Martin
2025-11-12 13:08 ` David Laight
2025-10-29 16:20 ` [PATCH v13 13/32] x86,fs/resctrl: Add an architectural hook called for each mount Tony Luck
` (21 subsequent siblings)
33 siblings, 2 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl assumes that all monitor events can be displayed as unsigned
decimal integers.
Hardware architecture counters may provide some telemetry events with
greater precision where the event is not a simple count, but is a
measurement of some sort (e.g. Joules for energy consumed).
Add a new argument to resctrl_enable_mon_event() for architecture code
to inform the file system that the value for a counter is a fixed-point
value with a specific number of binary places.
Only allow architecture to use floating point format on events that the
file system has marked with mon_evt::is_floating_point.
Display fixed point values with values rounded to an appropriate number
of decimal places for the precision of the number of binary places
provided. Add one extra decimal place for every three additional binary
places, except for low precision binary values where exact representation
is possible:
1 binary place is 0.0 or 0.5 => 1 decimal place
2 binary places is 0.0, 0.25, 0.5, 0.75 => 2 decimal places
3 binary places is 0.0, 0.125, etc. => 3 decimal places
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 3 +-
fs/resctrl/internal.h | 8 +++
arch/x86/kernel/cpu/resctrl/core.c | 6 +--
fs/resctrl/ctrlmondata.c | 84 ++++++++++++++++++++++++++++++
fs/resctrl/monitor.c | 10 +++-
5 files changed, 105 insertions(+), 6 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 702205505dc9..a7e5a546152d 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -409,7 +409,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
-void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
+ unsigned int binary_bits);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 40b76eaa33d0..f5189b6771a0 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* Only valid if @evtid is an MBM event.
* @configurable: true if the event is configurable
* @any_cpu: true if the event can be read from any CPU
+ * @is_floating_point: event values are displayed in floating point format
+ * @binary_bits: number of fixed-point binary bits from architecture,
+ * only valid if @is_floating_point is true
* @enabled: true if the event is enabled
*/
struct mon_evt {
@@ -71,6 +74,8 @@ struct mon_evt {
u32 evt_cfg;
bool configurable;
bool any_cpu;
+ bool is_floating_point;
+ unsigned int binary_bits;
bool enabled;
};
@@ -79,6 +84,9 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
#define for_each_mon_event(mevt) for (mevt = &mon_event_all[QOS_FIRST_EVENT]; \
mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++)
+/* Limit for mon_evt::binary_bits */
+#define MAX_BINARY_BITS 27
+
/**
* struct mon_data - Monitoring details for each event file.
* @list: Member of the global @mon_data_kn_priv_list list.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 78ad493dcc01..c435319552be 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -893,15 +893,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_ABMC))
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 883be6f0810f..290a959776de 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -17,6 +17,7 @@
#include <linux/cpu.h>
#include <linux/kernfs.h>
+#include <linux/math.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <linux/tick.h>
@@ -597,6 +598,87 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
}
+/*
+ * Decimal place precision to use for each number of fixed-point
+ * binary bits.
+ */
+static unsigned int decplaces[MAX_BINARY_BITS + 1] = {
+ [1] = 1,
+ [2] = 2,
+ [3] = 3,
+ [4] = 3,
+ [5] = 3,
+ [6] = 3,
+ [7] = 3,
+ [8] = 3,
+ [9] = 3,
+ [10] = 4,
+ [11] = 4,
+ [12] = 4,
+ [13] = 5,
+ [14] = 5,
+ [15] = 5,
+ [16] = 6,
+ [17] = 6,
+ [18] = 6,
+ [19] = 7,
+ [20] = 7,
+ [21] = 7,
+ [22] = 8,
+ [23] = 8,
+ [24] = 8,
+ [25] = 9,
+ [26] = 9,
+ [27] = 9
+};
+
+static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
+{
+ unsigned long long frac;
+ char buf[10];
+
+ if (!binary_bits) {
+ seq_printf(m, "%llu.0\n", val);
+ return;
+ }
+
+ /* Mask off the integer part of the fixed-point value. */
+ frac = val & GENMASK_ULL(binary_bits, 0);
+
+ /*
+ * Multiply by 10^{desired decimal places}. The integer part of
+ * the fixed point value is now almost what is needed.
+ */
+ frac *= int_pow(10ull, decplaces[binary_bits]);
+
+ /*
+ * Round to nearest by adding a value that would be a "1" in the
+ * binary_bits + 1 place. Integer part of fixed point value is
+ * now the needed value.
+ */
+ frac += 1ull << (binary_bits - 1);
+
+ /*
+ * Extract the integer part of the value. This is the decimal
+ * representation of the original fixed-point fractional value.
+ */
+ frac >>= binary_bits;
+
+ /*
+ * "frac" is now in the range [0 .. 10^decplaces). I.e. string
+ * representation will fit into chosen number of decimal places.
+ */
+ snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
+
+ /* Trim trailing zeroes */
+ for (int i = decplaces[binary_bits] - 1; i > 0; i--) {
+ if (buf[i] != '0')
+ break;
+ buf[i] = '\0';
+ }
+ seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
+}
+
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
@@ -674,6 +756,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
seq_puts(m, "Unavailable\n");
else if (rr.err == -ENOENT)
seq_puts(m, "Unassigned\n");
+ else if (evt->is_floating_point)
+ print_event_value(m, evt->binary_bits, rr.val);
else
seq_printf(m, "%llu\n", rr.val);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 6eab98b47816..7d1b65316bc8 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -975,16 +975,22 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
},
};
-void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu)
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsigned int binary_bits)
{
- if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS))
+ if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS ||
+ binary_bits > MAX_BINARY_BITS))
return;
if (mon_event_all[eventid].enabled) {
pr_warn("Duplicate enable for event %d\n", eventid);
return;
}
+ if (binary_bits && !mon_event_all[eventid].is_floating_point) {
+ pr_warn("Event %d may not be floating point\n", eventid);
+ return;
+ }
mon_event_all[eventid].any_cpu = any_cpu;
+ mon_event_all[eventid].binary_bits = binary_bits;
mon_event_all[eventid].enabled = true;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-10-29 16:20 ` [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
@ 2025-11-05 14:42 ` Dave Martin
2025-11-05 23:31 ` Luck, Tony
2025-11-10 16:52 ` Luck, Tony
2025-11-12 13:08 ` David Laight
1 sibling, 2 replies; 85+ messages in thread
From: Dave Martin @ 2025-11-05 14:42 UTC (permalink / raw)
To: Tony Luck
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
A few drive-by nits from me -- apologies, I hadn't looked at this in a
while.
On Wed, Oct 29, 2025 at 09:20:55AM -0700, Tony Luck wrote:
> resctrl assumes that all monitor events can be displayed as unsigned
> decimal integers.
>
> Hardware architecture counters may provide some telemetry events with
> greater precision where the event is not a simple count, but is a
> measurement of some sort (e.g. Joules for energy consumed).
>
> Add a new argument to resctrl_enable_mon_event() for architecture code
> to inform the file system that the value for a counter is a fixed-point
> value with a specific number of binary places.
> Only allow architecture to use floating point format on events that the
> file system has marked with mon_evt::is_floating_point.
>
> Display fixed point values with values rounded to an appropriate number
> of decimal places for the precision of the number of binary places
> provided. Add one extra decimal place for every three additional binary
(Is this just informal wording? If not, it's wrong...)
> places, except for low precision binary values where exact representation
> is possible:
>
> 1 binary place is 0.0 or 0.5 => 1 decimal place
> 2 binary places is 0.0, 0.25, 0.5, 0.75 => 2 decimal places
> 3 binary places is 0.0, 0.125, etc. => 3 decimal places
What's the rationale for this special treatment? I don't see any
previous discussion (apologies if I missed it).
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> include/linux/resctrl.h | 3 +-
> fs/resctrl/internal.h | 8 +++
> arch/x86/kernel/cpu/resctrl/core.c | 6 +--
> fs/resctrl/ctrlmondata.c | 84 ++++++++++++++++++++++++++++++
> fs/resctrl/monitor.c | 10 +++-
> 5 files changed, 105 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 702205505dc9..a7e5a546152d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -409,7 +409,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
> u32 resctrl_arch_system_num_rmid_idx(void);
> int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
>
> -void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
> +void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
> + unsigned int binary_bits);
>
> bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 40b76eaa33d0..f5189b6771a0 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> * Only valid if @evtid is an MBM event.
> * @configurable: true if the event is configurable
> * @any_cpu: true if the event can be read from any CPU
> + * @is_floating_point: event values are displayed in floating point format
Nit: Maybe rebrand this as is_fixed_point, or is_fractional, or similar?
The print syntax is just a decimal fraction, and the hardware
representation is fixed-point. Nothing floats.
> + * @binary_bits: number of fixed-point binary bits from architecture,
> + * only valid if @is_floating_point is true
> * @enabled: true if the event is enabled
> */
> struct mon_evt {
> @@ -71,6 +74,8 @@ struct mon_evt {
> u32 evt_cfg;
> bool configurable;
> bool any_cpu;
> + bool is_floating_point;
> + unsigned int binary_bits;
> bool enabled;
> };
>
> @@ -79,6 +84,9 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
> #define for_each_mon_event(mevt) for (mevt = &mon_event_all[QOS_FIRST_EVENT]; \
> mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++)
>
> +/* Limit for mon_evt::binary_bits */
> +#define MAX_BINARY_BITS 27
> +
Could this be up to 30?
(The formatting code relies on the the product of the maximum fraction
value with 10^decplaces[] not exceeding a u64, so I think 30 bits
fits? But this only has to be as large as the largest value required
by some supported piece of hardware... I didn't go check on that.)
> /**
> * struct mon_data - Monitoring details for each event file.
> * @list: Member of the global @mon_data_kn_priv_list list.
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 78ad493dcc01..c435319552be 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -893,15 +893,15 @@ static __init bool get_rdt_mon_resources(void)
> bool ret = false;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
> - resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
> + resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
> - resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
> + resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
> - resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
> + resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
> ret = true;
> }
> if (rdt_cpu_has(X86_FEATURE_ABMC))
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 883be6f0810f..290a959776de 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -17,6 +17,7 @@
>
> #include <linux/cpu.h>
> #include <linux/kernfs.h>
> +#include <linux/math.h>
> #include <linux/seq_file.h>
> #include <linux/slab.h>
> #include <linux/tick.h>
> @@ -597,6 +598,87 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
> }
>
> +/*
> + * Decimal place precision to use for each number of fixed-point
> + * binary bits.
> + */
> +static unsigned int decplaces[MAX_BINARY_BITS + 1] = {
^ const
Also, maybe explicitly initialise
[0] = 1,
here? (See print_event_value().)
> + [1] = 1,
> + [2] = 2,
> + [3] = 3,
> + [4] = 3,
> + [5] = 3,
> + [6] = 3,
> + [7] = 3,
> + [8] = 3,
> + [9] = 3,
> + [10] = 4,
Why these specific values?
ceil(binary_bits * log10(2)) makes sense if we want to expose all
available hardware precision with as few digits as possible.
floor(binary_bits * log10(2)) makes sense if we want expose as many
digits as possible without advertising spurious precision.
Disregarding the special-casing for binary_bits <= 3, still neither
option quite seems to match this list.
Rounding up means that the hardware value can be reconstructed, but
only if userspace knows the value of binary_bits. Should that be
exposed?
> + [11] = 4,
> + [12] = 4,
> + [13] = 5,
> + [14] = 5,
> + [15] = 5,
> + [16] = 6,
> + [17] = 6,
> + [18] = 6,
> + [19] = 7,
> + [20] = 7,
> + [21] = 7,
> + [22] = 8,
> + [23] = 8,
> + [24] = 8,
> + [25] = 9,
> + [26] = 9,
> + [27] = 9
Documenting the rule for generating these may be a good idea unless we
are sure that no more entries will never be added.
> +};
> +
> +static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> +{
> + unsigned long long frac;
> + char buf[10];
In place of the magic number 10, how about
decplaces[MAX_BINARY_BITS] + 1 ?
(I think the compiler should accept that as an initialiser if the array
is const.)
> +
> + if (!binary_bits) {
> + seq_printf(m, "%llu.0\n", val);
> + return;
> + }
Can an initialiser for decplaces[0] reduce the special-casing for
binary_bits == 0?
> +
> + /* Mask off the integer part of the fixed-point value. */
> + frac = val & GENMASK_ULL(binary_bits, 0);
Should this be GENMASK_ULL(binary_bits - 1, 0)?
Should we be telling userspace the binary_bits value? It is not
(exactly) deducible from the number of decimal places printed.
It depends on the use cases and what the code is trying to achieve, but
this does not seem to be described in detail, unless I've missed it
somewhere.
> +
> + /*
> + * Multiply by 10^{desired decimal places}. The integer part of
> + * the fixed point value is now almost what is needed.
> + */
> + frac *= int_pow(10ull, decplaces[binary_bits]);
> +
> + /*
> + * Round to nearest by adding a value that would be a "1" in the
> + * binary_bits + 1 place. Integer part of fixed point value is
> + * now the needed value.
> + */
> + frac += 1ull << (binary_bits - 1);
> +
> + /*
> + * Extract the integer part of the value. This is the decimal
> + * representation of the original fixed-point fractional value.
> + */
> + frac >>= binary_bits;
> +
> + /*
> + * "frac" is now in the range [0 .. 10^decplaces). I.e. string
> + * representation will fit into chosen number of decimal places.
> + */
> + snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
> +
> + /* Trim trailing zeroes */
Why?
Would it be better to present the values with consistent precision?
There's no reason why a telemetry counter should settle for any length
of time at a tidy value, so the precision represented by the trailing
zeros is always significant.
The hardware precision doesn't go up and down depending on the precise
value of the counter...
> + for (int i = decplaces[binary_bits] - 1; i > 0; i--) {
> + if (buf[i] != '0')
> + break;
> + buf[i] = '\0';
> + }
> + seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
> +}
> +
[...]
Cheers
---Dave
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-05 14:42 ` Dave Martin
@ 2025-11-05 23:31 ` Luck, Tony
2025-11-06 0:09 ` Reinette Chatre
` (2 more replies)
2025-11-10 16:52 ` Luck, Tony
1 sibling, 3 replies; 85+ messages in thread
From: Luck, Tony @ 2025-11-05 23:31 UTC (permalink / raw)
To: Dave Martin
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi Dave,
Thanks for taking time to review. You did unearth one big bug
and I'm super-grateful for that.
On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
> Hi Tony,
>
> A few drive-by nits from me -- apologies, I hadn't looked at this in a
> while.
>
> On Wed, Oct 29, 2025 at 09:20:55AM -0700, Tony Luck wrote:
> > resctrl assumes that all monitor events can be displayed as unsigned
> > decimal integers.
> >
> > Hardware architecture counters may provide some telemetry events with
> > greater precision where the event is not a simple count, but is a
> > measurement of some sort (e.g. Joules for energy consumed).
> >
> > Add a new argument to resctrl_enable_mon_event() for architecture code
> > to inform the file system that the value for a counter is a fixed-point
> > value with a specific number of binary places.
> > Only allow architecture to use floating point format on events that the
> > file system has marked with mon_evt::is_floating_point.
> >
> > Display fixed point values with values rounded to an appropriate number
> > of decimal places for the precision of the number of binary places
> > provided. Add one extra decimal place for every three additional binary
>
> (Is this just informal wording? If not, it's wrong...)
Informal. It isn't far off from the table. Once out of the small numbers
the number of decimal places does increment after each group of three.
>
> > places, except for low precision binary values where exact representation
> > is possible:
> >
> > 1 binary place is 0.0 or 0.5 => 1 decimal place
> > 2 binary places is 0.0, 0.25, 0.5, 0.75 => 2 decimal places
> > 3 binary places is 0.0, 0.125, etc. => 3 decimal places
>
> What's the rationale for this special treatment? I don't see any
> previous discussion (apologies if I missed it).
The strict log10(2) calculations below throw away some precision from
these cases. I thought that was bad.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> > ---
> > include/linux/resctrl.h | 3 +-
> > fs/resctrl/internal.h | 8 +++
> > arch/x86/kernel/cpu/resctrl/core.c | 6 +--
> > fs/resctrl/ctrlmondata.c | 84 ++++++++++++++++++++++++++++++
> > fs/resctrl/monitor.c | 10 +++-
> > 5 files changed, 105 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 702205505dc9..a7e5a546152d 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -409,7 +409,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
> > u32 resctrl_arch_system_num_rmid_idx(void);
> > int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
> >
> > -void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
> > +void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
> > + unsigned int binary_bits);
> >
> > bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
> >
> > diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> > index 40b76eaa33d0..f5189b6771a0 100644
> > --- a/fs/resctrl/internal.h
> > +++ b/fs/resctrl/internal.h
> > @@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> > * Only valid if @evtid is an MBM event.
> > * @configurable: true if the event is configurable
> > * @any_cpu: true if the event can be read from any CPU
> > + * @is_floating_point: event values are displayed in floating point format
>
> Nit: Maybe rebrand this as is_fixed_point, or is_fractional, or similar?
>
> The print syntax is just a decimal fraction, and the hardware
> representation is fixed-point. Nothing floats.
You are right. I can change from is_floating_point to is_fixed_point.
> > + * @binary_bits: number of fixed-point binary bits from architecture,
> > + * only valid if @is_floating_point is true
> > * @enabled: true if the event is enabled
> > */
> > struct mon_evt {
> > @@ -71,6 +74,8 @@ struct mon_evt {
> > u32 evt_cfg;
> > bool configurable;
> > bool any_cpu;
> > + bool is_floating_point;
> > + unsigned int binary_bits;
> > bool enabled;
> > };
> >
> > @@ -79,6 +84,9 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
> > #define for_each_mon_event(mevt) for (mevt = &mon_event_all[QOS_FIRST_EVENT]; \
> > mevt < &mon_event_all[QOS_NUM_EVENTS]; mevt++)
> >
> > +/* Limit for mon_evt::binary_bits */
> > +#define MAX_BINARY_BITS 27
> > +
>
> Could this be up to 30?
Yes.
> (The formatting code relies on the the product of the maximum fraction
> value with 10^decplaces[] not exceeding a u64, so I think 30 bits
> fits? But this only has to be as large as the largest value required
> by some supported piece of hardware... I didn't go check on that.)
I only have one data point. The Intel telemetry events are using 18
binary places.
> > /**
> > * struct mon_data - Monitoring details for each event file.
> > * @list: Member of the global @mon_data_kn_priv_list list.
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 78ad493dcc01..c435319552be 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -893,15 +893,15 @@ static __init bool get_rdt_mon_resources(void)
> > bool ret = false;
> >
> > if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
> > - resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false);
> > + resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
> > ret = true;
> > }
> > if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
> > - resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false);
> > + resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
> > ret = true;
> > }
> > if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
> > - resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false);
> > + resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
> > ret = true;
> > }
> > if (rdt_cpu_has(X86_FEATURE_ABMC))
> > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > index 883be6f0810f..290a959776de 100644
> > --- a/fs/resctrl/ctrlmondata.c
> > +++ b/fs/resctrl/ctrlmondata.c
> > @@ -17,6 +17,7 @@
> >
> > #include <linux/cpu.h>
> > #include <linux/kernfs.h>
> > +#include <linux/math.h>
> > #include <linux/seq_file.h>
> > #include <linux/slab.h>
> > #include <linux/tick.h>
> > @@ -597,6 +598,87 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
> > }
> >
> > +/*
> > + * Decimal place precision to use for each number of fixed-point
> > + * binary bits.
> > + */
> > +static unsigned int decplaces[MAX_BINARY_BITS + 1] = {
>
> ^ const
OK
>
> Also, maybe explicitly initialise
>
> [0] = 1,
OK (though this might only occur if there is an event that resctrl says
must be fixed point, with a h/w implementation that provides a simple
integer).
> here? (See print_event_value().)
>
> > + [1] = 1,
> > + [2] = 2,
> > + [3] = 3,
> > + [4] = 3,
> > + [5] = 3,
> > + [6] = 3,
> > + [7] = 3,
> > + [8] = 3,
> > + [9] = 3,
> > + [10] = 4,
>
> Why these specific values?
For 1, 2, 3 binary bits you get an exact decimal representation
with 1, 2, 3 decimal places. I kept the "3" going from 4 to 9
bits because it should output at least as many places as 3 bits.
After that I started stepping every 3 extra bits.
> ceil(binary_bits * log10(2)) makes sense if we want to expose all
> available hardware precision with as few digits as possible.
>
> floor(binary_bits * log10(2)) makes sense if we want expose as many
> digits as possible without advertising spurious precision.
>
> Disregarding the special-casing for binary_bits <= 3, still neither
> option quite seems to match this list.
Side-by-side comparion:
#include <stdio.h>
#include <math.h>
static unsigned int tony[] = {
[0] = 0, [1] = 1, [2] = 2, [3] = 3, [4] = 3, [5] = 3,
[6] = 3, [7] = 3, [8] = 3, [9] = 3, [10] = 4, [11] = 4,
[12] = 4, [13] = 5, [14] = 5, [15] = 5, [16] = 6, [17] = 6,
[18] = 6, [19] = 7, [20] = 7, [21] = 7, [22] = 8, [23] = 8,
[24] = 8, [25] = 9, [26] = 9, [27] = 9
};
int main(void)
{
int binary_bits;
double log10_2 = log10(2.0);
printf("bits:\tceil\tfloor\ttony\n");
for (binary_bits = 0; binary_bits < 28; binary_bits++)
printf("%d:\t%d\t%d\t%d\n",
binary_bits,
(int)ceil(binary_bits * log10_2),
(int)floor(binary_bits * log10_2),
tony[binary_bits]);
return 0;
}
bits: ceil floor tony
0: 0 0 0
1: 1 0 1
2: 1 0 2
3: 1 0 3
4: 2 1 3
5: 2 1 3
6: 2 1 3
7: 3 2 3
8: 3 2 3
9: 3 2 3
10: 4 3 4
11: 4 3 4
12: 4 3 4
13: 4 3 5
14: 5 4 5
15: 5 4 5
16: 5 4 6
17: 6 5 6
18: 6 5 6
19: 6 5 7
20: 7 6 7
21: 7 6 7
22: 7 6 8
23: 7 6 8
24: 8 7 8
25: 8 7 9
26: 8 7 9
27: 9 8 9
I'm not a fan of the "floor" option. Looks like it loses precision. Terrible for
1-3 binary bits. Also not what I'd like for the bits==18 case that I currently
care about.
"ceil" is good for bits > 6. Almost matches my numbers (except I jump
to one more decimal place one binary bit earlier).
What do you think of me swapping out the values from 7 upwards for the
ceil values and documenting that 0..6 are hand-picked, but 7 and up are
ceil(binary_bits * log10_2)?
>
> Rounding up means that the hardware value can be reconstructed, but
> only if userspace knows the value of binary_bits. Should that be
> exposed?
I'm not sure I see when users would need to reconstruct the h/w value.
General use case for these resctrl events is: read1, sleepN, read2 &
compute rate = (read2 - read1) / N
In the case of the Intel telemetry events there is some jitter around
the timing of the reads (since events may only be updated every 2ms).
So the error bars get big if "N" is small. Which all leads me to believe
that a "good enough" approach to representing the event values will
be close enough for all use cases.
>
> > + [11] = 4,
> > + [12] = 4,
> > + [13] = 5,
> > + [14] = 5,
> > + [15] = 5,
> > + [16] = 6,
> > + [17] = 6,
> > + [18] = 6,
> > + [19] = 7,
> > + [20] = 7,
> > + [21] = 7,
> > + [22] = 8,
> > + [23] = 8,
> > + [24] = 8,
> > + [25] = 9,
> > + [26] = 9,
> > + [27] = 9
>
> Documenting the rule for generating these may be a good idea unless we
> are sure that no more entries will never be added.
Above proposal - use the ceil function for bits >= 7.
> > +};
> > +
> > +static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> > +{
> > + unsigned long long frac;
> > + char buf[10];
>
> In place of the magic number 10, how about
> decplaces[MAX_BINARY_BITS] + 1 ?
>
> (I think the compiler should accept that as an initialiser if the array
> is const.)
If the compiler doesn't barf, then OK.
> > +
> > + if (!binary_bits) {
> > + seq_printf(m, "%llu.0\n", val);
> > + return;
> > + }
>
> Can an initialiser for decplaces[0] reduce the special-casing for
> binary_bits == 0?
I'll check and see.
> > +
> > + /* Mask off the integer part of the fixed-point value. */
> > + frac = val & GENMASK_ULL(binary_bits, 0);
>
> Should this be GENMASK_ULL(binary_bits - 1, 0)?
Oops. I think you are right.
> Should we be telling userspace the binary_bits value? It is not
> (exactly) deducible from the number of decimal places printed.
I could add another info file for fixed_point events to display this.
But I'm not convinced that it would result in users doing anything
different.
Assume you just did the "read1, sleepN, read2" and got values of
235.617542 and 338.964815, tell me how things would be different
if an info file said that binary_bits was 17 vs. 19?
> It depends on the use cases and what the code is trying to achieve, but
> this does not seem to be described in detail, unless I've missed it
> somewhere.
>
> > +
> > + /*
> > + * Multiply by 10^{desired decimal places}. The integer part of
> > + * the fixed point value is now almost what is needed.
> > + */
> > + frac *= int_pow(10ull, decplaces[binary_bits]);
> > +
> > + /*
> > + * Round to nearest by adding a value that would be a "1" in the
> > + * binary_bits + 1 place. Integer part of fixed point value is
> > + * now the needed value.
> > + */
> > + frac += 1ull << (binary_bits - 1);
> > +
> > + /*
> > + * Extract the integer part of the value. This is the decimal
> > + * representation of the original fixed-point fractional value.
> > + */
> > + frac >>= binary_bits;
> > +
> > + /*
> > + * "frac" is now in the range [0 .. 10^decplaces). I.e. string
> > + * representation will fit into chosen number of decimal places.
> > + */
> > + snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
> > +
> > + /* Trim trailing zeroes */
>
> Why?
It felt good. I'm not wedded to this. Maybe saving a few cycles of
kernel CPU time by dropping this would be good.
> Would it be better to present the values with consistent precision?
Humans might notice the difference. Apps reading the file aren't going
to care.
> There's no reason why a telemetry counter should settle for any length
> of time at a tidy value, so the precision represented by the trailing
> zeros is always significant.
But x1 = atof("1.5") and x2 = atof("1.500000") ... can the subsequent
use of x1 tell that there was less precision that x2?
>
> The hardware precision doesn't go up and down depending on the precise
> value of the counter...
>
> > + for (int i = decplaces[binary_bits] - 1; i > 0; i--) {
> > + if (buf[i] != '0')
> > + break;
> > + buf[i] = '\0';
> > + }
> > + seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
> > +}
> > +
>
> [...]
>
> Cheers
> ---Dave
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-05 23:31 ` Luck, Tony
@ 2025-11-06 0:09 ` Reinette Chatre
2025-11-11 17:22 ` Dave Martin
2025-11-06 2:27 ` Luck, Tony
2025-11-11 17:16 ` Dave Martin
2 siblings, 1 reply; 85+ messages in thread
From: Reinette Chatre @ 2025-11-06 0:09 UTC (permalink / raw)
To: Luck, Tony, Dave Martin
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel, patches
Hi Dave and Tony,
On 11/5/25 3:31 PM, Luck, Tony wrote:
> On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
>> On Wed, Oct 29, 2025 at 09:20:55AM -0700, Tony Luck wrote:
...
>>> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
>>> index 40b76eaa33d0..f5189b6771a0 100644
>>> --- a/fs/resctrl/internal.h
>>> +++ b/fs/resctrl/internal.h
>>> @@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
>>> * Only valid if @evtid is an MBM event.
>>> * @configurable: true if the event is configurable
>>> * @any_cpu: true if the event can be read from any CPU
>>> + * @is_floating_point: event values are displayed in floating point format
>>
>> Nit: Maybe rebrand this as is_fixed_point, or is_fractional, or similar?
>>
>> The print syntax is just a decimal fraction, and the hardware
>> representation is fixed-point. Nothing floats.
>
> You are right. I can change from is_floating_point to is_fixed_point.
>
This is a fs property though, not hardware, and highlights that the value is displayed in
floating point format which is the closest resctrl has to establish a "contract" with user
space on what format user space can expect when reading the data as backed with a
matching update to resctrl.rst for the events that have this hardcoded by the fs.
Whether an architecture uses fixed point format or some other mechanism to determine the
value eventually exposed to user space is unique to the architecture.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-06 0:09 ` Reinette Chatre
@ 2025-11-11 17:22 ` Dave Martin
2025-11-12 16:12 ` Reinette Chatre
0 siblings, 1 reply; 85+ messages in thread
From: Dave Martin @ 2025-11-11 17:22 UTC (permalink / raw)
To: Reinette Chatre
Cc: Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi,
On Wed, Nov 05, 2025 at 04:09:28PM -0800, Reinette Chatre wrote:
> Hi Dave and Tony,
>
> On 11/5/25 3:31 PM, Luck, Tony wrote:
> > On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
> >> On Wed, Oct 29, 2025 at 09:20:55AM -0700, Tony Luck wrote:
>
> ...
>
> >>> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> >>> index 40b76eaa33d0..f5189b6771a0 100644
> >>> --- a/fs/resctrl/internal.h
> >>> +++ b/fs/resctrl/internal.h
> >>> @@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> >>> * Only valid if @evtid is an MBM event.
> >>> * @configurable: true if the event is configurable
> >>> * @any_cpu: true if the event can be read from any CPU
> >>> + * @is_floating_point: event values are displayed in floating point format
> >>
> >> Nit: Maybe rebrand this as is_fixed_point, or is_fractional, or similar?
> >>
> >> The print syntax is just a decimal fraction, and the hardware
> >> representation is fixed-point. Nothing floats.
> >
> > You are right. I can change from is_floating_point to is_fixed_point.
> >
>
> This is a fs property though, not hardware, and highlights that the value is displayed in
> floating point format which is the closest resctrl has to establish a "contract" with user
> space on what format user space can expect when reading the data as backed with a
> matching update to resctrl.rst for the events that have this hardcoded by the fs.
> Whether an architecture uses fixed point format or some other mechanism to determine the
> value eventually exposed to user space is unique to the architecture.
Sure, getting the docmuentation right is the most important thing,
while the internal name for this property is not ABI.
(I don't strongly object to "is_floating_point", even if we expose this
in the filesystem, so long as we document carefully what it means.)
Cheers
---Dave
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-11 17:22 ` Dave Martin
@ 2025-11-12 16:12 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-12 16:12 UTC (permalink / raw)
To: Dave Martin
Cc: Luck, Tony, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi Dave,
On 11/11/25 9:22 AM, Dave Martin wrote:
> Hi,
>
> On Wed, Nov 05, 2025 at 04:09:28PM -0800, Reinette Chatre wrote:
>> Hi Dave and Tony,
>>
>> On 11/5/25 3:31 PM, Luck, Tony wrote:
>>> On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
>>>> On Wed, Oct 29, 2025 at 09:20:55AM -0700, Tony Luck wrote:
>>
>> ...
>>
>>>>> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
>>>>> index 40b76eaa33d0..f5189b6771a0 100644
>>>>> --- a/fs/resctrl/internal.h
>>>>> +++ b/fs/resctrl/internal.h
>>>>> @@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
>>>>> * Only valid if @evtid is an MBM event.
>>>>> * @configurable: true if the event is configurable
>>>>> * @any_cpu: true if the event can be read from any CPU
>>>>> + * @is_floating_point: event values are displayed in floating point format
>>>>
>>>> Nit: Maybe rebrand this as is_fixed_point, or is_fractional, or similar?
>>>>
>>>> The print syntax is just a decimal fraction, and the hardware
>>>> representation is fixed-point. Nothing floats.
>>>
>>> You are right. I can change from is_floating_point to is_fixed_point.
>>>
>>
>> This is a fs property though, not hardware, and highlights that the value is displayed in
>> floating point format which is the closest resctrl has to establish a "contract" with user
>> space on what format user space can expect when reading the data as backed with a
>> matching update to resctrl.rst for the events that have this hardcoded by the fs.
>> Whether an architecture uses fixed point format or some other mechanism to determine the
>> value eventually exposed to user space is unique to the architecture.
>
> Sure, getting the docmuentation right is the most important thing,
> while the internal name for this property is not ABI.
>
> (I don't strongly object to "is_floating_point", even if we expose this
> in the filesystem, so long as we document carefully what it means.)
Highlighting the member name and description in fs/resctrl/internal.h:
@is_floating_point: event values are displayed in floating point format
I consider it important that the description highlights that the event will be displayed to
user space as floating point. struct mon_evt that contains this member is internal to resctrl fs
and there is no helper available to arch with which @is_floating_point can be changed since
this is a contract with user space. I find that having the member name match that description
and contract easier to read.
The documentation (resctrl.rst) is updated in patch #32 with below to make this clear:
"core energy" reports a floating point number for the energy (in Joules) ...
...
"activity" also reports a floating point value (in Farads).
I agree that internal names are not ABI and this is evident with the only internal
connection to a value displayed as floating point being an internal fixed point fraction
number. This can change any time. We have to draw the line somewhere to make it clear
how resctrl interacts with user space and I find the event's display property to be
appropriate for this.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-05 23:31 ` Luck, Tony
2025-11-06 0:09 ` Reinette Chatre
@ 2025-11-06 2:27 ` Luck, Tony
2025-11-11 17:31 ` Dave Martin
2025-11-11 17:16 ` Dave Martin
2 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-06 2:27 UTC (permalink / raw)
To: Dave Martin
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
On Wed, Nov 05, 2025 at 03:31:07PM -0800, Luck, Tony wrote:
> > > +
> > > + if (!binary_bits) {
> > > + seq_printf(m, "%llu.0\n", val);
> > > + return;
> > > + }
I can't completely escape a test for !binary_bits. Most of the
flow works ok (doing nothing, working towards frac == 0 when
it comes time for the snprintf()).
But the round-up code:
frac += 1ull << (binary_bits - 1);
goes badly wrong if binary_bits == 0.
I could write it like this:
static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
{
char buf[decplaces[MAX_BINARY_BITS] + 1];
unsigned long long frac = 0;
if (binary_bits) {
/* Mask off the integer part of the fixed-point value. */
frac = val & GENMASK_ULL(binary_bits - 1, 0);
/*
* Multiply by 10^{desired decimal places}. The integer part of
* the fixed point value is now almost what is needed.
*/
frac *= int_pow(10ull, decplaces[binary_bits]);
/*
* Round to nearest by adding a value that would be a "1" in the
* binary_bits + 1 place. Integer part of fixed point value is
* now the needed value.
*/
frac += 1ull << (binary_bits - 1);
/*
* Extract the integer part of the value. This is the decimal
* representation of the original fixed-point fractional value.
*/
frac >>= binary_bits;
}
/*
* "frac" is now in the range [0 .. 10^decplaces). I.e. string
* representation will fit into chosen number of decimal places.
*/
snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
}
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-06 2:27 ` Luck, Tony
@ 2025-11-11 17:31 ` Dave Martin
2025-11-14 18:39 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Dave Martin @ 2025-11-11 17:31 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi,
On Wed, Nov 05, 2025 at 06:27:48PM -0800, Luck, Tony wrote:
> On Wed, Nov 05, 2025 at 03:31:07PM -0800, Luck, Tony wrote:
> > > > +
> > > > + if (!binary_bits) {
> > > > + seq_printf(m, "%llu.0\n", val);
> > > > + return;
> > > > + }
>
> I can't completely escape a test for !binary_bits. Most of the
> flow works ok (doing nothing, working towards frac == 0 when
> it comes time for the snprintf()).
>
> But the round-up code:
>
> frac += 1ull << (binary_bits - 1);
>
> goes badly wrong if binary_bits == 0.
>
> I could write it like this:
>
>
> static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> {
> char buf[decplaces[MAX_BINARY_BITS] + 1];
> unsigned long long frac = 0;
>
> if (binary_bits) {
> /* Mask off the integer part of the fixed-point value. */
> frac = val & GENMASK_ULL(binary_bits - 1, 0);
>
> /*
> * Multiply by 10^{desired decimal places}. The integer part of
> * the fixed point value is now almost what is needed.
> */
> frac *= int_pow(10ull, decplaces[binary_bits]);
I guess there was already a discussion on whether it is worth
precomputing this multiplier.
int_pow() is not free, but if implemented in the standard way, it
should be pretty fast on 64-bit arches (which is all we care about).
(I've not checked.)
> /*
> * Round to nearest by adding a value that would be a "1" in the
> * binary_bits + 1 place. Integer part of fixed point value is
> * now the needed value.
> */
> frac += 1ull << (binary_bits - 1);
>
> /*
> * Extract the integer part of the value. This is the decimal
> * representation of the original fixed-point fractional value.
> */
> frac >>= binary_bits;
Looks reasonable. It's your call whether this is simpler, I guess.
> }
>
> /*
> * "frac" is now in the range [0 .. 10^decplaces). I.e. string
> * representation will fit into chosen number of decimal places.
> */
> snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
>
> seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
Can we get rid of buf, actually?
I don't see why we can't just do
seq_printf(m, "%llu.%0*llu",
val >> binary_bits, decplaces[binary_bits], frac);
...?
This avoids having to care about the size of buf.
seq_file's crystal ball knows how to make its buffer large enough.
Cheers
---Dave
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-11 17:31 ` Dave Martin
@ 2025-11-14 18:39 ` Luck, Tony
0 siblings, 0 replies; 85+ messages in thread
From: Luck, Tony @ 2025-11-14 18:39 UTC (permalink / raw)
To: Dave Martin
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi Dave,
On Tue, Nov 11, 2025 at 05:31:12PM +0000, Dave Martin wrote:
> Hi,
>
> On Wed, Nov 05, 2025 at 06:27:48PM -0800, Luck, Tony wrote:
> > On Wed, Nov 05, 2025 at 03:31:07PM -0800, Luck, Tony wrote:
> > > > > +
> > > > > + if (!binary_bits) {
> > > > > + seq_printf(m, "%llu.0\n", val);
> > > > > + return;
> > > > > + }
> >
> > I can't completely escape a test for !binary_bits. Most of the
> > flow works ok (doing nothing, working towards frac == 0 when
> > it comes time for the snprintf()).
> >
> > But the round-up code:
> >
> > frac += 1ull << (binary_bits - 1);
> >
> > goes badly wrong if binary_bits == 0.
> >
> > I could write it like this:
> >
> >
> > static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> > {
> > char buf[decplaces[MAX_BINARY_BITS] + 1];
> > unsigned long long frac = 0;
> >
> > if (binary_bits) {
> > /* Mask off the integer part of the fixed-point value. */
> > frac = val & GENMASK_ULL(binary_bits - 1, 0);
> >
> > /*
> > * Multiply by 10^{desired decimal places}. The integer part of
> > * the fixed point value is now almost what is needed.
> > */
> > frac *= int_pow(10ull, decplaces[binary_bits]);
>
> I guess there was already a discussion on whether it is worth
> precomputing this multiplier.
>
> int_pow() is not free, but if implemented in the standard way, it
> should be pretty fast on 64-bit arches (which is all we care about).
Earlier versions of the patch had the precomputed value. Reinette
pointed me to int_pow(). It is in lib/math/int_pow.c and does seem
to be pretty efficient.
>
> (I've not checked.)
>
> > /*
> > * Round to nearest by adding a value that would be a "1" in the
> > * binary_bits + 1 place. Integer part of fixed point value is
> > * now the needed value.
> > */
> > frac += 1ull << (binary_bits - 1);
> >
> > /*
> > * Extract the integer part of the value. This is the decimal
> > * representation of the original fixed-point fractional value.
> > */
> > frac >>= binary_bits;
>
> Looks reasonable. It's your call whether this is simpler, I guess.
>
> > }
> >
> > /*
> > * "frac" is now in the range [0 .. 10^decplaces). I.e. string
> > * representation will fit into chosen number of decimal places.
> > */
> > snprintf(buf, sizeof(buf), "%0*llu", decplaces[binary_bits], frac);
> >
> > seq_printf(m, "%llu.%s\n", val >> binary_bits, buf);
>
> Can we get rid of buf, actually?
>
> I don't see why we can't just do
>
> seq_printf(m, "%llu.%0*llu",
> val >> binary_bits, decplaces[binary_bits], frac);
The buf[] was only there for trimming the trailing zeroes. Now that is
gone the result can be sent directly to seq_printf() as you suggest.
>
> ...?
>
> This avoids having to care about the size of buf.
>
> seq_file's crystal ball knows how to make its buffer large enough.
>
> Cheers
> ---Dave
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-05 23:31 ` Luck, Tony
2025-11-06 0:09 ` Reinette Chatre
2025-11-06 2:27 ` Luck, Tony
@ 2025-11-11 17:16 ` Dave Martin
2025-11-14 18:51 ` Luck, Tony
2 siblings, 1 reply; 85+ messages in thread
From: Dave Martin @ 2025-11-11 17:16 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On Wed, Nov 05, 2025 at 03:31:07PM -0800, Luck, Tony wrote:
> Hi Dave,
>
> Thanks for taking time to review. You did unearth one big bug
> and I'm super-grateful for that.
>
> On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
> > Hi Tony,
> >
> > A few drive-by nits from me -- apologies, I hadn't looked at this in a
> > while.
> >
> > On Wed, Oct 29, 2025 at 09:20:55AM -0700, Tony Luck wrote:
> > > resctrl assumes that all monitor events can be displayed as unsigned
> > > decimal integers.
> > >
> > > Hardware architecture counters may provide some telemetry events with
> > > greater precision where the event is not a simple count, but is a
> > > measurement of some sort (e.g. Joules for energy consumed).
> > >
> > > Add a new argument to resctrl_enable_mon_event() for architecture code
> > > to inform the file system that the value for a counter is a fixed-point
> > > value with a specific number of binary places.
> > > Only allow architecture to use floating point format on events that the
> > > file system has marked with mon_evt::is_floating_point.
> > >
> > > Display fixed point values with values rounded to an appropriate number
> > > of decimal places for the precision of the number of binary places
> > > provided. Add one extra decimal place for every three additional binary
> >
> > (Is this just informal wording? If not, it's wrong...)
>
> Informal. It isn't far off from the table. Once out of the small numbers
> the number of decimal places does increment after each group of three.
>
> >
> > > places, except for low precision binary values where exact representation
> > > is possible:
> > >
> > > 1 binary place is 0.0 or 0.5 => 1 decimal place
> > > 2 binary places is 0.0, 0.25, 0.5, 0.75 => 2 decimal places
> > > 3 binary places is 0.0, 0.125, etc. => 3 decimal places
> >
> > What's the rationale for this special treatment? I don't see any
> > previous discussion (apologies if I missed it).
>
> The strict log10(2) calculations below throw away some precision from
> these cases. I thought that was bad.
It depends what is meant by "precision".
We can't magic up accuracy that isn't present in the counters, just by
including extra digits when formatting.
So long as we format values in such a way that every counter value is
formatted in a unique way, we are as precise as it is possible to be.
If I didn't confuse myself, ceil(binary_bits * log10(2)) is the
smallest number of fractional decimal digits that provide this
guarantee.
(This may seem pedantic -- partly, I was wondering what was so special
about implementations with fewer than 3 binary places that they needed
special treatment -- I think that still hasn't been answered?)
[...]
> > > +/* Limit for mon_evt::binary_bits */
> > > +#define MAX_BINARY_BITS 27
> > > +
> >
> > Could this be up to 30?
>
> Yes.
>
> > (The formatting code relies on the the product of the maximum fraction
> > value with 10^decplaces[] not exceeding a u64, so I think 30 bits
> > fits? But this only has to be as large as the largest value required
> > by some supported piece of hardware... I didn't go check on that.)
>
> I only have one data point. The Intel telemetry events are using 18
> binary places.
Ah, right.
[...]
> > > diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> > > index 883be6f0810f..290a959776de 100644
> > > --- a/fs/resctrl/ctrlmondata.c
> > > +++ b/fs/resctrl/ctrlmondata.c
[...]
> > > @@ -597,6 +598,87 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > > resctrl_arch_mon_ctx_free(r, evt->evtid, rr->arch_mon_ctx);
> > > }
> > >
> > > +/*
> > > + * Decimal place precision to use for each number of fixed-point
> > > + * binary bits.
> > > + */
> > > +static unsigned int decplaces[MAX_BINARY_BITS + 1] = {
> >
> > ^ const
>
> OK
>
> >
> > Also, maybe explicitly initialise
> >
> > [0] = 1,
>
> OK (though this might only occur if there is an event that resctrl says
> must be fixed point, with a h/w implementation that provides a simple
> integer).
>
> > here? (See print_event_value().)
> >
> > > + [1] = 1,
> > > + [2] = 2,
> > > + [3] = 3,
> > > + [4] = 3,
> > > + [5] = 3,
> > > + [6] = 3,
> > > + [7] = 3,
> > > + [8] = 3,
> > > + [9] = 3,
> > > + [10] = 4,
> >
> > Why these specific values?
>
> For 1, 2, 3 binary bits you get an exact decimal representation
> with 1, 2, 3 decimal places. I kept the "3" going from 4 to 9
> bits because it should output at least as many places as 3 bits.
>
> After that I started stepping every 3 extra bits.
>
> > ceil(binary_bits * log10(2)) makes sense if we want to expose all
> > available hardware precision with as few digits as possible.
> >
> > floor(binary_bits * log10(2)) makes sense if we want expose as many
> > digits as possible without advertising spurious precision.
> >
> > Disregarding the special-casing for binary_bits <= 3, still neither
> > option quite seems to match this list.
>
> Side-by-side comparion:
>
> #include <stdio.h>
> #include <math.h>
>
> static unsigned int tony[] = {
> [0] = 0, [1] = 1, [2] = 2, [3] = 3, [4] = 3, [5] = 3,
> [6] = 3, [7] = 3, [8] = 3, [9] = 3, [10] = 4, [11] = 4,
> [12] = 4, [13] = 5, [14] = 5, [15] = 5, [16] = 6, [17] = 6,
> [18] = 6, [19] = 7, [20] = 7, [21] = 7, [22] = 8, [23] = 8,
> [24] = 8, [25] = 9, [26] = 9, [27] = 9
> };
>
> int main(void)
> {
> int binary_bits;
> double log10_2 = log10(2.0);
>
> printf("bits:\tceil\tfloor\ttony\n");
> for (binary_bits = 0; binary_bits < 28; binary_bits++)
> printf("%d:\t%d\t%d\t%d\n",
> binary_bits,
> (int)ceil(binary_bits * log10_2),
> (int)floor(binary_bits * log10_2),
> tony[binary_bits]);
>
> return 0;
> }
>
> bits: ceil floor tony
> 0: 0 0 0
> 1: 1 0 1
> 2: 1 0 2
> 3: 1 0 3
> 4: 2 1 3
> 5: 2 1 3
> 6: 2 1 3
> 7: 3 2 3
> 8: 3 2 3
> 9: 3 2 3
> 10: 4 3 4
> 11: 4 3 4
> 12: 4 3 4
> 13: 4 3 5
> 14: 5 4 5
> 15: 5 4 5
> 16: 5 4 6
> 17: 6 5 6
> 18: 6 5 6
> 19: 6 5 7
> 20: 7 6 7
> 21: 7 6 7
> 22: 7 6 8
> 23: 7 6 8
> 24: 8 7 8
> 25: 8 7 9
> 26: 8 7 9
> 27: 9 8 9
>
> I'm not a fan of the "floor" option. Looks like it loses precision. Terrible for
Loses precision, but does not advertise bogus precision precision
beyond the precision in the original value. (This is why it is not
standard to print doubles with more then 15 significant digits, even
though 17 significant digits are needed for bit-exact reproduction.)
I don't know whether this matters relative to the use cases, but it
would be nice to have some rationale.
> 1-3 binary bits. Also not what I'd like for the bits==18 case that I currently
> care about.
>
> "ceil" is good for bits > 6. Almost matches my numbers (except I jump
> to one more decimal place one binary bit earlier).
>
> What do you think of me swapping out the values from 7 upwards for the
> ceil values and documenting that 0..6 are hand-picked, but 7 and up are
> ceil(binary_bits * log10_2)?
If there is sound rationale for hand-picking some values then yes.
I haven't yet been convinced that there is ;)
(The 7 times table could doubtless be made to look nicer by hand-
picking some entries. But it wouldn't be the 7 times table any more.)
> > Rounding up means that the hardware value can be reconstructed, but
> > only if userspace knows the value of binary_bits. Should that be
> > exposed?
>
> I'm not sure I see when users would need to reconstruct the h/w value.
> General use case for these resctrl events is: read1, sleepN, read2 &
> compute rate = (read2 - read1) / N
If userspace can reconstruct the original values, it can do this
calculation more accurately.
Since the values yielded by read1 and read2 might not differ by very
much, the relative error introduced by formatting the values in decimal
_might_ be significant.
(If we include enough decimal digits that there is no error, userspace
will see unexpectedly coarse granularity in the delta read2 - read1.
And this is only practical when the number of fractional bits is small.)
Again, I don't know whether this matters for use cases, but minimising
the number of magic numbers and arbitrary tradeoffs feels like it would
hide fewer potential surprises...
> In the case of the Intel telemetry events there is some jitter around
> the timing of the reads (since events may only be updated every 2ms).
> So the error bars get big if "N" is small. Which all leads me to believe
> that a "good enough" approach to representing the event values will
> be close enough for all use cases.
Probably (and in any case, userspace is likely to be a giant hack
rather than rigorous statistical analysis).
Still, telling the userspace the actual precision the hardware supports
feels easy to do.
(It could be added later on as an extension, though.)
> > > + [11] = 4,
> > > + [12] = 4,
> > > + [13] = 5,
> > > + [14] = 5,
> > > + [15] = 5,
> > > + [16] = 6,
> > > + [17] = 6,
> > > + [18] = 6,
> > > + [19] = 7,
> > > + [20] = 7,
> > > + [21] = 7,
> > > + [22] = 8,
> > > + [23] = 8,
> > > + [24] = 8,
> > > + [25] = 9,
> > > + [26] = 9,
> > > + [27] = 9
> >
> > Documenting the rule for generating these may be a good idea unless we
> > are sure that no more entries will never be added.
>
> Above proposal - use the ceil function for bits >= 7.
>
> > > +};
> > > +
> > > +static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> > > +{
> > > + unsigned long long frac;
> > > + char buf[10];
> >
> > In place of the magic number 10, how about
> > decplaces[MAX_BINARY_BITS] + 1 ?
> >
> > (I think the compiler should accept that as an initialiser if the array
> > is const.)
>
> If the compiler doesn't barf, then OK.
>
> > > +
> > > + if (!binary_bits) {
> > > + seq_printf(m, "%llu.0\n", val);
> > > + return;
> > > + }
> >
> > Can an initialiser for decplaces[0] reduce the special-casing for
> > binary_bits == 0?
>
> I'll check and see.
>
> > > +
> > > + /* Mask off the integer part of the fixed-point value. */
> > > + frac = val & GENMASK_ULL(binary_bits, 0);
> >
> > Should this be GENMASK_ULL(binary_bits - 1, 0)?
>
> Oops. I think you are right.
>
> > Should we be telling userspace the binary_bits value? It is not
> > (exactly) deducible from the number of decimal places printed.
>
> I could add another info file for fixed_point events to display this.
> But I'm not convinced that it would result in users doing anything
> different.
>
> Assume you just did the "read1, sleepN, read2" and got values of
> 235.617542 and 338.964815, tell me how things would be different
> if an info file said that binary_bits was 17 vs. 19?
It changes the error bars, no?
For 17 bits, ± .00000381 (approx.)
For 19 bits, ± .000000953 (approx.)
(i.e., ± 0.5 times the least-significant bit).
Whether it is important / useful to know this is usecase dependent,
though.
[...]
> > > + /* Trim trailing zeroes */
> >
> > Why?
>
> It felt good. I'm not wedded to this. Maybe saving a few cycles of
> kernel CPU time by dropping this would be good.
>
> > Would it be better to present the values with consistent precision?
>
> Humans might notice the difference. Apps reading the file aren't going
> to care.
I noticed ;) In that, there is explicit code here that seems to have
no function other than to make the output worse (i.e., more
unpredictable and with no obvious gain in usefulness).
If the number of digits is the only clue to the size of the error bars
in the readings, userspace code might well care about this.
>
> > There's no reason why a telemetry counter should settle for any length
> > of time at a tidy value, so the precision represented by the trailing
> > zeros is always significant.
>
> But x1 = atof("1.5") and x2 = atof("1.500000") ... can the subsequent
> use of x1 tell that there was less precision that x2?
Exactly. If knowledge of the error bars is needed, just knowing the
nearest real number to the measured value is insufficient.
But the number of digits is all we seem to be giving userspace to go on
here -- and we're not presenting that in a predictable way, either (?)
Cheers
---Dave
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-11 17:16 ` Dave Martin
@ 2025-11-14 18:51 ` Luck, Tony
0 siblings, 0 replies; 85+ messages in thread
From: Luck, Tony @ 2025-11-14 18:51 UTC (permalink / raw)
To: Dave Martin
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
On Tue, Nov 11, 2025 at 05:16:22PM +0000, Dave Martin wrote:
... snip
> > I'm not a fan of the "floor" option. Looks like it loses precision. Terrible for
>
> Loses precision, but does not advertise bogus precision precision
> beyond the precision in the original value. (This is why it is not
> standard to print doubles with more then 15 significant digits, even
> though 17 significant digits are needed for bit-exact reproduction.)
>
> I don't know whether this matters relative to the use cases, but it
> would be nice to have some rationale.
>
> > 1-3 binary bits. Also not what I'd like for the bits==18 case that I currently
> > care about.
> >
> > "ceil" is good for bits > 6. Almost matches my numbers (except I jump
> > to one more decimal place one binary bit earlier).
> >
> > What do you think of me swapping out the values from 7 upwards for the
> > ceil values and documenting that 0..6 are hand-picked, but 7 and up are
> > ceil(binary_bits * log10_2)?
>
> If there is sound rationale for hand-picking some values then yes.
>
> I haven't yet been convinced that there is ;)
I don't have a rationale and I've been doing the thing I tell others
not to do "getting attached to code that I wrote". I will switch the
whole table to the ceil(binary_bits * log10_2) values.
One exception for binary_bits == 0. Back in the v6 version of these
patches I printed as a plain integer. Reinette commented[1]:
At this time I understand that it will be clear for which
events user space expects floating point numbers. If the architecture in
turn does not support any "binary bits" then I think resctrl
should still print a floating point number ("x.0") to match user space
expectation.
-Tony
Link: https://lore.kernel.org/all/8214ae1f-d64c-496c-b41d-13b31250acea@intel.com/ [1]
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-05 14:42 ` Dave Martin
2025-11-05 23:31 ` Luck, Tony
@ 2025-11-10 16:52 ` Luck, Tony
2025-11-11 17:34 ` Dave Martin
1 sibling, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-10 16:52 UTC (permalink / raw)
To: Dave Martin
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
> > +static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> > +{
> > + unsigned long long frac;
> > + char buf[10];
>
> In place of the magic number 10, how about
> decplaces[MAX_BINARY_BITS] + 1 ?
>
> (I think the compiler should accept that as an initialiser if the array
> is const.)
The compiler (gcc 15.2.1) accepts without any warnings. But generates
different code.
sparse complains:
fs/resctrl/ctrlmondata.c:640:45: warning: Variable length array is used.
I may change the hard coded constant to 21 (guaranteed to be big enough
for a "long long" plus terminating NUL byte.)
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-11-10 16:52 ` Luck, Tony
@ 2025-11-11 17:34 ` Dave Martin
0 siblings, 0 replies; 85+ messages in thread
From: Dave Martin @ 2025-11-11 17:34 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On Mon, Nov 10, 2025 at 08:52:52AM -0800, Luck, Tony wrote:
> On Wed, Nov 05, 2025 at 02:42:18PM +0000, Dave Martin wrote:
> > > +static void print_event_value(struct seq_file *m, unsigned int binary_bits, u64 val)
> > > +{
> > > + unsigned long long frac;
> > > + char buf[10];
> >
> > In place of the magic number 10, how about
> > decplaces[MAX_BINARY_BITS] + 1 ?
> >
> > (I think the compiler should accept that as an initialiser if the array
> > is const.)
>
> The compiler (gcc 15.2.1) accepts without any warnings. But generates
> different code.
>
> sparse complains:
> fs/resctrl/ctrlmondata.c:640:45: warning: Variable length array is used.
Hmmm. Shame.
(Of course, this is only a warning. sparse may not know how to
determine that the resulting buffer is limited to a sane size, but
looking at the code makes it pretty obvious. Perhaps best avoided,
though.)
> I may change the hard coded constant to 21 (guaranteed to be big enough
> for a "long long" plus terminating NUL byte.)
I guess. We may be able to sidestep this, though (see my other reply
about getting rid of buf[] altogether.)
Cheers
---Dave
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters
2025-10-29 16:20 ` [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
2025-11-05 14:42 ` Dave Martin
@ 2025-11-12 13:08 ` David Laight
1 sibling, 0 replies; 85+ messages in thread
From: David Laight @ 2025-11-12 13:08 UTC (permalink / raw)
To: Tony Luck
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86,
linux-kernel, patches
On Wed, 29 Oct 2025 09:20:55 -0700
Tony Luck <tony.luck@intel.com> wrote:
> resctrl assumes that all monitor events can be displayed as unsigned
> decimal integers.
>
> Hardware architecture counters may provide some telemetry events with
> greater precision where the event is not a simple count, but is a
> measurement of some sort (e.g. Joules for energy consumed).
>
> Add a new argument to resctrl_enable_mon_event() for architecture code
> to inform the file system that the value for a counter is a fixed-point
> value with a specific number of binary places.
> Only allow architecture to use floating point format on events that the
> file system has marked with mon_evt::is_floating_point.
>
> Display fixed point values with values rounded to an appropriate number
> of decimal places for the precision of the number of binary places
> provided. Add one extra decimal place for every three additional binary
> places, except for low precision binary values where exact representation
> is possible:
>
> 1 binary place is 0.0 or 0.5 => 1 decimal place
> 2 binary places is 0.0, 0.25, 0.5, 0.75 => 2 decimal places
> 3 binary places is 0.0, 0.125, etc. => 3 decimal places
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> include/linux/resctrl.h | 3 +-
> fs/resctrl/internal.h | 8 +++
> arch/x86/kernel/cpu/resctrl/core.c | 6 +--
> fs/resctrl/ctrlmondata.c | 84 ++++++++++++++++++++++++++++++
> fs/resctrl/monitor.c | 10 +++-
> 5 files changed, 105 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 702205505dc9..a7e5a546152d 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -409,7 +409,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r);
> u32 resctrl_arch_system_num_rmid_idx(void);
> int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
>
> -void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu);
> +void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
> + unsigned int binary_bits);
>
> bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
>
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index 40b76eaa33d0..f5189b6771a0 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -62,6 +62,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
> * Only valid if @evtid is an MBM event.
> * @configurable: true if the event is configurable
> * @any_cpu: true if the event can be read from any CPU
> + * @is_floating_point: event values are displayed in floating point format
> + * @binary_bits: number of fixed-point binary bits from architecture,
> + * only valid if @is_floating_point is true
> * @enabled: true if the event is enabled
> */
> struct mon_evt {
> @@ -71,6 +74,8 @@ struct mon_evt {
> u32 evt_cfg;
> bool configurable;
> bool any_cpu;
> + bool is_floating_point;
> + unsigned int binary_bits;
> bool enabled;
> };
Nit: You've added 4 bytes of padding.
David
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 13/32] x86,fs/resctrl: Add an architectural hook called for each mount
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (11 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 12/32] x86,fs/resctrl: Support binary fixed point event counters Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-10-29 16:20 ` [PATCH v13 14/32] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor Tony Luck
` (20 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Enumeration of Intel telemetry events is an asynchronous process involving
several mutually dependent drivers added as auxiliary devices during
the device_initcall() phase of Linux boot. The process finishes after
the probe functions of these drivers completes. But this happens after
resctrl_arch_late_init() is executed.
Tracing the enumeration process shows that it does complete a full seven
seconds before the earliest possible mount of the resctrl file system
(when included in /etc/fstab for automatic mount by systemd).
Add a hook at the beginning of the mount code that will be used
to check for telemetry events and initialize if any are found.
Call the hook on every attempted mount. Expectations are that
most actions (like enumeration) will only need to be performed
on the first call.
resctrl filesystem calls the hook with no locks held. Architecture code
is responsible for any required locking.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 6 ++++++
arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++++
fs/resctrl/rdtgroup.c | 2 ++
3 files changed, 17 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a7e5a546152d..1634db6176c3 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -511,6 +511,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
void resctrl_online_cpu(unsigned int cpu);
void resctrl_offline_cpu(unsigned int cpu);
+/*
+ * Architecture hook called at beginning of each file system mount attempt.
+ * No locks are held.
+ */
+void resctrl_arch_pre_mount(void);
+
/**
* resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
* for this resource and domain.
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c435319552be..980a4d9e5267 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -721,6 +721,15 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
return 0;
}
+void resctrl_arch_pre_mount(void)
+{
+ static atomic_t only_once = ATOMIC_INIT(0);
+ int old = 0;
+
+ if (!atomic_try_cmpxchg(&only_once, &old, 1))
+ return;
+}
+
enum {
RDT_FLAG_CMT,
RDT_FLAG_MBM_TOTAL,
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index e0eb766c5cf4..a4d4d4080e87 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2720,6 +2720,8 @@ static int rdt_get_tree(struct fs_context *fc)
struct rdt_resource *r;
int ret;
+ resctrl_arch_pre_mount();
+
cpus_read_lock();
mutex_lock(&rdtgroup_mutex);
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 14/32] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (12 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 13/32] x86,fs/resctrl: Add an architectural hook called for each mount Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-13 4:04 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 15/32] fs/resctrl: Cleanup as L3 is no longer the only monitor resource Tony Luck
` (19 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Add a new PERF_PKG resource and introduce package level scope for
monitoring telemetry events so that CPU hot plug notifiers can build
domains at the package granularity.
Use the physical package ID available via topology_physical_package_id()
to identify the monitoring domains with package level scope. This enables
user space to use:
/sys/devices/system/cpu/cpuX/topology/physical_package_id
to identify the monitoring domain a CPU is associated with.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 ++
fs/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
fs/resctrl/rdtgroup.c | 2 ++
4 files changed, 16 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 1634db6176c3..9fe6205743b7 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -53,6 +53,7 @@ enum resctrl_res_level {
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
RDT_RESOURCE_SMBA,
+ RDT_RESOURCE_PERF_PKG,
/* Must be the last */
RDT_NUM_RESOURCES,
@@ -267,6 +268,7 @@ enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
RESCTRL_L3_NODE,
+ RESCTRL_PACKAGE,
};
/**
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index f5189b6771a0..96d97f4ff957 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -255,6 +255,8 @@ struct rdtgroup {
#define RFTYPE_ASSIGN_CONFIG BIT(11)
+#define RFTYPE_RES_PERF_PKG BIT(11)
+
#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 980a4d9e5267..af555dadf024 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -100,6 +100,14 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
.schema_fmt = RESCTRL_SCHEMA_RANGE,
},
},
+ [RDT_RESOURCE_PERF_PKG] =
+ {
+ .r_resctrl = {
+ .name = "PERF_PKG",
+ .mon_scope = RESCTRL_PACKAGE,
+ .mon_domains = mon_domain_init(RDT_RESOURCE_PERF_PKG),
+ },
+ },
};
u32 resctrl_arch_system_num_rmid_idx(void)
@@ -435,6 +443,8 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
return get_cpu_cacheinfo_id(cpu, scope);
case RESCTRL_L3_NODE:
return cpu_to_node(cpu);
+ case RESCTRL_PACKAGE:
+ return topology_physical_package_id(cpu);
default:
break;
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index a4d4d4080e87..26f0d1f93da2 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2330,6 +2330,8 @@ static unsigned long fflags_from_resource(struct rdt_resource *r)
case RDT_RESOURCE_MBA:
case RDT_RESOURCE_SMBA:
return RFTYPE_RES_MB;
+ case RDT_RESOURCE_PERF_PKG:
+ return RFTYPE_RES_PERF_PKG;
}
return WARN_ON_ONCE(1);
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 14/32] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor
2025-10-29 16:20 ` [PATCH v13 14/32] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor Tony Luck
@ 2025-11-13 4:04 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 4:04 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
in subject "rdt_resource" -> "struct rdt_resource"
On 10/29/25 9:20 AM, Tony Luck wrote:
> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
> index f5189b6771a0..96d97f4ff957 100644
> --- a/fs/resctrl/internal.h
> +++ b/fs/resctrl/internal.h
> @@ -255,6 +255,8 @@ struct rdtgroup {
>
> #define RFTYPE_ASSIGN_CONFIG BIT(11)
>
> +#define RFTYPE_RES_PERF_PKG BIT(11)
> +
This needs a new bit number after rebase on the assignable counter work.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 15/32] fs/resctrl: Cleanup as L3 is no longer the only monitor resource
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (13 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 14/32] x86,fs/resctrl: Add and initialize rdt_resource for package scope monitor Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-13 4:05 ` Reinette Chatre
2025-10-29 16:20 ` [PATCH v13 16/32] x86/resctrl: Discover hardware telemetry events Tony Luck
` (18 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The feature to sum event data across multiple domains supports systems
with Sub-NUMA Cluster (SNC) mode enabled. The top-level monitoring files
in each "mon_L3_XX" directory provide the sum of data across all SNC
nodes sharing an L3 cache instance while the "mon_sub_L3_YY" sub-directories
provide the event data of the individual nodes.
SNC is only associated with the L3 resource and domains and as a result
the flow handling the sum of event data implicitly assumes it is
working with the L3 resource and domains.
Reading of telemetry events do not require to sum event data so this
feature can remain dedicated to SNC and keep the implicit assumption
of working with the L3 resource and domains.
Add a WARN to where the implicit assumption of working with the L3 resource
is made and add comments on how the structure controlling the event sum
feature is used.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/internal.h | 4 ++--
fs/resctrl/ctrlmondata.c | 8 +++++++-
fs/resctrl/rdtgroup.c | 3 ++-
3 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 96d97f4ff957..39bdaf45fa2a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -92,8 +92,8 @@ extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
* @list: Member of the global @mon_data_kn_priv_list list.
* @rid: Resource id associated with the event file.
* @evt: Event structure associated with the event file.
- * @sum: Set when event must be summed across multiple
- * domains.
+ * @sum: Set for RDT_RESOURCE_L3 when event must be summed
+ * across multiple domains.
* @domid: When @sum is zero this is the domain to which
* the event file belongs. When @sum is one this
* is the id of the L3 cache that all domains to be
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
index 290a959776de..f7fbfc4d258d 100644
--- a/fs/resctrl/ctrlmondata.c
+++ b/fs/resctrl/ctrlmondata.c
@@ -683,7 +683,6 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
enum resctrl_res_level resid;
- struct rdt_l3_mon_domain *d;
struct rdt_domain_hdr *hdr;
struct rmid_read rr = {0};
struct rdtgroup *rdtgrp;
@@ -711,6 +710,13 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
r = resctrl_arch_get_resource(resid);
if (md->sum) {
+ struct rdt_l3_mon_domain *d;
+
+ if (WARN_ON_ONCE(resid != RDT_RESOURCE_L3)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
/*
* This file requires summing across all domains that share
* the L3 cache id that was provided in the "domid" field of the
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 26f0d1f93da2..fa1398787e83 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3030,7 +3030,8 @@ static void rmdir_all_sub(void)
* @rid: The resource id for the event file being created.
* @domid: The domain id for the event file being created.
* @mevt: The type of event file being created.
- * @do_sum: Whether SNC summing monitors are being created.
+ * @do_sum: Whether SNC summing monitors are being created. Only set
+ * when @rid == RDT_RESOURCE_L3.
*/
static struct mon_data *mon_get_kn_priv(enum resctrl_res_level rid, int domid,
struct mon_evt *mevt,
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 15/32] fs/resctrl: Cleanup as L3 is no longer the only monitor resource
2025-10-29 16:20 ` [PATCH v13 15/32] fs/resctrl: Cleanup as L3 is no longer the only monitor resource Tony Luck
@ 2025-11-13 4:05 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 4:05 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
I do not see how the subject matches this patch. How about something like:
"fs/resctrl: Emphasize that L3 monitoring resource is required for summing domains"
On 10/29/25 9:20 AM, Tony Luck wrote:
> The feature to sum event data across multiple domains supports systems
> with Sub-NUMA Cluster (SNC) mode enabled. The top-level monitoring files
> in each "mon_L3_XX" directory provide the sum of data across all SNC
> nodes sharing an L3 cache instance while the "mon_sub_L3_YY" sub-directories
> provide the event data of the individual nodes.
>
> SNC is only associated with the L3 resource and domains and as a result
> the flow handling the sum of event data implicitly assumes it is
> working with the L3 resource and domains.
>
> Reading of telemetry events do not require to sum event data so this
> feature can remain dedicated to SNC and keep the implicit assumption
> of working with the L3 resource and domains.
>
> Add a WARN to where the implicit assumption of working with the L3 resource
> is made and add comments on how the structure controlling the event sum
> feature is used.
>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
| Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 16/32] x86/resctrl: Discover hardware telemetry events
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (14 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 15/32] fs/resctrl: Cleanup as L3 is no longer the only monitor resource Tony Luck
@ 2025-10-29 16:20 ` Tony Luck
2025-11-13 4:11 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 17/32] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651 Tony Luck
` (17 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:20 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each CPU collects data for telemetry events that it sends to the nearest
telemetry event aggregator either when the value of MSR_IA32_PQR_ASSOC.RMID
changes, or when a two millisecond timer expires.
There is a guid and an MMIO region associated with each aggregator. The
combination of the guid and the size of the MMIO region link to an XML description
of the set of telemetry events tracked by the aggregator. XML files are published
by Intel in a GitHub repository [1].
The telemetry event aggregators maintain per-RMID per-event counts of the
total seen for all the CPUs. There may be multiple telemetry
event aggregators per package.
There are separate sets of aggregators for each type of event, but all
aggregators with the same guid are symmetric keeping counts for the same
set of events for the CPUs that provide data to them.
The XML file for each aggregator provides the following information:
1) Which telemetry events are included in the group.
2) The order in which the event counters appear for each RMID.
3) The value type of each event counter (integer or fixed-point).
4) The number of RMIDs supported.
5) Which additional aggregator status registers are included.
6) The total size of the MMIO region for an aggregator.
The resctrl implementation condenses the relevant information from the
XML file into some of the fields of struct event_group.
The INTEL_PMT_TELEMETRY driver enumerates support for telemetry events. This
driver provides intel_pmt_get_regions_by_feature() to list all available telemetry
event aggregators of a given enum pmt_feature_id type. The list includes the
"guid", the base address in MMIO space for the region where the event counters
are exposed, and the package id where the all the CPUs that report to this
aggregator are located.
A theoretical example struct pmt_feature_group returned from the INTEL_PMT_TELEMETRY
driver for events of type FEATURE_PER_RMID_PERF_TELEM could look like this:
+-------------------------------+
| count = 6 |
+-------------------------------+
| [0] guid_1 size_1 pkg1 addr_1 |
+-------------------------------+
| [1] guid_1 size_1 pkg2 addr_2 |
+-------------------------------+
| [2] guid_2 size_2 pkg1 addr_3 |
+-------------------------------+
| [3] guid_2 size_2 pkg1 addr_4 |
+-------------------------------+
| [4] guid_2 size_2 pkg2 addr_5 |
+-------------------------------+
| [5] guid_2 size_2 pkg2 addr_6 |
+-------------------------------+
This provides details for "perf" aggregators with two guids. If resctrl
has an event_group for both of these guids it will get two copies of this
struct pmt_feature_group by calling intel_pmt_get_regions_by_feature()
once for each. event_group::pfg will point to the copy acquired from
each call.
On the call for guid1 it will see there is just one aggregator per package for
guid_1. So resctrl can read event counts from the MMIO addr_1 on package 1 and
addr_2 on package 2.
There are two aggregators listed on each package for guid_2. So resctrl must
read counters from addr_3 and addr_4 and sum them to provide result for package
1. Similarly addr_5 and addr_6 must be read and summed for event counts on
package 2.
resctrl will silently ignore unknown guid values.
Add a new Kconfig option CONFIG_X86_CPU_RESCTRL_INTEL_AET for the Intel specific
parts of telemetry code. This depends on the INTEL_PMT_TELEMETRY and INTEL_TPMI
drivers being built-in to the kernel for enumeration of telemetry features.
Call INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() for each guid
known to resctrl (using the appropriate enum pmt_feature_id argument for that
guid) to obtain a private copy of struct pmt_feature_group that contains all
discovered/enumerated telemetry aggregator data for all event groups (known and
unknown to resctrl) of that pmt_feature_id. Further processing on this structure
will enable all supported events in resctrl.
Return the struct pmt_feature_group to INTEL_PMT_TELEMETRY at resctrl exit time.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://github.com/intel/Intel-PMT # [1]
---
arch/x86/kernel/cpu/resctrl/internal.h | 8 ++
arch/x86/kernel/cpu/resctrl/core.c | 5 +
arch/x86/kernel/cpu/resctrl/intel_aet.c | 122 ++++++++++++++++++++++++
arch/x86/Kconfig | 13 +++
arch/x86/kernel/cpu/resctrl/Makefile | 1 +
5 files changed, 149 insertions(+)
create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 14fadcff0d2b..886261a82b81 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -217,4 +217,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
+#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
+bool intel_aet_get_events(void);
+void __exit intel_aet_exit(void);
+#else
+static inline bool intel_aet_get_events(void) { return false; }
+static inline void __exit intel_aet_exit(void) { }
+#endif
+
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index af555dadf024..648f44cff52c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -738,6 +738,9 @@ void resctrl_arch_pre_mount(void)
if (!atomic_try_cmpxchg(&only_once, &old, 1))
return;
+
+ if (!intel_aet_get_events())
+ return;
}
enum {
@@ -1095,6 +1098,8 @@ late_initcall(resctrl_arch_late_init);
static void __exit resctrl_arch_exit(void)
{
+ intel_aet_exit();
+
cpuhp_remove_state(rdt_online);
resctrl_exit();
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
new file mode 100644
index 000000000000..02bbe7872fcf
--- /dev/null
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Resource Director Technology(RDT)
+ * - Intel Application Energy Telemetry
+ *
+ * Copyright (C) 2025 Intel Corporation
+ *
+ * Author:
+ * Tony Luck <tony.luck@intel.com>
+ */
+
+#define pr_fmt(fmt) "resctrl: " fmt
+
+#include <linux/array_size.h>
+#include <linux/cleanup.h>
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/intel_pmt_features.h>
+#include <linux/intel_vsec.h>
+#include <linux/overflow.h>
+#include <linux/resctrl.h>
+#include <linux/stddef.h>
+#include <linux/types.h>
+
+#include "internal.h"
+
+/**
+ * struct event_group - All information about a group of telemetry events.
+ * @feature: Argument to intel_pmt_get_regions_by_feature() to
+ * discover if this event_group is supported.
+ * @pfg: Points to the aggregated telemetry space information
+ * returned by the intel_pmt_get_regions_by_feature()
+ * call to the INTEL_PMT_TELEMETRY driver that contains
+ * data for all telemetry regions of a specific type.
+ * Valid if the system supports the event group.
+ * NULL otherwise.
+ * @guid: Unique number per XML description file.
+ */
+struct event_group {
+ /* Data fields for additional structures to manage this group. */
+ enum pmt_feature_id feature;
+ struct pmt_feature_group *pfg;
+
+ /* Remaining fields initialized from XML file. */
+ u32 guid;
+};
+
+/*
+ * Link: https://github.com/intel/Intel-PMT
+ * File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
+ */
+static struct event_group energy_0x26696143 = {
+ .feature = FEATURE_PER_RMID_ENERGY_TELEM,
+ .guid = 0x26696143,
+};
+
+/*
+ * Link: https://github.com/intel/Intel-PMT
+ * File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
+ */
+static struct event_group perf_0x26557651 = {
+ .feature = FEATURE_PER_RMID_PERF_TELEM,
+ .guid = 0x26557651,
+};
+
+static struct event_group *known_event_groups[] = {
+ &energy_0x26696143,
+ &perf_0x26557651,
+};
+
+#define for_each_event_group(_peg) \
+ for (_peg = known_event_groups; \
+ _peg < &known_event_groups[ARRAY_SIZE(known_event_groups)]; \
+ _peg++)
+
+/* Stub for now */
+static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
+{
+ return false;
+}
+
+/*
+ * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
+ * pmt_feature_group for each known feature. If there is one, the returned
+ * structure has an array of telemetry_region structures, each element of
+ * the array describes one telemetry aggregator.
+ * A single pmt_feature_group may include multiple different guids.
+ * Try to use every telemetry aggregator with a known guid.
+ */
+bool intel_aet_get_events(void)
+{
+ struct pmt_feature_group *p;
+ struct event_group **peg;
+ bool ret = false;
+
+ for_each_event_group(peg) {
+ p = intel_pmt_get_regions_by_feature((*peg)->feature);
+ if (IS_ERR_OR_NULL(p))
+ continue;
+ if (enable_events(*peg, p)) {
+ (*peg)->pfg = p;
+ ret = true;
+ } else {
+ intel_pmt_put_feature_group(p);
+ }
+ }
+
+ return ret;
+}
+
+void __exit intel_aet_exit(void)
+{
+ struct event_group **peg;
+
+ for_each_event_group(peg) {
+ if ((*peg)->pfg) {
+ intel_pmt_put_feature_group((*peg)->pfg);
+ (*peg)->pfg = NULL;
+ }
+ }
+}
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..da5775056ec8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -536,6 +536,19 @@ config X86_CPU_RESCTRL
Say N if unsure.
+config X86_CPU_RESCTRL_INTEL_AET
+ bool "Intel Application Energy Telemetry"
+ depends on X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
+ help
+ Enable per-RMID telemetry events in resctrl.
+
+ Intel feature that collects per-RMID execution data
+ about energy consumption, measure of frequency independent
+ activity and other performance metrics. Data is aggregated
+ per package.
+
+ Say N if unsure.
+
config X86_FRED
bool "Flexible Return and Event Delivery"
depends on X86_64
diff --git a/arch/x86/kernel/cpu/resctrl/Makefile b/arch/x86/kernel/cpu/resctrl/Makefile
index d8a04b195da2..273ddfa30836 100644
--- a/arch/x86/kernel/cpu/resctrl/Makefile
+++ b/arch/x86/kernel/cpu/resctrl/Makefile
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_X86_CPU_RESCTRL) += core.o rdtgroup.o monitor.o
obj-$(CONFIG_X86_CPU_RESCTRL) += ctrlmondata.o
+obj-$(CONFIG_X86_CPU_RESCTRL_INTEL_AET) += intel_aet.o
obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include:
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 16/32] x86/resctrl: Discover hardware telemetry events
2025-10-29 16:20 ` [PATCH v13 16/32] x86/resctrl: Discover hardware telemetry events Tony Luck
@ 2025-11-13 4:11 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 4:11 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:20 AM, Tony Luck wrote:
> Each CPU collects data for telemetry events that it sends to the nearest
> telemetry event aggregator either when the value of MSR_IA32_PQR_ASSOC.RMID
> changes, or when a two millisecond timer expires.
>
> There is a guid and an MMIO region associated with each aggregator. The
Could this be:
There is a feature type ("energy" or "perf"), guid, and MMIO region
associated with each aggregator. This combination links to an XML ...
> combination of the guid and the size of the MMIO region link to an XML description
> of the set of telemetry events tracked by the aggregator. XML files are published
> by Intel in a GitHub repository [1].
>
> The telemetry event aggregators maintain per-RMID per-event counts of the
> total seen for all the CPUs. There may be multiple telemetry
> event aggregators per package.
>
> There are separate sets of aggregators for each type of event, but all
"type of event" -> "feature type, for example "perf" or "energy". All ..."
Would this be accurate:
There are separate sets of aggregators for each feature type. Aggregators
in a set may have different guids. All aggregators with the same feature
type and guid are symmetric ...
> aggregators with the same guid are symmetric keeping counts for the same
> set of events for the CPUs that provide data to them.
>
> The XML file for each aggregator provides the following information:
Would this be accurate?
"0) Feature type of the events ("perf" or "energy")"?
> 1) Which telemetry events are included in the group.
"included in the group" -> "tracked by the aggregator"?
> 2) The order in which the event counters appear for each RMID.
> 3) The value type of each event counter (integer or fixed-point).
> 4) The number of RMIDs supported.
> 5) Which additional aggregator status registers are included.
> 6) The total size of the MMIO region for an aggregator.
>
> The resctrl implementation condenses the relevant information from the
> XML file into some of the fields of struct event_group.
(above implies struct event_group already exists)
How about:
"Introduce struct event_group that condenses the relevant information
from an XML file. Hereafter an "event group" refers to a group of events of
a particular feature type ("energy" or "perf") with a particular guid."
(Above also tries to give definition to "event group" mentioned in v12, please
do correct and feel free to improve)
> The INTEL_PMT_TELEMETRY driver enumerates support for telemetry events. This
> driver provides intel_pmt_get_regions_by_feature() to list all available telemetry
> event aggregators of a given enum pmt_feature_id type. The list includes the
(at this point it should be clear what "feature type" means).
"enum pmt_feature_id type" -> "feature type"
> "guid", the base address in MMIO space for the region where the event counters
> are exposed, and the package id where the all the CPUs that report to this
> aggregator are located.
>
The example below describes behavior of implementation in a particular scenario before introducing
the implementation self. I think the general changelog can help to explain that such a scenario is
possible and I tried to do so with the earlier sample text. If you feel that this example is
still helpful then I would propose it be placed at the end of the changelog, after general
description of implementation, with a heading that reader can use to decide whether to read or
skip what follows. Placing it in maintainer notes is also an option.
> A theoretical example struct pmt_feature_group returned from the INTEL_PMT_TELEMETRY
> driver for events of type FEATURE_PER_RMID_PERF_TELEM could look like this:
>
> +-------------------------------+
> | count = 6 |
> +-------------------------------+
> | [0] guid_1 size_1 pkg1 addr_1 |
> +-------------------------------+
> | [1] guid_1 size_1 pkg2 addr_2 |
> +-------------------------------+
> | [2] guid_2 size_2 pkg1 addr_3 |
> +-------------------------------+
> | [3] guid_2 size_2 pkg1 addr_4 |
> +-------------------------------+
> | [4] guid_2 size_2 pkg2 addr_5 |
> +-------------------------------+
> | [5] guid_2 size_2 pkg2 addr_6 |
> +-------------------------------+
>
> This provides details for "perf" aggregators with two guids. If resctrl
> has an event_group for both of these guids it will get two copies of this
> struct pmt_feature_group by calling intel_pmt_get_regions_by_feature()
> once for each. event_group::pfg will point to the copy acquired from
> each call.
>
> On the call for guid1 it will see there is just one aggregator per package for
> guid_1. So resctrl can read event counts from the MMIO addr_1 on package 1 and
> addr_2 on package 2.
>
> There are two aggregators listed on each package for guid_2. So resctrl must
> read counters from addr_3 and addr_4 and sum them to provide result for package" t
> 1. Similarly addr_5 and addr_6 must be read and summed for event counts on
> package 2.
>
> resctrl will silently ignore unknown guid values.
>
> Add a new Kconfig option CONFIG_X86_CPU_RESCTRL_INTEL_AET for the Intel specific
> parts of telemetry code. This depends on the INTEL_PMT_TELEMETRY and INTEL_TPMI
> drivers being built-in to the kernel for enumeration of telemetry features.
>
The paragraph below seems better suited to follow right after the "The INTEL_PMT_TELEMETRY
driver enumerates support ..." paragraph.
> Call INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() for each guid
> known to resctrl (using the appropriate enum pmt_feature_id argument for that
> guid) to obtain a private copy of struct pmt_feature_group that contains all
"each guid known to resctrl (using the appropriate enum pmt_feature_id argument for that
guid)" -> "each event group"?
> discovered/enumerated telemetry aggregator data for all event groups (known and
> unknown to resctrl) of that pmt_feature_id. Further processing on this structure
> will enable all supported events in resctrl.
I think the above tries to be too specific in what the code does while also trying to
explain the flow at a high level. I find the result difficult to parse. Consider something like:
Call INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() for each event group
to obtain a private copy of that event group's aggregator data. Duplicate the aggregator
data between event groups that have the same feature type but different guid. Further
processing on this private copy will be unique to the event group.
Return the aggregator data to INTEL_PMT_TELEMETRY at resctrl exit time.
>
> Return the struct pmt_feature_group to INTEL_PMT_TELEMETRY at resctrl exit time.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Link: https://github.com/intel/Intel-PMT # [1]
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 8 ++
> arch/x86/kernel/cpu/resctrl/core.c | 5 +
> arch/x86/kernel/cpu/resctrl/intel_aet.c | 122 ++++++++++++++++++++++++
> arch/x86/Kconfig | 13 +++
> arch/x86/kernel/cpu/resctrl/Makefile | 1 +
> 5 files changed, 149 insertions(+)
> create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 14fadcff0d2b..886261a82b81 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -217,4 +217,12 @@ void __init intel_rdt_mbm_apply_quirk(void);
> void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
> void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
>
> +#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
> +bool intel_aet_get_events(void);
> +void __exit intel_aet_exit(void);
> +#else
> +static inline bool intel_aet_get_events(void) { return false; }
> +static inline void __exit intel_aet_exit(void) { }
> +#endif
> +
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index af555dadf024..648f44cff52c 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -738,6 +738,9 @@ void resctrl_arch_pre_mount(void)
>
> if (!atomic_try_cmpxchg(&only_once, &old, 1))
> return;
> +
> + if (!intel_aet_get_events())
> + return;
> }
>
> enum {
> @@ -1095,6 +1098,8 @@ late_initcall(resctrl_arch_late_init);
>
> static void __exit resctrl_arch_exit(void)
> {
> + intel_aet_exit();
> +
> cpuhp_remove_state(rdt_online);
>
> resctrl_exit();
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> new file mode 100644
> index 000000000000..02bbe7872fcf
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -0,0 +1,122 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Resource Director Technology(RDT)
> + * - Intel Application Energy Telemetry
> + *
> + * Copyright (C) 2025 Intel Corporation
> + *
> + * Author:
> + * Tony Luck <tony.luck@intel.com>
> + */
> +
> +#define pr_fmt(fmt) "resctrl: " fmt
> +
> +#include <linux/array_size.h>
> +#include <linux/cleanup.h>
> +#include <linux/cpu.h>
> +#include <linux/err.h>
> +#include <linux/init.h>
> +#include <linux/intel_pmt_features.h>
> +#include <linux/intel_vsec.h>
> +#include <linux/overflow.h>
> +#include <linux/resctrl.h>
> +#include <linux/stddef.h>
> +#include <linux/types.h>
> +
> +#include "internal.h"
> +
> +/**
> + * struct event_group - All information about a group of telemetry events.
Trying to answer my own question from v12. How about:
"Events with the same feature type ("energy" or "perf") and guid"
> + * @feature: Argument to intel_pmt_get_regions_by_feature() to
> + * discover if this event_group is supported.
"discover if this event_group is supported" - this does not seem accurate (and
contradicts the @pfg description)
How about just:
"Type of events, for example FEATURE_PER_RMID_PERF_TELEM or FEATURE_PER_RMID_ENERGY_TELEM, in this group."
> + * @pfg: Points to the aggregated telemetry space information
> + * returned by the intel_pmt_get_regions_by_feature()
> + * call to the INTEL_PMT_TELEMETRY driver that contains
> + * data for all telemetry regions of a specific type.
"of a specific type" -> "of type @feature"
> + * Valid if the system supports the event group.
> + * NULL otherwise.
> + * @guid: Unique number per XML description file.
> + */
> +struct event_group {
> + /* Data fields for additional structures to manage this group. */
> + enum pmt_feature_id feature;
> + struct pmt_feature_group *pfg;
> +
> + /* Remaining fields initialized from XML file. */
> + u32 guid;
> +};
> +
> +/*
> + * Link: https://github.com/intel/Intel-PMT
> + * File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
> + */
> +static struct event_group energy_0x26696143 = {
> + .feature = FEATURE_PER_RMID_ENERGY_TELEM,
> + .guid = 0x26696143,
> +};
> +
> +/*
> + * Link: https://github.com/intel/Intel-PMT
> + * File: xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
> + */
> +static struct event_group perf_0x26557651 = {
> + .feature = FEATURE_PER_RMID_PERF_TELEM,
> + .guid = 0x26557651,
> +};
> +
> +static struct event_group *known_event_groups[] = {
> + &energy_0x26696143,
> + &perf_0x26557651,
> +};
Placing all event groups in same array is a great enhancement.
> +
> +#define for_each_event_group(_peg) \
> + for (_peg = known_event_groups; \
> + _peg < &known_event_groups[ARRAY_SIZE(known_event_groups)]; \
> + _peg++)
> +
> +/* Stub for now */
> +static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> +{
> + return false;
> +}
> +
> +/*
> + * Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
> + * pmt_feature_group for each known feature. If there is one, the returned
> + * structure has an array of telemetry_region structures, each element of
> + * the array describes one telemetry aggregator.
> + * A single pmt_feature_group may include multiple different guids.
> + * Try to use every telemetry aggregator with a known guid.
Same feedback as v12.
> + */
> +bool intel_aet_get_events(void)
> +{
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 17/32] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (15 preceding siblings ...)
2025-10-29 16:20 ` [PATCH v13 16/32] x86/resctrl: Discover hardware telemetry events Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:38 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 18/32] x86,fs/resctrl: Add architectural event pointer Tony Luck
` (16 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The Intel Clearwater Forest CPU supports two RMID-based PMT feature
groups documented in the xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml and
xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml files in the Intel PMT GIT repository
[1].
The struct pmt_feature_group provided by the INTEL_PMT_TELEMETRY driver lists the
guid and other information for each aggregator of a given type (energy or perf)
present on the system.
resctrl has a condensed form of the XML description in struct
event_group. An event group is enabled if the pfg field points
to a struct pmt_feature_group.
The counter offsets in MMIO space are arranged in groups for each RMID.
E.g the "energy" counters for guid 0x26696143 are arranged like this:
MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY
MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY
MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY
MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY
...
MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY
MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY
After all counters there are three status registers that provide
indications of how many times an aggregator was unable to process
event counts, the time stamp for the most recent loss of data, and
the time stamp of the most recent successful update.
MMIO offset:0x2400 AGG_DATA_LOSS_COUNT
MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP
MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP
Define these events in the file system code and add the events
to the event_group structures.
PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
format. File system code must output as floating point values.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://github.com/intel/Intel-PMT # [1]
---
include/linux/resctrl_types.h | 11 +++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 43 +++++++++++++++++++++++++
fs/resctrl/monitor.c | 35 +++++++++++---------
3 files changed, 74 insertions(+), 15 deletions(-)
diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h
index acfe07860b34..a5f56faa18d2 100644
--- a/include/linux/resctrl_types.h
+++ b/include/linux/resctrl_types.h
@@ -50,6 +50,17 @@ enum resctrl_event_id {
QOS_L3_MBM_TOTAL_EVENT_ID = 0x02,
QOS_L3_MBM_LOCAL_EVENT_ID = 0x03,
+ /* Intel Telemetry Events */
+ PMT_EVENT_ENERGY,
+ PMT_EVENT_ACTIVITY,
+ PMT_EVENT_STALLS_LLC_HIT,
+ PMT_EVENT_C1_RES,
+ PMT_EVENT_UNHALTED_CORE_CYCLES,
+ PMT_EVENT_STALLS_LLC_MISS,
+ PMT_EVENT_AUTO_C6_RES,
+ PMT_EVENT_UNHALTED_REF_CYCLES,
+ PMT_EVENT_UOPS_RETIRED,
+
/* Must be the last */
QOS_NUM_EVENTS,
};
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 02bbe7872fcf..5aec929c3441 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -13,6 +13,7 @@
#include <linux/array_size.h>
#include <linux/cleanup.h>
+#include <linux/compiler_types.h>
#include <linux/cpu.h>
#include <linux/err.h>
#include <linux/init.h>
@@ -20,11 +21,27 @@
#include <linux/intel_vsec.h>
#include <linux/overflow.h>
#include <linux/resctrl.h>
+#include <linux/resctrl_types.h>
#include <linux/stddef.h>
#include <linux/types.h>
#include "internal.h"
+/**
+ * struct pmt_event - Telemetry event.
+ * @id: Resctrl event id.
+ * @idx: Counter index within each per-RMID block of counters.
+ * @bin_bits: Zero for integer valued events, else number bits in fraction
+ * part of fixed-point.
+ */
+struct pmt_event {
+ enum resctrl_event_id id;
+ unsigned int idx;
+ unsigned int bin_bits;
+};
+
+#define EVT(_id, _idx, _bits) { .id = _id, .idx = _idx, .bin_bits = _bits }
+
/**
* struct event_group - All information about a group of telemetry events.
* @feature: Argument to intel_pmt_get_regions_by_feature() to
@@ -36,6 +53,9 @@
* Valid if the system supports the event group.
* NULL otherwise.
* @guid: Unique number per XML description file.
+ * @mmio_size: Number of bytes of MMIO registers for this group.
+ * @num_events: Number of events in this group.
+ * @evts: Array of event descriptors.
*/
struct event_group {
/* Data fields for additional structures to manage this group. */
@@ -44,8 +64,14 @@ struct event_group {
/* Remaining fields initialized from XML file. */
u32 guid;
+ size_t mmio_size;
+ unsigned int num_events;
+ struct pmt_event evts[] __counted_by(num_events);
};
+#define XML_MMIO_SIZE(num_rmids, num_events, num_extra_status) \
+ (((num_rmids) * (num_events) + (num_extra_status)) * sizeof(u64))
+
/*
* Link: https://github.com/intel/Intel-PMT
* File: xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
@@ -53,6 +79,12 @@ struct event_group {
static struct event_group energy_0x26696143 = {
.feature = FEATURE_PER_RMID_ENERGY_TELEM,
.guid = 0x26696143,
+ .mmio_size = XML_MMIO_SIZE(576, 2, 3),
+ .num_events = 2,
+ .evts = {
+ EVT(PMT_EVENT_ENERGY, 0, 18),
+ EVT(PMT_EVENT_ACTIVITY, 1, 18),
+ }
};
/*
@@ -62,6 +94,17 @@ static struct event_group energy_0x26696143 = {
static struct event_group perf_0x26557651 = {
.feature = FEATURE_PER_RMID_PERF_TELEM,
.guid = 0x26557651,
+ .mmio_size = XML_MMIO_SIZE(576, 7, 3),
+ .num_events = 7,
+ .evts = {
+ EVT(PMT_EVENT_STALLS_LLC_HIT, 0, 0),
+ EVT(PMT_EVENT_C1_RES, 1, 0),
+ EVT(PMT_EVENT_UNHALTED_CORE_CYCLES, 2, 0),
+ EVT(PMT_EVENT_STALLS_LLC_MISS, 3, 0),
+ EVT(PMT_EVENT_AUTO_C6_RES, 4, 0),
+ EVT(PMT_EVENT_UNHALTED_REF_CYCLES, 5, 0),
+ EVT(PMT_EVENT_UOPS_RETIRED, 6, 0),
+ }
};
static struct event_group *known_event_groups[] = {
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 7d1b65316bc8..ea7cc0a3340c 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -952,27 +952,32 @@ static void dom_data_exit(struct rdt_resource *r)
mutex_unlock(&rdtgroup_mutex);
}
+#define MON_EVENT(_eventid, _name, _res, _fp) \
+ [_eventid] = { \
+ .name = _name, \
+ .evtid = _eventid, \
+ .rid = _res, \
+ .is_floating_point = _fp, \
+}
+
/*
* All available events. Architecture code marks the ones that
* are supported by a system using resctrl_enable_mon_event()
* to set .enabled.
*/
struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
- [QOS_L3_OCCUP_EVENT_ID] = {
- .name = "llc_occupancy",
- .evtid = QOS_L3_OCCUP_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
- },
- [QOS_L3_MBM_TOTAL_EVENT_ID] = {
- .name = "mbm_total_bytes",
- .evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
- },
- [QOS_L3_MBM_LOCAL_EVENT_ID] = {
- .name = "mbm_local_bytes",
- .evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
- .rid = RDT_RESOURCE_L3,
- },
+ MON_EVENT(QOS_L3_OCCUP_EVENT_ID, "llc_occupancy", RDT_RESOURCE_L3, false),
+ MON_EVENT(QOS_L3_MBM_TOTAL_EVENT_ID, "mbm_total_bytes", RDT_RESOURCE_L3, false),
+ MON_EVENT(QOS_L3_MBM_LOCAL_EVENT_ID, "mbm_local_bytes", RDT_RESOURCE_L3, false),
+ MON_EVENT(PMT_EVENT_ENERGY, "core_energy", RDT_RESOURCE_PERF_PKG, true),
+ MON_EVENT(PMT_EVENT_ACTIVITY, "activity", RDT_RESOURCE_PERF_PKG, true),
+ MON_EVENT(PMT_EVENT_STALLS_LLC_HIT, "stalls_llc_hit", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_C1_RES, "c1_res", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_UNHALTED_CORE_CYCLES, "unhalted_core_cycles", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_STALLS_LLC_MISS, "stalls_llc_miss", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_AUTO_C6_RES, "c6_res", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_UNHALTED_REF_CYCLES, "unhalted_ref_cycles", RDT_RESOURCE_PERF_PKG, false),
+ MON_EVENT(PMT_EVENT_UOPS_RETIRED, "uops_retired", RDT_RESOURCE_PERF_PKG, false),
};
void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsigned int binary_bits)
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 17/32] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651
2025-10-29 16:21 ` [PATCH v13 17/32] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651 Tony Luck
@ 2025-11-13 22:38 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:38 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> The Intel Clearwater Forest CPU supports two RMID-based PMT feature
> groups documented in the xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml and
It makes the series easier to read when it uses consistent terms throughout and
avoids needing to keep redefining things. With "feature type" defined in previous patch
I think "PMT feature group" can be replaced with "feature type" and then the second
and third paragraph can be dropped?
To help keep things consistent this could be something like below that also gives
users a link that works since, for example, appending the provided
"xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml" to provided
https://github.com/intel/Intel-PMT results in a "Not Found".
The telemetry event aggregators of the Intel Clearwater Forest CPU
support two RMID-based feature types: "energy" with guid 0x26696143 [1], and
"perf" with guid 0x26557651 [2].
Link: https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml # [1]
Link: https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml # [2]
> xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml files in the Intel PMT GIT repository
> [1].
Taking a step back there seems to be an ordering issue since the previous patch
already uses the two guid "introduced" here.
>
> The struct pmt_feature_group provided by the INTEL_PMT_TELEMETRY driver lists the
> guid and other information for each aggregator of a given type (energy or perf)
> present on the system.
>
> resctrl has a condensed form of the XML description in struct
> event_group. An event group is enabled if the pfg field points
> to a struct pmt_feature_group.
Above two paragraphs can be dropped. (They anyway were just pasted in without maintaining style nor
connection with rest of changelog.)
>
> The counter offsets in MMIO space are arranged in groups for each RMID.
(to remind reader the context)
"The counter offsets in MMIO space" -> "The event counter offsets in an aggregator's MMIO space"
>
> E.g the "energy" counters for guid 0x26696143 are arranged like this:
>
> MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY
> MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY
> MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY
> MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY
> ...
> MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY
> MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY
>
> After all counters there are three status registers that provide
> indications of how many times an aggregator was unable to process
> event counts, the time stamp for the most recent loss of data, and
> the time stamp of the most recent successful update.
>
> MMIO offset:0x2400 AGG_DATA_LOSS_COUNT
> MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP
> MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP
>
> Define these events in the file system code and add the events
(to be specific about which events)
"these events" -> "the events tracked by the aggregators"?
> to the event_group structures.
>
> PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point
> format. File system code must output as floating point values.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Link: https://github.com/intel/Intel-PMT # [1]
> ---
I think it will be more appropriate if the event groups initialized here,
for example "static struct event_group energy_0x26696143", are defined
as part of this patch also.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 18/32] x86,fs/resctrl: Add architectural event pointer
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (16 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 17/32] x86,fs/resctrl: Fill in details of events for guid 0x26696143 and 0x26557651 Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-10-29 16:21 ` [PATCH v13 19/32] x86/resctrl: Find and enable usable telemetry events Tony Luck
` (15 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The resctrl file system layer passes the domain, RMID, and event id to the
architecture to fetch an event counter.
Fetching a telemetry event counter requires additional information that is
private to the architecture, for example, the offset into MMIO space from
where the counter should be read.
Add mon_evt::arch_priv that architecture can use for any private data
related to the event. resctrl filesystem initializes mon_evt::arch_priv
when the architecture enables the event and passes it back to architecture
when needing to fetch an event counter.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 7 +++++--
fs/resctrl/internal.h | 4 ++++
arch/x86/kernel/cpu/resctrl/core.c | 6 +++---
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
fs/resctrl/monitor.c | 14 ++++++++++----
5 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 9fe6205743b7..34ad0f5f1309 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -412,7 +412,7 @@ u32 resctrl_arch_system_num_rmid_idx(void);
int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
- unsigned int binary_bits);
+ unsigned int binary_bits, void *arch_priv);
bool resctrl_is_mon_event_enabled(enum resctrl_event_id eventid);
@@ -529,6 +529,9 @@ void resctrl_arch_pre_mount(void);
* only.
* @rmid: rmid of the counter to read.
* @eventid: eventid to read, e.g. L3 occupancy.
+ * @arch_priv: Architecture private data for this event.
+ * The @arch_priv provided by the architecture via
+ * resctrl_enable_mon_event().
* @val: result of the counter read in bytes.
* @arch_mon_ctx: An architecture specific value from
* resctrl_arch_mon_ctx_alloc(), for MPAM this identifies
@@ -546,7 +549,7 @@ void resctrl_arch_pre_mount(void);
*/
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, void *arch_mon_ctx);
+ void *arch_priv, u64 *val, void *arch_mon_ctx);
/**
* resctrl_arch_rmid_read_context_check() - warn about invalid contexts
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 39bdaf45fa2a..46fd648a2961 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -66,6 +66,9 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc)
* @binary_bits: number of fixed-point binary bits from architecture,
* only valid if @is_floating_point is true
* @enabled: true if the event is enabled
+ * @arch_priv: Architecture private data for this event.
+ * The @arch_priv provided by the architecture via
+ * resctrl_enable_mon_event().
*/
struct mon_evt {
enum resctrl_event_id evtid;
@@ -77,6 +80,7 @@ struct mon_evt {
bool is_floating_point;
unsigned int binary_bits;
bool enabled;
+ void *arch_priv;
};
extern struct mon_evt mon_event_all[QOS_NUM_EVENTS];
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 648f44cff52c..d759093e7dce 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -915,15 +915,15 @@ static __init bool get_rdt_mon_resources(void)
bool ret = false;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC)) {
- resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_OCCUP_EVENT_ID, false, 0, NULL);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_MBM_TOTAL_EVENT_ID, false, 0, NULL);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) {
- resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0);
+ resctrl_enable_mon_event(QOS_L3_MBM_LOCAL_EVENT_ID, false, 0, NULL);
ret = true;
}
if (rdt_cpu_has(X86_FEATURE_ABMC))
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2d1453c905bc..2f62a834787d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -240,7 +240,7 @@ static u64 get_corrected_val(struct rdt_resource *r, struct rdt_l3_mon_domain *d
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
u32 unused, u32 rmid, enum resctrl_event_id eventid,
- u64 *val, void *ignored)
+ void *arch_priv, u64 *val, void *ignored)
{
struct rdt_hw_l3_mon_domain *hw_dom;
struct rdt_l3_mon_domain *d;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index ea7cc0a3340c..9cc54d04b2ac 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -137,9 +137,11 @@ void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
struct rmid_entry *entry;
u32 idx, cur_idx = 1;
void *arch_mon_ctx;
+ void *arch_priv;
bool rmid_dirty;
u64 val = 0;
+ arch_priv = mon_event_all[QOS_L3_OCCUP_EVENT_ID].arch_priv;
arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
if (IS_ERR(arch_mon_ctx)) {
pr_warn_ratelimited("Failed to allocate monitor context: %ld",
@@ -160,7 +162,7 @@ void __check_limbo(struct rdt_l3_mon_domain *d, bool force_free)
entry = __rmid_entry(idx);
if (resctrl_arch_rmid_read(r, &d->hdr, entry->closid, entry->rmid,
- QOS_L3_OCCUP_EVENT_ID, &val,
+ QOS_L3_OCCUP_EVENT_ID, arch_priv, &val,
arch_mon_ctx)) {
rmid_dirty = true;
} else {
@@ -452,7 +454,8 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
rr->evt->evtid, &tval);
else
rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, closid, rmid,
- rr->evt->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, rr->evt->arch_priv,
+ &tval, rr->arch_mon_ctx);
if (rr->err)
return rr->err;
@@ -481,7 +484,8 @@ static int __l3_mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr,
rr->evt->evtid, &tval);
else
err = resctrl_arch_rmid_read(rr->r, &d->hdr, closid, rmid,
- rr->evt->evtid, &tval, rr->arch_mon_ctx);
+ rr->evt->evtid, rr->evt->arch_priv,
+ &tval, rr->arch_mon_ctx);
if (!err) {
rr->val += tval;
ret = 0;
@@ -980,7 +984,8 @@ struct mon_evt mon_event_all[QOS_NUM_EVENTS] = {
MON_EVENT(PMT_EVENT_UOPS_RETIRED, "uops_retired", RDT_RESOURCE_PERF_PKG, false),
};
-void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsigned int binary_bits)
+void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu,
+ unsigned int binary_bits, void *arch_priv)
{
if (WARN_ON_ONCE(eventid < QOS_FIRST_EVENT || eventid >= QOS_NUM_EVENTS ||
binary_bits > MAX_BINARY_BITS))
@@ -996,6 +1001,7 @@ void resctrl_enable_mon_event(enum resctrl_event_id eventid, bool any_cpu, unsig
mon_event_all[eventid].any_cpu = any_cpu;
mon_event_all[eventid].binary_bits = binary_bits;
+ mon_event_all[eventid].arch_priv = arch_priv;
mon_event_all[eventid].enabled = true;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 19/32] x86/resctrl: Find and enable usable telemetry events
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (17 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 18/32] x86,fs/resctrl: Add architectural event pointer Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:46 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 20/32] x86/resctrl: Read " Tony Luck
` (14 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl requests a copy of struct pmt_feature_group from the INTEL_PMT_TELEMETRY
driver for each event_group known to resctrl.
Scan pmt_feature_group::regions[] and mark those that fail the following tests:
1) guid does not match the guid for the event_group.
2) Package ID is invalid.
3) The enumerated size of the MMIO region does not match the expected
value from the XML description file.
If there are any regions that pass all of these checks enable each of the
telemetry events in event_group::evts[].
Note that it is architecturally possible that some telemetry events are only
supported by a subset of the packages in the system. It is not expected that
systems will ever do this. If they do the user will see event files in resctrl
that always return "Unavailable".
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/intel_aet.c | 58 ++++++++++++++++++++++++-
1 file changed, 56 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 5aec929c3441..e8da70eaa7c6 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -20,9 +20,11 @@
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
#include <linux/overflow.h>
+#include <linux/printk.h>
#include <linux/resctrl.h>
#include <linux/resctrl_types.h>
#include <linux/stddef.h>
+#include <linux/topology.h>
#include <linux/types.h>
#include "internal.h"
@@ -117,12 +119,64 @@ static struct event_group *known_event_groups[] = {
_peg < &known_event_groups[ARRAY_SIZE(known_event_groups)]; \
_peg++)
-/* Stub for now */
-static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
+/*
+ * Clear the address field of regions that did not pass the checks in
+ * skip_telem_region() so they will not be used by intel_aet_read_event().
+ * This is safe to do because intel_pmt_get_regions_by_feature() allocates
+ * a new pmt_feature_group structure to return to each caller and only makes
+ * use of the pmt_feature_group::kref field when intel_pmt_put_feature_group()
+ * returns the structure.
+ */
+static void mark_telem_region_unusable(struct telemetry_region *tr)
+{
+ tr->addr = NULL;
+}
+
+static bool skip_telem_region(struct telemetry_region *tr, struct event_group *e)
{
+ if (tr->guid != e->guid)
+ return true;
+ if (tr->plat_info.package_id >= topology_max_packages()) {
+ pr_warn("Bad package %u in guid 0x%x\n", tr->plat_info.package_id,
+ tr->guid);
+ return true;
+ }
+ if (tr->size != e->mmio_size) {
+ pr_warn("MMIO space wrong size (%zu bytes) for guid 0x%x. Expected %zu bytes.\n",
+ tr->size, e->guid, e->mmio_size);
+ return true;
+ }
+
return false;
}
+static bool group_has_usable_regions(struct event_group *e, struct pmt_feature_group *p)
+{
+ bool usable_regions = false;
+
+ for (int i = 0; i < p->count; i++) {
+ if (skip_telem_region(&p->regions[i], e)) {
+ mark_telem_region_unusable(&p->regions[i]);
+ continue;
+ }
+ usable_regions = true;
+ }
+
+ return usable_regions;
+}
+
+static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
+{
+ if (!group_has_usable_regions(e, p))
+ return false;
+
+ for (int j = 0; j < e->num_events; j++)
+ resctrl_enable_mon_event(e->evts[j].id, true,
+ e->evts[j].bin_bits, &e->evts[j]);
+
+ return true;
+}
+
/*
* Make a request to the INTEL_PMT_TELEMETRY driver for a copy of the
* pmt_feature_group for each known feature. If there is one, the returned
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 19/32] x86/resctrl: Find and enable usable telemetry events
2025-10-29 16:21 ` [PATCH v13 19/32] x86/resctrl: Find and enable usable telemetry events Tony Luck
@ 2025-11-13 22:46 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:46 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> resctrl requests a copy of struct pmt_feature_group from the INTEL_PMT_TELEMETRY
> driver for each event_group known to resctrl.
>
> Scan pmt_feature_group::regions[] and mark those that fail the following tests:
>
> 1) guid does not match the guid for the event_group.
> 2) Package ID is invalid.
> 3) The enumerated size of the MMIO region does not match the expected
> value from the XML description file.
>
> If there are any regions that pass all of these checks enable each of the
> telemetry events in event_group::evts[].
Above mostly just describes what can be seen from the patch. Below is a draft of an
attempt to change this:
Every event group has a private copy of the data of all telemetry event
aggregators (aka "telemetry regions") tracking its feature type. Included
may be regions that have the same feature type but tracking different
guid from the event group's.
Traverse the event group's telemetry region data and mark all regions that
are not usable by the event group as unusable by clearing those regions'
MMIO addresses. A region is considered unusable if:
1) guid does not match the guid of the event group.
2) Package ID is invalid.
3) The enumerated size of the MMIO region does not match the expected
value from the XML description file.
Hereafter any telemetry region with an MMIO address is considered valid
for the event group it is associated with.
Enable all the event group's events as long as there is at least one usable
region from where data for its events can be read.
>
> Note that it is architecturally possible that some telemetry events are only
> supported by a subset of the packages in the system. It is not expected that
> systems will ever do this. If they do the user will see event files in resctrl
> that always return "Unavailable".
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Patch looks good to me.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 20/32] x86/resctrl: Read telemetry events
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (18 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 19/32] x86/resctrl: Find and enable usable telemetry events Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:47 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 21/32] fs/resctrl: Refactor mkdir_mondata_subdir() Tony Luck
` (13 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Introduce intel_aet_read_event() to read telemetry events for resource
RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each
package, so scan all of them and add up all counters. Aggregators may
return an invalid data indication if they have received no records for
a given RMID. User will see "Unavailable" if none of the aggregators
on a package provide valid counts.
Resctrl now uses readq() so depends on X86_64. Update Kconfig.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 7 ++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 52 +++++++++++++++++++++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++
fs/resctrl/monitor.c | 14 +++++++
arch/x86/Kconfig | 2 +-
5 files changed, 78 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 886261a82b81..97616c81682b 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -220,9 +220,16 @@ void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
+int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
+ void *arch_priv, u64 *val);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void __exit intel_aet_exit(void) { }
+static inline int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
+ void *arch_priv, u64 *val)
+{
+ return -EINVAL;
+}
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index e8da70eaa7c6..f64fb7d0c8a9 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -12,13 +12,17 @@
#define pr_fmt(fmt) "resctrl: " fmt
#include <linux/array_size.h>
+#include <linux/bits.h>
#include <linux/cleanup.h>
#include <linux/compiler_types.h>
+#include <linux/container_of.h>
#include <linux/cpu.h>
#include <linux/err.h>
+#include <linux/errno.h>
#include <linux/init.h>
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
+#include <linux/io.h>
#include <linux/overflow.h>
#include <linux/printk.h>
#include <linux/resctrl.h>
@@ -217,3 +221,51 @@ void __exit intel_aet_exit(void)
}
}
}
+
+#define DATA_VALID BIT_ULL(63)
+#define DATA_BITS GENMASK_ULL(62, 0)
+
+/*
+ * Read counter for an event on a domain (summing all aggregators
+ * on the domain). If an aggregator hasn't received any data for a
+ * specific RMID, the MMIO read indicates that data is not valid.
+ * Return success if at least one aggregator has valid data.
+ */
+int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id eventid,
+ void *arch_priv, u64 *val)
+{
+ struct pmt_event *pevt = arch_priv;
+ struct event_group *e;
+ bool valid = false;
+ u64 total = 0;
+ u64 evtcount;
+ void *pevt0;
+ u32 idx;
+
+ pevt0 = pevt - pevt->idx;
+ e = container_of(pevt0, struct event_group, evts);
+ idx = rmid * e->num_events;
+ idx += pevt->idx;
+
+ if (idx * sizeof(u64) + sizeof(u64) > e->mmio_size) {
+ pr_warn_once("MMIO index %u out of range\n", idx);
+ return -EIO;
+ }
+
+ for (int i = 0; i < e->pfg->count; i++) {
+ if (!e->pfg->regions[i].addr)
+ continue;
+ if (e->pfg->regions[i].plat_info.package_id != domid)
+ continue;
+ evtcount = readq(e->pfg->regions[i].addr + idx * sizeof(u64));
+ if (!(evtcount & DATA_VALID))
+ continue;
+ total += evtcount & DATA_BITS;
+ valid = true;
+ }
+
+ if (valid)
+ *val = total;
+
+ return valid ? 0 : -EINVAL;
+}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 2f62a834787d..3f511543748d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -251,6 +251,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr,
int ret;
resctrl_arch_rmid_read_context_check();
+
+ if (r->rid == RDT_RESOURCE_PERF_PKG)
+ return intel_aet_read_event(hdr->id, rmid, eventid, arch_priv, val);
+
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9cc54d04b2ac..a04c1724fc44 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -514,6 +514,20 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
return __l3_mon_event_count(rdtgrp, rr, d);
}
+ case RDT_RESOURCE_PERF_PKG: {
+ u64 tval = 0;
+
+ rr->err = resctrl_arch_rmid_read(rr->r, rr->hdr, rdtgrp->closid,
+ rdtgrp->mon.rmid, rr->evt->evtid,
+ rr->evt->arch_priv,
+ &tval, rr->arch_mon_ctx);
+ if (rr->err)
+ return rr->err;
+
+ rr->val += tval;
+
+ return 0;
+ }
default:
rr->err = -EINVAL;
return -EINVAL;
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da5775056ec8..60ace4427ede 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -538,7 +538,7 @@ config X86_CPU_RESCTRL
config X86_CPU_RESCTRL_INTEL_AET
bool "Intel Application Energy Telemetry"
- depends on X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
+ depends on X86_64 && X86_CPU_RESCTRL && CPU_SUP_INTEL && INTEL_PMT_TELEMETRY=y && INTEL_TPMI=y
help
Enable per-RMID telemetry events in resctrl.
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 20/32] x86/resctrl: Read telemetry events
2025-10-29 16:21 ` [PATCH v13 20/32] x86/resctrl: Read " Tony Luck
@ 2025-11-13 22:47 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:47 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> @@ -217,3 +221,51 @@ void __exit intel_aet_exit(void)
> }
> }
> }
> +
> +#define DATA_VALID BIT_ULL(63)
> +#define DATA_BITS GENMASK_ULL(62, 0)
> +
> +/*
> + * Read counter for an event on a domain (summing all aggregators
> + * on the domain). If an aggregator hasn't received any data for a
> + * specific RMID, the MMIO read indicates that data is not valid.
> + * Return success if at least one aggregator has valid data.
> + */
> +int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id eventid,
> + void *arch_priv, u64 *val)
Is eventid needed? It could perhaps be used as a sanity check of
pevt->id but if it is not used then it can be dropped.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 21/32] fs/resctrl: Refactor mkdir_mondata_subdir()
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (19 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 20/32] x86/resctrl: Read " Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:48 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 22/32] fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp() Tony Luck
` (12 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Population of a monitor group's mon_data directory is unreasonably complicated
because of the support for Sub-NUMA Cluster (SNC) mode.
Split out the SNC code into a helper function to make it easier to add support
for a new telemetry resource.
Move all the duplicated code to make and set owner of domain directories
into the mon_add_all_files() helper and rename to _mkdir_mondata_subdir().
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/rdtgroup.c | 108 +++++++++++++++++++++++-------------------
1 file changed, 58 insertions(+), 50 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index fa1398787e83..bcb76dc818c0 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3194,57 +3194,65 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
-static int mon_add_all_files(struct kernfs_node *kn, struct rdt_domain_hdr *hdr,
- struct rdt_resource *r, struct rdtgroup *prgrp,
- bool do_sum)
+/*
+ * Create a directory for a domain and populate it with monitor files. Create
+ * summing monitors when @hdr is NULL. No need to initialize summing monitors.
+ */
+static struct kernfs_node *_mkdir_mondata_subdir(struct kernfs_node *parent_kn, char *name,
+ struct rdt_domain_hdr *hdr,
+ struct rdt_resource *r,
+ struct rdtgroup *prgrp, int domid)
{
- struct rdt_l3_mon_domain *d;
struct rmid_read rr = {0};
+ struct kernfs_node *kn;
struct mon_data *priv;
struct mon_evt *mevt;
- int ret, domid;
+ int ret;
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- return -EINVAL;
+ kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+ if (IS_ERR(kn))
+ return kn;
+
+ ret = rdtgroup_kn_set_ugid(kn);
+ if (ret)
+ goto out_destroy;
- d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
for_each_mon_event(mevt) {
if (mevt->rid != r->rid || !mevt->enabled)
continue;
- domid = do_sum ? d->ci_id : d->hdr.id;
- priv = mon_get_kn_priv(r->rid, domid, mevt, do_sum);
- if (WARN_ON_ONCE(!priv))
- return -EINVAL;
+ priv = mon_get_kn_priv(r->rid, domid, mevt, !hdr);
+ if (WARN_ON_ONCE(!priv)) {
+ ret = -EINVAL;
+ goto out_destroy;
+ }
ret = mon_addfile(kn, mevt->name, priv);
if (ret)
- return ret;
+ goto out_destroy;
- if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
+ if (hdr && resctrl_is_mbm_event(mevt->evtid))
mon_event_read(&rr, r, hdr, prgrp, &hdr->cpu_mask, mevt, true);
}
- return 0;
+ return kn;
+out_destroy:
+ kernfs_remove(kn);
+ return ERR_PTR(ret);
}
-static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
- struct rdt_domain_hdr *hdr,
- struct rdt_resource *r, struct rdtgroup *prgrp)
+static int mkdir_mondata_subdir_snc(struct kernfs_node *parent_kn,
+ struct rdt_domain_hdr *hdr,
+ struct rdt_resource *r, struct rdtgroup *prgrp)
{
- struct kernfs_node *kn, *ckn;
+ struct kernfs_node *ckn, *kn;
struct rdt_l3_mon_domain *d;
char name[32];
- bool snc_mode;
- int ret = 0;
-
- lockdep_assert_held(&rdtgroup_mutex);
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return -EINVAL;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci_id);
kn = kernfs_find_and_get(parent_kn, name);
if (kn) {
/*
@@ -3253,41 +3261,41 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
*/
kernfs_put(kn);
} else {
- kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp);
+ kn = _mkdir_mondata_subdir(parent_kn, name, NULL, r, prgrp, d->ci_id);
if (IS_ERR(kn))
return PTR_ERR(kn);
+ }
- ret = rdtgroup_kn_set_ugid(kn);
- if (ret)
- goto out_destroy;
- ret = mon_add_all_files(kn, hdr, r, prgrp, snc_mode);
- if (ret)
- goto out_destroy;
+ sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
+ ckn = _mkdir_mondata_subdir(kn, name, hdr, r, prgrp, hdr->id);
+ if (IS_ERR(ckn)) {
+ kernfs_remove(kn);
+ return PTR_ERR(ckn);
}
- if (snc_mode) {
- sprintf(name, "mon_sub_%s_%02d", r->name, hdr->id);
- ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp);
- if (IS_ERR(ckn)) {
- ret = -EINVAL;
- goto out_destroy;
- }
+ kernfs_activate(kn);
+ return 0;
+}
- ret = rdtgroup_kn_set_ugid(ckn);
- if (ret)
- goto out_destroy;
+static int mkdir_mondata_subdir(struct kernfs_node *parent_kn,
+ struct rdt_domain_hdr *hdr,
+ struct rdt_resource *r, struct rdtgroup *prgrp)
+{
+ struct kernfs_node *kn;
+ char name[32];
- ret = mon_add_all_files(ckn, hdr, r, prgrp, false);
- if (ret)
- goto out_destroy;
- }
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ if (r->rid == RDT_RESOURCE_L3 && r->mon_scope == RESCTRL_L3_NODE)
+ return mkdir_mondata_subdir_snc(parent_kn, hdr, r, prgrp);
+
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ kn = _mkdir_mondata_subdir(parent_kn, name, hdr, r, prgrp, hdr->id);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
kernfs_activate(kn);
return 0;
-
-out_destroy:
- kernfs_remove(kn);
- return ret;
}
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 21/32] fs/resctrl: Refactor mkdir_mondata_subdir()
2025-10-29 16:21 ` [PATCH v13 21/32] fs/resctrl: Refactor mkdir_mondata_subdir() Tony Luck
@ 2025-11-13 22:48 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:48 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> Population of a monitor group's mon_data directory is unreasonably complicated
> because of the support for Sub-NUMA Cluster (SNC) mode.
>
> Split out the SNC code into a helper function to make it easier to add support
> for a new telemetry resource.
>
> Move all the duplicated code to make and set owner of domain directories
> into the mon_add_all_files() helper and rename to _mkdir_mondata_subdir().
>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 22/32] fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp()
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (20 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 21/32] fs/resctrl: Refactor mkdir_mondata_subdir() Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:48 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 23/32] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
` (11 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Clearing a monitor group's mon_data directory is complicated because of the
support for Sub-NUMA Cluster (SNC) mode.
Refactor the SNC case into a helper function to make it easier to add support
for a new telemetry resource.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/rdtgroup.c | 42 +++++++++++++++++++++++++++++++-----------
1 file changed, 31 insertions(+), 11 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index bcb76dc818c0..4f461ec773d6 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -3163,28 +3163,24 @@ static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subn
}
/*
- * Remove all subdirectories of mon_data of ctrl_mon groups
- * and monitor groups for the given domain.
- * Remove files and directories containing "sum" of domain data
- * when last domain being summed is removed.
+ * Remove files and directories for one SNC node. If it is the last node
+ * sharing an L3 cache, then remove the upper level directory containing
+ * the "sum" files too.
*/
-static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
- struct rdt_domain_hdr *hdr)
+static void rmdir_mondata_subdir_allrdtgrp_snc(struct rdt_resource *r,
+ struct rdt_domain_hdr *hdr)
{
struct rdtgroup *prgrp, *crgrp;
struct rdt_l3_mon_domain *d;
char subname[32];
- bool snc_mode;
char name[32];
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
return;
d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
- snc_mode = r->mon_scope == RESCTRL_L3_NODE;
- sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci_id : d->hdr.id);
- if (snc_mode)
- sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
+ sprintf(name, "mon_%s_%02d", r->name, d->ci_id);
+ sprintf(subname, "mon_sub_%s_%02d", r->name, hdr->id);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
@@ -3194,6 +3190,30 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
}
}
+/*
+ * Remove all subdirectories of mon_data of ctrl_mon groups
+ * and monitor groups for the given domain.
+ */
+static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
+ struct rdt_domain_hdr *hdr)
+{
+ struct rdtgroup *prgrp, *crgrp;
+ char name[32];
+
+ if (r->rid == RDT_RESOURCE_L3 && r->mon_scope == RESCTRL_L3_NODE) {
+ rmdir_mondata_subdir_allrdtgrp_snc(r, hdr);
+ return;
+ }
+
+ sprintf(name, "mon_%s_%02d", r->name, hdr->id);
+ list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
+ kernfs_remove_by_name(prgrp->mon.mon_data_kn, name);
+
+ list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
+ kernfs_remove_by_name(crgrp->mon.mon_data_kn, name);
+ }
+}
+
/*
* Create a directory for a domain and populate it with monitor files. Create
* summing monitors when @hdr is NULL. No need to initialize summing monitors.
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 22/32] fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp()
2025-10-29 16:21 ` [PATCH v13 22/32] fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp() Tony Luck
@ 2025-11-13 22:48 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:48 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> Clearing a monitor group's mon_data directory is complicated because of the
> support for Sub-NUMA Cluster (SNC) mode.
>
> Refactor the SNC case into a helper function to make it easier to add support
> for a new telemetry resource.
>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 23/32] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (21 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 22/32] fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp() Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-10-29 16:21 ` [PATCH v13 24/32] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
` (10 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
The L3 resource has several requirements for domains. There are per-domain
structures that hold the 64-bit values of counters, and elements to keep
track of the overflow and limbo threads.
None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.
Define a new rdt_perf_pkg_mon_domain structure which just consists of the
standard rdt_domain_hdr to keep track of domain id and CPU mask.
Update resctrl_online_mon_domain() for RDT_RESOURCE_PERF_PKG. The only action
needed for this resource is to create and populate domain directories if a
domain is added while resctrl is mounted.
Similarly resctrl_offline_mon_domain() only needs to remove domain
directories.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 13 +++++++++++
arch/x86/kernel/cpu/resctrl/core.c | 17 +++++++++++++++
arch/x86/kernel/cpu/resctrl/intel_aet.c | 29 +++++++++++++++++++++++++
fs/resctrl/rdtgroup.c | 17 ++++++++++-----
4 files changed, 71 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 97616c81682b..b920f54f8736 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -84,6 +84,14 @@ static inline struct rdt_hw_l3_mon_domain *resctrl_to_arch_mon_dom(struct rdt_l3
return container_of(r, struct rdt_hw_l3_mon_domain, d_resctrl);
}
+/**
+ * struct rdt_perf_pkg_mon_domain - CPUs sharing an package scoped resctrl monitor resource
+ * @hdr: common header for different domain types
+ */
+struct rdt_perf_pkg_mon_domain {
+ struct rdt_domain_hdr hdr;
+};
+
/**
* struct msr_param - set a range of MSRs from a domain
* @res: The resource to use
@@ -222,6 +230,8 @@ bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
void *arch_priv, u64 *val);
+void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void __exit intel_aet_exit(void) { }
@@ -230,6 +240,9 @@ static inline int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_i
{
return -EINVAL;
}
+
+static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos) { }
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d759093e7dce..ba1ddc2eec15 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -575,6 +575,10 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r)
if (!hdr)
l3_mon_domain_setup(cpu, id, r, add_pos);
break;
+ case RDT_RESOURCE_PERF_PKG:
+ if (!hdr)
+ intel_aet_mon_domain_setup(cpu, id, r, add_pos);
+ break;
default:
pr_warn_once("Unknown resource rid=%d\n", r->rid);
break;
@@ -674,6 +678,19 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
l3_mon_domain_free(hw_dom);
break;
}
+ case RDT_RESOURCE_PERF_PKG: {
+ struct rdt_perf_pkg_mon_domain *pkgd;
+
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_PERF_PKG))
+ return;
+
+ pkgd = container_of(hdr, struct rdt_perf_pkg_mon_domain, hdr);
+ resctrl_offline_mon_domain(r, hdr);
+ list_del_rcu(&hdr->list);
+ synchronize_rcu();
+ kfree(pkgd);
+ break;
+ }
default:
pr_warn_once("Unknown resource rid=%d\n", r->rid);
break;
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index f64fb7d0c8a9..781ca8ede39e 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -17,16 +17,21 @@
#include <linux/compiler_types.h>
#include <linux/container_of.h>
#include <linux/cpu.h>
+#include <linux/cpumask.h>
#include <linux/err.h>
#include <linux/errno.h>
+#include <linux/gfp_types.h>
#include <linux/init.h>
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
#include <linux/io.h>
#include <linux/overflow.h>
#include <linux/printk.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
#include <linux/resctrl.h>
#include <linux/resctrl_types.h>
+#include <linux/slab.h>
#include <linux/stddef.h>
#include <linux/topology.h>
#include <linux/types.h>
@@ -269,3 +274,27 @@ int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id eventid,
return valid ? 0 : -EINVAL;
}
+
+void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
+ struct list_head *add_pos)
+{
+ struct rdt_perf_pkg_mon_domain *d;
+ int err;
+
+ d = kzalloc_node(sizeof(*d), GFP_KERNEL, cpu_to_node(cpu));
+ if (!d)
+ return;
+
+ d->hdr.id = id;
+ d->hdr.type = RESCTRL_MON_DOMAIN;
+ d->hdr.rid = RDT_RESOURCE_PERF_PKG;
+ cpumask_set_cpu(cpu, &d->hdr.cpu_mask);
+ list_add_tail_rcu(&d->hdr.list, add_pos);
+
+ err = resctrl_online_mon_domain(r, &d->hdr);
+ if (err) {
+ list_del_rcu(&d->hdr.list);
+ synchronize_rcu();
+ kfree(d);
+ }
+}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 4f461ec773d6..84336b6e1679 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4241,11 +4241,6 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
mutex_lock(&rdtgroup_mutex);
- if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
- goto out_unlock;
-
- d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
-
/*
* If resctrl is mounted, remove all the
* per domain monitor data directories.
@@ -4253,6 +4248,13 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
if (resctrl_mounted && resctrl_arch_mon_capable())
rmdir_mondata_subdir_allrdtgrp(r, hdr);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto out_unlock;
+
+ if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
+ goto out_unlock;
+
+ d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
if (resctrl_is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID) && has_busy_rmid(d)) {
@@ -4349,6 +4351,9 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
mutex_lock(&rdtgroup_mutex);
+ if (r->rid != RDT_RESOURCE_L3)
+ goto mkdir;
+
if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
goto out_unlock;
@@ -4366,6 +4371,8 @@ int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *hdr
if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
+mkdir:
+ err = 0;
/*
* If the filesystem is not mounted then only the default resource group
* exists. Creation of its directories is deferred until mount time
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 24/32] x86/resctrl: Add energy/perf choices to rdt boot option
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (22 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 23/32] x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-10-29 16:21 ` [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG Tony Luck
` (9 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Legacy resctrl features are enumerated by X86_FEATURE_* flags. These
may be overridden by quirks to disable features in the case of errata.
Users can use kernel command line options to either disable a feature,
or to force enable a feature that was disabled by a quirk.
Provide similar functionality for hardware features that do not have an
X86_FEATURE_* flag. Unlike other features that are tied to X86_FEATURE_*
flags, these must be queried by name. Add rdt_is_feature_enabled()
to check whether quirks or kernel command line have disabled a feature.
Users may force a feature to be disabled. E.g. "rdt=!perf" will ensure
that none of the perf telemetry events are enabled.
Resctrl architecture code may disable a feature that does not provide
full functionality. Users may override that decision. E.g. "rdt=energy"
will enable any available energy telemetry events even if they do not
provide full functionality.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
.../admin-guide/kernel-parameters.txt | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++
3 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061ca20e..bb8f5d73ebf8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6207,7 +6207,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
- mba, smba, bmec, abmc.
+ mba, smba, bmec, abmc, energy, perf.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index b920f54f8736..e3710b9f993e 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -225,6 +225,8 @@ void __init intel_rdt_mbm_apply_quirk(void);
void rdt_domain_reconfigure_cdp(struct rdt_resource *r);
void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
+bool rdt_is_feature_enabled(char *name);
+
#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index ba1ddc2eec15..7013911d3575 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -772,6 +772,8 @@ enum {
RDT_FLAG_SMBA,
RDT_FLAG_BMEC,
RDT_FLAG_ABMC,
+ RDT_FLAG_ENERGY,
+ RDT_FLAG_PERF,
};
#define RDT_OPT(idx, n, f) \
@@ -798,6 +800,8 @@ static struct rdt_options rdt_options[] __ro_after_init = {
RDT_OPT(RDT_FLAG_SMBA, "smba", X86_FEATURE_SMBA),
RDT_OPT(RDT_FLAG_BMEC, "bmec", X86_FEATURE_BMEC),
RDT_OPT(RDT_FLAG_ABMC, "abmc", X86_FEATURE_ABMC),
+ RDT_OPT(RDT_FLAG_ENERGY, "energy", 0),
+ RDT_OPT(RDT_FLAG_PERF, "perf", 0),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
@@ -847,6 +851,31 @@ bool rdt_cpu_has(int flag)
return ret;
}
+/*
+ * Hardware features that do not have X86_FEATURE_* bits. There is no
+ * "hardware does not support this at all" case. Assume that the caller
+ * has already determined that hardware support is present and just needs
+ * to check if the feature has been disabled by a quirk that has not been
+ * overridden by a command line option.
+ */
+bool rdt_is_feature_enabled(char *name)
+{
+ struct rdt_options *o;
+ bool ret = true;
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name)) {
+ if (o->force_off)
+ ret = false;
+ if (o->force_on)
+ ret = true;
+ break;
+ }
+ }
+
+ return ret;
+}
+
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt)
{
if (!rdt_cpu_has(X86_FEATURE_BMEC))
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (23 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 24/32] x86/resctrl: Add energy/perf choices to rdt boot option Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:51 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 26/32] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] Tony Luck
` (8 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
There are now three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF. This
is the maximum number of distinct values that can be loaded into
MSR_IA32_PQR_ASSOC. Note that systems with Sub-NUMA Cluster mode enabled
will force scaling down the CPUID enumerated value by the number of SNC
nodes per L3-cache.
2) The number of registers in MMIO space for each event. This
is enumerated in the XML files and is the value initialized into
event_group::num_rmids.
3) The number of "hardware counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that
does describe the limitations) feeding to those MMIO registers. This
is enumerated in telemetry_region::num_rmids returned from the call to
intel_pmt_get_regions_by_feature()
Event groups with insufficient "hardware counters" to track all RMIDs
are difficult for users to use, since the system may reassign "hardware
counters" at any time. This means that users cannot reliably collect
two consecutive event counts to compute the rate at which events are
occurring.
Introduce rdt_set_feature_disabled() to mark any under-resourced event groups
(those with telemetry_region::num_rmids < event_group::num_rmids for any of
the event group's telemetry regions) as unusable. Note that the rdt_options[]
structure must now be writable at run-time.
Limit an under-resourced event group's number of possible monitor
resource groups to the lowest number of "hardware counters" if the
user explicitly requests to enable it.
Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
resource "num_rmids" value to the smallest of these values as this value
will be used later to compare against the number of RMIDs supported
by other resources to determine how many monitoring resource groups
are supported.
N.B. Change type of rdt_resource::num_rmid to u32 to match type of
event_group::num_rmids so that min(r->num_rmid, e->num_rmids) won't
complain about mixing signed and unsigned types.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 2 +-
arch/x86/kernel/cpu/resctrl/internal.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 18 +++++++-
arch/x86/kernel/cpu/resctrl/intel_aet.c | 55 +++++++++++++++++++++++++
fs/resctrl/rdtgroup.c | 2 +-
5 files changed, 76 insertions(+), 3 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 34ad0f5f1309..a2bf335052d6 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -292,7 +292,7 @@ enum resctrl_schema_fmt {
* events of monitor groups created via mkdir.
*/
struct resctrl_mon {
- int num_rmid;
+ u32 num_rmid;
unsigned int mbm_cfg_mask;
int num_mbm_cntrs;
bool mbm_cntr_assignable;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e3710b9f993e..cea76f88422c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -227,6 +227,8 @@ void resctrl_arch_mbm_cntr_assign_set_one(struct rdt_resource *r);
bool rdt_is_feature_enabled(char *name);
+void rdt_set_feature_disabled(char *name);
+
#ifdef CONFIG_X86_CPU_RESCTRL_INTEL_AET
bool intel_aet_get_events(void);
void __exit intel_aet_exit(void);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7013911d3575..a8eb197e27db 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -788,7 +788,7 @@ struct rdt_options {
bool force_off, force_on;
};
-static struct rdt_options rdt_options[] __ro_after_init = {
+static struct rdt_options rdt_options[] = {
RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC),
RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL),
RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL),
@@ -851,6 +851,22 @@ bool rdt_cpu_has(int flag)
return ret;
}
+/*
+ * Can be called during feature enumeration if sanity check of
+ * a feature's parameters indicates problems with the feature.
+ */
+void rdt_set_feature_disabled(char *name)
+{
+ struct rdt_options *o;
+
+ for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
+ if (!strcmp(name, o->name)) {
+ o->force_off = true;
+ return;
+ }
+ }
+}
+
/*
* Hardware features that do not have X86_FEATURE_* bits. There is no
* "hardware does not support this at all" case. Assume that the caller
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 781ca8ede39e..252a3fd4260c 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -25,6 +25,7 @@
#include <linux/intel_pmt_features.h>
#include <linux/intel_vsec.h>
#include <linux/io.h>
+#include <linux/minmax.h>
#include <linux/overflow.h>
#include <linux/printk.h>
#include <linux/rculist.h>
@@ -57,6 +58,7 @@ struct pmt_event {
* struct event_group - All information about a group of telemetry events.
* @feature: Argument to intel_pmt_get_regions_by_feature() to
* discover if this event_group is supported.
+ * @name: Name for this group (used by boot rdt= option)
* @pfg: Points to the aggregated telemetry space information
* returned by the intel_pmt_get_regions_by_feature()
* call to the INTEL_PMT_TELEMETRY driver that contains
@@ -64,6 +66,10 @@ struct pmt_event {
* Valid if the system supports the event group.
* NULL otherwise.
* @guid: Unique number per XML description file.
+ * @num_rmids: Number of RMIDs supported by this group. May be
+ * adjusted downwards if enumeration from
+ * intel_pmt_get_regions_by_feature() indicates fewer
+ * RMIDs can be tracked simultaneously.
* @mmio_size: Number of bytes of MMIO registers for this group.
* @num_events: Number of events in this group.
* @evts: Array of event descriptors.
@@ -71,10 +77,12 @@ struct pmt_event {
struct event_group {
/* Data fields for additional structures to manage this group. */
enum pmt_feature_id feature;
+ char *name;
struct pmt_feature_group *pfg;
/* Remaining fields initialized from XML file. */
u32 guid;
+ u32 num_rmids;
size_t mmio_size;
unsigned int num_events;
struct pmt_event evts[] __counted_by(num_events);
@@ -89,7 +97,9 @@ struct event_group {
*/
static struct event_group energy_0x26696143 = {
.feature = FEATURE_PER_RMID_ENERGY_TELEM,
+ .name = "energy",
.guid = 0x26696143,
+ .num_rmids = 576,
.mmio_size = XML_MMIO_SIZE(576, 2, 3),
.num_events = 2,
.evts = {
@@ -104,7 +114,9 @@ static struct event_group energy_0x26696143 = {
*/
static struct event_group perf_0x26557651 = {
.feature = FEATURE_PER_RMID_PERF_TELEM,
+ .name = "perf",
.guid = 0x26557651,
+ .num_rmids = 576,
.mmio_size = XML_MMIO_SIZE(576, 7, 3),
.num_events = 7,
.evts = {
@@ -174,11 +186,54 @@ static bool group_has_usable_regions(struct event_group *e, struct pmt_feature_g
return usable_regions;
}
+static bool all_regions_have_sufficient_rmid(struct event_group *e, struct pmt_feature_group
+*p)
+{
+ struct telemetry_region *tr;
+ bool ret = true;
+
+ for (int i = 0; i < p->count; i++) {
+ if (!p->regions[i].addr)
+ continue;
+ tr = &p->regions[i];
+ if (tr->num_rmids < e->num_rmids)
+ ret = false;
+ }
+
+ return ret;
+}
+
static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
+
if (!group_has_usable_regions(e, p))
return false;
+ /* Disable feature if insufficient RMIDs */
+ if (!all_regions_have_sufficient_rmid(e, p))
+ rdt_set_feature_disabled(e->name);
+
+ /* User can override above disable from kernel command line */
+ if (!rdt_is_feature_enabled(e->name))
+ return false;
+
+ for (int i = 0; i < p->count; i++) {
+ if (!p->regions[i].addr)
+ continue;
+ /*
+ * e->num_rmids only adjusted lower if user (via rdt= kernel
+ * parameter) forces an event group with insufficient RMID
+ * to be enabled.
+ */
+ e->num_rmids = min(e->num_rmids, p->regions[i].num_rmids);
+ }
+
+ if (r->mon.num_rmid)
+ r->mon.num_rmid = min(r->mon.num_rmid, e->num_rmids);
+ else
+ r->mon.num_rmid = e->num_rmids;
+
for (int j = 0; j < e->num_events; j++)
resctrl_enable_mon_event(e->evts[j].id, true,
e->evts[j].bin_bits, &e->evts[j]);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 84336b6e1679..b67faf6a5012 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -1135,7 +1135,7 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
{
struct rdt_resource *r = rdt_kn_parent_priv(of->kn);
- seq_printf(seq, "%d\n", r->mon.num_rmid);
+ seq_printf(seq, "%u\n", r->mon.num_rmid);
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-10-29 16:21 ` [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-11-13 22:51 ` Reinette Chatre
2025-11-14 21:55 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:51 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> There are now three meanings for "number of RMIDs":
>
> 1) The number for legacy features enumerated by CPUID leaf 0xF. This
> is the maximum number of distinct values that can be loaded into
> MSR_IA32_PQR_ASSOC. Note that systems with Sub-NUMA Cluster mode enabled
> will force scaling down the CPUID enumerated value by the number of SNC
> nodes per L3-cache.
>
> 2) The number of registers in MMIO space for each event. This
> is enumerated in the XML files and is the value initialized into
> event_group::num_rmids.
>
> 3) The number of "hardware counters" (this isn't a strictly accurate
> description of how things work, but serves as a useful analogy that
> does describe the limitations) feeding to those MMIO registers. This
> is enumerated in telemetry_region::num_rmids returned from the call to
> intel_pmt_get_regions_by_feature()
>
> Event groups with insufficient "hardware counters" to track all RMIDs
> are difficult for users to use, since the system may reassign "hardware
> counters" at any time. This means that users cannot reliably collect
> two consecutive event counts to compute the rate at which events are
> occurring.
>
> Introduce rdt_set_feature_disabled() to mark any under-resourced event groups
> (those with telemetry_region::num_rmids < event_group::num_rmids for any of
Extra space above, also note how line lengths differ between paragraphs.
> the event group's telemetry regions) as unusable. Note that the rdt_options[]
> structure must now be writable at run-time.
>
> Limit an under-resourced event group's number of possible monitor
> resource groups to the lowest number of "hardware counters" if the
> user explicitly requests to enable it.
>
> Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG
> resource "num_rmids" value to the smallest of these values as this value
> will be used later to compare against the number of RMIDs supported
> by other resources to determine how many monitoring resource groups
> are supported.
>
> N.B. Change type of rdt_resource::num_rmid to u32 to match type of
resctrl_mon::num_rmid
> event_group::num_rmids so that min(r->num_rmid, e->num_rmids) won't
> complain about mixing signed and unsigned types.
May be worthwhile to highlight that resctrl_mon::num_rmid is already
used as a u32 so changing its type does not need to cascade into all
users. Something like "Change type of resctrl_mon::num_rmid to u32 to
match its usage and the type of ..."
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 781ca8ede39e..252a3fd4260c 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -25,6 +25,7 @@
> #include <linux/intel_pmt_features.h>
> #include <linux/intel_vsec.h>
> #include <linux/io.h>
> +#include <linux/minmax.h>
> #include <linux/overflow.h>
> #include <linux/printk.h>
> #include <linux/rculist.h>
> @@ -57,6 +58,7 @@ struct pmt_event {
> * struct event_group - All information about a group of telemetry events.
> * @feature: Argument to intel_pmt_get_regions_by_feature() to
> * discover if this event_group is supported.
> + * @name: Name for this group (used by boot rdt= option)
> * @pfg: Points to the aggregated telemetry space information
> * returned by the intel_pmt_get_regions_by_feature()
> * call to the INTEL_PMT_TELEMETRY driver that contains
> @@ -64,6 +66,10 @@ struct pmt_event {
> * Valid if the system supports the event group.
> * NULL otherwise.
> * @guid: Unique number per XML description file.
> + * @num_rmids: Number of RMIDs supported by this group. May be
> + * adjusted downwards if enumeration from
> + * intel_pmt_get_regions_by_feature() indicates fewer
> + * RMIDs can be tracked simultaneously.
> * @mmio_size: Number of bytes of MMIO registers for this group.
> * @num_events: Number of events in this group.
> * @evts: Array of event descriptors.
> @@ -71,10 +77,12 @@ struct pmt_event {
> struct event_group {
> /* Data fields for additional structures to manage this group. */
> enum pmt_feature_id feature;
> + char *name;
> struct pmt_feature_group *pfg;
>
> /* Remaining fields initialized from XML file. */
> u32 guid;
> + u32 num_rmids;
This addition is to resctrl where there is already a resctrl_mon::num_rmid.
Can this naming be made consistent with resctrl instead of INTEL_PMT_TELEMETRY?
> size_t mmio_size;
> unsigned int num_events;
> struct pmt_event evts[] __counted_by(num_events);
> @@ -89,7 +97,9 @@ struct event_group {
> */
> static struct event_group energy_0x26696143 = {
> .feature = FEATURE_PER_RMID_ENERGY_TELEM,
> + .name = "energy",
> .guid = 0x26696143,
> + .num_rmids = 576,
> .mmio_size = XML_MMIO_SIZE(576, 2, 3),
> .num_events = 2,
> .evts = {
> @@ -104,7 +114,9 @@ static struct event_group energy_0x26696143 = {
> */
> static struct event_group perf_0x26557651 = {
> .feature = FEATURE_PER_RMID_PERF_TELEM,
> + .name = "perf",
> .guid = 0x26557651,
> + .num_rmids = 576,
> .mmio_size = XML_MMIO_SIZE(576, 7, 3),
> .num_events = 7,
> .evts = {
> @@ -174,11 +186,54 @@ static bool group_has_usable_regions(struct event_group *e, struct pmt_feature_g
> return usable_regions;
> }
>
> +static bool all_regions_have_sufficient_rmid(struct event_group *e, struct pmt_feature_group
> +*p)
> +{
> + struct telemetry_region *tr;
> + bool ret = true;
> +
> + for (int i = 0; i < p->count; i++) {
> + if (!p->regions[i].addr)
> + continue;
> + tr = &p->regions[i];
> + if (tr->num_rmids < e->num_rmids)
> + ret = false;
> + }
> +
> + return ret;
> +}
> +
> static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> +
> if (!group_has_usable_regions(e, p))
> return false;
>
> + /* Disable feature if insufficient RMIDs */
> + if (!all_regions_have_sufficient_rmid(e, p))
> + rdt_set_feature_disabled(e->name);
> +
> + /* User can override above disable from kernel command line */
> + if (!rdt_is_feature_enabled(e->name))
> + return false;
Considering this from the user's perspective I do not think there is an easy way for user space
to know that the feature was force disabled and that there is an option to override this.
What is the use case considered here? If I understand correctly there may be a way to deduce this
by consulting files in /sys/class/intel_pmt/ where the telemetry regions' number of supported
RMID is printed. A user can compare that to the XML files to determine that there is an issue
that can explain resctrl not exposing the feature. Considering this difficulty I wonder if it
may be helpful to print an error message here? This is of course not perfect since the kernel
log cannot be guaranteed to forever contain it ... but it may help?
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-13 22:51 ` Reinette Chatre
@ 2025-11-14 21:55 ` Luck, Tony
2025-11-14 23:26 ` Reinette Chatre
0 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-14 21:55 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Thu, Nov 13, 2025 at 02:51:45PM -0800, Reinette Chatre wrote:
> > static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> > {
> > + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> > +
> > if (!group_has_usable_regions(e, p))
> > return false;
> >
> > + /* Disable feature if insufficient RMIDs */
> > + if (!all_regions_have_sufficient_rmid(e, p))
> > + rdt_set_feature_disabled(e->name);
> > +
> > + /* User can override above disable from kernel command line */
> > + if (!rdt_is_feature_enabled(e->name))
> > + return false;
>
> Considering this from the user's perspective I do not think there is an easy way for user space
> to know that the feature was force disabled and that there is an option to override this.
> What is the use case considered here? If I understand correctly there may be a way to deduce this
> by consulting files in /sys/class/intel_pmt/ where the telemetry regions' number of supported
> RMID is printed. A user can compare that to the XML files to determine that there is an issue
> that can explain resctrl not exposing the feature. Considering this difficulty I wonder if it
> may be helpful to print an error message here? This is of course not perfect since the kernel
> log cannot be guaranteed to forever contain it ... but it may help?
Reinette,
Good idea. Here's a draft. Comments welcome to improve the user message
that would look like:
resctrl: Feature energy guid=0x26696143 not enabled due to insufficient RMIDs
static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
bool warn_disable = false;
if (!group_has_usable_regions(e, p))
return false;
/* Disable feature if insufficient RMIDs */
if (!all_regions_have_sufficient_rmid(e, p)) {
warn_disable = true;
rdt_set_feature_disabled(e->name);
}
/* User can override above disable from kernel command line */
if (!rdt_is_feature_enabled(e->name)) {
if (warn_disable)
pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
e->name, e->guid);
return false;
}
...
}
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-14 21:55 ` Luck, Tony
@ 2025-11-14 23:26 ` Reinette Chatre
2025-11-17 16:37 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Reinette Chatre @ 2025-11-14 23:26 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 11/14/25 1:55 PM, Luck, Tony wrote:
>
> resctrl: Feature energy guid=0x26696143 not enabled due to insufficient RMIDs
>
>
> static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> bool warn_disable = false;
>
> if (!group_has_usable_regions(e, p))
> return false;
>
> /* Disable feature if insufficient RMIDs */
> if (!all_regions_have_sufficient_rmid(e, p)) {
> warn_disable = true;
> rdt_set_feature_disabled(e->name);
> }
>
> /* User can override above disable from kernel command line */
> if (!rdt_is_feature_enabled(e->name)) {
> if (warn_disable)
> pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
> e->name, e->guid);
> return false;
> }
> ...
> }
Thank you for considering. This looks good to me.
I now realize that if a system supports, for example, two energy guid and only one has insufficient
RMID then one or both may be disabled by default depending on which resctrl attempts to enable
first. This is arbitrary based on where the event group appears in the array.
How a system with two guid of the same feature type would work is not clear to me though. Looks
like they cannot share events at all since an event is uniquely associated with a struct pmt_event
that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
will complain loudly but still proceed and allow the event group to be enabled.
I think the resctrl_enable_mon_event() warnings were added to support enabling of new features
so that the WARNs can catch issues during development ... now it may encounter issues when a
kernel with this implementation is run on a system that supports a single feature with
multiple guid. Do you have more insight in how the "single feature with multiple guid" may look to
better prepare resctrl to handle them?
Should "enable_events" be split so that a feature can be disabled for all its event groups if
any of them cannot be enabled due to insufficient RMIDs?
Perhaps resctrl_enable_mon_event() should also now return success/fail so that an event group
cannot be enabled if its events cannot be enabled?
Finally, a system with two guid of the same feature type will end up printing duplicate
"<feature type> monitoring detected" that could be more descriptive?
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-14 23:26 ` Reinette Chatre
@ 2025-11-17 16:37 ` Luck, Tony
2025-11-17 17:31 ` Reinette Chatre
0 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-17 16:37 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Fri, Nov 14, 2025 at 03:26:42PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 11/14/25 1:55 PM, Luck, Tony wrote:
> >
> > resctrl: Feature energy guid=0x26696143 not enabled due to insufficient RMIDs
> >
> >
> > static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> > {
> > struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> > bool warn_disable = false;
> >
> > if (!group_has_usable_regions(e, p))
> > return false;
> >
> > /* Disable feature if insufficient RMIDs */
> > if (!all_regions_have_sufficient_rmid(e, p)) {
> > warn_disable = true;
> > rdt_set_feature_disabled(e->name);
> > }
> >
> > /* User can override above disable from kernel command line */
> > if (!rdt_is_feature_enabled(e->name)) {
> > if (warn_disable)
> > pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
> > e->name, e->guid);
> > return false;
> > }
> > ...
> > }
>
> Thank you for considering. This looks good to me.
>
> I now realize that if a system supports, for example, two energy guid and only one has insufficient
> RMID then one or both may be disabled by default depending on which resctrl attempts to enable
> first. This is arbitrary based on where the event group appears in the array.
intel_pmt_get_regions_by_feature() does return arrays of telemetry_region
with different guids today, but not currently for the "RMID" features.
So this could be a problem in the future.
I think I need to drop the "rdt=perf,!energy" command line control as
being too coarse. Instead add a new boot argument. E.g.
rdtguid=0x26696143,!0x26557651
to give the user control per-guid instead of per-pmt_feature_id. Users
can discover which guids are supported on a system by looking in
/sys/bus/auxiliary/devices/intel_vsec.discovery.*/intel_pmt/features*/per_rmid*
where there are "guids" and "num_rmids" files.
> How a system with two guid of the same feature type would work is not clear to me though. Looks
> like they cannot share events at all since an event is uniquely associated with a struct pmt_event
> that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
> will complain loudly but still proceed and allow the event group to be enabled.
I can't see a good reason why the same event would be enabled under
different guids present on the same system. We can revisit my assumption
if the "Duplicate enable for event" message shows up.
> I think the resctrl_enable_mon_event() warnings were added to support enabling of new features
> so that the WARNs can catch issues during development ... now it may encounter issues when a
> kernel with this implementation is run on a system that supports a single feature with
> multiple guid. Do you have more insight in how the "single feature with multiple guid" may look to
> better prepare resctrl to handle them?
>
> Should "enable_events" be split so that a feature can be disabled for all its event groups if
> any of them cannot be enabled due to insufficient RMIDs?
> Perhaps resctrl_enable_mon_event() should also now return success/fail so that an event group
> cannot be enabled if its events cannot be enabled?
> Finally, a system with two guid of the same feature type will end up printing duplicate
> "<feature type> monitoring detected" that could be more descriptive?
I need to add the guid to that message.
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-17 16:37 ` Luck, Tony
@ 2025-11-17 17:31 ` Reinette Chatre
2025-11-17 18:52 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Reinette Chatre @ 2025-11-17 17:31 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 11/17/25 8:37 AM, Luck, Tony wrote:
> On Fri, Nov 14, 2025 at 03:26:42PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 11/14/25 1:55 PM, Luck, Tony wrote:
>>>
>>> resctrl: Feature energy guid=0x26696143 not enabled due to insufficient RMIDs
>>>
>>>
>>> static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
>>> {
>>> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
>>> bool warn_disable = false;
>>>
>>> if (!group_has_usable_regions(e, p))
>>> return false;
>>>
>>> /* Disable feature if insufficient RMIDs */
>>> if (!all_regions_have_sufficient_rmid(e, p)) {
>>> warn_disable = true;
>>> rdt_set_feature_disabled(e->name);
>>> }
>>>
>>> /* User can override above disable from kernel command line */
>>> if (!rdt_is_feature_enabled(e->name)) {
>>> if (warn_disable)
>>> pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
>>> e->name, e->guid);
>>> return false;
>>> }
>>> ...
>>> }
>>
>> Thank you for considering. This looks good to me.
>>
>> I now realize that if a system supports, for example, two energy guid and only one has insufficient
>> RMID then one or both may be disabled by default depending on which resctrl attempts to enable
>> first. This is arbitrary based on where the event group appears in the array.
>
> intel_pmt_get_regions_by_feature() does return arrays of telemetry_region
> with different guids today, but not currently for the "RMID" features.
> So this could be a problem in the future.
>
> I think I need to drop the "rdt=perf,!energy" command line control as
> being too coarse. Instead add a new boot argument. E.g.
>
> rdtguid=0x26696143,!0x26557651
>
> to give the user control per-guid instead of per-pmt_feature_id. Users
> can discover which guids are supported on a system by looking in
> /sys/bus/auxiliary/devices/intel_vsec.discovery.*/intel_pmt/features*/per_rmid*
> where there are "guids" and "num_rmids" files.
Should disable/enable be per RMID telemetry feature? I do not see anything preventing a system from
using the same guid for different RMID telemetry features.
I think it will be useful to look at how other kernel parameters distinguish different
categories of parameters so that resctrl can be consistent here. Looks like an underscore is
most useful and also flexible since it allows both a dash and underscore to be used.
Another alternative that is common in kernel parameters is to use ":". For example,
rdt=energy:0x26696143
With something like above user can, for example, use just "energy" to disable all RMID energy
telemetry or be specific to which guid should be disabled. This seems to fit well with existing
rdt parameters and be quite flexible.
>
>> How a system with two guid of the same feature type would work is not clear to me though. Looks
>> like they cannot share events at all since an event is uniquely associated with a struct pmt_event
>> that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
>> will complain loudly but still proceed and allow the event group to be enabled.
>
> I can't see a good reason why the same event would be enabled under
> different guids present on the same system. We can revisit my assumption
> if the "Duplicate enable for event" message shows up.
This would be difficult to handle at that time, no? From what I can tell this would enable
an unusable event group to actually be enabled resulting in untested and invalid flows.
I think it will be safer to not enable an event group in this scenario and seems to math your
expectation that this would be unexpected. The "Duplicate enable for event" message will still
appear and we can still revisit those assumptions when they do, but the systems encountering
them will not be running with enabled event groups that are not actually fully enabled.
>
>> I think the resctrl_enable_mon_event() warnings were added to support enabling of new features
>> so that the WARNs can catch issues during development ... now it may encounter issues when a
>> kernel with this implementation is run on a system that supports a single feature with
>> multiple guid. Do you have more insight in how the "single feature with multiple guid" may look to
>> better prepare resctrl to handle them?
>>
>> Should "enable_events" be split so that a feature can be disabled for all its event groups if
>> any of them cannot be enabled due to insufficient RMIDs?
>> Perhaps resctrl_enable_mon_event() should also now return success/fail so that an event group
>> cannot be enabled if its events cannot be enabled?
>> Finally, a system with two guid of the same feature type will end up printing duplicate
>> "<feature type> monitoring detected" that could be more descriptive?
>
> I need to add the guid to that message.
Sounds good. Thank you.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-17 17:31 ` Reinette Chatre
@ 2025-11-17 18:52 ` Luck, Tony
2025-11-18 16:48 ` Reinette Chatre
0 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-17 18:52 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Mon, Nov 17, 2025 at 09:31:41AM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 11/17/25 8:37 AM, Luck, Tony wrote:
> > On Fri, Nov 14, 2025 at 03:26:42PM -0800, Reinette Chatre wrote:
> >> Hi Tony,
> >>
> >> On 11/14/25 1:55 PM, Luck, Tony wrote:
> >>>
> >>> resctrl: Feature energy guid=0x26696143 not enabled due to insufficient RMIDs
> >>>
> >>>
> >>> static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> >>> {
> >>> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> >>> bool warn_disable = false;
> >>>
> >>> if (!group_has_usable_regions(e, p))
> >>> return false;
> >>>
> >>> /* Disable feature if insufficient RMIDs */
> >>> if (!all_regions_have_sufficient_rmid(e, p)) {
> >>> warn_disable = true;
> >>> rdt_set_feature_disabled(e->name);
> >>> }
> >>>
> >>> /* User can override above disable from kernel command line */
> >>> if (!rdt_is_feature_enabled(e->name)) {
> >>> if (warn_disable)
> >>> pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
> >>> e->name, e->guid);
> >>> return false;
> >>> }
> >>> ...
> >>> }
> >>
> >> Thank you for considering. This looks good to me.
> >>
> >> I now realize that if a system supports, for example, two energy guid and only one has insufficient
> >> RMID then one or both may be disabled by default depending on which resctrl attempts to enable
> >> first. This is arbitrary based on where the event group appears in the array.
> >
> > intel_pmt_get_regions_by_feature() does return arrays of telemetry_region
> > with different guids today, but not currently for the "RMID" features.
> > So this could be a problem in the future.
> >
> > I think I need to drop the "rdt=perf,!energy" command line control as
> > being too coarse. Instead add a new boot argument. E.g.
> >
> > rdtguid=0x26696143,!0x26557651
> >
> > to give the user control per-guid instead of per-pmt_feature_id. Users
> > can discover which guids are supported on a system by looking in
> > /sys/bus/auxiliary/devices/intel_vsec.discovery.*/intel_pmt/features*/per_rmid*
> > where there are "guids" and "num_rmids" files.
>
> Should disable/enable be per RMID telemetry feature? I do not see anything preventing a system from
> using the same guid for different RMID telemetry features.
>
> I think it will be useful to look at how other kernel parameters distinguish different
> categories of parameters so that resctrl can be consistent here. Looks like an underscore is
> most useful and also flexible since it allows both a dash and underscore to be used.
>
> Another alternative that is common in kernel parameters is to use ":". For example,
> rdt=energy:0x26696143
>
> With something like above user can, for example, use just "energy" to disable all RMID energy
> telemetry or be specific to which guid should be disabled. This seems to fit well with existing
> rdt parameters and be quite flexible.
See rough patch at foot of this e-mail. It's just on top of my WIP v14,
but if it looks like the right direction I will merge it into the series
in the patch that adds the energy/perf options.
> >
> >> How a system with two guid of the same feature type would work is not clear to me though. Looks
> >> like they cannot share events at all since an event is uniquely associated with a struct pmt_event
> >> that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
> >> will complain loudly but still proceed and allow the event group to be enabled.
> >
> > I can't see a good reason why the same event would be enabled under
> > different guids present on the same system. We can revisit my assumption
> > if the "Duplicate enable for event" message shows up.
>
> This would be difficult to handle at that time, no? From what I can tell this would enable
> an unusable event group to actually be enabled resulting in untested and invalid flows.
> I think it will be safer to not enable an event group in this scenario and seems to math your
> expectation that this would be unexpected. The "Duplicate enable for event" message will still
> appear and we can still revisit those assumptions when they do, but the systems encountering
> them will not be running with enabled event groups that are not actually fully enabled.
There's a hardware cost to including an event in an aggregator.
Inclusing the same event in mutliple aggregators described by
different guids is really something that should never happen.
Just printing a warning and skipping the event seems an adequate
defense.
>
> >
> >> I think the resctrl_enable_mon_event() warnings were added to support enabling of new features
> >> so that the WARNs can catch issues during development ... now it may encounter issues when a
> >> kernel with this implementation is run on a system that supports a single feature with
> >> multiple guid. Do you have more insight in how the "single feature with multiple guid" may look to
> >> better prepare resctrl to handle them?
> >>
> >> Should "enable_events" be split so that a feature can be disabled for all its event groups if
> >> any of them cannot be enabled due to insufficient RMIDs?
> >> Perhaps resctrl_enable_mon_event() should also now return success/fail so that an event group
> >> cannot be enabled if its events cannot be enabled?
> >> Finally, a system with two guid of the same feature type will end up printing duplicate
> >> "<feature type> monitoring detected" that could be more descriptive?
> >
> > I need to add the guid to that message.
>
> Sounds good. Thank you.
>
> Reinette
-Tony
---
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 08eb78acb988..25df1abc1537 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -241,6 +241,7 @@ int intel_aet_read_event(int domid, u32 rmid, void *arch_priv, u64 *val);
void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
struct list_head *add_pos);
void intel_aet_add_debugfs(void);
+void intel_aet_option(bool force_off, const char *option, const char *suboption);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void __exit intel_aet_exit(void) { }
@@ -252,6 +253,7 @@ static inline int intel_aet_read_event(int domid, u32 rmid, void *arch_priv, u64
static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
struct list_head *add_pos) { }
static inline void intel_aet_add_debugfs(void) { }
+static inline void intel_aet_option(bool force_off, const char *option, const char *suboption) { };
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 5cae4119686e..68195f458c0b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -842,6 +842,7 @@ static struct rdt_options rdt_options[] = {
static int __init set_rdt_options(char *str)
{
struct rdt_options *o;
+ char *suboption;
bool force_off;
char *tok;
@@ -851,6 +852,11 @@ static int __init set_rdt_options(char *str)
force_off = *tok == '!';
if (force_off)
tok++;
+ suboption = strpbrk(tok, ":");
+ if (suboption) {
+ *suboption++ = '\0';
+ intel_aet_option(force_off, tok, suboption);
+ }
for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
if (strcmp(tok, o->name) == 0) {
if (force_off)
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 6028bfec229b..b3c61bcd3e8f 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -69,6 +69,9 @@ struct pmt_event {
* data for all telemetry regions type @feature.
* Valid if the system supports the event group.
* NULL otherwise.
+ * @force_off: Set true when "rdt" command line disables this @guid.
+ * @force_on: Set true when "rdt" command line overrides disable of
+ * this @guid due to insufficeint @num_rmid.
* @guid: Unique number per XML description file.
* @num_rmid: Number of RMIDs supported by this group. May be
* adjusted downwards if enumeration from
@@ -83,6 +86,7 @@ struct event_group {
enum pmt_feature_id feature;
char *name;
struct pmt_feature_group *pfg;
+ bool force_off, force_on;
/* Remaining fields initialized from XML file. */
u32 guid;
@@ -144,6 +148,26 @@ static struct event_group *known_event_groups[] = {
_peg < &known_event_groups[ARRAY_SIZE(known_event_groups)]; \
_peg++)
+void intel_aet_option(bool force_off, const char *option, const char *suboption)
+{
+ struct event_group **peg;
+ u32 guid;
+
+ if (kstrtou32(suboption, 16, &guid))
+ return;
+
+ for_each_event_group(peg) {
+ if (!strcmp(option, (*peg)->name))
+ continue;
+ if ((*peg)->guid != guid)
+ continue;
+ if (force_off)
+ (*peg)->force_off = true;
+ else
+ (*peg)->force_on = true;
+ }
+}
+
/*
* Clear the address field of regions that did not pass the checks in
* skip_telem_region() so they will not be used by intel_aet_read_event().
@@ -252,6 +276,9 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
bool warn_disable = false;
+ if (e->force_off)
+ return false;
+
if (!group_has_usable_regions(e, p))
return false;
@@ -262,7 +289,7 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
}
/* User can override above disable from kernel command line */
- if (!rdt_is_feature_enabled(e->name)) {
+ if (!rdt_is_feature_enabled(e->name) && !e->force_on) {
if (warn_disable)
pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
e->name, e->guid);
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-17 18:52 ` Luck, Tony
@ 2025-11-18 16:48 ` Reinette Chatre
2025-11-18 17:35 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Reinette Chatre @ 2025-11-18 16:48 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 11/17/25 10:52 AM, Luck, Tony wrote:
> On Mon, Nov 17, 2025 at 09:31:41AM -0800, Reinette Chatre wrote:
>> On 11/17/25 8:37 AM, Luck, Tony wrote:
>>> On Fri, Nov 14, 2025 at 03:26:42PM -0800, Reinette Chatre wrote:
>>>> On 11/14/25 1:55 PM, Luck, Tony wrote:
>>>> How a system with two guid of the same feature type would work is not clear to me though. Looks
>>>> like they cannot share events at all since an event is uniquely associated with a struct pmt_event
>>>> that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
>>>> will complain loudly but still proceed and allow the event group to be enabled.
>>>
>>> I can't see a good reason why the same event would be enabled under
>>> different guids present on the same system. We can revisit my assumption
>>> if the "Duplicate enable for event" message shows up.
>>
>> This would be difficult to handle at that time, no? From what I can tell this would enable
>> an unusable event group to actually be enabled resulting in untested and invalid flows.
>> I think it will be safer to not enable an event group in this scenario and seems to math your
>> expectation that this would be unexpected. The "Duplicate enable for event" message will still
>> appear and we can still revisit those assumptions when they do, but the systems encountering
>> them will not be running with enabled event groups that are not actually fully enabled.
>
> There's a hardware cost to including an event in an aggregator.
> Inclusing the same event in mutliple aggregators described by
> different guids is really something that should never happen.
> Just printing a warning and skipping the event seems an adequate
> defense.
My concern is that after skipping the event there is a deceiving message that the event group was
enabled successfully.
...
> ---
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 08eb78acb988..25df1abc1537 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -241,6 +241,7 @@ int intel_aet_read_event(int domid, u32 rmid, void *arch_priv, u64 *val);
> void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> struct list_head *add_pos);
> void intel_aet_add_debugfs(void);
> +void intel_aet_option(bool force_off, const char *option, const char *suboption);
> #else
> static inline bool intel_aet_get_events(void) { return false; }
> static inline void __exit intel_aet_exit(void) { }
> @@ -252,6 +253,7 @@ static inline int intel_aet_read_event(int domid, u32 rmid, void *arch_priv, u64
> static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> struct list_head *add_pos) { }
> static inline void intel_aet_add_debugfs(void) { }
> +static inline void intel_aet_option(bool force_off, const char *option, const char *suboption) { };
(nit: stray semicolon)
> #endif
>
> #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 5cae4119686e..68195f458c0b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -842,6 +842,7 @@ static struct rdt_options rdt_options[] = {
> static int __init set_rdt_options(char *str)
> {
> struct rdt_options *o;
> + char *suboption;
> bool force_off;
> char *tok;
>
> @@ -851,6 +852,11 @@ static int __init set_rdt_options(char *str)
> force_off = *tok == '!';
> if (force_off)
> tok++;
> + suboption = strpbrk(tok, ":");
> + if (suboption) {
> + *suboption++ = '\0';
This looks like an open code of strsep()?
> + intel_aet_option(force_off, tok, suboption);
> + }
I think this can be simplified. It also looks possible to follow some patterns of existing
option handling.
By adding the force_on/force_off members to struct event_group I do not see the perf and
energy options needed in rdt_options[] anymore. rdt_set_feature_disabled() and
rdt_is_feature_enabled() now also seems unnecessary because the event_group::force_on and
event_group::force_off are sufficient.
It looks to me that the entire token can be passed here to intel_aet_option() and it
returns 1 to indicate that it was able to "handle" the token, 0 otherwise. If intel_aet_option()
was able to handle the option then it is not necessary to do further parsing.
"handle" means that it could successfully initialize the new members of struct event_group.
So instead, how about something like:
if (intel_aet_option(force_off, tok))
continue;
> for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
> if (strcmp(tok, o->name) == 0) {
> if (force_off)
> diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> index 6028bfec229b..b3c61bcd3e8f 100644
> --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> @@ -69,6 +69,9 @@ struct pmt_event {
> * data for all telemetry regions type @feature.
> * Valid if the system supports the event group.
> * NULL otherwise.
> + * @force_off: Set true when "rdt" command line disables this @guid.
To be consistent, can also be true if event group was found to have insufficient RMID.
> + * @force_on: Set true when "rdt" command line overrides disable of
> + * this @guid due to insufficeint @num_rmid.
"Set" can be dropped to be explicit of state instead of potentially confusing "Set" to
be a verb.
insufficeint -> insufficient
> * @guid: Unique number per XML description file.
> * @num_rmid: Number of RMIDs supported by this group. May be
> * adjusted downwards if enumeration from
> @@ -83,6 +86,7 @@ struct event_group {
> enum pmt_feature_id feature;
> char *name;
> struct pmt_feature_group *pfg;
> + bool force_off, force_on;
>
> /* Remaining fields initialized from XML file. */
> u32 guid;
> @@ -144,6 +148,26 @@ static struct event_group *known_event_groups[] = {
> _peg < &known_event_groups[ARRAY_SIZE(known_event_groups)]; \
> _peg++)
>
> +void intel_aet_option(bool force_off, const char *option, const char *suboption)
> +{
> + struct event_group **peg;
> + u32 guid;
> +
Can use strsep() here to split provided token into name and guid. Take care to
check if guid NULL before attempting kstrtou32().
> + if (kstrtou32(suboption, 16, &guid))
> + return;
> +
> + for_each_event_group(peg) {
> + if (!strcmp(option, (*peg)->name))
!strcmp() -> strcmp()?
> + continue;
> + if ((*peg)->guid != guid)
> + continue;
If no guid provided then all event groups with matching name can have
force_on/force_off member set to support user providing, for example: "!perf" to
disable all perf event groups.
> + if (force_off)
> + (*peg)->force_off = true;
> + else
> + (*peg)->force_on = true;
> + }
> +}
> +
> /*
> * Clear the address field of regions that did not pass the checks in
> * skip_telem_region() so they will not be used by intel_aet_read_event().
> @@ -252,6 +276,9 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> bool warn_disable = false;
>
> + if (e->force_off)
> + return false;
> +
> if (!group_has_usable_regions(e, p))
> return false;
>
rdt_set_feature_disabled() that is around here can be replaced with setting
event_group::force_off
> @@ -262,7 +289,7 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> }
>
> /* User can override above disable from kernel command line */
> - if (!rdt_is_feature_enabled(e->name)) {
> + if (!rdt_is_feature_enabled(e->name) && !e->force_on) {
rdt_is_feature_enabled() can be replaced with check of event_group::force_off
> if (warn_disable)
> pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
> e->name, e->guid);
>
>
I believe changes I mention would simplify the implementation a lot while making it
more powerful to support the, for example, "!perf" use case. What do you think?
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-18 16:48 ` Reinette Chatre
@ 2025-11-18 17:35 ` Luck, Tony
2025-11-18 18:11 ` Reinette Chatre
0 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-18 17:35 UTC (permalink / raw)
To: Reinette Chatre
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
On Tue, Nov 18, 2025 at 08:48:18AM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 11/17/25 10:52 AM, Luck, Tony wrote:
> > On Mon, Nov 17, 2025 at 09:31:41AM -0800, Reinette Chatre wrote:
> >> On 11/17/25 8:37 AM, Luck, Tony wrote:
> >>> On Fri, Nov 14, 2025 at 03:26:42PM -0800, Reinette Chatre wrote:
> >>>> On 11/14/25 1:55 PM, Luck, Tony wrote:
> >>>> How a system with two guid of the same feature type would work is not clear to me though. Looks
> >>>> like they cannot share events at all since an event is uniquely associated with a struct pmt_event
> >>>> that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
> >>>> will complain loudly but still proceed and allow the event group to be enabled.
> >>>
> >>> I can't see a good reason why the same event would be enabled under
> >>> different guids present on the same system. We can revisit my assumption
> >>> if the "Duplicate enable for event" message shows up.
> >>
> >> This would be difficult to handle at that time, no? From what I can tell this would enable
> >> an unusable event group to actually be enabled resulting in untested and invalid flows.
> >> I think it will be safer to not enable an event group in this scenario and seems to math your
> >> expectation that this would be unexpected. The "Duplicate enable for event" message will still
> >> appear and we can still revisit those assumptions when they do, but the systems encountering
> >> them will not be running with enabled event groups that are not actually fully enabled.
> >
> > There's a hardware cost to including an event in an aggregator.
> > Inclusing the same event in mutliple aggregators described by
> > different guids is really something that should never happen.
> > Just printing a warning and skipping the event seems an adequate
> > defense.
>
> My concern is that after skipping the event there is a deceiving message that the event group was
> enabled successfully.
I can change resctrl_enable_mon_event() to return a "bool" to say
whether each event was successfully enabled.
Then change to:
int skipped_events = 0;
for (int j = 0; j < e->num_events; j++) {
if (!resctrl_enable_mon_event(e->evts[j].id, true,
e->evts[j].bin_bits, &e->evts[j]))
skipped_events++;
}
if (e->num_events == skipped_events) {
pr_info("No events enabled in %s %s:0x%x\n", r->name, e->name, e->guid);
return false;
}
if (skipped_events)
pr_info("%s %s:0x%x monitoring detected (skipped %d events)\n", r->name,
r->name, e->name, e->guid, skipped_events);
else
pr_info("%s %s:0x%x monitoring detected\n", r->name, e->name, e->guid);
>
> ...
>
> > ---
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index 08eb78acb988..25df1abc1537 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -241,6 +241,7 @@ int intel_aet_read_event(int domid, u32 rmid, void *arch_priv, u64 *val);
> > void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> > struct list_head *add_pos);
> > void intel_aet_add_debugfs(void);
> > +void intel_aet_option(bool force_off, const char *option, const char *suboption);
> > #else
> > static inline bool intel_aet_get_events(void) { return false; }
> > static inline void __exit intel_aet_exit(void) { }
> > @@ -252,6 +253,7 @@ static inline int intel_aet_read_event(int domid, u32 rmid, void *arch_priv, u64
> > static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
> > struct list_head *add_pos) { }
> > static inline void intel_aet_add_debugfs(void) { }
> > +static inline void intel_aet_option(bool force_off, const char *option, const char *suboption) { };
>
> (nit: stray semicolon)
>
> > #endif
> >
> > #endif /* _ASM_X86_RESCTRL_INTERNAL_H */
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index 5cae4119686e..68195f458c0b 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -842,6 +842,7 @@ static struct rdt_options rdt_options[] = {
> > static int __init set_rdt_options(char *str)
> > {
> > struct rdt_options *o;
> > + char *suboption;
> > bool force_off;
> > char *tok;
> >
> > @@ -851,6 +852,11 @@ static int __init set_rdt_options(char *str)
> > force_off = *tok == '!';
> > if (force_off)
> > tok++;
> > + suboption = strpbrk(tok, ":");
> > + if (suboption) {
> > + *suboption++ = '\0';
>
> This looks like an open code of strsep()?
>
> > + intel_aet_option(force_off, tok, suboption);
> > + }
>
> I think this can be simplified. It also looks possible to follow some patterns of existing
> option handling.
>
> By adding the force_on/force_off members to struct event_group I do not see the perf and
> energy options needed in rdt_options[] anymore. rdt_set_feature_disabled() and
> rdt_is_feature_enabled() now also seems unnecessary because the event_group::force_on and
> event_group::force_off are sufficient.
>
> It looks to me that the entire token can be passed here to intel_aet_option() and it
> returns 1 to indicate that it was able to "handle" the token, 0 otherwise. If intel_aet_option()
> was able to handle the option then it is not necessary to do further parsing.
> "handle" means that it could successfully initialize the new members of struct event_group.
>
> So instead, how about something like:
> if (intel_aet_option(force_off, tok))
> continue;
>
> > for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
> > if (strcmp(tok, o->name) == 0) {
> > if (force_off)
> > diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> > index 6028bfec229b..b3c61bcd3e8f 100644
> > --- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
> > +++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
> > @@ -69,6 +69,9 @@ struct pmt_event {
> > * data for all telemetry regions type @feature.
> > * Valid if the system supports the event group.
> > * NULL otherwise.
> > + * @force_off: Set true when "rdt" command line disables this @guid.
>
> To be consistent, can also be true if event group was found to have insufficient RMID.
>
> > + * @force_on: Set true when "rdt" command line overrides disable of
> > + * this @guid due to insufficeint @num_rmid.
>
> "Set" can be dropped to be explicit of state instead of potentially confusing "Set" to
> be a verb.
>
> insufficeint -> insufficient
>
> > * @guid: Unique number per XML description file.
> > * @num_rmid: Number of RMIDs supported by this group. May be
> > * adjusted downwards if enumeration from
> > @@ -83,6 +86,7 @@ struct event_group {
> > enum pmt_feature_id feature;
> > char *name;
> > struct pmt_feature_group *pfg;
> > + bool force_off, force_on;
> >
> > /* Remaining fields initialized from XML file. */
> > u32 guid;
> > @@ -144,6 +148,26 @@ static struct event_group *known_event_groups[] = {
> > _peg < &known_event_groups[ARRAY_SIZE(known_event_groups)]; \
> > _peg++)
> >
> > +void intel_aet_option(bool force_off, const char *option, const char *suboption)
> > +{
> > + struct event_group **peg;
> > + u32 guid;
> > +
>
> Can use strsep() here to split provided token into name and guid. Take care to
> check if guid NULL before attempting kstrtou32().
>
> > + if (kstrtou32(suboption, 16, &guid))
> > + return;
> > +
> > + for_each_event_group(peg) {
> > + if (!strcmp(option, (*peg)->name))
>
> !strcmp() -> strcmp()?
>
> > + continue;
> > + if ((*peg)->guid != guid)
> > + continue;
>
> If no guid provided then all event groups with matching name can have
> force_on/force_off member set to support user providing, for example: "!perf" to
> disable all perf event groups.
>
> > + if (force_off)
> > + (*peg)->force_off = true;
> > + else
> > + (*peg)->force_on = true;
> > + }
> > +}
> > +
> > /*
> > * Clear the address field of regions that did not pass the checks in
> > * skip_telem_region() so they will not be used by intel_aet_read_event().
> > @@ -252,6 +276,9 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> > struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
> > bool warn_disable = false;
> >
> > + if (e->force_off)
> > + return false;
> > +
> > if (!group_has_usable_regions(e, p))
> > return false;
> >
>
> rdt_set_feature_disabled() that is around here can be replaced with setting
> event_group::force_off
>
> > @@ -262,7 +289,7 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
> > }
> >
> > /* User can override above disable from kernel command line */
> > - if (!rdt_is_feature_enabled(e->name)) {
> > + if (!rdt_is_feature_enabled(e->name) && !e->force_on) {
>
> rdt_is_feature_enabled() can be replaced with check of event_group::force_off
>
> > if (warn_disable)
> > pr_info("Feature %s guid=0x%x not enabled due to insufficient RMIDs\n",
> > e->name, e->guid);
> >
> >
>
> I believe changes I mention would simplify the implementation a lot while making it
> more powerful to support the, for example, "!perf" use case. What do you think?
Agreed. Simpler and more flexible. I'll make these changes.
>
> Reinette
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
2025-11-18 17:35 ` Luck, Tony
@ 2025-11-18 18:11 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-18 18:11 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Maciej Wieczor-Retman, Peter Newman, James Morse,
Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86, linux-kernel,
patches
Hi Tony,
On 11/18/25 9:35 AM, Luck, Tony wrote:
> On Tue, Nov 18, 2025 at 08:48:18AM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 11/17/25 10:52 AM, Luck, Tony wrote:
>>> On Mon, Nov 17, 2025 at 09:31:41AM -0800, Reinette Chatre wrote:
>>>> On 11/17/25 8:37 AM, Luck, Tony wrote:
>>>>> On Fri, Nov 14, 2025 at 03:26:42PM -0800, Reinette Chatre wrote:
>>>>>> On 11/14/25 1:55 PM, Luck, Tony wrote:
>>>>>> How a system with two guid of the same feature type would work is not clear to me though. Looks
>>>>>> like they cannot share events at all since an event is uniquely associated with a struct pmt_event
>>>>>> that can belong to only one event group. If they may share events then enable_events()->resctrl_enable_mon_event()
>>>>>> will complain loudly but still proceed and allow the event group to be enabled.
>>>>>
>>>>> I can't see a good reason why the same event would be enabled under
>>>>> different guids present on the same system. We can revisit my assumption
>>>>> if the "Duplicate enable for event" message shows up.
>>>>
>>>> This would be difficult to handle at that time, no? From what I can tell this would enable
>>>> an unusable event group to actually be enabled resulting in untested and invalid flows.
>>>> I think it will be safer to not enable an event group in this scenario and seems to math your
>>>> expectation that this would be unexpected. The "Duplicate enable for event" message will still
>>>> appear and we can still revisit those assumptions when they do, but the systems encountering
>>>> them will not be running with enabled event groups that are not actually fully enabled.
>>>
>>> There's a hardware cost to including an event in an aggregator.
>>> Inclusing the same event in mutliple aggregators described by
>>> different guids is really something that should never happen.
>>> Just printing a warning and skipping the event seems an adequate
>>> defense.
>>
>> My concern is that after skipping the event there is a deceiving message that the event group was
>> enabled successfully.
>
> I can change resctrl_enable_mon_event() to return a "bool" to say
> whether each event was successfully enabled.
>
> Then change to:
>
> int skipped_events = 0;
>
> for (int j = 0; j < e->num_events; j++) {
> if (!resctrl_enable_mon_event(e->evts[j].id, true,
> e->evts[j].bin_bits, &e->evts[j]))
> skipped_events++;
> }
>
> if (e->num_events == skipped_events) {
> pr_info("No events enabled in %s %s:0x%x\n", r->name, e->name, e->guid);
> return false;
> }
>
> if (skipped_events)
> pr_info("%s %s:0x%x monitoring detected (skipped %d events)\n", r->name,
> r->name, e->name, e->guid, skipped_events);
> else
> pr_info("%s %s:0x%x monitoring detected\n", r->name, e->name, e->guid);
This looks good to me. Thank you. I am not able to tell from this snippet but since enabling of
an event group can fail at this point I think the r->mon.num_rmid initialization should
now be moved later so that a failing event group will not impact the number of RMIDs the
resource can use.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 26/32] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (24 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 25/32] x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:51 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 27/32] x86,fs/resctrl: Compute number of RMIDs as minimum across resources Tony Luck
` (7 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
closid_num_dirty_rmid[] and rmid_ptrs[] are allocated together during resctrl
initialization and freed together during resctrl exit.
Telemetry events are enumerated on resctrl mount so only at resctrl mount will
the number of RMID supported by all monitoring resources and needed as size for
rmid_ptrs[] be known.
Separate closid_num_dirty_rmid[] and rmid_ptrs[] allocation and free in
preparation for rmid_ptrs[] to be allocated on resctrl mount.
Keep the rdtgroup_mutex protection around the allocation and free of
closid_num_dirty_rmid[] as ARM needs this to guarantee memory ordering.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
fs/resctrl/monitor.c | 79 ++++++++++++++++++++++++++++----------------
1 file changed, 51 insertions(+), 28 deletions(-)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index a04c1724fc44..9f5097a4be82 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -893,36 +893,14 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
static int dom_data_init(struct rdt_resource *r)
{
u32 idx_limit = resctrl_arch_system_num_rmid_idx();
- u32 num_closid = resctrl_arch_get_num_closid(r);
struct rmid_entry *entry = NULL;
int err = 0, i;
u32 idx;
mutex_lock(&rdtgroup_mutex);
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
- u32 *tmp;
-
- /*
- * If the architecture hasn't provided a sanitised value here,
- * this may result in larger arrays than necessary. Resctrl will
- * use a smaller system wide value based on the resources in
- * use.
- */
- tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL);
- if (!tmp) {
- err = -ENOMEM;
- goto out_unlock;
- }
-
- closid_num_dirty_rmid = tmp;
- }
rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
if (!rmid_ptrs) {
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
- kfree(closid_num_dirty_rmid);
- closid_num_dirty_rmid = NULL;
- }
err = -ENOMEM;
goto out_unlock;
}
@@ -958,11 +936,6 @@ static void dom_data_exit(struct rdt_resource *r)
if (!r->mon_capable)
goto out_unlock;
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
- kfree(closid_num_dirty_rmid);
- closid_num_dirty_rmid = NULL;
- }
-
kfree(rmid_ptrs);
rmid_ptrs = NULL;
@@ -1799,6 +1772,45 @@ ssize_t mbm_L3_assignments_write(struct kernfs_open_file *of, char *buf,
return ret ?: nbytes;
}
+static int closid_num_dirty_rmid_alloc(struct rdt_resource *r)
+{
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+ u32 num_closid = resctrl_arch_get_num_closid(r);
+ u32 *tmp;
+
+ /* For ARM memory ordering access to closid_num_dirty_rmid */
+ mutex_lock(&rdtgroup_mutex);
+
+ /*
+ * If the architecture hasn't provided a sanitised value here,
+ * this may result in larger arrays than necessary. Resctrl will
+ * use a smaller system wide value based on the resources in
+ * use.
+ */
+ tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL);
+ if (!tmp) {
+ mutex_unlock(&rdtgroup_mutex);
+ return -ENOMEM;
+ }
+
+ closid_num_dirty_rmid = tmp;
+
+ mutex_unlock(&rdtgroup_mutex);
+ }
+
+ return 0;
+}
+
+static void closid_num_dirty_rmid_free(void)
+{
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) {
+ mutex_lock(&rdtgroup_mutex);
+ kfree(closid_num_dirty_rmid);
+ closid_num_dirty_rmid = NULL;
+ mutex_unlock(&rdtgroup_mutex);
+ }
+}
+
/**
* resctrl_l3_mon_resource_init() - Initialise global monitoring structures.
*
@@ -1819,10 +1831,16 @@ int resctrl_l3_mon_resource_init(void)
if (!r->mon_capable)
return 0;
- ret = dom_data_init(r);
+ ret = closid_num_dirty_rmid_alloc(r);
if (ret)
return ret;
+ ret = dom_data_init(r);
+ if (ret) {
+ closid_num_dirty_rmid_free();
+ return ret;
+ }
+
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
@@ -1865,5 +1883,10 @@ void resctrl_l3_mon_resource_exit(void)
{
struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
+ if (!r->mon_capable)
+ return;
+
+ closid_num_dirty_rmid_free();
+
dom_data_exit(r);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 26/32] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
2025-10-29 16:21 ` [PATCH v13 26/32] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] Tony Luck
@ 2025-11-13 22:51 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:51 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> closid_num_dirty_rmid[] and rmid_ptrs[] are allocated together during resctrl
> initialization and freed together during resctrl exit.
>
> Telemetry events are enumerated on resctrl mount so only at resctrl mount will
> the number of RMID supported by all monitoring resources and needed as size for
> rmid_ptrs[] be known.
>
> Separate closid_num_dirty_rmid[] and rmid_ptrs[] allocation and free in
> preparation for rmid_ptrs[] to be allocated on resctrl mount.
>
> Keep the rdtgroup_mutex protection around the allocation and free of
> closid_num_dirty_rmid[] as ARM needs this to guarantee memory ordering.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 27/32] x86,fs/resctrl: Compute number of RMIDs as minimum across resources
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (25 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 26/32] fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-10-29 16:21 ` [PATCH v13 28/32] fs/resctrl: Move RMID initialization to first mount Tony Luck
` (6 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
resctrl assumes that only the L3 resource supports monitor events, so
it simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as
the system's number of RMIDs.
The addition of telemetry events in a different resource breaks that
assumption.
Compute the number of available RMIDs as the minimum value across
all mon_capable resources (analogous to how the number of CLOSIDs
is computed across alloc_capable resources).
Note that mount time enumeration of the telemetry resource means that
this number can be reduced. If this happens, then some memory will
be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and
rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization
will be larger than needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 15 +++++++++++++--
fs/resctrl/rdtgroup.c | 6 ++++++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index a8eb197e27db..9eb1bca9436b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -110,12 +110,23 @@ struct rdt_hw_resource rdt_resources_all[RDT_NUM_RESOURCES] = {
},
};
+/**
+ * resctrl_arch_system_num_rmid_idx - Compute number of supported RMIDs
+ * (minimum across all mon_capable resource)
+ *
+ * Return: Number of supported RMIDs at time of call. Note that mount time
+ * enumeration of resources may reduce the number.
+ */
u32 resctrl_arch_system_num_rmid_idx(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ u32 num_rmids = U32_MAX;
+ struct rdt_resource *r;
+
+ for_each_mon_capable_rdt_resource(r)
+ num_rmids = min(num_rmids, r->mon.num_rmid);
/* RMID are independent numbers for x86. num_rmid_idx == num_rmid */
- return r->mon.num_rmid;
+ return num_rmids == U32_MAX ? 0 : num_rmids;
}
struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index b67faf6a5012..15b2765898d6 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -4286,6 +4286,12 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain_hdr *h
* During boot this may be called before global allocations have been made by
* resctrl_l3_mon_resource_init().
*
+ * Called during CPU online that may run as soon as CPU online callbacks
+ * are set up during resctrl initialization. The number of supported RMIDs
+ * may be reduced if additional mon_capable resources are enumerated
+ * at mount time. This means the rdt_l3_mon_domain::mbm_states[] and
+ * rdt_l3_mon_domain::rmid_busy_llc allocations may be larger than needed.
+ *
* Return: 0 for success, or -ENOMEM.
*/
static int domain_setup_l3_mon_state(struct rdt_resource *r, struct rdt_l3_mon_domain *d)
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 28/32] fs/resctrl: Move RMID initialization to first mount
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (26 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 27/32] x86,fs/resctrl: Compute number of RMIDs as minimum across resources Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-10-29 16:21 ` [PATCH v13 29/32] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
` (5 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
L3 monitor features are enumerated during resctrl initialization
and rmid_ptrs[] that tracks all RMIDs and depends on the
number of supported RMIDs is allocated during this time.
Telemetry monitor features are enumerated during first resctrl mount and
may support a different number of RMIDs compared to L3 monitor features.
Delay allocation and initialization of rmid_ptrs[] until first mount.
Since the number of RMIDs cannot change on later mounts, keep the same
set of rmid_ptrs[] until resctrl_exit(). This is required because the
limbo handler keeps running after resctrl is unmounted and needs
to access rmid_ptrs[] as it keeps tracking busy RMIDs after unmount.
Rename routines to match what they now do:
dom_data_init() -> setup_rmid_lru_list()
dom_data_exit() -> free_rmid_lru_list()
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
fs/resctrl/internal.h | 4 ++++
fs/resctrl/monitor.c | 54 ++++++++++++++++++++-----------------------
fs/resctrl/rdtgroup.c | 5 ++++
3 files changed, 34 insertions(+), 29 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
index 46fd648a2961..0dd89d3fa31a 100644
--- a/fs/resctrl/internal.h
+++ b/fs/resctrl/internal.h
@@ -369,6 +369,10 @@ int closids_supported(void);
void closid_free(int closid);
+int setup_rmid_lru_list(void);
+
+void free_rmid_lru_list(void);
+
int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
index 9f5097a4be82..448f490ba344 100644
--- a/fs/resctrl/monitor.c
+++ b/fs/resctrl/monitor.c
@@ -890,20 +890,29 @@ void mbm_setup_overflow_handler(struct rdt_l3_mon_domain *dom, unsigned long del
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
-static int dom_data_init(struct rdt_resource *r)
+int setup_rmid_lru_list(void)
{
- u32 idx_limit = resctrl_arch_system_num_rmid_idx();
struct rmid_entry *entry = NULL;
- int err = 0, i;
+ u32 idx_limit;
u32 idx;
+ int i;
- mutex_lock(&rdtgroup_mutex);
+ if (!resctrl_arch_mon_capable())
+ return 0;
+ /*
+ * Called on every mount, but the number of RMIDs cannot change
+ * after the first mount, so keep using the same set of rmid_ptrs[]
+ * until resctrl_exit(). Note that the limbo handler continues to
+ * access rmid_ptrs[] after resctrl is unmounted.
+ */
+ if (rmid_ptrs)
+ return 0;
+
+ idx_limit = resctrl_arch_system_num_rmid_idx();
rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL);
- if (!rmid_ptrs) {
- err = -ENOMEM;
- goto out_unlock;
- }
+ if (!rmid_ptrs)
+ return -ENOMEM;
for (i = 0; i < idx_limit; i++) {
entry = &rmid_ptrs[i];
@@ -916,30 +925,24 @@ static int dom_data_init(struct rdt_resource *r)
/*
* RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and
* are always allocated. These are used for the rdtgroup_default
- * control group, which will be setup later in resctrl_init().
+ * control group, which was setup earlier in rdtgroup_setup_default().
*/
idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID,
RESCTRL_RESERVED_RMID);
entry = __rmid_entry(idx);
list_del(&entry->list);
-out_unlock:
- mutex_unlock(&rdtgroup_mutex);
-
- return err;
+ return 0;
}
-static void dom_data_exit(struct rdt_resource *r)
+void free_rmid_lru_list(void)
{
- mutex_lock(&rdtgroup_mutex);
-
- if (!r->mon_capable)
- goto out_unlock;
+ if (!resctrl_arch_mon_capable())
+ return;
+ mutex_lock(&rdtgroup_mutex);
kfree(rmid_ptrs);
rmid_ptrs = NULL;
-
-out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
@@ -1815,7 +1818,8 @@ static void closid_num_dirty_rmid_free(void)
* resctrl_l3_mon_resource_init() - Initialise global monitoring structures.
*
* Allocate and initialise global monitor resources that do not belong to a
- * specific domain. i.e. the rmid_ptrs[] used for the limbo and free lists.
+ * specific domain. i.e. the closid_num_dirty_rmid[] used to find the CLOSID
+ * with the cleanest set of RMIDs.
* Called once during boot after the struct rdt_resource's have been configured
* but before the filesystem is mounted.
* Resctrl's cpuhp callbacks may be called before this point to bring a domain
@@ -1835,12 +1839,6 @@ int resctrl_l3_mon_resource_init(void)
if (ret)
return ret;
- ret = dom_data_init(r);
- if (ret) {
- closid_num_dirty_rmid_free();
- return ret;
- }
-
if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) {
mon_event_all[QOS_L3_MBM_TOTAL_EVENT_ID].configurable = true;
resctrl_file_fflags_init("mbm_total_bytes_config",
@@ -1887,6 +1885,4 @@ void resctrl_l3_mon_resource_exit(void)
return;
closid_num_dirty_rmid_free();
-
- dom_data_exit(r);
}
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 15b2765898d6..17f5b986f210 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -2734,6 +2734,10 @@ static int rdt_get_tree(struct fs_context *fc)
goto out;
}
+ ret = setup_rmid_lru_list();
+ if (ret)
+ goto out;
+
ret = rdtgroup_setup_root(ctx);
if (ret)
goto out;
@@ -4586,4 +4590,5 @@ void resctrl_exit(void)
*/
resctrl_l3_mon_resource_exit();
+ free_rmid_lru_list();
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 29/32] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (27 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 28/32] fs/resctrl: Move RMID initialization to first mount Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:52 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 30/32] fs/resctrl: Provide interface to create architecture specific debugfs area Tony Luck
` (4 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG
resource is not considered "monitoring capable" during early resctrl initialization.
This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU
hot plug notifiers are registered and run for the first time right after resctrl
initialization.
Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry
event enumeration to ensure future CPU hotplug events include this resource and
initialize its domain list for CPUs that are already online.
Print to console log announcing the name of the telemetry feature detected.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 16 +++++++++++++++-
arch/x86/kernel/cpu/resctrl/intel_aet.c | 2 ++
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9eb1bca9436b..7a9c7e6ad712 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -761,14 +761,28 @@ static int resctrl_arch_offline_cpu(unsigned int cpu)
void resctrl_arch_pre_mount(void)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
static atomic_t only_once = ATOMIC_INIT(0);
- int old = 0;
+ int cpu, old = 0;
if (!atomic_try_cmpxchg(&only_once, &old, 1))
return;
if (!intel_aet_get_events())
return;
+
+ /*
+ * Late discovery of telemetry events means the domains for the
+ * resource were not built. Do that now.
+ */
+ cpus_read_lock();
+ mutex_lock(&domain_list_lock);
+ r->mon_capable = true;
+ rdt_mon_capable = true;
+ for_each_online_cpu(cpu)
+ domain_add_cpu_mon(cpu, r);
+ mutex_unlock(&domain_list_lock);
+ cpus_read_unlock();
}
enum {
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 252a3fd4260c..2f4f8fb317d7 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -238,6 +238,8 @@ static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
resctrl_enable_mon_event(e->evts[j].id, true,
e->evts[j].bin_bits, &e->evts[j]);
+ pr_info("%s %s monitoring detected\n", r->name, e->name);
+
return true;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 29/32] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
2025-10-29 16:21 ` [PATCH v13 29/32] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-11-13 22:52 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:52 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG
> resource is not considered "monitoring capable" during early resctrl initialization.
> This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU
> hot plug notifiers are registered and run for the first time right after resctrl
> initialization.
>
> Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry
> event enumeration to ensure future CPU hotplug events include this resource and
> initialize its domain list for CPUs that are already online.
>
> Print to console log announcing the name of the telemetry feature detected.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 30/32] fs/resctrl: Provide interface to create architecture specific debugfs area
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (28 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 29/32] x86/resctrl: Enable RDT_RESOURCE_PERF_PKG Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-10-29 16:21 ` [PATCH v13 31/32] x86/resctrl: Add debugfs files to show telemetry aggregator status Tony Luck
` (3 subsequent siblings)
33 siblings, 0 replies; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
All files below /sys/fs/resctrl are considered user ABI.
This leaves no place for architectures to provide additional
interfaces.
Add resctrl_debugfs_mon_info_arch_mkdir() which creates a directory in
the debugfs file system for a monitoring resource. Naming follows the
layout of the main resctrl hierarchy:
/sys/kernel/debug/resctrl/info/{resource}_MON/{arch}
The {arch} last level directory name matches the output of
the user level "uname -m" command.
Architecture code may use this directory for debug information,
or for minor tuning of features. It must not be used for basic
feature enabling as debugfs may not be configured/mounted on
production systems.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
include/linux/resctrl.h | 10 ++++++++++
fs/resctrl/rdtgroup.c | 29 +++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index a2bf335052d6..3d176f4d6b6e 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -678,6 +678,16 @@ void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;
+/**
+ * resctrl_debugfs_mon_info_arch_mkdir() - Create a debugfs info directory.
+ * Removed by resctrl_exit().
+ * @r: Resource (must be mon_capable).
+ *
+ * Return: NULL if resource is not monitoring capable,
+ * dentry pointer on success, or ERR_PTR(-ERROR) on failure.
+ */
+struct dentry *resctrl_debugfs_mon_info_arch_mkdir(struct rdt_resource *r);
+
int resctrl_init(void);
void resctrl_exit(void);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
index 17f5b986f210..0257c9938455 100644
--- a/fs/resctrl/rdtgroup.c
+++ b/fs/resctrl/rdtgroup.c
@@ -24,6 +24,7 @@
#include <linux/sched/task.h>
#include <linux/slab.h>
#include <linux/user_namespace.h>
+#include <linux/utsname.h>
#include <uapi/linux/magic.h>
@@ -75,6 +76,8 @@ static void rdtgroup_destroy_root(void);
struct dentry *debugfs_resctrl;
+static struct dentry *debugfs_resctrl_info;
+
/*
* Memory bandwidth monitoring event to use for the default CTRL_MON group
* and each new CTRL_MON group created by the user. Only relevant when
@@ -4531,6 +4534,31 @@ int resctrl_init(void)
return ret;
}
+/*
+ * Create /sys/kernel/debug/resctrl/info/{r->name}_MON/{arch} directory
+ * by request for architecture to use for debugging or minor tuning.
+ * Basic functionality of features must not be controlled by files
+ * added to this directory as debugfs may not be configured/mounted
+ * on production systems.
+ */
+struct dentry *resctrl_debugfs_mon_info_arch_mkdir(struct rdt_resource *r)
+{
+ struct dentry *moninfodir;
+ char name[32];
+
+ if (!r->mon_capable)
+ return NULL;
+
+ if (!debugfs_resctrl_info)
+ debugfs_resctrl_info = debugfs_create_dir("info", debugfs_resctrl);
+
+ sprintf(name, "%s_MON", r->name);
+
+ moninfodir = debugfs_create_dir(name, debugfs_resctrl_info);
+
+ return debugfs_create_dir(utsname()->machine, moninfodir);
+}
+
static bool resctrl_online_domains_exist(void)
{
struct rdt_resource *r;
@@ -4582,6 +4610,7 @@ void resctrl_exit(void)
debugfs_remove_recursive(debugfs_resctrl);
debugfs_resctrl = NULL;
+ debugfs_resctrl_info = NULL;
unregister_filesystem(&rdt_fs_type);
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* [PATCH v13 31/32] x86/resctrl: Add debugfs files to show telemetry aggregator status
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (29 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 30/32] fs/resctrl: Provide interface to create architecture specific debugfs area Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:53 ` Reinette Chatre
2025-10-29 16:21 ` [PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events Tony Luck
` (2 subsequent siblings)
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Each telemetry aggregator provides three status registers at the top
end of MMIO space after all the per-RMID per-event counters:
data_loss_count: This counts the number of times that this aggregator
failed to accumulate a counter value supplied by a CPU core.
data_loss_timestamp: This is a "timestamp" from a free running
25MHz uncore timer indicating when the most recent data loss occurred.
last_update_timestamp: Another 25MHz timestamp indicating when the
most recent counter update was successfully applied.
Create files in /sys/kernel/debug/resctrl/info/PERF_PKG_MON/x86_64/
to display the value of each of these status registers for each aggregator
in each enabled event group. The prefix for each file name describes
the type of aggregator, which package it is located on, and an opaque
instance number to provide a unique file name when there are multiple
aggregators on a package.
The suffix is one of the three strings listed above. An example name is:
energy_pkg0_agg2_data_loss_count
These files are removed along with all other debugfs entries by the
call to debugfs_remove_recursive() in resctrl_exit().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 +
arch/x86/kernel/cpu/resctrl/core.c | 2 +
arch/x86/kernel/cpu/resctrl/intel_aet.c | 60 +++++++++++++++++++++++++
3 files changed, 64 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index cea76f88422c..8d4bdae735e4 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -236,6 +236,7 @@ int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_id evtid,
void *arch_priv, u64 *val);
void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
struct list_head *add_pos);
+void intel_aet_add_debugfs(void);
#else
static inline bool intel_aet_get_events(void) { return false; }
static inline void __exit intel_aet_exit(void) { }
@@ -247,6 +248,7 @@ static inline int intel_aet_read_event(int domid, u32 rmid, enum resctrl_event_i
static inline void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
struct list_head *add_pos) { }
+static inline void intel_aet_add_debugfs(void) { }
#endif
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7a9c7e6ad712..e96e5662e863 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -783,6 +783,8 @@ void resctrl_arch_pre_mount(void)
domain_add_cpu_mon(cpu, r);
mutex_unlock(&domain_list_lock);
cpus_read_unlock();
+
+ intel_aet_add_debugfs();
}
enum {
diff --git a/arch/x86/kernel/cpu/resctrl/intel_aet.c b/arch/x86/kernel/cpu/resctrl/intel_aet.c
index 2f4f8fb317d7..c9f2d8de2c60 100644
--- a/arch/x86/kernel/cpu/resctrl/intel_aet.c
+++ b/arch/x86/kernel/cpu/resctrl/intel_aet.c
@@ -18,8 +18,11 @@
#include <linux/container_of.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
+#include <linux/debugfs.h>
+#include <linux/dcache.h>
#include <linux/err.h>
#include <linux/errno.h>
+#include <linux/fs.h>
#include <linux/gfp_types.h>
#include <linux/init.h>
#include <linux/intel_pmt_features.h>
@@ -33,6 +36,7 @@
#include <linux/resctrl.h>
#include <linux/resctrl_types.h>
#include <linux/slab.h>
+#include <linux/sprintf.h>
#include <linux/stddef.h>
#include <linux/topology.h>
#include <linux/types.h>
@@ -203,6 +207,46 @@ static bool all_regions_have_sufficient_rmid(struct event_group *e, struct pmt_f
return ret;
}
+static int status_read(void *priv, u64 *val)
+{
+ void __iomem *info = (void __iomem *)priv;
+
+ *val = readq(info);
+
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(status_fops, status_read, NULL, "%llu\n");
+
+static void make_status_files(struct dentry *dir, struct event_group *e, u8 pkg,
+ int instance, void *info_end)
+{
+ char name[64];
+
+ sprintf(name, "%s_pkg%u_agg%d_data_loss_count", e->name, pkg, instance);
+ debugfs_create_file(name, 0400, dir, info_end - 24, &status_fops);
+
+ sprintf(name, "%s_pkg%u_agg%d_data_loss_timestamp", e->name, pkg, instance);
+ debugfs_create_file(name, 0400, dir, info_end - 16, &status_fops);
+
+ sprintf(name, "%s_pkg%u_agg%d_last_update_timestamp", e->name, pkg, instance);
+ debugfs_create_file(name, 0400, dir, info_end - 8, &status_fops);
+}
+
+static void create_debug_event_status_files(struct dentry *dir, struct event_group *e)
+{
+ struct pmt_feature_group *p = e->pfg;
+ void *info_end;
+
+ for (int i = 0; i < p->count; i++) {
+ if (!p->regions[i].addr)
+ continue;
+ info_end = (void __force *)p->regions[i].addr + e->mmio_size;
+ make_status_files(dir, e, p->regions[i].plat_info.package_id,
+ i, info_end);
+ }
+}
+
static bool enable_events(struct event_group *e, struct pmt_feature_group *p)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
@@ -355,3 +399,19 @@ void intel_aet_mon_domain_setup(int cpu, int id, struct rdt_resource *r,
kfree(d);
}
}
+
+void intel_aet_add_debugfs(void)
+{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_PERF_PKG].r_resctrl;
+ struct event_group **peg;
+ struct dentry *infodir;
+
+ infodir = resctrl_debugfs_mon_info_arch_mkdir(r);
+
+ if (IS_ERR_OR_NULL(infodir))
+ return;
+
+ for_each_event_group(peg)
+ if ((*peg)->pfg)
+ create_debug_event_status_files(infodir, *peg);
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 31/32] x86/resctrl: Add debugfs files to show telemetry aggregator status
2025-10-29 16:21 ` [PATCH v13 31/32] x86/resctrl: Add debugfs files to show telemetry aggregator status Tony Luck
@ 2025-11-13 22:53 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:53 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> Each telemetry aggregator provides three status registers at the top
> end of MMIO space after all the per-RMID per-event counters:
>
> data_loss_count: This counts the number of times that this aggregator
> failed to accumulate a counter value supplied by a CPU core.
>
> data_loss_timestamp: This is a "timestamp" from a free running
> 25MHz uncore timer indicating when the most recent data loss occurred.
>
> last_update_timestamp: Another 25MHz timestamp indicating when the
> most recent counter update was successfully applied.
>
> Create files in /sys/kernel/debug/resctrl/info/PERF_PKG_MON/x86_64/
> to display the value of each of these status registers for each aggregator
> in each enabled event group. The prefix for each file name describes
> the type of aggregator, which package it is located on, and an opaque
> instance number to provide a unique file name when there are multiple
> aggregators on a package.
>
> The suffix is one of the three strings listed above. An example name is:
>
> energy_pkg0_agg2_data_loss_count
Would files named like above have enough information when considering the
theoretical struct pmt_feature_group from patch #16? In that example there
are perf aggregators with two guids. As mentioned above the aggregator
instance is opaque so a user may not know which guid a file like above may
refer to.
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (30 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 31/32] x86/resctrl: Add debugfs files to show telemetry aggregator status Tony Luck
@ 2025-10-29 16:21 ` Tony Luck
2025-11-13 22:56 ` Reinette Chatre
2025-10-29 18:59 ` [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Luck, Tony
2025-11-16 17:35 ` Drew Fustini
33 siblings, 1 reply; 85+ messages in thread
From: Tony Luck @ 2025-10-29 16:21 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches, Tony Luck
Update resctrl filesystem documentation with the details about the
resctrl files that support telemetry events.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Documentation/filesystems/resctrl.rst | 102 +++++++++++++++++++++++---
1 file changed, 90 insertions(+), 12 deletions(-)
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index b7f35b07876a..ea4995f402c9 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -168,13 +168,12 @@ with respect to allocation:
bandwidth percentages are directly applied to
the threads running on the core
-If RDT monitoring is available there will be an "L3_MON" directory
+If L3 monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids":
- The number of RMIDs available. This is the
- upper bound for how many "CTRL_MON" + "MON"
- groups can be created.
+ The number of RMIDs supported by hardware for
+ L3 monitoring events.
"mon_features":
Lists the monitoring events if
@@ -400,6 +399,24 @@ with the following files:
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
+If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of RMIDs for telemetry monitoring events. By default,
+ resctrl will not enable telemetry events of a particular type
+ ("perf" or "energy") if the number of RMIDs supported for that
+ type is lower than the number of RMIDs supported by hardware
+ for L3 monitoring events. The user can force-enable each type
+ of telemetry events with the "rdt=" boot command line option,
+ but this may reduce the number of "MON" groups that can be created.
+
+"mon_features":
+ Lists the telemetry monitoring events that are enabled on this system.
+
+The upper bound for how many "CTRL_MON" + "MON" can be created
+is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
+
Finally, in the top level of the "info" directory there is a file
named "last_cmd_status". This is reset with every "command" issued
via the file system (making new directories or writing to any of the
@@ -505,15 +522,40 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
- "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
- files provide a read out of the current value of the event for
- all tasks in the group. In CTRL_MON groups these files provide
- the sum for all tasks in the CTRL_MON group and all tasks in
+ This contains directories for each monitor domain.
+
+ If L3 monitoring is enabled, there will be a "mon_L3_XX" directory for
+ each instance of an L3 cache. Each directory contains files for the enabled
+ L3 events (e.g. "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes").
+
+ If telemetry monitoring is enabled, there will be a "mon_PERF_PKG_YY"
+ directory for each physical processor package. Each directory contains
+ files for the enabled telemetry events (e.g. "core_energy". "activity",
+ "uops_retired", etc.)
+
+ The info/`*`/mon_features files provide the full list of enabled
+ event/file names.
+
+ "core energy" reports a floating point number for the energy (in Joules)
+ consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
+ during execution of instructions summed across all logical CPUs on a
+ package for the current monitoring group.
+
+ "activity" also reports a floating point value (in Farads). This provides
+ an estimate of work done independent of the frequency that the CPUs used
+ for execution.
+
+ Note that "core energy" and "activity" only measure energy/activity in the
+ "core" of the CPU (arithmetic units, TLB, L1 and L2 caches, etc.). They
+ do not include L3 cache, memory, I/O devices etc.
+
+ All other events report decimal integer values.
+
+ In a MON group these files provide a read out of the current value of
+ the event for all tasks in the group. In CTRL_MON groups these files
+ provide the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
+
On systems with Sub-NUMA Cluster (SNC) enabled there are extra
directories for each node (located within the "mon_L3_XX" directory
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
@@ -1506,6 +1548,42 @@ Example with C::
resctrl_release_lock(fd);
}
+Debugfs
+=======
+In addition to the use of debugfs for tracing of pseudo-locking performance,
+architecture code may create debugfs directories associated with monitoring
+features for a specific resource.
+
+The full pathname for these is in the form:
+
+ /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
+
+The presence, names, and format of these files may vary between architectures
+even if the same resource is present.
+
+PERF_PKG_MON/x86_64
+-------------------
+Three files are present per telemetry aggregator instance that show status.
+The prefix of each file name describes the type ("energy" or "perf") which
+processor package it belongs to, and the instance number of the aggregator.
+For example: "energy_pkg1_agg2".
+
+The suffix describes which data is reported in the file and
+is one of:
+
+data_loss_count:
+ This counts the number of times that this aggregator
+ failed to accumulate a counter value supplied by a CPU.
+
+data_loss_timestamp:
+ This is a "timestamp" from a free running 25MHz uncore
+ timer indicating when the most recent data loss occurred.
+
+last_update_timestamp:
+ Another 25MHz timestamp indicating when the
+ most recent counter update was successfully applied.
+
+
Examples for RDT Monitoring along with allocation usage
=======================================================
Reading monitored data
--
2.51.0
^ permalink raw reply related [flat|nested] 85+ messages in thread* Re: [PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events
2025-10-29 16:21 ` [PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events Tony Luck
@ 2025-11-13 22:56 ` Reinette Chatre
0 siblings, 0 replies; 85+ messages in thread
From: Reinette Chatre @ 2025-11-13 22:56 UTC (permalink / raw)
To: Tony Luck, Fenghua Yu, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/25 9:21 AM, Tony Luck wrote:
> @@ -400,6 +399,24 @@ with the following files:
> bytes) at which a previously used LLC_occupancy
> counter can be considered for re-use.
>
> +If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
"an" -> "a"?
> +with the following files:
> +
> +"num_rmids":
> + The number of RMIDs for telemetry monitoring events. By default,
> + resctrl will not enable telemetry events of a particular type
> + ("perf" or "energy") if the number of RMIDs supported for that
> + type is lower than the number of RMIDs supported by hardware
> + for L3 monitoring events. The user can force-enable each type
It is not clear to me how the number of L3 monitoring events is relevant here. This
is addressed later with the "The upper bound for how many "CTRL_MON" + "MON" can be
created ...", no?
How about something like: "if the number of RMIDs that can be tracked concurrently
for that type is lower than the total number of RMIDs supported by that type."?
(I am sure it can be improved)
> + of telemetry events with the "rdt=" boot command line option,
> + but this may reduce the number of "MON" groups that can be created.
Since this includes "CTRL_MON" and "MON" groups it may be simpler to just say "monitoring
groups".
> +
> +"mon_features":
> + Lists the telemetry monitoring events that are enabled on this system.
> +
> +The upper bound for how many "CTRL_MON" + "MON" can be created
> +is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
> +
> Finally, in the top level of the "info" directory there is a file
> named "last_cmd_status". This is reset with every "command" issued
> via the file system (making new directories or writing to any of the
...
> +Debugfs
> +=======
> +In addition to the use of debugfs for tracing of pseudo-locking performance,
> +architecture code may create debugfs directories associated with monitoring
> +features for a specific resource.
> +
> +The full pathname for these is in the form:
> +
> + /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
> +
> +The presence, names, and format of these files may vary between architectures
> +even if the same resource is present.
> +
> +PERF_PKG_MON/x86_64
> +-------------------
> +Three files are present per telemetry aggregator instance that show status.
> +The prefix of each file name describes the type ("energy" or "perf") which
> +processor package it belongs to, and the instance number of the aggregator.
> +For example: "energy_pkg1_agg2".
> +
> +The suffix describes which data is reported in the file and
> +is one of:
(nit: unnecessary line break)
> +
> +data_loss_count:
> + This counts the number of times that this aggregator
> + failed to accumulate a counter value supplied by a CPU.
> +
> +data_loss_timestamp:
> + This is a "timestamp" from a free running 25MHz uncore
> + timer indicating when the most recent data loss occurred.
> +
> +last_update_timestamp:
> + Another 25MHz timestamp indicating when the
> + most recent counter update was successfully applied.
> +
> +
> Examples for RDT Monitoring along with allocation usage
> =======================================================
> Reading monitored data
Reinette
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (31 preceding siblings ...)
2025-10-29 16:21 ` [PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events Tony Luck
@ 2025-10-29 18:59 ` Luck, Tony
2025-11-05 15:33 ` Moger, Babu
2025-11-16 17:35 ` Drew Fustini
33 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-10-29 18:59 UTC (permalink / raw)
To: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu
Cc: x86, linux-kernel, patches
I took a stab at applying the AET patches on tob of Babu's v10
SDCIAE series https://lore.kernel.org/all/cover.1761090859.git.babu.moger@amd.com/
There are only a couple of easy to resolve conflicts.
I pushed a branch with v6.18-rc3 + SDCIAE + AET here:
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git sdciae-aet
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-10-29 18:59 ` [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Luck, Tony
@ 2025-11-05 15:33 ` Moger, Babu
2025-11-05 15:41 ` Luck, Tony
2025-12-17 0:28 ` Luck, Tony
0 siblings, 2 replies; 85+ messages in thread
From: Moger, Babu @ 2025-11-05 15:33 UTC (permalink / raw)
To: Luck, Tony, Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman,
Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
Chen Yu
Cc: x86, linux-kernel, patches
Hi Tony,
On 10/29/2025 1:59 PM, Luck, Tony wrote:
> I took a stab at applying the AET patches on tob of Babu's v10
> SDCIAE series https://lore.kernel.org/all/cover.1761090859.git.babu.moger@amd.com/
>
> There are only a couple of easy to resolve conflicts.
>
> I pushed a branch with v6.18-rc3 + SDCIAE + AET here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git sdciae-aet
>
I ran your code on my AMD system. It appears to be working fine.
I don't have the hardware that supports these features. It would be
helpful to list the resctrl interface files (files in info directory and
files each group) and give an example of how it looks in the system that
supports these features. It can be in cover letter or resctrl.rst file
as well. It will also help to review the code.
thanks
Babu
^ permalink raw reply [flat|nested] 85+ messages in thread
* RE: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-11-05 15:33 ` Moger, Babu
@ 2025-11-05 15:41 ` Luck, Tony
2025-12-17 0:28 ` Luck, Tony
1 sibling, 0 replies; 85+ messages in thread
From: Luck, Tony @ 2025-11-05 15:41 UTC (permalink / raw)
To: Moger, Babu, Fenghua Yu, Chatre, Reinette, Wieczor-Retman, Maciej,
Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
Chen, Yu C
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
> > I took a stab at applying the AET patches on tob of Babu's v10
> > SDCIAE series https://lore.kernel.org/all/cover.1761090859.git.babu.moger@amd.com/
> >
> > There are only a couple of easy to resolve conflicts.
> >
> > I pushed a branch with v6.18-rc3 + SDCIAE + AET here:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git sdciae-aet
> >
>
> I ran your code on my AMD system. It appears to be working fine.
> I don't have the hardware that supports these features. It would be
> helpful to list the resctrl interface files (files in info directory and
> files each group) and give an example of how it looks in the system that
> supports these features. It can be in cover letter or resctrl.rst file
> as well. It will also help to review the code.
Babu,
Thanks for testing. Good to know that I didn't break your code while
merging mine on top.
Good suggestion. I'll add some examples to the cover letter when
I spin v14 after I get feedback on this series.
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-11-05 15:33 ` Moger, Babu
2025-11-05 15:41 ` Luck, Tony
@ 2025-12-17 0:28 ` Luck, Tony
2025-12-17 16:44 ` Moger, Babu
1 sibling, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-12-17 0:28 UTC (permalink / raw)
To: Moger, Babu
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86,
linux-kernel, patches
On Wed, Nov 05, 2025 at 09:33:56AM -0600, Moger, Babu wrote:
> Hi Tony,
>
> On 10/29/2025 1:59 PM, Luck, Tony wrote:
> > I took a stab at applying the AET patches on tob of Babu's v10
> > SDCIAE series https://lore.kernel.org/all/cover.1761090859.git.babu.moger@amd.com/
> >
> > There are only a couple of easy to resolve conflicts.
> >
> > I pushed a branch with v6.18-rc3 + SDCIAE + AET here:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git sdciae-aet
> >
>
> I ran your code on my AMD system. It appears to be working fine.
> I don't have the hardware that supports these features. It would be helpful
> to list the resctrl interface files (files in info directory and files each
> group) and give an example of how it looks in the system that supports these
> features. It can be in cover letter or resctrl.rst file as well. It will
> also help to review the code.
Reinette reminded me that I didn't provide the examples that I promised.
Here's what I plan to add to the v17 cover letter. Is this enough
detail?
-Tony
Examples:
--------
As with other resctrl monitoring features first create CTRL_MON or MON
directories and assign the tasks of interest to the group.
Energy events:
-------------
There are two events associated with energy consumption in the core.
The "core_energy" event reports out directly in Joules. To compute
power just take the difference between two samples and divide by the
time between them. E.g.
$ cat core_energy; sleep 10; cat core_energy
94499439.510380
94499607.019680
$ bc -q
scale=3
(94499607.019680 - 94499439.510380) / 10
16.750
So 16.75 Watts in this example.
Note that different runs of the same workload may report different
energy consumption. This happens when cores shift to different
voltage/frequency profiles due to overall system load.
The "activity" event reports energy usage in a manner independent
of voltage and frequency. This may be useful for developers to
assess how modifications to a program (e.g. attaching to a library
optimized to use AVX instructions) affect energy consumption. So
read the "activity" at the start and end of program execution and
compute the difference.
Perf events:
-----------
The other telemetry events largely duplicate events available using
"perf", but avoid of reading the perf counters on every context switch.
This may be a significant improvement when monitoring highly multi-threaded
applications. E.g. to find the ratio of core cycles to reference cycles:
$ cat unhalted_core_cycles unhalted_ref_cycles
1312249223146571
1660157011698276
$ { run application here }
$ cat unhalted_core_cycles unhalted_ref_cycles
1313573565617233
1661511224019444
$ bc -q
scale = 3
(1661511224019444 - 1660157011698276) / (1313573565617233 - 1312249223146571)
1.022
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-12-17 0:28 ` Luck, Tony
@ 2025-12-17 16:44 ` Moger, Babu
2025-12-17 17:08 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Moger, Babu @ 2025-12-17 16:44 UTC (permalink / raw)
To: Luck, Tony
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86,
linux-kernel, patches
Hi Tony,
On 12/16/2025 6:28 PM, Luck, Tony wrote:
> On Wed, Nov 05, 2025 at 09:33:56AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 10/29/2025 1:59 PM, Luck, Tony wrote:
>>> I took a stab at applying the AET patches on tob of Babu's v10
>>> SDCIAE series https://lore.kernel.org/all/cover.1761090859.git.babu.moger@amd.com/
>>>
>>> There are only a couple of easy to resolve conflicts.
>>>
>>> I pushed a branch with v6.18-rc3 + SDCIAE + AET here:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git sdciae-aet
>>>
>>
>> I ran your code on my AMD system. It appears to be working fine.
>> I don't have the hardware that supports these features. It would be helpful
>> to list the resctrl interface files (files in info directory and files each
>> group) and give an example of how it looks in the system that supports these
>> features. It can be in cover letter or resctrl.rst file as well. It will
>> also help to review the code.
>
> Reinette reminded me that I didn't provide the examples that I promised.
>
>
> Here's what I plan to add to the v17 cover letter. Is this enough
> detail?
>
> -Tony
>
>
> Examples:
> --------
>
> As with other resctrl monitoring features first create CTRL_MON or MON
> directories and assign the tasks of interest to the group.
>
> Energy events:
> -------------
>
> There are two events associated with energy consumption in the core.
> The "core_energy" event reports out directly in Joules. To compute
> power just take the difference between two samples and divide by the
> time between them. E.g.
>
> $ cat core_energy; sleep 10; cat core_energy
Please use the full path (/sys/fs/resctrl/test/mon_data/xx/<file_name>).
Otherwise looks good to me.
Thanks
Babu
> 94499439.510380
> 94499607.019680
> $ bc -q
> scale=3
> (94499607.019680 - 94499439.510380) / 10
> 16.750
>
> So 16.75 Watts in this example.
>
> Note that different runs of the same workload may report different
> energy consumption. This happens when cores shift to different
> voltage/frequency profiles due to overall system load.
>
> The "activity" event reports energy usage in a manner independent
> of voltage and frequency. This may be useful for developers to
> assess how modifications to a program (e.g. attaching to a library
> optimized to use AVX instructions) affect energy consumption. So
> read the "activity" at the start and end of program execution and
> compute the difference.
>
> Perf events:
> -----------
>
> The other telemetry events largely duplicate events available using
> "perf", but avoid of reading the perf counters on every context switch.
> This may be a significant improvement when monitoring highly multi-threaded
> applications. E.g. to find the ratio of core cycles to reference cycles:
>
> $ cat unhalted_core_cycles unhalted_ref_cycles
> 1312249223146571
> 1660157011698276
> $ { run application here }
> $ cat unhalted_core_cycles unhalted_ref_cycles
> 1313573565617233
> 1661511224019444
> $ bc -q
> scale = 3
> (1661511224019444 - 1660157011698276) / (1313573565617233 - 1312249223146571)
> 1.022
>
>
^ permalink raw reply [flat|nested] 85+ messages in thread* RE: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-12-17 16:44 ` Moger, Babu
@ 2025-12-17 17:08 ` Luck, Tony
0 siblings, 0 replies; 85+ messages in thread
From: Luck, Tony @ 2025-12-17 17:08 UTC (permalink / raw)
To: Moger, Babu
Cc: Fenghua Yu, Chatre, Reinette, Wieczor-Retman, Maciej,
Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
Chen, Yu C, x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
>> $ cat core_energy; sleep 10; cat core_energy
>
> Please use the full path (/sys/fs/resctrl/test/mon_data/xx/<file_name>).
I'll add to the intro paragraph to set a shell variable and use that in the example
to keep the line lengths reasonable like this:
$ cat $dir/core_energy; sleep 10; cat $dir/core_energy
> Otherwise looks good to me.
Thanks
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-10-29 16:20 [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Tony Luck
` (32 preceding siblings ...)
2025-10-29 18:59 ` [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring Luck, Tony
@ 2025-11-16 17:35 ` Drew Fustini
2025-11-17 16:52 ` Luck, Tony
33 siblings, 1 reply; 85+ messages in thread
From: Drew Fustini @ 2025-11-16 17:35 UTC (permalink / raw)
To: Tony Luck
Cc: Fenghua Yu, Reinette Chatre, Maciej Wieczor-Retman, Peter Newman,
James Morse, Babu Moger, Drew Fustini, Dave Martin, Chen Yu, x86,
linux-kernel, patches
On Wed, Oct 29, 2025 at 09:20:43AM -0700, Tony Luck wrote:
> Patches based on v6.18-rc3
[snip]
>
> Background
> ----------
> On Intel systems that support per-RMID telemetry monitoring each logical
> processor keeps a local count for various events. When the
> MSR_IA32_PQR_ASSOC.RMID value for the logical processor changes (or when a
> two millisecond counter expires) these event counts are transmitted to
> an event aggregator on the same package as the processor together with
> the current RMID value. The event counters are reset to zero to begin
> counting again.
Do you have any suggestion of which Xeon parts I should be looking for
if I want to try out this feature?
I'm looking at bare metal hosting providers for an Intel server that I
can try out features like SNC and this. I look up the parts on ARK but
it shows yes/no for RDT which is not very specific. I'm hoping to do
something better than just hoping cpuinfo has what I want once the
machine is provisioned :)
Thanks,
Drew
^ permalink raw reply [flat|nested] 85+ messages in thread* RE: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-11-16 17:35 ` Drew Fustini
@ 2025-11-17 16:52 ` Luck, Tony
2025-11-18 23:03 ` Drew Fustini
0 siblings, 1 reply; 85+ messages in thread
From: Luck, Tony @ 2025-11-17 16:52 UTC (permalink / raw)
To: Drew Fustini
Cc: Fenghua Yu, Chatre, Reinette, Wieczor-Retman, Maciej,
Peter Newman, James Morse, Babu Moger, Drew Fustini, Dave Martin,
Chen, Yu C, x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
> > On Intel systems that support per-RMID telemetry monitoring each logical
> > processor keeps a local count for various events. When the
> > MSR_IA32_PQR_ASSOC.RMID value for the logical processor changes (or when a
> > two millisecond counter expires) these event counts are transmitted to
> > an event aggregator on the same package as the processor together with
> > the current RMID value. The event counters are reset to zero to begin
> > counting again.
>
> Do you have any suggestion of which Xeon parts I should be looking for
> if I want to try out this feature?
>
> I'm looking at bare metal hosting providers for an Intel server that I
> can try out features like SNC and this. I look up the parts on ARK but
> it shows yes/no for RDT which is not very specific. I'm hoping to do
> something better than just hoping cpuinfo has what I want once the
> machine is provisioned :)
>
Drew,
Telemetry monitoring based on RMIDs is only present on Clearwater Forest
systems today. Those haven't been released into the wild yet.
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-11-17 16:52 ` Luck, Tony
@ 2025-11-18 23:03 ` Drew Fustini
2025-11-18 23:12 ` Luck, Tony
0 siblings, 1 reply; 85+ messages in thread
From: Drew Fustini @ 2025-11-18 23:03 UTC (permalink / raw)
To: Luck, Tony
Cc: Drew Fustini, Fenghua Yu, Chatre, Reinette,
Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
On Mon, Nov 17, 2025 at 04:52:09PM +0000, Luck, Tony wrote:
> Drew,
>
> Telemetry monitoring based on RMIDs is only present on Clearwater Forest
> systems today. Those haven't been released into the wild yet.
>
> -Tony
Thanks for letting me know. Do you know what Xeon parts I should look
for if I want to try out the SNC support?
Drew
^ permalink raw reply [flat|nested] 85+ messages in thread
* RE: [PATCH v13 00/32] x86,fs/resctrl telemetry monitoring
2025-11-18 23:03 ` Drew Fustini
@ 2025-11-18 23:12 ` Luck, Tony
0 siblings, 0 replies; 85+ messages in thread
From: Luck, Tony @ 2025-11-18 23:12 UTC (permalink / raw)
To: Drew Fustini
Cc: Drew Fustini, Fenghua Yu, Chatre, Reinette,
Wieczor-Retman, Maciej, Peter Newman, James Morse, Babu Moger,
Drew Fustini, Dave Martin, Chen, Yu C, x86@kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev
> Thanks for letting me know. Do you know what Xeon parts I should look
> for if I want to try out the SNC support?
Drew
Granite Rapids (a.k.a. "Intel(R) Xeon(R) 6" (with P-cores ... there's also a Xeon 6 with E-cores called Sierra Forest)
-Tony
^ permalink raw reply [flat|nested] 85+ messages in thread